A little forward, a little backward with lots of attention in the interim

A little forward, a little backward with lots of attention in the interim
Let's go cha cha cha with NNs

Well ofcourse we are not talking about cha cha cha or any other dance form here. Apologies for the click bait. But we need your attention here. And there is a pun intended as well. You shall see it in a minute.

We are however, talking about the very dear Neural Networks as is so fitting for an AI blog.  Turns out, the process by which the Neural Networks work is by the twin mechanism of Forward Propagation and followed by something called as Backward Propagation or Backprop. For the beginners in the course & the discourse in AI, Neural Networks have been the game-changers in the field of Deep Learning which has really altered the landscape of AI in the last decade or so. This has been made possible with rapid advancement in some hardware technology like like GPUs, TPUs and the transistor density according to the Moores law. I shall delve deeper into that in another blog. And also with an increase in data for training. But here we need to understand that the advent of the NN architecture has taken the field of AI from Classical phase to the deep learning phase very rapidly in a span of few years.

So what are these NN and how do they work. End of the day, they are sets of computer codes and math algorithms which make complex and detailed computations on some input data to produce some output which is a problem we are trying to solve in NLP, Computer Vision like image Classification or object recognition or facial recognition. Or say in Speech recognition. But there is a certain architecture they follow which is layered and repeated. There is an input layer which we feed the data into after some cleaning and preprocessing including the data labeling part which we shall take separately. Just think that we make our data better cleaned and understandable for our programs before we feed those into them. After the input layer, we have what we call some hidden layers which can number into dozens. The more the hidden layers, the more the power of our NN and the better the problem solving. Then we have the output layer. This one gives the final output which the examples would always call as y^ or y-hat which is a numerical entity. We calculate our cost function or error function on these outputs and then try to minimize those. That is the main computational problem we solve in the entire process. The cost function is better understood intuitively as the error function. It is the equation using sqaured mean or other formulae based on the kind of regression  you are working with - binary Classification, logistical regression etc - which defines the function. But essentially it is about how far off your computed number thrown out of the final layer is from the expected output. So you get a function of some variables which the NN has reduced your real world problem to and you now apply some error minimizing on it. By a process called gradient descent. It's just finding the minima of that function of many variables. As you may understand, the number of variables the problem has been reduced to can be many, so this cost function can get very complex also. So to find the minima of this is a very computationally involved problem in practice and that is where we need so much computational power which is always quoted in literature. So essentially you realize that our initial problem is broken down into some math variables using something called feature learning or breaking it down into some parts which can be functionized into some math. Then every layer in the network goes deeper into working with these variables. Within eachw layer is a set of individual computational units which we call as the neurons. So a layer may have 1 or many neurons. This is where they try to make the whole thing similar in sound and structure to the human mind. Each neuron is an intelligence unit in philosophy. It has learnt something about the problem. Now that learning can be very accurate or less accurate or contextual. Remember, every piece of real life problem our NN is going to answer is going to have a different context and hence each neurons learning needs to be applied in proportion. And thus we assign what are called as "weights" & "biases" to each neuron in our network. You know what weights are in math. And since this computation and calculation is so complex, you guessed it correct that the weights we assign can't be correct the very first time. It has to be an iterative process to arrive at the correct weights and biases. And thats what we do during those hours of training the NN for days on end. Experiment with these parameters along with some other HyperParameters. The weights are the parameters which are controlled by some other parameters like the learning rate denoted generally by the Greek alpha. They are called as hyperparameters. And many other parameters we can feed into the big NN program like the number of layers and the number of neurons within each layer. And then the iteration starts.

 But you shall ask me that where is the backward step in the dance as promised. 3 steps forward and then 3 steps backward was the whole premise of the dance we promised remember. This is where the variables dw & the db come into the scene. You recall the weights and biases we spoke about earlier. And that they shall be iteratively changed till our final error function is reduced to the minima. So the algo shall change the values of each weight dw and each bias db many a time. And how does it know what values to change to. So the values of these dw and db are computed by the program and the algo in the step of backward propagation. Or backprop. It is not to be expected that you shall learn the details of the same in this blog as frankly that is beyond the scope as this is more a beginners writing. But just remember that the final error function we created as an output of the final layer had the y^ output. You calculated the error function on that say using a mean squared formula. From the same computation, you can calculate what is called da in common parlance where a is the output of an Activation Fn. This is the delta change in the activation function used in the NN. This is some kind of a  math operator you have applied to the intrinsic math outputs to your variables in the NN. All through the layers. Some common Activation functions are Relu, Sigmoid etc. They are needed to make the calculations to try out different points in the function as otherwise no matter how many layers you put in the NN, it shall not experiment enough and shell out some basic nonlearned output that shall be not useful. In very basic terms, the AF is the explorer that explores the vast ocean going all around to find that promised land. Ofcourse it's experimental in nature and computationally rigorous. But thats the whole point of the exercise right. So we get da calculated at the final output layer. And we then retrace the journey backwards or leftward into the same layers. From da, we can calculate dw and db eventually for every layer and neuron. The math is a set of equations which can be covered later. The important thing to understand is that now we have started retracing the steps and changing the weights and biases in the assumptions and learning features in our deep algos in those layers. And based on our error function that is. So there is a feedback we have now. This is the iteration that needs to be done. As many times as we can. Now the parameters and hyperparameters are changed in iterations based on the final cost function. And we have entered the most important part of the journey called the gradient descent for our error or cost function. Remember it is a function of many variables our algos have reduced the problem statement into. And remember your basic calculus where a function shall be minimized as the gradient shall fall and eventually be zero. So it is just simply the same process. This process repeated overall many times eventually gives you some kind of a minima which is where the output layer shall give you the least error and the best accuracy.

As you shall appreciate, now it is a problem of experimenting more and more in iterations with more and more inputs and trying different parameters and hyperparameters to make the algos more mature to reduce our error functions to a minima. And that takes in huge data like the runs of millions of images datasets popular models have been trained on from say Imagenet. This data is very well curated now. And labeled too from better transfer learning which is a concept we shall delve into later in the series. Ofcourse, the data changes and context changes in production environments over the time and we are presented with issues of what is called as data drift and concept drift. This is a constant challenge and needs retraining of the models in production to apply different parameters and huperparameters as the accuracy of LIVE models decrease over the course of time due to the introduction of new data in the production environment which is different from the training datasets we used initially. Like the lighting maybe different in real-life images which users upload form their mobiles compared to the very good imagenet dataset pictures you used initially. There is a lot of data bias inherent in every model which has been trained. And this means your LIVE models are always to be in some kind of a beta as opposed to the other software which is non decisionmaking in nature with fixed inputs and confirmed and predetermined goals. This has spun a completely new field called Data Centric AI where models give way to data based effort for improving outcomes. But I digress a little.

We discussed the forward steps and we discussed the backward step in our discourse. But where is the "Attention"  we were talking about. Well that comes circa 2017 fall when a team at Google published a paper titled "Attention is all you need". That revolutionized ML onto the next phase it appears. It gave rise to the new transformers architecture for the NNs which were some kind of encoder and decoder based systems. Some layers were encoding the input with something called atention data. This applied very well to the NLP problems where words were the basic building blocks of inputs. So you create a vector for every word fed into the NN which has numerical values in aatrix say 4 by 1 for each word. Now there shall be thousands of words preceeding any given word in a text you input. The main difference in this new tranfromer architecture was that the algo could now reference in theory the entire preceeding set in run time so to say. The backward reference window was infinite as opposed to a limited number in the preceeding architectures like RNNs etc. And in the 3ncoder stage, you filled the word vector with some additional attention data as postulated in nemclature by thr writers of the paper. And that attention data with the relative spacial encoding for each word was the output of thr encoder for every word in the text set. So in a resulting matrix of words every word shall have a numerical connection to every other word in the input text. So the NN can now work with this additional attention data which the preceeding model architectures could not. The decoder then ofcourse works with the final output and the regular mechanics of the error function can follow. But this has revolutionized Natural Language Processing as a field with NLU and NLG. Just go to hugging face amd jabe a blast with transformer based NNs.

In conclusion, NNs with the advent of GPUs with the parallelizability and increased computation power hence, gave the researchers the tools to iterate on the experiment runs needed to reduce the error in output and hence increase the accuracy of the output. And this is also what we help the ML engineers here at newron with. In The Experiment runs and tracking those with great visual representations of what is happening in those hidden layers. It is the most important part I reckon for the future as the more robust understanding we have of the intricate processes going on inside of these layers, the more intuitive we become and faster and better the training process is for any transfer learning task we are given. That along with Data Centric AI as a construct shall be the next step in the AI and ML revolution in my humble opinion.