Motivation
Updating weights in each layer is the core of tuning a Neural Network. Now consider a simple regression model. If I ask you what method we use to update the weights? You'll probably reply — the method of least squares. And that is the most common thing. But it is to be noted that, Least Squares may not always be the most optimal method of updating weights. In this blog, I'd want to tell you about a very powerful method to get the weights. And if you have studied Numerical Analysis, you probably know about this. It's nothing but the very popular Newton-Raphson Method.
Well in general, this is used to solve equations numerically which are not solvable analytically. If we simply have to define the method, it will be something as follows:
Consider the equation, $$$f(x)=0$$$. Then, we can solve for $$$x$$$ starting from any initial guess $$$x^{(0)}$$$; and for the $$$i$$$-th iteration, the update of the solution $$$x$$$ is given by,
$$$x^{(i)} = x^{(i-1)} - \frac{f\left(x^{(i-1)}\right)}{f'\left(x^{(i-1)}\right)}\quad[i \gt 0]$$$
Now let's apply this for weights $$$(\Theta \in \mathbb{R}^n)$$$ in a simple Linear Regression model.
Warning: Heavy Mathematics coming up!
The Idea
Let's consider a simple linear regression model.
Let's consider that we have $$$m$$$ training examples and $$$n$$$ inputs, or more formally, $$$x^{(i)} \in \mathbb{R}^n$$$ for any $$$i \in {1, \ldots, m}$$$. For simplicity, let's take $$$y \in \mathbb{R}$$$ here. We can also have to predict a multidimensional output.
Consider the loss function,
$$$J(\Theta) = \frac{1}{2m} \sum_{i=1}^{m} \Big( h_\Theta(x^{(i)}) - y^{(i)} \Big)^2$$$ where $$$h_\Theta(x^{(i)})=\theta_0+\sum_{j=1}^{n}\theta_jx_j^{(i)}$$$ where $$$i \in {1, \ldots, m}$$$.
First let's take about the Least Squares Method. In this method, to get optimal weights i.e. $$$\Theta$$$, we simply set,
$$$\nabla_{\Theta} J(\Theta) = 0$$$
To be noted that,
You can get this result simply by calculating each partial derivative $$$\frac{\partial J}{\partial \theta_j}$$$.
Now, we needed, $$$\nabla_{\Theta} J(\Theta) = 0$$$
Hence, $$$X^\top (X \Theta - y)=0$$$
Which gives, $$$\Theta = (X^\top X)^{-1} X^\top y$$$
That's the final result that we obtain in this method. Now let's talk about the Newton-Raphson Method.
In the $$$k$$$-th iteration of the Newton-Raphson method, for multidimensional quantities, we have,
Where $$$H$$$ is the Hessian Matrix, in which,
$$$H_{ij} = \frac{\partial^2 J(\Theta)}{\partial \theta_i \, \partial \theta_j}$$$
If you work out the matrix, you'll end up with $$$H(\Theta)=\nabla_{\Theta}^2 J(\Theta)$$$.
Let's begin iterations of Newton-Raphson method with $$$\Theta^{(0)}=O$$$. Then we end up with, $$$\Theta^{(1)}=-(\nabla_{\Theta}^2 J(\Theta^{(0)}))^{-1}\nabla_{\Theta} J(\Theta^{(0)})$$$.
Also, it is to be noted that $$$J(\Theta) = \frac{1}{2} (X \Theta - y)^\top (X \Theta - y) = \frac{1}{2} \text{tr}\big((X \Theta - y)^\top (X \Theta - y)\big)$$$.
So we can do some can do some calculations with $$$J(\Theta^{(0)})$$$ and we'll end up with,
$$$\Theta^{(1)} = (X^\top X)^{-1} X^\top y$$$
Which is the same result as Least Squares!
Conclusion
While solving general scalar equations, we usually do 3-4 iterations of the Newton-Raphson method. Least Squares is already a practised method in Statistics and Machine Learning. And here I showed you how Newton-Raphson method outperforms Least Squares only in one iteration. And now, I'll leave it up to you — to think, how much optimization we'd have in just 1-2 more iterations! Neural Networks generally use Gradient Descent with a given learning rate to update weights. But when we talk about the Mathematics, analytically it becomes quite tough to work with Gradient Descent — we aren't perceptrons after all. So, when you do not have a computer, you can use Newton-Raphson method to determine optimality, and with just two iterations, it will outperform Least Squares by quite a large margin.
And this brings us to the end of the long, mathematical discussion. I really hope you enjoyed reading it, thank you for reading so far!



