Beating Least Squares in Only 1 Iteration

Правка en29, от SrabonGitikar, 2025-08-23 14:26:56

Motivation

Updating weights in each layer is the core of tuning a Neural Network. Now consider a simple regression model. If I ask you what method we use to update the weights? You'll probably reply — the method of least squares. And that is the most common thing. But it is to be noted that, Least Squares may not always be the most optimal method of updating weights. In this blog, I'd want to tell you about a very powerful method to get the weights. And if you have studied Numerical Analysis, you probably know about this. It's nothing but the very popular Newton-Raphson Method.

Well in general, this is used to solve equations numerically which are not solvable analytically. If we simply have to define the method, it will be something as follows:

Consider the equation, $$$f(x)=0$$$. Then, we can solve for $$$x$$$ starting from any initial guess $$$x^{(0)}$$$; and for the $$$i$$$-th iteration, the update of the solution $$$x$$$ is given by,

$$$x^{(i)} = x^{(i-1)} - \frac{f\left(x^{(i-1)}\right)}{f'\left(x^{(i-1)}\right)}\quad[i \gt 0]$$$

Now let's apply this for weights $$$(\Theta \in \mathbb{R}^n)$$$ in a simple Linear Regression model.

Warning: Heavy Mathematics coming up!

The Idea

Let's consider a simple linear regression model.

Let's consider that we have $$$m$$$ training examples and $$$n$$$ inputs, or more formally, $$$x^{(i)} \in \mathbb{R}^n$$$ for any $$$i \in {1, \ldots, m}$$$. For simplicity, let's take $$$y \in \mathbb{R}$$$ here. We can also have to predict a multidimensional output.

Consider the loss function,

$$$J(\Theta) = \frac{1}{2m} \sum_{i=1}^{m} \Big( h_\Theta(x^{(i)}) - y^{(i)} \Big)^2$$$ where $$$h_\Theta(x^{(i)})=\theta_0+\sum_{j=1}^{n}\theta_jx_j^{(i)}$$$ where $$$i \in {1, \ldots, m}$$$.

First let's take about the Least Squares Method. In this method, to get optimal weights i.e. $$$\Theta$$$, we simply set,

$$$\nabla_{\Theta} J(\Theta) = 0$$$

To be noted that,

$$$ \nabla_{\Theta} J(\Theta) = \begin{bmatrix} \frac{\partial J}{\partial \theta_1} \\ \frac{\partial J}{\partial \theta_2} \\ \vdots \\ \frac{\partial J}{\partial \theta_n} \end{bmatrix} = X^\top (X \Theta - y) $$$

You can get this result simply by calculating each partial derivative $$$\frac{\partial J}{\partial \theta_j}$$$.

Now, we needed, $$$\nabla_{\Theta} J(\Theta) = 0$$$

Hence, $$$X^\top (X \Theta - y)=0$$$

Which gives, $$$\Theta = (X^\top X)^{-1} X^\top y$$$

That's the final result that we obtain in this method. Now let's talk about the Newton-Raphson Method.

In the $$$k$$$-th iteration of the Newton-Raphson method, for multidimensional quantities, we have,

$$$ \Theta^{(k)} = \Theta^{(k-1)} - H(\Theta^{(k-1)})^{-1} \nabla_{\Theta} J(\Theta^{(k-1)}) $$$

Where $$$H$$$ is the Hessian Matrix, in which,

$$$H_{ij} = \frac{\partial^2 J(\Theta)}{\partial \theta_i \, \partial \theta_j}$$$

If you work out the matrix, you'll end up with $$$H(\Theta)=\nabla_{\Theta}^2 J(\Theta)$$$.

Let's begin iterations of Newton-Raphson method with $$$\Theta^{(0)}=O$$$. Then we end up with, $$$\Theta^{(1)}=-(\nabla_{\Theta}^2 J(\Theta^{(0)}))^{-1}\nabla_{\Theta} J(\Theta^{(0)})$$$.

Also, it is to be noted that $$$J(\Theta) = \frac{1}{2} (X \Theta - y)^\top (X \Theta - y) = \frac{1}{2} \text{tr}\big((X \Theta - y)^\top (X \Theta - y)\big)$$$.

So we can do some can do some calculations with $$$J(\Theta^{(0)})$$$ and we'll end up with,

$$$\Theta^{(1)} = (X^\top X)^{-1} X^\top y$$$

Which is the same result as Least Squares!

Conclusion

While solving general scalar equations, we usually do 3-4 iterations of the Newton-Raphson method. Least Squares is already a practised method in Statistics and Machine Learning. And here I showed you how Newton-Raphson method outperforms Least Squares only in one iteration. And now, I'll leave it up to you — to think, how much optimization we'd have in just 1-2 more iterations! Neural Networks generally use Gradient Descent with a given learning rate to update weights. But when we talk about the Mathematics, analytically it becomes quite tough to work with Gradient Descent — we aren't perceptrons after all. So, when you do not have a computer, you can use Newton-Raphson method to determine optimality, and with just two iterations, it will outperform Least Squares by quite a large margin.

And this brings us to the end of the long, mathematical discussion. I really hope you enjoyed reading it, thank you for reading so far!

Теги machine learning, neural networks, math

История

 
 
 
 
Правки
 
 
  Rev. Язык Кто Когда Δ Комментарий
en30 Английский SrabonGitikar 2025-08-23 14:29:31 37 (published)
en29 Английский SrabonGitikar 2025-08-23 14:26:56 1349
en28 Английский SrabonGitikar 2025-08-23 13:57:13 1
en27 Английский SrabonGitikar 2025-08-23 13:56:37 190
en26 Английский SrabonGitikar 2025-08-23 13:32:22 481
en25 Английский SrabonGitikar 2025-08-23 13:09:07 25
en24 Английский SrabonGitikar 2025-08-23 13:06:54 5
en23 Английский SrabonGitikar 2025-08-23 13:04:54 341
en22 Английский SrabonGitikar 2025-08-23 12:56:06 12
en21 Английский SrabonGitikar 2025-08-23 12:51:03 777
en20 Английский SrabonGitikar 2025-08-23 02:22:46 2 Tiny change: 'ny $i \in {1, \ldots, m}$. For si' -> 'ny $i \in \{1, \ldots, m\}$. For si'
en19 Английский SrabonGitikar 2025-08-23 02:21:48 6
en18 Английский SrabonGitikar 2025-08-23 02:19:31 4
en17 Английский SrabonGitikar 2025-08-23 02:17:12 121
en16 Английский SrabonGitikar 2025-08-23 02:13:39 234
en15 Английский SrabonGitikar 2025-08-23 02:07:05 20
en14 Английский SrabonGitikar 2025-08-23 02:05:01 245
en13 Английский SrabonGitikar 2025-08-23 01:58:12 4
en12 Английский SrabonGitikar 2025-08-23 01:56:50 144
en11 Английский SrabonGitikar 2025-08-23 01:51:31 6 Tiny change: ')}\right)} [i>0]$' -> ')}\right)}\quad[i>0]$'
en10 Английский SrabonGitikar 2025-08-23 01:50:19 14
en9 Английский SrabonGitikar 2025-08-23 01:48:57 4
en8 Английский SrabonGitikar 2025-08-23 01:48:22 226
en7 Английский SrabonGitikar 2025-08-23 01:43:04 4 Tiny change: 'equation, `f(x)=0`.\n' -> 'equation, $f(x)=0$.\n'
en6 Английский SrabonGitikar 2025-08-23 01:42:32 15 Tiny change: 'equation, \n\[\nf(x) = 0\n\]\n' -> 'equation, `f(x)=0`.\n'
en5 Английский SrabonGitikar 2025-08-23 01:39:02 2 Tiny change: 'quation, \[\nf(x) =' -> 'quation, \n\[\nf(x) ='
en4 Английский SrabonGitikar 2025-08-23 01:38:06 6 Tiny change: 'uation, \[f(x) = 0\]' -> 'uation, \[\nf(x) = 0\n\]\n'
en3 Английский SrabonGitikar 2025-08-23 01:37:17 210
en2 Английский SrabonGitikar 2025-08-23 01:32:35 65 Tiny change: '**Motivati' -> '``**Motivati'
en1 Английский SrabonGitikar 2025-08-15 21:42:58 688 Initial revision (saved to drafts)