Mathematics behind Machine Learning: Beating Least Squares in Only 1 Iteration

Revision en30, by SrabonGitikar, 2025-08-23 14:29:31

Motivation

Updating weights in each layer is the core of tuning a Neural Network. Now consider a simple regression model. If I ask you what method we use to update the weights? You'll probably reply — the method of least squares. And that is the most common thing. But it is to be noted that, Least Squares may not always be the most optimal method of updating weights. In this blog, I'd want to tell you about a very powerful method to get the weights. And if you have studied Numerical Analysis, you probably know about this. It's nothing but the very popular Newton-Raphson Method.

Well in general, this is used to solve equations numerically which are not solvable analytically. If we simply have to define the method, it will be something as follows:

Consider the equation, $$$f(x)=0$$$. Then, we can solve for $$$x$$$ starting from any initial guess $$$x^{(0)}$$$; and for the $$$i$$$-th iteration, the update of the solution $$$x$$$ is given by,

$$$x^{(i)} = x^{(i-1)} - \frac{f\left(x^{(i-1)}\right)}{f'\left(x^{(i-1)}\right)}\quad[i \gt 0]$$$

Now let's apply this for weights $$$(\Theta \in \mathbb{R}^n)$$$ in a simple Linear Regression model.

Warning: Heavy Mathematics coming up!

The Idea

Let's consider a simple linear regression model.

Let's consider that we have $$$m$$$ training examples and $$$n$$$ inputs, or more formally, $$$x^{(i)} \in \mathbb{R}^n$$$ for any $$$i \in {1, \ldots, m}$$$. For simplicity, let's take $$$y \in \mathbb{R}$$$ here. We can also have to predict a multidimensional output.

Consider the loss function,

$$$J(\Theta) = \frac{1}{2m} \sum_{i=1}^{m} \Big( h_\Theta(x^{(i)}) - y^{(i)} \Big)^2$$$ where $$$h_\Theta(x^{(i)})=\theta_0+\sum_{j=1}^{n}\theta_jx_j^{(i)}$$$ where $$$i \in {1, \ldots, m}$$$.

First let's take about the Least Squares Method. In this method, to get optimal weights i.e. $$$\Theta$$$, we simply set,

$$$\nabla_{\Theta} J(\Theta) = 0$$$

To be noted that,

$$$ \nabla_{\Theta} J(\Theta) = \begin{bmatrix} \frac{\partial J}{\partial \theta_1} \\ \frac{\partial J}{\partial \theta_2} \\ \vdots \\ \frac{\partial J}{\partial \theta_n} \end{bmatrix} = X^\top (X \Theta - y) $$$

You can get this result simply by calculating each partial derivative $$$\frac{\partial J}{\partial \theta_j}$$$.

Now, we needed, $$$\nabla_{\Theta} J(\Theta) = 0$$$

Hence, $$$X^\top (X \Theta - y)=0$$$

Which gives, $$$\Theta = (X^\top X)^{-1} X^\top y$$$

That's the final result that we obtain in this method. Now let's talk about the Newton-Raphson Method.

In the $$$k$$$-th iteration of the Newton-Raphson method, for multidimensional quantities, we have,

$$$ \Theta^{(k)} = \Theta^{(k-1)} - H(\Theta^{(k-1)})^{-1} \nabla_{\Theta} J(\Theta^{(k-1)}) $$$

Where $$$H$$$ is the Hessian Matrix, in which,

$$$H_{ij} = \frac{\partial^2 J(\Theta)}{\partial \theta_i \, \partial \theta_j}$$$

If you work out the matrix, you'll end up with $$$H(\Theta)=\nabla_{\Theta}^2 J(\Theta)$$$.

Let's begin iterations of Newton-Raphson method with $$$\Theta^{(0)}=O$$$. Then we end up with, $$$\Theta^{(1)}=-(\nabla_{\Theta}^2 J(\Theta^{(0)}))^{-1}\nabla_{\Theta} J(\Theta^{(0)})$$$.

Also, it is to be noted that $$$J(\Theta) = \frac{1}{2} (X \Theta - y)^\top (X \Theta - y) = \frac{1}{2} \text{tr}\big((X \Theta - y)^\top (X \Theta - y)\big)$$$.

So we can do some can do some calculations with $$$J(\Theta^{(0)})$$$ and we'll end up with,

$$$\Theta^{(1)} = (X^\top X)^{-1} X^\top y$$$

Which is the same result as Least Squares!

Conclusion

While solving general scalar equations, we usually do 3-4 iterations of the Newton-Raphson method. Least Squares is already a practised method in Statistics and Machine Learning. And here I showed you how Newton-Raphson method outperforms Least Squares only in one iteration. And now, I'll leave it up to you — to think, how much optimization we'd have in just 1-2 more iterations! Neural Networks generally use Gradient Descent with a given learning rate to update weights. But when we talk about the Mathematics, analytically it becomes quite tough to work with Gradient Descent — we aren't perceptrons after all. So, when you do not have a computer, you can use Newton-Raphson method to determine optimality, and with just two iterations, it will outperform Least Squares by quite a large margin.

And this brings us to the end of the long, mathematical discussion. I really hope you enjoyed reading it, thank you for reading so far!

Tags machine learning, neural networks, math

History

 
 
 
 
Revisions
 
 
  Rev. Lang. By When Δ Comment
en30 English SrabonGitikar 2025-08-23 14:29:31 37 (published)
en29 English SrabonGitikar 2025-08-23 14:26:56 1349
en28 English SrabonGitikar 2025-08-23 13:57:13 1
en27 English SrabonGitikar 2025-08-23 13:56:37 190
en26 English SrabonGitikar 2025-08-23 13:32:22 481
en25 English SrabonGitikar 2025-08-23 13:09:07 25
en24 English SrabonGitikar 2025-08-23 13:06:54 5
en23 English SrabonGitikar 2025-08-23 13:04:54 341
en22 English SrabonGitikar 2025-08-23 12:56:06 12
en21 English SrabonGitikar 2025-08-23 12:51:03 777
en20 English SrabonGitikar 2025-08-23 02:22:46 2 Tiny change: 'ny $i \in {1, \ldots, m}$. For si' -> 'ny $i \in \{1, \ldots, m\}$. For si'
en19 English SrabonGitikar 2025-08-23 02:21:48 6
en18 English SrabonGitikar 2025-08-23 02:19:31 4
en17 English SrabonGitikar 2025-08-23 02:17:12 121
en16 English SrabonGitikar 2025-08-23 02:13:39 234
en15 English SrabonGitikar 2025-08-23 02:07:05 20
en14 English SrabonGitikar 2025-08-23 02:05:01 245
en13 English SrabonGitikar 2025-08-23 01:58:12 4
en12 English SrabonGitikar 2025-08-23 01:56:50 144
en11 English SrabonGitikar 2025-08-23 01:51:31 6 Tiny change: ')}\right)} [i>0]$' -> ')}\right)}\quad[i>0]$'
en10 English SrabonGitikar 2025-08-23 01:50:19 14
en9 English SrabonGitikar 2025-08-23 01:48:57 4
en8 English SrabonGitikar 2025-08-23 01:48:22 226
en7 English SrabonGitikar 2025-08-23 01:43:04 4 Tiny change: 'equation, `f(x)=0`.\n' -> 'equation, $f(x)=0$.\n'
en6 English SrabonGitikar 2025-08-23 01:42:32 15 Tiny change: 'equation, \n\[\nf(x) = 0\n\]\n' -> 'equation, `f(x)=0`.\n'
en5 English SrabonGitikar 2025-08-23 01:39:02 2 Tiny change: 'quation, \[\nf(x) =' -> 'quation, \n\[\nf(x) ='
en4 English SrabonGitikar 2025-08-23 01:38:06 6 Tiny change: 'uation, \[f(x) = 0\]' -> 'uation, \[\nf(x) = 0\n\]\n'
en3 English SrabonGitikar 2025-08-23 01:37:17 210
en2 English SrabonGitikar 2025-08-23 01:32:35 65 Tiny change: '**Motivati' -> '``**Motivati'
en1 English SrabonGitikar 2025-08-15 21:42:58 688 Initial revision (saved to drafts)