Motivation

Updating weights in each layer is the core of tuning a Neural Network. Now consider a simple regression model. If I ask you what method we use to update the weights? You'll probably reply — the method of least squares. And that is the most common thing. But it is to be noted that, Least Squares may not always be the most optimal method of updating weights. In this blog, I'd want to tell you about a very powerful method to get the weights. And if you have studied Numerical Analysis, you probably know about this. It's nothing but the very popular Newton-Raphson Method.

Well in general, this is used to solve equations numerically which are not solvable analytically. If we simply have to define the method, it will be something as follows:

Consider the equation, $$$f(x)=0$$$. Then, we can solve for $$$x$$$ starting from any initial guess $$$x^{(0)}$$$; and for the $$$i$$$-th iteration, the update of the solution $$$x$$$ is given by,

$$$x^{(i)} = x^{(i-1)} - \frac{f\left(x^{(i-1)}\right)}{f'\left(x^{(i-1)}\right)}\quad[i \gt 0]$$$

Now let's apply this for weights $$$(\Theta \in \mathbb{R}^n)$$$ in a simple Linear Regression model.

Warning: Heavy Mathematics coming up!

The Idea

Let's consider a simple linear regression model.

Let's consider that we have $$$m$$$ training examples and $$$n$$$ inputs, or more formally, $$$x^{(i)} \in \mathbb{R}^n$$$ for any $$$i \in {1, \ldots, m}$$$. For simplicity, let's take $$$y \in \mathbb{R}$$$ here. We can also have to predict a multidimensional output.

Consider the loss function,

$$$J(\Theta) = \frac{1}{2m} \sum_{i=1}^{m} \Big( h_\Theta(x^{(i)}) - y^{(i)} \Big)^2$$$ where $$$h_\Theta(x^{(i)})=\theta_0+\sum_{j=1}^{n}\theta_jx_j^{(i)}$$$ where $$$i \in {1, \ldots, m}$$$.

First let's take about the Least Squares Method. In this method, to get optimal weights i.e. $$$\Theta$$$, we simply set,

$$$\nabla_{\Theta} J(\Theta) = 0$$$

To be noted that,

$$$ \nabla_{\Theta} J(\Theta) = \begin{bmatrix} \frac{\partial J}{\partial \theta_1} \\ \frac{\partial J}{\partial \theta_2} \\ \vdots \\ \frac{\partial J}{\partial \theta_n} \end{bmatrix} = X^\top (X \Theta - y) $$$

You can get this result simply by calculating each partial derivative $$$\frac{\partial J}{\partial \theta_j}$$$.

Now, we needed, $$$\nabla_{\Theta} J(\Theta) = 0$$$

Hence, $$$X^\top (X \Theta - y)=0$$$

Which gives, $$$\Theta = (X^\top X)^{-1} X^\top y$$$

That's the final result that we obtain in this method. Now let's talk about the Newton-Raphson Method.

In the $$$k$$$-th iteration of the Newton-Raphson method, for multidimensional quantities, we have,

$$$ \Theta^{(k)} = \Theta^{(k-1)} - H(\Theta^{(k-1)})^{-1} \nabla_{\Theta} J(\Theta^{(k-1)}) $$$

Where $$$H$$$ is the Hessian Matrix, in which,

$$$H_{ij} = \frac{\partial^2 J(\Theta)}{\partial \theta_i \, \partial \theta_j}$$$

If you work out the matrix, you'll end up with $$$H(\Theta)=\nabla_{\Theta}^2 J(\Theta)$$$.

Let's begin iterations of Newton-Raphson method with $$$\Theta^{(0)}=O$$$. Then we end up with, $$$\Theta^{(1)}=-(\nabla_{\Theta}^2 J(\Theta^{(0)}))^{-1}\nabla_{\Theta} J(\Theta^{(0)})$$$.

Also, it is to be noted that $$$J(\Theta) = \frac{1}{2} (X \Theta - y)^\top (X \Theta - y) = \frac{1}{2} \text{tr}\big((X \Theta - y)^\top (X \Theta - y)\big)$$$.

So we can do some can do some calculations with $$$J(\Theta^{(0)})$$$ and we'll end up with,

$$$\Theta^{(1)} = (X^\top X)^{-1} X^\top y$$$

Which is the same result as Least Squares!

Conclusion

While solving general scalar equations, we usually do 3-4 iterations of the Newton-Raphson method. Least Squares is already a practised method in Statistics and Machine Learning. And here I showed you how Newton-Raphson method outperforms Least Squares only in one iteration. And now, I'll leave it up to you — to think, how much optimization we'd have in just 1-2 more iterations! Neural Networks generally use Gradient Descent with a given learning rate to update weights. But when we talk about the Mathematics, analytically it becomes quite tough to work with Gradient Descent — we aren't perceptrons after all. So, when you do not have a computer, you can use Newton-Raphson method to determine optimality, and with just two iterations, it will outperform Least Squares by quite a large margin.

And this brings us to the end of the long, mathematical discussion. I really hope you enjoyed reading it, thank you for reading so far!

Rev.	Кто	Когда	Δ	Комментарий
en30	SrabonGitikar	2025-08-23 14:29:31	37	(published)
en29	SrabonGitikar	2025-08-23 14:26:56	1349
en28	SrabonGitikar	2025-08-23 13:57:13	1
en27	SrabonGitikar	2025-08-23 13:56:37	190
en26	SrabonGitikar	2025-08-23 13:32:22	481
en25	SrabonGitikar	2025-08-23 13:09:07	25
en24	SrabonGitikar	2025-08-23 13:06:54	5
en23	SrabonGitikar	2025-08-23 13:04:54	341
en22	SrabonGitikar	2025-08-23 12:56:06	12
en21	SrabonGitikar	2025-08-23 12:51:03	777
en20	SrabonGitikar	2025-08-23 02:22:46	2	Tiny change: 'ny $i \in {1, \ldots, m}$. For si' -> 'ny $i \in \{1, \ldots, m\}$. For si'
en19	SrabonGitikar	2025-08-23 02:21:48	6
en18	SrabonGitikar	2025-08-23 02:19:31	4
en17	SrabonGitikar	2025-08-23 02:17:12	121
en16	SrabonGitikar	2025-08-23 02:13:39	234
en15	SrabonGitikar	2025-08-23 02:07:05	20
en14	SrabonGitikar	2025-08-23 02:05:01	245
en13	SrabonGitikar	2025-08-23 01:58:12	4
en12	SrabonGitikar	2025-08-23 01:56:50	144
en11	SrabonGitikar	2025-08-23 01:51:31	6	Tiny change: ')}\right)} [i>0]$' -> ')}\right)}\quad[i>0]$'
en10	SrabonGitikar	2025-08-23 01:50:19	14
en9	SrabonGitikar	2025-08-23 01:48:57	4
en8	SrabonGitikar	2025-08-23 01:48:22	226
en7	SrabonGitikar	2025-08-23 01:43:04	4	Tiny change: 'equation, `f(x)=0`.\n' -> 'equation, $f(x)=0$.\n'
en6	SrabonGitikar	2025-08-23 01:42:32	15	Tiny change: 'equation, \n\[\nf(x) = 0\n\]\n' -> 'equation, `f(x)=0`.\n'
en5	SrabonGitikar	2025-08-23 01:39:02	2	Tiny change: 'quation, \[\nf(x) =' -> 'quation, \n\[\nf(x) ='
en4	SrabonGitikar	2025-08-23 01:38:06	6	Tiny change: 'uation, \[f(x) = 0\]' -> 'uation, \[\nf(x) = 0\n\]\n'
en3	SrabonGitikar	2025-08-23 01:37:17	210
en2	SrabonGitikar	2025-08-23 01:32:35	65	Tiny change: 'Motivati' -> '``Motivati'
en1	SrabonGitikar	2025-08-15 21:42:58	688	Initial revision (saved to drafts)

№	Пользователь	Рейтинг
1	Benq	3792
2	VivaciousAubergine	3647
3	Kevin114514	3603
4	jiangly	3583
5	turmax	3559
6	tourist	3541
7	strapple	3515
8	ksun48	3461
9	dXqwq	3436
10	Otomachi_Una	3413

№	Пользователь	Вклад
1	Qingyu	157
2	adamant	153
3	Um_nik	147
4	Proof_by_QED	146
5	Dominater069	145
6	errorgorn	141
7	cry	139
8	YuukiS	135
9	TheScrasse	134
10	chromate00	133

Motivation

The Idea

Conclusion

История