# Least Square Regression

## Chi-squared and maximum likelihood estimation

Suppose that our measurement data $y_i$ is independent of each other and obeys $N(\bar{y}_i, \sigma_i)$.
Then likelihood function $\mathcal{L}(\bar{\mathbf{y}} | \mathbf{y})$ is given by

\begin{align*}
\mathcal{L}(\bar{\mathbf{y}} | \mathbf{y}) &= \prod_i P(\bar{y}_i | y_i) \\
&= C \exp \left ( - \frac{1}{2} \sum_i \left(\frac{y_i-\bar{y}_i}{\sigma_i}\right)^2 \right)
\end{align*}
, for some constant $C$.

Define $\chi^2$ as

\begin{equation*}
\chi^2 = \sum_i \left(\frac{y_i-\bar{y}_i}{\sigma_i}\right)^2
\end{equation*}

then, 

\begin{equation*}
P(\bar{\mathbf{y}} | \mathbf{y}) = C \exp \left(-\chi^2/2 \right)
\end{equation*}

So, the log likelihood function $\log \mathcal{L}(\bar{\mathbf{y}} | \mathbf{y})$ is
\begin{equation*}
\log \mathcal{L}(\bar{\mathbf{y}}, \mathbf{y}) = \log C - \frac{\chi^2}{2}
\end{equation*}

Thus, maximizing likelihood or log-likelihood is the same as minimizing $\chi^2$.

In common fitting process we estimate $\bar{y}_i$ as 

\begin{equation*}
\bar{y}_i = f(x_i, \mathbf{\theta})
\end{equation*}

, so our likelihood, log-likelihood and chi-squared function are the function of fitting parameter $\mathbf{\theta}$.

## Linear Least Square

Suppose that our fitting function $f(x, \mathbf{\theta})$ is the linear combination of some function $g_i(x)$ which does not depend on $\mathbf{\theta}$.

\begin{align*}
f(x, \mathbf{\theta}) &= \sum_i \theta_i g_i(x) \\
&= \mathbf{\theta}^T \mathbf{g}(x)
\end{align*}

Define matrix $G$ as

\begin{equation*}
G = \left [ \frac{g_j(x_i)}{\sigma_i} \right ]_{i,j}
\end{equation*}

and set $\mathbf{y}' = \mathbf{y}/\mathbf{\sigma}$ then

\begin{equation*}
\chi^2(\mathbf{\theta}) = \| G \mathbf{\theta} - \mathbf{y}' \|^2
\end{equation*}

To minimize $\chi^2$, we require

\begin{equation*}
\frac{\partial \chi^2}{\partial \theta_i} = 0
\end{equation*}

Then, we have the following equation, usually called the normal equation.

\begin{equation*}
\mathbf{\theta} = (G^T G)^{-1} G^T \mathbf{y}'
\end{equation*}

The $(G^T G)^{-1}$ is called the parameter covariance matrix, which is denoted by ${Cov}$.

The standard error of paramter ${Err}(\mathbf{\theta})$ is defined as

\begin{equation*}
{Err}(\mathbf{\theta})^2 = \frac{\chi^2}{N-p} {diag}(Cov)
\end{equation*}

, where $N$ is the total number of data points, and $p$ is the number of parameter.

Note that the $G$ is also the scaled Jacobian of the model function $f(x, \mathbf{\theta})$ concerning parameter $\mathbf{\theta}$.

So, one can extend the definition of the parameter's standard error in linear least square regression to a non-linear one.

\begin{align*}
{Cov} &= (J^T J)^{-1} \\
{Err}(\mathbf{\theta})^2 &= \frac{\chi^2}{N-p} {diag}(Cov)
\end{align*}
, where $J$ is the scaled jacobian of non-linear model function $f(x, \mathbf{\theta})$ with respect to paramter $\mathbf{\theta}$.

Such parameter error estimation is called Asymptotic Standard Error.
However, strictly speaking, Asymptotic Standard Error estimation should not be used in non-linear least square regression.

Our package `TRXASprefitpack` provides an alternative error parameter estimation method based on the `F-test.`

## Alternative Paramter Error Estimation

Define $\chi^2_i(x)$ as

\begin{equation*}
\chi^2_i (x) = {arg}\,{min}_{\mathbf{\theta}, \theta_i = x} \chi^2 (\theta)
\end{equation*}

Then the number of parameters corresponding to $\chi^2_i$ is $P-1$.

### F-test based paramter error estimation

Let $\chi^2_0 = \chi^2(\theta_0)$ be the minimum chi-square value.
One can estimates confidence interval of $i$th optimal parameter $\theta_{0, i}$ with significant level $\alpha$ by

\begin{equation*}
F_{\alpha}(1, n-p) = \frac{\chi^2_i(\theta)-\chi^2_0}{\chi^2_0/(n-p)}
\end{equation*}

## Compare two different fit

Assume that model 2 is the restriction of model 1. Then, you can compare two models based on the f-test.

## Separation Scheme

Suppose that

\begin{equation*}
f(t, \mathbf{\theta}_{l}, \mathbf{\theta}_{nl}) = \mathbf{\theta}_{l}^T \mathbf{g}(t, \mathbf{\theta}_{nl})
\end{equation*}

Then

\begin{equation*}
 {arg}\,{min}_{\mathbf{\theta}_l, \mathbf{\theta_{nl}}} \chi^2 = 
 {\arg}\,{min}_{\mathbf{\theta}} \left({\arg}\,{min}_{\mathbf{\theta}_l} \chi^2(\mathbf{\theta}_l, \mathbf{\theta})\right)
\end{equation*}

The optimization problem

\begin{equation*}
{\arg}\,{min}_{\mathbf{\theta}_l} \chi^2(\mathbf{\theta}_l, \mathbf{\theta})
\end{equation*}

is just a linear least square problem described in a linear least square section, and we know the exact solution to such a problem.
Let $\mathbf{\theta}_{l} = \mathbf{C}(\mathbf{\theta})$ be the least norm solution of the linear least square problem then,

\begin{align*}
{arg}\,{min}_{\mathbf{\theta}} \chi^2(\mathbf{C}(\mathbf{\theta}), \mathbf{\theta}) &= {arg}\,{min}_{\mathbf{\theta}_l, \mathbf{\theta_{nl}}} \chi^2 \\
\frac{\partial \chi^2(\mathbf{C}(\mathbf{\theta}), \mathbf{\theta})}{\partial \mathbf{C}(\mathbf{\theta})}  &= 0 
\end{align*}

So, by chain rule the gradient of $\chi^2(\mathbf{C}(\mathbf{\theta}), \mathbf{\theta})$ is

\begin{align*}
\frac{\partial \chi^2(\mathbf{C}, \mathbf{\theta})}{\partial \mathbf{\theta}} &= 
\frac{\partial \chi^2}{\partial \mathbf{C}(\mathbf{\theta})} \frac{\partial \mathbf{C}(\mathbf{\theta})}{\partial \mathbf{\theta}} 
+ \frac{\partial \chi^2}{\partial \mathbf{\theta}} \\
&= \frac{\partial \chi^2}{\partial \mathbf{\theta}}
\end{align*}

Because of $\frac{\partial \mathbf{C}(\mathbf{\theta})}{\partial \mathbf{\theta}}$ term, the analytic hessian of $\chi^2(\mathbf{C}, \mathbf{\theta})$ is quite complicated. Since v0.8, the analytic Hessian is implemented for the following three fitting drivers.

1. `fit_static_voigt` 
2. `fit_transient_exp`
3. `fit_transient_raise`

The Hessian of $\chi^2(\mathbf{C}, \mathbf{\theta})$ is

\begin{equation*}
\frac{\partial^2 \chi^2(\mathbf{C}, \mathbf{\theta})}{\partial \mathbf{\theta}_i \mathbf{\theta}_j} = \frac{\partial^2 \chi^2}{\partial \mathbf{\theta}_i \partial \mathbf{\theta}_j} + \sum_k \frac{\partial^2 \chi^2}{\partial \mathbf{\theta}_j \partial \mathbf{C}_k} \frac{\partial \mathbf{C}_k}{\partial \mathbf{\theta}_i}
\end{equation*}

Note that $\frac{\partial \chi^2(\mathbf{C}(\mathbf{\theta}), \mathbf{\theta})}{\partial \mathbf{C}_j(\mathbf{\theta})} = 0$ for all $\theta$. Take derivative of $\mathbf{\theta}_i$ then

\begin{equation*}
\frac{\partial^2 \chi^2}{\partial \mathbf{\theta}_i \partial \mathbf{C}_j} + \sum_k \frac{\partial^2 \chi^2}{\partial \mathbf{C}_j \partial \mathbf{C}_k} \frac{\partial \mathbf{C}_k}{\partial \mathbf{\theta}_i} = 0.
\end{equation*}

For simplicity, denote $H_c = [\frac{\partial^2 \chi^2}{\partial C_i \partial C_j}]_{ij}$, $H_{\theta} = [\frac{\partial^2 \chi^2}{\partial \theta_i \partial \theta_j}]_{ij}$, $H_{\theta c} = [\frac{\partial^2 \chi^2}{\partial \theta_i \partial C_j}]_{ij}$, and $B = [\frac{\partial C_i}{\partial \theta_j}]_{ij}$.
Let the hessian matrix of $\chi^2(\mathbf{C}, \mathbf{\theta})$ be $H'$ then

\begin{align*}
H_{\theta c}^T &= - H_c B \\
H' &= H_{\theta} + H_{\theta_c} B \\
&= H_{\theta} - B^T H_c B
\end{align*}

The solution $B$ satisfying $H_{\theta c}^T = - H_c B$ is in general not unique. Let $B' = B + N$, where $H_c N = 0$. Then 
\begin{align*}
B'^T H_c B' &= B^T H_c B + N^T H_c B + B^T H_c N + N^T H_c N \\
&= B^T H_c B + (H_c N)^T B \\
&= B^T H_c B
\end{align*}

Therefore, $H'$ is well defined, even though $B$ is not unique.

The separation scheme reduces the dimension of the optimization problem, and the gradient of $\chi^2(C,\theta)$ is the same as that of the original $\chi^2$ function, 
so implementing a separation scheme will speed up the optimization process.