# Least Square Regression ## Chi-squared and maximum likelihood estimation Suppose that our measurement data $y_i$ is independent of each other and obeys $N(\bar{y}_i, \sigma_i)$. Then likelihood function $\mathcal{L}(\bar{\mathbf{y}} | \mathbf{y})$ is given by \begin{align*} \mathcal{L}(\bar{\mathbf{y}} | \mathbf{y}) &= \prod_i P(\bar{y}_i | y_i) \\ &= C \exp \left ( - \frac{1}{2} \sum_i \left(\frac{y_i-\bar{y}_i}{\sigma_i}\right)^2 \right) \end{align*} , for some constant $C$. Define $\chi^2$ as \begin{equation*} \chi^2 = \sum_i \left(\frac{y_i-\bar{y}_i}{\sigma_i}\right)^2 \end{equation*} then, \begin{equation*} P(\bar{\mathbf{y}} | \mathbf{y}) = C \exp \left(-\chi^2/2 \right) \end{equation*} So, the log likelihood function $\log \mathcal{L}(\bar{\mathbf{y}} | \mathbf{y})$ is \begin{equation*} \log \mathcal{L}(\bar{\mathbf{y}}, \mathbf{y}) = \log C - \frac{\chi^2}{2} \end{equation*} Thus, maximizing likelihood or log-likelihood is the same as minimizing $\chi^2$. In common fitting process we estimate $\bar{y}_i$ as \begin{equation*} \bar{y}_i = f(x_i, \mathbf{\theta}) \end{equation*} , so our likelihood, log-likelihood and chi-squared function are the function of fitting parameter $\mathbf{\theta}$. ## Linear Least Square Suppose that our fitting function $f(x, \mathbf{\theta})$ is the linear combination of some function $g_i(x)$ which does not depend on $\mathbf{\theta}$. \begin{align*} f(x, \mathbf{\theta}) &= \sum_i \theta_i g_i(x) \\ &= \mathbf{\theta}^T \mathbf{g}(x) \end{align*} Define matrix $G$ as \begin{equation*} G = \left [ \frac{g_j(x_i)}{\sigma_i} \right ]_{i,j} \end{equation*} and set $\mathbf{y}' = \mathbf{y}/\mathbf{\sigma}$ then \begin{equation*} \chi^2(\mathbf{\theta}) = \| G \mathbf{\theta} - \mathbf{y}' \|^2 \end{equation*} To minimize $\chi^2$, we require \begin{equation*} \frac{\partial \chi^2}{\partial \theta_i} = 0 \end{equation*} Then, we have the following equation, usually called the normal equation. \begin{equation*} \mathbf{\theta} = (G^T G)^{-1} G^T \mathbf{y}' \end{equation*} The $(G^T G)^{-1}$ is called the parameter covariance matrix, which is denoted by ${Cov}$. The standard error of paramter ${Err}(\mathbf{\theta})$ is defined as \begin{equation*} {Err}(\mathbf{\theta})^2 = \frac{\chi^2}{N-p} {diag}(Cov) \end{equation*} , where $N$ is the total number of data points, and $p$ is the number of parameter. Note that the $G$ is also the scaled Jacobian of the model function $f(x, \mathbf{\theta})$ concerning parameter $\mathbf{\theta}$. So, one can extend the definition of the parameter's standard error in linear least square regression to a non-linear one. \begin{align*} {Cov} &= (J^T J)^{-1} \\ {Err}(\mathbf{\theta})^2 &= \frac{\chi^2}{N-p} {diag}(Cov) \end{align*} , where $J$ is the scaled jacobian of non-linear model function $f(x, \mathbf{\theta})$ with respect to paramter $\mathbf{\theta}$. Such parameter error estimation is called Asymptotic Standard Error. However, strictly speaking, Asymptotic Standard Error estimation should not be used in non-linear least square regression. Our package `TRXASprefitpack` provides an alternative error parameter estimation method based on the `F-test.` ## Alternative Paramter Error Estimation Define $\chi^2_i(x)$ as \begin{equation*} \chi^2_i (x) = {arg}\,{min}_{\mathbf{\theta}, \theta_i = x} \chi^2 (\theta) \end{equation*} Then the number of parameters corresponding to $\chi^2_i$ is $P-1$. ### F-test based paramter error estimation Let $\chi^2_0 = \chi^2(\theta_0)$ be the minimum chi-square value. One can estimates confidence interval of $i$th optimal parameter $\theta_{0, i}$ with significant level $\alpha$ by \begin{equation*} F_{\alpha}(1, n-p) = \frac{\chi^2_i(\theta)-\chi^2_0}{\chi^2_0/(n-p)} \end{equation*} ## Compare two different fit Assume that model 2 is the restriction of model 1. Then, you can compare two models based on the f-test. ## Separation Scheme Suppose that \begin{equation*} f(t, \mathbf{\theta}_{l}, \mathbf{\theta}_{nl}) = \mathbf{\theta}_{l}^T \mathbf{g}(t, \mathbf{\theta}_{nl}) \end{equation*} Then \begin{equation*} {arg}\,{min}_{\mathbf{\theta}_l, \mathbf{\theta_{nl}}} \chi^2 = {\arg}\,{min}_{\mathbf{\theta}} \left({\arg}\,{min}_{\mathbf{\theta}_l} \chi^2(\mathbf{\theta}_l, \mathbf{\theta})\right) \end{equation*} The optimization problem \begin{equation*} {\arg}\,{min}_{\mathbf{\theta}_l} \chi^2(\mathbf{\theta}_l, \mathbf{\theta}) \end{equation*} is just a linear least square problem described in a linear least square section, and we know the exact solution to such a problem. Let $\mathbf{\theta}_{l} = \mathbf{C}(\mathbf{\theta})$ be the least norm solution of the linear least square problem then, \begin{align*} {arg}\,{min}_{\mathbf{\theta}} \chi^2(\mathbf{C}(\mathbf{\theta}), \mathbf{\theta}) &= {arg}\,{min}_{\mathbf{\theta}_l, \mathbf{\theta_{nl}}} \chi^2 \\ \frac{\partial \chi^2(\mathbf{C}(\mathbf{\theta}), \mathbf{\theta})}{\partial \mathbf{C}(\mathbf{\theta})} &= 0 \end{align*} So, by chain rule the gradient of $\chi^2(\mathbf{C}(\mathbf{\theta}), \mathbf{\theta})$ is \begin{align*} \frac{\partial \chi^2(\mathbf{C}, \mathbf{\theta})}{\partial \mathbf{\theta}} &= \frac{\partial \chi^2}{\partial \mathbf{C}(\mathbf{\theta})} \frac{\partial \mathbf{C}(\mathbf{\theta})}{\partial \mathbf{\theta}} + \frac{\partial \chi^2}{\partial \mathbf{\theta}} \\ &= \frac{\partial \chi^2}{\partial \mathbf{\theta}} \end{align*} Because of $\frac{\partial \mathbf{C}(\mathbf{\theta})}{\partial \mathbf{\theta}}$ term, the analytic hessian of $\chi^2(\mathbf{C}, \mathbf{\theta})$ is quite complicated. Since v0.8, the analytic Hessian is implemented for the following three fitting drivers. 1. `fit_static_voigt` 2. `fit_transient_exp` 3. `fit_transient_raise` The Hessian of $\chi^2(\mathbf{C}, \mathbf{\theta})$ is \begin{equation*} \frac{\partial^2 \chi^2(\mathbf{C}, \mathbf{\theta})}{\partial \mathbf{\theta}_i \mathbf{\theta}_j} = \frac{\partial^2 \chi^2}{\partial \mathbf{\theta}_i \partial \mathbf{\theta}_j} + \sum_k \frac{\partial^2 \chi^2}{\partial \mathbf{\theta}_j \partial \mathbf{C}_k} \frac{\partial \mathbf{C}_k}{\partial \mathbf{\theta}_i} \end{equation*} Note that $\frac{\partial \chi^2(\mathbf{C}(\mathbf{\theta}), \mathbf{\theta})}{\partial \mathbf{C}_j(\mathbf{\theta})} = 0$ for all $\theta$. Take derivative of $\mathbf{\theta}_i$ then \begin{equation*} \frac{\partial^2 \chi^2}{\partial \mathbf{\theta}_i \partial \mathbf{C}_j} + \sum_k \frac{\partial^2 \chi^2}{\partial \mathbf{C}_j \partial \mathbf{C}_k} \frac{\partial \mathbf{C}_k}{\partial \mathbf{\theta}_i} = 0. \end{equation*} For simplicity, denote $H_c = [\frac{\partial^2 \chi^2}{\partial C_i \partial C_j}]_{ij}$, $H_{\theta} = [\frac{\partial^2 \chi^2}{\partial \theta_i \partial \theta_j}]_{ij}$, $H_{\theta c} = [\frac{\partial^2 \chi^2}{\partial \theta_i \partial C_j}]_{ij}$, and $B = [\frac{\partial C_i}{\partial \theta_j}]_{ij}$. Let the hessian matrix of $\chi^2(\mathbf{C}, \mathbf{\theta})$ be $H'$ then \begin{align*} H_{\theta c}^T &= - H_c B \\ H' &= H_{\theta} + H_{\theta_c} B \\ &= H_{\theta} - B^T H_c B \end{align*} The solution $B$ satisfying $H_{\theta c}^T = - H_c B$ is in general not unique. Let $B' = B + N$, where $H_c N = 0$. Then \begin{align*} B'^T H_c B' &= B^T H_c B + N^T H_c B + B^T H_c N + N^T H_c N \\ &= B^T H_c B + (H_c N)^T B \\ &= B^T H_c B \end{align*} Therefore, $H'$ is well defined, even though $B$ is not unique. The separation scheme reduces the dimension of the optimization problem, and the gradient of $\chi^2(C,\theta)$ is the same as that of the original $\chi^2$ function, so implementing a separation scheme will speed up the optimization process.