# Model identification
An identification problem is essentially a parametric optimization problem: we have a set of data $y(1), y(2), \dots (u(1), u(2), \dots)$ and we want to find the best model that approximate these data focusing on **parametric** identification of dynamic systems.
The concept of making a model relying on data as the source is the foundation of what we will see in [Machine Learning](../../Machine%20Learning/Machine%20Learning.md) .

Choice of $J^N(\theta) > 0$
Predictive approach
$J(\theta) = \mathbb{E}\left[(y(t+1) - \hat{y}(t+1|t,\theta))^2\right]
$
Ideal objective:
- A good model should return
- Low variance of the 1-step
- Ahead prediction error
Sample version of the objective:
$
J_N(\theta) = \frac{1}{N}\sum_{t=1}^N (y(t) - \hat{y}(t|t-1,\theta))^2
$
Why 1-step ahead prediction? and not $2$-step or $k$-step ... ?
If $\hat{y}(t+1|t)$ is the optimal predictor of $y(t)$, the variance of $\epsilon(t+1|t)$ is:
$
\text{var}[\epsilon(t+1|t)] = \text{var}[e(t)] = \lambda^2
$
We obtain $\lambda^2$ (or better, an estimate of $\lambda^2$) for free .
$
J^N(\theta): \mathbb{R}^{n_\theta} \rightarrow \mathbb{R}^+
$
Two relevant situations:
1. $J^N(\theta)$ quadratic (AR, ARX)
2. $J^N(\theta)$ non-quadratic (ARMA, MA, ARMAX)
We made a few assumptions:
- $S$ lies within a specific model set
- $n_a$, $n_b$, $n_c$ fixed a-priori
- data could have been given by somebody else (could be non-informative)
## Identification of AR/ARX Models
Generic expression of $AR(X)$:
$
M(\theta): y(t) = \frac{B(z)}{A(z)}u(t-d) + \frac{1}{A(z)}e(t), \quad e(t) \sim WN(0,\lambda^2)$
with:
$A(z) = 1 - a_1z^{-1} - a_2z^{-2} - ... - a_mz^{-m}$
$B(z) = b_0 + b_1z^{-1} + b_2z^{-2} + ... + b_{p-1}z^{-(p+1)}$
$C(z) = 1 \quad \text{(ARX model! No MA part)}$
$\theta = [a_1 \, a_2 \, ... \, a_m \, b_0 \, b_1 \, ... \, b_{p-1}]^T$ are the parameters $\in \mathbb{H}$ (the parameters domain):
$\theta=\begin{vmatrix}a_1\\\vdots\\a_{n_a}\\b_1\\\vdots\\b_{n_b}\end{vmatrix}\quad\varphi(t)=\begin{vmatrix}y(t-1)\\\vdots\\y(t-n_a)\\u(t-1)\\\vdots\\u(t-n_b)\end{vmatrix}$
Express the 1-step ahead predictor:
$
\hat{y}(t|t-1,\theta) = \phi(t)^T \theta
$
Define the prediction error criterion as we said before:
$J_N(\vartheta)=\frac{1}{N}\sum_{i=1}^{N}\left(\mathrm{y}(i)-\hat{y}(i\mid i-1,\vartheta)\right)^2=\frac1N\sum_{t=1}^N\left(y(t)-\varphi(t)^T\vartheta\right)^2$
A good model should return a low empirical variance of the prediction error. Substitute the predictor expression:
$
J_N(\theta) = \frac{1}{N} \sum_{t=1}^N (y(t) - \phi(t)^T \theta)^2
$
Take the derivative with respect to $\theta$ and set to zero:
$\frac{\partial J_{N}(\theta)}{\partial\theta}=\frac{d}{d\theta}[J_{N}(\theta)]=\frac{d}{d\theta}\left[\frac{1}{N}\sum_{t=1}^{N}(y(t)-\varphi(t)^{T}\theta)^{2}\right]=0$
$\begin{aligned}
\frac{\partial J(\theta)}{\partial\theta}& =-\frac1N\sum_{t=1}^N2\left(y(t)-\theta'\varphi(t)\right)\varphi(t)^T \\
&=-\frac{2}{N}\left(\sum_{t=1}^{N}y(t)\varphi(t)^T-\sum_{t=1}^{N}\theta'\varphi(t)\varphi(t)^T\right)=0\end{aligned}$
Rearrange to get the normal equations:
$
\left[\frac{1}{N} \sum_{t=1}^N \phi(t)\phi(t)^T\right] \theta = \frac{1}{N} \sum_{t=1}^N \phi(t)y(t)
$
$\left[\sum_{t=1}^{N}\varphi(t)\varphi(t)^{T}\right]\hat{\theta}_{N}=\sum_{t=1}^{N}\varphi(t)y(t)$
we obtain the Least Squares method for AR(X) models family:
$\hat{\theta}=\left[\sum_{t=1}^N\varphi(t)\varphi(t)^T\right]^{-1}\sum_{t=1}^Ny(t)\varphi(t)^T$
To check if $\hat{\theta}_N$ corresponds to a minimum of the cost function $J_N(\theta)$, we need to examine the first and second derivatives of $J_N(\theta)$.
The first derivative is given by:
$
\frac{\partial J_N(\theta)}{\partial \theta} = -\frac{2}{N} \sum_{t=1}^N \phi(t)(y(t) - \phi(t)^T \theta)
$
Setting this to zero gives us the condition for a stationary point.
The second derivative is:
$
\frac{\partial^2 J_N(\theta)}{\partial \theta^2} = \frac{2}{N} \sum_{t=1}^N \phi(t) \phi(t)^T
$
For $\hat{\theta}_N$ to be a minimum, the Hessian must be positive definite:
$
x^T M x > 0 \quad \forall x \neq 0
$
Applying this to our Hessian:
$
x^T \left(\frac{2}{N} \sum_{t=1}^N \phi(t) \phi(t)^T\right) x > 0 \quad \forall x \neq 0
$
This can be rewritten as:
$
\frac{2}{N} \sum_{t=1}^N (x^T \phi(t))^2 > 0
$
Which is always true for non-zero $x$, confirming that we can say that LS estimates converge to $\Delta$ (set of global minimum of $J(\theta)$):
1. $\mathbb{E}[\varphi(t)\varphi(t)^T]$ non-singular $\rightarrow \Delta = \{\theta^*\}$ and we have **identifiability**
2. $\mathbb{E}[\varphi(t)\varphi(t)^T]$ singular $\rightarrow J(\theta)$ has $\infty$ global minima, including $\theta^*$

Actually if multiple global minimums are found, it's an error since it's known that only one system originated the data. So multiple minimums are not feasible solutions and this can happen because:
- **over**: the chosen models are too complex for the system
- **under**: data isn't representative enough
At the end we say that **identifiability** is when there is only **one** solution. Since AR has no MA part ($C(z)=1$), explicit minimization is possible:
$\hat{\vartheta}_N=\left(\sum_{t=1}^N\varphi(t)\varphi'\left(t\right)\right)^{-1}\sum_{t=1}^Ny(t)\varphi'(t)$
In the above formula, the crucial thing is that the matrix $\left[\sum_{t=1}^N\varphi(t)\varphi'\left(t\right)\right]$ is invertible: it must be positive semi-definite.
A necessary condition for the invertibility is that the input $u(t)$ is **persistently exciting** of order $n$ (with $n>= n_b$ ) which is a property which interest the $k \times k$ matrix:
$\begin{vmatrix}\gamma_{uu}(0)&\gamma_{uu}(1)&\gamma_{uu}(2)&\cdots\\\gamma_{uu}(1)&\gamma_{uu}(0)&\gamma_{uu}(1)&\ddots\\\vdots&\ddots&\ddots&\ddots\end{vmatrix}$
This matrix must be invertible $\leftarrow$ is a necessary condition to make invertible $\bar R$ !
Note that a $WN$ is **persistently exciting** of any order since $\bar R$ is an identity matrix multiplied by the $\lambda^2$ (surely invertible):
$\begin{vmatrix}\gamma_{uu}(0)&\gamma_{uu}(1)&\gamma_{uu}(2)&\cdots\\\gamma_{uu}(1)&\gamma_{uu}(0)&\gamma_{uu}(1)&\ddots\\\vdots&\ddots&\ddots&\ddots\end{vmatrix}=\begin{vmatrix}\lambda^2&0&0&\cdots\\0&\lambda^2&0&\ddots\\\vdots&\ddots&\ddots&\ddots\end{vmatrix}=\lambda^2I$
## ARMA/ARMAX model identification
ARMA (or ARMAX) identification differs from AR (or ARX) because there is no more linearity in the parameters. Traditional methods like Least Squares cannot be directly applied, we must use an iterative numerical method such as gradient descent can be used with an initial estimate $\theta_1$ and an update rule that computes the next guess for $\theta$ based on the previous iteration.
ARMA/ARMAX generic formula:
$M(\theta): y(t) = \frac{B(z)}{A(z)}u(t-d) + \frac{C(z)}{A(z)}e(t), \quad e(t) \sim WN(0,\lambda^2)$
Where $A(z)$, $B(z)$, and $C(z)$ are polynomials in $z^{-1}$ and $\theta = [a_1 \, a_2 \, ... \, a_m \, b_0 \, b_1 \, ... \, b_{p-1} , c-1 \, c_2 \, ... \, c_n]^T$ .
Applying the 1 step predictor, we will immediately notice that from this:
$
J_N(\theta) = \frac{1}{N} \sum_{t=1}^N (y(t) - \phi(t)^T \theta)^2
$
$\varepsilon(t|t-1,\theta) = y(t) - \hat{y}(t|t-1,\theta)=$
$=y(t) - \frac{C(z)-A(z)}{C(z)}y(t) - \frac{B(z)}{C(z)}u(t-d)$
$=\left[1 - \frac{C(z)-A(z)}{C(z)}\right]y(t) - \frac{B(z)}{C(z)}u(t-d)$
$=\frac{A(z)}{C(z)}y(t) - \frac{B(z)}{C(z)}u(t-d)$
The prediction error has a nonlinear dependence on the coefficients of the $C(z)$ polynomial $(c_1, c_2, ..., c_n)$. This nonlinearity arises because $C(z)$ appears in the denominator of both terms.
This means that the square of $\epsilon$ ( $J_n$) depends quadratically to $C(z)$ .
Because of this, ARMA/ARMAX models need iterative approaches with update rules based on gradient and Hessian information.
This update/iterative rule comes in 3 main main flavors in this course:
- **Newton's rule**:
$\theta^{(i+1)}=\theta^{(i)}-[\text{Hessian}]^{-1}\cdot \text{gradient}$
- **Gradient descent**: [gradient descent in neural networks](../../Artificial%20Neural%20Networks%20and%20Deep%20Learning/src/02.%20FFNN%20in%20practice.md#Gradient%20descent%20algorithms) is more effective since the are a lot of parameters.
$\theta^{(i+1)}=\theta^{(i)}-\eta\cdot \text{gradient}$
- **Quasi-Newton's rule**: in the system identification framework, it's often used quasi-newton's rule.
$\theta^{(i+1)}=\theta^{(i)}-[\frac{\text{aproximate}}{\text{Hessian}}]^{-1}\text{gradient}$
### Quasi-Newton method proof
**TLDR**: the core of Quasi-Newton method proof is:
1) Approximate the cost function $J$ using Taylor expansion;
2) Do considerations on the Taylor expansion to get to the update rule;
3) Conclude that all you need is the first derivative of $\epsilon$ with respect to vector of parameters (in the case of quasi-newtonian);
4) Split the vector of parameters in $\alpha$, $\beta$, $\gamma$ to speed up the computations
5) Get to the final result.
Newton method basically involves taking the quadratic expansion of the cost function and finding the minimum of the resulting paraboloid to update the parameter.

The approximating function can be obtained by the Taylor development.
$V(\theta)=\varepsilon(t|t-1,\theta)|_{\theta=\theta^{(i)}} + (\theta-\theta^{(i)})^\top\left.\frac{\partial\varepsilon(t|t-1,\theta)}{\partial\theta}\right|_{\theta=\theta^{(i)}} + \frac{1}{2}(\theta-\theta^{(i)})^\top\left.\frac{\partial^2\varepsilon(t|t-1,\theta)}{\partial\theta^2}\right|_{\theta=\theta^{(i)}}(\theta-\theta^{(i)})$
The optimal update rule use that quadratic approximation:
$\theta^{(i+1)} = \theta^{(i)} - \left[\frac{\partial^2 J_N(\theta)}{\partial \theta^2}\bigg|_{\theta=\theta^{(i)}}\right]^{-1} \frac{\partial J_N(\theta)}{\partial \theta}\bigg|_{\theta=\theta^{(i)}}$
So we need both the **gradient** and the **hessian**.
The gradient of the cost function is given by:
$\frac{\partial J_N(\theta)}{\partial \theta} = \frac{2}{N} \sum_{t=1}^N \varepsilon(t|t-1,\theta) \cdot \frac{\partial \varepsilon(t|t-1,\theta)}{\partial \theta}$
The Hessian is computed (derivative of first derivative, and so a derivative of a product):
$ \frac{\partial^2 J_N(\theta)}{\partial \theta^2} = \frac{2}{N} \sum_{t=1}^N \frac{\partial \varepsilon(t|t-1,\theta)}{\partial \theta} \left(\frac{\partial \varepsilon(t|t-1,\theta)}{\partial \theta}\right)^T + \frac{2}{N} \sum_{t=1}^N \varepsilon(t|t-1,\theta) \frac{\partial^2 \varepsilon(t|t-1,\theta)}{\partial \theta^2} $
The second-order term is usually neglected for simplicity (with this approximation passage we are moving from "Newton's rule" to "Quasi-Newton rule"):
$\frac{\partial^2 J_N(\theta)}{\partial \theta^2} \approx \frac{2}{N} \sum_{t=1}^N \frac{\partial \varepsilon(t|t-1,\theta)}{\partial \theta} \left(\frac{\partial \varepsilon(t|t-1,\theta)}{\partial \theta}\right)^T$
With this approximation we can say that the hessian is always positive and so it's useful so that we always have the "positive concavity".
> "I'm sure to take the right paraboloid"
and the Newton's law becomes:
$\theta^{(i+1)} = \theta^{(i)} - \left[\sum_{t=1}^N \frac{\partial \varepsilon(t|t-1,\theta)}{\partial \theta} \frac{\partial \varepsilon(t|t-1,\theta)}{\partial \theta}^T\right]^{-1} \left[\sum_{t=1}^N \varepsilon(t|t-1,\theta)\frac{\partial \varepsilon(t|t-1,\theta)}{\partial \theta}\right]$
#### Computing the gradient of $\epsilon$
Now that we have the expression we need to compute the protagonist $\frac{\partial \varepsilon(t|t-1,\theta)}{\partial \theta}$: the crucial thing here is that it's a vector of partial derivatives computed with respect to each parameter in $\theta = [a_1, a_2, ..., a_m, b_1, ..., c_1, ...]^T$
We need to compute derivative polynomials considering $\theta$ at step $i$.
$\theta = [ a_1 , a_2 , ... a_m , b_1 , ... , c_1 ... , c_n]$ .
3 auxiliary signals are defined to simplify computations:
$\alpha(t) = -\frac{1}{C(z)}y(t)$
$\beta(t) = -\frac{1}{C(z)}u(t)$
$\gamma(t) = -\frac{1}{C(z)}\varepsilon(t|t-1,\theta)$
All of these are time signals : the derivative of a signal respect to a parameter is another signal (time dependant).
Note that $\gamma(t)$ depends by the current value of $\theta$
So that the gradient of the prediction error with respect to the parameter vector $\theta$ can be expressed as:
$\frac{\partial\varepsilon(t|t-1,\theta)}{\partial\theta}=[\alpha(t-1)\alpha(t-2)\ldots\alpha(t-m)\beta(t-1)\ldots\ldots\gamma(t-1)\ldots\gamma(t-m)]^{T}$
After computing the signals $\alpha(t, \theta^{(i)}),\beta(t, \theta^{(i)}),\gamma(t, \theta^{(i)})$ and so the minimum we also need the hessian (which is a matrix).
$\frac{\partial^2\varepsilon(t|t-1,\theta)}{\partial\theta^2}$
But actually using the "approximate hessian" we just do :
$\frac{\partial^2 J_N(\theta)}{\partial \theta^2} \approx \frac{2}{N} \sum_{t=1}^N \frac{\partial \varepsilon(t|t-1,\theta)}{\partial \theta} \left(\frac{\partial \varepsilon(t|t-1,\theta)}{\partial \theta}\right)^T$
(so no hessian).
Skipping the endless computation, at the end the update rule:
$\theta^{(i+1)} = \theta^{(i)} - \left[\sum_{t=1}^N \frac{\partial \varepsilon(t|t-1,\theta)}{\partial \theta} \frac{\partial \varepsilon(t|t-1,\theta)}{\partial \theta}^T\right]^{-1} \left[\sum_{t=1}^N \varepsilon(t|t-1,\theta)\frac{\partial \varepsilon(t|t-1,\theta)}{\partial \theta}\right]$
$\vartheta^{(i+1)}=\vartheta^{(i)}-\left[\frac1N\sum_{t=1}^N\frac{\partial\varepsilon(t)}{\partial\theta}\left(\frac{\partial\varepsilon(t)}{\partial\theta}\right)^T\right]^{-1}\left[\frac1N\sum_{t=1}^N\frac{\partial\varepsilon(t)}{\partial\theta}\varepsilon_\vartheta(t)\right]$
which is (recall):
$\theta^{(i+1)}=\theta^{(i)}-[\text{Hessian}]^{-1}\cdot \text{gradient}$
The iterative process continues until a convergence criterion is met, such as a small change in the parameter estimates or a maximum number of iterations is reached.
This approach allows for the identification of ARMA and ARMAX models, which are nonlinear in parameters, unlike the simpler AR and ARX models that can be solved using least squares.

## Dummy identification exercise
In exam can happen:
$y(t) = e(t) + \frac{1}{2}e(t-1)$
where $e(t) \sim WN(0,1)$
which we would like to identify with a model $M$ :
$y(t) = \eta(t) + b\eta(t-1)$
where $\eta(t) \sim WN(0,\lambda^2)$
We must not be a noobs and use the brain and recognize without moving the pen that with $b^* = \frac{1}{2}$ we have $\lambda^{2*} = 1$ .
The equality of models is trivial:
$\varepsilon(t|t-1) = \frac{A_m(z)}{C_m(z)}y(t) = \frac{1}{1+bz^{-1}}y(t) = \frac{1}{1+bz^{-1}}(1+\frac{1}{2}z^{-1})e(t)$
$\varepsilon(t|t-1) = \frac{1+\frac{1}{2}z^{-1}}{1+bz^{-1}}e(t)$
For $\varepsilon(t)$ to be white noise, equal to $e(t)$ and have $\lambda^{2*} = 1$ we need:
$b^* = \frac{1}{2}$
## Non-parametric identification
So far if we want to infer something about the underling stochastic process we need to learn the model first and then make a **parametric identification**.
Now let's try a non-parametric one directly estimating from data $m_y$, $\gamma _y(\tau)$ and $\Gamma_y(\omega)$ without first identifying a full model of $W(z)$ .
But before diving in the estimators, we we need to define what we mean by "good estimator". We will use two definitions:
- **Correctness:** An estimator is considered correct (or unbiased) if its expected value is equal to the true parameter it is estimating, indicating that the mean of many independent estimates would converge to the true parameter. $E[{\hat{\mu}}_n] = \mu$ where $\mu$ is the real mean.
- **Consistency:** An estimator is consistent if the probability of the estimates being close to the true parameter increases as the sample size grows, meaning that with more data, the estimates become progressively more precise. $Var[\hat s_N ] \rightarrow 0$ as $N \rightarrow \infty$ .
### Estimation of Mean
The most natural estimator is $\hat{\mu}_n$, where $n$ refers to the number of samples.
$\hat\mu_n=\frac1N\sum_i^N y(i)$
This estimator grows more informative with additional data, capturing the dynamics of ARMA processes without relying on full historical memory.
### Estimation of Covariance
$\gamma_{y}(\tau)=\frac{1}{N-\tau}\sum_{t=1}^{N-\tau}y(t)y(t+\tau)$
The estimators is consistent in case of stationary ARMA processes but it's only asymptotically correct.
### Estimation of Spectrum
We can estimate this:
$\Gamma_{y}(w)=\sum_{z=-\infty}^{\infty}\gamma(z)e^{-jwz}$
with this:
$\hat{\Gamma}_{N}(w)=\sum_{\tau=-(N-1)}^{N-1}\hat{\gamma}_{N}(\tau)e^{-jwz}$
Estimating the spectrum $\hat{\Gamma}_N(\omega)$ is complex as it is derived from the estimated covariance function, making it an indirect estimation.
- This estimator is not initially correct but becomes asymptotically correct with a large dataset.
- Consistency is challenging as it's shown that the error variance does not tend to zero even with infinite samples:
$\mathbb{E}[(\hat{\Gamma}_{N}(\omega)-\Gamma_{\gamma}(\omega))^{2}] \xrightarrow[n\to\infty]{} \Gamma(\omega)^2$
The best we can do to address this issue is applying the so called Bartlett method: average the spectrum by dividing the dataset into $r$ parts and computing the spectrum for each part:
$ \hat{\Gamma}^{(i)}(\omega), \quad i,...,r $
Then we average these estimators:
$ \bar{\hat{\Gamma}}(\omega) = \frac{1}{r} \sum_{i=1}^r \hat{\Gamma}_{\hat{N}}^{(i)}(\omega) $
Under the assumption that the data of different sub-series are uncorrelated (hence the requirement $N \gg r$), the variance is approximately:
$ \text{Var}[\bar{\hat{\Gamma}}(\omega)] \approx \frac{1}{r^2} \Gamma^2(\omega) $
The uncertainty is now "significantly" reduced.
### Data preprocessing
What if $y(t)$ is non-stationary? So these two are possible causes of non-stationarity that we address in this course:
- trend
- seasonality
To work with non-stationary processes, we first need to estimate possible trends or seasonalities. Then we can remove them and work with the reminder **SSP (Stationary Stochastic Process)**.
#### Trend removal
$y(t)=\tilde y (t) + kt +m$
So in order to work with $\tilde y(t)$ we first estimate $k$ and $m$.
Then we remove the trend from the data set. We can also say:
$\mathbb{E}[y(t)-kt-m]=\mathbb{E}[\tilde{y}(t)]=0$
Inspired by the above equality, we can find $\hat{m}$ and $\hat{k}$ as the argument of the minimum with respect to $m$ and $k$ :
$(\hat{m},\hat{k})=\operatorname{argmin}_{m,k}\frac{1}{N} \sum_{t=1}^{N}\left(y(t)-k(t-m)\right)^2$
This expression is a classic least squares optimization, which aims to determine the "best fit" by finding the line (or trend) that best fits the data, where $k$ can be thought of as a scaling factor and $m$ as a time shift.
As the LS cost function, the expression forms a paraboloid, which again has a single global minimum making possible to find the best linear trend estimate in the data.
#### Seasonality
$y(t)=\tilde{y}(t)+s(t)$
where $s(t)=s(t+k \mathbb{T})$ where $\mathbb{T}$ is the period.
In the same way we need to estimate $s(t)$ .
The underlying idea is :
$\hat{S}(t)=\frac{1}{M}\sum_{k=0}^{M-1}y(t+hT)=\frac1M\sum_{k=0}^{M-1}\widetilde{y}(t+hT)+\frac1M\sum_{k=0}^{M-1}S(t+hT)$
But remember that we had also to estimate $T$ period.