# Learning Theory
Training error and true error are two different types of errors:
- **Training error**: error rate which measures how well the model **fits the training data**
- **True error**: it refers to the average prediction error over all possible input values in an unseen dataset.
We cannot know the true error because we don't know the true distribution of data that the algorithm will work on. Instead, we can measure the model performance on a known set of training data (the "empirical" risk) and try to make an estimation of the true error.
## No free lunch theorem
No Free Lunch theorems are a large class of theorems. Here we will analyze only the "main" NFL theorem only for the binary classification problem:
> “Independently from the learner, any technique in average will have an accuracy of 0.5, that in a binary classification problem is exactly the same accuracy of random guessing. Any learner cannot perform better than random guessing
Takeaways:
- NFL theorem states that an algorithm can perform poorly and not consistently outperform random guessing **on average**. However, this is based on the assumption that all possible hypotheses are equally likely to occur and there is no regularity in the world. Actually in real-world applications, some samples have lower probabilities of occurring, so specific algorithms tailored to specific tasks are needed and **possible**.
- There is no such thing as a winner-takes-all in ML! In ML we always operate under some assumptions! It is important to understand that there are no guarantees of success in all scenarios.
- While we can find algorithms that perform well on specific tasks, they may not generalize to other tasks.
- It is always possible to find data sets where an algorithm performs poorly.
## VC Dimension
Some concepts:
- **Dichotomy**: given a set $S$ of points a dichotomy is a partition of that set $S$ in 2 disjunct subsets.
- **Shattering**: we say that a set $S$ is shattered by an hypothesis space $H$ if and only if for every dichotomy in $S$ there exist some hypotheses in $H$ consistent with this dichotomy. Basically $H$ is shattering $S$ if it is able to perfectly classify all the examples in $S$ independently of how they are labelled.
- $VC(H)$: The Vapnik-Chervonenkis dimension is the cardinality of the largest set of instances $S$ shattered by $H$.
Besides the formalism the more complex is your hypothesis space the more you are likely that you're be able to shatter an higher cardinality set of instances, since "you can draw complex boundaries" to classify the points.
A linear classifier is a model that tries to draw a straight line (or plane, or hyperplane) to separate two groups of data points. In a 2D input space, this means we're trying to draw a line that separates one group of points from another.
The VC dimension for this linear classifier in 2D is 3. This means that if we have up to three points in our dataset, we can always find some way to draw a line between them so that all the points on one side are in one group and all the points on the other side are in another group.

It turns out that the VC dimension for a linear classifier in $M-D$ input space is simply $M+1$.

So if we have an $M$-dimensional dataset and use a linear classifier model ($y = w_0 + w_1x_1 + w_2x_2 + \dots + w_m-1x_m$) then according to VC theory, there will always exist some combination of weights such that any set of up to $M+1$ data points can be perfectly classified into two groups using this model.
Other examples of $VC(H)$ are:
- Neural Networks: $VC(H)$ is the number of parameters
- 1-Nearest Neighbors: $VC(H)$ is $\infty$
- SVM with Gaussian Kernel: $VC(H)$ is $\infty$
In a exercitation the prof showed that the $VC(H)$ of a particular hyphotesis space is at least a given number $v$ you have to find at **least one** set of instances $S$ of cardinality $v$ (one possible location of the points) and show that it can be shattered by $H$, which means to test all possible dichotomies (aka all possible assignements of classes) it's possible to correctly classify the points.
Vice versa to prove that $VC(H)$ is lower than $v$ is slightly more difficult since we have to prove that for **all** set of instances $S$ can't be shattered by $H$ .
The VC dimension is at least k if the hypothesis space shatters at least one subset
of cardinality k in the instance space.
## PAC-Learning
**Probably Approximately Correct** is a statistical tool to assess the quality/performance of a learning model given information from the samples (either a test set (simplest scenario) or train set).
Some definitions:
- A **concept class** is a set of possible concepts that our model can choose from when making predictions. It represents all possible functions or mappings that could explain the observed data.
- A class $C$ of possible target concepts $c$ is **PAC-learnable** by a learner $L$ using $H$ (the hypothesis space) if for all $c \in C$, for any distribution $P(X), \varepsilon$ (such that $0<\varepsilon<1 / 2$ ), and $\delta$ (such that $0<\delta<1 / 2$ ), $L$ will with a probability at least $(1-\delta)$ output a hypothesis $h \in H$ such that error $_{\text {true }}(h) \leq \varepsilon$, in time that is polynomial in $1 / \varepsilon, 1 / \delta, M$, and size(c).
- A sufficient condition to prove **PAC-learnability** is proving that a learner $L$ requires only a polynomial number of training examples and processing time for each example is polynomial. "If we have a concept class that is PAC-learnable it means that is not hard to learn".
- The **version space** is the subset of the hypothesis space `H` that contains all hypotheses with 0 training error. In practice usually we need to work outside of the version space.
- We will use the **Hoeffding Bound** which is is a tool to build confidence interval: $P(X \le\bar{X} + \mu) = e^{-2Nu^2}$.
- The empirical risk minimizer $\hat {h} = \arg \min _{h \in \mathcal{H}} \hat{\mathcal{L}}(h)=\frac{1}{N} \sum_{n=1}^N \ell\left(h\left(\mathbf{x}_n\right), t_n\right)$ with $\ell$ the loss function.
We exploit the Hoeffding Bound to evaluate the true loss of the empirical risk minimizer starting from:
- the **test set**: it's independent from the training set so the test loss $\hat L( \hat h)$ is an unbiased estimator for the true loss.
- the **train set**: obviously will provide a **negatively** biased estimator for the true loss.
### Using the Test Set
Under the assumption of bounded loss $[0, L]$:
$L(\hat{h}) \le \hat L (\hat h) + L \sqrt{\frac{log(\frac{1}{\delta})}{2J}}$
with probability $1-\delta$ and where:
- $\mathcal{L}(\hat{h})$ is the true error rate (risk) of our model, which measures how well it performs on new, unseen data points.
- $\tilde{\mathcal{L}}(\hat{h})$ is the empirical error rate (training loss) of our model, which measures how well it fits to a given set of training data.
- $L$ can be thought as an upper bound on how much our predictions can deviate from their true values.
- $J$ represents the size of our test set, i.e., how many samples we have available to evaluate our model's performance on new data points.
- $\delta$ represents confidence level or probability threshold we want to achieve with respect to this inequality.
### Using the Training Set
We distinguish:
- **consistent learners** which have zero training error
- **agnostic learning** is when: if $c \in H$ and that the learner $L$ will not always output a hypothesis $h$ such that $\operatorname{error}_{\mathcal{D}}(h)=0$ but $L$ will output a hypothesis $h$ such that $\operatorname{error}_{\mathcal{D}}(h)>0$.
In case of **binary classification** (finite hypothesis space):
- Finite hypothesis space $(|\mathcal{H}|<+\infty)$ and **consistent** learning $(\hat{\mathcal{L}}(\hat{h})=0$ always):
$
\mathcal{L}(\hat{h}) \leq \frac{\log |\mathcal{H}|+\log \left(\frac{1}{\delta}\right)}{N} \quad \text { w.p. } \quad 1-\delta
$
- Finite hypothesis space $(|\mathcal{H}|<+\infty)$ and **agnostic** learning $(\hat{\mathcal{L}}(\hat{h})>0$ possibly):
$
\mathcal{L}(\hat{h}) \leq \hat{\mathcal{L}}(\hat{h})+\sqrt{\frac{\log |\mathcal{H}|+\log \left(\frac{1}{\delta}\right)}{2 N}} \quad \text { w.p. } \quad 1-\delta
$
When we moving to an **infinite** hypothesis space, like for example in a linear regression case (where the output is an real number) we have to use the notion of **VC** dimension, which in some away accounts for the complexity of the hypothesis space.
So, in case of infinite hypothesis space $(|\mathcal{H}|=\infty)$ and agnostic learning $(\hat{\mathcal{L}}(\hat{h})>0$ possibly):
$
\mathcal{L}(\hat{h}) \leq \hat{\mathcal{L}}(\hat{h})+\sqrt{\frac{\operatorname{VC}(\mathcal{H}) \log \left(\frac{2 e N}{\operatorname{VC}(\mathcal{H})}\right)+\log \left(\frac{4}{\delta}\right)}{N}} \quad \text { w.p. } \quad 1-\delta
$
### PAC takeaways
So basically, this theoretical formula confirms what so far we only said empirically by looking at the different techniques:
- Larger hypotheses space implies larger bound (variance)
- Increasing $N$ implies reduced bound (variance)
- Large $|H|$: low bias, high variance
- Small $|H|$: high bias, low variance
- There is relationship between train error and prediction error (true error):
- Close relationship between the two when under-fitting
- Relationship lost when over-fitting
$L_{\text {true }}(h) \leq \underbrace{L_{\text {train }}(h)}_{\text {Bias }}+\underbrace{\sqrt{\frac{\ln |H|+\ln \frac{1}{\delta}}{2 N}}}_{\text {Variance }}$