Understanding Machine Learning 05

Posted on March 21, 2022 by Shiguang Wu
Tags: UML

VCdim

here we go to one of the famous theories, the VC dimension

before going a little deeper, we will have a look at the motivation

motivation

first, finite is not a sufficient and necessary condition of PAC learnability. exercise 3.3 is a simple example. Moreover, in the last post (No-Free-Lunch theorem), we have seen H that contains all possible functions is not PAC learnable. When we rethink the proof, we may notice that the construction of set \(C\) is the key point. Since we are considering all possible functions, the error of different f’s can cancel, resulting in a large error.

borrow the idea, if we can find a subset \(C\) of domain \(\mathcal{X}\), and if H contains all the functions when taking \(C\) as the domain, then it will cause a considerable risk using the same proof

further, if such kind of \(C\) is infinitely large, then \(H\) is not learnable

the thoughts above give us the basic idea of how the VC dimension comes

VC dimension


Def. restriction of H to C

here, \(\mathcal{C}=\{c_1,c_2,\dots,c_m\}\subseteq\mathcal{X}\)

and \(\mathcal{H}_C=\{h(c_1),\dots,h(c_m)\}\)


Def. Shattering

H shatters C \(\iff\) \(|H_C|=2^{|C|}\)


Def. VC-dimension

the VC-dimension of H \(\coloneq\max_C\{|H_C|:\text{H shatters C}\}\)


a simple method to show VCdim=d we need to show that

  1. \(\exists C\) of size d that is shattered by H

  2. every C of size d+1 is not shattered by H


since \(2^{|VCdim(H)|}\le |H|\), we have a loose upper bound of VC dim which is \(log_2(|H|)\)

fundamental theorem of PAC learning

the followings are eq

  1. uniform convergence

  2. PAC learnable on every ERM learner

  3. agnostic PAC learnable on every ERM learner

  4. finite VC dimension


from 4. to 1. is non-trivial. here we will go deeper into that

we need to find \(m_H^U\) s.t. for any S that \(|S|\ge m_H^U\) is \(\frac{\epsilon}{2}\)-representative (i.e. \(L_D(h)-L_S(h)|\le \frac{\epsilon}{2}\))

intuitively, if \(m\) is larger than \(d\), where \(d\) is the VC dimension, \(H_C\) will only take small part of \(2^C\) (in fact that is polynomial large w.r.t. \(|C|\)), the estimation error could be bounded by \(o(m)\)

Sauer’s Lemma


Def. Growth Function

\[\tau_H(m)=\max_{C\subset \mathcal{X}:|C|=m}|H_C|.\]


according to Sauer’s lemma, \(\tau_H(m)\le \sum_{i=0}^d\tbinom{n}{i}\), if \(m\gt d+1\), we have a looser but neat form \(\tau_H(m)\ge (em/d)^d\)

proof

basic idea

we prove a stronger claim, \(\forall C=\{c_1,\dots,c_m\}\)

we have \(\forall H,\ |H_C|\le |\{B\subseteq C: \text{H shatters B}\}|\)

induction on m

when m=1, both sides are equal

when m=k+1, suppose the claim holds for \(m\le k\).

to use the claim, we need to split \(C\) as \(\{c_1\}\) and ${c_1,, c_m} which is denoted as \(C'\)

we need to find such \(H'\) that can naturally convert from \(C'\) to \(C\) but still holds the shattering property

note:

\[ H_C=H_{C'}\oplus H'_{C'} \]

where \(H'_{C'}=\{\exists f(c_1)=1\land g(c_1)=0\land f_{C'}=g_{C'}, \text{ where }f,g\in H\}\), \(\oplus\) means direct sum (\(H_{C'}\cap H'_{C'}=\emptyset\))


if \(f_{C'}\) and \(g_{C'}\) are exactly the same, they represent the same function in \(H_{C'}\), and hence only count once in \(H_{C'}\) which should be treated separatedly in \(H_C\).


so we have that

\[ \begin{aligned} |H_C|&=|H_{C'}|+|H'_{C'}|\\ &\le |\{B\subseteq C':\text{H shatters B}\}|+|\{B\subseteq C':\text{H' shatters B}\}| \end{aligned} \]

note that \(H'\) shatters B \(\iff\) H shatters \(\{c_1\}+B\)

so RHS \(\le |\{B\subseteq C:\text{H shatters B}\}|\) \(\blacksquare\)

uniform convergence

we will give the upper bound without proof :(

for every \(\delta\), with prob \(1-\delta\) over the choice of \(S\sim D^m\)

\[ |L_D(h)-L_S(h)|\le \frac{4+\sqrt{log(\tau_H(2m))}}{\delta \sqrt{2m}} \]

RHS \(\in o(m)\)


I’ll add the proof when I master this math skill :(

exercise

btw, Dudley class seems to be a fascinating and important topic