Understanding Machine Learning 02

Posted on March 8, 2022 by Shiguang Wu
Tags: UML

def PAC

def PAC learnability

hypothesis H is PAC learnable if realizability assumption holds, and exists \(m_H(\epsilon,\delta)\rightarrow\N\)

where with #i.i.d. sample \(\ge m_H\), we always have a probability approx correct h just using ERM.

removing realizability assumption

the realizability assumption is too strong and unrealistic where we assume the existence of a perfect hypothesis from \(\mathcal{H}\) that can reveal ground truth \(f\) with \(Pr=1\)

so, change \(f(x)\) into a joint distrib \(D(x,y)\) as most researchers would do

def generalized risk as \(L_D(h)\coloneqq \mathcal{P}_{(x,y)\sim D}[h(x)\neq y]\coloneqq D(\{(x,y):h(x)\neq y\})\), just changed \(D\) into a joint distrib

empirical risk is the same

still, when take \(D\) to be a uniform distrib on \(S\) they are eq

ideally, func “\(f_D(x)=1\text{ if }Pr[y=1|x]\ge 0.5\text{ and 0 otherwise}\)” is the optimal sol to the gen risk min problem when \(\mathcal{Y}=\{0,1\}\), w.r.t 0-1 loss. Bayes optimal sol.


note: my stupid short proof about the above

we need to proof

\[ [f(x)\neq 0]Pr(y=0|x)+[f(x)\neq 1]Pr(y=1|x)\\ \le [g(x)\neq 0]Pr(y=0|x)\dots \]

just consider cond on \(Pr(y=0|x)\gt 0.5\), and it becomes

\[ LHS=Pr(y=1|x)\\ =\{[g(x)\neq 0]+[g(x)\neq 1]\}Pr(y=1|x)\\ \le RHS \]

very ugly and not clever proof :/ \(\blacksquare\)


here, we still have the opt function \(f\), which minimizes the gen risk but does not minimize it to zero

def Agnostic(不可知论的) PAC learnability

before: \(L_{D,f}(h)\le \epsilon\), now: \(L_{D}(h)\le min_{g\in\mathcal{H}}L_{D}(g)+\epsilon\)

so \(f\) above will not be used, but the joint distrib \(D\) instead

and we see that if the realizability assumption holds, it’s the same as the original PAC learnability

agnostic just means we can’t obtain an h with arbitrary small gen risk

relative best instead of abs best

other loss functions

we can actually use other measures in place of the 0-1 loss, especially in regression problems

naturally, we can extend the loss function into a more formal definition, \(l:\mathcal{H}\times\mathcal{Z}\rightarrow \R_+\), where \(\mathcal{Z}\) is the set of instances, in prediction tasks, it could be \(\mathcal{X}\times\mathcal{Y}\)

Agnostic PAC learnability for General Loss Functions

\(L_D(h)\le\min_{h'\in\mathcal{H}}L_D(h')+\epsilon\)

where \(L_D(h)=\mathcal{E}_{z\sim D}[l(h,z)]\)

note: \(l(h,\dot)\) is a rand var, it should be measurable..

some exercises