The elements of statistical learning 01

Posted on March 7, 2022 by Shiguang Wu

Tags: ESL

boosting

committee from a bunch of weak learners\(G_m\)(slightly better than rand)

\[ G(x)=sign\left(\sum_{m=1}^Ma_mG_m(x)\right) \]

one generic method is forward-stagewise method where you compute one model \(G_m\) and its correspd weight \(a_m\) at a time (min \(L(y_i, f_m(x_i)+\beta G_m(x_i))\)).

if using MSE as the \(L\) loss, each time we are seeking for a model \(\beta G\) that fit the residual.

AdaBoost.M1

iteratively fit \(G_m\) on a weighted dataset.

method derived by using exp loss instead of the common mse ..

\[e^{-y_if(x_i)}\]

suppose we consider scaled model (\(range f=\{-1,1\}\))

at stage m

we want to opt the following

\[ \min_{(a,G)}\sum_{i=1}^Nexp\left(-y_i(f_{m-1}(x_i)+aG(x_i))\right)\\ =\sum w_iexp(-y_iaG(x_i)) \]

basically using forward-stagewise method, if we fix \(a\)

\[ exp(a)\sum_{y_i\neq G(x_i)}w_i+exp(-a)\sum_{y_i=G(x_i)}w_i\\ =exp(a)\sum_i^N w_i - exp(a)\sum_{y_i=G(x_i)}w_i+\dots\\ =(exp(-a)-exp(a))\sum_{y_i=G(x_i)} w_i + exp(a)\sum_i^Nw_i\\ =A\sum_i^Nw_i[y_i\neq G(x_i)]+\dots \]

so actually, we are minimizing a weighted dataset using 0-1 loss

with this new solved \(G\), we can then solve for \(a\)

\[ \frac{d \sum w_iexp(-aG(x_i))}{da}=-\sum w_iy_iG(x_i)exp(-ay_iG(x_i))\\ =-(exp(-a)+exp(a))\sum_{y_i=G(x_i)} w_i + exp(a)\sum_i^Nw_i=0 \]

in short, AdaBoost.M1 is directly abtained from forward-stagewise method. With exp loss, we can get this neat sol.

further, since \(f_m=f_{m-1}+G_m\), the weights \(w_i\) can be calculated iteratively.

more about exp loss

\(f^\star=arg \min_fE_{Y|x}(e^{-Yf(x)})=\frac{1}{2}log\frac{Pr(Y=1|x)}{Pr(Y=-1|x)}\), it est one half the log odds of \(Pr(Y=1|x)\), and \(Pr(Y=1|x)=\frac{1}{1+e^{-2f}}\)
let \(p(x)=Pr(Y=1|x)=\frac{1}{1+e^{-2f}}\) and \(Y^\prime =(Y+1)/2\), we can see that cross-entropy and exp loss are actually est the same population.
- ce: \(-l(Y,f(x))=log(1+e^{-2Yf(x)})=Y^\prime logp(x)+(1-Y^\prime)log(1-p(x))\), (f is softmaxed before output)
- exp: \(e^{-Yf}\)
- same Pr, and f

loss func and robustness

here robustness means being disturbed less by outlier samples.

regression

squared-error -> \(f(x)=\mathbf{E}(Y|x)\), more focus on obs with large absolute residuals during fitting process, far less robust, bad on long-tailed error distrb, gross outliers
absolute loss -> median
Huber loss. \([y-f(x)]^2 \text{ for abs residual } \leq \delta \text{ and } 2\delta|y-f(x)|-\delta^2 \text{ otherwise }\)

classification

we consider two-class classification problem

in regression problem, \(y-f(x)\) is considered as the margin

in classif.., \(yf(x)\) plays the same role. where \(y\in\{-1,1\}\)

Loss functions for two-class classification.

conclusion

squared-error, and exp loss are not robust, but give rise to simple boosting algorithms.

specific boosting example

boosting tree

region \(R_j,\, j=1,\dots,J\)

\(x\in R_j\Rightarrow f(x)=y_j\)