The elements of statistical learning 01

Posted on March 7, 2022 by Shiguang Wu
Tags: ESL

boosting

committee from a bunch of weak learners\(G_m\)(slightly better than rand)

\[ G(x)=sign\left(\sum_{m=1}^Ma_mG_m(x)\right) \]

one generic method is forward-stagewise method where you compute one model \(G_m\) and its correspd weight \(a_m\) at a time (min \(L(y_i, f_m(x_i)+\beta G_m(x_i))\)).

if using MSE as the \(L\) loss, each time we are seeking for a model \(\beta G\) that fit the residual.

AdaBoost.M1

iteratively fit \(G_m\) on a weighted dataset.

method derived by using exp loss instead of the common mse ..

\[e^{-y_if(x_i)}\]

suppose we consider scaled model (\(range f=\{-1,1\}\))

at stage m

we want to opt the following

\[ \min_{(a,G)}\sum_{i=1}^Nexp\left(-y_i(f_{m-1}(x_i)+aG(x_i))\right)\\ =\sum w_iexp(-y_iaG(x_i)) \]

basically using forward-stagewise method, if we fix \(a\)

\[ exp(a)\sum_{y_i\neq G(x_i)}w_i+exp(-a)\sum_{y_i=G(x_i)}w_i\\ =exp(a)\sum_i^N w_i - exp(a)\sum_{y_i=G(x_i)}w_i+\dots\\ =(exp(-a)-exp(a))\sum_{y_i=G(x_i)} w_i + exp(a)\sum_i^Nw_i\\ =A\sum_i^Nw_i[y_i\neq G(x_i)]+\dots \]

so actually, we are minimizing a weighted dataset using 0-1 loss

with this new solved \(G\), we can then solve for \(a\)

\[ \frac{d \sum w_iexp(-aG(x_i))}{da}=-\sum w_iy_iG(x_i)exp(-ay_iG(x_i))\\ =-(exp(-a)+exp(a))\sum_{y_i=G(x_i)} w_i + exp(a)\sum_i^Nw_i=0 \]

in short, AdaBoost.M1 is directly abtained from forward-stagewise method. With exp loss, we can get this neat sol.

further, since \(f_m=f_{m-1}+G_m\), the weights \(w_i\) can be calculated iteratively.

more about exp loss

loss func and robustness

here robustness means being disturbed less by outlier samples.

regression

loss and robustness on regression prob

classification

we consider two-class classification problem

in regression problem, \(y-f(x)\) is considered as the margin

in classif.., \(yf(x)\) plays the same role. where \(y\in\{-1,1\}\)

Loss functions for two-class classification.

conclusion

squared-error, and exp loss are not robust, but give rise to simple boosting algorithms.

specific boosting example

boosting tree

region \(R_j,\, j=1,\dots,J\)

\(x\in R_j\Rightarrow f(x)=y_j\)