MSR AI Seminar: Why Does Deep Learning Perform Deep Learning?

note
research
webinar
Author

Shiguang WU

Published

February 8, 2023

record video link on bilibili 👉 MSR AI Seminar: Why Does Deep Learning Perform Deep Learning?

main question

how does “deep layers” work?

Deep learning = hierarchical feature learning

observation

  • adding more layers and train holistically will improve the accuracy, though the previous layers are already fully converged.

  • You can’t expect what it learns from what you build

focused object

we only consider densenet with quadratic activation function

densenet

proof target

densenet with wider layers will learn a target densenet effeciently in arbitrary accuracy

Note
  • “wider” means overparameterize

  • “efficient” means converging with arbitrary accuracy \(\epsilon\) using \(\text{poly}(d,\frac{1}{\epsilon})\) samples, \(d\) dimensions

assumptions

  • weight matrices in the target net are well-conditioned: not degenerated. the output will be a \(2^L\) degree poly

  • information gap: let \(a_i\) be the coefficient in the linear combination of output, \(a_i >> a_{i+1} >> 1/d^{0.01}\). note: \(G(x)=\sum_i^La_iL_i\), where \(L_i\) is the sum of the output nodes of layer \(i\)

  • \(L\approx O(\log\log d)\)

Note

shallow model will not learn efficiently, usually \(d^{2^L}\) samples, while deep model uses \(2^{2^L}\) samples which is \(\text{poly}(d)\)

intuition proof

Step 1: about overparameterization

If we wish to learn \(G(x)=x_1^2+x_2^2+\alpha(x_1^4+x_2^4)\), \(\alpha=0.1\)

We hope the first layer to learn \(x_1^2\), \(x_2^2\), second layer to learn \(\alpha(x_1^4+x_2^4)\)

but first layer may give an output which we cannot reconstruct \(x_1^4+x_2^4\) from

solution: over-parametrization and Gaussian random init

conclusion

rich representation for the next layer (not necessary useful for current layer)

Step 2

If we wish to learn \(G(x)=x_1^2+x_2^2+\alpha((x_1^2+x_3)^2+(x_2^2+x_4)^2)\), \(\alpha=0.1\)

Chances are that the first layer learns \((x_1+\alpha x_3)^2+(x_2+\alpha x_4)^2\) from which the next layer cannot reconstruct the remaining terms

claim

layer-wise training overfits to higher-level signals, not noise

solution: training both layers together, second layer will fix the first layer

Step 3

first layer: \(\alpha\) close (to the ground-truth poly)

\(\xrightarrow{learn}\) second layer (learns the residual): \(\alpha^2\) close

\(\xrightarrow{correction}\) first layer (correction): \(\alpha^2\) close

\(\xrightarrow{re-learn}\) second layer: \(\alpha^4\) close

…

more layers means more corrections to the previous layers

claim

layers are learned simoutaneously

important: backward feature correction

feature visualization:

find the picture that activates a specific neuron the most by Gradient Descent

weakness: relies on strong regularization to make it more like a image, otherwise a meaningless noise picture

related

adversarial perturbation