A. Achille and S. Soatto, Information Dropout: learning optimal representations through noisy computation, 2016.

A. Achille and S. Soatto, Emergence of invariance and disentanglement in deep representations, ArXiv, 2017.

A. A. Alemi, I. Fischer, J. Dillon, and K. Murphy, Deep variational information bottleneck, International Conference on Learning Representations, 2017.

R. , A. Amjad, and B. C. Geiger, How (not) to train your neural network using the information bottleneck principle, 2018.

X. Aolin, R. Maxim, ;. I. Guyon, U. V. Luxburg, S. Bengio et al., Information-theoretic analysis of generalization capability of learning algorithms, Advances in Neural Information Processing Systems, vol.30, pp.2524-2533, 2017.

P. Bartlett, L. , and S. Mendelson, Rademacher and gaussian complexities: Risk bounds and structural results, Journal of Machine Learning Research, 2002.

G. Benjamin, , 2014.

J. Bruna and S. Mallat, Invariant scattering convolution networks, IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2013.

X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever et al., Infogan: Interpretable representation learning by information maximizing generative adversarial nets, Advances in Neural Information Processing Systems, 2016.

T. Cover and J. Thomas, Elements of information theory, 1991.

Z. Matthew, D. , and R. Fergus, Stochastic pooling for regularization of deep convolutional neural networks, 2013.

Y. Gal and Z. Ghahramani, Dropout as a bayesian approximation: Representing model uncertainty in deep learning, Proceedings of the International Conference on Machine Learning, 2017.

Y. Ganin and V. Lempitsky, Unsupervised domain adaptation by backpropagation, ICML, pp.1180-1189, 2015.

K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
DOI : 10.1109/cvpr.2016.90
URL : http://arxiv.org/pdf/1512.03385

S. Ioffe and C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, Journal of Machine Learning Research, 2016.

D. Nitish-shirish-keskar, J. Mudigere, M. Nocedal, P. Smelyanskiy, and . Tang, On large batch training for deep learning: generalization gap and sharp minima, 2017.

P. Diederik, M. Kingma, and . Welling, Auto-encoding variational bayes, International Conference on Learning Representations, 2014.

A. Krizhevsky, Learning multiple layers of features from tiny images, 2009.

A. Krizhevsky, I. Sutskever, and G. E. Hinton, Imagenet classification with deep convolutional neural networks, Advances in Neural Information Processing Systems 25, 2012.
DOI : 10.1145/3065386
URL : http://dl.acm.org/ft_gateway.cfm?id=3065386&type=pdf

A. Krogh and J. A. Hertz, A simple weight decay can improve generalization, Advances in Neural Information Processing Systems, 1992.

S. Mallat, Group invariant scattering, Communications on Pure and Applied Mathematics, 2012.
DOI : 10.1002/cpa.21413
URL : http://arxiv.org/pdf/1101.2286

V. Vladimir, N. Chervonenkis, and . Ya, On the uniform convergence of relative frequencies of events to their probabilities, Measures of Complexity, 1972.

T. Naftali and Z. Noga, Deep learning and the information bottleneck principle, Information Theory Workshop (ITW), 2015.

L. Paninski, Estimation of entropy and mutual information, Neural Computation, 2003.
DOI : 10.1162/089976603321780272
URL : http://www.cns.nyu.edu/pub/eero/paninski03-reprint.pdf

G. Pereyra, G. Tucker, J. Chorowski, L. Kaiser, and G. E. Hinton, Regularizing neural networks by penalizing confident output distributions, International Conference on Learning Representations Workshop, 2017.

J. Rissanen, Modeling by shortest data description, Automatica, 1978.
DOI : 10.1016/0005-1098(78)90005-5

O. Russakovsky, *. , J. Deng, *. , H. Su et al., Imagenet large scale visual recognition challeng, International Journal of Computer Vision, 2015.
DOI : 10.1007/s11263-015-0816-y
URL : http://dspace.mit.edu/bitstream/1721.1/104944/1/11263_2015_Article_816.pdf

O. Shamir, S. Sabato, and N. Tishby, Learning and generalization with the information bottleneck, Theoretical Computer Science, 2010.
DOI : 10.1007/978-3-540-87987-9_12
URL : http://www.cs.huji.ac.il/labs/learning/Papers/ibgen_full.pdf

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, Dropout: A simple way to prevent neural networks from overfitting, Journal of Machine Learning Research, 2014.

D. Thibaut, T. Nicolas, and M. C. Weldon, Weakly Supervised Learning of Deep Convolutional Neural Networks. In CVPR, 2016.

N. Tishby, F. C. Pereira, and W. Bialek, The information bottleneck method. Annual Allerton Conference on Communication, Control and Computing, 1999.