Dynamics of stochastic gradient descent for two-layer neural networks in the teacher-student setup

Deep neural networks achieve stellar generalisation even when they have enough parameters to easily fit all their training data. We study this phenomenon by analysing the dynamics and the performance of over-parameterised two-layer neural networks in the teacher-student setup, where one network, the student, is trained on data generated by another network, called the teacher. We show how the dynamics of stochastic gradient descent (SGD) is captured by a set of differential equations and prove that this description is asymptotically exact in the limit of large inputs. Using this framework, we calculate the final generalisation error of student networks that have more parameters than their teachers. We find that the final generalisation error of the student increases with network size when training only the first layer, but stays constant or even decreases with size when training both layers. We show that these different behaviours have their root in the different solutions SGD finds for different activation functions. Our results indicate that achieving good generalisation in neural networks goes beyond the properties of SGD alone and depends on the interplay of at least the algorithm, the model architecture, and the data set.

squared error between teacher and student outputs, averaged over all of input space. We will focus on student networks that have a larger number of hidden units K ≥ M than their teacher. This means that the student can express much more complex functions than the teacher function they have to learn; the students are thus over-parameterised with respect to the generative model of the training data in a way that is simple to quantify. We find this definition of over-parameterisation cleaner in our setting than the oft-used comparison of the number of parameters in the model with the number of samples in the training set, which is not well justified for non-linear functions. Furthermore, these two numbers surely cannot fully capture the complexity of the function learned in practical applications.
The teacher-student framework is also interesting in the wake of the need to understand the effectiveness of neural networks and the limitations of the classical approaches to generalisation 11 . Traditional approaches to learning and generalisation are data agnostic and seek worst-case type bounds 19 . On the other hand, there has been a considerable body of theoretical work calculating the generalisation ability of neural networks for data arising from a probabilistic model, particularly within the framework of statistical mechanics 17,18,[20][21][22] . Revisiting and extending the results that have emerged from this perspective is currently experiencing a surge of interest [23][24][25][26][27][28] .
In this work we consider two-layer networks with a large input layer and a finite, but arbitrary, number of hidden neurons. Other limits of two-layer neural networks have received a lot of attention recently. A series of papers 29-32 studied the mean-field limit of two-layer networks, where the number of neurons in the hidden layer is very large, and proved various general properties of SGD based on a description in terms of a limiting partial differential equation. Another set of works, operating in a different limit, have shown that infinitely wide over-parameterised neural networks trained with gradient-based methods effectively solve a kernel regression [33][34][35][36][37][38] , without any feature learning. Both the mean-field and the kernel regime crucially rely on having an infinite number of nodes in the hidden layer, and the performance of the networks strongly depends on the detailed scaling used 38,39 . Furthermore, a very wide hidden layer makes it hard to have a student that is larger than the teacher in a quantifiable way. This leads us to consider the opposite limit of large input dimension and finite number of hidden units.
Our main contributions are as follows: (i) The dynamics of SGD (online) learning by two-layer neural networks in the teacher-student setup was studied in a series of classic papers [40][41][42][43][44] from the statistical physics community, leading to a heuristic derivation of a set of coupled ordinary differential equations (ODE) that describe the typical time-evolution of the generalisation error. We provide a rigorous foundation of the ODE approach to analysing the generalisation dynamics in the limit of large input size by proving their correctness.
(ii) These works focused on training only the first layer, mainly in the case where the teacher network has the same number of hidden units and the student network, K = M . We generalise their analysis to the case where the student's expressivity is considerably larger than that of the teacher in order to investigate the over-parameterised regime K > M .
(iii) We provide a detailed analysis of the dynamics of learning and of the generalisation when only the first layer is trained. We derive a reduced set of coupled ODE that describes the generalisation dynamics for any K ≥ M and obtain analytical expressions for the asymptotic generalisation error of networks with linear and sigmoidal activation functions. Crucially, we find that with all other parameters equal, the final generalisation error increases with the size of the student network. In this case, SGD alone thus does not seem to be enough to regularise larger student networks.
(iv) We finally analyse the dynamics when learning both layers. We give an analytical expression for the final generalisation error of sigmoidal networks and find evidence that suggests that SGD finds solutions which amount to performing an effective model average, thus improving the generalisation error upon over-parameterisation. In linear and ReLU networks, we experimentally find that the generalisation error does change as a function of K when training both layers. However, there exist student networks with better performance that are fixed points of the SGD dynamics, but are not reached when starting SGD from initial conditions with small, random weights.
Crucially, we find this range of different behaviours while keeping the training algorithm (SGD) the same, changing only the activation functions of the networks and the parts of the network that are trained. Our results clearly indicate that the implicit regularisation of neural networks in our setting goes beyond the properties of SGD alone. Instead, a full understanding of the generalisation properties of even very simple neural networks requires taking into account the interplay of at least the algorithm, the network architecture, and the data set used for training, setting up a formidable research programme for the future.
Reproducibility -We have packaged the implementation of our experiments and our ODE integrator into a user-friendly library with example programs at https://github.com/sgoldt/nn2pp. All plots were generated with these programs, and we give the necessary parameter values beneath each plot.
1 Online learning in teacher-student neural networks We consider a supervised regression problem with training set D = {(x µ , y µ )} with µ = 1, . . . , P . The components of the inputs x µ ∈ R N are i.i.d. draws from the standard normal distribution N (0, 1). The scalar labels y µ are given by the output of a network with M hidden units, a non-linear activation function g : R → R and fixed weights θ * = (v * ∈ R M , w * ∈ R M ×N ) with an additive output noise ζ µ ∼ N (0, 1), called the teacher (see also Fig. 1a): where w * m is the mth row of w * , and the local field of the mth teacher node is ρ m ≡ w * m x/ √ N . We will analyse three different network types: sigmoidal with g(x) = erf(x/ √ 2), ReLU with g(x) = max(x, 0), and linear networks where g(x) = x.
A second two-layer network with K hidden units and weights θ = (v ∈ R K , w ∈ R K×N ), called the student, is then trained using SGD on the quadratic training loss We emphasise that the student network may have a larger number of hidden units K ≥ M than the teacher and thus be over-parameterised with respect to the generative model of its training data.
The SGD algorithm defines a Markov process X µ ≡ [v * , w * , v µ , w µ ] with update rule given by the coupled SGD recursion relations We can choose different learning rates η v and η w for the two layers and denote by g (λ µ k ) the derivative of the activation function evaluated at the local field of the student's kth hidden unit λ µ k ≡ w k x µ / √ N , and we defined the error term ∆ µ ≡ k v µ k g (λ µ k ) − m v * m g(ρ µ m ) − σζ µ . We will use the indices i, j, k, . . . to refer to student nodes, and n, m, . . . to denote teacher nodes. We take initial weights at random from N (0, 1) for sigmoidal networks, while initial weights have variance 1/ √ N for ReLU and linear networks.
The key quantity in our approach is the generalisation error of the student with respect to the teacher: where the angled brackets · denote an average over the input distribution. We can make progress by realising that g (θ * , θ) can be expressed as a function of a set of macroscopic variables, called order parameters in statistical physics, 21,40,41 together with the second-layer weights v * and v µ . Intuitively, the teacher-student overlaps R µ = [R µ in ] measure the similarity between the weights of the ith student node and the nth teacher node. The matrix Q ik quantifies the overlap of the weights of different student nodes with each other, and the corresponding overlap of the teacher nodes are collected in the matrix T nm . We will find it convenient to collect all order parameters in a single vector and we write the full expression for g (m µ ) in Eq. (S31).
In a series of classic papers, Biehl, Schwarze, Saad, Solla and Riegler 40-44 derived a closed set of ordinary differential equations for the time evolution of the order parameters m (see SM Sec. B). Together with the expression for the generalisation error g (m µ ), these equations give a complete description of the generalisation dynamics of the student, which they analysed for the special case K = M when only the first layer is trained 42,44 . Our first contribution is to provide a rigorous foundation for these results under the following assumptions: (A1) Both the sequences x µ and ζ µ , µ = 1, 2, . . ., are i.i.d. random variables; x µ is drawn from a normal distribution with mean 0 and covariance matrix I N , while ζ µ is a Gaussian random variable with mean zero and unity variance; (A2) The function g(x) is bounded and its derivatives up to and including the second order exist and are bounded, too; (A3) The initial macroscopic state m 0 is deterministic and bounded by a constant; (A4) The constants σ, K, M , η w and η v are all finite.
The correctness of the ODE description is then established by the following theorem: Theorem 1.1. Choose T > 0 and define α ≡ µ/N . Under assumptions (A1) -(A4), and for any α > 0, the macroscopic state m µ satisfies where C(T ) is a constant depending on T , but not on N , and m(α) is the unique solution of the ODE d dt with initial condition m * . In particular, we have where all f (m(α)) are uniformly Lipschitz continuous in m(α). We are able to close the equations because we can express averages in Eq. (9) in terms of only m(α).
We prove Theorem 1.1 using the theory of convergence of stochastic processes and a coupling trick introduced recently by Wang et al. 45 in Sec. A of the SM. The content of the theorem is illustrated in Fig. 1b, where we plot g (α) obtained by numerically integrating (9) (solid) and from a single run of SGD (2) (crosses) for sigmoidal students and varying K, which are in very good agreement.
Given a set of non-linear, coupled ODE such as Eqns. (9), finding the asymptotic fixed points analytically to compute the generalisation error would seem to be impossible. In the following, we will therefore focus on analysing the asymptotic fixed points found by numerically integrating the equations of motion. The form of these fixed points will reveal a drastically different dependence of the test error on the over-parameterisation of neural networks with different activation functions in the different setups we consider, despite them all being trained by SGD. This highlights the fact that good generalisation goes beyond the properties of just the algorithm. Second, knowledge of these fixed points allows us to make analytical and quantitative predictions for the asymptotic performance of the networks which agree well with experiments. We also note that several recent theorems [29][30][31] about the global convergence of SGD do not apply in our setting because we have a finite number of hidden units.

Asymptotic generalisation error of Soft Committee machines
We will first study networks where the second layer weights are fixed at v * m = v k = 1. These networks are called a Soft Committee Machine (SCM) in the statistical physics literature 18,27,[40][41][42]44 . One notable feature of g (α) in SCMs is the existence of a long plateau with sub-optimal generalisation error during training. During this period, all student nodes have roughly the same overlap with all the teacher nodes, R in = const. (left inset in Fig. 1b). As training continues, the student nodes "specialise" and each of them becomes strongly correlated with a single teacher node (right inset), leading to a sharp decrease in g . This effect is well-known for both batch and online learning 18 and will be key for our analysis.
Let us now use the equations of motion (9) to analyse the asymptotic generalisation error of neural networks * g after training has converged and in particular its scaling with L = K − M . Our first contribution is to reduce the remaining K(K + M ) equations of motion to a set of eight coupled differential equations for any combination of K and M in Sec. C. This enables us to obtain a closed-form expression for * g as follows. In the absence of output noise (σ = 0), the generalisation error of a student with K ≥ M will asymptotically tend to zero as α → ∞. On the level of the order parameters, this corresponds to reaching a stable fixed point of (9) with g = 0. In the presence of small output noise σ > 0, this fixed point becomes unstable and the order parameters instead converge to another, nearby fixed point m * with g (m * ) > 0. The values of the order parameters at that fixed point can be obtained by perturbing Eqns. (9) to first order in σ, and the corresponding generalisation error g (m * ) turns out to be in excellent agreement with the generalisation error obtained when training a neural network using (2) from random initial conditions, which we show in Fig. 2a.
Sigmoidal networks. We have performed this calculation for teacher and student networks with g(x) = erf(x/ √ 2). We relegate the details to Sec. C.2, and content us here to state the asymptotic value of the generalisation error to first order in σ 2 , * g = where f (M, L, η) is a lengthy rational function of its variables. We plot our result in Fig. 2a together with the final generalisation error obtained in a single run of SGD (2) for a neural network with initial weights drawn i.i.d. from N (0, 1) and find excellent agreement, which we confirmed for a range of values for η, σ, and L.
One notable feature of Fig. 2a is that with all else being equal, SGD alone fails to regularise the student networks of increasing size in our setup, instead yielding students whose generalisation error increases linearly with L. One might be tempted to mitigate this effect by simultaneously decreasing the learning rate η for larger students. However, lowering the learning rate incurs longer training times, which requires more data for online learning. This trade-off is also found in statistical learning theory, where models with more parameters (higher L) and thus a higher complexity class (e.g. VC dimension or Rademacher complexity 4 ) generalise just as well as smaller ones when given more data. In practice, however, more data might not be readily available, and we show in Fig. S2 of the SM that even when choosing η = 1/K, the generalisation error still increases with L before plateauing at a constant value.
We can gain some intuition for the scaling of * g by considering the asymptotic overlap matrices Q and R shown in the left half of Fig. 2b. In the over-parameterised case, L = K − M student nodes are effectively trying to specialise to teacher nodes which do not exist, or equivalently, have weights zero. These L student nodes do not carry any information about the teachers output, but they pick up fluctuations from output noise and thus increase * g . This intuition is borne out by an expansion of * g in the limit of small learning rate η, which yields * g = which is indeed the sum of the error of M independent hidden units that are specialised to a single teacher hidden unit, and L = K − M superfluous units contributing each the error of a hidden unit that is "learning" from a hidden unit with zero weights w * m = 0 (see also Sec. D of the SM).
Linear networks. Two possible explanations for the scaling * g ∼ L in sigmoidal networks may be the specialisation of the hidden units or the fact that teacher and student network can implement functions of different range if K = M . To test these hypotheses, we calculated * g for linear neural networks 46,47 with g(x) = x. Linear networks lack a specialisation transition 27 and their output range is set by the magnitude of their weights, rather than their number of hidden units. Following the same steps as before, a perturbative calculation in the limit of small noise variance σ 2 yields * g = This result is again in perfect agreement with experiments, as we demonstrate in Fig. 2a. In the limit of small learning rates η, Eq. (10) simplifies to yield the same scaling as for sigmoidal networks, * This shows that the scaling * g ∼ L is not just a consequence of either specialisation or the mismatched range of the networks' output functions. The optimal number of hidden units for linear networks is K = 1 for all M , because linear networks implement an effective linear transformation with an  effective matrix W = k w k . Adding hidden units to a linear network hence does not augment the class of functions it can implement, but it adds redundant parameters which pick up fluctuations from the teacher's output noise, increasing g .
ReLU networks. The analytical calculation of * g , described above, for ReLU networks poses some additional technical challenges, so we resort to experiments to investigate this case. We found that the asymptotic generalisation error of a ReLU student learning from a ReLU teacher has the same scaling as the one we found analytically for networks with sigmoidal and linear activation functions: * g ∼ ησ 2 L (see Fig. S3). Looking at the final overlap matrices Q and R for ReLU networks in the bottom half of Fig. 2b, we see that instead of the one-to-one specialisation of sigmoidal networks, all student nodes have a finite overlap with some teacher node. This is a consequence of the fact that it is much simpler to re-express the sum of M ReLU units with K = M ReLU units. However, there are still a lot of redundant degrees of freedom in the student, which all pick up fluctuations from the teacher's output noise and increase * g .

Discussion.
The key result of this section has been that the generalisation error of SCMs scales as * Before moving on the full two-layer network, we discuss a number of experiments that we performed to check the robustness of this result (Details can be found in Sec. G of the SM). A standard regularisation method is adding weight decay to the SGD updates (2). However, we did not find a scenario in our experiments where weight decay improved the performance of a student with L > 0.
We also made sure that our results persist when performing SGD with mini-batches. We investigated the impact of higher-order correlations in the inputs by replacing Gaussian inputs with MNIST images, with all other aspects of our setup the same, and the same g -L curve as for Gaussian inputs. Finally, we analysed the impact of having a finite training set. The behaviour of linear networks and of non-linear networks with large but finite training sets did not change qualitatively. However, as we reduce the size of the training set, we found that the lowest asymptotic generalisation error was obtained with networks that have K > M .

Training both layers: Asymptotic generalisation error of a neural network
We now study the performance of two-layer neural networks when both layers are trained according to the SGD updates (2) and (3). We set all the teacher weights equal to a constant value, v * m = v * , to ensure comparability between experiments. However, we train all K second-layer weights of the student independently and do not rely on the fact that all second-layer teacher weights have the same value. Note that learning the second layer is not needed from the point of view of statistical learning: the networks from the previous section are already expressive enough to capture the students, and we are thus slightly increasing the over-parameterisation even further. Yet, we will see that the generalisation properties will be significantly enhanced.
Sigmoidal networks. We plot the generalisation dynamics of students with increasing K trained on a teacher with M = 2 in Fig. 3a. Our first observation is that increasing the student size K ≥ M decreases the asymptotic generalisation error * g , with all other parameters being equal, in stark contrast to the SCMs of the previous section.
A look at the order parameters after convergence in the experiments from Fig. 3a reveals the intriguing pattern of specialisation of the student's hidden units behind this behaviour, shown for K = 5 in Fig. 3b. First, note that all the hidden units of the student have non-negligible weights (Q ii > 0). Two student nodes (k = 1, 2) have specialised to the first teacher node, i.e. their weights are very close to the weights of the first teacher node (R 10 ≈ R 20 ≈ 0.85). The corresponding second-layer weights approximately fulfil v 1 + v 3 ≈ v * . Summing the output of these two student hidden units is thus approximately equivalent to an empirical average of two estimates of the output of the teacher node. The remaining three student nodes all specialised to the second teacher node, and their outgoing weights approximately sum to v * . This pattern suggests that SGD has found a set of weights for both layers where the student's output is a weighted average of several estimates of the output of the teacher's nodes. We call this the denoising solution and note that it resembles the solutions found in the mean-field limit of an infinite hidden layer 29,31 where the neurons become redundant and follow a distribution dynamics (in our case, a simple one with few peaks, as e.g. Fig. 1 in 31 ).
We confirmed this intuition by using an ansatz for the order parameters that corresponds to a denoising solution to solve the equations of motion (9) perturbatively in the limit of small noise to calculate * g for sigmoidal networks after training both layers, similarly to the approach in Sec. 2. While this approach can be extended to any K and M , we focused on the case where K = ZM to obtain manageable expressions; see Sec. E of the SM for details on the derivation. While the final expression is again too long to be given here, we plot it with solid lines in Fig. 3c. The crosses in the same plot are the asymptotic generalisation error obtained by integration of the ODE (9) starting from random initial conditions, and show very good agreement.
While our result holds for any M , we note from Fig. 3c that the curves for different M are qualitatively similar. We find a particular simple result for M = 1 in the limit of small learning rates, where: * This result should be contrasted with the g ∼ K behaviour found for SCM.
Experimentally, we robustly observed that training both layers of the network yields better performance than training only the first layer with the second layer weights fixed to v * . However, convergence to the denoising solution can be difficult for large students which might get stuck on a long plateau where their nodes are not evenly distributed among the teacher nodes. While it is easy to check that such a network has a higher value of g than the denoising solution, the difference is small, and hence the driving force that pushes the student out of the corresponding plateaus is small, too. These observations demonstrate that in our setup, SGD does not always find the solution with the lowest generalisation error in finite time.
ReLU and linear networks. We found experimentally that * g remains constant with increasing K in ReLU and in linear networks when training both layers. We plot a typical learning curve in green for linear networks in Fig. 4, but note that the figure shows qualitatively similar features for ReLU networks (Fig. S4). This behaviour was also observed in linear networks trained by batch gradient descent, starting from small initial weights 48 . While this scaling of * g with K is an improvement over its increase with K for the SCM, (blue curve), this is not the 1/K decay that we observed for sigmoidal networks. A possible explanation is the lack of specialisation in linear and ReLU networks (see Sec. 2), without which the denoising solution found in sigmoidal networks is not possible. We also considered normalised SCM, where we train only the first layer and fix the second-layer weights at v * m = 1/M and v k = 1/K. The asymptotic error of normalised SCM decreases with K (orange curve in Fig. 4), because the second-layer weights v k = 1/K effectively reduce the learning rate, as can be easily seen from the SGD updates (2), and we know from our analysis of linear SCM in Sec. 2 that g ∼ η. In SM Sec. F we show analytically how imbalance in the norms of the first and second layer weights can lead to a larger effective learning rate. Normalised SCM also beat the performance students where we trained both layers, starting from small initial weights in both cases. This is surprising because we checked experimentally that the weights of a normalised SCM after training are a fixed point of the SGD dynamics when training both layers. However, we confirmed experimentally that SGD does not find this fixed point when starting with random initial weights. Discussion. The qualitative difference between training both or only the first layer of neural networks is particularly striking for linear networks, where fixing one layer does not change the class of functions the model can implement, but makes a dramatic difference for their asymptotic performance. This observation highlights two important points: first, the performance of a network is not just determined by the number of additional parameters, but also by how the additional parameters are arranged in the model. Second, the non-linear dynamics of SGD means that changing which weights are trainable can alter the training dynamics in unexpected ways. We saw this for two-layer linear networks, where SGD did not find the optimal fixed point, and in the non-linear sigmoidal networks, where training the second layer allowed the student to decrease its final error with every additional hidden unit instead of increasing it like in the SCM.
Dynamics of stochastic gradient descent for two-layer neural networks in the teacher-student setup We will prove Theorem 1.1 in two steps. First, we will show that the mean values of the order parameters R in , Q ik and v k are given by the expressions used in the equations of motion (Lemma A.1) and that they concentrate, i.e. that their variance is bounded by a term of order N −2 . This ensures that the leading-order of the average increment is captured by the ODE of Theorem 1.1, and that the stochastic part of the increment of the order parameters can be ignored in the thermodynamic limit N → ∞. In other words, the two bounds ensure that the stochastic Markov process converges to a deterministic process. To complete the proof, we use a form of the coupling trick as described by Wang et al. 45 .

A.2 First moments of the increment m µ
Throughout this paper, we use the convention that E indicates an average over all the random variables that follow, while E µ denotes the conditional expectation of all the random variables that follow conditioned on the state of the Markov chain at step µ, m µ .
Lemma A.1. Under the same setting as Theorem 1.1, for all µ < N T , we have Proof. We first recall that m µ contains all time-dependent order parameters R µ , Q µ , and v µ , so we will prove the Lemma in turn for each of them. In fact, in each case we can prove a slightly stronger result which encompasses the required bound.
For the teacher-student overlaps R µ in , we multiply the update (2) with w * n /N on both sides and find that R µ+1 The local field of the teacher is ρ µ n ≡ w * n x µ / √ N is a Gaussian random variable with mean zero and variance T nn . Taking the conditional expectation, we find as required.
For the student-student overlaps Q µ ik , we multiply the update (2) by w µ k /N and find that Using assumption (A1), we see that the term (x µ ) 2 /N concentrates to yield 1 by the central limit theorem. Thus we find after taking the conditional expectation of both sides and using E µ ζ µ = 0 that Finally, it is easy to convince oneself that taking the conditional expectation of the update for the second-layer weights (3) yields which completes the proof of Lemma A.1.

A.3 Second moments of the increment m µ
We now proceed to bound the second-order moments of the increments of the time-dependent order parameters. We collect these bounds in the following lemma: Lemma A.2. Under the assumptions of Theorem 1.1, for all µ < N T , we have that Before proceeding with the proof, we state a simple technical lemma that will be helpful in the following; we relegate its proof to Sec. A.5. Lemma A.3. Under the same assumptions as Theorem 1.1, we have for all 0 ≤ µ ≤ N T that In the following, we will use q to denote any order-parameter that is varying in time, such as the teacher-student overlaps R µ in , while we keep m µ as the collection of all order parameters, including those that are static, such as the teacher-teacher overlaps T nm .
Proof of Lemma A.2. We first note all order parameters q ∈ {R in , Q ik , v k } obey update equations of the form where we have emphasised that the update function f q (·) may depend on all order parameters at time µ and the µth sample shown to the student x µ . For the variance σ 2 q = E (q − E q) 2 of the order parameter q, a little algebra yields the recursion relation (S10) We will now use complete induction to show that for any q, the update of the variance at every step is bounded by C(T )N −2 as required. In particular, this means showing that the term proportional to N −1 actually scales as N −2 .
For the induction start, we note that by Assumption A3, we have σ 0 q = 0. Hence the variance of any order parameter after a single step of SGD reads In going from the first to the second line, we have used assumption (A3) by which the initial macroscopic state is deterministic and therefore the average E is just an average over the first sample shown during training, which leads to the simplification of Eq. S12.
For the induction step, we assume that the variance after µ < T steps is (σ µ v ) 2 ≤ C(T )µN −2 ≤ C(T )αN −1 . By using the existence and boundedness of the derivatives of the activation function, we can write m µ = E m µ + (m µ − E m µ ) and expand the terms proportional to N −1 using a multivariate Taylor expansion in (m µ − E m µ ). We find that (S13) We are justified in truncating the expansion since we assumed that σ 2 q ≤ C(T )N −1 . If the functions f q (m, x) are bounded by a constant, this completes the induction and shows that the variance of the increment of the order parameters is bounded by C(T )N −2 , as required.
It is easy to check that all three functions f v , f R and f Q fulfil this condition because of the boundedness of g(x) and its derivatives (A2) and of Lemma A.3, which completes the proof of Lemma A.2.

A.4 Putting it all together
Having proved both Lemmas A.1 and A.2, we can proceed to prove Theorem 1.1 by using the coupling trick in the form given by Wang et al. 45 for another online learning problem, namely the training of generative adversarial networks. We paraphrase the coupling trick as given by Wang et al. in the following to make the proof self-contained and refer to the supplemental material of their paper for additional details.
Proof of Theorem 1.1. We first define a stochastic process b µ that is coupled with the Markov process This process lives in the same space as m µ . Wang et al. 45 showed that for such a process, when Lemma A.1 holds, we have that for all µ ≤ N T . We then define a deterministic process which is a standard first-order finite difference approximation of the equations of motion (9), and also lives in the space as m µ . Invoking a standard Euler argument for first-order finite differences gives Wang et al. 45 further showed that for such a process, using Lemma A.2, we have Finally, combining Eqs. (S15), (S18) and (S17), we have which completes the proof.

A.5 Additional proof details
To bound the value of v µ k after µ steps, we consider the three terms in the sum v µ k = µ s=1 v µ k each in turn. We first note that the sum of the output noise variables ζ µ is a simple sum over uncorrelated, (sub-) Gaussian random variables rescaled by 1/N and thus by Hoeffding's inequality almost surely smaller than a constant 49 .
For the first two terms, we can use an argument similar to the one used to prove the bound on the variance of the increment of the order parameters. We first note that g(·) is a bounded function by Assumption (A2) and that the initial conditions of the second-layer weights are bounded by a constant by Assumption (A3). Hence, after a first step, the weight has increased by a term bounded by C(T )N −1 . Actually, at every step where the weight is bounded by a constant, its increase will be bounded by C(T )N −1 . Hence the magnitude of v µ k ≤ C(T ) for 0 ≤ µ ≤ N T , as required.

B Derivation of the ODE description of the generalisation dynamics of online learning
Here we demonstrate how to evaluate the averages found in the equations of motion for the order parameters (9), following the classic work by Biehl and Schwarze 40 and Saad and Solla 41,42 . We repeat the two main technical assumption of our work, namely having a large network (N → ∞) and a data set that is large enough to allow that we visit every sample only once before training converges. Both will play a key role in the following computations.

B.1 Expressing the generalisation error in terms of order parameters
We first demonstrate how the assumptions stated above allow to rewrite the generalisation error in terms of a number of order parameters. We have where we have used the local fields λ k and ρ m . Here and throughout this paper, we will use the indices i, j, k, . . . to refer to hidden units of the student, and indices n, m, . . . to denote hidden units of the teacher. Since the input x µ only appears in g only via products with the weights of the teacher and the student, we can replace the high-dimensional average · over the input distribution p(x) by an average over the K + M local fields λ µ k and ρ µ m . The assumption that the training set is large enough to allow that we visit every sample in the training set only once guarantees that the inputs and the weights of the networks are uncorrelated. Taking the limit N → ∞ ensures that the local fields are jointly normally distributed with mean zero ( x n = 0). Their covariance is also easily found: writing w ka for the ath component of the kth weight vector, we have The variables R in , Q ik , and T nm are called order parameters in statistical physics and measure the overlap between student and teacher weight vectors w i and w * n and their self-overlaps, respectively. Crucially, from Eq. (S22) we see that they are sufficient to determine the generalisation error g . We can thus write the generalisation error as where we have defined (S26) Assuming sigmoidal activation functions g(x) = erf(x/ √ 2) allows us to evaluate the average I 2 (i, k) analytically: The average in Eq. (S26) is taken over a normal distribution for the local fields λ i and λ k with mean (0, 0) and covariance matrix Since we are using the indices i, j, . . . for student units and n, m, . . . for teacher hidden units, we have where the covariance matrix of the joint of distribution λ i and ρ m is given by and likewise for I 2 (n, m). We will use this convention to denote integrals throughout this section. For the generalisation error, this means that it can be expressed in terms of the order parameters alone as

B.2 ODEs for the evolution of the order parameters
Expressing the generalisation error in terms of the order parameters as we have in Eq. (S31) is of course only useful if we can track the evolution of the order parameters over time. We can derive ODEs that allow us to do precisely that for the order parameters Q by squaring the weight update of w (2) and for R taking the inner product of (2) with w * n , respectively, which yields the equations of motion (9).
To make progress however, i.e. to obtain a closed set of differential equations for Q and R, we need to evaluate the averages · over the local fields. In particular, we have to compute three types of averages: where a is one the local fields of the student, while b and c can be local fields of either the student or the teacher; where a and b are local fields of the student, while c and d can be local fields of both; and finally where a and b are local fields of the teacher. In each of these integrals, the average is taken with respect to a multivariate normal distribution for the local fields with zero mean and a covariance matrix whose entries are chosen in the same way as discussed for I 2 .
We can re-write Eqns. (9) with these definitions in a more explicit form as [41][42][43] The explicit form of the integrals I 2 (·), I 3 (·), I 4 (·) and J 2 (·) is given in Sec. H for the case g(x) = erf x/ √ 2 . Solving these equations numerically for Q and R and substituting their values in to the expression for the generalisation error (S25) gives the full generalisation dynamics of the student. We show the resulting learning curves together with the result of a single simulation in Fig. 2 of the main text. We have bundled our simulation software and our ODE integrator as a user-friendly library with example programs at https://github.com/sgoldt/nn2pp. In Sec. C, we discuss how to extract information from them in an analytical way.

C Calculation of g in the limit of small noise for Soft Committee Machines
Our aim is to understand the asymptotic value of the generalisation error * g ≡ lim α→∞ g (α). (S38) We focus on students that have more hidden units than the teacher, K ≥ M . These students are thus over-parameterised with respect to the generative model of the data and we define as the number of additional hidden units in the student network. In this section, we focus on the sigmoidal activation function unless stated otherwise.
Eqns. (S35ff) are a useful tool to analyse the generalisation dynamics and they allowed Saad and Solla to gain plenty of analytical insight into the special case K = M 41,42 . However, they are also a bit unwieldy. In particular, the number of ODEs that we need to solve grows with K and M as K 2 + KM . To gain some analytical insight, we make use of the symmetries in the problem, e.g. the permutation symmetry of the hidden units of the student, and re-parametrised the matrices Q ik and R in in terms of eight order parameters that obey a set of self-consistent ODEs for any K > M . We choose the following parameterisation with eight order parameters: which in matrix form for the case M = 3 and K = 5 read: We choose this number of order parameters and this particular setup for the overlap matrices Q and R for two reasons: it is the smallest number of variables for which we were able to self-consistently close the equations of motion (S35), and they agree with numerical evidence obtained from integrating the full equations of motion (S35).
By substituting this ansatz into the equations of motion (S35), we find a set of eight ODEs for the order parameters. These equations are rather unwieldy and some of them do not even fit on one page, which is why we do not print them here in full; instead, we provide a Mathematica notebook where they can be found and interacted with together with the source at http://www.github.com/sgoldt/nn2pp. These equations allow for a detailed analysis of the effect of over-parameterisation on the asymptotic performance of the student, as we will discuss now.

C.1 Heavily over-parameterised students can learn perfectly from a noiseless teacher using online learning
For a teacher with T nm = δ nm and in the absence of noise in the teacher's outputs (σ = 0), there exists a fixed point of the ODEs with R = Q = 1, C = D = E = F = 0, and perfect generalisation g = 0. Online learning will find this fixed point 41,42 . More precisely, after a plateau whose length depends on the size of the network for the sigmoidal network, the generalisation error eventually begins an exponential decay to the optimal solution with zero generalisation error. The learning rates are chosen such that learning converges, but aren't optimised otherwise.

C.2 Perturbative solution of the ODEs
We have calculated the asymptotic value of the generalisation error * g for a teacher with T nm = δ nm to first order in the variance of the noise σ 2 . To do so, we performed a perturbative expansion around  Figure S1: The final generalisation error of over-parameterised sigmoidal networks scales linearly with the learning rate, the variance of the teacher's output noise, and L. We plot * g /σ 2 in the limit of small noise, Eq. (S47), for M = 2 (red) and M = 16 (blue). It is clear that generalisation error increases with the number of superfluous units L at fixed learning rate (left) and the learning rate η (middle). Right: For K = M , the learning rate η div at which our perturbative result diverges is precisely the maximum learning rate η max at which the exponential convergence to the optimal solution is guaranteed for σ = 0, Eq. (S48) the fixed point with the ansatz for all the order parameters. Writing the ODEs to first order in σ 2 and solving for their steady state where X (α) = 0 yielded a fixed point with an asymptotic generalisation error * f (M, L, η) is an unwieldy rational function of its variables. Due to its length, we do not print it here in full; instead, we give the full function in a Mathematica notebook together with our source code at https://github.com/anon/.... Here, we plot the results in various forms in Fig. S1. We note in particular the following points: * g increases with L, η The two plots on the left show that the generalisation error increases monotonically with both L and η while keeping the other fixed, respectively, for teachers with M = 2 (red) and M = 16 (blue) The role of the learning rate η Mitigating this effect by decreasing the learning rate η for larger students raises two problems: a lower learning rate yields longer training times, and longer training times imply that more data is required until convergence. This is in agreement with statistical learning theory, where models with more parameters generalise just as well as smaller ones given enough data, despite having a higher complexity class as measured by VC dimension or Rademacher complexity 4 , for example. Furthermore, we show in Sec. C.2 that even with η ∼ 1/K, the generalisation error increases with L before plateauing at a constant value. Moreover, we show in Fig. S2 that the asymptotic generalisation error (S47) of a student trained using SGD with learning rate η = 1/K still increases with L before plateauing at a constant value that is independent of M .
Divergence at large η Our perturbative result diverges for large L, or equivalently, for a large learning rate that depends on the number of hidden units L ∼ K. For the special case K = M , the learning rate η div at which our perturbative result diverges is precisely the maximum learning rate η max for which the exponential convergence to the optimal solution is still guaranteed for σ = 0 42 as we show in the right-most plot of Fig. S1.  Figure S2: Asymptotic generalisation error for sigmoidal soft committee machines with learning rate η = 1/K. We plot the asymptotic generalisation error * g (S47) over σ 2 of a student with a varying number of hidden units trained on data generated by teachers with M = 2, 4, 16 using SGD with learning rate 1/K. The generalisation error still increases with K, before plateauing at a constant value that is independent of M . Weight decay parameter κ = 0.
Expansion for small η In the limit of small learning rates, which is the most relevant in practice and which from the plots in Fig. S1 dominates the behaviour of * g outside of the divergence, the generalisation error is linear in the learning rate. Expanding * g to first order in the learning rate reveals a particularly revealing form, * g = with second-order corrections that are quadratic in L. This is actually the sum of the asymptotic generalisation errors of M continuous perceptrons that are learning from a teacher with T = 1 and L continuous perceptrons with T = 0 as we calculate in Sec. D. This neat result is a consequence of the specialisation that is typical of SCMs with sigmoidal activation functions as we discussed in the main text.

D Asymptotic generalisation error of a noisy continuous perceptron
What is the asymptotic generalisation for a continuous perceptron, i.e. a network with K = 1, in a teacher-student scenario when the teacher has some additive Gaussian output noise? In this section, we repeat a calculation by Biehl and Schwarze 40 where the teacher's outputs are given by where ζ is again a Gaussian r.v. with mean 0 and variance σ 2 . We keep denoting the weights of the student by w and the weights of the teacher by w * . To analyse the generalisation dynamics, we introduce the order parameters  Figure S3: The final generalisation error of over-parametrised ReLU networks scales as * g ∼ ησ 2 L. Simulations confirm that the asymptotic generalisation error * g of a ReLU student learning from a ReLU teacher scales with the learning rate η, the variance of the teacher's output noise σ 2 and the number of additional hidden units as g ∼ ησ 2 L, which is the same scaling as the one found analytically for sigmoidal networks in Eq. and we explicitly do not fix T for the moment. For g(x) = erf x/ √ 2 , they obey the following equations of motion: The equations of motion have a fixed point at Q = R = T which has perfect generalisation for σ = 0. We hence make a perturbative ansatz in σ 2 and find for the asymptotic generalisation error * g = To first order in the learning rate, this reads * which should be compared to the corresponding result for the full SCMs, Eq. (S49).

E Calculation of the asymptotic generalisation error in two-layer sigmoidal networks
In this section, we describe the ansatz we chose for the ODE to compute the asymptotic generalisation error when training both layers with sigmoidal activation function. As we describe in the main text, the ansatz used for the Soft Committee Machine is not appropriate, since (i) all the hidden units of the student are used, and (ii) several nodes overlap with the same teacher node. Inspection of the overlaps after integration of the ODE thus suggested the following ansatz when the number of nodes in the student is a multiple of the number of teacher nodes, K = ZM : which in matrix form for the case M = 2 and K = 4 read: Once this ansatz is found, the rest of the calculation follows along the same lines as that of Sec. C: we derive a reduced set of coupled ODE for Q, C, R and S, expand around the noiseless fixed point where R = 1, S = 0, Q = 1, C = 0 and substitute the resulting fixed point into the expression for the generalisation error, yielding the formula plotted in Fig. 3c.
In Fig. S4 we show the asymptotic performance linear and ReLU two-layer networks that we discuss at the end of Sec. 3 of the main text.

F Unbalanced weights rescale effective learning rate in two layer linear networks
If we consider a linear, two layer neural network of the form: where v ∈ R 1×M , w ∈ R M ×N and x ∈ R N ×1 . The online SGD updates to the first and second layer weights will have the form: If we define the product of student weights as a vector u: it follows that Substituting the form for the update in first and second layer weights into the expression above we find: This suggests that the level of imbalance between the norm of weights at different layers may impact the noisy fluctuations in updates even at late training times. If we compare the update step of the network with another network which produces the same output but has a different scaling of the weights we can see that the effective learning rate will be different. For instanceṽ = av andw = 1 a w leads to an equivalent network, but updates which scale as: We can think of this scaling of the weights as impacting the effective learning, and we have provided an expression for how the learning rate impacts generalisation error in this paper. Our finding thus suggests that weights with more balanced norms across layers will tend to lead to lower generalisation error during online learning.

G.1 Regularisation by weight decay does not help
A natural strategy to avoid the pitfalls of overfitting is to regularise the weights, for example by using explicit weight decay by choosing κ > 0. We have not found a setup where adding weight decay improved the asymptotic generalisation error of a student compared to a student that was trained without weight decay in our setup. As a consequence, weight decay completely fails to mitigate the increase of * g with L. We show the results of an illustrative experiment in Fig. S5.

G.2 SGD with mini-batches
One key characteristic of online learning is that we evaluate the gradient of the loss function using a single sample from the training step per step. In practice, it is more common to actually use a number of samples b > 1 to estimate the gradient at every step. To be more precise, the weight update  Figure S6: SGD with mini-batches shows the same qualitative behaviour as online learning We show the asymptotic generalisation error * g for students with sigmoidal (left) and ReLU activation function (right) for various K learning from a teacher with M = 4. Between the curves, we change the size of the mini-batch used at each step of SGD from 1 (online learning) to 20 000. Parameters: N = 500, η = 0.2, σ = 0.1, κ = 0.  Figure S7: Higher-order correlations in the input data do not play a role for the asymptotic generalisation. We plot the final generalisation error * g after online learning of a student of various sizes from a teacher with M = 4 using Gaussian inputs (blue) and MNIST images (red) for training and testing. N = 784, η = 0.1, σ = 0.1, κ = 0. equation for SGD with mini-batches would read: where x µ, is the th input from the mini-batch used in the mth step of SGD, λ µ, k is the local field of the kth student unit for the th sample in the mini-batch, etc. Note that when we use every sample only once during training, using mini-batches of size b increases the amount of data required by a factor b when keeping the number of steps constant.
We show the asymptotic generalisation error of student networks of varying size trained using SGD with mini-batches and a teacher with M = 4 in Fig. S6. Two trends are visible: first, using increasing the size of the mini-batches decreases the asymptotic generalisation error * g up to a certain mini-batch size, after which the gains in generalisation error become minimal; and second, the shape of the * g − L curve is the same for all mini-batch sizes, with the minimal generalisation error attained by a network with K = M .  Figure S8: The scaling of * g with L shows a similar dependence on the size of the training set for early-stopping (top) and final (bottom) generalisation error. We plot the asymptotic and the early-stopping generalisation error after SGD with a finite training set containing P N samples (linear, sigmoidal and ReLU networks from left to right). The result for online learning for linear and sigmoidal networks, Eqns. (10) and (12) of the main text, are plotted in violet. Error bars indicate one standard deviation over 10 simulations, each with a different training set; many of them are too small to be clearly visible. Parameters: N = 784, M = 4, η = 0.1, σ = 0.01.

G.3 Using MNIST images for training and testing
In the derivation of the ODE description of online learning for the main text, we noted that only the first two moments of the input distribution matter for the learning dynamics and for the final generalisation error. The reason for this is that the inputs only appear in the equations of motion for the order parameters as a product with the weights of either the teacher or the student. Now since they are -by assumption -uncorrelated with those weights, this product is the sum of large number of random variables and hence distributed by the central limit theorem.
We have checked how our results change when this assumption breaks down in one example where we train a network on a finite data set with non-trivial higher order moments, namely the images of the MNIST data set. We studied the very same setup that we discuss throughout this work, namely the supervised learning of a regression task in the teacher-student scenario. We only replace the the inputs, which would have been i.i.d. draws from the standard normal distribution, with the images of the MNIST data set. In particular, this means that we do not care about the labels of the images. Figure S7 shows a plot of the resulting final generalisation against L for both the MNIST data set and a data set of the same size, comprised of i.i.d. draws from the standard normal distribution, which are in good agreement.
G.4 The scaling of * g with L for finite training sets In practice, a single sample of the training data set will be visited several times during training. After a first pass through the training set, the online assumption that an incoming sample (x, y) is uncorrelated to the weights of the network thus breaks down. A complete analytical treatment in this setting remains an open problem, so to study this practically relevant setup, we turn to simulations. We keep the setup described in Secs. 1, but simply reduce the number of samples in the training data set P . Our focus is again on the final generalisation error after convergence * g for linear, sigmoidal and ReLU networks, which we plot from left to right as a function of L in Fig. S8.
Linear networks show a similar behaviour to the setup with a very large training set discussed in Sec. 2: the bigger the network, the worse the performance for both P = 4 and P = 50. Again,