Dynamic Programming for Mean-Field Type Control

We investigate a model problem for optimal resource management. The problem is a stochastic control problem of mean-field type. We compare a Hamilton–Jacobi–Bellman fixed-point algorithm to a steepest descent method issued from calculus of variations. For mean-field type control problems, stochastic dynamic programming requires adaptation. The problem is reformulated as a distributed control problem by using the Fokker–Planck equation for the probability distribution of the stochastic process; then, an extended Bellman’s principle is derived by a different argument than the one used by P. L. Lions. Both algorithms are compared numerically.


Introduction
Stochastic control has been studied extensively over the past five decades [1,2,3,4,5], and yet there is a renewed interest in economy and finance due to mean-field games [6,7,8,9].Mean-field games give rise to mean-field type stochastic control problems [10] which involve not only the Markov process of the state of the system, some statistics of the process like means and variance, in the cost function or in the stochastic differential equation (SDE).For these problems, optimality conditions are derived either by stochastic calculus of variation [11] or by stochastic dynamic programming [12,13] and justified in the quadratic case by classical arguments [14,15], but not so classical one in the general case for the fundamental reason that Bellman's principle does not apply in its original form [12,16].
Several authors have generalized Dynamic Programming using Wasserstein distance to define derivatives with respect to measures.Others have studied the existence of a conceptual HJB equation [16,17].These results certainly overlap and precede our analysis but our point of view is different : it is pragmatic, so we sacrifice mathematical rigor to explicit expressions, and numerical, as we wish to compare solutions obtained by HJB to standard optimal control by calculus of variations.We have not tried to specify regularity of data for existence and differentiability of solutions.The results are stated formally only but with an intuitive feeling that they could be justified later with appropriate assumptions as in e.g.[18,19] if the behavior at infinity of the solution of the HJB equation is known, which is a major riddle.
Before proceeding further, note that a direct simulation of the problem with the stochastic differential equation approximated by Monte-Carlo is too costly and not competitive with the methods that we pursue below.Indeed, the cost function of the optimization problem involves means of stochastic quantities and Monte-Carlo methods would require large numbers of evaluations of the SDE, embedded in forwardbackward time loops.Faced with the same problem, Garnier et al. [7] and Chan et al. [20] came to the same conclusion.
In this article, pursuing a preliminary study published in [21], we apply the dynamic programming argument to the value functional as in [22], but instead of using the probability measure of the stochastic process, we use its probability density function (PDF).Hence, though less general, the mathematical argument will be simpler.Of course this is at the cost of several regularity assumptions, such as the existence of a regular PDF at all times.However, our analysis here is strongly motivated by the numerical solutions of these control problems, and hence assuming regularity is not a real practical limitation.
Once the problem is reformulated with the Fokker-Planck equation [1,23], it becomes a somewhat standard exercise to find the optimality necessary conditions by a calculus of variations.So this article begins likewise in Section 3.Then, in Section 4, a similar result is obtained by using dynamic programming, and the connection with the previous approach and with stochastic dynamic programming is established.In Section 5 a problem introduced in [24] involving profit-optimizing oil producers is defined and studied for existence and optimality, and two algorithms are proposed together with a semi-analytical method based on a Riccati solution.The paper ends with a numerical Section which implements the three methods and compare them.

The Problem
Consider a stochastic differential system of d-equations where u takes values in R d , σ in R d×k and W t is a k-vector of independent Brownian motions.Assumptions for which a strong solution is known to exist once the distribution of X 0 is known to be in L 2 ∩ L ∞ are given in [25] (see also [26], Proposition 4 which applies when σ σ T is uniformly positive definite: and with σ (x,t) := σ (x,t, u(x,t)): Then the PDF of X t satisfies and is the unique solution of the Fokker-Planck equation, Conversely, with (2)(3), there is a unique solution to (4) which is the PDF of a Markov process which satisfies (1).

Remark 2.1
The assumption σ σ T > 0 can be replaced by assuming the conditions (2) for u − 1 2 ∇ • ( σ σ T ) rather than u, but it implies some regularity on the second derivatives of u.

Consider the stochastic optimization problem min
subject to (1) with ρ 0 given, and where h, g, H, G are C 1 functions taking values in R r , R s , R and R, respectively.Assume also that H and G are bounded from below and As a first approach to solve such problems, Andersson et al. [11] proposed to use a Stochastic Calculus of Variations; the necessary optimality conditions is a forwardbackward stochastic differential system, which is numerically very hard because the backward volatility for the adjoint equation is one of the unknowns [11,27].
A second approach is to use Stochastic Dynamic Programming (SDP), but an important adaptation needs to be made.Usually SDP uses the remaining cost function The Bellman equation is derived by saying that u t ,t > τ is a solution only if, together with (1) and However, the above is not true unless h = 0, g = 0; as in [22], one has to work with (1), with ρ τ given, where V is a pointwise function of τ ∈ [0, T ] and has functional dependence on ρ τ (•), i.e., depends on {x → ρ τ (x), ∀x ∈ R d }.
A third approach is to work with the deterministic version of the problem.With sufficient regularity (see [28] for weaker assumptions), with the Fokker-Planck partial differential equation for ρ(x,t) := ρ t (x); the problem is equivalent to the deterministic distributed control problem, min where where Remark 2.2 Note that the problem is equivalent to the Stochastic control problem only if ρ 0 is in P, the set of positive real-valued functions with measure 1.However, the deterministic control problem still makes sense even if it is not the case and ρ ∈ L 2 ([0, T ], H 1 (R d )) only.We will use this in the proof of Proposition 4.3.
Remark 2.3 Existence of solutions for (6) or (7) requires more assumptions on H, h, G and g to make sure that J is lower semi-continuous with respect to u in a norm such as L 2 [0, T ], H 1 (R d ) which implies (2) and U d closed.However since there is no term containing a gradient of u in J, a Tikhonov regularization seems to be needed for these problems to be well posed.

Calculus of Variations
Deriving optimality conditions by calculus of variations is fairly standard, as we shall show in this section.
where H u , H ρ , H χ , h u , h ρ and G ξ are partial derivative in the classical sense.

Remark 3.1
The "boundary conditions at infinity" on ρ * in ( 9) are problematic.The PDE is to be understood in weak form in the dual of P, i.e., for all ν ∈ P : Then, the decay of ν at infinity will balance the potential growth of ρ * , and the integrations by parts in the proof above will have no terms at infinity in x .However, the uniqueness of ρ * is not guaranteed.We could have avoided the problem by working in [−L, L] d × [0, T ] rather than R d × [0, T ] and imposing ρ * = 0 on the boundary of [−L, L] d , but then the solution depends strongly on L; this problem will be rediscussed in the numerical section (Section 5).

Dynamic Programming
For notational clarity consider the more general case, where H, G are functionals of ρ t (•).Let ρ be solution of ( 7) initialized at τ by ρ τ (•) and let J and V be defined as : By the Markovian property, ρ t (x) := ρ(x,t), for t > τ, is also the PDF of X t given by (1) knowing its PDF ρ τ at τ.
Remark 4.1 In this section, H is a pointwise function of u(x,t) ∈ R d , but the theory can be extended to the case where Assuming that J is bounded from below, V is finite and we prove the following version of Bellman's principle of optimality : Proposition 4.1 If the problem is regular, then for any τ ∈ [0, T ] and any ρ τ ∈ P, we have : subject to ρ t , given by ( 7) on [τ, τ + δ τ], initialized by ρ τ at time τ.
Proof Denote the infimum of the right-hand side by V (τ; ρ τ ).For any ε > 0, there exists u ∈ U d such that, if ρ t is the solution of ( 7) with control u : Conversely, given u ∈ U d and ε > 0, there exists a control ũ ∈ U d , which coincides with u on R d × [τ, τ + δ τ], such that: where ρt is the solution of (7) at t with control ũ starting with ρ τ at time τ.Hence : To conclude, let ε → 0 and take the infimum over u ∈ U d .
From now on, we assume that H and V are Fréchet differentiable with respect to ρ.

Remark 4.2
The correct mathematical tool for this differentiability is the Wasserstein distance and the differentiability with respect to the probability measure rather than to its density (see e.g.[13,22]).Our approach is more pragmatic.
We denote the Fréchet derivatives by H ρ (x, τ; ρ) and V ρ (τ; ρ).Thus, and similarly for V , H ρ (x, τ; ρ) denotes the linear application L 2 → R such that : where L 2 := L 2 (R d ).Moreover, we denote with a prime the Riesz representative of the Fréchet derivative with respect to ρ.For instance, V : Proposition 4.2 (HJB minimum principle).Assuming that V is smooth enough, the following holds : Note: As usual, ∇ is with respect to x.
Proof A first order approximation of the time derivative in the Fokker-Planck equation gives As V is assumed to be smooth, we have : Using (15) and the mean value theorem for the time integral, Bellman's principle yields , up to o(δ τ), The terms V (τ; ρ τ ) cancel; divided by δ τ and combined with (14) and letting δ τ → 0, (16) gives To finalize the proof we need the following proposition to relate V to V ρ .
Proposition 4.3 Given τ ∈ [0, T ] and an initial ρ τ ∈ P, let û ∈ U d and ρ denote an optimal control and the corresponding solution of (7), then: Proof Note that the Fokker-Planck equation implies the existence of a semi-group operator G such that, for all τ ≤ t, ρ t = G(t − τ) * ρ τ .Let ( ût ) t∈[0,T ] be the optimal control and ( ρt ) t∈[0,T ] the corresponding solution of ( 7) and (12).Then : By the optimality of û and ρ, this can be Fréchet-differentiated with respect to ρ by computing , for a given ν ∈ L 2 , lim λ →0 Taking ν = ρτ leads to (18).One may object however that such choice for ν is not admissible because being a variation of ρ τ it has to have zero measure, but we discussed this in Remark 2.2.
End of proof of Proposition 4.2 Differentiating (18) with respect to τ leads to where ûτ is the optimal control at time τ.Now, let us use (17), rewritten as Integrating by parts the last two terms leads to (13).
Proposition 4.4 (Hamilton-Jacobi-Bellman equation) Denote an optimal control û. and where the second equation is in fact the first order optimality condition in (13).
Remark 4.5 When the Hamiltonian depends on the distribution ρ t only through the local value ρ t (x) and the average of a fixed function, we can make explicit the link with the calculus of variations (see Section 3).More precisely, let us assume that In particular, for ν = ρ τ we have : Then, for the optimal û and ρ, (21) yields The link with Section 3 is established : ( 9) and ( 25) coincide with V = ρ * .
Remark 4.6 Note that the adjoint equation, (25), is set in R d ×[0, T ] with a right-hand side which is unbounded as x → ±∞.Existence of solutions is doable in the finite case Ω × [0, T ] with Ω a bounded open set and V | ∂ Ω = 0, but is a riddle otherwise.It is also a source of numerical difficulties because, as R d is approximated by ] − L, L[ d , numerical boundary conditions compatible with the (unknown) behavior at infinity of V need to be provided.

An Academic Example: Production of an Exhaustible Resource
Following [24], we consider a continuum of producers exploiting an oil field.Each producer's goal is to maximize his profit, knowing the price of oil; however, this price is influenced by the quantity of oil available on the market, which is the sum of all that the producers have decided to extract at a given time.Hence, while each producer does not affect the price of oil, because each producer solves the same optimization problem, in the end the global problem must take into account the market price as a function of oil availability.For a better understanding of the relation between the individual game and the global game, the reader is referred to [10].

Notations
Let X 0 be the initial oil reserve and X t be the quantity of oil left in the field at time t, as seen by a producer.It is modeled by dX t = −a t dt + σ X t dW t , X 0 given by its PDF, where a t dt is the quantity extracted by the producer in the time interval [t,t + dt], (so a t is the extraction rate), and W is a standard real valued Brownian motion reflecting the incertitude of the producer about the remaining reserve; σ > 0 is a volatility parameter, assumed constant.We suppose that a t := a(X t ,t) is a deterministic function of t and X t , meaning by this that the producers apply a feedback law to control their production.
We denote by C the cost of oil extraction, which is function of a and assumed to be C(a) := αa+β a 2 , for some positive constants α and β .The price of oil is assumed to be p t := κe −bt (E(a t )) −c , with positive κ, b and c.This contains the macro-economic assumption that p t is a decreasing function of the mean production because scarcity of oil increases its price and conversely.It also says that in the future oil will be cheaper because it will be slowly replaced by renewable energy.
Note that by construction X t takes only positive values and ought to be bounded by, say L, the maximum estimate of the reservoir content.However, nothing in the model enforces these constraints.

Model
Each producer optimizes his integrated profit up to time T , discounted by the interest rate r; however he wishes also to drain the oil field, i.e., achieve X T = 0. Thus his goal is to maximize over a(•, •) ≥ 0 the functional : γ and η are penalization parameters.
Replacing p and C by their expressions gives Obviously, J being the mean of a function of E[a t ], it is a mean-field type stochastic control problem.

Remarks on the Existence of Solutions
Denoting a t := E[a t ], J is: Assume that c < 1.Then, the maximum of the right-hand side is attained when a is such that, a(1 − c)e −bt (a t ) −c = α + 2β a t , ∀t.Hence the maximum value provides an upper bound for J, so long as a t is upper bounded on [0, T ].Furthermore when the problem is converted into a deterministic optimal control problem with the Fokker-Planck equation, it is seen that the function is upper semi-continuous; so a maximum exists.
A counter example: Assume c > 1.For simplicity, suppose that a(t) = |τ − t| for some τ ∈]0, T [.Then, T 0 a 1−c t dt = +∞; hence the problem is not well posed, an obvious consequence of the fact that the model makes the price infinite too fast if nobody extract oil.
Linear feedback: If we search for a in the class of linear feedbacks a(x,t) = w(t)x where w is a deterministic function of time only, then (26) has an analytical solution and the first and second moments of a t are , with wt := w(t)e − t 0 w(τ)dτ .
Then, for η = 2, the problem reduces to maximizing over wt ≥ 0

Dynamic Programming in Absence of Constraint
To connect to Section 2 let us work with u = −a, the reserve depletion rate.For the time being, we shall ignore the constraints L ≥ X t ≥ 0 and u ≤ 0; so V d = R. Moreover we shall work with η = 2 and comment on η > 2 at the end.
Recall that ρ(•,t), the PDF of X t , is given by the Fokker-Planck equation : with initial condition : + R γ|x| η ρ(x, T )dx subject to (30) with The goal is now to minimize J with respect to u. Define also Application of the Results of Section 2. In this example we have : By Proposition 4.2 and Remark 4.5, we have V |T = γ|x| η , and V satisfies because ∂ ρ H = ∂ ρ h = 0 and ∂ χ H = cκe −bt χ −c−1 u.Moreover, by (22), giving: Now, using (35) to eliminate ∂ x V in (34) leads to Finally, using (36) in (37) and the definition of (−u) t yields : Note that this equation for V depends only on u and not on u.Nevertheless (36)-( 38) is a rather complex partial-integro-differential system.
A Fixed-Point Algorithm.We can now sketch a numerical method to solve the problem: Algorithm 1.
Although it seems to work numerically in many situations, as we shall see below, nothing is known on the convergence of this fixed-point type algorithm; three points need to be clarified: 1. Equation ( 30) is nonlinear and existence of solution is unclear.
2. A relevant stopping criteria for Algorithm 1 is yet to be found.
3.Even if the Fokker-Planck equation ( 30) is set on R + ×]0, T [ instead of Q as discussed below in the numerical section (Section 5), there are difficulties.Because the second order term vanishes at x = 0, a weak formulation ought to be in the weighted Sobolev space and would be to find ρ, with Theorem 2.2 in [29] asserts existence and uniqueness of ρ, provided that there exists u M such that u(x,t) < xu M , ∀t.However, this oil-resource model doesn't impose u(0,t) = 0, and consequently there is a singularity at that point.

Calculus of Variations on the Deterministic Control Problem
To find the optimality conditions for (31), let us introduce an adjoint ρ * satisfying (41) In other words Algorithm.We apply the steepest descent method with varying step size: Initialize a 0 and i = 0; choose 0 < ε 1; DO 1. Compute ρ i by (30) with ρ i|t=0 given; 2. Compute u i = R u i ρ i dx ; 3. Compute ρ * i by (40); 4. Compute Grad u J by (42) ; 5. Compute a feasible descent step µ i ∈ R by Armijo rule [30] 6. Set For a convergence analysis, here the situation is somewhat better: both the state and the adjoint equations ( 30), (40) are linear and the only problem is the asymptotic behavior of u.Note also that one could use a conjugate-gradient algorithm at little additional computational cost.
After discretization by a variation method such as finite element methods, convergence to a local minima could probably be established by the techniques of control theory (see e.g.[30]) because the problem is finite dimensional and the gradient of the cost function is exact [29].However, convergence of the solution of the discrete problem to the solution of the continuous problem is open and even more difficult than existence.

The Riccati Equation when η = 2
Even though the problem is not linear-quadratic, when η = 2 we can still look for V solution of (38) in the form V (x,t) = P t x 2 + z t x + s t .where P t , z t , s t are functions of time only.
For clarity, let Q t = e rt P t and µ = σ 2 − r.Then, the above is: As long as 4β e rt µ − Q t > 0 it leads to Then, u is found by (36).In particular However, the Fokker-Planck equation must be solved numerically to compute u.
Remark 5.1 By identification of the terms of order 1 and 0 in x, equations for z and s are found: Remark 5.2 Note that u t = 2xP t +z t is not a linear feedback function as in (28) above.
Remark 5.3 This explicit feedback solution is smooth and adapted to the stochastic process (26), so it should be a solution of (27) if it exists (recall that ( 27) is not convex).Furthermore it has a behavior at infinity which is compatible with (2).

Numerical Implementation
To implement Algorithms 1 & 2, we need to localize the PDE.As x < 0 makes no sense for this application, we shall work on a stopping time for the event X t = 0 would be better, but too costly.At x = L, we set ρ(L,t) = 0, ∀t, which makes sense when L is large.
Assigning boundary conditions to (38) and ( 40) is problematic.Our numerical tests show that the computations depend strongly on L when it is not done correctly.When η = 2, we know that V and ρ * have asymptotically the same behavior as P t x 2 , giving 1  2 σ 2 x 2 ∂ x V = σ 2 x 3 P t = σ 2 xV , a relation which can be used as a boundary condition in the weak form of the equation (and similarly for ρ * ): find ρ ∈ H 1 (Q L ) with V T given and for all ν ∈ H 1 (Q L ) with ν T = 0.
To solve this non-linear PDE we use the fact that it is embedded into an iterative loop in Algorithm 1, and semi-linearise it by evaluating the square term in the last integral as a product of the same, where one factor is evaluated at the previous iteration.
To discretise it we have used a space-time finite element method of degree 1 over triangles covering Q L .Admittedly it is an unusual method!However, it is somewhat similar to a central difference method and it is feasible because the problem is one dimensional in space and because it allows exact conservativity and exact duality with respect to time and space in the integrations by parts.It handles also automatically the storage of ρ, u,V, u at all times and solve the backward (and forward) equation at all time steps by a single linear system.The linear systems are solved with the library MUMPS as implemented in freefem++ [31].Results with Algorithm 1. Figure 1 shows the optimal control as a function of (x,t).
For each t the control is linear in x, as predicted by the Riccati equation; the quadratic part of the Riccati solution of Section 5.6 is also plotted, and a small difference is seen on the plot (two surfaces close to each other are displayed).Figure 2 shows the evolution in time of the PDF ρ for all x > 0 of the resource distribution X t .At time 0 it is a Gaussian distribution centered at x = 5; at time T the distribution is concentrated around x = 0.5, so most producers have pumped 90% of the oil available to them.All above is obtained with L = 10, but there is almost no difference with L = 40.Figures3, 4 present the evolution of production (−u t ) and price p t = κe −bt (−u t ) −c .The performance of descent methods were disappointing.It generated many different solutions depending on the initial value for u.If u 0 = u e , then the algorithm decreases the cost functions by introducing small oscillations, a strategy which is clearly mesh dependent.If u 0 = −0.5, then the solution of Figure 5 is found after 10 iterations of steepest descent.The convergence history is given in Table 2.The results are shown on Figure 6.Note the oscillations of ρ near t = T .to the solution shown on Figure 7.

Linear Feedback Solution
Using automatic differentiation of computer programs by operator overloading in C++ and initializing a steepest descent method with the linear part of the Riccati solution of Section 5.6 and the same parameters as above, we obtained the w(t) displayed on Figure 8, very close to the Riccati solution.To understand why the Riccati solution may not be the best solution we plotted λ → J d (λ ) := J(w d t +λ h t ), λ ∈]−0.5, +0.5[,where h t is an approximate w t − ∇J(w d t ). Figure 8 shows that there are three local minima and two local maxima, and w d t is only a local minimum.
Fig. 8 A solution computed by a steepest descent method to maximize (29) showing the feedback coefficient w versus time (left figure), compared with the feedback coefficient of the Riccati solution (solid line).The plot in the center shows J d λ as a function of λ ∈] − 0.5, +0.5[; the Riccati solution is at λ = 0.At λ = ±0.12lies absolute minima of J d (.), but it can be seen only by zooming in the central figure (right plot).Observe that the absolute minimum is very shallow and hard to find; it appears also to be mesh dependent.

Conclusions
Stochastic mean-field type control is numerically hard.Even a simple academic toy problem gives difficulties.The first difficulty which this paper had to deal with is the extension of the HJB dynamic programming approach.The second difficulty is related to the well-posedness of the HJB or adjoint equation because it is set in an infinite domain in space.The third difficulty is the lack of proof of convergence of the two algorithms suggested here, an HJB-based fixed point and a steepest descent based on calculus of variations.When it converges the fixed point method is preferred but there is no known way to guarantee convergence even to a local minimum; as for the steepest descent we found it somewhat hard to use, mainly because it generates irregular oscillating solutions; some bounds on the derivatives of u need to be added in penalised form in the criteria.Numerically both algorithms are cursed by the asymptotic behaviour of the solution of the adjoint state at infinity.So, when possible, the Riccati semi-analytical feedback solution is the best.Finally, but this applies only to this toy problem, the pure feedback solution is nearly optimal, easy to compute and stable.Note also that this semi-analytical solution is a validation test, since it has been recovered by both algorithms.

Fig. 1
Fig.1Optimal x,t → u(x,t) and the Riccati solution slightly below Fig.2PDF of resource X t : x,t → ρ(x,t)

Table 2
Algorithm 2. Convergence history: Values of J and Grad u J versus iteration number k