Convergence and optimality of neural networks for reinforcement learning
Recent groundbreaking results have established a convergence theory for wide neural networks in the supervised learning setting. Under an appropriate scaling of parameters at initialization, the (stochastic) gradient descent dynamics of these models converge towards a so-called “mean-field” limit, identified as a Wasserstein gradient flow. In this talk, we extend some of these recent results to examples of prototypical algorithms in reinforcement learning: Temporal-Difference learning and Policy Gradients. In the first case, we prove convergence and optimality of wide neural network training dynamics, bypassing the lack of gradient flow structure in this context by leveraging sufficient expressivity of the activation function. We further show that similar optimality results hold for wide, single layer neural networks trained by entropy-regularized softmax Policy Gradients despite the nonlinear and nonconvex nature of the risk function.