### Workshop B4 - Learning Theory

**Organizers:** Sébastien Bubeck (Microsoft Research, USA) - Lorenzo Rosasco (Massachusetts Institute of Technology, USA) - Alexandre Tsybakov (Université de Paris 6, France)

## Talks

No date set

## Random Moments for Compressive Statistical Learning

### Rémi Gribonval

### Inria - Rennes, France - remi.gribonval@inria.fr

We describe a general framework, {\em sketched statistical learning}, for memory-efficient large-scale learning. A low-dimensional {\em sketch} (a vector of random empirical generalized moments), that captures the information relevant to the considered learning task, is computed in one pass on the training collection. A near-minimizer of the risk is computed from the sketch using the generalized method of moments.

The framework is illustrated on sketched clustering, and sketched PCA, using empirical algorithms inspired by sparse recovery algorithms used in compressive sensing.

We provide theoretical guarantess to control the resulting generalization error, with sketch of size driven by an intrinsic measure of complexity of the learning task, which is independent of the volume of the training collection. Volume reductions of several orders of magnitude are demonstrated while preserving the overall learning quality.

Joint work with Nicolas Keriven (Université de Rennes 1, France), Yann Traonmilin (Inria - Rennes, France) and Gilles Blanchard (Universität Potsdam, Germany).

No date set

## The thresholding bandit problem : discrete and continuous setting

### Alexandra Carpentier

### Universitaet Potsdam, Germany - carpentier@math.uni-potsdam.de

In this talk we consider a specific combinatorial pure exploration stochastic bandit problem where the learner aims at finding actively in an online environment the set of actions whose means are above a given threshold, up to a given precision, and for a fixed time horizon. This problem is related to, yet different from, the so-called pure exploration bandit problem.

An optimal strategy for the discrete setting will be presented, and then an extension to a given non-parametric setting will be discussed - this is a natural setting for active level set detection, or active non parametric classification.

Joint work with Maurilio Gutzeit (Universitaet Potsdam), Samory Kpotufe (Princeton University) and Andrea Locatelli (Universitaet Potsdam).

No date set

## Online Learning, Martingale inequalities and Concentration

### Karthik Sridharan

### Cornell University, United States - sridharan@cs.cornell.edu

There is an inherent connection between martingale inequalities and online learning. We will see how to use this connection not only to establish adaptive bounds on regret but also to design efficient online learning algorithms that attain the said bounds on regret. On the other hand, this same connection between martingale inequalities and online learning shall be used to establish an equivalence between online learning and concentration inequalities for martingales.

For the first part of the talk of using martingale inequalities to design online learning algorithms, we use results first established by Donald Burkholder connecting martingale inequalities and existence of certain types of functions on the instance space which can then be used in the online learning algorithms to compute the predictions. We establish adaptive online learning algorithms for problems varying from ada-grad type bounds for online linear optimization to adaptive algorithms for matrix completion problems and for online regression. We will conclude the talk to showing that regret bounds imply martingale concentration statements which are often not easy to prove directly (for instance concentration in banach space).

Joint work with Dylan Foster, Alexander Rakhlin.

No date set

## Optimal algorithms for smooth and strongly convex distributed optimization in networks

### Francis Bach

### INRIA - ENS, France - francis.bach@ens.fr

In this work, we determine the optimal convergence rates for strongly convex and smooth distributed optimization in two settings: centralized and decentralized communications over a network. For centralized (\ie \emph{master/slave}) algorithms, we show that distributing Nesterov's accelerated gradient descent is optimal and achieves a precision $\varepsilon>0$ in time $O(\sqrt{\kappa_g}(1+\Delta\tau)\ln(1/\varepsilon))$, where $\kappa_g$ is the condition number of the (global) function to optimize, $\Delta$ is the diameter of the network, and $\tau$ (resp. $1$) is the time needed to communicate values between two neighbors (resp. perform local computations). For decentralized algorithms based on gossip, we provide the first optimal algorithm, called the multi-step dual accelerated (MSDA) method, that achieves a precision $\varepsilon>0$ in time $O(\sqrt{\kappa_l}(1+\frac{\tau}{\sqrt{\gamma}})\ln(1/\varepsilon))$, where $\kappa_l$ is the condition number of the local functions and $\gamma$ is the (normalized) eigengap of the gossip matrix used for communication between nodes. We then verify the efficiency of MSDA against state-of-the-art methods for two problems: least-squares regression and classification by logistic regression.

Joint work with Kevin Scaman (INRIA), Sébastien Bubeck (Microsoft Research), Yin-Tat Lee (Microsoft Research) and Laurent Massoulié (INRIA).

No date set

## Learning determinantal point processes

### Philippe Rigollet

### M.I.T - rigollet@math.mit.edu

Determinantal Point Processes (DPPs) are a family of probabilistic models that have a repulsive behavior, and lend themselves naturally to many tasks in machine learning where returning a diverse set of objects is important. While there are fast algorithms for sampling, marginalization and conditioning, much less is known about learning the parameters of a DPP. In this talk, I will present recent results related to this problem, specifically \begin{itemize} \item[(i)] Rates of convergence for the maximum likelihood estimator: by studying the local and global geometry of the expected log-likelihood function we are able to establish rates of convergence for the MLE and give a complete characterization of the cases where these are parametric. We also give a partial description of the critical points for the expected log-likelihood. \item[(ii)] Optimal rates of convergence for this problem: these are achievable by the method of moments and are governed by a combinatorial parameter, which we call the cycle sparsity. \item[(iii)] Fast combinatorial algorithm to implement the method of moments efficiently. \end{itemize} The necessary background on DPPs will be given in the talk.

Joint work with Victor-Emmanuel Brunel (M.I.T), Ankur Moitra (M.I.T) and John Urschel (M.I.T).

No date set

## Dictionary Learning: From Local to Global Convergence

### Karin Schnass

### University of Innsbruck, Austria - karin.schnass@uibk.ac.at

In this talk we first give conditions that guarantee one iteration of an alternating dictionary learning scheme to contract an estimate for the desired generating dictionary towards this generating dictionary. Conversely we will provide examples of dictionaries not equal to the generating dictionary that are stable fixed points of the alternating scheme. Based on these characterisations we then propose a (cheap) replacement strategy for alternate dictionary learning to avoid local minima. Time permitting we will finally discuss how the replacement strategy can be used to automatically determine the dictionary size and sparsity level.

No date set

## Matrix Completion, Saddlepoints, and Gradient Descent

### Jason Lee

### University of Southern California, USA - jasondlee88@gmail.com

Matrix completion is a fundamental machine learning problem with wide applications in collaborative filtering and recommender systems. Typically, matrix completion are solved by non-convex optimization procedures, which are empirically extremely successful. We prove that the symmetric matrix completion problem has no spurious local minima, meaning all local minima are also global. Thus the matrix completion objective has only saddlepoints an global minima.

Next, we show that saddlepoints are easy to avoid for even Gradient Descent -- arguably the simplest optimization procedure. We prove that with probability 1, randomly initialized Gradient Descent converges to a local minimizer. The same result holds for a large class of optimization algorithms including proximal point, mirror descent, and coordinate descent.

Joint work with Michael Jordan (UC Berkeley), Benjamin Recht (UC Berkeley), Max Simchowitz (UC Berkeley), Rong Ge (Duke University) and Tengyu Ma ( Princeton University).

No date set

## Projected gradient descent with nonconvex constraints

### Rina Barber

### University of Chicago, USA - rina@uchicago.edu

Nonconvex optimization arises in many applications of high-dimensional statistics and data analysis, where data models and regularization terms can both often exhibit nonconvexity. While convex programs for structured signal recovery have been widely studied, comparatively little is known about the theoretical properties of nonconvex optimization methods. In this talk I will discuss the problem of projected gradient descent over nonconvex constraints, where the local geometry of the constraint set is closely tied to its convergence behavior. By measuring the local concavity of the constraint set, we can give concrete guarantees for convergence of projected gradient descent. Furthermore, by relaxing these geometric conditions, we can allow for approximate calculation of the projection step to speed up the algorithm.

Joint work with Wooseok Ha.

No date set

## Subspace Clustering via Tangent Cones

### Rebecca Willett

### University of Wisconsin-Madison, USA - willett@discovery.wisc.edu

In this talk, I will describe a novel approach to subspace clustering called ``conic subspace clustering'' (CSC), which offers new insight into the geometry of subspace clustering. Like many other subspace clustering methods, it operates by first estimating the affinity between pairs of points, where the affinity is one if two points are in the same subspace and zero otherwise, and then applying an off-the-self clustering method to the learned affinities. The main novelty of the proposed approach lies in the estimation of pairwise affinities. Specifically, CSC considers the convex hull of a collection of data points and the tangent cones at each sample $x$. The union of subspaces underlying the data imposes a specific geometry on the convex hull, so that the tangent cone at a point $x$ has a strong association with the subspace containing $x$. This insight leads to a tangent cone membership test that can be used to estimate the affinity between $x$ and other points. In addition to describing this novel geometric perspective, this work provides deterministic and stochastic guarantees on the accuracy of the learned affinity matrix. More specifically, the test guarantees zero false positives and provides guarantees for any true positive rate. Furthermore, the true positive rate is potentially much larger than for competing methods that leverage sparsity regularization, which is particularly important in the presence of large numbers of samples from subspaces with complicated configurations. This increased true positive rate can be leveraged within an convex optimization formulation of the CSC affinity matrix estimation, leading to a computationally efficient algorithm.

Joint work with Amin Jalali (University of Wisconson-Madison).

No date set

## Estimation of Smooth Functionals of Covariance Operators

### Vladimir Koltchinskii

### School of Mathematics, Georgia Institute of Technology, USA - vlad@math.gatech.edu

We discuss a problem of estimation of a smooth functional of unknown covariance operator of a mean zero Gaussian random variable in a Hilbert space based on a sample of its i.i.d. observations in a setting when either the dimension of the space, or so called effective rank of the covariance operator are allowed to be large (although much smaller than the sample size). The main example is the problem of estimation of a linear functional of unknown eigenvector of the true covariance that corresponds to its largest eigenvalue (the top principal component). In this case, in a joint work with Matthias Loeffler and Richard Nickl, we proved a minimax lower bound on the risk of an arbitrary estimator of a linear functional, developed an estimator for which this bound is attained (which is not true for a standard estimator based on the top principal component of sample covariance due to its large bias) and also proved a Berry-Esseen type bound on the accuracy of approximation of its distribution by normal. This estimator is based on a further development of a bias reduction method initially introduced in our joint work with Karim Lounici. We will also discuss a more general approach to bias reduction and efficient estimation of smooth functionals of the unknown covariance (developed in our joint work with Mayya Zhilova).

No date set

## The loss surface of deep and wide neural networks

### Matthias Hein

### Saarland University, Germany - hein@math.uni-sb.de

While the optimization problem behind deep neural networks is highly non-convex, it is frequently observed in practice that training deep networks seems possible without getting stuck in suboptimal points. It has been argued that this is the case as all local minima are close to being globally optimal. We show that this is (almost) true, in fact almost all local minima are globally optimal, for a fully connected network with squared loss and analytic activation function given that the number of hidden units of one layer of the network is larger than the number of training points and the network structure from this layer on is pyramidal.

Joint work with Quynh Nguyen Ngoc.

No date set

## Consistency and comparison of objective functionals in semi-supervised learning

### Dejan Slepcev

### Carnegie Mellon University, United States - slepcev@math.cmu.edu

We consider a regression problem of semi-supervised learning: given real-valued labels on a small subset of data recover the function on the whole data set while taking into account the information provided by a large number of unlabeled data points. Objective functionals modeling this regression problem involve terms rewarding the regularity of the function estimate while enforcing agreement with the labels provided. We will discuss and prove which of these functionals make sense when the number of data points goes to infinity. Furthermore we will discuss distinct qualitative properties of function estimates that different regularizations lead to. In particular we will discuss regularizations motivated by p-Laplace equation and higher order fractional Laplacians.

Joint work with Matthew Dunlop (Caltech), Andrew Stuart (Caltech) and Matthew Thorpe (CMU).

No date set

## On the Exponentially Weighted Aggregate with the Laplace Prior

### Arnak Dalalyan

### ENSAE/CREST, Universite Paris Saclay, France - arnak.dalalyan@ensae.fr

In this talk, we will present some results on the statistical behavior of the Exponentially Weighted Aggregate (EWA) in the problem of high-dimensional regression with fixed design. Under the assumption that the underlying regression vector is sparse, it is reasonable to use the Laplace distribution as a prior. The resulting estimator and, specifically, a particular instance of it referred to as the Bayesian lasso, was already used in the statistical literature because of its computational convenience, even though no thorough mathematical analysis of its statistical properties was carried out. We will present results that fill this gap by establishing sharp oracle inequalities for the EWA with the Laplace prior. These inequalities show that if the temperature parameter is small, the EWA with the Laplace prior satisfies the same type of oracle inequality as the lasso estimator does, as long as the quality of estimation is measured by the prediction loss. Extensions of the proposed methodology to the problem of prediction with low-rank matrices will be discussed as well.

Joint work with Edwin Grappin (ENSAE/CREST), Quentin Paris (HSE).

No date set

## Overlapping variable clustering with statistical guarantees and LOVE

### Florentina Bunea

### Cornell University , USA - fb238@cornell.edu

Variable clustering is one of the most important unsupervised learning methods, ubiquitous in most research areas. In the statistics and computer science literature, most of the clustering methods lead to non-overlapping partitions of the variables. However, in many applications, some variables may belong to multiple groups, yielding clusters with overlap. It is still largely unknown how to perform overlapping variable clustering with statistical guarantees. To bridge this gap, we propose a novel Latent model-based OVErlapping clustering method (LOVE) to recover overlapping sub-groups of a potentially very large group of variables. In our model-based formulation, a cluster is given by variables associated with the same latent factor, and can be determined from an allocation matrix $A$ that indexes our proposed latent model. We assume that some of the observed variables are pure, in that they are associated with only one latent factor, whereas the remaining majority has multiple allocations. We prove that the corresponding allocation matrix $A$, and the induced overlapping clusters, are identifiable, up to label switching. We estimate the clusters with LOVE, our newly developed algorithm, which consists in two steps. The first step estimates the set of pure variables, and the number of clusters. In the second step we estimate the allocation matrix $A$ and determine the overlapping clusters. Under minimal signal strength conditions, our algorithm recovers the population level clusters consistently. Our theoretical results are fully supported by our empirical studies, which include extensive simulation studies that compare LOVE with other existing methods, and the analysis of a RNA-seq dataset.

Joint work with Yang Ning (Cornell University) and Marten Wegkamp (Cornell University).

No date set

## Parallelizing Spectral Algorithms for Kernel Learning

### Gilles Blanchard

### University of Potsdam, Germany - gilles.blanchard@math.uni-potsdam.de

We consider a distributed learning approach in supervised learning for a large class of spectral regularization methods in a reproducing kernel Hilbert space framework. The data set of size $n$ is partitioned into $m=O(n^\alpha)$ disjoint subsets which can be sent to different machines. On each subset, some spectral regularization method (belonging to a large class, including in particular Kernel Ridge Regression, $L^2$-boosting and spectral cut-off) is applied. The regression function $f$ is then estimated via simple averaging. This leads to a substantial reduction in computation time with respect to using a single machine. We show that minimax optimal rates of convergence are preserved if $m$ grows sufficiently slowly with $n$, the upper limit on the exponent $\alpha$ depending on the smoothness assumptions on $f$ and the intrinsic dimensionality.

Joint work with Nicole Mücke.

No date set

## Optimal graphon estimation

### Olga Klopp

### University Paris Nanterre, France - kloppolga@math.cnrs.fr

In the last decade, network analysis has become an important research field driven by applications. We study the twin problems of estimating the connection probability matrix of an inhomogeneous random graph and the graphon of a $W$-random graph. We consider classes of block constant matrices and step function graphons and establish the minimax estimation rates with respect to different distances: $l_1$, $l_2$ and the cut metric. We also extend our study to the case of unbounded graphons.

Joint work with Nicolas Verzelen (INRA, France).

No date set

## Learning in Nature

### Nisheeth Vishnoi

### EPFL, Switzerland - nisheeth.vishnoi@epfl.ch

In this talk, I will present two algorithmic connections between nature and humans that we recently discovered. The first is an equivalence of the dynamics of the slime mold and a popular algorithm in machine learning -- the iteratively reweighted least squares (IRLS) method. The second is between a distributed learning dynamics observed among social animals (including humans) and the multiplicative weights algorithm.

Not only have similar learning algorithms been independently chosen by human intelligence and nature (via evolution) to solve problems in entirely different context, these connections lead us to the first proof of correctness of the damped IRLS method and a novel distributed and low-memory implementation of the classic MWU method.

Joint work with L. Elisa Celis (EPFL), Peter Krafft (MIT) and Damian Straszak (EPFL).

No date set

## The statistical foundations of learning to control

### Benjamin Recht

### University of California, Berkeley, USA - brecht@berkeley.edu

Given the dramatic successes in machine learning and reinforcement learning over the past half decade, there has been a resurgence of interest in applying these techniques to continuous control problems in robotics, self-driving cars, and unmanned aerial vehicles. Though such applications appear to be straightforward generalizations of standard reinforcement learning, few fundamental baselines have been established prescribing how well one must know a system in order to control it. In this talk, I will discuss how one might merge techniques from statistical learning theory with robust control to derive such baselines for such continuous control. I will explore several examples that balance parameter identification against controller design and demonstrate finite sample tradeoffs between estimation fidelity and desired control performance. I will then describe how these simple baselines give us insights into shortcomings of existing reinforcement learning methodology. I will close by listing several exciting open problems that must be solved before we can build robust, safe learning systems that interact with an uncertain physical environment.

Joint work with Ross Boczar (UC Berkeley), Sarah Dean (UC Berkeley), Moritz Hardt (Google Brain/UC Berkeley), Tengyu Ma (Princeton), Horia Mania (UC Berkeley), Andrew Packard (UC Berkeley), and Stephen Tu (UC Berkeley)..