The following describes work by Ching-An Cheng and Byron Boots, which was awarded Best Paper at the Further details and proofs are available at The 21st International Conference on Artificial Intelligence and Statistics (AISTATS). The paper can be found here: https://arxiv.org/abs/1801.07292.

Learning to make sequential decisions is a fundamental topic in designing automatic agents with artificial intelligence. A sequential decision-making task can be formulated as a reinforcement learning problem, which quantifies the performance of an agent in terms of accumulated rewards. The goal of reinforcement learning is to search for a *policy *(a decision making rule) such that the agent can gain high accumulated rewards. For example, in a car racing game, the reward can be used to encourage the agent to finish the track with minimal time, and therefore by solving the reinforcement learning problem the agent can learn to drive as fast as possible.

Despite its flexibility and generality, solving a reinforcement learning problem can be extremely difficult. One of its fundamental difficulty is because the agent has minimal knowledge about the the problem, it needs to randomly explore the environment while trying to figure out what the best policy is for the problem. This trade-off is classically known as the exploration-and-exploitation problem. As a result, the agent usually needs a tremendous amount of interactions with the environment to learn a decent policy.

The sample inefficiency of reinforcement learning is the core obstruction of deploying reinforcement learning algorithms to real-world agents, such as robots. For example, if to learn a policy requires 10000 interactions (which is largely underestimated number) and each interaction takes a minute of experiment. Then the agent (and the poor human supervisor) will need a week of 24-hour non-stopping experiments (provided no equipment breaks in between) in order to learn to solve a single decision making task.

## Speeding Up Policy Learning with Imitation

Is there a way to make policy learning faster? Fortunately, yes, because, most of the time, the decision problem we wish to solve is not a completely unknown problem. Instead, we usually have some prior knowledge. For the car racing example, even we don’t know what the best way to drive the car is, we know that stepping on the gas pedal is not a bad move. In the past, we have been using such rough ideas and intuitions to design many effective heuristics and engineered solutions. While these approaches may not be optimal, it turns out using them as guidance can greatly speed up the process of policy search.

The core idea of imitation learning is to treat this domain knowledge as *expert policies* and use them to provide a more informed policy search direction through *demonstrations*. For example, we can find a policy by minimizing the difference between the actions taken by the agent and that taken by the expert over the trajectories that the expert visits. Solving this surrogate supervised learning problem (also known as behavior cloning or batch imitation learning) is much simpler than the original reinforcement learning problem. Therefore, the agent can quickly pick up a policy and later on it may use this policy as a warm start to continue to solve the reinforcement learning of interest.

However, reducing an imitation learning to a *single *supervised learning problem as above may lead to a large performance gap between the expert and the agent. This is because the agent with the learned policy may encounter different states (situations) from the ones the expert visited in the demonstrations. This problem is called the covariate shift problem and due to accumulated error it is pronounced particularly in problems with a long decision horizon.

## Value Aggregation: An Interactive Approach

To mitigate the covariate shift problem, a recent technique of imitation learning is to learn the policy by solving a *sequence *of supervised learning problems. It works iteratively as follows:

- 0. we start with a policy which may be the policy learned using behavior cloning.
- 1. In iteration
*n*, we first let the agent execute its policy in the environment to check which states it visits. - 2. Then we ask the expert about which action the agent should have taken on those states and construct a cost function that penalizes the difference.
- 3. Finally, we update the agent’s policy by minimizing the sum of the cost function we just constructed and all the cost functions over the past iterations. We go back to repeat Step 1) with the updated policy until a fixed number of iterations is reached.

This simple idea is called value aggregation (Ross and Bagnell, 2014), because it combines all the costs based on the past experiences to update the policies. Unlike the traditional batch approach of behavior cloning, value aggregation interactively learns the policy by *online learning* as it defines a new cost function in each iteration. Because value aggregation considers the states that the agent visits with the learned policy, it can close the performance gap due to the covariate shift problem and let the agent learn a policy that performs consistently with the expert policy.

## Convergence of Value Aggregation

The performance of value aggregation can be analyzed by standard Follow-the-Leader analysis in the online learning literature. Theoretically, it guarantees that the policy sequence it generates *on average* can perform similarly to the expert policy. In other words, there exists a policy in that sequence which is close to the expert policy.

This theoretical property comes with some practical inconvenience however. Because the original analysis only guarantees on-average performance, it requires keeping the entire policy sequence in memory and performing another evaluation phase to check which policy is the best. In other words, it requires extra interactions in order to identify the best policy after the learning phase, which means additional experiments and time in practice.=

Consequently, practitioners of value aggregation usually just use the last policy in the sequence, disregarding what is suggested by the theory. But does the policy sequence generated by value aggregation actually improve monotonically and the last one is the best one? Empirically, there are mixed answers. Sometimes it does and the last-policy heuristic demonstrates good performance. On the contrary, people have also observed performance oscillation or degradation over the course of value aggregation.

The main contribution of our work is to answer this open question theoretically. We prove that whether performing value aggregation can lead to a convergent policy sequence is indeed problem dependent. Furthermore, we show that the behavior of value aggregation can be simply characterized by a single stability constant 𝛳.

At a high level, this constant 𝛳 measures how difficult a problem is in terms of how sensitive the change of visited states is to the policy change. For example, a problem is difficult when a tiny change in the agent’s policy would lead the agent to visit a completely different set of states. On the contrary, a problem is simple, if similar states are visited by different policies.

We prove that if 𝛳 < 1, then the last policy generated by value aggregation converges to the expert policy, and if 𝛳 > 1, there is a problem in which running value aggregation always leads to a degrading policy sequence. Under the same assumptions used in the original analysis, we provide non-asymptotic convergence rate of the performance of the last policy.

## Stabilizing an Unstable Problem

Based on this new insight, we provide modifications of value aggregation to ensure its stability. By adding certain regularization to the problem, we show that the modified problem can have an effective 𝛳 < 1, which means that running value aggregation with the modified problem would always converge. In a high level, if we can modify the problem such that the agent can at least perform well on the states visited by the expert, then running value aggregation is stable. For example, an admissible regularization can be using a mixture of the agent’s policy and the expert policy to explore states. This is a heuristic that has been used in the imitation learning literature, and here we prove that with a proper mixing rate it can indeed stabilize the learning.

## Imitation Learning and Beyond

Value aggregation is an intuitive idea to overcome the covariate shift problem in imitation learning. More generally, it captures our trial-and-error principle: make mistakes and find the remedy to the previous mistakes. Here we show that this process does not necessary lead to monotonic improvement. But fortunately we also show that if we can modify the problem to ensure the agent can perform well on some fixed baseline tests, then value aggregation can regain stability and ensure monotonic improvement during learning.

## Further Reading

Further details and proofs are available at https://arxiv.org/abs/1801.07292.

*[Post written by Ching-An Cheng and Byron Boots]*