When I initially read the title of this paper, Learning to Act By Predicting the Future, I had high hopes that it would add a compelling solution to a problem I’d been wondering about for awhile: whether it was possible to teach a reinforcement learning model to learn about the world, without having a specific goal in mind. And, it’s true: there is a sense in which this paper solves that problem, but I found it more limited than I was hoping for.
This core implementation detail of this paper, which teaches an agent to play the First Player Shooter game Doom, is that, instead of learning based on the value of a reward, the actual learning happens based on forcing ourselves to build good predictions of some small set of visible measurements: health, stamina, and kills. How exactly does this work?
The output ultimately produced by the network is a vector, with length equal to the number of measurements you’re trying to predict, multiplied by the number of possible actions you can take, and it represents our prediction of those measurements in the world in which we each possible action. For example, if we imagine one of our actions is “step forward” elements 0, 1, and 2 in that vector might be our predictions of health 3 steps from now, kills 3 steps from now, and stamina 3 steps from now, given that we take the action “step forward” from our current state. This vector serves two purposes:
- It helps us decide which action to take. In this paper, the agent’s goal function, which is used to select actions, is defined by weights that specify how much each of the three key measurements factors into our goal function. For example, we might place 0.5 weight on future health, 0.5 on future stamina, and 1.0 on future kills. In that case, we take the model’s predictions of health/stamina/kills, and take the action that causes that weighted combination to be highest, in expectation.
- It provides the source of loss that the model trains on. Once we choose which action to take, we then observe the actual values of our predicted measurements, and can calculate a loss based on the difference between predictions and reality. This — which is formulated as a simple mean squared loss — is what’s used to train our model.
[The next paragraph focuses on the mechanics of how this vector of expected action-predictions is created; if you’re just looking for overall intuition of the approach, feel free to skip ahead]
Continuing to work backward, this vector of predictions corresponding to each action is learned via a composition of two vectors: an expectation vector, and a per-action vector. This simply means that we learn both
- a vector representing our average expectation, e.g. what our prediction would be if we imagined we were just drawing actions randomly from the action distribution, and
- a vector representing the “offset” from that expectation that corresponds to each action. For example, if our average expected value of our predictions is (4, 90, 1), and our prediction if we “step forward” is (5, 88, 2), then the offset for the “step forward action” would be (1, -2, 1)
You may notice that the expectation vector is much shorter than the full action-offset vector, since it only contains one set of predictions total, rather than one set per action. And, you would be correct: in order to make the vector math work, the expectation vector is “tiled”, i.e. repeated over and over again, until it’s of the same length as the action-offset vector. These “expectation” and “action” subnetworks build on a foundation that takes three sources of input: the pixels of the current state (which go into a convolutional network), the current measurements (which go into a set of densely connected layers), and the current goal vector, which is to say, the vector of weightings over measurements that was earlier discussed in the context of how to choose our next action. If you’re confused about why you add the goal vector as input, I was/am too; I’ll talk about that later.
When you look at performance in terms of health and ultimate kills in an average game, this approach does do better than two very popular current reinforcement learning techniques: Double Q Learning and Async Advantage Actor-Critic (A3C).
So. These are all impressive charts. Where does my lack of enthusiasm stem from? To explain that, I think it’s instructive to take a minor detour into the theory of Reinforcement Learning. Not theorems or equations, per se, but what is it about reinforcement learning problems that makes them unique and hard in the first place?
In games like chess and go, you need to learn long term strategies, in order to bring yourself closer to winning the game. However, crucially, there isn’t a human-known mapping between the visible position of your pieces at any given point in time and expected eventual reward, eg probability of winning. You can’t just say that winning is inherently defined by having 3 pieces here, and 4 here, and so on, and if you accomplish that criteria, you’ve won. That mapping is something the model has to learn as part of its training process. By contrast, in this paper, we hard code a mapping between observed variables and reward, so all the model needs to learn is how to predict observed variables.
I think this is at the core of my frustration with the framing of the paper: the “measurements” that the model predicts really aren’t that different from its goal. The goal is just a weighted combination of measurements. In the case of (1, 1, 1) goal vector, this model framing is equivalent to “predict the amount of reward you’ll have accumulated three steps from now”. Because these measurements are hand-chosen by a human, and hand-mapped to a reward, they don’t really seem to be delivering on the promise I was hoping for: of learning from future happenings in a more unsupervised way.
All of that said, I think that, if you forget the bit about measurements vs reward, and just look at the case where they’re equivalent, the performance of this model compared to DQN and AC3 does suggest strong competitive performance in environments where you get frequent rewards. In both DQN and AC3, the value (/Q function) of an action is inherently recursive: a prediction of the amount of reward that will be acquired between that action and the end of the game. This paper suggests that you can do quite well by just optimizing directly over what you’ll have a few* steps from now, rather than using a more complicated process, with long chains of gradient-passing to reach the end of the game. This is obviously going to be insufficient in environments that require long term strategy, but in situations that are mostly more short-term situational, this lowered complexity may provide a lot of value.
All in all, one (indirect) contribution of this paper, that I quite appreciated, was the way it made me think more clearly and cleanly about the different kinds of problems that fall under the broad umbrella of “learning a set of behavior”, and how the differing reward structures of those problems can, and should, lead you to conceptualize those problems in fundamentally different ways.
*The paper actually uses measurements made at a series of frames-ahead, but in terms of human game time, it’s a relatively short window; 3–4 seconds.
Could Reinforcement Learning Use More Supervision? was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Gurupriyan is a Software Engineer and a technology enthusiast, he’s been working on the field for the last 6 years. Currently focusing on mobile app development and IoT.