People affect change in the world all the time using natural language communication. Grounding such communication in real world actions is a well-studied and notoriously complex task, even the data gathering step is difficult.
So does there exist a platform on which we could more easily simulate such communication?
And the answer is … well this wouldn’t be much of a blog post if the answer wasn’t yes. And as the title would imply, the answer is text-adventure games.
So what is a text-adventure game? Now, the dozens of people who’ve actually played one of these before can skip ahead.
Still here? Thought so.
Alright, so text-adventure games are simulations in which the agent must interact with the world entirely through text. The agent “sees” and “talks” to the world through natural language.
So for 3 people who’ve played these games before and decided to read this section anyway — you’re probably wondering if its actually possible to simulate something so complex in the form of such a game and the answer is …
Of course not! Remember how a minute ago I said that natural language communication is basically part of everything we do and even data gathering was difficult? Well, simulating something like that computationally right off the bat is nearly impossible. Think of it this way, text adventures simulate grounded human communication about as well as Minecraft simulates the real world. They are simplifications at best, but still complex enough that they cover a vast variety of domains and situations and pose a challenge even to humans, thus providing us with a meaningful stepping stone towards modeling more complex interactions.
To figure out what makes these games challenging, lets first define them. Text-adventure games are POMDPs or Partially Observable Markov Decision Processes. This takes the form of a tuple of <S, T, A, Ω, O, R, γ> representing the set of environment states, conditional transition probabilities between states, words used to compose text commands, observations, observation conditional probabilities, reward function, and the discount factor respectively (Cote et al. , 2018).
So its basically an MDP but the agent never has access to the true underlying world state — in this case all it gets receives is a potentially incomplete textual description of the world around it.
So this definition directly leads us to our first challenge, partial observability — not having access to the world state means that all the agent knows about is what’s right in front of it and sometimes not even that due to the lack of complete descriptions. The agent thus has to reason about what actions to take with missing information.
The second issue is that of the combinatorially large state-action space, a consequence of everything being in natural language. Even if we know that the actions are in a certain format, “take ITEM from ITEM” still a large number of these to fill in. A game like Zork has about ~10 million actions possible. The agent thus needs to have a sense of previous context in order to be able to more effectively explore this space.
- We’re first going to talk about knowledge graphs and how they help with these challenges
- Then I’ll present a deep reinforcement learning architecture the KG-DQN that can be be trained to play these games
- We’ll talk about how to frame gameplay as a question-answering task and how we can use that to pre-train parts of this architecture
- Finally, we’ll see all the pieces in action and see some results!
Knowledge graphs are a very intuitive way of representing the world of a text adventure game. They are easier to understand and give us a structured memory aid. People have already been using graphs in game guides, etc. for some time now. In fact, what you are looking at is a hand drawn graph of the world of Zork, a popular commercial text adventure game. We’re going to see how this representation helps with both of the challenges we have talked about.
In terms of partial observability, the graph acts as a persistent memory and gives the agent a sense for what it has seen in the past in the form of a mental map of things it has seen. This lets us train a deep Q-network that is conditioned on previous states, something that previous approaches have not used.
The graph also lets us cut down the massive state-action space by giving the agent a strong prior on potential high utility actions given a state, letting us prune the action space and enabling more effective exploration.
The knowledge graphs that we use have two types of nodes:
The long term nodes which give the agent a mental map of the world and contain info about rooms, objects, etc.
And the short term nodes which provide current context and are linked to a “you” node representing the agent itself.
Putting everything together, we get something like this:
We see how the graph updates given two consecutive states where the agent explores a basement and chamber.
Well, we’ve seen how the graph gives us a memory that helps with partial observability but how about the ridiculously large state-action space?
We use a scoring function to rank potentially high utility actions and then keep the top k actions: +1 for each object in the graph and +1 for a path between the objects.
Take this graph for example. Based on this, we see that the agent has explored two rooms: a chamber and a basement and knows that there’s a cubical chest in the basement and a key in the chamber and an exit to the north of the chamber. This tells us that potential high utility action, one that is likely to affect the world state, would be to go north or pick up the key as opposed to trying to go west.
Knowledge Graph DQN
OK, we’ve kind of convinced ourselves that knowledge graphs are the right state representations to use. But, how do we actually solve the game? Well, DQNs have been shown to be good at solving games and maybe with our state representation, we can also use them.
The gist of our architecture is that we use two neural networks to separately embed the states and actions into different n-dimensional vectors.
The state side encodes the graph using graph embedding techniques in addition to using graph attention to learn which portions of the graph are useful and combines this with an encoding of the observation itself.
On the action side, we first start with the set of all possible actions, then prune using the graph and use an LSTM encoder to convert the rest into their vector representations.
So agents using DQNs are trained using some exploration strategy, e-greedy is a popular option. The agent basically picks a random action with e prob and follows the policy otherwise. This e varies so that the agent explores more at the beginning and less later on.
The rest of the training is done using experience replay and measuring temporal difference loss. All the agents interactions with game environment are stored in a buffer and every so often we sample almost randomly from this buffer and calculate how “surprised” our current deep q network is by the q-value that was initially calculated for that state action pair.
This won’t work for us because we can prune potentially useful actions out — such as at the beginning of a game, when the graph does not have enough information to effectively to give the agent a prior on actions. The agent explores using a modified version of the standard e greedy algorithm, that we call e1-e2 greedy — it explores with a probability e1 and when it does it randomly chooses from the full action set with probability e2 and from the set of pruned actions otherwise.
Gameplay as Question Answering
Despite all of this, RL is notoriously sample inefficient — especially in cases like this where there is a weak, sparse reward signal. The agent has very little feedback on whether its doing something right or not.
What if we could pre-train parts of the network to make life easier for the agent, decreasing initial embedding noise for the agent? Well, we can! Previous work has shown that multiple NL tasks can be framed as QA tasks. We do the same and train a question answering system that learns to answer the question of what action to best take next — and use this network to initialize portions of the deep q network architecture as shown here:
Testing things out
We use Textworld, a framework that lets us generate text-adventure games in a controlled setup in a given domain. We use the “home” theme, a textual simulation of a house.
We generate two sets of games — “small” and “large” — with different difficulties and specs and try ablation testing on these games.
OK, lets see how the KG-DQN actually does on these games, in this case the smaller one.
So what you’re looking at here is average reward gathered by the agent per episode during the training phase.
We see that KG-DQN in its full form converges to a policy that receives the maximum reward 40% faster than the baseline LSTM-DQN. In fact, this trend remains even when we remove the pre-training.
Upon convergence we run the agent and see how long it takes for it to finish the game.
Its also worth noting that the random command baseline is much stronger than randomly picking from the full action set as we can use Textworld to gain a list of exactly which actions can be performed given a state, something that we don’t use in any of the rest of the experiments.
The gist of this table is that when we do not prune the action space, it converges only slightly slower but the overall solution quality is much worse and similar results follow when we do not pre-train. This implies that both the pruning and pre-training help us optimize the control policy. LSTM DQN also reaches these many steps but takes significantly longer to converge.
So overall we see that KGDQN converges much faster without any loss in solution quality, and the difference in steps number of steps between KGDQN and LSTM DQN is statistically insignificant.
That was a lot to take in all at once, so here’s what you should remember:
Knowledge graphs help with:
- Partial observability by providing persistent memory
- Combinatorially large state-action space by pruning
This resulting graph is also interpretable and can grow arbitrarily large without forgetting unlike something like the cell state of an LSTM.
Thanks for reading!
This content was originally published at https://medium.com/@rajammanabrolu/kg-dqn-57d06004022