By Adrian Rivera Cardoso and He Wang

Machine learning (ML) is changing our lives. We can instantly translate from one language to another, search entire libraries in a matter of seconds, and even prevent credit card fraud. ML’s success is mostly due to the power of artificial neural networks — a machine learning model inspired by how the brain works — massive datasets, and a lot of computational power. However, while these ML applications are making our lives easier, we have not really solved the artificial intelligence (AI) agent. 

A true AI agent should be able to perform well in a broad range of tasks. It should be able to recommend movies, drive cars, provide medical diagnosis, reduce unnecessary electricity consumption, and perhaps even assist humans to make scientific breakthroughs. In all of these applications, the AI agent needs to make a sequence of decisions to interact with humans or the environment. 

For example, suppose you have never listened to The Beatles and one day, a recommender system suggests one of their songs. Since you will obviously enjoy the song, the recommender system has changed for you, and from that day on, it must take that fact into account. It is also important to notice that the environment does not only change because of the AI agent’s actions. It may also change due to humans or other AI systems making decisions at the same time. For example, if your best friend just introduced you to jazz, the recommender system must learn this fact and try to adapt.

Reinforcement Learning

A classical framework for solving sequential decision-making problems is that of reinforcement learning (RL). The main idea behind RL is to teach an AI agent to perform well by rewarding it whenever it makes a good decision, just as in the same way we would reward our dogs with a treat whenever they sit down at our command.

The pillars of RL were developed in the late 50’s by Richard Bellman under the name of Markov decision processes (MDPs). Although almost 70 years have passed since the initial work of Bellman, we have not been able to fully adopt his ideas to solve all the challenges mentioned earlier. However, AI researchers have made significant advances! Using RL together with the power of artificial neural networks, we have been able to achieve superhuman intelligence in games such as Chess, Atari, Go and more recently StarCraft.

unnamed (1)

The Curse of Dimensionality and Nonstationarity

One of the reasons we have not been able to use RL to fully solve AI is because the size of the environment is usually massive (in the eyes of the AI agent). For example, consider an AI agent whose goal is to play Atari games by only looking at pixels from the screen. If the screen is 210×160 pixels and each pixel can have Red-Green-Blue values varying from 0 to 255, then there are roughly 550 billion possible states in our agent’s environment!  The real world is significantly more complicated than Atari games, so the massive number of states can really be a problem for AI agents. Bellman called this phenomenon the curse of dimensionality: as the problem complexity grows, the number of possible states will become astronomically large. 

giphy (1)

Another challenge in RL is that there may be humans or other agents interacting and trying to learn with our AI agent. This makes the environment nonstationary, meaning that those actions that were good previously are not necessarily good in the future. Back to the music example, imagine that after your friend introduces you to jazz, you enjoy it so much that you stop being interested in The Beatles. The recommender system should stop suggesting them for a while and instead recommend new jazz songs you haven’t heard.

Solving Large-Scale MDPs in a Changing Environment

In a recent paper to appear in NeurIPS (Rivera Cardoso et al. 2019), we try to tackle decision problems with a massive number of states in a nonstationary environment. We adopt the classical MDP model but allow the environment to be highly nonstationary. The way we create smart agents for this problem is very intuitive. At each decision step, the agent will take the best action according to all the observed data so far, but with special care to not change its decisions too drastically. Without getting too much into the technical details, the agent works by exploiting tools from mathematical optimization and convex geometry.

An issue with the previous approach is that mathematical optimization problems can become unsolvable by computers due to a large number of states in the environment. To fix this issue, we force the agent to play similar actions whenever it is in similar states. We leverage tools from linear algebra, optimization, and a simple yet powerful algorithm in ML called “stochastic gradient descent” so that the agent can solve the modified optimization problems using a traditional computer. We believe our results may be enhanced further by using the power of artificial neural networks.

Read the paper at:

Rivera Cardoso, Adrian, He Wang, and Huan Xu.  “Large Scale Markov Decision Processes with Changing Rewards.” Advances in Neural Information Processing Systems (NeurIPS) 2019.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s