Curious AI learns by exploring game worlds and making mistakes

Technology

Curious AI learns by exploring game worlds and making mistakes

26 May 2017

New Ů��С��Ƶ. Science news and long reads from expert journalists, covering developments in science, technology, health and the environment on the website and the magazine. — Let’s-a go! The algorithm is motivated by a desire to explore rather than to score points
Nintendo

I wonder what will happen if I press this button? Algorithms armed with a sense of curiosity are teaching themselves to discover and solve problems they’ve never encountered before.

Faced with level one of Super Mario Bros, a curiosity-driven AI learned how to explore, avoid pits, and dodge and kill enemies. This might not sound impressive – algorithms have been thrashing humans at video games for a few years now – but this AI’s skills all were all learned thanks to an inbuilt desire to discover more about the game world.

Conventional AI algorithms are taught through positive reinforcement. They are rewarded for achieving some kind of external goal, like upping the score in a video game by one point. This encourages them to perform actions that increase their score – such as stomping on enemies in the case of Mario – and discourages them from performing actions that don’t increase the score, like falling into a pit.

This type of approach, called reinforcement learning, was used to create AlphaGo, the Go-playing computer from Google DeepMind that beat Korean master Lee Sedol by four games to one last year. Over thousands of real and simulated games, the AlphaGo algorithm learned to pursue strategies that led to the ultimate reward: a win.

But the real world isn’t full of rewards, says , who led . “Instead, humans have an innate curiosity which helps them learn,” he says, which may be why we are so good at mastering a wide range of skills without necessarily setting out to learn them.

So Pathak set out to give his own reinforcement learning algorithm a sense of curiosity to see whether that would be enough to let it learn a range of skills. Pathak’s algorithm experienced a reward when it increased its understanding of its environment, particularly the parts that directly affected it. So, rather than looking for a reward in the game world, the algorithm was rewarded for exploring and mastering skills that led to it discovering more about the world.

This type of approach can speed up learning times and improve the efficiency of algorithms, says at Google’s AI company DeepMind. The company used a last year to teach an AI to explore a virtual maze. Its algorithm learned much more quickly than conventional reinforcement learning approaches. “Our agent is far quicker and requires a lot less experience from the world to train, making it much more data efficient,” he says.

Fast learner

Imbued with a sense of curiosity, Pathak’s own AI learnt to stomp on enemies and jump over pits in Mario and also learned to explore faraway rooms and walk down hallways in another game similar to Doom. It was also able to apply its newly acquired skills to further levels of Mario despite never having seen them before.

But curiosity could only take the algorithm so far in Mario. On average, it explored only 30 per cent of level one as it couldn’t find a way past a series of pits that could only be overcome through a sequence of more than 15 button presses. Rather than jump to its death, the AI learned to turn back on itself and stop when it reached that point.

The AI may have been flummoxed because it had no idea that there was more of the level to explore beyond the pit, says Pathak. It didn’t learn to consistently take useful shortcuts in the game either, since they led it to discovering less of the level so didn’t satiate its urge for exploration.

Parker is now working on seeing whether robotic arms can learn through curiosity to grasp new objects. “Instead of it acting randomly, you could use this to help it move meaningfully,” he says. He also plans to see whether a similar algorithm could be used in household robots similar to the Roomba vacuum cleaner.

But Jaderberg isn’t so sure that this kind of algorithm is ready to be put to use just yet. “It’s too early to talk about real-world applications,” he says.

Journal reference:

Topics: