In the study, the UC Berkeley researchers used a video game called Overcooked, where two chefs divvy up tasks to prepare and serve meals, in this case soup, which earns them points. It’s a 2-D world, seen from above, filled with onions, tomatoes, dishes and a stove with pots. At each time step, each virtual chef can stand still, interact with whatever is in front of it, or move up, down, left or right.
The researchers first collected data from pairs of people playing the game. Then they trained AIs using offline RL or one of three other methods for comparison. (In all methods, the AIs were built on a neural network, a software architecture intended to roughly mimic how the brain works.) In one method, the AI just imitated the humans. In another, it imitated the best human performances. The third method ignored the human data and had AIs practice with each other. And the fourth was the offline RL, in which AI does more than just imitate; it pieces together the best bits of what it sees, allowing it to perform better than the behavior it observes. It uses a kind of counterfactual reasoning, where it predicts what score it would have gotten if it had followed different paths in certain situations, then adapts.