S4WM

Abstract

World models are a fundamental component in model-based reinforcement learning (MBRL). To perform temporally extended and consistent simulations of the future in partially observable environments, world models need to possess long-term memory. However, state-of-the-art MBRL agents, such as Dreamer, predominantly employ recurrent neural networks (RNNs) as their world model backbone, which have limited memory capacity. In this paper, we seek to explore alternative world model backbones for improving long-term memory. In particular, we investigate the effectiveness of Transformers and Structured State Space Sequence (S4) models, motivated by their remarkable ability to capture long-range dependencies in low-dimensional sequences and their complementary strengths. We propose S4WM, the first world model compatible with parallelizable SSMs including S4 and its variants. By incorporating latent variable modeling, S4WM can efficiently generate high-dimensional image sequences through latent imagination. Furthermore, we extensively compare RNN-, Transformer-, and S4-based world models across four sets of environments, which we have tailored to assess crucial memory capabilities of world models, including long-term imagination, context-dependent recall, reward prediction, and memory-based reasoning. Our findings demonstrate that S4WM outperforms Transformer-based world models in terms of long-term memory, while exhibiting greater efficiency during training and imagination. These results pave the way for the development of stronger MBRL agents.

Method

S4WM efficiently models the long-range dependencies of environment dynamics in a compact latent space, using a stack of S4 blocks. This crucially allows fully parallelized training and fast imagination and planning. We also find that adding the final MLP in S4 blocks can improve generation quality. The S4 layer in each block can be replaced with other parallelizable SSMs such as S5.

Long-Term Imagination in Four Rooms

S4WM demonstrates superior generation quality for up to 500 steps, with only minor errors in object positions. RSSM and TSSM make many mistakes, struggling to memorize the objects and wall colors.

Reward Prediction in Distracting Memory

S4WM is able to accurately predict rewards within imagination. TSSM has limited success when observing the full sequence, but fails to imagine future rewards accurately. RSSM completely fails, and its reward prediction is close to random guessing. Our visualization of model imagination reveals that the failure of TSSM and RSSM is mainly due to their inability to keep track of the agent's position.

Memory-Based Reasoning in Multi Doors Keys

Door State Prediction in Multi Doors Keys

S4WM is able to keep updating its memory when a key is picked up or consumed, leading to accurate predictions of future door states (measured by generation MSE), even when there are many keys.

Offline Probing in Memory Maze

S4WM is a general framework compatible with S4 and its variants. When instantiated with S5 (denoted S5WM), it can outperform RSSM in offline probing by a large margin, achieving state-of-the-art results in the Memory Maze offline probing benchmark.

Facing Off World Model Backbones: RNNs, Transformers, and S4