We propose S4WM, the first world model compatible with
parallelizable SSMs including S4 and its variants. Furthermore, we
present an extensive comparative study of prominent world model
backbones in a variety of memory-demanding environments with
sequence lengths of up to 2000. Our findings demonstrate the
unparalleled long-term memory capabilities of S4WM when compared
with TSSM (Transformer-based) and RSSM (RNN-based), two seminal
world models.
World models are a fundamental component in model-based reinforcement learning (MBRL). To perform temporally extended and consistent simulations of the future in partially observable environments, world models need to possess long-term memory. However, state-of-the-art MBRL agents, such as Dreamer, predominantly employ recurrent neural networks (RNNs) as their world model backbone, which have limited memory capacity. In this paper, we seek to explore alternative world model backbones for improving long-term memory. In particular, we investigate the effectiveness of Transformers and Structured State Space Sequence (S4) models, motivated by their remarkable ability to capture long-range dependencies in low-dimensional sequences and their complementary strengths. We propose S4WM, the first world model compatible with parallelizable SSMs including S4 and its variants. By incorporating latent variable modeling, S4WM can efficiently generate high-dimensional image sequences through latent imagination. Furthermore, we extensively compare RNN-, Transformer-, and S4-based world models across four sets of environments, which we have tailored to assess crucial memory capabilities of world models, including long-term imagination, context-dependent recall, reward prediction, and memory-based reasoning. Our findings demonstrate that S4WM outperforms Transformer-based world models in terms of long-term memory, while exhibiting greater efficiency during training and imagination. These results pave the way for the development of stronger MBRL agents.
S4WM efficiently models the long-range dependencies of
environment dynamics in a compact latent space, using a stack of
S4 blocks. This crucially allows fully parallelized training and
fast imagination and planning. We also find that adding the
final MLP in S4 blocks can improve generation quality. The S4
layer in each block can be replaced with other parallelizable
SSMs such as S5.
S4WM demonstrates superior generation quality for up to 500
steps, with only minor errors in object positions. RSSM and TSSM
make many mistakes, struggling to memorize the objects and wall
colors.
S4WM is able to accurately predict rewards within imagination.
TSSM has limited success when observing the full sequence, but
fails to imagine future rewards accurately. RSSM completely
fails, and its reward prediction is close to random guessing.
Our visualization of model imagination reveals that the failure
of TSSM and RSSM is mainly due to their inability to keep track
of the agent's position.
S4WM is able to keep updating its memory when a key is picked up
or consumed, leading to accurate predictions of future door
states (measured by generation MSE), even when there are many
keys.
S4WM is a general framework compatible with S4 and its variants.
When instantiated with S5 (denoted S5WM), it can outperform RSSM
in offline probing by a large margin, achieving state-of-the-art
results in the Memory Maze offline probing benchmark.