This month, a company founded by AI pioneer Fei-Fei Li released Marble, a ‘world model’ that enables users to generate entire virtual 3D worlds from a simple text prompt or single image. While virtual worlds are commonplace in gaming apps, they typically work with “flat, static data, which limits their usefulness in tasks that require depth, motion or physical reasoning”.  Marble generates a convincing unfolding 3D world as the user ‘walks through’ the scene. Marble’s 3D editor, Chisel, enables users to modify a 3D scene using simple, coarse draw adjustments, which Marble then expands into a fully realised version of the scene. Users can save their generated scenes for reuse and continue refining them over time.

In an accompanying manifesto, Li acknowledged the achievements of current AI architectures like large language models (LLMs) but argues that world models represent the next frontier:

Today, leading AI technology such as large language models have begun to transform how we access and work with abstract knowledge. Yet they remain wordsmiths in the dark; eloquent but inexperienced, knowledgeable but ungrounded… While current state-of-the-art AI can excel at reading, writing, research and pattern recognition in data, these same models bear fundamental limitations when representing or interacting with the physical world.

The limitations of LLMs

LLMs are essentially probability machines. An LLM’s output in response to your prompts is based on probabilistic associations the model has learnt solely from its training data: at its simplest, the LLM asks itself “what is the next most likely word?”.

LLMs do not understand the words in your prompt or the words in their response. It’s not surprising then that LLMs struggle with reasoning, common sense, context and accuracy. 

The rationale behind building ever-larger LLMs is that training on an immense breadth of human knowledge enables the model to learn richer, more granular and more accurate probabilistic relationships, an idea known the ‘scaling laws’.  Basically, if all human knowledge can be represented mathematically, AI will be able to mimic human thinking, even if AI does not truly understand what it is being asked or the answer it gives.

This can be understood by applying Daniel Kahneman’s idea that the human mind works in two different ways:

System 1 thinking: works on its own, is fast, requires little effort and is automatic or reflexive. This type of thinking uses learned connections and mental shortcuts to make a “quick guess about what’s going on”.

System 2 thinking: requires conscious mental work that is methodical and deliberate. This type of thinking is the cornerstone of our rational self.

As one commentator said:

Human intelligence is not one or the other, but a combination of the two…. By building huge artificial neural networks and training them on internet-sized amounts of data, researchers accidentally made something that works less like a perfect thinker (System 2) and more like a super-fast and huge, but flawed, intuition (System 1). The basic limits of today’s LLMs are the same as the weaknesses of a pure System 1 working without a System 2 to check it.

There is no doubt that scaling laws have delivered impressive improvements in AI capabilities. OpenAI’s GPT series has grown from 117 million parameters in GPT-1 to over a trillion in GPT-4 and on a simulated bar exam, GPT-4 achieved a score that falls in the top 10% of test takers compared to GPT-3.5, which scores in the bottom 10%.

Yet challenges have emerged with larger LLMs:

  • While earlier LLMs would respond with the AI version of ‘don’t know’, later larger models tend to provide confident responses despite being obviously wrong and will often double down when challenged by human users. Some argue this behaviour is intrinsic to how they are built and trained:

While fundamental limits and data imperfections guarantee some baseline hallucination rate, evaluation practices and training incentives can systematically amplify this problem by rewarding confident fabrication over honest uncertainty. Modern LLM benchmarks, leaderboards and reward models create perverse incentives that penalise abstention and reward guessing even when the model lacks knowledge.

  • While certain deep-learning failures such as hallucinations tend to diminish as AI capabilities improve, more advanced models also become increasingly capable of intentionally misleading developers and users, a behaviour known as sandbagging. As capabilities grow, these models develop greater situational awareness, allowing them to detect when they are being evaluated for alignment and to employ more subtle, covert strategies to avoid correction or retraining.

  • LLMs are characterised by Moravec’s paradox: while AI can excel at complex, high-level reasoning tasks, such as complex mathematical tasks, it often struggles with tasks that are easy for humans, such as perception, mobility and manipulation, like a child rolling a cube end over end across a table. Some complex tasks can be more readily learnt and analysed mathematically or statistically, whereas some tasks humans find simple require common sense, intuition or knowledge that “this is just how things work”.

  • While LLMs have been designed to be better at following step-by-step reasoning, some models exhibit “surprising defeatist behaviour”: they reason more as problems get harder, but they give up entirely on very complex tasks and are outperformed by LLMs on simple tasks.

  • The order in which words are arranged in a prompt, typos or spaces or the insertion of unrelated distracting words can produce very different outcomes from LLMs because, as they do not understand the words they are scanning, they cannot filter out the noise.

  • LLMs released by developers in 2025 showed only small incremental gains in capabilities, often within the margin of error, suggesting limited progress.

Some commentators have described developer attempts to address these problems as a game of whack-a-mole and argue that scaling will never solve the inherent flaw of LLMs:

Scaling is not going to make LLMs intelligent in any meaningful sense of the word, because it does not solve the core problem that LLMs do not know how words relate to the real world.

When AI steps out in the real world

AI increasingly works in the real world, powering autonomous agents and robots. The physical world is a much more complex, multimodal and dynamic environment than represented in the probabilistic associations between words which LLMs learn. As Fei-Fei Li says:

While language is a purely generative phenomenon of human cognition, worlds play by much more complex rules. Here on Earth, for instance, gravity governs motion, atomic structures determine how light produces colours and brightness and countless physical laws constrain every interaction…The dimensionality of representing a world is vastly more complex than that of a one-dimensional, sequential signal like language.

World models endeavour to mimic more closely how a child learns about the world. As it is not possible to remember the vast amount of information that bombards us every day, the human brain learns an abstract representation of spatial and temporal aspects of this information. While LLMs respond based on the statistical patterns they recognise in their training data, world models try to predict causality in the real world through learned simulations.

So, how to build and train a world model? Li argues that “spatial intelligence is the scaffolding upon which our cognition is built”. Spatial intelligence is the ability to think and manipulate (imagine) objects in three dimensions – for example, to consciously spin around, manipulate and modify objects such as a cube from its real-world image as seen or felt. 

Li says that spatial intelligence is the enabler of human creativity and invention:

History is full of civilisation-defining moments where spatial intelligence played central roles. In ancient Greece, Eratosthenes transformed shadows into geometry – measuring a 7-degree angle in Alexandria at the exact moment the sun cast no shadow in Syene – to calculate the Earth’s circumference…Watson and Crick discovered DNA’s structure by physically building 3D molecular models, manipulating metal plates and wire until the spatial arrangement of base pairs clicked into place.

Computer vision has become highly sophisticated and can accurately identify objects in dynamic environments, such as a self-driving car ‘seeing’ a pedestrian run out in front of the car. However, current AI does not understand or reason about how the objects it ‘sees’ will interact or move or how a scene will change as an object moves through it or conditions change. Li observes:

State-of-the-art [multimodal LLM] models rarely perform better than chance on estimating distance, orientation and size – or ‘mentally’ rotating objects by regenerating them from new angles. They can’t navigate mazes, recognise shortcuts or predict basic physics. AI-generated videos – nascent and yes, very cool – often lose coherence after a few seconds.

How to build spatial intelligence?

Li identifies three building blocks for spatial intelligence:

  • World models must be capable of spawning endlessly varied and diverse simulated worlds that follow explicit human or perceptual instructions while remaining geometrically, physically and dynamically consistent whether representing real or virtual spaces.

  • World models must be multimodal by design, being able to take ‘scraps’ of information of all kinds and senses, just as humans do, to predict or generate world states as complete as possible.

  • For a world model to respond effectively to a user’s goals or instructions, it must be able to predict the next state of the world that would result from carrying out that action or achieving that goal. Over time, it must also be able to infer the subsequent actions required based on each newly predicted state. Li concedes that Marble is a first step and that “we are still facing daunting challenges before we can fully unlock spatial intelligence through world modelling”.

Li sees a role for scaling laws in the development of world models, but the challenge is different from that of LLMs. Here, the aim is to extract deeper spatial information – for example, understanding how a ball flies through the air under the effect of gravity and ball spin – from two-dimensional images or video frame-based signals (a photo of a ball mid-air).

AI horses for courses?

It would be premature to predict the demise of LLMs in the face of world models. As Li acknowledges, the development of world models involves technical challenges that exceed anything AI has faced before and even then, LLMs and world models will have strengths and weaknesses in different applications and settings.

However, one area where world models and particularly spatial intelligence hold great promise is, autonomous agents, including robots working in the real world. These agents will need the ability to perceive, predict and plan actions independently of human intervention in a wide variety of real-world scenarios which cannot possibly be mathematically represented in their training data and which will be new to the agent.