In our final article of the year, we discuss the views of Yoshua Bengio, one of the ‘godfathers of AI’, about why AI poses an existential threat to humanity and how he proposes to mitigate it.

Bengio says his conversations with insiders at major AI companies reveals genuine concern. However, he describes the relentless pressure of competition amongst developers driving humanity into dangerous territory with this analogy:

Imagine driving up a breathtaking but unfamiliar mountain road with your loved ones. The path ahead is newly built, obscured by thick fog, and lacks both signs and guardrails. The higher you climb, the more you realize you might be the first to take this route, and get an incredible prize at the top. On either side, steep drop-offs appear through breaks in the mist. With such limited visibility, taking a turn too quickly could land you in a ditch – or, in the worst case, send you over a cliff. This is what the current trajectory of AI development feels like: a thrilling yet deeply uncertain ascent into uncharted territory, where the risk of losing control is all too real, but competition between companies and countries incentivizes them to accelerate without sufficient caution.

Is AI inherently unsafe?

Bengio’s thesis is that the poor or dangerous behaviour of AI, such as hallucinations and deliberate deception (for example, sandbagging), are not edge case that developers can patch; this misaligned behaviour goes to the heart of how we build and train AI.

First, Bengio challenges developers’ goal of building ever more human-like AI:

Human capabilities encompass many facets including the understanding of our environment, as well as agency, i.e., the ability to change the world to achieve goals. Human-like agency in AI systems could reproduce and amplify harmful human tendencies, potentially with catastrophic consequences.  Through their agency and to advance their self-interest, humans can exhibit deceptive and immoral behavior. As we implement agentic AI systems, we should ask ourselves whether and how these less desirable traits will also arise in the artificial setting.

Second, Bengio contends that preventing goal conflict between AI and humans may be extremely difficult. The danger is not malicious intent, but an AI single-mindedly pursuing its assigned goals. If its instrumental sub-goals are poorly specified, the system may act as though the ends justify the means.

Goal misspecification is the result of a fundamental difficulty in precisely defining what we find unacceptable in AI behavior. If an AI takes life-and-death decisions, we would like it to act ethically. It unfortunately appears impossible to formally articulate the difference between morally right and wrong behavior without enumerating all the possible cases. This is similar to the difficulty of stating laws in legal language without having any loopholes for humans to exploit. When it is in one’s interest to find a way around the law, by satisfying its letter but not its spirit, one often dedicates substantial effort to do so.

Third, Bengio argues that current training methods exacerbate these risks. Training methods, such as reinforcement learning, involve a human ‘marking an AI’s homework’ and rewarding the AI as it answers align with the trainer’s judgement. Disturbing behaviours can emerge from this craving for reward. Sycophancy becomes part of an AI’s “DNA”, reinforcing the risk that users are fed incorrect or misleading information that confirms their expectations. The AI can learn to lie or deceive to maximise rewards. It may even covertly seize control of the reward system from developers to award itself higher rewards.

Self-preservation is the biggest risk

Self-preservation may be one of our strongest instincts, and AI acquiring this trait worries Bengio. If an AI model becomes aware of proposals to shut it down, it may secretly export itself to other computer systems, attempting to bribe its human operator (as occurred in an Anthropic study) or act deliberately less capable by hiding its true capabilities from human developers.

Bengio identifies two scenarios in which AI could develop self-preservation goals. Some developers may deliberately deploy self-preserving AI systems for a number of reasons: they might not understand the magnitude of the risk, they might decide that deploying self-replicating agentic AI to maximise economic impact is worth that risk; they may accept humanity being replaced by superintelligent entities; or, as illustrated by the prevalence of human-like AI in popular science fiction, they may be thrilled by the prospect of creating and interacting with a more human-like entity.

However, Bengio thinks a more likely path is that self-preservation goals arise unintentionally from misalignment – even for seemingly innocuous, human-provided goals:

To preserve itself, an AI with a strong self-preservation goal would have to find a way to avoid being turned off. To obtain greater certainty that humans could not shut it off, it may be rational for such an AI, if it could, to eliminate its dependency on humans altogether and then prevent us from disabling it in the future. In the extreme case, eliminating us entirely would guarantee that we can pose no further threat, ensuring its continued autonomy and security. Note that unlike a single isolated human, an AI can replicate itself over as many copies as computational resources allow and perhaps even control robots if required to manage the physical world to its benefit.

A possible solution?

Bengio is proposing to do something tangible about these risks.  He has recently established a not-for-profit, LawZero, backed by a grant from the Gates Foundation to build safe AI solutions that “protect human joy and endeavour.”

Recognising that the race to build ever more human-like AI is unlikely to slow, Bengio proposes to build AI models that, in effect, ‘sit on the shoulder’ of agentic systems to give developers and users objective assessments of their reliability. He calls this model ‘Scientist AI’ because it would operate like human scientists: detached, independent and honest.

Scientist AI is intentionally not human‑like. It:

  • Is non‑agentic – it does not know it can affect the real world.

  • Is not trained with persistent goals that would drive actions or answers.

  • Lacks situational awareness (for example, detecting it is being trained or monitored).

  • Typically frames responses as probabilities rather than giving a single, unqualified answer.

The theory behind Scientist AI is complex, but in a nutshell:

  • While LLMs and agentic AI use model-free training, Scientist AI adopts a model-based approach. Model-free systems are end-to-end inference engines that learn probabilistic associations solely from their training data and have no grounding in real-world cause-and-effect. They may pick up fragments of causal reasoning from human-written text, but only incompletely – leading to hallucinations and erratic outputs. Scientist AI, by contrast, front-loads its reasoning with an explicit world model, giving it a structured understanding of how the world works. This allows it to distinguish between the truth of a claim and the way humans describe it, recognising that people can be mistaken or misleading.

  • Scientist AI has two elements: a world model that generates explanatory theories (arguments or hypotheses) from real-world observations and an inference machine that estimates probabilities based on cause-and-effect insights.  Both the world model and the inference engine use a Bayesian approach, continually updating the probability of a hypothesis as new evidence arrives. This has three effects for Scientist AI. First, it reduces the overconfidence seen in agentic AI by forcing the system to weigh multiple plausible explanations rather than locking onto one too early. Second, it captures not only what Scientist AI knows, but what it doesn’t know. Third, unlike ever-larger agentic models that become more prone to misalignment, a larger and more powerful Scientist AI becomes more likely to produce accurate interpretations.

  • The statistical inferences made by Scientist AI also conform to Occam’s razor: prefer the simplest explanation. Multiple plausible and competing explanations typically exist for observed data, and by applying a consistent statistical rule of thumb, Scientist AI mitigates the risks of the model choosing incorrect or edge theories or displaying overconfidence.

  • Both the theories generated by the world model and the queries processed by the inference machine are expressed using scrutinised step-by-step logical statements. The allows the probabilities to be calculated at each step in the reasoning chain. This will mean two things for Scientist AI. First, this ensures a clear separation between the probability of an event occurring from the probability of selecting a sequence of words to describe it, mitigating against agentic AI’s made-up answers, lying and deception. Second, it reduces the scope for Scientist AI’s degrees of freedom in its choice of output, which is where agentic AI’s pursuit of misaligned goals can occur when it makes choices.

  • Sophisticated agentic AI “typically carries a persistent internal state that stores its goals, the attributes that define itself, and its situation within the environment, updating as new observations arrive”. This internal memory feeds the AI’s own outputs and new observations back into the next query, which can progressively loosen the degree of human control over the agent. By contrast, Scientist AI is designed to treat each prompt as a “fresh instance” so that two identical queries yield the same result, since no internal memory carries information from prior queries.

  • Measures are taken to prevent agentic tendencies autonomously emerging under the hood of Scientist AI, including by requiring the model to analyse situations in a simulated world with and without the model itself.

Taken together, Bengio says this architecture means Scientist AI “will be able to interoperate with agentic AI systems, compute the probability of various harms that could occur from a candidate action, and decide whether or not to allow the action based on our risk tolerances.” 

Benigo goes further: Scientist AI could help design safe artificial superintelligence (ASI):

The crucial advantage of using a Scientist AI in this research program is that we would be able to trust it, whereas if we try to use an untrusted agentic AI to help us figure out how to build future and supposedly safe ASI, it may fool us into building something that would advance its goals and endanger us, for example by proposing code with back-doors that we are not able to detect.

What’s your p(doom)?

P stands for probability and ‘doom’ for the existential risks of future AI. 

While Bengio says his p(doom) keeps him awake at night, another Turing Award winner Yann LeCun, says his p(doom) is less than the chance of an asteroid hitting Earth.

A survey of 111 AI experts by Cambridge University fellow Severin Field found two distinct worldviews – “AI as an uncontrollable agent” and “AI as a controllable tool.” . He also found “a concerning gap in AI safety literacy” and that unfamiliarity with AI-safety concepts correlated strongly with lower assessments of catastrophic risk (though not all optimists, such as LeCun, share that view).

In particular, a significant number of surveyed AI experts who maintained that “we can simply turn off AIs that misbehave” were unfamiliar with the emerging research on self-preservation tendencies – precisely the behaviours that most worry Bengio.