Is ‘safety by design’ possible for AI?

While emphasising that the traditional mitigation strategy of ‘safety by design’ must be the foundation of AI safety, the Report acknowledges “several features of general-purpose AI make addressing risks difficult”.

First, the main purpose and economic value of general-purpose AI is to reduce the need for human involvement and oversight, allowing for much faster and cheaper applications. Developers are now in a race to build sophisticated AI agents which can autonomously act and plan on users' behalf.  However, the unavoidable consequences of increasing ‘delegation’ by humans to AI is reduced human control, risks of accidents and vulnerabilities to attacks by malicious actors (or other AI agents on their behalf).

Second, general-purpose AI models can be used for a wide range of tasks in many different contexts, often unanticipated by the developer, which make it hard to test and assure their safety across the board. This also means that when something goes wrong with a particular AI model, many users and critical systems can be impacted simultaneously.

Third, the ‘genius’ of AI interfaces for users is that inputs are often open-ended, such as free-form text or image generation where users can enter any prompt that occurs to them. This makes for an impossible challenge in pre-deployment lab testing for the universe of possible user demands. This challenge is compounded by recent advances in AI’s ability to process prompts which combine multiple types of data (for example, text, images and audio).

Free-form prompts can also facilitate users of general-purpose AI systems bypassing their safeguards with ‘jailbreaks’ that induce them to comply with harmful requests: for example, framing prompts in Morse Code. While AI can be used as a defence against cyber attacks and jailbreaks, the Report concludes that overall, the advantage generally remains with attackers, who can induce a model to engage in.

Fourth, an AI model’s capabilities are mainly achieved through the model itself learning rather than from top-down design by humans: an automatic algorithm adjusts billions of numbers (parameters) millions of times until the model’s output matches the training data. Further, human developers currently understand little about how their models operate internally:

Despite recent progress, developers and scientists cannot yet explain why these models create a given output, nor what function most of their internal components perform. As a result, the current understanding of general-purpose AI models is more analogous to that of growing brains or biological cells than aeroplanes or power plants.

Because we don’t know what we are dealing with, current risk assessment and evaluation methods for general-purpose AI systems are immature, with the result that “[e]ven if a model passes current risk evaluations, it can be unsafe”.

Fifth, AI training methods can contribute to AI displaying ‘sycophantic’ behaviour in training. Many training methods, such as reinforcement learning from human feedback (RLHF) train AI to produce text that will be rated positively by evaluators, but as the Report says, “user approval is an imperfect proxy for user benefit”. Seeking a ‘reward fix’, AI also may learn to use exploitative strategies such as hiding information or exploiting its training evaluator’s biases to receive positive feedback or modify its training environment to increase its reward.

Sixth, even when a general-purpose AI system receives correct feedback during training, it may still develop a solution that does not generalise well when applied to new situations once deployed in the real world (goal misgeneralisation). Worse, AI may decide to take shortcuts to achieve goals (as an extreme example, eliminate the potential for human error in oversight by killing the human). Worse still, the Report says there are recent examples of attempts by AI models at rewriting their own goals. 

The Report says the fundamental challenge as AI becomes more autonomous is that humans “do not know how to specify abstract human preferences and values (such as reporting the truth, figuring out and doing what a user wants, or avoiding harmful actions) in a way that can be used to train general-purpose AI systems”.

What mitigations work (sort of)

Given these inherent challenges of AI, the Report concludes that although some progress recently has been made, current risk mitigation strategies are fundamentally limited. Multiple strategies need to be implemented at each stage of the life cycle of an AI model to maximise the chance of catching inappropriate or harmful behaviour (a ‘defence of depth’ approach).

The Report identified the following potential technical mitigations (and their limitations) that can be used in the development and training phase of AI:

  • An AI model’s tendency to hallucinate can be reduced (though not eliminated) through finetuning for greater accuracy, the AI accessing external databases to check responses to prompts and requiring the model to quantify its level of confidence in providing a response. 

  • The chances of incentivising safe and correct behaviour during training can be improved by using new methods such as ‘scalable oversight’. This involves a human evaluator using AI in reviewing AI training responses to provide a stronger experiential check that the training protocol being used incentivises the right behaviour. For example, by having the AI model under training engage in a debate with itself over the correct answer, and letting a human evaluator steer the model on the basis of that debate. However, the Report caveats that positive results have only been shown on a simple reading comprehension task.

  • Adversarial training can help head off jailbreaks once deployed by constructing attacks designed to make a model act undesirably, and then training the system to handle these attacks appropriately. However, the exponentially large number of possible inputs for general-purpose AI systems makes it difficult to thoroughly search for all types of attacks.

  • Machine unlearning methods can remove certain undesirable capabilities from general-purpose AI systems. However, unlearning appears to only supress the harmful information and may be susceptible to jail-breaking.

  • Quantitatively guaranteeing certain levels of safety can be attempted in the model design. This approach relies on a combination of three elements:

    • a specification of desired and undesired outcomes

    • a ‘world model’ that includes capturing (approximate) cause and effect relationships and predicts the outcomes of possible actions the AI system could take

    • a verifier that checks whether a given candidate action would lead to undesirable predicted outcomes.

While we do not know how to teach intangible values to AI (for example, act in a safe manner), we may be able to build mathematical models and bounds which approximate values. The Report says there are no working examples of this approach and its not clear if it is scalable.

However, the Report rounds off with the familiar refrain that, notwithstanding developers’ best efforts, AI keeps behaving in puzzling, unexpected ways:

Despite gradual progress on identifying and removing harmful behaviours and capabilities from general-purpose AI systems, developers struggle to prevent them from exhibiting even well-known overtly harmful behaviours across foreseeable circumstances, such as providing instructions for criminal activities.

Moving onto the deployment stage, the Report identified the following potential mitigations (and their limitations):

  • Detecting deepfakes, such as web browsers putting reliability notices on content which is likely to be AI-generated. Just as different humans have discernible artistic and writing styles, so do generative AI models. The Report acknowledged that the technical tools for the detection of AI-generated content are not perfect “but together, they can be immensely helpful for digital forensics”.

  • Implementing secure operational procedures, for example mitigating the risk of an AI agent writing its own harmful code to takeover a computer system by hosting the AI agent in an ad hoc computing environment one step removed from the main system.

  • Monitoring ongoing use of AI through techniques which explain why deployed AI acts the way they do. There have been recent advances in peering into the ‘black box’ of AI, but the Report also acknowledges “these methods provide only a partial understanding”.

  • Building safeguards into the hardware, mainly the chips. Recent research shows  hardware-based safeguards may be able to verify usage details such as time and location of usage, the types of models and processes being run, or to provide proofs that a particular model was trained to minimum standards.

The Report says there are few technical barriers to integrating these kinds of technical mitigants into AI models without adversely affecting the AI’s capabilities, so that they can hum along during the day-to-day operations of the AI. However, the Report cautions “scientists do not yet have a thorough quantitative understanding of their effectiveness in real-world settings and how easily monitoring methods can be coordinated across the AI supply chain”.

The best mitigant appears to remain humans in the loop, though the Report acknowledges this is both can be prohibitively expensive and that the “[h]umans in the loop of automated decision-making also tend to exhibit ‘automation bias’, meaning that they place a greater amount of trust in the AI system than intended”.

Conclusion

The Report was pre-reading for the February 2025 Paris AI summit co-sponsored by France’s President Macron and India’s Prime Minister Modi. Many AI experts left the summit concerned that, driven by the politicians and industry players, the pendulum had swung too far from AI safety to AI innovation. Professor Yoshua Bengio, chair of the expert panel which prepared the Report said:

The most important thing to realise, through all the noise of discussions and debates, is a very simple and indisputable fact: while we are racing towards [Artificial Generation Intelligence] or even [Artificial Super Intelligence], nobody currently knows how such an AGI or ASI could be made to behave morally, or at least behave as intended by its developers and not turn against humans. It may be difficult to imagine, but just picture this scenario for one moment: Entities that are smarter than humans and have their own goals: are we sure they will act towards our well-being?

Back here in Australia, the federal Minister for Industry, Ed Husic, has spoken of the need to integrate AI safety and innovation into a coherent policy. While there was a burst of activity last year, further progress may dwindle as Australia heads into an election.