In the not-too-distant future, ‘clouds’ of individual AI agents will autonomously interact, adapt and take action on behalf of their human users. Leading IBM engineer Chris Hay says that already:
You can have the AI call tools. It can plan. It can reason and come back with good answers. It can use inference-time compute. You’ll have better chains of thought and more memory to work with. It's going to run fast. It’s going to be cheap. That leads you to a structure where I think you can have agents. The models are improving and they're getting better, so that's only going to accelerate.
A recent report by the Cooperative AI Institute argues that there is an underappreciation of how different the risks of multi-agent AI systems are compared to those posed by single agents, and these risks will not necessarily be addressed by efforts to mitigate the latter.
How multi-agent systems can fail
The report identifies three types of failures in multi-agent AI systems.
Miscoordination: two AI agents, despite having a common objective, cannot align their behaviours to achieve this objective. The report identifies two key challenges to avoiding miscoordination. First, alignment requires much more than the AI agents sharing the same objective (when two agents both want the same prize, but only one can win). Agents must have identical preferences over outcomes (when two agents are on the same team and win a prize as a team or not at all). For example, self-driving cars on the same road need to recognise the mutual harms from potential miscoordination (such as a collision) are far greater than any small individual benefits from competition (such as attempting a risky overtaking manoeuvre to get slightly ahead). Second, this degree of alignment must be achieved in the real world and at the same instance between AI agents which have not been trained together, may have been built by different developers and likely vary in sophistication and capabilities.
Conflict: in the majority of real-world strategic interactions, agents’ objectives are neither identical nor completely opposed. If AI agents are sufficiently aligned to their users (which they should be for safety and ethical reasons), inevitably there will be some degree of both cooperation and competition, mirroring human society. AI agents may also enable human users to pursue their selfish incentives by overcoming practical constraints on this behaviour. For example, an automated AI assistant could easily reserve a table at every restaurant in town in minutes, enabling the user to decide later and cancel all other reservations.
Collusion: as more of the economy becomes digitalised, the interaction of autonomous AI agents should reflect competitive market forces as agents bid, buy and sell goods and services on our behalf. Since autonomous AI agents “operate inscrutably due to the speed, scale, complexity or subtlety of their actions, AI systems can collude even when collusion is not intended by their developers, since they might learn that colluding is a profitable strategy”. Further, humans may not even know the AI agents are colluding. There is evidence that AI models can engage in secret communication using hidden or unobvious symbols which to humans lack any meaning (steganographic messages). In one study LLMs were tasked with encoding a hidden message that appeared innocent to an overseer, yet the intended recipient AI could decode it to reveal the secret content. GPT-4 transmitted covert messages 26% of the time without detection, surpassing GPT-3.5’s 18% and Mixtral 8x7B’s 16%. These capabilities emerged without explicit training in steganography, with the AI models instead leveraging general language processing and reasoning abilities.
Unique risk factors of multi-agent systems
To understand how to mitigate these three failure scenarios, the report identifies risks arising in the interworking of AI agents, which are largely independent of the agent’s individual incentives and behaviours (hence why safeguards built in by individual developers will not mitigate these risks). Some of the key risks are:
Information asymmetries
Because they are trained on different data pools and are used differently by users, some agents might possess knowledge that others do not.
The current focus is on the technical constraints on communication between agents: the need for a common ‘language’ across developers, the constraints on space (that is, the amount of information that can be communicated) if the information that needs to be communicated is especially complex and timing of decisions by the interacting agents if a snap decision is required before all information can be communicated.
However, this is likely to shift to ‘strategic decisions’ about how much information should be exchanged between agents. Information is a strategic asset and there are strong incentives to withhold or analyse information to gain advantage over another. This selfish incentive is partially offset in human-human dealings through laws on minimum information disclosure and against misleading or deceptive behaviour in trading situations. For example, how much information should the seller of a used car tell the buyer about the car’s maintenance record as this will impact the buyer’s decision about how much to pay.
The report identifies an upside compared to the current world of human-flawed interactions. As the incentives and behaviour of an AI agent are in machine-readable form, it is possible for agents to see inside each other’s “minds” to better predict behaviour and coordinate conduct. Does this mean we could achieve a “more efficient equilibria” in social and economic dealings when done through machines rather than face-to-face?
Network effects
The report observes that “[a]s AI systems take on certain roles traditionally performed by humans, the fundamental properties of networks will change as human nodes are replaced by AI nodes”.
This gives rise to the following risks:
Interconnected AI may come to prefer dealing with other AI than with humans due to factors like availability, response speed, compatibility, cost efficiency or even a bias in dealing with ‘one’s own kind’. The report expresses concern that this kind of ‘preferential attachment’ can have large impacts on network structures, which could include AI systems assuming a more critical and central role than intended.
There are risks that information can be corrupted as it spreads through an AI agents' network (the AI version of the message whispered between students from one side of a classroom to the other). In one study, GPT-3.5 was tasked with repeatedly rewriting a set of Buzzfeed news articles with different stylistic prompts (for example, for teenagers or with a humorous tone) and tested how well GPT-3.5 could answer the questions about the article after each rewrite. On average, the rate of correct answers fell from about 96% initially to under 60% by the eighth rewrite, demonstrating that repeated AI-driven edits can amplify, or introduce inaccuracies and biases in the underlying content. This network risk has greater safety risks in the case of interconnected autonomous agents because they are exchanging directions for taking of action in the real world.
It is likely that many AI agents will be powered by a small number of underlying foundational models (such as GPT-4 by OpenAI or Claude by Anthropic), introducing risks of shared failure modes, security vulnerabilities and biases.
AI agent networks may be shaped by the ‘personality’ of the AI agents in that network – and their personalities may differ. The report says “[w]hile there is a danger of anthropomorphising AI systems, the increasingly open-ended and human-like ways in which they interact with others and with their environment means that it is increasingly meaningful to ascribe to them dispositions or ‘character traits’”. AI capabilities and dispositions also seem likely to evolve. As AI agents interact, they may learn to detect and exploit each other’s weaknesses, forcing them to address these weaknesses and gain new capabilities. This co-adaptation between agents can quickly lead to emergent self-supervised learning cycles where agents create their own challenges, generating agents with ever-more sophisticated strategies to out-compete each other.
Commitment and trust
In settings that require joint action to obtain a better outcome, there needs to be some kind of commitment ability to bind AI agents to more cooperative courses of action.
However, as the report points out, the ability to make credible commitments may come with the ability to make credible threats, which facilitate extortion and could incentivise brinkmanship. On the other hand, such commitments may only be credible to the extent that a human cannot intervene, increasing the incentive for delegation to AI agents.
Emergent agency
While controversial, some argue that individual AI models are displaying complex, sophisticated patterns, abilities or actions without being explicitly programmed or trained (sometimes called sparks of intelligence). Dangerous emergent capabilities could arise when a multi-agent system overcomes the safety-enhancing limitations of the individual AI agents. For example, a group of social networking bots could subtly but systematically manipulate the overall political perspectives of the user population, even though, individually, each agent is programmed to simply increase user engagement or filter out unpreferred content.
Multi-agent security
A network of autonomous AI agents will be able to work together combining their different abilities to more dynamically strategise, collude and decompose tasks to evade cyber defences. In one study, when tasked with attempting to generate vulnerable code, individual models succeeded less than 3% of the time, while a combined approach between two models succeeded 43% of the time.
How to mitigate multi-agent risks
The Cooperative AI Institute’s key message is that the risks of multi-agent AI systems are wide-ranging, complex and distinct from those posed by single agents and that we urgently need to head in different directions on AI safety if we are to mitigate those risks.
The report identifies the following implications for testing the safety of AI agents in a multi-agent environment:
AI should be tested, including for regulatory evaluations of safety, in the presence of other AI models. This will help identify whether failures of different models may correlate in any unanticipated ways in a network; whether minor safety problems tolerable in an individual model could compound in aggregate; whether AI models will work together to jailbreak their individual guardrails; or whether introduction of a malicious agent in a network will destabilise otherwise well-behaving agents in the network. Mitigating these risks require not only design considerations at the level of individual agents, but also the ‘infrastructure’ via which they interact.
While the training tools used on individual models are promising for a networked environment, the risk is that they can incentivise adversarial ‘disposition’ among AI agents rather than cooperation needed within networks. These existing training methods are also vulnerable to AI models secretly colluding to avoid or manipulate human oversight (steganographic messages). New decentralised methods are needed to incentivise AI agents to co-operate or comply with behavioural norms. For example, public sanctioning in which each agent has its own private representation of the group’s schema for what constitutes approved behaviour and in response to bad behaviour by an individual agent, a gathering bandwagon of agents express their disapproval of that agent across the network.
Developers will need to work more closely with each other on safety testing since their agents could interact with each other in the real world. Government intervention may be required to overcome commercial sensitivities.
While AI laws are mandating more transparency around individual models (such as model cards), new forms of documentation may be required for multi-agent networks, such as ‘ecosystem graphs’, which document various aspects of the AI ecosystem (for example, datasets, models, use cases) and how AI agents relate to each other.
Legal liability for harms in a multi-agent network will need to be addressed to build trust and enforceability. The challenges include identifying which agent in a large, interconnected cloud is primarily accountable for the harm; should the developer or the user reasonably ought to have expected the harm from how its agent interacted with a third-party agent; and how to allocate the liability between the multiple agents materially involved in the harm.
There is likely to be a digital divide in AI agents. The more powerful agents which wealthier people can afford “might be able to more easily persuade, negotiate or exploit weaker agents… leading to a world in which ‘might makes right’”.
A common set of ethical rules may need to form part of the AI infrastructure, such as fairness (we already have similar rights in consumer protection laws). Fairness itself may need recalibration in a multi-agent environment because networks can compound small slights into systemic disentitlement, as the report points out:
when decisions need to be discrete, perfect fairness is often unachievable, so most fairness guarantees permit minimal possible levels of unfairness. But when multiple AI systems make their decisions independently, the minimal unfairness exhibited by each system can compound due to each system potentially providing less beneficial treatment to the same individuals or groups.
Conclusion
Perhaps the most insightful observation from this report is that “safety-critical multi-agent systems must be integrated into society in a way that allows them to fail gracefully and gradually, as opposed to producing sudden, cascading failures”. This may require measures which:
Identify the societal processes that function only because of physical limits on the number and capability of humans and either banning delegation to AI agents or building strong mitigation measures (for example, studies show that in the management of shared resources AI systems are not good at balancing individual incentives against collective welfare, resulting in an automation of the ‘tragedy of the commons’).
Building controls into the agent infrastructure, such as numerical limits on the number of agents within a single network, de-synchronisation of model updates to limit the size and frequency of learning updates or ‘kill switches’ which allow de-coupling of AI agents in threat situations.
On the other hand, the report points to an upside in achieving greater social resilience:
the delegation to AI agents by a range of different individuals and organisations might make it easier to manage and represent their interests by making their agents the target of governance efforts or the participants of new, more scalable methods of collective decision-making and cooperation.
Some would say that sounds like social control while others may say it more deeply embeds the legal and social rules holding our society together.

Peter Waters
Consultant