​The media is ablaze with the marvels of the next generation of AI: writing passable grade university essays, winning art competitions, and composing a song “in the style of Nick Cave” (although Nick says the song sucks).

In a phrase coined by Stanford University’s Institute for Human-Centered Artificial Intelligence (HAI), these new AIs are classified as foundational models: 'A foundation model is any model that is trained on broad data (generally using self-supervision at scale) that can be adapted (e.g., fine-tuned) to a wide range of downstream tasks'.

Why are foundational models different to the AI of the past?

Actually, they are not so different. Foundational models use deep neural networks and self-supervised learning, which have existed for decades. What is different is the sheer scale and scope of foundation models – basically the vastness of the data they learn (and importantly, self-learn) on.

In a sense, foundational models are an example of ‘bigger is better’. AI model performance scales with the amount of computing, and in turn, the amount of computing used in training the largest AI has been doubling every 3.4 months and seems to be accelerating (faster than the biblical Moore’s Law). In less than four years, the number of parameters used in the largest AI models jumped by over 5 times.

Why size matters is because AI models with large numbers of parameters, more data, and more training time develop a richer, more nuanced understanding of language. This means that a foundational model can generalise: they can do a wide range of tasks despite not being trained explicitly to do many of those tasks. They can produce results that equal or better the so-called ‘few shot leaners’, which are more task-specific AI, with more focused algorithms learning on smaller topic-specific data sets.

What makes a foundational AI model?

The HAI authors identified two defining characteristics of foundational models: emergence and homogenization, which they defined as follows:

“Emergence means that the behavior of a system is implicitly induced rather than explicitly constructed; it is both the source of scientific excitement and anxiety about unanticipated consequences. Homogenization indicates the consolidation of methodologies for building machine learning systems across a wide range of applications; it provides strong leverage towards many tasks but also creates single points of failure.”

How these two features make foundational AI models what is today is more easily understood by looking at how machine learning has evolved, which the HAI authors depicted as follows:

The first big step in machine learning was in the 1990s when, rather than rather than specifying how to solve a task, a learning algorithm would induce it based on data — i.e., the how emerges from the dynamics of learning. However, AI still need help in ‘joining the dots’ – human domain experts needed to write domain-specific logic to convert raw data into higher-level features (called ‘feature engineering’).

The next big step was ‘deep learning’ around 2010: deep neural networks would be trained on the raw inputs (e.g., pixels), and higher-level features would emerge through training (called “representation learning”). However, as big as this step in machine learning was, AI models performed only on a single task and future tasks required a new set of data points, more training and more resources.

Then in late 2018 came ‘transfer learning’ at scale: this is the ‘secret sauce’ of foundational AI models. As the HAI authors explain it, the idea of transfer learning is to take the ‘knowledge’ learned from one task (e.g. object recognition in images) and apply it to another task (e.g. activity recognition in videos).

As the HAI authors say, “transfer learning is what makes foundation models possible, but scale is what makes them powerful.” AI already had shifted to ‘self-supervised training’ in which the AI could learn from unannotated data. The development of Transformer architecture for AI combined with the huge scale of foundational models means that these models can now undertake a form of self-learning called in-context learning, in which the model can be adapted to a downstream task simply by providing it with a prompt (a natural language description of the task), an emergent property that was neither specifically trained for nor anticipated to arise.

The ‘wow factor’ of foundational AI – even for the experts – is that the results of in-context learning at this scale have been more impressive than many anticipated.

The success and scale of transfer learning (i.e. the ‘emergence’ characteristic of foundational AI) feeds into the second characteristic, ‘universality’, because foundational models can ‘turn their digital hands to anything’. The same model can be used to provide recommendations on health issues, to undertake analysis of insurance issues, etc.

What are the capabilities of foundational AI?


Advances in machine learning have largely been driven by natural language processing (NLP) and foundational AI has had the most immediate and dramatic impact there, as illustrated by media reports of the ‘panic’ in schools and universities over students using AI to write essays. And there is something to this: in 2018, the best AI system for answering open-ended science questions scored 73.1% in the New York 8th grade science exam but barely a year later, a foundation model scored 91.6%.

The power of foundational AI in NLP goes beyond its sheer computational power. As the HAI authors explain, in the past, NLP was broken down into specific tasks, including classification tasks for a whole sentence or document (e.g., sentiment classification, like predicting whether a movie review is positive or negative), and sequence labelling tasks, in which we classify each word or phrase in a sentence or document (e.g., predicting if each word is a verb or a noun, or which spans of words refer to a person or an organization). Each task often had their own distinct research communities, and they developed task-specific architectures, and these were assembled, ‘humpty dumpty’ style, into an NLP AI. Now, a single foundation model can be used across the range of NLP tasks by adapting it slightly using relatively small amounts of annotated data specific to each task.

But the really startling change has been in language generation. As the HAI authors explain:

“Until around 2018, the problem of generating general-purpose language was considered very difficult and essentially unapproachable except through other linguistic sub-tasks. Instead, NLP research was mostly focused on linguistically analyzing and understanding text. Now, it is possible to train highly coherent foundation models with a simple language generation objective, like “predict the next word in this sentence”.

In an AI version of the Tower of Babel, foundational AI also could be used to digitally capture the 6,000 languages on the planet. Many languages are spoken by a few (and diminishing number) of native speakers, or even if widely spoken do not have an extensive recorded or text baseline, in each case providing insufficient data to train AI on that language on a standalone basis.

Multilingual foundation models address this by jointly training on multiple languages simultaneously, and the assumption (which the HAI authors say has been proven by studies) that “the shared structures and patterns between languages can lead to sharing and transfer from the high-resource languages to the low-resource ones.”

However, the HAI authors also caution that, as AI models are usually initially trained on English and other European languages, it is not clear how far this approach can stretch as languages are more distant in structure, syntax etc from these languages. There might even be a risk of AI reshaping the lesser resourced languages into a kind of AI patois or pidgin.

Yet even with all of NLP capabilities of foundational AI, humans still outmatch them in language acquisition. AI has to be trained on three to four times more language data than most humans will ever hear or read in their lifetime, let alone the few words with which babies begin to assemble into language.


Similarly, “while visual acuity is learned and executed effortlessly by even simple living creatures, transferring the same abilities to machines has proved remarkably challenging.”

Why are we so much better at visual processing than AI?

First, its very complex. A human walking down a busy street has to simultaneously perform multiple tasks in order to understand the visual environment around them, including discovering the properties and relations among entities within visual scenes; geometric, motion and 3D tasks, seeking to represent the geometry, pose and structure of still and moving objects within the field of vision; and multimodal integration tasks, combining semantic and geometric understanding with other modalities such as natural language and visual question answering.

Second, humans also are very good at processing a continuing visual stream of objects, scenes and events (e.g. spotting things out a car window traveling at speed). The HAI authors acknowledge that matching this human capability is “a daunting endeavor”.

The traditional AI learning approach for visual content involved a laborious process labelling huge volumes of images with a fully supervised training task (‘this is a cat’ but this is a dog’) until the AI ‘got it’.

Foundation models can go that next step and, more like humans, “translate raw perceptual information from diverse sources and sensors into visual knowledge that may be adapted to a multitude of .downstream settings.” However, the HAI authors acknowledge that while there are early promising signs “current foundation models for computer vision are nascent relative to their NLP counterparts”.

The key to advancing the visual acuity of AI lies in how to solve the problem that visual-question answering requires ‘commonsense understanding’, since these questions often require external knowledge beyond what is present in the pixels alone. So far foundational AI is good at image synthesis (‘produce a painting of a pig flying in the style of Picasso’), but as the HAI authors note, “these models still struggle to generalize to compositions of simple shapes and colors.”

Benefits and risks

There are areas where foundation AI has immediate, substantial benefits over pre-existing AI models, such as health care:

“Today’s medical AI models often make use of a single input modality, such as medical images, clinical notes, or structured data like [international classification of disease] codes. However, health records are inherently multimodal, containing a mix of provider’s notes, billing codes, laboratory data, images, vital signs, and increasingly genomic sequencing, wearables, and more…. Foundation models can combine multiple modalities during training. Many of the amazing, sci-fi abilities of models like Stable Diffusion are the product of learning from both language and images. The ability to represent multiple modalities from medical data not only leads to better representations of patient state for use in downstream applications, but also opens up more paths for interacting with AI. Clinicians can query databases of medical imaging using natural language descriptions of abnormalities or use descriptions to generate synthetic medical images with counterfactual pathologies.”

At this year’s Davos conference, Microsoft CEO, Satya Nadella, described a project using chatGPT to make available Government information to villages across India in a range of local languages.

Yet a recent New York Times opinion piece argues that the very same capabilities could allow foundational models to ‘hijack democracy:

“ChatGPT could automatically compose comments submitted in regulatory processes. It could write letters to the editor for publication in local newspapers. It could comment on news articles, blog entries and social media posts millions of times every day. It could mimic the work that the Russian Internet Research Agency did in its attempt to influence our 2016 elections, but without the agency’s reported multimillion dollar budget and hundreds of employees.”

Concerns with foundational models identified by HAI included:

  • “If the same model is used across a variety of domains with minimal adaptation, the strengths, weaknesses, biases, and idiosyncrasies of the original model will be amplified, such as gender and race bias.” While regulation increasingly emphases the responsibilities of AI developers (e.g. the product liability approach of the new EU regulations), it is difficult to assess and mitigate the risk of foundational models at the development stage because foundation models are unfinished intermediate objects that can be adapted to many downstream applications, sometimes by an entirely different entity for unforeseen purposes;
  • “Although the absolute cost of computation has become dramatically cheaper over time, the training of the largest foundation models currently requires computational resources that put their development beyond the reach of all but a few institutions and organizations”;
  • “Although concerns over the spread of automated are not specific to foundation models, the generative abilities of models such as GPT-3, as well as the impressive performance on benchmark tasks have the potential to prompt a less-than-careful adoption of this technology by, for example, administrative agencies, many of which lack the expertise necessary to understand sophisticated ML systems”;
  • “Whereas the traditional AI approach of collecting a labelled dataset typically requires working with domain experts and understanding the problems with and limitations of that data, the need for exceptionally large amounts of data in training foundation models has encouraged some researchers to emphasize quantity rather than quality”. This could mean more data collection and more surveillance;
  • AI ‘trust’ frameworks, and increasingly regulation, emphasise ‘explainability’ of AI (i.e. the battle to open the ‘black box’), but foundational AI presents a challenge:

    “The space of tasks that the model is able to perform is generally large and unknown, the input and output domains are often high-dimensional and vast (e.g., language or vision), and the models are less restricted to domain-specific behaviors or failure modes. Consider, for example, the surprising ability of GPT-3 to be trained on large language corpora and to subsequently develop the ability to generate mostly-functional snippets of computer programs. A key challenge for characterizing the behavior of foundation models is therefore to identify the capabilities that it has.”

Yet the HAI authors also recognised the ‘raw potential’ of foundational models, and identified significant benefits in education, health and law, which they described as the ‘pillars of society’. Their following observation zeroes in on the transformative value of foundational models:

“Foundation models show clear potential to transform the developer and user experience for AI systems: foundation models lower the difficulty threshold for prototyping and building AI applications due to their sample efficiency in adaptation, and raise the ceiling for novel user interaction due to their multimodal and generative capabilities.”

Maybe AI itself should have the last word on risks. When ABC reporter Ange Lavoipierre in her Background Briefing report (‘Has the Age of AI already begun?’) asked GPT-3 whether we had reached the point where AI capabilities were beyond human understanding, the AI responded (entirely off its own bat, as it were):

“I don’t know but I guess it is a valid concern. But I don’t think that there is anything sinister or worrying about it. We just need to be aware that AI is capable of creating things we do not fully understand and we need to be careful about how we use AI tools.”

​Read moreOn the Opportunities and Risks of Foundation Models