Air Street Capital (Nathan Benaich) has just released its sixth annual report on the state of AI.

In a nutshell, while AI consumer-facing capabilities continue to improve in leaps and bounds, we could be approaching the limits of what large language models (LLMs) can achieve without the next, as yet unknown, step change in AI architecture. 

The current state of AI

OpenAI’s ChatGPT-4 continues to hold the mantle as the most capable LLM across a wide range of tests, not only on established natural language processing (NLP) benchmarks but also on exams designed to evaluate human ‘talent’:

  • on the Uniform Bar Exam, GPT-4 scores 90% compared to 10% for GPT-3.5.
  • OpenAI reports that although GPT-4 still suffers from hallucinations, it is factually correct 40% more often than the previous-best ChatGPT model on an adversarial truthfulness dataset.
  • GPT-4 is also ‘safer’ to use. OpenAI says that GPT4 is 82% less likely to "respond to requests for inappropriate content".

Holy Grail of AI research is to build a single model that can solve any tasks specified by the user, whether the presented request or the requested output a is in the form of text, a fixed or moving image, sound or any combination thereof. Understandably, compared to NLP tasks, vision-language tasks are more diverse and complicated in nature due to the additional types of visual inputs from various domains.

GPT-4 is a step towards this single multimodal language model. GPT-4 can accept images as requests and has the ability to answer questions about an image:

  • an OpenAI video shows how GPT-4 can use an uploaded drawing to build a website. 
  • the New York Times showed GPT-4 a photo of the inside of a fridge and asked it to write a menu for a meal from the available ingredients. 

However, there is still a way to go on AI’s visual capabilities, with GPT-4’s performance highly variable across different visual tests. While GPT-4 is superior overall on visual capabilities, other AIs also closely match its visual capabilities, including LLaMa-Adapter-v2 and InstructBLIP.

What is OpenAI’s ‘secret sauce’?

Air Street puts GPT-4’s impressive improvements in capabilities down to two factors, but both of which carry inherent challenges for the future (see below):

  • GPT-4 was trained on vastly more data than previous models. Air Street quotes industry speculation that “GPT-4 has 220B parameters and is a 16-way mixture model with 8 sets of weights.” While the sheer volume of training data is impressive enough, the computing techniques multiply the number of parameters capable of being generated from the compute power (itself still pretty vast): ‘mixture modelling’ is a broad class of statistical models used to discern unobserved classes or patterns of responses from data; and each ‘set of weights’, referred to as ‘experts’, will be assigned specialized computation and skills by the neural network and different experts can be hosted on different GPUs, scaling up the number of GPUs used for a model. But impressive as all this computing ‘black magic’ is, Air Street comments:
  • Neither the total size of the model nor using a Mixture of Experts model is unheard of. If the rumours are to be believed, no fundamental innovation underpins GPT-4’s success.
  • OpenAI utilises Reinforcement Learning from Human Feedback (RLHF) in training its AI products. This involves human ‘teachers’ ranking LLMs outputs sampled for a set of ‘control’ requests (essentially ‘marking the LLM's homework’), using these rankings to learn a reward model of human preferences, and then using this as a reward signal to finetune the language model. As Air Street explains:

RLHF requires hiring humans to evaluate and rank model outputs, and then models their preferences. This makes this technique hard, expensive, and biased. This motivated researchers to look for alternatives.

Will we run out of data to feed AI?

Epoch AI predicts that “we will have exhausted the stock of low-quality language data by 2030 to 2050, high-quality language data before 2026, and vision data by 2030 to 2060.” Air Street notes that new speech recognition and optical character recognition technologies should open up more audio and visual content, but given the current LLM architecture and training methods, the challenge of ‘feeding the beast’ remains.

One potential solution is train AI on synthetic data – data generated by AI itself, whose volume should only be bounded by the available compute capacity. While Google has had some success using synthetic data to fine tune Imagen, other researchers have identified the risk of ‘model collapse, in which generated data end up polluting the training set of the next generation of models. Synthetic data, therefore, is most helpful when used as a proportionally small uplift to an already large real data set.

An alternative approach is to train more times on the same data set: an 'epoch' in machine learning means one complete pass of the training dataset through the algorithm. Training for one or two epochs is considered ideal. Training for too many epochs risks ‘overfitting’, which means that the AI cannot generalise predictions to new data because it has become too tied to the trained data (a case of familiarity breeds laziness). The problem of overfitting worsens with an increase in the size of the model and the training data.

Another potential barrier to continued AI development is ‘context length’, which refers to the maximum number of tokens the model can remember when generating output in response to a request. Context length determines how far apart the AI can build connections between ideas in the text to generate coherent outputs. As Air Street explains the constraints of current content lengths as follows:

One of the most alluring promises of LLMs is their few-shot capabilities, i.e. the ability of an LLM to answer a request on a given input without further training on the user’s specific use case. But that’s hindered by a limited context length due to the resulting compute and memory bottleneck.

Therefore, on the hypothesis a larger context length will result in improved performance, the race is on between developers over context length. But Stanford researchers have found that the larger the context the poorer AI models performed. Worse still with large contexts, AI models performed much better on the beginning and the end of the context, with performance dropping dramatically in the middle of the context. 

As context is essentially the ‘working memory’ of AI, this suggests that there might be gold fish bowl effect: if AI has to ‘swim around too large a fish bowl’ in processing each request, it could forget the request it is processing. Therefore, increased context length and large datasets will require architectural innovations.

Can open and small compete against big and closed?

There are concerns that AI competition will be constrained because the huge costs involved in developing and training LLMs will limit the number of upstream AI providers.

Potentially countervailing this concern, Air Street notes that there are emerging techniques which provide alternatives to a high level of human involvement in training:

  • small LLMs can be fine-tuned on the outputs of larger, more capable LLMs. However, Berkley researchers found that while producing models which are stylistically impressive they often produce inaccurate text.
  • Meta has developed a ‘less is more’ approach in which AI can be trained on a carefully curated set of prompts and responses (e.g. a 1,000) and even at this early stage of development, researchers have found it competitive with GPT-4 in 43% of cases. 
  • Google has shown that LLMs can self-improve by training on their own outputs. 

But Air Street concludes that as things currently stand the resource intensive RLFH remains king.

Air Street also notes that AI competition could be impacted by a growing trend of AI developers to withhold details of their models:

As the economic stakes and the safety concerns are getting higher (you can choose what to believe), traditionally open companies have embraced a culture of opacity about their most cutting edge research.

However, while Air Street says there is a capability chasm between GPT-4 and open source models, it also reports on the accelerating dynamism of the open source sector.

In February 2023, Meta released the LLaMas model series. By September 2023, the LLaMa-2, which is available for commercial use by third party developers, had 32 million downloads. LLaMa-2 can be trained using publicly available data over a 21 day period, substantially reducing development costs. LLaMa-2 70B is competitive with ChatGPT on most tasks except for coding, where it significantly lags behind it. But Air Street says that CodeLLaMa, a fine-tuned version for code, beats all non-GPT4 models. Other AI developers have responded with their own open-source models, including MosaicML’s MPT-30B, TII UAE’s Falcon-40B, Together’s RedPajama, or Eleuther’s Pythia.

Air Street reports that a striking development downstream from the release of these models is that the open-source community is fine-tuning the smallest versions of LLaMa and other AI models on specialized datasets to create dozens of new AI applications.

Big is also not necessarily always better. Research by Microsoft has shown that when small language models (SLMs) are trained with very specialized and curated datasets, the outputs these SLMs produce can rival models which are 50x larger. Phi-1, with 1.3 billion parameters, “achieved an accuracy score of 50.6%, surpassing GPT-3.5’s performance of 47% with a staggering 175 billion parameters.”

A problem with SLMs is that they can be overwhelmed if force-fed too much data, but too little data runs higher risks of bias etc. Air Street reports the development of new training techniques which can produce GPT-beating outputs:

Assisted by GPT-3.5 and GPT-4, researchers generated TinyStories, a synthetic dataset of very simple short stories but that capture English grammar and general reasoning rules. They then trained SLMs on TinyStories and showed that GPT-4 (which was used as an evaluation tool) preferred stories generated by a 28M SLM to those generated by GPT-XL 1.5B.

Crystal-ball gazing

Air Street’s predictions for AI in 2024 include:

  • a Hollywood-grade production makes use of generative AI for visual effects.
  • an AI-generated song breaks into the Billboard Hot 100 Top 10 or the Spotify Top Hits 2024.
  • the GenAI scaling craze sees a group spend >US$1B to train a single large-scale model.
  • a generative AI media company is investigated for its misuse during in the 2024 US election circuit.
  • Self-improving AI agents crush state of the art (SOTA) in a complex environment.

Read more: State of AI Report 2023