As the UK feels that is emerging from the COVID tunnel, the Alan Turing Institute convened workshops of experts “to capture the successes and challenges experienced by the UK’s data science and AI community during the COVID-19 pandemic.”
The report of the workshops observed:
While pandemics appear to have occurred throughout human history, the COVID-19 pandemic is unique in one important respect. It is the first pandemic to occur in the age of data science and AI: the first pandemic in a world of deep learning, ubiquitous computing, smartphones, wearable technology and social media.
Data successes, data fails
The workshop participants attributed much of the success of data analytics in the COVID response “to the permissive nature of the regulatory environment necessitated by the pandemic.” In March 2020, the UK government issued a direction to the National Health Service, called a Control of Patient Information (COPI) notice, requiring confidential patient data to be made available to researchers and policy makers for COVID-specific purposes.
For example, this allowed the development of a new statistical analytics platform called OpenSAFELY, which provides secure access to a database of over 58 million NHS patient records, allowing researchers to answer urgent clinical and public health questions related to COVID. To maximise collaboration, OpenSAFELY requires all researchers to archive and publish their analytic code: this is the only way they are allowed to run code against real data. To maximise transparency, very time a researcher wants to run this code against real data, the event is logged on a public register.
COVID data analytics were also enhanced by data from non-health and private sources, such as the social media platforms and mobile operators. The workshops called out the work of the Office for National Statistics in supplying data during the pandemic, including its work on deaths stratified by ethnicity, the social impacts of COVID-19, and how people spent their time during lockdown.
But the workshops also concluded that, insights from the data science and AI community were not as informative and robust as they could have been for three reasons:
- Limited or slow access to data: e.g. some hospitalisation data were not available early in the pandemic, and geographically disaggregated data for local analysis and crafting of solutions (e.g. local lockdowns) were also not available to all relevant academic groups;
- Some data had never previously been systematically collected: e.g. on non-pharmaceutical interventions (social distancing, mask wearing, lockdowns), and particularly compliance with such interventions, making it difficult to measure the impact of these policies on behaviour; and
- Lack of standardisation: different data standards and codification of metadata, and lack of dataset documentation, meant that data were difficult to find, link and assess in terms of "missingness" and biases, limiting the scope of and confidence in analyses.
Some of the recommendations to improve data for future pandemic responses:
- aspire to a research culture in which data are shared as openly as legal and ethical obligations permit, with central repositories, or ‘data lakes’, for cleaned and anonymised data;
- investigate ways in which access to sensitive data (e.g. from the NHS) may be enabled while respecting professional, ethical and legal obligations surrounding the data. This will require developing new ways to allow researchers to securely access personal data, perhaps using differential privacy or federated learning techniques;
- automate data collection; and
- stress test new standardisation specifications for the next pandemic.
The workshop participants carefully weaved privacy concerns through these recommendations, but this would involve a significant shift in the approach to use of personal medical data.
The workshop participants noted that “the COVID pandemic has brought societal inequality into sharp focus, with the disease having a much greater impact on some groups than others”. Disadvantaged groups suffered a ‘double disadvantage’ because they often ‘invisible’ in data sets.
Some public data sets, such as the NHS, had robust and rigorous system for obtaining epidemiological data at a sufficiently granular level to identify trends linked to social and economic disadvantage. However, in the ‘scramble’ to respond to COVID, there was much repurposing of existing data sets which had ‘baked in’ under-representation of disadvantage.
Now in the head long rush to ‘normalise’, there was a risk that tools which could have an ongoing exclusionary impact, such as vaccine passports, would draw on these flawed data sets.
The workshops recommended developing clear protocols for generating anonymised and synthetic data, so that important demographic information can be included in open datasets, and pointed to projects such as the University of Nottingham’s OpenPseudonymiser.
Policy lead by science, but made in a black box
While the workshops lauded the politicians and policy makers proclaiming they would be ‘led by the science’, they also identified problems in communications:
- there needs to be more emphasis on developing collaborative working relationships between data scientists and clinicians: “data scientists can provide insights about the collection and storage of data that clinicians may not be aware of, while clinicians can provide valuable insights about the multi-dimensional nature of health and social data collect”; and
- because Government processes are not very transparency, researchers have little idea about how data analytics is driving policy: “it was difficult to know which studies had ‘cut through’ and been considered by government and advisory groups when making policy interventions, and which data policy makers were using to inform their decisions.”
However, the workshops also recognised that the data analytics community itself must be better at communicating directly with the public:
Throughout the pandemic, data science and AI have been in the public eye as never before. Every day presents a new swathe of statistics about COVID-19, and data scientists have been increasingly called upon to communicate their research to non-specialists. Meanwhile, … there have been vigorous debates on the ethics of AI and algorithmic bias…Although there had been successful examples of public engagement from the community during the pandemic, there were also shortcomings in communication, particularly around the limitations and uncertainties of research. Helping the public to understand the findings and caveats of modelling studies, for example, could enhance trust in the research and increase support for and compliance with policies.
The point was made that data scientists have an unprecedented array of cool tools at their fingertips, such as data visualization, and they need to get more creative at using them in public communications. The example given was Harry Stevens’ early article on how rapidly COVID can spread (as it was in the “Amazon Washington Post”, Donald Trump obviously missed it).
The Turing report concludes:
Navigating our way through the pandemic without the knowledge and resources of the data science and AI community would have been markedly more difficult. These are transformational times for the community as its research becomes ever more embedded in everyday life.
This seems a polite way of saying that, in the scramble to respond to COVID, there were some surprising success in using AI and data analytics, but in the digital age we should have done much, much better.
Read more: Data science and AI in the age of COVID-19