Big Data: Big bias? | Catalog of Bias

Ami Banerjee blogs about the Catalogue of Bias teams work to document the potential sources of bias in Big Data and Artificial Intelligence.

Last week, the UK government published a code of conduct for artificial intelligence (AI) and data-driven technologies in health. Major investments in big data research and AI are part of the strategy for key stakeholders across the healthcare landscape, from governments to the largest research funders.

No two terms have captured the zeitgeist of growing mountains of medical information and related analytics like “big data” and “artificial intelligence” (AI). Barely a day goes by without another journal article or media story promising new progress in healthcare, fuelling the expectation that my job as a health professional is in real danger from robots.

When terms enter the vernacular, we start to lose control of what the terms mean. Problems may arise as “big data” and “AI” mean different things to different people in different situations. I have previously defined big data by seven V’s (volume, velocity, veracity, variety, volatility, validity and value).

Big data health research includes linking millions of records to better reflect and understand the health needs of populations such as refugees and migrants, as well as linking genomic information to better treat rare diseases and the use of wearables to screen for heart rhythm problems.

AI “aims to mimic human cognitive functions”, and has been used to describe improvements in predicting long-term survival in heart failure patients as well as diagnosing retinal disease.

However, there are potential pitfalls in the use of big data. For example, how we define normal values for laboratory tests is central to how we diagnose and treat diseases, and large datasets are increasingly used to assess clinical outcomes across a range of test values. Sample size is not usually an issue, but there may be other problems. For example, haemoglobin A1c, commonly used to diagnose and monitor diabetes, has been shown to systematically underestimate past glycaemia in African American patients with sickle cell trait. If certain values and certain individuals in large datasets are repeatedly sampled, then a selective reporting bias can suggest differences in normal values which do not exist.

Big data and AI are fertile new areas for research with lots of industry involvement and funding. There is inevitably a “hot stuff bias” and a confirmation bias.

Bias has long been recognised as a problem in health research and its application. Our Catalogue of Bias project is compiling and continuously updating the various types of bias affecting health research. The advent of evidence-based medicine was accompanied by validated tools and checklists which are used to separate the wheat from the chaff in terms of the risk of bias in evidence used to underpin health care decisions. However, aside from commentary pieces, no article or project has looked at all the biases that might affect big data or AI research in healthcare. Is research using information from big data or AI any different to other methods with respect to vulnerability to biases? More importantly, is the potential for bias increased in this brave new world? Confucius said, “Real knowledge is to know the extent of your own ignorance.” If we know which biases can occur and how we might reduce them, then we can make more sense of evidence in healthcare, whether in treatment or diagnosis. In the Catalogue of Bias Collaboration, we are all working to document the potential sources of bias in Big Data and AI. I’d be interested in your thoughts.

Ami Banerjee, Associate Professor in Clinical Data Science and Honorary Consultant Cardiologist, UCL

Conflicts of interest: Advisory boards for Pfizer, Astra-Zeneca and Boehringer Ingelheim and Trustee for South Asian Health Foundation.