Seminar Recap: Ground Truth and Biology?

AI + Life Sciences Director, Sam Scarpino warns us to question every data point as we struggle to solve disease outbreaks with AI systems.

By: Tyler Wells Lynch

8 FEB, BOSTON—"We are in a lot of trouble," said Sam Scarpino, the AI + Life Sciences director of Northeastern University’s Institute for Experiential AI (EAI). “But,” he added, “I wouldn’t be here if we were in that much trouble.”

He explained: COVID-19, the flu, and SARS are living systems and to understand them we need their “ground truth”—the essential nature of a problem or, more specifically, the data used to model it. But is that realistic? Will we ever be able to predict the behavior of complex living systems? Scarpino argued before a full house at the kick-off of EAI’s  Expeditions in Experiential AI spring seminar series, that this is perhaps the biggest open question in science. The challenge, he went on, is immense, but it’s not insurmountable.

What Are Data?

Merriam-Webster defines data as “factual information used as a basis for reasoning, discussion, or calculation.” But, as Scarpino warned, “as soon as you say something is data you've stopped worrying about whether or not it's true. You've already decided that it’s essentially indistinguishable from reality.” 

So how can one find ground truth in a complex system? Scarpino illustrated how the majority of scientists will tell you that the predictability of a system has only to do with the quality and quantity of the data—the better the data, the better the predictability. “How do we know that the data we're collecting and operating on are actually meaningful and representative of the thing that we care about?” Scarpino asked. “In short,” he answered, “we don’t.”

Can We Predict the Flu?

For many, the Holy Grail of the life sciences is the ability to predict disease outbreaks. But scientists are people. Because the process of collecting data is fraught with bias and inaccuracy, so too, are the machine learning models scientists use to predict complex systems.

To get around the problem, Scarpino and one of his research partners decided to go back to the drawing board. Instead of debating which model to choose, they began with a more fundamental question: How much information is contained within a time series? Is there a model that exists that could theoretically be used to predict what's happening during, for example, an influenza outbreak in Texas?

When in Doubt, Mathematize!  

Scarpino and his partner used a specific tool called permutation entropy. Using this tool, Scarpino and his colleague were able to model the overall predictability of a known disease outbreak, measuring it against a completely unpredictable, random set (white noise) and an entirely predictable set (a sine wave). Their conclusion? The more data you have, the less information you have in terms of prediction. “This,” he said, “is a problem.”

“I have yet to find a model that you can write down on a piece of paper that generates this kind of behavior,” Scarpino said, “but it's all over the place in living systems.”

Why does nature behave like this? Evolution. The mandate for life scientists is not only to model ground truth but also to make their models robust to evolution. For Scarpino, the only way we’re going to make advances in core AI is “by improving the way we mathematize living systems.”

The Way Forward: Humans in the Loop

The complexity of living systems results in ground truths that are context-dependent and always evolving. Pathogens interact with other pathogens, driving infection rates in unpredictable ways. People change their behavior in response to outbreaks. Mutations occur that bestow evolutionary advantages. And on top of all that, traditional modeling approaches like neural networks are only capable of memorizing resistance mutations within the evolutionary tree. They are unable to account for the clustering of mutations that occur over evolutionary time scales. Throw in bias, mislabeling, information silos, and transformation errors, and you begin to see how the project of getting at the ground truth is a mighty task.

To put it in Scarpino’s terms, the only approach to understanding and predicting complex biological systems is to “keep humans in the loop.” Recontextualize data. De-silo information. Couple human or laboratory research systems with AI systems in a way that iterates and adapts to the ground truth in real time. And acknowledge the fact that a lot of what we ultimately care about is subjective in nature—things like fairness, transparency, justice, reliability, accountability.

“We know that when you have humans or labs in the loop you can dramatically improve the process of evaluating models, suggesting new experiments, and trying to understand bias,” Scarpino said.

Scarpino believes EAI and Northeastern are uniquely positioned to help advance both AI and the life sciences. Why? Because the experiential approach is about gathering key experts from ethics, life science, healthcare, data science, and other disciplines under one roof to work together in a collaborative setting. By reimagining the traditional framework as one of teaming AI systems with human supervisors, we stand the best chance of pulling the sword from the stone—and gaining that elusive ground truth.

You can watch a replay of Sam’s talk and flip through his slides here. Learn about Scarpino’s research vision for AI and life sciences at the Institute for Experiential AI, which seeks to bridge the gap between wet and dry labs and advance the state of the art in AI.