News
Life Sciences

Q&A with Sam Scarpino and Giulia Menichetti on AI, Drug Discovery, and Network Science

By
No items found.
March 28, 2024
Share this post
Q&A with Sam Scarpino and Giulia Menichetti on AI, Drug Discovery, and Network Science

Experts on life sciences, data sets, AI and network sciences, Giulia Menichetti and Sam Scarpino recently shared their perspectives on advancements at intersection of AI and drug discovery, unpacking the complexity of living systems and much more in a special webinar hosted by the Institute for Experiential AI. Here, the two leaders answer many of the numerous attendee questions.

When building datasets for AI, how do you select or narrow down which ones to include? With the immense variety and amount of publicly available datasets, one could easily spend millions of dollars training models and not end up with a useful AI tool. Ideally, one would like to ensure the data included are limited, yet include enough information to represent the  breadth of relevant living systems.

One of the biggest “open secrets” in AI is that data quality almost always ends up being the biggest barrier to entry and improved model performance. For example, de-duplication of documents has been shown to reduce both the reliance on memorized text and training costs for large language models1 and recent estimates suggest that data labeling is already a multi-billion dollar industry2. In the context of living systems, most data labeling is done as a byproduct of observation and/or experimentation, which can increase costs and complicate reusability. Additionally, the data complexity and lack of aligned incentives can create substantial barriers to storing and sharing life sciences data. Even when usable, labeled data exist, it’s still not obvious how to effectively select data for model training. This is where methods like the network-based sampling approaches presented in AI-Bind come into play3. Essentially, the goal is to use information contained in the data, along with expert knowledge, to select which elements are likely going to most efficiently and effectively inform that AI. Additionally, we can use approaches from network science to both make smarter investment decisions around gathering/compiling data and select existing data for training to minimize effects of things like class imbalance and label homogeneity.

What are the key components (e.g., math-theories or data structures) that network science has to help AI models to generalize for future data?

The field of Network Science shares many overlapping mathematical/computational approaches with artificial intelligence. These includes methods like message passing4,5, sparsification6,7, and spectral analysis8,9. Network Science as a field is concerned with the structure and function of interconnected systems, which of course includes AI approaches like neural networks and reservoir computing. Both fields also leverage concepts from statistical physics, e.g., symmetry10,11 and scaling12,13. In particular, the universality14 implied by the existence of scaling laws can be exploited to create better training data (e.g., AI-Bind can correct for biases determined by the degree distribution of annotations in the original databases when selecting data for inclusion in the training set, which improves the generalizability of the resulting AI algorithm) and prove theorems about the performance of AI systems15. Despite the overlap, there is not a lot of cross-talk between the two fields, which is something we’re trying to change at Northeastern University (housing both the Network Science Institute and Institute for Experiential AI). Whether any of these approaches will improve generalizability to future data will be entirely dependent on the system/model generating the data and is one of the key open questions we’re studying.

How do you see AI for life sciences and biochemical research becoming less data limited? Where do you predict data will come from? For example, large consortium style projects, or aggregate databanks like the PDB?

It’s not clear that life sciences/biochemical research will ever become less data limited. However, that doesn’t mean that we can’t do a better job of building data systems capable of storing and sharing data more effectively, funding basic science research that leads to data generation, and linking AI to experimental design to inform experimental design. There will certainly be a role for existing/new large consortium projects, but also a role for all the exciting research happening in labs around the world. Perhaps what’s really needed is a life sciences data moonshot from the federal government!

Can AI generate a genome and then predict body structure? What about adding compounds to the simulation to understand the effects of that molecule on the individual?

There is no AI (nor any technology for that matter) capable of generating a genome de novo and using that genome to make accurate predictions about complex phenotypes like body structure16. Even for simple traits like sickle cell in humans17 or Mendel’s wrinkled-seeds18, predicting phenotype from genotype sometimes fails due to environmental factors and/or rare mutations. There have been books written on what we know and don’t know about predicting phenotype from genotype, but–if you’re new to these ideas–a great place to start is with Obgunugafor and Edge’s review published on the 25th anniversary of the movie Gattaca19. Setting aside the complexity surrounding predicting phenotypes from genotypes, we still don’t know the function of most genes (much less what’s going on with all that genomic dark matter20) in any genome, including humans21, flies22, and e. coli23. Consider that in the field of synthetic biology, researcher’s only recently succeeded in generating novel genes and introducing them into living organisms24 and that for some minimally viable e. coli genomes we don’t know the function of 30% of the remaining genes25. Finally, for many small molecules, biologicals, etc. we are able to make coarse predictions about how an organism will respond, e.g., pharmaceuticals, but these understanding barely scratch the surface of what’s actually going on inside the organism. As a result, one of the biggest opportunities for AI in areas like drug discovery and development are models able to predict toxicity and other side effects26. If you’re looking for a PhD thesis idea, using AI to study genotype-to-pheontype maps for complex traits would be a great one to consider.

What are the main directions that you are working towards that utilize graph machine learning for further mining the phenotypes and pathways associated with network dynamics in living systems?

Graph machine learning has shown great promise for studying living systems27, especially in areas like drug discovery28. Two directions that seem particularly promising are: 1.) incorporating the kinds of symmetries typically associated with living systems into the architecture of graph neural networks and 2.) assembling the kind of representative, diverse data sets likely needed to train more generalizable models.

Thanks for the great discussion. I'm curious how you reconcile the astonishing complexity of biological networks with, for example, the seemingly straightforward impact of dietary sugar and salt alone on diseases like diabetes and hypertension.

While it’s true that dietary sugar and salt influence the probability of conditions like diabetes and hypertension, the relationship between these factors and disease is far from straightforward29,30. Even accounting for the effect of diet on dietary sugar and salt isn’t settled science31,32. As with essentially all phenotypes, diabetes and hypertension will be caused by interactions between the genome and environment. One hope for AI systems is that they can extract meaningful relationships from high-dimensional–but sparse–data from living systems that will lead to an improved understanding of diseases like diabetes and hypertension. Such understandings, and the treatments they may lead to, are of course the holy grail when it comes to AI and health, but we’ve got a long way to go!


1 Lee, K., Ippolito, D., Nystrom, A., Zhang, C., Eck, D., Callison-Burch, C., & Carlini, N. (2021). Deduplicating training data makes language models better. arXiv preprint arXiv:2107.06499.

2 Grandview Research - Industry Analysis - Data Collection Labeling Market

3 Chatterjee, A., Walters, R., Shafi, Z., Ahmed, O. S., Sebek, M., Gysi, D., ... & Menichetti, G. (2023). Improving the generalizability of protein-ligand binding predictions with AI-Bind. Nature communications, 14(1), 1989.

4 Zambaldi, V., Raposo, D., Santoro, A., Bapst, V., Li, Y., Babuschkin, I., ... & Battaglia, P. (2018). Relational deep reinforcement learning. arXiv preprint arXiv:1806.01830.

5 Newman, M. E. J. (2023). Message passing methods on complex networks. Proceedings of the Royal Society A, 479(2270), 20220774.

6 Zhou, X., Zhang, W., Xu, H., & Zhang, T. (2021). Effective sparsification of neural networks with global sparsity constraint. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 3599-3608).

7 Zhou, X., Zhang, W., Xu, H., & Zhang, T. (2021). Effective sparsification of neural networks with global sparsity constraint. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 3599-3608).

8 Bo, D., Wang, X., Liu, Y., Fang, Y., Li, Y., & Shi, C. (2023). A survey on spectral graph neural networks. arXiv preprint arXiv:2302.05631.

9 Barrat, A., Barthelemy, M., Pastor-Satorras, R., & Vespignani, A. (2004). The architecture of complex weighted networks. Proceedings of the national academy of sciences, 101(11), 3747-3752.

10 Aguirre, L. A., Lopes, R. A., Amaral, G. F., & Letellier, C. (2004). Constraining the topology of neural networks to ensure dynamics with symmetry properties. Physical Review E, 69(2), 026701.

11 MacArthur, B. D., Sánchez-García, R. J., & Anderson, J. W. (2008). Symmetry in complex networks. Discrete Applied Mathematics, 156(18), 3525-3531.

12 Barabási, A. L., & Albert, R. (1999). Emergence of scaling in random networks. science, 286(5439), 509-512.

13 Hestness, J., Narang, S., Ardalani, N., Diamos, G., Jun, H., Kianinejad, H., ... & Zhou, Y. (2017). Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409.

14 Dodds, P. S., & Rothman, D. H. (2000). Scaling, universality, and geomorphology. Annual Review of Earth and Planetary Sciences, 28(1), 571-610.

15 Zhou, D. X. (2020). Universality of deep convolutional neural networks. Applied and computational harmonic analysis, 48(2), 787-794.

16 Although people are trying, e.g., Guo, T., & Li, X. (2023). Machine learning for predicting phenotype from genotype and environment. Current Opinion in Biotechnology, 79, 102853.

17 Royal, C. D., Babyak, M., Shah, N., Srivatsa, S., Stewart, K. A., Tanabe, P., ... & Asnani, M. (2021). Sickle cell disease is a global prototype for integrative research and healthcare. Advanced Genetics, 2(1), e10037.

18 Rayner, T., Moreau, C., Ambrose, M., Isaac, P. G., Ellis, N., & Domoney, C. (2017). Genetic variation controlling wrinkled seed phenotypes in Pisum: how lucky was Mendel?. International Journal of Molecular Sciences, 18(6), 1205.

19 Ogbunugafor, C. B., & Edge, M. D. (2022). Gattaca as a lens on contemporary genetics: marking 25 years into the film’s “not-too-distant” future. Genetics, 222(4), iyac142.

20 Pavlopoulos, G. A., Baltoumas, F. A., Liu, S., Selvitopi, O., Camargo, A. P., Nayfach, S., ... & Kyrpides, N. C. (2023). Unraveling the functional dark matter through global metagenomics. Nature, 622(7983), 594-602.

21 Amaral, P., Carbonell-Sala, S., De La Vega, F. M., Faial, T., Frankish, A., Gingeras, T., ... & Salzberg, S. L. (2023). The status of the human gene catalogue. Nature, 622(7981), 41-47.

22 Mohr, S. E., Kim, A. R., Hu, Y., & Perrimon, N. (2023). Finding information about uncharacterized Drosophila melanogaster genes. Genetics, 225(4), iyad187.

23 Fang, X., Lloyd, C. J., & Palsson, B. O. (2020). Reconstructing organisms in silico: genome-scale models and their emerging applications. Nature Reviews Microbiology, 18(12), 731-743.

24 Camellato, B. R., Brosh, R., Ashe, H. J., Maurano, M. T., & Boeke, J. D. (2024). Synthetic reversed sequences reveal default genomic states. Nature, 1-8.

25 Hutchison III, C. A., Chuang, R. Y., Noskov, V. N., Assad-Garcia, N., Deerinck, T. J., Ellisman, M. H., ... & Venter, J. C. (2016). Design and synthesis of a minimal bacterial genome. Science, 351(6280), aad6253.

26 Perez Santin, E., Rodríguez Solana, R., González García, M., García Suárez, M. D. M., Blanco Díaz, G. D., Cima Cabal, M. D., ... & Lopez Sanchez, J. I. (2021). Toxicity prediction based on artificial intelligence: A multidisciplinary overview. Wiley Interdisciplinary Reviews: Computational Molecular Science, 11(5), e1516.

27 Wang, Z., Ioannidis, V. N., Rangwala, H., Arai, T., Brand, R., Li, M., & Nakayama, Y. (2022, August). Graph neural networks in life sciences: Opportunities and solutions. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (pp. 4834-4835).

28 Chen, D., O’Bray, L., & Borgwardt, K. (2022, June). Structure-aware transformer for graph representation learning. In International Conference on Machine Learning (pp. 3469-3489). PMLR.

29 Cobelli, C., Dalla Man, C., Sparacino, G., Magni, L., De Nicolao, G., & Kovatchev, B. P. (2009). Diabetes: models, signals, and control. IEEE reviews in biomedical engineering, 2, 54

30 Narkiewicz, K. (2006). Obesity and hypertension—the issue is more complex than we thought. Nephrology dialysis transplantation, 21(2), 264-267.

31 Forouhi, N. G. (2023). Embracing complexity: making sense of diet, nutrition, obesity and type 2 diabetes. Diabetologia, 66(5), 786-799.

32 Williams, S. M. (2010). Endophenotypes, heritability, and underlying complexity in hypertension. American journal of hypertension, 23(8), 819-819.

33 Dowell, R. D., Ryan, O., Jansen, A., Cheung, D., Agarwala, S., Danford, T., ... & Boone, C. (2010). Genotype to phenotype: a complex problem. Science, 328(5977), 469-469.