Stability AI, the corporate-backed startup behind the Stable Diffusion text-to-image AI system, is funding a massive effort to apply AI to the frontiers of biotechnology. Called OpenBioML, the company’s first projects will focus on machine learning-based approaches to DNA sequencing, protein folding and computational biochemistry.
The company’s founders describe OpenBioML as an “open research lab” – and aim to explore the intersection of AI and biology in a setting where students, professionals and researchers can participate and collaborate, according to the CEO of Stability AI, Emad Mostaque.
“OpenBioML is one of the independent research communities supported by Stability,” Mostaque told TechCrunch in an email interview. “Stability seeks to develop and democratize AI, and through OpenBioML, we see an opportunity to advance the state of the art in science, health and medicine.”
Given the controversy surrounding Stable Diffusion – Stability AI’s AI system that generates art from textual descriptions, similar to OpenAI’s DALL-E 2 – one might understandably be wary of Stability’s first foray. AI in healthcare. The startup has taken a laissez-faire approach to governance, allowing developers to use the system however they wish, including for celebrity deepfakes and pornography.
Stability AI’s ethically questionable decisions to date aside, machine learning in medicine is a minefield. Although the technology has been successfully applied to diagnose conditions such as skin and eye diseases, among others, research has shown that the algorithms can develop biases leading to poorer care for some patients. An April 2021 study, for example, found that statistical models used to predict suicide risk in mental health patients worked well for white and Asian patients, but poorly for black patients.
OpenBioML is starting with safer territory, for good reason. His first projects are:
BioLMwho seeks to apply natural language processing (NLP) techniques to the fields of computational biology and chemistry
DNA disseminationwhich aims to develop an AI capable of generating DNA sequences from text prompts
LibreFoldwhich seeks to increase access to AI protein structure prediction systems similar to DeepMind’s AlphaFold 2
Each project is led by independent researchers, but Stability AI provides support in the form of access to its AWS-hosted cluster of over 5,000 Nvidia A100 GPUs to train the AI systems. According to Niccolò Zanichelli, an undergraduate computer science student at the University of Parma and one of the main researchers of OpenBioML will be enough processing and storage power to possibly train up to 10 different AlphaFold 2-like systems in parallel.
“A lot of computational biology research is already leading to open-source versions. However, much of it happens at a single-lab level and is therefore usually limited by insufficient computing resources,” Zanichelli told TechCrunch via email. mail. “We want to change that by encouraging large-scale collaborations and, with the support of Stability AI, support those collaborations with resources that only the largest industrial labs have access to.”
Generation of DNA sequences
Of current OpenBioML projects, DNA-Diffusion – led by the lab of pathology professor Luca Pinello at Massachusetts General Hospital & Harvard Medical School – is perhaps the most ambitious. The goal is to use generative AI systems to learn and apply the rules of DNA “regulatory” sequences, or segments of nucleic acid molecules that influence the expression of specific genes within an organism. . Many diseases and disorders are the result of misregulated genes, but science has yet to discover a reliable process to identify – let alone alter – these regulatory sequences.
DNA-Diffusion proposes to use a type of AI system known as a diffusion model to generate cell type-specific regulatory DNA sequences. Diffusion models – which underpin image generators like Stable Diffusion and OpenAI’s DALL-E 2 – create new data (e.g. DNA sequences) by learning to destroy and recover many samples from existing data. As they are fed samples, the models improve to recover all the data they had previously destroyed to generate new works.
Picture credits: OpenBioML
“Diffusion has had widespread success in multimodal generative models, and it is now beginning to be applied to computational biology, for example for the generation of novel protein structures,” Zanichelli said. “With DNA-Diffusion, we are now exploring its application to genomic sequences.”
If all goes as planned, the DNA-Diffusion project will produce a diffusion model capable of generating regulatory DNA sequences from text instructions such as “A sequence that will activate a gene to its maximum level of expression in the cell type X” and “A sequence that activates a gene in the liver and heart, but not in the brain.” Such a model could also help interpret the components of regulatory sequences, Zanichelli says, improving the scientific community’s understanding of the role of regulatory sequences in different diseases.
It should be noted that this is largely theoretical. While preliminary research on applying diffusion to protein folding looks promising, it’s still in its infancy, admits Zanichelli – hence the drive to involve the wider AI community .
Predict protein structures
OpenBioML’s LibreFold, although smaller in scope, is more likely to bear immediate fruit. The project aims to better understand machine learning systems that predict protein structures, as well as ways to improve them.
As my colleague Devin Coldewey explained in his article about DeepMind’s work on AlphaFold 2, AI systems that accurately predict protein shape are relatively new to the scene but transformative in terms of their potential. Proteins comprise sequences of amino acids that fold into shapes to perform different tasks within living organisms. The process of determining the shape an acid sequence will create was once an arduous and error-prone endeavor. AI systems like AlphaFold 2 changed that; thanks to them, more than 98% of protein structures in the human body are known to science today, along with hundreds of thousands of other structures in organisms like E. coli and yeast.
However, few groups have the engineering expertise and resources to develop this type of AI. DeepMind spent days training AlphaFold 2 on Tensor Processing Units (TPUs), Google’s expensive AI acceleration hardware. And acid sequence training datasets are often proprietary or released under non-commercial licenses.
Proteins folding into their three-dimensional structure. Picture credits: Christoph Burgstedt/Scientific Photo Library/Getty Images
“It’s a shame, because if you look at what the community was able to build on top of the AlphaFold 2 checkpoint published by DeepMind, it’s just amazing,” Zanichelli said, referring to the AlphaFold 2 model trained that DeepMind released last year. . “For example, just days after publication, Minkyung Baek, a professor at Seoul National University, reported a tip on Twitter that allowed the model to predict quaternary structures – something few, if any, expected. that the model is capable of doing so. There are many more such examples, so who knows what the wider scientific community might build if they had the ability to train entirely new methods of AlphaFold-like protein structure prediction?
Building on the work of RoseTTAFold and OpenFold, two ongoing community efforts to replicate AlphaFold 2, LibreFold will facilitate “large scale” experiments with various protein folding prediction systems. Led by researchers from University College London, Harvard and Stockholm, LibreFold’s goal will be to better understand what systems can accomplish and why, according to Zanichelli.
“LibreFold is at its heart a project for the community, by the community. let’s start releasing the first deliverables or it could take a lot longer,” he said. “That said, my hunch is that the former is more likely.”
Application of NLP to biochemistry
In the longer term is OpenBioML BioLM project, which has the vaguer mission “to apply language modeling techniques derived from NLP to biochemical sequences”. In collaboration with EleutherAI, a research group that has published several open-source text generation models, BioLM hopes to train and publish new “biochemical language models” for a range of tasks, including protein sequence generation.
Zanichelli cites Salesforce’s ProGen as an example of the types of work BioLM could undertake. ProGen treats amino acid sequences as words in a sentence. Trained on a dataset of over 280 million protein sequences and associated metadata, the model predicts the next set of amino acids from before, like a language model predicting the end of a sentence from its beginning.
Nvidia earlier this year released a language model, MegaMolBART, which was trained on a dataset of millions of molecules to search for potential drug targets and predict chemical reactions. Meta also recently trained an NLP called ESM-2 on protein sequences, an approach the company says allowed it to predict sequences for more than 600 million proteins in just two weeks.
Protein structures predicted by the Meta system. Picture credits: Meta
While OpenBioML’s interests are broad (and expanding), Mostaque says they are united by a desire to “maximize the positive potential of machine learning and AI in biology,” in the tradition of open research in science and medicine.
“We seek to give researchers more control over their experimental pipeline for the purposes of active learning or model validation,” Mostaque continued. “We also seek to push the state of the art with increasingly general biotechnology models, as opposed to the specialized architectures and learning goals that currently characterize much of computational biology.”
But — as you’d expect from a VC-backed startup that recently raised over $100 million — Stability AI doesn’t view OpenBioML as a purely philanthropic effort. Mostaque says the company is open to exploring the commercialization of OpenBioML’s technology “when it’s advanced enough and secure enough and the time is right.”