Testing in humans in clinical trials and the regulatory approval process itself are candidates for technology solutions where artificial intelligence is playing a role.
The cost of drug discovery—the process of R&D that brings new drugs to patients to address unmet medical needs—has grown exponentially over the past few decades. It takes 10-12 years and ~$2 billion to successfully discover and develop a new drug when taking into account all of the failures along this long and complex development pathway. At the same time the returns on drug sales are under pressure.
The days of blockbuster drugs that generate more $1 billion dollars in sales per year seem to be receding as pressure on drug pricing increases and personalized medicine, i.e., the better understanding of which patients will benefit most from which drugs is, appropriately, reducing market sizes. There is a continuing need for novel technology to help streamline drug discovery, reduce time and cost, and also, importantly, to accelerate the delivery of new therapies to patients who sometimes desperately need them.
To achieve significant improvement, every step in the drug discovery process needs to be innovated. For the purposes of this article, I’ll focus on the preclinical stages of drug discovery from drug target discovery and validation, and the search for, and optimization of, molecules that can interact with drug targets to modulate the diseases they enable. Testing in humans in clinical trials to demonstrate safety and efficacy, and even the regulatory approval process itself are also candidates for technology solutions where artificial intelligence (AI) can and is playing a role.
Where data are available, AI has a huge role to play here. In the past 20 years, development of high throughput biology techniques have created large data sets of biological information through geonomics, proteomics, metabolomics etc., (AKA the multi “omics”) that although heralded as a breakthrough at the time, in fact, generated a glut of information that was almost impossible to decipher in a systematic way. Then modern AI came onto the stage, and with billions of investment dollars, has started to accelerate our ability to collapse the dimensionality of multi-omics into understandable correlations that identify novel drug targets for diseases at will and at an unprecedented rate.
But that creates a new bottleneck: AI-discovered drug targets are at this point largely identified as long strings of symbols that describe the sequence of building blocks that make up proteins (that is “letters” that define their nucleic acid bases and corresponding amino acid sequences). Modern drug discovery demands that we know how those long strings fold to create three dimensional structures that mediate the function of those drug targets, and how best to design drugs to interact with them.
Physical experimental techniques to determine structure are slow and expensive. X-ray crystallography, NMR, and more recently cryo-EM are the gold standards to determine structure but with an average timeline of six months and $50-$250K+ to determine just one structure at a time. However, they have been around for a long time and progressive structure determination by these physical methods combined with careful forethought in curating publicly accessible databases ultimately has created a large repository of data that correlates protein sequence to 3D structure.
Enter AI again. AI-based structure prediction has progressed rapidly over the past few years with incredible results shown with DeepMind’s AlphaFold 2 now routinely producing virtual structures of all known proteins with a high degree of accuracy. Although more work is still needed here, in principle the bottleneck shifts again.
Now we have the structures of the drug target proteins—either physically and/or virtually—the next step is to find chemical compounds that bind to them and modulate the diseases states they cause. High throughput screening of large compound libraries using expensive automation that would fill a large lab and chemical combinatorial libraries of ~2 million compounds were considered a breakthrough just before the turn of the century and have since evolved to DNA encoded library technology where one technician can physically screen billions of molecules in a few days.
Even so, this is still only screening a very small portion of the chemical diversity of drug-like molecules, which is estimated to be ~1060 molecules. AI knocks at the door again. New technology platforms using various AI tools have been created that can screen trillions of commercially available and/or virtually generated compounds to greatly expand our ability to find novel compounds against drug target structures.
Notice I have not been calling these compounds “drugs” or “drug candidates.” The search for compounds that bind drug targets is just the beginning of the next stage in drug discovery where these compound “hits” have to be synthesized and validated in multiple biological assays in a lab, and then begins a long and expensive process of optimizing these hits into leads and ultimately drug candidates ready for preclinical and human testing.
Binding to the drug target is just one of 15–20 chemical parameters that need to be optimized in a final drug candidate (e.g., potency, selectivity, solubility, permeability, toxicity, pharmacokinetics, etc.). This process generally takes 3–5 years (~26% of the development timeline of the full drug discovery process mentioned above).
It’s also the starting point of significant failure in the process of drug discovery where repetitive design-make-test-analyze (DMTA) cycles produce compounds that have one or more of many potential fatal flaws that may ultimately result in not being able to identify a drug candidate at all. This happens in 69% of drug programs at the early discovery stage.
The iterative DMTA cycle in drug discovery is still very much a manual process of synthesis and testing and, unlike all of the steps described earlier, does not have access to large, well-curated publicly available databases to build AI models on. This is due to a mix of reasons ranging from data confidentiality to historic lack of consistency in how manual synthesis and testing is conducted and reported, leading to a lack of reproducibility. Lastly, these data points are expensive to produce by current physical methods.
So now we are at the next major bottleneck in the discovery process. AI isn’t positioned well here as in the previous bottlenecks highlighted because the data aren’t readily available, and generative AI techniques are challenged by the sheer size of chemical diversity that needs to be searched without good starting points.
For AI to play a significant role in breaking the molecular discovery bottleneck, it will have to be combined with new ways of rapidly generating smaller, highly accurate and supervised data sets. The “lab of the future” has been envisioned extensively over the past few years as ranging from high throughput experimental tools that augment scientists to fully automated robotic “human-out-of-the-loop” laboratories. Whichever paradigm is pursued, for small molecule discovery chemistry and biology automation will need to be integrated for the rapid synthesis and testing to create data that feeds directly into AI systems to predict in real time the next generation of analogs to make and test.
At its most basic, this requires three core integrated components:
So where are we with these components? Novel tools that use interactive human knowledge as “rules” in neurosymbolic AI are already showing promise in highly accurate molecular property prediction, and in more or less real time to facilitate rapid design cycles. The biology automation requirement has largely been addressed by the development of high throughput systems over the past three decades.
Robust and reliable automated chemistry has however lagged due to the various material states of chemicals and reactions (e.g., liquids, solids, gases, viscous gels, harsh corrosives, hazardous reagents etc.). General automated synthesis platforms are only now beginning to emerge and offer the potential of greatly expediting chemical synthesis over the classically labor-intensive manual methods that are used today.
High throughput experimentation (HTE) using commercial liquid handlers and novel inkjet printing are accelerating reaction screening and optimization at the microscale while automated multistep synthesizers can produce compounds at larger quantities required for escalating experimental validation. Successfully achieving this cutting-edge combination of AI and automation will create an exciting opportunity to greatly accelerate the process of molecular discovery, effectively removing the next big bottleneck—the chemistry bottleneck—in drug discovery.
Parallel advances in computational chemistry may well help this process along. Increasing access to high performance computing is enabling quantum mechanics and physics-based simulation of virtual drug candidates to be faster and cheaper. As the accuracy of these virtual simulations is approaching that observed in experiments in the real world, an intriguing possibility of complementing wet lab experiment with simulated data (similar to so called “deep fake” data in the image processing industry) to facilitate AI models in making predications may be in our future.
Another area in which AI will have impact on the bottleneck of molecular discovery is in the application of large language models (LLMs). An AI-driven automated lab will require extensive experimental planning from synthetic design of target molecules to designing and implementing automated procedures to making and testing them.
This requires a significant amount of coordinated expertise in AI, computational chemistry and informatics, medicinal chemistry, and biology working together in rapid iterative sprints toward a drug candidate. While LLMs such as ChatGPT have shown their potential in semantic language accessing data from the internet, large chemistry models (LCMs) are now being envisioned that will be capable of taking huge amounts of complex historical information on drug target-specific experimental procedure combined with AI and automated data generation of the lab of the future to facilitate automated complex planning and execution based on simple requests.
In a similar way to interacting with ChatGTP, a scientist would be able to ask an integrated LCM/AI/lab automation platform: “I want a drug candidate that binds to target A with potency X but does not bind to target B which would cause toxicity.” The LCM then designs and implements the complex set of synthesis and testing experiments that will attempt to answer that request, or at least take an iterative step in that direction and make a recommendation as what to do next.
Given the current excitement around LLMs and the rate at which this technology is progressing, together with a new future of AI and automation-driven research, one could envision in the relatively near future molecular discovery being accessible to a much wider audience of molecular “designers” asking increasingly challenging questions. Highly interactive AI and data generations tools like these will exponentially leverage human ingenuity.
Imagine how that would drive innovation in the future.
About the Author
Nathan Collins is the cofounder and head of Strategic Alliances and Development at Synfini, Inc., a pioneer in agile chemistry whose advanced, automated molecular discovery platform is transforming drug discovery.