AI for Science: Standing at the Turning Point of a Research Paradigm | FreeS Fund Report

峰瑞资本峰瑞资本·November 14, 2024

What is AI for Science, and why now?

In the second week of October 2024, the Nobel Prize winners were announced one by one. Notably, the Nobel Prizes in Physics and Chemistry were awarded to scientists who had made groundbreaking contributions at the intersection of artificial intelligence and fundamental science.

For a moment, the internet erupted with quips—

"Does physics even exist anymore?"

"Physics should be spelled PhysiCS!"

...

But the initial shock soon subsided. The general public gradually came to realize that foundational disciplines like physics and mathematics have consistently provided the theoretical groundwork and methodological support for AI's development, while AI's powerful data processing and pattern recognition capabilities are being applied ever more deeply in scientific research. With all the low-hanging fruit already picked, AI is now helping scientists tackle far more challenging problems.

AI for Science (AI4S) has become the norm, and the Nobel Prizes served as an unambiguous signal—humanity has once again arrived at an inflection point in how research is conducted.

In this article, we will explore the following topics:

  • How have human scientific paradigms evolved over time?
  • GPT is still an error-prone guessing machine—can it really be trusted to assist scientific research?
  • What is AI4S, and why is it emerging now?
  • What are the main stages of the research process, and how is AI being embedded into each?
  • Why is AI4S said to have already driven AI-powered drug discovery into its 2.0 era? How does AI pharma 1.0 differ from 2.0?
  • What are the prospects for AI4S? We'll look at chemistry, biology, materials science, and other fields.

We hope this offers fresh perspectives. If you're a researcher, practitioner, or entrepreneur in the AI4S space, you're welcome to reach out to the author, Rui Ma, Partner at FreeS Fund (marui@freesvc.com).

Giveaway

What do you think about AI for Science? Share your thoughts in the comments. We'll randomly select 5 readers to receive a FreeS Fund industry research handbook.

Why Can AI Greatly Accelerate Scientific Research?

The Fifth Paradigm of Science

The evolution of human scientific paradigms follows a spiral of ascent—beginning with the empirical paradigm based on inductive reasoning from observational data, exemplified by Kepler, who discovered the laws of planetary motion through observation and simple mathematical calculation. This was followed by the first-principles-driven theoretical paradigm, represented by Newton, who sought to uncover nature's laws from fundamental truths and describe them with equations. As data volumes grew, we moved into computational paradigms and data-driven paradigms.

While data-driven methods can effectively uncover facts from data, they don't help us understand the underlying causes. And the mathematical equations derived from first principles are often intractable. AI4S, this fifth paradigm that fuses first-principles-driven and data-driven approaches, emerged to address this gap.

A simple equation can help you grasp the fifth paradigm:

𝑋 = 𝑋(𝜃) + 𝜖

The blue 𝑋(𝜃) represents a theoretical equation describing some physical system. But any theory built from experimental observation necessarily has limits—it cannot perfectly capture physical reality (the green X). Hence there exists a 𝜖, representing the residual between theory and reality.

This is where AI shines. Simply put, AI can not only help compute this residual but also help solve the theoretical equation 𝑋(𝜃) itself.

Some might object: "I've used ChatGPT, and sometimes it says things that are completely unreliable." It's true—despite AI's impressive generative capabilities, it remains fundamentally a guessing machine. ChatGPT's language outputs are probabilistic computations, and guessing machines make mistakes. Is that really okay for scientific research?

Yes, it is—but the problems can be managed. As we use AI tools to augment our scientific capabilities, we simply need to combine them with scientific validation to filter out the garbage and keep what's useful.

In fact, AI is already being widely applied across scientific disciplines. AI-based algorithms can dramatically improve the efficiency and accuracy of first-principles modeling. By enabling new experimental designs, more accurate and efficient characterization algorithms, and even new experimental instruments, AI can also transform how we conduct experiments.

In mathematics, for instance, mathematicians use computers to assist with calculations, conjecture generation, and proofs. In physics, AI can bridge quantum mechanics and classical coarse-grained models, effectively connecting physical models across different scales. In chemistry, AI is used to design chemical molecules and reactions. In biology, it's used to design biomolecules and drugs. In materials science, AI accelerates the exploration, design, synthesis, and optimization of new materials...

AI4S is becoming one of the core drivers of the technological revolution and the development of new quality productive forces.

The Next Technological Revolution Will Be the Deep Fusion of Digital and Physical Worlds

Looking back at the Industrial Revolution and the Electrical Revolution, most innovations followed from the establishment of macroscopic physical laws—Newtonian mechanics, thermodynamics, Maxwell's equations for electromagnetic fields.

But everything changes when we enter the microscopic world, where macroscopic physical laws may no longer apply. Quantum mechanics, developed specifically to explain microscopic physical behavior, ushered in the third technological revolution. With the birth of quantum mechanics, humanity formally entered the microscopic paradigm. Semiconductor technology advanced rapidly, computers became ubiquitous, internet and mobile internet technologies evolved at breakneck speed, AI kept breaking new ground—innovation gradually shifted from the physical world to the digital world.

However, following Kondratiev waves or the logic of spiral ascent, the next technological revolution may swing back from the digital world to the physical world, or more likely, represent a deep fusion of both. Once innovation needs to happen in the physical world, the measurement, computation, control, and fabrication of microscopic particles—electrons, atoms, molecules—become critically important.

This is precisely where AI can demonstrate its full potential. In the narrow sense, AI4S can study microscopic particles and their interactions—investigating the fundamental laws of the microscopic world, which form the crucial foundation of the physical world. AI4S will propel the next technological revolution.

What do we mean by the microscopic world?

Microscopic, as opposed to macroscopic, generally refers to scales invisible to the naked eye. In physics, the microscopic means atomic scales below a few tenths of a nanometer. In life sciences, it corresponds to the scale of biological macromolecules—roughly several to tens of nanometers. In materials science, it refers to material diameters below 10 nanometers (one nanometer equals one-millionth of a millimeter).

Consider an example.

When carbon atoms arrange themselves in a honeycomb lattice of sheets, you get graphene. When they connect in tetrahedral bonds forming an infinite three-dimensional framework, you get diamond. The same carbon atoms, but different arrangements and interaction patterns yield vastly different properties. Building on carbon, add hydrogen, oxygen, and nitrogen in specific arrangements, and you get the double-helix structure of DNA—the very foundation of all biology.

So when we study the microscopic, we're really studying the molecular composition (or sequences) of different substances. We care about molecular structure, dynamics, and the functions that emerge from structure and dynamics.

Using traditional physics-based computation for molecular simulation faces the "curse of dimensionality"—as variables increase, problem complexity grows exponentially. Particularly for large systems and long timescales, simulations become prohibitively time-consuming, costly, and inaccurate.

When quantum mechanics was established, British physicist Paul Dirac optimistically predicted that the task of finding fundamental principles was largely complete. But because the mathematical problems are so complex, with too many variable functions, and because computational cost grows exponentially with the number of variables, solving practical problems from first principles becomes extraordinarily difficult.

Problems like various many-body problems, drug and materials design, protein folding, turbulence, plastic mechanics, and non-Newtonian fluid dynamics are extremely difficult to solve precisely even with supercomputers. For a long time, people assumed some scientific problems were simply incomputable—because the dimensions could literally explode.

AI is particularly adept at solving high-dimensional mathematical problems. As Academician Weinan E noted in his May 2022 report Revisiting AI for Science, solving high-dimensional mathematical problems is precisely what deep learning, or AI, excels at—deep neural networks provide effective approximation methods for high-dimensional functions. When using neural networks to approximate functions, the number of parameters needed is independent of dimensionality.

Consider a simple example. AI excels at image recognition, which is a high-dimensional problem. A 32×32 resolution image has 32×32 pixels, with three color channels per pixel—roughly 32×32×3 = 3,072 dimensions. By contrast, one of the classical equations humans can solve is the Boltzmann equation, formally a seven-dimensional integro-differential equation involving seven independent variables: three spatial coordinates, three velocity coordinates, and time.

AI's Breakthrough Development Is Driving AI4S

Why are we increasingly feeling the importance of AI4S? This largely stems from recent breakthroughs in AI and their spillover effects.

The Infrastructure Revolution

As a leader in AI infrastructure, NVIDIA has in recent years pushed GPU iteration speeds beyond the constraints of Moore's Law, fueling explosive growth in computing power. At Computex in June 2024, NVIDIA CEO Jensen Huang noted that over the past eight years, AI processing speed has increased 1,000-fold while energy consumption has dropped to 1/350th of its previous level. This AI-driven expansion has dramatically broadened the horizons of technological innovation.

The Algorithm Revolution

  • Self-supervised learning: Self-supervised learning marks a significant advance in how AI learns. Previous-generation AI required data labeling for many learning tasks — a limitation that prevented it from truly processing big data and producing large models. Self-supervised learning, by contrast, needs no human-provided labels or answers; it can autonomously learn from massive amounts of unlabeled data. By leveraging the inherent structure and properties of data, self-supervised learning extracts data features to serve as supervisory signals for training models.

  • Transformer: The Transformer is a feature extractor widely used in natural language processing. By introducing an attention mechanism, it can process sequential data in parallel. As the best-performing feature extractor, the Transformer has become the architecture of choice for deep learning models.

  • Large models / Pre-training: Models are first pre-trained on large volumes of unlabeled data, then fine-tuned through supervised learning on labeled data according to specific tasks and scenarios.

  • Generative AI: Analyzes the distribution of existing data to generate diverse designs — for example, generating small molecules or proteins.

  • Geometric deep learning: Particularly suited for processing graphs or manifolds with geometric shapes, such as atoms and molecules. These deep learning methods preserve topological features (geometric invariance) during feature extraction, better capturing the geometric structure of data.

  • Reinforcement learning: Driven by a reward function, an agent learns optimal behavioral strategies through interaction with its environment to maximize rewards.

  • Physics-informed AI: Incorporates physical models as prior knowledge into AI algorithms — a deep integration of physical models with AI methods.

  • Active learning: A strategy that prioritizes data requiring labels, elevating the priority of points that merit优先 exploration. This identifies which data will have the greatest impact on training supervised models.

To summarize, at this moment in time, a critical precondition for AI4S's rise is that both algorithms and computing power have achieved enormous breakthroughs. On this foundation, more and more researchers are beginning to apply AI to every stage of scientific research.

**/ 03 / ** How Is AI Embedded Across the Full Research Workflow?

Generally speaking, the full research workflow consists of several steps: first, proposing a scientific hypothesis; then, obtaining data through experiments, analyzing that data, and checking whether it aligns with the proposed hypothesis. If not, the hypothesis is revised, and the cycle of experimentation, analysis, and adjustment continues until the hypothesis is validated.

AI can play an important role at every step in this process. It has already been widely used for learning representations from experimental data, refining measurement results, generating scientific hypotheses, guiding experiments, automating processes through agents, and exploring theoretical spaces.

AI4S can solve problems across numerous fields. It can be used for weather forecasting, battery design, high-throughput virtual screening in pharmaceutical research, and more — addressing both extremely macroscopic and highly microscopic problems, as shown in the figure below.

AI4S can be roughly divided into three types.

▎Data-Driven (AI + Data)

The representative case is AlphaFold2, the protein structure prediction algorithm developed by DeepMind. AlphaFold2 is entirely data-driven — it uses no physical models whatsoever. Input a protein sequence (more precisely, a Multiple Sequence Alignment, or MSA), and it outputs the protein structure.

When this year's Nobel Prize in Physics was awarded to scientists working on artificial intelligence, we joked internally at FreeS Fund that AlphaFold2 would win the Nobel Prize in Chemistry.

Why? For one thing, structure determines function, and structure is critically important — protein structure prediction is the holy grail of structural biology, drug discovery, and related fields.

For another, this was the first time computational methods achieved experimental-level precision. Moreover, over the past 60 years, humans have experimentally determined the structures of 200,000 proteins; AlphaFold2 successfully predicted the structures of hundreds of millions of proteins in less than three years — representing an efficiency improvement of more than 10,000-fold.

One reason AlphaFold2 succeeded and achieved accurate predictions is its incorporation of Multiple Sequence Alignment (MSA) data. Over the past several decades, as biotechnology has continuously advanced, humanity has accumulated massive amounts of metagenomic data. This enables us to perform MSAs on a given protein — analyzing and comparing sequence similarities and differences of the same protein across different species (human, pig, chicken, fish, fungi, bacteria, and so on). In other words, structure is more conserved than sequence, and patterns of sequence variation themselves contain structural information.

In a sense, AlphaFold2 can be described as a fully data-based protein structure generation model conditioned on multiple sequence alignment. Specifically, users need only input protein sequence data, and AlphaFold2, through its powerful algorithms and models, computes highly accurate three-dimensional structures — like using an advanced statistical machine to efficiently complete protein structure prediction.

Previously, there was frequent skepticism about whether AI-driven models could solve problems with precision. In my view, AlphaFold2 gives us tremendous confidence, because it is an outstanding example of using AI to excel at protein structure prediction.

▎Model-Driven (AI + Physical Models)

Model-driven AI4S uses AI to connect and process physical models or fundamental principles across various scales.

These physical models and fundamental principles are often difficult to solve through conventional methods, or current data volumes are insufficient for effective observation and computation — examples include the Schrödinger equation, Boltzmann equation, density functional theory, molecular dynamics, quantum mechanics, and so on.

As we noted above, a precondition for the success of data-driven AlphaFold2 was the availability of massive relevant data. Yet in many fields, a typical challenge is precisely the scarcity of data. In such cases, the task of AI4S is to help solve physical models, thereby solving the problem.

Take DeepWise's deep potential energy surface calculations as an example:

Using density functional theory or quantum calculations to compute potential energy is an O(N³) complexity problem — the computational load and complexity quickly become unmanageable as the number of particles increases. DeepWise uses AI to efficiently sample high-dimensional potential energy surfaces; combining AI with quantum calculations, it reduces complexity to O(N).

Specifically, the three blue spheres in the lower left of the figure represent three points on the potential energy surface. The energy at these three points can be calculated with relatively high accuracy using fundamental physical methods. A neural network is then trained to learn from these precise physical calculations, producing a deep potential neural network. The next time energy at some point on the potential energy surface needs to be calculated, there is no need to invoke quantum calculations — the AI can complete the computation and output the answer directly, achieving quantum calculation accuracy with empirical force field speed: both accurate and fast.

▎Deep Integration of Models and Data (AI + Physical Models + Data)

The third type involves deep integration of observed and measured data with (physics + AI) models, commonly used in drug design, weather forecasting, controlled thermonuclear reactions, and other fields.

Take METiS Pharmaceuticals, an innovative company in FreeS Fund's portfolio, as an example.

METiS uses AI to design LNPs (lipid nanoparticles). An LNP is a lipid vesicle with a uniform lipid core, used to deliver nucleic acid drugs while preventing their degradation and premature release during delivery. The mRNA vaccines for COVID-19 were delivered using LNPs.

LNPs deliver active molecules at appropriate concentrations, at the appropriate time, to the correct location. This is a cross-scale complex process involving multiple different scale ranges: molecular and nanoscale, cellular scale, and organ scale.

At the molecular and nanoscale, considerations include the composition of cationic lipids and how tens of thousands of molecules assemble into LNP particles. At the cellular scale, the focus is on how LNPs enter cells and whether they undergo endosomal escape within the cell, preventing drug degradation and loss of efficacy. At the organ scale, simulations of interactions between LNPs and plasma proteins are needed to predict vascular extravasation and organ-targeting properties.

In studying and analyzing this process, AI can rapidly generate million-scale lipid libraries for molecular design. AI can also predict delivery effects, providing guidance for experimental design. Physical models can offer micro-level physical mechanism explanations — for example, predicting whether a given LNP can achieve endosomal escape. Real experimental data serves as the ultimate criterion and basis for iteration, continuously refining and optimizing the models. AI + physical models + data together drive the advancement of LNP delivery technology.

**/ 04 / ** Specific Applications of AI4S in Chemistry, Biology, and Materials

In mathematics and physics, AI4S is primarily suited to solving foundational problems; in chemistry, biology, and materials, using AI to discover new drugs, invent new materials, and generate novel molecules carries tremendous industrial promise and commercial potential.

▎AI Drug Discovery Has Entered the 2.0 Era

AI-driven drug discovery is a critical application area and subfield of AI4S. It refers to the use of AI technologies to innovate and optimize every stage of pharmaceutical R&D — drug design, screening, clinical trials, and manufacturing. In our view, after nearly a decade of development, AI drug discovery has now entered the 2.0 era.

Since 2016, "IT + BT (biocomputing)" has been one of FreeS Fund's investment themes, and we have thus participated fully in the investment boom and industry evolution of China's AI drug discovery 1.0 era.

What's the difference between the 1.0 and 2.0 eras of AI drug discovery?

The dividing line is mainly algorithmic. AI 1.0 was discriminative AI; AI 2.0 is generative AI. Mapping this onto AI drug discovery, we can draw a somewhat loose boundary: AI drug discovery companies founded before 2022 were primarily based on discriminative AI, placing them in the 1.0 era; those founded after 2022 are mainly based on generative AI, making them 2.0-era enterprises.

Most 1.0-era companies targeted the preclinical stage of drug R&D and concentrated on small-molecule drug discovery. In biomedicine, small molecules typically refer to organic compounds with molecular weights below 500 Daltons — aspirin, for instance, composed of benzene rings, carboxyl groups, and acetyl groups; large molecules generally refer to biomolecules exceeding 1,000 Daltons, including proteins, nucleic acids, and polysaccharides.

In fact, using AI for small-molecule drug discovery was choosing an extraordinarily difficult problem. At the scale of 10^-10 meters, precisely characterizing small-molecule and protein interactions is extremely challenging. There wasn't enough high-quality data, and the AI was still first-generation discriminative AI. Many teams essentially used "physics + AI" approaches to compensate for relatively weak technical foundations.

Immature tools plus the hardest possible problem — truly "Hard" mode.

Starting from "Hard" mode wasn't unique to AI drug discovery. Similarly, early AI applications in medical imaging aimed to directly replace doctors; first-generation autonomous driving targeted Level 4 (full self-driving) from the outset... But over many years of development, after bubbles burst and technology matured, goals actually became more constrained.

Back to AI drug discovery. The mainstream commercialization paths for 1.0-era companies included software services, CRO, and drug pipeline development. The entire AI drug discovery industry cooled after reaching a funding peak in 2022. Still, the leading companies have fared reasonably well. Beyond entrepreneurial spirit and team capability, these top players benefited from relatively abundant liquidity in the previous capital cycle, with massive funding flows concentrating toward them — XtalPi raised $732 million before its IPO, and Insilico Medicine secured over $400 million.

If the capital retreat continues, perhaps 80% of companies could fall because they cannot raise sufficient funds. Yet just when people felt the industry was facing a value recalibration, new technological breakthroughs may once again lead us to break through.

Over the past two years, technology has advanced at a breathtaking pace:

  • First, in December 2020, AlphaFold2 demonstrated protein structure prediction capabilities rivaling laboratory standards at CASP14;
  • In November 2022, ChatGPT burst onto the scene;
  • In July 2023, David Baker's team unveiled RF diffusion, shifting protein design from physics-based computation to AI with significant improvements in success rates and design efficiency;
  • In May 2024, AlphaFold3 was released — unlike AlphaFold2, which could only predict protein structures, AlphaFold 3 can predict the structures and interactions of all life molecules, including proteins, peptides, and nucleic acids, with unprecedented accuracy;
  • In June 2024, ESM3, capable of generating novel proteins, was released — a large model for the life sciences developed by startup Evolutionary Scale...

We have an interesting observation: over the past few years, AI drug discovery companies' PMF (product-market-fit) has shifted along a small molecule – large molecule – small molecule trajectory, a spiraling upward process.

Many 1.0-era companies were working on small molecules, while the technological advances mentioned above occurred mainly in large molecules. Marked by humanity's ability to use AI to predict monomeric protein structures and perform de novo protein design, the birth of AlphaFold3 — using diffusion to learn intermolecular interactions among biomolecules at full-atom scale, particularly small-molecule and protein binding — has brought attention back to small molecules.

Moreover, technological progress has extended from studying molecular structures to investigating how biomolecules interact, assemble into molecular machines, and generate function. This is precisely the core concern of structural biology.

Today, the tools available to AI drug discovery companies are clearly more numerous and better than in the 1.0 era. To make a rough and aggressive estimate, the underlying technology of the 1.0 era might amount to only 1/5 to 1/10 of what's available now? And technology continues to iterate rapidly — how could one not be filled with anticipation for the next decade?

According to incomplete statistics, there are currently roughly one hundred or so AI drug discovery companies domestically; the number that will ultimately go public or reach the market is likely to be extremely small. Among them, XtalPi — an early FreeS Fund portfolio company — listed on the Hong Kong Stock Exchange in June 2024 under the 18C chapter, becoming the first AI drug discovery stock. XtalPi was also included as a constituent of the Hang Seng Index.

If 5-10 companies from this cohort eventually succeed in going public, then given the pace of current technological development, the next decade should see even more AI drug discovery companies born in the AI 2.0 era reach the public markets — perhaps three to five times as many.

This is why FreeS Fund has continued to pay close attention to this space. Major technological shifts tend to create commercial opportunities. We are optimistic that 1.0-era AI drug discovery companies will apply the latest models to application scenarios where they have accumulated expertise and advantages, and we are equally optimistic about pharmaceutical companies riding the AI 2.0 wave, leveraging more cutting-edge technology to drive innovation.

Opportunities for AI Applications in Biology

Overall, AI applications in biotech can be divided into three layers:

First, GPT-driven advances in natural language processing, directly applied to extracting biomedical knowledge. We have vast amounts of biological and drug development-related knowledge. Large language models like BioGPT and BioLLM, which excel at understanding biological concepts, can very effectively extract knowledge and key points from scientific data and literature. For example, we can have a large language model read abstracts from 30 million papers and discover previously unnoticed knowledge connections.

Second, focusing on computational biological macromolecules themselves, following the DNA-RNA-Protein pathway to predict and design the sequence-structure-function of these biomolecules.

Third, the computation of biomolecular interactions, including protein-protein interactions, protein-small molecule interactions, or protein folding processes. This is precisely what AF3 addresses.

Having clarified the problems to be solved, we can now examine the three main technological frontiers of AI for bio — David Baker's protein design RF diffusion, DeepMind's AlphaFold2 and AlphaFold3, and the multimodal generative large model ESM3. Overall, all three technological trajectories have evolved from structure prediction alone to the ability to design biomolecules.

Let's examine each in turn.

  • Represented by American biochemist and 2024 Nobel Prize in Chemistry laureate David Baker: the diffusion model-based protein design tool RoseTTAFold Diffusion (hereinafter RF diffusion)

Simply put, RF diffusion uses a denoising diffusion probabilistic model to design proteins through a step-by-step noise reduction process.

Denoising diffusion probabilistic models were originally developed for audio or image generation. As shown in the figure below, by progressively adding Gaussian noise to a cat image, the image eventually becomes pure noise with a Gaussian distribution. Training AI to predict the denoising result — once the AI learns to denoise step by step — you can input pure noise and use progressive denoising to generate data distributions resembling the original images.

Interestingly, denoising diffusion probabilistic models were inspired by nonequilibrium thermodynamics.

For example, when a drop of ink is placed in water, it forms a blot that gradually disperses. Directly simulating the probability density distribution of the ink's initial state before diffusion is extremely difficult. But as the ink fully diffuses throughout the water and the distribution becomes uniform, its probability density distribution becomes tractable. Nonequilibrium thermodynamics can describe the probability distribution at each step of the ink diffusion process.

Because each step of the diffusion process is reversible, as long as the "steps" are sufficiently small, one can work backward from a simple distribution to infer the original complex distribution.

Denoising diffusion probabilistic models are exceptionally well-suited for protein design. David Baker's team cleverly retrained and fine-tuned the existing folding algorithm RoseTTAFold using a diffusion model approach, training it on large amounts of real protein structure data from the Protein Data Bank (PDB). Initially, RFdiffusion produces much "noise," then through reverse progressive "denoising," it can generate a variety of proteins that resemble existing ones but are in fact entirely novel.

This is what's marvelous about AI4S: the principle of thermal diffusion in physics inspired the Diffusion model in AI, and this algorithm was then applied to protein molecular design. AI and science serve as mutually ascending engines.

The protein design workflow includes backbone design, sequence design, computational screening, and experimental validation — each step with its own computational tools.

David Baker's group has produced a series of breakthroughs. Beyond using RFdiffusion for backbone design, they also developed algorithms using MPNN for sequence design, followed by computational filtering with AlphaFold2 or RoseTTAFold before experimental screening. Designs that pass AlphaFold2 screening see dramatically higher experimental validation rates, greatly improving the efficiency of protein design.

The images below show several important targets in cancer immunology and virology designed by RFdiffusion and RoseTTAFold. Without such tools, these proteins capable of performing specific tasks might remain undiscovered despite enormous effort — yet each represents a potential drug molecule.

Leveraging RFdiffusion and ProteinMPNN, David Baker launched the startup Xaira Therapeutics in 2023, recruiting Marc Tessier-Lavigne — former Chief Scientific Officer at Genentech (often called "the birthplace of the biotech industry") and former President of Stanford University — as CEO. Xaira raised $1 billion in its seed round, among the largest financings in biotech history.

  • DeepMind's AlphaFold2 and AlphaFold3: From computing proteins alone to predicting the structures and interactions of all biomolecules, AlphaFold3 substantially expanded AlphaFold2's capabilities, taking a major step toward commercial application.

AlphaFold2 architecture: MSA + Transformer

In large language models, we use RAG (Retrieval-Augmented Generation). It's a concept that provides external knowledge sources to large models, enabling LLMs to generate accurate, contextually appropriate answers while reducing hallucinations. When we pose a question to the model, though it retrieves that specific query, the model actually pulls relevant information from data sources — gathering a range of contextually related material — and feeds all of this as prompts to the large language model, effectively giving it more knowledge to produce better answers.

MSA (Multiple Sequence Alignment) works similarly: the model uses alignments of homologous protein sequences as additional inputs.

Furthermore, AlphaFold2 leveraged the biggest innovation in this wave of AI — the transformer architecture — to achieve end-to-end prediction, better handling the sequence-structure relationships implied in long sequences.

AlphaFold3: AlphaFold2 + Diffusion

AlphaFold3 built upon AlphaFold2 by adding a Diffusion module, replacing the structure module in AlphaFold2.

So we can say AlphaFold3 is a structure generation model conditioned on sequence (MSA), simultaneously employing Transformer, RAG, and diffusion.

What made AlphaFold3's arrival particularly exciting was the discovery that its performance in predicting protein-small molecule complex structures might surpass physics-based molecular docking methods.

In the AI drug discovery 1.0 era, AI was generally considered unreliable; physics-based methods were deemed more accurate. Even now, molecular docking remains the mainstream approach for finding small molecules that bind to specific targets. But AlphaFold3 may change this. When you input a protein sequence and a small molecule's SMILES file, the model can output a co-folded structure in seconds.

Precisely because of this, AlphaFold3 demonstrates formidable commercial potential. In early 2024, Isomorphic Labs — a DeepMind-incubated company targeting drug development — announced two major deals with Eli Lilly and Company and Novartis worth nearly $3 billion combined.

  • Multimodal generative large models: From prediction to design and generation.

The third frontier is the direct use of multimodal generative large models to "brute force" computations. The representative here is ESM3, a protein language model from EvolutionaryScale.

ESM3 can flexibly prompt across sequence, structure, and function to achieve protein molecule generation. Its training dataset is extraordinarily large, encompassing over 2.78 billion natural proteins, augmented with synthetic data to 3.15 billion sequences, plus 236 million structures (experimentally determined plus AlphaFold2-predicted), and 539 million proteins with functional annotations — totaling 771 billion tokens.

The development team trained ESM3 at three scales: 1.4 billion, 7 billion, and 98 billion parameters. They found that as model parameter scale increased, performance improved, validating the effectiveness of scaling laws.

This also illustrates how critical datafication is. In fact, a prerequisite for AlphaFold's success was the advancement of sequencing technology, which accumulated massive sequencing data for multiple sequence alignment, compensating for our lack of structural information. Additionally, AI outputs still require experimental validation.

In summary, all three technical approaches are iterating rapidly, competing while mutually inspiring each other toward collective progress. RFdiffusion was fine-tuned from AlphaFold2 (RosettaFold) using diffusion training. AlphaFold3 built upon AlphaFold2 by adding a diffusion module and reducing dependence on MSA. Ultimately, they may all converge toward a similar path: a unified biological foundation model.

FreeS Fund also has investments in the biotech + generative AI space. Among them, Hengyu Biotech is dedicated to using generative AI to design RNA molecules. In June 2024, Hengyu Biotech unveiled GEMORNA, the world's first generative AI-designed mRNA drug technology platform. The related paper is under review at Science. Hengyu Biotech is also the first Chinese company to publish an mRNA paper in Nature.

AI Applications in Materials Science

Materials are the foundation of the physical world. Every major technological revolution has depended on materials innovation. Historically important materials include iron, copper, cement, and steel; today's pillar materials comprise semiconductors and polymers built from silicon, C-H, N, and other elements, plus biomolecules. Looking ahead, nanomaterials, bio-based polymers, and quantum materials may become equally important.

The discovery and simulation of new materials cannot proceed without AI. GNoME exemplifies this.

At the end of 2023, Google DeepMind's AI tool GNoME, combining graph neural networks with active learning, successfully predicted 2.2 million crystal structures. Among these, 380,000 stable crystal structures are candidates for experimental synthesis, powering future innovations from superconductors to advanced computing.

Unlike biomolecules, which are characterized by sequences, materials and crystals are better represented as graphs. GNoME employs an advanced Graph Neural Network (GNN) model. After effective material representation, it uses DFT + active learning for screening. Meanwhile, Density Functional Theory (DFT) quantifies crystal energy calculations, with AI helping simplify computations and ultimately significantly improving discovery speed and efficiency.

Thanks to GNoME's powerful capabilities, the number of known stable materials has grown nearly tenfold, reaching 421,000.

DeepMind also noted that GNoME has identified 528 promising lithium ion conductors, some of which could help improve electric vehicle battery efficiency.

From new energy vehicle batteries to solar cells to computer chips and beyond, new materials discovery will dramatically accelerate technological breakthroughs.

Autonomous materials discovery and synthesis systems (self-driving labs) represent a crucial direction in materials science. These labs aim to automate scientific workflows, combining robotics with ab initio databases, machine learning-driven data interpretation, synthesis heuristics learned from text-mined literature data, and active learning to optimize synthesis of novel inorganic materials in powder form.

For instance, the U.S. Lawrence Berkeley National Laboratory collaborated with Google DeepMind to develop the autonomous lab system A-Lab, where AI-guided robots manufacture new materials. Over 17 days, it conducted 355 consecutive experiments, synthesizing 41 of 58 target compounds — a 71% success rate, far exceeding manual experiments.

AI Applications in Chemistry

The representative case is ChemCrow.

We previously noted that large language models inherently lack external knowledge sources, which is where RAG (Retrieval-Augmented Generation) proves valuable. When we pose a question to the model, though it retrieves that specific query, the model actually pulls relevant information from data sources — gathering a range of contextually related material — and feeds all of this as prompts to the large language model, effectively giving it more knowledge to produce better answers.

Based on similar logic, research teams from EPFL (École polytechnique fédérale de Lausanne) and the University of Rochester developed ChemCrow, a language model agent capable of completing diverse chemistry tasks including organic synthesis, drug discovery, and materials design.

ChemCrow builds on GPT-4, integrating 13 expert-designed tools — some for synthesis, some for planning, some for measurement, and so on. The results show that this combination of GPT-4 plus specialized tools not only enhances the large language model's performance in chemistry but also enables it to autonomously execute chemical synthesis tasks, dramatically accelerating research in chemistry and materials science. The team has also received funding from former Google CEO Eric Schmidt.

We can see that AI is being applied vigorously across biology, materials science, chemistry, and many other fields. But broadly speaking, AI's progress in biology is far ahead — the first generation of commercial companies has already gone public, and a wave of newcomers is following in their footsteps.

Finally, as humanity once again stands at an inflection point in the paradigm of scientific research, with powerful new waves surging forth, challenges and opportunities will coexist. Embrace change, integrate with change, drive change, define change. The future is promising, and it is up to our generation to make it so.

Join the Conversation

What are your thoughts on AI for science? Share with us in the comments. We'll randomly select five readers to receive a FreeS Fund industry research handbook.

▲ From Silicon Valley PC Innovation Since 1980: The AI Hardware Opportunity

▲ How Tech Consumer Brands Go Global

▲ China's Healthcare System Over 40 Years: From Past to Future | FreeS Report

▲ 18 Charts to Understand the Changing Global Supply Chain | FreeS Research

▲ The Private Cloud Era Arrives: How AI NAS Reshapes Your Digital Life | FreeS Research

Star the FreeS Fund WeChat official account

Timely business insights delivered straight to you