A Conversation with Tsinghua University Professor Chen Wenguang: What If Large Models Stop Competing on "Largeness"?

峰瑞资本·September 12, 2024·53·2

Are Large Models Cooling Down, Fei-Fei Li's New Project, and AI's Sunshine for All

What happens when large models stop chasing "bigness"? That was one of the topics in a recent deep conversation between Feng and Professor Wenguang Chen. Wenguang Chen is a professor in the Department of Computer Science and Technology at Tsinghua University, and also serves as Vice President at Ant Group and Dean of Ant Research. He has long researched high-performance computing programming models and compilation systems, and in recent years has made progress in next-generation big data processing systems represented by graph computing systems, with substantial experience in both academia and industry.

One backdrop to their conversation: over the past two years, AI — especially large models — seems to have become the hope of the entire venture capital industry. Discussions around large models all seem to revolve around "big," or to put it exaggeratedly, only "big." As models have grown larger, we've witnessed the battle for computing power, NVIDIA's soaring stock price, and endless imagination about AGI (Artificial General Intelligence).

Another context: recently, in international capital markets, stock prices of leading large model-related companies have fluctuated, and different voices have emerged in discussions around large models.

So, is "bigness" in models necessary? Is scaling up model size the only path forward for AI?

Around this core topic, Feng and Professor Chen explored:

How did large language models develop?
Returning to the question of whether 9.3 or 9.11 is larger — how much intelligence can we actually learn from language?
What directions can large models achieve significant breakthroughs in next?
If we stop pursuing model "bigness" alone, will growth for companies like NVIDIA face challenges?
What opportunities exist for AI applications? Can today's large model capabilities really support emotional companionship?
How do the economics of large models work? For what types of applications can large models both exceed user expectations and motivate users to pay, with that payment covering the costs of such massive resource consumption?
When processing multimodal data, what technical challenges do current large models face, and will our interaction methods with machines be further transformed?
The large model wave is blowing toward embodied intelligence. Can a foundation large model for embodied intelligent robots be realized — in other words, is Professor Fei-Fei Li's entrepreneurial direction sound?

We hope this brings fresh perspectives and food for thought. Welcome to search and subscribe to "High Energy" on Xiaoyuzhou / Apple Podcasts / Ximalaya to listen to this episode.

Also follow our other AI and large model content from this year:

▲ Conversation with Ji Yu: Understanding NVIDIA, Deconstructing NVIDIA, Challenging NVIDIA

▲ Conversation with Zhang Wei, Founder of LimX Dynamics: Humanoid? Robot?

▲ Conversation with Lian Wenzhao: Imagination and Bubbles of Large Models, the "Impossible Triangle" of Robots and the Future

▲ How Far Is Embodied Intelligence from Reality: Boom, Bubbles, Tech Frontiers, Commercialization

▲ The Private Cloud Era Arrives: How AI NAS Reshapes Your Digital Life

▲ The Road to Embodied Intelligence | FreeS Fund Report

Interactive Giveaway

Have large models started affecting your work and life? If so, what capabilities have you used? Feel free to leave a comment — we'll randomly select 5 readers to receive a copy of What Is ChatGPT Doing... And Why Does It Work?

/ 01 / Two Evolutionary Threads of Large Language Models

Li Feng: Today we've invited Professor Chen to discuss the current development of large models and the deeper reasons behind recent market fluctuations. Of course, today's discussion represents mainly my personal views and Professor Chen's — he's a relatively senior and professional practitioner in the computer industry, while I'm more of an interested outsider on AI. Let's first have Professor Chen briefly introduce himself.

Wenguang Chen: I mainly work on high-performance computing, that is, large-scale parallel computing research. In the large model field, I've participated in training several large models, including one with over a hundred trillion parameters called "Bagualu." Even so, I'm not specifically focused on NLP (natural language processing) or algorithms. If Li Feng is looking at this from an outsider's perspective, I might be more like a combination of half insider and half outsider.

Li Feng: Before we begin, let's review the evolution of large models. Large models are essentially a form and technological advance of AI. How do you view this wave of investment and entrepreneurship triggered by large models? Does this boom have a clear developmental thread?

Wenguang Chen: Although all based on Transformer pre-trained language models, when BERT was released in 2018, it wasn't very different from GPT, but it didn't generate much response at the time. In November 2022, with the launch of GPT-3.5, especially ChatGPT, everyone felt a technological leap — the addition of conversational functionality in ChatGPT gave it widespread attention. GPT-4's release was another milestone. GPT-4 solved many problems in GPT-3.5, particularly in factual alignment, where GPT-4 performed excellently, reaching new heights in capability.

Next, Sora during this year's Spring Festival further pushed multimodal technology development. Building on GPT's ability to process text and images, Sora achieved relatively long-duration video generation with a certain degree of consistency, though its later development didn't meet people's expectations. Sora still hasn't opened its service to date, while some domestic projects have already provided practical applications.

To summarize, I think the general thread is: large language models evolved from natural language dialogue, to high-accuracy and multimodal GPT-4, to long-duration multimodal video generation with a certain degree of consistency.

Another developmental thread is open-source projects. Taking Meta's LLaMA as an example, the latest LLaMA 3 uses over 10T tokens and larger model scale, with good results.

Domestic model development has also been rapid. From recent performance, models like Qwen and Zhipu AI's GLM not only perform excellently on machine-evaluated benchmarks but also achieved good performance on the human-evaluated Chatbot Arena organized by LMSYS, approaching GPT-4's level. For these first-tier domestic models, the overall gap may be very small — it's hard to say who's better than whom.

Li Feng: From a linguistic perspective, compared to English-based large language models, what differences or advantages do Chinese-based large language models have in terms of corpus, results, computation, and technology? Are these differences sufficient to form lasting advantages?

Wenguang Chen: It may not count as a lasting advantage, but high-quality Chinese corpus is scarce, and publicly available corpus is even scarcer. We've seen that open-source models like LLaMA are inferior to some domestic models in Chinese processing capability. The reason may not lie in training methods or technical architecture, but more in data. Domestic companies have invested more effort in collecting and processing high-quality Chinese corpus, thus currently achieving better results.

From Big Data to Image Recognition, Then to ChatGPT: The Rise of the AI Boom

Li Feng: From an investment perspective, there may be a longer trajectory.

Both China and the US experienced a wave of entrepreneurship related to big data. The US started during the mobile internet era, while China started slightly later, around 2009 or 2010. After three to five years of development, people gradually established relatively good data infrastructure, with significantly improved capabilities in data processing, storage, and calling.

Next came the first wave of AI, perhaps called the computer vision (CV) boom. Representative companies in this wave were the "Asian Four Little Dragons." CV mainly processed image features, particularly excelling at facial feature extraction.

For the current wave of large model technology, the proposal of Transformer was an important node. As a powerful feature extractor, how did the evolution of Transformer technology occur?

Chen Wenguang: Transformer was introduced in a 2017 paper titled "Attention Is All You Need." Initially, it was designed primarily for translation tasks rather than conversational applications, so its architecture centered on an encoder-decoder pattern.

To put it simply, Transformer can simultaneously consider all correlations between two sequences, and through continuous weight training, it ultimately determines which relationships are tighter. Previous NLP models tended to lose state when processing long sequences, but Transformer flattens the entire sequence, so even tokens far apart from each other can have their relationships considered.

This characteristic makes Transformer particularly well-suited for parallel processing. Through parallel processing, GPU computational efficiency improves dramatically, and it can execute tasks well in parallel across multiple machines.

Li Feng: There's another milestone after that — the proposal of the Diffusion model. My personal understanding is that Diffusion allows for some generalization and extension in reasonable directions during the traditional process of convergent abstraction. Could you explain what role Diffusion plays in the development trajectory of large models, and when it started to make an impact?

Chen Wenguang: Diffusion and language models developed along two parallel tracks. Previously, the main technology for generating content was GAN (Generative Adversarial Network).

The mainstream view now is that Diffusion is the successor to GAN, offering better richness, so more research has concentrated on Diffusion. However, GAN has its own advantages: faster computation speed, lower cost, and under the same precision and resolution, its computational cost is lower while its image realism is superior.

Do Large Models Have to Keep Getting "Larger"?

Li Feng: Recently I've observed two phenomena. One is that the AI wave has concentrated mainly on large models. Over the past two years, models have primarily been competing on "size" — parameters keep growing, and text input lengths keep increasing. This has produced two consequences:

First, increased resource consumption, which has driven the compute competition fueled by companies like NVIDIA, because you need more computational resources to support larger models.

Second, the "size" of models has also produced beyond-expected generation results — for example, generated text has become more complete, natural, and even more logically coherent.

These two aspects have given people room for imagination in investment and entrepreneurship: one related to compute power, the other related to AGI. How do you view this trend toward "largeness"? Will this process of competing on size continue?

Chen Wenguang: I can approach this topic from the perspective of supercomputers. Supercomputing development has always followed a pattern: peak performance increases 1,000-fold every ten years. Of this, roughly 100x comes from algorithm optimization and single-chip scaling, while the other 10x depends on scale expansion. Consequently, supercomputing costs keep rising — a machine's price goes from tens of millions to hundreds of millions, then to billions.

But once costs reach the billions, further growth becomes very difficult. Because if you're going to invest tens of billions to build a supercomputer, naturally someone will question what actual returns those tens of billions can deliver.

The scale expansion of AI clusters actually faces similar problems. First, the investment cost is simply too high. Second, there are many engineering challenges, such as power supply and cooling.

Take a 10,000-card cluster: its mean time between failures (MTBF) can only be maintained for a few hours. For a 100,000-card cluster, MTBF might shrink to the ten-minute level — because you still need to employ various fault-tolerance techniques, and if doing fault tolerance well takes five minutes, that means your actual usable training time is only half.

When scaling further, various engineering and cost constraints cause training efficiency to drop substantially. Additionally, communication across 100,000 cards itself is a major challenge.

Therefore, from the perspective of performance and cost-benefit, the marginal returns of further expanding machine scale may not be that significant.

Another issue is data. Currently, mainstream high-quality models train on roughly 10T tokens, and at this scale, you need tens of thousands of cards. But if we don't have enough data, then we don't need larger machines for computation either.

So my view is: 10,000-card scale is definitely fine, 100,000 cards is possible, but from a cost-benefit perspective, a million cards doesn't have very high practical returns.

Li Feng: If we stop blindly pursuing "largeness," many ancillary problems may follow. For instance, if scale no longer expands, can GPT continue to exceed expectations as everyone hopes?

Chen Wenguang: I'm not well-positioned to make specific predictions. After all, ChatGPT was a big surprise to all of us. OpenAI spent eight years preparing a major release. Now, although GPT-5 has been slow to launch, we're not inside the company, so we can't be certain of the specific reasons — whether the model didn't meet revolutionary expectations, or whether GPT-4o has received good market response so the major release can wait.

As for what else we couldn't do before, or couldn't do well, that large models can suddenly do very well? That's hard for me to imagine right now. And it's hard to say whether that's a limitation of human imagination, or simply the reality of this field.

9.3 or 9.11 — Which Is Bigger? How Much Intelligence Can Actually Be Learned from Language?

Li Feng: This wave of large models all started from natural language — developing from learning and processing text. This raises a first-principles question: how much intelligence does language itself actually represent? Is language primarily a mode of expression, or a mode of thinking? Would a person who never received language training still possess complex abstract thinking capabilities in their brain?

I've seen different discussions, and the view I tend to agree with is that language itself is mainly a form of expression. It can be used for abstract thinking and organizing induction, but it's not necessarily a tool our brains must rely on when thinking through problems. So how much intelligence can we actually learn from language?

Chen Wenguang: This question may involve several layers.

First, if we're talking about language in the narrow sense, it certainly cannot encompass all knowledge. For example, expressing a chemical equation in natural language is very difficult. But if we generalize the concept of "language" to include not just natural language but also image-based content, and even certain process-based expressions like programming, this brings us back to an old topic in AI research — knowledge representation. In the field of knowledge representation, there are various representation methods.

From one perspective, current large models actually represent certain knowledge from natural language through neural networks, and then we can retrieve this knowledge or use it to complete certain tasks.

So, returning to your question, I understand you're expressing that learning from natural language may have limitations, and from the height of intelligence, some content indeed cannot be clearly expressed through language or learned through reading documents.

I've also been thinking about a question recently: can a large language model read a physics textbook and then solve the exercises at the back? This still can't be done right now.

Li Feng: This is similar to that 9.3 and 9.11 question.

Chen Wenguang: Right. This can't be done now — there's a gap in between. To solve this gap, other methods are needed to express certain knowledge in physics textbooks, not just through natural language. However, this problem is still at the discussion and exploration stage.

Li Feng: The core of this wave of large model enthusiasm is language models, based on digitizing the text accumulated over the past 40 years, inputting it into computers, and computing it. If we could learn and compute all this textual content, how much intelligence could we gain?

Chen Wenguang: This involves the definition of intelligence. What current large models can do is: from the understanding perspective, after inputting an article, they can help summarize and answer questions; from the generation perspective, they can generate documents or themes based on certain prompts.

Li Feng: Yes, from the knowledge perspective, because large models learn from textual data, they already involve data that is diverse, complex, and of varying quality to a certain degree. So text-to-text has reached a certain level. But beyond text-to-text, I have a question: can emotional companion applications like Character AI truly achieve the progression from text to emotion, then to empathy and high-EQ feedback? Or rather, do today's large models actually have the capability for emotional companionship?

Chen Wenguang: On this question, I may be more optimistic than you. First, setting aside EQ for the moment, large models can certainly simulate a person's speaking manner and style to converse with people. Speaking in some emotional mode — angry or cheerful — this can also be done. The real challenge is understanding the other person's state, then empathizing, and then influencing the other person's emotions through some path — for example, making them feel better. Based on my current understanding of AI, given enough examples, AI can learn this.

To use an analogy, putting privacy issues aside for the moment, if we could obtain dialogue data between psychologists and patients, and the data volume reached a certain level, it wouldn't be entirely impossible for AI to learn these emotional paths.

Li Feng: From expression to emotion, then to empathy — this process involves multiple layers of abstraction. Perhaps abstracting emotion from text itself is relatively easy, but understanding the true feelings behind emotions is much more complex?

Chen Wenguang: This actually involves an important debate across the entire AI field: end-to-end learning, or functional decomposition? In fact, we haven't fully figured out the reasoning mechanisms of language models. Sometimes, it doesn't reason step by step but jumps directly to the result, yet it can still make reasonable inferences along logical chains.

Take CNN (Convolutional Neural Network) face recognition as an example — we don't necessarily need to explicitly tell the computer how many feature points exist on a human face, because neural networks have the characteristic of end-to-end learning. Given enough datasets, the model can learn and integrate this information, even though we may not know how it processes it internally, but it can indeed learn.

Li Feng: In other words, if I consistently express certain emotions, the model can learn this pattern and make corresponding expected reactions in similar situations. AI doesn't improvise — it still makes responses based on data it has learned.

Chen Wenguang: This is precisely where the magic of models lies.

How to Calculate the Economics of Large Models, and Where Lies the Room for Imagination Going Forward?

Li Feng: From an investor's external perspective, if large models no longer continue getting "larger," the stock prices of companies like NVIDIA may face some challenges. Because if people no longer engage in fierce compute competition, the room for imagination around large models will be somewhat constrained.

Furthermore, if large models don't achieve continuous technological breakthroughs, people will start calculating the economic account — on what types of applications can large models both exceed user expectations and make users willing to pay, with their payments covering the costs of such massive resource consumption.

Regarding model "largeness" or "largeness" as one method of model competition, I have two questions.

The first is: if existing large models no longer advance, can they provide a user experience that users are willing to pay for and that can cover costs?

The other is: by continuously expanding model scale, can we ultimately achieve AGI or generative artificial intelligence? This is people's imagination of "largeness."

Chen Wenguang: There are two issues here.

First is: to what extent can "largeness" improve model performance? In pure language processing, this depends on whether we can acquire more large-scale corpora through different methods. Existing corpora have already reached a certain level, and synthetic corpora are still not widely used in natural language pre-training. It's like a child learning a language — once they've reached the equivalent of a 15-year-old's level, pushing them to 16 won't yield particularly significant results.

Another area worth watching is multimodal data. The introduction of multimodal data, particularly video, images, and pictures, could become key to the next step in improving model capabilities. Video data is extremely voluminous, even surpassing natural language in some respects. The combination of natural language and vision, with the potential future addition of touch, will further drive multimodal development and place even higher demands on model capabilities.

Multimodal capabilities can be enhanced through more data and larger-scale models, but this requires machine scale to match data scale. If data scale can't reach an order of magnitude of 10x, machine scale doesn't need to expand 10x either. From this perspective, I believe model scale can still expand further, but it's hard to imagine 10x growth like in the past.

Li Feng: I'd like to interject — the "Scaling Law" that people in the industry often mention, what exactly does it refer to? Could you give everyone a quick explainer?

Chen Wenguang: Scaling Law roughly means that when we use models with larger parameter scales, say reaching hundreds of billions of parameters, the data volume required might reach trillions of tokens, and the corresponding computing needs might require thousands or even tens of thousands of GPUs.

In this process, parameter scale, data volume, and computing power are matched. People hope to achieve better results by expanding model scale, but this also requires more data and stronger computing power.

Li Feng: It's perhaps not a simple linear derivation process?

Chen Wenguang: Some research indicates that given a certain amount of computing or data scale, we can choose the most suitable model construction approach, allowing us to achieve expected training results within limited time. These studies are typically validated on small models first, then extrapolated to large models — mainly to help optimize the training process.

For example, by predicting the Loss (Loss Function) decline curve, we can determine whether training has encountered problems, making the entire training process more controllable.

Based on my limited experience participating in training several models, indeed the larger the model scale, the faster the Loss decreases. However, from 200B to 500B to 1 trillion, exactly how fast Loss decreases and how much Loss there will be — this remains largely experimental science. Even if it's not necessarily linear, nonlinear functions can still be fitted to some degree.

Li Feng: That makes sense. Let's return to whether scale can bring greater intelligence.

Chen Wenguang: I think there's still some room for growth, but it won't be as rapid as in the past. On one hand, people probably won't be willing to invest that many resources. On the other hand, from a cost-performance perspective, scale isn't that important. Although Moore's Law has slowed down, it hasn't completely failed. We can also reduce costs through other technical means.

So if we're considering short-term returns, say three to five years, these issues are critical. But from a long-term research perspective, this problem may not be that urgent. In the long run, as long as intelligence can be achieved, future technological breakthroughs will make computing costs more controllable. Perhaps one day we'll have portable fusion reactors, and then power consumption won't be an issue. These new technologies will certainly be expensive at first, but costs tend to decline over time, becoming economically viable and profitable.

Li Feng: From an investment perspective, if people stop competing on "largeness," they'll be constrained by cost-performance, and models may start to converge and pull back, shifting toward driving application deployment. In other words, only when the focus shifts away from "largeness" will application deployment truly begin to accelerate.

Chen Wenguang: There are already some trends in this direction. Companies like Baidu, Google, and Microsoft are all working on smaller-scale models. These small models don't need encyclopedia-like capabilities in specific domains, but rather combine relatively general capabilities with domain-specific capabilities. Some in the industry are even starting to go against Scaling Law — though this may not be consensus — but in certain application scenarios, using smaller models to achieve capabilities equivalent to large models does have cost-performance advantages.

Li Feng: If model scale no longer continues to expand, people might look back and wonder whether simply expanding model scale can really lead to artificial general intelligence. In other words, will people's imagination about large models hit a ceiling because of this?

Chen Wenguang: First, the belief that "largeness" can lead to artificial general intelligence is itself a conjecture without much basis. Many people may feel that large models are more capable, but whether they can reach AGI — this hasn't achieved broad consensus.

Second, the "largeness" direction of large models may slow down, but it won't necessarily stop. A slowdown in the "largeness" trend also doesn't mean that demand for computing power will necessarily decline, because both model training and inference require computing power. Even if training demands less computing power, inference demands will inevitably increase substantially.

Li Feng: Agreed. This is also why we've invested in inference chip companies. (Related content: Dialogue with Ji Yu: Understanding NVIDIA, Deconstructing NVIDIA, Challenging NVIDIA)

**/ 06/ ** The Logic and Challenges of Large Models: From Text to Multimodal Input

Li Feng: Everyone has their own internal language model. From an observer's perspective, natural language expression has several characteristics:

First, theoretically language can be exhausted, but because there are virtually infinite ways to combine elements, it's practically inexhaustible. For example, our commonly used vocabulary is roughly 50,000 to 60,000 words — how these words are organized and connected is theoretically exhaustible, but practically very difficult to achieve.

Chen Wenguang: This is the classic combinatorial explosion problem. Its complexity is exponential — however capable computers become, it's hard to exhaust.

Li Feng: Yes, that's the first point. Second, language has rough rule constraints. Whether we're expressing ourselves or listening to others, we theoretically have a large model for absorbing, understanding, and evaluating information — otherwise we couldn't understand or evaluate the quality of others' speech or writing.

Of course, everyone's expressive ability differs, but in most cases it can be improved through more reading, speaking, learning, and practice. In other words, we're all expressing the same content, but in the process of choosing words and forming sentences, if my language large model is slightly better, I'll be more expressive than others. That's roughly the characteristic of language itself.

Additionally, there's a gap between language and thought — the more concrete the expression, the more information may be lost.

If we look at these characteristics from the perspective of large language models, when computers learn enough characters and their combinations, combined with some objective evaluation data input, they can to some degree simulate the language model in our brains.

The premise here is that text is one-dimensional, linear — one character, one word following another. Generation models based on language models essentially process enough relationships between vocabulary or text, constrain these relationships in certain dimensions, and then output — generating new text.

Chen Wenguang: Broadly speaking, I understand it should be like this.

Li Feng: Like the iterations of AlphaGo playing Go back then — theoretically it's also a combinatorial explosion problem. AlphaGo initially learned from human game records, then was able to find optimal playing paths on its own, after which top human players could no longer defeat it.

I bring up this example because from an investor's perspective, we're thinking about what logic applies to large language models. For example, programming has similarities with large language models.

Chen Wenguang: Programming languages have one advantage over natural language — they can be compiled and executed. If compilation fails or execution fails, you know the code is wrong. Therefore, programming has a stronger self-feedback mechanism; it doesn't necessarily require human involvement, and can achieve better alignment and optimization through this mechanism.

Li Feng: Then let's next discuss issues related to multimodal data. Today's large models have developed to the level of processing one-dimensional text and word relationships. I understand that the "text-to-image" process is essentially still expressing a sentence, just transforming the content in that sentence into some element in an image. Therefore, with assistive tools similar to Copilot, text-to-image is feasible.

But if images serve as input, used to parse element relationships or scenes within them, since images are at least two-dimensional information and may contain even more complex visual elements — intuitively, extending from the logic mentioned earlier to this step seems difficult to achieve.

Chen Wenguang: Because directly processing pixels involves too much computation, the current mainstream approach for multimodal large models processing images is, like DALL-E or Sora models, to cut images into small pieces called "patches," then feed them into the model as a one-dimensional sequence for processing.

In natural language processing, we form tokens through tokenization; in image processing, we generate tokens by cutting images into patches. Of course, the upper and lower parts of images are related — when we cut images into one-dimensional sequences, it seems like some two-dimensional information is lost, but actually it isn't. Because while natural language appears one-dimensional on the surface, through the Transformer model it becomes an extremely high-dimensional relationship matrix.

Li Feng: What I mean is, whether in Chinese or English, characters and words can exist independently, but as you said, the upper and lower parts of images may be related.

Chen Wenguang: I understand you mean that if there's a person's eye or nose in an image, and when cutting, one patch only contains part of an eye or nose. Actually after cutting, this embedding itself has content — it will try to understand that this resembles an eye, possibly being part of a human eye.

Li Feng: If we look purely from the perspective of abstract thinking, is the process by which we humans process visual input information the same as the abstract process of processing language input?

Chen Wenguang: Probably not. The visual processing process likely isn't a linear sequence like language processing.

There's a view in machine learning: AI only needs to achieve brain functionality, not necessarily replicate brain mechanisms. Perhaps we can find a different path from the human brain, one more computationally feasible and capable of achieving similar effects. So AI's working mechanisms not resembling the human brain, and its inability to reach certain intelligence — these two can't be simply equated.

Li Feng: Agreed. We've looked at some neuromorphic computing projects, but currently their efficiency isn't high enough. The main bottleneck behind this is that we know too little about the brain, particularly our understanding of the brain's complex cognitive processes and mechanisms.

Chen Wenguang: Our methods for observing the human brain are limited — we can't conduct invasive experiments, which makes it difficult to deeply study language mechanisms. Visual research is actually somewhat better off, because experiments can be done on lower animals like fruit flies or mice. But when it comes to human language mechanisms, there are more experimental constraints, and ethical concerns represent a major barrier.

The current logic is that AI will continue advancing along its own development path, while brain-inspired computing represents an alternative route. There are many researchers working on artificial neural networks with diverse approaches, so breakthroughs will inevitably emerge. If brain-inspired computing achieves some good results, it too will find wide application.

**/ 07/ ** The Logic and Challenges of Large Models: From Images to Video

Li Feng: Returning to our earlier topic, text-to-image generation and image input are two different things. If we expand one level further and add a time dimension to create three-dimensional data, it becomes video input — would this be too complex for existing models?

Chen Wenguang: Not necessarily. Sora's white paper, though brief, basically shows that it works by training on a sequence of images, using a vision model to describe these images, thereby forming a sequence.

The key question you raise is how to understand the relationships between these sequences, and how to maintain consistency when generating content. These have been hot topics in the visual research community for some time.

Let me briefly mention the work of Ant Group's research institute. While Ant hasn't built a model that directly generates very long videos, it has done research on editable multimedia: you give it an image, mark a region, and say "put a scarf here" — and the system generates a realistic scarf in that area.

Another project involves stylized video editing, such as changing a person's hair to white in a video. This involves temporal consistency issues — first transforming the image, then applying constraints along the timeline, and finally generating the complete video.

Of course, none of this research is perfect yet, but people are indeed continually exploring. I don't think there are insurmountable technical constraints or chasms.

Li Feng: Large language models essentially operate by analyzing absolute relationships between words. So for the images or video data used for training and input today, is most of it unlabeled?

Chen Wenguang: Most data is indeed unlabeled, though some is labeled — for example, some open-source datasets from Meta. The approach people commonly use now is Meta's Segment Anything, which can automatically segment elements in images and identify them. After identification, it can generate a very complex long caption. With these descriptions, combined with images for multimodal model training, more data can be obtained.

Li Feng: This method can identify nouns or objects, but what about verbs?

Chen Wenguang: Current research still mainly focuses on spatial relationships — what things are in the image and their relative positions. If it involves actions or verbs, that may require video for comparison, inferring actions by observing changes between images.

**/ 08/ ** Foundation Models for Embodied Intelligence: Is Fei-Fei Li's Startup Direction Sound?

Li Feng: We've covered the historical evolution of large models, their past and present. Currently, large models have ignited another track: embodied intelligent robots, or humanoid robots. We've invested in quite a few companies in this direction.

Recently, a new startup in Silicon Valley attracted attention — its founder is Fei-Fei Li. She noted that large models face many challenges in the physical world because they were never designed for it. Her company is dedicated to solving large model problems in the physical world, particularly foundation models for embodied intelligent robots.

I very much agree with her view, but the challenges now are: first, developing new models for future intelligent robots that need to account for physical problems like interaction and deformation. However, in the near term, these don't seem solvable through pure physical computation. Second, data in robotics is scarce, and labeled data is even scarcer — so in building foundation models, this challenge appears even greater than with video data.

My question is: within the next two to three years, can an effective foundation model for embodied intelligent robots really be developed?

Chen Wenguang: I think it depends on what function it needs to achieve. For vision and scene understanding, it's basically feasible. Autonomous driving, for instance, has largely solved these two problems.

Similarly, in robotics, whether indoors or outdoors, what's being observed differs somewhat but there's no fundamental distinction in essence — seeing an object, identifying its depth, knowing what it is, then understanding its positional relationships. These are very reasonable extensions of current computer vision technology.

Li Feng: Actually, the massive amounts of data, algorithms, and experience accumulated in the autonomous driving industry can all transfer to the robotics field. But compared to manipulation robots, autonomous driving has one significant difference:

In autonomous driving, avoiding collision under any circumstances is the absolute prerequisite. But for manipulation robots, all operations are designed for physical contact — this is completely different logic. (Related: Dialogue with Lian Wenzhao: Imagination and Bubbles of Large Models, the "Impossible Triangle" of Robots and the Future)

Chen Wenguang: Yes, what I mentioned just now was more about environmental understanding and locomotion capabilities. These technologies are immature but present no fundamental technical challenges. Manipulation is another story — its difficulty lies in there being too many manipulation objects. We can categorize manipulation objects into rigid and non-rigid objects, and manipulation into precise and imprecise operations. Currently, the first step is tackling imprecise manipulation of rigid objects. The mainstream approach now is imitation learning — for example, having humans operate thousands of times first, then the robot learns through imitation. Because these are rigid objects, current operations don't involve much tactile sensing yet; the focus is mainly on precision.

Of course, the future definitely moves toward precise manipulation and non-rigid objects — like shaving someone, turning someone over, folding clothes. These are non-rigid object operations with higher accuracy requirements.

Li Feng: So I'm somewhat uncertain — not questioning Fei-Fei Li's view, but unsure about this direction, because the challenges in manipulation are substantial. There may emerge some specialized, generalizable models for specific scenarios, but achieving a comprehensive foundation model for embodied intelligence still has a long way to go.

Chen Wenguang: People's definitions of foundation models may not be identical. Because there are too many manipulation objects and manipulation methods are too complex. The current tendency is toward task-specific approaches — for moving boxes, train a model specifically for moving boxes. People rarely discuss a general manipulation foundation model yet; there's currently no mature methodology to solve this problem.

Li Feng: I feel the same way. In the short term, there may be some specialized small models for specific tasks.

Chen Wenguang: Technology development typically starts from point solutions, then gradually expands. It's hard to say a general manipulation foundation model won't emerge in the future — we just don't know how to approach building it yet. Future progress may gradually merge similar manipulation types, eventually expanding to more scenarios. For example, transporting and grasping objects have different requirements, but we might find ways to integrate these operations.

Li Feng: For now, it seems the lower-body locomotion capabilities of robots can be generalized. Regarding lower body or locomotion — we're talking about cross-terrain movement, like going up and down steps, stairs, or climbing over obstacles and traversing ditches. (Related: Dialogue with LimX Dynamics Founder Zhang Wei: Humanoid? Robot?)

This type of movement — do you think a separate foundation model will emerge for it?

Chen Wenguang: It's possible, because this kind of movement essentially combines vision, tactile sensing, and motor control.

Additionally, existing large language models have some utility for task understanding and planning. This is one reason why embodied intelligence has become so hot after the emergence of large language models.

In the past, it was very difficult to directly tell robots what to do, but now through natural language, direct communication with robots has become much easier — this breakthrough is quite crucial. (Related: How Far is Embodied Intelligence from Reality: Boom, Bubbles, Technological Frontiers, Commercialization)

Li Feng: And the robotics field can borrow a lot of technology and experience from autonomous driving. Autonomous driving, as one of the previous AI booms, went through roughly ten years of development, compounded by algorithmic changes — that's also a driving force for robotics. Is there anything else you think is worth paying attention to or adding on the topic of large models? Because large model development seems to have reached a阶段性节点 [phased inflection point], and next it may shift more toward applications and robotics.

Chen Wenguang: I still have a little bit of anticipation for what GPT-5 will actually look like. Maybe once GPT-5 comes out, the first half of what we discussed today will basically become obsolete.

Li Feng: Then I might be slightly more conservative than you. Currently, large models demonstrate reasonably good intelligence on factual, text-based tasks. What needs to be solved next is the 9.11 versus 9.3 problem I mentioned earlier.

Chen Wenguang: Existing large models can basically handle office clerk work — basic reading and writing. But when it comes to more complex interactions, like having it help plan travel or book tickets, the results are still unsatisfactory. Additionally, when we narrow models down to specific professional domains, they usually require retraining on specialized data to form a more professional model. But what level of depth these specialized models can achieve remains unknown.

Taking science as an example, models like AlphaFold have been continuously improving on scientific discovery and are starting to be applied to actual scientific research. Some people are even discussing whether large models can propose hypotheses, not just summarize existing information. If they can combine computational and experimental methods for closed-loop verification, you could perhaps imagine an automated science factory. Of course, this would require that the hypotheses have a reasonably high success rate — otherwise it would seem rather unreliable.

As for whether GPT-5 can help us explore these relatively professional domains better, we don't know yet, but the imaginative space is there.

Li Feng: To echo your point. Something similar has happened in history. Around 2015–2016, that AI boom included not just autonomous driving but also technological and algorithmic shifts, such as breakthroughs in reinforcement learning and CNNs. The wave that followed, AI gradually began integrating with science, entering the so-called "AI for Science" phase.

The current wave of large models is also an advancement in AI technology. Embodied intelligence and robotics is the domain that first received attention, or was first illuminated by sunlight, in this wave — just as autonomous driving once led that era's AI boom. If people no longer pursue merely "bigger" models, AI progress will shine like sunlight onto other important industries and more domains.

Reader Giveaway

Has large model technology started affecting your work and life? If so, what capabilities have you used? Leave a comment below — we'll randomly select 5 readers to receive a copy of What Is ChatGPT Doing... And Why Does It Work?

▲ Dialogue with Ji Yu: Understanding NVIDIA, Deconstructing NVIDIA, Challenging NVIDIA ▲ Dialogue with LimX Dynamics Founder Zhang Wei: Humanoid? Robot? ▲ Dialogue with Lian Wenzhao: Imagination and Bubbles of Large Models, the "Impossible Triangle" and Future of Robotics ▲ How Far is Embodied Intelligence from Reality: Boom, Bubbles, Technological Frontiers, Commercialization

▲ The Private Cloud Era Arrives: How AI NAS Reshapes Your Digital Life

▲ The Road to Embodied Intelligence

Star the FreeS Fund WeChat Official Account — timely business insights delivered to you