How Close Are We to "General-Purpose Robots" With Large Models? | Riding the AGI+ Wave

云启资本·July 13, 2023·6·0

Large language models: This can't be all on me.

As a medium that can connect the digital world with the physical world, robotics is seen as one of the most important directions for large model deployment. The emergence of large language models makes it possible for robots to master language, further narrowing the cognitive gap between robots and humans. A general-purpose robot that understands instructions and acts on commands seems to be getting closer to us.

But Yunqi Capital partner Chen Yu believes that language models represent only a small part of human cognitive models. At this stage, AI may still be unable to complete simple actions like grabbing a bottle of water from a table, because there is a disconnect between the digital and physical worlds — a problem that requires long-term attention and resolution. The ultimate hope is that general-purpose robots can truly accomplish various general tasks.

Recently, Yunqi's "Wave AGI+ Series Salon" joined forces with Tencent Technology and Qingteng Hui in Shenzhen, bringing together over 50 frontline entrepreneurs, academic guests, and industry experts who have been deeply involved in robotics and deep learning for many years to discuss the innovation and challenges of AGI + robotics.

Below is a summary of key insights and a transcript from the event. Enjoy~

Content source | Tencent Technology. Reporters | Tencent Technology Zhou Xiaoyan, Zhao Yangbo

Guests for this session:

Yu Sang — Yunqi, Frontier Technology Investment

Sheng Bi — South China University of Technology, Big Data and Intelligent Robotics Key Laboratory · Main research areas: intelligent robotics, intelligent hardware, embedded systems intelligent computing

Yutao Yue — Founder and Director, Jisui Deep Sensing Technology Research Institute (Qingteng Future Technology Academy alumnus) · Main research areas: multimodal perception, radar-vision fusion, AGI and machine consciousness

Qifan Yan — Co-founder, Dafang Intelligence · Dafang Intelligence is a construction robotics company dedicated to using robots to address severe labor aging and poor working conditions in the construction industry.

Core viewpoints in this article:

In algorithms, we can compare the entire world and its laws to a majestic mountain range, while the data used to train small models may only represent one small hill, unable to see beyond it. Large models actually provide abstract information about the entire mountain range's terrain, rather than specific geographic data. This abstract information is trained on natural language and symbols. Therefore, the assistance of large models may solve corner case and OOD (out-of-distribution) generalization problems.

For relatively simple tasks like path planning and navigation, as long as the environment is fixed, robots perform well. But when the environment becomes complex, problems get thorny. With the availability of massive amounts of data, robots can better switch between tasks in complex environments and flexibly schedule task execution. Perhaps in some complex scenarios, better results can be achieved — but this requires the support of large models and big data.

No matter how well we do in deep learning, the results are actually not ideal for precise movements like obstacle avoidance and navigation. This is because deep learning is better suited for ideological judgments, while in precise scenarios, perceptual sensors remain key.

Traditional robotic systems also face requirements for real-time performance and computing power. Computing power can be layered: robot control and drive have relatively high real-time requirements, while planning requirements are relatively lower and can be achieved with some embedded systems. Therefore, in operation, these two aspects can be separated.

How Will Robot Algorithms Be Improved by Large Models?

Yu Sang: Let's start with the underlying technology. Large models have currently built "Foundation models" in language and visual modalities, achieving results beyond people's imagination, with emergent chain-of-thought reasoning and strong generalization capabilities. We're excited to apply these technical advances to robotics. However, robotics is a systems engineering challenge. Looking at the robot algorithm stack alone, it roughly divides into perception, planning, decision-making, control, and drive. How can large models be applied within this? With a longer-term perspective, what disruptions might large models bring to the robot algorithm technology stack?

Yutao Yue: Perception is like humans having eyes and ears. This robot is equipped with cameras, radar, and other sensing technologies to observe and perceive its surroundings. However, regarding robot perception technology, especially issues related to large models, there exist different viewpoints and understandings in society, academia, and industry. What I'm presenting is just one perspective.

In robot perception, there is a long-standing problem: corner cases and out-of-distribution (OOD) generalization. For common scenarios, if there is abundant data for thorough training, algorithms can recognize well. But for rare scenarios, unexpected events, or variations of common situations, things become much more difficult.

To give an example, there was previously an accident in Hualien, Taiwan, where a small truck overturned with its roof facing a Tesla vehicle. Tesla's algorithm may have seen vehicles from many different angles and forms during training, but probably never or rarely seen a vehicle overturned with its roof facing upward. Therefore, the algorithm couldn't recognize it and avoid collision. This is a corner case — this kind of situation.

This was previously very difficult to handle in the perception field. One view holds that this involves concepts of common sense, commonsense world models, and commonsense reasoning. In algorithms, we can compare the entire world and its laws to a majestic mountain range, while algorithm training data may only be one small hill, unable to see beyond it. However, in certain situations, things beyond that small hill may affect task execution.

From my perspective, large models actually provide high-level abstract information about the entire mountain range's terrain, rather than specific geographic data. This abstract information is trained on natural language and symbols. For example, when we see a car, it has millions, tens of millions, or even hundreds of millions of pixels — this is basic data. But when I describe it with a few letters "car," this is natural language description, a highly compressed expression of information. At this information level, the model has understanding of almost everything humans have seen, and can construct models about the world and knowledge structures. Therefore, the assistance of large models may significantly improve the generalization of perceptual images, solving corner case and OOD generalization problems. This is somewhat like a process from perception to cognition, combining basic data with highly abstract information and knowledge.

Specifically, when these two are combined, for solving corner case and OOD generalization problems, it brings a series of benefits. For example, perception reliability will be significantly improved — whether for target detection and tracking, or more complex tasks like semantic segmentation, accuracy can be substantially increased, possibly even overturning traditional understanding. I've noticed some scholars and enterprises are already attempting similar projects, and we're also conducting related research.

The second possibility is expanding the scope of perception — for example, not limited to simple single-frame image perception tasks (like target detection and tracking), but for video or more complex behaviors involving stronger correlations and complexity, such as complex behavior recognition. In such cases, large model assistance may significantly improve perception accuracy at the behavioral level. These are just some preliminary thoughts to spark discussion. Welcome criticism and corrections, thank you.

Sheng Bi: I'd like to briefly share my thoughts on this. Recently, we've developed strong interest in the multimodal field, particularly the research direction of Vision Language Navigation. This direction is currently very hot, and we've already invested some time in research work. From a theoretical perspective, we tend more toward engineering research, applying results to actual scenarios. Therefore, we referenced methods from top foreign teams and tried to apply them to our research.

However, we encountered some problems, which may relate to model generalizability. Datasets are an important challenge in deep learning AI research. Solving dataset problems is crucial for achieving good research results. When selecting datasets, we referenced the work of teams including Fei-Fei Li in this field, and drew on their papers. They provided a simulation environment for model training, where the training dataset mainly involved smart home and household scenarios, such as sofas, tables, etc. Their goal was to enable robot navigation in home environments through language instructions. We conducted some experiments using their provided simulation model for training.

However, if we want to truly achieve application, we need to use real training data. So we purchased 3D scanning cameras to scan room scenes into 3D images. We built 3D models of scenes around our laboratory and imported them into the trained model for testing. However, initial results were not ideal — path planning was not accurate.

We found that laboratory scenes differ from home scenes, so we had to find a place similar to a home scene. Eventually, we found a first-floor lobby in a laboratory building that had sofas and tables. We first built a map of this location, using 3D scanning cameras to scan the entire room's 3D image. In this scene, we successfully conducted navigation. For example, when giving instructions to the robot, we could tell it to walk along the sofa to somewhere, or along the glass door to the entrance. The robot would generate a path. However, when the robot walked along the path, it couldn't rely entirely on vision, but entirely on deep learning. I think visual navigation is feasible in fuzzy environments, but still has difficulties in precise scenarios. Therefore, we combined vision and laser methods. We divided the environment into many grids, using visual information at each grid point, but using laser for the walking direction between points. However, this required some calibration and experimentation. Although the success rate wasn't particularly high — around 60% to 70% — I think such results are acceptable for research, but further effort is needed for application.

I think now with larger models, there may be better performance in this area in the future. Regarding Vision Language Navigation, my theoretical understanding is general — we mainly draw on other teams' methods and try to apply them practically. They mainly conduct dataset testing in simulation environments, using large models. Actually, we mainly use their trained models for deployment. This is my feeling, and I also feel that with ChatGPT's development, as models scale up, robots will be able to traverse complex environments through experience like humans.

I think this is possible, though I'm not sure if it's already been achieved — perhaps some experts can provide us with suggestions. This is my view. At the same time, I also recognize that no matter how well we do in deep learning, the results are actually not ideal for precise movements like obstacle avoidance and navigation. This is because deep learning is better suited for ideological judgments, while in precise scenarios, perceptual sensors remain key. Humans don't need precise distance sense when traversing narrow spaces, but robots can accurately measure distance from obstacles through laser sensors. Then they complete traversal through perception. Humans rely on experience to traverse, without needing to know the specific distance from obstacles.

I think this is also due to limitations in the volume of training data available to models. So for now, we typically combine perception and cognition to handle these precise movements. Additionally, I believe task-level planning offers us valuable insights for robotics research. Previously, we mainly worked on relatively simple tasks like path planning and navigation, where robots performed well—as long as the environment was fixed. But when environments become complex, the problems get thorny. Now, however, with large amounts of data available, robots can better switch between tasks in complex environments and flexibly schedule execution scenarios. Perhaps in some complex scenarios, we can achieve better results. But this requires support from large models and big data. This is just my own understanding—we haven't made progress in this area yet. We're simply hoping that organizations like OpenAI can bring new breakthroughs in the development of large models.

Which Existing Pain Points Will See "Qualitative" Improvements?

Sang Yu: Thank you, Professor Bi and Professor Yue, for your excellent responses. There's a view that large models compress information from the internet, and that the end result of compressing information and seeking efficient representations is the emergence of human-like abstract understanding and chain-of-thought capabilities. If this capability is used well, I believe robots won't need to rely so heavily on precise sensors—they can perceive and navigate by "walking while looking," making significant progress on corner cases, and opening up tremendous imaginative space on the application side. So I'll direct this application-side question to Mr. Yan. You're currently focused on construction scenarios. If robotics + AGI technology takes another step forward, which customer pain points you're encountering now could potentially be solved with qualitative transformation?

Yan Qifan: Yes, you just mentioned something like the chain-of-thought concept. Actually, I've always struggled to understand what chain-of-thought really is. For humans, it might be a logical thinking process of completing things step by step. Now we see AI trending in this direction too—it can also reason progressively, though it may need humans to provide some prompts or so-called steps. So I'm wondering: I haven't figured out whether this is a genuine chain-of-thought, or just something like the step-by-step operations in ordinary programming. For example, if I were to write an algorithm, I'd first list out the mathematical formulas, then gradually convert them into algorithmic steps.

One problem robots face now is that their tasks need to be planned in advance. We might need to pre-import maps to generate paths, and tell the robot the complete set of rules to execute tasks accordingly. For instance, as a compatible robot, I could tell it to complete construction clockwise along the wall, and ask whether it needs to handle doors and windows.

This interaction approach may be more humanized, more convenient and accessible than pre-generating the entire path for construction. I think this is a promising direction based on current circumstances. And in areas like perception, decision planning, control, and drive—AI has tremendous prospects in perception and control roles, which is exciting. But in control and drive, robots still have significant gaps. Especially for robots—this is why we're excited about multimodal models, because if one day it truly integrates all modalities including smell, touch, and so on, that would be fascinating. We'd truly be able to perceive all information like humans do. However, this may take longer, and we hope future development can achieve this goal. Because data collection in this area isn't as easy as with text or image data—I can easily find large amounts of such data on the internet and come back to work and learn. So for the robotics industry, actual implementation may encounter foreseeable operational directions and difficulties.

Sang Yu: Among the panelists' professional experiences, you've been exposed to service robots, industrial robots, autonomous driving, and so on. What changes do you think AGI will bring? Will there be any new scenarios or new functions emerging?

Yue Yutao: For new scenarios, I'm personally most interested in digital companions and digital immortality.

Many companies have worked on such projects before, but the experience may not have been ideal. Now large model technology has made various possibilities much more viable. I think digital companions address a genuine rigid demand. Technically, we can already create virtual characters to a certain degree, or characters from literary works like Yang Guo, as mentioned by a previous guest.

Another scenario is digital immortality, which involves the digital construction and perpetuation of intelligence, thinking, memory, and consciousness. This was originally a very science-fiction topic. Recently we organized a small roundtable discussion with participants from AI, neuroscience, information science, physics, philosophy, and other fields. The preliminary conclusion was that digital immortality has reached the point where it can be seriously discussed at the technical level. This involves several very interesting aspects. For example, why has the possibility of realizing this scenario now become higher? It's because we have deeper understanding of human intelligence and consciousness. The human brain has 86 billion neurons with connections between them. When external sensory stimuli enter the brain, different regions are activated, and if these regions form extensive interactions, conscious experience emerges. We know we can use "System 1" and "System 2" to describe human thinking patterns—System 1 is a simple response mode, while System 2 is an analytical and logical reasoning mode based on structured knowledge. At the machine learning level, how to achieve intelligence similar to "System 2"—I personally consider this the most disruptive and breakthrough problem, and also one of the most difficult to solve.

Large models have solved this problem by constructing knowledge and the structure between knowledge from massive data. If you ask in reverse: why do people say AI can do things but doesn't understand the meaning of words? Why is there a distinction between understanding and not understanding? There is much research in psychology and other fields. We observe that in language models, this knowledge and knowledge structure forms a hierarchical understanding capability. Although the specific formation mechanism remains a mystery, there is now some evidence and research suggesting that code training may be the process by which large models develop this capability, and that certain specific neurons in large models serve as particular knowledge nodes or reasoning functions. However, if we truly enter the digital immortality scenario, I think we may face several major technical challenges:

One is memory—how to extract memory information already existing in the brain and convert it into training data and input for models. This may be a significant challenge.

Another is models combining multimodal real-time perception. For example, compressing, processing, and abstracting received perceptual information may not be particularly problematic. But whether for memory information or real-time perceptual information, making the model's behavior highly consistent with its original human prototype in terms of personality, habits, thinking patterns, and corresponding learning and updating capabilities is a major challenge.

The third concerns anthropomorphized conscious experience—if you are a digital immortal entity, you might still feel like yourself, still have conscious experience, just with some aspects of sensation possibly differing. I believe this conscious experience is entirely achievable technically.

In summary, these two scenarios—digital immortality and virtual companions—are the two points that currently excite me most at the application level of large models.

Bi Sheng: In robotics, multimodal large models are a hot topic. However, in our lab's navigation work, we face relatively high failure rates. I think if we can further enrich VR segmentation models, whether in labs, homes, or various scenarios, we can achieve better navigation results. This is a very interesting point for me, so I believe large models can make robots more flexible and better adapted to complex living environments—this is very important.

Additionally, in industrial robotics, precise calibration was previously required. For example, when a robot needs to grab a bottle, it must correctly identify and grasp it. However, if robots possess a certain level of consciousness, for service applications, they need to better understand complex environments. For example, if half the water in a bottle has been drunk and I'm not present, the robot might need to throw the bottle in the trash. Next time someone needs to use it, because the water hasn't been touched or finished, the robot might handle it based on previous experience. Moreover, this isn't limited to water—similar situations may arise with other items like cherry blossom tea, mineral water, and various objects that robots can identify and correctly handle. And when grabbing the robot's position, it knows how to place the water in the appropriate location. I think this situation resembles human behavior. For example, if I'm cleaning a table, I might consider where to put the water—no one may have told me, but based on past experience, I can handle this task.

I believe if robots can achieve this level of consciousness, they will be able to better serve humans, and large models provide support for this possibility. In industrial fields, especially flexible assembly and other areas requiring flexibility, robots are indispensable. As you mentioned, calibration for flexible assembly is a challenge. In such cases, robots need adaptive and personalized capabilities. Service scenarios are even more so, because they involve interaction with people, so the characteristic of "thousand people, thousand faces" will be fully realized. Additionally, Professor Yue raised a higher-level question about future human-robot relationships and the direction of future social development. This is also worth exploring deeply at the ethical level.

More General-Purpose Robots Require: More Data, More Compute, Better Models

Sang Yu: Everyone has mentioned data issues multiple times. How should robot data be collected, and what should be collected? If we want to achieve relatively generalized application scenarios, we may need to collect data across multiple domains—this isn't easy. I'd like to ask whether some engineering and research solutions have emerged to address these problems.

Yue Yutao: I have two points. First, I believe large models actually alleviate data requirements in many scenarios. The foundation layer of large models is called the base model or foundation model—it's a cross-modal pre-trained model. By embedding large amounts of information and knowledge into this model, we can execute specific downstream tasks on this basis and meet the data and quality requirements for training. In comparison, if we train downstream tasks based on this foundation model, the required data scale and quality demands may be much smaller. This is my first point.

My second point involves our own experience with data. We find cross-modal issues becoming increasingly clear and important. For example, we can perform cross-modal annotation and apply some data augmentation techniques to better use this data and achieve our goals.

It seems only with the emergence of foundation models and ChatGPT-like technologies did people realize how powerful the information commonality between different modalities is. Taking GPT-4's non-multimodal version as an example—it was trained on data that was entirely text and symbols, yet it could execute code segments and draw graphics like unicorns, houses, and dogs. That is, within the text modality, considerable spatial and geometric concepts were already embedded, which can actually correspond to information in visual modalities or other modalities like LiDAR.

So in certain cases — for example, with radar data that's difficult to collect or annotate — we can perform cross-modal annotation, such as using visual results to label radar data. I think this approach could be somewhat helpful for data.

Yan Qifan: This question relates to pipelines in the construction domain. In practice, data from construction sites is relatively scarce, because most data is generated after home construction is complete. Since this is a fairly niche area, we may need to do some detailed annotation and collection ourselves. However, I just heard Teacher Yue mention that foundation models actually reduce the demand for data annotation. So we can leverage more small samples to achieve this, because some knowledge structures are already stored in the foundation model, including transfer methods. We can combine foundation models to achieve data for specific niche scenarios and reduce data requirements.

I think this is very meaningful for us, because we currently face exactly this problem — as a small company, we cannot afford such high costs to acquire rich scenario data, and foundation models are genuinely meaningful for us.

The second question is about simulation. Personally, I think simulation can probably solve 80-90% of problems now, but the cost of achieving fine-grained guarantees is very high. So it's not that we can't achieve 1:1 digital simulation — it's that the cost is too high. In real scenarios, my wheels might slip, there might be light interference, and so on. But precisely modeling such scenarios and guaranteeing the details is expensive; perhaps large models could offer some solutions. I haven't thought about this carefully yet, so it needs further exploration.

Sang Yu: Mr. Yan's response also reflects the commercial thinking of robotics companies when applying scenarios to real-world deployment — considering costs, considering what is an overall optimized solution. I'll move on to the next related question. Robots often have very low latency requirements for critical tasks, which actually creates some conflict with large models themselves. Large models are large in parameter count, requiring more memory and stronger computing power, which often doesn't conform to the low-power principles of robotics applications. This is also a difficulty in robotics + AGI deployment. I'd like to ask what technical and engineering solutions everyone has seen.

Yan Qifan: Let me first discuss the most traditional approaches. As for how to use large models to solve this problem, perhaps I can hear the two teachers' views later. In traditional robotics systems, we also face requirements for real-time performance and computing power. Actually, this system is hierarchical. As mentioned earlier, in robotics systems we can basically analyze from several major directions: perception, decision-making, planning, control, and actuation.

For control and actuation, the real-time requirements are relatively high, while for planning the requirements are relatively lower — just some embedded systems can achieve this. Therefore, in operation, we basically separate these two aspects.

For parts with high real-time requirements, we grant them permission to run on real-time cores, with layering in both hardware architecture and software architecture. For perception, the planning requirements aren't high; it may run on higher-computing-power architectures to compensate for this layering trend. But in the future, if we really want to imbue robotics systems with large model capabilities, we may need to rely on researchers doing foundational work — they can compress models, perform quantization, or reduce model size to enable local or cloud-based operation while ensuring sufficient bandwidth. This may require people working on cloud infrastructure or model infrastructure at two levels to consider. We hope to reap the benefits without sowing — just use it.

Bi Sheng: Edge computing has attracted much attention in the technology field in recent years, with people hoping to apply it to robots. In the past decade, we've mainly developed edge computing at the mobile processor unit (MPU) level, involving much model deployment such as activity detection, Lite, and so on. In recent years, we've started doing deep learning research on microcontrollers — I even ran a small deep learning model on a small MCU. However, I think these deep learning models should be relatively small. In the past, our development at the MPU level was mainly based on mobile networks, such as Google's MobileNet and other frameworks. Some domestic companies are also doing similar work — this was seven or eight years ago. Now we have technologies like MCUNITE; they've implemented much mathematization and theoretical abstraction to extract key content and achieve sparsity search. They have much mathematical knowledge in compression, pruning, and clipping MCU-related networks. Since we also use some off-the-shelf network models, on the robot side we start from the computing side — from microcontrollers to MPUs to cloud acceleration cards, we have computing solutions.

Actually, we have computing solutions, but the key is that on the robot side, some partitioning may be needed. As Mr. Yan just mentioned, during motion we use microcontrollers for low-level development, even using real-time operating systems like main ITS and so on. When controlling robot motion, we need to ensure task-switching latency is within seven or eight milliseconds, so there won't be problems. Therefore, we place some role-level functions at the application layer — as Mr. Yan just said, perception and cognition both use the CPU, but the memory management unit (MMU) and memory interface unit (MIU) at the operating system level aren't ideal. There used to be some real-time operating systems like VxWorks, but they were costly to use and challenging for us. Industrial robots previously typically used such systems to achieve industrial real-time control, but now microcontroller frequencies have increased to 700 MHz, 800 MHz, even 878 MHz. So there's no need for that kind of operating system anymore; we can directly adopt smaller-scale Preempt-RT systems. Then at the decision-making level — the development level, including decision-making and perceptual cognition — although there are some shortcomings at the application level, we can actually achieve a certain degree of edge computing needs.

Of course, I think some partitioning is needed. For example, for large models, even after compression, running them on a true MPU is still very difficult. So if you're dealing with ultra-large models, you may still need to consider partitioning between edge and cloud. In robot tasks — for example, during robot navigation — I suggest letting it compute at the edge regardless of model size. Don't coordinate edge computing with cloud, because if the network disconnects, the robot won't work.

But for some role-guidance aspects — for example, during robot navigation, it may need to be aware of certain environmental changes — I think in such cases communication with the cloud is possible. For example, when the environment changes, large models can be used for environment recognition, then different navigation methods can be switched according to different environments. Because I think navigation methods differ in different environments, especially for very deep corridor scenarios.

In such cases, using laser positioning may not be suitable; loop closure detection should be used instead, letting the robot know whether it's in place. I think environmental perception is a very complex problem, but fundamentally positioning isn't needed — just move forward, do relative positioning, advance along the wall. But when the robot leaves that environment and enters another, it may need to switch tasks. So how to perceive environmental changes? In such cases, cloud communication may be needed, using large models for environmental perception. Therefore, I think during robot navigation, don't partition the navigation process itself — this should be edge computing. But when switching environments, turn to cloud computing, so there needs to be a combined approach. This is my personal view; I think this field still has many challenges awaiting resolution.

Yue Yutao: I can share some views and practical experience on neural network pruning and lightweighting. Imagine I'm holding a ball and throwing it; the ball lands somewhere. The trajectory could be very complex, requiring much data and coordinates to describe — especially from ancient people's perspective, they didn't know this trajectory and needed complex coordinate systems. However, now we know Newton's second law — F=ma (force equals mass times acceleration) — this concise formula is sufficient to very precisely describe the entire trajectory. This shows that in many cases, simplicity exists: complex phenomena can be described with very few elements. This principle has been confirmed in neural networks; traditionally many networks have high sparsity.

For example, suppose we have a model with 95% accuracy and 1 million parameters. Through pruning, we can reduce it to 50,000 parameters, or in some cases just 10,000 parameters, then perform the same task with accuracy perhaps dropping only 1-2 percentage points. That is, even after pruning most parameters, the model can still basically perform the original task. In this process, a key question is how to prune — which nodes and layers to prune. Here, we need to find which nodes can maintain original characteristics and capabilities. Methods in this area are very diverse, but sometimes very simple random pruning actually works better.

In our exploration, we adopted a method called quantized causality. By quantifying the causal relationship transmitted from one node to the next — during matrix multiplication and other operations — if the causal relationship is strong, we keep the node; if weak, we prune it. This causal relationship is computable and can be measured in information-theoretic ways. When pruning based on this criterion, we found that in many scenarios this method outperforms other pruning methods, and importantly, this method has strong robustness and can apply to various different networks. Previously, one pruning method might suit one network, another method another network — but our practice shows that the quantized causality approach can apply to multiple different networks. The above is some small practice we've done ourselves, hoping it may be enlightening.

After Adding Large Models, Will Robots Experience "Consciousness Awakening"?

Sang Yu: We've just discussed many serious scenarios and technical questions. The next question leans more toward ultimate imagination about human society. Will general-intelligence robots emerge, and how long will it take? Currently, human-machine coexistence is relatively harmonious, but one day, could a situation like in The Matrix emerge where robots oppose humans?

Yan Qifan: I think this process is actually quite distant. As I mentioned before, we've made breakthroughs in text and images and so on, but in olfaction and many other multidimensional, multimodal aspects, we haven't yet seen clear development paths.

Another aspect is energy consumption. Robots can obviously be stronger and more robust than humans, with greater energy. If we achieve controlled nuclear fusion, this energy is achievable. However, regarding computing power — as you've all probably heard — the human brain only uses about 10 watts, perhaps the energy from eating one bowl of rice per day is enough to satisfy its needs. But to process the massive amounts of information in large models, we actually face a rather strange state.

So I've always believed neural networks are just networks — I have no idea what they have to do with actual nervous systems. The neurons in the human brain seem to operate through their own unique mechanisms; you can't achieve something like that network with simple gradient descent algorithms or similar methods. So on this question, I think we're still quite far off. That's my view.

Bi Sheng: Personally, I think when designing robots, we have the Three Laws of Robotics. The first law states that a robot may not injure a human being and must obey human orders; the second law requires a robot to protect its own existence, provided that such protection does not conflict with the first law. However, I'm not sure whether these laws can truly constrain a robot's behavior, nor can I be certain they can fully restrain it.

The current development of AI is indeed rapid. Though I don't work on the frontier of AI research, making it difficult for me to assess accurately. Some authoritative institutions and top figures — experts at OpenAI, for example — have raised concerns about the dangers of AI. But we can't evaluate this precisely either. Personally, I find views like President Yan's — that AI lacks the intelligence of robots — somewhat hasty. Yet in the AI field, we also can't accurately assess how far it has developed.

I believe that with the application of large models, we'll see robots functioning at different levels. In this situation, I personally can't offer any definitive certainty. I just feel that if AI can help humanity live better lives, that's already quite good, isn't it?

Yue Yutao: Regarding the Three Laws of Robotics and whether we can control robots, I believe we cannot control robots. This is because of a fundamental concept: computational irreducibility. When a system's complexity exceeds a certain threshold, there will always exist some states beyond the reach of computation — states that simply cannot be encompassed. Therefore, on this question, I personally believe we cannot control robots.

As for the question President Sang raised — about robots like those in science fiction — my personal prediction for the expected timeline to achieve such robots is 20 years, with a standard deviation of 10 years, so roughly between 10 and 30 years. Why is that? Some people think progress is very fast, especially since large models are already quite powerful. But others are pessimistic, believing many problems remain unsolved. I'm relatively neutral. I believe the future development of large models faces three and a half critical problems that need to be solved.

First is multimodal perception and closed-loop interaction with the physical world. Although GPT-4 already has a multimodal version, we don't yet have a clear understanding of its actual performance. Moreover, current large model breakthroughs are still limited to modalities within the information world. I believe that once multimodal perception begins interacting with the physical world, the challenge becomes quite substantial. Solving this problem may take considerably longer than three to five years. That's the first problem.

The second problem is arbitrary multi-step logical reasoning. Earlier versions like GPT-3 had virtually no logical reasoning capability. Starting from the version released on November 30 last year, accuracy for reasoning within two or three steps was quite high. But beyond two or three steps — four or five steps — the error rate increased significantly. With GPT-4, accuracy for independent reasoning across five or six, seven or eight steps is still quite high, but anything more complex becomes unmanageable. There are underlying limitations and issues causing this phenomenon.

For example, the autoregressive approach and token-by-token generation method limit its capacity for complex logical reasoning — what I call arbitrary multi-step logical reasoning. Just like humans solving math problems, individuals make mistakes too. But humans have a system of logical reasoning that allows for backward verification and cross-checking. The sophisticated scientific and technological system we've built rests upon rigorous logical reasoning. Humans can construct such complex systems, but GPT hasn't reached that level yet.

The third problem is autonomous training and autonomous learning. The current training approach fixes a version after training it. It can only operate with something akin to short-term working memory within its input token set, offering some flexibility. But it cannot actually update itself unless humans deliberately retrain it with new datasets. Humans, meanwhile, can continuously update the connection weights between neurons in their brains through observation and learning — this happens simultaneously. Unlike our current GPT training based on backpropagation and gradient calculation, the human brain doesn't use backpropagation. So this is the third limitation: autonomous updating and learning.

Finally, there's the half-problem of consciousness. Some consider it an ultimate challenge, but I personally think it only counts as half a problem. The greater difficulty actually lies in the vagueness of how we define and understand "consciousness" itself. If we break down the various behaviors consciousness exhibits and the constituent elements of its mechanisms, I believe existing technology is already nearly capable of fully constructing them.

Therefore, if we solve these three and a half problems, I think it may take decades rather than years. At the same time, I'm technologically optimistic — I believe these problems will all be solved, though it may take considerable time. When that truly impressive intelligent agent emerges, whether it will threaten humanity and whether it can be constrained — that's a deeper topic that may require much longer to explore.