AI Agents with Three Frontier Researchers: Rules, Environments, and Real Questions Beneath the Hype | 5Y Capital's Tavern Vol. 31

五源资本·July 30, 2025·44·1

Can Agents Ask Real Questions Now?

In this episode of 5Y Capital's Tavern, we spoke with three researchers active on the front lines of AI agents — Kexin Huang, Yunfei Xie, and Jiayi Zhang — about the real progress and challenges in agent development.

From research to application, they shared their assessments of where agent systems stand today, their predictions for how they'll evolve, and a provocative question: Could future models develop something like a scientist's "problem consciousness" — the ability to proactively identify valuable tasks and drive breakthroughs through their own initiative?

This is a deconstruction of agents, and a conversation about the future of intelligence.

[Guests]

Yaopeng Xing, Investor at 5Y Capital (host)

Kexin Huang, PhD student at Stanford University, author of Biomni

Yunfei Xie, PhD student at Rice University, author of ViGaL

Jiayi Zhang, PhD student at The Hong Kong University of Science and Technology (Guangzhou), MetaGPT team

Selected excerpts from the podcast:

01 What Agents Can Do Today

Yaopeng Xing: I'm thrilled to have all three of you on this new episode of our tavern. Could you each briefly introduce your background and current research?

Jiayi Zhang: I'm in my first year of PhD at The Hong Kong University of Science and Technology (Guangzhou), and I also do research at MetaGPT. I mainly lead an open-source organization called Foundation Agents, through which we've done some open-source projects and research — Foundation Agent, Open Manus, and a series of Automating Something-related work. Recently I've been leading the team in exploring agent environment training and formalization, with about five or six papers in progress simultaneously.

Yunfei Xie: I just finished my undergraduate degree and will start my PhD at Rice University this fall. For the past six months I've been working on rule-based reinforcement learning and how to combine it with multimodal large models. We just ran an experiment where an agent learned to play Snake through RL, and its math and reasoning abilities improved noticeably — with no math samples used during training, something traditional SFT can't achieve. I'll continue exploring RL and agent applications going forward.

Kexin Huang: I'm Kexin, a fourth-year PhD in Stanford's computer science department. I've always worked at the intersection of AI and biomedicine. I previously worked on biological foundation models, and for the past two years I've focused on how AI agents can participate in scientific discovery. We recently released Biomni, a general-purpose biomedical agent, and we're also building environments and doing RL, aiming to give it true AI biologist capabilities.

Yaopeng Xing: There's a lot of debate and disagreement in the agent space. I'd like to ask: standing at this moment in time, how do each of you define "agent"? What's your definition and thinking?

Jiayi Zhang: We started formally discussing LLM-based agent architecture around 2023. At that time we divided agents into three layers: foundation units like language models at the bottom, structures connecting these units in the middle, and the system's behavioral logic at the top. The main question then was how to combine these components in a "topological" way.

Later we realized that language models alone aren't enough — they need sufficient "spatial sense" and "self-scheduling" ability to become a true agent, then form autonomous multi-agent systems through self-communication mechanisms. Of course, such systems still mostly remain at the simulation level; there aren't many that can actually solve real-world tasks.

What we're building now with Foundation Agent is the hope that it can have "environment levels" similar to humans — meaning it can adapt to different environmental rules. If an agent can act across environments like a person, we can give it a general protocol and let it gradually evolve into a real-world agent network. At the end of the day, the word "agent" itself has no static definition; I see it more as a dynamically evolving process.

Yunfei Xie: Let me add something from the RL side. The concept of agent actually existed before large models — AlphaGo is a classic RL agent: it perceives the environment, searches, makes decisions, and has memory. But these past agents had narrow, limited decision-making capabilities. With large language models, we have a powerful perception and decision network in itself. We can swap out the "expert model" like AlphaGo for a more general large model, letting the agent solve more challenging problems in more complex environments. So my understanding of agent leans more toward the system level — the large model is the more powerful "brain," and the agent is the complete execution and feedback system built around that brain.

Kexin Huang: Let me supplement from another angle. Recently many people have been discussing "agency" — why do some people have it and others don't? In certain fields, agent refers to an entity that proactively completes tasks, with will and goal-directedness — a very intentional being.

The agent we ultimately want is, frankly, a virtual human. Which human traits can AI replicate? Right now, environment is the critical first step. Only when an agent can truly exist in an environment and interact with it does it have "agency." Consciousness, emotion, and other capabilities may emerge in the future, but for now, I think "being able to use an environment, learn in it, and act in it" is a fairly basic and standard definition of agent.

Yaopeng Xing: What technological or research breakthroughs from the past year impressed you most? And what are you most looking forward to seeing become reality?

Yunfei Xie: For me, the most striking was DeepSeek's work — they got GRPO (rule-based reinforcement learning) to actually work. This showed strong generalization capability in academia. Previously we had to rely on massive corpora and pretraining; now you just design the environment and rules, and the model learns on its own. It's also a huge breakthrough for industry — before, training an agent for something like booking flights required massive manually annotated data: mouse trajectories, button coordinates, expensive and tedious. Now you just build a virtual web environment with clear rules, and the agent can complete tasks autonomously through RL. Far more efficient than traditional methods.

Last year there were also many large-scale agent datasets, but that approach of recording human behavior doesn't scale. Rule-based RL lets agents run infinitely in standardized environments and learn autonomously. I think it could revolutionize how we train agents.

Yaopeng Xing: This leads to a related topic — many tasks have rewards that can't be quantified, like complex negotiations in business. We can't score AI on every sentence like a test. In reality, human feedback is often vague: "this strategy won't work" or "this proposal isn't good enough," with no absolute right or wrong. How AI learns quickly from these fuzzy signals is a breakthrough direction worth watching.

Yunfei Xie: Right, rule-based RL has limitations. In some scenarios, like buying something, there are clear rules for judging success. But with more complex systems, how to define those rules is itself a difficult problem RL needs to crack.

Kexin Huang: What impressed me most this past year was still AI's reasoning ability, especially in scientific applications. Early on you could see some scientific knowledge had been written into ROM (read-only memory), but knowledge alone was far from enough — many tasks still couldn't be completed. In the O1/O3 series projects I was involved with, you could clearly feel: the same tasks, as model reasoning improved, unlocked many applications that were previously out of reach. This may not count as a dramatic breakthrough, but it's been a continuous, sustained process. For example, a year ago Future House had only published papers on research workflows (literature research workflow), and many people thought it was no different from ChatGPT. But this year you can see AI is starting to complete scientific discovery tasks with real economic value.

Behind this change are many factors pushing together: the base model's knowledge base has grown larger, reasoning ability has strengthened, RL frameworks are starting to guide models toward more complex tasks... combined, these are making the concept of "AI scientist" gradually become realistic. I think this is "gradual but substantive" progress — quite striking.

Yaopeng Xing: I've felt this myself too. We regularly meet entrepreneurs who want to use AI for research. It used to be painful reading their materials; now using O3 to analyze and reason through their underlying technical logic actually helps us understand quickly — the learning curve has come down a lot.

What agent products have you used that left a strong impression?

Jiayi Zhang: I use Cursor and Claude Code the most. For me, a good AI tool needs to help me complete real tasks over the long term, not just answer questions one-off. Beyond coding, it can also handle daily chores — I download a lot of PDFs but don't want to organize them manually, and it can automatically sort and archive them for me, which is quite practical.

For OpenAI and Gemini's Deep Research products, I tend to use them for introductory summaries, especially in fields I'm completely unfamiliar with. They can quickly compile hundreds of relevant links, broad coverage and high efficiency, sometimes faster than I could search myself.

Yunfei Xie: I'm similar to Jiayi — mainly using code tools. Many agents today look flashy but are far from actually fitting into my workflow. Most have low success rates, and the actual efficiency isn't necessarily good — sometimes you wait for it to run for ages and it fails in the end, then you have to retry, which is a pretty bad experience. Code agents, on the other hand, genuinely boost productivity and deliver direct value.

Kexin Huang: I'm also a heavy Cursor user — it's been especially helpful for coding-related tasks. Some friends of mine in biology have started using it too. Right now they're mostly applying it to basic tasks, but you can see it's beginning to integrate into their daily research workflows. As Yunfei mentioned, more complex tasks still can't be fully handed off to agents, but the trend is clear — people are gradually learning how to put these tools to work and actually deploy them in scientific research.

02 The Bottlenecks, Boundaries, and Exploration of General-Purpose Agents

Yaopeng Xing: This is quite an interesting pattern. Every time we talk to people with research or technical backgrounds, we find they don't really use those general-purpose agent products on the market. What they're actually willing to spend time on are coding tools. We've tested many general-purpose agents ourselves — one colleague had them do a DCF valuation analysis for Tesla, and the results looked flashy but couldn't withstand scrutiny.

So in professional contexts, code agents' capabilities are recognized. General-purpose agents perform better on "check-the-box" tasks — writing a passable report, alleviating workplace anxiety — and there's certainly a market for that. But to actually enter certain vertical domains, you need products with solid environments and tools to deliver real efficiency gains.

This raises a question — general-purpose agents still face many capability bottlenecks. What's your take? How should their capabilities evolve going forward? Is there any chance they can truly solve high-value tasks beyond coding?

Jiayi Zhang: Here's how I see it. Many companies building general-purpose agents face a critical challenge: constructing a data flywheel. While they don't explicitly say they're doing this, they typically emphasize training browser agents through user clickstream data. But I think the core issue isn't data collection methodology — it's the stage of agent capability.

Currently agents have two main functions: information organization and information generation. Something like Deep Research helps with the organization piece, but what really determines user experience is generation — not just generating correct content, but generating it in a form users find "visually agreeable." Right now there's no effective feedback mechanism for this, and no unified standard. For example, when we build web products, the model can implement functionality but can't judge what's "good-looking." Elements like aesthetics and simplicity — things humans perceive intuitively — agents currently can't capture.

So we're trying to start from user intent, combining behavioral data to train our own models, optimizing structure and aesthetics during generation. The essence is solving whether "the generated content looks like it was made by a human."

Yunfei Xie: Jiayi put that well — I have a similar feeling. These agents today have the ability to complete tasks, but they're far from doing them well. For instance, generated web pages may be functionally complete but look ugly, like toys — users abandon them quickly. We still need to write very detailed requirement specs for agents to produce barely satisfactory results. But the real ideal is: the user says one sentence, and it delivers excellent output in one shot.

This transformation is like the leap from GPT-3 to GPT-3.5 — the former was merely "usable," the latter became a genuine productivity tool. Today's general-purpose agents haven't crossed this threshold yet. It requires massive data, systematic optimization, and engineering accumulation to solve numerous corner cases.

Kexin Huang: We work on vertical agents for scientific research and face similar challenges. There are three key issues: First, insufficient knowledge reserves. Much biological and medical material sits behind paywalls, completely inaccessible during model training. Second, lack of high-quality tools and real-time feedback mechanisms, making it difficult to support agents in completing complex tasks with economic value. Third, lack of domain "common sense." For example, in a recent task, the agent's solution was basically correct, but an expert pointed out a small error — and that error happened to be critically important. Such detailed errors are currently hard to avoid. So while UI and experience matter too, what ultimately determines whether an agent is truly "useful" is whether it can achieve expert-level professional capability.

Yaopeng Xing: Going forward, if we want to provide users with professional, personalized experiences, which parts should be optimized at the model layer, and which need to be built at the system layer beyond the model? And how should the boundary between these two be drawn?

Jiayi Zhang: My own understanding is this: if I were building a general-purpose, cross-environment agent system from scratch, I'd first clarify the concepts of "model" and "system" — there's actually a clear boundary between them. But the boundary between "model" and "agent" is less distinct. Many models can already complete some autonomous tasks, so the mere ability to "complete tasks" is no longer sufficient to distinguish whether something is an agent.

In my view, a model is a model — its core responsibility is to provide foundational intelligence, particularly understanding the dynamics of the environment. I'd probably choose a base model, which could be open-source or weight-tuned, and on top of that, connect a learning system — one that can continuously adjust the model or system itself based on data or environmental feedback.

So the whole system splits into two parts: one is the fine-tunable or fixed base model, the other is the external system responsible for execution. The learning system serves as a bridge, continuously learning and adjusting execution strategies through environmental or user data. In other words, the model layer provides intelligence and environmental understanding, while the system layer handles contextual awareness — knowing what adaptations and processing to make in what environments. Much of the learning may happen at the system layer, but its object remains the model itself.

Yunfei Xie: The two keywords just mentioned — "boundary" and "personalized experience" — immediately made me think of recommendation systems. By comparison, current agent systems are actually quite weak on user feedback mechanisms. GPT lets you thumbs-up or thumbs-down responses, and sometimes offers two replies to choose between, but most of the time users can't be bothered — I'm like this myself. If I'm unsatisfied with an answer, I might just close it or switch to another agent rather than leave feedback.

Recommendation systems are different — they naturally compare different algorithms through A/B testing, using data to guide optimization. But agents currently have almost no equivalent mechanism. While we can do pre-training and fine-tune before launch, once live, how to optimize based on real user behavior — this is quite limited.

Going further, different users' interaction styles with agents vary enormously. Some prefer writing out every detail in one shot; others favor multi-turn dialogue with repeated refinement. Prompt length, complexity, and rhythm all differ. But agents aren't sensitive enough yet to truly adapt to these differences.

So I think how to dynamically adapt to different users' interaction preferences is another major challenge for agents right now. This isn't just a product problem — it's a question of system boundaries and capability boundaries.

Kexin Huang: Something quite crucial was just mentioned — agents fundamentally depend on environment and base model. This is something I've been thinking about too. For general-purpose platforms like Claude or ChatGPT, if they want to support development of vertical domain agents in the future, how would they do it?

Because many vertical agents heavily depend on specific environments, and these environments are often highly specialized. If you want an agent to run well in such a specialized environment, you first need to build that environment itself. That is, for every vertical domain, Claude or OpenAI would need to provide a matching environment — only then can the system collect data and train models based on that environment.

What I find particularly interesting here is that environment and base model are a co-evolutionary process. And as everyone knows, the environment itself isn't static — especially in rapidly changing fields like scientific discovery: today's best tools may be obsolete tomorrow. So if you only train a base model at one point in time, how can it keep up with knowledge evolution? How to achieve continuous learning, or continuous agent adaptation? This actually raises many questions worth exploring.

For example, their recently launched Artifacts is a lightweight environment specifically designed for web development. But if, as Dario envisions, they want to build a "virtual biologist" in the future, that would require an environment tailored for biology. And this environment could be extremely complex, needing a large team for long-term maintenance. So I believe the key to true agent evolution lies in how environment and base model co-evolve — this determines an agent's adaptability and long-term value.

Yaopeng Xing: As everyone just noted, this is both a technical and commercial problem. As intelligence continues to evolve, agents will likely penetrate various industries, and a clearer division of labor may form between models and environments. For instance, some environments are designed for base model self-iteration, while others, from commercial efficiency considerations, are better suited to serving specific vertical domains. There are also niche environments that major players overlook, where startups might instead gain precise feedback and refine experiences.

Take our observations in the multimodal field: in recent years, many directors and creators have technically acknowledged GPT-4o or Flux's image generation capabilities, even finding them quite realistic. But the reality is, three years on, many artistic creators still prefer MidJourney — because its nuance and personalization in aesthetic style are qualities other tools haven't yet achieved. This is like how small companies, by going deep in focused directions, can instead establish themselves in scenarios giants haven't prioritized — behind this lies a commercial dynamic worth watching.

If agents continue to evolve, as many researchers have proposed, we may no longer be able to evaluate models and agents with fixed benchmarks in the future. Instead, we'll need to construct a new "environment + feedback" system that allows agents to continuously receive incentives and optimization opportunities through tasks.

03 From Executing Tasks to Discovering Problems: How Far Are Agents?

Yaopeng Xing: Do you think future models could, like scientists, proactively pose valuable questions and, driven by their own initiative, design better tasks?

Yunfei Xie: I think we can divide this into two categories. One is incremental questions — slight combinations or inferences based on existing knowledge, like a new angle discovered after a literature review. Agents today can basically handle these. The definition is clear, the search space is bounded, and you can find answers through combination and generalization.

But the other category is much harder. It's more like literary creation, relying on inspiration and intuition — the kind of question that "has never been asked before." Behind such questions lie scientific taste and value judgment, and I don't think agents possess these yet.

So right now, agents can pose some questions at the boundaries of existing knowledge, no problem. But to come up with truly disruptive, breakthrough insights — that's still quite difficult.

Jiayi Zhang: I largely agree with Yunfei. A lot of what we're doing now is essentially extending inferences on top of existing ideas and structures. This doesn't depend so much on the model's own depth of knowledge; what's more important is guiding it to ask questions along valuable paths. Often, we extract coherent patterns from human collaboration, and the agent just helps fill them in.

Kexin Huang: Right, I think everyone made good points. Agents can pose questions, but what we care more about is whether they can pose particularly good questions. The difference between a Nobel-level scientific discovery and the small questions we ask daily comes down to complexity and leap distance.

Some questions only require a single hop to connect (single-hop reasoning). But truly breakthrough questions may need hundreds of reasoning steps (multi-hop). The more complex and distant the logical chain, the higher the demands on the agent. And in fields like biology, reasoning alone isn't enough — you need experimental feedback to close the loop. Many major achievements are actually the result of long cycles of "reasoning — experiment — re-reasoning."

So in the future, if an agent ever produces a Nobel-level discovery, it will likely have been running independently in a complex environment for a long time, forming genuinely creative questions through multiple rounds of feedback. It's not impossible. But this kind of high-quality question output will necessarily be scarce — it can't be mass-produced. The issue isn't "can it ask," but "can it ask well."

Yaopeng Xing: Today everyone also touched on a framework for evaluating intelligence progress — we could judge capability improvements by how long a model or agent can sustain independent thinking and reasoning, from one minute to one hour, and perhaps eventually a full year of reasoning and experimentation without being derailed by distractions. But I'm also curious: Does this progress happen naturally as testing scaling advances? Or are there still many unsolved challenges behind it that researchers and engineers alike need to invest substantial time to crack?

Jiayi Zhang: I think this requires deliberate design and problem-solving — it's unlikely to "emerge" naturally. As I mentioned earlier with scanning, while it sounds like active exploration, it's still largely humans proposing ideas and designing paths. Every researcher's reasoning style and thinking pattern differs, so we need a lot of engineering work to guide it — constraining it to think along certain paths, or using specific methods to deepen the reasoning process.

Yunfei Xie: Yes, to put it simply: if you let an agent think continuously for a long time, can it actually solve the problem? For now, large models based on autoregressive architecture are essentially probability models that generate content token by token — each token's generation depends on the probability distribution of preceding tokens. This approach is highly prone to "cumulative error": once one step goes wrong, everything downstream drifts further off course.

So in these models, test-time scaling doesn't necessarily mean performance improvement. It might actually make results worse due to error accumulation.

But we've also seen hope. Diffusion-based language models, for instance, are quite promising. Rather than outputting one token at a time, they update entire passages in a more global manner, making it easier to control overall coherence and correctness — perhaps addressing some inherent limitations of autoregressive approaches.

Kexin Huang: I very much agree. Human civilization didn't evolve through complex reasoning completed by a single LLM in one go. It's more like a modularized system, evolving through layered structural accumulation.

So if we want an agent to execute a year-long task, it certainly won't rely on just one LLM. We'll need a multi-agent system, or an agent system that can continuously accumulate knowledge and organize information. It performs structured tasks daily, constantly recording and optimizing, rather than completing everything in a single output.

In this process, the biggest challenge is probably context engineering: how do you manage context over long-duration tasks? How do you coordinate the organization and scheduling of massive knowledge and observations? It's like building a real company — you can't do it with one person alone. You need a system with division of labor, structure, and collaboration to accomplish complex, long-term goals.

04 Epiphanies, Pivots, and the Future

Yaopeng Xing: In your AI research journeys, have you experienced any "aha moments" — sudden flashes of insight?

Jiayi Zhang: I only seriously started doing research in my senior year — the first three years I was basically coasting, busy with student council stuff, and my grades weren't great. I started catching up on research skills in senior year to secure grad school recommendation. I remember my first "aha moment" was when writing my first complete paper, around late September to early October 2024. No one was helping me edit; I was absolutely desperate. In that moment, I suddenly realized: research is something you ultimately have to do yourself. Your paper, your ideas, execution, reasoning — you have to own it all. Advisors and bosses can't do it for you. From that moment, my understanding of research fundamentally shifted. This epiphany didn't come from some novel idea, but from a profound realization of "research agency."

Yunfei Xie: Looking back, I can share an "aha moment" from our work. We trained an agent to play Snake, originally just to optimize its in-game performance. Later, we tried testing this agent on some math tasks, and the results were surprisingly good — it generalized even better than the base model. And it had never seen any math data; training was entirely based on the game.

Many people asked us, why did you think to apply a game-playing agent to math tasks? Honestly, we didn't start with that idea. I was even somewhat resistant to the project, thinking "game-playing" research was too niche. But the turning point was when I suddenly realized: what I actually care about is general capability. So on a whim, I tested its performance on general tasks, and the results were unexpectedly good. For me, this "aha moment" wasn't some divine inspiration, but a natural extension of long-standing attention to general intelligence and reasoning.

Kexin Huang: I can share an experience too. I started doing research about seven or eight years ago, always focused on AI plus biomedicine. For the first four or five years, I was mainly building a foundation model for biological data. There was an "aha moment" in there. After we published one result, I was chatting with a biologist. At the time, we had another collaboration going, and there was a basic biological data analysis task in it. He asked me to help process it. I was a bit lazy, so I said: "How about I let my agent handle it?"

After the agent finished, I showed the results to this biologist. His reaction — not quite screaming, but you could clearly tell he was genuinely surprised and excited, feeling this thing could be immediately useful. He even said it would have normally taken one of his students three or four months to complete. From that point, I realized this agent had significant economic and practical value.

This was completely different from when I was doing foundation models for medicine before. Back then, when I showed those results to biologists, they'd at most say "very cool" and that was it. The contrast was stark. In that moment, I felt this agent possessed genuine practical and economic value. From then on, I firmly pivoted to the "agent + biology" direction.

Yunfei Xie: I'm still a research novice, living in an era of exploding technology and papers. But I hope I can persist, undistracted by the environment. My ten-year goal is to produce several solid, grounded research pieces, using them as leverage points to advance a small field. In this restless era, not being anxious, not blindly chasing trends — that's the original intention I hope to hold onto.

Jiayi Zhang: I set myself a three-year goal: by graduation, build a company of my own. Whether it gets acquired or is still operating, as long as it once existed, it's an anchor point for me. If I had to state a ten-year wish: the company I founded could have some positive impact on the world.

Kexin Huang: I want to become mission-driven. My dream is that in ten years, biologists wake up no longer manually designing experiments, but opening our platform to check what agents accomplished overnight, then directing them to execute new tasks. The way scientific research is conducted may be undergoing its first transformation in centuries, from manual execution to agent collaboration. I want to participate in and drive this change — it's profound and fascinating, and might fundamentally alter how research is done. I hope to leave my own mark on it in ten years.

5Y Capital seeks out, supports, and inspires lone entrepreneurs, backing them from the spirit down to every operational detail. We believe that if the world starts believing in the "crazy" version of you that others see, things will unfold in entirely new ways.

BEIJING · SHANGHAI · SHENZHEN · HONG KONG