"Xiaosu Tech" Du Zhiheng: Search Is 10x Cheaper Than Reasoning, But 90% of People Don't Know | Yunqi Portfolio --- This is a headline/article title, so here's the full translation with context: "Xiaosu Tech" Du Zhiheng: Search Is 10x Cheaper Than Reasoning, But 90% of People Don't Know | Yunqi Portfolio Or if you need the body text translated as well, please provide it.

云启资本·May 15, 2026·7·10

How to Spend Your Tokens Without Regret

How should Tokens actually be spent to avoid waste? As Agents continue to drive Token consumption in 2026, this question is becoming increasingly practical.

In complex task chains, a single user request can trigger multiple rounds of search, reasoning, and tool calls. If search results are too long, repetitive, unstructured, or repeatedly stuffed into long context windows, costs quickly balloon.

Recently, Du Zhiheng, CEO and co-founder of Xiaosu Technology, an early Yunqi portfolio company, spoke with Silicon Star about search, reasoning, and Token economics in the Agent era. His view: for many Agent teams looking to cut costs and improve efficiency, the first step may not be switching to a cheaper model, but rather getting clear visibility into inputs and chain design.

When we invested in Xiaosu Technology in 2025, what we valued was its platform position in the Agent era: as Agents become a major product form beyond large models themselves, real-time, accurate, traceable information retrieval will become foundational infrastructure for AI applications' sustained use of reasoning capabilities. (See Yunqi podcast Attent!on Episode 9)

In this edition of "Yunqi Partners," we share the full conversation.

The following is excerpted from "Silicon Star Pro"

Author: Yoky

** After OpenClaw went viral, many developers saw their Token bills clearly for the first time — a single user request triggering multiple rounds of tool calls, each carrying ultra-long context, with actual API costs far exceeding expectations, sometimes dozens of times higher than subscription fees. Token bills are increasingly trapping people building Agents.

This is not a healthy business model. If an Agent application can't even calculate its own Token cost structure, it can't face the market. The problem isn't that Tokens are too expensive; it's that massive amounts of Tokens are spent where they shouldn't be: redundant searches, low-quality context, information at the wrong granularity, wrong model choices.

Luo Fuli, head of Xiaomi's MiMo large model, also noted in a previous post: the Agent era belongs not to those who consume the most compute, but to those who use compute most effectively. Everyone building AI should establish their own Token economics.**

Luo Fuli's tweet:

https://x.com/_LuoFuli/status/2040825059342721520?s=20

We took this question to Du Zhiheng: CEO and co-founder of Xiaosu Technology, who lived through the entire migration of search engines from PC to mobile. Xiaosu Technology's core business is intelligent search: building search engines for AI Agents. Unlike search for humans, this is intelligent search infrastructure for Agent products like Moonshot AI, DeepSeek, and Manus to call. Currently, over 50% of China's top Agent companies use their search API, with hundreds of millions of monthly calls.

Du Zhiheng, CEO and co-founder of Xiaosu Technology

Having worked in intelligent search for years, Du Zhiheng has extensive hands-on experience with "how to spend Tokens without waste." We spoke with him about how to design search to save Tokens, how search and reasoning should coordinate, and how to choose models cost-effectively.

Intelligent Search: Getting the Information Right from the Source

Silicon Star:

We often say AI "searches the web," but what's fundamentally different between an Agent calling a search engine and a human opening a browser to search?

** **

Du Zhiheng:

The difference is enormous. When humans use search engines, they're "browsing information" — attracted by headlines, using summaries to judge whether to click, reading one by one. So traditional search has long optimized for relevance and click efficiency, with CTR (click-through rate) as the core metric.

But when an Agent calls search, the fundamental purpose isn't browsing, but acquiring information necessary to execute a task. It may be doing research, writing reports, making plans based on search, or passing results to the next tool for further processing. So in an Agent's chain, search results aren't an "entry point" but "raw material" for the task chain.

This single-word difference means completely different optimization goals. You no longer need to put the most clickable links first; instead, you need to deliver a set of content that is sufficiently complete, credible, traceable, and efficiently readable by the model. A concrete example: if you ask an Agent to plan a Singapore family vacation, the Agent won't click through and compare one by one like a human would. Instead, it quickly grabs visa requirements, flights, hotel prices, child facilities, weather, safety — all the information — then produces an executable itinerary. The role of search here is to batch, rapidly, and precisely provide the raw materials needed for execution.

Silicon Star:

Now there's more and more AI-generated content, some of it confidently wrong. Can search engines tell?

** **

Du Zhiheng:

Our quality control operates at multiple layers.

The first layer is basic screening of sources and content quality. This includes cross-referencing between webpages, whether they come from official media or authoritative institutions, structural quality of language expression — the model does an overall assessment of these factors.

The second layer is information density and originality judgment. Does a piece of content have real information density? Does it have an original source? Or is it just reprocessing of earlier content? Timestamps matter here — if a piece of content was published later than its original source, it's likely just paraphrase.

The third layer is cross-validation. We compare content needing judgment against original publishing sources — official documents, papers, databases, credible media. If an entire chain consists of paraphrases, it's basically unusable.

Beyond this, we also control complementarity between results. For humans, having 7 out of 10 results be repetitive is acceptable — you just click one. But for an Agent, repetition is waste. It needs information coverage from different angles and different sources, so every result has incremental value.

Silicon Star:

There's a problem here: traditional search engines iterate based on click-through rates, but Agents don't click. How do you know if search results are good?

** **

Du Zhiheng:

This is one of the core challenges of building search in the Agent era. With human search, you can see click behavior for every query type in real time — higher CTR means better results, A/B testing is straightforward. But with Agents, regardless of whether results are good or bad, the customer just takes 10 or 20 results; you see zero click signals.

So feedback comes from the customers themselves. A customer's Agent can perceive when quality is poor in a certain scenario — users follow up, give negative feedback, or the Agent repeatedly handles a question it already answered. Though not as binary as CTR, these are all leverageable signals for reinforcement learning.

Silicon Star:

This means you have to be deeply bound to customers. Are customers willing to share the feedback information you need for optimization?**

Du Zhiheng:

This is fundamentally a trust issue, and also where the real moat in this赛道 lies. To optimize their Agent quality, customers need to make more specific suggestions and demands of the search API they call. But feedback signals are the most valuable data — they'll only co-build with you if they trust you enough.

The prerequisite for trust is solid foundational capabilities. You need to reach near-Bing level before customers will seriously work with you. On that foundation, they'll tell you when top results in a certain vertical have quality issues, or when a query type consistently produces unsatisfying results. It's more like a long-term market transaction relationship: you interact daily, and one day he tells you Thursday's fish isn't fresh enough, so you go optimize your supply chain.

High-quality deep-cooperation customers are necessarily limited, and we're selective too. Throwing all signals in equals doing nothing — you need customers whose needs have universal applicability, whose feedback can genuinely help improve foundational capabilities. This is a mutually valuable interdependent relationship for both sides.

Search and Reasoning Decoupled: Look It Up When You Can

Silicon Star:

Many developers now use models' built-in search capabilities directly, like GPT with web browsing enabled. What's the benefit of separating out search as its own layer?

** **

Du Zhiheng:

At the most abstract level, humans facing a problem have only two options: one is to rack their brains reasoning and calculating, like solving a math problem; the other is to look it up — using a dictionary, a search engine, seeing if there's a direct result. For Agents it's exactly the same: either use the model to reason, or go to the internet to see if there's a native result.

In most cases, looking it up in a dictionary is more reliable and cheaper than reasoning. Reasoning produces hallucinations; while search can't guarantee 100% accuracy either, its error rate is far lower than reasoning from scratch. More importantly, reasoning consumes vastly more Tokens. So for any question with a definite answer, calling search is far more cost-effective than having the model reason it out itself.

Currently many Agents haven't developed the habit of "prioritizing search." Many problems that could be solved with a quick lookup instead go through reasoning chains — neither accurate nor economical.

Silicon Star:

At the execution level, where does search sit within a complex task?

** **

Du Zhiheng:

Search isn't a single point trigger, but embedded in the middle layer of the task chain. Using travel planning as an example again: after receiving the task, the Agent first uses reasoning to break it into sub-problems — destination information, visa requirements, flight options, hotels, child facilities, etc. Then for each sub-problem layer, it calls the most suitable tools: some call search engines, some call Ctrip's API directly, some call weather services. Finally, it uses reasoning again to integrate all results into an executable plan.

So the structure of a complete task is: reasoning decomposition → multi-layer search and tool calls → reasoning integration. The first reasoning segment handles decomposition, the final reasoning handles synthesis; the middle execution chain should let search and specialized tools carry as much as possible. This is the most cost-effective design.

Silicon Star:

How do you determine the output format of search results? When to deliver long text, when to deliver short summaries?

** **

Du Zhiheng:

This depends on the priority of the customer's scenario. Some scenarios are latency-priority, like chatbot real-time responses where users won't wait long — here you should give short summaries so the Agent can quickly integrate an answer. Some scenarios are quality-priority, like academic research or generating deep reports — here you need to read out complete webpage or even PDF content, giving the Agent a clean, complete long text with sufficient raw materials to work with.

This isn't unilaterally decided by us, but configured based on the customer's specific scenario. Fundamentally it's all real-time data retrieval, just with different delivery formats. Search results are an input to the customer's Agent, and different scenarios have completely different requirements for that input.

Saving Tokens: Choices Matter Enormously

Silicon Star:

With more and more models available, how should developers choose? Can different scenarios within the same product use different models?

** **

Du Zhiheng:

I think this is a practical problem many developers face today. As models proliferate, the easiest trap is to oversimplify the problem as "which single strongest model to choose."

** **

But real business doesn't work this way. An Agent completing a task typically involves multiple simultaneous links: data retrieval, information processing, context organization, model reasoning, and engineering orchestration.**

These links aren't independent. Many problems that appear to be model performance issues are actually data quality issues, context being too long, or chain design itself being unreasonable; problems that appear to be high call costs often aren't necessarily the model itself being expensive, but rather tasks of different complexity being put into the same processing approach.

From our perspective, developers can and should use models at different capability tiers for different scenarios within the same product.

Because within the same product, there naturally exist many different types of tasks simultaneously: some are relatively standardized tasks like classification, extraction, translation, rewriting; some are more reasoning-intensive tasks like complex comprehension, long-chain decision-making, multi-tool coordination. Their requirements for model capability, stability, latency, and cost were never the same to begin with.

If all scenarios use the same highest configuration, results may not be optimal and costs are usually unreasonable; but if you only chase low prices and push all tasks down to low-tier models, you easily run into stability and output quality issues.

What really matters isn't asking "which model is strongest" first, but breaking tasks apart first, getting clear on what capability, quality requirement, response speed, and cost structure each link actually needs.

When these questions are thought through, model selection becomes more natural: not building products around models, but configuring capabilities around scenarios.

Silicon Star: You previously mentioned that built-in model search costs 5 to 10 times more than independent search APIs. Luo Fuli also pointed out that many harnesses frequently compress search return results causing cache invalidation. Where does this 5-10x come from specifically? If developers decouple search from the model and procure it separately, how much can they actually save?**

Du Zhiheng:

This 5 to 10 times comes from several layers of costs stacking up.

First layer: search results become persistent context baggage. Under normal circumstances, one search call ends when results are returned. But when search is bound inside the model, this content enters long context and gets repeatedly carried in subsequent reasoning rounds — costs amplify from "one query" to "multiple rounds of magnification."**

Second layer: secondary processing of search results itself burns Tokens. Many systems summarize, compress, and rewrite results before feeding them back to the model, intending to save money. But with unreasonable strategies, this step itself consumes extra Tokens while potentially losing key information — ending up neither saving money nor improving results.**

Third layer: cache hit rates are dramatically pulled down. Search results are highly dynamic; once in context, every input changes, and cache reuse basically becomes invalid.**

Fourth layer: things that should happen outside the model are all handed to the model. Crawling, body extraction, deduplication, ranking, structuring — these can all be done efficiently outside the model, but if all thrown to the model, you're using the most expensive layer of system for the least cost-effective tasks.**

These layers stack up, and the multiplier comes out.

The solution: our approach is to frontload these actions, getting the "form" processed before information enters the model. But there's a real tension here: compress too aggressively and you lose details; feed full text and costs can't be controlled.

This is why we built Chunks — extracting and reorganizing the most relevant fragments from original content for the current question, rather than stuffing in whole articles. For example, when doing investment research, an Agent needs to analyze a company. If it directly reads 20 full articles at ~1,000 words each, total input is roughly 20,000 words; through Chunks extraction and reorganization of key fragments, input can drop to about 70% of original, key details remain preserved, Token costs drop about 30%, while information coverage maintains 95%.

Back to your question: how much can decoupling actually save? Hard to give a unified number; different business chains vary greatly. But if the original architecture was "model directly connected to search internally + large amounts of results repeatedly entering long context," after decoupling plus frontloaded structuring, costs, latency, and stability usually all improve noticeably.

What you actually save isn't just money on a single call, but massive amounts of unnecessary Token consumption throughout the entire Agent chain.

Silicon Star:

How do you become someone who uses compute smartly? If an Agent team wants to bring Token costs down, would you advise them to optimize search first or model selection first? Which link has more room for savings?

Du Zhiheng:

If I could only give one suggestion: don't rush to switch models — first get clear visibility into inputs and chain design.

The reason is straightforward. From most teams we've worked with, what's most easily overlooked yet most likely to amplify costs isn't the model itself, but search and context organization methods.

The logic is simple: if search results themselves are long, repetitive, unstructured, or if the same material gets repeatedly concatenated, summarized, and fed into the model throughout the chain, then whichever model you switch to later, you're fundamentally still paying for invalid Tokens.

So often, the first cut should land on this front layer: are search results too long? Is there duplicate content? Are webpage bodies, summaries, and historical context all indiscriminately stuffed in? Which information doesn't need to enter the model at all? Which content can be reused, and which is recalculated every time?

After getting these issues straightened out, the value of model selection optimization can emerge more stably. Because then you're allocating capabilities on a cleaner, more disciplined input foundation, rather than doing patchwork on already out-of-control context — in that state, switching models likely just means finding a more or less expensive way to continue burning money.

So if I had to prioritize: in the short term, the most obvious cost reduction usually comes from search and context governance; in the medium to long term, the most stable, systematic optimization comes from doing both front-end information governance and back-end reasoning capability allocation together. The former solves "too much unnecessary stuff being fed to the model"; the latter solves "too many places getting over-configured."

These two things combined are what truly constitute Token efficiency optimization.