The "AGI OS" Era Unleashed by ChatGPT: How Entrepreneurs Should Approach Application Development | 5Y View

五源资本·March 10, 2023·7·5

AI is the most important tool ever invented by humanity.

Yunfeng Shi, 5Y Capital

What makes humans unique is our capacity to develop scientific methods and tools — the evolution of tools marks the milestones of human civilization. I believe AI is the most important tool humanity will create in the 21st century.

Back in 2021, we argued that "the maturation of front-facing phone cameras around 2010 dramatically lowered the barrier to video creation, giving rise to the Douyins and Kuaishous we scroll through daily. We've been asking ourselves: what is the technological variable today that could lower the barrier to creation by 100x? AIGC, we believe, is one answer."

Two years later, with GPT and diffusion models as new tools for a new generation of developers, we're especially excited to see what unique user experiences developers can build with them. We'd love to chat.

Article republished from TimeDomain Tech

Author: Jing Guo, Founder of TimeDomain Tech

Over the years as entrepreneurs, we've lived through countless tech themes: new energy vehicles, autonomous driving, the metaverse, web3, VR/AR...

But never have we seen anything like ChatGPT — something that, in such a short time, convinced so many frontline pioneers (founders, scientists, investors) that this was the defining event of the next decade, while simultaneously letting everyday users experience it firsthand, immerse themselves in it, and integrate it into their daily workflows.

ChatGPT's emergence has been described by many as "another electricity revolution," "the next-generation operating system."

As tech entrepreneurs, we're of course incredibly excited — but with that excitement comes serious anxiety.

Excited because entrepreneurial passion is reigniting globally. In Beijing, it feels like the garage-coffee era of Zhongguancun's Startup Street ten years ago, when people would throw up their hands and declare "I'm gonna start a company!" is back.

Anxious because: what do we actually do? After all, not everyone has the means — or should — try to build another ChatGPT.

So for most entrepreneurs and tech practitioners, how should we engage with ChatGPT?

In other words, what will be the core paradigm for application innovation and development on AGI platforms like ChatGPT?

This article combines our own experiments with ChatGPT to offer a simple, non-technical framework for building on ChatGPT.

Here's our conclusion upfront: we believe the opportunity for application innovation in the AGI era will be enormous. And if this opportunity won't belong to everyone, it certainly won't belong to just a handful of people either.

What we're working on

Our company is called TimeDomain Tech. We're building next-generation AI voice technology — AI voice with genuine emotional expressiveness and full-stack capabilities (human voice doesn't just speak, after all; it sings, cries, laughs, shouts). Our singing synthesis product, ACE Studio, enables AI to perform studio-quality vocals that surpass human capabilities. Global music creators have already used ACE to produce millions of AI-sung tracks, with cumulative plays nearing one billion across all platforms.

Our long-term vision is to use emotionally rich AI voice technology to build the emotional bridge between humans and AI in the AGI era.

How we used ChatGPT: one example

Behind ACE Studio's AI voice technology lies a technique called "timbre blending." Through a multi-speaker architecture and the model's transfer learning capabilities, singers in the model can be blended in specific proportions to generate infinite new timbres that don't exist anywhere in the world.

For example: say I have a singer modeled on Jacky Cheung — a rich, soulful mature male voice. And another modeled on Faye Wong — an ethereal, resonant female voice with beautiful head voice. Using this technique, I can create "50% Cheung + 50% Wong," birthing an entirely new gender-neutral singer that's simultaneously ethereal and warm. In practice, since our model contains hundreds of singers and supports blending across multiple dimensions at any ratio (even negative ratios), the possibilities are virtually limitless.

This process is like mixing oil paints — with enough base colors, you blend by feel until you get the shade you want.

The problem: this still feels too abstract! What if users could describe their desired timbre in natural language ("give me a gentle but slightly girlish female voice") and get back a sensible blend? That would be pretty cool.

Our first instinct was to train a text-to-timbre AI model: input text description, output timbre recipe. But this would require collecting massive amounts of user timbre descriptions and corresponding recipes as training data.

And this approach lacks flexibility — if we update our model with new singers, we'd need to rewrite training data and retrain the entire text-to-timbre model.

So we asked ourselves: how could we leverage ChatGPT to achieve this? (For instance, when a user says "give me a gentle but slightly girlish female voice," we return a matching blend.)

If you simply ask ChatGPT directly, you'll get nothing. First, it has no idea we're talking about a timbre-blending technique inside ACE Studio. Second, even if it did, it wouldn't know how to mix to get what you want.

The solution is straightforward: tell ChatGPT what it doesn't know.

Step one, we describe the timbre-blending technique in words:

Step two, we describe our "base color" singers' timbres in language for ChatGPT.

At this point, we've basically given ChatGPT enough context. Now we ask it to blend a timbre for us:

Here's ChatGPT's response:

It follows naturally that if we replace the red-boxed content in the prompt above with the user's timbre description, keep everything else as a preset prompt, feed it all to ChatGPT, then extract the blue-boxed content from its response to drive our timbre-blending parameter system — we've just used a general-purpose AI dialogue system to fulfill a highly business-specific need: "letting users create blended timbres through natural language."

Let's hear how ChatGPT's blend sounds:

Not bad — pretty close to what we wanted.

And naturally, we can guide users to give feedback on ChatGPT's blend (mapping their subjective impressions and adjustment requests into ChatGPT's information space through natural language), continuing the conversation to iteratively refine the recipe like a client directing a vendor, until satisfied.

The adjusted timbre comes closer to my vision while avoiding making this invented singer too similar to any original voice:

In the second round, ChatGPT not only grasped my subtle adjustment needs but explained the rationale behind its changes, noting that "blend ratios are highly subjective" — implying this is a matter of taste and welcoming further fine-tuning.

This illustrates the benefit of using a general-purpose AGI natural language interface: it's perceptive enough to understand what users really want, maintains context well, and brings its own reasoning to the process. Our "timbre-blending model" serves as the downstream task executor — it needs to understand (or rather, extract) the AGI's instructions and execute them with precision.

The basic paradigm for building on ChatGPT

Generalizing from this example, we believe we've glimpsed a fundamental paradigm for developing on AGI:

Encoding: Mapping product-specific knowledge into AGI's general language space

By "encoding," we mean using preset prompts to "set the scene" for ChatGPT.

In the timbre-blending example, through the technical explanation in step one and the singer timbre descriptions in step two, we translated our product's specific logic and knowledge — including audio modalities that ChatGPT currently cannot directly "hear" — into general natural language descriptions.

This is essentially mapping our highly business-specific information and multimodal data into ChatGPT's natural language information space. ChatGPT excels at understanding human questions, commonsense information, and universal logic embedded in language.

So by setting the scene through preset prompts and cleverly embedding user requests, we can guide it to handle domain-specific problems.

Below is the preset prompt that "hypnotized" New Bing revealed. You can get a sense of how to set ChatGPT's scene for search tasks:

Decoding: Parsing AGI's natural language instructions to execute specific tasks

Once ChatGPT understands our product logic and the user's request, it responds with solutions in natural language. "Decoding" means parsing these natural language instructions, extracting actionable information, and driving downstream task execution.

In one text-adventure game using ChatGPT, the developer fed ChatGPT's scene descriptions into Midjourney to generate illustrations — a relatively straightforward approach.

Brainstorm: Designing a ChatGPT-powered PC recommendation app

In practice, this encoding-decoding process isn't limited to a single iteration — they can nest within each other. Let's brainstorm together:

Suppose we want to use ChatGPT to build a pre-sales bot that recommends computers and answers questions for prospective buyers.

Assume our website has thousands of PC models, each with its own page featuring specs, pricing, sales figures, user reviews, etc. Due to input length constraints, it's impractical to encode all this information into ChatGPT at once. Plus, inventory, pricing, and availability update in real time.

As we've noted, ChatGPT's strength lies in its perceptiveness, commonsense knowledge, and reasoning. For such a "person," sometimes all they need is a manual, a dictionary, or a queryable knowledge base to accomplish much more.

So we might design it like this:

When a user asks a question, we don't have ChatGPT answer directly. Instead, we set the scene: "your job right now is to extract query tags from the user's question." Leveraging ChatGPT's language understanding, we have it pull out relevant tags.

With these tags, we use fixed business logic to search for the top 10 best-selling PCs under that tag. Using certain rules, we convert these 10 product pages into text (e.g., turning "sales: 100" into "sold 100 units," or a user review with 10 likes into "User xxx, Level 3, said xxxxx, 10 people found this helpful").

If the textified descriptions are too long, we can feed them into a second ChatGPT instance to extract 10 summaries. Then we feed these summaries along with the scene-setting "the user wants to know about xx, here are 10 PCs for reference, please help answer their question" into a third ChatGPT, which now has enough information to chat knowledgeably and recommend products.

In real application design, we can flexibly use encoding and decoding iterations. The core idea: treat ChatGPT as a very smart person. Tell it what it doesn't know. Leverage its analytical abilities and natural interaction with users. Ultimately, have it issue commands to downstream applications that execute the tasks.

Speculations on the AGI OS and application ecosystem landscape

First, all the examples, brainstorms, and design tricks above are based on ChatGPT's current capabilities. As AGI evolves, many things will change significantly.

For example, the input text length limit — currently 4,000 tokens, soon expanding to 32,000. In the future, this could reach hundreds of thousands or millions. Introducing domain knowledge into applications will become much simpler and more direct.

Or consider multimodal input — images, audio, and eventually physical world interaction. AGI that can see images, hear audio, interact with the physical world...

For AGI's future, here are some angles for speculation and discussion:

Open source v.s. API

We believe AGI will likely become the operating system of the new era, hosted in the cloud by a few massive institutions and accessed via API.

First, the underlying large models will trend ever larger. While research shows comparable effects with fewer parameters, improving model "density" isn't contradictory to the overall trend toward scale (simple logic: if half the parameters achieve the same effect, wouldn't the same parameter count achieve even better results?).

With multimodal integration and generative AI becoming increasingly embedded in human creative workflows, accelerating total information growth itself, it's hard to imagine future large models not continuing to grow.

Second, AGI should become increasingly general-purpose. It masters universal, general capabilities that help millions of application developers build unique applications serving billions of users across every aspect of life — just as personal computers, smartphones, and mobile internet did before.

Open source's value lies in letting developers take general base models, connect specialized models, and fine-tune for specific tasks.

It's hard to imagine millions of developers from diverse backgrounds and directions all being able to fine-tune and host such massive models. This would dramatically raise the barrier to application development on the AGI OS, preventing the kind of revolution smartphones brought.

As an operating system, the first function is user interaction; the second is resource scheduling and task execution. Developers tell AGI their specific business logic in natural language; AGI interacts with users in natural language and schedules downstream applications through natural language to help users complete tasks.

Natural language is all you need! A continuously updated super-brain in the cloud, where millions of developers simply teach it in "human language" to realize various applications — this may be the cleaner solution.

General v.s. Specialized

As AGI grows more powerful, does specialized AI still matter? Our guess: yes. But the boundary between general and specialized will keep shifting.

Specialized AI's value lies in greater professionalism and efficiency. Even if AGI someday gains driving capability, a specialized autonomous driving AI will likely still exist for higher robustness and lower latency. Specialized AI targets vertical domains with specific engineering optimizations and more focused data flywheels.

In the future, when a user tells an in-car AGI "I have elderly and children in the car, please drive slower," rather than AGI taking the wheel directly, it's more likely to understand the statement's meaning — grasping that "slower" here implies smoothness and safety — translate this into instructions, and drive a downstream specialized autonomous driving AI to adjust its strategy.

But which tasks are general, which are specialized? The boundary will keep moving, with general capabilities expanding outward to encompass more previously specialized domains.

Tasks like machine translation, text summarization, and style transfer — once specialized NLP tasks — are now easily handled by ChatGPT.

Which tasks will see general encompass and replace specialized? Which will remain specialized, with general calling specialized? Consider human learning processes for intuition.

If becoming more broadly knowledgeable helps with a task, it's likely a general task AGI will eventually cover. Take "translation" — truly brilliant translations often come from great scholars. When someone possesses broad, deep understanding of world languages, cultures, and ideas, their translations naturally excel.

If a task requires relentless practice in a specific domain to master, it's more likely one AGI shouldn't try to encompass directly. Take "simultaneous interpretation" — while seemingly translation, it's fundamentally different. It requires listening while translating, with sufficient speed and accuracy. This isn't solved by erudition alone; it requires "training." A great scholar's translation may be exquisite, but in simultaneous interpretation, they'll never beat someone who's trained for years (similar examples: autonomous driving, AI art, singing synthesis...).

Whether a task needs more "learning" or more "drilling" — this may be the core distinction between general and specialized tasks.

Plugin v.s. Entry point

In product form, will AGI become the new entry point, or an omnipresent plugin?

Entry point — the first thing people do when starting a device is open an AGI dialogue interface, then handle everything through AGI interaction and downstream application calls.

Plugin — users still go to different apps for different needs, but all apps integrate AGI capabilities.

My guesses:

Before major terminal device changes, AGI will become an omnipresent plugin embedded in all products
Bots will quickly become one of the mainstream product forms, with IM becoming the new App Store
Once XR matures, AGI will become the new entry point
When using computers and phones, typing often has higher operational cost than tapping or swiping, and speaking carries higher psychological barriers. If smartphones remain the dominant terminal, it's hard to imagine scenarios like hailing rides, ordering food, or scrolling short videos being fundamentally transformed.

However, when using AGI costs significantly less than previous methods, introducing AGI as a plugin will notably improve user experience. This is especially true for productivity tools — during productive learning and work, there are numerous high-cost operations where AGI empowerment will dramatically reduce task difficulty.

Additionally, bots may become a mainstream product form. Many products will develop their own bot endpoints (just as they have web endpoints and mini-program endpoints today), and many bot-native products will emerge.

As early as Facebook's 2016 F8 developer conference, the Messenger Platform for chatbots was announced, letting developers build chatbots integrated into Messenger for information services, customer support, and more (many Chinese practitioners viewed this as Facebook's attempt to build an upgraded WeChat Official Accounts platform). It caused huge excitement initially, but limited by NLP technology at the time, subsequent development fell short of expectations.

Setting aside technical limitations, integrating bots into IM for natural language interaction and service delivery is intuitively natural. In Slack and Discord, numerous functional and customized bots play important roles in their ecosystems.

With ChatGPT's API opening up, bots as a product form can do far more. Users subscribing to bots on IM and conversing with them to execute tasks may become a mainstream interaction pattern, making IM platforms the new App Store.

Finally, if new terminals like XR glasses replace phones, it's hard to imagine anything but AGI as the primary interaction choice with such devices.

Final thoughts

From Kurzweil's The Singularity Is Near, to Singularity University's founding, to 2015 when tech leaders anxious about AI domination founded OpenAI — the "singularity theory" then seemed more philosophical reasoning. The logic sounded correct: Moore's Law, exponential effects, machines growing ever more powerful at accelerating speed, eventually surpassing humans. But how? At the time, few concrete paths to AI general intelligence were visible.

ChatGPT's emergence marks the first time people broadly realized that the technological path to the AI singularity has become concrete.

Some still argue ChatGPT isn't a disruptive technical revolution, just simple fine-tuning on top of large language models. But this "simple" fine-tuning worked. It demonstrated that the large language model + instruction-based paradigm is effective for building general natural language interfaces and, by extension, general artificial intelligence.

If AGI application development follows our hypothesized paradigm — encoding business logic in natural language, decoding natural language to execute business logic — it will be remarkably simple and natural. Building applications on AGI APIs combined with downstream specialized AI, particularly in productivity scenarios, represents a highly promising entrepreneurial direction.

5Y Capital seeks out, supports, and inspires lonely entrepreneurs, providing support from spirit to all business operations. We believe that if the crazy you in others' eyes begins to be believed in, the world will become a different place.

BEIJING · SHANGHAI · SHENZHEN · HONG KONG