蜂巢科技夏勇峰：AI眼镜是最理想的智能硬件形态吗？| Yunqi Capital

云启资本·August 16, 2024·7·5

Getting a Head Start Before the "Battle of a Hundred Smart Glasses"

Sales of Meta's Ray-Ban smart glasses breaking one million units made AI glasses one of the hottest smart hardware categories of 2024. Recent smart glasses announcements from Apple, Meta, and others have also ignited a fervent AI glasses rally in the secondary market. But beyond the stock market fluctuations, entrepreneurs have long been working to define and create AI glasses.

Recently, Hive Technology, a Yunqi Capital Pre-A portfolio company and Xiaomi ecosystem enterprise, launched its new product: the Jiehuan AI Audio Glasses. Since its founding, Hive Technology has focused on R&D innovation in head-worn smart hardware, and this release represents its embrace of AI. The company's founder and CEO, Yongfeng Xia, said he expects audio glasses to replace traditional glasses at a growth rate of two to three times in sales volume every year.

Of course, the path of AI hardware innovation is hardly smooth. Around the entrepreneurial journey, AI glasses, and the challenges and future of smart hardware innovation, Yongfeng Xia sat down for a conversation with Peng Zhang, founder and president of tech media outlet GeekPark. This edition of "Yunqi Capital" shares the full transcript of their dialogue.

This article is republished with permission from Founder Park.

Interviewer: Peng Zhang

Original title: "Peng Zhang in Conversation with Yongfeng Xia: AI Hardware That Gets More Than 5 Hours of Daily Use Stays in the Game"

01 In the Near Term, AR Hardware Will Struggle to Become a Mass Consumer Product

Peng Zhang: Let's start with how Hive Technology came to be, including some of your own background.

Yongfeng Xia: I'll start from my time at GeekPark. In the early days, I was one of GeekPark's first founding employees. Later, to fulfill my dream of building hardware, I joined Xiaomi with Peng Zhang's blessing.

At Xiaomi, I started out working on the Xiaomi Router, then joined the formation of the Xiaomi ecosystem. At that time, the Xiaomi ecosystem was just getting off the ground with only three employees, including the boss. The ecosystem's founding date is actually marked by my start date of January 8, 2014. I was involved in the launches of products like the Mi Band and the robot vacuum, and helped found the Mijia brand. In the early days, I actually ran all the Mijia brand launch events.

By 2018, the Xiaomi ecosystem had invested in over 70 companies and was producing more than 300 products annually. At that point, I felt myself drifting further and further from my original vision of being a hardware product manager. My daily work had become management, and I was no longer close to the actual product work.

Peng Zhang knows I'm a relatively idealistic person—for instance, in my early years I only wanted to be a reporter, not an editor, and insisted on staying on the front lines. So in 2018, after speaking with CEO Lei Jun, I left the ecosystem and joined the Xiaomi phone division to work on phones. By 2020, I felt I had contributed about as much value to Xiaomi as I could, and it was time to pursue some of my own dreams.

So in 2020, I founded Hive Technology. We defined ourselves around a new category: head-worn wearable smart hardware, and I wanted to make products in this direction. Perhaps for the next ten, twenty, thirty years, all our products would revolve around this direction. Later we successively launched glasses, cameras, Mijia audio glasses, and other products, including the "Jiehuan" AI audio glasses just released today—this is our product lineup.

Peng Zhang: When you founded Hive in 2020, your first-generation product was a glasses camera, your second was a pair of Bluetooth audio glasses, and now you're working on "Jiehuan." There seems to be some evolution in thinking here. Can you look back and walk us through how you got here over these past few years?

Yongfeng Xia: Three years is short, but for me, it felt incredibly long.

We initially wanted to work on head-worn smart hardware because after finishing up with phones, I noticed a major trend: global phone sales were declining every year. Phone quality kept improving, performance became more than sufficient, so people's replacement cycles lengthened. At the time, we spoke with many key technology partners for phones—screen makers, chip makers, camera makers—and many of them shared a sense of long-term anxiety. When an industry grows to a certain scale, it actually struggles to find new footholds.

I thought there might be an opportunity for phone technology to spill over. Besides cars, phones should be the crown jewel of human hardware, driving continuous improvement across a range of technologies—low power consumption, high performance, structural stacking, and so on.

I started thinking: after phone technology spills over, what new hardware product could carry this trend?

Intuitively speaking, if technology keeps advancing, the current interaction between handheld devices and the digital and physical worlds still isn't good enough. For long-term interaction, it should be the five senses, with some hardware—like a ring—directly interacting with the real world while your hands do what they need to do. So I believed that in the future, there would inevitably be a head-worn device that could replace phones as the most important personal smart terminal. Head-worn smart hardware would present enormous opportunities, and I set this as my company's ultra-long-term goal.

At the time, the metaverse was going through another wave of ups and downs, so naturally I thought of AR as my first phase. That sounds arrogant now, since AR still doesn't have a really good product. But I was relatively optimistic then—through various channels and information, I knew several major tech companies were researching AR hardware, some had projects in the works, and various OEMs were making attempts. So I estimated that in 5-10 years, AR could become mainstream hardware.

The technology wasn't mature enough yet, so I started with a glasses camera, which I saw as the minimum viable model for head-worn devices. It had a minimum near-eye display screen, a sufficiently good camera. I built a relatively complete all-in-one system that could connect to the internet and support extensive developer work. The system was based on Android but with many modifications.

This glasses camera did bring value to some people, but it had a major problem—one that I think future AR glasses will struggle to solve in the short term: the cost users pay far exceeds what they get in return. Probably only people who are both wealthy and tech-enthusiastic, willing to pay for the belief, would become early adopters.

Around 2022, I realized this: AR will struggle to become a mass-market consumer electronics product in the relatively near term.

Peng Zhang: I still believe AR has definite long-term value today, but exactly how long that "long term" is remains a question worth exploring.

You mentioned a key point just now—that true AR devices still aren't mature enough. But with the rise of AI, many people think AR may be heating up again. Yet your glasses today have neither cameras nor screens; they access AI through audio. Why not add a screen? Why not stick with the AR path? What factors influenced your decision over the past two years?

Yongfeng Xia: I mentioned this at the launch event too. Some very cool smart glasses have taken giant leaps beyond traditional glasses. But I believe that even a small step could bring revolutionary change to the eyewear industry. If that small step isn't taken steadily, those bigger steps may not materialize in the near term.

For example, we mentioned Ray-Ban Meta, which is a pretty good pair of sunglasses. We're also striving to make excellent glasses. You can see that our product is very close to the traditional eyewear industry. It's this closeness that satisfies people's fundamental needs and gives us a relatively large market foundation.

If at this point we rush out an overly innovative product, we might actually undermine the basic needs that traditional glasses fulfill. For instance, it might not be comfortable enough to wear, or convenient enough to fit with prescription lenses, and then we'd have to create entirely new needs and tell people: even though your basic eyewear needs aren't well met, we can satisfy other needs. In fact, at least at this stage, I haven't found any truly viable needs of that sort.

Peng Zhang: I quite agree with your point—wearing these glasses, I'm not paying too high a price. But your choice not to add a display is quite interesting. Your first-generation camera glasses did have a screen. What's the logic behind dropping it this time?

Yongfeng Xia: The core logic for not adding a screen is: what practical use would adding a screen bring? First, monochrome Micro LED technology is already quite mature, but the key question is: what can you do with it? For example, if adding a display in front raises the price by 1,500 to 2,000 yuan, what would you use it for that would make you willingly pay that extra amount?

Peng Zhang: What the market has validated so far is that people use it as a display—for watching videos and so on. Basically it's a big screen.

Yongfeng Xia: A large-screen monochrome display isn't enough, and what you're describing costs more—1,500 yuan won't cover it. Large-screen Micro LED technology isn't fully mature yet. For all the commercial large screens we have now, most well-known companies use the Birdbath solution, which has relatively controllable costs.

What ultimate AR devices need is actually very good Micro LED technology combined with diffractive waveguide lenses, but that technology isn't mature yet. Last year there were 640x480 resolution versions; supposedly scrolling Douyin on them is pretty nice, like a small TV. By next year, I estimate it could reach 1080p, but conservatively, costs would at least double. Even then, I don't think it can properly replace phones or laptops.

Peng Zhang: The core issue is that today I'm paying a very high cost to add a screen to glasses, but it still can't fully replace phones or laptops, so trying to capture their screen usage scenarios at this point isn't wise. So this decision isn't viable, right?

Yongfeng Xia: Right, it can only serve as an extended screen for laptops or phones—basically an accessory.

Peng Zhang: If you force someone to stop using laptops and phones from now on, the cost to the user would be quite high. So today you've actually moved away from the AR path you were on before toward audio glasses instead.

02 Ray-Ban Meta's Core Achievement: Making a Great Pair of Sunglasses

Peng Zhang: I'm quite curious—when you saw the emergence of large language models, what was your reaction? How did you find the point of connection with this technology wave?

Yongfeng Xia: First, I believe the closer a product is to traditional glasses, the more it can replace them. This brings obvious benefits because users' fundamental needs are met, and product wear time becomes very long. If you make a pair of glasses that can replace traditional sunglasses, you've basically anchored to however long people wear sunglasses each day. If your product is very close to traditional glasses, like prescription glasses, then usage time will approach that of traditional glasses.

Our backend data shows that users wear our products for very long periods. The top 25% of users wear them more than 10 hours daily, and the average usage per person per day exceeds 7 hours. That's a very long time.

After AI arrived, if users are near their phones or laptops, they'll definitely use AI on those devices—for making presentations, writing drafts, adding subtitles, foreign language translation, and so on. Since users will directly use AI on existing hardware, when we develop new AI hardware, we need to capture the time outside of phone and laptop usage. The hardware we develop needs to stay with the user and interact with them.

Zhang Peng: So what you're saying is, today we shouldn't try to compete for phone and laptop usage time—that's impossible. Instead, we should find things outside of those devices that deliver value to users, and that value needs to be compelling enough for users to want to wear it for extended periods. VR glasses don't work, for example—you typically only wear them to watch movies, then you take them off. So was this product derived through reasoning?

Xia Yongfeng: Not exactly. We started by developing audio glasses, and then AI technology emerged. We realized AI was a natural fit for this product, especially during moments when users aren't on their phones or laptops—like driving, cycling, or running. Using AI through audio glasses is a more natural approach. If they need AI in these moments, there may be very few hardware options available. Beyond glasses, I think there are two other devices well-suited for AI integration: cars, and watches or high-endurance fitness bands. In special scenarios where people can't use phones or laptops, these devices can step in.

Zhang Peng: The value delivery for bands and watches is probably limited to vibration—making sounds doesn't seem very practical. The positioning of glasses feels better because they can whisper, which makes sense.

One question everyone probably cares about: many people think Ray-Ban Meta glasses are pretty good, with those two prominent cameras on the front. We discussed earlier why you're not adding a screen—so do you think your glasses will add cameras in the future? No camera this generation, but will there be one later? How are you thinking about this?

Xia Yongfeng: I know there are actually many people in the market doing similar things. No offense to anyone—I'll just share my personal view. I think Ray-Ban Meta's core achievement was making a good pair of sunglasses.

It's a device with relatively large social distance from others. In places with wide-open spaces and sparse populations, social distance is greater. Putting a camera on sunglasses, whatever it's used for, is at least reasonable. From a distance, people wearing a camera doesn't feel offensive to others. But in densely populated areas, wearing something with a camera all day, every day—I tried it, and it felt extremely awkward. The social pressure it puts on others comes back to you.

Zhang Peng: You can see in other people's eyes that they find it somewhat unsettling.

Xia Yongfeng: Right, so I believe camera-equipped AI glasses and the AI audio glasses we're making now are fundamentally two completely different products.

Camera AI glasses are better suited for delivering higher value in short-duration scenarios, but exactly how much value they can provide remains to be seen. They can replace sunglasses, but most Chinese people don't actually wear sunglasses—foreigners do. This isn't arbitrary; it's percentage-based. China has a relatively large myopic population. If they want to wear sunglasses, they either need contacts, prescription sunglasses, or clip-on lenses. Overall, the total percentage of sunglasses wearers is far lower than in Europe and America. In many parts of the US, sunglasses are essential—something everyone must buy. In that context, AI glasses replacing sunglasses has a better mass-market foundation.

Sunglasses are naturally suited for adding cameras, and AI needs cameras—that story closes the loop.

Zhang Peng: If we follow this logic, the path you've chosen today isn't starting from sunglasses, but from the glasses we wear daily, right? It's essentially our everyday frames, just switched to prescription glasses. That does effectively solve the problem.

Since we're on the topic of AI—everyone's paying close attention to AI now. Since last year, AI has evolved from pure language models to more multimodal stages. AI's evolution means that if you have a camera, you gain multimodal capabilities. You can understand many things visually, and thereby help solve many problems—like seeing something and translating it to English, or identifying what something is. This is essentially like the camera becoming a sensor. So adding a camera to the product, an obvious, video-recording camera, this already becomes another category.

Your current product is smart audio glasses. If you want to apply AI intelligence in the future, achieve multimodality, is adding sensors necessary? Under what circumstances would you consider adding them?

Xia Yongfeng: I have indeed considered this, even discussing demo possibilities with some large companies. I think there may be two approaches.

The first: the lens in front of the camera is typically glass. You could make it electrochromic, controlled by AI to determine when it needs to activate. Normally it's like "eyes closed," and when needed it "opens its eyes"—this might alleviate some privacy concerns. The other scenario: you include a camera, AI can interpret images, but it can't take photos, and image quality isn't emphasized. And everyone needs to know this isn't for photography, but a sensor. Because AI doesn't need high resolution—600x400 is basically sufficient. You don't need high-resolution cameras and advanced chips like Ray-Ban Meta, or all those photography algorithms loaded in.

Zhang Peng: So from an intelligence perspective, it is indeed necessary to add effective sensors, but this generation doesn't have them yet. I'm sure you've done technical reserves in this area. What you mentioned about adding "eyelids," or clearly telling people this is a sensor with no photography function, just recognition—that makes sense for solving privacy issues. I think that's very reasonable.

Xia Yongfeng: We need to consider whether it can be used normally and without barriers in high-population-density scenarios. This may be equally important as the functions it implements.

Zhang Peng: So if this kind of sensor is added, would these glasses enable capabilities beyond our imagination?

Xia Yongfeng: I can share a bit. Rather than what specific functions it can achieve, it's more about enabling AI to better understand human intention. We discussed earlier that in the future, AI might take over everything, and you just need to be yourself. Because AI machines are very smart—when you're just being yourself, AI can basically know what you want to do, what your intention is. It may replace some existing GUI functions. I think it can help machines better recognize human intention.

Zhang Peng: So it's actually an intention-recognition sensor. We shouldn't understand it as a traditional camera's photography function. That may be the core point.

03 First Make a Great Pair of Audio Glasses, Then Add AI

Zhang Peng: After adding AI, what are the characteristics of this generation's product, and what different experiences will it bring compared to previous audio glasses?

Xia Yongfeng: We actually set goals. First, to make the best-looking glasses in the world—looks are everything. Second, the most comfortable glasses to wear in the world. Third, the smartest glasses in the world.

First goal: best-looking. Our glasses have many frame styles like traditional glasses—8 frames, 14 colors. They're quite useful for enhancing your appearance. When you go out wearing them, they make a good impression.

Second goal: best to use. We've developed three generations of audio glasses. This generation has noticeable improvements in actual usage experience, ergonomic design, and frames. The lightest is only 30.7 grams, with overall comfort significantly improved. To make quality glasses, we also provide custom lens services with pretty good value. If you're unsure what lenses to buy, you can purchase glasses with lenses directly from us—the value is decent.

Our photochromic lenses are particularly good. Indoors they block blue light; outdoors they automatically become sunglasses that block UV rays. Of course, they're relatively more expensive. Lenses are now formally part of our glasses business, a proper undertaking. We'll provide unified service to everyone.

Third, smartest—this means AI.

Zhang Peng: Tell me about the AI. Your AI audio glasses are called smart glasses. Where does this intelligence manifest?

Xia Yongfeng: Our AI, you can simply understand it as an upgraded version of voice assistants. From ChatGPT to now, our domestic AI models haven't yet produced very core, significantly valuable applications for the mass market. But they have infinite possibilities, basically capable of text-to-text, voice-to-voice conversion. This is what AI from 1.0 to now can bring people.

For us, what is an upgraded voice assistant?

After you ask it a question, it can recognize your intention and different needs, and assign these needs to different AI agents to execute. After execution, it summarizes and answers you. We call these different AI agents "AI little people"—they work continuously behind your glasses for you. For example, you say: "Tomorrow I want to hear French media's evaluation of the Olympics. Please give me a summary before 8 PM tomorrow." The listening AI little person tells the working little person: "You need to produce this content tomorrow." At 8 PM the next day, the working little person hands the summary to the little person responsible for talking with you, and that little person reads it to you.

The little person that talks with you, we call VUI—Voice User Interface. When voice-based interaction can generate more and more value because of AI, it becomes an interface. This is part of our core goal with AI: we hope to give users a unified VUI experience across platforms.

Zhang Peng: Let me try to understand. If we're talking about the fundamental change these glasses bring, it's that through them, you can invoke omnipotent AI to solve problems around your goals, completing appropriate tasks for you through voice via these glasses.

The underlying technology is large models—that is, you can mobilize AI with voice. This is the core interaction node. We don't have to hold our phones like before, messaging with them. I think this is the key change.

Xia Yongfeng: Looking at the underlying architecture, we do first processing through the glasses and the phone app that's continuously connected to them. After processing, we match the user's daily habits with needs on the server.

On our server, there are preset prompts and an AI Hub, which connects to many services, and of course many AI agents. We package everything together, instructing it to find the appropriate large model to handle the matter. After the large model completes the work, results return to the server, then back to the glasses through the phone. That's basically the working logic.

Zhang Peng: Some say this resembles the first-generation Xiao Ai. What's the advantage of using large models now?

Xia Yongfeng: The key is, when the first-generation Xiao Ai launched, there were no large models at all.

Zhang Peng: Xiao Ai didn't actually have this omnipotent capability we're talking about today, right?

Xia Yongfeng: For example, we have a feature called AI notification broadcast. Say I receive an image from a colleague on Lark. With traditional notification broadcast, it has no ability to tell you the specific content—you have to pull out your phone to check Lark.

Now with AI notification broadcast, before I even pull out my phone, I hear: "So-and-so just sent you an image that may need your confirmation." That's one example of AI notification broadcast. Of course, our colleagues are also considering whether to add image recognition, but we're not planning to do that yet.

Here's another real example I encountered. Sometimes you get suddenly pulled into a WeChat group, and before you mute it, the group keeps pinging with notifications. If you're wearing these glasses, you don't have to check and respond immediately, or even pull out your phone — the AI automatically gives you a brief summary first. If you see it's actually relevant to you, then you reply. The AI provides a filtering layer, making sure you don't miss important messages while also not getting overwhelmed by information. These days everyone has to constantly check their phones, unlocking them the moment they see a WeChat notification. With AI, that kind of action is greatly reduced.

Some people might not see this as a hard necessity, but based on our previous data, the percentage of users who turn on notification broadcast is very high. We had nearly 100,000 users, and over 36% of them enabled notification broadcast. Even though notification broadcast bombards them with plenty of junk, they still turned it on. This really is a fairly high-frequency need.

Zhang Peng: I used the beta version before launch, and right away I noticed there was notification broadcast, so I turned it on for WeChat and Lark. I found it wasn't simply reading the notification aloud to me — it actually does some summarization, and I later realized that's quite good.

This feature actually made me receive information more promptly. Before, I'd check my phone every hour to see what was up; notifications were useless to me because I wouldn't even have my phone nearby, I wasn't looking at it. But now there might be something urgent, and I can respond quickly.

People often send me conversation screenshots — they send images to give me background context, to show me how they were chatting. In the future, if it could help me recognize and directly summarize those images, I wouldn't have to read through the conversations in the screenshots. I'd really look forward to that.

Xia Yongfeng: But image recognition brings new problems — it falls under multimodal recognition. We're still discussing it, haven't decided to add it yet. But we've been optimizing AI notification broadcast for two months now, and its usability is actually quite high.

Zhang Peng: If I had to recommend one feature, it would definitely be notification broadcast. I've used it for about a week, and the feeling is — in this kind of intelligent summary broadcast, I can clearly sense there's AI at work inside. I feel like in the future I could even have it do briefings for me, summarizing everything I need to look at first.

Zhang Peng: AI notification broadcast is a feature I use quite a lot now, and there might be new features in the future. Is there any possibility of new features you could reveal to us?

Xia Yongfeng: At year-end we'll launch "Jiehuan Aiting" (界环爱听), the AI Cast feature. Because many of our users are heavy headphone users, or heavy audio content consumers. For example, among our users, the proportion who listen to Xiaoyuzhou and Ximalaya is very high, far above the internet average — they're heavy podcast users, often listening to podcasts while doing other things.

We've built a short-form audio app. But the biggest difference from short-video apps is that it doesn't need so many creators, so many real people — every creator is an AI avatar. For instance, there's an AI avatar that specifically tells you about Eastern Zhou history, one that covers the Three Kingdoms, one that tells jokes, one that summarizes news for you — there will be many like this. We'll pay attention to what content is popular in podcasts.

Zhang Peng: So you're using agents to replace so-called creators, right?

Xia Yongfeng: Right. The difference from short video is that our short audio will be slightly longer. After content is generated, there's an AI avatar that acts as a content quality reviewer — poor quality gets sent back for revision, good quality gets released.

The initial quantity might be smaller. Currently in the version I'm testing, there are only about 20 short audio pieces per day, but there will be many more in the future. Users interact with it like listening to podcasts — not interested, swipe forward, it jumps to the next; still not interested, swipe again. When you swipe fast enough, you'll find the content you dislike gradually stops appearing in your "Jiehuan Aiting," and it gradually learns what to play for you. This is a feature we'll launch this winter.

Zhang Peng: Someone asked, is it that there aren't enough creators, or is AI better than humans? Let me share my understanding. These agents are essentially about infinitely matching what users want — it's not that creators aren't enough, you could say there aren't enough creators who perfectly match you, but it's not necessarily that AI is better than humans. Because no matter how many creators there are, you're still searching for ones that match your needs, and not every piece from a real creator is great to you — they won't create content just for you. But this AI creates content only for you.

So I think, theoretically if these platforms have APIs, you could actually directly listen to that creator's content — creator content could be integrated too. But if it's not matching enough, AI can generate content around your needs, and that would truly be mass personalization.

Xia Yongfeng: If you don't like this thing, you can still listen to Xiaoyuzhou — we'll just be an audio glasses device. If you like this feature, you'll use it. We'll gradually make it better and better, because after all I only need AI avatars, I don't need UGC.

04 Slapping on an API Doesn't Make AI Hardware

Zhang Peng: Actually stuffing a large model into hardware still requires some foundational work, like how to architect it, how to use it. Behind the AI voice broadcast feature, how is the architecture built? Is there a model on the device side? Or is everything in the cloud? Why not just call the Xiao AI large model directly?

Xia Yongfeng: Our glasses don't have an on-device large model right now, and there's actually no need. To some extent we referenced some of Meta's approaches — using the glasses plus the phone app together to do the first layer of processing. If in the future phones open up some large model voice capabilities, including sharing, notification permissions, even NPU so we can run a small model on it, we'll definitely deploy on-device. To speed up local processing and enable more features.

But right now that's not available. Currently we connect directly from the app to the server. First, on the app side we do all voice-based analysis, including TTS, ASR, voice timbre. On the server side, first, the agent that talks with you — its personality and emotions are set by the user themselves. Second is RAG. Third is AI Hub — we're connected to over a dozen large models. We also have an AI long-term memory function, meaning longer historical context, to more accurately judge your intent.

After implementing these, we also built a content quality review AI — content only gets sent to users after passing quality checks. Meanwhile, on the server side we also built large model scheduling prompts: for different applications, which large model works best? Hand that to the large model, then come back to do similar work, and finally push to the phone side, push to the glasses side.

Zhang Peng: So-called AI intelligent hardware — if you're just slapping on an API, you can't deliver good enough results. It really needs to stand on user value, build a relatively complete, reasonable architecture, and then call things reasonably.

You've already gotten into calling different models, even into how longer-term memory is stored, called, coordinated. In different scenarios, delivering different value — you have to consider how to use AI more reasonably. I think this will become very important in future AI intelligent hardware; it's really not something solved by just slapping on an API.

We understand GUI, and you mentioned VUI and NUI earlier — these are essentially future interactions. Future delivery isn't about replacing phones and computers, but it might add a more user-proximate natural experience for interaction and delivery. Both interaction and delivery become different because of it. How do you define and understand GUI, VUI, NUI?

Xia Yongfeng: GUI is the graphical user interface we all know. It was a very important revolution for computers, because making machines understand what humans want to do is actually very difficult. Early computer assembly languages were all very standardized things — this established a paradigm. Later we got to the smartphone era, but it's still a paradigm. For example with screens, you need to tell the machine where you tapped. That's why early on some elderly people had a high barrier to using smartphones — you still needed to learn. Though the learning cost was simpler than before, when you learned text input, learned QWERTY keyboards, there was still a threshold.

For future interaction overall, many friends believe we'll enter an NUI, natural user interface. You just need to be yourself — say what you normally say, do what actions you normally do, machines get smarter and smarter.

You be yourself, the machine knows what you want to do, then gives you corresponding service. This is what we think future interaction might look like, especially after general large models emerged. Maybe we can't get to artificial general intelligence quickly, but making machines able to recognize your natural behavior — this becomes much easier.

I think this is also a goal our future glasses need to achieve, and also because if you want to type on a keyboard on glasses, or use touch operations, that's basically impossible.

Zhang Peng: For example, say I have an assistant next to me. I point here with my finger and tell the assistant "turn this off" — that's normal for us. But today if I want to interact with Xiao AI, I probably have to say "Xiao AI, turn down the air conditioner in my bedroom by a few degrees."

But theoretically, in the future if you have such a device, assuming it can open its eyes to see the world, has a sensor, I say "turn this off, it's a bit cold" — it should be able to identify which space I'm in, and know I'm talking about the air conditioner. I don't need to specify the exact request. That's me making concrete the scenario you described. Following this reasoning, I'm increasingly feeling — really not considering adding camera input for interaction?

Xia Yongfeng: If after the camera sees, you still need to input commands to the machine yourself, it's still VUI. But if the machine has a camera, it can reduce your input cost.

Zhang Peng: Like me saying "turn this off" instead of "turn down the bedroom air conditioner by a few degrees" — this actually lowers my input cost. This camera is essentially an intent sensor. I guess you'll definitely add it in the future.

Xia Yongfeng: I'll get VUI done first. For me VUI is actually a precursor stage to NUI. The value of getting VUI right is already enormous.

05 AI Doesn't Create Demand Out of Thin Air — It Only Infinitely Improves Experience

Zhang Peng: A while back ByteDance bought a headphone company. Headphones seem like they could also go the VUI route. Why didn't you make headphones, but glasses instead?

Xia Yongfeng: I think headphones and sunglasses are basically the same category of product. Users don't wear them unconsciously, or from morning to night — it's something you put on when there's a need, and take off immediately when the need ends. VUI requires lots of interaction, but when you need this interaction, it might not even be on your ears.

Zhang Peng: Essentially it's still not enough user wear time.

Xia Yongfeng: You can't wear headphones all day — otherwise walking is dangerous, and you can't use them while driving either.

Someone asked us, why don't we make a charging case? Because we don't have a "take it off when you're done using it" scenario. With headphones, when you're done, you take them off and put them in the charging case — it keeps charging, that's natural. But glasses need to be worn all day, so our battery life has to last a full day. It's not like a nearsighted user would take them off at 2 p.m. — that's not realistic.

For AI, there are scenarios where you actively issue commands, and scenarios where you passively receive AI notifications. Take notifications, or proactive summaries — when your headphones are in the charging case, they're useless. But if you're wearing glasses, you're basically using them seamlessly and unconsciously every day. That's the biggest difference.

Zhang Peng: That perspective makes sense. I think we need to start from getting users to wear something and keep wearing it long-term, so that AI can actually deliver when it needs to. You can't turn it into a "destination" product that you only put on when you need AI.

Xia Yongfeng: We have a slogan: "Technology reinvents traditional glasses." Among our previous users, 89% were prescription users — nearsighted or presbyopic. After using our glasses, people who wore glasses long-term basically replaced their traditional glasses almost 100% of the time. Their traditional glasses became their backup pair. So the substitutability is quite obvious.

But we haven't really replaced headphones. Because there are so many headphone needs — when you want quiet, there's noise cancellation, which we can't do. For gaming, there's low latency, which we might achieve in the future, but we're definitely not as good as specialized gaming headphones right now. Or customer service headsets that need a microphone very close to the mouth. The point is, headphones have enormous diversity. It's impossible for an audio product like ours to "over" all headphone needs. But we basically cover their glasses needs — that's what the data shows.

Zhang Peng: Makes sense. And if you really want technology to reinvent traditional glasses, I can understand why you need to consider many different styles. You can't have everyone walking around wearing the same style.

Xia Yongfeng: The history of traditional glasses has already proven that a single style doesn't work.

Zhang Peng: So it actually needs more personalized choices, while simultaneously getting people to wear them first, and wear them for long periods.

I think this point is quite important. Our judgment of whether an internet product can build more and more capabilities in the future comes down to one core thing: the user needs to have usage time.

So there might be a threshold here — I may be speaking a bit definitively — but future AI hardware probably needs at least three to five hours of wear time as a baseline. Otherwise, this AI hardware might just end up selling an AI gimmick without real long-term growth potential.

Xia Yongfeng: AI hardware is hardware for using AI. Overall, people use two giants from morning to night — computers and phones. It first needs to find its own space to survive outside these two giants. In that space, it needs to become everyone's first priority for using AI. That's probably its most basic survival condition.

Zhang Peng: One insight I got today about how to evaluate AI hardware: there's an important dimension of whether you're using AI as a selling point to sell hardware, or whether the hardware can continuously grow alongside AI's capabilities and unlock greater value. The core dividing line is: how much usage time does the user actually have? To achieve this, you first need to solve some definite problem for the user, solve it well — only then do you earn that qualification.

Xia Yongfeng: I've made hundreds of different hardware products, and increasingly I feel that human needs can continuously be satisfied better, but creating a need from scratch is extremely difficult. I don't think AI will conjure up a need that didn't previously exist out of thin air. It will only allow certain human needs to be satisfied better, or improve certain efficiencies, or make listening to something more enjoyable, or viewing something more enjoyable.

Needs that previously couldn't be met in certain scenarios can now be met because of AI — but basically this can only be further satisfaction of needs, not something that didn't exist before. Like needing an external brain, like God, analyzing everything you do all day every day — I think that's too broad, too idealized. It has to get down to the concrete needs of actual people.

Maybe I'm being a bit definitive here too, but I think needs can only be satisfied better, not created from nothing. I've noticed some AI hardware that thinks AI can create previously nonexistent needs out of thin air — I'm quite cautious about that view.

Zhang Peng: With Ray-Ban Meta glasses, the reason everyone's so focused on them is that this is supposedly the first "tech glasses" product to sell over a million units. Apart from previous VR products that were heavily pushed by big platforms, this is a device users actively chose to buy themselves. How would you assess its possible future trajectory?

Xia Yongfeng: There are some overseas user research reports on Ray-Ban Meta. Over 40% of users' top demand is actually first-person video shooting quality being acceptable — not AI.

Zhang Peng: Definitely not, because when it came out there wasn't really much AI-related stuff.

Xia Yongfeng: Then we also noticed that many users actually need sunglasses first, and Ray-Ban is a good enough brand. A friend told me he walked into a Ray-Ban store needing to buy sunglasses. He found regular sunglasses and Ray-Ban Meta, with very little price difference between them — something like a few dozen dollars. One was traditional, one came with lots of features. So he chose Ray-Ban Meta.

Zhang Peng: So the tech is basically a freebie, right? A freebie that makes you feel like you're getting a good deal — something like that?

Xia Yongfeng: I do feel that people who bought Ray-Ban Meta because of AI are probably a relative minority in its current user base. Without AI, if they had made a very good smart sunglasses product, I think they could have sold a number not far from today's.

Zhang Peng: So it didn't actually become popular because of AI. Essentially it's a good brand, plus tech appeal and some interesting features, and people see the price difference isn't much — "I would've spent this much on regular Ray-Bans anyway" — so they bought it.

Xia Yongfeng: But having said that, if AI keeps upgrading, it could rise from being the second selling point until one day, based on its user base getting AI for free (the glasses' AI usage is free, no subscription fee), it becomes a very high-value point. At that point it might complete the transformation from a good enough smart sunglasses product to a true AI glasses product. That's possible.

06 The More Infinite Possibilities, The More You Must Constrain Your Imagination

Zhang Peng: We used to say if you want to do hardware, you have to look at China. In recent years, which overseas team has actually made hardware that truly took off? It's basically all Chinese teams. But how do you see this wave? Combined with large models, is it possible that overseas there will be some hardware teams that ride this new technology wave to create a super-category product? Is that possible? Or does this future still belong to Chinese teams?

Xia Yongfeng: From an empiricist perspective, I'm not optimistic about American startup teams creating a globally popular AI-based hardware product.

Back when I was at Xiaomi's ecosystem, I actually met a huge number of American hardware teams, both East Coast and West Coast. Like Lily Drone — that was an extremely popular supposedly innovative drone back in the day. Most of the founding members of these teams had no hardware background, not even basic concepts about hardware. Most American hardware entrepreneurs were previously algorithm engineers or software product managers. Of course it's not absolute — some came out of Tesla or Apple's hardware teams, some were designers.

Zhang Peng: What mistakes do they tend to make?

Xia Yongfeng: I don't think it's about what mistakes they tend to make — they understand hardware as too simple. Because the wall between hardware and software has never been fully broken down, unless you're already a massively successful company. They think they've conceived a product, and just need to find a factory in China that can completely make it for them.

But with hardware, the parts where you need to modify the product definition or make compromises — they generally don't pay much attention to these. Recently there was a very popular foreign team that found one of our Fortune Global 500 partners to ODM a hardware product for them. The whole process seemed to be about 10 emails total. They don't talk hardware details. I think that's a fairly significant gap. Their vision for hardware tends to be relatively idealistic too. Of course because their large models and AI, including their AI applications and agents, are more advanced, their vision for future AI model applications on hardware is more advanced than ours.

There's a term that may not be quite appropriate, but I feel they're more like AI fundamentalists. They believe AI can change everything, create everything — that's the feeling I got communicating with some of them. So I think Chinese teams still have opportunity, but teams that are only good at Chinese hardware or supply chain probably won't cut it. Especially on AI — probably not. Our weaknesses are also very obvious.

Zhang Peng: To follow up, what was hot at the beginning of this year was AI Pin, and later it encountered a lot of criticism. Where exactly did it go wrong?

Xia Yongfeng: It couldn't achieve the feeling it wanted. If they had more hardware concept themselves, they would've known that whether it's projection or the fixation method, it fundamentally couldn't support being even a backup for a phone. Projection requires lumens — you would've thought about outdoor versus indoor usage. Actually through very simple data and parameter deduction, you'd know this product would have problems.

Zhang Peng: As far as I know, a large number of Chinese teams have already gathered in this so-called smart glasses track. How do you see it? Will there soon be a "hundred glasses war"? Here, to ultimately survive and create greater value, what do you think will be the core test?

Xia Yongfeng: The startup teams and company projects I know of are mostly similar to Ray-Ban Meta. Then there's also some based on existing smart glasses, directly adding cameras on top — you know it can help you identify content, but most people basically just ask "what's this," get an answer, and try it out for novelty.

Maybe I don't know enough, but I haven't seen deeper applications yet. The future "hundred glasses war" — I think it will emerge quickly, quickly educate the market, then see very rapid elimination and iteration.

Zhang Peng: Fast rising tide, fast receding tide.

Xia Yongfeng: If something truly innovative appears, its rise will also be very fast. It won't leave time or space for slow movers. Either something suddenly grows out, or it dies very fast — basically that's the situation. So I think the tests are: one, resource capability; two, understanding of AI and hardware; three, organizational efficiency. Basically these are the tests. Based on what we currently know, a situation like this should arrive in the near future.

Zhang Peng: We're calling it the "hundred glasses war" — it's already underway, but it could also be a very fast campaign. I remember the drone boom years ago, when the hype eventually died down and nobody wanted to compete with DJI anymore. You can kind of feel that same dynamic today: lots of people are watching, lots of people are trying, but getting this right and doing it properly — that's what really matters.

Let me press you on one final question. A company like Superhexa, under a brand like Jiehuan — what kind of future are you actually trying to build? Maybe it's not about transforming personal computing overnight, but what's the阶段性 goal? How long do you think it'll take to get there, and what might it look like when you do?

Xia Yongfeng: Jiehuan literally means "the ring of the world." What does that mean? It's about the boundary between self and world — others are the world, the line between you and everything else. The idea we want to promote is: "engage with the world without losing yourself; please yourself without closing yourself off."

Early on, I explained why we weren't doing VR — I saw VR as something that completely shuts you in. I wanted to build something open, something that helps you in your daily movements, in the process of accomplishing your everyday goals — just something additional alongside you. That was the product I wanted to make. So the glasses camera and the current audio glasses actually share the same goal: you have your own objectives, don't isolate yourself, but don't lose yourself either.

That's the philosophy Jiehuan wants to communicate. For this first phase, the goal I hope to achieve is for audio glasses to replace traditional glasses at a growth rate of two to three times in sales every year. We already achieved that this year, and I think we'll likely do it again next year. In three to four years, with over a million units annually, I think it'll hit a qualitative inflection point. Right now my focus is on meeting traditional glasses needs while gradually creating substitution effects against traditional glasses, and at the same time capturing more of each person's headphone-wearing time compared to other headphone types. That's what I'm working toward at this stage — a fairly difficult goal.