Even Siri Took 14 Years — What's So Hard About Voice AI, and Why Is AI Hardware Suddenly Hot? | A Conversation with Xiaoliang Chen of SoundAI

峰瑞资本·September 30, 2024·3·3

From Submarine Anti-Sonar, to Siri, to AI Earbuds

Right now, you may already be on your way to your National Day holiday destination. Imagine you're abroad, and you put on a pair of AI headphones. Not only do they translate menus for you, but they also teach you how to ask for directions in a foreign language in real time... Would you pay for headphones like that?

From wartime submarine sonar and radios to wildly popular consumer products: the Walkman, CD players, MP3 players, the iPod, Siri, smart speakers, and now AI phones and AI headphones... acoustic technology has long been one of the critical drivers in consumer electronics development. Today, with the rapid advancement of AI, sound is becoming the new frontier of human-machine interface interaction — reshaping how we engage with the world between listening and conversation.

Not long ago, Feng Shu invited Dr. Chen Xiaoliang, founder of SoundAI, for a conversation. Dr. Chen previously worked at the Chinese Academy of Sciences Institute of Acoustics as a researcher in the acoustics field, and founded SoundAI in 2016. The company has been exploring the intersection of acoustics and AI. During the heyday of smart speakers, most of the well-known brands on the market were powered by SoundAI's far-field acoustic interaction technology.

Yet for a long time, smart speakers left people with the impression of being "not very smart." The emergence of large AI models has finally solved this problem — voice interaction is no longer a鸡肋 (useless appendage). Chen has always believed that the validation and eventual large-scale deployment of AI would happen in the consumer space. In 2024, SoundAI launched its own AI headphones. The first thing users notice is the built-in real-time translation and transcription features. So the scenario we described at the beginning is already becoming reality.

After the release of the iPhone 16 (Apple's first AI phone), Dr. Chen and Feng Shu had an in-depth discussion on the historical evolution of acoustics and the current state and future of AI hardware entrepreneurship. Topics included:

Why must military technologies like submarine sonar and torpedoes use sound signals?
Why does our own voice sound different when we sing versus when we hear it played back from a recording?
How did acoustic technology, originally serving national defense and military industries, gradually shift toward consumer electronics?
How did WeChat, starting from voice messages, rise to become China's top instant messaging app?
Siri launched back in 2010 and spent nearly 14 years in relative obscurity — why did it become the centerpiece of Apple's first AI phone launch event?
Do you also have a smart speaker gathering dust at home? Why were smart speakers "not smart"? What technical challenges and user experience difficulties lay behind this?
What new changes and possibilities can GPT-4o bring to speech recognition?
What developmental path have wearable headphones taken? AirPods vs. bone conduction vs. ear-clip headphones
How will AI empower headphones?
In China, hearing aid penetration is very low. Will this change? How can hearing aids win over the new generation of elderly users?

We hope this offers a fresh perspective. Head to the Xiaoyuzhou app, Apple Podcasts, or Ximalaya, search for and subscribe to "High Energy" (高能量) to listen to this episode.

Giveaway

Do you have a habit of wearing headphones? What needs do you have that current headphones on the market fail to meet well? Leave a comment below — we'll randomly select 2 readers to receive SoundAI's AI headphones.

/ 01 / Acoustics may be niche, but it's the critical point of every technological breakthrough

Li Feng: Our guest today is Dr. Chen Xiaoliang. Before founding his company, he was a researcher at the Chinese Academy of Sciences Institute of Acoustics. In 2016, he left the academy system to start his own business. Over eight years of entrepreneurship, he has ridden multiple waves related to acoustics and artificial intelligence. I've invited Dr. Chen to share his views on the new entrepreneurial direction of AI-hardware integration, and how this wave of enthusiasm will affect the tech consumer products we'll experience in the future.

Chen Xiaoliang: I'm very glad to chat with everyone today about AI + acoustics + hardware entrepreneurship. Acoustics is a relatively niche discipline globally. In the past, it mainly served national defense and military fields — things like submarine sonar and torpedo-related technologies.

Li Feng: Let me interject with a fun fact. A key military application of acoustics is how to use sonar to better detect targets, and how to avoid being detected by others' sonar.

Chen Xiaoliang: Why must sound signals be used? Because in underwater environments, sound waves are the only means of communication and sensing that can maintain long-distance transmission without rapid attenuation. Optical signals and electromagnetic waves are the primary methods in air, but in water, their signals rapidly attenuate due to absorption and scattering by the water itself. That's why when you dive, you find that below about 10 meters, it's basically pitch black — because light attenuates so quickly.

In the consumer space, we also use many acoustic technologies, including the microphones used for podcast recording, speakers, the sound effects you hear in movie theaters, the MP3 format for listening to music, and so on. These are all typical applications in the acoustics field.

However, most of these underlying algorithms — such as codecs — come from overseas. We've tried developing some codec algorithms independently, but because this involves globally unified standards, the difficulty of promotion is quite high.

Looking back at the 1980s and 90s, during Japan's rapid economic growth, people may remember the rise of the Walkman and CD players.

Later, Apple launched the iPod. Consumer acoustic hardware has been continuously evolving.

Going further back, many overseas acoustic brands were essentially companies that produced or repaired radios during World War II — such as Siemens, Philips, and Bose. At that time, the development of aircraft carriers, submarines, and related industries drove the application of acoustics in national defense and military fields.

Even further back, there was the invention of the telephone in the 19th century, which was also a very important technological advance.

Li Feng: This was a crucial part of the Second Industrial Revolution.

Chen Xiaoliang: So acoustics is one of the critical points where many technological developments achieve breakthroughs.

The rise of AI, in a sense, also started from sound. Deep learning was first validated on audio. In 2011, Apple officially launched Siri at its product event — the first voice assistant brought to market, which suddenly gave people a sense of the algorithmic changes brought by deep learning. This launch greatly drove the upgrade of acoustic algorithms.

/ 02 / In the early days, using WeChat felt like using a walkie-talkie

Li Feng: You just mentioned the importance of Siri. Actually, if we look back at mobile internet, almost all successful Chinese mobile internet ventures since 2010 — Meituan, Douyin, Kuaishou, WeChat — were business models that would have been unimaginable before.

These newly risen mobile internet giants share some common characteristics.

First is the change in interaction modality. We shifted information input from physical keyboards on PCs to dragging and swiping on smartphone touchscreens — that is, no longer relying on keyboards. This change in interaction modality enabled many business models, such as the big data recommendation systems of Toutiao and Douyin.

Second, new sensors brought new data. The arrival of high-definition cameras made image- and video-based mobile apps popular, such as Douyin and Kuaishou. The addition of GPS to smartphones created location data, which in turn spawned location-based mobile apps including DiDi ride-hailing and food delivery services.

The rise of WeChat was also closely tied to this shift.

While many people now type on WeChat, we can recall that when it first launched, it was primarily voice-based. At that time, many users treated WeChat like a walkie-talkie.

This was because, for voice input, when we moved away from Nokia phones to smartphones represented by the iPhone, microphones evolved from simple recording devices to microphone arrays, improving sound quality and signal-to-noise ratio. Even in noisy environments, we could clearly receive voice messages.

Additionally, mobile networks at the time could support voice transmission well, and voice reproduction quality was decent. The user experience was also distinctive — smartphones had become full-screen devices, but the screens were small, with no physical keyboards and virtual keyboards that weren't very user-friendly. Voice communication was far more efficient than typing.

Chen Xiaoliang: When WeChat was born, the acoustic technology it used was relatively simple, mainly recording and playback, without involving complex AI technology. Later, WeChat incorporated some speech recognition technology, and AI gradually found application.

It's worth noting that the popularization of WeChat voice messages relied heavily on improvements in underlying codec technology, which dramatically reduced the bandwidth requirements for voice transmission while maintaining clarity. Recall that early voice calls often suffered from network stuttering and intermittent connections.

Li Feng: We just discussed how humans interact with smart devices. Typing on a keyboard is an acquired skill — no one is born knowing how to type. But swiping and dragging are completely natural; a child can pick up an iPad and start playing. Cameras are like extensions of the human eye, and voice communication is a natural input-output method for humans.

So in the evolution of smartphones, the biggest change was moving from physical keyboards to no physical keyboards. This led users to adopt new sensors and interaction methods — sound, location data, HD cameras, swiping and dragging — forming entirely new interaction paradigms.

Chen Xiaoliang: Actually, since Siri's birth, the industry reached a consensus that the next generation of interaction would be multimodal, voice-based interaction. But why, after more than a decade, has voice interaction still not become mainstream? It's because the underlying acoustic-related technologies haven't reached a sufficiently mature commercialization stage.

Sound involves many problems, including different voice characteristics, multilingual processing, semantic understanding — these are all issues that large language models are currently addressing. However, acoustic computing hasn't seen new breakthroughs in a long time.

For example, the invention of the radio was based on early vacuum tube technology, which processed sound signals mainly through analog circuits. Later, devices like the Walkman and CD player emerged, and acoustic technology gradually shifted from analog signal processing to digital, but still followed a signal-processing approach.

Then Apple launched the iPod music player and rose again on the back of this product. I feel that Steve Jobs had an obsession with acoustic technology. Around 2009, people began experimenting with deep learning methods to process speech problems. However, at that time it hadn't yet penetrated into language or acoustics — it was only being used for speech processing.

Moving away from traditional signal processing methods — from analog signals to digital signals, and then further to the deep learning era — this is equivalent to saying acoustic computing has entered its third era.

Deep learning excels at handling nonlinear problems that traditional signal processing methods cannot solve, though it falls short in precision. In speech recognition, it's difficult to achieve 100% accuracy, but often 100% accuracy isn't necessary. Everyone's voice characteristics differ. Humans also make mistakes when listening to sound, especially when they haven't heard clearly — they habitually fill in content through association. Fortunately, large models can compensate for this. In speech recognition, machines have already surpassed humans.

In the voice interaction chain that Steve Jobs promoted, there are two key components: acoustics and NLP (Natural Language Processing). Around 2010, speech processing technology was still limited to use inside phones.

At the end of 2014, Amazon released the Echo smart speaker, pushing voice interaction forward. Amazon had been developing Echo since 2011, and at that time, to solve voice recognition problems, it introduced microphone array technology.

A single microphone can only receive the amplitude of a sound signal. Through the combination of multiple microphones, we can also capture phase information. Using time differences to calculate phase differences, we can more precisely determine the location of the sound source, further improving speech recognition accuracy and sound signal quality.

03 Distance, Latency, and Noise: How Are Three Common Acoustic Problems Solved?

Li Feng: Many listeners may not know much about acoustic technology. When you mention microphone arrays, there are actually several common scenarios and problems involved.

The first is like what we're doing now, recording a podcast — everyone is very close to the microphone, and the recording effect is ideal. This is an ideal environment.

Another is during meetings, where people are at different distances from the microphone on the table. Someone sitting close might speak very clearly, but someone far away might be hard to hear, with sound that sometimes cuts in and out.

Additionally, if you're outdoors, phones pick up a lot of background noise — subway trains, wind, and so on.

So when microphone arrays address these problems, what are the current and future solutions?

Chen Xiaoliang: Human-device interaction is very natural. For example, interacting with a phone is at arm's length — this is called "near-field" interaction. In the future, when intelligent robots become widespread, we won't be able to chase after them to press buttons or touchscreens like we do with phones. So we must solve the far-field interaction problem.

Smart speakers use array technology primarily to solve the far-field problem first. In 2016, our main task was to remove distance as a boundary condition, ensuring that sound could still be heard clearly at a distance.

In military sonar confrontation, this is the core problem. The ocean environment is extremely complex — there are no ideal boundary conditions. But in consumer scenarios, there are often cost constraints. We initially used six-microphone arrays, later reduced to three, and now one microphone can achieve the effect — showing that technology continues to advance.

The meeting scenario is a typical multi-person scenario. In multi-person scenarios, there's a phenomenon called the "cocktail party effect": humans can focus on certain sounds in a noisy environment while ignoring others. Beyond external noise, there's also self-noise.

This is because humans hear sound in two ways: air conduction and bone conduction. The sound of our own voice is actually a combination of these two methods. When you speak, you inevitably cause bone vibrations, and these vibrations are transmitted to your ears through bone conduction — this is self-noise. AI hardware, including robots, must effectively suppress self-noise.

Li Feng: That's truly a cold fact.

Chen Xiaoliang: So when humans sing, the voice they hear themselves is different from the playback voice, or from what others hear. Many people go off-key when singing because they don't accurately hear their own voice. And to precisely control pitch, singers typically wear in-ear monitors, constantly adjusting their singing rhythm. This places extremely high demands on the latency of the acoustic system.

Continuing with boundary conditions: sound travels at different speeds in air and solids. Sound travels at about 345 meters per second in air, while in steel it's more than ten times faster. If you tap a radiator, the sound immediately travels throughout the entire floor. If sound is delayed or misaligned, sounds that should cancel out instead amplify each other and become noise.

So latency is a major technical challenge. Just as satellite positioning depends on precise time synchronization — once time is wrong, accuracy degrades — acoustics has similar requirements. The second boundary condition is latency.

Acoustic processing is very different from speech recognition. In acoustic processing, latency must be controlled within a range acceptable to humans. Generally, the time for a human to utter a word is about 200 to 300 milliseconds, while our perception of sound reverberation and echo is around 80 to 100 milliseconds. But sensitive people can perceive latency as low as 30-something milliseconds. Therefore, sound processing must compress latency to under 30 milliseconds.

When we process speech, we divide it into very small frames — each frame is generally at most 10 milliseconds. Very fragmented data, and it must be predicted and processed in real time.

In smart speakers, when using microphone arrays, we mainly solved two problems: first, solving the far-field problem; second, solving the latency problem. We need to ensure that conversational service latency is around 1.5 seconds — for example, after a user gives a command, the time for the speaker to start playing music shouldn't exceed 2 seconds, otherwise users will perceive obvious latency, affecting the experience.

Li Feng: So through acoustic control methods, could you add natural language processing in advance, rather than waiting until later?

Chen Xiaoliang: It can't be added in advance. In 2016, we had to add more than ten types of acoustic algorithms to all our arrays. This was mainly to reduce the third boundary condition: noise.

Additionally, endpoint detection is also very important. It accounts for the largest delay in the entire conversational interaction process. If not well controlled, latency could reach one to two seconds, seriously affecting subsequent user experience.

Li Feng: Is the concept of endpoint detection the same as handling "breath gaps" when editing podcast audio?

Chen Xiaoliang: Yes. Similar to how every pause when you speak needs to be detected. Some people speak very fast, and I need to add endpoint detection to ensure correct segmentation. Because the segmentation in acoustic processing must leave room for subsequent speech and language processing. If the earlier processing is inaccurate, all subsequent speech and language processing will go wrong.

So in array processing, we actually need to solve many boundary condition problems and undertake a lot of work. This is also why we needed to find a new carrier — because phones at the time didn't have sufficient computing power to support this complex processing. Because of this, we needed to develop a small acoustic chip.

At the same time, we also needed to reduce costs. Echo at that time used a very high-end digital signal processor — a TI DSP chip. We later moved all algorithms to ARM architecture, with direct chip access from the microphone, bringing smart speaker prices down to around 200 yuan. Xiaomi's smart speaker also became a hit product around that time.

04 Why Weren't Early Smart Speakers Smart? Has Technology Progressed Now?

Li Feng: Smart speakers were particularly hot from 2016 to 2018. For you, what goals were ultimately achieved?

Chen Xiaoliang: I think three goals were achieved, and one goal was not achieved.

The first is that we solved acoustic problems in complex scenarios, successfully freeing smart speakers from the constraint of arm's length and making them true far-field interaction devices.

Second, by using microphone arrays and acoustic structures, we built a complete AI acoustic processing architecture — from acoustic processing to speech recognition, language processing, content services, and TTS (Text-to-Speech) synthesis. The entire chain was connected, laying the groundwork for subsequent iterations of smart devices.

At the same time, we improved speech recognition accuracy. In complex scenarios, our far-field speech recognition accuracy basically reached over 85%, which was sufficient for understanding and executing user commands.

Third, after combining multiple algorithms, we successfully kept latency within an acceptable range for users, ensuring a balance between accuracy, latency, and distance.

Li Feng: How many of these technologies were related to your work?

Chen Xiaoliang: All the acoustic algorithms mentioned earlier, plus the wake-word algorithms. In one year, the smart speakers powered by our technology sold 20 to 30 million units.

But there was one problem we couldn't fully solve: NLP processing wasn't mature enough at the time. Many people said smart speakers "weren't smart enough," and that was indeed the case. After the smart speaker boom, starting from 2019, we spent a lot of effort improving NLP technology.

Li Feng: Let me briefly summarize. The development of acoustic technology in smartphones enabled high-definition voice calling applications like WeChat, especially walkie-talkie-style voice calls. But due to the hardware limitations of phones themselves, Amazon started looking for new form factors from 2011, and eventually, between 2014 and 2019, facilitated the rise of smart speakers.

With more space and higher computing power, the acoustic performance of smart speakers improved. However, even though the hardware was already well done, during that period, smart speakers still had some difficulties with NLP capabilities and interactive abilities.

Chen Xiaoliang: From 2010 to 2015, deep learning brought a leap in voice technology, and products like Apple's Siri, Google's Assistant, and Microsoft's Cortana gradually rose. However, their speech recognition accuracy on phones wasn't high, and they weren't smart enough — somewhat鸡肋 [of limited use].

From 2015 to 2020, microphone array technology solved key acoustic problems, especially in complex scenarios. But at that time, language processing technology didn't significantly improve. Although a lot of data was accumulated, how to efficiently process and use this data remained a challenge.

As a result, even now, smart speakers remain the most widely accessible AI device for global users.

Technology after 2020 is very different from the previous ten years, especially with the emergence of large models. Now we're seeing opportunities for new wearable devices, and Apple is starting to launch AI phones — this is because the combination of language technology and AI has become more mature.

05 In the GPT-4o Era, Voice Interaction Will See Greater Opportunities

Li Feng: In AI and acoustics-related fields, people have reacted enthusiastically to GPT-4o, especially regarding voice interaction. I'd like to hear your perspective on GPT-4o and its subsequent development.

Chen Xiaoliang: The next very important step is combining acoustics with large models and deploying them in devices. GPT-4o can currently demonstrate voice and language capabilities based on phones, with relatively low requirements for acoustics. The performance of large models in conversation is already much better than the smart speaker era, and user experience has reached a usable level.

However, voice interaction still faces challenges, especially in natural conversation. Current smart speakers still use a "one-to-one" interaction pattern: you finish speaking, it listens, then responds. But in multi-person chat scenarios, voices need to be segmented to identify who is speaking and the contextual connections between different speakers.

This depends on voiceprint technology — quickly recognizing how many people are speaking and who said what. Otherwise, without accurate context, large models may misinterpret the entire conversation. Voiceprint technology has never been widely commercialized, but it plays an important role in solving these problems.

Li Feng: This sounds like fingerprint recognition.

Chen Xiaoliang: Exactly. Especially in complex conversation scenarios, voiceprint plays a key role. Once the technology matures, combined with previous accumulated capabilities, the interactive experience in complex scenarios will become very good. At that point, you'll feel that it can not only understand individual speakers but truly comprehend conversations among multiple different people.

Li Feng: Large language models basically involve two scenarios: one is writing, the other is voice interaction. At the GPT-4o stage, we'll see more interaction forms like "speaking" and "listening."

AI hardware has suddenly become hot in the past half year, including smart glasses, AI earphones, and many companion devices with voice interaction — for example, adding voice functions to toys. In the future, perhaps voice can be used for elderly companion and status monitoring devices.

GPT-4o has brought changes in input and output forms. What attempts have you made in this regard?

Chen Xiaoliang: I think GPT-4o can be compared to Siri — GPT-4o is the next generation of Siri. Since Apple launched Siri in 2010, it has endured for nearly 14 years, and now finally welcomes a major upgrade. Apple's AI phone is the result of Siri upgrading to GPT-4o. Siri evolved from originally "can't hear clearly, can't understand" to now being able to recognize multi-person conversations and understand them, which is achieved through large language models.

Actually, Apple starting to combine GPT-4o with search functions indicates that the combination of voice and language large models is already relatively mature and ready for commercial use. GPT-4o is a critical node for the combination of voice and large models. This technological upgrade will soon be applied to various new smart devices, such as PCs, earphones, and glasses. Next, as more devices join, combined with improvements in acoustics, the entire AI device market will see explosive growth.

Li Feng: Because voice is the natural interaction based on language, once language technology partially matures, the shift in interaction methods will increasingly rely on voice.

Chen Xiaoliang: Yes, if you want large models to be better applied, or to let hardware play to its strengths, these two definitely need to be combined.

06 The Development and Iteration of Wearable Earphones

Li Feng: So based on this understanding, plus your past accumulation, you launched a new AI earphone?

Chen Xiaoliang: Yes, it became a hit product in a very short time. We originally thought selling 50,000 units a month and 600,000 units for the full year would already be quite good, but the actual situation far exceeded expectations. We've also been continuously supplementing production capacity.

Li Feng: Consumer enthusiasm exceeded your inventory. Specifically, how are sales of this earphone on different platforms?

Chen Xiaoliang: We're currently only doing pre-sales on Douyin, and it exploded with orders as soon as it launched — and this is Douyin's own definition of "explosive orders." Currently, our earphone basically ranks in the top ten in Douyin's organic traffic, sometimes even number one. The weekly add-to-cart numbers are also doubling, and this rhythm reminds me of the smart speaker explosion back then.

Li Feng: Being able to sell number one on Douyin is impressive, because Douyin is almost the most cutthroat sales market. What do you think are the main reasons this earphone became a Douyin hit?

Chen Xiaoliang: Mainly because of AI. Many users want to know what AI can do, but they don't understand how AI can help them. Our AI earphone added translation functionality, letting users intuitively experience the capabilities of an AI earphone.

For example, when traveling abroad to small-language countries along the Belt and Road, the need for translation is very clear. The market education cost for translation functionality is low — users buy the earphone and can immediately experience AI functions.

Li Feng: It's like buying an earphone and getting a translation device as a bonus.

Chen Xiaoliang: Right. The second reason is large model applications. Although large models have limitations in many scenarios — such as needing prompts and having "hallucination" problems — we've made some optimizations for these issues to help users better utilize AI.

Li Feng: What adaptations have you made on the base model?

Chen Xiaoliang: Our base model doesn't have large parameters; it adopts a mixture-of-experts architecture, where each expert model focuses on specific types of tasks or data. This base model is particularly suitable for conversation scenarios — the content it generates is very concise, usually short conversations that quickly help users solve problems. Short conversations have another benefit: since the AI needs to read out the translated speech, if it's too long users have to listen for a while, while brief answers reduce user waiting time and make communication smoother.

So, while the overall consumer market is currently sluggish and it's hard to stimulate users to replace devices, the situation is completely different after adding AI. AI has stimulated user consumption demand — they're willing to try new technologies and products. So consumption demand isn't nonexistent; it just needs a new trigger point to activate it.

Li Feng: Next let's talk about the development of portable earphones. In 2019, after several iterations, Apple's AirPods became very popular.

Chen Xiaoliang: TWS earphones, what we commonly call true wireless earphones.

Li Feng: Yes, previously mainstream earphones were either wired or bulky. After TWS earphones came out, people started getting used to wearing wireless earphones for extended periods. This was also a process of cultivating the market.

Chen Xiaoliang: It basically raised earphone penetration rates.

Li Feng: Then, because of the pandemic, after people stayed home for a long time, outdoor activities especially exercise scenarios increased, and bone conduction earphones became popular.

The characteristics and advantages of bone conduction earphones are that during outdoor exercise, they don't affect your reception of environmental sounds, such as vehicle horns. But their sound quality is relatively not as good, and during exercise if you sweat, the earphones also produce "crackling" interference sounds.

Overall, today's earphone market is relatively mature. People are accustomed to wearing wireless earphones for long periods, and there are corresponding products for different indoor and outdoor usage scenarios. Your earphones are neither fully in-ear nor completely external — they adopt an ear-clip design. How did you think about this?

Chen Xiaoliang: This was also an innovation in hardware form factor for us.

Li Feng: How much does your earphone sell for?

Chen Xiaoliang: The current price is 399 yuan, but during the pre-sale period there's a promotional price of 199 yuan — roughly the same as our previous smart speaker pricing.

Li Feng: That's a very clever pricing strategy. It happens to hit on a "consumption pattern": Chinese consumers have high acceptance for new electronic products priced under 200 yuan. This means if you price below 200 yuan, people are willing to tolerate some minor flaws for the novelty of the experience. If the product quality is good, it exceeds their expectations. In the US, a similar price range would be around $300, equivalent to roughly 2,000 yuan.

I recommended your earphones to some friends who didn't know much about you, and the overall feedback was that it exceeded expectations. First, they found the packaging really cool and the build quality excellent. Additionally, the ear-clip design combines the advantages of in-ear and bone conduction — you can hear ambient sound without too much interference, and the voice pickup works well. These basic functions alone surpassed what you'd expect for 199 yuan. When they then discovered the AI features, they realized the earphone could do that too, and felt they got more than their money's worth.

This product will probably appeal to elderly people as well. After a certain age, seeing things becomes tiring, and listening becomes an easier way to take in information. Plus, older people like to walk around, especially outdoors — strolling or jogging — and they rely more on hearing.

Chen Xiaoliang: Yes, we originally assumed these devices would be purchased mainly by students, but in reality, many elderly people have significant demand too. They also want to understand and use AI — something we hadn't anticipated. We have plans to launch earphone models specifically designed for seniors, which we'll be announcing soon.

07 What Makes Top-Tier Aviation Headphones So Expensive? What Will Next-Gen AI Earphones Look Like?

Li Feng: Will you launch more premium earphones in the future? What typically makes the most expensive top-tier headphones so costly?

Chen Xiaoliang: We're also developing more premium headphones, similar to aviation headsets used by pilots. The demands on acoustic hardware and algorithms are extremely high — they need to maintain comfortable hearing even in environments with constant artillery fire. Abroad, such top-tier headphones cost over $10,000.

Li Feng: These can't be fully wireless, right? Don't they need to stay connected to power?

Chen Xiaoliang: Early versions required a power connection, but technology has advanced — now they can last four or five hours like AirPods.

Li Feng: These must be fully over-ear, right?

Chen Xiaoliang: Yes, though semi-open designs are also being developed now. As technology iterates, such earphones will be applied in pilot and low-altitude development scenarios.

Li Feng: Many people use noise-canceling headphones on planes, like Bose.

Chen Xiaoliang: Those are relatively mature noise-canceling headphones, mainly suppressing external steady-state noise.

Li Feng: Teenagers today almost all wear earphones, partly so their parents can't talk to them. Just like when we were kids — put on a Walkman and you could immerse yourself in your own world. So if the earphone's isolation is excellent and voice pickup is also very good, they'll listen even more.

Chen Xiaoliang: This could really become as ubiquitous as the Walkman.

Li Feng: In future R&D, what new AI features would you want the next mass-market earphone to carry?

Chen Xiaoliang: There are two main directions now.

First, for earphones in the current ~200 yuan price range, we hope to achieve three functional points. The first is real-time translation — we've currently upgraded to mutual translation among 66 languages, and real-time simultaneous interpretation for 8 languages. The second is optimizing multi-person voice real-time transcription for business users. The third is dialogue generation and applications based on large language models.

The second major direction is deep integration of earphones with AI — this is the area we'll focus on breaking through next. We previously succeeded in making smart speakers operate independently from phones; now, we plan to enable earphones to provide richer AI functions completely independently from phones as well.

Li Feng: When people abroad are asking for directions or ordering food, could they use this earphone for real-time translation and conversation?

Chen Xiaoliang: These are exactly what we want to achieve.

Li Feng: Then let me set you a challenging scenario. Suppose a young person goes to a bar — in that extremely noisy environment, could AI earphones help them hear what someone is saying in a foreign language, generate appropriate responses, and even teach them how to respond in that foreign language? Could this be realized in the future?

Chen Xiaoliang: The complex scenarios Uncle Feng mentioned are something we're exploring. We found during user testing that user activity is very high during late night hours, with strong demand for translation functions — languages like Korean and Ukrainian show high activity in our system.

Li Feng: If someone buys your earphones now, will future product iterations and upgrades be pushed to existing users?

Chen Xiaoliang: Yes. With the current earphone purchase, we include a one-year membership. During the membership period, users can enjoy continuously iterated functions and performance.

08 The New Elderly's Consumption Demand: Don't Let People See That I'm Old

Li Feng: Why are hearing aids so difficult to make? They lean toward social welfare — products with high social value.

Chen Xiaoliang: There are several core challenges to making good hearing aids.

First, everyone's hearing loss is different, and their perception of pitch — that is, loudness — varies. Hearing aids need to amplify sound substantially, which demands far more than ordinary earphones. They must have sufficient capability to handle this high-gain amplification.

Second, after sound is amplified many times over, not all sounds should be amplified. They need to precisely amplify sounds the user cares about while adding noise reduction, because excessive noise can seriously impact the user's mental health.

Third, hearing aids consume a lot of power — how to control power consumption is another critical issue.

Additionally, many hearing aids produced by medical device manufacturers, while functionally powerful, clearly look like medical equipment. They're inconvenient to wear and offer poor user experience.

Li Feng: Most elderly people are afraid of others seeing that they've aged. They don't like products like hearing aids that directly expose their elderly identity.

Chen Xiaoliang: So our goal is to design hearing aids to look as fashionable as earphones. When elderly people wear them, they look like they're wearing ordinary earphones. We plan to release an AI hearing aid for people with severe hearing loss, with AI functions built in from the factory.

The number of deaf people in our country exceeds 20 million, and those with hearing loss reach 200 million. Yet many would rather not hear clearly than wear hearing aids. Besides some hearing aids not being aesthetically pleasing, some people also have certain prejudices against using them and are unwilling to wear them. Another issue is that many elderly people's needs haven't been seriously listened to.

To substantially increase hearing aid penetration domestically faces major challenges. US hearing aid penetration is roughly 35%, while China's is less than 5%. This has led to the domestic market being dominated by foreign brands, especially mid-to-high-end brands. However, precisely because of the low penetration rate, domestic brands have price advantages, and we're narrowing the market gap.

Li Feng: If it looks like an earphone and is fashionable, more elderly people would be willing to use it. For example, about a year or two ago, bifocal lenses started becoming popular in the consumer market — a single lens that enables seeing both far and near. Such products are very popular among the new generation of "post-70s" new elderly.

What the new elderly dislike most is using items that are obviously markers of old age. Once you put on reading glasses, everyone knows you're old. Like bifocal lenses — no need to switch glasses, able to see both near and far clearly — new middle-aged people really like them. Because this design not only meets their practical needs but also helps them maintain consistency with their desired identity. Hearing devices will eventually go in this direction too.

09 Looking Forward, Looking Back: Voice Will Eventually Become a New Entry Point for Interaction

Li Feng: Looking back, you mentioned that Siri was 14 years in the making. As a startup, from technology integration, hardware evolution, to the fusion of software algorithms, you finally connected hardware, software, and algorithms into one whole, plus large language model or AI technology applications, to create today's hit earphone. How would you summarize the twists and turns of this 8-year journey?

Chen Xiaoliang: These 8 years, we happened to catch many important developments.

First, we seized the technology dividend. Initially Siri started from phones, then voice interaction shifted from phones to smart devices like speakers. A key technology at the time was solving acoustic problems. We caught this opportunity, but technology development isn't achieved overnight — it requires long polishing. For example, data accumulation can't be completed in a day or two; mathematical cognitive analysis and large model development both depend on data accumulation. Without the internet, large models couldn't have developed.

We went through long hardship, explored and experimented with many technical routes, most of which ended in failure. But we had firm belief in reaching consumers. At first, we reached consumers together with partners — our intelligent voice solution was standard equipment for the vast majority of domestic smart speaker brands. So we had some understanding of consumers.

Later, the emergence of AI large models finally solved the "not smart" problem of smart speakers, making voice interaction no longer a鸡肋 [useless appendage], able to truly help users solve problems. So at least at the technical level, we've reached a usable standard. We've always looked forward to reaching consumers directly through our own brand, so we made AI earphones.

From seizing dividends, enduring hardship, to technology maturing and product exploding — much of this we hadn't anticipated. Entrepreneurship always encounters challenges, but I firmly believe that AI's ultimate large-scale landing will definitely be in the consumer domain. Apple's AI phone release is a good example. It doesn't look very different, but actually has much AI-related layout inside.

Additionally, many points Uncle Feng mentioned, we're also thinking about and verifying. For example, combining our manufacturing advantages with technological innovation, plus understanding of consumer demand, doing good product design — there's opportunity to create new species.

I've always believed that acoustic interaction will definitely be a very important interaction method in the future, and we'll continue exploring in the direction of AI combined with hardware, letting consumers feel the charm of AI and the efficiency improvements AI brings.

Li Feng: Just like that old saying, "No road in life is walked in vain — every step counts." I hope SoundAI Technology's earphones continue to sell well, sell explosively. Products with such cross-language capabilities and great social value should be sold more in international markets.

Looking forward, what you said makes very good sense. First, just like in the last round of electronic product revolution, the evolution of smartphones made touch a new interaction method; similarly, the development and evolution of language models has also made voice a new input-output method, an important entry point. I'm certain of this.

Second, as you mentioned, with the popularization of earphones, more AI functions can be realized on earphones. Equally worth anticipating is earphones becoming a new entry point for interaction.

For Dr. Chen, this journey has been like going from submarine anti-sonar technology, to smart speakers, and finally to AI hit earphones. I wish SoundAI Company can overcome more challenges and continue to grow.

Interactive Benefit

Do you have a habit of wearing earphones? What needs do you have that current earphones on the market can't satisfy well? Feel free to leave a comment, and we'll randomly select 2 readers to receive SoundAI Technology's AI earphones.

Wishing you a happy holiday!

▲ Embodied Intelligence vs. Sports Tech: One Makes Machines More Human, the Other Turns Humans Into Machines? | FreeS Report

▲ The Road to Embodied Intelligence | FreeS Report

▲ A Conversation with Tsinghua Professor Wenguang Chen: What If Foundation Models Stop Chasing "Bigger"?

▲ Li Feng in Conversation with Wenzhao Lian: The Imagination and Hype Around Foundation Models, the "Impossible Triangle" of Robotics, and the Future

▲ Li Feng in Conversation with Zhang Wei, Founder of LimX Dynamics: Humanoid? Robot?

▲ Li Feng in Conversation with Ji Yu: Understanding NVIDIA, Deconstructing NVIDIA, Challenging NVIDIA

Star the FreeS Fund WeChat Official Account for timely business insights delivered straight to you.