Vozo AI's Changyin Zhou: Early Google X Engineer, Video Product Goes Viral on Product Hunt, 6 Million Global Users | Z Talk

真格基金·November 13, 2024

The market for video creation is a $100 billion opportunity that's severely underestimated.

Z Talk is ZhenFund's column for sharing insights.

This July, Vozo AI hit #1 on Product Hunt immediately after launch, earning a vote even from Product Hunt's CEO. Its core Rewrite & Redub feature lets users regenerate video scripts from prompts, clone the original speaker's voice, and synthesize lip movements for new dubbing — winning over nearly ten million users in the video creation space.

Vozo AI CEO Changyin Zhou earned his bachelor's and master's degrees at Fudan University and his PhD at Columbia University, where he researched computational photography. During his doctoral studies, he received an invitation from Stanford computational photography luminary Professor Levoy to join the Google X Gcam project team.

In 2015, Zhou left Google to found VISBIT. He returned to China in 2021 for his second entrepreneurial venture. After deeply exploring industry needs and technological boundaries, he pivoted from VR applications to video production, ultimately launching Vozo AI in July 2024. ZhenFund made an angel investment in VISBIT in 2016 and invested again in Vozo AI in 2020.

In a recent interview, Zhou — who has long specialized in computer vision — discussed his experience finding product-market fit across two startups, and explained the deep understanding of video production industry technology and demand dynamics that underpins Vozo AI. Here is the full interview:

Every technological revolution brings transformation to content creation. Vozo was founded against this backdrop, dedicated to enabling everyone to express their creativity and stories through AI.

Vozo provides powerful AI video creation tools, including smart editing, automatic effects, and a rich template library, allowing users to produce high-quality video content without professional skills. Whether short video creators, KOLs, or enterprise users, anyone can leverage Vozo's tools to communicate and share more freely through video, accelerating both the efficiency and quality of content creation.

In this interview, we spoke extensively with Vozo founder Changyin Zhou about his entrepreneurial journey from computational photography to video creation tools, and his understanding of the industry's future and product philosophy. Throughout the conversation, Zhou smiled constantly, repeatedly mentioning "it's still quite interesting" — giving us a real sense of his passion for and enjoyment of entrepreneurship. Let's step into his story and explore the vision and future behind Vozo. Enjoy!

01

Early Google X scientist, serial entrepreneur, previously focused on using computation plus photography to deliver better visual experiences for users

Q: Welcome, Changyin. Please introduce yourself to everyone!

Changyin Zhou: Hi everyone, I'm Changyin Zhou, Founder and CEO of Vozo. Friends call me CY. I did my undergrad at Fudan University's School of Management and my graduate studies in Fudan's Computer Science department. During my master's, I visited Microsoft Research, where I had the fortune of encountering some of the world's top computer vision research and scientists, finding my research direction: computational photography, meaning "computation plus photography" — helping users get better images or video, or better understand images or video.

In 2007, a Microsoft Research advisor recommended I pursue my PhD at Columbia University. My doctoral advisor was a principal founder of computational photography and a member of all three US national academies.

In 2011, during my fourth year of PhD studies while interning at NVIDIA, I received an invitation from Stanford computational photography legend Professor Levoy. He wanted me to join him at Google X to start a new project team, later named Gcam (for Google Camera).

At the time, I had about a year left before graduation, and my original plan was to pursue an academic career after completing my PhD. Many things happened in between, and I ultimately accepted the invitation to join Google X, transforming from a pure researcher into a researcher-plus-engineer. I'm deeply grateful for my PhD advisor's active support and flexibility, which allowed me to work full-time at Google X while still completing my dissertation defense a year later. So I entered Google X without a normal interview process, and received full PhD compensation without having actually graduated yet — working four days a week with full-time pay. Professor Levoy sat to my left, with renowned AI professor Sebastian behind me, and I'd frequently run into Sergey Brin wandering around, leading to many interesting discussions. It felt somewhat surreal at the time. This experience gave me lots of inspiration about taking unconventional paths.

Over four years at Google X, I gradually shifted toward industry. I worked on the Google Glass project, where the biggest problem was that the camera was tiny and the processor weak, resulting in very poor video and image quality. Building vision algorithms on top of that was difficult, so we needed an entire technology stack. We essentially redefined the whole image pipeline — a very typical engineering and research combination. We'd capture 6-10 photos at once, then quickly fuse them together, reducing noise and improving quality by a whole level.

But we had to do this without users knowing — they thought they took one photo, but actually captured 6-10, with lots of vision algorithms, processing algorithms, and camera control in between. Later, all Android phone manufacturers had to follow our Google standard, which became the Android Camera2 API, one of the foundational Android system components.

What's interesting about this: an engineering effort could improve something's efficiency by ten or even a hundred times, without introducing major new hardware — we achieved this through pure software. That gave me enormous inspiration.

I left Google in 2015 to start my first company, working on VR applications with the core goal of solving teleportation — experiencing Shanghai while physically in Beijing, for instance.

This required solving data transmission, generation, and rendering, with data volumes 10 to 100 times larger than typical video. It later became a B2B SaaS serving carriers like AT&T and Horizon. There were many technical breakthroughs, but commercial success was limited, constrained by upstream and downstream VR hardware and content. The market simply didn't have that much VR demand. In retrospect, it was a bit too technology-driven, not truly starting from user needs.

I returned to China in 2021 for my second startup, Vozo. I felt this company would be more balanced — valuing business opportunities more and truly starting from demand. We're very careful to verify whether demand is real or just something we imagined.

Q: Looking back, what do you see as the reasons your first startup didn't meet expectations?

Changyin Zhou: We were doing video processing — taking camera-captured video data, processing, generating, and streaming it, then decoding, rendering, and displaying it on the viewing end for an end-to-end experience. I think the core issue was our position in the ecosystem. What we were doing required dependence on upstream and downstream players. Upstream, we needed good capture devices to collect VR raw data for us to process. Downstream, we needed to deliver to headsets, which depended on headset installed base. So we were stuck in the middle.

We also had lots of wishful thinking — believing that since our experience was so great, surely many companies would make great upstream cameras and many people would buy headsets. We'd even cite interesting curves predicting headset installed base growth. But looking back, neither of those things happened (rapidly growing VR content and VR headset installed base).

So you can't be too idealistic. Just because we do computation and rendering well doesn't mean we can push upstream and downstream to improve. The industry has its own commercial logic, and one company's influence is generally very limited.

But for entrepreneurship, you can't wait ten years for upstream and downstream to get better. The entrepreneurial window is roughly two to four years, five at most. So my lesson was: don't be too far ahead. It's better to address market demands that are already more or less formed — don't lead the market by too much.

Q: What motivated you to start your second venture in 2021?

Changyin Zhou: First, I find entrepreneurship very interesting. Second, I felt there were some regrets from my first startup, and I hoped to do better the second time — particularly the failure to start from user needs. So I've carried that idea through this venture, though I genuinely enjoy the entrepreneurial process.

When I returned to China in 2021, I went to Hangzhou and spoke with all the major MCN players there. I found they had all kinds of needs, mostly around image and video.

This was a stark contrast to doing VR. With VR, you'd build the technology and product, then go around begging people to use it. But with short video and live streaming, there was massive demand with no suitable technology or products to meet it. So I thought we could do something here, and started my second company.

02

Starting from user needs, Vozo exploded upon launch with 6 million global users, using AI to let everyone create video expression more easily

Q: What universal needs did you see at the time?

Changyin Zhou: We saw several categories of needs. One was short video production, another was live streaming. Live streaming seemed to have stronger demand at first — several well-known MCN companies were building massive live streaming centers, with hundreds of streaming rooms in one building, each with three Sony cameras, each camera needing a cameraman behind it, plus a director, everyone wearing headsets, with camera one, camera two, cables all over the floor. This scenario sounded very hard to scale.

I thought we could automate the cameras, with just one person controlling everything. We spent about half a year building an interesting live streaming camera — including a wide-angle camera and two lenses, with control algorithms behind it. When one lens zoomed in on your face, the second lens would zoom in on your hands as backup, automatically cutting between them based on understanding when to show hands versus face. But we later realized that having demand alone wasn't enough — we had to consider commercial viability. Top streamers already had lots of people serving them and enjoyed being the center of attention. Long-tail streamers could just use their phones. That only left mid-tier streamers, not enough and not stable enough, so we canceled this direction.

But we gained lots of insights. Because we had the streaming camera, we could have very close exchanges with MCN organizations. When you put your streaming camera in their room and discuss needs, you learn what else they normally do, deepening your understanding of the entire short video industry. So we later decided not to do hardware, took just the software out, and that became our subsequent product, which we've continued building to this day.

Q: What was the company's positioning after the pivot?

Changyin Zhou: After the streaming camera, we saw lots of short video production needs. We realized this was different from before — previously, video creators were professional editors. In the short video era, many KOLs, KOCs, e-commerce sellers weren't video production professionals. Their video skills were about the same as ours, normal people's video skills. So they had lots of problems making short videos.

We thought about how to let ordinary people naturally tell stories, express emotions, introduce products, or other content through short video. I found this very interesting, and something that could impact many people. So we decided to stop purely pursuing technical coolness, and instead build a tool that everyone could use easily, and make it the best.

We did extensive user research and discovered many interesting needs. One was "can't remember lines" — it sounds small, but for non-professionals, unless you're a professional host or announcer, almost everyone struggles to remember lines. This requires take after take, which is crushing because each take needs full emotional investment. Finally get the emotion right but forget the lines, and you have to start over.

We built a mobile teleprompter based on speech recognition models that would scroll according to speaking speed. We finished it in a month. But users weren't very satisfied — they found many issues: accents, environmental noise, the teleprompter freezing, etc.

Q: What were the main problems, and how did we address them?

Changyin Zhou: Mainly technical issues. Users' Mandarin wasn't standard, noise was heavy, echo was strong, causing speech recognition problems. Because users weren't professionals, we discovered that what users considered "no noise" and what you considered "no noise" were completely different. So we had to solve these problems — collecting data, building models, making models small enough and latency short enough, below 100 milliseconds.

This is one example; there were many other problems ordinary people might encounter. It started somewhat experimental, but after optimization, user satisfaction, payment rates, and renewal rates were all very good. We built more features around the teleprompter, including subtitles and auto-editing, attracting more and more users. We currently have over 6 million global users, with excellent payment rates, renewal rates, and user feedback. Our user ratings are extremely high both domestically and internationally — we often share user reviews with our team.

Beyond technical issues, we also had to think about what to build. As mentioned, this demographic wants many things — which to do, which not to do, which to do first, which later, how to design the UI. This also gradually evolved into our later product.

We have private community groups domestically, roughly 50,000 people. Their feedback lets us gradually explore more possibilities. For example, users wanted to fix things they said wrong, even make their voices sound better, improve their image, change story A to story B, convert Chinese to English — slowly extending. We eventually found that SaaS was the best format, since frequent users preferred SaaS. So we started the SaaS project last year, and Vozo officially launched this July. This product carries our accumulated understanding of this demographic's needs over the past few years.

Q: What's Vozo's product positioning, and what value does it provide users?

Changyin Zhou: What Vozo does is hope to let everyone create video expression more easily through AI. Internally we call it "video freedom" — hoping everyone can conveniently use video to tell stories.

When choosing what to build or not build, we have three core criteria. First: the demand must be real, with market scale. Second: it must be significantly different from mainstream existing products, like Adobe Premiere or CapCut. Third: it must align with our main direction — expression tools for non-video professionals. With clear criteria, we defined our direction, developed and iterated, until launching this July.

Q: Who are Vozo's main target users?

Changyin Zhou: Initially some SMBs and prosumers, but now we're finding more enterprises. Our product positioning is Vozo Rewrite. Rewrite has many different scenarios — you have an ad you want to change to different styles, different openings and endings; or you had a corporate promotional video that was formally and professionally presented, and you want to make it more upbeat; or you discover your company logo changed in the final second; or I spoke Chinese and want to change it to English.

But after launch, we found two categories dominated: explainer videos and translation. Our experience is decent now, at least passing our own threshold, so we're focusing on several scenarios. So the profile is more enterprise marketing, advertising departments, and some content and education training companies. This isn't a traditional vertical — it's similar departments across various industries with similar needs.

Q: Why launch Vozo this year, and what drove that decision?

Changyin Zhou: I think market changes over these three years have been quite significant. Video translate, or video rewrite, technically couldn't have been done well even a year earlier. And we somewhat went down some detours. Our teleprompter got us lots of data, so we did lots of training. Right when DALL-E came out at the end of 2022, we found diffusion and video generation very exciting, so we went down another fork, throwing user needs out the window (laughs). But it was also during that year that we somewhat accidentally developed capabilities in generative models, speech generation, and lip synthesis.

In 2022 we actually built something similar, but after internal evaluation, it couldn't meet user expectations. A year later when we returned to our main path, generative AI could now solve our original problems. Although the two paths diverged then came back to the original problem — quite interesting.

The technology breakthroughs feel quite comprehensive. Video translate is a very integrated thing. First, speech recognition has been revolutionized in the past two years — current solutions are vastly better than before. Second, translation, where large language models matter greatly. Previous translation wasn't very smart and required human calibration. Right then, LLMs had solved this, plus some of our own fine-tuning, achieving fairly good overall results. Third, voice cloning and voice generation — this is actually quite difficult. In the industry, us, 11 Labs, and several major companies are working on this. Emotional realism also only saw major breakthroughs in roughly the past year. Fourth, lip sync — we published a paper in 2022, but the gap to actual productization was still quite far. After about half a year to a year, it slowly became a productized project.

So from speech recognition, voice cloning, TTS, generating lip movements, to generating facial movements — a series of problems — major breakthroughs happened in roughly half a year to a year. Quite magical.

Q: As technology continues developing, how will you expand product functionality, and what's the main thread of product iteration?

Changyin Zhou: This requires connecting technology and product. On one hand, you need to predict when technologies on the tech tree truly reach productization timing, assessing your own R&D capabilities and industry advancement speed — this requires strong frontline R&D capabilities. On the other hand, you need to consider existing product needs. For example, we sometimes wonder whether when translating speech we should also translate the face, making it look like an Indian person. But is this demand real? What percentage of users would pay for this? We need to judge this. Or, for the first three seconds of video, we could generate different visuals, but what exactly do users want for visuals — generating an effect from a library, or what? We feel there's something to do, but exactly what still requires talking with customers.

User needs plus the earlier technology prediction — we only develop products when we judge they can overlap. So the path forward is these two things constantly colliding.

Q: Evaluating the authenticity of user needs has always been difficult. Practically, what do you do to get as close as possible to real user needs?

Changyin Zhou: This is a PMF exploration process. In Silicon Valley there are many systematic theories. Actually there's a book I highly recommend called The Right It — excellent book.

I think the most important thing is to have no ego, don't think your own ideas are particularly important. Second, you need to know the industry well enough — for example, you need to know how marketing people work, what their KPIs are, what their daily work is like. So you need to be very familiar with the video production industry. Finally, much of it still relies on subjective judgment. There are also some tactical things, like those in these books — how to do small-scale testing, how to do interviews — these are very tactical things.

Q: After product launch, what exceeded or fell short of expectations?

Changyin Zhou: Quite surprising — after launching July 20, we didn't do any promotion, but many users started using our product. To this day we still don't know how many of them found out about us. What exceeded expectations was that most users are fairly satisfied with our product. For example, our translation is relatively fast and accurate, so it seems Vozo is currently one of the more satisfactory video translation products on the market. This was an interesting unexpected gain. We hadn't originally anticipated they'd use our Rewrite for translation — Vozo can use prompts to rewrite video. I had originally imagined prompts like "Rewrite to something..." but many users directly say "Translate to something." I hope through our product iteration, more and more users can use our product. Currently, users produce 2-3 million complete, quality videos monthly using our products (Vozo APP + SaaS). I think this is quite remarkable. While our retention data isn't public, our retention is very good.

Q: As CEO, what are the three most important things for you in the next 1-2 years?

Changyin Zhou: First, I hope to attract some more interesting, excellent people to join. Second, I hope to ensure our product and business direction is correct, not going wrong, having no ego, and following how the market and product should evolve. Third, ensuring our company's cash flow or revenue growth is fast enough. Of course these three have causal relationships — do the first well and the second follows, do the second well and the third follows.

With rapid technology driving development, the Video Creation track remains early-stage, with opportunities for multiple companies across segments to build tools 100x better than Adobe

Q: Over the past 2 years, what major trends and changes have you seen in this industry?

Changyin Zhou: I think the biggest change is technology evolution, enabling experiences that previously couldn't be improved, thus creating many opportunities. Looking at Adobe, you'll find many features where you can build a very different, much better experience — basically every Video Creation scenario can be redone, platforms that completely surpass Adobe can be built.

For example, someone doing Comics-style video, that anime-type film — previously you might draw in Adobe then edit, but around the ultimate need, cartoonists or publishers needing to make anime video, you can build completely different software systems with very high anime production efficiency. Or advertising video, PPT video — very different things can be made. You can take all video categories out and imagine, hypothesize.

Q: You mentioned technology is developing very fast. From a competitive perspective, how do you view the industry's current stage?

Changyin Zhou: I personally think it's quite early. Real market competition doesn't really exist yet. Looking at current industry research reports on Video Production, it's roughly a $100B market — hiring people to make video, buying software to make video. And this $100B I think is very small. There are so many video creation scenarios, so this $100B is seriously underestimated — it'll be 10-100x larger in the future. In 3-5 years, many people like us, students, and people from all industries will use video to tell stories. The market will be much larger than before.

But currently the industry basically still uses Adobe PR, CapCut — whether you do translation or advertising, these few software are universal. They can indeed do anything, but everything is particularly difficult to do, with poor results. For example, I count as a semi-professional video producer, but if I haven't used PR for two weeks, I don't know how to use it...

I don't think the future is like this. In video expression, every scenario is a fairly large market — translation, digital humans, advertising, etc. Now is a time of transformation. In the coming 3-5 years, many different video tools will likely emerge serving each scenario, with experiences 10-100x better than before, and markets 10-100x larger. I don't think competition really exists yet — everyone is going after blue ocean markets, not yet at red ocean.

Q: In the Video Creation track, how do you see the long-term competitive landscape?

Changyin Zhou: This is actually hard to say. After segmentation, many possible forms exist — will digital generation and translation be separate tracks, or combine into one? Will one company eventually have a product matrix covering all video scenarios, or will one or two companies dominate each scenario? But my current observation is that different product types may be hard to combine into one product, because different products have very different user experiences, and the underlying logic of the entire product service is quite different.

So I'd bet that in the future there will be different products in different tracks — just whether these products are provided by the same company or different companies is the question. At this point, what's more important is paying attention to who develops faster over the next two to three years.

Q: What's the current AI penetration rate?

Changyin Zhou: Basically zero, very small. What we do with teleprompters or AI subtitle addition is very basic functionality. You might think these basic AI features should already be quite widespread, but actually in video production, most haven't heard of AI teleprompters, don't know AI can help add various dynamic subtitles, don't know you can edit video by editing subtitles.

Q: Why is penetration still relatively low?

Changyin Zhou: I think it's because there are hardly any decent products. Take our translation — technology only relatively reached passing grade around the second half of last year, maybe 60-70 points at most. HeyGen's explosion last year might count as one major push, but things like that probably need to happen many more times. I think ordinary people are quite "stubborn" — don't overestimate AI technology's influence, it needs very long to develop.

Q: Over the next 3 years, what are your expectations for AI technology progress? What technological changes might significantly impact video editing?

Changyin Zhou: I think it divides into two parts: research and engineering. Based on existing research, there's still much optimization possible on the engineering side, especially non-framework optimizations — there's actually quite large room for improvement here, possibly with continuous improvements for three to five years, hoping to lead other companies by one to two years.

In the next three years, I think there may be some breakthroughs in AI's underlying technology. However, these breakthroughs probably won't be led by our team, but by companies like OpenAI or Google. Currently, multimodal system design still has some obvious deficiencies and problems, so I hope in two to three years these areas can see major progress. For the engineering work we're doing now, I hope we won't encounter fundamental research "ceilings" in the next two to three years, and can continue pushing forward.

Q: Finally, a few small questions about you. What did you expect of yourself 10 years ago, and have you achieved it? Standing today, what kind of person do you hope to be in 10 years?

Changyin Zhou: Ten years ago at Google X I worked on quite an interesting project, thinking at the time about enabling people to become some kind of "superhuman." For example, seeing things that are invisible, or when someone asks you a question and you don't know the answer, Google Glass could tell you. This is equivalent to giving people "superpowers" through technology. Obviously didn't succeed, or perhaps achieved it product-wise but didn't make it a product everyone uses.

Continuing to start companies, I still hope to do some technological or product innovations. Expanding human capabilities, pushing human boundaries. For example, originally you couldn't tell stories with video, now you can; originally something took long to explain, now you can make it clear in a minute. I feel I'm still constantly exploring the boundaries of human capability.

Q: What hobbies do you have?

Changyin Zhou: I quite like sports — badminton, rollerblading, skiing, etc.

Q: As an entrepreneur, what channels do you typically use for continuous learning?

Changyin Zhou: Two things. First, engaging with people better than yourself, whether from previous companies or encountered during entrepreneurship. Second, ChatGPT — you can discuss many questions with it (laughs).

For more information about Vozo, please visit its official website: https://www.vozo.ai/

Recommended Reading