Recommendation: Medeo Should Rebrand as CapCut (Lite)

葬AI·December 26, 2025

Specifically, it means Medeo is a trimmed-down version of CapCut.

"CapCut Lite, basically"

Medeo is the AI video agent that's piqued my curiosity most lately.

The curiosity is all about their marketing pitch: "your words become reality," "make videos just by chatting." Influencers try it and they're shocked, they're losing their minds, like bionic Columbuses discovering some electronic New World. But I have to ask: which AI startup product on the market right now isn't "your words become reality"? Which one doesn't let you get things done through natural language conversation? If it can't do that, is it even AI? Yet Medeo is genuinely hot — they raised nearly $23 million, and Xiaohongshu comment sections are full of people begging for invite codes. So I wondered: maybe the product itself is solid, and it's just the marketing team that's botched the execution. To test this hypothesis, I made a few videos with Medeo and chatted with their team.

I clicked into Medeo's homepage and saw their showcase wall filled with animated short film examples, so I thought I'd make a satirical Rick and Morty episode about U.S. visa policy, to see how it stacked up against OiiOii, which I'd discussed before. But I felt something was off the moment it generated its task list:

We're making an animated short, and you're generating voiceover first, then creating characters, then image-to-video for each shot all happening simultaneously? How much of a mess is the final cut going to be? Sure enough. Let's set aside video quality and narrative logic for a moment. The character voiceover was pre-made TTS — completely flat, no emotional inflection, no verbal fillers. Rick's signature burp was read aloud in newscaster diction as the word "burp." You'd think this was a tutorial on mastering the uncanny valley of human-machine interaction. And probably because of this voice-first-then-visuals logic, not only did the lip-syncing fail, sometimes the lips didn't even belong to the right face. Ventriloquism, is it?

I fed this back to Medeo, and their team's suggestion was: use the recipes on the homepage — that is, templates — and this happens less often.

After carefully studying all the recipes on their homepage, I discovered they operate on two distinct video generation logics. For purely story-driven videos, like the Rick and Morty I just attempted, if you use a recipe, Medeo directly calls Sora2 to generate multiple clips and stitches them together, with all voiceovers also generated wholesale by Sora2. But when you want to make explainer videos, tutorial videos, or commercials, Medeo switches to another logic: first generate copy and voiceover, then generate corresponding still images, animate those images, and finally stitch the video together.

Since the product is in a transitional phase, these two logics tend to bleed into each other, so recipes are needed to separate them. I reluctantly accepted this explanation, followed the great guidance, and selected their newest, front-and-center recipe: Realistic Film. After clicking, I received a fill-in-the-blank exercise.

I entered: Create a short film in the style of [Tarantino] about [the CCTV Spring Festival Gala]. Let's see what surprise I'd get when a Western auteur meets a patriotic brief. The film generated quickly. No violence aesthetic, no Tarantino-esque plot design. But at least it used high-saturation color grading, and a five-second close-up of feet. It tried, I'll give it that.

Some film studies experts hold differing views

And indeed, as their team claimed, using a specific recipe at least solved the voiceover and lip-sync issues. After all, it uses Sora2, which can do lip-sync natively. But Sora2 has side effects too. The generated video has incoherent plotlines, the footage falls apart under scrutiny, and the subtitles are a mess. So I think the Medeo team is pretty clever. Their own showcase case for this recipe is a film in a疑似王家卫娄烨混合风格 — all empty shots, people opening their mouths without speaking, shaky handheld camera. Vibe editing — we're mainly generating vibe, atmosphere. You can't spot any product flaws from the film itself. Take it up with the director if you have complaints.

For narrative shorts, I have to say it was somewhat of a fail. I couldn't see any difference from generating with Sora2 myself and stitching clips together. So I decided to try other recipes — explainer videos. Since this is an overseas product, I generated a video for rednecks: explaining why the U.S. moon landing was faked by lizard people.

This time there was somewhat of a pleasant surprise. Though there were still minor issues like made-up numbers, overall it was a video with high perceived completion. It works. I sent just one sentence, and Medeo not only auto-generated conspiracy theory copy but also thoughtfully paired it with a suspenseful voice. After exporting, it felt ready to post straight to Douyin for a content farm account launch — post for a month straight and my relatives back in Texas would probably see it and hit like. If this could be mass-produced, you could get rich off mid-length video revenue sharing alone. So why haven't the Douyin content farm families latched onto this welfare yet? One look at my account balance and I understood.

Generating a one-minute video with Medeo costs about 200 credits, roughly 80 RMB. And that's just for one generation — you need to understand, Medeo emphasizes it's not a "one-click video generation" product, but one where you iterate through conversation. So that 80 RMB video needs revisions, and revisions cost money too. AI products charging money isn't the problem. But when I spend 80 RMB making a video, where do I post it to earn 80 RMB back? That's the problem.

Most critically, seeing this price, I decided to return to ancient methods: manual non-linear editing. But when I opened CapCut, AI Text-to-Video stared back at me in bold letters.

This really isn't a CapCut ad

I sent the same copy to this AI Text-to-Video feature. The resulting video was as follows.

I'll admit the video wasn't as polished as Medeo's, but for stick-figure explainer videos, the visuals themselves matter less than the text content — as long as it moves and is watchable, it's fine. And CapCut generates such a video for just 4 RMB. What can I say?

Of course, Medeo's selling point isn't using AI to generate video, but "your words become reality" and "make videos just by chatting" — something CapCut can't yet match. But I don't see the point. For example, when a video is done and the user finds the BGM too loud, completely drowning out the voiceover, they have to specifically type "the BGM is a bit loud" for it to slowly adjust.

But dragging a volume slider with a mouse takes less time than typing that sentence, no? Can anyone tell me which editing function in Medeo currently can't be accomplished with a single mouse click or drag? Must we be gentlemen who move our mouths but not our hands?

Stepping back, if you're really doing conversational editing, you need to demonstrate automation. For instance, if I lengthen or shorten a shot, the copy and voiceover should lengthen or shorten accordingly. If I change a line of copy, the visual content should adjust too. If I swap out one shot, other shots should change accordingly, or at least suggest changes. None of this exists now.

The most divine part: when I told Medeo "make the shot cooler in tone," it didn't have any color grading or white balance tools — it just solved the problem by generating an entirely new video.

Fam this isn't editing at all 😭

But I still think Medeo is salvageable, so let me end with some patronizing dad advice.

First, abandon the current all-category video generation agent positioning. Otherwise, people will discover your animated shorts don't work, your live-action films don't work — you're just digging your own grave.

Second, admit that what you want to do and can do is a vertical explainer video generation agent. When I talked with the team, they also said explainer videos are their stronger suit. Medeo even has a stock videos feature — pulling from over a million existing clips to match your copy. This barely has anything to do with AI; it's pure explainer video editing demand. I suggest you develop this further and return to your original calling.

Third, watch some purchasing power videos before setting prices. At the current price, I suspect even Americans can't afford it.

Fourth, if you can't integrate more editing functions, users will definitely drag their project files to CapCut halfway through. And once they see CapCut's AI Text-to-Video is that cheap, won't they just collapse? So either iterate the product fast, or immediately grab a Didi to Dazhongsi and play some corporate promo videos made with Medeo for the ByteDance bosses, asking which is more suitable: acquisition or joining the team.

Fifth, Identity V.

Originally wanted to use Medeo to generate an Identity V video to wrap up but ran out of credits 👋😭👋

(Images in this article generated by ChatGPT; text purely human-written)