Diving Headfirst into the North American Music Scene: A Guide to Taking AI Music Global | 5Y Capital's Tavern x ACE Studio's Guo Jing [Podcast]

五源资本五源资本·July 30, 2024

Humans have always needed music.

For this episode of 5Y Tavern, we invited Jing Guo, founder of ACE Studio. In early 2024, ACE Studio began its push into North America. Within six months, as a pro-sumer music workstation, it had already achieved notable traction overseas. Jing shared his journey and some lessons from going global.

For entrepreneurs eyeing overseas expansion, ACE Studio's story is inspiring. Facing entirely new markets and unfamiliar environments, strategy and methodology may matter less than the audacity to dive in headfirst, the courage to confront challenges directly, and the price you're willing to pay.

Jing also shared his insights on the AI music industry — how AI lowers the barrier to creative tools, allowing more people to participate and spend more time perceiving and expressing art. We've excerpted portions of the interview below; the full podcast is available on Xiaoyuzhou.



Follow 5Y Tavern on Xiaoyuzhou to listen to this episode


Excerpts from the interview:

5Y Tavern: Let's start with introductions — Jing, tell us about yourself and ACE Studio.

Jing Guo: I'm Jing Guo. Our product is ACE Studio, a workstation for professional consumers. Our primary users are music industry professionals and producers who use it to create vocal performances for their music. Recording vocals is expensive and difficult to replicate electronically — if you want a hundred-person choir, you still need real people in a real space doing heavy production work. But today you can open ACE Studio and accomplish that.

We've also found that beyond traditional musicians, many content creators use ACE Studio to write lyrics and create an entirely new content format called "sung commentary" or "sung film reviews." These people are what we call professional consumers — they approach the work with a professional mindset, but make individual purchase decisions. ACE Studio is a classic pro-sumer product. Going forward, we'll integrate more AI music capabilities to gradually evolve it into an all-in-one workstation.

5Y Tavern: Six months ago you were still figuring out overseas expansion from scratch. Yet in just half a year, you've reached several hundred thousand dollars in monthly overseas revenue. That's rapid progress. Where are your paying users coming from now? How did you earn their trust starting from zero?

Jing Guo: Right — the majority of our revenue is now overseas, 90% in USD. Of that, 70% comes from the US and Canada, with the rest mainly from the UK, Germany, France, and Brazil in Europe, plus some from Southeast Asia and other regions.

We initially assumed going global meant succeeding domestically first, then replicating that success overseas as a global product. Since we were building functionality, you just translate the interface into multiple languages — no big deal. But we've since realized that because we're a pro-sumer product, even good wine fears a dark alley. Before we really pushed overseas versus after we went all-in, the same product format could generate 100x difference in revenue.

A key insight: whether your product is good or has product-market fit isn't immediately apparent at launch or upon first user contact. You need to keep pushing in a market until you reach a certain threshold to validate that. If you push hard for a long time and still can't get traction, maybe there's no PMF. But initial failure doesn't mean the product doesn't work — there may simply be things you haven't done yet.

5Y Tavern: What was a critical turning point for you?

Jing Guo: We went to NAMM SHOW in Los Angeles, the world's largest music industry trade show. There we met many industry professionals and Grammy-winning musicians. Before going, our sense was that we hadn't completely ignored overseas — we had an English website, payment systems, overseas users could download. But why so few new users, so little revenue? Occasional new revenue still came from the Chinese-speaking world. We lacked confidence about how Westerners thought about this; we assumed they didn't have this need.

But at the show we discovered they absolutely did — we just hadn't formally introduced our product to them, hadn't explained its benefits and user experience. They likely knew nothing about us, or saw us as some random product that their minds would filter out.

When we set up our booth overseas and properly introduced what we did, they found it extremely impressive. Word spread that this was amazing, that no previous product had solved this need. They felt we were disrupting the industry. Americans can be exaggerators, prone to excessive praise, but at least we saw positive signals.

Before that, we'd tried reaching out to many YouTubers hoping for product review videos. I distinctly remember sending 100 emails with zero responses. But after the trade show, some YouTubers started responding — because we were no longer strangers. They'd seen us firsthand or heard from friends that we were decent. We formally introduced ourselves: we're from Beijing, China, and we're committed to entering the US market. People may not instantly trust you, but they at least have a sense that you're real, that you're earnestly doing this. Then we started collaborating, funded some YouTube videos, and growth took off.

5Y Tavern: You've mentioned before that you never passed CET-4 [China's College English Test Band 4] and haven't lived overseas for extended periods. When you first started going global, did you have worries or concerns?

Jing Guo: Definitely. Initially the interpersonal and emotional pressure far exceeds communicating with Chinese people. You constantly feel your language is inadequate, this instinct to please. But my experience has been: Americans don't all have great language skills either. America is full of immigrants with accents far worse than mine.

Honestly, a lot of things are paper tigers. Many people, lacking confidence in their language, naturally lack sincerity in communication — that inner drive to genuinely want to do something doesn't come through. If you're just a bit thick-skinned, showing your sincerity with broken English, the results are often quite good. And because you're a non-native speaker, sometimes when you ask very direct questions, people forgive you. My new goal is to reach native speaker level — still extremely far, but I've set that target.

5Y Tavern: Any memorable stories from your global expansion — moments of positive feedback or being trusted?

Jing Guo: At NAMM SHOW, people would constantly come chat with us. Some unassuming guys in beat-up straw hats would say, "This is interesting, let me introduce you to my studio partners — can you come by tomorrow?" We'd agree. Turns out the studio was a legendary Hollywood recording studio under Jay Z, the legendary East Coast rapper. Quite a few people left detailed contact info wanting to enter our annual membership raffle — over a dozen were Grammy nominees.

But what struck me most wasn't meeting impressive people. It was how much closer everything felt. Los Angeles is essentially global music's central headquarters; the energy density you absorb there is incredibly high.

Many Chinese founders probably feel this way: we toil away domestically without really understanding our gap versus global top-tier standards, and may swing between two extremes. One is blind confidence — thinking we're excellent when our conceptual gap with the world's best may actually be huge. The other is blind insecurity — believing our product lacks competitiveness. I'm gradually realizing that as founders or engineers, our capabilities are often no worse than theirs, sometimes even superior. Our gap lies in judgment — knowing what's right, what should be done. Their vision is stronger because they're exposed to higher information density.

For example, we'd been grinding on ACE Studio for so long with strong engineering across the board. But companies like Suno knew that fundamentally, you build an end-to-end thing — text-to-music is the future — and they were first to do it. This actually pains me: why, after doing AI music for so many years with so many talented engineers, weren't we the first globally to do text-to-music?

I've reflected on this. When Suno was initially brewing this, the world didn't believe in them. I remember months after Suno V2 launched, I was telling people AI music might genuinely not work. I felt Suno V2 had reached its "Midjourney moment," yet six months post-launch, Suno's Twitter had only 4,000-some followers, mostly tech people. I wondered if there was simply no demand for AI music. Later I realized no — it was just slightly below the threshold to break through the market. Slightly better quality, and it would flip.

But why didn't we do this during Suno's lonely accumulation phase? We knew this paradigm, studied their open-source code, evaluated compute costs — and chose not to pursue it. What were we doing then? We were immersed in the domestic atmosphere talking metaverse, virtual avatars — looking back, massive wastes of time and resources. Fundamentally our cognition wasn't strong enough, our technical vision wasn't clear enough. We saw different things, leading us to lose on many detailed decisions.

5Y Tavern: How do you develop market-breaking cognition, especially when you're constantly surrounded by noise? How do you make correct decisions?

Jing Guo: I now believe the simple, plain truth is whether you can access more correct information — whether you can feed your brain reliable training data. Many obstacles block information access. You need to step out of many comfort zones, let the world humiliate you many times, to obtain real information. Often we use seemingly reasonable arguments to convince ourselves and our teams, getting everyone to accept a comfortable rather than correct decision.

The harder thing is: I don't think there's any such thing as effort or talent — it's one word: price. What price you take determines what results you get. We ask why can he do it and you can't? Maybe you see him coding every day and think you can too. But can you accept his prices? Maybe he's balding — you have a decent physique and good job now; are you willing to make work your absolute priority if the price is becoming overweight? Or your family falling apart, getting divorced — are you willing to take that price?

Of course this is an extreme thinking mode; think extremely, but act flexibly. First push the logic to the limit, then see how to maneuver within what you can accept. Take going global: maybe you're doing fine domestically, finances improving, but overseas you're marginal, squeezed in a hacker house, language barriers, American white guys don't take you seriously — are you willing to accept that? Ultimately no one is smarter or works harder than anyone else; maybe they've just chewed more glass shards than you.

5Y Tavern: Did you clearly think through what prices you'd accept before going global?

Jing Guo: You can't think it all through beforehand; sometimes you just dive in headfirst. But we were always going to go global. Even if we're 90 meters from the starting line domestically, and going overseas sets us back to 30 meters, I'll run from 30 meters. Getting clear on this actually makes it less scary.

5Y Tavern: Have you hired locally? Any team configuration adjustments?

Jing Guo: We're still all Chinese, mainly in China, some in Japan. Going fully remote was our first step. Our company has a strong anonymous plus open-source collaboration culture, so it's naturally suited for remote — our own testing shows efficiency is even higher. How to hire there, manage local employees, evaluate people there — that's still unknown territory for me.

5Y Tavern: You've previously held the view that nothing in the world is that hard; the gap between geniuses and ordinary people isn't that large. Including when you first started a company around age 23-24, you taught yourself to code. After these years of entrepreneurship and external changes, has this view shifted?

Jing Guo: It's shifted somewhat. My feeling then was that many things aren't that different, so why do people end up with huge differences? Objectively, humans are self-booting systems — many may never cross certain gaps, even if they're small steps. Their thinking has already fossilized at that point. I often reflect on myself: what are such things for me? Steps that could transform everything, not inherently difficult, but because of my limitations I don't see them or always feel they're hard.

Broadly I still believe many things in the world aren't that hard. But I don't think you can stop at a passionate slogan — you need to break down why many things aren't that hard, what your strategy is. Because many things are low-hanging fruits; first chart a path for yourself. After you've grabbed all the low-hanging fruits, the higher ones become low-hanging. Maybe in a couple years you see you've accomplished many tasks others see as impossible, but each step was achievable for you. Again: price — what prices can you accept, or can you change yourself to accept more prices? This ultimately determines your ceiling.

5Y Tavern: So it comes down to your conviction in a vision and your belief in the change it will bring? Why can some people accept greater prices or endure greater pain, while others can't?

Jing Guo: I think it's whether you see the lighthouse. When climbing a mountain, some people encounter a blizzard that obliterates all visibility, and they freeze to death 10 meters from camp. It's shocking — they were just steps away. But once the blizzard blinds them, they don't know if each step forward is in the right direction. You can imagine that feeling — utterly desperate.

The essence of vision is seeing that something will truly happen, and not through logical deduction. In the early years of entrepreneurship, what we called direction and vision were really logical deductions — pretty words written on paper. Now I'm deeply wary of such complex logical deduction. I prefer to rely on intuitive, first-step innovation: there are 10 million or 100 million people with this need — this is something I see, not something I derive. I just need one hypothesis: can what I'm making satisfy this need? This is extremely simple, System 1 intuition. When you can intuitively see this, you gain confidence. You feel that achieving this goal will definitely change many people's lives.

I think good entrepreneurs don't rely on step-by-step logical deduction to plan distant futures. They're more like neural networks — they abstract something seemingly long-cycle and complex into a denser space. What originally required many steps now needs just one, but that one step is massive.

5Y Tavern: For AI-generated music specifically, what's your lighthouse?

Jing Guo: Music is a long-lasting, humble content need with its own special characteristics — not logically derived, but a pattern across centuries of human history. People always need music, and always need people to create it.

We believe that when humans truly create content, the most important part remains the human element, because only people understand people's emotional needs. The only difference or variable is: what method to use? Previously music creation had high barriers, achievable only by select professionals. But going forward, creation becomes more modular, more people can participate and exercise their creativity.

I'm often asked: why would ordinary people create music? My answer: as long as people need to listen to music, there will always be people who need to create it. This is our foundational principle, though with certain risks and uncertainties. We're essentially betting on two things. First, in the foreseeable future, humans remain the mainstream supply for human content consumption. AI is a tool — though its modules and granularity will grow increasingly large, perhaps so large it functions as a vendor, like a music agent you can talk to who helps create what you imagine. But it's still what you imagine; we still believe in the human element. Second, we believe music won't disappear within our lifetimes — though there's uncertainty. Like Peking opera: if humanity no longer needs to hear Peking opera, it ceases to exist.

5Y Tavern: When AI enters any creative field, a common debate is whether it will put many original creators out of work, whether it will replace humans. What's your view?

Jing Guo: Today's AIGC and AGI are two different things. AGI needs to solve AI having all parts of human nature — self-motivation, doubt, reflection, even insight; the ability to form hypotheses and verify them. When these problems get solved, and what AI looks like afterward — many unknowns to slowly reveal.

Before these are solved, AI is merely a tool. I think today's AIGC can be analogized to granularity of operation. Take games: early games like The King of Fighters required pressing seven buttons for a special move; later games like Honor of Kings need just one button. Initially designers instinctively felt things should be difficult to create skill gaps between players — we all went through a process of hand-training ourselves into machines. Later, by encapsulating complex operations into coarser-grained modules — one-button special moves — the tool barrier was lowered, letting us compete on a more equal footing.

Painting initially operated at pixel-level granularity, but may gradually become modular or semantic — simply describing "a person in the middle, sun on the right, lake below," or circling an area to add a fruit basket. Previously creators needed prolonged practice to develop basic capabilities. But AI doing these things means creation will rely less on refined technique and more on higher-level elements like theme, composition, and style. So I don't think we'll lose creators; rather, more may emerge. As operational granularity increases and technical barriers lower, more people can participate in creation, giving creators more space for expression.

You'll find many talents originally buried by their era. Before hip-hop, many current rappers and electronic producers might not have been considered talented because they couldn't enter mainstream music. When rock emerged, The Beatles were initially criticized by traditional musicians as a desecration of music — how dare they attract mass audiences with a few instruments and simple melodies? Because prior jazz was more complex, demanding virtuosic technique; jazz drummers were expected to play different rhythms with all four limbs. Rock broke that. Yet today rock itself has become a high-barrier form; punk, hip-hop, and internet music successively lowered barriers. Is this a degradation of creation, a tragedy for art, or good news for humanity? I firmly believe it's the latter.

I think people shouldn't be arrogant, feeling superior because "I can do this and you can't." Some people were skilled at abacus, but calculators let everyone reach the same level — this doesn't mean calculators desecrate finance work. What matters is the purpose of your calculation. Similarly with creation: what matters is what you're expressing; today's technology just lets more people express it. This naturally upsets those who've spent years accumulating abacus skills — these vested interests want to erect barriers and say "you're desecrating." That's natural human nature, but fundamentally I don't believe it.

5Y Tavern: Regarding creation, the 10,000-hour theory is often cited — I type faster not because I'm more authoritative or a vested interest, but because I put in more blood, sweat, and deliberate practice. Will AI deconstruct this theory?

Jing Guo: When we cite the 10,000-hour theory, it only proposes that genius isn't innate but shaped by environment — it doesn't mean 10,000 hours of blind practice makes you a genius. If those 10,000 hours are blind practice, your skills won't improve; you'll just become a mediocre old hand.

Go to a pickup basketball court and you'll see people who may have played their whole lives, started at thirteen, play every day — but still can't dribble with their left hand. People who actually play well don't just show up and play, mastering one hook shot. They spend time daily on fundamentals: left-hand dribbling, right-hand dribbling, crossovers — all the uncomfortable things. This process is called encoding.

The huge insight for us: what should our 10,000 hours be spent on? If you need 7,000 hours practicing finger dexterity just to operate an abacus, that compresses time you could spend actually thinking about how to do finance. The essence of creation is contemplating human needs, or introspecting your own emotions, finding what you want to express. If everyone can enter the threshold of creation, and their 10,000 hours go toward perceiving art and expressing emotion, then what gets created in the future should be greater.

5Y Tavern: AI lowers the tool barrier, but that doesn't make creation itself easier.

Jing Guo: Because creation is a relative problem, not absolute. Any random earworm today would have been an impressive work in Chinese pop 20 years ago. People's tastes improve. When everyone can casually make things, we'll still mainly hear that small subset of works infused with humanity, with differentiated ideas.


Share to Win

What are your takeaways after listening to the podcast or reading this article? Share your thoughts in the comments about ACE Studio's global expansion or AI music, and we'll select 2 featured commenters to receive a gift from 5Y Capital :)

5Y Capital seeks, supports, and inspires solitary entrepreneurs, providing support from spirit to all business operations. We believe that if the "crazy you" in others' eyes begins to be believed in, the world will become refreshingly different.

BEIJING · SHANGHAI · SHENZHEN · HONG KONG

WWW.5YCAP.COM