From "Heretical" Tactics to a Multimodal Future: How Nano Banana Reads Users' Minds | Yunqi Capital Tech π

云启资本·September 28, 2025·13·0

How Much Higher Can Image Generation Models Go?

When it comes to the most talked-about AI image models lately, Nano Banana is undoubtedly the center of attention. Its "character consistency" feature went viral across social platforms, even triggering a new wave of disruption in the image generation race.

But Nano Banana's real value goes beyond simply "working well." How did the team refine the model? What challenges did they face in productization? And where is the real ceiling for image generation models — higher resolution, or better understanding of user intent?

In this episode of "Yunqi Tech π," we bring you firsthand insights and deep reflections from Nicole Brichtova and Oliver Wang, core researchers on the Nano Banana team.

This article is republished from "Founder Park" (excerpted)

Original title: Nano Banana Core Team: Image Generation Quality Has Nearly Peaked, Next Step Is Teaching Models to Read User Intention

Podcast source: Unsupervised Learning: Redpoint's AI Podcast

01 Higher Resolution Is the Most Requested Feature

Host: Before Nano Banana's official release, which use cases did you internally think would take off and excite you the most? Now that it's out in the market, does reality match your expectations?

Nicole: For me, the most exciting thing was actually "character consistency" — seeing yourself in different scenarios. I literally made an entire slide deck with me on a "wanted" poster, me as an archaeologist, basically covering every career I dreamed of as a child. We even created an email template with images of my face, plus other team members' faces, that we could reference when developing new models.

Host: In AI, that's definitely the highest honor.

Nicole: And these images were so personalized. I was incredibly excited. So I had high hopes for "character consistency" from the start, because it lets people "imagine themselves" in entirely new ways that were previously very difficult to achieve. And it did become one of the features users were most interested in.

We found that people would turn their images into "figurines" — that was a very popular use case. But one scenario surprised me: many people used it to colorize old photos, which carries deep emotional significance for users. Things like "now I can finally see what I actually looked like as a baby," or "I can see what my parents looked like back then from these black-and-white photos." That feedback was really heartwarming.

Host: Once a product blows up, you inevitably get flooded with feature requests. What do users ask for most? And what do you think the next milestone for image models will be?

Nicole: The most frequent request we got on Twitter was "higher resolution" — a lot of professional users asked for this. The current model resolution is 1K, and users want it higher. "Transparent background" was another high-frequency request, because it's extremely practical for professional use cases. Those two were probably the most common requests I saw. Beyond that, there was "better text rendering."

Host: Everyone's curious — what drove such a massive performance improvement in the model?

Oliver: I think the reality is actually quite "unremarkable" — no single factor determines everything. The key is getting all the details right, continuously optimizing the "technical approach," and the team has been working on this problem for a long time. Honestly, we were a bit surprised ourselves that this model became so successful. We knew it was a good model and were excited to launch it, but we didn't expect the response to be this intense. For example, after we released it on Arena, not only did the ratings go very high, but what really caught my attention was how many users flooded LM Arena specifically to use this model — to the point where we had to keep raising QPS to handle the load. This completely exceeded our expectations, and it was the first time we realized: "Oh, this thing is really special, and a lot of people need it."

Host: I think this is one of the most fascinating things about the entire AI ecosystem: as developers, you have some understanding of what you've built, but only when you push it to market and let the masses test it do you truly grasp its "real potential."

Host: "Nana Banana" has already taken over the internet. Beyond that, what dynamics in AI imaging do you think are worth paying attention to but haven't gotten much notice yet?

Nicole: I think it's the "factual dimension of images." For example, people use Nano Banana to make infographics, or upload photos of Niagara Falls and have the model annotate information. As demos, they look pretty good, but look closely and you'll find: garbled text, inaccurate information, repetitions. Not many people are focused on this direction yet, but I believe it will keep improving.

Oliver: This is actually very similar to how LLMs developed. When GPT-1 and GPT-2 first came out, people thought they were "interesting" and used them for creative content — tasks where the "acceptable answer range" is very broad. But now, people rarely use LLMs for creativity anymore; they use them more for "information queries," "conversations," and even "emotional companionship." I think image models may undergo a similar transformation: from "creative tool" to "information query tool," and in the future people might even have conversations with video models "when they need companionship." This kind of trend could very well emerge.

Nicole: And models should become more "proactive." Right now, users have to actively say "I want to generate an image." But what if the query itself "needs visual assistance"? We're already used to this kind of "proactive adaptation" in search engines — when you search, the system automatically returns "text + image" or pure image results based on your needs. I look forward to future models being more proactive and intelligent: flexibly using different modalities (text, image, etc.) to interact based on the user's question.

Host: Is there a story behind the name "Nano Banana"?

Nicole: We have a PM on our team named Nana. She was working overtime until 2:30 AM for this launch, and that's when she came up with the name. It sounded fun, so everyone just kept using it. Now it's even become a "semi-official name" — after all, "Gemini 2.5 Flash Image Model" is a bit of a mouthful.

Host: Yeah, the name is a huge success. Even Google executives are posting banana emojis on Twitter — shows how much it's "stuck in people's minds."

Nicole: If there's one branding insight, it's that "the name should ideally pair with a good emoji" — that makes it much more memorable.

02 From Toy to Productivity Tool: Integrating LLM "World Knowledge"

Host: You've solved the major challenge of "character consistency." In your view, what will be the next frontier breakthrough for image models?

Oliver: I think the most exciting thing about this model is that you can start making "more complex requests" to it. Before, you might need to describe the image details you wanted very clearly, but now you can "ask for help" like you're talking to an LLM. For example, someone might use it like this: "I want to rearrange my room but don't know how, give me some advice" — and the model can come back with reasonable suggestions like "based on your room's color scheme, these furniture pieces would work well."

What's really interesting to me is how to incorporate the "world knowledge" from LLMs into image models, so that generated images can actually help users — showing them solutions they hadn't thought of, or answering their "information query" needs. Like when a user asks "how does this thing work," the model can directly generate a diagram annotating "this is how its mechanism works." I think this will be a very important application direction for these models going forward.

Host: How much can image models benefit from advances in LLMs? And as LLMs continue to evolve, will this benefit persist?

Oliver: They can definitely benefit, and it's almost 100% thanks to LLMs' "world knowledge." Actually, the formal name of this model is "Gemini 2.5 Flash Image Model" — "Nano Banana" is just a more fun nickname.

Oliver: I even wonder how much of our success is because "Nano Banana" is so catchy. But it is indeed a Gemini family model, so you can talk to it like you talk to Gemini, and it understands everything Gemini understands. I think integrating image models with language models is a crucial step for improving the model's practicality and functionality.

Nicole: You might remember that two or three years ago, if you wanted a model to generate an image, you had to describe it very specifically — "a cat sitting on a table, the background looks like this, the colors are these." But now you don't need to go through all that trouble, largely because language model performance has improved dramatically.

Host: Yes, you don't need those "sneaky prompt conversions" anymore. The old "trick" was: you input one sentence, and the system would convert it into a 10-sentence detailed prompt to ensure the model generated accurately. But now the model is complex enough to directly understand simple prompts. That's really exciting.

03 Future Interaction Will Be Multimodal — Recognizing User Intent Is Key

Host: From a product perspective, Nano Banana's user base is actually very diverse. There are experts who know exactly what they want to do, and many ordinary users facing the "blank canvas problem." Tell us how you design for these two completely different types of users?

Nicole: First, users on LM Arena, including developers, are very professional — they know how to use these tools and can come up with new scenarios we didn't anticipate. For example, someone turned objects in photos into "holograms" — we never trained for this scenario, nor did we expect the model to be good at it, but it performed well.

For ordinary consumers, "simplifying operations" is important. For instance, when you open the Gemini app now, you'll see banana emojis everywhere — we did this because we found that many people, after hearing about "banana" (the model), went to the app but couldn't find it because there wasn't an obvious entry point before. We also collaborated with creators to showcase some use cases in advance, providing examples that link directly to the Gemini app, where clicking automatically populates the prompt. I think there's still a lot we can do in terms of "onboarding guidance," like providing visual instructions.

Also, when editing images, maybe we could add "gesture controls" so users don't have to rely entirely on prompts. Sometimes even when you want a specific effect, you need to write a long prompt — but that's not natural for most consumers. I use the "parent test" to evaluate products: if my parents can use it easily, then it passes. But we're not there yet, so we still have a long way to go. The core philosophy is really "show more, lecture less": give users examples they can easily replicate, and make sharing simple. As Oliver often says, there's no "magic single solution" — it takes effort on multiple fronts.

Oliver: "Social sharing" is actually the key to solving the "blank canvas problem." When people see what others have created with the model, because the model supports "personalization" by default, they naturally think, "I could try putting myself, my friends, or my pet in there too." This "imitative creation" is how "Nano Banana" spreads.

Host: Currently the interaction is still mainly text-based. In the long term, what other "design interfaces" could make it easier for people to interact with models? What ideas in this space excite you?

Nicole: I think we've only just scratched the surface of "interaction possibilities." Ultimately, I want all "modalities" (text, image, voice, etc.) to blend together into an "intelligent interface" that automatically chooses the most appropriate interaction method based on the task you want to accomplish.

For example, we're already moving toward a future where "LLMs don't just output text, but can output images or visual explanations when users need them." Voice interaction also has huge potential because it's such a natural way for humans to communicate, but no one has really solved "how to integrate voice interaction into user interfaces" yet. We still mainly rely on "text input" — maybe we could combine it with "gestures," like if you want to remove an object from an image, it's as simple as scribbling over it on a sketchpad. How to seamlessly switch between different interaction modalities based on task requirements — that's a direction I'm very interested in, and there's still a lot of room to explore there.

Also, I think the idea that "a short prompt can generate a 'ready-to-use finished product'" is overhyped. In reality, after generating content, you need extensive iteration and refinement. Even the content people share on social platforms requires significant effort behind the scenes to polish into the final result. So this expectation of "one-step completion" is somewhat unrealistic. Future interaction interfaces (UI) are currently undervalued. How to integrate various modalities (text, image, voice, etc.) to make it easier for ordinary people to use these models, understand their capabilities, and adapt the models to specific workflows — the value of this direction hasn't been fully appreciated.

Host: What are the current limitations of "voice interaction interfaces"?

Nicole: I think part of it may be "priority sequencing" — we're still fully focused on improving the model's core capabilities. But voice technology has also made tremendous progress in recent years, so I think someone will soon start exploring "combining voice with image models," and our team may do work in this area too.

I've been thinking about what this interface might look like. I think the core challenge is how to recognize user intent, and how to switch between different interaction modes based on user intent and what they actually want to accomplish, because users' needs are often unclear. And then the interface might end up back at the "blank canvas" state — so how do you show users "what operations are possible"? That's a huge challenge in itself.

We've found that when users interact with chatbots, they always assume it "can do everything," since you can talk to it like you would with a person. But actually, it's very difficult to explain to users "what the bot can't do"; especially when the tool's capabilities are already very powerful, it's not easy to clearly demonstrate "what it can do." So I think the key is to define clear boundaries, to let users know "what operations are feasible" through interface design, and ultimately help them accomplish almost everything they want to do.

04 Real User-Driven Testing Is the Best Way to Evaluate Models

Host: Let's talk about "model evaluation." Beyond putting it on the LM Arena platform for public testing, how do you do routine evaluation in practice? What insights do you have on "how to judge and measure whether a model is good or not"?

Oliver: Actually, the advances in language models and vision-language models have brought one benefit: a "feedback loop" has now formed where we can use the intelligence of language models to evaluate the content they generate themselves. This creates a virtuous cycle that can drive progress in both language models and image models simultaneously — that's very exciting. At the end of the day, users themselves are the ultimate standard for "judging whether an image meets their needs." So platforms like LM Arena, where users input their own prompts to use the model, are actually the best way to evaluate models.

Nicole: "Aesthetics" matter too. Oliver is being modest — she's actually someone on the team with "extremely high sensitivity to image details," able to tell at a glance whether an image looks good and what its flaws are. We have several team members like this. After model training is complete, we first do extensive "manual preliminary screening" to judge whether the model's outputs are acceptable.

Returning to your question about "evaluation methods" — we receive user feedback from many channels (including X) about "which features work well and which don't." We then adjust our evaluation criteria, on one hand ensuring that "features that work well don't regress," and on the other hand focusing our efforts on optimizing the "features that don't work well" that the community wants improved.

05 Aesthetic Demands Are Hard to Meet; Deep Contextual Interaction at the Prompt Level Is Needed

Host: Among the "power users" you've seen, are there any particularly impressive use cases?

Oliver: My personal favorite power user scenario? Most of my career has been in video-related work, so I'm particularly interested in video tools and creative tools. I've found that "Nana Banana" combined with video models like Voe3 can be a practical tool for making AI-generated videos — it helps you brainstorm ideas and plan shots faster. Interestingly, this mirrors the production workflow in the film industry: first using "storyboards" to map out the story and shots, and now users are using this approach to create more coherent, longer video content.

Nicole: I was surprised that some people use it in "actual architectural workflows." For example, starting from blueprints, first generating something like a 3D model effect (without actually building a 3D model), then further iterating into design drawings. This greatly shortens the "tedious, repetitive parts" of the workflow, letting people focus on the "creative, interesting parts that they genuinely enjoy." And I didn't expect it to work so well "out of the box" for these kinds of scenarios.

Host: So it's like using image models to quickly build a "basic framework" across various domains.

Nicole: Another scenario is "generating website UI through code." Previously, the process from "inputting a prompt" to "generating website code" always felt abrupt to me — there was a missing "iterative design" step in between, with no way to quickly modify the design. But now, we can finally iterate on the design before generating code, and only generate code once we're satisfied.

Host: This is basically the workflow of the future. After all, if the generated code doesn't match your aesthetic or completely misses your expectations, then the compute spent on "generating code" was wasted. This approach makes much more sense.

Nicole: And it's more fun too. As Oliver said, people integrate new technologies into existing workflows, and this process is actually quite natural. Although LLMs have advanced quickly and can now "generate websites directly from prompts," which is impressive, I think spending more time on that "design iteration" intermediate step, ensuring the final result matches one's aesthetic, is more enjoyable for users.

Host: How far along are we in this direction?

Oliver: "Aesthetic" demands are actually quite hard to satisfy, because they require deep personalization to provide useful suggestions. And I think "personalization" itself is still being continuously optimized at the technical level. So we're still some distance from "precisely understanding user needs." But I think through "a few clarifying questions" and "conversation with the model" — which is actually one of the features I most look forward to in models — things will get better and better. You can talk with the model in a chat thread, gradually refining your needs, and eventually get the image you want.

Host: Do you think "personalization" will stay at the "prompt level"? For example, achieved through conversation and context? Or in the future, will everyone have their own dedicated "aesthetic model"?

Oliver: I think it will mostly stay at the "prompt level." For instance, based on personal preferences the user has previously told you about, the model can make decisions that better fit their needs. At least I hope so. After all, if everyone had to have their own model and be responsible for maintaining it, that sounds like a hassle. So this might be the direction things develop.

Nicole: But I do think different people have radically different "aesthetic preferences," and at this level, some degree of "personalization" is essential. For example, when you search for sweaters on Google's "Shopping" tab, you get many recommendations, but you actually want them to "match your aesthetic," and even to "combine with clothes already in your wardrobe" to see which new pieces would go well. I hope this need can be achieved through "the model's context window" — for example, feeding the model images of clothes in your wardrobe and having it recommend styles that would match. I'm very excited about this direction and hope we can achieve it. Of course, perhaps some additional "aesthetic controls" at the "model level" will be needed, but I suspect this will be more applicable to "professional workflows."

Host: So do you think the future will be a general-purpose model that handles all scenarios through precise prompts? Or will there be more specialized models, like ones specifically for "futuristic style" or certain particular styles?

Nicole: I've always been surprised by how broad a range of use cases "off-the-shelf models" can support. But as you said, in some "consumer-facing scenarios," like quickly sketching what an item might look like in a room, they already perform very well; but once you get into "more advanced functional needs," like creating final deliverables for marketing or design workflows, you need to combine them with other tools to make the model truly effective and practical.

06 The Key Going Forward Is Improving Models' "Expressiveness" to Raise the Floor on Capabilities

Host: Let's zoom out and talk about the entire "image model field." Since Stable Diffusion and Midjourney emerged, development in this space has been like a rocket ride. What do you think have been the "key milestones" in image generation models over the past two or three years?

Oliver: It really has been a "rocket ride." When I first started working in this field, generative adversarial networks (GANs) were still the dominant approach to image generation. At the time, we were all amazed by what GANs could do, but they could only generate images within a very narrow scope. For example, they could produce decent-looking faces, but only frontal-facing ones. Later, models capable of "generalized generation" and "fully text-controlled" output began to emerge. Initially, they were small in scale and produced blurry images. But even then, we realized: "Wow, this is going to change everything," and everyone started pouring energy into research. Yet no one could have predicted how rapidly it would advance.

I think there are two reasons behind this. First, many top-tier teams were tackling these hard problems. Second, "healthy competition" provided momentum. When you saw other teams release impressive models, it was motivating. For instance, "Midjourney had been far ahead for a while, with stunningly good results," and we'd wonder, "How did they do that? Why does it look so good?"

Additionally, Stable Diffusion's emergence as an open-source model showed us the "potential of the developer community" — so many people wanted to build new things on top of these models. That was undoubtedly another "inflection point." But honestly, working in this field is both exciting and somewhat "frustrating": on one hand, models are improving at breakneck speed; on the other, user expectations keep rising. Now users complain about "minor issues," and you think to yourself, "My god, do you know how much effort we put into optimizing this model? A year ago, the generated images were completely unrealistic, and people were still amazed." I have to say, human "aesthetic fatigue" with new technology sets in remarkably fast.

Host: Why was Midjourney able to stay "far ahead" for so long in this field? It felt like the industry benchmark for quite a while.

Oliver: I think Midjourney figured out "how to do post-training on models" earlier than other teams, particularly "how to use post-training to generate stylized, artistic images." That was precisely their core advantage — focusing on "letting users control image style" and ensuring that "whatever content was generated, the visual quality would be excellent." At the time, this was crucial: if you could narrow the generation scope to "focus on the small domain of 'good-looking images,'" you could achieve much better results within that domain. Starting from "a focus on high-quality stylized images" was an excellent strategy for them. Later, all models including Midjourney (such as Flux, GPT image models, etc.) began to "broaden their generation scope." Now they can generate a much wider variety of image types while maintaining high quality.

Host: What enabled models to "broaden their generation scope" and no longer be limited to generating curated, high-quality images?

Oliver: There are many reasons. First, we all figured out "what training data should look like." Second, model scale and compute power have grown naturally — things that were previously impossible became achievable simply because "the scale got bigger."

Host: Image models have improved so much, but I'm not sure whether we're "only 10% away from the ceiling" or whether "looking back three years from now, we'll find it laughable that we thought those models were good." What do you think? And the images generated now are already quite good — I can't even imagine what the "next 10x improvement" would look like.

Oliver: I think we still have a long way to go. Setting aside other application scenarios, "image quality" alone has enormous room for improvement. I believe the key advances will come in "model expressiveness": right now, we can perfectly generate certain content, with generated images nearly indistinguishable from real ones. But once you step outside the range of "commonly generated content," image quality drops precipitously. For prompts requiring "more imagination" or "combining multiple concepts," the results are often poor.

So I think future models might follow this trend: "the best image quality now and the best image quality a few years from now may not differ much; but the worst image quality now will be far worse than the worst image quality a few years from now."

We'll make models more practical and applicable across broader scenarios. And we've found that the wider a model's applicability, the more use cases users can discover, and the more useful the model itself becomes.

07 In Future Workflows, Traditional Tools and AI Models Will Coexist Long-Term

Host: You provide both models and APIs. How do you determine which capabilities belong in a general-purpose chat tool like Gemini, and which are better left to other specialized products?

Nicole: I think these two types of scenarios serve completely different purposes. We've found that users use Gemini for "rapid iteration." For example, someone on our team wanted to redesign their garden, so they first generated concept images in Gemini to imagine the possibilities, then collaborated with a landscape designer to refine and execute the idea. So Gemini is more like "the first step in creative ideation" — it rarely becomes "the tool for producing the final product."

But for advanced users like developers, they build more complex tools, chaining multiple models together. This is a more sophisticated, intricate "multi-tool collaborative workflow." The advantage of chatbots lies in "helping you kickstart creativity and providing inspiration," while also supporting many "fun, easily shareable" scenarios — like sharing creative results with family and friends. I think this positioning will remain, because users with higher demands will always gravitate toward "more visual" or "more professional" tools.

Host: How should "editing workflows" fit into this? Using AI to generate initial ideas is great, but to take a piece from 95% to 100%, do you think we'll still need to rely on traditional editing tools in the future? Or will the entire workflow transform?

Oliver: I think this largely depends on the user type. Some users have "pixel-level precision requirements." For these needs, we must integrate models with existing tools (like Adobe's various products). Other users need more "inspiration and ideation," with less strict requirements for output quality — for them, quickly generating ideas in a chatbot is sufficient. So both application scenarios are important.

Nicole: Regarding "pixel-level control," I just learned about a case two days ago: when creating ads for different products or brands, the "direction of the model's gaze" significantly affects the message conveyed, because viewers' attention is guided by where the model is looking. This kind of fine-grained control is very difficult to achieve with a chatbot. So for these users and scenarios, "professional tools" and "extremely high-precision control capabilities" will still be needed going forward.

Oliver: Ultimately, the key is "which needs can be clearly described in language, and which cannot." Language is well-suited for conveying "macro-level ideas," but if you want an element to "shift 3 pixels to the left," describing that in language feels awkward. So I believe "traditional tools" and "AI models" will coexist long-term.

Host: Right. If we observe the complete workflow of professional artists or creators, we find that they often struggle to precisely describe their operations in language — much of it is done "by feel." Within Google, which products or business lines are you most excited to see this image model deployed in?

Nicole: I think there are many directions. First, creative domains like "photo apps" — being able to edit directly within the photo library would be very convenient. For example, I have this need a few times every year: turning family photos into birthday cards. If I could do that directly in a photo app, it would be incredibly handy.

Additionally, "knowledge-oriented scenarios" have great potential. Across Google's various products, if a 5-year-old wants to learn about "photosynthesis" but no suitable visual materials exist online, the model could generate custom explanatory diagrams. This would open up many new scenarios and opportunities for "personalized visual learning," since many people are "visual learners."

Oliver: I think "productivity and collaboration (Workspace)" is also a great direction. For PowerPoint and Google Slides, perhaps people will be able to create "more engaging presentations" in the future, rather than the same old "walls of text."

Host: I worked in consulting when I first started my career — I wish I'd had this feature back then. I know all too well the pain of "spending enormous amounts of time adjusting formatting."

Nicole: In the past, making slides meant first drawing storyboards on a whiteboard, determining titles, chart placements ("put this dataset's chart on the left"). If you could feed these requirements to an LLM and have it handle all the tedious work, that would be incredibly exciting.

Oliver: You could even just "take a photo of the whiteboard" and have the model recognize the content.

08 In the Future, All Teams Will Move Toward "Omni Models"

Host: What's the relationship between image models and video models? Are they developed independently, or do they learn from each other? How much interaction is there between these two fields?

Oliver: They're very closely connected. I think in the future, all teams will move toward "Omni Models" — models capable of handling multiple tasks. These models have many advantages and may become mainstream in the long term, though I'm not certain. But what's certain is that many techniques we've learned in image generation will be applied to video generation models, and vice versa. This is one reason why the video generation field has been able to advance so quickly. The industry as a whole has already grasped the approach to solving this class of problems. So I think they're like "close companions" — they'll share many techniques, and may even "merge together" in the future.

Host: By "techniques," do you mean the core technical frameworks behind image and video models are similar?

Nicole: From a workflow perspective, people also frequently "use these two types of models in complementary ways." For example, if you're a filmmaker, early-stage creative iteration often starts with outlining ideas in an LLM, then quickly generating frame images in an image model — faster and more cost-effective — before finally moving into video production. So even from the perspective of "workflow and usability," these two model types are complementary. Additionally, many of the problems they need to solve are similar, such as "consistency" — both images and videos need to ensure consistency in characters, objects, and scenes. It's just more complex in video because this consistency must be maintained across multiple frames.

Host: What do you think are the core problems that the video model field needs to solve next?

Oliver: I think first, video models need to achieve "the same level of controllability as the latest image models." This will have a significant impact on video's development and is a direction worth watching. Second, video teams are continuously optimizing "resolution" and "long-term consistency." Of course, "having the same character appear in multiple scenes" is also one of users' most pressing needs. So the future direction is clear: moving toward "longer, more coherent video content."

Host: Will the market structure of the image model field eventually converge toward what we see in the LLM space, dominated by a few major players?

Oliver: That's a great question. So far, I think there's still room in the image space for "small teams building top-tier models." We've seen quite a few small labs develop really impressive models. I hope this continues, because small-team participation keeps the field more dynamic.

But as I mentioned earlier, the "world knowledge" and "practical utility" of image models really depend on scale effects — especially the scale of LLMs. So my guess is that teams with the capability to train LLMs, or teams that can endow image models with rich world knowledge, may end up dominating the image space as well. We're already seeing some of China's large labs roll out excellent image models, which mirrors the trend in the LLM space. So I do think we'll see a similar concentration of leading players in images going forward.

Host: For image models, would using "the most advanced open-source models" put you at a significant disadvantage compared to using "frontier closed-source LLMs"?

Oliver: That's an excellent question. I think the answer depends heavily on how open-source models evolve — and that landscape is changing very quickly. About a year ago, "going with open source" looked like a pretty safe bet, but now the picture is less clear. That said, I'm not sure about the future trajectory of open-source models either. There's still a very real possibility that open source continues to advance and enables more small labs to train high-quality image models.