Exploring Forward in a World with Open-Source Communities: There Will Be Challenges, But It Will Be Worth It | Riding the AGI+ Wave

云启资本·July 26, 2023·15·0

More, Bigger, Better, More Valuable Open-Source Communities

Over the past year, open-source AI has surged in power. From Stable Diffusion to the just-released Llama 2, open-source forces have been constantly shaking up the GenAI space.

Yunqi Capital has long tracked the open-source direction, leading early investments in open-source companies including PingCAP, Zilliz, Jina AI, RisingWave, and TabbyML. At this year's AGI Playground conference, Yunqi Capital partner Chen Yu was invited to moderate a roundtable, where he joined several industry experts from the open-source world for an in-depth discussion on building open-source ecosystems, the role of open-source in the large model space, and the commercialization and globalization of open-source. Content source: GeekPark

Chen Yu

Yunqi Capital Partner

Open-source companies with different positioning vary greatly in their operational approach and team building. Open-source companies need to clarify their positioning before commercialization, maintain openness, respect community input, and execute differentiated regional operations.

Below is a summary of key insights plus a transcript of the conversation. Enjoy~

Roundtable Guests:

Chen Yu: Partner at Yunqi Capital, roundtable moderator.

Liu Cong: APAC Lead at BentoML.

Luo Xuan: Co-founder of RWKV and Syrius (炬星).

Yin Yifeng: Machine Learning Engineer at Hugging Face.

Zheng Yizhou: Director of Technical Product at Stability AI.

Zhang Meng: Founder of TabbyML.

"The Llama Moment"

How should we view open-source models?

Chen Yu:

Whether it's RWKV or Stable Diffusion, both have their own open-source models. Big tech companies like Meta just released Llama 2 last week. What's your take on open-source models, or open models in general? Will they pose any challenge to closed-source models from OpenAI or Google?

Liu Cong:

We're a company that helps deploy large models, and recently we've seen overseas customer demands shifting rapidly — starting with Llama-based, then Falcon-based, and lately many clients have asked us to help deploy Llama 2. We feel that open-source models are getting increasingly capable, but still limited to private deployments and internal enterprise use cases. General-purpose applications likely still depend on the capabilities of large models like OpenAI's.

Luo Xuan:

I once asked Dr. Qi Lu why OpenAI doesn't open-source, and he was concerned that open-sourcing could lead to misuse. Closed-source might be safer, but we believe you can't keep things closed-source because there's actually no real barrier to entry. Now Llama has open-sourced even better models. Open-source may be the true path toward democratizing access to future AI technology.

Our motivation for open-sourcing was simple: we believed OpenAI's closed-source approach betrayed its original mission. So we started open-sourcing in 2020, and received support from companies like Stability AI and Hugging Face.

Actually, many commercial companies are starting to open-source now, but they're mostly releasing weaker models — their best models won't be open-sourced, and the data won't be open-sourced either. Data is what everyone should be paying attention to. Currently the open-source community is using ChatGPT conversation data, which is a major problem. I think the open-source community should focus more on building datasets, and I hope the entire open-source community can work on this together.

Yin Yifeng:

From BERT to Llama, Falcon, and now Llama 2 — every time such a powerful foundation model emerges, large numbers of people fine-tune it with their own data. This creates a Cambrian explosion of activity every time a strong base model is released to the open-source community. This phenomenon has been named "The Llama Moment."

First, this will definitely impact OpenAI and big tech companies, because open-source is steadily closing the capability gap with closed-source large models. And the biggest advantage of open-source is transparency — the papers are published, the code is released, you can use it with confidence, you know what's inside. With closed-source large models, there are things you simply can't trust.

Second, the most important thing is still data. Because when Llama 1 came out you fine-tuned it, then Llama 2 came out and you fine-tuned it again — you can actually use the same dataset. Models change every day, constantly getting updated and replaced, but building a high-quality dataset serves you well for a long time to come.

Zheng Yizhou:

I'll approach this from two angles.

First, foundation models can be viewed as production tools, and who controls these tools, and what different effects they have in different hands, matters enormously.

In large companies, as closed-source foundation models, they let masses of people use the production tool to develop on their own, making the tool more efficient — as we're seeing with GPT-4. Such models will continue to exist, and are very meaningful for direct-to-consumer applications or applications that aren't particularly critical. This level of performance is something people will always pursue.

Open-source models are public production tools that everyone gets to hold. They're relatively decentralized, without such concentrated resources. People do all kinds of different things with them, pushing in different directions. The benefit is that the ecosystem becomes incredibly vibrant, and within this flourishing ecosystem you see tremendous diversity.

Models in this era aren't merely production tools — they can be a channel for expression, a reflection of your thoughts. If it's a closed-source model, that model doesn't belong to you. Open-source lets you have a model that belongs to you. Whether it's Stable Diffusion, where creators customize models with their own artistic style to match their expressive habits, or text, where I feed my thoughts, my chat history, all kinds of information into the model. Because this model belongs to me, I don't worry about data ownership or whether my ideas might leak out. A model customized on top of this becomes a reflection of my own thoughts.

Imagine a fully closed-source world. If you don't own your model, when we truly reach the AGI era, how will you own your thoughts? How will you have a model that embodies you?

Zhang Meng:

Right now there's a quality gap between open-source and closed-source models, but the AI programming use case is particularly interesting.

It's one of the few scenarios where the community or research institution perspective faces no significant data disadvantage against big tech super-research labs like OpenAI or Google.

That's why over the past 3 to 6 months, beyond general language models, the coding scenario has developed extremely rapidly at the community level. Models like CodeGen2.5, WizardCoder, and Phi-1 have actually surpassed these closed-source models in coding quality.

This is why TabbyML decided to go open-source from day one. When the ecosystem is relatively diverse, or models themselves quickly become commoditized, we expect the future ecosystem to be quite pluralistic, with many options when deploying as developer tools. And in developer scenarios, open-source itself is an extremely ideal option from a customer acquisition perspective. Looking ahead, especially in the coding scenario, because downstream use cases are so enormously varied, we believe it will become open-source model-dominated, a state that closed-source models will struggle to catch up with.

Open-Source Models:

Smaller and Better

Chen Yu:

Open-source AI has only really heated up over the past year. Are there any projects or landmark events that left a particularly strong impression on you?

Liu Cong:

An important landmark event was when Falcon was first released and wanted to charge a 10% royalty. They called themselves an open-source model but wanted 10% royalty. The community and public reaction was strong, and Falcon eventually removed this and switched to a fully Apache-compatible license.

But after Llama 2's recent release, there's a commercial clause in the license that nobody seems to be discussing anymore — because in their terms it explicitly states that if your monthly active users exceed 700 million, you need to separately request a grant from Facebook, and they don't specify what this grant actually entails. I think this is something the open-source community urgently needs to address: open-source licenses for large models.

Luo Xuan:

Regarding licenses — since we've always been Apache 2.0 open-source and commercially usable — I think Llama still left itself some wiggle room. Meta is ultimately a commercial company, and from what I understand, they hope Llama 2 can attract more development resources and developer ecosystem toward their metaverse efforts.

What I'd rather discuss is another topic: I hope people pay more attention to on-device, on-terminal large models — on phones, computers, robots, XR. Overseas developers have built things like Llama.cpp, and others have built RWKV.cpp for us. This is more relevant to developers and entrepreneurs. As long as large models can run on terminals, the demand for compute power and the barrier to entry get dramatically lowered. This is a very good thing.

On the other hand, I've noticed that many open-source communities have become much more focused in their goals recently, which is a very positive development. In the open-source ecosystem, if you want to compete against closed-source commercial ecosystems, having clear goals, clear paths, and strong execution is extremely important.

Yi Feng Yin:

The hottest project recently has been Llama 2. What's interesting about Llama 2 is the trend we're seeing: a ~70B model can now compete with OpenAI's 175B closed-source model in many areas. This is likely a trend that will continue.

First, OpenAI's model was trained back in 2021, so it hasn't incorporated many of the new techniques and architectures that have emerged in the past two years.

Second, models like Llama have benefited from years of accumulated technical expertise, enabling smaller models to achieve what previously required larger ones. I think the trend going forward is this: take a model that's powerful enough to score 100 points — maybe that was 70B, but it could drop to 50B and still score 100, then eventually 13B could score 100. Models keep getting smaller while hardware keeps getting more powerful, and soon you'll be able to run them on-device. Once they reach the edge, consumer-facing applications can really take off. This is the trend I'm seeing both commercially and technically.

Yizhou Zheng:

One very interesting phenomenon I'm observing is that starting with Stable Diffusion, the participant profile in open-source communities has shifted. Previously, open-source participants — especially in ML-related communities — were mostly ML engineers or very technically-oriented developers.

But SD (Stable Diffusion) may have been an inflection point. The open-source community started seeing a large influx of interest-driven individuals and grassroots researchers — people with research capabilities who weren't necessarily from ML backgrounds. This richer, more diverse community began producing things like the edge deployments we just mentioned: Llama.cpp, ExLlama — all built by the open-source community itself. This cross-pollination and broadening of community scope is an interesting pattern we're seeing in the AGI era, or the era leading toward AGI.

Meng Zhang:

As an application layer for language models, we're particularly focused on the serving layer for open-source large language models. I'll share two projects we're watching closely. One is Hugging Face's Text Generation Inference, which is now very well-engineered with excellent support and observability — I'd say it's approaching the de facto standard for open-source LLM serving, and it has very high attention.

The other is a newer one called vLLM (vllm.ai), from Berkeley's Sky Computing Lab. What's surprising is they seem intent on competing across the full serving layer. Their distinguishing feature is applying memory paging concepts to attention mechanisms, making continuous batching easier and improving throughput. We hope this serving layer competition stays healthy — from the application layer perspective, that means a better developer experience for us.

How to commercialize? Can it be done well?

Yu Chen:

You all mentioned many commercialization points just now. In the large model era, if you really want to do commercialization well, what are the prerequisites? What are good business models? And as an open-source company, how do you internally balance your commercial version against your open-source version?

Cong Liu:

I'll answer briefly from BentoML's perspective. BentoML is actually a very typical open-source 3.0 company. Version 1.0 was probably Red Hat — selling support and licenses. 2.0 was more the OpenCore model, selling premium features. 3.0 is more like Databricks, tightly integrated with cloud platforms where revenue and usage can be shared with the cloud provider.

BentoML has an open-source framework that helps developers build AI applications and deploy large models. We launched our commercial product last month with very good cloud platform partnerships. You build AI applications with the open-source framework, then deploy to the cloud platform where we help you with serving and scaling. This model is more friendly to smaller companies like us because we can share customer revenue with the cloud platform.

Returning to the fundamental question of commercialization for open-source companies, we believe open-source products need to help developers solve very thorny problems, and you can integrate with cloud platforms along these functional paths — that may be a better development trajectory.

Xuan Luo:

For RWKV, the base model will always be open-source and free for commercial use. We've also established a commercial company as part of the broader open-source ecosystem, which will do vertical-specific optimizations.

In the current large model space, people are still paying for results. For language models that's ChatGPT, for image generation it's mostly Midjourney. Right now the paying customers are basically individuals or enterprises looking for efficiency gains. There will be incremental opportunities going forward, and the incremental piece will come from new computing platforms and new internets emerging. I think there's much more room for imagination there. For now, it's still about efficiency improvement.

Yi Feng Yin:

Large models are getting smaller and more powerful. Eventually everyone may want their own large model. But the problem is, you may not have your own hardware to run models on-device.

One business model is: I show you how powerful my model is, you use my model, and I host it for you — essentially Infrastructure as a Service. Hugging Face is doing this too: we'll host your model, it just sits there after training. So you have the model, the database, and the infrastructure — one-stop service, no need to go anywhere else.

If we draw an analogy between large models and the early internet, the next wave of entrepreneurship will be like "internet plus" — internet plus food delivery gave us Meituan, plus shopping gave us Taobao.

Because the internet was a disruptive technology that could disrupt food delivery, could disrupt shopping. I think there's a sharp question now: what exactly can large models disrupt? If large models can disrupt a particular industry, that's where giants can emerge. If you can't find something to disrupt, but find something incremental, you can at least make money.

Yizhou Zheng:

What's the foundation for open-source to do commercialization?

I want to raise something about whether the open-source community follows the rules of the game.

We've observed some patterns recently. For example, our models eventually go to commercial open-source, but there may be a research-only release period first. SDXL 0.9, which people are seeing now, is currently research-only — not yet open-source, not available for commercialization.

Yet many companies, both domestic and international, have directly taken it and built commercial APIs around it. But this model isn't actually ready for commercialization. This kind of rule-breaking behavior may cause certain damage to the overall open-source commercial environment.

Meng Zhang:

In the developer tools open-source ecosystem, commercialization is a relatively proven model. Everyone basically charges per seat, per year — this is a very smooth business model overseas.

For us, the core question is how to differentiate functionality between the open-source and commercial versions. TabbyML is fundamentally a developer productivity tool. In our open-source OpenCore, all developer productivity features — completion, Q&A, basic analysis — are covered by the open-source version as permanently free capabilities.

When we commercialize for enterprises, targeting CTOs or engineering managers, we provide visibility into your team's overall productivity gains from using Tabby, your entire workflow, and post-language-model analysis showing how long each issue took and where it got stuck. These productivity collaboration and insight-level capabilities become our commercial offering, with additional charges for enterprise customers.

There are problems, but it's really worth doing

Yu Chen:

What role do you all see the open-source community playing in this wave of AI open-source entrepreneurship?

Cong Liu:

Open source is very important. Right now there are many new projects emerging for both large models and toolchains. From a startup perspective, we don't have enough engineering capacity to cover all use cases.

For example, in our community, support for the Baichuan model came from community developer contributions. From the open-source large model perspective, this is a very important capability that requires fairly transparent collaboration. From the toolchain perspective, OpenMLL is a very vibrant,百花齐放 (let-a-hundred-flowers-bloom) process — many people using different tools for different functions. From an open-source collaboration standpoint, this makes the ecosystem develop better, more open, which also facilitates progress down the line.

Xuan Luo:

RWKV has always emphasized global developer ecosystem, global from day one, born global. Why do developers use RWKV, why join an open-source community? The motivation is very simple: they find your project interesting, with promise, worth investing in. That's a very straightforward starting point.

We hope to democratize AI more. Recently we've been organizing online closed-door sessions, including hackathon projects, hoping to help more developers find more resources — we provide some resource connections for developers.

Yi Feng Yin:

I think the open-source community should serve as a catalyst.

Getting from 0 to 1 might require a room of exceptionally smart people working behind closed doors. But getting from 1 to 100? Throw it to the open-source community and it happens fast. When Llama 2 came out, everyone was amazed — we figured this model would sit at the top of the leaderboard for at least a few days. It literally lasted only a few days before getting surpassed (by Stability AI).

From another angle, even if you're doing closed-source work, the open-source community helps you enormously, because open source sets the floor for closed source. If Company A builds a closed model and it scores 50 points below Llama 2, you just go download Llama 2 from Hugging Face.

Whether in terms of innovation or impact on commercial companies, it acts as an accelerant. So even though open-source communities currently face various commercialization challenges, this is absolutely worth doing.

Yi Zhou Zheng:

Has anyone here watched videos of slime molds searching for food? I have an analogy: the open-source community is kind of like a slime mold. Though a slime mold is technically a single organism, we can think of it as a collective. Initially, its direction is completely diffuse — the collective explores in every direction, gradually spreading out. At this stage, there's no clear unifying direction. But the moment one point touches food, reaches the ultimate goal, the other paths quickly atrophy, and one very thick trunk directly connects to that target.

The open-source community plays the role of this exploration process. Even after the slime mold reaches a food source and thick trunks have formed, countless branches continue exploring elsewhere to find more food.

The open-source community can prevent "local optimization" traps. Whether Transformer is locally optimal, we still don't have an answer; whether RNN is the next answer, we also don't know. But because the open-source community exists, multiple branches are exploring different directions simultaneously. Meaningful branches each develop momentum, enabling better progress along that path. This is the greatest significance I see for open-source communities in this era: maintaining technological diversity, preventing entrapment in local optima, and ultimately getting stuck.

Meng Zhang:

The existence of an open-source community is the core point that fundamentally distinguishes open-source projects from all other business models. An open-source community allows potential users — even those unwilling to pay — to become community contributors and generate value.

For example, many of us have done business with China's major internet companies. They're generally not clients with much willingness to pay; it's hard to make money from them. But objectively, these companies have technical capabilities and the desire to use advanced open-source productivity tools.

Our strategy from the start was not to expect revenue from internet giants, but to onboard them through usage — get them participating in the community, actually using a product like Tabby internally, and create opportunities for them to become contributors. This fundamentally broadens the business model.

So when doing open-source commercialization, one essential engagement strategy judgment is: when a customer clearly won't pay, our primary goal becomes turning them into a community contributor.

Missing Out Because of Language Is a Real Shame

Yu Chen:

For our final topic — everyone knows open source has no borders, and all our panelists here have global ambitions for their open-source communities. How do you globalize an open-source project? What's different between China's open-source culture and overseas?

Cong Liu:

I completely agree that open source is borderless.

Open-source software probably splits into two categories: infrastructure-related open-source software, and transactional open-source software. This panel has focused more on infrastructure open-source software.

For infrastructure development software, Chinese entrepreneurs and developers have certain advantages. From the internet company perspective, we have more users, more concurrency, and encounter more complex scenarios than overseas open-source projects just getting started.

I actually strongly recommend that domestic developers and entrepreneurs build global developer communities from day one, rather than focusing solely on Chinese developer communities. Overseas developers want to use foundational software created by Chinese entrepreneurs and infrastructure developers too — missing out because of language barriers is, I think, a real shame.

Xuan Luo:

After Stable Diffusion went open source, enthusiasm for open source in China surged dramatically. I think China has tremendous passion for open source — there just hasn't been a good closed loop or product, or ecosystem-level business model, in the past. Now RWKV has many domestic developers too; our QQ group alone exceeds 10,000 developers.

Commercial companies' current approach to open source is a different path — they won't open-source their best models, or some companies open-source models when they find themselves falling behind. I think this is what we'll see more of going forward. We need to think beyond temporal and spatial constraints, and consider what changes AI will bring in three to five years.

Yi Feng Yin:

People doing open-source models can easily upload their models; people downloading them can easily pull them down — this easily forms a community. But communities also have barriers and dividing lines. After Stable Diffusion went open source, the reason it spread worldwide was largely because images are universally understandable.

With language models, there may be language barriers — English-speaking communities tend to build English models, Chinese-speaking communities tend to build Chinese models. Llama 2 became so popular perhaps partly because it benefited from English, since everyone speaks it to some degree. I think this can also lead to factionalism. If you want to internationalize, I think the biggest issue is breaking through language barriers. First option: get others to learn Chinese. Second option: put more languages into your models.

Yi Zhou Zheng:

China is actually an especially important contributor to open-source communities. For example, the DPM++ sampling algorithm for Stable Diffusion came from a Tsinghua University team — arguably one of the most important sampling algorithms. And the ResNet layers we use in our model came from Chinese researchers at Microsoft Research Asia. These are core contributions to the open-source community.

Domestic developers are doing lots of things that, because of language barriers, haven't really reached the global community; with language models this may be even more pronounced, since the underlying languages of the models themselves differ.

If we set aside all geopolitical topics and just ask how long language barriers will persist, I think within the next two to three years, this problem will be solved by various tools and current models. The open-source community has cultivated a host of open-source models that can help us rebuild the Tower of Babel, enabling true cross-language collaboration — this is something I'm particularly excited about. In the next six months to a year, we'll see more trends toward cross-language development becoming more united.

Meng Zhang:

From our perspective, the biggest problem between Chinese and overseas communities still stems from the internet environment, which forces necessary adaptations that create additional obstacles specifically for the Chinese community.

Many overseas open-source projects simply aren't interested in solving problems unique to the Chinese community. These issues can only be resolved by Chinese developers themselves, and only once such problems are solved can China truly keep pace with overseas toolchains. Once language issues are resolved, I believe the domestic community will develop with tremendous momentum.