Yunqi Capital Year-End Roundup | Annual Deep-Dive Report: AI + Open Source + Commercialization = ?

云启资本·February 8, 2024

Three Years Running: The New Trends in Open Source + AI

In 2023, the AI field witnessed an explosion in pre-trained large model technology, with the open-source ecosystem playing a major catalyzing role. Numerous open-source models and AI open-source projects began actively pursuing commercialization, but due to their unique product characteristics, their business models differ significantly from traditional open-source software.

This "Yunqi Select" excerpt is drawn from the 2023 China Open Source Annual Report · Commercialization Edition, aiming to clarify potential new paths in this emerging ecosystem by analyzing the reasons behind these differences. In the final week before Lunar New Year, we're sharing with you a "hardcore" holiday gift (there's a surprise at the end 🎉).

This marks the third year of our ongoing collaboration with Kaiyuanshe to produce the China Open Source Annual Report and author the commercialization chapter. Over the past two years, the report has garnered broad attention from open-source practitioners across mainland China, Taiwan, and Singapore.

As one of the earliest institutions to focus on and consistently cultivate open source, we have long tracked developments in AI and open source at close range, successfully identifying and investing in open-source companies such as PingCAP, Zilliz, Jina AI, RisingWave Lab, and TabbyML at early stages, while continuously contributing to the open-source ecosystem.

In 2023, beyond continuing to produce the "China Open Source Annual Conference · Commercialization Forum," we also co-hosted a series of closed-door discussion meetups with Kaiyuanshe, engaging in deep discussions on the direction of open-source ecosystem development with major domestic and international tech companies including Microsoft, Google, Apple, Meta, Huawei, and Baidu; research institutions such as Stanford University, Shanghai Jiao Tong University, USTC, and UCSD; and numerous frontline entrepreneurs at home and abroad. The highlights from these exchanges have been incorporated into this report, and we thank all the entrepreneurs and industry experts for their generous contributions.


Key topics explored in this report include:

➤ How is the open-source ecosystem mutually reinforcing and rapidly developing alongside AI?

➤ What security challenges is open source currently facing?

➤ What does the funding landscape look like for open-source projects globally and domestically?

Below is a curated selection of content from the report. To ensure a complete presentation of our research approach, the full chapter directory is included in the main text; for the complete version, please reply with "开源2023" via our backend. Regarding relatively cutting-edge themes still undergoing rapid change, we have included more open-ended discussions in the report, and we welcome open-source entrepreneurs to reach out to us at any time.

I. The Open-Source Ecosystem Accelerates Rapid AI Development

1. The Rapid Development of Pre-Trained Large Models

Open Source Deserves Major Credit

The development of pre-trained large models has become a defining feature of the artificial intelligence field, with the power of the open-source ecosystem playing an important role. Meanwhile, open-source foundation large models have rapidly advanced in performance, gradually matching their closed-source counterparts.

The continuous emergence of open-source large models

ELO ratings of large models based on user feedback; positions 4-9 are all open-source large models. Source: LMSYS, compiled by Yunqi

Thanks to the open-source nature of these models, users can more conveniently fine-tune large models to adapt to different vertical application scenarios. Fine-tuned large models gain more industry-specific characteristics and become better suited for particular industry applications compared to general-purpose large models — an advantage that closed-source models do not possess.

2. Open Source as a Second Engine Driving Foundation Model Development

1. Supply Side: Concentrating Resources to Accelerate R&D

➤ Reducing developer headcount needs, concentrating R&D capacity

According to relevant data released by the Ministry of Industry and Information Technology, the talent supply-to-demand ratio for AI technical positions across different specializations is below 0.4.

➤ Saving computing power, avoiding redundant efforts

➤ Exploring broader technical possibilities

Whether the Transformer architecture is the optimal solution remains unanswered; whether RNN represents the next better direction is also still uncertain. But it is precisely the open-source ecosystem that enables developers to experiment on different branches of this AI tree, ensuring technological diversity, avoiding fixation on local optima, and truly driving the possibility of AI development in multiple directions.

2. Demand Side: Lowering Barriers, Capturing Market Share

➤ Open-source models significantly reduce costs for model users

Cost comparison between calling OpenAI APIs and deploying open-source models on AWS cloud, compiled by Yunqi

Taking direct API calls to OpenAI versus deploying the Flan UL2 model on public cloud as an example:

According to the latest data from OpenAI's official website, using the ChatGPT-4 model, input costs $0.03 per 1,000 tokens and output costs $0.06 per 1,000 tokens. Considering the relationship between input and output, we assume an average cost of $0.04 per 1,000 tokens. Each token is approximately 3/4 of an English word; the token count within a single request equals prompt tokens plus generated output tokens. Assuming a text block of 500 words, or roughly 670 tokens, the cost per text block would be 670 × 0.04 / 1000 = $0.0268.

If deploying an open-source model on AWS cloud instead, using the 20-billion-parameter Flan UL2 model mentioned in AWS's published tutorials as an example, costs break down into three components:

1) Fixed cost of deploying the model as an endpoint using AWS SageMaker: approximately $5-6 per hour, or about $150 per day

2) Connecting the SageMaker endpoint to AWS Lambda: assuming a 5-second response time to users, using 128MB memory, the price per request is: 5000 × 0.0000000021 (128MB per-millisecond unit price) = $0.00001

3) Exposing this Lambda function as an API through API Gateway: Gateway pricing is approximately $1 per 1 million requests, or $0.000001 per request.

Based on the above data, we can calculate:

➤ Open source improves model interpretability and transparency, lowering barriers to technology adoption

➤ Enterprise users can address specific needs through open-source foundation models

When enterprises call closed-source large models, those models remain deployed on servers at companies like OpenAI. The ability to deploy open-source large models locally substantially protects enterprise privacy.

➤ Open-source models benefit long-term user experience

With the boost of open-source community R&D strength, open-source models update rapidly. LLaMA2 itself lacked Chinese language corpora, leading to unsatisfactory Chinese comprehension; yet merely one day after LLaMA2's open-source release, the community produced the first downloadable, runnable open-source Chinese LLaMA2 model, "Chinese LLaMA2 7B."

➤ Open source helps capture market opportunities first

Due to their low barrier to entry, open-source models are more accessible to users and can rapidly expand market reach. Stable Diffusion, an open-source image generation model, has become a significant competitor to the closed-source text-to-image model MidJourney thanks to its massive developer community and diverse application scenarios.

3. Ecosystem Side: Gathering Diversity for Sustained Growth

➤ Open source helps large model companies rapidly capture ecosystem resources

Number of keyword-related projects on GitHub

➤ Open source helps large model vendors leverage markets and gain commercial allies

After LLaMA2's commercial open-source release, Meta quickly reached partnerships with Microsoft and Qualcomm. Meta indicated that Microsoft Azure cloud service users could directly fine-tune and deploy Llama 2 on the cloud; Microsoft stated that Llama 2 had been optimized for Windows and could run locally on Windows directly. This combination fully demonstrates the natural collaborative foundation between open-source large models and cloud vendors.

Meta's partnership with Qualcomm also signals its expansion into mobile phones. Given their broad audience and local deployment advantages, open-source large models make mobile phones an important future vehicle for convenient large model usage. This also attracts mobile chip manufacturers to partner with open-source model vendors.

➤ Open source can mobilize extensive community forces, gathering diverse development power

Changes in the number of generative AI-related projects on the open-source community GitHub (Source: GitHub, compiled by Yunqi)

Open-source large models receive contributions from developers across different regions, cultures, and technical backgrounds. Their participation makes open-source large models more adaptable to local customs and conditions in various regions: such as fine-tuning for corresponding languages, fine-tuning for corresponding industries, and fine-tuning for different usage habits — thereby expanding the audience reach of open-source large models.

Top 10 Regions by Generative AI Contributor Distribution (Source: GitHub, compiled by Yunqi)

4. Paths to Commercializing Open-Source Large Models

This section draws on conversations with industry practitioners and case studies to outline current commercialization directions:

➤ Offering Support Services

➤ Providing Cloud Hosting Services

➤ Developing Commercial Applications Based on Foundation Models

➤ "Model-as-a-Service" Business Models

➤ The Need for Bold Exploration in Large Model Business Models

Numerous companies are actively exploring diverse business models rather than locking themselves into a single pricing strategy. To date, no model has emerged that effectively covers the steep costs of training. The rise and evolution of open-source large models signals the birth of a new industry ecosystem, with all participants — research institutions, enterprises, developers, and users — actively probing for approaches that can balance technological innovation with economic returns. This ecosystem carries distinctive value and potential, offering unprecedented technical support and innovation possibilities across industries.

2. Open-Source AI Developer Tools

An Emerging Industry Consensus

1. The Critical Role of Developer Tools in the AI Value Chain

The developer tools layer is a crucial link in the AI large model development chain. As shown below, this layer serves as a bridge, connecting and mediating between upstream and midstream elements:

Positioning of Developer Tools in the AI Large Model Industry Chain

The Growing Ranks of AI Large Model Developers

A Wide Array of Developer Tools Covering Different Layers of Large Model Development

The Significance of Open-Source Developer Tools

Supply-Side Benefits

Open-source developer tools demonstrate high user stickiness. Source: Clouded Judgement, compiled by Yunqi

The chart shows net revenue retention rates across major SaaS products. Developer products consistently exceed the median, with Snowflake leading at 174%, followed by Hashicorp, Gitlab, and Confluent all surpassing 120%. Against this backdrop of high stickiness, faster customer acquisition velocity translates directly to higher future revenue. Open source lowers the barrier to trying and adopting new tools, which is essential for building brand awareness and a user base.

Demand-Side Benefits

Thanks to the ecosystem effects of open-source developer tools, the pace of technical iteration typically outstrips that of closed-source alternatives. The latest research from labs can be quickly integrated and shared, ensuring rapid technology updates and dissemination.

Building the Ecosystem Is Critical for Open-Source Developer Tools

Sustaining Open-Source Developer Tools Requires Stable Technical Support for the Community

Open-Source Developer Tools Need Complementary Strengths with Cloud Providers to Expand Market Reach and User Base

Take MongoDB as an example. MongoDB pivoted to the cloud early, launching its SaaS offering Atlas. Though Atlas accounted for just 1% of revenue at MongoDB's 2017 IPO, the company invested heavily in building out SaaS products and marketing infrastructure. Subsequently, Atlas revenue grew at a compound annual growth rate exceeding 40%.

MongoDB Revenue by Product

Building an Ecosystem Helps Establish Industry Standards for Open Source

MongoDB leveraged its community to become the industry standard for NoSQL RDMS. This active community not only delivered high-quality, low-cost licenses for MongoDB's early commercial versions, but also formed the foundation for Atlas (managed service) down the road. Through open-source community collaboration, Milvus launched Vector DB Bench, which measures key metrics to evaluate vector database performance and maximize their potential. This has gradually established an industry standard for vector databases, enabling users to make targeted selections based on their specific needs.

Vector Database Benchmark Results

Exploring Commercialization Paths for Open-Source Developer Tools

AI developer tools share commercialization parallels with traditional software developer tools, though overall commercialization remains in early exploration. Based on research and analysis of open-source developer tool projects that have already attempted commercialization, several business models have emerged:

Cloud Hosting Managed Service — Pay-as-You-Go

Cloud Hosting Managed Service — Tiered Subscription

Private Cloud / Dedicated Cloud / Customized Deployment

Success Stories on the Developer Tools Side

Zilliz develops next-generation data processing and analytics platforms for artificial intelligence, primarily providing underlying technology for application-oriented enterprises. It is now used by over 1,000 enterprises globally.

Zilliz Global Users

Zilliz's main product is a vector database — a critical component in the developer tools stack. This database system, purpose-built for storing, indexing, and querying embedding vectors, enables large models to store and retrieve knowledge bases more efficiently and fine-tune models at lower cost. It will also play an increasingly important role in the evolution of AI-native applications. Zilliz's commercial product is Zilliz Cloud, which uses a monthly subscription model and SaaS deployment. Monthly fees are determined based on vector count, vector dimension, compute unit (CU) type, and average data length. Zilliz also offers PaaS-based dedicated deployment services for scenarios with stringent data privacy and compliance requirements, priced on a custom basis.

Zilliz GitHub Community Operations

Zilliz Pricing Calculator Example

A Flourishing Ecosystem of Open-Source Tools at the AI Application Layer

A Diverse Array of Open-Source Tools at the Application Layer

A Flourishing Array of AI Application-Layer Products (Source: Sequoia)

Open-Source Tool Landscape at the Application Layer (Selected Examples by Domain)

Market Status of Open-Source Large Model Applications

Internet Giants and Startups Both Making Strides

Different Competitive Dynamics in B2B and B2C Markets

  1. B2B market: Enterprise-facing applications typically focus on improving efficiency, reducing costs, and enhancing decision-making capabilities. 2) B2C market: Consumer-facing applications place greater emphasis on user experience, interactivity, and ease of use.

Numerous Sub-Scenarios Remain Blue Ocean Markets Without Clear Leaders

New Capabilities Enabled by Large Models Set the Stage for Innovative Applications

Challenges Facing Open-Source Commercialization of Large Models

Rapid Technological Advancement Requires Continuous Iteration to Maintain Competitiveness

Difficulty Defining the Boundaries of Copying/Borrowing

Community Participants Struggle to Make Direct Contributions to Model Iteration

Fast Pace of Open-Source Technology Development Leads to High Later-Stage Update Costs

Open Source Security Challenges

Security is a critical factor in determining whether an open-source product can successfully commercialize. According to Synopsys data, as of end-2022, 84% of codebases contained at least one known open-source vulnerability, 48% contained high-risk vulnerabilities, and 34% of respondents reported experiencing "attacks exploiting known vulnerabilities in open-source software" within the past 12 months. Only with robust security guarantees can open-source software go further down the commercialization path.

Open Source Codebase Vulnerabilities (Data Source: Synopsys)

Open Source Software Network Security

Open Source Software Security Issues Are Relatively Widespread

Open Source Software Itself Contains Numerous Security Vulnerabilities

According to the "2022 QI-ANXIN Open Source Project Detection Program," defect density and high-severity defect density have increased for three consecutive years, with an accelerating trend. The security situation of open-source software itself is quite severe.

Three-Year Comparison of Average Defect Density in Open Source Software (Data Source: 2023 China Software Supply Chain Security Analysis Report)

Open Source Projects with Either Too Low or Too High Activity Levels Are More Prone to Security Risks
Some Users Rely on Overly Outdated Software with Confused Version Management

Strategies for Addressing Open Source Software Vulnerability Risks

Regular Security Audits and Code Reviews
Using SCA (Software Composition Analysis) Tools
Enhanced Education and Training

Controlling Open Source Licenses

Open Source Licenses Are Constraints on Open Source Resource Users, with Rich Variety

An open source license is a constraint on open source resources (including but not limited to software, code, and web users).

Open source licenses are broadly categorized into three types based on the restrictiveness of their authorization: Permissive, Weak Copyleft, and Strong Copyleft.

Open Source License Classification

Within these broad categories, specific licenses and license families each have unique restrictions, permissions, and additional parameters. The overall logical relationships among licenses are organized as follows:

License Logical Relationships, Source: GitHub

The Open Source Society (Kaiyuanshe) also provides an open source license selector, which offers excellent help for quickly and effectively understanding optimal license choices. Highly recommended for those with needs: https://kaiyuanshe.cn/tool/license-filter

2

Failure to Comply with Licenses When Using Open Source Resources Creates Infringement Risk

  • Open source license infringement
  • License reciprocity requirements expand the scope of open source copyright issues
  • Open source license infringement can lead to serious consequences

3

Open Source Large Model Licenses Differ Significantly from Traditional Licenses

Because open source large models are still developing and iterating, two highly influential open source large models this year — LLaMA 2 and Falcon — both faced questions about whether they were truly "open source" due to adjustments in their open source license terms. Neither used generic license agreements available on the market; instead, each created its own proprietary agreement, the "LLAMA 2 COMMUNITY LICENSE AGREEMENT" and "TII Falcon LLM License" respectively. Both also imposed additional constraints on commercial use.

  • The Purpose of Open Source for Large Models Differs from Traditional Software Open Source

Taking LLaMA 2 as an example, its license is essentially a guidance framework primarily aimed at enterprises intending to develop and deploy AI systems in compliance with Meta's established norms and standards. The purpose of this framework is to ensure that these enterprises can adhere to specific rules and standards set by Meta when developing and deploying AI technology. This approach helps Meta manage the scope and manner of its AI technology applications, thereby safeguarding its commercial interests and brand image.

For enterprises planning to conduct AI development on Meta's platforms, the LLaMA 2 license may constitute a mandatory compliance requirement. This means that when these enterprises use tools and resources provided by Meta to develop and deploy AI models, they must follow Meta's specific norms and requirements. In this process, these enterprises may need to apply for corresponding authorization from Meta, and the LLaMA 2 license is one component of such authorization.

4

Methods to Ensure License Controllability

  • Document usage of open source components
  • Exercise caution when using AI-assisted coding tools
  • Conduct thorough code audits during M&A processes

3. Open Source AI Safety

With the surge in large models, beyond the license issues mentioned above, more AI safety and controllability concerns have gradually entered public view. Because the technology is so new and lacks clear definitions and standards, this section draws on desk research to list topics currently of concern to relevant practitioners, hoping to spark reader reflection. Discussion and feedback are welcome.

1

Open Source AI Poses New Requirements for Data Security

Dataset quality and whether datasets contain malicious data are particularly important issues for AI large models, especially open source large models, because many open source large model datasets are provided by enterprises internally, and cleansing, monitoring, and compliance cannot be as professional as those of dedicated closed-source large model vendors.

  • Improper training dataset handling can trigger a cascade of biases
  • Training data sources should be factored into consideration when selecting open source foundation models

2

Widespread Use of Open Source AI Large Models Triggers Reflection on Social Ethics

  • Large model hallucination issues can lead to serious consequences

Harbin Institute of Technology Classification of Hallucinations

  • Large model outputs may produce content that violates moral and legal standards

    1. How to strengthen large model information filtering mechanisms — technical dimension
    2. How to determine whether large model output constitutes infringement or illegality — legal dimension
  • Large models may exacerbate social fragmentation

    1. Strengthen public education: large models are not omnipotent and require prudent evaluation — public awareness dimension
    2. How to ensure quality of large model training datasets and reduce their bias — technical dimension

III. Capital Market Conditions for Open Source Projects

1. Global Market Conditions

1

2023 Global VC Investment Scale Declined, but AIGC Was the Center of Attention

Global Venture Capital Up, Flat, and Down Round Proportions (Data Source: KPMG)

However, against an overall pessimistic backdrop, AIGC-related financing became a global focal point, with related funding scale growing substantially. In North America, AI-related companies comprised the largest share of 2023 unicorns, including AI agent startup Imbue, AI+biotech company TrueBinding, generative AI company Runway, and natural language processing company Cohere. In Europe, despite overall funding slowdown, AI companies performed exceptionally well, with numerous startups receiving capital, such as French AI platform company Poolside. Asian investors' interest in AI has also been rising continuously, though regulators in relevant countries have been intensifying oversight of generative AI.

As AI technology iterates rapidly and concepts such as large models and AI agents continue to gain traction, AI-related investment and financing are expected to be relatively insulated from the contraction in global venture capital scale.

2

Global Open Source Financing Conditions

In recent years, the growth of commercial open source companies has been remarkable. Their total market capitalization has surged from $10 billion to surpass the $500 billion mark. This significant growth not only demonstrates the enormous potential of open source technology in the commercial realm, but also reflects investors' high recognition of and confidence in the open source model. According to OSS Capital projections, the market cap of commercial open source companies could reach an astonishing $3 trillion in the future.

Over the past four years, the open source commercial sector has demonstrated robust growth. During this period, more than 400 startups completed approximately 700 funding rounds, totaling $29 billion. Specifically, annual funding scale increased from $270 million in 2020 to $12.5 billion in 2023, representing a compound annual growth rate of 255%.

Even in the lowest funding month of 2023, the monthly funding amount of $386 million exceeded the highest monthly funding amount of 2021, and even surpassed the total annual funding of 2020 ($272 million). This trend reflects the capital market's sustained attention to and recognition of open source commerce.

Global VC funding into commercial open source software companies (Data source: OSS Capital)

Funding round distribution for commercial open source software companies (USD millions) (Data source: OSS Capital)

Cumulative funding scale distribution for commercial open source software companies (Data source: OSS Capital)


Yunqi Capital New Year Collection

What new trends do you see emerging at the intersection of AI and open source? We'll select two participants from the comments section to receive a physical copy of the 2023 Open Source Annual Report: Commercialization Edition!