In-Depth Conversation with the vLLM Team: How to Build a Successful Open-Source Ecosystem from Scratch

July 2, 2024·9·2

This episode features the vLLM team — Zhuohan Li, Simon Mo, Lily Xiaoxuan Liu, and Kaichao You — alongside Yusen Dai, Managing Partner at ZhenFund.

In just two years, vLLM has grown from a demo project at UC Berkeley into the world's most popular open-source LLM inference acceleration framework. As AI technology pushes boundaries and large language models ride the wave of the moment, deployment often hits a wall: inference is too slow, GPU utilization too low. With its PagedAttention algorithm at the core, vLLM supports 30+ generative large language models, adapts to multiple hardware vendors, and incorporates cutting-edge optimizations — delivering up to 24x the throughput of Hugging Face Transformers.

Yesterday, ZhenFund officially announced its donation to the vLLM project. For foundational building blocks that shape the future, we're glad to do our small part. In this podcast episode, we've invited four core members of the vLLM team and Yusen Dai, the ZhenFund partner who led this donation, to discuss the open-source story behind vLLM and how the project has gained unstoppable momentum in the AI wave.

In this episode, we explore: How did vLLM become such a sought-after open-source LLM inference acceleration framework in just two years? Starting from an academic project, how did the vLLM team leverage its strengths while adapting to new challenges? Why has the vLLM open-source project built such a vibrant technical community, attracting global talent to co-create? How does the vLLM team view commercialization, and what is their vision for vLLM as an open-source project? Whether you're a developer in the large model space, curious about AI's evolution and innovation, or a community builder, this episode offers something for you.

Guests

Yusen Dai: Managing Partner at ZhenFund, investor in leading AI companies including Moonshot AI
Zhuohan Li: PhD candidate at UC Berkeley, co-founder of vLLM, currently leads high-level design and open-source community management
Simon Mo: PhD candidate at UC Berkeley, serves as product manager and open-source community ecosystem lead at vLLM
Lily Xiaoxuan Liu: PhD candidate at UC Berkeley, leads research-oriented design and improvements at vLLM
Kaichao You: PhD candidate at Tsinghua University, visiting researcher at Berkeley, manages open-source content maintenance at vLLM

Episode Highlights

02:07 Introduction to the vLLM project and team
12:04 Before vLLM, no LLM inference framework had attempted optimization from a multi-request perspective
15:38 From zero to one: becoming the fastest, most user-friendly open-source engine
23:12 "David vs. Goliath": open-source model and code quality as the core advantage
30:22 Open-source tradition sparks collision between academia's exploratory spirit and industry's practical demands
32:35 After open-sourcing, vLLM's goal was no longer just speed
35:27 When a model goes viral, issues about it explode
40:20 Papers can have preconditions; systems must handle every edge case
43:51 Future large models will trend toward scenario-driven development, exploring the limits of model efficiency
45:52 What vLLM chooses to do — and what it chooses not to
50:50 Building a broad contributor ecosystem: no shortcuts, just one person at a time
58:32 Resist the urge to do everything yourself; let the community grow gradually
01:02:39 PMF arrived when PRs exploded and we had no bandwidth for new features
01:03:10 Doing research at vLLM is a positive iterative process
01:07:10 More hardware and model support, higher performance optimization — vLLM will continue building and maintaining open-source
01:11:43 We're happy to be a non-commercial project that helps everyone else commercialize better
01:16:57 Fresh open-source projects and book recommendations from the vLLM team

Related Resources

vLLM Project GitHub: https://github.com/vllm-project/vllm

vLLM: An inference and serving engine for large language models. In short, vLLM deploys trained models into production environments with a focus on efficiency and cost — making inference faster, maximizing GPU utilization, and ultimately accelerating AI product launches. Technically, vLLM is an open-source project built on the PagedAttention algorithm, supporting 30+ generative large language models with adaptations for multiple hardware vendors and cutting-edge optimizations.

PagedAttention: A memory management algorithm designed to optimize the attention mechanism in large language models. Its core idea borrows operating system paging and virtual memory techniques to manage the KV cache in Transformer attention operations, enabling more efficient use of computational resources during LLM inference.

KV (Key-Value): In attention mechanisms, input data is split into two parts: Keys and Values. The model determines the importance of each Value by computing similarity between Keys and Queries.

Paging: An operating system memory management technique that divides memory into fixed-size pages, loading only currently needed pages into physical memory while storing others on disk. When a page not in physical memory is accessed, the OS performs page swapping.

Virtual Memory: A technique allowing computers to use more memory than physically available by storing some data on disk rather than solely in physical memory.

Prefix Caching: A computer science technique for storing results of expensive function calls and returning cached results when the same inputs recur. Particularly useful for computationally costly functions frequently called with identical parameters.

Speculative Decoding: An emerging LLM inference acceleration technique from 2023 that increases parallelism in per-step computation to reduce total decoding steps — thereby minimizing repeated parameter reads and writes for faster inference.

CUDA Kernel: In CUDA, a Kernel is a function executed on the GPU. "Launching a Kernel" refers to scheduling and executing that function on the GPU.

Chunked Prefill: Processes longer user request inputs in chunks, creating more room to handle other requests, reducing the impact of a single long request on others, and improving overall service latency stability.

MoE (Mixture of Experts): A large model architecture whose core design philosophy is "specialization" — categorizing tasks and delegating them to multiple "expert" sub-models.

Hugging Face: An open-source library for natural language processing and deep learning, developed by a French company of the same name. It provides powerful features including model training, data preprocessing, and model conversion for more convenient NLP development.

Transformers: An NLP package developed by Hugging Face supporting loading of most current pre-trained models. With the rise of large-scale models like BERT and GPT, increasingly adopted by companies and researchers for NLP applications.

Encoder-Decoder: A model architecture containing both encoder and decoder components.

State Space: Models for describing state representations and predicting next states based on certain inputs.

Latent Attention: DeepSeek-V2 designed an attention mechanism called MLA (Multi-Head Latent Attention). MLA achieves better results than MHA with significantly smaller KV cache requirements through low-rank key-value joint compression.

TensorRT-LLM: An open-source library from NVIDIA for accelerating and optimizing inference of latest large language models on NVIDIA GPUs. Provides a comprehensive library for compiling and optimizing LLMs for inference, including kernel fusion, quantization, runtime optimization, and an intuitive Python API for defining and building new models. GitHub: github.com

TGI (Text Generation Inference): Hugging Face's Rust and Python framework for deploying and serving large language models — a production toolkit for LLM deployment.

ControlNet: A neural network combinable with pre-trained diffusion models, particularly Stable Diffusion. Improves generation quality by introducing external control signals during the generation process. In 2022, project creator Lvmin Zhang received the first Zhen Summer scholarship. GitHub: github.com

Andrej Karpathy: Former founding member of OpenAI, previously led AI and Autopilot at Tesla.

BSD (Berkeley Software Distribution): A Unix-like operating system originally developed by UC Berkeley's Computer Systems Research Group (CSRG) in late 1977.

Postgres: A UC Berkeley open-source project that evolved into PostgreSQL, a highly extensible open-source object-relational database system. Website: www.postgresql.org

RISC-V: An open-standard instruction set architecture (ISA). Website: riscv.org

Spark: An open-source distributed computing system providing a fast, general-purpose engine for large-scale data processing. Website: spark.apache.org

Ray: An open-source distributed computing framework designed for large-scale AI and machine learning applications. Website: docs.ray.io

Unsloth: Focuses on fine-tuning and training acceleration for large language model inference. Offers an open-source version that significantly improves training efficiency, reduces memory usage, and supports NVIDIA, Intel, and AMD GPUs. Key feature: rewriting all kernels using OpenAI's Triton language. Website: unsloth.ai

PyTorch: An open-source machine learning library widely used for computer vision and natural language processing applications. Website: pytorch.org

Jax: A high-performance numerical computing library developed by Google Brain, supporting automatic differentiation and just-in-time (JIT) compilation. Website: jax.readthedocs.io

Staff

Executive Producers: Jiafen, Wendi, Stone

Post-production: Keyone Studio

About ZhenFund

"This Is Serious" (此话当真) is a general business podcast produced by ZhenFund, where the investment team shares the latest trends and industry insights with leaders across fields.

Founded in 2011, ZhenFund is one of China's earliest angel investment firms. Since inception, ZhenFund has actively sought out the best entrepreneurial teams and era-defining investment opportunities in artificial intelligence, chips and semiconductors, robotics and hardware, healthcare, enterprise services, new energy, cross-border expansion, and consumer lifestyle.

ZhenFund — Your First Stop for Entrepreneurship!

Contact Us

WeChat Official Account: ZhenFund (ID: zhenfund)

Website: www.zhenfund.com

Email: media@zhenfund.com

Listen on Xiaoyuzhou, Apple Podcasts, and Ximalaya.

We welcome your feedback and expectations in the comments!