"PingCAP" Huang Dongxu: A Deep Dive into New Database Development Trends in 10,000 Words | Yunqi Tech π

云启资本·February 6, 2023·6·0

"Before ChatGPT came out, I always thought there was an element of hype around AI."

"Yunqi Tech π" shares updates from Yunqi Capital's portfolio companies, exploring how cutting-edge technology expands the boundaries of real-world applications and tracking the present and future of tech commercialization. In this edition, we bring you the latest from PingCAP.

➤➤➤ This article was written by Edward Huang, co-founder and CTO of PingCAP, drawing on his firsthand experience in the database industry to deeply summarize the major trends in database development over the past year and look ahead to new directions for 2023 — hoping to offer insights to more industry practitioners.

In his view, the future of databases will inevitably center on "saving people, time, hassle, and money." Under the cloud-native trend, databases will increasingly function not merely as software but as a service — becoming a critical engine for accelerating technological development.

In 2022, we were surrounded by so many tech buzzwords that many enterprises and developers grew increasingly confused about the future of technology.

I spent a considerable amount of time in North American tech circles this year and had some firsthand observations: First, the macro environment has taken a toll on the economy. When times are tough, companies look to cut costs — reducing infrastructure spending, for example — which inevitably affects decision-making. If you're a solutions provider trying to sell a product and your pitch doesn't clearly show how much money you'll save the customer, they probably won't even give you a second look. Second, there's a talent shortage. Put another way, "the barrier to entry for practitioners is dropping." You might have been an infrastructure engineer before, but now you're being pushed to become a full-stack engineer — a jack of all trades, if you will. The threshold for developing business applications needs to be lowered. Everyone now wants to get to market faster with fewer people.

So what trends in database technology address these pain points?

1. For next-generation databases, HTAP is the mandatory technical path. I have a prediction: all databases will eventually be HTAP databases. Pure AP alone is incomplete for business needs — TP is essential.

2. Databases need to be rearchitected in a cloud-native way. It's not just about separating storage and compute, but separating everything that can be separated. Because databases are no longer just software — they're a service. Users don't care what's happening behind the scenes. They want the simplest possible experience, with all infrastructure details hidden away. Minimal cognitive load means minimal onboarding friction and faster time-to-value.

3. Serverless will become the ultimate product form for databases (note: Serverless is not a technology, but a product form).

All Databases Will Eventually Be HTAP Databases

In 2022, the term HTAP gained increasing traction. Though archaeologically speaking, HTAP was first coined by analyst firm Gartner in 2014, it remains a relatively new concept to this day, and many still regard it with skepticism. Yet the demand is real. People may not define it with this specific term, but they're using it. Many are reluctant to make HTAP a proprietary category for now — they simply want to solve business problems with new types of databases or products. I believe HTAP adoption will still take some time, but we're already seeing sparks begin to spread into a prairie fire.

In May 2022, Google Cloud released its latest database product, AlloyDB, with HTAP capabilities as its most prominent highlight. In June, leading Data Cloud vendor Snowflake launched its first HTAP product, Unistore.

At this point, the three major global cloud providers — AWS, Microsoft, and GCP — along with Data Cloud leader Snowflake and database giant Oracle accelerating its cloud transition (with MySQL Heatwave) have all released database products with HTAP as a core architectural highlight.

Looking closely at these products, you'll find that MySQL's Heatwave, launched in 2021, uses an MPP architecture for analytics, but MySQL itself remains single-node. Google AlloyDB drew inspiration from AWS Aurora's architecture and achieved something even better. The progenitor of the NewSQL branch is Google Spanner, but TiDB, which shares the NewSQL architecture, has made massive ongoing investments in real-time HTAP. TiDB initially solved MySQL's sharding problems, then faced users' real-time analytics demands. With the introduction of TiSpark in 2018, the completion of the TiFlash architecture closing the HTAP loop in 2020, and the MPP capabilities in TiDB 5.0 in 2021 — delivered to all cloud users through TiDB Cloud — TiDB completed four major leaps in real-time HTAP product capabilities in just five years.

Overall, while specific implementations vary, next-generation HTAP architectures share some clear common pursuits: built on open source, leveraging cloud scalability, seeking a single entry point and unified data stack that enables real-time synchronization between OLTP and OLAP data. Some vendors implement OLAP using approaches similar to MPP pushdown, achieving the "four no's" — no application change, no schema change, no ETL, no data movement — to minimize modifications to existing applications.

Any database trend is the product of three converging forces: "shifting demand × technological change × architectural innovation." HTAP is no exception.

First, on the demand side, next-generation HTAP database vendors consistently use the word "Operation" when discussing HTAP. Using hot data to enable operational-level real-time analytics, gaining real-time insights to support operational feedback loops — this is the biggest demand-side shift driving next-generation HTAP to center stage.

Second, on the technology change and architectural innovation side, the evolution of cloud infrastructure has enabled more thorough separation of storage and compute, opening new possibilities through technological change. The fusion of distributed systems theory with cloud computing and AI algorithms has spawned a new generation of architectural innovations, allowing HTAP in the cloud to support different cloud storage options, AI, and other new technologies to create more cost-competitive innovations.

I believe HTAP truly makes sense in the cloud, because it's all about balance. Only in the cloud can you break the resource imbalance between AP and TP. For TP, you need stable, high-performance, low-latency hardware resources. For AP, you might need massive compute resources for short bursts — to do high-performance AP, you'll find that on-premises, nothing quite fits. Why would I buy such high-spec servers just to run three large full-table scans daily, while the CPU sits underutilized 99% of the time? Cloud-native is entering its next phase. What I observed in North America this year is that nearly every company is undergoing cloud/cloud-native transformation — this isn't up for debate, it's already a done deal.

Third, this generation of HTAP users is very different from the previous generation of in-memory database HTAP's niche elite. This generation's users are thoroughly mainstream — essentially any enterprise using MySQL and PostgreSQL open-source databases can leverage next-generation HTAP architecture to expand the capabilities of both OLTP and OLAP, gaining access to a database that requires no application modifications, no additional data systems, and yet delivers powerful analytics capabilities.

Rearchitecting in a Cloud-Native Way

These days, if you're building a database without offering a cloud service, you're almost embarrassed to say hello to people (soon it will be Serverless). Many people — especially database kernel developers — underestimate the complexity of building a cloud service. The classic refrain: "Isn't it just automated deployment on the cloud?" or "Just support a Kubernetes Operator?"

It's not, and the goal should actually be reversed: what we're building isn't database software, but a database service. Taking a longer view, the latter encompasses the former. This shift in mindset is the first and most important step in building a good database cloud service.

In the past, when we developed programs, different modules saw a homogeneous and deterministic environment. For example, when developing software running on a single machine, different modules might have logical boundaries, but once linked together and running, they still saw just that one computer's territory — "Everything is a trade-off." Even with the rise of distributed systems in recent years, classic distributed software was largely an extension of single-machine design thinking, simply connecting multiple computers via RPC. The environment was relatively deterministic, and though many software systems made some adaptations to underlying environmental changes — such as dynamic scaling in distributed databases, data rebalancing, etc. — the essence remained unchanged; just the resources that could be controlled and scheduled increased.

However, in the cloud, all these assumptions change:

Diverse and virtually unlimited resources are provided through Service APIs, and resource scheduling and allocation can be done through code — this is a revolutionary transformation.
Every resource has a transparent price tag, so optimization shifts from a one-dimensional pursuit of peak performance (since hardware costs are already sunk) to a dynamic problem: getting the most done for the least money.

These changed assumptions drive technical changes: a cloud database should first and foremost be a network of autonomous microservices. These microservices don't necessarily run on separate machines—they may physically coexist on one machine—but they must be remotely accessible. They should also be stateless (no side effects), enabling rapid elastic scaling. The implication for developers is clear: abandon your attachment to synchronous semantics. The world is asynchronous and unreliable. I was glad to see my idol, Amazon CTO Werner Vogels, emphasize this very point in his 2022 re:Invent keynote.

What do we gain by giving up the illusion of synchronous, single-machine operation? Let's look at some examples.

First, the much-discussed separation of storage and compute. In the cloud, compute costs far more per unit than storage. If compute and storage are bound together, you can't exploit storage's price advantage. Moreover, certain requests may have compute requirements completely mismatched with a storage node's physical resources—think heavy OLAP requests requiring reshuffles and distributed aggregation. For distributed databases, scaling speed is a critical user experience metric. With storage-compute separation, scaling can in principle become extremely fast: start new compute nodes, warm caches, done. The reverse works equally well.

Second, internal database components can be microserviced—for example, DDL-as-a-Service. Traditional database DDL impacts online operations even with Online DDL; adding an index inevitably requires backfilling data, which creates jitter for storage nodes serving OLTP workloads. If we examine DDL closely, we see it's global, infrequent, compute-heavy, offline-capable, and idempotent. With a shared storage layer like S3, such modules are perfect candidates to be extracted into serverless services that share data with the OLTP storage engine via S3. The benefits are obvious:

Near-zero performance impact on online workloads
Cost reduction through on-demand execution

There are many similar examples: logging (low CPU, high storage), LSM-Tree storage engine compaction, data compression, metadata services, connection pools, CDC—all are candidates ripe for extraction. In the new cloud-native version of TiDB, we use Spot Instances for remote compaction of the storage engine, with astonishing cost reductions.

Another critical consideration in cloud database design: QoS (Quality of Service). The details include:

Defining WCU and RCU as control units; without this, resource allocation, scheduling, and pricing become impossible
Multi-tenancy is mandatory. Tenants must be able to share hardware and even cluster resources; large tenants may also have dedicated resources (single-tenant mode being a specialization of multi-tenancy). The challenges: how to avoid noisy neighbor problems? How to design throttling strategies? How to prevent shared metadata services from being overwhelmed? How to handle extreme hot spots?
There are many more challenges I won't enumerate.

Another important topic: which cloud services can you depend on? For a third-party vendor, cross-cloud (or even hybrid cloud) product experience is a natural advantage. Tight, deep dependency on specific cloud services sacrifices this flexibility. So choose dependencies carefully. Some principles:

Depend on interfaces and protocols, not implementations. Internal service implementations can be whatever you want, but interfaces exposed to other services should be generic and make minimal assumptions. Simply put: minimize the cognitive burden on the caller—ancient wisdom from the UNIX era. A good example: VPC Peering versus PrivateLink. Following this principle, you'd choose PrivateLink, since VPC Peering tends to expose more details to the consumer.
Follow industry standards where they exist (S3, POSIX filesystems). Every cloud has object storage, and every cloud's object storage API more or less兼容 S3 protocol. That's good.
The sole exception is security. If you can't abstract across clouds, don't reinvent the wheel—use whatever the cloud provides. Key management, IAM, etc. Never build your own.

A few examples of how Cloud-Native TiDB makes these dependency choices:

Storage

S3. As mentioned, every cloud has S3-compatible object storage. Using tiered storage similar to LSM-Tree in databases enables leveraging different storage tiers through one API: hot data on local disks, cold data on S3, with asynchronous compaction moving data between tiers. This is the foundation of TiDB's storage-compute separation. Only with data on S3 can you unlock operations like remote compaction. The tradeoff: S3's high latency means it can't be on the main read/write path—a cache miss at the upper tier causes severe tail latency. I'm optimistic about this:

If we consider 100% local cache scenarios, we degenerate to classic shared-nothing design. Supporting the most extreme OLTP scenarios is feasible (see current TiKV). The extra cost is merely S3 storage, which is cheap.
With sufficiently fine-grained sharding, cache and hot spot problems become manageable.
Tiered storage can also incorporate EBS (distributed block storage) as a secondary cache to further smooth out latency spikes from local cache misses.

I noted in a 2020 talk that for cloud-native databases, mastering S3 would be key. That view hasn't changed.

Compute

Containers + Kubernetes. Like S3, every cloud has a Kubernetes service. Kubernetes is the operating system of the cloud, much like Linux. While storage-compute separation simplifies compute management somewhat, tasks like compute resource pool management remain—for example, fast startup for serverless clusters (waking from hibernation). Starting a new pod from zero is too slow; you need reserved resources. Or using Spot Instances for compaction tasks: if a Spot Instance is reclaimed, can you quickly find another machine to continue? Or load balancing and service mesh...

While S3 solves the hardest state problems, scheduling these pure compute nodes is tedious. If you choose to build your own, you'll likely end up reinventing Kubernetes. Better to embrace it directly.

On the cloud, there's another major design question: is the filesystem a good abstraction? This asks at what layer to shield cloud infrastructure. Before S3's普及, large distributed storage systems—especially Google's BigTable and Spanner—chose a distributed filesystem as their foundation (I suspect deep Plan 9 influence here, given how many Google infrastructure veterans came from Bell Labs).

So: with S3, do we still need a filesystem abstraction? I haven't fully decided. I'm inclined toward yes, still for caching reasons. With a filesystem layer, you can cache based on file access heat, improving warm-up speed during scaling. Another benefit: better ecosystem tool compatibility. Many UNIX tools can be reused directly, reducing operational complexity.

In my 2022 PingCAP DevCon keynote, I raised a point: how can cloud databases integrate with modern developer experience? It's an interesting topic. Databases have remained largely unchanged for years—SQL is still king. Yet the applications developers build and the tools they use are vastly different from decades past. As an old programmer from the UNIX era, seeing the dazzling advanced tools and concepts young developers use today, I can only marvel at how each generation surpasses the last. Though SQL remains the standard for data manipulation, can database software do more to integrate with these modern application development experiences?

Serverless will become the ultimate product form of databases

I believe the next phase of cloud-native will become increasingly self-consistent, gradually forming full-stack cloud-native. This full-stack cloud-native will catalyze the development of Serverless. The essence of Serverless is simple: helping developers further hide infrastructure complexity. In summary, nearly all software in the cloud will form a self-consistent Serverless ecosystem.

Serverless—many people treat it as a technical term. I don't think so. Serverless is more important as a user-experience definition of what better cloud software looks like. Or perhaps this should simply be self-evident: why should users care how many nodes you have? Why should they care about internal parameters and configurations? Why, when I click start, do you make me wait half an hour? These things our industry has long taken for granted seem quite absurd on reflection. For example: imagine buying a car and the dealer hands you an engine repair manual, telling you to read it before driving. The car runs slowly, then they tell you some engine parameter needs tuning, and every startup takes half an hour... Isn't that strange?

For Serverless products, the greatest significance from a user experience perspective comes down to three things:

1. It hides configuration, reducing cognitive load on users; 2. Extremely fast startup times, which expands use cases and improves usability; 3. Scale-to-Zero, which lowers costs in most scenarios (when you have clear peaks and valleys that you can't predict), and can even be free at small scale.

With these three elements in place, a database can be properly embedded into other application development frameworks — the foundation for building a larger ecosystem.

Beyond Serverless, modern developer experience (DX) includes several other critical elements:

Modern CLI: For developers, the command line is far more efficient than graphical interfaces, and easier to compose with other tools through shell scripts for automation.
Unified cloud-local development / debugging / deployment experience: Nobody wants to touch servers every day. If it can be done locally, don't make people SSH in. Especially for cloud services, how to develop and debug offline remains a market full of pain points.
Example code / demos / scaffolding: A new generation of PLG-oriented service providers — Vercel, Supabase, and company — have gotten very good at this. And it makes sense: for ordinary CRUD applications, the basic code frameworks are all similar. Providing quick-start examples lets developers experience your product's value faster, and helps them build their applications faster.

So, generally speaking, a Serverless database should have four key characteristics:

1. Pay-per-use with transparent pricing. In a Serverless database service, users pay only for the CPU, storage, network bandwidth, disk I/O, and other resources consumed by each transaction or query. No usage means no charges. This saves users substantial costs, especially for applications with unstable or unpredictable workloads. Moreover, Serverless database pricing is fully transparent — users can clearly understand resource consumption and calculate accurate return on investment, or ROI.

2. Extreme elasticity. A Serverless database can automatically match required resources to business complexity, scaling up dramatically in a very short time to handle traffic and load spikes, and scaling down to zero when there's no demand. This ensures applications always have appropriate resources for optimal performance, without over-provisioning.

3. Simplicity. A Serverless database shields users from infrastructure complexity — no resource selection or capacity planning, no worrying about underlying infrastructure management and maintenance. Users are completely freed from tedious resource management work.

4. High availability. When any compute instance, network, or hardware fails, a Serverless database ensures data remains always available and always correct through multi-replica deployment and automatic failover.

Sounds wonderful. Is it even possible? After nearly a year, we finally built our first prototype and launched it into public beta on November 1st — TiDB Serverless Tier.

I wrote a small program that, in a completely fresh environment, spins up a TiDB Serverless Tier instance through code. Throughout the process, I simply told the program to start a cluster, gave it a name, entered a password, and 20 seconds later I could connect with a MySQL client — a time that will shrink further in the future. Imagine getting that down to three or five seconds: it would dramatically transform application development workflows and experience. And you don't need to worry about scalability — even after going live, when business traffic becomes massive, it can scale up gracefully, and when traffic drops, it scales back down.

There are countless technical details behind this, which I won't go into here. One principle we followed was making the best use of different cloud services — Spot Instances, S3, EBS, elastic Load Balancers. TiDB Serverless Tier integrates all these elastic cloud resources with clever scheduling to deliver an extremely elastic user experience. This represents a further leap beyond the previous generation of cloud-native databases: fewer details, higher abstraction.

I can say with near certainty that for any new database company, if your ticket to entry was cloud-native in the past two years, your ticket this year has become Serverless. Without Serverless, you're basically not even at the table. Serverless databases have now become a domain where public cloud providers and independent database vendors at home and abroad are scrambling to establish positions. AWS, Azure, GCP, Alibaba Cloud, Tencent Cloud, as well as MongoDB, PingCAP, CockroachDB, and Snowflake have all been accelerating their investment in Serverless database services over the past year.

2023 Outlook: New Product Forms, Business Models, R&D Organization, AI Experience

There's still much to be done in database technology going forward. But personally, I see one through-line: staying close to application developers. Whether enterprise applications or other kinds, these applications are all written by programmers, all built by code developers.

Before ChatGPT emerged, I had always felt AI was overhyped. But ChatGPT genuinely surprised me — it made me feel AI's value. It doesn't directly replace engineers; it enhances their productivity. The integration of such tools with existing internal business operations will be a trend very worth watching.

Essentially, how the industry improves application developer efficiency may be a major direction of development. Perhaps this through-line will evolve to a point where database technology stops looking like database technology in many respects. How do you build a more usable Serverless database? I use a pile of load balancing or elastic computing technologies. And I'm even wondering whether SQL is still too complex for application developers — whether there's a data product form that sits closer to users. TiDB Cloud Serverless Tier recently launched an AI tool, Chat2Query beta, which lets users generate SQL for data queries in natural language, enabling anyone to easily extract insights from data.

I believe the future of database technology isn't about the technology itself. The ultimate direction is improving every application developer's happiness index. Database technology will inevitably move toward greater simplicity, better usability, and making it easier for everyone to write new applications — accelerating time to market.

To close, here's a partial list of interesting challenges, certainly incomplete, that I hope will be thought-provoking:

1. New product forms. When different tenants' storage engine data all resides on S3, you theoretically unlock a much larger market for data sharing and exchange (imagine Google Docs). Or, S3 plus MVCC could theoretically enable Git-like version control for data — imagine the smooth experience of git checkout, except you're switching database snapshots. I know some cloud database products are already exploring this form, which will create many new application scenarios and unique value.

2. New business models. Cloud is the new computer, but the world probably won't have just a few computers. Beyond standard SaaS models, is it possible to export DBaaS as a whole? This could be an entirely new business model (especially when partnering with second-tier or private cloud providers), where database vendors become vendors of database service products (a bit of a mouthful).

3. New R&D organizations. For a database vendor, R&D and product needs used to be almost entirely about kernel development. But in building cloud services, you're not just developers — you're also operators and business runners. And building cloud services requires a completely different technical stack from database kernel development, which inevitably involves massive organizational transformation and personnel adjustments.

4. Database interaction experience will be completely reshaped by AI. When we combine Serverless, HTAP, and AI, they will fundamentally change the possibilities of how we interact with databases. We can already leverage AI to transform natural language into standard SQL code — anyone, even without much SQL knowledge, can easily perform complex data query and analysis, giving everyone the chance to become a data analyst. This is what the future database looks like, and that day is coming soon.

Finally, over the past year, we've seen many Chinese technologies, projects, and developers gain global influence, and I see vast global markets beckoning Chinese enterprises. Going forward, more and more technologies from China will gradually go global, building their own worldwide influence.

For years, Yunqi Capital has remained focused on "technology innovation, industrial empowerment." Open source and infrastructure software are among Yunqi's sustained areas of focus. Yunqi made early lead investments in PingCAP, Zilliz, Jina AI, RisingWave, and other industry-leading companies, and has been invited to share commercial perspectives at multiple world-class industry summits including the Amazon Web Services Summit, China Open Source Conference, VMware Edge & Cloud Open Source Summit, and Huawei Partner & Developer Conference.

As a professional investor in China's open source industry, Yunqi has repeatedly received industry recognition including "Sci-Tech Innovation China" Open Source Innovation awards, and Yunqi's investors are the only ones in the industry recognized as domestic open source pioneers.

As fellow travelers driven by technology, we have always believed in the power of open source and openness. This year, Yunqi once again partnered with "Kaiyuanshe," China's most influential open source community, to officially release the 2022 China Open Source Annual Report. Combining data analysis, survey research, and in-depth study, it paints a comprehensive picture of China's open source landscape — from major events, data, commercialization, and survey perspectives — revealing the current state and trends of China's open source development. The commercialization chapter was written by the Yunqi team.

Follow the Yunqi Capital WeChat account and reply with "2022 Open Source Report" to read the complete Chinese version of the report.