"PingCAP" Huang Dongxu: After GPT-4, ChatGPT + Database = ? | Yunqi Capital ChatGPT Special

云启资本·March 16, 2023

Draft to website, in just seconds?

After ChatGPT detonated the tech world, people have been debating what the "next step" for AI would be. Fortunately, we didn't have to wait long. In the early hours of March 15, OpenAI released GPT-4. Co-founder Sam Altman called it "the most capable and aligned model yet."

"Yunqi Tech π" continues its ChatGPT special series. This installment is written by Ed Huang, co-founder and CTO of PingCAP, an early Yunqi Capital portfolio company, analyzing the opportunities ChatGPT creates from a database perspective and how ChatGPT is being integrated into TiDB.

➤➤➤ Early this morning, OpenAI released the long-awaited multimodal pre-trained large model GPT-4, billed as the most powerful model to date. Since its debut, ChatGPT has rapidly ignited the entire tech sector in just a few short months.

I started using it the moment it came out, and now I can barely function without it — writing documents, emails, code, basically everything is done with ChatGPT's assistance.

ChatGPT-4 Launch Highlights: Sketch to Website in 10 Seconds

ChatGPT may be the first product of this era to approach AGI (Artificial General Intelligence). Judging by its performance, it's no longer an AI application for some narrow vertical, but a system with general knowledge. What surprised me most is that it actually possesses some capacity for logical reasoning — a level no previous AI system had achieved.

Several application directions based on ChatGPT's capabilities have already become extremely popular:

  • The simplest is translation. ChatGPT's translation ability is exceptional. You can throw an article or a block of text at it and have it generate a summary, or extract key features and keywords — a very typical current application.
  • The second application is data cleaning. Say you have an Excel or CSV document, or any arbitrary data. In the past, data quality assessment required many clerical staff to manually organize it. Now, you can hand ChatGPT a CSV file and have it directly do things like error correction or data cleaning.
  • The third category is what personally interests me most: developer-facing tools like code generation, or applications like Copilot. There are also applications that help you write articles or emails.

ChatGPT's Major Implications for Database Technology

All of ChatGPT's underlying capabilities are built on data, which represents enormous positive momentum for database and data storage technology. So what new demands does it place on databases? Let's first look at what has historically been most painful about working with data:

First, data storage. In the past, when you were dealing with a bunch of single-machine databases with very limited capabilities — say key-value stores or other storage technologies with restricted workload support — you would easily end up with data silos. Yet data only generates greater value through continuous cross-referencing and interaction. Data silos mean that although data is stored, there's no way to extract more value from it.

Suppose you have a distributed SQL database that conveniently allows you to do correlated cross-analysis across different data segments and different tables. This sounds great, but it creates a challenge: after data is stored, how do you quickly turn that data or insight into an online service? This was historically the job of OLTP databases. For a long time, OLTP and OLAP were completely separate systems. OLAP could certainly do complex join analysis and queries. But you couldn't directly use a data warehouse or OLAP database as an online service to monetize data externally. You still had to move it into a separate OLTP database.

The emergence of HTAP databases solved two problems: one, data can be stored while the underlying layer remains unified, eliminating concerns about data silos; two, these complex queries and data insights can be directly turned into an OLTP or online service for external delivery. ChatGPT would struggle to generate correct SQL on a sharded database — it's like asking a skilled cook to prepare a meal without rice. But it can directly generate SQL to run on an HTAP database, dramatically improving overall efficiency and lowering the barrier to completing such data insight applications.

Second, data monetization. More and more companies now have what's called a data middle platform. CEOs or business people frequently go to data analysts and say, I want certain data, please run a query for me. But this creates a contradiction: CEOs or managers understand the business but don't know SQL; data analysts know SQL but don't deeply understand the business. The emergence of AI like ChatGPT is like giving managers a personal intelligent assistant. You can directly use natural language to tell it, I have a decision to make, please first run the data in the database for me and see what results come back for my question. It essentially lowers the barrier to data monetization dramatically, which may be quite disruptive to the data middle platform or data analyst professions.

How We're Integrating ChatGPT into TiDB

For me, ChatGPT's greatest challenge is actually imagination. How should we use it, and how can we amplify its capabilities across different industries as much as possible? This is what I've been thinking about lately.

For infrastructure software companies like ours, ChatGPT's greatest significance is that previously many ordinary people couldn't use this software at all — they had to go through programmers or DBAs to operate it. But now ChatGPT has instantly lowered the database usage barrier to where anyone can use it. As long as you can clearly describe what you want to do, it can help you extract insights from data. From this perspective, ChatGPT is a product form with very far-reaching significance, so we immediately integrated its capabilities into our TiDB Cloud service — launching a natural language to SQL tool called Chat2Query.

Click "Read More" at the end of this article to immediately experience Chat2Query's full capabilities

Everyone is discussing ChatGPT now, but you could argue that ChatGPT is merely a demo from OpenAI. The underlying model should be a large language model from the GPT 3.5 family. OpenAI exposes these different language models to developers worldwide through Open API, charging based on your usage. Developers can then build their own applications based on these APIs — we simply used its API in TiDB Cloud, packaging it behind our product.

Chat2Query became extremely popular upon launch, with many developers building very interesting applications on top of it. On one hand, it helps engineers improve development efficiency by more than an order of magnitude. On the other hand, by combining ChatGPT's ability to automatically write SQL, it transforms the very form of database software itself.

Those familiar with us probably know that last year we built an open source community data analysis platform called OSS Insight. It recently launched a new feature called Data Explorer. You can ask it questions in natural language, and the system automatically gives you answers. For example, you can ask it to generate a summary report for a specific GitHub ID, showing whether that ID typically contributes code, submits issues, or just gives stars.

You could do this through SQL in the past, but the Data Explorer feature lets you ask the question in natural language. It uses OpenAI's API to convert natural language into an equivalent SQL query, which then runs against the TiDB database behind it.

Actually, natural language to SQL generation isn't particularly novel — there have been many relevant research papers before. What shocked me most is that ChatGPT isn't a system specifically designed to solve the natural language to SQL conversion problem, yet it produces remarkably good results.

You just need to tell it some simple prompts or hints — like what this table roughly looks like, what rules to pay attention to. Also remind it to use best practices when writing SQL. Through these simple prompts, ChatGPT can generate very high-quality SQL. The SQL it generates actually surpasses many systems specifically designed to solve this class of problems.

And you'll find that the prompt information I give ChatGPT or the OpenAI system doesn't really provide additional information. What does this mean? It's like teaching a child math — you're not teaching them how to solve specific problems, but telling them to read the question carefully and think about which sub-problems need to be solved first. We're just telling the system how to think about problems, and it can then answer your questions with much higher success rates.

In the future, a new profession may emerge — prompt engineering. When using OpenAI's ChatGPT capabilities, there are actually certain techniques involved. Many people feel ChatGPT frequently talks nonsense when using it, but when we use OpenAI to generate SQL from table schema information, we can't simply tell it to generate SQL that satisfies the question. We need to provide many prompt words in the intermediate context.

Let me give a few examples of prompts I've used: First, I have it imagine it's a Python compiler, or imagine it's a programmer — "Your task below is to rewrite this problem into a Python program." Second, I have it imagine it's a Python interpreter, put the code it just generated into this imaginary interpreter to run, and return the interpreter's results. Finally, you'll discover that when you lay out the contextual prompts, its answer accuracy improves dramatically. This is a very interesting example of prompt engineering. Prompt engineering is essentially teaching it ways of thinking — you just tell the machine "let's think step by step," and accuracy improves significantly.

AI + Serverless + HTAP: A New Milestone in Database Development

For decades, databases had only one form: CRUD, write SQL. For computer software, every change in how humans interact with software has triggered a massive revolution or transformation. ChatGPT's emergence is like the invention of the internet, or the invention of the steam engine — it will inevitably spark a transformation. AI + Serverless + HTAP, these things fused together will become a very important milestone in database development. It will change the form of database software itself going forward, and even its business model.

When ChatGPT first came out, everyone's first reaction was that it seemed like a search engine. But I don't think so — it's not a search engine, it's more like infrastructure. ChatGPT itself is presented to people as a demo in chat board form, but what's truly meaningful is the capability packaged beneath these applications. In the coming years, no matter what industry you're in, you need to think about how to integrate with AI. Don't think it's just an IT circle thing — I believe this is an opportunity for every industry.

We must not underestimate the impact large language models will have on human society in the future. It's no longer simply a chatbot — it will profoundly transform all industries. For example, as an assistant to programmers, it can make a very strong programmer 10x or 20x more productive. Since I started using ChatGPT + Copilot, I basically no longer hand-write much code. As long as I describe the problem clearly, it can generate the code, and I at most do some fine-tuning on top. Though it's not 100% correct, it has already saved me 70-80% of my working time. The core point here is that we need to abandon a fixation — the ingrained notion that AI is inferior to humans. For a long time, when people talked about AI, it was just tuning parameters, building recommendation systems. But now products represented by ChatGPT can achieve so many functions, we can no longer view this thing through traditional lenses.

ChatGPT is just the beginning. There will certainly be more and more powerful products emerging on the road to true AGI.