Bilibili's Hottest RAG App! How Zilliz Helps LLMs Become a "Walking Encyclopedia of Chinese History" | Yunqi Tech Talk

云启资本·March 20, 2024·2·1

Hello, Mr History. Hello, Zilliz Pipeline.

Over the past year, we've become quite familiar with AI chatbots. Though they can handle most questions fluently, "hallucinations" and inconsistent corpus quality often lead to irrelevant or fabricated responses.

So how can we make AI into a professional, trustworthy expert consultant? Recently, the well-known tech creator "Ele Lab" released a video documenting the entire development process of a RAG application called Mr History. The video was tagged as "trending" for three consecutive days and quickly climbed to the top of the platform's tech category rankings. To date, it has garnered nearly 700,000 views, with an unusually active comment section where many tech enthusiasts and developers raised technical implementation questions.

Following the rapid development of large language models (LLMs) in 2023, the industry has increasingly discussed RAG (Retrieval-Augmented Generation) technology. As the technical partner for this video, Zilliz explains how RAG cleverly improves retrieval accuracy in current large model applications. In 2017, Yunqi Capital led Zilliz's angel round and has continued to invest in subsequent rounds. As a crucial middleware layer for large models, Zilliz has become a standout in the open-source infrastructure software space and a key driver in advancing RAG technology.

01. Background on the Mr History RAG Application

Following the rapid development of large language models (LLMs) in 2023, the industry has increasingly discussed Retrieval-Augmented Generation (RAG) technology. The key to this technique lies in combining two important elements: retrieval and generation. First, it searches external information sources (such as webpages or databases) to gather information relevant to the question. Then it skillfully incorporates this information into its response, generating a more accurate, reality-grounded answer.

This means that whether your question concerns breaking news, domain-specific knowledge, or anything else, RAG can provide more comprehensive, in-depth answers. RAG discussions have focused mainly on enterprise internal knowledge Q&A, but as a technology, we wanted to explore something at the intersection with the humanities — using RAG for historical source Q&A. The goals were twofold: to explore practical application scenarios for current technology, and to examine the specific challenges RAG faces in deployment.

The Twenty-Four Histories are a collection of twenty-four historical texts spanning from the Yellow Emperor to the Chongzhen era of the Ming Dynasty (1644). Building a Q&A project around them seemed intriguing — after all, history enthusiasts often know the broad strokes but want to dig into finer details. Some might suggest simply looking up the relevant biographies, but in annalistic-biographical format, a figure's biography doesn't capture their full story; their deeds may appear as secondary details scattered across other people's biographies. This is precisely where RAG proves useful: finding clues dispersed throughout the entire corpus and reasoning toward an answer.

An initial look at the Twenty-Four Histories data revealed several important characteristics:

Classical Chinese heavily omits subjects. For instance, "The Duke of Gaogui assumed the throne and was bestowed the title of Marquis Within the Pass" omits the protagonist Zhong Hui.
Classical Chinese typically uses only given names. In "Biao considered their shared imperial lineage," the model must infer that "Biao" refers to Liu Biao for proper retrieval.
Most modern embedding models are aligned on modern Chinese-to-modern Chinese; modern Chinese-to-classical Chinese alignment training is lacking.

Based on this preliminary analysis, we decided to start with modern Chinese translations. The goal and method were clear: the community already has abundant RAG application tutorials. Using LlamaIndex (or any other RAG framework like LangChain), import documents into the vector database Milvus (https://milvus.io/) or Zilliz Cloud (https://zilliz.com.cn/cloud), connect to the ChatGPT API as the query engine, and you have a textbook RAG application for historical sources. The initial results:

Clearly, the current RAG system produced a surprising answer. After completing a standard RAG implementation, we were unsatisfied with the final results. Of course, evaluating RAG quality is typically a complex systems engineering problem, especially without labeled data, so we adopted a case-by-case analysis approach for optimization. Following Occam's razor, if our case-specific optimization is elegant and simple, it should improve most other cases as well. We started by trying to address this particular issue.

02. Improving Embedding Detail Capture

In RAG, text is first split according to chunk size, with each chunk converted to a vector for retrieval. Typical historical questions usually involve (person) (time) (location) and actions, so we want retrieved corpus content to have high semantic overlap with the question on these points.

The embedding model we used (BAAI/bge-base-zh-v1.5) is popular in the open-source community, trained on large-scale Chinese datasets, and strikes a good balance between effectiveness and performance. The first possible cause to analyze was that the chunk size was set too large, causing the embedding to poorly capture details. So we adopted LlamaIndex's SentenceWindowNode, computing embeddings by sentence but returning a broader contextual window to the LLM for reading. This allows the LLM to find more clues when analyzing information. Vector retrieval did show improved relevance for details, but it also included other content about Guan Yu and even irrelevant material.

03. Reranking to Further Improve Retrieved Text Relevance

Even with SentenceWindowNode significantly improving embedding detail capture, suboptimal ranking still occurred. This likely stems from the embedding model being trained on large-scale general corpora. Typical improvements would involve manually labeling data to fine-tune the embedding model, but we wanted to use more general-purpose models instead.

We used a cross-encoder-based rerank model (BAAI/bge-reranker-large) to compute text relevance. The rerank model generates scores directly from the input (question, document fragment), so it receives information from both simultaneously. To use an analogy, embedding retrieval is like screening resumes for a general impression; reranking is like conducting one-on-one interviews — so it typically further improves result relevance. With this technique, we found all texts about Guan Yu killing others, though this also included information about Guan Yu being killed by Wu.

04. LLM Selection

After providing the highest-quality historical source information possible, we reached the final reading comprehension stage with the large model. We initially used gpt-35-turbo-1106 and found its performance on this problem less than ideal (possibly because the corpus consisted of fragmented passages), with hallucinations appearing very easily. After some prompt engineering (e.g., instructing it to faithfully reference the original text), we still couldn't achieve the desired results. Fortunately, OpenAI released the cheaper gpt4-turbo-1205 at year-end, which showed significant improvement in both format adherence and hallucination suppression. We selected gpt4-turbo as the final reader.

05. Adding Citations to Passages

As a historical source RAG application, we wanted to provide not just the original text containing the knowledge but also the specific biography name in the source. Since the corpus format distinguishes biography names from main text, simple rules could extract each passage's corresponding biography name, which we then added as text metadata. We could control this metadata so that only the biography name was shown to the LLM for source labeling, without affecting the embedding and rerank stages — controlling it to appear only in the LLM's prompt.

06. Summary

Now we can produce a RAG system that initially meets expectations — one that captures fine-grained, highly relevant historical sources and accurately cites their corresponding origins. Meanwhile, the stronger LLM model can extract knowledge helpful for answering questions from complex historical fragments.

In this project, we first analyzed the data characteristics and adopted a text approach better suited to existing models. The subsequent techniques we employed focused primarily on improving the relevance between retrieved text and questions, using the trick of making the text for embedding different from the text presented to the LLM — achieving improved retrieval precision, sufficient context, and citation information. We used general-purpose techniques as much as possible. For the historical RAG scenario specifically, we could also identify "person names" for fragment filtering, or additionally analyze the dynasty and use that information to filter retrieved fragments.

Ultimately, it comes down to this: given a query, how do we find the most relevant knowledge as effectively as possible, without wasting prompt space in the LLM. Having said all this, we must include the project address for interested folks — feel free to clone and try it out: https://github.com/wxywb/history_rag.

Additionally, during community use, some non-programmer users encountered insufficient computing power and database deployment issues with history_rag. Currently, besides the Milvus mode requiring local database and vector model deployment, history_rag also supports Zilliz Pipeline mode.

Zilliz Pipeline is a cloud-hosted service launched by Zilliz Cloud that converts text to vectors and inserts them into a cloud database. Combined with other cloud-deployed LLMs, this allows users to forgo high-performance computing devices and avoid complex database management. This solution also decouples from specific LLMs, making it convenient to use various models — the community has contributed a Qwen option that can be switched to with one click.

So if you're passionate about history, you don't need programming or computer skills — just head to GitHub and follow the documentation.

Related Links:

LlamaIndex: https://github.com/run-llama/llama_index
Milvus: https://github.com/milvus-io/milvus
Zilliz Cloud Pipelines: https://zilliz.com/zilliz-cloud-pipelines
history_rag: https://github.com/wxywb/history_rag