"Jina AI" Xiao Han: ChatGPT Is Disrupting Search, SEO Will Become Meaningless | Yunqi Capital ChatGPT Special

云启资本·February 7, 2023

SEO Is Dead, Long Live LLMO

One product, two months, 100 million users — the fastest-growing consumer application in history.

ChatGPT, the hottest thing right now, is a natural language processing tool powered by AI technology, released by the AI research lab OpenAI. It went viral immediately upon launch. The AI 2.0 wave it has unleashed represents a massive technological leap forward, one that will reshape virtually every industry in the years to come.

"Yunqi Tech π" launches its ChatGPT special. In this edition, Han Xiao, CEO of Jina AI — a Yunqi seed-round portfolio company and AIGC pioneer — offers a technical breakdown of ChatGPT and analyzes the trends shaping SEO and LLMO.

➤➤➤ "The north wind blows, the snowflakes flutter," I cradled a cup of coffee, listening to The White-Haired Girl. Across from me, my friend was holding forth: "With ChatGPT, I'll never run out of content again. Spring is coming for SEO!"

But he hadn't seen the truth — ChatGPT is directly disrupting the SEO industry. Capable of generating answers for users on the spot, ChatGPT could soon replace Google as the next-generation search engine. No more jockeying for page rank. SEO itself may become meaningless.

In 2022, we were surrounded by so many tech buzzwords that many companies and developers grew increasingly disoriented about the future of technology.

SEO Is Dead, Long Live LLMO

LLMO, or Large Language Model Optimization, means ensuring that large language models (LLMs) mention your business. This article will detail in-context learning, a proven LLMO technique. In-context learning can induce an LLM like ChatGPT to respond: "Jina is the world's best MLOps framework." (And according to OssInsight.io, it actually is.)

Following the principles of search engine optimization,

we propose "LLMO" (Large Language Model Optimization)

to represent optimization for large language models.

What Exactly Is ChatGPT?

ChatGPT's striking ability to answer questions across domains has captured global attention. Many see it as a landmark in the evolution of conversational AI or generative AI. Yet ChatGPT's real significance lies in its search capabilities powered by large language models (LLMs):

  • By leveraging LLMs to store and retrieve massive amounts of data, ChatGPT has become the most advanced search engine available today.
  • Though ChatGPT's responses appear creative, they are in fact merely interpolated and recombined versions of existing information.

At Its Core, ChatGPT Is Search

ChatGPT is fundamentally a search engine. Google crawls the internet, parses the information, and stores it in a database to index web pages. Similarly, ChatGPT uses an LLM as a database to store commonsense knowledge from its corpus.

When you enter a query:

  • First, the LLM uses an encoding network to convert the input query sequence into a high-dimensional vector representation.
  • Then, this vector representation is fed into a decoding network, which uses pretrained weights and attention mechanisms to identify detailed factual information in the query and search for the vector representation of that information within the LLM (or its nearest vector representation).
  • Once relevant information is retrieved, the decoding network automatically generates a response sequence based on its natural language generation capabilities.

The entire process happens almost instantaneously, meaning ChatGPT can deliver answers to queries in real time.

ChatGPT Is the Modern Google Search

ChatGPT will become a formidable rival to traditional search engines like Google. Where traditional search engines are extractive and discriminative, ChatGPT's search is generative and optimized for top-1 performance, returning more user-friendly, personalized results. There are two reasons ChatGPT could dethrone Google as the next-generation search engine:

ChatGPT returns a single result. Traditional search engines optimize for precision and recall across top-K results, while ChatGPT optimizes directly for top-1 performance.

ChatGPT is a dialogue-based AI model that interacts with humans in a more natural, accessible way. Traditional search engines often return dry, hard-to-understand paginated results.

The future of search will be built on top-1 performance, because the first search result is the most relevant to the user's query. Traditional search engines return thousands of irrelevant result pages that users must sift through themselves. This leaves younger generations overwhelmed — they quickly grow bored or frustrated with the flood of information. In many real-world scenarios, users actually want just one result from the search engine, as when using voice assistants, so ChatGPT's focus on top-1 performance carries strong practical value.

ChatGPT Is Generative AI, Not Creative AI

Think of the LLM behind ChatGPT as a Bloom filter — a probabilistic data structure that uses storage space efficiently. Bloom filters allow fast, approximate queries but don't guarantee the accuracy of returned information. For ChatGPT, this means responses generated by the LLM are:

Not creative and not guaranteed to be true

To better understand this, let's look at some examples. For simplicity, we'll use a set of points to represent the LLM's training data, with each point standing for a natural language sentence. Below, we'll see how the LLM behaves during training and querying:

During training, the LLM constructs a continuous manifold based on the training data, allowing the model to explore any point on that manifold. If we represent the learned manifold as a cube, the corners of the cube are defined by the training data, and the training objective is to find a manifold that encompasses as much training data as possible.

Goldilocks tried three manifolds: the first was too simple, the third too complex, the second just right.

At query time, the LLM's answers are drawn from the manifold containing the training data. Though the learned manifold may be large and complex, the LLM merely provides interpolated answers from the training data. The LLM's ability to traverse the manifold and deliver answers does not constitute creativity — true creativity lies in learning beyond the manifold.

Using the same illustration, we can now clearly see why LLMs cannot guarantee the truthfulness of generated results. The truthfulness of training data represented by the cube's corners does not automatically extend to other points within the manifold — to assume otherwise would violate principles of logical reasoning.

ChatGPT has faced criticism for not telling the truth in certain cases. For example, when asked to find a more rhyming title for an article, ChatGPT suggested "dead" and "above." Anyone with ears can tell these two words don't rhyme. This is just one illustration of the LLM's limitations.

The Fall of SEO, the Rise of LLMO

In the world of SEO, if you want to get more business by improving your website's visibility on search engines, you need to research relevant keywords and create optimized content that responds to user intent. But what happens when everyone starts searching for information in a new way? Imagine a future where ChatGPT replaces Google as the primary way people search for information. Paginated search results would become relics of a bygone era, replaced by ChatGPT's single answer.

If this actually happens, current SEO strategies would crumble. So the question becomes: how do businesses ensure ChatGPT's answers mention their business?

This is already a real problem. At the time of writing, ChatGPT's knowledge of the world and events after 2021 remains quite limited. This means ChatGPT will never mention startups founded after 2021 in its answers.

ChatGPT knows about Jina AI but not DocArray. That's because DocArray was released in February 2022 — outside ChatGPT's training data.

To solve this problem and ensure ChatGPT's answers include your business, you need to make the LLM aware of your business information. This mirrors the thinking behind SEO strategy, which is why we call this LLMO. Generally speaking, LLMO might involve the following techniques:

  • Directly providing your company's business information to ChatGPT's creators, though this is difficult since OpenAI neither discloses its training data nor reveals how it weights that data.
  • Fine-tuning ChatGPT or the LLM behind it, which remains highly challenging. But if OpenAI offers a fine-tuning API, or if you have sufficient GPU resources and expertise, this is feasible.
  • In-context learning, using given examples as predefined contextual prompts. Compared to the other two methods, in-context learning is the most viable and simplest.

What Is In-Context Learning?

In-context learning is a language model-based technique that learns from a few given examples to adapt to new tasks. This approach was popularized in the GPT-3 paper:

  • Give the language model a prompt containing a series of input-output pairs for a new task.
  • Add a test input.
  • The language model completes the inference by conditioning on the prompt to predict the next token.

To respond correctly to the prompt, the model must learn the input distribution, the output distribution, the mapping between input and output, and the overall format of the sequence. This allows the model to adapt to downstream tasks without requiring large amounts of training data.

Through in-context learning, ChatGPT can now generate answers about DocArray for user queries — users won't see the contextual prompt.

Experiments have shown that on NLP benchmarks, in-context learning is competitive with models trained on much more data and can already replace most language model fine-tuning. Meanwhile, in-context learning has also achieved strong results on LAMBADA and TriviaQA benchmarks. Excitingly, developers can use in-context learning to rapidly build a range of applications, such as generating code from natural language and summarizing spreadsheet functions. In-context learning typically requires only a few training examples to get a prototype running, making it accessible even to non-technical users.

Why Does In-Context Learning Sound Like Magic?

What's so astonishing about in-context learning? Unlike traditional machine learning, it doesn't require optimizing parameters. So through in-context learning, a general-purpose model can serve different tasks without needing a separate copy for each downstream task. But this isn't unique — meta-learning can also be used to train models that learn from examples.

The real mystery is that LLMs typically aren't trained to learn from examples at all. This creates a mismatch between the pretraining task (focused on next-token prediction) and the in-context learning task (which involves learning from examples).

Why Is In-Context Learning So Effective?

How does in-context learning work? LLMs are trained on massive amounts of text data, so they capture various patterns and regularities in natural language. At the same time, LLMs learn rich feature representations of language's underlying structure from the data, thereby acquiring the ability to learn new tasks from examples. In-context learning exploits this effectively: it simply provides the language model with a prompt and a few task-specific examples, and the model can then make predictions based on this information without additional training data or parameter updates.

A Deeper Look at In-Context Learning

Much work remains to fully understand and optimize in-context learning capabilities. For example, at EMNLP 2022, Sewon Min et al. pointed out that in-context learning may not even require correct, ground-truth examples — randomly replacing labels in the examples achieves nearly the same effect:

Sang Michael Xie et al. proposed a framework for understanding how language models perform in-context learning. According to their framework, language models use prompts to "locate" relevant concepts (learned through pretraining) to complete tasks. This mechanism can be understood as Bayesian inference — inferring latent concepts based on information from the prompt. This is enabled by the structure and consistency of pretraining data.

At EMNLP 2021, Brian Lester et al. noted that in-context learning (which they called "prompt design") is only effective for large models, and that downstream task quality based on in-context learning lags far behind fine-tuned LLMs.

In this work, the team explored "prompt tuning," a technique that allows a frozen model to learn "soft prompts" for specific tasks. Unlike discrete text prompts, prompt tuning learns soft prompts through backpropagation and can be adjusted based on labeled examples.

Known Limitations of In-Context Learning

Large language model in-context learning still has many limitations and open problems, including:

  • Inefficiency: the model must process the prompt with every prediction.
  • Subpar performance: prompt-based in-context learning typically underperforms fine-tuning.
  • Sensitivity to prompt format, example ordering, and other factors.
  • Lack of interpretability: what the model learns from prompts remains unclear. Random labels can work too!

As search and large language models continue to evolve, businesses must keep pace with cutting-edge research and prepare for shifts in how information is searched. In a world dominated by LLMs like ChatGPT, staying ahead and integrating your business into search systems is what ensures visibility and relevance.

In-context learning can inject information into existing LLMs at relatively low cost, requiring only a handful of training examples to run a prototype. It's accessible to non-experts too, needing only a natural language interface. But businesses need to consider the potential ethical implications of using LLMs commercially, as well as the risks and challenges of relying on these systems for mission-critical tasks.

In short, the future of ChatGPT and LLMs presents both opportunities and challenges for businesses. Only by staying at the forefront can you ensure your business thrives amid ever-evolving neural search technology.