Bolt Picks | How to Evaluate Large Models More Effectively — Evals Are All You Need

线性资本·July 4, 2024

In the pre–large model era, machine learning engineers would often joke that their job was "alchemy." Once the basic model training architecture stabilized, engineers spent more than half their time on cleaning and filtering training data. For applications built on large models, prompt engineers face a similar shift: the core of their work revolves around how to evaluate products efficiently and accurately. Today, drawing on an article by Michael Taylor about...

Evaluation is not only the benchmark for foundation models to climb leaderboards, but also an indispensable part of bringing application products to market.

In the era before large models, machine learning engineers would jokingly call their work "alchemy." Once the basic model training architecture stabilized, engineers spent more than half their time on better cleaning and filtering training data. Similarly, for applications built on large models, prompt engineers center their work around how to efficiently and accurately evaluate products. Today, drawing on Michael Taylor's article about scoring models (click "read original" for the full piece), let's talk about an important part of model training and application — evaluation.

🎙️ Why is evaluation so hard?

AI model outputs are often non-deterministic; even with identical inputs, they may produce different results. Therefore, establishing reliable evaluation metrics is critical. These metrics not only help us understand how a model performs on specific tasks, but also provide crucial data support during model updates and optimization, ensuring that AI models meet business objectives and deliver high-quality results.

Main Evaluation Methods

  1. Programmatic Evaluations:
    • Pros: Fast and low-cost, suitable for questions with clear answers, such as comparing whether an AI-generated response matches a reference answer.
    • Cons: Poor performance on complex tasks; struggles with subjective or ambiguous criteria.

  1. Synthetic Evaluations:

    • Pros: Uses AI to evaluate AI-generated answers; lower cost and faster than human evaluation.
    • Cons: Less accurate than human evaluation; slower and more expensive than programmatic evaluation.
  2. Human Evaluations:

    • Pros: High accuracy, especially suitable for mission-critical tasks.
    • Cons: High cost and slow speed; requires significant human resources.

🎙️ Challenges in Implementing Evaluation

  • Reliability: The non-deterministic nature of AI model outputs leads to inconsistent results, requiring multiple tests to assess stability.
  • Vendor Selection: Different AI vendors offer varying model performance and costs, requiring careful trade-offs.
  • Hallucination: AI models sometimes generate false information. Detecting and reducing such "hallucinations" remains a challenge.

🎙️ Practical Applications of Evaluation

  • Optimizing generated content: Using programmatic evaluation to check whether AI-generated text meets length requirements, or using synthetic evaluation to compare output quality across different models.
  • Reducing hallucinations: For example, in ad generation, using programmatic evaluation to detect whether new attributes have been generated that don't match product descriptions.
  • Improving customer experience: For example, in marketing email generation, human evaluation can ensure content accuracy and personalization.

Thoughts

For AI developers and entrepreneurs, building and using a robust evaluation system is key to ensuring efficient AI model operation. By combining programmatic, synthetic, and human evaluation methods, we can comprehensively improve model reliability and quality.

As open-source models mature, for application developers, figuring out how to use evaluation to smoothly align foundation model capabilities with application scenarios is a core element that's easily overlooked in the rush to ship. This applies not only to prompt engineers like the author, but also to model fine-tuning and alignment — effective evaluation in these areas is also an indispensable part of directly impacting user experience. Startups like HumanLoop and Vellum, which have positioned themselves in the evaluation space, have begun helping clients like Duolingo and Redfin deploy with greater confidence at million-user scale. Using tools and services is one option; the core is still about solving the key problem. Efficient and accurate evaluation is the "last mile" in delivering an exceptional experience to consumer product users.

You can't improve what you don't measure. Bolt encourages every AI developer and entrepreneur to actively explore and apply advanced evaluation methods and tools, and build products that stay close to user needs. (My WeChat: zoey_jingyi)

Bolt Community

If you're an explorer of the AI wave who shares Bolt's perspective, and you'd like to discuss with like-minded peers, welcome to scan the QR code below to briefly introduce yourself. After review, we'll invite you to join the discussion.

Linear Bolt Bolt is Linear Capital's investment program dedicated to early-stage, globally-oriented AI applications. It upholds Linear's investment philosophy and principles, focusing on projects driven by technological change, with the goal of helping founders find the shortest path to their objectives. Whether in speed of action or investment approach, Bolt's commitment is lighter, faster, and more flexible.