Moonshot AI Releases Visual Reasoning Model k1, Outperforming OpenAI o1 in STEM Benchmarks | Z News

真格基金·December 16, 2024

Every pixel deserves deep thought.

Today, Moonshot AI released its visual reasoning model k1, now live on the latest versions of its app and web client.

Built on reinforcement learning, the k1 model offers native end-to-end image understanding and chain-of-thought reasoning, extending its capabilities beyond mathematics into broader foundational sciences.

ZhenFund invested in Moonshot AI at the angel round in 2023. Since Kimi's launch in October 2023, the company has continuously upgraded its foundation model capabilities and expanded its product features and interactions. Monthly active users grew from tens of thousands to tens of millions in just one year — an astonishing roughly 1000x increase. We look forward to co-creating intelligence with Kimi and more users in the future.

"Seeing comes before words. The child looks and recognizes before it can speak."

Ways of Seeing, John Berger

Today, Moonshot AI released its visual reasoning model k1. Built on reinforcement learning, the k1 model offers native end-to-end image understanding and chain-of-thought reasoning, extending its capabilities beyond mathematics into broader foundational sciences. In benchmark tests for foundational sciences including mathematics, physics, and chemistry, the first-generation k1 model outperformed global benchmark models OpenAI o1, GPT-4o, and Claude 3.5 Sonnet.

Kimi's new model is available immediately upon release. The k1 visual reasoning model has been gradually rolling out on the latest versions of the "Kimi AI Assistant" Android and iPhone apps, as well as the web version at kimi.com. Find "Kimi Visual Reasoning" in the latest mobile app or the Kimi+ page on the web version to snap a photo or upload an image and try it out.

"Kimi Visual Reasoning" displays the full reasoning chain of thought (CoT), so users don't just see the answer — they can see the model's entire thought process as it works toward the solution.

01

Scaling Reinforcement Learning

The technical paradigm of the AI large model industry is shifting — we are now in a new era of reinforcement learning. Just as humans don't immediately blurt out answers when faced with difficult problems, but instead analyze the problem, explore different solutions, try various approaches, reflect, refine strategies, and iterate — the new generation of reinforcement learning-based models significantly improves success rates on more complex, harder tasks by incentivizing the model to generate more detailed reasoning steps, forming high-quality chains of thought (CoT). Data shows that with the boost of reinforcement learning techniques, the intelligence level of AI large models is climbing steadily, already reaching or exceeding human expert levels in many domains.

We call this new generation of reinforcement learning-based models "reasoning models." Previously, reasoning models were primarily trained on text data. But vast amounts of information in foundational sciences are embedded in images and diagrams. How to more tightly integrate reasoning capabilities with image understanding while delivering a smoother user experience is a major technical challenge.

The k1 visual reasoning model released and launched today achieves true end-to-end image understanding and reasoning. The model can directly process image information input by users and reason to arrive at answers, without relying on external OCR or additional vision models for information processing — delivering new heights in both user experience and performance.

From a model training perspective, k1 visual reasoning model training consists of two stages: first, pre-training to obtain a base model, followed by reinforcement learning post-training on top of that base model. The k1 base model was specifically optimized for character recognition capabilities, achieving a current state-of-the-art result of 903 points on OCRBench, with scores of 69.1, 66.7, and 96.9 on the MathVista-testmini, MMMU-val, and DocVQA benchmark datasets respectively — placing it in the top tier globally.

k1's reinforcement learning post-training made further optimizations in data quality and learning efficiency, achieving new breakthroughs in the scaling of reinforcement learning — this is the most critical reason the k1 visual reasoning model achieved industry-leading results in benchmark tests.

Additionally, scientific model capability benchmarking is one of the important challenges facing the large model industry. Due to the lack of graphics-based test sets for foundational science subjects on the market, the Kimi model R&D team independently constructed a standardized test set called Science Vista, covering physics, chemistry, and mathematics problems at varying difficulty levels, with a distribution that reasonably matches actual user needs. This test set will be open to the entire industry; click the link to apply for use within licensed scope: https://forms.gle/uWFFWfMVkzsRNsNv9.

In internal testing, we also identified some limitations of the k1 visual reasoning model, such as significant room for improvement in out-of-distribution generalization, success rates on more complex problems, accuracy in noisier scenarios, and multi-turn conversation effects. In some scenarios and generalization capabilities, the k1 model still lags behind OpenAI's o1 series models.

02

Generalizing Foundational Science Capabilities

A month ago, Kimi introduced the k0-math model with mathematical capabilities benchmarked against OpenAI's o1 series. The model demonstrated exceptional mathematical ability, but because it primarily supported text inputs in formats like LaTeX, it struggled with certain geometry problems that depend on graphical understanding.

The new k1 visual reasoning model, leveraging its end-to-end image understanding capabilities, unlocks more comprehensive mathematical abilities including geometry problems. In benchmark tests for geometry and graphics problems at various educational stages, k1-preview achieved results matching or exceeding OpenAI's o1 model.

Let's start with a standard geometry problem to give the k1 visual reasoning model a quick test:

Scroll to view the full chain of thought (CoT); long press to download

Beyond mathematics, the k1 visual reasoning model extends its capabilities to physics, chemistry, and other fields. In physics and chemistry benchmark tests at the foundational science education level, the k1 model's performance is similarly competitive with global leaders OpenAI and Anthropic.

Let's look at an example of the k1 visual reasoning model solving a classic physics circuit problem:

Scroll to view the full chain of thought (CoT); long press to download

Next, let's look at an example of the k1 visual reasoning model analyzing a chemistry technical principle. We took a technical schematic that had won a Nobel Prize in Chemistry and removed most of the text labels, leaving only the two letters "QD," to see how k1 would step by step analyze what principle the diagram illustrates.

Scroll to view the full chain of thought (CoT); long press to download

03

Keen Eye

In real-world scenarios, image information users input to models often contains various types of noise, including dark photos, blurry images, multiple problems photographed together, handwritten text interference, purely handwritten problems, and tilted camera angles.

We specifically tested the k1 model's performance in scenarios closer to real-world usage. In "noisy" scenarios, benchmark test data across multiple foundational science subjects shows that the k1 model has a more significant lead over OpenAI and Anthropic's vision-language models. While most other models see their capabilities drop by more than half in visually noisy scenarios, k1 maintains the lowest capability loss thanks to its superior visual recognition ability.

Let's look at an example.

The image below shows formulas manually worked out by a colleague on a tablet.

Here's how the k1 model step by step analyzed the author's intent.

Scroll to view the full chain of thought (CoT); long press to download

03 Snap What You Don't Understand

When we released the mathematical reasoning model k0-math, we discovered that Kimi Math wasn't just good at math problems — it could also use mathematical thinking to explain anything. For example, if you asked "How much is a heart worth?", Kimi Math might, after some analysis, give us a surprising answer: a heart is priceless.

After learning to tackle more problems across algebra, geometry, physics, chemistry, biology, and other foundational sciences, we found that the k1 visual reasoning model similarly exhibits more general-purpose emergent capabilities.

For example, we found a photo of a scientist's manuscript online and sent it to Kimi Visual Reasoning to see how it would step by step, engagingly reason through the manuscript's content and backstory:

Me:

Step by step, deduce who wrote these notes and what they're specifically about

Kimi Visual Reasoning:

Alright, let's solve this puzzle step by step. I have an image of ancient notes in front of me, and I need to figure out who wrote them and what they contain. This really is like an adventure through history and text — exciting!

First, I carefully observe the image. It's a yellowed sheet of paper covered densely with ink-written text, arranged quite neatly. There are also some hand-drawn illustrations in places, possibly star charts or similar diagrams. Overall, this appears to be a scientific or astronomical manuscript.

The notes begin with large lettering: "Ser.mo Principe." — Italian for "Most Serene Prince." This indicates the author is addressing some eminent figure, likely a ruler or noble of the time. The next few lines mention "Galileo Galilei" — undoubtedly an important clue. Galileo Galilei was a renowned Italian astronomer, physicist, and engineer, known as the "father of modern observational astronomy." This makes me increasingly certain that these notes are closely connected to Galileo.

Continuing to read, the notes mention "Accademia dei Lincei," one of Italy's oldest scientific academies, of which Galileo was a founding member. This further confirms the scientific background of these notes. The notes also mention "Padova" — Padua, a city in northeastern Italy where Galileo once served as professor. These references to places and institutions provide crucial clues for identifying the author and content.

Next, I notice some dates in the notes, such as "Anno 1610" — the year 1610. This was a significant year for Galileo, as it was when he published Sidereus Nuncius (The Starry Messenger), his collection of astronomical discoveries made through the telescope. The notes also mention "Siderius Nuncius," the Latin title of The Starry Messenger. This convinces me further that these notes are a letter Galileo wrote in 1610 to some eminent figure, concerning his astronomical observations.

Looking more carefully at the illustrations in the notes, I can see some star charts and symbols. These diagrams are closely related to the lunar surface, Jupiter's moons, and other astronomical phenomena Galileo observed through his telescope. These observations caused a tremendous stir at the time, because they supported Copernicus's heliocentric theory — that the Earth and other planets orbit the Sun.

To summarize, these notes are very likely a letter from Galileo Galilei written in 1610 to some eminent figure, concerning his astronomical observations, particularly the discoveries he published in The Starry Messenger. These notes hold not only significant scientific value but also serve as important historical testimony to Galileo's personal story.

To verify my hypothesis, I decide to delve deeper into Galileo's life and work in 1610. In 1610, Galileo was a professor of mathematics at the University of Padua and had just begun using the telescope for astronomical observations. He discovered four major moons of Jupiter, a finding that was crucial for supporting Copernicus's heliocentric theory. This discovery proved that not all celestial bodies orbit the Earth, further advancing the field of astronomy.

The "Ser.mo Principe" mentioned in the notes may refer to a noble or ruler from whom Galileo sought support or recognition. Considering Galileo's relationship with the Medici family, particularly Cosimo II de' Medici, Grand Duke of Tuscany, this hypothesis seems quite plausible. Galileo dedicated his discoveries to Cosimo II to gain his support and recognition.

However, I must remain cautious. Perhaps there were other nobles or scholars with whom Galileo had similar exchanges. To further verify, I carefully examine other clues in the notes. The mention of "Accademia dei Lincei," the scientific academy to which Galileo belonged, further confirms the connection between these notes and Galileo.

Additionally, the date "Anno 1610" in the notes coincides with the publication date of Galileo's Sidereus Nuncius. Sidereus Nuncius was a pamphlet in which Galileo described his astronomical observations, including mountains on the Moon's surface, Jupiter's moons, and nebulae in the Milky Way. These discoveries fundamentally transformed human understanding of the universe.

Now I am even more confident about the source of these notes. They are very likely a letter from Galileo to Cosimo II de' Medici in 1610, intended to share his astronomical discoveries and seek support. The illustrations in the notes resemble those in Sidereus Nuncius, further confirming this.

Galileo wrote many letters in 1610 describing his observations, particularly regarding the discovery of Jupiter's moons. These letters typically included sketches and annotations, very similar to the notes I am analyzing.

After this thorough analysis, I am convinced that the author of these notes is Galileo Galilei, and the content concerns his astronomical observations in 1610 through the telescope, particularly his discovery of Jupiter's moons. These notes are an important historical document of Galileo sharing his revolutionary discoveries with the scientific community and nobility of his time.

Therefore, to summarize my findings:

Author of the notes: Galileo

Content of the notes: Describes Galileo's astronomical observations in 1610 through the telescope, including the discovery of Jupiter's moons, and his attempts to seek support and recognition from nobles of the time.

Answer: These notes were written by Galileo Galilei, concerning his astronomical observations in 1610, including the discovery of Jupiter's moons, and his attempts to seek support and recognition from nobles of the time.

We believe that image understanding and reasoning capabilities will unlock broader usage scenarios for Kimi. In the future, when users encounter problems they don't understand, they can simply snap a photo and send it to Kimi. And when they see photos of unfamiliar cities or buildings on social media, exotic tropical fruits they don't recognize, cursive calligraphy they can't read, memes they don't get, or scientists' manuscripts they can't decipher... they can all try sending them to Kimi Visual Reasoning to explore the answers together with Kimi.

We will continue to enhance Kimi Visual Reasoning's capabilities in all aspects, making it more "reliable" and more "versatile." You may have already noticed that Kimi's main product has also comprehensively upgraded its image understanding capabilities.

Kimi's new model k1 is available immediately upon release. Download the latest mobile app or experience the new model on the Kimi web version (kimi.com). Kimi looks forward to exploring the wider world with you.

01

02

03

03

Snap What You Don't Understand

Recommended Reading