Understanding Cerebras: Compute Lets AI Learn to Think, Memory Lets Agents Learn to Work

五源资本五源资本·May 15, 2026

The Achilles' heel of the agent economy is memory.

Compute makes AI think; memory makes Agents work.

Cerebras went public this week, and Ben Thompson's latest article cuts to the heart of it: as AI evolves from "chatting" to "autonomously executing tasks," the bottleneck in chip architecture has shifted.

When you chat with Doubao, you're waiting for speed. When Kimi Claw runs a five-hour task for you, it doesn't care whether it's three seconds faster or thirty seconds slower — what matters is whether it can remember context, whether it can keep working. With every step it executes, its working memory (KV Cache) swells another layer. GPUs were designed for "humans waiting in front of screens": memory sits idle during prefill, compute sits idle during decode — half the time is spent waiting.

The real chokepoint isn't how fast you can compute; it's how much you can store, and how fast you can read it. More fundamentally, long-running agents transform KV Cache from temporary buffer into persistent working memory. Whoever can make that memory last longer, reuse it more efficiently, and do so at lower cost, holds the key to the Agent economy.

This matters far more than benchmark scores.


By Ben Thompson

If timing is everything, May 2026 is about as ideal as it gets for a chip company. Reuters reported over the weekend:

Two people familiar with the matter told Reuters on Sunday that Cerebras Systems is considering increasing the size and pricing of its initial public offering as soon as Monday, driven by sustained demand for the artificial intelligence chip company's stock. The sources said the company is considering raising the price range from its previously indicated $115–$125 per share to $150–$160 per share, and increasing the offering to 30 million shares from 28 million; both spoke on condition of anonymity because the information is not yet public.

The sustained rally in semiconductor stocks is, of course, fundamentally driven by AI — particularly the market's growing realization that agents will consume massive amounts of compute. But Cerebras points to a broader proposition: until now, the AI compute narrative has been almost entirely about GPUs, almost entirely about NVIDIA; the future landscape will grow increasingly heterogeneous.

The GPU Era

The story of how GPUs became the center of AI is well-worn, but briefly:

  • Just as rendering pixels on a screen is an embarrassingly parallel process — more processing units means faster graphics — AI computation works the same way: the number of processing units directly determines computational speed.
  • NVIDIA seized this "dual use" opportunity: it made graphics processors programmable, then pushed that programmability to all developers through CUDA, a complete software ecosystem.
  • The fundamental difference between graphics and AI lies in problem scale — models are vastly larger than video game textures. This drove two parallel evolutions: the dramatic expansion of high-bandwidth memory (HBM) capacity on individual GPUs, and major breakthroughs in chip-to-chip networking that enabled multiple chips to work together as an addressable system. NVIDIA led on both fronts.
  • The primary use case for GPUs has always been training, and training is particularly demanding on the third point above. Each training step is highly parallel internally, but steps are serial: before proceeding to the next step, every GPU must synchronize its results with all other GPUs. This is why a trillion-parameter model must fit into the aggregate memory of tens of thousands of GPUs — and those GPUs must communicate with each other as if they were a single machine. NVIDIA dominates both challenges: locking up HBM supply before anyone else in the industry, and investing heavily in networking technology for years.

Of course, training is not the only AI workload; the other is inference. Inference consists of three main parts:

  1. Prefill: Encoding everything the LLM needs to understand into a comprehensible state; this is highly parallel, and compute is what matters.

  2. Decode Part 1: Includes reading the KV Cache — which stores context, including the output from the prefill phase — to perform attention calculations. This is a serial step where bandwidth is critical, and memory requirements are variable and increasingly large.

  3. Decode Part 2: Feed-forward computation on the model weights; this is also a serial step where bandwidth is critical, with memory requirements determined by the size of the model.

These two decode steps alternate at every layer of the model (they are interleaved rather than purely sequential), meaning that decoding is serial and memory-bandwidth bound. For every token generated, two different memory pools must be read in their entirety: the KV cache, which stores context and grows with each token, and the model weights themselves. Both must be completely read to produce a single output token.

GPUs handle all three of these needs perfectly: high compute for prefill, abundant HBM for KV cache and weights, and memory pooling via chip interconnect when a single GPU's memory proves insufficient. In other words, the architecture that works for training also works for inference — just look at the deal between SpaceX and Anthropic. Anthropic noted in an official blog post:

"We have signed an agreement to use all of the compute capacity in SpaceX's Colossus 1 datacenter. This gives us access to more than 300 megawatts of new capacity (more than 220,000 NVIDIA GPUs). This will directly increase service capacity for Claude Pro and Claude Max users."

SpaceX is keeping Colossus 2 — presumably for both training future models and inferencing existing ones. They can do both in the same datacenter because xAI's models don't currently have much usage; more to the point for this article, they can do both because training and inference can both run on GPUs. Indeed, the GPUs Anthropic is contracting for from Colossus 1 were originally used for training; the flexibility of GPUs is a massive advantage.

Reading Cerebras

Cerebras makes something completely different. While silicon wafers are 300mm in diameter, the "reticle limit" — the maximum area that lithography tools can expose on a wafer — is about 26mm x 33mm. This is the effective size limit for a chip; going beyond it requires connecting two separate chips through a relatively slow chip-to-chip interposer, which is what NVIDIA did with the B200. Cerebras invented a way to route wires across "scribe lines" — the boundaries between reticle exposures — turning an entire wafer into a single chip, without the relatively slow chip-to-chip interconnect.

The result: a chip with terrifying compute and massive SRAM, with access speeds that are astonishingly fast. The numbers: Cerebras's latest WSE-3 has 44GB of on-chip SRAM with 21 PB/s of bandwidth; NVIDIA's H100 has 80GB of HBM with 3.35 TB/s of bandwidth. In other words, the WSE-3 has a bit more than half the memory of the H100, but 6,000 times the memory bandwidth.

The reason to compare the WSE-3 to the H100 is that the H100 is currently the most widely used chip for inference, and inference is obviously what Cerebras excels at. You can train with Cerebras, but the chip's networking story is not compelling, which means all that compute and on-chip memory mostly sits idle; what actually matters is that it can generate streams of tokens far faster than GPUs.

However, the training limitation also applies to inference: as long as all the data fits in on-chip memory, Cerebras is an absolute delight; once memory requirements exceed that limit (whether from a larger model, or, more commonly, a longer KV cache), Cerebras no longer makes sense, particularly given its price. This "whole wafer as a chip" technology means that high yield is an enormous challenge, which dramatically increases cost.

That said, I do think there is a market for Cerebras-style chips: the company is currently emphasizing speed for the utility of programming — inference means needing to generate lots of tokens, which means dramatically increasing tokens-per-second equates to faster thinking speed. But I think this is a temporary use case, for reasons I'll explain shortly. What actually matters is how long humans need to wait for an answer, and as AI wearables and similar products become more prevalent, interaction speed (particularly voice, which will depend on token generation speed) will have a material impact on user experience.

Agentic Inference

I have previously argued that we have experienced three inflection points in the LLM era:

  1. ChatGPT proved the utility of token prediction.

  2. o1 introduced the concept of reasoning, where more tokens means better answers.

  3. Opus 4.5 and Claude Code introduced the first useful agents, which can use reasoning models and a framework of tool use, work verification, and more to actually accomplish tasks.

While all of these are "inference," I think a distinction is emerging between providing an answer — what I call "answer inference" — and accomplishing a task — what I call "agentic inference." Cerebras's target market is answer inference; in the long run, I think the architecture for agentic inference will be radically different from both Cerebras and GPUs.

I noted above that fast inference for programming is a temporary use case. Specifically, using LLMs for programming today still requires a human in the loop. It is the human that defines the task, checks the code, submits the pull request, etc.; it is not hard, though, to see a future where all of this is handled entirely by machines. This will broadly apply to agentic work: the real power of agents is not working for humans, but working independently of them.

By extension, the optimal path to solving agentic inference will look very different from answer inference. What matters most for answer inference is token speed; what matters most for agentic inference is memory. Agents need context, state, and history. Some of this lives in an active KV cache, some in host memory or SSDs, and much more in databases, logs, embeddings, and object storage. The point is that agentic inference will no longer be a GPU answering a question, but rather a complex memory hierarchy system built around a model.

Crucially, this agent-specific memory hierarchy implies a necessary tradeoff: speed for capacity. And if there is no human in the loop in real-time, lower speed is no longer a core consideration. If an agent is running an overnight job, it doesn't care about latency's impact on user experience; it only cares about getting the job done. If a new memory approach makes a complex task possible, some latency is acceptable.

Meanwhile, if latency is no longer the primary concern, the pursuit of maximum compute and high-bandwidth memory (HBM) seems misplaced: if latency is not a hard constraint, then slower, cheaper memory (like traditional DRAM) is more attractive. If the system is mostly waiting on memory responses, chips don't need to be on the most advanced process nodes. This implies a radical shift in architecture, but not the disappearance of existing architectures:

  • Training: Will remain important, and NVIDIA's current architecture (high compute, HBM, fast networking) will continue to dominate.

  • Answer inference: Will be an important but relatively small market, where maximum speed (like Cerebras or Groq) will be very useful.

  • Agentic inference: Will increasingly decouple from GPUs. GPUs' weakness — wasting memory during prefill and compute during decode — will become more pronounced. It will be replaced by systems dominated by high-capacity, low-cost memory, with "good enough" compute. Indeed, the speed of CPUs handling tool calls may matter more than the speed of GPUs.

At the same time, these categories are not equal in scale and importance. Specifically, agentic inference will be the largest market in the future, because it is not constrained by the number of humans or the amount of human time. Today's agents are just fancy answer inference; true future agentic inference will be computers doing work based on instructions from other computers, with a market size that scales not with population but with compute.

Implications of Agentic Inference for Compute

Until now, mentioning "scaling with compute" has mostly implied being bullish on NVIDIA. However, NVIDIA's relative advantage to date has been largely built on latency: NVIDIA chips are extremely fast at compute, but to keep that compute from sitting idle requires massive investment in HBM and networking. If latency is no longer the core constraint, NVIDIA's approach seems less worth paying a premium for.

NVIDIA is aware of this shift: the company has introduced an inference framework called Dynamo to help decompose the different parts of inference, and products like disaggregated memory and CPU racks to enable larger KV caches and faster tool calls, keeping expensive GPUs busy. But ultimately, hyperscalers may look for alternatives for cost and simplicity on tasks where agentic inference is not GPU-constrained.

Meanwhile, China lacks top-tier compute but has everything needed for agentic inference: GPUs that are fast enough, CPUs that are fast enough, DRAM, hard drives, and more. The challenge is, of course, training compute; additionally, answer inference may matter more for national security (particularly military applications).

Another interesting angle is space: slower chips actually make "space datacenters" more viable. First, if memory can be external, chips can be simpler and run cooler. Second, older process nodes, being physically larger, are more resistant to space radiation. Third, older process nodes draw less power, creating less heat. Fourth, non-leading-edge processes mean higher reliability, which is critical on satellites that cannot be serviced.

NVIDIA CEO Jensen Huang often says "Moore's Law is dead"; he means that future speedups will come from system-level innovation. However, when agents can act independently of humans, the deeper implication may be that Moore's Law no longer matters. The way we get more compute is by realizing that the compute we already have is "good enough."


5Y Capital seeks out, supports, and inspires lonely entrepreneurs, providing everything from spiritual support to operational assistance. We believe that if the world begins to believe in the madness others see in you, the world will become a different place.

BEIJING · SHANGHAI · SHENZHEN · HONG KONG

WWW.5YCAP.COM