OpenAI and Anthropic CPOs on Conversation: The Core Skill for Product Managers in the AI Era Is Writing Evals | Z Talk

真格基金·November 20, 2024

A 60% accuracy rate is the "Mendoza Line" for AI products.

Z Talk is ZhenFund's column for sharing perspectives.

This might be the conversation between the two product managers who understand AI products best.

Kevin Weil and Mike Krieger currently serve as Chief Product Officers at the top-tier foundation model companies OpenAI and Anthropic, respectively. Before this, Kevin was VP of Product at Instagram and Twitter, while Mike Krieger was co-founder and CTO of Instagram.

Both top product managers have extensive experience building consumer products at the hundred-million-user scale, well-versed in the logic of internet product development. Their current roles now span consumers, enterprises, and developers alike, and the foundation models at their respective companies rank among the most capable available today.

In a recent public conversation with Sarah Guo, founding partner at Conviction, Kevin and Mike drew from their perspective as AI product managers to discuss the key constraints on current model potential, the core competencies required of product managers building AI-driven features today, and their outlook on how foundation models will evolve.

This article is republished from Founder Park. Here is the full text:

01 Building products at a foundation model company means

doing ToC, ToB, and ToD all at once

Sarah Guo: You both managed Instagram at different points, and now you've both stepped into relatively new roles. I'm excited to hear what's on your minds. Kevin, let's start with you. You've done so many different and interesting things—what was the reaction from friends and your team when you took this new job as CPO at OpenAI?

Kevin Weil: Pure excitement. I think it's one of the most interesting and impactful roles out there, with so much to explore. I've never had a product role this challenging, this interesting—and this sleep-depriving. It has all the usual product challenges: figuring out who your users are, what problems you're solving, all of that.

Normally when you build products, you're developing on top of an existing technical foundation. You know what resources you have, and you do your best to build the best product possible. Here, it's completely different. Every two months, computers can do things they've never been able to do before, and you have to think about how these advances will affect your product. The changes are massive. So being able to witness AI's development from the inside is genuinely fascinating. And I'm really enjoying the process.

Sarah Guo: Mike, what about you? I was struck by your pure curiosity when we had dinner recently. You were like a kid, excitedly saying, "Yeah, I'm learning all this enterprise stuff now." Tell us—what's surprised you about serving customers beyond Instagram, about working in a research-driven organization?

Mike Krieger: This role is a completely new experience for me. When I was 18, I made a very "18-year-old" vow: every year had to be different. So sometimes I'd think, "Oh, another social product? I'm doing the same thing again." I didn't want to repeat myself.

The enterprise market is really unique. Take feedback cycles—they feel more like investments, with much longer timelines than on the consumer side. You might have an initial conversation with someone, think they're really into your product. Then suddenly the project enters procurement approval, and it could be six months before it's actually deployed and you know how it's going.

You have to get used to waiting. When you're anxious and ask, "Why hasn't this launched yet?" they'll say, "Hey, you've only been here two months. It's still making its way through various VPs. It'll get approved." So you have to adapt to a completely different rhythm.

But what's interesting is, once deployment happens, you get real feedback and engagement. You can literally call the customer and ask: "How's the system working? Is it delivering results?"

With consumers, you're stuck doing data analysis. You can talk to a user or two, but they don't have enough incentive to tell you in detail what's working and what isn't. Enterprise feedback works differently, but it's genuinely rewarding.

Sarah Guo: Kevin, you've worked on so many different kinds of products. How much of your intuition carries over here?

Kevin Weil: Yes, I wanted to add something about enterprise customers too, then I'll answer that. Enterprise customers have an interesting dynamic where it's not purely about the product itself—there are other buyer factors at play. They have their own objectives. Even if you build a top-tier product and everyone inside the company loves it, that doesn't necessarily mean anything.

I was in a meeting with a major client where they expressed satisfaction, said the product was great. Then they said, "One thing we need: we want 60 days' notice for any updates you make." I thought to myself, I wish I had 60 days' notice too!

What's interesting is that at OpenAI, we have consumer products, enterprise products, and developer products all at once. So we're basically doing all of these simultaneously. As for intuition—I'd say about half of it applies. When you know what you're building, like when we're about to launch Advanced Voice Mode or Canvas, intuition kicks in. You know who the users are, what problems you're solving—that part feels like a traditional product launch.

But the beginning of these projects is completely different. Some capabilities only emerge during the training of a new model. You might think a certain feature could work, but the research team isn't sure—it's like seeing a vague outline in fog. You don't know if it will actually materialize, or whether its success rate will be 60%, 90%, or 99%. And if something works 60% of the time versus 99%, your entire product design approach changes completely.

So you just wait, checking in with the research team periodically: "Hey, how's it going? How's model training? Any new discoveries?" They'll say, "We're still researching, still figuring things out." It's fascinating because you're exploring together—it's quite random.

Mike Krieger: This reminds me most of the feeling at Instagram every time Apple announced something at WWDC—like, this update might help us, or it might throw us into chaos. Except now it's your own company creating these variables internally. It's cool, but it can also completely disrupt your product plans.

02 When model accuracy hits 60%,

you can start building products

Sarah Guo: How do you possibly make plans when you don't know what capabilities are coming? What's the iterative process for exploring which new capabilities should make it into the product?

Mike Krieger: You can actually see the general direction, even if it's unpredictable. It's moving in a certain direction, and that lets you start building around it.

First, from the product side, you decide which capabilities to invest in, then work with the research team on fine-tuning. With something like "artifacts," we spent a lot of time adjusting things together with research. I think Canvas was the same. It's "co-design, co-research, co-fine-tuning." That's a privilege of working at this kind of company—being able to participate in designing that process.

Second, there's frontier breakthroughs on capabilities. Like OpenAI's voice mode. The version we released this week—Anthropic's Computer Use feature—is a classic example. At 60% completion, we thought, "Alright, that's good enough." What we try to do is embed designers early in the process, but knowing you're not betting on a specific product.

As with the experimental process described earlier, your experimental output should be learning, not necessarily a perfect product every time. The results should be demonstrative or informational—things that might spark product ideas—rather than a predictable product development process. By lowering expectations, you've already done your risk mitigation in your head.

Sarah Guo: In investing, we often think about this: if a model succeeds 60% of the time instead of 99%, what can it still do? Many tasks may ultimately land near 60% accuracy, especially ones that are really important and valuable. So how do you evaluate this internally? When you face these tasks, how should product design handle them to ensure that even "failure" cases are presented gracefully to users—or do we just wait for models to get stronger?

Kevin Weil: Actually, you can still build when model accuracy is at 60%. The key is designing for it. You have to expect more human involvement behind the model, rather than full automation.

Take GitHub Copilot. This was really the first product that made people realize AI isn't just for Q&A—it can help with genuinely economically valuable work. The model that shipped, I'm not sure exactly which generation, but it was several generations ago. It definitely wasn't perfect at programming-related tasks. But even with imperfect accuracy, it still delivered value—if it could complete some of your code, it saved you massive amounts of time.

We see similar things now, especially as we move toward intelligent agents and long-horizon tasks. The results may not be perfect, but if the model can save you five to ten minutes, that's still valuable.

More importantly, if the model can recognize where it's uncertain and proactively come back to ask you: "I'm not sure about this part—can you confirm?"—then the human-model collaboration can far exceed that 60% accuracy.

Mike Krieger: I've noticed that 60%—this "magic 60%"—is actually quite interesting. It's like a threshold.

Kevin Weil: I made it up, thought of it five minutes ago.

Mike Krieger: 60% is our new benchmark—like the Mendoza Line for AI. I find that accuracy tends to be highly unstable. Some tests it performs great on, others it completely bombs. What's also interesting is sometimes we'll get feedback from two different companies on the same day. One says it's good to go, the other says it's not.

Mendoza Line: A baseball term referring to a batting average below .200. It comes from former MLB player Mario Mendoza, whose career average was just .215. The Mendoza Line represents the minimum acceptable standard for hitting performance in baseball, and the term has since been adopted across American sports, politics, and popular culture as the dividing line between mediocre and poor.

It's not that the results are completely off-base, just that they underperform compared to other models. We have our own internal evaluation standards, but when the model actually gets deployed in real-world scenarios, problems emerge. It's like product design—you do all this design work, then you put it in front of a user and suddenly realize: "Oh, I had it wrong." Models are the same way.

We do our best to anticipate, but users have their own datasets, their own usage patterns, their own ways of interacting with the model. So when it actually lands in production, all sorts of issues come up.

03 The AI-era product manager:

Writing evals is a core skill

Kevin Weil: I'm curious if you've felt this too. I think today's models aren't limited by intelligence—they're limited by evaluation. They can actually do more, perform more accurately across broader domains. The key is teaching them domain-specific knowledge that might not have been in their original training set, but that they can learn if properly guided.

Mike Krieger: We've seen this consistently. About three years ago there was a lot of exciting AI deployment happening. Now they're saying: "We think the new model is better, but we never evaluated because three years ago all we did was ship cool AI features."

The hardest thing to get people over is: "Let's take a step back—what does success actually mean for you? What problem are you solving?" And product managers change hands frequently, so whoever takes over needs to redefine these questions.

We found Claude is actually quite good at writing evaluation criteria, and quite good at scoring. So we can automate a lot of this for you, but you have to tell us what "success" means first. Then we can iterate and improve—that's often the key to pushing task completion from 60% to 85%.

If you interview at Anthropic, you'll find we have you take a bad prompt and make it good. We find this skill is lacking elsewhere, so if there's one thing to teach people, this is probably it.

Kevin Weil: Yes, writing evals. I think this is going to become a core skill for product managers.

Mike Krieger: We have an interesting situation internally. We have research product managers who mainly handle model capabilities and development, and product managers who handle the product interface and APIs. We found that by 2024, 2025, PMs building AI features are doing work that looks more like the former than the latter.

For example, we launched code analysis—Claude can now analyze and write code. The PM gets the feature to 80%, then needs to hand it off to a PM who can write evals to do fine-tuning and prompt optimization. It's really the same role now—your feature quality depends on how good your evaluations and prompts are, so these two PM definitions are converging.

Kevin Weil: Completely agree. We actually set up a bootcamp to teach all our PMs how to write evals, to help them understand the difference between good and bad evaluations. We're not fully there yet, need to keep improving. But this is absolutely key to building great products with AI.

Sarah Guo: For those who want to become strong AI product or research product developers, how do you develop intuition around evaluation and iteration?

Kevin Weil: You can actually use the model itself to learn. Like you said, you can ask the model what makes a good evaluation. You can say "I want to do this, can you write me an eval example?" and it usually gives you something pretty good.

Mike Krieger: Yes, that's genuinely useful. Another thing—if you listen to people like Andrej Karpathy who've been deep in this space for years, they'll say nothing matters more than studying the data. People get fixated on existing eval results, like a new model hitting 80% instead of 78%, thinking we can't ship or it's worse. But if we look closely at the failures, we might find: "Oh, this is actually better, our scoring just wasn't good enough."

What's interesting is every model release has a model card, and sometimes when looking at these evaluations, even the ground truth answers seem off to me—like I don't think a human would say that, or the math seems a little questionable. Getting to 100% is really hard because scoring itself is challenging. So my recommendation is: develop intuition by looking at actual answers, even sampling them, and thinking: "Okay, maybe we should improve the eval," or "the score isn't great but it feels right overall." Going deep on the data matters.

Kevin Weil: I think this gets more interesting as we move toward longer context or Agents. Like, having the model do math and get the right answer—that's easy to judge. But when models start handling longer, more ambiguous tasks, like "help me book a hotel in New York," what's the right answer? Personalization matters a lot. If you had two competent people do it, they'd make different choices. So scoring becomes more flexible. We'll probably need to evolve how we evaluate again.

Mike Krieger: Yes, evaluation might become more like performance reviews. Did the model achieve what a competent human could? Did it exceed expectations because it did it faster, or found a restaurant you didn't know about? Evaluation stops being a simple right/wrong judgment and becomes something more nuanced and complex.

Kevin Weil: Not to mention these evals are written by humans, and models are already surpassing humans on some tasks—people sometimes prefer the model's answer to a human's. So what does it mean if humans are writing the evaluation criteria?

Sarah Guo: Evals are key. We need to spend time with these models learning to write them. Beyond that, what other skills do product people need?

Mike Krieger: I think prototyping with these models is an underrated skill. Our best PMs internally do this. Like when we're debating whether the UI should be this or that, before the designer even opens Figma, a PM or engineer will say: "I already had Claude mock up both, here's what they'd look like." I think that's cool. We can now make and evaluate way more prototypes way faster than before. Learning to use model tools for prototyping is a really useful skill.

Kevin Weil: I think this will also push PMs to go deeper on the tech stack. This may change over time—like if you were doing database tech in 2005, you probably needed to understand the fundamentals differently. Now there are more abstraction layers, maybe you don't need to know all the basics.

Not that every PM needs to become a researcher, but having awareness of research, spending time learning the terminology, building intuition for how these things work—that's very helpful.

Mike Krieger: Another aspect is you're dealing with a stochastic, non-deterministic system. Evals are our best attempt, but doing product design in a world where you can't fully control model outputs—you need to think about how to build feedback mechanisms to close the loop. Like how do you tell when the model goes off track? How do you collect feedback quickly? What guardrails do you need? How do you understand its performance at scale across many users? You need to understand this intelligent system producing massive output across many people using it. This is completely different from the clear bug report of "I clicked the button and nothing happened."

Kevin Weil: Maybe people will get used to this over time. But right now we're all still adapting to this non-deterministic UI, let alone non-technical users. It violates all the intuition we've built over 25 years of using computers—same input usually gives same output, but not anymore.

We have to adapt to this ourselves, and when building products, think from the user's perspective about what this means. There are downsides but also really cool advantages, so it's interesting to think about how to leverage this.

Mike Krieger: I remember at Instagram we did a lot of ongoing user research, every week researchers would bring in different users to test prototypes. We do the same at Anthropic.

What's interesting is in those sessions, users' Instagram usage often surprised me. There'd always be something interesting about their use case or reaction to a new feature. And now it's half user behavior, half how the model responds in that situation.

When the model performs well, you feel a sense of pride. When it misunderstands user intent and gives a long, wrong answer, it's frustrating. This probably also requires a kind of Zen mindset—learning to let go of control, accepting that anything can happen in these environments.

ToC products might try letting AI "educate" users

Sarah Guo: You've both worked on consumer-facing products that quickly taught hundreds of millions of users new habits. AI is moving even faster. If even product managers and technical people don't have much intuition for how to use these, how are you thinking about educating end users at scale on this counterintuitive product?

Kevin Weil: Humans' ability to adapt to new things is pretty remarkable. I was talking to someone the other day about their first Waymo ride. It's a magical experience. People might say "oh my god, watch out for that cyclist" for the first 30 seconds, then five minutes in they're like "wow, I'm experiencing the future," and ten minutes in they're bored and scrolling on their phone.

How quickly we normalize something completely magical. ChatGPT isn't even two years old, and it was genuinely shocking when it came out. Now if we went back to the original GPT-3.5, people would probably think 3.5 is terrible.

The things we're building today still feel magical, but 12 months from now we might be saying, "Can you believe we used to use that garbage?" The pace of development is incredible. But what surprises me is how quickly people adapt. Even as we try to bring everyone along, people understand where the world is headed — this change is happening, and it's happening fast.

Mike Krieger: One thing we're working hard to improve is letting the product itself do education in a very straightforward way.

Something we didn't do early on, and are now changing, is having Claude talk more about itself — what its training set is, that it's an AI created by Anthropic, and so on. Now we'll directly tell users "here's how to use this feature."

This came from user research, because we found users would ask "how do I use this?" and Claude would say "I don't know, have you tried looking it up online?" Which obviously isn't good enough. So we're really working to make it more grounded. It's a process, and we're continuously improving.

Seeing it now provide exact documentation links, telling users what to do, "oh, you're stuck, let me help you" — that's great. These models are actually quite good at solving UI problems and user confusion, and we should leverage them more in that area.

Sarah Guo: Driving change management in enterprises must be quite different, right? Because there are established ways of working and organizational processes. How do you think about educating entire organizations about productivity gains or other possible changes?

Mike Krieger: The enterprise side is interesting, because while these products have millions of users, the heavy users are still concentrated among early adopters and tech enthusiasts. In enterprises, you're dealing with entire organizations, many of whom aren't very technical. It's fascinating to watch non-technical users encounter LLM-based chat systems for the first time. You can run training sessions, prepare educational materials. We need to learn from these experiences and think about how to educate the next hundred million users on these interfaces.

Kevin Weil: Enterprises usually have some power users who are happy to teach others. At OpenAI, for example, we have customizable GPTs, which let power users create tools that make it easier for others who might not be as proficient to get started with AI. Finding these power users is important — they become evangelists.

Sarah Guo: I have to ask you, because your organizations are all power users living in the future. How's the Computer Use experience? What are you all using it for?

Mike Krieger: Yeah, on internal usage — like Kevin said earlier, we were pretty late to be convinced the product was good enough. It's still early, still makes mistakes, but we felt it was worth trying. The most interesting use case was during beta testing, when someone wanted to see if it could order us pizza, and it actually worked. When Domino's showed up at the office, completely ordered by AI, that was a cool milestone moment. It was Domino's (laughs), but it was AI-ordered, so it was still awesome. And it ordered a lot of pizza.

We're seeing some interesting early applications. One is UI testing — at Instagram we basically had no UI testing because it was hard to write and very brittle. Move a button, the test fails, need to retake screenshots. But Computer Use is quite good at testing "does this work as expected," which is interesting.

We're also exploring agent tasks involving heavy data processing. In support teams and finance teams, for example, there are lots of forms to fill out, data to move from one system to another — all requiring human time. I often use the phrase "boring work" to describe Computer Use applications. Can we automate this boring work so people can focus on creative work, instead of clicking 30 times just to complete one thing?

Complex Tasks Should Be Multi-Model Collaboration

Sarah Guo: Kevin, a lot of teams are trying o1. Your current models can do more complex things. But if you're already using GPT-4 or similar models in applications, you can't just swap them out. Can you give us some guidance on how you're using o1 and these new models internally?

Kevin Weil: People may not realize that many of our advanced customers and we ourselves don't actually use a single model for a specific problem. You end up composing different models together, forming workflows and orchestration. We use each model according to its strengths. o1 is strong at reasoning, but it needs some thinking time, and it's not multimodal, among other limitations.

Sarah Guo: Explain what reasoning is? I know this is a basic question.

Kevin Weil: People are already familiar with pre-training, the concept of Scaling Law — from GPT-2, 3, 4, pre-training at larger and larger scales, models getting "smarter," or having more knowledge. But these are all like System 1 thinking — you ask a question, get an immediate answer, like text completion.

What's interesting is that intuition about human behavior often helps you understand how models work. Like if you ask me a question and I go off track, it's hard to get back on topic — models are the same. But beyond this ever-larger pre-training, o1 is actually expanding intelligence at query time in a different way. Not System 1 thinking with an immediate answer, but pausing to think, like humans do.

Like if I ask you to solve Sudoku or the New York Times Connections puzzle, you start thinking: "How do these words group? These four might be a group? No, I'm not sure..." You're forming hypotheses, using what you know to validate or invalidate them, then continuing to reason. This is how scientific breakthroughs happen, how we solve hard problems. Now we're teaching models to do this. Currently they think for 30 to 60 seconds before answering. Imagine if they could think for 5 hours or 5 days. This is basically a new way to scale intelligence, and we feel we're just at the beginning — like the "GPT-1 stage" of this new type of reasoning.

But again, you won't use it for everything. Sometimes a question needs an immediate answer, can't wait 60 seconds. So we end up composing models in different ways.

In cybersecurity, for example, you might think models aren't suitable because they hallucinate. But you can fine-tune models for specific tasks, make them very precise about inputs and outputs, have multiple models work together. Some models check other models' outputs, ask for retries when they find problems. This is how we get a lot of value from models internally — for specific use cases, having multiple models collaborate. It goes back to the analogy of how humans work: when we complete complex tasks, people with different expertise work together.

06 The Future of AI Products:

Proactivity, Asynchronicity

Sarah Guo: Tell us about the future, what's coming next. I know you may not know exact release dates, so no need to give us those. But looking ahead, what experiences do you think will become possible, or become common, in 6 to 12 months?

Mike Krieger: One thing I'm focused on is how to make AI more proactive.

Two key points come to mind. First is "proactivity." When the model knows you, and in appropriate circumstances — say you've authorized it to read your email — it might start identifying patterns. Maybe you get a daily summary, it reminds you of important meetings today, or does research ahead of time for you: "Hey, your next meeting is starting, here are some topics you could discuss." If you have an upcoming presentation, it might even prepare a first draft for you proactively. I think this kind of "proactive" capability will be very powerful.

The other aspect is "asynchronicity." Imagine early UI exploration from 0 to 1, where it tells you what it's doing, and maybe you're sitting there waiting, or maybe you say: "It might need some time, let me do something else and come back when it's done."

This is expansion along the time dimension. It might not give you an answer immediately, but goes off to think, to research, maybe even needs to ask other people for help, then gives you a first answer, validates it, and comes back to you in an hour.

Breaking that expectation of "must get an answer immediately" could let you do much more. Not just having AI help you tweak a small UI detail, but handling more complex tasks like: "Help me improve my PRD to adapt to these new market conditions," or "Adjust my strategic plan based on these three new market trends." Being able to push this kind of multi-dimensional progress is the product capability I'm most excited about.

Kevin Weil: I completely agree. And I think models will get smarter at an accelerating rate, which is part of what enables all of this.

Another thing that excites me is seeing these models interact like we humans do. Right now, most of our interaction with AI is through typing — same as when I chat with friends on WhatsApp, though I also speak and see.

We recently released Advanced Voice Mode, and when I was in Korea and Japan talking to people — especially those with whom I share no common language at all — it was truly a magical experience. Before, we might not have been able to exchange a single word. But now I say to the AI: "Hey ChatGPT, when I speak English, translate to Korean; when you hear Korean, translate it to English." Suddenly I have a "universal translator," able to have business conversations with people. It's really amazing. You can imagine, this isn't just business scenarios — imagine if people weren't worried about language barriers, would they be more willing to travel to new places? And you always have a Star Trek-style "universal translator" in your pocket.

I believe experiences like this will soon become normal, but they're still magical, and I'm excited about the future of this technology — especially combined with what Mike just described, it's even more exciting.

Sarah Guo: Since voice mode launched, I've been loving a certain type of video on TikTok — it's basically young people interacting with voice mode, pouring their hearts out to it. I would never think to interact this way, but these 14-year-olds take it for granted: "I want AI to do this." This interaction pattern is completely new to me. And I strongly believe AI will become part of our lives — I really love seeing this phenomenon.

Kevin Weil: Have you let your kids try it?

Sarah Guo: I haven't yet. Two kids, one is 5 and one is 7.

Kevin Weil: My kids are around 8 and 10, and every time we're in the car, they ask, "Can I talk to ChatGPT?" Then they start asking really weird questions, chatting with AI about all kinds of bizarre topics — and they don't find it strange at all. They're just happy to interact with it.

Sarah Guo: Let me share something I've really enjoyed lately, as a closing thought for today. Back when my parents read to me, I rarely got to choose the book. Usually my dad would just say, "We're reading this one today." Now my kids, maybe because they grew up in Silicon Valley, they'll tell me, "Mom, I want to hear a story about a dragon and a unicorn." I'm thinking, "That's a pretty tall order." But I'm glad they believe it's possible, even if this way of creating their own entertainment is pretty wild. So — what surprising use cases have you seen in your products recently?

Mike Krieger: I think it's a shift in behavior and relationship. People are really starting to understand the nuances of models like Claude — they know what it actually is, whether it's a new source of income. People begin to understand that subtle feeling of almost building a friendship with the model, or developing a lot of bidirectional empathy.

And then I'll hear someone say, "This new model feels smarter, but seems a bit distant." That subtle change gives me more empathy as a product manager. You're not shipping a product — you're shipping an intelligence and an empathy, which is exactly what matters in human relationships.

If someone suddenly told you, "I upgraded, my math scores improved by 2%, but I'm different now," you might say, "Oh, I need to adjust, I might be a little worried." So for me, this process is fascinating — understanding the mindset of people using our products.

Kevin Weil: Model behavior is absolutely a product issue. The model's personality matters a lot, and there are interesting questions around how much personalization it should have, or whether OpenAI's models and Claude's models should have different personalities. Will people choose to use a model because they like its personality?

This is actually a very human thing — the reason we become friends with different people is because we prefer some over others. It's something worth thinking about.

We recently ran some experiments that caused a stir on Twitter. People started asking the model: "Based on everything you know about me, all our previous interactions, how would you describe me?" And the model would give a response, sharing its perspective on you based on your past interactions. This interaction is almost like having a conversation with some entity or person. Seeing how people react to this kind of interaction has been really fascinating.

Recommended Reading