A MiniMax Early Employee Built a "Swipe Left/Right" Proactive Agent丨début
A concrete way to realize a Personal Model.
@elsewhere
Starting today, we're launching a new column: début. The word originally refers to the beginning of an action, work, or speech — also a person's first appearance on stage. début will introduce you to a group of entrepreneurs you've never met before.
We promise this is a completely free, public-interest column. Welcome to début, and to making your first appearance in the public eye.
Reach out via message through our official account, or email us at info@elsewhere.news.
Below is the first installment of this column.
Ever since OpenClaw went viral, the idea that everyone could have their own dedicated AI "personal assistant" no longer seems like science fiction. In the crowded race toward this new world, Boxy, a startup focused on overseas markets, has just secured several million dollars in seed funding from Sequoia Capital China.
What exactly is Boxy?
Put simply, Boxy runs a virtual machine silently in the background of a user's computer, enabling compliant access to full historical data from apps like WhatsApp, LinkedIn, and Instagram. Based on this data, Boxy builds a Personal Model that can sense latent user needs and proactively push task suggestion "cards," continuously iterating the model based on user choices.
Proactive Agent is one of the hottest startup directions right now. "Proactive," in short, means anticipating — even completing — needs the user hasn't yet voiced. OpenClaw simulated this experience through a heartbeat mechanism firing every 30 minutes, inspiring more startups to build truly proactive-trigger agents. The main approach involves capturing complete user context through screen recordings and similar methods, identifying intent, and then executing tasks "before they're spoken."
Unlike the old model where users had to switch between multiple apps, with Boxy you simply swipe left or right in 0.1 seconds — like using a dating app — and the system automatically handles scheduling changes, message replies, and other tasks in the background.
But Boxy's ambitions go further. In founder John's vision, it's not merely an assistant but a complete chain: entering users' communication scenarios through radically simple interaction, building out a vast array of different agent types through both proprietary and third-party development, continuously performing reinforcement learning from human feedback, with the goal of becoming "the largest Personal Model producer in the Agent era."
John is a serial entrepreneur born in 1998 who went to the US for school at age 12 and attended Emory University. On the eve of the AGI wave, he joined MiniMax — a large model unicorn — as employee #8, experiencing the pioneering period of building AI data pipelines from scratch.
That was late 2021, before the concept of AGI had even gained traction in the industry. MiniMax, born out of SenseTime, was searching for AI-to-C possibilities. John and his colleagues built extremely complex human-machine interaction datasets from zero, achieving millimeter-precision multimodal data capture through cutting-edge collection technology. This provided the underlying data methodology for the later emotional interaction and hyper-natural voice performance of hit products like Glow.
But for John personally, "extremely realistic virtual humans" was never his true interest. After nearly a year at MiniMax, the accelerating approach of the AI era convinced him he had to throw himself into building something of his own.
After leaving MiniMax, he kept trying to achieve comprehensive AI-to-C implementation in new fields, but never found a direction that convinced him. Now returning to AI-to-C with Boxy, John says he's found a true "Founder Lifestyle Fit" — no longer entrepreneurship for entrepreneurship's sake, but building infrastructure he desperately wants himself: something that can thoroughly liberate humans from tedious digital labor.
Interacting with an Agent Takes Just 0.1 Seconds
elsewhere: Let's start with the product. The market is flooded with AI assistants in various forms. What's different about Boxy?
John: Boxy isn't an AI Personal Assistant, Proactive Agent, Jarvis, or the OS1 from HER. We help each person create their own Personal Model, so that Agents can truly know "who you are" first, then provide corresponding Agent services.
You can think of it this way: Boxy wants to become an essential interaction medium for users in the Agent era — an entry point for Agents aimed at the general public.
Right now, most Agents are still forcing users to input prompts into a blank dialog box, which creates massive cognitive overload. Ordinary people simply don't know how to precisely describe vague needs to AI. Often, by the time you've refined your prompt, you could have just done the task yourself. So Boxy's solution is to first tell Agents who you are, give them enough context, then let Agents proactively do the right thing at the right time.
elsewhere: Boxy certainly isn't the first to claim it wants to be a proactive assistant. What enables you to actually achieve "proactivity"?
John: We run a virtual machine (VM) silently in the background of the user's computer, with the user's WeChat, LinkedIn, Instagram, and other websites/apps running simultaneously inside it. After obtaining authorization, Boxy controls the mouse to scroll up, takes screenshots, and compliantly accesses full historical chat records and browsing data on these platforms — all without disrupting the user's main interface workflow. Based on this collected full data, Boxy can build each user's own Personal Model from the cold-start phase.
elsewhere: Building an accurate user profile by accessing full data?
John: The so-called Personal Model can be understood as a user-level long-term behavioral model. Compared to traditional "user profiles," this model emphasizes computable behavioral structures — preference tendencies in different scenarios, decision boundaries, communication styles, rhythm patterns, relationship variations, and trend changes over time. This behavioral data is then injected in structured form into the Agent's retrieval, ranking, planning, and generation processes. So even when an Agent first collaborates deeply with a user, the system already possesses a certain degree of individual understanding rather than starting from zero.
From the user's perspective, Agents will use high-value data from the Personal Model to predict needs immediately after the user downloads Boxy and authorizes data access, pushing solutions to you in the form of "cards."
elsewhere: Cards?
John: Exactly. After predicting a user's latent need, similar to dating app Tinder, Boxy pushes suggestion cards to the user, who only needs to "swipe left or right" to complete the decision.
For example, you're focused on rushing a report when you suddenly receive a message from a partner asking if tomorrow's meeting can be moved to the afternoon. Because Boxy has access to your communications, email, calendar, and other data, it discovers you happen to be free tomorrow afternoon. At this point, it won't force you to break your train of thought and switch apps. Instead, it pushes a card at the edge of your screen: "Confirmed no meetings tomorrow afternoon. Reply: 'No problem, let's do 2 PM online tomorrow'?" You don't even need to open your messaging app — just swipe right in 0.1 seconds, and Boxy will automatically send the message and update your calendar in the background. If something feels off, swipe left to reject. If the direction is right but the tone needs tweaking, or you want to add something, you can swipe up on the card to quickly input supplementary instructions or make edits.
elsewhere: Are you trying to eat iMessage, WhatsApp, and WeChat's lunch?
John: To be precise, not eating their lunch — or rather, we have zero interest in building another IM (instant messaging) app. In our planning, communication is actually a "Trojan horse."
The pain point today isn't lacking apps to chat with — it's information overload, having to maintain multiple conversations across numerous different apps daily. We chose to enter through communication, the highest-frequency, most energy-consuming scenario. Once users get accustomed to handling communication through Boxy, they no longer need to waste time switching between apps.
elsewhere: How long does it take to reestablish a complete set of communication habits?
John: If I receive many messages from different software sources, I open Boxy and through quick right-swipes on cards can directly have the Agent reply to these conversations across various apps. I believe once you've tried Boxy, you'll lose all desire to open other software, find contacts, understand context, manually compose replies, and send them.
And in the coming "post-APP era," I believe the general public will increasingly become accustomed to viewing current apps as backends for an Agent system. Because these operations are inherently mechanical. So it's not just that our product experience creates a "can't live without it" feeling. The broader trend in software development also shows us the inevitability of this interaction.
elsewhere: So you still want to overthrow WeChat.
John: That's putting it too strongly. WeChat itself still has its mini-program, payment, and subscription account ecosystems. And we're absolutely not just here to help users send a few messages on their behalf — that's only step one.
In this process, Boxy naturally takes over your schedule, work, and even life flow through messaging. Communication is merely an anchor point. Once you leave your highest-frequency interactions and decisions in Boxy, it naturally becomes the ultimate entry point where you summon all Agents and distribute all tasks in the future.
elsewhere: But nearly every hardware or software product wanting to be a "personal assistant" claims it will increasingly understand the user.
John: This is something that will inevitably happen, yes. But if it requires extensive text communication between the user and your Agent, the user experience is quite poor. And because it's text communication, users have very low tolerance for misjudgments — they then have to type themselves to describe their intentions and correct errors.
So scenario selection is important: how to enable high-frequency interaction between Agent and human in that scenario, thereby further understanding the user and generating stickiness.
Starting from this point, we believe the current swipe-based interaction method is crucial. Swiping brings not just physiological pleasure and cognitive relief — it's also the highest-frequency RLHF (Reinforcement Learning from Human Feedback). Each swipe not only has the Agent automatically complete 4-6 operations in the background; every acceptance, rejection, edit, deferral, ignore, and explicit feedback becomes a new learning signal, continuously refining the Personal Model's understanding of that user.
Through this continuous online learning mechanism, the system can distinguish between long-term stable preferences and short-term situational changes, and can identify behavioral trends and anomalous states. For the Agent, this means it receives not just scattered historical fragments, but a set of compressed, modeled high-value signals — enabling it to do the right thing for the user at the right time.
Personal Model + LLM = A Truly Long-term Evolvable Agent System
elsewhere: There's a massive paradox here: to build each person's Personal Model for "proactive prediction," users must hand over extremely core private social and work data. What gives them confidence to do so?
John: Privacy is a lifeline, and also presents extremely high technical barriers. If, like other products, you throw massive amounts of raw chat records entirely at cloud-based large models for processing, privacy protection is empty talk.
We use an "on-device desensitization" architecture: a desensitization pipeline deployed locally on the user's device runs locally. Its core task is "data obfuscation." For example, real names, companies, passwords, and other sensitive information in your chat records are replaced with meaningless placeholders (like Person A, Company B, etc.) by a small model locally.
Then we send this completely desensitized logical description to the cloud large model for strategic reasoning. After the cloud returns results, we perform a "hydration" restoration locally, and finally call local scripts for task execution. In other words, the cloud large model never touches your real privacy, and users can check uploaded data logs at any time — everything is transparent.
elsewhere: Compared to LLMs, what role will this Personal Model you describe play in our lives?
John: After generative AI has generally entered the Agentization stage, how to make intelligent agents truly understand "a specific user" is the core problem. Relying solely on general-purpose large language models themselves, it's difficult to reliably absorb user behavioral data across time (months and years) and across platforms (multiple data sources) within limited context windows. So I believe truly valuable individualized capability should come from a Personal Model that exists independently of the foundation model, continuously evolving.
The significance of this architecture isn't about training an "even bigger model," but about adding a layer of long-term, individualized, updatable user state to the general foundation model. The former excels at general reasoning and generation; the latter is responsible for long-term memory and user adaptation. Only with both working together can intelligent agents move from "strong generalization capability" to "truly different for each person" — and this may become the key watershed for next-generation AI Agents moving from demo to daily driver.
Simply put, the large model determines AI's capability ceiling, while the Personal Model determines AI's usefulness floor for you. And Boxy's goal is to become the world's largest Personal Model producer and hosting platform.
elsewhere: I can sense your firm belief in Personal Models. Where does this conviction come from?
John: Looking back, technological development has actually been repeating one thing: returning capabilities that were originally centralized, bit by bit, to each individual.
Phone calls — originally we couldn't do without telecom operators, telephone exchange systems. Wanting to reach someone was essentially requesting a centralized network to "connect" you. Then going online, we relied on internet service providers, remote data centers. Because information, computing, content — they were all "over there," not in our hands. But now, we have our own phones, our own computers. Many capabilities that originally belonged to "the network" have been moved onto "the device." Capabilities are beginning to flow from the center to each individual.
In the future, as on-device computing power increases, we can expect that everyone will have sufficient local computing capability (in fact, people buying Mac Minis in large quantities are stockpiling their own compute). In that future, what this compute will do is something not yet defined. And we believe this compute should run each person's own Personal Model, then cooperate with cloud-based LLMs for sufficiently intelligent, personally data-aware personalized inference.
I believe we are entering a completely new era, one where everyone will possess their own computing resources, their own models. The future AI system will consist of two types of models: one is general-purpose large models deployed in the cloud; the other is Personal Models running on personal devices. Only with both working together can humanity, for the first time, truly possess intelligence of its own.
elsewhere: But at first glance, what you want to do looks no different from what Doubao Phone wants to do.
**John: On the surface, everyone is telling the "personal intelligent assistant" story, but the underlying implementation paths and accessible data are fundamentally different. Doubao Phone is essentially still the giant-doing-hardware, doing-OS logic. As long as it's a giant, it will inevitably be constrained by the thick ecological walls between major companies. The subsequent blocking of Doubao Phone by other major companies proved this point. Without full cross-platform context, the so-called Personal Model is water without a source, ultimately reduced to a slightly better voice assistant or system plugin. The virtual machine approach means Boxy doesn't need any major company to open APIs for us.
More importantly, there's the trust issue. From a human nature perspective, are users really willing to hand over all their underwear-level private social and work data, without reservation, to a company with a massive commercial empire? That's also why we've invested enormous cost in "on-device desensitization." What Boxy wants to be is not a vassal of some major company's ecosystem, but an absolutely neutral, completely transparent "digital asset safe" loyal only to you personally.
elsewhere: This "third-party ecological position" probably has startups standing all over it. Why can you do it?
**John: One type of company in this track collects data through local screen recording. This has limitations — it can only capture screen fragments of what you're "currently" doing, and can't access deep historical context at all. This makes it very difficult to truly build context around a person.
At the same time, many startups want to build a grand unified Agent that can do everything. This is opposite to our principles, and also brings extreme uncontrollability and very high error risk.
What we advocate is Unix philosophy — believing each Agent should do one thing and do it well. Users shouldn't give a vague demand to a centralized intelligent hub and "expect" it to execute smoothly. Boxy delegates specific execution to third-party developers in the Agent Store, each with their own responsibilities.
elsewhere: Similar to Apple's App Store? How will you attract Agent developers?
**John: In some ways, very similar. When developers build AI applications, the two most headache-inducing things are always: where does traffic come from, and how to get high-quality context data. Boxy secures stable traffic by rooting itself in the "communication scenario." As for full data, understanding implicit preferences — Boxy has already handled these through building the Personal Model.
Every Agent on the Agent Store will clearly state what data it can and cannot see, and all data Agents see is obfuscated, ensuring user privacy isn't leaked. On the other hand, Agents need to state what tools they can and cannot call. For example, an Agent that helps with research shouldn't have the ability to send messages on some platform. This also prevents malicious prompt injection causing Agents to randomly operate apps, and even if encountering the extremely low probability of hallucination, the Agent still won't cause excessive damage.
elsewhere: Apple initially used iPhone as a distribution channel. So why don't you just build hardware directly?
John: Hardware is certainly the perfect physical vessel for intelligence, but my view on hardware is: soul first, then container.
Because they can't access users' full context, AI hardware remains an "amnesiac." The Personal Model is that "soul," while hardware is merely the "container." Our strategy is to first use the fastest method on users' existing PCs to nurture this soul.
elsewhere: How do you plan your own business model?
**John: In the short term, our business model is primarily subscription-based. How to better serve each user and make Boxy their communication entry point is our core focus. Later it will definitely expand based on the Agent Store ecosystem — we'll take commissions and share profits with third-party Agent developers providing various vertical services on the platform.
But looking toward the longer-term "new world," I believe the Personal Model will become each person's core digital asset. As personal models continuously improve, they are not just time-saving tools, but can even represent you in creating value.
For example, with privacy desensitization guarantees, your Agent can represent you in participating in specific consumer brands' commercial research, or automatically screen and match business opportunities for you. Our ultimate commercial imagination is absolutely not just selling efficiency software subscriptions. We hope to become the infrastructure for everyone's Agents to conduct value exchange with the external world in this new world.
Cover image: Johannes Vermeer, Girl with a Pearl Earring, 1663, Rijksmuseum
