Focus on what matters most, and everything else will follow | 5Y View

五源资本·June 12, 2024·26·0

Do the hard right thing.

Recommender

Yunfeng Shi, VP at 5Y Capital

Ion Stoica launched the distributed computing frameworks Spark and Ray, and was the original CEO of Databricks. He now serves as Executive Chairman of both Databricks and Anyscale. His reflections on what made these two companies successful remain worth revisiting.

People usually know what the right thing to do is. They choose the wrong thing simply because doing the right thing is too hard — for example:

Putting company success above personal interests
Betting on major directions and executing fast with conviction: choosing the cloud
"Ideas are easy, execution is everything"

An infrastructure stack guided by scaling laws is now emerging. What is the biggest opportunity within it worth focusing on and executing with conviction?

Author | wandb.ai

Compiled by | OneFlow Community

Originally titled "Ion Stoica: Spark, Ray, and Enterprise Open Source"

Spark and Ray — one open-sourced in 2010 as a fast, general-purpose computing engine designed for large-scale data processing, the other open-sourced in 2018 by UC Berkeley's RISELab as a next-generation high-performance distributed computing framework — have both become standout projects in the open-source world. And behind their growth stands one central figure: Ion Stoica.

Ion Stoica is the co-creator of distributed computing frameworks Spark and Ray, the former CEO of Databricks, and the co-founder and Executive Chairman of both Databricks and Anyscale. He is also a professor of computer science at UC Berkeley and the lead investigator at RISELab, a five-year research lab dedicated to developing low-latency, intelligent decision-making technologies that has incubated many of the most exciting startups to emerge over the past decade.

Spark and Ray are not only hugely influential open-source projects — both have also grown from open-source foundations into commercially successful companies. They must have done a series of hard but right things to get where they are today. What exactly did they do right?

On the machine learning podcast Gradient Dissent, host Lukas Biewald sat down with Ion Stoica for an in-depth conversation. Through this first-hand account, we can understand how key decisions were made — launching Spark and Ray, founding startups, prioritizing open source, embracing the cloud. Through this article, we hope readers can find the secret to Spark and Ray's success.

Ray: Solving the Performance and Flexibility Challenges of Distributed Programming

Lukas: Many people listening to this will know Ray and Anyscale, but for those working in machine learning, they don't necessarily know what Anyscale is or what it does.

Ion: Basically, if you look at the demands of new applications — machine learning applications or data applications — their growth far outpaces the capabilities of a single node or single processor. Even if you account for specialized hardware like GPUs, TPUs, and so on, it's the same story. So there seems to be no alternative but to handle these workloads with distributed systems.

Now, writing distributed applications is very difficult. And if more and more applications take distributed form, the gap between people's desire to scale workloads through distribution and the expertise most programmers actually have keeps widening.

So before we talk about Databricks, let me start with Ray. Ray's goal is to make writing distributed applications easier, which it does by providing a very flexible, minimal API. Beyond that, we have a very powerful ecosystem of distributed libraries. Many people probably know them — RLlib for reinforcement learning, Tune for hyperparameter tuning, and more recently Serve, plus many third-party libraries like XGBoost, Horovod, and others.

At the end of the day, if you look at the most popular languages, like Java or Python, they succeeded not because they were the best languages, but because they had powerful library ecosystems — though the "worse is better" argument remains debatable. Developers love libraries because if you have a library for a specific application or workload, you just call some APIs rather than writing thousands of lines of code.

Ray is open source now. Anyscale is the cloud-hosted product for Ray. We're committed to building the best platform for developing, deploying, and managing Ray. That means higher availability, better security, autoscaling, tooling, and monitoring when you deploy applications in production.

On the developer side, we try to give them the illusion of an infinite laptop experience. A survey we conducted showed that most machine learning developers still love their laptops — they still do a lot of work on their laptops.

We want to preserve the experience of using editors and other tools on the laptop, while scaling that to the cloud. So whatever you do on your laptop, you can handle in the cloud. We package the application to the cloud, run it there, autoscale — it's very transparent. That's what Anyscale provides. But really, both Anyscale and Ray aim to make scaling applications, especially machine learning applications, as simple as possible.

Lukas: You've put a lot of work into Ray and Anyscale, but obviously there's been a long-standing question: What makes developing a simple distributed framework truly challenging?

Ion: That's a good question. One lesson we learned is that, in a sense, users and developers really prioritize performance and flexibility, even over reliability.

Let me give a few examples. When we started developing Ray, we only used the task abstraction — they were side-effect free. A task would get some input from storage, compute on that input, store the result somewhere, and then another task could consume it.

This was a very simple model on which you could build a very powerful system. It was a lesson from Spark: if you lose some data, you can preserve the chain of tasks that originally created that data. Based on side-effect-free tasks, once you know the order of tasks, you can re-execute them. If tasks are side-effect free and deterministic, re-execution produces the same output. We were quite happy with these properties.

But then people started wanting more performance, and things began to fall apart.

With GPUs, you don't just want to run tasks, fetch data, and store data. Because even transferring data from RAM, from computer memory, to GPU memory — that's expensive. And then if your task is also doing something like TensorFlow, launching it, initializing all the variables — that takes at least a few seconds. Actually it takes longer.

This overhead started becoming somewhat prohibitive. People wanted state to actually stay on the GPU, which meant you no longer enjoyed those pure, side-effect-free tasks. This in turn made it much harder to provide a very good fault-tolerance model.

Here's another example. People use reinforcement learning and apply it to simulations, emulations, games. Some games are not open source, and for these closed-source games, they have hidden internal state.

They won't give you the state, you can't extract the internal state — they let you take actions like move left, move right, and you can only look at the screen and read the screen.

Because of this, we had to use abstractions like actors, but with actors, providing fault tolerance becomes much harder. In Ray's first paper, we tried assuming that for every type of actor, there was a single sequential thread. That way, basically, you could sequence the methods executed on an actor. Through sequentialization, every command executed was logged, and then could be re-executed to reconstruct state.

But guess what? People started using multiple threads. Even though multithreading doesn't work very well in Python, they were still using it. So we wanted to simplify and try to provide some fault tolerance even in multithreaded scenarios.

We added this restriction: if you create an actor, only the party that created it can call methods on that actor. There's only one source for actor method calls, so at least making operations sequential remained feasible.

But then, people started wanting to do things like parameter servers. For parameter servers, not only does the actor's creator want to access it, but a whole group of other actors needs to access it too — so others also need to call the actor's methods. This submission of different methods from different actors or tasks led to complex concurrent behavior.

So in a sense, all of this added complexity. If you talk about fault tolerance, it's still important, especially in distributed systems.

Leslie Lamport, Turing Award winner and developer of Paxos, defined distributed systems long ago as systems where, when a machine or service you didn't know about fails, the system stops working.

So we had to give up our ideal of transparent fault tolerance. We could support recovering actors, but upper-layer applications would also have to do some state recovery themselves, if they still needed fault tolerance.

In distributed systems, performance and fault tolerance are hard problems. Concurrency is another matter entirely, because things happen in parallel, and on different machines.

Similarly, when you try to make it flexible, things get much harder. Because in Spark, for example, you abstract and constrain parallelism. You don't let users write truly parallel applications, so you can have more control.

If Spark Exists, Why Build Ray?

Lukas: In some ways, Ray is very similar to Spark, which I imagine comes from your experience with Spark. Can you describe Spark, and then talk about how it influenced Ray?

Ion: Overall, Spark was designed for data-parallel applications. As a programmer, what you see when using Spark is controlled sequences. Developing on Spark feels like writing ordinary code — sequential instructions. These instructions in Spark look like ordinary code in terms of API appearance. The real difference is that Spark's backend operates on datasets, and these datasets are partitioned across different machines (starting with Resilient Distributed Datasets, RDDs, now DataFrames).

So you have a dataset that's partitioned across different machines. Now you execute a command on this dataset, and that command also executes in parallel on each data partition in the backend.

When you write programs in Spark, you just manipulate datasets and apply functions. This execution model, also known as Bulk Synchronous Processing (BSP), is basically "stage-based" computation.

In each stage, we have a bunch of essentially identical computations operating on different partitions of the same data. Stages cooperate in a relay fashion — one stage creates a new dataset for the next stage to operate on.

The most basic stages are map and reduce, both synchronous operations. One stage operates on a dataset, then does a shuffle to create another dataset, and other stages follow suit...

For Spark programmers, you can't do fine-grained control of parallelism. Because you write one instruction that's syntactically at the dataset level. Only in the backend, when that instruction or function is accepted, does it execute on different partitions.

This is perfect for big data processing. Obviously, for that scenario, Spark has a great API, or data API.

Now, Ray is much lower level. Spark abstracts and hides parallelism; Ray reveals and exposes it. So with Ray, you can actually say: this task will operate on this data, this might happen in parallel, and here are the dependency relationships between these task outputs. You have another task operating on the outputs of these different tasks. This gives you flexibility, but it's harder to program.

On the other hand, in Spark and other systems, you have a single master node that launches tasks, starting all tasks in a stage from some state.

Ray is different — a task can launch other tasks or launch actors, and they can communicate with each other. In Spark and other BSP systems, tasks within the same stage can't communicate with each other. Tasks in the same stage just work on their partitions, then propagate changes caused by the previous stage, creating another dataset for the next stage.

But for humans, it's hard to write parallel programs. We're used to thinking sequentially. Even context switching is hard for humans — by definition, context switching isn't necessarily parallel processing. It's multitasking — do a little of this, do a little of that — and even that's difficult. We're not used to thinking in parallel. It's hard.

So that's another reason you need libraries — the libraries on Ray are exactly for abstracting and hiding parallelism. If you use RLlib, or use Tune, you don't need to know the underlying parallel details for it to run well, and you don't need to worry about those details.

But that's how it is. It's a more flexible, lower-level API. Half-jokingly, if Ray can deliver on the promises I hope it can deliver, and someone today wants to redevelop Spark, they should build Spark on top of Ray.

So fundamentally, Ray is an RPC (remote procedure call) framework, plus an actor framework, plus an object store that allows you to efficiently pass data between different functions and actors through references. You just pass references, you don't always have to copy it. That's where the flexibility comes from.

Lukas: When you were working at Databricks or on Spark, did you see use cases that made you want to develop Ray? Or was Ray something you'd always wanted to create?

Ion: No, no, no, something happened. I'm a firm believer that if existing systems don't provide the functionality you need, you should develop a new system. But before developing this new system, you'd better try to implement what you want on existing systems.

When we were developing Spark in fall 2015, I was teaching a systems course for graduate students. At the time, I was still CEO of Databricks. Two machine learning students, Robert and Philip, took this course. Their project was about data-parallel machine learning training. They were using Spark for data parallelism at the time. In fact, they modified it slightly and called it SparkNet. But then some challenges emerged.

Spark was too rigid. With reinforcement learning, the computation model you need is much more complex — you need nested parallelism and things like that. Spark is great for data processing, but when you need more flexibility for reinforcement learning, it's not quite suitable.

Another thing was that Spark was in Java, the JVM, and Scala. At least at the time, it didn't have good support for GPUs, Java didn't. That's why we started developing Ray. Robert and Philip started doing some development work themselves.

The Story Behind Spark and Databricks

Lukas: That's great. I'd also love to hear the Spark story. I remember Hadoop had the same kind of value, everyone was excited about it. Spark seemed to replace it in such a different way, which I think is technically rare. I'd love to hear what use cases drove Spark's development, and why you think the shift happened so quickly.

Ion: That's a great question. Spark also started as a course project. In spring 2009, I was teaching a graduate class, probably something like cloud computing services and applications. One of the projects was cluster orchestration. The problem at the time was: you want the same cluster to be able to run multiple frameworks, sharing the same cluster across different frameworks.

The problem actually came from software upgrades. Hadoop at the time wasn't very backward compatible. If you had a new version, upgrading was painful. Most development deployments were on-premises. So it was hard to deploy another cluster internally to test the new version, then migrate to the next version. Therefore, if you could run two Hadoop versions on the same cluster simultaneously, that was much better — at the time, this meant enormous value.

Initially, this system was called Nexus, but then academics told us that name wasn't suitable because they'd already used it, so it was renamed Mesos. Maybe you remember Apache Mesos — the previous generation's Kubernetes.

Four people worked on this project: Matei Zaharia, Andy Konwinski, Ali Ghodsi, and Ben Hindman. One of Mesos's value propositions was that you could have all existing frameworks, and it was easier to build new data frameworks on top of it. Mesos handled some isolation between frameworks, fault detection, and other tedious things, or did some scheduling.

You'll see that one of the reasons for developing Spark was as a showcase for Mesos. Because with Mesos, it was now easier for developers to write just a few hundred lines of code to develop a new framework like Spark and run it on Mesos. This was around mid-2009.

So what were the use cases? The main use case was machine learning. That's a great story — from RADlab to AMPlab to RISElab, each lab lasted about five years, with people from machine learning, databases, systems, and other disciplines working together in the same open space. Around that time, Netflix also launched a competition, offering a $1 million prize to whoever developed the best recommendation system. A postdoc named Lester came and asked: they have a lot of data, what do we have, what can we do with it, what can be useful?

Well, you should use Hadoop, we're working with Hadoop. We showed him how to use Hadoop, and Lester actually used it. But then he came back — he used Hadoop to analyze big data, and though he didn't run out of memory, it was too slow. Obviously it was slow, because most machine learning algorithms are inherently iterative — you feed in more data, constantly iterating to improve a model until you get a version you're satisfied with in terms of accuracy, meaning it converges.

Each iteration was handed off in a MapReduce job. Every MapReduce job read and wrote data from disk. At the time, disks were slow disk drives, so it took a long time.

That was one use case. Another use case was query processing. At the time, some big companies were adopting Hadoop to process massive amounts of data. After all, it was MapReduce, Google was doing MapReduce too, it must be good.

And these people were like database people — they constantly needed to look at or query data, etc. Now you had this massive data distinct from databases, and they needed to access it. They could access the data, the only thing they needed to do was write Java code, that is, MapReduce code, and then they could process the data. But these people didn't work that way — they liked writing SQL statements.

Then people started developing Hive, a layer on top of Hadoop that provided some SQL-like query language. So now you could query using SQL on top of it. The problem was, when you query on a database, you write a query and get an answer almost immediately. When you wrote queries on big data, you got some answers two hours later — very slow.

So these were Spark's target use cases, and its solution was to keep as much of the dataset in memory as possible.

Of course, Spark's trick wasn't just keeping data in memory, but also how to guarantee resilience and fault tolerance.

This was important. Before this, if you wanted powerful processing capability, you bought a supercomputer. But now you had these cheap servers building large computers, clusters — and guess what? They failed easily.

So fundamentally, fault tolerance mechanisms needed to be provided. That's why Hadoop wrote data to disk — persistent storage — and created three replicas for each piece of data. But for Spark, now data was kept in memory — volatile.

So how do you achieve fault tolerance? Since you only have side-effect-free tasks, you just need to preserve the chain of tasks, keep a record of it. If a failure occurs, you re-execute the task: recreate the data that was lost due to the failure.

That's Spark. Because data sits in memory, machine learning applications run much faster — mainly because between iterations, the data stays in memory. Incidentally, it's also a more flexible computation model, because Hadoop could only do MapReduce within a stage. But in Spark, you can chain more stages. Obviously, if data is in memory, queries return faster too, even if you have to scan the entire dataset in memory.

These were the use cases that inspired Spark. How did it replace Hadoop? In a sense, Hadoop's impact at the time was overstated — it was still in a bubble.

The 2000s were remarkable. At least in tech circles, everyone knew about Hadoop and big data. Around 2012, 2013, the number of companies actually using Hadoop wasn't that large, yet Hadoop summits drew maybe 300 to 500 people, perhaps 700. It was like a bubble. Then Spark entered the Hadoop bubble and said, we'll provide a better compute engine. Hadoop had two parts: a compute engine, MapReduce, and HDFS, the file system.

At first, it was a battle — or not quite that intense. Ray had long been perceived as only suitable for small data that fit in memory. But from the start, operating on disk-resident data wasn't difficult, even though Ray actually did that from day one. We still focused on in-memory computing scenarios, because that's where Ray excelled. (Translator's note: Ion Stoica may be emphasizing here the strategy of opening up a large market through a small entry point.)

Then it became a very smooth replacement, another engine in the same ecosystem. Later, Cloudera bet on it in 2013, and it started snowballing.

Lukas: Was the opportunity to build a company around Spark obvious?

Ion: Initially, we built Spark as an academic project. More and more people started using it, and one obvious problem users faced was: "As a company, I love Spark, but can I bet on it? What happens when Matei or others graduate? What happens to the project?"

We really wanted to have impact. We thought this was a better way to process data, and we saw big data processing as a major problem. Ultimately, some company needed to support the open-source project to make it a reliable solution, at least for large customers. There were two ways: get acquired or start a company.

I won't name the company, but we went to a Hadoop company — we were friends with Cloudera, Hortonworks, even MapR. We knew some people who were actually sponsors of our Berkeley lab. We asked, do you want to take over Spark? But they had other plans for what the compute engine after Hadoop and MapReduce should look like.

The acquisition path didn't work, so starting a company became the natural next step. At that point, I happened to be going on sabbatical, Matei was graduating, and Andy and Patrick were already thinking about starting a company — everything aligned. So, alright, let's start a company.

When we founded the company, we had extensive discussions. One major question was whether the company's success was predicated on the success of the open-source project Spark. When we started, it wasn't clear. We began talking about forming a company in fall 2012. Looking around, Linux was still a very special phenomenon. There were no other open-source-based unicorns. MySQL sort of counted, but it was later sold to Oracle.

Lukas: Wasn't Cloudera already big?

Ion: Not big enough. Hortonworks was small, not big enough. Just a year or two later, we started getting valuations of over four billion dollars.

Also, people thought Cloudera was entering the cloud era. They initially wanted to do cloud, but the cloud business wasn't big enough then, so they pivoted to on-premise deployment.

After we started our company — it's a long story — our decision was to adopt cloud services as a new business model. We only offered a managed version of Spark in the cloud, initially only on AWS.

We believed the success of open source was a necessary condition for the company's success. Once open source succeeded, if we built the open-source project into the best product, we could hopefully win those customers. Even if other open-source companies offered Spark-based cloud services — later Cloudera provided Spark to their users, then MapR, Hortonworks, and AWS, Azure, and Microsoft.

We bet on open-source success and put a lot of effort into it.

Lukas: Now, the open-source-based business model seems like a very popular strategy for infrastructure companies.

Ion: Databricks was one of the earliest to do this. Before that, it was basically the on-premise business model, which is much heavier. Some companies founded around the same time failed or weren't as successful as people thought. Doing a cloud-hosted product was a pretty big bet at the time. At least initially, we faced enormous pressure to pivot to on-premise. But now, building managed products for open source is quite common.

Why TensorFlow and PyTorch Weren't Commercialized

Lukas: Why do you think popular deep learning frameworks like TensorFlow and PyTorch haven't been hosted in the cloud? Many enterprises typically use them, but that business model doesn't seem to exist there.

Ion: That's a great question. Obviously it's not entirely true — for PyTorch, there's Grid.AI offering managed products.

I think these open-source projects from large companies themselves aren't very interested in commercialization. For example, Google's monetization thinking for TensorFlow might be: TensorFlow and everything else runs best on GCP, especially using TPUs — that's how they make money. The best place to train TensorFlow models would be GCP.

It's the same with Kubernetes. It's hard for a company to build a business without the creators of that open-source project involved. If the open-source project was already open-sourced by another company, and your company doesn't have the founders of that project, it's even harder. You can't coordinate, you can't synchronize development between open source and product. So far, I don't know of any company that has achieved huge success commercializing Kubernetes. What can you do? Most Kubernetes developers are still at Google.

Another thing is that managed products based on distributed solutions are more valuable, because the value lies in managing the complexity of clusters. As long as you're only running on a single machine, the value is smaller.

Of course, now TensorFlow can run across different machines, etc. — it's distributed. But I do think these are two different things. First, most PyTorch and TensorFlow usage is on a single machine. Second, most developers of these open-source libraries still work at large companies like Google, Facebook. I may be wrong, but these are some of the differences I see.

Choosing Co-Founders and Other Startup Advice

Lukas: As someone who founded two very successful companies, do you think the people you chose as co-founders mattered? What did you see in them, or what are some commonalities among the co-founders you chose?

Ion: People matter enormously. Databricks is an older company, so I've had more time to observe it. At some point, to succeed, you need everything, including luck. I think the team members are complementary. Despite all their achievements, they have no airs.

As a team we're very open. Like Matei — I met him after joining Berkeley in 2006, 2007. Ali came to Berkeley in 2009, Andy was there too, then Patrick. We've known each other a long time and are very open discussing any issue.

We don't always agree. Sometimes we argue passionately, shouting and so on. Later people told us that during these very heated exchanges, because Berkeley's small offices had poor soundproofing, people could hear almost everything we were talking about, and we had no idea.

In a way, this looks a bit scary, because other people expect us to lead the company, yet we might not even agree on very basic things. But we're happy to debate.

I think Robert and Phillip at Anyscale are the same way, which brings me back to "low ego." What you want from everyone, including the CEO, is to put the company's success above personal agendas. This is absolutely true. As the saying goes, there's no winner on a losing team.

When you've known someone a long time, trust is absolutely foundational. Every company has highs and lows in its survival. A small company is like flying a plane very close to the ground — not much room to maneuver. I'm not saying everyone needs to be absolutely humble, but they absolutely need to recognize that the company's success matters most.

Lukas: When you founded Ray, you had already been running Databricks for a while and were starting to see real success. I imagine you were a very different person. Did you consider founding this company differently than you did Databricks?

Ion: What struck me most was how much great feedback you get from people, and how much of it you ignore.

In retrospect, getting things done mainly comes down to a few basic principles. At least in theory, everyone knows what you need to build a great company. Of course, you need a great team. You need vision, strategy, real focus on product-market fit. Iterate from early customers, make them incredibly successful.

People usually know what the right thing is, but they choose the wrong thing simply because doing the right thing is too hard.

Imagine you go to San Francisco, or pick any city you like, and ask people on the street: what does success require? People will say you need to work hard, focus, plus a bit of luck, etc. Everyone will tell you what's needed, you'll get many similar answers. But the question is: how many people actually do it? The reason is that doing the right thing is just hard.

Looking back, the reason Databricks developed fairly well is that we persisted in some things, we got some things right.

For example, we chose the cloud. We wanted to focus, and we realized early on that building for the cloud versus on-premise is a completely different engineering problem — you essentially need two teams to do both. We weren't even sure we could build a great engineering team to do one thing well, let alone two.

So we thought, let's just do the cloud, because we believed the cloud market was big enough for us. If you had told me there was a tens-of-billions on-premise market, we still couldn't have done anything about it. We only had 40 or 80 people. Capturing even a sliver of that market would have taken years. What was the point of thinking about it?

What I'm trying to say is, in a sense, beyond following some very basic principles, there was nothing particularly special about us. Anyscale is the same way. You just focus on where you want to innovate, and for everything else, you try to use the most advanced solutions available.

So what's different between Anyscale and Databricks now? In a sense, Databricks convinced me even more that these fundamentals are right, and that there's no shortcut to success — it's just hard work.

And it makes you more aware of the importance of execution, which is probably the most important thing. As John Doerr said: "Ideas are easy. Execution is everything."

You'll meet people who can have such massive impact — like Ron Gabrisko, who eventually became our CRO and joined us when the company was doing a few million in revenue, and has now taken us to tens of millions.

Everyone tells you how important it is to do thorough reference checks when hiring, but it's hard, because it all takes effort. Unfortunately, I can't tell you there's any silver bullet. Just stick to the basics. There are no shortcuts.

You also need to think about how every company is different and has to be different, because things change — Anyscale versus Databricks, for example.

When we started Databricks, AWS was the dominant cloud. Now you have GCP, Azure, multiple clouds — you can't ignore them. When we built Databricks, we were primarily focused on data scientists, data engineers, and so on. But Anyscale is different now — it has to target a broader set of developers and ML developers, and different users obviously want different things.

Again, this isn't earth-shattering. It's obvious. The small things like execution and speed matter.

Lukas: What did you do differently the second time around building a company? My understanding is it's almost exactly the same — like you know what you should do, but you do it better in the details. But I'm curious, because your experience is different from mine: the second time you founded a company, but you're not the CEO. Was it difficult in some ways working with Robert? I mean, he seems very smart and impressive, but I think this might be his first job out of school. You must have thought, "I know how to do this but you're not doing it this way" — did that happen?

Ion: No. I think the reason I was CEO of Databricks earlier was that no one was sure they wanted to do it long-term. In fact, Ali wanted to go into academia, Matei had a job at MIT and was on leave, and so on. With Robert and Philip, they didn't look at anything else, didn't ask anything — this is what they wanted to do.

I think that arrangement, looking back, was the right one. We've been working together again since 2015. Four years before we founded the company, we already knew each other very well.

I think in terms of responsibilities, Robert and I, and Philip to some extent, we've divided things up well. As you know, there's so much to do as a CEO, and having someone you can trust to help share some of that responsibility to solve...

Dream Project for the Future: Sky Computing

Lukas: If you had more time, is there another project like Spark or Ray that you'd dream of doing?

Ion: I think one thing I'm looking forward to now is a project called Sky Computing. It's a multi-cloud platform — think of it as the internet for cloud computing. Fundamentally, the belief here is... what the internet did was link together a bunch of different networks and provide the abstraction of a single network, so when you send a packet, you don't know which networks the packet will travel through.

I think in cloud computing, we'll increasingly see this layer emerge, which we call the intercloud layer, that will abstract away the clouds.

There will also be specialized clouds. For example, you have an ML pipeline: data processing, training, serving. For good reasons, you might actually want to run each component on a different cloud.

For example, maybe you're working with confidential data and want to strip P-II information from it. You might decide to do that on Azure because Azure has confidential computing. You might decide to train on TPUs, and you might decide to serve using Amazon Inferentia, the new chip.

I think you'll also see the rise of more proprietary clouds, especially for machine learning. NVIDIA made an announcement called Equinox, which is essentially tightly-built, GPU-optimized data centers.

So I think this is an exciting thing. In the evolution of trends, cloud will inevitably be driven by open source, providing more similar services. This provides a great foundation for the next level of abstraction to emerge.

I think it will happen. By the way, for every company to succeed, you need to bet on some kind of uncertainty. If you're not betting on anything, you must be doing what everyone else is doing. In other words, when you start a company, what you believe and what you're doing must not be that obvious yet.

Decision-Making Techniques in Difficult Situations

Lukas: Last question — what's the hardest part of putting machine learning models into production? And as a founder, what's the hardest part of building a truly successful, large company?

Ion: Obviously every company has ups and downs. I think when things are bad, you may need to course-correct. It could be bad because the product isn't shipping, or because you went down the wrong path while building the product, or maybe you have the wrong people...

When things are going well, it's great. But I think it always comes back to fundamentals: try not to be emotional and always focus on the facts. Are there industry trends? Is there data from customers? Is there a problem with certain people not being a fit? We're emotional humans. I've always found it hard to separate emotion and put it in the back seat, and try to consider only the facts when making decisions.

The harder things get, the more emotional you become. Because you take it personally — and that's what I think is the hardest thing. And I'm an emotional person too. To some extent, I'm also very impulsive.

Overall, when you try to make decisions based on emotion, it works for some people — that's intuition. But in my case, it doesn't work.

Lukas: Do you have any techniques for managing emotions and thinking clearly under pressure?

Ion: When you're under pressure, a lot of things are happening. I'll try to think about what's most important, try to forget everything else, and try to simplify the problem so it's easier to make decisions based on what matters — that's what I've found. Especially when you're torn about a decision involving multiple dimensions of variables, reducing dimensionality and simplifying the problem is always useful.

For example, when we started Databricks, there was a lot of discussion about whether open source success was that important, especially since we were building a company, building a product around an open source project, and at some point needed to generate and form some revenue. Obviously, there are 2x2 possibilities: open source succeeds but company doesn't, open source succeeds and company succeeds, open source doesn't succeed and company doesn't succeed, open source doesn't succeed but company succeeds.

When we thought about this, we couldn't see a path to the company succeeding if open source didn't succeed. Once we reached that conclusion, there was nothing else to discuss. We just tried to find ways to simplify it — as long as you focus on the most important thing, everything else follows. Of course, trying to oversimplify can sometimes be bad. Try to imagine: what's the most important thing I need to solve? What's the most important dimension?

(Original link:

https://wandb.ai/wandb_fc/gradient-dissent/reports/Ion-Stoica-Spark-Ray-and-Enterprise-Open-Source--VmlldzoxNDEyMzY0?galleryTag=gradient-dissent)

5Y Capital seeks out, supports, and inspires solitary entrepreneurs, providing them with support from the spiritual to all operational aspects of running a business. We believe that if the you whom others see as crazy begins to be believed in, the world will become a different place.

BEIJING · SHANGHAI · SHENZHEN · HONG KONG