Jeff Magnusson: How To Create A Self-Service Data Platform For Data Scientists
In this podcast, Wayne Eckerson and Jeff Magnusson discuss a self-service model for data science work and the role of a data platform in that environment. Magnusson also talks about Flotilla, a new open source API that makes it easy for data scientists to execute tasks on the data platform.
Magnusson is the vice president of data platform at Stitch Fix. He leads a team responsible for building the data platform that supports the company's team of 80+ data scientists, as well as other business users. That platform is designed to facilitate self-service among data scientists and promote velocity and innovation that differentiate Stitch Fix in the marketplace. Before Stitch Fix, Magnusson managed the data platform architecture team at Netflix where he helped design and open source many of the components of the Hadoop-based infrastructure and big data platform.
- It's difficult to align data scientists, product engineers, and data engineers
- The data engineering role can merge with the data science role, making a full-stack data scientist
- The data engineer should focus on building a self-service platform for data scientists
- Data scientists have a lot to benefit from more autonomy and fewer handoff negotiations
- By optimizing for velocity, fast iteration speed, and an environment of innovation, you sacrifice technical efficiency and quality
- Sometimes you don't need to tightly engineer code
The following is an edited transcript from the podcast between Wayne Eckerson and Jeff Magnusson
Wayne: You wrote a blog that circulated widely in Silicon Valley titled Engineers Shouldn’t Write ETL. In that blog you wrote, “For the love of everything sacred and holy in the profession,” the data engineer or the ETL engineer “should not be a dedicated or specialized role.” Those are pretty strong words. What made you come to that conclusion and what should data engineers do if not ETL and data prep?
Jeff: I think with blog posts and blogs, in general, some amount of click-bait element comes to play, so I’ll admit those are very strong words and very opinionated, but there are shades to that. My real intent with that blog post was to highlight an occurrence of the thinker-doer problem that I think crops up in many data science departments. What that looks like is departments organize in a way that requires several handoffs down the pipeline to accomplish any kind of data science or algorithmic work.
So data gets acquired by data engineers, and that’s handed off to data scientists based on some requirements that the data scientists need to accomplish. Data scientists prototype, build models, have ideas and those things eventually get productionized into some form of data product that then gets handed off to a product engineer to implement. The problem with handoffs is that they create a coordination cost. However, in my experience, I found that the motivations and the roadmaps between those groups of specialists are seldom well aligned.
So the data scientist is trying to innovate and quickly test things out into the product. The production engineer is really trying to optimize uptime and latencies to the client, and to juggle a huge amount of workload from several data scientists. The data engineers are trying to build these robust data pipelines that meet the SLA of an aggregate group of needs, and it’s hard to align those things.
A lot of it is based on the notion that ETL, data wrangling, and data engineering is hard, and it requires a specialization to do it all. Certainly for companies like a Facebook or a Google or a Twitter, I wouldn’t argue that that’s the case for a lot of use cases where there’s a huge data volume that needs to be handled. But for most companies, I don’t think it’s that hard and I’d rather focus good, strong engineers on building tools and abstractions to make ETL, data movement, and data science easier versus having those folks deployed, engineering each specific data pipeline that needs to get developed.
And so, by creating those tools, that in turn empowers data scientists to take full ownership of their pipelines from data acquisition to productionization, and then they can control their iteration cycles, and that often increases velocity. They’re also in control of the requirements that they’re pushing out. So they can make tradeoffs if it’s going to take an extra 100 hours to acquire an additional field of data or to add an additional feature to their model. They’re able to see, ‘Okay, is it worth it to do that? Is it worth it to support that in production or should I not do that and push to production and get things tested earlier?’ Having that kind of freedom makes both sides happier as well as increases the iteration speed and velocity and creates an environment of innovation.
Wayne: Okay, that’s pretty radical thinking, rethinking the roles of a data engineer and a data scientist. If the data engineers are not creating those pipelines, what are they doing?
Jeff: I think it’s a notion of just not having the traditional data engineering role and merging it with the data science role. At Stitch Fix we have Data Platform and we have full-stack data scientist who is a combination of a data engineer and data scientist. A lot of the traditional data engineering expertise gets pushed onto Data Platform. I think a lot of companies have an infrastructure group or data platform group that is tasked with maintaining the infrastructure, and I think that pushes responsibility up the stack and makes that role a little bit harder.
So rather than building that infrastructure and making it easy to use for a small group of specialized data engineers, our data platform folks are tasked with making that environment easy to use and performant for our data scientists. And that requires a very different tool set and support for a breadth of skills and levels of engineering expertise that most traditional platforms probably wouldn’t support.
Wayne: In practice, what does that platform look like? What are they building to empower the data scientist to be a full-stack data scientist as you say?
Jeff: Our platform is 100 percent deployed on AWS. It’s a combination of Spark, Presto and a bunch of containerized ECS Docker tasks, and I don’t think in practice those pieces look so different from a lot of other big data environments. It’s really the focus on the one level up that starts to differentiate a bit. So, self-service becomes a primary focus of any kind of API or toolset we would push up, and a lot of the times those required guard rails.
So when you’ve got a group of data scientists, often they’re vertically focused on problems that need to be solved for the business. So, they’re full-stack. They’re trying to build some kind of analysis or data product or model that’s going to solve a problem in the business versus the more traditional horizontally deployed data engineer who’s just trying to create well-cohesive and performant data models for the data scientist. So, you end up having this tragedy of the commons where that entire core infrastructure is shared. And things need to be scheduled and execute and meet some SLAs.
So, when you don’t have the horizontal group of engineers who are deployed to make sure that things are scheduled in a way that the SLAs get hit, then you have to satisfy these verticals that work independently. And we have built isolation guarantees into the platform to guarantee that. We make it impossible for one group of data scientists’ job to clobber and take all the resources from another. We build the ability to request and guarantee the way resources get granted into our self-service APIs. And one of the things that makes that easy for us is running on AWS versus a data center.
AWS is, at least for our scale, infinitely elastic, so we’re able to spin up the number of containers or the number of nodes on our clusters that each job needs to succeed. And we can guarantee that it’s going to meet an SLA because we can push it into an area where it’s not going to impact other things.
In the majority of cases just providing visibility into how things are performing, alerting when things start to run longer than normal, and alerting to the number of resources you’re using for any job is powerful. One of the tricks that we’ve used for several years successfully is to publish leaderboards, such as top 10 lists or lists of the most expensive jobs. Those will get published every week to the department and people will see the cost of the jobs that they’re running and they can self-police to make sure that it’s worth it or that they’re responsibly using the right slice of the pie.
It has solved a lot of problems for us. If things are silently taking 20 hours of compute time every day, but they end up completing when they need to complete, they’re just marvelously inefficient. It’s really easy not to see that unless it gets pointed out to you and data scientists typically are less curious about that by nature than a data engineer would be who probably has those sort of efficiency concerns ingrained in them through education and experience.
Wayne: So you’re taking the data engineering role and you’re pushing some of it down onto the data scientist, and you’re moving the data engineer upstream to build a self-service platform; self-service for the data scientist to build these vertical pipelines themselves without getting into trouble when they deploy them on the cluster or use the cluster to build them and model them. Is that a fair summary?
Jeff: Yeah, that’s a fair summary.
Wayne: Good. So, now let’s go to the data scientists. If their role is expanded, how do they survive without a data engineer feeding them data and putting their code in production? I guess you’ll say it’s the platform, but in reality does that truly work out and the do you trust data scientists that put their code into production without a data engineer?
Jeff: We definitely trust them to push their code into production without a data engineer, and you’re right, I’m going to say that it’s the platform. But we put more resources onto building our platform and the level of abstraction that it provides than we would if it was just put in place to service the data engineers building those pipelines. We try to make it as easy as possible to develop these things when we see common patterns or pieces of the platform that can be put into place that don’t require a high amount of iteration or business logic.
The charter that I’ve given my team is to build and own as much as possible for the data scientist, but still guaranteeing that you don’t get in the way of their iterations cycle. So basically guaranteeing that you’re going to avoid a handoff of requirements and concerns. I actually think that there’s a whole lot that you can build and abstract up the stack.
One of the things that Stitch Fix does and has done for a long time is create machine- and human-based recommendations systems. So we recommend a personal stylist that we think is the most appropriate for clients, and the personal stylist consumes those recommendations and has the final say in curating the group of clothes that we’re going to send to our clients. We’ve been doing that since day one of data science at Stitch Fix. And it’s well-understood what a recommender is going to look like in an environment, the type of models that are going to serve them out, as well as the latency requirements and data quality concerns on serving the stylist. So, there’s a high degree of tooling that guarantees that when code gets pushed into a production environment that it’s going to serve recommendations to a stylist that can meet data quality and latency guarantees and degradation concerns and APIs, and it’s able to scale and report the correct metrics to our systems.
Basically, there’s a much tighter interface that our data scientist are implementing when pushing that code, but they still have the autonomy over the math, which is really what we’re trying to get to. They’re trying to iterate on those models and the type of machine learning techniques that we deploy into production. They don’t really care about how the machinery is scaling up those recommendations or monitoring them. So, the platform takes care of that and guarantees the quality of the code is up to standards and it’s going to raise hell when it isn’t through alerting and waking people up through on-call. And it’s going to, again, provide a high degree of visibility into how things are performing, but we’re not going to get in the way of the math or the iteration cycle of our data scientists. And we try to apply kind of that principle to every other piece of our stack and our other data products that we’re developing as well.
So I think as things become more understood, they can become more and more highly abstracted, but another thing that’s important to realize, and I think it gets lost in a lot of these handoff processes, is that data science is often experimental and highly iterative in nature. So, the quality of code that’s only going to be around in a production environment for a few weeks before it gets rewritten is probably less important than the quality of code that’s going to be in production for a long time and iterated on and enhanced by a larger team of software engineer and developers. So, we don’t necessarily enforce that the code gets tightly engineered when it’s going to be a little more transient in nature.
We focus more on the qualities of the APIs. So if there’s an SLA behind an API or a product that we’re pushing to the business, then the platform gets involved to guarantee that there are degradation paths if code fails. It’s easy to roll it back. We have a high degree of visibility onto the data products and there’s a testing pipeline to guarantee that code is meeting the minimum bar before it gets pushed into the production environment.
Wayne: So let’s talk a little bit about the impact of this approach on the people themselves. If we’re kind of changing data engineers’ role, do they disappear? Do they move into the platform team? I assume you need fewer of them. Is that true?
Jeff: I think it’s true that we need fewer of them and that’s because some amount of the data scientists’ time is being taken by doing the data engineering. So, it’s not that there’s necessarily less effort happening to engineer data. It’s just that we’re not specializing on two items, so we don’t have that differentiation and role.
We do have some folks on the platform team that come from a more traditional data engineering background, and I think that that helps to kind of partner with data scientists when they hit tricky data modeling problems so they have a resource to go to. And it helps to inform the types of APIs and abstractions that we need to build to make this stuff easier. I think that’s the place that a data engineer would find on the data platform team; being able to leverage knowledge of what data engineering takes to help you be a good partner with the data scientists as well as go for the right abstractions.
And then you look at our data science teams; some of our data scientists have more engineering experience than others. And the amount of data engineering or production engineering expertise that is required to be successful in those groups varies from team to team. But for the most part, as long as they have a decent background in coding, we can teach them the right skills to be successful in our environment.
Wayne: So does every data scientist thrive in this environment or do you find that you lose some because they don’t want to be full-stack; they don’t want to do the data engineering?
Jeff: Certainly, our environment favors a mindset, and it’s the same with the platform engineers too. Some people really don’t want to build these self-service tools that empower the data scientists to do the data engineering tasks. It’s a divisive kind of thing. It’s very political in a way.
Some people would prefer a department where we do specialize in data engineering, so I think we do use those folks that have a different mindset. But what we’ve found is that a lot of our data scientists and platform engineers, as long as they’re open-minded and give the environment a chance, have few complaints.
Certainly, we’re open to debate whether there are needs that have evolved that need to be specialized or not, and those debates do occasionally crop up. But for the most part, our data scientists and our data engineers, our platform engineers, tend to thrive and get very protective over their sense of autonomy where they see some clear benefits through not having to negotiate handoffs.
I think people of a little more entrepreneurial mindset, engineers that are comfortable with a high degree of ambiguity in a lot of cases, and data scientists as well, tend to thrive where it’s a little more challenging, as well, where people want to go deep into a highly technical problem or want to do more research-oriented data science sometimes. If you want to focus on a certain area of the stack and not learn the skills required to kind of go full-stack, then it’s more challenging to fit into our environment.
Wayne: I would imagine that the data scientists are much more vested in the outcomes since they’re building it from end-to-end. Is that true?
Jeff: Yeah, I think there’s a higher degree of ownership and accountability.
Wayne: What do you think are the tradeoffs of this self-service approach? Are there any times when it would make more sense to use this specialist model or the assembly line model like most companies do?
Jeff: For sure. I think there are tradeoffs with the approach and I don’t think it’s meant for every occasion or even every department. In designing our department the way that we did, we made some very conscious tradeoffs. What we’re trying to optimize for is velocity, fast iteration speed, and an environment of innovation. And if you’re optimizing for something that means you’re not optimizing for something else. So we’re achieving those things at the expense of technical efficiency and technical quality in a lot of cases.
So even in our environment, as pipelines and data products mature, specialization can start to make a lot of sense. And so, you’ve got this maturity and adoption curve where you eat a lot of the low-hanging fruit and then iteration slows. And once things mature and the quality and your liability benefits that we can obtain outweigh the hit that iteration will take, then we tend to build further and further up the stack and specialists can come in and solve a problem at once and be done versus having that handoff.
The truth is Stitch Fix can get away with the self-service model a lot easier than a lot of other data science-focused companies. The reason for that is that Stitch Fix doesn’t have a huge breadth end user-facing data science components.
So, for example, our recommendations go to our personal stylists who are employees. They don’t go to our end users, and most other data products at Stitch Fix follow a similar model where they power internal tools and processes, but not directly user-facing ones. And that gives us a little more time to correct faults when they crop up and a higher tolerance for periods of higher latency, for example, or downtime into an internal tool where it’s less impactful to the business than downtime of a user-facing one. A good example of that is if latency is spiked on our recommendation APIs and our stylists have to wait 500 milliseconds to a second to get recommendations for a fix that they’re styling. We can tolerate that. It’s not ideal, but we can tolerate it. Whereas if a client has to wait that long for an API call to complete which is part of rendering the recommendation into a webpage, there are some material impacts to the business.
When you’ve got those more user-facing concerns or APIs and things that can’t tolerate faults, then specialization makes a lot more sense. Some obvious areas where I probably wouldn’t advocate strongly for this model are high-speed trading or ad tech performance requirements. It’s totally understandable where you want to put tighter controls on pushing into production.
Wayne: Now, you guys just pushed some code out into the open source community, an API called Flotilla. Tell us what that does and why someone would use it.
Jeff: I’m really excited to have the team open source Flotilla and happy to have a chance to advertise it today. A large portion of the work that happens in most data science departments is going to be batch-oriented. So, data gets aggregated, models get trends, featurization. A lot of those are batch-oriented tasks, and at Stitch Fix we use two platforms for batch execution, Spark and Docker on ECS.
So, most data wrangling and movement happens on Spark and, for us, most of our machine learning and model training happens inside our Docker containers. So, especially given our self-service model and our use of containerization for batch execution, which I feel isn’t necessarily super mainstream, I think containerization for service deployment is well-adopted at this point; for batch execution, less so.
So some level of abstraction needed to be developed to manage the execution of those batch tasks and, particularly, the execution of our Dockerized tasks. So, there are concerns that we wanted to abstract over there like launching and parameterizing jobs, making sure that they execute in the correct environment with the correct resource allocations, monitoring their status for success or failure, being able to queue them up, being able to retry them. And that’s basically what Flotilla is. It’s a job execution service.
So if you’ve got a batch-oriented task of work that needs to execute, we’re going to execute that in some sort of container in our environment and Flotilla is going to help abstract that, so it abstracts over ECS. It makes the details of containerization fairly invisible to the end user, especially in that batch job context, and so it’s going to handle the job launching, the queuing, the resource allocation, and the monitoring for you.
Wayne: So is the information that comes through the APIs directed to the data scientist or someone else?
Jeff: Any type of work that needs to execute in our environment is going to execute on top of Flotilla, and so, data scientists interact with it on a day-to-day basis through command lines and some abstractions that we put on top. So if Flotilla ran my job that I defined with this level of resources and these parameters, that’s kind of the abstraction that data scientists are going to interact with when they’re scheduling things. So we’ve got a scheduler that we’ve developed, and when you’re putting a task into that scheduler environment, it’s really pointing at Flotilla to execute each individual task of work. So that’s typically how our data scientists see it. It’s like scheduling this kind of production workflow to execute on top of Flotilla.
We use it for other things as well. We even use it to execute sort of short-lived or ad hoc tasks. We’ve got tools built on top of it that data scientists use to allocate notebook environments or Python or RStudio type notebooks to do some research in figuring out that we’re doing analyses of data or starting to productionize some jobs. And so, it’s useful for a wide variety of things. Some of it is interacting directly with the Flotilla GUI or command line for the data scientists. Some other fancier things happen through platform abstractions and tools that we haven’t open sourced yet, but are built on top of that to make other things easier for our data scientists.
Wayne: So where can people get Flotilla?
Jeff: If you go to github.com/stitchfix/flotilla-os, that’s the GitHub repo. Also, if you just Google Flotilla Stitch Fix it will take you there. And there is the stitchfix.github.io/flotilla-os webpage that’s a nicer landing page with some screenshots and links to some pretty nicely developed documentation and API specs for the product. I highly encourage people to check it out. You can set it up in about 5 minutes if you clone the repo and follow the quick start instructions to run it in a local environment on your laptops, so you can experiment with it and try it out.
If you liked the podcast, please subscribe!