Jeff Magnusson: Architecting For Data Science And Blending Man And Machine
Magnusson is the vice president of data platform at Stitch Fix. He leads a team responsible for building the data platform that supports the company's team of 80+ data scientists, as well as other business users. That platform is designed to facilitate self-service among data scientists and promote velocity and innovation that differentiate Stitch Fix in the marketplace. Before Stitch Fix, Magnusson managed the data platform architecture team at Netflix where he helped design and open source many of the components of the Hadoop-based infrastructure and big data platform.
- Business leaders should use data science and algorithms to lead the business, not inform the business
- Modern environments handle more concurrency and flexibility
- Deciding to start with descriptive or prescriptive analytics depends on who is leading the charge. Doesn’t need to be either one in particular.
- Jeff uses R, Python, Spark, Presto, Java, Scala, and lots of custom built tools
- Blending art and science makes for better decisions
- Machines excel at wrote calculation which is in contrast to the artistry and intuition that a human brings
- Push data quality standards as far down the stack as possible
The following is a transcription of two questions and answers from the podcast
Wayne Eckerson: Is there a role for the MPP relational database? The one thing I thought they were better for over a Hadoop environment was user concurrency, but you’re saying that’s not necessarily true.
Jeff Magnusson: Yeah. For sure. And even at Stitch Fix we’ve had a Redshift database for a long time. Primarily it serves dashboards, reports, and ad hoc query case. And I think that’s where the sweet spot is for a lot of these tools - and when you don’t have a huge influx of concurrent data processes running against an MPP database you’re not hitting concurrency problems. It’s more like they don’t excel at concurrent ML workloads where it’s very difficult to tune different data access patterns. It’s not like they’re serving dashboards in that ML development sense. They’re serving model building and training workloads which tend to be intensive in a different way.
Wayne Eckerson: You mentioned you lead with data science and the reporting and dashboarding comes after. I’ve heard a couple people at more advanced organizations say this – that they start with predictive and if needed they develop the descriptive after. Could you comment on how you start predictive and what role descriptive plays?
Jeff Magnusson: I think we believe in both. The starting place could be either or. It depends on who’s leading the charge. A lot of ideas that come out of Stitch Fix data science are actually originated by the data scientists themselves. And in those cases, especially if it’s not an area of the business where a high degree of descriptive analytics has been asked for in the past, then that’s going to default to a place that’s going to produce data products and automation over dashboards meant to inform.
Where there’s already a big investment around operating on a certain dataset we absolutely believe in providing transparency into the data and informing our users to make the best decisions possible. In that case, we’re going to invest in that visibility and building dashboards. And as our data scientists engage with the business more and more and figure out ways to integrate automated decision-making, then it transitions to that place. So the starting point is really what team begins leading the charge, so if its operators in the business who need the data to operate better then it’s going to start from a place of dashboards and descriptive analytics versus data scientists having an insight that leads to tackling a problem, which typically defaults to a more automated approach.