Source: Timo Elliott
Asking your Data Scientists to deploy their Machine Learning models at scale is like having your graphic designers decide which sorting algorithm to use; it’s not a good skill fit. The fact of the matter is that in 2018, the standard Data Science curriculum doesn’t prepare students for the low level infrastructure building that deployment requires. This post will walk through the knowledge base that most Data Scientists have and why it’s not a good fit for production models.
The most important thing to understand in the context of this topic is that Data Science is still evolving, and nothing is set in stone. Positions and their limits are still up in the air, most companies haven’t developed meaningful expertise in hiring properly, and the educational system is struggling to adjust to an increasingly volatile cutting edge. Expect variation and rapid change.
What Data Scientists do know
Data Science spans across multiple roles, and those roles aren’t defined by any global standard. That means that a Data Scientist at one company might be doing the work of a Data Engineer at another, who both might be doing the work of a Machine Learning Engineer at a third. In fact, there’s actually an ongoing debate as to how vertical Data Scientists should be.
But what is standard is the pipeline for Data Science, which almost always will look something like this: pipelines / ingestion → cleaning → manipulation → visualization → modeling → deployment.
Data Scientists will typically be very good at one of these parts of the pipeline. You’ll have some who thrive in building the infrastructure to stream and access date, some who are skilled at manipulating and visualizing data, and some who like to focus on the technical parts of statistically modeling relationships. The educational system largely reflects this: there are programs targeted at Data Engineering (the pipes), Data Science (manipulation / visualization), and Machine Learning (modelling).
Data Scientists will typically be proficient in one or more scripting languages, the specific one depending on their focus. They’re expected to have familiarity with databases and query languages, use packages like Pandas and Numpy, and typically write code in Jupyter Notebooks. Basic Machine Learning skills are a given, and those more Machine Learning inclined will spend their time with Deep Learning and Linear Algebra.
Zooming in on that last part, there are often entire Data Science teams focused exclusively on Machine Learning. These positions are sometimes called “Machine Learning Engineers” but typically fall under the purview of Data Science. As part of their day-to-day, they’ll create, tune, and adjust complex Machine Learning models on the company’s data.
But even on the Machine Learning end of the Data Science spectrum (and certainly on the rest), there’s an important element missing: the last mile, or deploying models into production.
What it takes to deploy Machine Learning at scale
There’s a good reason why the methods required to deploy Machine Learning into production aren’t taught in most Data Science programs: the experience and skill set for deployment are totally different than most Data Science tasks. Deployment is a software engineering discipline, not a Data Science one.
Deploying Machine Learning into production is hard. You need to:
- Build and use the right cloud infrastructure on the right cloud provider
- Design and implement public and internal APIs for model usage
- Orchestrate a load of containers
- Implement a load balancer to ensure you can scale to meet inference needs
- Integrate with data pipelines and consistently update models
These are all complex, integrated, and difficult tasks to worry about when you want to create an application that works at scale.
Working with low level cloud infrastructure like this is not part of the Data Science curriculum, and that’s why there’s such a fundamental mismatch. Graphic Design is to Data Science as Data Science is to Machine Learning deployment: a different level of purely technical literacy is required for each step. These, of course, are not value judgements about these different career roles: they’re just very different, and the skillsets often don’t overlap.
How do you deploy then?
The reality is that it’s not going to be easy: most companies will struggle to deploy their Machine Learning into production, and without solving that last mile problem it’s tough to see meaningful ROI. The emerging trends of microservices and serverless are starting to give some direction, but it’s still largely a black box.
One of the ways that companies are starting to tackle this issue is with specialized teams. Hiring managers are starting to form unique groups of engineers called “DevOps” to worry about this process, but that’s very much an emerging role. A DevOps engineer is typically a more senior type role that draws on experience from a variety of areas. For example, a Nike job posting for this type of engineer asks for experience in:
- CI/CD infrastructure
- Docker, Kubernetes, and other container tech
- AWS components and services
- Python, Go, shell scripting, and SQL
- Automation to deploy R, Spark ML, and Python apps
- Data processing and storage frameworks like Airflow, Hadoop, etc.
This is quite the rap sheet; it’s no surprise that these engineers are far and few in between.
Another solution for deploying your Machine Learning models into production is to use a platform that takes care of it for you. Algorithmia lets you deploy Machine Learning at scale with a simple Git push or code upload, turning your deployment process from something that takes months into something that takes minutes.