Newsletter #11: System Design for Machine Learning - Part I
Exploring how to build and deploy machine learning systems
We all start out by coding and training ML models on test environments (usually our PC), to get a hang of how to build the models, how they work and a few use cases. However, building and deploying ML models in the real world takes much more than just coding and training the model. In this first, of a two-part, post, we will explore other basics of designing an ML system. In the next post, we examine a few use cases of how different companies deployed their ML models.
While reading the articles on ML System Design, I found that most of them start with data gathering and preparation. Although this is a crucial step, I found the first step described by Chip Huyen, particularly interesting. Hence I will start the article there.
Step 1: Setting up the project
Even before you start looking at which datasets you can use, you need to define the goals of the project. Here are a few questions you can ask:
Goal: What are you trying to achieve? Why is ML useful for achieving the goal?
User experience: What should the end UX be like? For example, DALL.E.2 is intended for everyone, whereas Generative Disco is intended for people with a background in video editing.
Constraints: These can be constraints regarding the resources like hardware, cost, etc. or software, like using a particular framework or platform.
Evaluation: How do you define the criteria to judge the performance?
Step 2: Setting up the Data Pipeline
The data pipeline is the process of collecting, preparing and doing an exploratory analysis of the data. Once this is done, you can get an idea about the important and non-important parts of the data which will be useful in feature engineering.
Here are a few things to consider:
Data Source: Where is the data coming from? Do we need additional data? If yes, how to generate it?
Data Storage: Which is the most effective and efficient form of storage?
Data preparation and exploration: As mentioned before, this step is necessary to get an idea about measures of central tendency, which features are useful, etc.
Privacy issues and Bias analysis: As ML models integrate more into our lives, this is a crucial step while using data. The data that is collected should be collected with consent and the bias if present should be removed to ensure a neutral model.
The Data Pipeline should also include a runtime component which will take real-time data and integrate it into future training iterations.
Step 3: Modelling
In addition to model selection and model training, this stage also includes feature engineering, hyper-parameter tuning, scaling, etc.
Model selection is largely based on what the goal is, and hence will depend on how well the project is set up. In addition to looking at obvious things like supervised vs unsupervised, and regression vs classification, an important thing to keep in mind is the resource constraints.
Even though it might be tempting to always choose a neural net of some sort, it is expensive to train and hard to debug. Hence it’s a good idea to give the “traditional” models like KNN, K-means, Decision trees, etc before trying out deep learning models.
Feature engineering is choosing which features your model will use to make predictions. This will depend on how much of an impact feature X has on the outcome. Methods such as PCA can be used to remove redundant features.
Hyper-parameters are meta-feature that affect the performance of the model. For example, learning rate, temperature, etc. Techniques such as grid search, random search, Bayesian optimization, etc. can be used for choosing between different hyper-parameter values.
Step 4: Deployment
For ML models, the deployment is not as simple as a deployment for a website. In addition to the software component used during deployment, the hardware consideration is just as important for initial deployment as well as scaling requirements.
Here are a few things to consider while deploying a model:
Hardware requirement: How will the model be trained? What are the hardware resources required (GPUs, TPUs, etc.) How will these resources be scaled up if required?
Software requirements: Where to deploy a model — on the user’s device (dedicated app) or a website? How do set up loops to get real-time data into the future training process?
These are the 4 main components that I learned while doing research for this article. I feel they provide a good framework to think about how we can set up a production-level ML project. If you have any suggestions or experience to improve this framework, please reply to this email or leave a comment!
That’s it for this issue. I hope you found this article interesting. Until next time!
📖Resources
Let’s connect :)