The above figure shows a typical Machine Learning (ML) pipeline that was taken from "Deployment of Machine Learning Models" course from Udemy.
Gathering Data Sources
Collecting the data is first and an essential step towards any data science project. It can be from business units or from public datasets for builing ML models.Data Analysis
Understanding the data well is a good practice, and as important as other steps such as building ML models. In this step it is important to understand and answer questions such as:- what variables are availble?
- how are they related?
- what is the characteristics of those variables? (numerical or categorical?)
- missing values? outliers?
Data Pre-processing (Feature Engineering)
This step requires to understand how can we use the raw data and transform it to build ML models. For example, can we use the raw data directly or should we transform in some way in order to use it in later stages?Our goal here is to make the data ready for building ML models! To this end, many things can be done but not limited to such as
- filling missing values in the data
- dealing with (e.g., removing) outliers
- transforming categorical values
- ...
Variable Selection (Feature Selection)
This steps aims to select a subset of features out of all features which is critical to ML model performance. This is important since there can be a great number of features without feature selection, which is detremental to both model building and depolyment.
ML Model Building
This is the step we are familiar with, which aims to try different ML models and choose the best one for the given problem. However, it is worth noting that, in the practice, this step is just part of the whole pipeline, and other steps are as important or can be more important than this step.
Deployment
After you finding out which ML model to use to solve the business problem, it is time to deploy it, which can be accessed by business and make predictions and being used to make it valuable!
Deployment does not mean only deploying ML models but deploying entire data and ML pipelines!
The following shows four different architectures for deploying your ML models from the aforementioned Udemy course.