This post summarizes an interesting video with respect to MLOps: From Model-centric to Data-centric AI, from Andrew Ng from DeepLearningAI regarding MLOps, which I found really interesting where it stresses the shift of mindset from current model-centric AI to data-centric AI.
Why? The motivation is that AI system is made by code (including models) and data, and the current focus has been heavily on developing/improving models (e.g., on benchmark datasets). However, we already know and hear often that normally 80% (or maybe much more) of time a data scienc project focuses on preparing high quality data and the rest for training a model due to the fact of "garbage in garbage out". And with many data science projects done for a wide range of industries, Andrew and his team have also noticed that fixing the model and working on data could have significant improvement compared to working on the model with data fixed (details can be found in the video). In this regard, having tools and processes for high quality data (even small) is critical and better than having (relatively higher volume of) noisy data.
- Train a model
- Do error analysis to identify types of data that the algorithm struggles
- Either get more of that data via data augmentation, data generation, or data collection (change inputs X); or give a consistent definition for labels if they found to be ambiguous (change labels Y).
After deployment, monitor model performance, and collect flow new data back to refine and update the model
- Systematically check concept drift/data drift (performance degradation)
- Flow new data back to retrain/update model regularly
The following figure summarizes the MLOps really well with its analogy to SE and DevOps.