MLOps: From Model-centric to Data-centric AI

This post summarizes an interesting video with respect to MLOps: From Model-centric to Data-centric AI, from Andrew Ng from DeepLearningAI regarding MLOps, which I found really interesting where it stresses the shift of mindset from current model-centric AI to data-centric AI. 

Why? The motivation is that AI system is made by code (including models) and data, and the current focus has been heavily on developing/improving models (e.g., on benchmark datasets). However, we already know and hear often that normally 80% (or maybe much more) of time a data scienc project focuses on preparing high quality data and the rest for training a model due to the fact of "garbage in garbage out". And with many data science projects done for a wide range of industries, Andrew and his team have also noticed that fixing the model and working on data could have significant improvement compared to working on the model with data fixed (details can be found in the video). In this regard, having tools and processes for high quality data (even small) is critical and better than having (relatively higher volume of) noisy data.




So what Data-centric AI refers to?
It means we need to have tools and processes to in a systematic way to improve the data quality. The following figure shows a clear difference between Model-centric and Data-centric views in the context of speech recognition after error analysis.


More specifically, making the tools and process systematic indicates iteratively improving data
  • Train a model
  • Do error analysis to identify types of data that the algorithm struggles
  • Either get more of that data via data augmentation, data generation, or data collection (change inputs X); or give a consistent definition for labels if they found to be ambiguous (change labels Y).
After deployment, monitor model performance, and collect flow new data back to refine and update the model
  • Systematically check concept drift/data drift (performance degradation)
  • Flow new data back to retrain/update model regularly

The following figure summarizes the MLOps really well with its analogy to SE and DevOps.