01/10/2023
If you are really interested in machine learning in industry, 👉 Designing Machine Learning Systems (2022)👈 by Chip Huyen offers a holistic view. Each chapter can be expanded to 1+ books👍
1⃣ 👉 Overview Of Machine Learning Systems
Academic leaderboards (State-Of-The-Art or SOTA) do not benefit production.
Latency, throughput, data dynamics, fairness, interpretability are afterthoughts in research but critical in production.
2⃣ 👉 Introduction To Machine Learning Systems Design
Mind the gap between model and business metrics.
Developing an ml system is iterative and never ending just like traditional SWE.
3⃣ 👉 Data Engineering Fundamentals
data engineering, sql exploits relational structure or fixed schema. nosql let applications define schema.
OLTP and OLAP are outdated as the boundary is blurred.
Data can be passed via database, via microservices or via streaming services.
4⃣ 👉 Training Data
It covers Sampling, Labeling, Class Imbalance, Data Augmentation.
5⃣ 👉 Feature Engineering
Handling missing data for Missing not at random (MNAR) vs Missing at random (MAR).
Be cautious about the data leakage.
Engineer features not too specific and not too generic.
6⃣ 👉 Model Development And Offline Evaluation
When selecting a model, avoid the state-of-the-art trap and human biases, mind the performance now and in future, use ensembles, and track your experiments.
Monitor trends at ML conferences such as NeurIPS, ICLR, and ICML. Oh, there are also distributed training and AutoML.
Evaluate your model against a baseline and on different populations. Calibrate your model.
7⃣ 👉 Model Deployment And Prediction Service
Batch prediction using batch pipeline vs online prediction using streaming pipeline.
Compress the model for fast inference. Compiling and Optimizing Models for Edge Devices.
8⃣ 👉 Data Distribution Shifts And Monitoring
Google researchers have found 60 out of 96 failures were due to causes not directly related to ML.
Data Distribution Shifts includes covariant, label and concept shifts.
Monitoring is the act of tracking, and the observability is to set up the system such that it gives visibility into the system.
9⃣ 👉 Continual Learning And Test In Production
Continual learning means learning in batches or micro-batches.
“Online learning” makes one thinks of online education.
Continuous learning means your model continuously learns with each incoming sample and it could also mean continuous delivery of ML as in CI/CD.
Stateful training vs stateless retraining (no stateful retraining).
✳️Check the book for more. 👇
如果您真的对工业中的机器学习感兴趣,Chip Huyen 的 👉《设计机器学习系统 (2022)》👈 提供了一个整体视图。 每章可以扩展到1+本书👍
1⃣️ 👉机器学习系统概述
学术排行榜(State-Of-The-Art 或 SOTA)对生产没有好处。
延迟、吞吐量、数据动态、公平性、可解释性在研究中是事后的想法,但在生产中却是至关重要的。
2⃣️ 👉机器学习系统设计简介
注意模型和业务指标之间的差距。
开发机器学习系统是迭代的,就像传统的SWE 一样永无止境。
3⃣️ 👉数据工程基础
sql利用关系结构或固定模式。 nosql让应用程序定义模式。
OLTP 和 OLAP 已经过时了,因为边界已经模糊了。
数据可以通过数据库、微服务或流媒体服务传递。
4⃣️ 👉训练数据
它涵盖了采样、标记、类不平衡、数据增强。
5⃣️ 👉特征工程
处理非随机缺失 (MNAR)与随机缺失 (MAR) 的缺失数据。
小心数据泄露。
工程师的特性不太具体也不太通用。
6⃣️ 👉模型开发及线下评估
选择模型时,避免最先进的陷阱和人为偏见,注意现在和将来的性能,使用集成并跟踪您的实验。
监控 ML会议的趋势,例如 NeurIPS、ICLR 和 ICML。哦,还有分布式训练和AutoML。
根据基线和不同人群评估您的模型。校准您的模型。
7⃣️ 👉模型部署与预测服务
使用批处理管道的批量预测与使用流处理管道的在线预测。
压缩模型以进行快速推理。为边缘设备编译和优化模型。
8⃣️ 👉数据分布转移和监控
谷歌研究人员发现 96 次失败中有 60 次是由于与 ML 没有直接关系的原因造成的。
数据分布转移包括协变、标签和概念转移。
监控是跟踪行为,可观察性是设置系统,使系统具有可见性。
9⃣ 👉持续学习和生产测试
持续学习意味着分批或微批学习。
“在线学习”让人联想到在线教育。
持续学习意味着您的模型不断学习每个传入的样本,也可能意味着像CI/CD中那样持续交付 ML。
有状态训练与无状态再训练(无状态再训练)。
✳️查看本书了解更多。 👇