Exercise: decide which pictures above are computer-generated.
MLMW: Machine Learning My Way
Textbooks, articles, reports and learning resources to navigate the concepts in statistical modelling and bridge them to skills in data management and model productisation.
This collection of notes appeared as a response to questions about how to make a career in machine learning or working with data, given the people would come from different backgrounds. This is not a beginner guide though, it assumes you may have tried a few things and want a bigger picture how does the data, models and computing technologies and the business outcomes connect or separate.
The notes are structured in favour of a person who knows a bit about statistical models and would want to master data management and computing to making statistical models useful in a business context or at bigger scale.
© Evgeny Pogrebnyak, 2024
Last updated: December 9, 2024.
Version 0.8.4
Get the newest version:
- Mailing list: https://buttondown.email/mlmw/
- Reddit: r/ml_my_way
- Telegram: https://t.me/ml_my_way
- Website: https://trics.me/
v0.8.4 - December 9, 2024
- The cloud computing guide (part 82)
v0.8.3 - December 8, 2024
- Need a new data engineering section
- Inspired by Comments on the weekly plan for DE job interview (pipeline2insights)
- Initial edits for parts 78-79 (Data Engineering).
v0.8.2 - October 17, 2023
- New tentative TOC, MLMW and MDTP edited together.
- Developper Voices podcast with Kris Jenkins
- Feedback loop article, with short introduction to dynamic systems (ACM, 2023):
https://dl.acm.org/doi/fullHtml/10.1145/3617694.3623227
v0.8.0 (May 10, 2024):
- The MLMW is now three resources: a short topic list (must request access), a public longread guide and the MLMW website. Check out https://trics.me/ for details.
- Interviews moved from the longread to the website. Read randomlyCoding interview on production pipelines, engineering skills and job roles.
- Also at the website are the beginner track and probability section.
- Few topics merged in the MLMW longread and not it is exactly 100 topics, 23 are annotated with extra links and structures and 2 are the notes sections.
- MLMW Additions are 2024 State of AI report, Aubrey Clayton videos about ET Janes, ITMO ML job role model and r/AllenDowney/ in the persona list.
Tentative structure:
Introduction
- Statistical thinking and intuition
- Models and control
- Observation and experiment
Models and methods
- Probability and statistics
- Econometrics
- Machine learning
- Deep learning
- Other useful models
Steps in analysis
- Descriptive analysis
- Research pipelines
- Business-driven modelling
Tools for individual statistical modeller.
- Notebooks, IDEs
- BI and data visualisation
- Libraries for statistical modelling
- Building data pipelines
Data storage and operations
- Data types and sources. Tabular data. Unstructured data.
- SQL, dataframes and relational databases.
- Stream vs batch processing.
- Data warehousing (DWH).
- ETL and orchestration (Airflow and similar)
Data governance
- Data quality
- Metadata and linage
- Data security and personal data
Building software
- How computers work. Operating systems. Networks. Local vs remote machine.
- From smaller to bigger systems. Frontend vs backend. System architecture. APIs.
- Software Developement Lifecycle (SDLC) and DevOps.
- Projects and requirements. Products and features.
- Waterfall vs interactive development methodologies.
- Developer vs SWE vs CS areas. Team roles.
- Technical debt and code quality.
Making things work for the business
- Legacy enterprise systems vs data-driven systems.
- From business hypothesis to a valuable project.
- What companies usually do with data.
- Job market, companies and company valuation.
- Society benefits and regulation.
Intuition and foundational concepts
Deep learning and neural networks
Interaction, feedback, networks and optimisation
Harder, dull or less obvious topics
Part 2. Data types, sources and quality
Not just numbers: text, images and sound
Data in business and economic perspective
Part 3. From research design to model productisation
Programming languages and statistical software
Orchestration and data engineering tools
Part 5. Business change and society impact
Technology companies as machine learning market players
Adoption in broad economy and society
Selected industry domains and use cases
Changelog – timeline of this document
Part 1. Models and methods
Intuition and foundational concepts
- Probability and randomness. [annotated]
- Probability as repeated events (Bernoulli) vs plausibility estimate (ET Janes).
- Random variables and their distributions.
- Sequence of events and conditional probability.
- Joint distribution of random variables and marginal probability.
- Axioms of probability and measure theory.
- Generating random numbers practically (pseudorandom and seed).
See a listing of probability and mathematical statistics textbooks at the end of the chapter.
- Data generating process (DGP). Sample vs population. Learning from sample about the population (inference). [annotated]
In general in statistics we rely on a notion there is some discoverable law, the data generating process, that produces the data that we collect and analyse.
We rarely know the true model of a DGP, but might reason about its functional form based on theory or prior knowledge or take the most simple functional form as a guess. Next we estimate the parameters in that functional form given the observation data points that we have.
“Statistical pragmatism emphasizes the assumptions that connect statistical models with observed data.” Figures from Robert E. Kass (2012) Statistical Inference: The Big Picture.
- Inference (econometrics) and generalization (machine learning). [annotated]
- Inference and parameter interpretation (statistics and econometrics).
- Generalization and prediction (machine learning or statistical learning).
- Change of behavior (causal inference and heterogeneous treatment).
Readings:
- Presentation: Hal Varian (2014). Machine Learning and Econometrics.
- Article: Susan Athey and Guido Imbens (2019). Machine Learning Methods That Economists Should Know About.
- Presentation: The Impact of Machine Learning on Economics and the Economy (2019).
- Correlation, causality, common drift and spurious regressions.
Example: German Cheeses and a directed acyclic graph (DAG) in Determining causality in correlated time series.
- Observation, experiment and experiment design.
- Measurement errors and missing data.
- Model performance and model evaluation. Modeling trade-offs (eg bias vs variance) and no free lunch.
Extra: What books or papers are must reads for every professional statistician? : r/statistics
Econometrics
Survey article: Undergraduate Econometrics Instruction: Through Our Classes, Darkly.
Modern introduction course: Mathematical Econometrics I by Roth and Hall
- Cross-section, time series, panel and spatial data. Single vs multivariate response variable.
- Linear regression and ordinary least squares (OLS).
- Violation of OLS assumptions.
Note: Peter Kennedy textbook (1998) is built on listing the violations, very clear to follow.
- Difference-in-Differences. Instrumental Variables. Regression Discontinuity.
- Time series. Seasonal adjustment, smoothing, filtering. [annotated]
Reference text: Hamilton.
See more textbooks in Econometrics Navigator: Time Series Section.
Extra: Forecasting: Principles and Practice by Hyndman and Athanasopoulos.
- Systems of equations. [annotated]
Important part of econometrics for imposing a structure. The rise and fall of large macroeconometric models in 1960s-1970s.
- Klein’s model in Cowles Commission papers (1950).
- Goldberger (1972). Structural Equation Methods in the Social Sciences.
- Heckman and Vytlacil (2005). Structural Equations, Treatment Effects, and Econometric Policy Evaluation. (EP: kind of says the field is alive and well.)
Estimation
- Methods of estimation.
OLS and extensions.
Maximum likelihood.
Bayesian estimation
MCMC.
- OLS extensions.
logit/probit
GMM
2- and 3- stage OLS
Quantile regressions
Lasso, ridge
- Machine learning textbooks [annotated]
ISLR/ISLP is a career starter. Bishop or Murphy for more advanced text with more math, they are also older and not supplemented with code. Do not confuse ISLR/ISLP with Elements of Statistical Learning (ESL), a more mathematically rigorous book.
Acronym | Title | Authors | Latest edition |
ISLP | An Introduction to Statistical Learning | Gareth James, Daniela Witten, Trevor Hastie, Rob Tibshirani | 2023 |
PRML | Christopher Bishop | 2006 | |
PML | Probabilitstic Machine Learning | Kevin Patrick Murpy | 2012. Follow-up books in 2022, 2023. |
Reddit Quote: For the longest time the best books for a mathematical treatment of ML were Chris Bishop's "Pattern Recognition and Machine Learning" and Kevin Murphy's "Machine Learning: A Probabilistic Perspective". Both authors have written new and updated books, better adapted to the deep learning era. Bishop's new book is "Deep Learning: Foundations and Concepts". Murphy released two books: "Probabilistic Machine Learning: An Introduction" and "Probabilistic Machine Learning: Advanced Topics".
I would say that Murphy's two tome book currently provides the most comprehensive and thorough treatment of probabilistic ML. The first chapters of the introductory book are basically mathematical preliminaries, so it's more accessible than before. Additionally, the most frequently used book for getting a stong mathematical foundation for ML is "Mathematics for Machine Learning" by Deisenroth et al.
CS229 Lecture Notes Fall 2022 by Andrew Ng. Very good structure.
For machine learning (not deep learning), I recommend the Andrew Ng lecture notes from Stanford's CS229 course. The reason I really like these notes is because you can find past problem sets that went along with them, and the problem sets are very good: difficult but not impossible, and close to a 50/50 mix of math and programming. I never feel like I've learned a topic just from reading about it, so having good problems to go along with the reading was very important to me (Reddit quote).
scikit-learn: machine learning in Python by Gael Varoquaux. Part of Scientific Python Lectures. “One document to learn numerics, science, and data with Python.”
Two books below are paywalled, but very practical.
Andreas Mueller and Sarah Guido (2016). Introduction to Machine Learning with Python.
HOML: Hands-on Machine Learning with Scikit-Learn, Keras and TensorFlow (3rd edition) by Aurélien Geron. Repo: https://github.com/ageron/handson-ml3/
- Statistical learning theory. Supervised vs unsupervised learning. [annotated]
Beginner: Chapter 8 “When Models Meet Data” from Mathematics for Machine Learning by Deisenroth et al (2020).
Advanced: John Shawe-Taylor (2023). Statistical Learning Theory for Modern Machine Learning (video, slides)
- Typical ML tasks.
- Classification.
- Prediction.
- Clustering.
- Dimensionality reduction.
- Decision trees.
- Support vector machines (SVM) and discriminant analysis.
- Ensembles and forecast combination. Choosing between models and AutoML.
Deep learning and neural networks
- How does a simple neural network like a perceptron work? How do more complex networks train and operate? [annotated]
- Deap learning textbooks [annotated]
Freely available online:
- Ian Goodfellow, Yoshua Bengio and Aaron Courville. Deep Learning Book (DLB)
- Aston Zhang, Zack C. Lipton, Mu Li and Alex J. Smola. Dive into Deep Learning (d2l) (with notebooks).
- Simon Prince. Understanding Deep Learning (UDL) (with notebooks).
Paywalled:
- Deep Learning: Foundations and Concepts by Bishop and Bishop.
- Neural network architectures
Feed-forward Neural Network
Convolutional Neural Network (CNN)
Recurrent Neural Network (RNN)
Generative Adversarial Network (GANs)
Transformers
- GPT models. Interaction, dialogue, prompt engineering. Retrieval augmented generation (RAGs) and model fine-tuning.
- One-shot, federated, transfer learning and other advances in deep learning.
Artificial intelligence
In short: Not everything in a NN in AI. AI winter ended with backpropagation and rise of computational power. No AGI (yet), a computer cannot “think”.
- State of AI Report: Artificial Intelligence Index Report 2024 [annotated]
Great summary by a Reddit user, beat that clarity: “AI good, getting gooder. Tech > academics. AI costs $$$, gonna cost $$$$$. US numba 1. Benchmarks are meh. GenAI is trending. AI regulations increase. People & science can and do benefit from AI. People have begun to pay attention to AI.”
- History and branches of AI. [annotated]
microsoft/AI-For-Beginners: 12 Weeks, 24 Lessons (contains lessons on symbolic approach with Knowledge Representation and reasoning, Genetic Algorithms and Multi-Agent Systems in addition neural nets).
Annotated History of Modern AI and Deep Learning by Jürgen Schmidhuber.
Textbook: Artificial Intelligence: A Modern Approach by Norvig and Stuart (a bit old).
- Artificial general intelligence (AGI). [annotated]
Economist: How to define artificial general intelligence.
Other modeling approaches
For modelling topic classification see JEL Classification System: Mathematical and Quantitative Methods. Compare with AMS Classification for subjects in mathematics.
Bayes and causality
Departing from frequentism: Bayesian modelling and causality.
- Bayes theorem and Bayesian modeling. Probabilistic programming.
Probability and Bayesian Modeling (2020)
- Causality and do-notation. [annotated]
Book of Why by Judea Pearl.
Causal Inference The Mixtape by Scott Cunningham.
Causal ML Book by Victor Chernozhukov, Christian Hansen, Nathan Kallus, Martin Spindler, Vasilis Syrgkanis (2024)
Software: PyWhy, EconML
Interaction, feedback, networks and optimisation
- Operations research and statistical decision theory.
- Agents. Reinforcement learning. Game theory. Auction design.
- Systems with feedback and system dynamics (SD). Control theory.
- Graphs and networks. Knowledge graphs.
- Optimisation models and solvers. [annotated]
Linear programming (LP). PuLP package.
Convex Optimization textbook by Boyd and Vandenberghe.
Several book suggestions here in a Reddit post.
Harder, dull or less obvious topics
- Combinatorics.
- Random variable distributions and their families.
- Point estimation. Confidence intervals. Hypothesis testing.
Informal review on hypothesis testing in Notes for Nonparametric Statistics.
- Convergence and central limit theorems. Asymptotics.
- Sampling methods and techniques.
Sampling techniques: cross-validation, bootstrap, jackknife.
Course (advanced): Keisuke Hirano and Jack Porter (2022). Modern Sampling Methods: Design and Inference.
- Non-parametric methods. [annotated]
Introduction to Nonparametric Statistics by Bodhisattva Sen
All of Nonparametric Statistics by Larry Wasserman
Reddit quotes:
- https://www.reddit.com/r/statistics/comments/obhpte/comment/h3og15r/
- https://www.reddit.com/r/statistics/comments/1bya8dd/comment/kyi7i8q/
- Differentiation and differential equations.
- Random processes.
- Information theory. Entropy and cross-entropy.
- Boolean vs fuzzy logic. Qubit and quantum computing.
- Knowledge representation and ontologies.
- Probability as part of measure theory.
Reddit quote: What is necessary however, is to understand measure-theoretic probability. If you have a solid foundation in measure theory, that should be quite straightforward. You will see that only a subset of results/theorems from measure theory make frequent appearances. These include the Fubini/Tonelli theorem, absolute continuity, the Randon-Nikodym derivative, the Borel-Cantelli lemma etc. and you can always refer back to your measure theory book for things.
For probabilistic/statistical machine learning, you almost always assume probability measures are dominated by the Lebesgue measure on the underlying Euclidean space and work directly with pdfs. The only area of ML theory that I know of that is measure-theory heavy is PAC-Bayes / advanced statistical learning theory.
Mathematical prerequisites
Reddit user: “Eventually, after years of trying to get in through various "shortcuts" I realised that I actually have to learn maths and statistics like all the other guys”.
- From d2l preface: Linear Analysis by Bollobás (1999) covers linear algebra and functional analysis in great depth. All of Statistics (Wasserman, 2013) provides a marvelous introduction to statistics. Joe Blitzstein’s books and courses on probability and inference are pedagogical gems.
- Mathematics for Machine Learning (MML) by Marc Peter Deisenroth, A. Aldo Faisal and Cheng Soon Ong (2020) - highly recommended. Part 1 is math and part 2 is math application to classic problems of regression, dimensionality reduction, density estimation and classification. Part 2 is focused just on these four problems, which gives a feeling of completeness and achievement. Chapter 8 “When Models Meet Data” is a great introduction to statistical learning theory.
- Calculus. [annotated]
What is Calculus I, II and III in the US: Naming of calculus courses.
See Econometrics Navigator – Mathematic preliminaries – Calculus
Real analysis introduction by Hunter (advanced).
- Linear algebra. [annotated]
See Econometrics Navigator – Mathematical preliminaries – Linear Algebra
- Probability and mathematical statistics. [annotated]
- Blitzstein and Hwang and video series. Statistics 110, Wackerly, Mood. Introduction to Probability Theory by Hoel, Port and Stone and some others discussed here: Best Probability Theory textbook? : r/math.
- Using Julia for Introductory Statistics by John Verzani – you can dismiss the fact the book uses Julia programming language, very thoughtful text in about probability and statistics in general.
- Why Another Probability Textbook? (2022). Probability and Statistics Cookbook (2011).
- Larry Wasserman “All of statistics” is the textbook for 10-705 Intermediate Statistics Course (see lectures notes).
- Intermediate probability – a Reddit comment in Overlap between Mathematical Statistics and Probability Theory textbooks:
An intermediate probability course must include:
- Probability spaces (axiomatic development of sigma algebras and Probability measure, besides the usual topics)
- Random variables (as particular cases of measurable functions, probability distribution and density functions with their properties, and the usual intro to common RVs)
- Random vectors (simmilar to R variables and criteria of independence, conditional densities)
- Distribution of R variables and order statistics
- Moments of RVs (characteristic function and its properties are very important)
- (most important) Asymptotic theory (convergence in mean squared, in probability and in distribution, and related theorems (Markov, Tchebychev, Slutsky, Helly-Bray, Lèvy, etc.))
Now, Wackerly offers a very good introduction to topic 1 (without the axiomatic development of sigma algebras and probability spaces), topics 2 and 3 (without explaining measurability of RVs). It is very clear explaining random vectors. Topic 4 is very well covered and topic 5 (without the characteristic function). Does not touch topic 6, which is essential for inference. Its main advantages are the examples, the clarity of the explanations and the visual organization which makes it very easy to read.
I would suggest you read Wackerly first (till chapter 6) and then read about asymptotics in Hogg's Intro to Math Stats (8th ed.) or Roussas' A Course in Matg Stats (2nd ed) or Mittelhammer's Math. Stats. for Economics and Business (2nd ed.).
That's the probability you are going to need if you want to study statistical inference at an intermediate level. Wackerly is rather elementary in this regard, but can be an excellent introduction to inference if you are new to the field.
Textbook review
- Textbook review [annotated]
More beginner-friendly vs more advanced texts, with focus on open source texts (o) and code examples (c) or Jupyter notebooks (j). Top picks marked with (*).
Subject Area | Beginner-friendly (undergraduate) | Advanced (graduate) | Reference or other texts |
Math prerequisites | Deisenroth* (o, 2020) | ||
Probability | Blitzstein and Hwang * | Wentzel (1982) | |
Statistics | Wasserman* (2013) Casella/Berger | ||
Econometrics | Kennedy* Stock/Watson | Green | |
Time series | Hamilton | ||
Machine Learning | ISLR/ISLP* (o,c) | PRML by Bishop (o) PML by Murphy (o) | ESL (o) Vapnik (1998) |
Deep Learning | Andrew Ng* | UDL(o,j), d2l (o,j), DLB(o) | |
Artificial Intelligence | Russel/Norvig (2020) |
Part 2. Data types, sources and quality
Tabular data
- Table in a dataframe and in a relational database. Data types and table schema. Data serialization.
- Textbook datasets. Kaggle and similar datasets. Official statistics, data search, open data.
Not just numbers: text, images and sound
- Text as vector. Natural language processing. Deep learning in NLP (ChatGPT). [annotated]
Textbooks
- Speech and Language Processing by Jurafsky and James H. Martin
- Big Ideas: Natural Language Processing with MacArthur Fellow Dan Jurafsky
- Programming and written exercises
Manning et al (2008) Introduction to Information Retrieval.
Lewis Tunstall, Leandro von Werra, Thomas Wolf (2022). Natural Language Processing with Transfomers.
Courses: Lena Voita NLP Course “For You”
Extra: Linguistic fundamentals for natural language processing: 100 essentials from morphology and syntax by Emily M. Bender (2013, paywalled, see TOC)
Few articles (via Ilya Gusev):
- Word2Vec: Mikolov et al., Efficient Estimation of Word Representations in Vector Space https://arxiv.org/pdf/1301.3781.pdf
- FastText: Bojanowski et al., Enriching Word Vectors with Subword Information https://arxiv.org/pdf/1607.04606.pdf
- Attention: Bahdanau et al., Neural Machine Translation by Jointly Learning to Align and Translate: https://arxiv.org/abs/1409.0473
- Transformers: Vaswani et al., Attention Is All You Need https://arxiv.org/abs/1706.03762
- BERT: Devlin et al., BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding https://arxiv.org/abs/1810.0480
Libraries:
- hugginface
- spacy
- nltk
- Images and video. Color representation. Computer vision.
- Sound, noise, music. Waves.
Data in business and economic perspective
- Big vs small data. [annotated]
Article: Hal Varian (2014). Big Data: New Tricks for Econometrics
Report: Matthew Hardingand and Jonathan Hersh (2018). Big Data in Economics.
Course: Melissa Dell and Matthew Harding (2023). Machine Learning and Big Data.
- Data quality and limitations (incomplete, not granular, not relevant, not yours). Data governance frameworks (DMBOK).
Data Governance Is A Top Priority For 65% Of Data Leaders (Gartner via Atlan),
Part 3. From research design to model productisation
Steps in analysis
- Descriptive statistics. [annotated]
John Tukey and the Origins of EDA
Remembrances of Things: EDA (article about John W. Tukey work).
The 4 R’s in EDA (from A Course in Exploratory Data Analysis)
- Data visualization. Dashboards. [annotated]
Economist: Mistakes, We Have Drawn a Few.
Grammar of Graphics (ggplot)
- Business hypothesis and ways to test it. Business outcomes.
- Interacting with business or retail customer
- Controlling own system or business processes.
- Supply chain, sourcing, inputs control.
- Reporting, regulation, compliance, audit.
Advanced analytics and dashboards loop vs automation of control loop.
- Analysis as a DAG.
Quote form drivendata cookiecutter as a starter (but not universal) project template.
- Reproducible research and reproducibility crisis.
ML in production
- Making business change with data [notes]
—-
A - Identify the business case or hypothesis where you can earn more or save some money or make other improvements.
B - Create the adequate model:
Make a simplified representation of the value chain, business process or client interaction that you you are changing (the model).
State what are the points of control where you do something and how these posits of control affect the whole system under control (control points).
State what are the points of data collection where you gather the data and how they help shape the decision and judge upon its success (data points).
C - State what you are about to do (proposed change):
Why (motivation), what/how/where (proposal) and to what end (success criteria)
Estimate if it is worth it.
Lay down the most simple way to test the proposed change.
D - Prove the change is worth it with experiments
E - Next:
Scale if experiments show it is profitable or useful.
Continue the search – update the model, proposed change or the business hypothesis.
Give up and move to the next thing (with lesson drawing or without).
F - Be able to iterate on this chart faster, cheaper and with better outcomes.
—-
- Responsibilities of a data engineer, data scientist or modeler, machine learning engineer, business analyst and other roles. [notes]
Perfect world (bam is the sound like “bang”, popularized by StatQuest videos, an “aha moment”).
Idea what to improve -> Good data -> Known model -> Quick inference -> Bam to production -> Bam cool and unambiguous effect -> Bam business liked it
Practically perfect world:
Many ideas -> Business and modelling hypothesis -> Expected result -> Where to deploy? -> Is there a model for this? Which do we to pick? -> Is there data for this? -> Can we estimate/train the model? -> Does it seem to work? -> Can we deploy? -> How is it doing in production? -> Can we improve? -> Is business happy? -> How long would the result persist?
Pick roles who is in charge of what:
- A full-stack “data scientist”
- A modeller
- Data engineer / Data architect
- Machine learning engineer / ML platform engineer
- Software engineers
- Research scientist
- Business analyst
- Business lead or product manager
- Vendor/consultant/ChatGPT
When does the data pipeline become a “product”?
- any data pipeline that worked and delivered the business result, however trivial from modeling viewpoint;
- a complex system that is not just model+prediction (frontend, hardware, business rule change, etc);
- anything that a business or end user wants and has a bit of data or intelligence in it;
- something you can sell as a solution in a specific industry.
What can go wrong in a data pipeline? How are companies different with respect to data and machine learning? What are the most aggressive promises about AI in a specific industry? Why is this not happening yet?
Broader perspective:
- system and actors vs control system and desired outcomes
- customer/asset life cycle, person-centicity
- business intelligence, business value and margins
- silos, vested interest, delegation of control/responsibility
- change of business models and company boundary
- immediate ROI, long-term sustainability, business valuation
- data as representation of objects, processes, behaviors
- ML pipelines and roles as told by the companies and experts. [notes]
- ITMO University Role model (very detailed roles at the picture):
https://github.com/aimclub/ai-competency-model/
- Exercise: spot the paragraph wrongly placed in a guide https://learn.microsoft.com/en-us/training/modules/leverage-ai-tools/6-understand-machine-learning-lifecycle
- Exercise: which part do you think is most important in a working data pipeline (extracted from Demystifying AI for the Enterprise book).
- Ask a Specific Question
- Start Simple
- Try Many Algorithms
- Treat Your Data with Suspicion
- Normalize Your Inputs
- Validate Your Model
- Ensure the Quality of Your Training Data
- Set Up a Feedback Loop
- Don’t Trust Black Boxes
- Correlation Is Not Causation
- Monitor Ongoing Performance
- Keep Track Of Your Model Changes
- Don’t be Fooled by “Accuracy”
- Machine learning pipelines and MLOps.
https://github.com/EthicalML/awesome-production-machine-learning
- Model life cycle, model drift vs data drift.
Part 4. Software tools
Programming languages and statistical software
- Programming languages (R, Python, Julia) and machine learning libraries.
R for statistical packages and classic statistics.
Python for statistics (stamodels), machine learning (scikit-learn) and deep learning (PyTorch, TensorFlow, keras).
Differentiation and composablilty in Julia and JAX.
Exercise: Compare a tabular dataframe implementation in R, Python (pandas or polars) and Julia.
- Stattistical software, proprietary vs open source and documentation.
Proprietary (SAS, Stata, MATLAB) vs open source (R, Julia) statistics tools.
Software documentation as learning tools (they are great to read even if you are not using the package):
- Modern Applied Statistics with S <https://www.stats.ox.ac.uk/pub/MASS4/>
- R vignettes
- MATLAB
- gretl
- eviews
- scikit-learn lectures
- JASP and jamovi
- Notebooks vs plain files and packages. Version control. Refactoring and cleaner code.
The Missing Semester of Your CS Education (ignore metaprogramming chapter).
- Extra topic: open source viability and funding models
VC-driven: streamlit
Sponsored: PyWhy
Burnout: curl
Databases and storage
- Disk: HDD, SDD and cloud (S3) storage. Disk costs and time to access. File systems (HDFS).
- Database management systems (DBMS). Relational databases and SQL.
Other types of databases and NoSQL (key-value, graph, column, vector, time series).
Processing large data in parallel: MapReduce, Hadoop (HDFS+Yarn+MapReduce), Spark.
Search databases (ElasticSearch, Splunk, Solr)
Database popularity:
- https://db-engines.com/en/ranking
- https://www.jetbrains.com/lp/devecosystem-2023/databases/
- https://survey.stackoverflow.co/2023/#section-admired-and-desired-databases
- Data warehouses (DW) and DW architectures. Decoupling storage and compute.
- Cloud providers (AWS, GCP, Azure). New data solution providers (Snowflake, Databricks).
- Mergers and acquisitions, venture financing and forks:
Sun buys MySQL (2008), Oracle buys Sun (2010), EU and US antitrust approval.
SAP's acquisition of Sybase (2010).
Cloudera and Hortonworks merger (2019).
Valkey, a Redis fork after licence change (2024).
Exercise: find venture-funded database projects here and explain their valuations.
- More elements of database theory and implementation. Relational algebra and ER-diagrams. ACID, CAP, BASE. DDL and DML. Normalization and normal forms. Hashing and B-trees. OLAP and OLTP.
Extra video: The ancient art of data management (2023) from DuckDB co-founder.
Extra reading: Lecture notes on database engineering (VSUUT, India).
Data engineering tools
- Data engineering lifecyce from https://www.oreilly.com/library/view/fundamentals-of-data/9781098108298/
See also Exploring the Modern Data Warehouse by Microsoft Learn.
- Workflow and orchestration tools (MLFlow, Airflow, Prefect, Luigi, etc).
What the data stack is:
Here is a reddit post and a link to docs.
- More DE sources:
Andriy Burukov MLOps book.
Data Engineering Zoomcamp.
See MAD@firstmark.com company landscape.
See my notes Comments on the weekly plan for DE job interview (pipeline2insights).
The cloud
… is not just someone else’s computer.
- Knowing the cloud is good for the job:
One thing I would say is usually a *must* is familiarity with Linux and a cloud provider (AWS, GCP, Azure). You don't need to know all 3 cloud providers (pick AWS if you don't know any yet - it has 50% market share) but if you don't know any of them it'll be harder to on board you and your first few weeks would be a lot more overwhelming – even knowing a different one to the one you use at a specific job will help as they all have similar functionality.
- Business user guide to the cloud
Client-server ideas as a starting point:
- your own machine not the most powerful computer, would be too costly
- you can connect to a remote machine as admin (eg via ssh)
- users can connect to server for own workloads (eg API calls)
- a lot of compute work can happen at the server, data also stored there
- historically software used started as monolith and then become more modular
Cloud mindmap:
- the server is virtualised, not longer a single metal machine (resource efficiency)
- there are various data storage solutions (storage)
- an application can spin off a container (containerisation)
- containers can be managed in a smart way (orchestration)
- there can be a hosted environment for application (PaaS)
- application can be very small (lambda, "serverless")
- but there is still latent network in the middle (zones, CDN, etc)
On-premise vs cloud criteria:
- hardware costs
- maintenance costs
- security
- speed
- flexibility
- ownership
Economics of cloud computing:
- virtualisation (hypervisor, virtual machine) achieves higher hardware utilisation
- decoupling of storage (S3) and compute (EC2)
- load switching and containerisation (Kubernetes, OpenShift)
- network costs
- data center economics
Cloud providers:
- AWS/GCP/Azure and others, especially outside the US
- software, platform or infrastructure (SaaS/PaaS/IaaS)
- typical products and product overlap
- vendor lock and switching costs
- competition issues (EU investigations)
Also happening:
- "managed" and "hybrid" clouds emerged
- there is more lower level and higher level services
- providers offer a miriad of services (and terrible APIs)
- you cannot easily switch a cloud provider
- "you grow out of cloud"
- cloud seeks extra value streams
Data centers and energy efficiency
- hyperscalers vs smaller providers
Deals:
- Broadcom buying VMware for $68 billion (2023)
- Microsoft settles EU cloud complaint for 22m (2024)
Links:
- https://news.microsoft.com/download/archived/presskits/cloud/docs/The-Economics-of-the-Cloud.pdf
- https://www.networkworld.com/article/2516091/microsoft-settles-cloud-complaint-for-22m-to-avoid-eu-antitrust-probe.html
- https://www.computerweekly.com/feature/Broadcoms-VMware-acquisition-explained-The-impact-on-your-IT-strategy
Part 5. Business change and society impact
Making money and making change.
Technology companies as machine learning market players
- Internet-scale data owners.
- Cloud providers.
- Companies that sell ML tools and solutions, their valuations and strategy.
H2O, DataRobot.
- Hardware providers (NVIDIA).
Adoption in broad economy and society
- Fairness, biases, equity, human loop.
- Economics, cost and payoffs of applying ML. Business value of ML.
- Job market: in-house data modeller, consultant or a vendor?
- Who’s got more data? Data privacy and data protection. Markets and pricing of data.
- What gets to be regulated. Does a national AI or data strategy make sense?
- Why the hype: what makes the corporation play hype and overpromise? Why do investors buy that?
Selected industry domains and use cases
- Social sciences (sociology, political science, psychology, anthropology).
- Clinical trials.
- Industrial processes (discrete or continuous). Quality control and dependability.
- Recommender systems (RecSys).
- ML in finance
See Halperin textbook.
Appendix
Interviews
randomlyCoding: On production pipelines, engineering skills and job roles.
More personal skills
- Common sense, logic and critical reasoning.
- Writing well, explaining, inquiring, communicating.
On Writing Well: An Informal Guide to Writing Nonfiction. On Writing Well and keeping it up-to-date for 35 years by William Zinsse in American Scholar.
Personas
- People that strike me as great thinkers and educators who make complex thing easy to follow for the rest of us – through courses, books and personal interaction:
- Will Curt
- Scott Cunningham
- Laura Mayoral
- r/AllenDowney/
You also may be surprised a textbook or a professor can be reachable on Twitter/X or other social media:
- Jeffrey Wooldridge (@jmwooldridge)
- Paul Goldsmith-Pinkham (@paulgp), repo for Yale Applied Empirical Methods PHD Course and the video list (PGP)
- Hall of shame: story of Siraj Raval (plagiarism in education).
Glossary
- Common terms, professional slang and buzzwords.
Common terms:
- Supervised vs unsupervised vs semisupervised learning.
- Structured vs unstructured data.
Professional slang:
- Feature engineering (variable selection and transform).
- ETL vs ELT (data ingestion)
Fading buzzwords:
- Data mining
- Big data
Other resources
- Courses, syllabuses and excercises
- CIS 4190/5190: Applied Machine Learning (Spring 2023) – great list of resources on one page.
- https://deepmleet.streamlit.app – iams to be the leetcode of machine learning.
- Society of actuaries (SOA) exams: Statistics for Risk Modeling (SRM) Exam (topics are fine, but a bit dated literature, eg ISLR instead of ISLP in syllabus).
- Video series
- StatQuest with Josh Starmer (This man is a genius.)
- 3Blue1Brown by Grant Sanderson. (Very high quality content!)
- Machine Learning Street Talk (Suggested by reader: “Sometimes bit too dense for absolute beginers but really good. They list resources, papers, books.”)
Changelog – timeline of this document
v0.7.0 (April 29, 2024):
- Excellent undergrad econometrics course Mathematical Econometrics I by Roth and Hall.
- Updated econometrics vs machine learning section with papers and courses from Hal Varian and Susan Athley.
- Classic probability textbooks added (Blitzstein and Hwang, Wackerly, Mood) along with modern free websites.
- Links from Econometric Navigator for time series, calculus and linear algebra.
- Total count is 173 links and 111 topics.
v0.6.2:
- Discussion (r/MachineLearning): Thoughts on Hands-On Machine Learning with Scikit-Learn, Keras & Tensorflow by Geron
v0.6.1:
- Interview with u/randomlyCoding/
- Using Julia for Introductory Statistics by John Verzani
v0.6.0
General interest:
- Overview of metrics in JEL Section C for and AMS classification for math papers.
- Annotated History of Modern AI and Deep Learning by Jürgen Schmidhuber.
Niche:
- Bodhisattva Sen and Larry Wasserman on nonparametric statistics.
- Informal review on hypothesis testing (online book appendix chapter).
v0.5.5:
- Big Ideas: Natural Language Processing with MacArthur Fellow Dan Jurafsky(interview, beginner-friendly) and Lena Voita NLP Course | For You
- scikit-learn: machine learning in Python by Gael Varoquaux. Part of Scientific Python Lectures. One document to learn numerics, science, and data with Python.I think an underappreciated resource (beginner, but programming knowledge required).
- Andrew Ng lecture notes on machine learning from a 2022 course. EP: Andrew Ng best known for a deep learning course, but the classic machine learning notes are very well structured (intermediate).
- Deisenroth et al. (2020). Mathematics for Machine Learning. Chapter 8 “When Models Meet Data” is an accessible introduction to statistical learning (very beginner friendly).
- Shawe-Taylor (2023). Statistical Learning Theory for Modern Machine Learning, has video and slides (advanced).
- Causal ML Book by Chernozhukov et al. (2024) (advanced).
April 9, 2024
Added 3x2 table on title page with key topics. Also removed few images.
April 6, 2024 (88 topics)
- Finalised Databases and storage.
- Edited SEMs and references from Paul Goldsmith-Pinkham
- Marked for review DAG and ML project flow.
- To add next new causal ML book.
March 31, 2024
The topic count is 77, also organized into textbook, data, project, adoption and cases sections. Good reception of the list in comments on Reddit, but removed by moderators, no specific reason or stated.
March 29, 2024
А way to keep up with data modeling and sort out what you already know. So far it is a list of topics organized by section, perhaps somewhat upside down compared to a traditional textbook or a course, but I hope you like the perspective. Few links added where most appropriate and I remembered good stuff. There are open textbook and blog links at Econometrics Navigator website, my previous work. 33 topics in original post.
March 28, 2024
First published as a Reddit post.
Quotes and reader feedback
This is pretty neat. Congratulations on putting together such a great list!
Amazing, thanks man, also it would also be much better to provide resource lists as well, still pretty useful, thanks!
This list is gold, thanks :)
This list is pretty comprehensive. I would have a bit on MLOps side because most advanced practitioners of ML should have some amount of understanding of how models are productionized. Perhaps a few topics on model drift, data drift, understanding how experiments are set up etc can be beneficial. Overall looks pretty good and will probably even use this to brush up on my own skills.
Guide roadmap
Models and methods | Pipelines | Tools |
Stats and econometrics | Descriptive analysis | Programming languages |
Machine learning (ML) | Task design and outcomes | ML and DL libraries |
Deep learning (DL) | ML in production | Databases, DE and MLOps |
Other methods | Reproducible research | Infrastructure for ML |
Model evaluation | ||
Data | Players and impacts | Economics |
Types of data | Technology companies | Cost of (not) doing ML |
Sources and ownership | Non-tech business | Markets for data |
Data quality and DG | The human user | Rationale for regulation |
NLP, CV and Robotics | Society impacts |
ML = machine learning, DE = data engineering, MLOps = devops for ML, DG = data governance
Link to this document: https://t.ly/RcA2Q.
Editor mode: MLMW: Machine Learning My Way and Prose for MLMW and