MLMW: Machine Learning My Way

27 min read Original article ↗

Exercise: decide which pictures above are computer-generated.

MLMW: Machine Learning My Way

Textbooks, articles, reports and learning resources to navigate the  concepts in statistical modelling and bridge them to skills in data management and model productisation.

This collection of notes appeared as a response to questions about how to make a career in machine learning or working with data, given the people would come from different backgrounds. This is not a beginner guide though, it assumes you may have tried a few things and want a bigger picture how does the data, models and computing technologies and the business outcomes connect or separate.

The notes are structured in favour of a person who knows a bit about statistical models and would want to master data management and computing to making statistical models useful in a business context or at bigger scale.

© Evgeny Pogrebnyak, 2024

e.pogrebnyak+mlmw@gmail.com

Last updated: December 9, 2024.

Version 0.8.4

Get the newest version:

v0.8.4 - December 9, 2024

  • The cloud computing guide (part 82)

v0.8.3 - December 8, 2024

v0.8.2 - October 17, 2023

  • New tentative TOC, MLMW and MDTP edited together.
  • Developper Voices podcast with Kris Jenkins
  • Feedback loop article, with short introduction to dynamic systems (ACM, 2023):

https://dl.acm.org/doi/fullHtml/10.1145/3617694.3623227

v0.8.0 (May 10, 2024):

  • The MLMW is now three resources: a short topic list (must request access), a public longread guide and the MLMW website. Check out https://trics.me/ for details.
  • Interviews moved from the longread to the website. Read randomlyCoding interview on production pipelines, engineering skills and job roles.
  • Also at the website are the beginner track and probability section.
  • Few topics merged in the MLMW longread and not it is exactly 100 topics, 23 are annotated with extra links and structures and 2 are the notes sections.
  • MLMW Additions are 2024 State of AI report, Aubrey Clayton videos about ET Janes, ITMO ML job role model and r/AllenDowney/ in the persona list.

Tentative structure:

Introduction

  • Statistical thinking and intuition
  • Models and control
  • Observation and experiment

Models and methods

  • Probability and statistics
  • Econometrics
  • Machine learning
  • Deep learning
  • Other useful models

Steps in analysis

  • Descriptive analysis
  • Research pipelines
  • Business-driven modelling

Tools for individual statistical modeller.

  • Notebooks, IDEs
  • BI and data visualisation
  • Libraries for statistical modelling
  • Building data pipelines

Data storage and operations

  • Data types and sources. Tabular data. Unstructured data.
  • SQL, dataframes and relational databases.
  • Stream vs batch processing.
  • Data warehousing (DWH).
  • ETL and orchestration (Airflow and similar)

Data governance

  • Data quality
  • Metadata and linage
  • Data security and personal data

Building software

  • How computers work. Operating systems. Networks. Local vs remote machine.
  • From smaller to bigger systems. Frontend vs backend. System architecture. APIs.
  • Software Developement Lifecycle (SDLC) and DevOps.
  • Projects and requirements. Products and features.
  • Waterfall vs interactive development methodologies.
  • Developer vs SWE vs CS areas. Team roles.
  • Technical debt and code quality.

Making things work for the business

  • Legacy enterprise systems vs data-driven systems.
  • From business hypothesis to a valuable project.
  • What companies usually do with data.
  • Job market, companies and company valuation.
  • Society benefits and regulation.

Part 1. Models and methods

Intuition and foundational concepts

Econometrics

Machine learning

Deep learning and neural networks

Artificial intelligence

Other modeling approaches

Bayes and causality

Interaction, feedback, networks and optimisation

Harder, dull or less obvious topics

Mathematical prerequisites

Textbook review

Part 2. Data types, sources and quality

Tabular data

Not just numbers: text, images and sound

Data in business and economic perspective

Part 3. From research design to model productisation

Steps in analysis

ML in production

Part 4. Software tools

Programming languages and statistical software

Databases and storage

Orchestration and data engineering tools

The cloud

Part 5. Business change and society impact

Technology companies as machine learning market players

Adoption in broad economy and society

Selected industry domains and use cases

Appendix

Interviews

More personal skills

Personas

Glossary

Other resources

Changelog – timeline of this document

Quotes and reader feedback

Part 1. Models and methods

Intuition and foundational concepts

  1. Probability and randomness. [annotated]
  • Probability as repeated events (Bernoulli) vs plausibility estimate (ET Janes).
  • Random variables and their distributions.
  • Sequence of events and conditional probability.
  • Joint distribution of random variables and marginal probability.
  • Axioms of probability and measure theory.
  • Generating random numbers practically (pseudorandom and seed).

See a listing of probability and mathematical statistics textbooks at the end of the chapter.

  1. Data generating process (DGP). Sample vs population. Learning from sample about the population (inference). [annotated]

In general in statistics we rely on a notion there is some discoverable law, the data generating process, that produces the data that we collect and analyse.  

We rarely know the true model of a DGP, but might reason about its functional form based on theory or prior knowledge or take the most simple functional form as a guess. Next we estimate the parameters in that functional form given the observation data points that we have.

“Statistical pragmatism emphasizes the assumptions that connect statistical models with observed data.” Figures from Robert E. Kass (2012) Statistical Inference: The Big Picture.

  1. Inference (econometrics) and generalization (machine learning). [annotated]
  • Inference and parameter interpretation (statistics and econometrics).
  • Generalization and prediction (machine learning or statistical learning).
  • Change of behavior (causal inference and heterogeneous treatment).

Readings:

  1. Correlation, causality, common drift and spurious regressions.

Example: German Cheeses and a directed acyclic graph (DAG) in Determining causality in correlated time series.

  1. Observation, experiment and experiment design.
  1. Measurement errors and missing data.
  1. Model performance and model evaluation. Modeling trade-offs (eg bias vs variance) and no free lunch.

Extra: What books or papers are must reads for every professional statistician? : r/statistics

Econometrics

Survey article: Undergraduate Econometrics Instruction: Through Our Classes, Darkly.

Modern introduction course: Mathematical Econometrics I by Roth and Hall

  1. Cross-section, time series, panel and spatial data. Single vs multivariate response variable.
  1. Linear regression and ordinary least squares (OLS).
  1. Violation of OLS assumptions.

        Note: Peter Kennedy textbook (1998) is built on listing the violations, very clear to follow.

  1. Difference-in-Differences. Instrumental Variables. Regression Discontinuity.
  1. Time series. Seasonal adjustment, smoothing, filtering. [annotated]

Reference text: Hamilton.

See more textbooks in Econometrics Navigator: Time Series Section.

Extra: Forecasting: Principles and Practice by Hyndman and Athanasopoulos.

  1. Systems of equations. [annotated]

Important part of econometrics for imposing a structure. The rise and fall of large macroeconometric models in 1960s-1970s.

Estimation

  1. Methods of estimation.

OLS and extensions.

Maximum likelihood.

Bayesian estimation

MCMC.

  1. OLS extensions.

logit/probit

GMM

2- and 3- stage OLS

Quantile regressions

Lasso, ridge

  1. Machine learning textbooks [annotated]

ISLR/ISLP is a career starter. Bishop or Murphy for more advanced text with more math, they are also older and not supplemented with code. Do not confuse ISLR/ISLP with Elements of Statistical Learning (ESL), a more mathematically rigorous book.

Acronym

Title

Authors

Latest edition

ISLP

An Introduction to Statistical Learning

Gareth James, Daniela Witten, Trevor Hastie, Rob Tibshirani

2023

PRML

Pattern Recognition and Machine Learning

Christopher Bishop

2006

PML

Probabilitstic Machine Learning

Kevin Patrick Murpy

2012. Follow-up books in 2022, 2023.

Reddit Quote: For the longest time the best books for a mathematical treatment of ML were Chris Bishop's "Pattern Recognition and Machine Learning" and Kevin Murphy's "Machine Learning: A Probabilistic Perspective". Both authors have written new and updated books, better adapted to the deep learning era. Bishop's new book is "Deep Learning: Foundations and Concepts". Murphy released two books: "Probabilistic Machine Learning: An Introduction" and "Probabilistic Machine Learning: Advanced Topics".

I would say that Murphy's two tome book currently provides the most comprehensive and thorough treatment of probabilistic ML. The first chapters of the introductory book are basically mathematical preliminaries, so it's more accessible than before. Additionally, the most frequently used book for getting a stong mathematical foundation for ML is "Mathematics for Machine Learning" by Deisenroth et al.

CS229 Lecture Notes Fall 2022 by Andrew Ng. Very good structure.

For machine learning (not deep learning), I recommend the Andrew Ng lecture notes from Stanford's CS229 course. The reason I really like these notes is because you can find past problem sets that went along with them, and the problem sets are very good: difficult but not impossible, and close to a 50/50 mix of math and programming. I never feel like I've learned a topic just from reading about it, so having good problems to go along with the reading was very important to me (Reddit quote).

scikit-learn: machine learning in Python by Gael Varoquaux. Part of Scientific Python Lectures. “One document to learn numerics, science, and data with Python.”

Two books below are paywalled, but very practical.

Andreas Mueller and Sarah Guido (2016). Introduction to Machine Learning with Python.

HOML: Hands-on Machine Learning with Scikit-Learn, Keras and TensorFlow (3rd edition) by Aurélien Geron. Repo: https://github.com/ageron/handson-ml3/

  1. Statistical learning theory. Supervised vs unsupervised learning. [annotated]

Beginner: Chapter 8 “When Models Meet Data” from Mathematics for Machine Learning by Deisenroth et al (2020).

Advanced: John Shawe-Taylor (2023). Statistical Learning Theory for Modern Machine Learning (video, slides)

  1. Typical ML tasks.
  • Classification.
  • Prediction.
  • Clustering.
  • Dimensionality reduction.
  • Decision trees.
  • Support vector machines (SVM) and discriminant analysis.
  1. Ensembles and forecast combination. Choosing between models and AutoML.

Deep learning and neural networks

  1. How does a simple neural network like a perceptron work? How do more complex networks train and operate? [annotated]
  1. Deap learning textbooks [annotated]

Freely available online:

Paywalled:

  1. Neural network architectures

Feed-forward Neural Network

Convolutional Neural Network (CNN)

Recurrent Neural Network (RNN)

Generative Adversarial Network (GANs)

Transformers

  1. GPT models. Interaction, dialogue, prompt engineering. Retrieval augmented generation (RAGs) and model fine-tuning.
  1. One-shot, federated, transfer learning and other advances in deep learning.

Artificial intelligence

In short: Not everything in a NN in AI. AI winter ended with backpropagation and rise of computational power. No AGI (yet), a computer cannot “think”.

  1. State of AI Report: Artificial Intelligence Index Report 2024 [annotated]

Great summary by a Reddit user, beat that clarity: AI good, getting gooder. Tech > academics. AI costs $$$, gonna cost $$$$$. US numba 1. Benchmarks are meh. GenAI is trending. AI regulations increase. People & science can and do benefit from AI. People have begun to pay attention to AI.”

  1. History and branches of AI. [annotated]

microsoft/AI-For-Beginners: 12 Weeks, 24 Lessons (contains lessons on symbolic approach with Knowledge Representation and reasoning, Genetic Algorithms and Multi-Agent Systems in addition neural nets).

Annotated History of Modern AI and Deep Learning by Jürgen Schmidhuber.

Textbook: Artificial Intelligence: A Modern Approach by Norvig and Stuart (a bit old).

  1. Artificial general intelligence (AGI). [annotated]

Economist: How to define artificial general intelligence.

Other modeling approaches

For modelling topic classification see JEL Classification System: Mathematical and Quantitative Methods. Compare with AMS Classification for subjects in mathematics.

Bayes and causality

Departing from frequentism: Bayesian modelling and causality.

  1. Bayes theorem and Bayesian modeling. Probabilistic programming.

Probability and Bayesian Modeling (2020)

  1. Causality and do-notation. [annotated]

         Book of Why by Judea Pearl.

Causal Inference The Mixtape by Scott Cunningham.

Causal ML Book by Victor Chernozhukov, Christian Hansen, Nathan Kallus, Martin Spindler, Vasilis Syrgkanis (2024)

Software: PyWhy, EconML 

Interaction, feedback, networks and optimisation

  1. Operations research and statistical decision theory.
  1. Agents. Reinforcement learning. Game theory. Auction design.
  1. Systems with feedback and system dynamics (SD). Control theory.
  1. Graphs and networks. Knowledge graphs.
  1. Optimisation models and solvers. [annotated]

        Linear programming (LP). PuLP package.

Convex Optimization textbook by Boyd and Vandenberghe.


        Several book suggestions
here in a Reddit post.

Harder, dull or less obvious topics

  1. Combinatorics.
  1. Random variable distributions and their families.
  1. Point estimation. Confidence intervals. Hypothesis testing.

Informal review on hypothesis testing in Notes for Nonparametric Statistics.

  1. Convergence and central limit theorems. Asymptotics.
  1. Sampling methods and techniques.

Sampling techniques: cross-validation, bootstrap, jackknife.

Course (advanced): Keisuke Hirano and Jack Porter (2022). Modern Sampling Methods: Design and Inference.

  1. Non-parametric methods. [annotated]

Introduction to Nonparametric Statistics by Bodhisattva Sen

All of Nonparametric Statistics by Larry Wasserman

Reddit quotes:

  1. Differentiation and differential equations.
  1. Random processes.
  1. Information theory. Entropy and cross-entropy.
  1. Boolean vs fuzzy logic. Qubit and quantum computing.
  1. Knowledge representation and ontologies.
  1. Probability as part of measure theory.

Reddit quote: What is necessary however, is to understand measure-theoretic probability. If you have a solid foundation in measure theory, that should be quite straightforward. You will see that only a subset of results/theorems from measure theory make frequent appearances. These include the Fubini/Tonelli theorem, absolute continuity, the Randon-Nikodym derivative, the Borel-Cantelli lemma etc. and you can always refer back to your measure theory book for things.

For probabilistic/statistical machine learning, you almost always assume probability measures are dominated by the Lebesgue measure on the underlying Euclidean space and work directly with pdfs. The only area of ML theory that I know of that is measure-theory heavy is PAC-Bayes / advanced statistical learning theory. 

Mathematical prerequisites

Reddit user: “Eventually, after years of trying to get in through various "shortcuts" I realised that I actually have to learn maths and statistics like all the other guys”.

  • From d2l preface: Linear Analysis by Bollobás (1999) covers linear algebra and functional analysis in great depth. All of Statistics (Wasserman, 2013) provides a marvelous introduction to statistics. Joe Blitzstein’s books and courses on probability and inference are pedagogical gems.
  • Mathematics for Machine Learning (MML) by Marc Peter Deisenroth, A. Aldo Faisal and Cheng Soon Ong (2020) - highly recommended. Part 1 is math and part 2 is math application to classic problems of regression, dimensionality reduction, density estimation and classification. Part 2 is focused just on these four problems, which gives a feeling of completeness and achievement. Chapter 8 “When Models Meet Data” is a great introduction to statistical learning theory.
  1. Calculus. [annotated]

What is Calculus I, II and III in the US: Naming of calculus courses.

See Econometrics Navigator – Mathematic preliminaries – Calculus

Real analysis introduction by Hunter (advanced).  

  1. Linear algebra. [annotated]

See Econometrics Navigator – Mathematical preliminaries – Linear Algebra

  1. Probability and mathematical statistics. [annotated]
  • Using Julia for Introductory Statistics by John Verzani – you can dismiss the fact the book uses Julia programming language, very thoughtful text in about probability and statistics in general.

An intermediate probability course must include:

  1. Probability spaces (axiomatic development of sigma algebras and Probability measure, besides the usual topics)
  2. Random variables (as particular cases of measurable functions, probability distribution and density functions with their properties, and the usual intro to common RVs)
  3. Random vectors (simmilar to R variables and criteria of independence, conditional densities)
  4. Distribution of R variables and order statistics
  5. Moments of RVs (characteristic function and its properties are very important)
  6. (most important) Asymptotic theory (convergence in mean squared, in probability and in distribution, and related theorems (Markov, Tchebychev, Slutsky, Helly-Bray, Lèvy, etc.))

Now, Wackerly offers a very good introduction to topic 1 (without the axiomatic development of sigma algebras and probability spaces), topics 2 and 3 (without explaining measurability of RVs). It is very clear explaining random vectors. Topic 4 is very well covered and topic 5 (without the characteristic function). Does not touch topic 6, which is essential for inference. Its main advantages are the examples, the clarity of the explanations and the visual organization which makes it very easy to read.

I would suggest you read Wackerly first (till chapter 6) and then read about asymptotics in Hogg's Intro to Math Stats (8th ed.) or Roussas' A Course in Matg Stats (2nd ed) or Mittelhammer's Math. Stats. for Economics and Business (2nd ed.).

That's the probability you are going to need if you want to study statistical inference at an intermediate level. Wackerly is rather elementary in this regard, but can be an excellent introduction to inference if you are new to the field.

Textbook review

  1. Textbook review [annotated]

More beginner-friendly vs more advanced texts, with focus on open source texts (o) and code examples (c) or Jupyter notebooks (j). Top picks marked with (*).

Subject Area

Beginner-friendly

(undergraduate)

Advanced

(graduate)

Reference or other texts

Math prerequisites

Deisenroth* (o, 2020)

Probability

Blitzstein and Hwang *

Wentzel (1982)

Statistics

Wasserman* (2013)

Casella/Berger

Econometrics

Kennedy*

Stock/Watson

Green

Time series

Hamilton

Machine Learning

ISLR/ISLP* (o,c)

PRML by Bishop (o)

PML by Murphy (o)

ESL (o)

Vapnik (1998)

Deep Learning

Andrew Ng*

UDL(o,j), d2l (o,j), DLB(o)

Artificial Intelligence

Russel/Norvig (2020)

Part 2. Data types, sources and quality

Tabular data

  1. Table in a dataframe and in a relational database. Data types and table schema. Data serialization.
  1. Textbook datasets. Kaggle and similar datasets. Official statistics, data search, open data.

Not just numbers: text, images and sound

  1. Text as vector. Natural language processing. Deep learning in NLP (ChatGPT). [annotated]

Textbooks

Jurafsky:

Manning et al (2008) Introduction to Information Retrieval.

Lewis Tunstall, Leandro von Werra, Thomas Wolf (2022). Natural Language Processing with Transfomers.

Courses: Lena Voita NLP Course “For You

Extra: Linguistic fundamentals for natural language processing: 100 essentials from morphology and syntax by Emily M. Bender (2013, paywalled, see TOC)

Few articles (via Ilya Gusev):

  • Word2Vec: Mikolov et al., Efficient Estimation of Word Representations in Vector Space https://arxiv.org/pdf/1301.3781.pdf
  • FastText: Bojanowski et al., Enriching Word Vectors with Subword Information https://arxiv.org/pdf/1607.04606.pdf
  • Attention: Bahdanau et al., Neural Machine Translation by Jointly Learning to Align and Translate: https://arxiv.org/abs/1409.0473
  • Transformers: Vaswani et al., Attention Is All You Need https://arxiv.org/abs/1706.03762
  • BERT: Devlin et al., BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding https://arxiv.org/abs/1810.0480

Libraries:

  • hugginface
  • spacy
  • nltk
  1. Images and video. Color representation. Computer vision.
  1. Sound, noise, music. Waves.

Data in business and economic perspective

  1. Big vs small data. [annotated]

Article: Hal Varian (2014). Big Data: New Tricks for Econometrics

Report: Matthew Hardingand and Jonathan Hersh (2018). Big Data in Economics.

Course: Melissa Dell and Matthew Harding (2023). Machine Learning and Big Data.

  1. Data quality and limitations (incomplete, not granular, not relevant, not yours). Data governance frameworks (DMBOK).

Data Governance Is A Top Priority For 65% Of Data Leaders (Gartner via Atlan),

Part 3. From research design to model productisation

Steps in analysis

  1. Descriptive statistics. [annotated]

John Tukey and the Origins of EDA

Remembrances of Things: EDA (article about John W. Tukey work).

The 4 R’s in EDA (from A Course in Exploratory Data Analysis)

  1. Data visualization. Dashboards. [annotated]

        Economist: Mistakes, We Have Drawn a Few.

         Grammar of Graphics (ggplot)

  1. Business hypothesis and ways to test it. Business outcomes.
  • Interacting with business or retail customer
  • Controlling own system or business processes.
  • Supply chain, sourcing, inputs control.
  • Reporting, regulation, compliance, audit.

Advanced analytics and dashboards loop vs automation of control loop.

  1. Analysis as a DAG.

Quote form drivendata cookiecutter as a starter (but not universal) project template.

  1. Reproducible research and reproducibility crisis.

ML in production

  1. Making business change with data [notes]

—-

A - Identify the business case or hypothesis where you can earn more or save some money or make other improvements.

B - Create the adequate model:

        Make a simplified representation of the value chain, business process or client interaction that you you are changing (the model).

State what are the points of control where you do something and how these posits of control affect the whole system under control (control points).

State what are the points of data collection where you gather the data and how they help shape the decision and judge upon its success (data points).

C - State what you are about to do (proposed change):

Why (motivation), what/how/where (proposal) and to what end (success criteria)

Estimate if it is worth it.

Lay down the most simple way to test the proposed change.

D - Prove the change is worth it with experiments  

E - Next:

Scale if experiments show it is profitable or useful.

Continue the search – update the model, proposed change or the business hypothesis.

Give up and move to the next thing (with lesson drawing or without).

F - Be able to iterate on this chart faster, cheaper and with better outcomes.

—-

  1. Responsibilities of a data engineer, data scientist or modeler, machine learning engineer, business analyst and other roles. [notes]

        Perfect world (bam is the sound like “bang”, popularized by StatQuest videos, an “aha moment”).

Idea what to improve -> Good data -> Known model -> Quick inference -> Bam to production -> Bam cool and unambiguous effect -> Bam business liked it

        Practically perfect world:

Many ideas -> Business and modelling hypothesis -> Expected result -> Where to deploy? -> Is there a model for this? Which do we to pick? -> Is there data for this? -> Can we estimate/train the model? -> Does it seem to work? -> Can we deploy? -> How is it doing in production? -> Can we improve? -> Is business happy? -> How long would the result persist?

        Pick roles who is in charge of what:

  • A full-stack “data scientist”
  • A modeller
  • Data engineer / Data architect
  • Machine learning engineer / ML platform engineer
  • Software engineers
  • Research scientist        
  • Business analyst
  • Business lead or product manager
  • Vendor/consultant/ChatGPT

        When does the data pipeline become a “product”?

  1. any data pipeline that worked and delivered the business result, however trivial from modeling viewpoint;
  2. a complex system that is not just model+prediction (frontend, hardware, business rule change, etc);
  3. anything that a business or end user wants and has a bit of data or intelligence in it;
  4. something you can sell as a solution in a specific industry.

        What can go wrong in a data pipeline? How are companies different with respect to data and machine learning? What are the most aggressive promises about AI in a specific industry? Why is this not happening yet?

        Broader perspective:

  • system and actors vs control system and desired outcomes
  • customer/asset life cycle, person-centicity
  • business intelligence, business value and margins
  • silos, vested interest, delegation of control/responsibility
  • change of business models and company boundary
  • immediate ROI, long-term sustainability, business valuation
  • data as representation of objects, processes, behaviors
  1. ML pipelines and roles as told by the companies and experts. [notes]

  1. Machine learning pipelines and MLOps.

https://github.com/EthicalML/awesome-production-machine-learning

  1. Model life cycle, model drift vs data drift.

Part 4. Software tools

Programming languages and statistical software

  1. Programming languages (R, Python, Julia) and machine learning libraries.

R for statistical packages and classic statistics.

Python for statistics (stamodels), machine learning (scikit-learn) and deep learning (PyTorch, TensorFlow, keras).

Differentiation and composablilty in Julia and JAX.

Exercise: Compare a tabular dataframe implementation in R, Python (pandas or polars) and Julia.        

  1. Stattistical software, proprietary vs open source and documentation.

Proprietary (SAS, Stata, MATLAB) vs open source (R, Julia) statistics tools.

Software documentation as learning tools (they are great to read even if you are not using the package):

  1. Notebooks vs plain files and packages. Version control. Refactoring and cleaner code.

        The Missing Semester of Your CS Education (ignore metaprogramming chapter).

  1. Extra topic: open source viability and funding models

VC-driven: streamlit

Sponsored: PyWhy

Burnout: curl

Databases and storage

  1. Disk: HDD, SDD and cloud (S3) storage. Disk costs and time to access. File systems (HDFS).
  1. Database management systems (DBMS). Relational databases and SQL.

Other types of databases and NoSQL (key-value, graph, column, vector, time series).

Processing large data in parallel: MapReduce, Hadoop (HDFS+Yarn+MapReduce), Spark.

Search databases (ElasticSearch, Splunk, Solr)

Database popularity:

  1. Data warehouses (DW) and DW architectures. Decoupling storage and compute.
  1. Cloud providers (AWS, GCP, Azure). New data solution providers (Snowflake, Databricks).
  1. Mergers and acquisitions, venture financing and forks:

Sun buys MySQL (2008), Oracle buys Sun (2010), EU and US antitrust approval.

SAP's acquisition of Sybase (2010).

Cloudera and Hortonworks merger (2019).

Valkey, a Redis fork after licence change (2024).

Exercise: find venture-funded database projects here and explain their valuations.

  1. More elements of database theory and implementation. Relational algebra and ER-diagrams. ACID, CAP, BASE. DDL and DML. Normalization and normal forms. Hashing and B-trees. OLAP and OLTP.

Extra video: The ancient art of data management (2023) from DuckDB co-founder.

Extra reading: Lecture notes on database engineering (VSUUT, India).

Data engineering tools

  1. Data engineering lifecyce from https://www.oreilly.com/library/view/fundamentals-of-data/9781098108298/

See also Exploring the Modern Data Warehouse by Microsoft Learn.

  1. Workflow and orchestration tools (MLFlow, Airflow, Prefect, Luigi, etc).

What the data stack is:

Here is a reddit post and a link to docs.

  1. More DE sources:

Andriy Burukov MLOps book.

Data Engineering Zoomcamp.

See MAD@firstmark.com company landscape.

See my notes Comments on the weekly plan for DE job interview (pipeline2insights).

The cloud

… is not just someone else’s computer.

  1. Knowing the cloud is good for the job:

One thing I would say is usually a *must* is familiarity with Linux and a cloud provider (AWS, GCP, Azure). You don't need to know all 3 cloud providers (pick AWS if you don't know any yet - it has 50% market share) but if you don't know any of them it'll be harder to on board you and your first few weeks would be a lot more overwhelming – even knowing a different one to the one you use at a specific job will help as they all have similar functionality.

  1. Business user guide to the cloud

Client-server ideas as a starting point:

- your own machine not the most powerful computer, would be too costly

- you can connect to a remote machine as admin (eg via ssh)

- users can connect to server for own workloads (eg API calls)

- a lot of compute work can happen at the server, data also stored there

- historically software used started as monolith and then become more modular

Cloud mindmap:

- the server is virtualised, not longer a single metal machine (resource efficiency)

- there are various data storage solutions (storage)

- an application can spin off a container (containerisation)

- containers can be managed in a smart way (orchestration)

- there can be a hosted environment for application (PaaS)

- application can be very small (lambda, "serverless")

- but there is still latent network in the middle (zones, CDN, etc)

On-premise vs cloud criteria:

- hardware costs

- maintenance costs

- security

- speed

- flexibility

- ownership

Economics of cloud computing:

- virtualisation (hypervisor, virtual machine) achieves higher hardware utilisation

- decoupling of storage (S3) and compute (EC2)

- load switching and containerisation (Kubernetes, OpenShift)

- network costs

- data center economics

Cloud providers:

- AWS/GCP/Azure and others, especially outside the US

- software, platform or infrastructure (SaaS/PaaS/IaaS)

- typical products and product overlap

- vendor lock and switching costs

- competition issues (EU investigations)

Also happening:

- "managed" and "hybrid" clouds emerged

- there is more lower level and higher level services

- providers offer a miriad of services (and terrible APIs)

- you cannot easily switch a cloud provider

- "you grow out of cloud"

- cloud seeks extra value streams

Data centers and energy efficiency

- hyperscalers vs smaller providers

Deals:

- Broadcom buying VMware for $68 billion (2023)

- Microsoft settles EU cloud complaint for 22m (2024)

Links:

- https://news.microsoft.com/download/archived/presskits/cloud/docs/The-Economics-of-the-Cloud.pdf

- https://www.networkworld.com/article/2516091/microsoft-settles-cloud-complaint-for-22m-to-avoid-eu-antitrust-probe.html

- https://www.computerweekly.com/feature/Broadcoms-VMware-acquisition-explained-The-impact-on-your-IT-strategy

Part 5. Business change and society impact

Making money and making change.

Technology companies as machine learning market players

  1. Internet-scale data owners.
  1. Cloud providers.
  1. Companies that sell ML tools and solutions, their valuations and strategy.

        H2O, DataRobot.

  1. Hardware providers (NVIDIA).

Adoption in broad economy and society

  1. Fairness, biases, equity, human loop.
  1. Economics, cost and payoffs of applying ML. Business value of ML.
  1. Job market: in-house data modeller, consultant or a vendor?
  1. Who’s got more data? Data privacy and data protection. Markets and pricing of data.
  1. What gets to be regulated. Does a national AI or data strategy make sense?
  1. Why the hype: what makes the corporation play hype and overpromise? Why do investors buy that?

Selected industry domains and use cases

  1. Social sciences (sociology, political science, psychology, anthropology).
  1. Clinical trials.
  1. Industrial processes (discrete or continuous). Quality control and dependability.
  1. Recommender systems (RecSys).
  1. ML in finance

See Halperin textbook.

Appendix

Interviews

randomlyCoding: On production pipelines, engineering skills and job roles.

More personal skills

  1. Common sense, logic and critical reasoning.
  1. Writing well, explaining, inquiring, communicating.

On Writing Well: An Informal Guide to Writing Nonfiction. On Writing Well and keeping it up-to-date for 35 years by William Zinsse in American Scholar.

Personas

  1. People that strike me as great thinkers and educators who make complex thing easy to follow for the rest of us – through courses, books and personal interaction:

You also may be surprised a textbook or a professor can be reachable on Twitter/X or other social media:

  1. Hall of shame: story of Siraj Raval (plagiarism in education).

Glossary

  1. Common terms, professional slang and buzzwords.

Common terms:

  • Supervised vs unsupervised vs semisupervised learning.
  • Structured vs unstructured data.

Professional slang:

  • Feature engineering (variable selection and transform).
  • ETL vs ELT (data ingestion)

Fading buzzwords:

  • Data mining
  • Big data

Other resources

  1. Courses, syllabuses and excercises
  1. Video series
  • 3Blue1Brown by Grant Sanderson. (Very high quality content!)
  • Machine Learning Street Talk (Suggested by reader: “Sometimes bit too dense for absolute beginers but really good. They list resources, papers, books.”)

Changelog – timeline of this document

v0.7.0 (April 29, 2024):

  • Excellent undergrad econometrics course Mathematical Econometrics I by Roth and Hall.
  • Updated econometrics vs machine learning section with papers and courses from Hal Varian and Susan Athley.
  • Classic probability textbooks added (Blitzstein and Hwang, Wackerly, Mood) along with modern free websites.
  • Links from Econometric Navigator for time series, calculus and linear algebra.
  • Total count is 173 links and 111 topics.

v0.6.2:

v0.6.1:

v0.6.0

General interest:

Niche:

v0.5.5:

  • Andrew Ng lecture notes on machine learning from a 2022 course. EP: Andrew Ng best known for a deep learning course, but the classic machine learning notes are very well structured (intermediate).

April 9, 2024

Added 3x2 table on title page with key topics. Also removed few images.

April 6, 2024 (88 topics)

  • Finalised Databases and storage.
  • Edited SEMs and references from Paul Goldsmith-Pinkham
  • Marked for review DAG and ML project flow.
  • To add next new causal ML book.

March 31, 2024

The topic count is 77, also organized into textbook, data, project, adoption and cases sections. Good reception of the list in comments on Reddit, but removed by moderators, no specific reason or stated.

March 29, 2024

А way to keep up with data modeling and sort out what you already know. So far it is a list of topics organized by section, perhaps somewhat upside down compared to a traditional textbook or a course, but I hope you like the perspective. Few links added where most appropriate and I remembered good stuff. There are open textbook and blog links at Econometrics Navigator website, my previous work. 33 topics in original post.

March 28, 2024

First published as a Reddit post.

Quotes and reader feedback

This is pretty neat. Congratulations on putting together such a great list!

Amazing, thanks man, also it would also be much better to provide resource lists as well, still pretty useful, thanks!

This list is gold, thanks :)

This list is pretty comprehensive. I would have a bit on MLOps side because most advanced practitioners of ML should have some amount of understanding of how models are productionized. Perhaps a few topics on model drift, data drift, understanding how experiments are set up etc can be beneficial. Overall looks pretty good and will probably even use this to brush up on my own skills.

Guide roadmap

Models and methods

Pipelines

Tools

Stats and econometrics

Descriptive analysis

Programming languages

Machine learning (ML)

Task design and outcomes

ML and DL libraries

Deep learning (DL)

ML in production

Databases, DE and MLOps

Other methods

Reproducible research

Infrastructure for ML

Model evaluation

Data

Players and impacts

Economics

Types of data

Technology companies

Cost of (not) doing ML

Sources and ownership

Non-tech business

Markets for data

Data quality and DG

The human user

Rationale for regulation

NLP, CV and Robotics

Society impacts

ML = machine learning, DE = data engineering, MLOps = devops for ML, DG = data governance

Link to this document: https://t.ly/RcA2Q.

Editor mode: MLMW: Machine Learning My Way and Prose for MLMW and

EP. ML Topic Guide