Machine Learning From Scratch


This is the 2.0 version of this workshop after I audited many AI courses and filtered for the best resources. All the prereqs of probability, linear algebra/vector calculus will be covered as we go. Any books can be found on library genesis.


We will do a short-ish course taught by a core-developer of the scikit-learn library, he will also teach us how to contribute. The slides are annotated with commentary if you press P while viewing (or click the speech bubble icon). This will also cover Keras, AutoML libraries.

Choose your own theory/algorithm course

I audited most of these courses, mix and match your own curriculum to go with COMS-W4995. If one topic doesn't make sense like AdaBoost or Perceptron in one course, try watching it in another.

CS4780 This is the best of all the intro theory/algorithm style ML courses as the prof will actually explain the math notation and motivation behind the theory. A small picture on a board in chaulk of a hyperplane vector will be much easier to understand than an entire hour of somebody talking about it in a zoom lecture with static slides filled with notation. It doesn't matter if the lectures are from 2018 because it's almost the same as the 2021 version, and all these topics come back into fashion again such as the recent paper Every Model Learned by Gradient Descent Is Approximately a Kernel Machine.

CS480 This course focuses on neural network models as they are the most hyped right now though see the above paper on neural network kernel machine approximation, some of the NN models have mind boggling amounts of parameters requiring large scale distributed machine learning. CS4780 will also cover much of the material here like AdaBoost, Perceptron, SVM etc so if they don't make sense, watch them again in CS4780.

10-701 General graduate intro for students who aren't specializing in machine learning and want to use it for other fields like systems biology. Taught by a researcher in algorithms used by nature and Prof Xing the now president of MBZUAI (yes, you can get into MBZUAI). The lectures are worth watching as they will talk about what works now and what doesn't anymore, why there is a ton of classifiers, however it's a grad course so math sophistication is assumed. Very good lectures on computational learning theory and probabilistic graph models.

10-301 Undergrad/MSc version of 10-701, errors in the slides and you have to watch the lectures as he handwrites all the algorithms on a tablet. If you don't understand a topic watch it again here for more clarity like the PAC learning theory lectures.

CS-433 Being taught right now (Dec 2021) is EPFL's machine learning course. Has more learning theory than most other intro courses which is good, I didn't try any assignments but the labs come with full solutions.

18.337J Try a research exploration into the crazy world of scientific ML such as physics-informed neural networks where you can drop in partial differential equations which are the math models describing the rules of any system with spatial as well as time dependence such as diffusion (heat transfer, population dynamics), finance, biology systems, sound waves, fluid dynamics, electrodynamics, conservation laws, relativity and who knows how many more. These PDEs act as prior information which encode the physical laws of that system then 'data efficient' noisy samples can be learned from instead of requiring millions+ of carefully prepared samples like in typical supervised learning.

This course also covers large scale ML high performance computing

Many More MIT now offers it's 2020 Machine Learning course on open library with class recorded blackboard lectures, even the classic Andrew NG machine learning course on Coursera is still worth doing.

Theoretic Foundations of Modern Machine Learning

My choice of curriculum is to understand the core theory, so here I will go through the first 20 chapters of the (free) book Understanding Machine Learning: From Theory to Algorithms w/lectures. It's still taught by the author at Waterloo in 2021 and CMU's PhD track intro course 10-715, and 10-701 covers some of it's chapters like Rademacher complexities. There's a solutions manual for the few exercises in the book which will be an exercise in itself to try and understand.

Every chapter in the book has a corresponding lecture in the above courses so I'll probably end up doing all of CS4780 and some of 10-701.


Get a math crash course reference

Obtain the book All the Math You Missed But Need to Know for Graduate School and the book Mathematics for Machine Learning to use as a reference though we will still build up prereqs from scratch as we go.

Install Software

For COMS-4995 either install conda locally or use google colab (free). See this brief tutorial for setting up google drive or how to directly import kaggle datasets to colab. You can do all the intro ML courses entirely on a phone or tablet using colab if you had to.

Applied ML lecture 1

The coreqs for this course is the Python Data Science Handbook which you can quickly audit going through the notebooks on GitHub or just read the NumPy and Pandas documentation as those libraries comes up, if you did the software workshop this will be trivial.

Watch or read the slides with notes though what he says in lecture is often different as the notes are from a previous semester and act as an outline. @50m he notes how ML is different that statistics as they're drawing up a hypothesis to ask a question about the data whereas we are making predictions on unseen data with models filled with assumptions. This course doesn't cover large scale machine learning tooling as the prof will use AWS as his personal computer loading up an instance with 512GB ram to work on a large data subset instead of messing around with the massive complexity of distributed ML frameworks.

Reading: IMLP Ch 1, APM Ch 1-2