Machine Learning From Scratch


This is the 2.0 version of this workshop after I audited many ML courses and filtered for the best resources. All the prereqs of probability, linear algebra/vector calculus will be covered as we go. Any books can be found on library genesis.


We will do a short-ish course taught by a core-developer of the scikit-learn library, he will also teach us how to contribute. The slides are annotated with commentary if you press P while viewing (or click the speech bubble icon). This will also cover Keras, AutoML libraries.

Choose your own theory/algorithm course

I audited most of these courses, mix and match your own curriculum to go with COMS-W4995. If one topic doesn't make sense try reading/watching it in another source.

CS4780 This is the best of all the intro theory/algorithm style ML courses as the prof will actually explain the math notation and motivation behind the theory. A small picture on a board in chaulk of a hyperplane vector will be much easier to understand than an entire hour of somebody talking about it in a zoom lecture with static slides filled with notation. It doesn't matter if the lectures are from 2018 because it's almost the same as the 2021 version, and all these topics come back into fashion again such as the recent paper Every Model Learned by Gradient Descent Is Approximately a Kernel Machine.

Bio/Chem Machine Learning Maybe you want an entirely different perspective, the exact same algorithms and theory we are doing can also be learned in the life sciences in fact that lecture is probably one of the best introductions I've seen to ML/Neural Nets. You could use that course instead of the ones here to learn these algorithms/theory. AlphaFold2 did not totally solve protein folding and you can get this data yourself. There's countless other uses for ML like drug design or reverse engineering eyesight.

CS480 This course focuses on neural network models as they are the most hyped right now though see the above paper on neural network kernel machine approximation, some of the NN models have mind boggling amounts of parameters requiring large scale distributed machine learning. CS4780 will also cover much of the material here like AdaBoost, Perceptron, SVM etc so if they don't make sense, watch them again in CS4780.

10-701 General graduate intro for students who aren't PhD track in machine learning and want to use it for other fields. Taught by a systems bio researcher in algorithms used by nature and Prof Xing the now president of MBZUAI (yes, you can get into MBZUAI). The lectures are worth watching as they will talk about what works now and what doesn't anymore, why there is a ton of classifiers, however it's a grad course so math sophistication is assumed. Very good lectures on computational learning theory and probabilistic graph models.

10-301/601 Undergrad/MSc version of 10-701, errors in the slides and you have to watch the lectures as he handwrites all the algorithms on a tablet. If you don't understand a topic watch it again here for more clarity like the PAC/learning theory lectures.

CS-433 Being taught right now (Dec 2021) is EPFL's machine learning course. Has more learning theory than most other intro courses which is good, I didn't try any assignments but the labs come with full solutions.

18.337J Try a research exploration into the crazy world of scientific ML such as physics-informed neural networks where you can drop in partial differential equations which are the math models describing the rules of any system with spatial as well as time dependence such as diffusion (heat transfer, population dynamics), finance, biology systems, sound waves, fluid dynamics, electrodynamics, conservation laws, relativity and who knows how many more. These PDEs act as prior information which encode the physical laws of that system then 'data efficient' noisy samples can be learned from instead of requiring millions+ of carefully prepared samples like in typical supervised learning. This course also covers large scale machine learning library optimization but for high performance computing labs.

Many More MIT now offers it's 2020 Machine Learning course on open library with class recorded blackboard lectures, even the classic Andrew Ng machine learning course on Coursera is still worth doing none of the fundamentals have changed.

Theoretic foundations of modern ML

My choice is to understand the core theory so in this workshop in parallel with COMS-W4995 I will go through the first 20 chapters of the (free) book Understanding Machine Learning: From Theory to Algorithms w/lectures. It's still taught by the author at Waterloo in 2021 and CMU's PhD track intro course 10-715, and 10-701 covers some of it's chapters like Rademacher complexities in it's learning theory lectures. There's a solutions manual for the few exercises in the book which will be an exercise in itself to try and understand.

Every chapter in the book has a corresponding lecture in the above courses so I'll probably end up doing most of CS4780 and some of 10-701 too but if you understand the book, you don't need the other courses.


Get a math crash course reference

Obtain the book All the Math You Missed: But Need to Know for Graduate School [ATMYM] and the book Mathematical Modeling and Applied Calculus [MMAC] to use as reference to look up topics as we come across them. 10-601 has a math resources pdf with links to other pdfs for more reference.

Install Software

For COMS-4995 either install conda locally or use google colab (free). See this brief tutorial for setting up google drive or how to directly import kaggle datasets to colab. You can do all the intro ML courses entirely on a phone or tablet using colab if you had to.

Applied ML lecture 1

The coreqs for this course is the Python Data Science Handbook which you can quickly audit going through the notebooks on GitHub or just read the NumPy and Pandas documentation as those libraries comes up, if you did the software workshop this will be trivial.

Watch or read the slides with notes though what he says in lecture is often different as the notes are from a previous semester and act as an outline. @50m he notes how ML is different that statistics as they're drawing up a hypothesis to ask a question about the data whereas we are making predictions on unseen data with models filled with assumptions. This course doesn't cover large scale machine learning tooling as the prof will use AWS as his personal computer loading up an instance with 512GB ram to work on a large data subset instead of messing around with the massive complexity of distributed ML frameworks.

Reading: IMLP Ch 1, APM Ch 1-2 Both books can be found on libgen, reading APM 1.3 Terminology is helpful for understanding 'class' or 'predictors'. Chapter 2 (and the first lecture) introduces a linear and quadratic (power) model. These are both covered in MMAC in the first chapter. The first is a linear function f(x) = mx + b (well, technically affine function) who's parameters are m the slope (rise/run) and b the vertical y-intercept. Now you know what 'parameters' means in the context of modeling, in this case it is the 2 things that define what a linear line is. The second is a quadratic f(x) = ax^2 + bx + c which is really a power function model y = Cx^k where C and k are nonzero constants: f(x) = Cx^2 + Cx^1 + c. The parameters C and k define what it means to be a power function. Note setting parameter k to k - 1 in f(x) = ax^2 + bx + c gives you a linear function f(x) = ax + (b + c) or mx + b so a linear function is a special case of a quadratic function which itself is just a power function model for our purposes of machine learning.

Let's read his book, IMLP. In the lecture he also said What question(s) am I trying to answer? Do I think the data collected can answer that question? as being fundamental. We are taught the fundamental NumPy data structure the NumPy array and the comma denotes a second dimension in np.array([1,2,3],[4,5,6]). In a matrix/vector it's dimensions is it's length. There's a mention here about efficient sparse matrices handling with SciPy which is explained here basically they come up all the time and multiplying matrices together is an inefficient problem unless of course they are sparse then you can speed this up with SciPy built-ins. We will learn what a matrix is in the linear algebra crash course we do. The matplotlib section is the subject of the entire next lecture.

The first application (page 20) is very similar to the petal length k-nearest neighbors example in Kevin Murphy's book, which we'll also read in CS4780. Terminology is clearly defined here. You don't have to read this whole chapter, you can use Building your first model chapter as a guideline for the first assignment, anyway that's what I normally would do read it as I worked on something else.

Theory 1

Watching ML Theory lecture 1. Skip to 8m, so far this is a pretty excellent intro, finally somebody supplied me with a concrete definition of inductive reasoning. @56:30 computational complexity plays a major role in ML ie: runtimes of these training/prediction algorithms. The only prereqs are some basic asymptotics for algorithm analysis, and familiarity with linear algebra/basic stats like what is a distribution, he claims he will teach us the rest of the math in the lectures. Important to note: the training data is 'randomly generated' meaning knowing one sample can't help us to find out what another sample is if we knew the distribution, and we don't care at all about the probability distribution (which we haven't even covered yet here) from where the random data comes from because as you'll see in CS4780 lecture it's impossible to know.

We get a very intuitive mathematical model of machine learning which was largely statistics free anyone could follow. The introduction of the book [UML] covers what we just saw in the lecture ie: bait shyness. Of course there is already a problem with his papaya learning example, in Thailand they will thinly shave a green unripe papaya to make it into a salad that is quite tasty but our example learning function would label that as not tasty.

CS4780 1

Watching Supervised Learning Setup which covers what we just learned with some more explanation, you can of course watch any of the other courses you want. @36:10 'curly X cartesian product curly Y' means every possible ordered pair combination of x from set X and y from set Y in a tuple for example if A = {1, 2} and Y = {3} then A x Y = {(1, 3), (2, 3)} so all his features x matched with a label y for each training sample. A feature vector is defined @44:50. End of lecture we did the bag of words vector representation as the first assignment in the CS19 workshop.

Math review

Sets & functions

Watch Review of sets and functions. It is the best visual explanation I can find with the exact amount of information we need for these courses. You can skip parts of the lecture where the class works out exercises. For example @18:49 - 27:20 is an exercise. A set is a data structure in math, that's all.

For future reference the MMAC book covers single and multivariable functions in the first two chapters that anybody can understand, and Terence Tao's Analysis I chapter 3 Set Theory will teach you anything you want to know about the definition of functions, their images/inverses and the operations of set data structures though you don't have to do it all now wait until we come across it in lectures. There's a lot of similarity to programming in Tao's book Example 3.4.2 for images of sets, that is exactly what map() in programming is: consume a data structure and apply a function to each one of it's elements, returning a new data structure.

Linear algebra summary

Recall polynomials from highschool, they are a specification of a function. You have this notion of something that it dependent: y = f(x) meaning the output y depends on the input x. Polynomials are a way to understand the properties that function has such as for which x is f(x) = 0, what is the maximum or minimum values of f(x), what is it's graph if we use polynomials to specify it?. Knowing these things tells you all you want to know about a function f(x).

In linear algebra a matrix also encodes a function, you input a vector and it outputs a vector. This means a matrix can be abstracted to more than just a table of numbers you can use linear algebra wherever there is a vector space and that includes all kinds of different mathematics, it's almost like the underlying theory of it all. You can factor functions encoded as a matrix using a technique called the SVD which allows you to do analysis of each matrix component. A matrix has four fundamental components/spaces: a column space (every possible linear combination of all columns times a vector), a row space (every possible linear combination of all rows), a nullspace, and a transposed nullspace. Since a matrix is a function, a nullspace is every solution to f(vector-input) = 0. A transpose operation is when you take a row and make it a column, or turn a column into a row, so the big idea is: Take all the rows in the row space and transpose them to find all the column space. Take all the zeros of columns meaning f(vector-input) = 0 to find the nullspace of all rows. Take the transpose of the nullspace of all rows, and find out the nullspace of all columns. This now tells you everything about a matrix which means you know everything about a function.

With these tools you can also do projection to find the best approximation, you can do calculus, you can do statistics, you can do probability, you can do optimization, you can do combinatorics/permutations, find linear maps between all these spaces and do linear transformations between dimensions like reducing R^3 to R^2. Best of all you can do all this with a programming language and these tools extend into other subjects like quantum computing when you learn about a discrete Fourier transform matrix, rotating it, giving it a vector of data for input and getting out the patterns and freq of that data. If you know linear algebra you know an enormous space of mathematics so it is the most critical subject for us in machine learning and where we should spend most of our time.

Ultimate linear algebra/probability crash course

We may as well do a read through of notes and short lectures (disable https or 'blocked content' to see the YouTube embeddings) by CMU prof Zico Kolter designed for machine learning/data science students. At the same time why don't we also do a single read through of these probability notes by CMU prof Cozma Shalizi. The idea is to become familiar with the notation not learn everything there is about these two subjects up front when we're merely using them as tools for machine learning.

TODO, 1hr a day of notes