AI in 2025
Table of Contents
Intro
We will learn the 'full stack' of modern deep learning systems by building our own library from scratch. Neural networks have different architectures just like there is different computer hardware architectures (x86, RISC, ARM). The most popular currently is transformer architecture for large language models (Grok-n, GPT-n).
Neel Nanda has helpfully produced a suggested curriculum for reverse engineering transformers. He even offers mentorship through the ML Alignment & Theory Scholars program or you can claim bounties for GeoHot's TinyGrad as a freelancer.
Limits of AI
Current popular AI is all foundation models like Grok-n, GPT-n, DALL-E etc. Have you wondered if we had infinite resources, infinite data, perfect training algorithms with no errors (aka ideal models), can we use this type of model for everything aka 'General AI'? Someone with help from the Beijing Academy of AI used category theory to model this scenario to see what is possible. The multimodal content here is interesting.
Curriculum
Most of the work is programming in (free) google colab notebooks in C++ and Python though since we are building everything from scratch you could use the language of your choice. Someone wrote Llama.cpp why not write your own too.
What we want to achieve
- 10-714 DL Algorithms & Implementation (CMU)
- Build from scratch transformers, RNNs, MLPs, everything
- Includes hardware acceleration
- Mechanistic Interpretability (Neel Nanda@Google DeepMind)
Research
This is a graduate course on trying to integrate everything into a shared reasoning and representation system (text, audio, video, actions) so it's the final boss of the AI game. Multimodal doesn't require enormous pretraining such as transformer-based models thus can overcome data scarcity. This means we as pleb researchers can modify open source models like this finance multimodal foundation model.
- Multimodal Machine Learning (CMU)
- We have access to 2023 lectures but recent lecture structure remains the same
How to get there
Intro to machine learning
Every school has an undergrad introduction to machine learning and these courses don't really change from year to year as the underlying concepts to the theory remain the same.
See CMU's latest Intro to Machine Learning (2025) and compare that with the chapters from this 2014 book on the theory of ML. K-nearest neighbors, perceptron (linear predictors), gradient descent, feature engineering, optimization (convex and non-convex), decision trees, PAC learning (probably approximately correct), loss functions, boosting, random forests, dimensionality reduction, it's all in that 2014 book because that is the mathematical model of machine learning. If you want the CMU undergrad version you can go back in time using any university's panopto (Summer 2022) recordings.
The graduate versions of introductory machine learning still use that 2014 book so we may as well. Lucky for us the author of that book made a series of lectures for his course at Waterloo and it's a very unusual course he walks through all the math assuming little background.
- CS 485 Theory of ML (Waterloo)
Neural Networks
This is to understand the neural networks in the interpretability courses we do and the CMU course on building our own libraries.
- CS 479 Neural Networks (Waterloo)
- Survey of modern NN from the perspective of theoretical neuroscience
- We also take Building Micrograd by Andrej Karpathy
Math we need
These can all be done in parallel
Linear Algebra
- Coding the Matrix (Brown)
- All the recorded lectures are open to anyone
- Programmed w/Python and has a (nonfree) book
Created by Philip Klein a name you will recognize if you take any algorithms course. Probability and the SVD is covered which we'll need. The book is at least cheap ($35) and worth buying or you can use Anna's Archive or whatever the latest Library Genesis domain is to get a pdf.
Neel Nanda who provided the reverse engineering course suggested we take Axler's LADR and the new 4th version is totally free (he even upgraded his cat in the about author pic). If you look at the table of contents for both Brown's Coding the Matrix and LADR they match up almost perfectly so we can take both at the same time and/or review LADR when needed.
Calculus
We will take the world's shortest shortcut to learn basic calculus then fill in all the blanks with matrix calculus.
- Matrix Calculus (MIT)
- All the calculus we need generalized to higher dimensions
Statistics
A free workbook on probability/statistics that we can take with the probability content in Coding the Matrix and the stats content in Understanding Machine Learning.
Day 1 Neuroscience Models
Reminder this is 'learn AI from scratch' thus here is neural models from scratch.
- What is a Neuron (26m) from CS479.
- Hodgkin-Huxley Neuron Model
@6:39 his example of an action potential using a water bucket being filled and dumped is exactly the same intuition of a potential function in a data structures analysis course used for amortized analysis. A dynamic array once filled up to a certain percentage needs to perform a costly action of doubling itself to enable more free space. You take the infrequent cost of doubling and average it over many read/writes to come up with an individual cost of all array operations. There's some differential equations here but he draws out what they mean, we don't have to know them we just need to know this is a non-linear model and everything to do with neurons is electrical. Hz (hertz) is spikes per second.
You can access all these notebooks he's using for the demo here if interested.
- Simpler Neuron Models (28m) from CS479.
I don't have a physics background either but he's explaining the dynamics of a membrane well enough we can get the gist of what's going on. The difference between this first model and the previous Hodgkin-Huxley model is we are no longer modeling the spikes only the action potential for a spike to occur and a reset. Reduces complexity. @18:44 even simpler models. @22:20 ReLU and Softmax are introduced. We will later have to write our own Softmax.
Day 2 Calculus
Let's look over the lecture notes of MIT's 18.01. Most of this course we are skipping because you will always use software to calculate a derivative/integral we just need the absolute basics to begin learning matrix calculus.
Derivatives
Reading lecture 1. If you want worked through examples then watch this from Clemson's MATH 1060 but this will all be properly explained in higher dimensions when we do calculus in the realm of linear algebra.
The answer to 'what is a tangent line exactly' is at that point P on the curve, super zoom until it's neighboring points appear to lay on a straight line. The slope of a tiny straight line between P and P + 0.001 is the tangent. Then you imagine doing this for the entire function creating many tiny little straight lines for every point in the function which is a linear approximation of that curve/function.
The geometric definition pretend 'delta-x' \(\Delta x\) is a tiny displacement dx = 0.001 then imagine P + dx distance shrinking as dx approaches zero. Here is a graph (desmos.com) showing what is going on. Notice on that graph in order to go from (2, 4) to (2.001, 4.004) move to the right along the x-axis 0.001 then up 0.004 to meet the graph again. This is what the derivative tells us that if a function is preturbed by some tiny extra input f(x + dx) how sensitive is this function when we analyze the change in output or the rate of change.
Example 1 of the MIT notes for f(x) = 1/x
- Plug into derivative equation
- Extract out 1/delta-x
- Perform a/b - c/d is (ad - bc)/bd
- Cancel the extracted delta-x leaving -1
- Take limit to zero of delta-x
- - 1/x2
Why is it negative? Look at the graph if 0.001 is added to the x input then the y output has to drop down -0.5 to meet the graph again of f(x) = 1/x
Finding the tangent line the equation for a line is y = mx + b and they have merely substituted in values but we will never use this feel free to skip it and the area computation after.
Try inputs to f(x) = x2 to understand Big-O
- (2 + 0.001) = 4.004001
- (3 + 0.001) = 9.006001
- (x + dx) = y + dy + 0.000001
- (x + dx) = y + dy + O(dx)2
If dx is approaching zero then (dx)2 will reach zero before dx and can discarded.
Those above inputs to f(x) = x2 notice that dy = 2dx + (dx)2 and if we ignore the (dx)2 then dy/dx = 2 which is the derivative at that point. The derivative of the entire function is dy = (2x)dx or dy/dx = 2x. This is the instantaneous rate of change like if you were to take a picture of a speeding car and ask what is it's speed right now which is technically zero at an instantaneous snapshot but through the magic of limits we pretend the denominator of the derivative equation is so close to zero that no other positive quantity could wedge itself between the limit and zero.
'dx' means 'a little bit of x' and dy means 'a little bit of y' it is infintesimal notation where an infintesimal is some extremely small positive quantity which is not zero.
Limits
Reading lecture 2. Watch left and right limits and these techniques for computing limits such as multiplying by a conjugate which is how you eliminate square roots from limit calculations.
A limit was already described perfectly by Isaac Newton as the ability to make the difference between a quantity and a fixed value less than any given positive number. He called these the ultimate ratio of 'vanishing quantities' where the ultimate ratio is not before they vanish or after but the ratio with which they vanish.
These MIT notes tell us if f is differentiable at a point then f is continuous at that point which of course makes sense if we go back to the definition of a derivative being a linear approximation where around any super zoomed point is a tiny displacement point on a straight line (tangent).
Trig functions
There's some trig limits here. If you forgot trig 3blue1brown has many YouTube lectures about sine and cosine or watch this. Brown university also has a Trig Boot Camp.
We have just seen another linearization. Watch a few minutes of this (Wildberger Rational Trig) starting @5:56 to see motion around a nonlinear curve being mapped to the linear motion on the two axis moving back and forth.
Day 3 Integrals
This is the only concept we will practice before matrix calculus as it will come up all the time in probability and elsewhere.
- Indefinite integral and Antiderivatives
- Area approximations (sets up Riemann sums)
- Riemann sums and the Definite Integral
- Theorem of Calculus I and II
Those videos explain everything. Let's practice:
- Integral problem book from CLP textbooks
- Techniques (Integration competition)
TODO