AI in 2025
Table of Contents
Intro
We will learn the 'full stack' of modern deep learning systems by building our own library from scratch. Neural networks have different architectures just like there is different computer hardware architectures (x86, RISC, ARM). The most popular currently is transformer architecture for large language models (Grok4, GPT-4).
Neel Nanda has helpfully produced a suggested curriculum for reverse engineering transformers. He even offers mentorship through the ML Alignment & Theory Scholars program or you can claim bounties for GeoHot's TinyGrad as a freelancer.
Limits of AI
Current popular AI is all foundation models like Grok-n, GPT-n, DALL-E etc. Have you wondered if we had infinite resources, infinite data, perfect training algorithms with no errors, can we use this type of model for everything aka 'General AI'? Someone with help from the Beijing Academy of AI used category theory to model this scenario to see what is possible.
Curriculum
- 10-714 DL Algorithms & Implementation (CMU)
- Build from scratch a library similar to PyTorch
- CS 479 Neural Networks (Waterloo)
- Designing and training neural networks
- Mechanistic Interpretability (Google DeepMind)
- Reverse engineer a trained neural network
- Matrix Calculus (MIT)
- All the calculus we need generalized to higher dimensions
- IAP course or 'Independent Activities Period' where faculty can run a 4-week course
Linear Algebra
There is no shortage of excellent linear algebra courses. These are the one's I will do here:
- Coding the Matrix (Brown)
- All the recorded lectures are open to anyone
- Programmed w/Python and has a (nonfree) book
Created by Philip Klein a name you will recognize if you take any algorithms course. Probability and the SVD is covered which we'll need. The book is at least cheap ($35) and worth buying or you can use Anna's Archive or whatever the latest Library Genesis domain is to get a pdf but it will likely be missing a lot of graphical content. This is a very good course for anyone interested in game graphics or taking an algorithms design course and wants to manipulate graphs using linear algebra. If you hate everything else here then do this you'll be fine.
- Linear Algebra Done Right - Sheldon Axler
- New completely free 4th version
- He upgraded his cat too that was always in the about author pic
- Abstract treatment which we need but is considered a second course
- Contains some calculus
- Seems designed to prepare students for functional analysis (and thus ML)
Neel Nanda says we should we do this and it can be completed in parallel with the Klein book.
Basic calculus
We can audit some of the slides from MIT OpenCourseWare then the Matrix Calculus course will teach us the behavior of derivatives.
Day 1 Neuroscience Models
Reminder this is 'learn AI from scratch' thus here is neural models from scratch.
- What is a Neuron (26m) from CS479.
- Hodgkin-Huxley Neuron Model
@6:39 his example of an action potential using a water bucket being filled and dumped is exactly the same intuition of a potential function in a data structures analysis course used for amortized analysis. A dynamic array once filled up to a certain percentage needs to perform a costly action of doubling itself to enable more free space. You take the infrequent cost of doubling and average it over many read/writes to come up with an individual cost of all array operations. There's some differential equations here but he draws out what they mean, we don't have to know them we just need to know this is a non-linear model and everything to do with neurons is electrical. Hz (hertz) is spikes per second.
You can access all these notebooks he's using for the demo here if interested.
- Simpler Neuron Models (28m) from CS479.
I don't have a physics background either but he's explaining the dynamics of a membrane well enough we can get the gist of what's going on. The difference between this first model and the previous Hodgkin-Huxley model is we are no longer modeling the spikes only the action potential for a spike to occur and a reset. Reduces complexity. @18:44 even simpler models. @22:20 ReLU and Softmax are introduced. We will later have to write our own Softmax.
Day 2 Calculus
Let's look over the lecture notes of MIT's 18.01. Most of this course we are skipping because you will always use software to calculate a derivative/integral we just need the absolute basics to begin learning matrix calculus.
Derivatives
Reading lecture 1. If you want worked through examples then watch this from Clemson's MATH 1060 but this will all be properly explained in higher dimensions when we do calculus in the realm of linear algebra.
The answer to 'what is a tangent line exactly' is at that point P on the curve, super zoom until it's neighboring points appear to lay on a straight line. The slope of a tiny straight line between P and P + 0.001 is the tangent. Then you imagine doing this for the entire function creating many tiny little straight lines for every point in the function which is a linear approximation of that curve/function.
The geometric definition pretend 'delta-x' \(\Delta x\) is a tiny displacement dx = 0.001 then imagine P + dx distance shrinking as dx approaches zero. Here is a graph (desmos.com) showing what is going on. Notice on that graph in order to go from (2, 4) to (2.001, 4.004) move to the right along the x-axis 0.001 then up 0.004 to meet the graph again. This is what the derivative tells us that if a function is preturbed by some tiny extra input f(x + dx) how sensitive is this function when we analyze the change in output.
Example 1 of the MIT notes for f(x) = 1/x
- Plug into derivative equation
- Extract out 1/delta-x
- Perform a/b - c/d is (ad - bc)/bd
- Cancel the extracted delta-x leaving -1
- Take limit to zero of delta-x
- - 1/x2
Why is it negative? Look at the graph if 0.001 is added to the x input then the y output has to drop down -0.5 to meet the graph again of f(x) = 1/x
Finding the tangent line the equation for a line is y = mx + b and they have merely substituted in values but we will never use this feel free to skip it and the area computation after.
Try inputs to f(x) = x2 to understand Big-O
- (2 + 0.001) = 4.004001
- (3 + 0.001) = 9.006001
- (x + dx) = y + dy + 0.000001
- (x + dx) = y + dy + O(dx)2
If dx is approaching zero then (dx)2 will reach zero before dx and can discarded.
Those above inputs to f(x) = x2 notice that dy = 2dx + (dx)2 and if we ignore the (dx)2 then dy/dx = 2 which is the derivative at that point. The derivative of the entire function is dy = (2x)dx or dy/dx = 2x.
'dx' means 'a little bit of x' and dy means 'a little bit of y' it is infintesimal notation where an infintesimal is some extremely small positive quantity which is not zero.
Limits
Reading lecture 2. Watch left and right limits and these techniques for computing limits such as multiplying by a conjugate which is how you eliminate square roots from limit calculations.
A limit was already described perfectly by Isaac Newton as the ability to make the difference between a quantity and a fixed value less than any given positive number. He called these the ultimate ratio of 'vanishing quantities' where the ultimate ratio is not before they vanish or after but the ratio with which they vanish.
These MIT notes tell us if f is differentiable at a point then f is continuous at that point which of course makes sense if we go back to the definition of a derivative being a linear approximation where around any super zoomed point is a tiny displacement point on a straight line (tangent).
Trig functions
There's some trig limits here. If you forgot trig 3blue1brown has many YouTube lectures about sine and cosine or watch this. Brown university also has a Trig Boot Camp.
We have just seen another linearization. Watch a few minutes of this (Wildberger Rational Trig) starting @5:56 to see motion around a nonlinear curve being mapped to the linear motion on the two axis moving back and forth.
Day 3 Integrals
This is the only concept we will practice before matrix calculus as it will come up all the time in probability and elsewhere.
- Indefinite integral and Antiderivatives
- Area approximations (sets up Riemann sums)
- Riemann sums and the Definite Integral
- Theorem of Calculus I and II
Those videos explain everything. Let's practice:
- Integral problem book from CLP textbooks
- Techniques (Integration competition)
TODO