Competitive Programming for Programming Interviews

What is this

How to game the current technical interview system for fun and profit. We need to be able to do this: "it's not enough to be able to solve the problems, you need to be able to solve them on the spot, without much thinking, in an interview context". Codility a popular filtering site for testing applicants based it's service on the IOI competition, where all problems usually have an easy but suboptimal solution, and the optimal solution requires algorithmic skill, so let's get that skill.

Every book listed here if it's not already free can be sourced from Library Genesis or search WorldCat library.

Gonzo journalism

I actually mass apply to various tech outfits and practice this in person doing interviews for fun just to see their process, maybe get in and walk around and see what goes on there.

Often in interviews you will get 'ranked', which is completely suspect, my cynical take is it means they have a candidate they want but are waiting for that candidate to lowball salary themselves in negotiations. If they ask for too much they go down the ranking list and maybe you're next and if they don't like your demands it's the next ranking. They should do all this before making you jump through multiple interviews only to get ranked but that's the game, we can't change it.

Recently the most interesting interviews I've had were due to the material taught here by MIT's Chris Rackauckas who will teach you the full math model like loops as discrete dynamical systems, absolutely amazing content if this kind of thing is your bag.


  • MIT's 6.006 sp2020 course on OCW

Contains recorded full problem walkthroughs. I preferred this course to others because they actually use a board, instead of zoom slides so you can see problems being fully worked out, and it's permanently archived on MIT OCW so won't disappear behind campus logins. The course uses DPV an an optional text but you should pick whatever algorithms text interests you, I will use The Art of Computer Programming by Don Knuth.

  • A short workshop Functional Data Structures in OCaml

This is a high level computer scientist giving you a stroll through the subject like if you had a personal PhD tutor. Prof Ragde mostly follows the MIT syllabus and is a different view of the same material but much more mind opening. Should you use the 'Master Theorem' or Akra-Bazzi method? Is best-case analysis just marketing? The author has many of these style of subject strolls and has notes on Programming Language Foundations in Agda complete with ascii screencasts if you want to learn some PL theory.


  • The book Competitive Programming 4 (first volume, 4 chapters)

The book uses methods to solve as a curated list of problems for each chapter, practice is done with the Kattis Problem Archive so producing working code for hundreds of problems and submitting them for tests of correctness and runtime constraints. All the algorithms have visualizations on the author's site VisuAlgo. Any practice site will work like leetcode or codeforces but competitive programming online judges have the best test suites and can be solved creatively there isn't just one solution.

I like this book as the ultimate reference, it's handbook sized, cost $20 and anytime some problem pattern comes up I can flip through it and always find opitmization advice not easily found anywhere else.


The MIT course is all you really need in fact just the recitations and the recorded problem solving seminars though Erik Demaine always gives a good a lecture and they're worth watching. He didn't read any textbooks in undergrad and instead practiced what he was taught in class by trying to construct objects like tetris, puzzles and folding algorithms for origami using the course materials. You can do the same, make your own music using this workshop or whatever it is you want to do.

This guy made Kazehakase browser as a hobby and now works for Mozilla, I used to use his browser all the time back in the day, practice what you want to do and you will conjure it into existence.


Do this in the language you enjoy, not what you think is 'employable' because it won't matter.

The book has a public code repository of all the algorithms contained in it, and explains the complexity of the standard library data structures for OCaml, Python 3, Java 11 and C++17 but you can use any of these languages. Whatever language you choose guaranteed there is a tutorial for using it for competitive programming like this Haskell example. Do you not see the language you want then go on Spoj you can use assembly, sed, scheme anything you want.

Using a phone/tablet

If you must, you can use any basic text editor and cut+paste your solutions, in fact we'll watch a CMU competitive programming lecture (now that I've linked it, it will prob disappear) on dynamic programming where that's exactly what the prof is doing. Looking around the Android store, the only decent offline app seems to be a Pascal compiler made by a highschool kid in Vietnam. Both Kattis and accept Pascal submissions (I think the IOI still accepts it) and Pascal syntax is simple pseudocode you can write on a white board. If you want an online compiler try


Let's see what competitive programming practice looks like.

Competitive programming class

A great writeup here of an exchange student from Canada who took CS 3233 in Singapore and how the author of the cp4 book organizes those classes. It's a 12 day course expanded to 4 months as most of the work is the class competing against each other.

Competition practice

Here is Gennady Korotkevich aka tourist practicing on codeforces and explaining what he's doing, though he didn't start with C++ he used to win the IOI with Pascal. A good article about him is here which explains the spartan background, he chooses to live in a student dorm. At school he gives lectures on discrete math/combinatorics and complexity theory which makes sense, if you are a master problem solver you are likely interested in trying to figure out unsolved problems in theoretical cs.

The first thing to notice is he's using a basic text editor and unusual shell that looks like minimalist GNU for Windows(MinGW) and a text mode file manager where he manually submits through the codeforces website UI. He prepares individual directories named after the problem, then always makes a sol.cpp file in that directory so he doesn't have to change a macro he's written to compile the solution from the shell.

His C++ template:

// include every C++ std library and bits library
// doesn't work on all platforms
#include <bits/stdc++.h>

// use std namespace to avoid having to write std::cin or std::cout
// in other words bringing into scope all C++ std libraries 
using namespace std;

int main() {
// turn off default cout flush 

int tt;
cin >> tt;
// declare variable tt as an integer, store the first input in tt
while (tt--) {
// while (true) do following..
// way to decrement through the input stream to zero

std::cin is synchronized to std::cout and performs default buffer flushing operations, so to save runtime he disables this sync. This is all written in online C++ references you can read yourself if you choose to use C++. His compiler flags he explains in one of these Twitch streams, or you can figure out yourself too as you go, most of them are to make compiler messages extra verbose for debugging. Note, you would never, ever write software outside of competitive programming like this because he has opened a gigantic scope.

You don't have to watch the entire 2hr vid, the last problem seems like the most informative. He begins most problems by writing himself examples as comments, something we already learned in CS19. He copies the problem input test cases into a text file 'in1' then pipes it into his compiled program "sol.exe < in1" from the shell as stdin (standard input). He writes multiple submissions to compare their runtime/memory usage. He looks for some kind of mathematical structure in the formulas given in the problem writeup. After submission he looks up the solutions by the problem writers and compares them to his own.

@1:43:48 or so he reads the writeup for a problem describing local maximum and minimum, which is the chapter on optimization in the calculus book here if you're interested. A time limit exceeded verdict shows how he would optimize to get accepted verdict, also trolling the stream by shortening all the variable names to single letters claiming it will speed up the program. The judging server ran his program faster than his own laptop, something to remember if debugging and believing your solution still isn't optimized enough, resubmit anyway as you get unlimited submissions.

C++ crash course

C++ is a generic programming language not really an object oriented language if you look at the STL or standard template library. You can alter all the built-ins by supplying custom functions without needing to manipulate function pointers.

Some materials for C++ I found useful:

There's countless C++ data structure and algorithms books on Amazon. The Stepik online course I found some problems like this one only have a 18% completion rate by students, meaning they are obviously too hard for the targeted audience, and it will not show you solutions until you enter a passing solution. If you get stumped early on the same problem:

For this problem I searched around to find out that C++ true or false booleans in a return statement like 'return x < y' can be cast to an integer to represent 1 or 0, though later in the solutions guide after I solved this problem I discovered you don't even need to cast to integer it will automatically cast.

return rand() < RAND_MAX * p

I only know this from the AI workshop because the distribution of a biased coin the expected value is p, whatever the probability is. If you flip that biased coin 100 times the expected value is 100p. Not sure how a beginner student who hasn't done probability courses would know that at all and likely why it has a 18% completetion rate

Try some problems

Start on Kattis with the easiest problems from the book, just to get used to I/O and the STL containers. You will learn by doing, here's an example of some of the first few Methods to Solve problems:

// problem Faktor
#include <iostream>
#include <cmath>
using namespace std;

int main() {
  int a, b;
  cin >> a >> b;
  cout << 1 + round(a * (b - 1)) << endl;
  return 0;

// problem carrots
#include <iostream>
#include <vector>
using namespace std;

int main() {
  int total, answer;
  cin >> total >> answer;
  // first input determines how many throwaway 3rd inputs:
  vector<string> a(total);
  for (int i = 0; i < total; i++) {
    cin >> a[i];
  cout << answer;
  return 0;

In this example for Roaming Romans I couldn't pass the last test case until I abandoned all C++ built-ins like ceil/round and just cast the type to an int to cut off the decimal:

#include <iostream>

int main() {
  float a;
  std::cin >> a;
  int n =  static_cast<int>(a * 1087.7626 + 0.5);
  std::cout << n;
  return 0;

To start I didn't import every library and untie cin/cout, so I could remember what library had those functions.

OCaml crash course

Here's a quick rundown on OCaml and how abstract it can get should you wish to try it to solve Kattis problems. I'm using the interactive OCaml top-level where code is compiled and executed on the fly.

# let compose f g = fun x -> f(g x);;
val compose : ('a -> 'b) -> ('c -> 'a) -> 'c -> 'b = <fun>

# compose (fun x -> x + x) (fun y -> y + y) 2;;
- : int = 8

Above is a function composition f(g(input)) where a function g is applied first to an input and the result is the input for function f. Compose takes two functions and something to apply the second function to. Lambda syntax (a function without a name) in OCaml is fun x -> x and you can bind that anonymous function to a name with let. There are many ways to declare a function in OCaml you can look up yourself in the documentation, in fact this composition is already in the standard library and called the application operator and there's always another way to do something in OCaml.

val compose : ('a -> 'b) -> ('c -> 'a) -> 'c -> 'b = <fun>

These arrows are the type of a function typ -> typ' where typ is the domain type (the type of the inputs) and typ' is its range type (the type of the result). The last -> 'b on the right is the result, and everything else to the left denotes the parameters that function takes. Arrows with brackets around them (typ -> typ') denote a function is expected as a parameter.

These are abstract type variables meaning they become whatever data you give it, if you give it strings it's now type string. They are called alpha 'a, beta 'b and gamma 'c. Think of them like math variables: a + b = 2 where a and b can be the same value 1 or different like -2 and 4. Type variables 'a -> 'b can be int -> int or int -> string but should be consistent or you will get a type error. The = <fun> notation is a placeholder for the bytecode generated by the top-level since it would make no sense to display it to you in the top-level/UTop.

let g x =
  let f = fun y -> y + x in
  let x = 1 in
  f x 

A popular OCaml idiom using nested evaluation sometimes called 'inlining' which is a way to both limit scope of internal bindings you create like libraries you open, and for compiler optimization as you give it all it needs at once similar to Common Table Expressions in SQL 'here's everything I want to do, you figure out how best optimize'. At the top g is the function you call from the outside, and f is an internal function not accessible outside that let scope. All this is doing is naming components before combining them all into one final expression.

# let g x =
 let f = fun y -> y + x in
 let x = 3 in
 f x
val g : int -> int = <fun>

(* Eval is by substitution: *)
# g 10
let f = fun y -> y + 10 in
 let x = 3 in
 f x
let x = 3 in
(fun y -> y + 10) x
(fun y -> y + 10) 3
(3 + 10)

The step (fun y -> y + 10) 3 is similar to the lambda calculus syntax: function on the left applied to value on the right and evaluated via substitution until you reach a value. In ML-family languages functions can be a value.

# let rec power f n = if n <= 0 then (fun x -> x) else compose f (power f (n-1));;
val power : ('a -> 'a) -> int -> 'a -> 'a = <fun>

# let sqr = power (fun x -> x * x) 2;;
val sqr : int -> int = <fun>
# sqr 2;;
- : int = 16

A power function that consumes a function and the number of times you want to square that function. The rec keyword is to get around the problem of declaring a function that references itself before it is defined. If you didn't include it you'd get an error along the lines of 'power is used but not defined'.

Hand stepping power:
# let sqr = power (fun y -> y * y) 2;;
(compose (fun y -> y * y) (power (fun y -> y * y) 1))
(compose (fun y -> y * y) (compose (fun y -> y * y) (power (fun y-> y * y) 0))) //if 0 return x -> x
(compose (fun y -> y * y) (compose (fun y -> y * y) (fun x -> x)) //compose f g so return fun x -> f(g x)
(compose (fun y -> y * y) (fun x -> (fun y -> y * y) ((fun y -> y) x)) //compose f g so return fun x -> f(g x)
fun x -> (fun y -> y * y) (fun y -> (fun y -> y * y) y)) x) 
val sqr : int -> int = <fun>
Hand stepping sqr:
# sqr 2;;
(fun 2 -> (fun y -> y * y) (fun y -> ((fun y -> y * y) y)) 2)
(fun 2 -> (fun y -> y * y) (fun 2 -> (fun y -> y * y) 2))  
(fun 2 -> (fun y -> y * y) (fun 2 -> (fun 2 -> 2 * 2))  
(fun 2 -> (fun y -> y * y) (fun 2 -> (fun 2 -> 4))
(fun 2 -> (fun y -> y * y) (fun 2 -> 4)
(fun 2 -> (fun y -> y * y) 4)
(* (fun y -> y * y) 4 is the same as f 4 or f(4) or function application to the value 4 *)  
(fun 2 -> (fun 4 -> 4 * 4)
(fun 2 -> 16
- : int = 16

The nested parens and confusing variables I probably screwed up but recursion pretend every call to power() from power() is to a new clone of power() and you're waiting for results to return just like you would any other program that calls an outside function. Recursion is solving the problem one step at a time and repeating until you reach a base case and you stop the recursion by returning a value, then all the delayed computations can begin to compute from right to left.

# let fib n =
   let rec f prev next count =
    if count < 2 then next
     f next (prev + next) (count - 1) in
      f 0 1 n;;
val fib : int -> int = <fun>

# fib 50;;
- : int = 12586269025

Virtually every programming book since SICP uses a version of the fibonacci recursive algorithm to show how inefficient recursion is growing a stack but you can rewrite it as a simple loop that still refers to itself without re-entrancy meaning a stack to keep track of delayed computations is not needed.

# let derivative dx f = function x -> (f(x +. dx) -. f(x)) /. dx;;
val derivative : float -> (float -> float) -> float -> float = <fun>

The dot notation +. or /. means a type float is expected. This function consumes a tiny float dx, and an arbitrary function f then returns the derivative (which is a new function) of f which in math notation is f'. For example a power function f(x) = x2 the formula for it's derivative is f'(x) = nxn-1 or f'(x) = 2x1 where the exponent is moved down to multiply the base, and 1 is subtracted from the original exponent. The formula in the OCaml code above is a generalized version for the derivative. If you took the second derivative of f'(x) = 2x you would get another function returned f''(x) = 2x0 which is just a constant x -> 2 for every input x but still a function.

# let squared = derivative 1e-5 (fun x -> x *. x);;
val squared : float -> float = <fun>
# squared 4.;;
- : float = 8.00000999952033

# let three = derivative 1e-5 (fun x -> x *. x *. x);;
val three : float -> float = <fun>
# three 2.;;
- : float = 12.0000600002612128

Squared is fun x -> x2 and it's derivative should be fun x -> 2x which checks out. The derivative of three should be fun x -> 3x2 which also checks out.

(* atan is a built-in *)
# let pi = 4.0 *. atan 1.0;;
val pi : float = 3.14159265358979312

(* sin is a built-in *)
# let sin''' = (power (derivative 1e-5) 3) sin;;
val sin''' : float -> float = <fun>

# sin''' pi;;
- : float = 0.999998972517346263

Idiomatic OCaml version

# let sin''' =
   (power (derivative 1e-5) 3) sin in
     sin''' pi;;
- : float = 0.999998972517346263

Partially apply the derivative function with a tiny 1e-5 for parameter dx. The returned function is then fed into power which returns the the cubic power of the derivative function. Apply that cubic derivative function to sin like how it was applied to squared x -> x * x earlier, and now derivative is complete and can be used to apply to pi a float that will return the third derivative of sin. This calculus example was lifted directly from the book Using, Understanding, and Unraveling the OCaml Language and exists here to explain the example if you choose to read that paper (sci-hub has a better version than the above link) and showcase the kind of crazy abstract passing around of functions and partial application you can do in OCaml.

Helpful materials

Of course the CP4 book also has OCaml code on the book's public github for every algorithm in the text.

Math functions

Exponential functions

Watch this to learn exponential functions. You can skip the chain rule and derivative content and still understand that video he shows how ex is preferred to regular exponentials since it has nicer properties, so that's what we use.

Parameters for C\(e^{kx}\) or C\(e^{k(x-h)}\)

  • C is the y-intercept (shift along y-axis)
  • k affects the steepness (slope)
  • x is the function input
  • h is a horizontal shift on the x-axis
    • never crosses the x-axis as y is never equal to zero

Statistics example

Statistics functions will now make sense when you break them apart by parameters. Look up 'bell curve' or 'normal distribution'. A picture is here. The formula is:

\(f(x) = \frac{1}{\sigma\sqrt{2\pi}} \exp\left( -\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^{\!2}\,\right)\)

or equivalent ex notation:

\(f(x) = \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2}\)

C = \(\frac{1}{\sigma\sqrt{2\pi}}\), k = -1/2, and x is being shifted by h = \(\mu\)

Type this into desmos and make the symbol for the standard deviation and mu into variables you can adjust with a slider. Playing around with this you now know the standard deviation is how far apart everything is in the distribution from the mean. Parameter k, when h is negative, k is positive, so all x inputs that are negative are concave up or climbing, and all positive x inputs are negative concaving down. This function gives the probability that x lies in some given interval and pi is used here as a trick to normalize the distribution so when you integrate it sums to 1 which all probability distros do.

Logarithmic functions

Watch this to learn logarithms.

Parameters for f(x) = C(ln(x - h)) + v

  • v is a vertical shift
  • h is the horizontal shift
  • C is the magnitude/rate of increase

A log is an exponent so exactly how you combine exponents is how you manipulate logs for example log(xy) when you multiply two bases by * bx you add exponents so log(xy) is log(x) + log(y).

Polynomial functions

Polynomials can be used as a specification for another function, such as functions in nature to see what inputs drive what responses (outputs). If you watch this lecture up to 12 mins, you will find out how they are used for approximating other functions so all the statistics of that function can be learned like what is it's max/min, what is it's derivative. Since they can be evaluated with only addition and multiplication they can be encoded as bitwise operations or as vectors therefore if you have a chance to use a polynomial function model you take it..


A function in which the output is the sum of terms that are the products of constant values and the input raised to some positive number power. The polynomial of a polynomial is another polynomial it is closed under addition and multiplication.

The general form of a single variable polynomial:


f(x) = 1/x is not a polynomial, negative exponents like f(x) = x-3 is not a polynomial, f(x) = \(\sqrt{x}\) is not a polynomial but a constant like f(x) = 0 is a polynomial since f(x) = 0x0 = 0. Often we will see in these courses and books they will just refer to a single variable as a function, such as 'a function n' which is n1 a polynomial.

Polynomial time

Polynomial functions are used as models of the running times of programs so we can see what kinds of inputs drive what responses either in a single worse case scenario of near infinite input, or an average cost over multiple runs and different sized inputs and this is called polynomial time meaning a polynomial exists that can bound our algorithm's running time or a polynomial can be given on the size of the input like size n inputs. What we don't want is exhaustive search, exponential running time.

Polynomial functions nc where c is some constant is in P the class of problems for which an algorithm exists that can solve instances of problems size nc and it's a flexible definition meaning if we alter the details of our computational model or add computational power then P doesn't change, you just move either up or down to a smaller polynomial. If you are using a parallel computer as long as it only has a polynomial number of processors you are saving at most polynomial time. Let's say you can solve problems of size n = 1 in one microsecond then a problem with instance of size n = 100 being solved by an algorithm of running time n3 can complete in under a minute. A problem of size n = 100 with 2n running time will take longer than the estimated age of the universe.


The MIT lectures just jump into this with no explanation so let's figure it out first.

RAM computational model

The Random-Access Machine model is one big array. You can access or write any memory/storage in constant time.

To use it you go through a program, applying +1 cost to essentially everything such as assignments or writes, every memory fetch, if-else statements (take the highest cost branch if doing worst case analysis), and all operations like multiplication. If it isn't a loop over n-sized inputs it only has +1 cost. If your program calls another function you count all it's operations in the same way and just add it to the total cost.

A loop is size-n * whatever cost its body has for example a counter:

ctr := ctr + 1

+1 operation for assignment, +1 for addition for total 2n cost. If it's nested loops meaning on every iteration you call another loop, it's n * n cost assuming that nested loop only does constant work and doesn't call another loop.

Simple model oblivious to the language you are using, the MIT lectures will have more details look at the first recitation. There is of course faults with this model, and I did a vague and poor misrepresentation of it here but at this stage who cares, we will learn some real life algorithm analysis specific to an architecture anyway as we go.


Let's say your program on every call does 50 constant operations before calling three other functions that run n-sized loops sequentially so one finishes and another begins each doing 20 operations per loop. Your program's running time is f(n) = 50 + 20n + 20n + 20n enter this function into desmos zooming out for millions of inputs.

Try and bound your running time with another function, I chose g(x) = 62x. Keep zooming out until you're at a ridiculous amount of inputs and notice your running time f(x) never outgrows g(x) = 62x. Reset the graph back to default zoom, hey your function is no longer bounded by g(x) = 62x, it is only when inputs are greater than a certain input that your function becomes bounded by g(x). This is the definition of big-O f(x) \(\le\) g(x) multiplied by a constant for some x input threshold denoted as x0

More clarity on notation

I highly recommend this chapter. The author calls these flaneries and describes them as a 'short stroll through a subject'.

Reading up to section 2.2, in 2.1.1 Ambiguities in mathematics he clarifies what n2 means in big-O notation like O(n2) which is given an input n it will grow to order n2 or O([n -> n2]) using lambda or anonymous function notation.

2.1.3 Abuse of Notation the 'small stuff theorem' demonstrates how you tell if one function is in another's bound. Taking a ratio: n2/n is n/1 and has no limit, it just grows forever so n2 is not in O(g(n)) whereas n/n2 is 1/n and converges to 0, so n is bounded by O(g(n2)). This f(n)/g(n) ratio is explained in the first chapter of DPV but the MIT lectures will also teach us this. Note that constants are included directly into the definition of big-O so we can ignore them.

2.1.6 Algorithm analysis in practice \(\max_{I \in I_n} R(A,I)\) if you didn't understand: R() outputs running time, and consumes inputs algorithm A, and an instance I of the problem. Example: A=insertion sort and I=inserting an element to an already sorted list then R(A,I) will return O(1) constant time because all insertion sort did was constant work to paste the element on to the head of the list, assuming your programming language does this in constant time and does not copy over the entire list to new memory just to attach a new head which some do. The max statement means the longest running time instance in the set I, which for this problem would be inserting an element to a list in reverse order where R(A,I) would return O(n2). And that's it, you know what worst case running time is now.

The sequential and product theorem are all you ever need to know to analyze sequential (imperative) code for worst case running time assuming there aren't hidden costs in the language implementation.


Watch Big O and friends part of a lecture from CMU's CS Theory Toolkit. You can type all these definitions into desmos to see their visualization like I did here, then keep zooming out to see how they change when inputs become gigantic. The constant and logarithm (log n) become indistinguishable, n log(n) grows very slowly towards 2x because it's linear times some logarithmic factor. 'Factor' here means you are multiplying two things, so they will like in highschool break into factors and one will be logarithmic because you multiplied n by log(n).

@14:50 log3 n means (log(n))3 and let's figure out O-tilde and poly() which you will see in every algorithm paper:

  • poly(g(x)) or f(x) = g(x)O(1)
    • O(g(x))c but with a constant exponent (a polynomial)
    • ex: O(n)2 is rewritten poly(n)
  • O-tilde(g(x)) or f(x) = g(x) * poly(log(g(x)))
    • multiple logarithm factors in g(x) or 'polylogarithmic' factors
    • ex: n2 * log3 n is O(n2 * poly(log n)) or \(\widetilde{O}\)(n2)

His \(\widetilde{O}\)(g(x)) example: \(n^5 \cdot 3^{n}\) the log of 3n is n, so now all the n's are polylogarithmic factors, and can be suppressed with this notation (3n exponential bounds any polynomial anyway). He remarks O-tilde indicates there's some logarithmic growth factors in your bound that you don't care about/couldn't pin down. His check that this isn't 2n involved taking the ratio f(x)/g(x) so \(\frac{3^{n}}{2^n} = 1.5^n\) and making sure it's not smaller than poly(n), which it's not since any nO(1) will never outgrow 1.5n exponential function.

Everything past 22:00 or so is in the book Asymptopia by Spencer talking about standard form of functions you want to use for g(x) which is the function you use to compare/bound your algorithm f(x) to. Now the 'g(x)' in big-O notation is demystified you know it's an arbitrary function from a family of functions, that has to be in a specific form to represent a bound on f(x). The rest of these lectures I'll mix in as they come up.

The Art of Computer Programming

Many of the MIT 6.006 lectures follow the structure of TAOCP so I will go through some of it with the MIT lectures.

TAOCP is not what I would call an algorithms text, and it's not a reference as many critical results are only found in the exercises with even the index pointing you to exercises. It is best described as an entertaining and casual hobby for programmers like those people you see on the subway or in coffee shops doing crosswords or sudoku except you are diving deep into programming puzzles. It only considers code that can be run on a machine so there is no 'inputs go to infinity' style analysis and all the code is in minimal abstraction assembly targeting a simplified architecture so you can see exactly what's happening in steps. There is no operating system in the third edition which still uses MIX, with each algorithm you often design your own memory layout schemes, your own auxillary data structures like stacks or threaded trees and everything is from scratch which of course helps you learn exactly how all this works together. Most of the analysis of algorithms uses Kirchoff's circuit laws that whatever goes in must equal what comes out, and it's almost always multiple algorithms being compared not a single algorithm.

The physical book itself is representative of the work that has gone into it's writing with high quality typography, perfect paper and a binding where no matter what page you flip to the book stays flat. If you were to ever invest money into a physical book this would be one of those books.

The Bill Gates quote

The back of volume 1, there's an old quote from Bill Gates ' should definitely send me a resume if you can read the whole thing' which is always brought up in social media with much confusion, I found a more recent quote from 2005:

"BILL GATES: ..say somebody came for an interview and they said ”Hey, I read the ’Art of Computer Programming’, that’s all I ever read, I did all the problems, I would hire them right then.

MARIA KLAWE: You’d hire them right then.

BILL GATES: Yeah, that’s right.

MARIA KLAWE: So would I.

BILL GATES: Even if they didn’t do the double-star problems, I mean, just the fact that they’d read the whole book, you know, those are the kinds of things you need to know to be a good programmer. Actually, there’s some of that you don’t even need to know, but the kind of algorithmic thinking that’s promoted there." (2005 interview with then Princeton Dean of Science).

I'm not sure what the double-star problems are, maybe the earlier editions had this notation, also the whole book is 5 books so far, since vol 4 is split into two parts, he probably is only talking about the first volume. Of course I wouldn't read TAOCP for a job but as Knuth puts it in interviews 'all the best programmers can move effortlessly through levels of abstraction' so it can't hurt.


You can if you want use the MMIX Supplement by Ruckert to replace most of the MIX/MIXAL content with MMIX/MMIXAL in volumes 1-3. MMIX isn't a single machine, you can customize the hardware however you want using Knuth's free meta-simulator program which is a test bed to try out different chip configurations. You can change instruction throughput, pipeline configurations, branch prediction, cache sizes and associativity, bus speeds, anything. He documented that program with the book MMIXware: A RISC Computer for the Third Millennium and all the code is written in literate programming style so you can read the programs like you would any other book. RISC-V is taught in MIT's 6.004 course on YouTube to better understand a RISC architecture if you want.

To start I'm just going to use the original 5-byte word size MIX since it's small enough to keep in my head and programming a chip directly without an operating system is interesting to me, can switch to MMIX anytime or choose to read in MIX, do all the assignments and algorithm code in MMIX. You can do the entire book on pieces of paper, or if you want try MIX/MIXAL there's GNU MDK or other software to run MIXware meaning compile .mixal files to .mix and run them either with mixvm, gmixvm GUI, or a scheme interpreter. Many distros have this as a package you can install mixvm/mixasm easily.

MMIX you can use the 'mixmasters' version or the 2017 sealed in time Knuth version you compile from c-web sources personally I went with the original c-web sources as they're self documented. Somehow Knuth declared them 'eternally bug free by definition' and they will never be modified again.

Optimizing compilers are dead

Another reason to learn this material is given by DJ Bernstein's talk a few years ago about the death of optimizing compilers, see these slides where at the end Knuth describes how hand coded assembly will always be faster than the best optimizing compilers, which at the present are absolute junk. Djb notes in this talk CloudFlare hand rolled his ChaCha/Poly using assembly to avoid wasting resources. Knuth also mentions Hoare suggesting that future compilers should have interactive transformations the programmer can use to specify optimizations themselves. If you design algorithms that must 'scale' then performance is critical you're going to have to target a specific architecture and rewrite CPU heavy hot spots into assembly.

MIT 6.006 Lecture 1

In the first lecture we get a definition of algorithms (sort-of) and a confusing example of induction. Both are explained better in the first two chapters of The Art of Computer Programming volume 1 so I'll go through those chapters.

TAOCP 1.1 This chapter is a thorough explanation of what an algorithm actually is. Knuth claims you must step an algorithm for some sample cases if you ever hope to understand it, claiming this is pretty much mandatory. His formal computational method on page 8 these are computational states/steps:

  • f((n)) = n
    • given a singleton return it
  • f((m,n)) = (m,n,0,1)
    • given a tuple (m,n) match it to (m,n,0,1) set r=0, go to step 1
  • f((m,n,r,1))
    • step 1 where m/n is set to r
  • f((m,n,r,2))
    • step 2 if r=0 match f((n)) which just returns n otherwise step 3
  • f((m,n,p,3))
    • step 3 match it to step 1 (m<-n,n<-p,p,1)

The last example if you're interested search for Markov algorithm, it's a model of computation that pattern matches the first occurence of a string, then replaces it as per the matching rule and repeats until it matches a terminating rule. If the string is 'tttabcdd' and the matching rule is 'abc' -> 'Z' then the first iteration is 'tttZdd' that's what \(\alpha\theta\omega\) is in Knuth's description where the string theta occurs in strings alpha + omega. I don't entirely get it either, but I see it's some kind of substitution operation with termination and that makes sense to me. Trying the exercises are illuminating, this is why I like this book.

1.2.1 Induction Knuth teaches us strong induction with a simple algorithm, and more importantly tells us how we can find these properties to prove.

Strong induction is given it's own algorithm: prove a property exists for P(0) or P(1) the base case, then P(2), then P(3) and each time you are proving n+1 so when you get to p(n), or you assume up to p(n) and prove that as well, you're done and can just conclude for all n you've proved P(n) by implication. We skip 'weak induction' completely.

Knuth examples this showing how assertions made at each step of an algorithm leads you to discovering invariant principles and other facts you think through which can then be proved by strong induction simply looping the algorithm for the next n+1 case. Knuth then tells us there is no induction proofs in the book because these assertions are usually given, so the algorithm is defacto proved. He reminds us again stepping through these assertions and trying to find them ourselves, is the only way to both prove and learn algorithms.

His non-algorithm math examples, one adds (2n + 1) to each side of his example because that's the n + 1 successor of any n, if 22 is 1 + 3 then 32 is 1 + 3 + 2(2) + 1 or 2n + 1.

There are many examples of strong induction on YouTube. You can prove up to n-1 if proving a recursive algorithm like Robert Harper's SML book. You can prove n + 3 in that above Michael Penn video if it makes it easier. You can do what Knuth does and just keep proving every piece you throw on the stack and then claim since you've already proved n+1 the entire time you're done. Try the end of chapter exercises and study them carefully. All the answers you want are there, it will take time to understand them staring at the page and trying to figure out what's going on but Knuth informs us we will never use these proofs in the entire text because they are 'self evident' if you write down enough invariants and assertions about how an algorithm works.

This is how this material works. You just stare at it enough and work through edge cases and you'll get it. It might take a while, you my have to sleep on it and wake up the next day and realize, wait I'm


We are skipping all the TAOCP math chapters, because they will come up in MIT lectures, then we can read them. I'm reading 1.3.1 Description of Mix. You don't have to read all of this either, just get a gist of what's going on then when you write programs for the very foreign MIX computer you can look it up as you go.

Assembly programming, you have a register where you do all the operations on and even though this is supposed to be 'least abstract' it's actually one of the most abstract things you will have to learn in the beginning until you get what's happening. Reading 1.3.1 we have a 5 byte word computer. Everything makes sense until we get to page 130 STA operator. What is going on, the examples wtf. We are taking from the right side all the bits and stuffing them into that memory location. So STA 2000(2:2) means take 2 bits from the right, and stuff them into 2:2 memory location. So you take 9, 0 from the A register but stuff them into a single memory slot so 9 disappears and only 0 remains because we are shifting left as per the text. Wtf right? It's alien but makes sense the more you do this, it maps directly to modern x86. There is an equation for STA if you look at stackexchange somebody will tell you exactly how to calculate this memory store but it doesn't matter, just get the intuition of cramming from the right into a single slot and shifting left.

Let's go to the next very confusing operator, page 132 Examples of arithmetic instuctions where Knuth describes: 'these examples have been prepared with the philosophy that it is better to give a complete, baffling description than an incomplete, straightforward one.

First example makes sense, though we're reading it wrong, you should read it as a whole number not as independent fields it's all one number. Example 2 STA 1000 wtf, how is 0 0 packed field - 150 equal 140? because we are not doing operator codes, this is memory, and recall 2 digits per byte, so this is 123400009 - 200001500. But wait there's more, recall MIX is both a decimal computer AND a binary computer, so we never know if we're programming in decimal or binary. That's why there's a question mark because the last field doesn't make sense.

Page 132 arithemtical operations, look at the third example for MULT 1000. Makes no sense until you remember the bytes are 0-64. So this is 0101010101 x 0101010101 and when you type this into a full precision calculator on some random internet site you get the result + 0 1 2 3 4 5 3 2 1 in the 10 bytes definition of MIX.