An Artificial Intelligence Curriculum


This is an (ongoing) workshop, Part II in the AI Tripos.

Start here

We start with CMU's 15-388/688 Practical Data Science from fall 2019 because it will teach us the 'full stack' so collection, processing, visualizations, building models using machine learning, and how to diagnose problems. When you finish this course you can get paid to do this in industry while we later take more advanced AI courses.

This course assumes some math and programming background, but the math is all applied in a blackbox way you'll be able to figure out, and if you know higher order functions map/filter/fold and lists from the software workshop you should be fine for the 15-388 assignments whereas the 15-688 assignments assume an entire undergrad already. .

Material req

Like the software workshop if all you have is access to a phone or tablet it's possible to complete this course using Google colab (free) otherwise the course uses Python3 and provides Vagrant files and instructions to set up your own environment. I went with miniconda. All the assignments are handed out as jupyter notebooks, you can certainly do them in PyCharm too if you wanted or vim even. The lectures are through panopto and open to the public.

Lecture 1 Introduction

We're watching the first lecture named introduction. If you want to download these lectures, right-click, open inspector, click on <head id="HeadNode"> and expand the tree, scroll down to <meta property="og:video" content =" …"> and cut and paste the .mp4 address into another browser tab, you can now save the video the locally. There's also notes for this lecture.

This is a great description of statistics, from a book we will be able to complete soon:

Statistics is the branch of mathematical engineering which designs and analyses methods for drawing reliable inferences from imperfect data. The subject of most sciences is some aspect of the world around us, or within us. Psychology studies minds; geology studies the Earth’s composition and form; economics studies production, distribution and exchange; mycology studies mushrooms. Statistics does not study the world, but some of the ways we try to understand the world — some of the intellectual tools of the other sciences. Its utility comes indirectly, through helping those other sciences. This utility is very great, because all the sciences have to deal with imperfect data.

Data may be imperfect because we can only observe and record a small fraction of what is relevant; or because we can only observe indirect signs of what is truly relevant; or because, no matter how carefully we try, our data always contain an element of noise. Over the last two centuries, statistics has come to handle all such imperfections by modeling them as random processes, and probability has become so central to statistics that we introduce random events deliberately (as in sample surveys). Statistics, then, uses probability to model inference from data. We try to mathematically understand the properties of different procedures for drawing inferences: Under what conditions are they reliable? What sorts of errors do they make, and how often? What can they tell us when they work? What are signs that something has gone wrong? Like other branches of engineering statistics aims not just at understanding but also at improvement: we want to analyze data better: more reliably, with fewer and smaller errors, under broader conditions, faster, and with less mental effort. – Advanced Data Analysis From an Elementary Point of View by Cosma Shalizi

All the examples he talks about you can see the slides better in the lecture notes. The one example about thatched roofs turned out to be a bad indicator as one student mentions. Googling this I discover: "There are many structures in Kenya and Uganda that have thatched roofs, but are not households, such as kitchens and sheds. These buildings were artificially inflating the count of households with thatched roofs, making areas look poorer than they were."

We don't have the recommended background but you'll see later most of these requirements are given a crash course enough to finish the assignments. The rest of this lecture is just course logistics, the masters version of this course 15-688 has extra homework assignments, which are up to you if you want to try them.

Lecture 2 Data collection and scraping

Next lecture when he introduces Jupyter notebook, you can export the notes for this lecture as a notebook to experiment around with the regex library, they are fully interactive can run them yourself, change anything to see how corner cases are matched, go through the Python tutorial looking up anything you don't understand like what range does or list/dict comprehensions.

The regex poll done in class is explained in the notes:


"These rules can of course be combined with the rules to match potentially very complicated expressions. For instance, if we want to match the text “something science” where something is any alphanumeric character, and there can be any number of spaces of any kind between something and the word “science”, we could use the expression r"\w+\s+science".

Regex practice

If you want some practice, try this.

Lecture 3 Jupyter notebooks

Watching lecture 3.

Colab If you must it's possible to use Google Colab (python v 3.6) online for free click on File -> New Notebook and enter in a throwaway or regular google account, a new notebook should appear. Free version of Colab, the VM will shut down after some indeterminate amount of time being idle, which you then have to reset with Runtime -> Restart Runtime. You can back up everything to a Google drive account or Github. You can redefine all the heavily used key shortcuts in Colab so if you're stuck with just a tablet, then redefine Shift-Enter (run and move to new cell) and Ctrl-Enter (run cell) to whatever you want. You can enter shell commands with ! so !cd or !ls or !python –version to match up to the online python documentation.

Local environment There is a vagrant installation method here (scroll down to environment) but it likely won't install anymore and you'll have to manually change the box in the Vagrantfile to another ubuntu distro. If you're wondering what Vagrant does in this context, it's a pile of scripts to download and configure a virtual environment, starts up virtualbox, you minimize that window and from command line 'vagrant ssh' into the running VM, so you can start jupyter notebook which gives you a localhost IP, and then whatever regular browser you use can go to that address and run the notebook. You can also install all this with your regular OS though a vm is nice to keep environments seperate as we'll installing pip packages for the course. I took his advice in the lecture and installed Anaconda, actually Miniconda, which has a cheat sheet for it's package manager (Python is so complex now that it needs distros). One reason anaconda exists is because there are countless scientific libraries and that all require a specific environment of python to run so anaconda solves this problem, and in many research labs and bigcorps you need written permission to download and install new software from the various layers of corporate security officers so anaconda full install can give you everything up front. Other solutions exist like NixOS or Guix in order to run multiple different versions of the same software.

His shill of python growth based on the amount of stackoverflow questions could also mean the documentation is terrible or error messages so cryptic that everybody needs to consult stackoverflow. For our use in this course, which is hacking around with data, scripting it into various datastructures, visualization, and machine learning, Python is the obvious choice because of the libraries available like Facebook's Pytorch. If whatever prototype you come up with needs to be deployed in some other language there is a C++ API for Pytorch.

I assume you have a notebook open while he's going through the lab. Note the syntax of python is nearly identical to what we've been using in the other workshop (pyret, with is python + racket) except each function needs a return statement.

  • Efficiency lists:
    • Accessing a list element O(1)
    • Appending O(1)
    • Inserting O(n) due to 'lists' being an array
    • If you forgot efficiency from CS019 see this
    • There's a somewhat accurate list of python builtins complexity

Note he uses a magic command for %%timeit.

  • Efficiency dicts:
    • Value lookup O(1)
    • Insert O(1)
    • Delete O(1)
      • Worse time constant than lists
      • Use for more complex type than integers

Classes, methods and OOP in python are explained here and here, more here and finally using sets and dictionaries in classes but we won't be using any kind of complex OOP or engineering software in this course just scripting together libraries.

Recitation 1

Let's do the first recitation. Download the Jupyter notebook for this recitation and untar it, open the .ipynb file in jupyter notebook and now you can run all the code.

If you're using Google Colab, look at this otherwise mount a google drive, save the homework .tar.gz file directly to your google drive, and use something like ZIP extractor to open it there. Other method is open a new notebook in colab and type the following, then File -> Open from drive and load the .ipnyb notbook and run all code blocks Runtime -> Run All

!tar -xvf recitation_1.tar.gz
!cp recitation_1.ipynb '/content/drive/My Drive' 

# note this can all be automated with python (pydrive) or google colab library

Run all the cells (Cell -> Run All). Recitation begins with notes about testing, in every homework assignment notebook it comes with a large amount of tests already written for us. The 'from' statement is of course importing a library. This recitation is self explanatory, notice what 'pprint(dict(response.headers))' did, returned the header in a dictionary with keys. Anytime you want try some examples yourself insert a code cell below what you want to manipulate.

They show that Python libraries can create encoding conflicts. Remember code points from cs019, Unicode/UTF-8 replaces ASCII, ISO 8859 and other encodings, and supports any written language characters. These encodings are all different ways to interpret raw bytes and represent textual data.

Regexp they want us to read a blog, it can be summarized as: break up your regexp into newlines so it's not a massive oneliner, use comments so you don't forget what you wrote, use a regexp tool like 'regex buddy' that allows you to paste in regexp and figure out what they do should you come across them in code, and use a real parser for parsing. The python documentation will explain metacharacters like *, ?, + which are repetition operators or ^ $. You can add more examples and run them in this recitation notebook.

Finite State Automata

A diversion into theoretical CS so you can understand regular languages and expressions as a finite automaton then all these Kleene star notations and repetition/assertion operators will make sense. You can understand DFAs even if you haven't taken set theory in the math workshop.

  • Finite Automata 15-251 Great Theoretical Ideas in CS
    • In the beginning the slides are slightly out of order, match with these slides.

The accepting state in a regular expression is a matching state. The 5 tuple he describes, is similar to the definition of an algorithm in the first chapter of The Art of Computer Programming vol 1 by Knuth. Now you know how to do those exercises should you ever read that book.

Around 59:25 we begin regular operations, concatenation (multiplication), union and star.

  • Example: Concatenation
    • \(L_1\) = {00,10}; \(L_2\) = {0,1} then \(L_1 \cdot L_2\) = {000, 001, 100, 101}
    • The dot notation is dropped in regex, abc is still multiplication
  • Example: Star, the concatenation of i copies of L for i > 0
    • If L = {0,1} then \(L^{2}\) = {00, 01, 10, 11} (L x L concatenated) and \(L^{*}\) = {none, 0, 00, 0000, 01, 111, 01110 etc}
    • Select any finite number of strings from the language and concat them all together (including possibility of no strings selected, as empty concat empty gives empty)

Their example regular expression of 'x starts and ends with the same char, in the language L of {a,b} then \(L^{*}\) would be {a, b, aa, bb, aaa, bbb, ababa, etc}

a(a|b)*a | b(a|b)*b | a | b which translates to: a concatenated with any amount of a's or b's, and ends with a, OR b concatenated with any amount of a's or b's, ending with b OR a OR b

There is also order precedence: Repetition like ?, star (highest), concat, union (lowest) unless you use parenthesis to denote order. For example ab* is a(b*).

Some regex and their corresponding state diagrams:

  • The automaton for "a?bc" (matches empty string as well)
  • Kleene star matching example and it's corresponding automaton for [fr]*og
    • the DFA explains why [fr]*og matches "og" as well as "frog", * will also match empty, so this regex is any combination of f or r (ffffff, or rr), or nothing, concatenated with og
  • The regex automaton for "[A-Z]+[aeiou]+[bd]?"
    • Two acceptance states, because [bd]? is none or b or d
  • The automaton for "[abc]+[123]*|[def]*[456]+"
  • The automaton for "([A-C][0-3][D-F])+"

Assignment 1

Walking through the first assignment the the rest won't be as verbose. Let's try homework 1. Untar it and run 'pip install -r requirements.txt'. Get Started notebook, you fill in rotate list and reverse dict, both of which are easy one liners straight from the previous lectures. Rotate list is return list slice + list slice, reverse dict is return swapped keys for a,b in d.items(). The second task is write some tests, notice this is an oracle, it consumes a function and then tests it. You need to add @test decorator and name your test function as specified in the notebook

def successor(n):
    return n + 1

def successor_test(successor):
    test.equal(successor(0), 1)
    test.equal(successor(1), 2)

The next notebook, Scraper, install requirements.txt again with pip. You unfortunately have to create a Yelp account to use the API, and this means relentless spam so use a throwaway email. How I'm doing this assignment is keep open the python tutorial, keep open the course notes which have almost all the information you'll need, keep open the library documentation for BeautifulSoup and Requests, and the Yelp API documentation.

Jupyter Notebook/IPython issues

  • If you get a persistent @test error 'TypeError: 'dict' object is not callable' or something similar, restart the kernel. You can try %history -g dict to see what command may have corrupted your state but I think I spent an hour trying to figure out this cryptic error only to discover it vanishes upon kernel reset.
  • Annoying indentation errors, don't use tab use spaces
  • Global variables, a few times I forgot I was actually hacking around with some global reponse variable 'r' I designated to slice apart some nested dictionary/list and test some things, only later to wonder why my tests weren't passing.
  • Jupyter config can be done in the browser dom or peristent config in a local .json file
  • Reserved keywords, to see what they are:
import keyword


Q0: Basic HTTP Requests

You may have to change the test to: test.true("This domain is for use in illustrative examples in documents." in text). This is just a requests.get(url), and return the status code, r.text

Q1: Authenticated HTTP Request with the Yelp API

We'll need the Yelp API documentation. A warning about the API, it may block your key for 24 hours if you do too many requests too quickly trying to figure out the assignment, I highly recommend retrieving once and then storing that in some global variable so you can hack around with the response unpacking the content instead of repeated calls. I used .content here instead of text to play around with the binary response in other cells, you would probably want .text

def yelp_search_test(yelp_search):
    total, business = yelp_search(read_api_key(), "Pittsburgh")
    test.true(abs(total - 2600) < 60)
    expected_keys = ['id', 'name', 'phone', 'review_count']
    if len(business):
        test.true(all(k in set(business[0].keys()) for k in expected_keys))

def yelp_search(api_key, query):
    Make an authenticated request to the Yelp API.

        query (string): Search term

        total (integer): total number of businesses on Yelp corresponding to the query
        businesses (list): list of dicts representing each business
    api = ''
    headers = {'Authorization': 'Bearer ' + api_key}
    params = {'location': query}
    r = requests.get(api, params=params, headers=headers).content
    s = json.loads(r)
    return s['total'], s['businesses']

Q2: Aquire all of the restaurants in Pittsburgh (on Yelp)

I cut + paste the yelp search function but added an extra parameter 'offset' and returned only r['businesses']. I also experimented here seeing how many python builtins I could chain together (again using binary r.content, you probably want r.text). This violates DRY principal obviously, if this wasn't an assignment you'd generalize yelp search to take a parameter of how many businesses you want from the API.

def all_restaurants(api_key, query, offset):
    Retrieve ALL the restaurants on Yelp for a given query.

        query (string): Search term

        results (list): list of dicts representing each business
    api = ''
    headers = {'Authorization': 'Bearer ' + api_key}
    params = {'location': query, 'limit': '50', 'offset': offset}
    r = json.loads(
        requests.get(api, params=params, headers=headers)
    return r["businesses"]

You could have this inside the all restaurants function or in a seperate cell and keep all restaurants generalized to call for any offset. There's probably a pythonic recommended way to do this, I'm just haphazardly scripting away on cruise control. You can check your code with pycodestyle –show-source (pip install pycodestyle) or use a linter for this in PyCharm and other IDEs.

# Yelp API docs: up to 1k results, max 50 returned at once
# Use offset parameter to get next 50, and so on
# Use time library so you don't get blocked

# Can also append to a dict w/content[i] for i in range()
content = []
for y in range(0, 1000, 50):
    content += list(all_restaurants(read_api_key(), 'Polish, Hill, Pittsburgh', y))

print([x['name'] for x in content])

Q2.5 Parse the API Responses and Extract the URLs

Jam the string into a json dict and use another list comprehension

def parse_api_response_test(parse_api_response):
    test.equal(parse_api_response(json_src), [''])

def parse_api_response(data):
    Parse Yelp API results to extract restaurant URLs.
        data (string): String of properly formatted JSON.

        (list): list of URLs as strings from the input JSON.
    dict_json = json.loads(data)
    return [x['url'] for x in dict_json['businesses']]

Q3: Parse a Yelp restaurant Page

Yelp has changed their page since this assignment was written, all the information is in a script tag now. There's some assignment tests to return 'page count' (20 per page) not the total count, which is integer division, and you have to cast the ratings to floats. What I did here was first dump the page into beautifulsoup and noticed what I wanted was all in the <script> tags, specifically the third element with that tag. Then I dumped the text into json.loads() and printed the keys to see which one's I could iterate on, and used a list comprehension to return the dict keys and page total as a tuple.

def parse_page(html):
   Parse the reviews on a single page of a restaurant.
       html (string): String of HTML corresponding to a Yelp restaurant
       Tuple[List[Dict], int]: a tuple of two elements
           first element: list of dictionaries corresponding to the extracted review information
           second element: number of pages total.
    soup = BeautifulSoup(html, 'html5lib')
    json_extract ="[type='application/ld+json']")[3]
    json_dict = json.loads(json_extract.text)
    return [{'author': x['author'],
            'rating': float(x['reviewRating']['ratingValue']),
            'date': x['datePublished'],
            'description': x['description']} 
           for x in json_dict['review']], json_dict['aggregateRating']['reviewCount']//20

Q3.5: Extract all the Yelp reviews for a Single Restaurant

This assignment the yelp url goes up by 20 each time, so after page one you add ?start=20, ?start=40, to offset the reviews. I actually did this assignment incorrectly and instead used the API to get all the reviews, then when rereading figured out I was supposed to crawl each page. Essentially this is the same as other assignments you can either grab the itemprop tags or you can grab the type=application/json in the <script> again, except you will have to use regex to clean up the beginning and end capturing everything after <!– and everything before –> which is grouping that we did in the lecture. I selected"[type='application/json']")[2] for every page after the first page, cleaned it w/regex, returned group(1) and loaded this in a json dict which then was trivial to use a comprehension to return all the desired nested values. The first page you extract the page count and then this is your range to loop over.

Now you can scrape any page you want, and are familiar with APIs.

Assignment 1 15-688 (Optional)

The graduate version of this course has it's own homework, and expects you already have an undergrad where you programmed in some OOP language for a year or so, so is up to you if you want to try the homework. For hw1_xml_parser.ipynb remember the automatons from the DFA lecture, keep one in your head and process the characters sending them to a non accepting loop or to a matching state, being aware of greedy matching vs lazy matching.

Q1: Regular expression for identifying tags

Answers go in the second code box, above test_snippet="bunch of xml":

... other regexs
comment = re.compile(r'<!--(.[^-]*)')
xml_prolog = re.compile(r'(xml version.+")')
html_declaration = re.compile(r'DOCTYPE \w*')

For my comment regex, I got [^-] from here "all of these chars except stopping at X" so stop on the - in –>. There's probably a hundred different ways to match these strings.

There is some weird spacing with their tests:

Mine   [('note', 'date="8/31/12"'), ('to', ''), ('from', ''), ('heading', 'type="Reminder"/'), ('body', ''), ('h1', ''), ('p', '')]
Theirs [('note', ' date="8/31/12"'), ('to', ''), ('from', ''), ('heading', ' type="Reminder"/'), ('body', ''), ('h1', ''), ('p', '')])

date and type='Reminder' have a space for some reason,  only test I failed

Q2: XML Parser

There is actually a paper floating around of somebody who parsed XML with a single regex but since XML is not a regular language you would never want to do this and instead use/make a real parser. For this assignment try the following then you'll get what you have to do, you already made tag_open, tag_close, comment etc., in the last assignment.

  • matchopen =, 117) #search(text-to-search, position)
    • <re.Match object; span=(138, 142), match='<to>'>
  • matchclosed =, 117)
    • <re.Match object; span=(146, 151), match='</to>'>

So you follow the algorithm they've described, testing with if-statements for a match, and moving around the matches using the .position methods like match.end() or span.

Q3: Searching for tags

I actually didn't try this one, since I spent longer than I wanted on the XML parser, and we can only do so many assignments on regex parsing hacks. It consumes an XMLNode and then searches for tags so isn't a lot of code to write. I put this in a todo notebook, anytime I run out of time for an assignment the unfinished ones go in there and later I go back and see if I can redo them, or if it's worth the time to redo them.

Lecture 4 Relational Data

Standard dbms crash course, more here if you're interested in learning some advanced sql like CTEs (common table expressions) where you just declare everything up front 'here's what I want' in a single statement and the dbms optimizes the query for you. Reminder you can load the notes for this lecture in jupyter notebook and test out all the Pandas functions as he introduces them.

Lecture 5 Visualizations

This lecture is about making your own visualizations for data exploration much like how we wrote examples in CS019 in the software workshop to better understand the full domain of the problem. Some good advice in here about how not to visualize, like never using a pie chart. There's no code in this lecture.

Lecture 6 Linear Algebra

None of this will make sense unless you've done a bit of linear algebra already, so let's get a crash course.

Vector Boot Camp

This crash course from Brown University, 3 tutorials on vectors, scalars, dot/cross product and a worksheet of exercises with solutions to practice.

Let's do part 1:

  • Some of this we've already seen in the math workshop: adding, subtracting, scalar multiplication
  • The rest of this part covers the law of cosines using vectors (assuming you've already watched the trig bootcamp)

Pretty easy, let's see part 2:

  • If you've also done the software workshop, the dot product is the first assignment
  • Dot product properties are at 13:55
  • Vector and scalar projection (the rest of the video) is optional, part 3 of the bootcamp also optional

The practice solutions the first answer is obviously wrong, that's 4v not 4w. The rest are straight forward if you remember how to multiply square roots, for example 1(e) is \(\sqrt48\) which simplifies to \(\sqrt 16\) * \(\sqrt 3\) or \(4\sqrt 3\).

We can't do the cross product exercises yet since we haven't watched part 3 which we'll cover eventually in the math workshop. The cross product produces a vector instead of a scalar (dot product) and only works 'nicely' in 3 and 7 dimensional spaces.

Linear Algebra Review

The same prof has a linear algebra crash course where this material is better explained, with notes. You won't need all of this playlist for the course. Ian Goodfellow (author of Deep Learning) used these notes when he took Andrew Ng's machine learning class at Stanford. Of course linear algebra is a huge subject, with a geometric interpretation, and we'll need to know much more of it which is what the math workshop is for.

Take the notes with some pieces of paper and make your own examples. You'll see really you just need to know how the dimensions get manipulated, and the formulas for the various products and sums and you're good to go for the purposes of this course. We are interested in programming linear algebra using libraries not doing it by hand, so a high level overview of all the rules and operations are all that's needed. All these operations have a geometric explanation such as inverse matrices but the math workshop will explain this better than even those 3blue1Brown animations since it is geometric focused from the beginning.

Going through this playlist:

  • Introduction: we've seen this already in the math workshop
  • Basic Notation: transpose is also described here. Notation is \(R^{Row \times Column}\)
  • Basic Operations 1: There are examples of these operations here with actual values. The properties that allow multiplication described at 2:30 is all you need to understand the rest of this video. Notice the dimensions follow the rules, where the two inner dimensions must be equal, ie: A \(\in R^{n \times n} \cdot\) B \(\in R^{n \times p}\) the inner n's are equal size so this operation is possible. That's why these various properties like associativity works, since AB have same dimensions, and BC have same dimensions.
  • Basic Operations 2: All these operations are equivalent (you'll get the same output). You don't have to memorize this because there are notes to refer to inner and outer vector products.
  • Special Matrices: Identity square (n x n) matrix, Zero matrix, all-ones vector, standard basis vector, square symmetric matrices, diagonal matrices (also square n x n) to scale columns and rows depending on how you multiply another matrix with the diagonal, right side scales columns, left scales rows.
  • Matrix inverses and solving linear eq: non-singular has an inverse, singular matrix does not. The inverse if you're wondering exists so you can do 'division' with a matrix, since there is no concept of dividing a matrix.
  • Matrix functions: all the above operations but now with function notation, the trace, vector norms, determinant.
  • Putting equations into matrix form 1, 2, and 3: Various examples on how you can use previous definitions to rewrite equations into a single matrix expression. Refer to the notes, write your own examples with tiny matrices.
  • (Optional) Advanced topics: Range, Linear Independence, Rank, Orthogonal vectors/matrix, Nullspace, Eigenvalues and Quadratic forms I didn't watch since we won't need them yet.

15-388 Linear Algebra Lecture 6

Now we can watch this lecture, we have an idea of linear algebra and how multiplication of matrices works enough so that we can program linear algebra using Numpy. todo

Assignment 2