Compilers

  • StanfordOnline: SOE.YCSCS1

A ten-week undergraduate course in compilers. The project of the course is to implement a compiler for Cool.

Although no textbook is required, there are recommendations provided for the interested student:

Suggested readings by topic:

  • Lexical Analysis and Finite Automata
    • CPTT: Sections 3.1, 3.3, 3.4, 3.6, 3.7, 3.8
    • EC: Chapter 2 through Section 2.5.1 except for 2.4.4
    • MCI: Chapter 2
  • Parsing
    • CPTT: Sections 4.1-4.6, 4.8.1, 4.8.2
    • EC: Sections 3.1-3.5
    • MCI: Chapter 3
  • Semantic Analysis and Types
    • CPTT: Sections 5.1-5.3 6.3, 6.5
    • EC: Sections 4.1-4.4
    • MCI: Chapters 4 and 5
  • Runtime Systems and Code Generation
    • CPTT: Sections 6.2, 7.1-7.4, 8.1-8.3, 8.6
    • EC: Chapter 5, Sections 7.1-7.2
    • MCI: Chapters 6, 7, and 14
  • Optimization
    • CPTT: Sections 8.5, 8.7, 9.1-9.4
    • EC: Sections 8.1-8.4, 8.6, 9.1-9.3
    • MCI: Chapter 10
  • Advanced Topics: Register Allocation, Garbage Collection
    • CPTT: Sections 7.5-7.6, Section 8.8
    • EC: Sections 13.1-13.2, 13.4,
    • MCI: Chapters 11 and 13

Week 1: Introduction & the Cool Programming Language

This week has 1:35:11 of lecture videos.

01-01: Introduction

A brief introduction: what is a compiler vs. interpreter, when and why were programming languages (starting with FORTRAN) developed.

01-02: Structure of a Compiler

Compilation works in five phases:

  1. Lexical analysis divides the program text into tokens.
  2. Parsing handles the grammar of the language--interpreting valid strings of tokens.
  3. Semantic analysis attempts to understand things like types--the meaning of the text.
  4. Optimization attempts to modify programs to use less time or space, for example.
  5. Code generation actually produces the machine code.

01-03: The Economy of Programming Languages

Programming languages and created in order to serve the different requirements of different application domains.

Most of the cost associated with using a new programming language is in training programmers, which leads to a few facts:

  1. New languages look like old languages, both because old languages might have good ideas, and because programmers are familiar with old languages, so require less training to become conversant in the new language.
  2. Popular languages can change only very slowly, because the cost of retraining a community of programmers is very high.
  3. New languages can be created and changed very 'cheaply', since they have few users.

02-01: Cool Overview

Cool is the Classroom Object Oriented Language. It is designed (by the instructor of this course, in fact) to be easy to implement.

This class will create a compiler from Cool to MIPS assembly language. There will be five programming assignments:

  1. Write a Cool program
  2. Lexical analysis
  3. Parsing
  4. Semantic analysis
  5. Code generation

Optimization will be touched on, as well.

A simple hello world program like the following is demonstrated:

class Main inherits IO {
   main(): SELF_TYPE {
        out_string("Hello, World.\n")
   };
};

02-02: Cool Example II

A program is demonstrated that takes an integer input and calculates the factorial, using both recursive and iterative techniques.

class Main inherits A2I {

    main() : Object {
        (new IO).out_string(i2a(fact(a2i((new IO).in_string()))).concat("\n"))
    };

    fact(i: Int): Int {
        let fact: Int <- 1 in {
            while (not (i = 0)) loop
                {
                    fact <- fact * i;
                    i <- i - 1;
                }
            pool;
            fact;
        }
    };
};

02-03: Cool Example III

Creating new classes is demonstrated. A list class that can print a string representation of itself is created.

class List inherits A2I {
    item: Object;
    next: List;

    init(i: Object, n: List): List {
        {
            item <- i;
            next <- n;
            self;
        };

    flatten(): String {
        let string: String <-
            case item of
                i: Int => i2a(i);
                s: String => s;
                o: Object => { abort(); ""; };
            esac
        in
            if (isvoid next) then
                string
            else
                string.concat(next.flatten())
            fi
    };
};

class Main inherits IO {

    main(): Object {
        let hello: String <- "Hello ",
            world: String <- "World!",
            i: Int <- 42,
            newline: String <- "\n",
            nil: List,
            list: List <-
                (new List).init(hello,
                    (new List).init(world,
                        (new List).init(42,
                            (new List).init(newline, nil))))
        in
            out_string(list.flatten())
    };
};

Week 2: Lexical Analysis & Finite Automata

This week has 2:22:30 of lecture videos.

03-01: Lexical Analysis

The goal of lexical analysis is to divide code up into its lexical units, called a lexeme, and to classify those units into the appropriate token class, such as identifiers, keywords, individual symbols, numbers, etc. The lexeme together with its token class make up a pair called a token.

03-02: Lexical Analysis Examples

In FORTRAN, whitespace is not significant. For example, VA R1 is the same as VAR1. As a result, DO 5 I = 1,25 begins a loop, while DO 5 I = 1.25 declares a variable--it is equivalient to DO5I = 1.25. In order to know whether the DO segment is a keyword for a loop or part of a variable name requires knowing what comes after. This is called lookahead.

In PL/I, keywords are not reserved. So something like this is valid:

IF ELSE THEN THEN  ELSE; ELSE ELSE = THEN

Also in PL/I:

DECLARE(ARG1, ..., ARGN)

In this statement, DECLARE could, depending on context, be either a keyword or an array reference (if it is followed by =). Since there might have been any number of arguments (indices), it requires unlimited lookahead to determine which of the two is intended.

03-03: Regular Languages

Regular languages are used to specify the set of strings that belong to a token class. A regular language is usually described using regular expressions.

The language denoted by the regular expression is a single-string language, namely .

The language denoted by ε is the language consisting of the empty string, . Note that this is not the empty language!

The union of two regular languages is a regular language:

The concatenation of two regular languages is a regular language:

The iteration (i.e. the concatenation a finite number of times) of a regular language is a regular language:

This is called the Kleene closure or Kleene star.

The regular expressions over an alphabet Σ are the smallest set of expressions including the empty string, the single-character strings for each character in Σ, and the closure under union, concatenation, and iteration of these expressions.

03-04: Formal Languages

A meaning function maps expressions to the sets of strings that they denote.

This formalism is worthwhile because:

  1. It allows us to clearly distinguish syntax from semantics
  2. It allows us to consider notation as a separate issue
  3. Syntax does not map one-to-one to semantics: it is many-to-one.

Consider Roman vs. Arabic numerals. These have identical semantics, but very different syntax, and the latter is much more convenient for computation.

03-05: Lexical Specifications

  • keywords: for example, 'if', 'then', 'else'
  • integers: a non-empty string of digits, so:
    • digit = '0', '1', ...
    • integers = digit digit*
    • or: integers = digit+
  • identifier: a string of letters or digits, starting with a letter
    • letter = 'a', 'b'..., 'A', ...
    • or: letter = [a-zA-Z]
    • identifier = letter (letter + digit)*
  • whitespace: a non-empty sequence of blanks, newlines, and tabs
    • whitespace = (' ', '\n', '\t')+

Comment

Aiken gives email addresses as a simple example of something that can be described by regular expressions, though he admits that to fully parse then would be slightly more complicated. It's more than just slightly more complicated, actually. Here is some discussion on that, and a look at any given email validator library will further illuminate the issue.

04-01: Lexical Specification

  1. Write regular expressions for each token class, e.g. identifier, whitespace.
  2. Construct R, matching all lexemes for all tokens (R = Keyword + Identifier + ..., a union of all the above regular expressions).
  3. Call the input , and for check .
  4. If so, then that string is in some for some .
  5. Then remove from the input string and go to step 3.

What if there are two matching prefixes, e.g. there exists a where both sequences up to and match? Then we take the maximal munch, which is the longest matching prefix.

What if multiple tokens match? For example, an otherwise valid identifier might also be a keyword. The usual rule is that keywords win, which is accomplished by a priority ordering, e.g. choose the regexp that was listed first.

What if no rule matches? Add a rule for errors that matches all strings not in the language, and put it last in priority.

There are good algorithms that accomplish this efficiently and require only a single pass over the input.

04-02: Finite Automata

A finite automaton consists of:

  • An input alphabet Σ
  • A finite set of states S
  • A start state n
  • A set of accepting states F ⊆ S
  • A set of transitions: state ⟶input state

There is a graphical notation for finite automata:

The language of a finite automaton is the set of all strings that it accepts.

A finite automaton can have another type of transition: ε-moves. These allow the machine to change state without consuming input.

A determininstic finite automaton (DFA):

  • has only one transition per input per state
  • no ε-moves
  • takes exactly one path through the state graph per input

A non-deterministic finite automaton (NFA):

  • may have multiple transitions per input per state
  • may have ε-moves
  • may take different paths through the graph for the same input
  • accepts if any path terminates in an accepting state

Multiple transitions per input per state can be simulated by ε-moves, though it makes the automaton appear more complicated.

Both NFAs and DFAs recognize the same set of languages, the regular languages. DFAs are faster to execute, but NFAs are, in general, smaller than the equivalent DFA.

04-03: Regular Expressions into NFAs

For each regular expression M, we will define an equivalent NFA. This is done by composing the machines for each part of M using some simple rules. An example:

04-04: NFA to DFA

Call as the ε-closure of a state the set of all states that may be reached from it by any number of ε-moves, plus the state itself.

Given an NFA with states S, start state s, and final states F ⊆ S, define

and the ε-closure operation.

The states of the DFA will be all the nonempty subsets of S. The start state will be the ε-closure of s. The final states of the DFA will be

There will be a transition on the DFA if

Converting the NFA from the last section to a DFA yields:

04-05: Implementing Finite Automata

A DFA can be implemented by a 2D table T:

  • One dimension is states
  • The other dimension is imput symbol
  • For every transition from state i to state k on input a, define

It can also be implemented as a one-dimensional table indexed on states and pointing to a table for the inputs. This would be more efficient if there are duplicated rows in T.

Alternatively, an NFA could be used directly, with a table containing sets of possible states.

Week 3: Parsing & Top-Down Parsing

05-01: Introduction to Parsing

Consider the language consisting of the set of all balanced parentheses:

This is not a regular language, and so cannot be expressed with regular expressions.

In general, a finite automaton can count things for some chosen .

Phase Input Output
Lexer String of characters String of tokens
Parser String of tokens Parse tree

05-02: Context Free Grammars

A context-free grammar (CFG) consists of:

  • a set of terminals,
  • a set of non-terminals,
  • a start symbol,
  • a set of productions,

The productions may be viewed as a set of rules about substitutions that allow us to transform from a non-terminal start state to a state consisting only of terminals.

The terminals have no rules for replacment; they should be the tokens of the language.

Let b a CFG with start symbol . Then the language of is:

Many grammars generate the same language. Tools may differ in how a grammar must be written.

05-03: Derivations

A derivation is a sequence of productions. A derivation can also be considered as a tree, where the children of a node are the elements to the right of the arrow in the grammar. This is the parse tree of the string.

A parse tree has terminals at the leaves, and non-terminals at the interior nodes.

An in-order traversal of the leaves is the original input.

In general, a parse tree may have many derivations. For parser implementation, we will be most interested in left-most derivations, in which the left-most children of the parse tree are resolved into terminals first, and right-most derivations, for the opposite case. These differ only on the order in which they are produced--the final parse tree is the same.

05-04: Ambiguity

A grammar is ambiguous if it has more than one parse tree for some string.

The ambiguity might be resolved by rewriting the grammar to eliminate the ambiguity (without changing the language).

Alternatively, accept the ambiguity, but define some disambiguating rules, such as precedence and associativity.

06-01: Error Handling

Three major kinds of error handling beavior:

  • Panic mode
  • Error productions
  • Automatic local or global correction

The first two are in use today.

In panic mode, when an error is detected, it begins discarding tokens until it finds a synchronizing token, and then begins again. For examples, in (1 ++ 2) + 3, upon encountering the second + the compiler might begin skipping until it finds an integer, and continue from there.

Error productions specify known, common mistakes in the grammar. For example, writing 5 x instead of 5*x. This complicates the grammar, but allows alternate syntax for these cases.

Error correction attempts to find a 'nearby' program that does not have the error. You might try to find a program with a low edit distance, for example, that compiles. However, this is difficult to implement, slow, and may yield a program that compiles but behaves in the wrong way.

06-02: Abstract Syntax Trees

An abstract syntax tree (AST) is like a parse tree, but with some details ignored. For example, since the tree structure captures the nesting of the program, we don't need to keep track of the parentheses once parsing is complete.

06-03: Recursive Descent Parsing

Recursive descent is a top-down parsing algorithm. It constructs the parse tree from the top, and from left to right.

06-04: Recursive Descent Algorithm

For the grammar

A recursive-descent parser might be

bool term(TOKEN tok) { return *next++ == tok; }

bool E_1() { return T(); }
bool E_2() { return T() && term(PLUS) && E(); }

bool E() { TOKEN *save = next; return (next = save, E_1())
                                   || (next = save, E_2()); }

bool T_1() { return term(INT); }
bool T_2() { return term(INT) && term(TIMES) && T(); }
bool T_3() { return term(OPEN) && E() && term(CLOSE); }

bool T() { TOKEN *save = next; return (next = save, T_1())
                                   || (next = save, T_2())
                                   || (next = save, T_3()); }

06-04-1: Recursive Descent Limitations

The above parser will fail for input like int * int, because it will succeed at matching in T_1() and then conclude the program is done, leaving unconsumed input.

It fails as a whole because if a production for a non-terminal X succeeds, the parser has no way to go back and try an alternative production for X.

This is a limitation of the implementation, and not the overall strategy, though: general recursive-descent parsers support such "full" backtracking.

06-05: Left Recursion

Consider a production .

bool S_1() { return S() && term(a); }
bool S() { return S_1(); }

This obviously contains an infinite recursion.

A left-recursive grammar has a non-terminal S for some . Recursive-descent parsers do not work for left-recursive grammars.

Consider a grammar

This grammar is left-recursive. This language can be described by a right-recursive grammar:

A general algorithm for eliminating left-recursion can be found in Compilers: Principles, Techniques, and Tools, Second Edition. Compilers can and do eliminate left-recursion automatically.

Week 4: Bottom-Up Parsing I & II

07-01: Predictive Parsing

Predictive parsers are top-down, like recursive-descent, but they can 'predict' the correct production to use by looking at upcoming tokens (lookahead), without backtracking.

These parsers restrict the grammars. They accept LL(k) grammars. These are left-to-right grammars that use left-most derivation, and have k tokens of lookahead.

In LL(1) grammars, for each step, there is at most one choice of production.

Given the grammar

we can left-factor the grammar to remove common prefixes:

This grammar is converted to a parse table, and the unprocessed frontier of the parse is recorded in a stack. As items are matched using the parse table, they are pushed onto the stack, and matched terminals are popped off the stack. In the end, we should end up with an empty stack and all input consumed.

07-02: First Sets

Consider non-terminal , production , and token . Then in two cases.

First, if . In this case, can derive a in the first position, and we say that .

Alternatively, if we have and and , then we can eliminate and proceed. In this case, we say that .

Formally, let us define

The algorithm to compute First is:

  1. , where is a terminal.
    • if , or
    • if and for
  2. if and

Consider the left-factored version of our favorite grammar:

Let us compute the first sets. For a given terminal, the first set is simply the set containing that terminal. For the other cases:

Since , we know that

And for the other non-terminals:

07-03: Follow Sets

Formally define follow sets as:

Intuitively, we can see that if then and

Also, if , then .

If is the start symbol, then .

The algorithm to compute follow sets goes:

  1. for each production .
  2. for each production where .

Given the same grammar we worked with before, compute the follow sets. We can see these relations:

And therefore:

07-04: LL1 Parsing Tables

Not all grammars are LL(1). For example, a grammar is not LL(1) if it is:

  • not left factored
  • left recursive
  • ambiguous

Most programming language CFGs are not LL(1).

07-05: Bottom-Up Parsing

Bottom-up parsing is more general than top-down parsing, but it is just as efficient. It is the preferred method used by most parsing tools.

Bottom-up parsers don't need left-factored grammars.

Bottom-up parsing reduces a string to the start symbol by inverting productions. If you read the reductions backward, you get a set of productions that amount to a right-most derivation.

07-06: Shift-Reduce Parsing

Bottom-up parsing uses only two kinds of moves: shift and reduce.

A shift reads one token of input. It pushes a terminal onto the stack.

A reduce applies an inverse production at the right end of the string. It pops symbols off the stack and pushes a non-terminal onto the stack.

08-01: Handles

08-02: Recognizing Handles

08-03: Recognizing Viable Prefixes

08-04: Valid Items

08-05: SLR Parsing

08-06: SLR Parsing Example

08-07: SLR Improvements

08-08: SLR Examples

Name Role
Alex Aiken Instructor
edX Publisher