Type | WebPage |
---|---|
Tags | programming, computer science, compilers, Project: Online Classes, mooc |
A ten-week undergraduate course in compilers. The project of the course is to implement a compiler for Cool.
Although no textbook is required, there are recommendations provided for the interested student:
Suggested readings by topic:
This week has 1:35:11 of lecture videos.
A brief introduction: what is a compiler vs. interpreter, when and why were programming languages (starting with FORTRAN) developed.
Compilation works in five phases:
Programming languages and created in order to serve the different requirements of different application domains.
Most of the cost associated with using a new programming language is in training programmers, which leads to a few facts:
Cool is the Classroom Object Oriented Language. It is designed (by the instructor of this course, in fact) to be easy to implement.
This class will create a compiler from Cool to MIPS assembly language. There will be five programming assignments:
Optimization will be touched on, as well.
A simple hello world program like the following is demonstrated:
class Main inherits IO { main(): SELF_TYPE { out_string("Hello, World.\n") }; };
A program is demonstrated that takes an integer input and calculates the factorial, using both recursive and iterative techniques.
class Main inherits A2I { main() : Object { (new IO).out_string(i2a(fact(a2i((new IO).in_string()))).concat("\n")) }; fact(i: Int): Int { let fact: Int <- 1 in { while (not (i = 0)) loop { fact <- fact * i; i <- i - 1; } pool; fact; } }; };
Creating new classes is demonstrated. A list class that can print a string representation of itself is created.
class List inherits A2I { item: Object; next: List; init(i: Object, n: List): List { { item <- i; next <- n; self; }; flatten(): String { let string: String <- case item of i: Int => i2a(i); s: String => s; o: Object => { abort(); ""; }; esac in if (isvoid next) then string else string.concat(next.flatten()) fi }; }; class Main inherits IO { main(): Object { let hello: String <- "Hello ", world: String <- "World!", i: Int <- 42, newline: String <- "\n", nil: List, list: List <- (new List).init(hello, (new List).init(world, (new List).init(42, (new List).init(newline, nil)))) in out_string(list.flatten()) }; };
This week has 2:22:30 of lecture videos.
The goal of lexical analysis is to divide code up into its lexical units, called a lexeme, and to classify those units into the appropriate token class, such as identifiers, keywords, individual symbols, numbers, etc. The lexeme together with its token class make up a pair called a token.
In FORTRAN, whitespace is not significant. For example, VA R1
is the same as VAR1
. As a result, DO 5 I = 1,25
begins a loop, while DO 5 I = 1.25
declares a variable--it is equivalient to DO5I = 1.25
. In order to know whether the DO segment is a keyword for a loop or part of a variable name requires knowing what comes after. This is called lookahead.
In PL/I, keywords are not reserved. So something like this is valid:
IF ELSE THEN THEN ELSE; ELSE ELSE = THEN
Also in PL/I:
DECLARE(ARG1, ..., ARGN)
In this statement, DECLARE
could, depending on context, be either a keyword or an array reference (if it is followed by =
). Since there might have been any number of arguments (indices), it requires unlimited lookahead to determine which of the two is intended.
Regular languages are used to specify the set of strings that belong to a token class. A regular language is usually described using regular expressions.
The language denoted by the regular expression is a single-string language, namely .
The language denoted by ε is the language consisting of the empty string, . Note that this is not the empty language!
The union of two regular languages is a regular language:
The concatenation of two regular languages is a regular language:
The iteration (i.e. the concatenation a finite number of times) of a regular language is a regular language:
This is called the Kleene closure or Kleene star.
The regular expressions over an alphabet Σ are the smallest set of expressions including the empty string, the single-character strings for each character in Σ, and the closure under union, concatenation, and iteration of these expressions.
A meaning function maps expressions to the sets of strings that they denote.
This formalism is worthwhile because:
Consider Roman vs. Arabic numerals. These have identical semantics, but very different syntax, and the latter is much more convenient for computation.
What if there are two matching prefixes, e.g. there exists a where both sequences up to and match? Then we take the maximal munch, which is the longest matching prefix.
What if multiple tokens match? For example, an otherwise valid identifier might also be a keyword. The usual rule is that keywords win, which is accomplished by a priority ordering, e.g. choose the regexp that was listed first.
What if no rule matches? Add a rule for errors that matches all strings not in the language, and put it last in priority.
There are good algorithms that accomplish this efficiently and require only a single pass over the input.
A finite automaton consists of:
There is a graphical notation for finite automata:
The language of a finite automaton is the set of all strings that it accepts.
A finite automaton can have another type of transition: ε-moves. These allow the machine to change state without consuming input.
A determininstic finite automaton (DFA):
A non-deterministic finite automaton (NFA):
Multiple transitions per input per state can be simulated by ε-moves, though it makes the automaton appear more complicated.
Both NFAs and DFAs recognize the same set of languages, the regular languages. DFAs are faster to execute, but NFAs are, in general, smaller than the equivalent DFA.
For each regular expression M, we will define an equivalent NFA. This is done by composing the machines for each part of M using some simple rules. An example:
Call as the ε-closure of a state the set of all states that may be reached from it by any number of ε-moves, plus the state itself.
Given an NFA with states S, start state s, and final states F ⊆ S, define
and the ε-closure operation.
The states of the DFA will be all the nonempty subsets of S. The start state will be the ε-closure of s. The final states of the DFA will be
There will be a transition on the DFA if
Converting the NFA from the last section to a DFA yields:
A DFA can be implemented by a 2D table T:
It can also be implemented as a one-dimensional table indexed on states and pointing to a table for the inputs. This would be more efficient if there are duplicated rows in T.
Alternatively, an NFA could be used directly, with a table containing sets of possible states.
Consider the language consisting of the set of all balanced parentheses:
This is not a regular language, and so cannot be expressed with regular expressions.
In general, a finite automaton can count things for some chosen .
Phase | Input | Output |
---|---|---|
Lexer | String of characters | String of tokens |
Parser | String of tokens | Parse tree |
A context-free grammar (CFG) consists of:
The productions may be viewed as a set of rules about substitutions that allow us to transform from a non-terminal start state to a state consisting only of terminals.
The terminals have no rules for replacment; they should be the tokens of the language.
Let b a CFG with start symbol . Then the language of is:
Many grammars generate the same language. Tools may differ in how a grammar must be written.
A derivation is a sequence of productions. A derivation can also be considered as a tree, where the children of a node are the elements to the right of the arrow in the grammar. This is the parse tree of the string.
A parse tree has terminals at the leaves, and non-terminals at the interior nodes.
An in-order traversal of the leaves is the original input.
In general, a parse tree may have many derivations. For parser implementation, we will be most interested in left-most derivations, in which the left-most children of the parse tree are resolved into terminals first, and right-most derivations, for the opposite case. These differ only on the order in which they are produced--the final parse tree is the same.
A grammar is ambiguous if it has more than one parse tree for some string.
The ambiguity might be resolved by rewriting the grammar to eliminate the ambiguity (without changing the language).
Alternatively, accept the ambiguity, but define some disambiguating rules, such as precedence and associativity.
Three major kinds of error handling beavior:
The first two are in use today.
In panic mode, when an error is detected, it begins discarding tokens until it finds a synchronizing token, and then begins again. For examples, in (1 ++ 2) + 3
, upon encountering the second +
the compiler might begin skipping until it finds an integer, and continue from there.
Error productions specify known, common mistakes in the grammar. For example, writing 5 x
instead of 5*x
. This complicates the grammar, but allows alternate syntax for these cases.
Error correction attempts to find a 'nearby' program that does not have the error. You might try to find a program with a low edit distance, for example, that compiles. However, this is difficult to implement, slow, and may yield a program that compiles but behaves in the wrong way.
An abstract syntax tree (AST) is like a parse tree, but with some details ignored. For example, since the tree structure captures the nesting of the program, we don't need to keep track of the parentheses once parsing is complete.
Recursive descent is a top-down parsing algorithm. It constructs the parse tree from the top, and from left to right.
For the grammar
A recursive-descent parser might be
bool term(TOKEN tok) { return *next++ == tok; } bool E_1() { return T(); } bool E_2() { return T() && term(PLUS) && E(); } bool E() { TOKEN *save = next; return (next = save, E_1()) || (next = save, E_2()); } bool T_1() { return term(INT); } bool T_2() { return term(INT) && term(TIMES) && T(); } bool T_3() { return term(OPEN) && E() && term(CLOSE); } bool T() { TOKEN *save = next; return (next = save, T_1()) || (next = save, T_2()) || (next = save, T_3()); }
The above parser will fail for input like int * int
, because it will succeed at matching in T_1() and then conclude the program is done, leaving unconsumed input.
It fails as a whole because if a production for a non-terminal X succeeds, the parser has no way to go back and try an alternative production for X.
This is a limitation of the implementation, and not the overall strategy, though: general recursive-descent parsers support such "full" backtracking.
Consider a production .
bool S_1() { return S() && term(a); } bool S() { return S_1(); }
This obviously contains an infinite recursion.
A left-recursive grammar has a non-terminal S for some . Recursive-descent parsers do not work for left-recursive grammars.
Consider a grammar
This grammar is left-recursive. This language can be described by a right-recursive grammar:
A general algorithm for eliminating left-recursion can be found in Compilers: Principles, Techniques, and Tools, Second Edition. Compilers can and do eliminate left-recursion automatically.
Predictive parsers are top-down, like recursive-descent, but they can 'predict' the correct production to use by looking at upcoming tokens (lookahead), without backtracking.
These parsers restrict the grammars. They accept LL(k) grammars. These are left-to-right grammars that use left-most derivation, and have k tokens of lookahead.
In LL(1) grammars, for each step, there is at most one choice of production.
Given the grammar
we can left-factor the grammar to remove common prefixes:
This grammar is converted to a parse table, and the unprocessed frontier of the parse is recorded in a stack. As items are matched using the parse table, they are pushed onto the stack, and matched terminals are popped off the stack. In the end, we should end up with an empty stack and all input consumed.
Consider non-terminal , production , and token . Then in two cases.
First, if . In this case, can derive a in the first position, and we say that .
Alternatively, if we have and and , then we can eliminate and proceed. In this case, we say that .
Formally, let us define
The algorithm to compute First is:
Consider the left-factored version of our favorite grammar:
Let us compute the first sets. For a given terminal, the first set is simply the set containing that terminal. For the other cases:
Since , we know that
And for the other non-terminals:
Formally define follow sets as:
Intuitively, we can see that if then and
Also, if , then .
If is the start symbol, then .
The algorithm to compute follow sets goes:
Given the same grammar we worked with before, compute the follow sets. We can see these relations:
And therefore:
Not all grammars are LL(1). For example, a grammar is not LL(1) if it is:
Most programming language CFGs are not LL(1).
Bottom-up parsing is more general than top-down parsing, but it is just as efficient. It is the preferred method used by most parsing tools.
Bottom-up parsers don't need left-factored grammars.
Bottom-up parsing reduces a string to the start symbol by inverting productions. If you read the reductions backward, you get a set of productions that amount to a right-most derivation.
Bottom-up parsing uses only two kinds of moves: shift and reduce.
A shift reads one token of input. It pushes a terminal onto the stack.
A reduce applies an inverse production at the right end of the string. It pops symbols off the stack and pushes a non-terminal onto the stack.
It is not always correct to reduce as soon as reduction is possible.
A handle is a reduction that will allow further reductions back to the start symbol. Obviously, we only want to reduce at handles. Importantly, handles appear only at the top of the stack.
There are no known efficient algorithms for finding handles, in general. However, there are good heuristics, which work on some CFGs reliably.
A viable prefix is one that, given an appropriate suffix, could be part of a handle. No parsing error has yet occurred.
For any grammar, the set of viable prefixes is a regular language.
We can build an NFA to recognize these.
Precedence declarations can resolve conflicts.
Most of the work of the DFA is repeated, because most of the stack does not change each time it is run. So, change the stack to hold pairs of (symbol, DFA state) to save time.
Name | Role |
---|---|
Alex Aiken | Instructor |
edX | Publisher |
Comment
Aiken gives email addresses as a simple example of something that can be described by regular expressions, though he admits that to fully parse then would be slightly more complicated. It's more than just slightly more complicated, actually. Here is some discussion on that, and a look at any given email validator library will further illuminate the issue.