Toward a Compiler Compiler: Parse Tree Reconstruction

quasiceo 2014-01-20

展开全文

Overview

For this project, you will turn the recognizer you built for PA4 into a full-fledged parser. In other words, you will

Reconstruct the parse tree from the parse edges. This means keeping track of how edges were derived during the parsing process.
Complete the diagnosis of ambiguous grammars. Now that you have the parse tree, you can finish the implementation of associativity for %left and %right directives.
Execute the syntax-directed translation (aka semantic actions). This will allow you to make translate the parse tree to an AST or to other interesting applications.
Rejoice: you now have a complete interpreter for our CS164 language. :)

Teamwork and Repositories

You will work in the same groups as in PA4 and use the same repository. If your team has changed in any form, please fill out this form.

Implementation

In the file parser_generator.py from your PA4, you must implement a function makeParser(grammar.Grammar), returning an object with a parse(string) method instead of recognize(string) method. This parse() method should return the semantic value (val) of the start symbol from the grammar (which is the root of the parse tree), after parsing and SDT (if the input string is valid).

(Compared to PA4, in PA5, when ambiguities remain, pick one parse tree arbitrarily but do NOT raise an exception. To ease grammar debugging, you may want to print a diagnostic message when you encounter two conflicting unresolved edges (for your own information), but suppress the message before submitting.)

When your parser generator encounters an error (such as getting a syntactically invalid input), you should print an error to sys.stderr and exit with a non-zero code. We have provided the function util.error() for printing the error. Please see the util module documentation

We have broken down the implementation of the second part of your parser generator into the steps below. These steps are merely meant to be a helpful guide; you are free to follow your own plan.

You can implement this part of the parser generator however you want. You can add as many functions, classes, and methods as you wish to the files from the starter kit. We only require that you don't change or remove any of the classes or methods already in the starter kit. These form your parser-generator API, which we will use in later projects. We will also use that API to test your solution to this project.

Step 1: Reconstruct Parse Trees

We recommend you implement this as soon as possible. Seeing reconstructed parse trees can make debugging the later stages much easier. Along with reconstructing parse trees, you need to write code that visualizes parse trees with graphviz. This will be useful for debugging.

Our tree reconstruction took about 25 lines of code; printing parse trees took about 10. Hints:

Reconstruct the parse tree (rooted at an edge) at the point when you are inserting the edge into the graph. Note that you need to reconstruct the parse tree before disambiguation of this edge with an existing edge.
All edges, not just complete edges, need to keep track of children. As you move the dot from E+.E to E+E., the latter edge receives a copy of references to children stored at the former edge.
~~The copy of children is expensive. Do it lazily by wrapping the copy code in a lambda and invoking the lambda at the latest convenient moment.~~ No need to do this.

Step 2: Complete Disambiguation

You have already done operator precedence for PA4. Now that you have the parse tree, you can do associativity easily. Please review the detailed notes of PA4 on disambiguation. Our core disambiguation logic required about 30 lines of code. The associated bookkeeping added about another 20 lines.

Hint: Recall that you disambiguate between E->E1+E and E->E2-E by comparing how much input was derived from E1 vs from E2.

Step 3: Execute Semantic Actions

Start by hard-wiring actions into your generated parsers (e.g., E.action = lambda n1,n2: n1.val+n2.val). Once your hard-coded SDT is correct, use the actual actions in the Grammar objects. The given code from PA4 has already created the function bodies in the Grammar objects for you. It is stored as actions field in Production object.

Semantic Actions

Synthesized Attributes

The following grammar specification is intended to evaluate simple arithmetic expressions:

%left '+'
%%
E -> E '+' E   %{ return n1.val + n3.val %}  // S₁
   | /[0-9]+/  %{ return int (n1.val) %}     // S₂
;

This specification contains two semantic actions: S₁ and S₂. They compute the translation (value) of the left-hand-side E from the values of E's children.

In the action for E -> E '+' E, we refer to the second right-hand-side E's value with n3.val because the '+' symbol also has a value, n2.val (see below).

For example, given the input 1+2+3, your parser would produce the parse tree and syntax-directed translation below:Grammar productions without actions get the default action:

%{ return n1.val %}

This convention is adopted from bison. As you might have guessed, in semantic actions, the objects containing the attributes of each production's symbols can be referred to by n_i. The left-hand-side symbol is n₀, and the right-hand-side symbols are n₁, n₂, etc., from left to right.

Action Creation and Execution

Each action (e.g., %{ return n1.val %}) defines the body of a Python function. Each action must be syntactically valid Python were it to appear as [BODY] in the following statement:

def foo ():[BODY]

Thus, the action in this grammar is valid:

%%
S -> A B  %{
  if n1.val:
    return 3
%} ;
...

but this is invalid:

%%
S -> A B %{ if n1.val:
return 3
%} ;
...

As shown in the examples above, actions can access the attributes of all the symbols in the right-hand side of their productions. Actions can also access the attributes of the left-hand side symbol. You will pass these symbol-attribute objects as arguments to the Python function created from the action. These function parameters must be named:

(n0)   (n1)  (n2)  (n3)
L ->   A     B     C ... ;

When you call action functions during SDT, each n_i argument should reference its parse-tree node's attribute object. So an action such as %{ n1 = 0 %} only sets the action-function's local variable n1 to be 0; no parse-tree node's attribute object is modified, and n₁'s original reference to the parse-tree node is lost to the action.

Calling an action has the effect: n0.val = s_action (n0, n1, n2, ...). In other words, your parser code that performs the SDT must set the synthesized attribute of the LHS symbol's node to the value returned by its action.

The attribute objects referenced by n_i must have at least one field: val, set initially to None. By convention, val is a node's synthesized attribute.

Actions must execute in their own Python namespace, shared among all actions. This namespace must contain all the modules declared by %import in the grammar specification. We provide the util module to assist you with creating this namespace and these Python functions. See the util documentation for example code that creates a namespace and a function.

If an action body is syntactically invalid, Python will raise an exception when you attempt to dynamically create a function from it. You should treat this as an error in the grammar specification: call util.error() with an error message. If an action raises an exception during SDT, you should also call util.error() with an error message.

Consider this grammar:

E -> Integer '+' Integer %{ return ['+', n1.val, n3.val] %}
  ;

Integer -> /[0-9]+/   %{ return int(n1.val) %} ;

Let p be a Production object that represents E -> Integer '+' Integer. You can execute the corresponding action body by calling p.actions(result,a,b), where result, a, and b must be objects with val field; essentially, result, a, and b are passed to the action function as n0, n1, and n2 respectively.

If you've never used python feature for a calling function with a variable number of parameters, this article might be useful for you.

Epsilons and Semantic Actions

If you match an epsilon with an SDT rule, its value must be None. Consider a small grammar:

%%

S -> 'A' T; 
T -> _           %{ print 'Epsilons are represented as ' + repr(n1.val) %};

It will print Epsilons are represented as None, when parsing the string 'A'.

Deliverables

Your complete parser, with SDT.
Submit three test grammars (each with semantic rules) and three inputs for each grammar. Name these folders grammar1, grammar2, and grammar3.

Logistics

Starter-kit

You can fetch the starter-kit by running the following Mercurial commands:

hg pull https://<your username>@/cs164_overlord/cs164sp13
hg merge or hg update, whichever Mercurial tells you to run.

This will create in your repository a new PA5 directory containing the starter-kit. Then, copy over the following files from your PA4 project

grammar.py
grammar_parser.py
parser_generator.py

Finally, one can invoke the parser with

$ ./main.py test.grm input file

Output

Your parser must produce the following output

Print the AST, as produced by the SDT rules, when there exists a parse tree. If there is an unresolved ambiguity, pick any parse tree.
Print Error otherwise (i.e. when there is no parse tree).

Errata and Bug fixes

Once a bug is discovered in the starter-kit, we will publish a corrected version on Bitbucket. Then, you can update your starter-kit (and get the bugfix) with following commands:

hg pull https://<your username>@/cs164_overlord/cs164sp13
hg merge or hg update, whichever Mercurial tells you to run.

Submission

Autograder

We will use the same autograder as for the previous PAs. The same rules apply:

You have to make sure your interpreter plays nice with the autograder. Do not modify main.py, or change the way the interpreter is invoked. Printing debugging information will interfere with the autograder. Make sure you remove all debugging print statements before submitting.

Using The Autograder With Your Own Tests

The structure of the tests directory has slightly changed. Now, each subfolder of tests represent a language. Inside, the grammar is always named grammar.grm. All the other files are input/output pairs. For example:

./tests/simple/grammar.grm
./tests/simple/1.in
./tests/simple/1.out
./tests/simple/2.in
./tests/simple/2.out

This defines a language simple, whose grammar is grammar.grm with 2 input/output pairs.

Using Our Remote Parser as Reference

You can query our remote parser using rparse.py as follow:
./rparse.py -g your_grammar.grm < file_containing_input_to_parse
./rparse.py -g your_grammar.grm "input to parse"

The Submission Itself

You will submit by tagging the revision you want us to grade with the following tag: "PA5_SUBMIT". Case matters. To do so, please follow the instructions below:

Make sure your interpreter plays nice with the autograder by running ./grader.py. Make sure you pass all the tests. To do so, all debugging print statements must be removed.
Please don't forget to add all your files to the repository (use hg add). This includes the parser code, the grammars, and the test files.
Commit with "hg commit -m message_here"
Add the tag with the command "hg tag PA5_SUBMIT"
Publish the project on the server with "hg push"
Go to https:///<username>/cs164sp13/changesets
You should see the tag in the timeline. If you do, you are all set.