Download Irony_src.zip - 247.12 KB
IntroductionThe goal of Irony, a new open source project, is to build a complete set of libraries and tools for language implementations in .NET platform. It is currently in its first phase, which includes building the two compiler front-end modules - scanner and parser. This article provides an overview of the technology and focuses on parser implementation with Irony. The project is hosted at CodePlex: http://www./irony. Irony brings several principal innovations into the field of compiler construction. Like many parser-building tools in use today, Irony produces a working parser from grammar specifications. However, unlike the existing parser builders, Irony does not use a separate meta-language for encoding the target grammar. In Irony the grammar is encoded directly in c# using BNF-like expressions over grammatic elements represented by .NET objects. Additionally, no code generation is employed - Irony‘s universal LALR parser uses the information derived from c#-encoded grammar to control the parsing process. Using the code in downloadThe download supplied with this article contains Irony‘s core assembly, Grammar Explorer application, and a set of sample parsers: a parser for arithmetic expressions and parsers based on simplified grammars for Scheme, Python and Ruby. With Grammar Explorer you can view the analysis data for sample grammars and test the parsers on source samples. The following image shows a source sample and its parsed syntax tree in the Grammar Explorer:
In order to try out the project:
BackgroundMany modern applications use parsing technologies in one form or another. Compiler development is one area that heavily employs parsers. The other examples are script engines and expression evaluators, source code analysis and refactoring tools, template-based code generators, formatters and colorizers, data import and conversion tools, etc. Therefore it is important that application developers have reliable and straightforward methods of implementing a parser when one is needed. Unfortunately, this is not the case - with the current state of technology constructing a parser is still a big challenge. In author‘s opinion, the reason for this is the fact that parser construction technology did not change much since the introduction of LEX/YACC generators in the mid-seventies. Most of the generators in use today are ports or derivatives of these original tools. Many things had changed, new languages and technologies had been introduced, OOP being probably the most prominent example. But somehow these advances didn‘t make into parser construction technology. Same plain c-style generated code, even in java- or c#- targeted parsers, with the same global variables and even the same yy- naming conventions. Irony is an attempt to fix the situation, to bridge the existing gap and to bring the power of a modern language environment like c# and .NET into parser and compiler construction field. Simple Expression ParserTo illustrate how Irony‘s approach is different, let‘s use it to build a parser for simple arithmetic expressions. We will target expressions that contain only numbers, variables, parenthesis and arithmetic operators +, -, *, / and **. The first step is to write a formal grammar for expression language using the Backus-Naur Form (BNF) notation: Expr -> n | v | Expr BinOp Expr | UnOp Expr | ( Expr ) BinOp -> + | - | * | / | ** UnOp -> - ExprLine -> Expr EOF Our grammar has four non-terminals: Our grammar references several terminals: To define this grammar in Irony, we create a new class inherited from the ![]() using System;
using System.Collections.Generic;
using System.Text;
using Irony.Compiler;
namespace Irony.Samples {
//Sample expression grammar - recognizes arithmetic expressions with numbers and variables
public class ExpressionGrammar : Irony.Compiler.Grammar {
public ExpressionGrammar() {
// 1. Terminals
Terminal n = new NumberTerminal("number");
Terminal v = new IdentifierTerminal("variable");
// 2. Non-terminals
NonTerminal Expr = new NonTerminal("Expr");
NonTerminal BinOp = new NonTerminal("BinOp");
NonTerminal UnOp = new NonTerminal("UnOp");
NonTerminal ExprLine = new NonTerminal("ExprLine");
// 3. BNF rules
Expr.Expression = n | v | Expr + BinOp + Expr | UnOp + Expr | "(" + Expr + ")";
BinOp.Expression = Symbol("+") | "-" | "*" | "/" | "**";
UnOp.Expression = "-"; //Irony provides default conversion here
ExprLine.Expression = Expr + Eof; //Eof is a predefined EOF terminal
this.Root = ExprLine; //Set grammar top element
// 4. Set operators precedence and associativity
RegisterOperators(1, "+", "-");
RegisterOperators(2, "*", "/");
RegisterOperators(3, Associativity.Right, "**");
// 5. Register parenthesis as punctuation symbols so they will not appear in the syntax tree
PunctuationSymbols.AddRange(new string[] { "(", ")" });
}
}
}
The entire grammar is defined in the class constructor. First we define terminal and non-terminal variables that we will use in grammar rules. We use standard classes defined by Irony. The name parameter in constructor is used to identify the terminal in listings of grammar data that you see in Grammar Explorer. Notice that we do not need to explicitly create terminals for operator symbols and parenthesis - we will use the string symbols directly in grammar expressions. Irony grammar processor discovers all these symbols automatically and transform them into symbol terminals. Next comes the main part - encoding the grammar rules. NonTerminal objects have a property Expression which is assigned an expression over other terminals and non-terminals. We use c# expressions that look very similar to the original BNF - Irony overloads "+" and "|" operators for syntax element classes so we can write BNF expressions directly in c#! A few comments on the code. In the expression for The last statement in step 3 sets the Finally, we specify operators‘ precedence and associativity - this information will be used by the parser to identify the correct grouping of operations. The last statement marks parenthesis as punctuation symbols - they will be excluded from the output syntax tree. Once we have the syntax tree, we don‘t need such symbols anymore as the order of operations is encoded explicitly in the tree structure. The following is the example of how to use this grammar to parse an expression: ExpressionGrammar grammar = new ExpressionGrammar();
LanguageCompiler compiler = new LanguageCompiler(grammar);
string expr = "a + b * 3.14 / (2.0 * x + y**z**2)";
AstNode rootNode = compiler.Parse(expr);
//rootNode now contains an AST node that is the root of the syntax tree.
The following image shows the parsed expression tree as it appears in the Grammar Explorer:
The following sections provide more detailed information on the parsing process as it is implemented in Irony. Irony Parsing PipelineA parser is a program that transforms the source text written in an input language into a tree-like representation called Abstract Syntax Tree (AST) or simply syntax tree. The tree nodes correspond to the syntactic structures of the input language. Schematically the process may be represented as follows: source -> PARSER -> syntax tree Traditionally this transformation is performed in two steps. First, the module called scanner combines the sequences of input characters into more meaningful combinations called tokens - the words of the input language. The examples of tokens are numbers, string literals, keywords, special characters, etc. Scanner is sometimes called "lexer" or "tokenizer". The tokens produced by a scanner are then passed to the parser proper that actually constructs the syntax tree from the stream of tokens. The processing schema now changes to the following: source -> SCANNER -> tokens -> PARSER -> syntax tree This structure is a standard processing pipeline used by all parsing solutions today. In Irony this standard schema is extended. Irony adds optional processing modules called Token Filters between scanner and parser: source -> SCANNER -> tokens -> (TOKEN FILTER)* -> tokens -> PARSER -> syntax tree (The Kleene star indicates "zero or more" objects). We will talk more about token filters and what they are used for in one of the following sections. For now lets talk about the connection between the modules. Irony uses a bit unconventional method to connect the modules - the pipeline is as a chain of c# iterators. Iterators are a very powerful feature introduced in c# 2.0. To illustrate how it works let‘s look at a sample token filter. The following code snippet shows filter‘s main processing method. The sample filter does nothing, simply passes all input tokens to the module in the pipeline: public override IEnumerable
As the comment in the code indicates, we can easily add or exclude tokens to/from the stream. Using iterators for connecting the modules provides much more flexibility and independence in constructing the individual modules while maintaining high pseudo-parallelism in execution. The execution point literally jumps from one module to another while processing the input as the tokens are moved along the pipeline. Let‘s now discuss each module in the pipeline in more details. ScannerThe purpose of the scanner is to group input characters into meaningful words represented by objects called tokens. The examples of tokens are: numbers, string literals, keywords, variables, special symbols like parenthesis, comma, operator symbols, etc. The other responsibility of the scanner is to eliminate the whitespace and comments from the input stream, leaving only meaningful content. Irony‘s The expression grammar we just built uses two standard terminal classes: public virtual Token TryMatch(CompilerContext context, ISourceStream source) {
//recognize a token in the source stream and return it, or return null if there is no match
....
}
The The following is a simplified description of Irony‘s scanning algorithm. The scanning process starts at initial (0) position in source code. First the scanner skips all characters that represent whitespace. Once a non-space character is reached, the scanner runs through the list of all defined terminals and for each calls the The actual implementation also does some optimization. The scanner does not actually run through all terminals all the time - it uses a hash table to quickly lookup the list of terminals that need to be called based on the current character in the input stream. The lookup table is constructed at initialization time from prefixes that each terminal may explicitly declare. This terminal hash table significantly speeds up the scanning process. In multiple aspects the Irony scanner operates differently compared to existing compiler toolkits. First, the scanner completely ignores the whitespace, even for indentation-sensitive languages like Python. However, the indentation information is not completely lost - each token produced by the scanner contains the location (row and column) information. The indentation tokens can be recreated if neccessary from content tokens later in a dedicated token filter. This arrangement simplifies the general-purpose scanner and separates the indentation-analysis code into a separate layer (token filter) resulting in a better layered architecture. Secondly, unlike other other LEX-like solutions, Irony scanner does not need to define hundreds of numeric constants for each distinct token type. Irony parser matches the tokens to grammar elements directly by their content, so the token type is identified by a containing symbol, e.g. "+", therefore it can be matched by the parser against a "+" terminal referenced in the grammar. Finally, in most cases Irony scanner does not need to distinguish between keywords and identifiers. It simply tokenizes all alpha-numeric words as identifiers leaving it to the parser to differentiate them from each other. Some LEX-based solutions try to put too much responsibility on the scanner and make it recognize not only the token itself, but also its role in the surrounding context. This in turn requires extensive lookaheads, making scanner code quite complex. In author‘s opinion, identifying the token role is responsibility of a parser, not scanner. The scanner should recognize tokens without using any contextual information. Token FiltersToken filters are special processors that intercept the token stream between scanner and parser. Although the term "filter" might suggest that filters only remove some tokens from the stream, they in fact are free to remove from or add tokens to the original stream. So why Irony introduces this extra processing layer in addition to traditional scanner/parser pair? The answer lies in realization that many processing tasks in scanning and parsing are best handled by an intermediate processor that sits between parser and scanner. The following are examples of such tasks:
The list above is certainly not all exaustive. Importantly, implementations can define or reuse several filters in the pipeline, each responsible for its own task. This results in a clear layered architecture and opens opportunities for wide code reuse. Irony defines the base class ParserAfter the token stream produced by the scanner passes the token filters it is consumed by the parser. Irony implements the so-called LALR(1) parser. This type of parser is considered to be the most efficient, yet difficult to implement. It requires building huge action/transition tables from language grammar, so LALR parsers are usually generated by parser generators. Irony does not employ code generation; instead, it construct the transition graph (equivalent of a table) from the language grammar during initialization and uses general-purpose parser implementation. LALR parser is a Deterministic Finite Automaton (DFA) - an abstract machine with finite number of internal states. The parser state represents a phase in analysis process of the input stream. Two states are special: initial state and final state. The parser starts in initial state and moves from one state to another while consuming input tokens and executing the predefined actions, until it arrives at the final state. In each state, it looks at the input token, retrieves the appropriate action from action table for the current state and input token, executes the action and moves to the new state. The new state is derived indirectly from the result of the action. When the parser arrives at the final state, the result of the last Reduce action is an AST node for the entire input. It is the node corresponding to the root non-terminal of the grammar. This node is the root of the syntax tree and is returned as a result of the parsing. The two main action types that parser executes are Shift and Reduce (that‘s why this kind of parser is also called shift-reduce parser). Each action manipulates the parser stack - an internal memory buffer. The parser accumulates input tokens in the stack by "shifting" them from the input. At some point it recognizes that there is a sequence on top of the stack that represents some language construct with corresponding non-terminal defined in the grammar. The parser then executes the Reduce action - it creates an AST node object corresponding to the recognized non-terminal, and replaces the sequence in the stack with the created node. The newly constructed node receives a list of its child nodes so it can properly link to them. The stack is effectively "reduced" in size in Reduce actions. Thus the stack contains either tokens shifted from the input by Shift actions, or AST nodes produced by Reduce actions. In fact they are all AST nodes because the The final result of the parsing process is a syntax tree for the source text. Normally each node in the tree would be of a specific type (class) matching the construct it represents. The language implementation would define node classes for all constructs in the language: there will be a node class for "if" statement, "for" statement, etc. Irony will also predefine some common node types in the future. In order to know which class is needed to create a node the parser looks up the type stored in the Grammar DataIrony‘s scanner and parser are implemented as c# classes which are intended to be used as is, without inheriting in custom language implementations. All custom behavior associated with the target language is encoded in the custom grammar class. The list of token filters is also specified in the grammar. However, Irony does not use the target grammar class directly. Instead, it uses derived information in the Grammar -> GrammarData -> Terminals -> SCANNER -> State Graph -> PARSER The construction of the LALR GrammarsAny programming language syntax may be expressed in more than one grammar. While these multiple grammars are all equivalent in terms of language they produce, they are not equally usable by different parsing algorithms. The LALR(1) algorithm used by Irony requires a special kind of grammar - the so-called LALR grammar. All deviations from LALR requirements result in conflicts - situations when parser wouldn‘t know what to do, or will have more than one available action. The target grammar therefore must be carefully written and fine-tuned. To make this process easier Irony provides a Grammar Explorer application that displays grammar analysis data in multiple tabs on the form. This application is included in the download. One of the tabs in Grammar Explorer entitled "Grammar Errors" shows all conflicts that were detected during grammar analysis. Language implementor should try to eliminate all errors and warning by tweaking the grammar, although it is not always possible to eliminate all of them. For ambiguous situations the parser chooses a preferred action - for example, Shift is preferred over Reduce in shift-reduce conflicts. PerformanceContrary to the common belief that universal parsers are slow (especially scanners), Irony‘s parsing is really fast. The initial performance measurements for languages like Python or Ruby give estimate in between 10 and 20 thousand lines per second for a standard 2GB/2GHz machine, which is impressive, even compared to professional-grade compilers. The initialization time (analyzing the grammar and building the state graph) is about 300 milliseconds, which is acceptable in most cases, but may be a bit too much for production compilers. The initialization time will be reduced in the future by introducing Future DevelopmentThe ultimate goal of the project is to create a complete set of tools for language implementations on .NET platform. As for more short term plan, several features are under development:
More long-term goals include building a runtime infrastructure and implementing back-end parts of the compiler with IL code generation. Finally, I would like to welcome everybody to participate, both in development of the core feature set, and in language implementations based on Irony. References
HistoryJanuary 2008 - Initial version. LicenseThis article is licensed under The MIT License About the Author
|
|