Using the Castalia Delphi Parser | TwoDesk Delphi Blog

quasiceo 2013-12-03

展开全文

Using the Castalia Delphi Parser

By jacob on May 21, 2012 in Castalia for Delphi, Delphi, Programming, Writing Code

Did you know that the parser used in Castalia is open source?

Well, it is. You can find it on GitHub.

It’s a fork of the original parser which was written by Martin Waldenburg in the late 1990s, and abandoned sometime before 2003, when I found the code and began working on it.

It’s been used in a few projects you may know of:

Castalia, my collection of Delphi IDE plugins
Anders Ohlsson’s Unicode Statistics Tool, which helps in migrating older Delphi applications to newer Delphi versions which support Unicode
The Delphi Mock Wizard, a mock object framework for Delphi

This post is a brief introduction to the parser and a short tutorial on how you might use it.

Parser Structure

The Castalia Delphi Parser isn’t a component that you install. It’s a collection of units that you use in your project, and write classes that descend from the parser’s “generic” classes.

There are 4 units in the Delphi Parser:

CastaliaPasLex.pas
CastaliaSimplePasPar.pas
CastaliaPasLexTypes.pas
CastaliaSimplePasParTypes.pas

The interesting parts in CastaliaPasLex.pas and CastaliaSimplePasPar.pas. The other two units are just convenient places to declare the many types required for the lexer and parser.

…Which brings us to the basic structure of the parser.

The parser consists of two parts: The lexer, and the parser. The lexer breaks the input into tokens, which are then analyzed by the parser.

Here’s where people who to try to user the parser get confused: The parser doesn’t produce any output. You have to do that part yourself.

The Lexer

The lexer is a fast, hand-written lexical analysis class that takes a string and breaks it into tokens. A token is the basic building block of source code. A token might represent an identifier, a keyword, a punctuation mark, etc…

For example, consider the code ShowMessage(‘hello, world!’);

This code would split into 5 tokens:

Identifier (ShowMessage)
OpenParenthesis
String (hello, world!)
CloseParenthesis
Semicolon

If you’re interested in the hardcore computer science of tokenization, read Chapter 3 of The Dragon Book.

(Note: The Castalia Delphi Parser lexer is not generated with Lex or any similar tool. It is hand-written.)

The Parser

The parser takes that stream of tokens produced by the lexer and analyzes them according to Delphi syntax.

The Castalia Delphi Parser is a recursive-descent parser, which means it analyzes the tokens starting at the beginning, and makes decisions as it encounters new tokens, calling procedures recursively as it decides which grammatical element it has encountered. It’s also hand written, very fast, and efficient.

As it comes from GitHub, the parser doesn’t produce any output, but it will validate whether some code is syntactically correct or not.

Again, if you’re interested in the theory and hardcore computer science behind recursive-descent parsers, read Chapter 4 of The Dragon Book.

Using the Lexer

If all you care about is the individual tokens, and not the grammar, you can use the lexer by itself without the parser:

First, include CastaliaPasLex in the uses clause of your code.

When you want to use the Lexer, declare a variable of type TmwPasLex:

Lexer: TmwPasLex;

Create the lexer like any other object:

Lexer := TmwPasLex.Create;

Now, assuming you have the source code you want to examine in a string, you’ll need to tell the lexer where to find the code. This is done with the TmwPasLex.Origin property, which is a PChar. Simply point the Origin to the first character of the string:

Lexer.Origin := PChar(SourceString);

Finally, initialize the lexer by calling Init:

Lexer.Init;

At this point, the Lexer will be ready to produce a token stream from the source code in SourceString. To move to the first token, call Next:

Lexer.Next;

The lexer will now be able to give you information about the first token. You can get the type of the token from Lexer.TokenID (see CastaliaPasLexTypes for all the various token types), or the actual text of the token from Lexer.Token. You can also find the token’s location in the source code with properties like Lexer.RunPos and Lexer.PosXY.

Continue calling Lexer.Next; for each token until Lexer.TokenID is ptNull. When the TokenID is ptNull, the lexer has reached the end of the string. Don’t forget to free the Lexer when you’re finished with it.

As a demonstration, here is a simple function that takes a string containing Delphi source code and returns the number of identifier tokens in the string:

function CountTokens(ASource: string): Integer;
var
  Lexer: TmwPasLex;
  Count: Integer;
begin
  Count := 0;
  Lexer := TmwPasLex.Create;
  try
    Lexer.Origin := PChar(ASource);
    Lexer.Init;
	Lexer.Next;
	while Lexer.TokenID <> ptNull do
	begin
	  if Lexer.TokenID = ptIdentifier then
	    Inc(Count);
	  Lexer.Next;
	end;
  finally
    Lexer.Free;
  end;
  Result := Count;
end;

Using the Parser

The parser is only a little more complicated than the lexer. To use the parser, you must create your own parser class that descends from TmwSimplePasPar:

TMyParser = class(TmwSimplePasPar)

end;

TmwSimplePasPar declares a virtual procedure for every grammar rule in the Delphi grammar. The grammar is based first on the Delphi grammar as published in the “Delphi Language Guide” that used to come with older versions of Delphi, with additions by me as the language has been expanded.

Speaking of Delphi Grammar….

…Since Embarcadero no longer publishes the Delphi Language Guide, Joe White has published a reverse-engineered Delphi grammar at dgrok.. The Castalia Delphi Parser is not related to this grammar in any way (in fact, I just found it a few minutes ago via Google), but it may help you understand how Delphi’s grammar works. The procedure names in the Castalia Delphi Parser are probably different from Joe’s rule names.

In order to utilize the parser, you’ll need to override the procedures for the grammar rules you want to work with. For example, if you wanted to get a list of all of the units listed in a uses clause, you would override the UsedUnitName procedure.

To get information about the source code during parsing, you access the parser’s built in lexer with the Lexer property. Here is an example of an overridden UsedUnitName procedure that writes the name of a used unit to the console:

procedure TMyParser.UsedUnitName;
begin
  Writeln(Lexer.Token);
  inherited;
end;

(Always ensure that your overridden method calls inherited, or parsing will stop and you won’t get the information you want).

There are about 250 procedures that can be overridden. Here are a few useful examples:

ProcedureMethodName, FunctionMethodName: the name of a procedure or function
WithStatement: a with..do clause
Block: A begin..end block

Take a look at the definition of TmwSimplePasPar to see the complete list of procedures that you can override. If you’re lucky enough to still have an old Delphi Language Guide lying around, the names of the procedures pretty closely parallel the names of those grammar rules.

Now that you’ve got your own parser class, you need to run it on your source code. Again, we’ll assume you have your source code in a string called SourceString.

Unlike the lexer, the parser expects the code to be in a TMemoryStream (really, any descendant of TCustomMemoryStream). I’ll show below how to write your string to a TMemoryStream, but once you have a TMemoryStream with the right data, you call the parser’s Run method, passing in the Unit name (which may be blank, depending on your usage, and the memory stream object:

TMyParser.Run(‘MyUnit’, SourceMemoryStream);

To wrap up, here’s an example of a complete procedure that takes a source string as input and outputs all of the used units to the console using our TMyParser as defined above:

procedure ListUsedUnits(ASourceCode: string);
var
  Parser: TMyParser;
  SourceStream: TMemoryStream;
begin
  SourceStream := TMemoryStream.Create;
  try
    SourceStream.WriteBuffer(Pointer(ASourceCode)^, Length(ASourceCode));
	SourceStream.Position := 0;
	Parser := TMyParser.Create;
	try
	  Parser.Run('', SourceStream);
	finally
	  Parser.Free;
	end;
  finally
    SourceStream.Free;  
  end;
end;

Conclusion

This has been a very basic overview of how to use the Castalia Delphi Parser. I love hearing about the creative things people have done with the parser, so if you find it useful, please drop me a line and let me know what you’ve done with it.

Development on the Castalia Delphi Parser is supported by Castalia, my collection of time-saving tools for Delphi programmers. Try Castalia today!

Subscribe to our e-mail newsletter to receive updates.

The Castalia Newsgroup is Closing

New Refactorings and More in Castalia 2012.2

17 Responses to Using the Castalia Delphi Parser

Mason Wheeler May 21, 2012 at 2:59 pm #

In a recent Delphi version, (I forget which; I think it was either 2010 or XE,) they changed TStringStream to be a TCustomMemoryStream descendant. That makes it even simpler to get a string into a memory stream .

Reply
Jason Southwell May 21, 2012 at 3:17 pm #

What level of language compatibility does this handle? I see methods for generics, so presumably that, but what about anonymous methods, complex variant type parsing, inline keyword, etc?

Reply
- jacob May 21, 2012 at 6:25 pm #
  
  @Jason: It should handle anything that’s legal in Delphi XE2, with the exception that it doesn’t handle all compiler directives.
  
  Reply
  - Jason Southwell May 30, 2012 at 10:54 am #
    
    It doesn’t look like the code that works with XE2 is updated on github. the defines.inc is missing code for versions higher than ver230 and XE2 is VER230 IIRC. I don’t know if there are other changes required or not. If you could push the current version though, that would be great.
    
    Reply
Roger May 21, 2012 at 8:49 pm #

The input is a PChar (just a memory stream of bytes?), but does the parser support ANSI files, UTF-8, UTF-16, etc?

Reply
- jacob May 21, 2012 at 8:51 pm #
  
  The lexer expects a stream of Chars (not to be confused with a stream of Bytes). You read in the file however you want, as long as it’s in a native Delphi String by the time you give it to the lexer.
  
  Reply
David M May 22, 2012 at 2:48 pm #

Very interesting! I had heard it was open-source, but it’s great to have an example of how to use it.

“I love hearing about the creative things people have done with the parser”

I’m curious. What have people done with it? Obviously there’s you and Castalia… what else have people written?

Reply
- jacob May 22, 2012 at 4:18 pm #
  
  Here are two other projects that have used it:
  Anders Ohlsson’s Unicode Statistics Tool: http://cc./item/27398
  The Delphi Mock Wizard: http://code.google.com/p/delphi-mock-wizard/
  
  Reply
David M May 22, 2012 at 2:52 pm #

Actually, one other thing – units often depend on other units, etc. How complex can the parser get? Is it unit-specific, or is it possible to feed it an entire program, correlate types across units, etc? I guess some of that is what Castalia must do for functions like refactoring. At what point does understanding the structure of a program leave the parser and move to other, non-open-source (?) code?

Reply
- jacob May 22, 2012 at 4:22 pm #
  
  The parser doesn’t produce any output. It doesn’t create a syntax tree or symbol table or anything. Castalia does these things internally, but that is not part of the parser.
  
  You might think of the parser as an event-driven system. You can say things like “parse this file, and every time you find the name of a used unit, do X.” X can be anything from printing out the name of the unit (as in the example) to figuring out what file that unit is in, loading it, parsing it, and compiling it, in which case you would have a fully functional Delphi compiler.
  
  Reply
Jon Lennart Aasenden May 23, 2012 at 7:49 am #

This is a great parser, i’ve used it a couple of times in prototyping ideas. What you have to provide yourself i a symbolic tree context. For instance, when you override various functions – they will be triggered as the parser encounters them – but it’s up to you to provide the context that gives meaning to a symbol or token.

Reply
John Jacobson May 23, 2012 at 5:14 pm #

Is this 64-bit compatible?

Reply
- jacob May 23, 2012 at 5:22 pm #
  
  I can’t think of any reason it wouldn’t work with 64-bit, but I don’t know for sure; I’ve only compiled it with the 32-bit compiler.
  
  Reply
Warren P. May 23, 2012 at 5:25 pm #

I’m actually working on a side project that uses this parser. I’m modifying it to create something that warns about Delphi Style Violations, as well as a Unit “Uses” “ratsnest” detangling tool.

Thanks for making this nice open source project.

Warren

Reply
Constantin L May 27, 2012 at 12:49 pm #

Bear in mind that the parser has some bugs and can not properly handle files when define D8_NEWER is activated. For example it can’t handle code below:

procedure Test;
var
Helper: Integer;
begin

end;

Reply
Martin Waldenburg June 10, 2012 at 8:57 am #

Nice to see something given back, which is very rare.

Reply
DelphiLui July 15, 2013 at 8:49 am #

Iam currently building a AutoComplete-Helper for Delphi 2007 because our Project is very big and the Standard Helper takes hours to generate the matching Variables etc.