【深入淺出教你寫編譯器（Compiler）】四、語意分析（Semantic analysis） by Dukeland

quasiceo 2014-02-02

展开全文

【深入淺出教你寫編譯器（Compiler）】四、語意分析（Semantic analysis）

大家好，又見到西杰了。在上兩章我們探討了如何編寫 Scanner 和 Parser，能夠把一份程式文件轉變成一棵 Parse tree，如果文件有 syntax error 的話亦能夠被偵測出來並且告訴開發人員，現在要進行最後一步的分析了。今天要說的是語意分析，即 Semantic analysis，這是什麼來的？Semantic analysis 要做的工作就是分析語意啦！哈哈。同學們或許你們會問，當我們建立了一棵 Parse tree 之後，不就可以 compile 了嗎？其實不然，你現在有的只是 N 句句子，但這還不是一個完整故事，還要分析一下上文下理電腦才知道你說的是什麼故事，Semantic analysis 就是做這樣的工作了。給你一個例子：

1	`var` `a:bool =` `1` `+` `true;`

一個數字跟一個布林值相加是一個怎樣的概念？別說電腦看不懂，連西杰也看不明白，Semantic analysis 要做的就是把這個問題抽出來。

由於 Wescript 比較簡單，我們只會做以下幾款分析：

一、不可重複定義變數

二、自動初始化變數

三、變數要先定義後使用

四、變數類型檢查

不可重複定義變數

開始吧，第一步要做的事是要修改一下 Parser，因為我們做 semantic analysis 是會把一整棵 Parse tree 讀進來，所以我們需要在建立 Parse tree 時記下那些 node 是哪一行建立出來的，那樣我們匯報錯誤時才可以告訴開發人員哪一行出錯，因此就有了以下的修改：

function Node(){

this.line = 0;

}

Node.prototype.setLine = function (line){

this.line = line;

return this;

}

在 base class “Node” 添加一個 method 來記錄行數。之後在每次建立新 node 時都要加上一句

1	`.setLine(this.scanner.currLine)`

現在可以開始寫 Analyser 了，先看代碼後解說。

//Analyser class

function Analyser(){

this.vars = {};

}

Analyser.prototype.evaluateExpressionBlockNode = function (node){

for (var i = 0, l = node.expressions.length; i < l; i++){

var expressionNode = node.expressions[i];

this.evaluateExpressionNode(expressionNode);

}

Analyser.prototype.evaluateExpressionNode = function (node){

if (node instanceof VariableNode){

this.evaluateVariableNode(node);

}

Analyser.prototype.evaluateVariableNode = function (node){

if (this.vars[node.varName]){

//this variable has been declared before

//since we can find it in our variable table

Errors.push({

type: Errors.SEMANTIC_ERROR,

msg: "The variable \"" + node.varName + "\" has been declared already",

line: node.line

});

}else{

this.vars[node.varName] = node;

//if we do not use "else", this variable declaration will replace the previous one

//This may result in wrong data type checking later on

}

this.evaluateExpressionNode(node.initExpressionNode);

}

Analyser 暫時有三個 method，第一個是 evaluateExpressionBlockNode，它要做的工作很簡單，只是純綷把它其下的所有 expression 都遍歷一次，並執行 evaluateExpressionNode。第二個是 evaluateExpressionNode，它負責判斷 node 的類型並分配到不同的 evaluator。

第三個是 evaluateVariableNode，它就是做第一項語意分析的核心程式了，每當我們遇到 VariableNode 的時候，就是要定義新變數的時候，所以我們就在這個時候檢查一下是否已經定義了變數。檢查的方法就是先建立一個變數 hash map，然後每次 evaluate 時都檢查一下這個變數是否已經在 hash map 當中，是的話就算是重複定義變數了，這個時候就要匯報錯誤了。如果還沒有定義的話，我們要把變數放到 hash map 中以便下次檢查。

注意，我們不可以用以下這種寫法，因為這會做成變數類型不斷改變的情況，後面做變數類型檢查時有機會出錯。

Analyser.prototype.evaluateVariableNode = function (node){

if (this.vars[node.varName]){

//this variable has been declared before

//since we can find it in our variable table

Errors.push({

type: Errors.SEMANTIC_ERROR,

msg: "The variable \"" + node.varName + "\" has been declared already",

line: node.line

});

}

this.vars[node.varName] = node;

this.evaluateExpressionNode(node.initExpressionNode);

}

運行一下，看看程式是否正常運作。

Analyser 的基本結構就是這樣子了，看起來有點熟悉吧？其實跟 Parser 差不多，都是一堆 mutual recursion 而已，把大問題斬成多個小問題，那就沒問題了。

自動初始化變數

下一步是自動初始化變數，這一步應該放在 Parser 中做還是在 semantic analysis 才做，其實都可以，不過西杰認為這應該屬於 Wescript 的特性之一，所以就放在 semantic analysis 才做。

Analyser.prototype.evaluateVariableNode = function (node){

if (this.vars[node.varName]){

//this variable has been declared before

//since we can find it in our variable table

Errors.push({

type: Errors.SEMANTIC_ERROR,

msg: "The variable \"" + node.varName + "\" has been declared already",

line: node.line

});

}else{

this.vars[node.varName] = node;

//if we do not use "else", this variable declaration will replace the previous one

//This may result in wrong data type checking later on

}

if (node.initExpressionNode){

this.evaluateExpressionNode(node.initExpressionNode);

}else{

if (node.type == "bool"){

node.initExpressionNode = new BoolNode("false");

}else if (node.type == "int"){

node.initExpressionNode = new IntNode("0");

}

其實就是改寫了一下最底的那一小段，如果有 initExpressionNode 的話就 evaluate 一下，否則我們就要自行建立相應的預設數值 node 了（當然，這些是 compiler 建立的就不用再 evaluate 了）。

變數要先定義後使用

接下來我們就要檢查一下是不是所有變數都先被定義過然後才被使用。

Analyser.prototype.evaluateIdentifierNode = function (node){

if (! this.vars[node.identifier]){

Errors.push({

type: Errors.SEMANTIC_ERROR,

msg: "Variable \"" + node.identifier + "\" must been declared before using",

line: node.line

});

}

技巧照舊，同樣是使用 hash map 來檢查就可以了，沒有難度吧。看看運行結果：

注意，有數個 method 上面沒有解說的，因為它們的做法都很普通，跟之前沒有太大分別，所以就不在此著墨了

變數類型檢查

最後是檢查變數類型，這個比較複雜，我們一步一步來吧。首先要定義兩個常數用來分辨變數的類型：

1 2	`Analyser.TYPE_BOOL = 1;` `Analyser.TYPE_INT = 2;`

然後我們就要在 evaluate 時為某些 node 賦予類型了，首先是 IntNode 和 BoolNode，很明顯地它們分別是 integer 和 boolean 類型了。

Analyser.prototype.evaluateBoolNode = function (node){

node.valueType = Analyser.TYPE_BOOL;

}

Analyser.prototype.evaluateIntNode = function (node){

node.valueType = Analyser.TYPE_INT;

}

另外是 compound node，現在先做一個較簡單的版本，並未處理有運算符的情況：

Analyser.prototype.evaluateCompoundNode = function (node){

var type = null;

for (var i = 0, l = node.nodes.length; i < l; i++){

var subNode = node.nodes[i];

this.evaluateExpressionNode(subNode);

if (type == null){

type = subNode.valueType;

}

node.valueType = type;

}

這裏假設 compound node 裏沒有運算符，並只有一個數值，我們就可以得出以上代碼。

還有 identifier node，我們也要為它們賦予類型，所以要改寫一下原本的 evaluateIdentifierNode。

Analyser.prototype.evaluateIdentifierNode = function (node){

if (! this.vars[node.identifier]){

Errors.push({

type: Errors.SEMANTIC_ERROR,

msg: "Variable \"" + node.identifier + "\" must been declared before using",

line: node.line

});

}else{

node.valueType = this.vars[node.identifier].valueType;

}

最後我們要改寫 evaluateVariableNode，加入變數類型檢查。

Analyser.prototype.evaluateVariableNode = function (node){

if (this.vars[node.varName]){

//this variable has been declared before

//since we can find it in our variable table

Errors.push({

type: Errors.SEMANTIC_ERROR,

msg: "The variable \"" + node.varName + "\" has been declared already",

line: node.line

});

}else{

this.vars[node.varName] = node;

//if we do not use "else", this variable declaration will replace the previous one

//This may result in wrong data type checking later on

}

if (node.initExpressionNode){

this.evaluateExpressionNode(node.initExpressionNode);

if (node.type == "bool" &&

node.initExpressionNode.valueType != Analyser.TYPE_BOOL){

Errors.push({

type: Errors.SEMANTIC_ERROR,

msg: "The variable \"" + node.varName + "\" is Boolean type but the assignment value is not Boolean",

line: node.line

});

}else if (node.type == "int" &&

node.initExpressionNode.valueType != Analyser.TYPE_INT){

Errors.push({

type: Errors.SEMANTIC_ERROR,

msg: "The variable \"" + node.varName + "\" is Integer type but the assignment value is not Integer",

line: node.line

});

}

}else{

if (node.type == "bool"){

node.initExpressionNode = new BoolNode("false");

}else if (node.type == "int"){

node.initExpressionNode = new IntNode("0");

}

node.valueType = (node.type == "bool" Analyser.TYPE_BOOL : Analyser.TYPE_INT);

}

在 initialise 時，我們會檢查一下變數的類型和初始化類型是否匹配，不是的話就需要報錯。現在試試運行一下：

現在再改寫一下 evaluateCompoundNode，因為 compound node 通常都不只有一個數值，而是一堆數值加一堆運算符。（記住一點，compound node 裏的運算是排列好的，可以由左至右直接讀！）做法是先記下左面的數值類型，以及運算符類型，當遇到右面的數值才看看有沒有語意錯誤，於是得出以下代碼。

Analyser.prototype.evaluateCompoundNode = function (node){

var type = null;

var operator = null;

for (var i = 0, l = node.nodes.length; i < l; i++){

var subNode = node.nodes[i];

this.evaluateExpressionNode(subNode);

if (type == null){

type = subNode.valueType;

}else{

if (subNode instanceof OperatorNode){

operator = subNode;

}else{

if (operator instanceof OperatorPlusNode){

if (type != Analyser.TYPE_INT || subNode.valueType != Analyser.TYPE_INT){

Errors.push({

type: Errors.SEMANTIC_ERROR,

msg: "Require Integers on both sides of \"+\"",

line: operator.line

});

}

type = Analyser.TYPE_INT;

operator = null;

}else if (operator instanceof OperatorEqualNode){

if ((type == Analyser.TYPE_BOOL && subNode.valueType != Analyser.TYPE_BOOL) ||

(type == Analyser.TYPE_INT && subNode.valueType != Analyser.TYPE_INT)){

Errors.push({

type: Errors.SEMANTIC_ERROR,

msg: "Require the type on both sides of \"==\" to be the same",

line: operator.line

});

}

type = Analyser.TYPE_BOOL;

operator = null;

}

node.valueType = type;

}

又運行一下看看結果吧：

現在把餘下可能出現在 compound node 中的 node 都編寫下來！

Analyser.prototype.evaluateCompoundNode = function (node){

var type = null;

var operator = null;

for (var i = 0, l = node.nodes.length; i < l; i++){

var subNode = node.nodes[i];

this.evaluateExpressionNode(subNode);

if (type == null){

type = subNode.valueType;

}else{

if (subNode instanceof OperatorNode){

operator = subNode;

}else{

if (operator instanceof OperatorPlusNode ||

operator instanceof OperatorMinusNode ||

operator instanceof OperatorMultNode ||

operator instanceof OperatorDivNode ||

operator instanceof OperatorModNode){

if (type != Analyser.TYPE_INT || subNode.valueType != Analyser.TYPE_INT){

Errors.push({

type: Errors.SEMANTIC_ERROR,

msg: "Require Integers on both sides of arithmetic operator",

line: operator.line

});

}

type = Analyser.TYPE_INT;

operator = null;

}else if (operator instanceof OperatorAndNode ||

operator instanceof OperatorOrNode){

if (type != Analyser.TYPE_BOOL || subNode.valueType != Analyser.TYPE_BOOL){

Errors.push({

type: Errors.SEMANTIC_ERROR,

msg: "Require Booleans on both sides of logical operator",

line: operator.line

});

}

type = Analyser.TYPE_BOOL;

operator = null;

}else if (operator instanceof OperatorEqualNode ||

operator instanceof OperatorNotEqualNode){

if ((type == Analyser.TYPE_BOOL && subNode.valueType != Analyser.TYPE_BOOL) ||

(type == Analyser.TYPE_INT && subNode.valueType != Analyser.TYPE_INT)){

Errors.push({

type: Errors.SEMANTIC_ERROR,

msg: "Require the type on both sides of comparison operator to be the same",

line: operator.line

});

}

type = Analyser.TYPE_BOOL;

operator = null;

}else if (operator instanceof OperatorAssignNode){

if ((type == Analyser.TYPE_BOOL && subNode.valueType != Analyser.TYPE_BOOL) ||

(type == Analyser.TYPE_INT && subNode.valueType != Analyser.TYPE_INT)){

Errors.push({

type: Errors.SEMANTIC_ERROR,

msg: "Require the type on both sides of assignment operator to be the same",

line: operator.line

});

}

}else if (operator instanceof OperatorPlusAssignNode ||

operator instanceof OperatorMinusAssignNode){

if (type != Analyser.TYPE_INT || subNode.valueType != Analyser.TYPE_INT){

Errors.push({

type: Errors.SEMANTIC_ERROR,

msg: "Require the type on both sides of plus/minus assignment operator to be Integer",

line: operator.line

});

}

node.valueType = type;

}

其實工作跟之前的都差不多，只要用多幾個 if 就可以了。另外，這次亦處理了 unary 運算符，做法是先 evaluate 一下那些運算符下的 node，之後再設定自己類型。

Analyser.prototype.evaluateNegateNode = function (node){

this.evaluateExpressionNode(node.node);

node.valueType = node.node.valueType;

}

Analyser.prototype.evaluateNotNode = function (node){

this.evaluateExpressionNode(node.node);

node.valueType = node.node.valueType;

}

Analyser.prototype.evaluateParenNode = function (node){

this.evaluateExpressionNode(node.node);

node.valueType = node.node.valueType;

}

Analyser.prototype.evaluatePostIncrementNode = function (node){

this.evaluateExpressionNode(node.node);

node.valueType = node.node.valueType;

}

Analyser.prototype.evaluatePreIncrementNode = function (node){

this.evaluateExpressionNode(node.node);

node.valueType = node.node.valueType;

}

Analyser.prototype.evaluatePostDecrementNode = function (node){

this.evaluateExpressionNode(node.node);

node.valueType = node.node.valueType;

}

Analyser.prototype.evaluatePreDecrementNode = function (node){

this.evaluateExpressionNode(node.node);

node.valueType = node.node.valueType;

}

現在還有什麼 node 未被處理呢？就是 while 跟 if。它們的處理方法都不難，只是使用之前編寫了的 recursion 就可以了（這就是 recursion 的威力嚕）。

Analyser.prototype.evaluateIfNode = function (node){

this.evaluateExpressionNode(node.conditionExpression);

if (node.conditionExpression.valueType != Analyser.TYPE_BOOL){

Errors.push({

type: Errors.SEMANTIC_ERROR,

msg: "The condition must be of Boolean type",

line: node.conditionExpression.line

});

}

this.evaluateExpressionBlockNode(node.expressions);

this.evaluateExpressionBlockNode(node.elseExpressions);

}

Analyser.prototype.evaluateWhileNode = function (node){

this.evaluateExpressionNode(node.conditionExpression);

if (node.conditionExpression.valueType != Analyser.TYPE_BOOL){

Errors.push({

type: Errors.SEMANTIC_ERROR,

msg: "The condition must be of Boolean type",

line: node.conditionExpression.line

});

}

this.evaluateExpressionBlockNode(node.expressions);

}

最後運行一下：

總結

語意分析（Semantic analysis）到此算是完了，大家學到了什麼呢？其實這一章的編寫風格跟上一章寫 Parser 真的差不多，都是一堆 mutual recursion 湊合起來，組合成一個強大的分析器。現在大家可算是過了最艱難的時刻了，接下來的幾章都應該比較簡單（或者可能是因為大家都認識了這種編寫方法了吧），大家請緊記熟讀這前幾章，打好基礎，咱們下禮拜再來。