Bison Internals

GNU/Bison is a LALR parser generator written in C. It can generate C and C++ code that implements LALR and GLR parsers. This project is an effort to document internal data structures of Bison.

Wednesday, June 07, 2006

Bison Grammar

It took me a lot of time to understand Bison's internal representation of a grammar. Once we have this data structure figured out well, the rest of the program should be easier to follow along. So here goes.

The grammar specification supplied to bison is like any other programming language input and must be subjected to lexical analysis and parsing before it can be converted into a parser. For lexical analysis, bison uses GNU flex. The file scan-gram.l contains the specification of a lexer for bison.

Not surprisingly, bison grammar is parsed by a bison itself (well, you should have a sane previous version of bison already installed) and an LALR parser is generated! The semantic actions in the grammar file parse-gram.y construct an internal representation of the input grammar. Well this is not yet the "final" internal representation as we will see in a moment.

The whole grammar is first flattened into a list of symbols (well, not exactly a flat list as explained in the mid-rule field below). The file symlist.c contains the necessary functions to manage a linked list of symbols. Each node in the list contains the following values:

Symbol *sym // The symbol in this node
location location //location of the symbol
symbol_list *mid_rule //If this a 'made-up' symbol for a midrule action, a pointer to the first node of the mid-rule. This makes the list sort of not-to-be flat.
const char *action; //all code in the action
location action_location;
bool used; //will be set if the symbol's value is used in the current action.
symbol *ruleprec //The precedence of this rule is determined by the prec of this symbol
int dprec, merger // for GLR parsers
struct symbol_list *next; //the most natural thing to find in a linked list

The LHS of a rule is first inserted into the list, followed by one node each for the symbols of the RHS of the rule. At the end of the rule, an empty node is inserted with a NULL in place of sym field. This special node marks the end of the rule.


when all rules are collected into this list, a function packgram() (see reader.c) is called to convert this list into the 'real' 'final' internal representation of the grammar...

The final representation of the input grammar is very simple. There are 2 arrays RULES and RITEMS. RULES is an array of structures of the type "struct rule". RITEMS is just a long array of integers. Lets deal with RULES first..

1. Each symbol of the grammar is given a unique number (See symbol table post below). Note that the symbols numbers are also given to bison internal symbols - $accept, $undefined, $end, error

2. The structure rule: Each rule consists of the following fields (see gram.h):

rule_number (btw thats just an int typedefed) user_number; //user's rule number
rule_number number; //this might be different as some useless rules will be thrown away
symbol *lhs; //The LHS of any rule is a non-terminal symbol and it is represented by a pointer to the symbol in the symbol table
item_number (huh just another int) *rhs; //Now! watch that! rhs is a pointer to an int. Will explain below.
int dprec,merge; //for GLR parsers
symbol *prec; //This is the symbol that determines the precedence of the rule
symbol *precsym; //Symbol specified in the %prec declaration.
location location;
bool useful; //will be set later to be useful/not useful
const char *action; //action associated with this rule
location action_location;

3. So each rule contains all information about a rule (hold on for a second more for that rhs part). There is this array RULES[0..nrules-1] for n rules. RULES[0] is always the rule appened by Bison

$accept -> S $end

and then the rest of the rules follow.

4. Now to deal with the elusive RHS of a rule. Well, for that, there is this array called RITEMS[0..nritems] - a simple array of integers. Now each integer represents a symbol - thats right, thats what the symbol numbers are for. Specifically, each integer in this array represents a symbol that appears on the RHS of a rule. Suppose symbols 1,2 appear on the RHS of rule#1, and symbols 3,4,5 appear on RHS of rule#2 RITEMS would contain:

1, 2, -1, 3, 4, 5, -2

Thats right, the RHS of each rule is terminated by a negative number, which is the negation of the rule number.

Anyway, now this long array goes on and on for all the rules. And yeah,

RULES[x].rhs = the address of the element in RITEMS where the rhs begins (basically base address of RITEMS + index of first RHS symbol of Rule x).

I wrote some simple print statements in pack gram() to make bison spit out this internal representation. Then ran bison on a simple grammar:

s-> NUM x
x -> NUM

(NUM being a token stolen from the calc examples on bison manual page.)

here is what I got:

********Trap: From packgram*********
Rule#0:
LHS:$accept, symNO: 4
RHS:5
Rule#1:
LHS:S, symNO: 5
RHS:3
Rule#2:
LHS:X, symNO: 6
RHS:3
RITEM#0:
5
RITEM#1:
0
RITEM#2:
-1
RITEM#3:
3
RITEM#4:
6
RITEM#5:
-2
RITEM#6:
3
RITEM#7:
-3

I printed out *(RULES[i].rhs) instead of the address stored in rhs field to make some sense. And it did make sense to me!

0 Comments:

Post a Comment

<< Home