Bison Internals

GNU/Bison is a LALR parser generator written in C. It can generate C and C++ code that implements LALR and GLR parsers. This project is an effort to document internal data structures of Bison.

Tuesday, June 20, 2006

The nullable list

Bison also creates a list of non-terminals that can expand into the empty string. A vector Nullable is created in nullable.c

Nullable[i] != 0 (zero) if i th non-terminal can expand into the empty string.

The algorithm itself is very simple. If X-> /* empty */ then clearly X is added to Nullable set

if X-> beta where beta contains only non-terminal symbols, X is also added to nullable set if all of the nonterminals can expand to the null string; if beta contains atleast one terminal symbol, then X cannot be included in the nullable set.

The implementation of the above algorithm is not as simple. It takes a good amount of linked list tweaking before we arrive at the final array. Look at nullable_compute () in nullable.c

The derives list

The next task for bison is to generate a list of rules assosciated with Non-terminals. By simply walking thru all the rules, and some complex linked list programming, an array DERIVES[0..NVARS] is computed in the file derives.c (derives_compute()). I am able to document the structure of the derives array for now. Why is this array required and how it will be used will be filled in later.

A look at the print_derives() funtion reveals the structure of DERIVES:

Derives[i] = list of rules with i th nonterminal as LHS.

The list of rules is flat list of rule numbers followed by NULL.

e.g, S->NUM X | Y
X -> NUM
Y -> 'a'

derives[0] = 0 NULL (Non terminal $accept is the LHS of rule 0 $accept -> S $end)
derives[1] = 1 2 NULL (S is the LHS of rules 1 and 2)
derives[2] = 3 NULL
derives[3] = 4 NULL

Monday, June 19, 2006

Reducing the grammar

Okay, so far we have parsed the input grammar, entered different symbols in the symbol table and made an internal representation for the user's input. Its time to start operating on the input.

The next function after a call to reader() is reduce_grammar() (see main.c post). This function is defined in the file reduce.c. The main aim of functions in this file is to sort out

1. Useless non-terminals,
2. Unreachable symbols,
3. useless rules

from the grammar. Useless Non-terminals are those that do not appear in any derivation of strings defined by a grammar; these are calculated by using a classic algorithm. The algorithm itself decribed very well inside comments in the function useless_nonterminals() (reduce.c). So I will just quot it here:


The set being computed is a set of nonterminals which can derive
the empty string or strings consisting of all terminals. At each
iteration a nonterminal is added to the set if there is a
production with that nonterminal as its LHS for which all the
nonterminals in its RHS are already in the set. Iterate until
the set being computed remains unchanged. Any nonterminals not
in the set at that point are useless in that they will never be
used in deriving a sentence of the language.



Bitsets are used to represent sets of symbols. All nonterminals, terminals and rules are nothing more than integers now; (symbol numbers in case of tokens and Nonterminals, and rule numbers in case of the productions); so they are represented by bitsets as sets of integers; The working of a bit set is simple. You can set and clear bits, test if the i th bit is set etc. So, if you want to add number 6 to a bitset, you set bit no. 6 to 1; This is a clever way of implementing sets. This way, nomatter howmany times you add 6 to a set, the same bit gets set to 1 and uniqueness of the set is preserved without searching and sorting!

A production is useful if all Non terminals in its RHS are in the set of useful Nonterminals.

There is also a function to calculate the set of all accessible symbols in the grammar by a similiar logic.

After calculating these sets, a bit of re-shuffling is done.
1. The non-terminals are re-numbered. Useless non-terminals are pushed to the end.
2. Accordingly, the symbol table is also re-shuffled to mirror the order of symbol numbers of non-terminals. This is because, symbols[i] should always point to the i th symbol.
3. RITEMS must also be modified to get the new symbol numbers for nonterminal symbols appearing on the RHS
4. RULES must be updated: The rule numbers are re-assigned (useful rules first and useless rules later) (Rules with a useless NT as LHS is anyway a useless rule, so we dont bother to update non terminal numbers on the LHS?!)
5. And hence the RULE markers (negated rule numbers) in RITEMS must also be updated!

The structure of the internal representation of the grammar is still the same except the fact that now, useless rules and non-terminals are pushed behind the useful ones. The useless rules can always be identified by qurying the boolean flag 'useful' in struct rule. Also, nrules -= nuseless_productions and nsyms -= nuseless_nonterminals; and nvars -= nuseless_nonterminals;

so we always know where to draw the line while operating on the grammar in futher steps.

A note on gram.c

gram.c contains many little useful functions that operate on the internal structure of grammar. A study of these functions will ground the understanding of bison's internal representation of grammar. There are functions to print the rules, dump the entire grammar, print out the RITEMS array in a useful way etc; I inserted calls to these functions in reader.c after packgram() finished its job; and what came our was amazing;

Here is my grammar:

S: NUM X
| Y
X: NUM
Y: 'a'
;

Here are the internals.. enjoy!

SYMBOL:Y NUMBER:8
SYMBOL:NUM NUMBER:3
SYMBOL:$end NUMBER:0
SYMBOL:error NUMBER:1
SYMBOL:'a' NUMBER:4
SYMBOL:$accept NUMBER:5
SYMBOL:S NUMBER:6
SYMBOL:$undefined NUMBER:2
SYMBOL:X NUMBER:7
********Trap: From packgram*********
Rule#0:
LHS:$accept, symNO: 5
RHS:6
Rule#1:
LHS:S, symNO: 6
RHS:3
Rule#2:
LHS:S, symNO: 6
RHS:8
Rule#3:
LHS:X, symNO: 7
RHS:3
Rule#4:
LHS:Y, symNO: 8
RHS:4
RITEM#0:
6
RITEM#1:
0
RITEM#2:
-1
RITEM#3:
3
RITEM#4:
7
RITEM#5:
-2
RITEM#6:
8
RITEM#7:
-3
RITEM#8:
3
RITEM#9:
-4
RITEM#10:
4
RITEM#11:
-5

RITEM
S $end (rule 0)
NUM X (rule 1)
Y (rule 2)
NUM (rule 3)
'a' (rule 4)


Grammar

0 $accept: S $end

1 S: NUM X
2 | Y

3 X: NUM

4 Y: 'a'


Dump1

ntokens = 5, nvars = 4, nsyms = 9, nrules = 5, nritems = 12

Variables
---------

Value Sprec Sassoc Tag
5 0 0 $accept
6 0 0 S
7 0 0 X
8 0 0 Y


Rules
-----

Num (Prec, Assoc, Useful, Ritem Range) Lhs -> Rhs (Ritem range) [Num]
0 ( 0, 0, 1, 0- 1) 5 -> 6 0 [0]
1 ( 0, 0, 1, 3- 4) 6 -> 3 7 [1]
2 ( 0, 0, 1, 6- 6) 6 -> 8 [2]
3 ( 0, 0, 1, 8- 8) 7 -> 3 [3]
4 ( 0, 0, 1, 10-10) 8 -> 4 [4]


Rules interpreted
-----------------

0 $accept: S $end
1 S: NUM X
2 S: Y
3 X: NUM
4 Y: 'a'

Wednesday, June 07, 2006

Bison Grammar

It took me a lot of time to understand Bison's internal representation of a grammar. Once we have this data structure figured out well, the rest of the program should be easier to follow along. So here goes.

The grammar specification supplied to bison is like any other programming language input and must be subjected to lexical analysis and parsing before it can be converted into a parser. For lexical analysis, bison uses GNU flex. The file scan-gram.l contains the specification of a lexer for bison.

Not surprisingly, bison grammar is parsed by a bison itself (well, you should have a sane previous version of bison already installed) and an LALR parser is generated! The semantic actions in the grammar file parse-gram.y construct an internal representation of the input grammar. Well this is not yet the "final" internal representation as we will see in a moment.

The whole grammar is first flattened into a list of symbols (well, not exactly a flat list as explained in the mid-rule field below). The file symlist.c contains the necessary functions to manage a linked list of symbols. Each node in the list contains the following values:

Symbol *sym // The symbol in this node
location location //location of the symbol
symbol_list *mid_rule //If this a 'made-up' symbol for a midrule action, a pointer to the first node of the mid-rule. This makes the list sort of not-to-be flat.
const char *action; //all code in the action
location action_location;
bool used; //will be set if the symbol's value is used in the current action.
symbol *ruleprec //The precedence of this rule is determined by the prec of this symbol
int dprec, merger // for GLR parsers
struct symbol_list *next; //the most natural thing to find in a linked list

The LHS of a rule is first inserted into the list, followed by one node each for the symbols of the RHS of the rule. At the end of the rule, an empty node is inserted with a NULL in place of sym field. This special node marks the end of the rule.


when all rules are collected into this list, a function packgram() (see reader.c) is called to convert this list into the 'real' 'final' internal representation of the grammar...

The final representation of the input grammar is very simple. There are 2 arrays RULES and RITEMS. RULES is an array of structures of the type "struct rule". RITEMS is just a long array of integers. Lets deal with RULES first..

1. Each symbol of the grammar is given a unique number (See symbol table post below). Note that the symbols numbers are also given to bison internal symbols - $accept, $undefined, $end, error

2. The structure rule: Each rule consists of the following fields (see gram.h):

rule_number (btw thats just an int typedefed) user_number; //user's rule number
rule_number number; //this might be different as some useless rules will be thrown away
symbol *lhs; //The LHS of any rule is a non-terminal symbol and it is represented by a pointer to the symbol in the symbol table
item_number (huh just another int) *rhs; //Now! watch that! rhs is a pointer to an int. Will explain below.
int dprec,merge; //for GLR parsers
symbol *prec; //This is the symbol that determines the precedence of the rule
symbol *precsym; //Symbol specified in the %prec declaration.
location location;
bool useful; //will be set later to be useful/not useful
const char *action; //action associated with this rule
location action_location;

3. So each rule contains all information about a rule (hold on for a second more for that rhs part). There is this array RULES[0..nrules-1] for n rules. RULES[0] is always the rule appened by Bison

$accept -> S $end

and then the rest of the rules follow.

4. Now to deal with the elusive RHS of a rule. Well, for that, there is this array called RITEMS[0..nritems] - a simple array of integers. Now each integer represents a symbol - thats right, thats what the symbol numbers are for. Specifically, each integer in this array represents a symbol that appears on the RHS of a rule. Suppose symbols 1,2 appear on the RHS of rule#1, and symbols 3,4,5 appear on RHS of rule#2 RITEMS would contain:

1, 2, -1, 3, 4, 5, -2

Thats right, the RHS of each rule is terminated by a negative number, which is the negation of the rule number.

Anyway, now this long array goes on and on for all the rules. And yeah,

RULES[x].rhs = the address of the element in RITEMS where the rhs begins (basically base address of RITEMS + index of first RHS symbol of Rule x).

I wrote some simple print statements in pack gram() to make bison spit out this internal representation. Then ran bison on a simple grammar:

s-> NUM x
x -> NUM

(NUM being a token stolen from the calc examples on bison manual page.)

here is what I got:

********Trap: From packgram*********
Rule#0:
LHS:$accept, symNO: 4
RHS:5
Rule#1:
LHS:S, symNO: 5
RHS:3
Rule#2:
LHS:X, symNO: 6
RHS:3
RITEM#0:
5
RITEM#1:
0
RITEM#2:
-1
RITEM#3:
3
RITEM#4:
6
RITEM#5:
-2
RITEM#6:
3
RITEM#7:
-3

I printed out *(RULES[i].rhs) instead of the address stored in rhs field to make some sense. And it did make sense to me!