Bison Internals

GNU/Bison is a LALR parser generator written in C. It can generate C and C++ code that implements LALR and GLR parsers. This project is an effort to document internal data structures of Bison.

Tuesday, June 20, 2006

The nullable list

Bison also creates a list of non-terminals that can expand into the empty string. A vector Nullable is created in nullable.c

Nullable[i] != 0 (zero) if i th non-terminal can expand into the empty string.

The algorithm itself is very simple. If X-> /* empty */ then clearly X is added to Nullable set

if X-> beta where beta contains only non-terminal symbols, X is also added to nullable set if all of the nonterminals can expand to the null string; if beta contains atleast one terminal symbol, then X cannot be included in the nullable set.

The implementation of the above algorithm is not as simple. It takes a good amount of linked list tweaking before we arrive at the final array. Look at nullable_compute () in nullable.c

The derives list

The next task for bison is to generate a list of rules assosciated with Non-terminals. By simply walking thru all the rules, and some complex linked list programming, an array DERIVES[0..NVARS] is computed in the file derives.c (derives_compute()). I am able to document the structure of the derives array for now. Why is this array required and how it will be used will be filled in later.

A look at the print_derives() funtion reveals the structure of DERIVES:

Derives[i] = list of rules with i th nonterminal as LHS.

The list of rules is flat list of rule numbers followed by NULL.

e.g, S->NUM X | Y
X -> NUM
Y -> 'a'

derives[0] = 0 NULL (Non terminal $accept is the LHS of rule 0 $accept -> S $end)
derives[1] = 1 2 NULL (S is the LHS of rules 1 and 2)
derives[2] = 3 NULL
derives[3] = 4 NULL

Monday, June 19, 2006

Reducing the grammar

Okay, so far we have parsed the input grammar, entered different symbols in the symbol table and made an internal representation for the user's input. Its time to start operating on the input.

The next function after a call to reader() is reduce_grammar() (see main.c post). This function is defined in the file reduce.c. The main aim of functions in this file is to sort out

1. Useless non-terminals,
2. Unreachable symbols,
3. useless rules

from the grammar. Useless Non-terminals are those that do not appear in any derivation of strings defined by a grammar; these are calculated by using a classic algorithm. The algorithm itself decribed very well inside comments in the function useless_nonterminals() (reduce.c). So I will just quot it here:


The set being computed is a set of nonterminals which can derive
the empty string or strings consisting of all terminals. At each
iteration a nonterminal is added to the set if there is a
production with that nonterminal as its LHS for which all the
nonterminals in its RHS are already in the set. Iterate until
the set being computed remains unchanged. Any nonterminals not
in the set at that point are useless in that they will never be
used in deriving a sentence of the language.



Bitsets are used to represent sets of symbols. All nonterminals, terminals and rules are nothing more than integers now; (symbol numbers in case of tokens and Nonterminals, and rule numbers in case of the productions); so they are represented by bitsets as sets of integers; The working of a bit set is simple. You can set and clear bits, test if the i th bit is set etc. So, if you want to add number 6 to a bitset, you set bit no. 6 to 1; This is a clever way of implementing sets. This way, nomatter howmany times you add 6 to a set, the same bit gets set to 1 and uniqueness of the set is preserved without searching and sorting!

A production is useful if all Non terminals in its RHS are in the set of useful Nonterminals.

There is also a function to calculate the set of all accessible symbols in the grammar by a similiar logic.

After calculating these sets, a bit of re-shuffling is done.
1. The non-terminals are re-numbered. Useless non-terminals are pushed to the end.
2. Accordingly, the symbol table is also re-shuffled to mirror the order of symbol numbers of non-terminals. This is because, symbols[i] should always point to the i th symbol.
3. RITEMS must also be modified to get the new symbol numbers for nonterminal symbols appearing on the RHS
4. RULES must be updated: The rule numbers are re-assigned (useful rules first and useless rules later) (Rules with a useless NT as LHS is anyway a useless rule, so we dont bother to update non terminal numbers on the LHS?!)
5. And hence the RULE markers (negated rule numbers) in RITEMS must also be updated!

The structure of the internal representation of the grammar is still the same except the fact that now, useless rules and non-terminals are pushed behind the useful ones. The useless rules can always be identified by qurying the boolean flag 'useful' in struct rule. Also, nrules -= nuseless_productions and nsyms -= nuseless_nonterminals; and nvars -= nuseless_nonterminals;

so we always know where to draw the line while operating on the grammar in futher steps.

A note on gram.c

gram.c contains many little useful functions that operate on the internal structure of grammar. A study of these functions will ground the understanding of bison's internal representation of grammar. There are functions to print the rules, dump the entire grammar, print out the RITEMS array in a useful way etc; I inserted calls to these functions in reader.c after packgram() finished its job; and what came our was amazing;

Here is my grammar:

S: NUM X
| Y
X: NUM
Y: 'a'
;

Here are the internals.. enjoy!

SYMBOL:Y NUMBER:8
SYMBOL:NUM NUMBER:3
SYMBOL:$end NUMBER:0
SYMBOL:error NUMBER:1
SYMBOL:'a' NUMBER:4
SYMBOL:$accept NUMBER:5
SYMBOL:S NUMBER:6
SYMBOL:$undefined NUMBER:2
SYMBOL:X NUMBER:7
********Trap: From packgram*********
Rule#0:
LHS:$accept, symNO: 5
RHS:6
Rule#1:
LHS:S, symNO: 6
RHS:3
Rule#2:
LHS:S, symNO: 6
RHS:8
Rule#3:
LHS:X, symNO: 7
RHS:3
Rule#4:
LHS:Y, symNO: 8
RHS:4
RITEM#0:
6
RITEM#1:
0
RITEM#2:
-1
RITEM#3:
3
RITEM#4:
7
RITEM#5:
-2
RITEM#6:
8
RITEM#7:
-3
RITEM#8:
3
RITEM#9:
-4
RITEM#10:
4
RITEM#11:
-5

RITEM
S $end (rule 0)
NUM X (rule 1)
Y (rule 2)
NUM (rule 3)
'a' (rule 4)


Grammar

0 $accept: S $end

1 S: NUM X
2 | Y

3 X: NUM

4 Y: 'a'


Dump1

ntokens = 5, nvars = 4, nsyms = 9, nrules = 5, nritems = 12

Variables
---------

Value Sprec Sassoc Tag
5 0 0 $accept
6 0 0 S
7 0 0 X
8 0 0 Y


Rules
-----

Num (Prec, Assoc, Useful, Ritem Range) Lhs -> Rhs (Ritem range) [Num]
0 ( 0, 0, 1, 0- 1) 5 -> 6 0 [0]
1 ( 0, 0, 1, 3- 4) 6 -> 3 7 [1]
2 ( 0, 0, 1, 6- 6) 6 -> 8 [2]
3 ( 0, 0, 1, 8- 8) 7 -> 3 [3]
4 ( 0, 0, 1, 10-10) 8 -> 4 [4]


Rules interpreted
-----------------

0 $accept: S $end
1 S: NUM X
2 S: Y
3 X: NUM
4 Y: 'a'

Wednesday, June 07, 2006

Bison Grammar

It took me a lot of time to understand Bison's internal representation of a grammar. Once we have this data structure figured out well, the rest of the program should be easier to follow along. So here goes.

The grammar specification supplied to bison is like any other programming language input and must be subjected to lexical analysis and parsing before it can be converted into a parser. For lexical analysis, bison uses GNU flex. The file scan-gram.l contains the specification of a lexer for bison.

Not surprisingly, bison grammar is parsed by a bison itself (well, you should have a sane previous version of bison already installed) and an LALR parser is generated! The semantic actions in the grammar file parse-gram.y construct an internal representation of the input grammar. Well this is not yet the "final" internal representation as we will see in a moment.

The whole grammar is first flattened into a list of symbols (well, not exactly a flat list as explained in the mid-rule field below). The file symlist.c contains the necessary functions to manage a linked list of symbols. Each node in the list contains the following values:

Symbol *sym // The symbol in this node
location location //location of the symbol
symbol_list *mid_rule //If this a 'made-up' symbol for a midrule action, a pointer to the first node of the mid-rule. This makes the list sort of not-to-be flat.
const char *action; //all code in the action
location action_location;
bool used; //will be set if the symbol's value is used in the current action.
symbol *ruleprec //The precedence of this rule is determined by the prec of this symbol
int dprec, merger // for GLR parsers
struct symbol_list *next; //the most natural thing to find in a linked list

The LHS of a rule is first inserted into the list, followed by one node each for the symbols of the RHS of the rule. At the end of the rule, an empty node is inserted with a NULL in place of sym field. This special node marks the end of the rule.


when all rules are collected into this list, a function packgram() (see reader.c) is called to convert this list into the 'real' 'final' internal representation of the grammar...

The final representation of the input grammar is very simple. There are 2 arrays RULES and RITEMS. RULES is an array of structures of the type "struct rule". RITEMS is just a long array of integers. Lets deal with RULES first..

1. Each symbol of the grammar is given a unique number (See symbol table post below). Note that the symbols numbers are also given to bison internal symbols - $accept, $undefined, $end, error

2. The structure rule: Each rule consists of the following fields (see gram.h):

rule_number (btw thats just an int typedefed) user_number; //user's rule number
rule_number number; //this might be different as some useless rules will be thrown away
symbol *lhs; //The LHS of any rule is a non-terminal symbol and it is represented by a pointer to the symbol in the symbol table
item_number (huh just another int) *rhs; //Now! watch that! rhs is a pointer to an int. Will explain below.
int dprec,merge; //for GLR parsers
symbol *prec; //This is the symbol that determines the precedence of the rule
symbol *precsym; //Symbol specified in the %prec declaration.
location location;
bool useful; //will be set later to be useful/not useful
const char *action; //action associated with this rule
location action_location;

3. So each rule contains all information about a rule (hold on for a second more for that rhs part). There is this array RULES[0..nrules-1] for n rules. RULES[0] is always the rule appened by Bison

$accept -> S $end

and then the rest of the rules follow.

4. Now to deal with the elusive RHS of a rule. Well, for that, there is this array called RITEMS[0..nritems] - a simple array of integers. Now each integer represents a symbol - thats right, thats what the symbol numbers are for. Specifically, each integer in this array represents a symbol that appears on the RHS of a rule. Suppose symbols 1,2 appear on the RHS of rule#1, and symbols 3,4,5 appear on RHS of rule#2 RITEMS would contain:

1, 2, -1, 3, 4, 5, -2

Thats right, the RHS of each rule is terminated by a negative number, which is the negation of the rule number.

Anyway, now this long array goes on and on for all the rules. And yeah,

RULES[x].rhs = the address of the element in RITEMS where the rhs begins (basically base address of RITEMS + index of first RHS symbol of Rule x).

I wrote some simple print statements in pack gram() to make bison spit out this internal representation. Then ran bison on a simple grammar:

s-> NUM x
x -> NUM

(NUM being a token stolen from the calc examples on bison manual page.)

here is what I got:

********Trap: From packgram*********
Rule#0:
LHS:$accept, symNO: 4
RHS:5
Rule#1:
LHS:S, symNO: 5
RHS:3
Rule#2:
LHS:X, symNO: 6
RHS:3
RITEM#0:
5
RITEM#1:
0
RITEM#2:
-1
RITEM#3:
3
RITEM#4:
6
RITEM#5:
-2
RITEM#6:
3
RITEM#7:
-3

I printed out *(RULES[i].rhs) instead of the address stored in rhs field to make some sense. And it did make sense to me!

Thursday, May 18, 2006

Bison Symbol Table

I am looking at the symbol table used by Bison today. The symbol table is implemented in symtab.h and symtab.c source files as a hashtable. The implementation of a general hashtable is provided by GNU libc, in the files hash.h and hash.c; this implementation is used here.

There are 3 types of symbols declared in symtab.h:
1. token_sym - A terminal symbol
2. nterm_sym - A non-terminal symbol
3. unknown_sym - unrecognized symbol

These 3 symbol types are declared as an enumeration symbol_class. Then the actual symbol structure follows. The elements of a symbol are:

(type variable_name)
1. uniqstr tag - This is the name of the symbol; Also the key for searching
2. location location - This is the location of the first occurance of this symbol
3. uniqstr type_name - The name declared in the %type declaration of this token/non-terminal
4. location type_location - Location of the %type declaration (A location is a structre containing line number and col number)
5. const char* destructor - Guess this is the %destructor action for this token.
6. location destructor_location
7. const char* printer - TBD.
8. location printer_location
9. symbol_number number - symbol_number is actually typedef int symbol_number. This is an internal index of the symbol.
10. location prec_location - This should be the location of precedence information;
11. int prec - should be the precedence of this token;
12. assoc assoc - Associativity of this token. assoc is actually an enumeration defined in assoc.h
typedef enum
{
undef_assoc,
right_assoc,
left_assoc,
non_assoc
} assoc;

13. int user_token_number - This token number is set from the grammar. The internal symbol_number number is different from this because, some useless symbols may have been eliminated after parsing; This number however is the number set from the actions of the bison grammar for bison.

14. symbol *alias - There is a comment there:
/* Points to the other in the identifier-symbol pair for an alias.
Special value USER_NUMBER_ALIAS in the identifier half of the
identifier-symbol pair for an alias. */
Also see note below

15. symbol_class class; - The class of the symbol as described at the beginning.
16. bool declared - Determines if this symbol is declared by the user.

Thats the entire structure of a symbol; A token symbol can have an Alias if declared like:
%token UMINUS "-"
The UMINUS is numbered regularly, the string "-" is considered an alias for this symbol; if the field user_token_number is set to USER_NUMBER_ALIAS this means this symbol has an alias.

After this structure, there are a number fo useful functions; The most important might be:
1. symbol_get(key, location) - Lookup the symbol table. This function doubles up as a symbol_create function; If the symbol is not present, it creates one into the hash table.

2. symbol_make_alias(sym1, sym2, loc) - Make sym2 an alias of sym1 at location loc. This is done by setting sym1->alias to sym2 and sym1->user_token_number = USER_NUMBER_ALIAS. Also sym2->alias will backpoint to sym1. Symbol number of both symbols will be set to the number the lowest between sym2->number and sym1->number.

Other functions set the type, class, precedence and associativity etc. There is no one-one function mapping for all the structure fields;

There is an interesting blurb of code at this point. Several distunguished symbols are declared here- These symbols are internally used by Bison and never declared by the user. They are:
1. error - The builtin error terminal symbol
2. $undefined - A symbol used when the next token is not recognizable by Bison
3. $end - The token that marks the last symbol in the Augmented grammar ($start -> S $end)
4. $accept - A symbol to represent acceptance
5. $startsymbol - bison augmented start symbol.

Then there are four functions to manage the symbol table:
1. symbols_new - Creates a new Symbol table
2. symbols_free - Frees up the symbol table
3. symbols_check_defined - Checks if a symbol is defined: basically check that the sym_class does not contain unknown_sym: if it does, complain.
4. symbols_pack - This will allocate proper numbers to all non-terminal symbols and does various sanity checks.

Monday, May 15, 2006

Begin main.c

Here is a brief outline of the entry point to bison - the file src/main.c

1. Initialize uniqstrs table by calling uniqstrs_new(); Apparently bison maintains a table of unique strings.
Source: uniqstrs.h, uniqstrs.c

2. Parse the command line arguments by calling getargs(argc,argv). This function uses GNU getopt to parse the command line and sets all the options (requested by user) globally. It also takes up the responsibility to display help info when invoked with --help option and the version info when invoked with -V option.
Source: getargs.h, getargs.c

3. Initialize a "Muscle table" by calling muscle_init() - looks like this is something related to M4 processing at the backend.
source: muscle_tab.h, muscle_tab.c

4. All action begins now. A call to reader() will read the input grammar specification and make an internal representation of the grammar.
source: reader.h, reader.c

5. A call to reduce_grammar() will get rid of useless Non-terminals and productions.
source: reduce.h, reduce.c

6. Record "other info" about grammar:
derives_compute(): umm, I think it is computing the derieved non-terminals for each pair of non-terminal symbols - this is useful for computing gotos as explained in sec 4.7 (under Efficient construction of LALR parsing tables) in the "Dragon book" by Aho, Ullman and Sethi;)
source: derives.h, derives.c
nullable_compute(): Calculate the set of non-terminals that can expand to the null string. This is also expalined in the same section in the book.
source: nullable.h, nullable.c

7. Generate parser states (the LR(0) items) by calling generate_states(). A state is represented by a very elobarate data structure found in state.h. This fucntion will move the dot over various symbols in the grammar productions and find all the LR0 items (which actually represent states of an NFA that recognizes the viable prefixes of the grammar). I guess only the "Kernel" items are generated.
source: LR0.h, LR0.c

8. The generated LR0 items form a NFA; we have to convert it into a DFA by subset construction (or sets of items) - This is done by calling lalr(); The computation of lookahead tokens is also done here (see algorithm 4.12 of dragon book) and then my guess is alogrithm 4.13 follows (Efficient computation of Kernels of LALR(1) collection of sets of items)
source: lalr.h, lalr.c

9. Now any Shift/reduce or reduce/reduce conflicts can be solved using precedence info supplied by grammar-writer. This is done by calling conflicts_solve() and conflicts_print() [to report conflicts to user]
source: conflicts.h, conflicts.c

10. The parser tables are now generated from the DFA created above. A call to tables_generate() will achieve this.
source: tables.h, tables.c

11. Here there is also a call to grammar_rules_never_reduced_report() which will report rules that will never be reduced because of conflicts
source: gram.h, gram.c

12. compute_output_file_names() - computes o/p file names. If your input file is parser.y then the ouput will be parser.tab.c
source: files.h, files.c

13. Here there are a couple of decisions - If you asked for a detailed report (--verbose) then a call to print_results() will print out the results to parser.output file.
If you also asked for a graph (VCG) a call to print_graph() will achieve that.
sources: print.h, print.c, print_graph.h, print_graph.c

14. Now it is time output the actual parser.tab.c file; the computed tables are used to output the parser file. A function call to output() will do the job.
source: output.h, output.c

15. The next task is to free up all the memory allocated by bison. So various _free functions do this job here.
source: each free fucntion is found in the corresponding c file; e.g., lalr_free() is found in lalr.c

That is a moderately simplified description of main.c. I skipped little parts where it checks for errors from various steps;

Based on these steps, bison can be split into the front-end, core and back-end; The steps 1-4 comprise the front-end, 5-11 the core and the rest is the backend; The backend is also resposible for generating a C++ parser instead of a C parser if requested by the user.

Whoever said opensource software is not organized well is 100% wrong! that is a really well organized structure; dont you think!

Wednesday, May 10, 2006

Where to begin

I was wondering where to begin hacking bison; Akim sent a mail describing the structure of bison in brief. Just want to record that mail here:

"Well, that would depend upon which part you wish to study. There is a front end, which task is to read the grammar and apply a few transformations upon it. Then there is the core of the program, which builds the LR(0) automaton, then the LALR automaton, deals with precedence, associativity and other conflicts, and then prepares (compresses) the tables. Then there is the back end, which prepares a batch of M4 definitions corresponding to the grammar, and it calls M4 giving it these definitions, and the selected skeleton.

Basically, see... main.c :)"

So I jumped into main.c and lo and behold! everything is hazily making sense...; but I decided not to rush. I am reading the complete bison manual first since I had not used all the features of bison, at least I need to know which is what. For e.g, there is this feature called "locations" that I had not used before;

Started to read bison manual today. It is at http://www.gnu.org/software/bison/manual/html_mono/bison.html.gz

Tuesday, May 09, 2006

Bison!


GNU/Bison is the GNU version of the popular yacc program. It is a LALR(1) parser generator (It can also generate GLR parsers). The bison home page is located at:

http://www.gnu.org/software/bison

This blog is for documenting bison internal data structures as best as I can. If you notice a mistake or typo please post a comment.