Bison Internals

GNU/Bison is a LALR parser generator written in C. It can generate C and C++ code that implements LALR and GLR parsers. This project is an effort to document internal data structures of Bison.

Thursday, May 18, 2006

Bison Symbol Table

I am looking at the symbol table used by Bison today. The symbol table is implemented in symtab.h and symtab.c source files as a hashtable. The implementation of a general hashtable is provided by GNU libc, in the files hash.h and hash.c; this implementation is used here.

There are 3 types of symbols declared in symtab.h:
1. token_sym - A terminal symbol
2. nterm_sym - A non-terminal symbol
3. unknown_sym - unrecognized symbol

These 3 symbol types are declared as an enumeration symbol_class. Then the actual symbol structure follows. The elements of a symbol are:

(type variable_name)
1. uniqstr tag - This is the name of the symbol; Also the key for searching
2. location location - This is the location of the first occurance of this symbol
3. uniqstr type_name - The name declared in the %type declaration of this token/non-terminal
4. location type_location - Location of the %type declaration (A location is a structre containing line number and col number)
5. const char* destructor - Guess this is the %destructor action for this token.
6. location destructor_location
7. const char* printer - TBD.
8. location printer_location
9. symbol_number number - symbol_number is actually typedef int symbol_number. This is an internal index of the symbol.
10. location prec_location - This should be the location of precedence information;
11. int prec - should be the precedence of this token;
12. assoc assoc - Associativity of this token. assoc is actually an enumeration defined in assoc.h
typedef enum
{
undef_assoc,
right_assoc,
left_assoc,
non_assoc
} assoc;

13. int user_token_number - This token number is set from the grammar. The internal symbol_number number is different from this because, some useless symbols may have been eliminated after parsing; This number however is the number set from the actions of the bison grammar for bison.

14. symbol *alias - There is a comment there:
/* Points to the other in the identifier-symbol pair for an alias.
Special value USER_NUMBER_ALIAS in the identifier half of the
identifier-symbol pair for an alias. */
Also see note below

15. symbol_class class; - The class of the symbol as described at the beginning.
16. bool declared - Determines if this symbol is declared by the user.

Thats the entire structure of a symbol; A token symbol can have an Alias if declared like:
%token UMINUS "-"
The UMINUS is numbered regularly, the string "-" is considered an alias for this symbol; if the field user_token_number is set to USER_NUMBER_ALIAS this means this symbol has an alias.

After this structure, there are a number fo useful functions; The most important might be:
1. symbol_get(key, location) - Lookup the symbol table. This function doubles up as a symbol_create function; If the symbol is not present, it creates one into the hash table.

2. symbol_make_alias(sym1, sym2, loc) - Make sym2 an alias of sym1 at location loc. This is done by setting sym1->alias to sym2 and sym1->user_token_number = USER_NUMBER_ALIAS. Also sym2->alias will backpoint to sym1. Symbol number of both symbols will be set to the number the lowest between sym2->number and sym1->number.

Other functions set the type, class, precedence and associativity etc. There is no one-one function mapping for all the structure fields;

There is an interesting blurb of code at this point. Several distunguished symbols are declared here- These symbols are internally used by Bison and never declared by the user. They are:
1. error - The builtin error terminal symbol
2. $undefined - A symbol used when the next token is not recognizable by Bison
3. $end - The token that marks the last symbol in the Augmented grammar ($start -> S $end)
4. $accept - A symbol to represent acceptance
5. $startsymbol - bison augmented start symbol.

Then there are four functions to manage the symbol table:
1. symbols_new - Creates a new Symbol table
2. symbols_free - Frees up the symbol table
3. symbols_check_defined - Checks if a symbol is defined: basically check that the sym_class does not contain unknown_sym: if it does, complain.
4. symbols_pack - This will allocate proper numbers to all non-terminal symbols and does various sanity checks.

Monday, May 15, 2006

Begin main.c

Here is a brief outline of the entry point to bison - the file src/main.c

1. Initialize uniqstrs table by calling uniqstrs_new(); Apparently bison maintains a table of unique strings.
Source: uniqstrs.h, uniqstrs.c

2. Parse the command line arguments by calling getargs(argc,argv). This function uses GNU getopt to parse the command line and sets all the options (requested by user) globally. It also takes up the responsibility to display help info when invoked with --help option and the version info when invoked with -V option.
Source: getargs.h, getargs.c

3. Initialize a "Muscle table" by calling muscle_init() - looks like this is something related to M4 processing at the backend.
source: muscle_tab.h, muscle_tab.c

4. All action begins now. A call to reader() will read the input grammar specification and make an internal representation of the grammar.
source: reader.h, reader.c

5. A call to reduce_grammar() will get rid of useless Non-terminals and productions.
source: reduce.h, reduce.c

6. Record "other info" about grammar:
derives_compute(): umm, I think it is computing the derieved non-terminals for each pair of non-terminal symbols - this is useful for computing gotos as explained in sec 4.7 (under Efficient construction of LALR parsing tables) in the "Dragon book" by Aho, Ullman and Sethi;)
source: derives.h, derives.c
nullable_compute(): Calculate the set of non-terminals that can expand to the null string. This is also expalined in the same section in the book.
source: nullable.h, nullable.c

7. Generate parser states (the LR(0) items) by calling generate_states(). A state is represented by a very elobarate data structure found in state.h. This fucntion will move the dot over various symbols in the grammar productions and find all the LR0 items (which actually represent states of an NFA that recognizes the viable prefixes of the grammar). I guess only the "Kernel" items are generated.
source: LR0.h, LR0.c

8. The generated LR0 items form a NFA; we have to convert it into a DFA by subset construction (or sets of items) - This is done by calling lalr(); The computation of lookahead tokens is also done here (see algorithm 4.12 of dragon book) and then my guess is alogrithm 4.13 follows (Efficient computation of Kernels of LALR(1) collection of sets of items)
source: lalr.h, lalr.c

9. Now any Shift/reduce or reduce/reduce conflicts can be solved using precedence info supplied by grammar-writer. This is done by calling conflicts_solve() and conflicts_print() [to report conflicts to user]
source: conflicts.h, conflicts.c

10. The parser tables are now generated from the DFA created above. A call to tables_generate() will achieve this.
source: tables.h, tables.c

11. Here there is also a call to grammar_rules_never_reduced_report() which will report rules that will never be reduced because of conflicts
source: gram.h, gram.c

12. compute_output_file_names() - computes o/p file names. If your input file is parser.y then the ouput will be parser.tab.c
source: files.h, files.c

13. Here there are a couple of decisions - If you asked for a detailed report (--verbose) then a call to print_results() will print out the results to parser.output file.
If you also asked for a graph (VCG) a call to print_graph() will achieve that.
sources: print.h, print.c, print_graph.h, print_graph.c

14. Now it is time output the actual parser.tab.c file; the computed tables are used to output the parser file. A function call to output() will do the job.
source: output.h, output.c

15. The next task is to free up all the memory allocated by bison. So various _free functions do this job here.
source: each free fucntion is found in the corresponding c file; e.g., lalr_free() is found in lalr.c

That is a moderately simplified description of main.c. I skipped little parts where it checks for errors from various steps;

Based on these steps, bison can be split into the front-end, core and back-end; The steps 1-4 comprise the front-end, 5-11 the core and the rest is the backend; The backend is also resposible for generating a C++ parser instead of a C parser if requested by the user.

Whoever said opensource software is not organized well is 100% wrong! that is a really well organized structure; dont you think!

Wednesday, May 10, 2006

Where to begin

I was wondering where to begin hacking bison; Akim sent a mail describing the structure of bison in brief. Just want to record that mail here:

"Well, that would depend upon which part you wish to study. There is a front end, which task is to read the grammar and apply a few transformations upon it. Then there is the core of the program, which builds the LR(0) automaton, then the LALR automaton, deals with precedence, associativity and other conflicts, and then prepares (compresses) the tables. Then there is the back end, which prepares a batch of M4 definitions corresponding to the grammar, and it calls M4 giving it these definitions, and the selected skeleton.

Basically, see... main.c :)"

So I jumped into main.c and lo and behold! everything is hazily making sense...; but I decided not to rush. I am reading the complete bison manual first since I had not used all the features of bison, at least I need to know which is what. For e.g, there is this feature called "locations" that I had not used before;

Started to read bison manual today. It is at http://www.gnu.org/software/bison/manual/html_mono/bison.html.gz

Tuesday, May 09, 2006

Bison!


GNU/Bison is the GNU version of the popular yacc program. It is a LALR(1) parser generator (It can also generate GLR parsers). The bison home page is located at:

http://www.gnu.org/software/bison

This blog is for documenting bison internal data structures as best as I can. If you notice a mistake or typo please post a comment.