Bison Internals

GNU/Bison is a LALR parser generator written in C. It can generate C and C++ code that implements LALR and GLR parsers. This project is an effort to document internal data structures of Bison.

Thursday, May 18, 2006

Bison Symbol Table

I am looking at the symbol table used by Bison today. The symbol table is implemented in symtab.h and symtab.c source files as a hashtable. The implementation of a general hashtable is provided by GNU libc, in the files hash.h and hash.c; this implementation is used here.

There are 3 types of symbols declared in symtab.h:
1. token_sym - A terminal symbol
2. nterm_sym - A non-terminal symbol
3. unknown_sym - unrecognized symbol

These 3 symbol types are declared as an enumeration symbol_class. Then the actual symbol structure follows. The elements of a symbol are:

(type variable_name)
1. uniqstr tag - This is the name of the symbol; Also the key for searching
2. location location - This is the location of the first occurance of this symbol
3. uniqstr type_name - The name declared in the %type declaration of this token/non-terminal
4. location type_location - Location of the %type declaration (A location is a structre containing line number and col number)
5. const char* destructor - Guess this is the %destructor action for this token.
6. location destructor_location
7. const char* printer - TBD.
8. location printer_location
9. symbol_number number - symbol_number is actually typedef int symbol_number. This is an internal index of the symbol.
10. location prec_location - This should be the location of precedence information;
11. int prec - should be the precedence of this token;
12. assoc assoc - Associativity of this token. assoc is actually an enumeration defined in assoc.h
typedef enum
{
undef_assoc,
right_assoc,
left_assoc,
non_assoc
} assoc;

13. int user_token_number - This token number is set from the grammar. The internal symbol_number number is different from this because, some useless symbols may have been eliminated after parsing; This number however is the number set from the actions of the bison grammar for bison.

14. symbol *alias - There is a comment there:
/* Points to the other in the identifier-symbol pair for an alias.
Special value USER_NUMBER_ALIAS in the identifier half of the
identifier-symbol pair for an alias. */
Also see note below

15. symbol_class class; - The class of the symbol as described at the beginning.
16. bool declared - Determines if this symbol is declared by the user.

Thats the entire structure of a symbol; A token symbol can have an Alias if declared like:
%token UMINUS "-"
The UMINUS is numbered regularly, the string "-" is considered an alias for this symbol; if the field user_token_number is set to USER_NUMBER_ALIAS this means this symbol has an alias.

After this structure, there are a number fo useful functions; The most important might be:
1. symbol_get(key, location) - Lookup the symbol table. This function doubles up as a symbol_create function; If the symbol is not present, it creates one into the hash table.

2. symbol_make_alias(sym1, sym2, loc) - Make sym2 an alias of sym1 at location loc. This is done by setting sym1->alias to sym2 and sym1->user_token_number = USER_NUMBER_ALIAS. Also sym2->alias will backpoint to sym1. Symbol number of both symbols will be set to the number the lowest between sym2->number and sym1->number.

Other functions set the type, class, precedence and associativity etc. There is no one-one function mapping for all the structure fields;

There is an interesting blurb of code at this point. Several distunguished symbols are declared here- These symbols are internally used by Bison and never declared by the user. They are:
1. error - The builtin error terminal symbol
2. $undefined - A symbol used when the next token is not recognizable by Bison
3. $end - The token that marks the last symbol in the Augmented grammar ($start -> S $end)
4. $accept - A symbol to represent acceptance
5. $startsymbol - bison augmented start symbol.

Then there are four functions to manage the symbol table:
1. symbols_new - Creates a new Symbol table
2. symbols_free - Frees up the symbol table
3. symbols_check_defined - Checks if a symbol is defined: basically check that the sym_class does not contain unknown_sym: if it does, complain.
4. symbols_pack - This will allocate proper numbers to all non-terminal symbols and does various sanity checks.

3 Comments:

  • At 4:23 AM, Anonymous Anonymous said…

    Keep posting stuff like this i really like it

     
  • At 7:47 AM, Anonymous Anonymous said…

    nice post. thanks.

     
  • At 7:27 PM, Anonymous Anonymous said…

    Hm hm.. that's amazing but honestly i have a hard time understanding it... I'm wondering what others have to say....

     

Post a Comment

<< Home