Bison Internals

GNU/Bison is a LALR parser generator written in C. It can generate C and C++ code that implements LALR and GLR parsers. This project is an effort to document internal data structures of Bison.

Monday, June 19, 2006

Reducing the grammar

Okay, so far we have parsed the input grammar, entered different symbols in the symbol table and made an internal representation for the user's input. Its time to start operating on the input.

The next function after a call to reader() is reduce_grammar() (see main.c post). This function is defined in the file reduce.c. The main aim of functions in this file is to sort out

1. Useless non-terminals,
2. Unreachable symbols,
3. useless rules

from the grammar. Useless Non-terminals are those that do not appear in any derivation of strings defined by a grammar; these are calculated by using a classic algorithm. The algorithm itself decribed very well inside comments in the function useless_nonterminals() (reduce.c). So I will just quot it here:


The set being computed is a set of nonterminals which can derive
the empty string or strings consisting of all terminals. At each
iteration a nonterminal is added to the set if there is a
production with that nonterminal as its LHS for which all the
nonterminals in its RHS are already in the set. Iterate until
the set being computed remains unchanged. Any nonterminals not
in the set at that point are useless in that they will never be
used in deriving a sentence of the language.



Bitsets are used to represent sets of symbols. All nonterminals, terminals and rules are nothing more than integers now; (symbol numbers in case of tokens and Nonterminals, and rule numbers in case of the productions); so they are represented by bitsets as sets of integers; The working of a bit set is simple. You can set and clear bits, test if the i th bit is set etc. So, if you want to add number 6 to a bitset, you set bit no. 6 to 1; This is a clever way of implementing sets. This way, nomatter howmany times you add 6 to a set, the same bit gets set to 1 and uniqueness of the set is preserved without searching and sorting!

A production is useful if all Non terminals in its RHS are in the set of useful Nonterminals.

There is also a function to calculate the set of all accessible symbols in the grammar by a similiar logic.

After calculating these sets, a bit of re-shuffling is done.
1. The non-terminals are re-numbered. Useless non-terminals are pushed to the end.
2. Accordingly, the symbol table is also re-shuffled to mirror the order of symbol numbers of non-terminals. This is because, symbols[i] should always point to the i th symbol.
3. RITEMS must also be modified to get the new symbol numbers for nonterminal symbols appearing on the RHS
4. RULES must be updated: The rule numbers are re-assigned (useful rules first and useless rules later) (Rules with a useless NT as LHS is anyway a useless rule, so we dont bother to update non terminal numbers on the LHS?!)
5. And hence the RULE markers (negated rule numbers) in RITEMS must also be updated!

The structure of the internal representation of the grammar is still the same except the fact that now, useless rules and non-terminals are pushed behind the useful ones. The useless rules can always be identified by qurying the boolean flag 'useful' in struct rule. Also, nrules -= nuseless_productions and nsyms -= nuseless_nonterminals; and nvars -= nuseless_nonterminals;

so we always know where to draw the line while operating on the grammar in futher steps.

0 Comments:

Post a Comment

<< Home