Canonical LR Parser - Overview

Overview

The LR(1) parser is a deterministic automaton and as such its operation is based on static state transition tables. These codify the grammar of the language it recognizes and are typically called "parsing tables".

The parsing tables of the LR(1) parser are parameterized with a lookahead terminal. Simple parsing tables, like those used by the LR(0) parser represent grammar rules of the form

A1 → A, B

which means that if we go from state A to state B then we will go to state A1. After parameterizing such a rule with a lookahead we have:

A1 → A, B, a

which means that the transition will now be performed only if the lookahead terminal is a. This allows for richer languages where a simple rule can have different meanings depending on the lookahead context. For example, in a LR(1) grammar, all of the following rules transition to a different state in spite of being based on the same state sequence.

A1 → A, B, a
A2 → A, B, b
A3 → A, B, c
A4 → A, B, d

The same would not be true if a lookahead terminal was not being taken into account. Parsing errors can be identified without the parser having to read the whole input by declaring some rules as errors. For example

E1 → B, C, d

can be declared an error, causing the parser to stop. This means that the lookahead information can also be used to catch errors, as in the following example:

A1 → A, B, a
A1 → A, B, b
A1 → A, B, c
E1 → A, B, d

In this case A, B will be reduced to A1 when the lookahead is a, b or c and an error will be reported when the lookahead is d.

The lookahead can also be helpful in deciding when to reduce a rule. The lookahead can help avoid reducing a specific rule if the lookahead is not valid, which would probably mean that the current state should be combined with the following instead of the previous state. That means that in the following example

  • State sequence: A, B, C
  • Rules:
A1 → A, B
A2 → B, C

the state sequence can be reduced to

A, A2

instead of

A1, C

if the lookahead after the parser went to state B wasn't acceptable, i.e. no transition rule existed. It should be noted here that states can be produced directly from a terminal as in

X → y

which allows for state sequences to appear.

LR(1) parsers have the requirement that each rule should be expressed in a complete LR(1) manner, i.e. a sequence of two states with a specific lookahead. That makes simple rules such as

X → y

requiring a great many artificial rules that essentially enumerate the combinations of all the possible states and lookahead terminals that can follow. A similar problem appears for implementing non-lookahead rules such as

A1 → A, B

where all the possible lookaheads must be enumerated. That is the reason why LR(1) parsers cannot be practically implemented without significant memory optimizations.

Read more about this topic:  Canonical LR Parser