Lexing module

The module lexing.py (download here) aims to reproduce the functionality of GNU flex in Python. It defines a class Lexer that can be subclassed in order to define a new lexer. Here is a simple example.

class Calc(Lexer):
    
    order = 'real', 'integer' # 123.45 is a float!
    separators = 'sep', 'operator', 'bind'
    
    sep = silent(r'\s') # This won't produce tokens
    integer = r'[0-9]+'
    real = r'[0-9]+\.[0-9]*'
    variable = r'[a-z]+'
    operator = r'[+-/*]'
    bind = r'='
    
    def process_integer(state, val): return int(val)
    
    def process_real(state, val): return float(val)

Token types are created automatically:

>>> Calc.variable
<tokentype variable>
>>> print Calc.variable
variable
>>>

Tokens can be created if needed (but usually just obtained through scan()):

>>> varx = Calc.variable.token('x')
>>> varx
<token variable: x>
>>> print varx
x
>>>

Use the scan() method to tokenise a string. It returns an iterator over all the tokens in the string.

>>> list(Calc.scan('a = 3.0*2'))
[<token variable: a>, <token bind: =>, <token real: 3.0>, <token operator: *>, <token integer: 2>]
>>>

The val attribute of real tokens is a float thanks to the process_real() function.

Optional attributes

tokentype: a type to create tokens. The default is lexing.Tokentype and if overriden it should inherit from it or implement the same interface
init_state: the value of the state object when tokenising starts. This state object is passed on to process functions and it should be possible to change its attributes. By default it is an empty instance of lexing.State.
separators: a sequence of names of tokens which are separators. If not defined, all tokens are considered separators. A non-separator token must be surrounded by separators.
order: a sequence of token names in order of priority. If two token types match the current position, then the first one in this list will be chosen.

Defining rules for tokens

To create a new token type, simply define a class attribute whose value is a regular expression recognising that token. E.g.

    ident = r'[_a-zA-Z][_a-zA-Z0-9]*'
    number = r'[0-9]+'

If you want number token to have an integer value rather than a string, define the process_number function.

    def process_number(state, val):
        return int(val)

Some tokens (e.g. comments) need not be processed at all. Wrap their rules in the silent() function. E.g.

   comment = silent(r'#.*')

Start conditions

Use the functions istart() and xstart() to define inclusive and exclusive start conditions (with the same meaning as in GNU flex). If you want to add a start condition to a token rule, write COND >> rule. If you want a token rule to change the current start condition, write rule >> COND (None if you want to clear the start condition). E.g.

    STRING = xstart()

    start_string = r'"' >> STRING
    string_body = STRING >> r'[^"]*'
    end_string = STRING >> '"' >> None

Token objects

The token objects yielded by the scan() method have a number of useful attributes:

val and strval: By default they are both equal to the string recognised by the token, but val can be changed if a process_token() function was defined (see example above).
toktype: This is the token's type. Token types are created automatically at the creation of the Lexer subclass, and are accessible as class attributes. Each token type has a name attribute.

Last updated on Tue Oct 28 16:16:12 2008

arno AT marooned.org.uk