Tokenizer Module

The first module in the processing chain is the tokenizer. It converts plain text to a vector of word objects, according to a set of tokenization rules.

Tokenization rules are regular expressions that are matched against the beggining of the text line being processed. The first matching rule is used to extract the token, the matching substring is deleted from the line, and the process is repeated until the line is empty.

The API of the tokenizer module is the following:

class tokenizer {
  public:
    /// Constructor
    tokenizer(const std::wstring &);

    /// tokenize string 
    void tokenize(const std::wstring &, std::list<word> &) const;
    /// tokenize string, return result as list
    std::list<word> tokenize(const std::wstring &) const;
    /// tokenize string, tracking offset
    void tokenize(const std::wstring &, unsigned long &, std::list<word> &) const;
    /// tokenize string, tracking offset, return result as list
    std::list<word> tokenize(const std::wstring &, unsigned long &) const;
};

That is, once created, the tokenizer module receives plain text in a string, tokenizes it, and returns a list of word objects corresponding to the created tokens



Subsections

Lluís Padró 2013-09-09