FreeLing  3.1
Public Member Functions | Private Attributes
freeling::tokenizer Class Reference

Class tokenizer implements a token splitter, which converts a string into a sequence of word objects, according to a set of tokenization rules read from aconfiguration file. More...

#include <tokenizer.h>

List of all members.

Public Member Functions

 tokenizer (const std::wstring &)
 Constructor.
void tokenize (const std::wstring &, std::list< word > &) const
 tokenize string
std::list< wordtokenize (const std::wstring &) const
 tokenize string, return result as list
void tokenize (const std::wstring &, unsigned long &, std::list< word > &) const
 tokenize string, tracking offset
std::list< wordtokenize (const std::wstring &, unsigned long &) const
 tokenize string, tracking offset, return result as list

Private Attributes

std::set< std::wstring > abrevs
 abreviations set (Dr. Mrs. etc. period is not separated)
std::list< std::pair
< std::wstring,
freeling::regexp > > 
rules
 tokenization rules
std::map< std::wstring, intmatches
 substrings to convert into tokens in each rule

Detailed Description

Class tokenizer implements a token splitter, which converts a string into a sequence of word objects, according to a set of tokenization rules read from aconfiguration file.


Constructor & Destructor Documentation

freeling::tokenizer::tokenizer ( const std::wstring &  tokFile)

Constructor.

Create a tokenizer, using the abreviation and patterns file indicated in given options.

References freeling::config_file::add_section(), freeling::config_file::close(), ERROR_CRASH, freeling::config_file::get_content_line(), freeling::config_file::get_section(), freeling::config_file::open(), and TRACE.


Member Function Documentation

void freeling::tokenizer::tokenize ( const std::wstring &  ,
std::list< word > &   
) const

tokenize string

std::list<word> freeling::tokenizer::tokenize ( const std::wstring &  ) const

tokenize string, return result as list

void freeling::tokenizer::tokenize ( const std::wstring &  ,
unsigned long &  ,
std::list< word > &   
) const

tokenize string, tracking offset

std::list<word> freeling::tokenizer::tokenize ( const std::wstring &  ,
unsigned long &   
) const

tokenize string, tracking offset, return result as list


Member Data Documentation

std::set<std::wstring> freeling::tokenizer::abrevs [private]

abreviations set (Dr. Mrs. etc. period is not separated)

std::map<std::wstring,int> freeling::tokenizer::matches [private]

substrings to convert into tokens in each rule

std::list<std::pair<std::wstring, freeling::regexp> > freeling::tokenizer::rules [private]

tokenization rules


The documentation for this class was generated from the following files: