This class wraps a libfoma FSM and allows fast retrieval of similar words via string edit distance based search.
The API of the class is the following:
class foma_FSM { public: /// build automaton from a file foma_FSM(const std::wstring &, const std::wstring &mcost=L""); /// delete FSM ~foma_FSM(); /// Use automata to obtain closest matches to given form, and //add them to given list. void get_similar_words(const std::wstring &, std::list<std::pair<std::wstring,int> > &) const; /// set maximum edit distance of desired results void set_cutoff_threshold(int); /// set maximum number of desired results void set_num_matches(int); /// Set default cost for basic SED operations void set_basic_operation_cost(int); };
The constructor of the module requests one parameter stating the
file to load, and a second optional parameter stating a file with
the cost matrix for SED operations. If the cost matrix is not
given, all operations default to a cost of 1 (or to the value set
with the method set_basic_operation_cost
).
The automata file may have extension .src
or .bin
.
If the extension is .src
, the file is intepreted as a text
file with one word per line. The FSM is built to recognize the
vocabulary contained in the file.
If the extension is .bin
, the file is intepreted as a binary
libfoma FSM. To compile such a binary file, FOMA command
line front-end must be used. The front-end is not included in
FreeLing. You will need to install FOMA if you want to create binary
FSM files. See http://code.google.com/p/foma for details.
A cost matrix for SED operations may be specified only for
text FSMs (i.e., for .src
files).
To use a cost matrix with a .bin
file, you can compile
it into the automata using FOMA front-end.
The format of the cost matrix must comply with FOMA formats. See
FOMA documentation, or examples provided in
data/common/alternatives
in FreeLing tarball.
The method get_similar_words
will receive a string and
return a list of entries in the FSM vocabulary sorted by string
edit distance to the input string.
Lluís Padró 2013-09-09