Approximate search dictionary

This class wraps a libfoma FSM and allows fast retrieval of similar words via string edit distance based search.

The API of the class is the following:

class foma_FSM {

  public:
    /// build automaton from a file
    foma_FSM(const std::wstring &, const std::wstring &mcost=L""); 
    /// delete FSM
    ~foma_FSM();

    /// Use automata to obtain closest matches to given form, and 
    //add them to given list.
    void get_similar_words(const std::wstring &, 
                           std::list<std::pair<std::wstring,int> > &) const;    
    /// set maximum edit distance of desired results
    void set_cutoff_threshold(int);
    /// set maximum number of desired results
    void set_num_matches(int);
    /// Set default cost for basic SED operations
    void set_basic_operation_cost(int);
  };

The constructor of the module requests one parameter stating the file to load, and a second optional parameter stating a file with the cost matrix for SED operations. If the cost matrix is not given, all operations default to a cost of 1 (or to the value set with the method set_basic_operation_cost).

The automata file may have extension .src or .bin. If the extension is .src, the file is intepreted as a text file with one word per line. The FSM is built to recognize the vocabulary contained in the file.

If the extension is .bin, the file is intepreted as a binary libfoma FSM. To compile such a binary file, FOMA command line front-end must be used. The front-end is not included in FreeLing. You will need to install FOMA if you want to create binary FSM files. See http://code.google.com/p/foma for details.

A cost matrix for SED operations may be specified only for text FSMs (i.e., for .src files). To use a cost matrix with a .bin file, you can compile it into the automata using FOMA front-end.

The format of the cost matrix must comply with FOMA formats. See FOMA documentation, or examples provided in data/common/alternatives in FreeLing tarball.

The method get_similar_words will receive a string and return a list of entries in the FSM vocabulary sorted by string edit distance to the input string.

Lluís Padró 2013-09-09