Language Identifier Options File

The language identifier options file is divided in three sections: <Languages>, <Threshold>, and <ScaleFactor>, which are closed by </Languages>, </Threshold>, and </ScaleFactor>, respectively.

Section <Languages> contains a list of filenames, one per line. Each filename contains a language model (generated with the train_language method). The filenames may be absolute or relative. If relative, they are considered to be relative to the location of the identifier options file.

Section <Threshold> and <ScaleFactor> contain one single line each, consisting of a real number in both cases.

The identifier uses a 4-gram visible Markov model to compute the probability of the text in each candidate language. Since the probabilitity of a sequence depends on its length, the result is divided by the text length to obtain a per-char ``averaged'' probability. Even in this way, the resulting probability is usually low and unintuitive. The parameter ScaleFactor multiplies this result to enlarge the difference between languages and to give probabilities in a more human scale. The parameter Threshold states minimun value that a language must achive to be considered a possible result. If no language reaches the threshold, the identify_language method will return none.

Note that this scaling is artificial, and doesn't change the results, only makes them more readable. The results with ScaleFactor=1.0 and Threshold=0.04 would be the same than with ScaleFactor=5.0 and Threshold=0.2.

An example of a language identifier option file is:

   <Languages>
   ./es.dat
   ./ca.dat
   ./it.dat
   ./pt.dat
   </Languages>
   <Threshold>
   0.2
   </Threshold>
   <ScaleFactor>
   5.0
   </ScaleFactor>

Lluís Padró 2013-09-09