The language identifier options file is divided in three sections:
<Languages>
, <Threshold>
, and <ScaleFactor>
,
which are closed by </Languages>
, </Threshold>
, and
</ScaleFactor>
, respectively.
Section <Languages>
contains a list of filenames, one per
line. Each filename contains a language model (generated with the
train_language
method). The filenames may be absolute or
relative. If relative, they are considered to be relative to the
location of the identifier options file.
Section <Threshold>
and <ScaleFactor>
contain one
single line each, consisting of a real number in both cases.
The identifier uses a 4-gram visible Markov model to compute the
probability of the text in each candidate language. Since the
probabilitity of a sequence depends on its length, the result is
divided by the text length to obtain a per-char ``averaged''
probability. Even in this way, the resulting probability is usually
low and unintuitive. The parameter ScaleFactor
multiplies
this result to enlarge the difference between languages and to give
probabilities in a more human scale. The parameter Threshold
states minimun value that a language must achive to be considered a
possible result. If no language reaches the threshold, the
identify_language
method will return none.
Note that this scaling is artificial, and doesn't change the results, only makes them more readable. The results with ScaleFactor=1.0 and Threshold=0.04 would be the same than with ScaleFactor=5.0 and Threshold=0.2.
An example of a language identifier option file is:
<Languages> ./es.dat ./ca.dat ./it.dat ./pt.dat </Languages> <Threshold> 0.2 </Threshold> <ScaleFactor> 5.0 </ScaleFactor>
Lluís Padró 2013-09-09