This section presents the options that can be given to the analyzer program (and thus, also to the analyzer_server program and to the analyze script). All options can be written in the configuration file as well as in the command line. The later has always precedence over the former.
Command line | Configuration file |
-h , --help , --help-cf |
N/A |
Prints to stdout a help screen with valid options and exits.
--help
provides information about command line options.
--help-cf
provides information about configuration file options.
Command line | Configuration file |
-v , --version |
N/A |
Prints the version number of currently installed FreeLing library.
Command line | Configuration file |
-f <filename> |
N/A |
Specify configuration file to use (default: analyzer.cfg).
Command line | Configuration file |
--server |
ServerMode=(yes|y|on|no|n|off) |
Activate server mode.
Requires that option --port
is also provided.
Default value is off
.
Command line | Configuration file |
-p <int> , --port <int> |
ServerPort=<int> |
Specify port where server will be listening for requests. This option must be specified if server mode is active, and it is ignored if server mode is off.
Command line | Configuration file |
-w <int> , --workers <int> |
ServerMaxWorkers=<int> |
Specify maximum number of active workers that the server will launch. Each worker attends a client, so this is the maximum number of clients that are simultaneously attended. This option is ignored if server mode is off.
Default vaule is 5. Note that a high number of simultaneous workers will result in forking that many processes, which may overload the CPU and memory of your machine resulting in a system collapse.
When the maximum number of workers is reached, new incoming requests are queued until a worker finishes.
Command line | Configuration file |
-q <int> , --queue <int> |
ServerQueueSize=<int> |
Specify maximum number of pending clients that the server socket can hold. This option is ignored if server mode is off.
Pending clients are requests waiting for a worker to be available. They are queued in the operating system socket queue.
Default value is 32. Note that the operating system has an internal limit for the socket queue size (e.g. modern linux kernels set it to 128). If the given value is higher than the operating system limit, it will be ignored.
When the pending queue is full, new incoming requests get a connection error.
Command line | Configuration file |
-l <int> , --tlevel <int> |
TraceLevel=<int> |
Set the trace level (0 = no trace, higher values = more trace), for debugging purposes.
This will work only if the library was compiled with tracing information, using ./configure -enable-traces. Note that the code with tracing information is slower than the code compiled without it, even when traces are not active.
Command line | Configuration file |
-m <mask> , --tmod <mask> |
TraceModule=<mask> |
Specify modules to trace. Each module is identified with an hexadecimal flag. All flags may be OR-ed to specificy the set of modules to be traced.
Valid masks are defined in file src/include/freeling/morfo/traces.h
,
and are the following:
Module | Mask |
Splitter | 0x00000001 |
Tokenizer | 0x00000002 |
Morphological analyzer | 0x00000004 |
Options management | 0x00000008 |
Number detection | 0x00000010 |
Date identification | 0x00000020 |
Punctuation detection | 0x00000040 |
Dictionary search | 0x00000080 |
Affixation rules | 0x00000100 |
Multiword detection | 0x00000200 |
Named entity detection | 0x00000400 |
Probability assignment | 0x00000800 |
Quantities detection | 0x00001000 |
Named entity classification | 0x00002000 |
Automata (abstract) | 0x00004000 |
Sense annotation | 0x00010000 |
Chart parser | 0x00020000 |
Parser grammar | 0x00040000 |
Dependency parser | 0x00080000 |
Correference resolution | 0x00100000 |
Utilities | 0x00200000 |
Word sense disambiguation | 0x00400000 |
Ortographic correction | 0x00800000 |
Database storage | 0x01000000 |
Feature extraction | 0x02000000 |
Language identifier | 0x04000000 |
Omlet | 0x08000000 |
Phonetics | 0x10000000 |
Command line | Configuration file |
--lang <language> |
Lang=<language> |
Code for language of input text. Though it is not required, the convention is to use two-letter ISO codes (as: Asturian, es: Spanish, ca: Catalan, en: English, cy: Welsh, it: Italian, gl: Galician, pt: Portuguese, ru: Russian, old-es: old Spanish).
Other languages may be added to the library. See chapter 7 for details.
Command line | Configuration file |
--locale <locale> |
Locale=<locale> |
Locale to be used to interpret both input text and data files.
Usually, the value will match the locale of the Lang
option
(e.g. es_ES.utf8
for spanish, ca_ES.utf8
for
Catalan, etc.). The values default
(stands for
en_US.utf8
) and system
(stands for currently active
system locale) may also be used.
Command line | Configuration file |
--flush , --noflush |
AlwaysFlush=(yes|y|on|no|n|off) |
When this option is inactive (most usual choice) sentence splitter buffers lines until a sentence marker is found. Then, it outputs a complete sentence.
When this option is active, the splitter never buffers any token, and considers each newline as a sentence end, thus processing each line as an independent sentence.
Command line | Configuration file |
--inpf <string> |
InputFormat=<string> |
Format of input data (plain, token, splitted, morfo, tagged, sense).
Command line | Configuration file |
--outf <string> |
OutputFormat=<string> |
Format of output data (token, splitted, morfo, tagged, shallow, parsed, dep).
chart_parser
module.
Command line | Configuration file |
--train |
N/A |
When this option (only available at command line) is specified,
OutputFormat
is forced to tagged
and results are
printed in the format:
word lemma tag # lemma1 tag1 lemma2 tag2 ...that is, one word per line, with the selected lemma and tag as fields 2 and 3, a separator (
#
) and a list of all possible pairs
lemma-tag for the word (including the selected one).
This format is expected by the training scripts. Thus, this option can be used to annotate a corpus, correct the output manually, and use it to retrain the taggers with the script src/utilities/train-tagger/bin/TRAIN.sh provided in FreeLing package. See src/utilities/train-tagger/README for details about how to use it.
Command line | Configuration file |
-I <filename> , --fidn <filename> |
N/A |
Configuration file for language identifier. See section 3.1 for details.
Command line | Configuration file |
--abrev <filename> |
TokenizerFile=<filename> |
File of tokenization rules. See section 3.2 for details.
Command line | Configuration file |
--fsplit <filename> |
SplitterFile=<filename> |
File of splitter options rules. See section 3.3 for details.
Command line | Configuration file |
--afx , --noafx |
AffixAnalysis=(yes|y|on|no|n|off) |
Whether to perform affix analysis on unknown words. Affix analysis applies a set of affixation rules to the word to check whether it is a derived form of a known word.
Command line | Configuration file |
-S <filename> , --fafx <filename> |
AffixFile=<filename> |
Affix rules file. See section 3.9.2 for details.
Command line | Configuration file |
--usr , --nousr |
UserMap=(yes|y|on|no|n|off) |
Whether to apply or not a file of customized word-tag mappings.
Command line | Configuration file |
-M <filename> , --fmap <filename> |
UserMapFile=<filename> |
User Map file to be used. See section 3.7 for details.
Command line | Configuration file |
--loc , --noloc |
MultiwordsDetection=(yes|y|on|no|n|off) |
Whether to perform multiword detection. This option requires that a multiword file is provided.
Command line | Configuration file |
-L <filename> , --floc <filename> |
LocutionsFile=<filename> |
Multiword definition file. See section 3.10 for details.
Command line | Configuration file |
--numb , --nonumb |
NumbersDetection=(yes|y|on|no|n|off) |
Whether to perform nummerical expression detection. Deactivating this feature will affect the behaviour of date/time and ratio/currency detection modules.
Command line | Configuration file |
--dec <string> |
DecimalPoint=<string> |
Specify decimal point character for the number detection module (for instance, in English is a dot, but in Spanish is a comma).
Command line | Configuration file |
--thou <string> |
ThousandPoint=<string> |
Specify thousand point character for the number detection module (for instance, in English is a comma, but in Spanish is a dot).
Command line | Configuration file |
--punt , --nopunt |
PunctuationDetection=(yes|y|on|no|n|off) |
Whether to assign PoS tag to punctuation signs.
Command line | Configuration file |
-F <filename> , --fpunct <filename> |
PunctuationFile=<filename> |
Punctuation symbols file. See section 3.6 for details.
Command line | Configuration file |
--date , --nodate |
DatesDetection=(yes|y|on|no|n|off) |
Whether to perform date and time expression detection.
Command line | Configuration file |
--quant , --noquant |
QuantitiesDetection=(yes|y|on|no|n|off) |
Whether to perform currency amounts, physical magnitudes, and ratio detection.
Command line | Configuration file |
-Q <filename> , --fqty <filename> |
QuantitiesFile=<filename> |
Quantitiy recognition configuration file. See section 3.12 for details.
Command line | Configuration file |
--dict , --nodict |
DictionarySearch=(yes|y|on|no|n|off) |
Whether to search word forms in dictionary. Deactivating this feature also deactivates AffixAnalysis option.
Command line | Configuration file |
-D <filename> , --fdict <filename> |
DictionaryFile=<filename> |
Dictionary database. See section 3.9 and chapter 7 for details.
Command line | Configuration file |
--prob , --noprob |
ProbabilityAssignment=(yes|y|on|no|n|off) |
Whether to compute a lexical probability for each tag of each word. Deactivating this feature will affect the behaviour of the PoS tagger.
Command line | Configuration file |
-P <filename> , --fprob <filename> |
ProbabilityFile=<filename> |
Lexical probabilities file. The probabilities in this file are used to compute the most likely tag for a word, as well to estimate the likely tags for unknown words. See section 3.13 for details.
Command line | Configuration file |
-e <float> , --thres <float> |
ProbabilityThreshold=<float> |
Threshold that must be reached by the probability of a tag given the suffix of an unknown word in order to be included in the list of possible tags for that word. Default is zero (all tags are included in the list). A non-zero value (e.g. 0.0001, 0.001) is recommended.
Command line | Configuration file |
--ner [bio|basic|none] |
NERecognition=(bio|basic|none) |
Whether to perform NE recognition and which recognizer to use: ``bio'' for AdaBoost based NER, ``basic'' for a simple heuristic NE recognizer and ``none'' to perform no NE recognition . Deactivating this feature will cause the NE Classification module to have no effect.
Command line | Configuration file |
--ner , --noner |
NERecognition=(yes|y|on|no|n|off) |
Whether to perform NE recognition.
Command line | Configuration file |
-N <filename> , --fnp <filename> |
NPDataFile=<filename> |
Configuration data file for NE recognizer.
See section 3.11 for details.
Command line | Configuration file |
--nec , --nonec |
NEClassification=(yes|y|on|no|n|off) |
Whether to perform NE classification.
Command line | Configuration file |
--fnec <filename> |
NECFile=<filename> |
Configuration file for Named Entity Classifier module
See section 3.19 for details.
Command line | Configuration file |
--phon , --nophon |
Phonetics=(yes|y|on|no|n|off) |
Whether to add phonetic transcription to each word.
Command line | Configuration file |
--fphon <filename> |
PhoneticsFile=<filename> |
Configuration file for phonetic encoding module
See section 3.18 for details.
Command line | Configuration file |
-s <string> , --sense <string> |
SenseAnnotation=<string> |
Kind of sense annotation to perform
Whether to perform sense anotation.
If active, the PoS tag selected by the tagger for each word is enriched with a list of all its possible WN synsets. The sense repository used depends on the options ``Sense Annotation Configuration File'' and ``UKB Word Sense Disambiguator Configuration File'' described below.
Command line | Configuration file |
-W <filename> , --fsense <filename> |
SenseConfigFile=<filename> |
Word sense annotator configuration file. See section 3.15 for details.
Command line | Configuration file |
-U <filename> , --fukb <filename> |
UKBConfigFile=<filename> |
UKB configuration file. See section 3.16 for details.
Command line | Configuration file |
-t <string> , --tag <string> |
Tagger=<string> |
Algorithm to use for PoS tagging
Command line | Configuration file |
-H <filename> , --hmm <filename> |
TaggerHMMFile=<filename> |
Parameters file for HMM tagger. See section 3.17.1 for details.
Command line | Configuration file |
-R <filename> , --rlx <filename> |
TaggerRelaxFile=<filename> |
File containing the constraints to apply to solve the PoS tagging. See section 3.17.2 for details.
Command line | Configuration file |
-i <int> , --iter <int> |
TaggerRelaxMaxIter=<int> |
Maximum numbers of iterations to perform in case relaxation does not converge.
Command line | Configuration file |
-r <float> , --sf <float> |
TaggerRelaxScaleFactor=<float> |
Scale factor to normalize supports inside RL algorithm. It is comparable to the step lenght in a hill-climbing algorithm: The larger scale factor, the smaller step.
Command line | Configuration file |
--eps <float> |
TaggerRelaxEpsilon=<float> |
Real value used to determine when a relaxation labelling iteration has produced no significant changes. The algorithm stops when no weight has changed above the specified epsilon.
Command line | Configuration file |
--rtkcon , --nortkcon |
RetokContractions=(yes|y|on|no|n|off) |
Specifies whether the dictionary must retokenize contractions when found,
or leave the decision to the TaggerRetokenize
option.
Note that if this option is active, contractions will be
retokenized even if the TaggerRetokenize
option is not
active. If this option is not active, contractions will be
retokenized depending on the value of the TaggerRetokenize
option.
Command line | Configuration file |
--rtk , --nortk |
TaggerRetokenize=(yes|y|on|no|n|off) |
Determine whether the tagger must perform retokenization after the appropriate analysis has been selected for each word. This is closely related to affix analysis and PoS taggers, see sections 3.9.2 and 3.17 for details.
Command line | Configuration file |
--force <string> |
TaggerForceSelect=(none,tagger,retok) |
Determine whether the tagger must be forced to (probably randomly) make a unique choice and when.
See 3.17 for more information.
Command line | Configuration file |
-G <filename> , --grammar <filename> |
GrammarFile=<filename> |
This file contains a CFG grammar for the chart parser, and some directives to control which chart edges are selected to build the final tree. See section 3.20.1 for details.
Command line | Configuration file |
-T <filename> , --txala <filename> |
DepTxalaFile==<filename> |
Rules to be used to perform dependency analysis. See section 3.21.1 for details.
Command line | Configuration file |
--coref , --nocoref |
CoreferenceResolution=(yes|y|on|no|n|off) |
Whether to perform coreference resolution.
Command line | Configuration file |
-C <filename> , --fcorf <filename> |
CorefFile=<filename> |
Configuration file for coreference resolution module.
See section 3.22 for details.
Lluís Padró 2013-09-09