Feature extraction rules are defined in a .rgf file. This section describes the format of the file. The syntax of the rules is described in section 4.4.2
Rules are grouped in packages. Begin and end of a package is marked
with the keywords RULES
and ENDRULES
.
Packages are useful to simplify the rules, and to speed up
feature computation avoiding computing the same features several
times.
A line with format TAGSET filename
may precede the rule
packages definition. The given filename
will be interpreted
as a relative path (based on the .rgf location) to a tagset
definition file (see section 4.1) that will be used to
obtain short versions of PoS tags. The TAGSET
line is
needed only if the short tag property t is used in some
rule (see section 4.4.4 below).
The RULES
package starting keyword must be followed by a
condition (see section 4.4.4).
The rules in a package will onlly be applied to those words matching the package condition, thus avoiding unnecessary tests.
For instance, the rules in the package:
RULES t matches ^NP ... ENDRULESwill be applied only for words with a PoS tag (t) starting with NP. The same result could have been obtained without the package if the same condition was added to each rule, but then, applicability tests for each rule on each word would be needed, resulting in a higher computational cost.
The package condition may be ALL. In this case, rules contained in the package will be checked for all words in the sentence. This condition has also an extra effect: the features extracted by rules in this package are cached, in order to avoid repeating computations if a rule uses a window to get features from neighbour words.
For instance, the rule:
RULES ALL punct_mark@ [-2,2] t matches ^F ENDRULESwill generate, for each word, features indicating which words in the surrounding two words (left and right) are punctuation symbols.
With this rule applied to the sentence Hi ! , said John ., the word said would get the features punct_mark@-1, punct_mark@-2, and punct_mark@2. The word John would get the features punct_mark@-2 and punct_mark@1. Since the package has condition ALL, the features are computed once per word, and then reused (that is, the fact that the comma is a punctuation sign will be checked only once, regardless of the size of the sentence and the size of the windows in the rules).
Lluís Padró 2013-09-09