Feature Extraction Rule File

Feature extraction rules are defined in a .rgf file. This section describes the format of the file. The syntax of the rules is described in section 4.4.2

Rules are grouped in packages. Begin and end of a package is marked with the keywords RULES and ENDRULES. Packages are useful to simplify the rules, and to speed up feature computation avoiding computing the same features several times.

A line with format TAGSET filename may precede the rule packages definition. The given filename will be interpreted as a relative path (based on the .rgf location) to a tagset definition file (see section 4.1) that will be used to obtain short versions of PoS tags. The TAGSET line is needed only if the short tag property t is used in some rule (see section 4.4.4 below).

The RULES package starting keyword must be followed by a condition (see section 4.4.4).

The rules in a package will onlly be applied to those words matching the package condition, thus avoiding unnecessary tests.

For instance, the rules in the package:

 
RULES t matches ^NP
 ...
ENDRULES
will be applied only for words with a PoS tag (t) starting with NP. The same result could have been obtained without the package if the same condition was added to each rule, but then, applicability tests for each rule on each word would be needed, resulting in a higher computational cost.

The package condition may be ALL. In this case, rules contained in the package will be checked for all words in the sentence. This condition has also an extra effect: the features extracted by rules in this package are cached, in order to avoid repeating computations if a rule uses a window to get features from neighbour words.

For instance, the rule:

 
RULES ALL
 punct_mark@   [-2,2]   t matches ^F
ENDRULES
will generate, for each word, features indicating which words in the surrounding two words (left and right) are punctuation symbols.

With this rule applied to the sentence Hi ! , said John ., the word said would get the features punct_mark@-1, punct_mark@-2, and punct_mark@2. The word John would get the features punct_mark@-2 and punct_mark@1. Since the package has condition ALL, the features are computed once per word, and then reused (that is, the fact that the comma is a punctuation sign will be checked only once, regardless of the size of the sentence and the size of the windows in the rules).

Lluís Padró 2013-09-09