Multiword Definition File

The file contains a list of multiwords to be recognized. The format of the file is one multiword per line. Each line has the format:
form lemma1 pos1 lemma2 pos2 ... [A|I]

The multiword form may contain lemmas in angle brackets, meaning that any form with that lemma will be considered a valid component for the multiword.

The form may also contain PoS tags. Any uppercase component in the form will be treated as a PoS tag.

Any number of pairs lemma-tag may be assigned to the multiword. The PoS tagger will select the most probable given the context, as with any other word.

For instance:

a_buenas_horas a_buenas_horas RG A
a_causa_de a_causa_de SPS00 I
<accidente>_de_trabajo accidente_de_trabajo $1:NC I
<acabar>_de_VMN0000 acabar_de_$L3 $1:VMI I
Z_<vez> TIMES:$L1 Zu I

The tag may be specified directly, or as a reference to the tag of some of the multiword components. In the previous example, the third multiword specification will build a multiword with any of the forms accidente de trabajo or accidentes de trabajo. The tag of the multiword will be that of its first form ($1) which starts with NC. This will assign the right singular/plural tag to the multiword, depending on whether the form was ``accidente'' or ``accidentes''.

The lemma of the multiword may be specified directly, or as a reference to the form of lemma of some of the multiword components. In the previous example, the fourth multiword specification will build a multiword with phrases such as acabo de comer, acababa de salir, etc. The lemma will be acabar_de_XXX where XXX will be replaced with the lemma of the third multiword component ($L3).

Lemma replacement strings can be $F1, $F2, $F3, etc. to select the lowercased form of any component, or $L1, $L2, $L3, etc. to select the lemma of any component. Component numbers can range from 1 to 9.

The last field states whether the multiword is ambiguous A or not I with respect to its segmentation (i.e. that it may be a multiword or not, depending on the context). The multiword is built in any case, but the ambiguity information is stored in the word object, so the calling applicacion can consult it and take the necessary decisions (e.g. un-glue the multiword) if needed.

Lluís Padró 2013-09-09