10 Morpho-syntactic content

Table of contents

This section explains how to attach morpho-syntactic content to word forms and how to define reusable tagsets to provide compact notations through tags and to control the validity of these contents.

The previous section explains how to enrich a document with morpho-syntactic annotations. However, it does not define the content of these annotations. What set of features and feature values should we use to express this content (within element wordForm) and with which meaning ?

Such a set is usually referred as a tagset specifying the content of possible annotations. However, the diversity of approaches and languages makes almost impossible the proposition of an unique tagset. More modestly or pragmatically, the current proposal seeks to provide mechanisms to define tagsets by relying on a Data Category Registry (DCR) and Feature Structures Representations (FSR).

An annotated document will therefore be completed by either adding or referring to a tagset.

10.1 Using feature structures

A word form may be completed by a morpho-syntactic content defining its linguistic nature and its grammatical function in its current context. This content is expressed using Feature Structures, following the recommendation of ISO 24610 Part 1 document on “Feature Structure Representation” [FSR]. In first approximation, a feature structure may attach one or several (possibly complex) values to linguistic properties (i.e., noun to part of speech, present to tense, indicative to mood, ... ).

<comment>nice</comment>
<token id="t0">belle</token>
<wordForm entry="urn:lexicon:fr:beaulemma="beautokens="t0">
 <fs>
  <f name="pos">
   <symbol value="adjective"/>
  </f>
  <f name="adj_type">
   <symbol value="qualifier"/>
  </f>
  <f name="gender">
   <symbol value="feminine"/>
  </f>
  <f name="number">
   <symbol value="singular"/>
  </f>
 </fs>
</wordForm>

The feature structure content attached to a word form may also provides additional information of interest about a word form.

10.2 Compact morpho-syntactic tags

FSR proposal provides ways for the compact representation of feature structures, by relying on libraries naming feature values and feature specifications (a feature specification being a pair formed by a feature and a value). These names may be used in wordForm attribute tag to get compact tags, following a standard practice in the NLP community.

<token id="t0">belle</token>
<wordForm
  tokens="t0"
  entry="urn:lexicon:fr:beau"
  tag="pos.adj adj_type.qual gender.fem num.sing"/>

The content of attribute tag should be similar to the content of attribute feats defined in FSR, namely a space-separated sequence of feature specification identifiers.

The libraries naming recurrent values and feature specifications are part of the tagset(s) coming with the annotated document.

10.2.1 FSR libraries

The generic way provided by FSR to use libraries is illustrated by the following example, with the attribute feats of element fs:

<comment>A feature value library</comment>
<fvLib n="French morpho values">
 <symbol xml:id="nounvalue="noun"/>
 <symbol xml:id="singvalue="singular"/>
 <symbol xml:id="pluvalue="plural"/>
 <symbol xml:id="mascvalue="masculine"/>
 <symbol xml:id="femvalue="feminine"/>
</fvLib>
<comment>A feature specification library</comment>
<fLib>
 <f xml:id="pos.nname="posfVal="noun"/>
 <f xml:id="num.sname="numberfVal="sing"/>
 <f xml:id="num.pname="numberfVal="plu"/>
 <f xml:id="gen.fname="genderfVal="fem"/>
 <f xml:id="gen.mname="genderfVal="masc"/>
</fLib>
With such a library, following FSR rules, one may write:
<wordForm lemma="prime_ministertokens="t1">
 <fs feats="pos.n num.s gen.f"/>
</wordForm>
or, equivalently, by using attribute tag, one may write:
<wordForm tokens="t1 t2lemma="prime_ministertag="pos.n num.sg gen.f"/>

Disjunctive values are allowed by FSR and may also be simplified, following the same mechanism:

<comment>A feature value library</comment>
<tagset>
 <fvLib>
  <vAlt xml:id="first.third">
   <symbol value="first"/>
   <symbol value="third"/>
  </vAlt>
  <symbol xml:id="verbvalue="verb"/>
  <symbol xml:id="sgvalue="singular"/>
 </fvLib>
 <comment>A feature specification library</comment>
 <fLib>
  <f xml:id="pers.13name="persfVal="first.third"/>
  <f xml:id="pos.vname="posfVal="verb"/>
  <f xml:id="num.sgname="numberfVal="sg"/>
 </fLib>
</tagset>
<comment>Annotated document</comment>
<token id="t0">porte</token>
<wordForm
  tokens="t0"
  entry="urn:lexicon:fr:porter"
  tag="pos.v pers.13 num.sg"/>

10.3 Designing tagsets

The features, values, and possibly feature types used to specify morpho-syntactic content are not just labels but carry linguistic meanings, or, in other words, semantic content. To avoid misinterpretations, the semantic content attached to a feature, a value or a type should be clearly defined. The combination of features, values and types should also be controlled in order to avoid linguistically invalid combinations, such as using /neuter/ as a value for /gender/ in French, or using a feature /tense/ for nouns in most languages.

MAF does not try to define the semantic content of an unique complete set of such features, values, and types. It would be an almost impossible task given the diversity of languages, and it would be equally impossible to assign to each component a meaning agreed on by the whole community.

Instead, it is proposed that an annotated document should be completed by including or referring one or more tagsets.

The first objective of a tagset is to list the terminology used to annotate a document as a set of data categories whose meanings is precisely defined in a Data Category Registry, following the recommendation of ISO 12620 proposal on “Data Category Registry”. The process may be seen as selecting a subset of morpho-syntactic data categories (Data Category Selection – DCS).

<tagset>
 <dcs local="genreregistered="dcs:morphosyntax:gender:frrel="eq"/>
 <dcs local="fem"
   registered="dcs:morphosyntax:gender:fr:femininerel="eq"/>

</tagset>

The correspondence with a registered data category may not be perfect. The rel may be used to specify which relationship exists between the local and registered data categories. For instance, one may introduce a local data category /advneg/ as being subsumed by a more general registered data category /adverb/.

<dcs local="advnegregistered="dcs:morphosyntax:pos:adverbrel="subs"/>
<dcs local="strangerel="none"/>

It is also possible (but not advised) to introduce a local data category bearing no relationship with any registered data category.

<dcs local="title"/>

When the correspondence is not perfect or missing, a few words of description should be added to define the meaning of a local data category.

<dcs local="title">
 <description> A part of speech used to denote honorific titles like
   Pr. or S.A.S.
 </description>
</dcs>

The second objective of a tagset is to specify the set of valid feature structures based on the selected data categories. It will be achieved by relying on the proposed ISO 24610 Part 2 on “Feature System Declaration” [FSD].

The third objective of a tagset is to name the most common morpho-syntactic structures through the use of FSR libraries, as seen in Section 8.2.1.

10.4 Formal description: tagset

  • tagset the tagset to be used to check and interpret the annotations
    ref
  • dcs (Data Category Selection) The selection of Data Categories used to express the annotations
    locallocal name of the category
    registered registered name of the category in the ISO Data Category Registry
    rel (relationship) Relationship between the local meaning of a category and the registered one
    desc (description) Informal description of a Data Category
  • fvLib (feature-value library) assembles a library of reusable feature value elements (including complete feature structures).
  • fLib (feature library) assembles a library of feature elements.

The dcs corresponds to a Data Category Selection part whose exact content is still to be defined.

The fsd corresponds to a Feature Structure Declaration part whose normalization is yet to be done.

Contents « 9 Word Forms as linguistic units » 11 Handling ambiguities



Copyright ISO 2007
Version rev4 -- This page generated on 2008-01-18T13:33:52+01:00