Language Resource Management -- Morpho-syntactic Annotation Framework (MAF)
Éric Villemonte de la Clergerie
maftei.xml from maftei.tex (TeX4ht, 2007-12-30 17:51:00)
rev4text-style: italic;Language Resource Management -- Morpho-syntactic Annotation Framework (MAF)
Éric Villemonte de la Clergerie
Lionel Clément
December 2007
Warning
This document is not an ISO International Standard. It is
distributed for review and comment. It is subject to change
without notice and may not be referred to as an International
Standard.
Recipients of this document are invited to submit, with
their comments, notification of any relevant patent rights
of which they are aware and to provide supporting
documentation.
Copyright notice
This ISO document is a draft international standard and is
copyright-protected by ISO. Except as permitted under the
applicable laws of the user's country, neither this ISO
draft nor any extract from it may be reproduced, stored in a
retrieval system or transmitted in any form or by any means,
electronic, photocopying, recording or otherwise, without
prior written permission being secured.
Requests for permission to reproduce should be addressed to
ISO at the address below or ISO's member body in the country
of the requester:
Copyright ManagerISO Central Secretariat1 rue de Varembé1211 Geneva 20 Switzerlandtel. +41 22 749 0111fax +41 22 734 0179e-mail central@iso.ch
Reproduction may be subject to royalty payments or a
licensing agreement.
Violators may be prosecuted.
Foreword
ISO (The International Organization for Standardization) is
a worldwide federation of national standards bodies (ISO
member bodies). The work of preparing International
Standards is normally carried out through ISO technical
committees. Each member body interested in a subject for
which a technical committee has been established has the
right to be represented on that committee. International
organizations, governmental and nongovernmental, in liaison
with ISO, also take part in the work. ISO collaborates
closely with the International Electrotechnical Commission
(IEC) on all matters of electrotechnical standardization.
International Standards are drafted in accordance with the
rules given in the ISO/IEC Directives, Part 3. Draft I
International Standard 24611 was prepared by Technical
Committee ISO/TC 37, Terminology and Other Language
Resources, Subcommittee SC 4, Language Resource Management.
All the Annexes are for information only.
Introduction
This standard provides a reference format for the
representation of morpho-syntactic annotations.
Scope
In Natural Language Resource Management, the
morpho-syntactic annotation phase assigns to each document
segment (either text or speech) one or more
tags providing morpho-syntactic
information about the part of speech (noun,
adjective, verb, ...), morphological and
grammatical features (such as number, gender,
person, mood, verbal tense, ...) and possibly other specific
linguistic properties. The morpho-syntactic annotations
attached to a segment do not refer to other segments or
annotations, even if the choice of an annotation may depend
on the surrounding context.
Normative references
ISO 8879: 1986 (SGML) as
extended by TC2 (ISO/IEC JTC 1/SC 34 N029: 1998-12-06) to
allow for XML
ISO 19757-2, Document Schema
Definition Language, part 2, to allow for RELAX NG
specifications. RELAX NG is a schema language for XML,
standing for REgular Language for XML for Next Generation,
and simplifies and extends the features of DTDs, Document
Type Definitions.
ISO 12620 on Data Category
Registry (DCR)
ISO 24610-1 on Feature
Structure Representation (FSR)
ISO 24610-2 on Feature System
Declaration (FSD)
ISO 24612 on Linguistic
Annotation Framework (LAF)
Text Encoding Initiative (TEI) – Chapters to be defined
Terms and definitions
For the purposes of this international standard, the terms
and definitions given in ISO
12620:200?, ISO
24610-1, ISO
24610-2, and the following apply:
relation by which a linguistic unit is associated with
other units. It is a virtual association which does not
requires their effective presence and differs from a
paradigmatic relation in that the latter only refers to
linguistic units associated by substitutability.
data category whose content
is constrained by a list of permissible values which
comprise its conceptual domain
A typical closed data category might be
grammaticalNumber, which can have as
its content the values: singular,
plural or dual.
finite list of simple data
categories that may be the values of a complex
data category
result of the specification of a given data field or the
content of a closed data field
A data category is to be used as an elementary descriptor
in a linguistic structure or an annotation scheme.
Examples are: term,
definition, part of
speech and grammaticalGender. Data categories for the management of
lexical resources and terminology are comparable to data
element concepts in ISO/IEC
11179-3:2003.
graph with directed edges and no cycle
the assignment of a value to a feature. In MAF, a
feature shall denote a morpho-syntactic feature of
a linguistic unit, such as the mood or tense of a verb.
a set of feature
specifications, used in MAF to express
morpho-syntactic content.
finite set of transitions from state to state, with an
initial state and a final one DAG.
any sequence of letters, pictograms and numerals used to
write or pronounce a word
modification or marking of a so that it reflects grammatical
(i.e. relational) information, such as grammatical
gender, tense, person, etc.
a table illustrating the forms of an inflected word
form that a word can take when used in a sentence or a
phrase
An inflected form of a word is associated with a
combination of morphological features, such as grammatical
number or case.
term often used in the NLP community to denote (with
some slight confusion with the notion of algebraic
lattice), an directed acyclic graph with an
initial node and a final node. DAGFSA
conventional form chosen to represent a lexeme (e.g.,
the infinitive form for French verbs).
Fundamental unit, generally associated to a set of
forms sharing a common
meaning.
container for managing a set of forms and possibly one
or several meaning to describe a lexeme.
resource comprising a collection of lexical entries for a
language.
smallest linguistic unit bearing a signification in a
discourse and that cannot be divided into smaller
meaningful units. A morpheme is either grammatical
(grammeme) or lexical (lexeme).
category induced from the inflected form of a word
ISO 12620provides a
comprehensive list of values for European languages.
An example of a morphological feature is:
grammaticalGender.
description comprising the lemmatized form or forms of a
word, plus additional information on its part
of speechdata
categories, possibly its inflectional paradigm
or paradigms, and possibly its explicitly listed
inflected forms.
The term morpho-syntax is often used in place of
morphology as it describes such features as number,
gender, case etc. which are essential for syntactic
agreement.
an expression composed of an ordered group of words that
has properties that are not predictable from the
properties of the individual words or of their normal
mode of combination.
The group of words making up an MWE can be continuous
or discontinuous.
"father in law" or "to be over the moon" that mean
something different from what they appear to mean.
the field of study covering knowledge and techniques which
allow computerized processing of linguistic data. This field
combines a variety of skills including linguistics,
mathematical logic, statistics, and algorithms.
data category whose content
cannot be fully enumerated due to the organic nature of
language
Typical open data categories might include
term, lemma.
relation by which linguistic units in a discourse are
associated.
to an associative relation corresponds a feature, for
which the related entities share the same value. The
morpho-syntactic tag lists some of these features
(part-of-speech, grammatical category, etc.).
category assigned to a word based on its grammatical and
semantic properties
ISO 12620provides a
comprehensive list of values for European languages.
Examples of such values are: noun
and verb.
non-empty contiguous discourse sequence identified as
such by a morpho-phonological analysis or an automatic
processing of the discourse.
This can involve the recognition of a regular or algebraic
language (matching of the separators), or a lexicological
analysis (recognition of roots, morphological derivation
and inflection, etc.).
the process identifying tokens
contiguous or non-contiguous entity from a speech or
text sequence identified as such in an associative relation. This
identification is the basis of morpho-syntactic tagging
(part-of-speech,
grammatical category, agreement feature, etc.).
Morpho-syntactic units may have no acoustic or graphic
realization, or correspond to one or more tokens.
transliteration from a
non-Latin script into a Latin script.
set of graphic characters used for the written form of
one or more languages (ISO/IEC
10646-1, 4.14)
data category that may be
the possible content of a closed data category, but
that cannot itself be further sub-divided
masculine, feminine,
and neuter are possible simple data
categories associated with the conceptual
domain of the closed data
categorygrammaticalGender
as it is associated with the German language.
form resulting from a coherent method of writing down
speech sounds
form resulting from the conversion of one writing system
into another
in the context of a given language, is a description
composed of at least a part of
speech and a lemmatized
form
The description can include more morphological
information and/or syntactic and semantic information. A
word is either a single word or
a multi-word expression.
part of speech
Key standards used by MAF
ISO 12620 Data Category Registry (DCR)
The designers of any specific MAF tagset shall use data
categories from the ISO 12620
DCR. The DCR is a set of data category specifications
defined by ISO 12620 and
maintained as a global resource by ISO TC 37 in compliance
with ISO/IEC 11179-3:2003.
Tagset creators can define a set of new data categories to
cover data category concepts that are needed and that are
not currently available in the DCR. The tagset creators
shall be responsible for negotiating the addition of the
new data categories to the DCR. This supplemental set of
data categories shall be represented and managed in
conformance with ISO 12620.
ISO 24610 Feature Structures (FSR and FSD)
Morpho-syntactic content shall be expressed using the
ISO 24610 part 1 document on
Feature Structure Representation and validated using the
future ISO 24610 part 2
companion document on Feature System Declaration.
OLAC Metadata
Metadata for MAF shall be expressed following the
recommendations and categories proposed by the Open
Language Archives Community (OLAC), as described in
the latest version of OLAC Metadata Standard
(http://www.language-archives.org/OLAC/metadata.html).
The OLAC Metadata Set includes the Dublin Core Metadata
Set with qualifiers.
Unified Modeling Language (UML)
MAF complies with the specifications and modeling
principles of UML as defined by OMG. MAF uses the subset of
UML that is described in Annex E.
General characteristics of MAF
Overview
In the Linguistic Community, morpho-syntactic annotations
provide an important layer of linguistic information in a
document, even if they do not cover the full range of
possible linguistic annotations. Other kinds of annotation
on references, discourse, prosody, or parsing may complete
morpho-syntactic annotations.
Syntax and semantics can not be avoided in the definition
of parts of speech and of grammatical categories. For
instance, pronouns and substantives intrinsically carry a
reference to some entity; the tense or the aspect of verbs
indicate the temporal deixis; the person, modality and
other grammatical categories indicate the enunciation
context, .... Therefore, it is not easy to provide an
exact and precise definition of what morpho-syntactic
annotations cover because they are strongly related to
many other linguistic properties of a given language in a
given context.
Nevertheless, the present proposal tries to delimit
minimal and maximal sequences in documents (either text or
speech) that can be identified as morpho-syntactic
units and tries to categorize the linguistic
properties that may be used to mark these units, within
some larger syntagmatic context. Minimal units can not be
broken into sub-parts that could be identified by similar
morpho-syntactic criteria, but may however still be broken
into smaller units with morphological or phonological
properties. Morpho-syntactic units can be nested to form
maximal units (such as compound words) that act as
elementary units for other level of linguistic analysis,
particularly parsing. The exact boundary between
morpho-syntax and parsing is sometimes difficult to define.
MAF Meta-Model
Figure 1 presents a simplified view of the proposed
meta-model for morpho-syntactic annotations, while
Figure 2 presents a more formal view based on UML. An
annotated document is formed by a raw original document
and a set of annotations. The annotations are carried by
word forms covering zero, one or more
segments or tokens of the original document.
A word form may reference a lexicon entry and provides
information about its underlying lemma and inflected form.
The morpho-syntactic content attached to a word form is
expressed by feature structures following the
guidelines of one or more tagsets. The
terminology or set of categories used in
tagsets are described w.r.t. registered data
categories. Because of structural
ambiguities, both tokens and word forms are organized into
one or more flows, materialized by lattices, or more
formally by Directed Acyclic
Graphs [DAGs]. The current proposal
addresses the representation of segments (through tokens),
word forms, morpho-syntactic content, tagsets, and
ambiguity. A MAF model is instantiated from the MAF
meta-model through the selection of a set of data
categories.
Segmenting with tokens
Morpho-syntactic annotations are carried by segments, called
tokens, present in the document flow, but this
does not imply that the resulting segmentation corresponds
to a sequence of adjacent segments partitioning the original
document. It is particularly important to distinguish the
morpho-syntactic units from their realizations. Some parts
of a document may carry no annotations (typographic marks,
didascalies, markup elements, ...); other parts may not
exactly correspond to their segmented form (abbreviations,
brachygraphies, typographic errors and variations,
typographic and morphological contractions, ...). Also, a
morpho-syntactic unit may not correspond to a segment
identified by typographic marks (such as white spaces or
hyphens), for instance for German compound words, speech
transcription, or Sanskrit writing.
The element token is used to represent
these segments of the original document that, roughly
speaking, follow typographical, morphological, or
phonological boundaries. The current proposal does not define
the linguistic properties of tokens. In different languages,
a token may be identified through typographic properties
(white-space, hyphens, characters, ...) and/or morphological
properties (radical, affix, morpheme, ...). The description of
the morphological, phonological or lexicological structures
that may define a token is not covered by the current
proposal.
Other typographical marks used to format pages or to
separate words and paragraphs, as well as encoding
information, do not belong to morpho-syntactic annotations
and are also not covered by this proposal, but rather by
TEI.
Standoff notation
The element token provides an
independence from the original document by providing a way
to reference intervals in documents. The attributes
from and to
are used to define such intervals. The content of these
attributes depends on some chosen addressing schema to
denote non ambiguous document positions and depends on the
nature of the original document.
Embedding notation
It is not always necessary to separate the original
document from its annotations. For simple cases, textual
content may be directly embedded within
token.
The embedding notation will be used for most of the
provided examples for MAF but it should be noted that the
use of this notation is not recommended. A first reason is
that the morpho-syntactic annotations may conflict with
other annotations. A second reason is that the content of
the textual material separating the textual content
embedded within token is not precisely
defined (white-space, newlines, no space, hyphen, ...),
except by relying on attribute
join.
Informative attributes
Tokens address segments of the original document but also
provide a level of possible abstraction w.r.t. this
document, for instance w.r.t. graphical or phonological
variations that are not linguistically pertinent. The non
mandatory attributes form,
transcription,
transliteration may be used to
perform this abstraction, providing, for instance, the
phonetic transcription of a speech segment, the roman
transliteration of some Cyrillic word, the expansion of an
abbreviation, the correction of a typographical error, or
the choice of a normalized form in presence of variations:
etc.csartsarFebruary, 23rd 2003plateau
The abstraction provided by the attribute
form is also adequate to handle the
phenomena of contraction and agglutination where two
tokens may cover the same segment of the original document
for distinct values (see Section 6.4.2).
Completing the embedding token notation
As above mentioned, the embedding token notation is less
precise than the standoff one, in particular to explicit
the contiguity and the overlapping of tokens (which are
obvious to check using the document positions in the case
of the standoff notation).
Joining tokens
The embedding notation for tokens is completed by the
attribute join used to specify
how a token is joined with its sibling tokens. By
default, two sibling tokens are considered to be
separated by whatever separator is standard for the
document language (for instance, space separated for
many languages). By using the attribute
join, it is possible to indicate
that a token is contiguous with its left or right
sibling or with both.
it is said ...L'ondit
It should be noted that a token may enclose material
usually considered as separator, such as spaces,
newline, dash, apostrophe, ..., even if these tokens do
not anchor linguistic units at the level of word forms.
it is said ...L'ondit
Another example, in Modern Greek, is provided by the
idiomatic expression “ϰαλοϰαγαϑὸϛ” (good and brave) that may be
segmented in three agglutinated segments “ϰαλὸϛ”, “ϰαι”, and “αγαϑὸϛ” and represented by:
ϰαλοϰαγαϑὸϛ
Overlapping tokens
Two tokens may overlap, for instance to denote an
agglutinated or contracted form (for instance, in
French, “des” may be
seen as a contraction for “de
les” [of
the]), or to denote multi-locutor
documents with overlapping discourses. In these cases, a
token may not mark just the
realization of a typographical or vocal sequence, but
expresses a deeper linguistic reality pertinent for
segmenting a document. It is however still possible not
to mention overlapping at the level of tokens and to
postpone the issue at the level of linguistic units,
i.e. word forms.
The value overlap for the token
attribute join may be used to
denote overlapping at the level of embedding tokens. For
instance, the following example illustrates the
contraction of an abbreviation with a punctuation mark
for “etc.”, for the
standoff and embedding notations for element
token:
Standoff notation
Embedding notation
etc.
Formal description: token
Word Forms as linguistic units
The segments identified by token elements
are used to anchor word forms, that may
generally be associated, through attribute
entry, to a lexical entry in a
lexicon. Words forms are also characterized by a part of
speech as well as morphological and grammatical properties
expressed by feature structures (see Section 8.1). Immediate
information about the lemma and inflected forms may also be
attached with the attributes lemma
and form. In particular, the
attribute form is useful when the
inflected form attached to the word form does not coincide
with the content attached to the covered tokens, because,
for instance, of spelling corrections.
A token may be associated to more than one word form and,
conversely, a word form may cover more than one token.
For instance, in French, the morphological agglutination of
auquel (“of which”) may have several
representations, depending on the granularity
of the tokenization:
coarse granularity The
character sequence auquel
is not decomposed and is covered by a single
token, with two word forms covering this
segment.
auquelfine granularity The tokenizer identifies two agglutinated parts
materialized by two tokens, each of them anchoring a word
form:
auquel
The choice of a level of granularity can be motivated by the
usage or by the available tools for a given language.
As mentioned before, there are no mandatory linguistic
properties for defining the tokens, which can, for instance,
be automatically recognized by regular languages. On the
other hand, a word form, that may cover zero, one or more
tokens, should represent a linguistic unit carrying
morpho-syntactic information.
The current proposal does not discuss the linguistic choices
that define these linguistic units but provides enough
flexibility to annotate them. The choice may be motivated by
lexical or morphological properties based on context and
language (depending on the nature and function of words).
Token attachment
One token; one word form
The simplest case of relationship between tokens and
word forms is when a word form covers a single token.
apple
Several contiguous tokens; one word form
However, the current proposal allows the handling of
more complex cases, as the identification of compound
words covering several adjacent tokens:
primeminister
Several discontinuous tokens; one word form
A sequence of non contiguous tokens may also be attached
to a word form, for instance to handle cases where some
material is inserted inside the components of a word
form:
afinjustementde
This kind of phenomena may also occur for verbs with
detached particles, for instance in English or German.
The English infinitive verbal form “to <verb>” may
also fit in this scheme.
toeventuallydecide
In order to identify discontinuous word-form while
preserving some information about the position of each
component in the flow of word forms, one may use word
forms covering the same sequence tokens and referring to
the same entry (but possibly sub-entries).
toeventuallydecide
Zero token; one word form
Another case that may arise is when one wishes to insert
a word form which is not realized in the original
document, and is, therefore, associated with an empty
sequence of tokens, e.g., some pronouns in Spanish or
the hypothesis of traces.
Jeanproposedepartir
Even if a word form covers no tokens, it still has a
relative position w.r.t. the other word forms. It is
this relative position which is pertinent for further
processing, rather than some absolute document position.
One token; several word forms
Finally, several word forms may be attached to a same
token, as illustrated by the following examples.
Give it to meDamelo
of whichauquel
Referring lexicon entries
A word form is a linguistic unit carrying morpho-syntactic
properties. Generally, a linguistic unit may be
characterized by a label corresponding to an entry if some
lexicon. This identification is materialized by the
attribute entry, whose content
should express a reference (an URN) to the lexicon entry.
Primeminister
The notion of “lexicon entry” is outside the
scope of MAF. A reference to a lexicon entry is therefore
not precisely defined but, in first approximation, should
correspond to an URN (Uniform Resource Name).
It should be noted that one may wish to reference lexicons
“sub-entries” for polysemous entries or for compound
forms.
toeventuallydecide
A token or a sequence of tokens may sometimes be identified
as forming a word form because of various properties but
can not associated to some lexicon entry, either because
no lexicon is available or because the word form
corresponds to a named entity (a proper name, a date, an
address, ...) or to a neologism. In that case, the content
of attribute entry may be left
empty. The other informative attributes
lemma and
form may still be used to provide
more information about the word form.
October,23rd2005
For such unknown words, it is however suggested that they
can be collected into a document specific lexicon, in order
for the unknown words to refer entries in this lexicon.
Compound word forms
The structure of compound forms (including multi-word
expressions) may be expressed using nested word forms,
therefore providing information about the subparts even
when none is available for the whole, for instance for
neologisms:
birthday gift wrapping paperGeburtstagsgeschenkpapier
Note: Precising the derivational
morphology of a compound word is outside the scope of MAF.
Still, the addition of a deriv
attribute on embedded word forms is being investigated, for
instance to mention the head of a compound form.
Formal description: wordForm
Morpho-syntactic content
This section explains how to attach morpho-syntactic content
to word forms and how to define reusable tagsets
to provide compact notations through tags and
to control the validity of these contents.
The previous section explains how to enrich a document with
morpho-syntactic annotations. However, it does not define the
content of these annotations. What set of features and
feature values should we use to express this content (within
element wordForm) and with which meaning
?
Such a set is usually referred as a tagset
specifying the content of possible annotations. However, the
diversity of approaches and languages makes almost
impossible the proposition of an unique tagset. More
modestly or pragmatically, the current proposal seeks to
provide mechanisms to define tagsets by relying on a Data
Category Registry (DCR) and Feature Structures
Representations (FSR).
An annotated document will therefore be completed by either
adding or referring to a tagset.
Using feature structures
A word form may be completed by a morpho-syntactic content
defining its linguistic nature and its grammatical function
in its current context. This content is expressed using
Feature Structures, following the
recommendation of ISO 24610 Part
1 document on “Feature Structure
Representation” [FSR]. In first approximation, a
feature structure may attach one or several (possibly
complex) values to linguistic properties (i.e., noun to
part of speech, present to tense, indicative to mood, ...
).
nicebelle
The feature structure content attached to a word form may
also provides additional information of interest about a
word form.
Compact morpho-syntactic tags
FSR proposal provides ways for the compact representation
of feature structures, by relying on
libraries naming feature values and feature specifications (a feature
specification being a pair formed by a feature and a
value). These names may be used in
wordForm attribute
tag to get compact tags, following
a standard practice in the NLP community.
belle
The content of attribute tag should
be similar to the content of attribute
feats defined in FSR, namely a
space-separated sequence of feature specification identifiers.
The libraries naming recurrent values and feature specifications are part of
the tagset(s) coming with the annotated document.
FSR libraries
The generic way provided by FSR to use libraries is
illustrated by the following example, with the attribute
feats of element
fs:
A feature value libraryA feature specification library
With such a library, following FSR rules, one may write:
or, equivalently, by using
attribute tag, one may write:
Disjunctive values are allowed by FSR and may also be
simplified, following the same mechanism:
A feature value libraryA feature specification libraryAnnotated documentporte
Designing tagsets
The features, values, and possibly feature types used to
specify morpho-syntactic content are not just labels but
carry linguistic meanings, or, in other words, semantic
content. To avoid misinterpretations, the semantic content
attached to a feature, a value or a type should be clearly
defined. The combination of features, values and types
should also be controlled in order to avoid linguistically
invalid combinations, such as using
neuter as a value for
gender in French, or using a feature
tense for nouns in most languages.
MAF does not try to define the semantic content of an
unique complete set of such features, values, and types.
It would be an almost impossible task given the diversity
of languages, and it would be equally impossible to assign
to each component a meaning agreed on by the whole
community.
Instead, it is proposed that an annotated document should be
completed by including or referring one or
more tagsets.
The first objective of a tagset is to list the terminology
used to annotate a document as a set
of data categories whose
meanings is precisely defined in a Data Category
Registry, following the recommendation
of ISO 12620 proposal on
“Data Category Registry”. The process may be
seen as selecting a subset of morpho-syntactic data
categories (Data Category Selection – DCS).
The correspondence with a registered data category may not
be perfect. The rel may be used to
specify which relationship exists between the local and
registered data categories. For instance, one may introduce
a local data category advneg as being
subsumed by a more general registered data category
adverb.
It is also possible (but not advised) to introduce a local
data category bearing no relationship with any registered
data category.
When the correspondence is not perfect or missing, a few
words of description should be added to define the meaning of
a local data category.
A part of speech used to denote honorific titles like
Pr. or S.A.S.
The second objective of a tagset is to specify the set of
valid feature structures based on the selected data
categories. It will be achieved by relying on the proposed
ISO 24610 Part 2 on
“Feature System Declaration” [FSD].
The third objective of a tagset is to name the most common
morpho-syntactic structures through the use of FSR
libraries, as seen in Section 8.2.1.
Formal description: tagset
The dcs corresponds to a Data Category
Selection part whose exact content is still to be defined.
The fsd corresponds to a Feature
Structure Declaration part whose normalization is yet to
be done.
Handling ambiguities
Ambiguities naturally arise when handling natural
language, and especially for automatically produced
annotations. Ambiguities may occur at various levels and,
therefore, MAF proposes several alternatives to cope with
ambiguities as simply as possible.
Word form Content Ambiguities
The proposal on Feature Structure Representation provides
several ways to represent ambiguities, for instance at the
level of feature values. These mechanisms may be used to
handle the ambiguities occurring within the
morpho-syntactic content of a word-form.
For instance, the French inflected verb form “mange” (to
eat) is ambiguous between the 1st and 3rd persons, and
this ambiguity can be captured by
the vAlt element present in FSR:
mange
A compact tag notation can still be used by registering
most frequent cases of ambiguities in FSR libraries
(Section 8.2.1).
mange
Lexical Ambiguities
Ambiguities between different lexical entries for a same
sequence of tokens can be handled by the element wfAlt:
porte
Structural Ambiguities
Structural ambiguities over word forms
A general and very generic answer is to describe the
possible readings as paths through an Directed Acyclic
Graph (DAG) whose edges are labeled by a word form. Such
DAGs forms a sub-part of Finite State Automata and also
cover the notion of word lattice used in
parsing and speech recognition communities. They are
powerful enough to represent ambiguities between several
decompositions into compound forms. They can also be
used to denote simpler cases of lexical ambiguities.
For instance, the French textual sequence “fer à cheval” (horse shoe) can still be
decomposed into several readings (“'’, “[iron] [on horse]”,
“'’), giving the following DAG:
feràcheval
The linguistic units “fer à
cheval”, “fer”, “à”, “cheval”, and “à cheval” correspond to minimal
syntagmatic units that can be annotated.
Additional information could be added to edges such as
probabilities.
Structural ambiguities over tokens
Structural ambiguities may also arise over sequences of
tokens, resulting from ambiguities in the tokenization
of the annotated document, e.g. speech documents.
Structural ambiguities over tokens are represented by
transitions labeled by tokens. The
attributes tinit
and tfinal on
elements fsm are used to state the
initial and final states for the token paths.
The two levels of structural ambiguities are represented
by two lattices that form a kind of chart. It is not
mandatory but advised that the two lattices share their
states, whenever possible.
A validity condition has to be expressed between the two
levels of structural ambiguity:
the tokens covered by word forms along a word form
path belong to some token path.
Simplified structuring variants
Non ambiguous linear representation
When there is no ambiguity, MAF allows to replace the
global lattice notation by a much simpler linear
notation where
the token, wordForm
and wfAlt elements are implicitly
chained following their appearance order, as illustrated
by the following example:
feràcheval
Mixed linear and lattice representation
Ambiguities are generally localized and it is tempting
to also localize the use of the lattice notation only
where it is needed. MAF allows to insert local
lattice fsm in a linear flow
of token, wordForm
and wfAlt elements.
afindegrandir,ilmangedespommesdeterre
Expanding the simplified variants
The simplified variants are allowed because they may
always be expanded into a global lattice, by applying
the steps sketched in the following sub-sections.
Separating tokens and word forms
All tokens embedded within a word form may be extracted
and moved just before the word form (and before an
enclosing wfAlt) , not changing the
relative order between tokens.
des
becomes
des
Note: There is no clear semantic to
handle tokens embedded in word forms, themselves
embedded in transitions. This case should be avoided.
Wrapping into local lattices
Tokens and word forms outside transitions are embedded
into local lattices, wfAlt elements
being considered as word forms.
ilmangedes
becomes
ilmange
Lattice states are local to each lattice.
Merging local lattices
Two adjacent lattices may be merged by renaming the
intermediary states in order to avoid name clashes and
in such a way that the word form (resp. token) final
state of the first lattice equals the word form (resp.
token) initial state of the second lattice. Whenever
possible, it is recommended, when merging, to rename the
lattice states in such a way that the final (resp. final)
states for tokens and word form coincide.
The previous example becomes:
ilmange
and then
ilmange
Removing wfAlt
A transition over a lexical ambiguity, materialized by
a wfAlt element, may be expanded into
two equivalent simpler transitions.
becomes
The ordering of transitions inside lattices is not
pertinent. On the other hand, the ordering of word forms
and tokens outside lattices is pertinent. The relative
ordering of local lattices is also pertinent.
Formal description: wfAlt and fsm
Header and metadata
The global maf element is introduced as
a root element to encapsulate morpho-syntactic annotations
and carries global metadata relative to the annotated
documents.
Two MAF specific metadata categories are introduced for the
token standoff notation, namely
the document
and addressing attributes.
The addressing attribute indicates
the addressing schema used to refer positions in the
annotated document. A full list of such schema will be
provided in ISO 24612 proposal
“Linguistic Annotation Framework” (LAF). The
following fragment illustrates the use of these attributes
for a video document:
The other non-mandatory metadata are handled following the
recommendations of the OLAC Metadata Standard
and should therefore be included in
an olac:olac element.
MySuperMorphoTool2005/09/301.1TDM80MAF.1.1TDM80MAF.1.0http://abu.cnam.fr/cgi-bin/donner_abu?tdm80j2FrenchMyInstitutionLe Tour du Monde en 80 Jours version
MAF A set of MAF annotations for Jules Vernes famous novel
MyInstitutionLGPL-LR
Formal description
The complete list of addressing schema
allowed by MAF will be inherited from ISO
24612 document on Linguistic Annotation
Framework (LAF). A possible list of such schema could
include:
TEI ptrs,
XML Xpointers,
character offsets (depending on the original document encoding)
MPEG7 multimedia addressing (MediaTimePointType)
(informative) RELAX NG compact schema
Note: The following RELAX NG compact
schema may be found online at http://atoll.inria.fr/~clerger/MAF/maf.rnc
Morphosyntactic Annotation Framework
Morphosyntactic Annotation Framework
MAF Start element
attributes to be used in case of standoff annotations (attribute) the url of the annotated document (attribute) the addressing mode for referring position into the annoted document
To be defined in LAF.
Global attributes for maf elements segment of the input document
Attributes to denote a span in the annoted document
Attributes used to denote a span
Left span boundary Right span boundary Relationship with neighbor tokensno
Attributes used to provide additional information on the
content of a token
(possibly corrected) form of the token phonetic transcription general transcription transliteration to some other script Linguistic units built upon tokens lemma attached to a wordForm form attached to a wordForm lexicon entry attrached to a wordForm Reference to a wordForm contentsequence of token identifiers the tagset to be used to check and interpret the annotations
Data Category Selection
The selection of Data Categories used to express the
annotations
local name of the category
registered name of the category in the ISO Data
Category Registry
relationship
Relationship between the local meaning of a category
and the registered one
eqdescription
Informal description of a Data Category
Finite State Machine
Used to describe an ambiguous flow of token
and/or wordForm elements
init state of the FSM wrt wordFormsfinal state of the FSM wrt wordFormsinit state of the FSM wrt tokensfinal state of the FSM wrt tokens
FSM transition in a flow of tokens and/or wordForms
source state of a transitiontarget state of a transition WordForm Alternative
Simplified form to express an alternative between several word forms
Validating MAF documents
For validating MAF document, the first step is to convert the
RELAX NG compact schema into an XML RELAX NG schema (for
instance using trang). Such a XML RELAX
NG schema may be found at http://atoll.inria.fr/~clerger/MAF/maf.rng
Then, the validation may be performed, for instance, using xmllint (from libxml2).
xmllint --relaxng maf.rng mafdoc.xml
It should be noted that some semantics constraints of MAF
are not checked by the RELAX NG schema, in particular the
constraint between the word form and token paths expressed
in Section 9.3.2.
(informative) DTD
The following DTD is only be an approximation of the RELAX NG schema.
Note: The current DTD is outdated w.r.t.
the RELAX NG schema.
(informative) Illustrative examples
Tagsets
Demonstrator
A preliminary demonstrator covering most of the
functionalities provided by MAF may be tried on line at
http://atoll.inria.fr/mafdemo
(illustrative) Morpho-syntactic Data Categories
This annexe lists and documents the morpho-syntactic data
categories used in the MAF examples.
A repository of data categories, including morpho-syntactic
data categories, may be found at http://syntax.inist.fr/
grammaticalGender with
conceptual values feminine,
masculinegrammaticalNumber with conceptual
values singular,
pluralgrammaticalPoswith conceptual values
noun, verb,
preposition,
determiner, adverbgrammaticalMood with conceptual
values indicative,
subjunctivegrammaticalTense with conceptual
values presentgrammaticalPerson with conceptual
values first,
second, third
(informative) UML notions used within MAF
Introduction
MAF complies with the specifications and modeling
principles of UML as defined by OMG [32]. UML is well
defined and broadly used in the industry. MAF uses a subset
of UML that is relevant for linguistic description.
The following notions are used:
The notion of class
The notion of relationship
The notion of instance
The notion of package
The notion of class
A class is a named descriptor for a set of objects that
share the same attribute s and relationships. Classes are
described within a class model.
The notion of attribute
An attribute is the description of a named element of a
specified type in a class; each object of a class
separately holds a value of the type.
The notion of relationship
A relation is a connection between classes. This includes
association and generalization. Relations are described
within a class model.
The notion of association
An association is a relationship between two specified
classes that describes connections among their objects.
The extension of the association is a collection of such
links. Associations are the
glue that holds together the model: without
associations, there is only a set of isolated classes. An
association holds two ends. Each end has "a multiplicity"
and an ordering qualifier.
The multiplicity is the specification of the range of
allowable cardinality values that a collection may assume.
The multiplicity range is an integer interval with its
minimum and maximum values.
An ordering qualifier specifies whether the connection
forms a set (an unordered collection) or a list (an
ordered collection).
The notion of aggregation
An aggregation is a form of association that specifies a
whole-part relationship between an aggregate (a whole) and
a constituent part. It is not permissible for both ends to
be aggregates.
The notion of generalization
A generalization relationship is a directed relationship
between two classes. On e class is called the parent or
the super-class, and the other is called the child or the
sub-class. The parent is the description of a set of
objects with common properties over all children. The
child is a description of a subset of those objects that
have the properties of the pa rent but that also have
additional properties peculiar to the child. A parent may
have more than on e child and a child may have more than
one parent. Generalization is a transitive and
anti-symmetrical relationship. No directed generalization
cycles are allowed. A child inherits the attributes and
associations of its parent.
The notion of instance
An instance is an object that conforms to a class.
Instances are not described within a class model but
within an instance model (sometimes called an object
model).
The notion of package
A package is a grouping of classes and relations. Usually
there is a single root package that owns the entire model
for a system. A package may contain nested packages.
Packages may have dependencies to other packages.
Graphical notations
Each notion has a graphical notation that is precisely
defined as follows: