Language Resource Management -- Morpho-syntactic Annotation Framework (MAF):

8 Segmenting with tokens

Table of contents

Morpho-syntactic annotations are carried by segments, called tokens, present in the document ﬂow, but this does not imply that the resulting segmentation corresponds to a sequence of adjacent segments partitioning the original document. It is particularly important to distinguish the morpho-syntactic units from their realizations. Some parts of a document may carry no annotations (typographic marks, didascalies, markup elements, ...); other parts may not exactly correspond to their segmented form (abbreviations, brachygraphies, typographic errors and variations, typographic and morphological contractions, ...). Also, a morpho-syntactic unit may not correspond to a segment identified by typographic marks (such as white spaces or hyphens), for instance for German compound words, speech transcription, or Sanskrit writing.

The element token is used to represent these segments of the original document that, roughly speaking, follow typographical, morphological, or phonological boundaries. The current proposal does not define the linguistic properties of tokens. In different languages, a token may be identified through typographic properties (white-space, hyphens, characters, ...) and/or morphological properties (radical, aﬃx, morpheme, ...). The description of the morphological, phonological or lexicological structures that may define a token is not covered by the current proposal.

Other typographical marks used to format pages or to separate words and paragraphs, as well as encoding information, do not belong to morpho-syntactic annotations and are also not covered by this proposal, but rather by TEI.

8.1 Standoff notationISO: Standoff notation¶

The element token provides an independence from the original document by providing a way to reference intervals in documents. The attributes from and to are used to define such intervals. The content of these attributes depends on some chosen addressing schema to denote non ambiguous document positions and depends on the nature of the original document.

8.2 Embedding notationISO: Embedding notation¶

It is not always necessary to separate the original document from its annotations. For simple cases, textual content may be directly embedded within token.

<token id="t1">The</token>
<token id="t2">victim</token>
<token id="t3">'s</token>
<token id="t4">friends</token>
<token id="t5">told</token>
<token id="t6">police</token>
<token id="t7">that</token>
<token id="t8">Krueger</token>
<token id="t9">drove</token>
<token id="t10">into</token>
<token id="t11">the</token>
<token id="t12">quarry</token>
<token id="t13">and</token>
<token id="t14">never</token>
<token id="t15">surfaced</token>
<token id="t16">.</token>

The embedding notation will be used for most of the provided examples for MAF but it should be noted that the use of this notation is not recommended. A first reason is that the morpho-syntactic annotations may conﬂict with other annotations. A second reason is that the content of the textual material separating the textual content embedded within token is not precisely defined (white-space, newlines, no space, hyphen, ...), except by relying on attribute join.

8.3 Informative attributesISO: Informative attributes¶

Tokens address segments of the original document but also provide a level of possible abstraction w.r.t. this document, for instance w.r.t. graphical or phonological variations that are not linguistically pertinent. The non mandatory attributes form, transcription, transliteration may be used to perform this abstraction, providing, for instance, the phonetic transcription of a speech segment, the roman transliteration of some Cyrillic word, the expansion of an abbreviation, the correction of a typographical error, or the choice of a normalized form in presence of variations:

<token form="et cetera" id="t1">etc.</token>
<token form="tzar" id="t2">csar</token>
<token form="tzar" id="t3">tsar</token>
<token form="23/02/03" id="t4">February, 23rd 2003</token>
<token
  form="et cetera"
  phonetic="/etsettrə/"
  from="1251"
  to="1253"
  id="t5"/>
<token phonetic="/platto/" id="t6">plateau</token>

The abstraction provided by the attribute form is also adequate to handle the phenomena of contraction and agglutination where two tokens may cover the same segment of the original document for distinct values (see Section 6.4.2).

8.4 Completing the embedding token notationISO: Completing the embedding token notation¶

As above mentioned, the embedding token notation is less precise than the standoff one, in particular to explicit the contiguity and the overlapping of tokens (which are obvious to check using the document positions in the case of the standoff notation).

8.4.1 Joining tokensISO: Joining tokens¶

The embedding notation for tokens is completed by the attribute join used to specify how a token is joined with its sibling tokens. By default, two sibling tokens are considered to be separated by whatever separator is standard for the document language (for instance, space separated for many languages). By using the attribute join, it is possible to indicate that a token is contiguous with its left or right sibling or with both.

It should be noted that a token may enclose material usually considered as separator, such as spaces, newline, dash, apostrophe, ..., even if these tokens do not anchor linguistic units at the level of word forms.

Another example, in Modern Greek, is provided by the idiomatic expression “ϰαλοϰαγαϑὸϛ” (good and brave) that may be segmented in three agglutinated segments “ϰαλὸϛ”, “ϰαι”, and “αγαϑὸϛ” and represented by:

<token form="ϰαλὸϛ" id="t0">ϰαλο</token>
<token form="ϰαι" id="t1" join="left">ϰ</token>
<token form="αγαϑὸϛ" id="t2" join="left">αγαϑὸϛ</token>

8.4.2 Overlapping tokensISO: Overlapping tokens¶

Two tokens may overlap, for instance to denote an agglutinated or contracted form (for instance, in French, “des” may be seen as a contraction for “de les” [of the]), or to denote multi-locutor documents with overlapping discourses. In these cases, a token may not mark just the realization of a typographical or vocal sequence, but expresses a deeper linguistic reality pertinent for segmenting a document. It is however still possible not to mention overlapping at the level of tokens and to postpone the issue at the level of linguistic units, i.e. word forms.

The value overlap for the token attribute join may be used to denote overlapping at the level of embedding tokens. For instance, the following example illustrates the contraction of an abbreviation with a punctuation mark for “etc.”, for the standoff and embedding notations for element token:

Standoff Notation
<token
  form="et cetera"
  id="t1"
  from="p1"
  to="p3"/>
<token
  form="#dot#"
  id="t2"
  from="p1"
  to="p3"/>
Embedding notation
<token form="et cetera" id="t1">etc.</token>
<token form="#dot#" id="t2" join="overlap"/>

8.5 Formal description: tokenISO: Formal description: token¶

token segment of the input document
id
from Left span boundary
to Right span boundary
join Relationship with neighbor tokens
att.token.information Attributes used to provide additional information on the content of a token
form (possibly corrected) form of the token
phonetic phonetic transcription
transcription general transcription
transliteration transliteration to some other script

↑ Contents « 7 General characteristics of MAF » 9 Word Forms as linguistic units

id
from	Left span boundary
to	Right span boundary
join	Relationship with neighbor tokens

form	(possibly corrected) form of the token
phonetic	phonetic transcription
transcription	general transcription
transliteration	transliteration to some other script