Previous Up Next

B  Description scientifique et technique detaillée du projet

B.1  But du projet

The main motivations of the PASSAGE project are twofold :
  1. to improve the accuracy and robustness of existing French parsers by using them on large scale corpora (several millions of words) and
  2. to exploit the resulting syntactic annotations to create richer and more extensive linguistic resources.
The adopted methodology consists of a feedback loop between parsing and resource creation as follows:
  1. parsing is used to create syntactic annotations
  2. syntactic annotations are used to create or enrich linguistic resources such as lexicons, grammars or annotated corpora
  3. the linguistic resources created or enriched on the basis of the syntactic annotations are then integrated into the existing parsers
  4. the enriched parsers are used to create richer (e.g., syntactico-semantic) annotations
  5. etc.
More generally, the PASSAGE project should help seeing the emergence of linguistic processing chains exploiting richer lexical informations, in particular semantic ones.

PASSAGE will build up on the results of the EASy French parsing evaluation campaign (EASy/Evalda action, Technolangue program). This campaign has shown that several parsing systems are now available for French but that robustness and accuracy can still be largely improved, especially for oral data.

Furthermore, although the initial plan was to combine the results produced by each participant to construct a treebank for French (a corpus annotated with syntactic information), the creation of this treebank is still to be achieved and the expected output, while very valuable, remains relatively small (around 40K sentences with a subset of around 4K sentences manually validated), compared to emerging international standards (10M to 100M words, i.e. 0.5M to 5M sentences).

PASSAGE aims at pursuing and extending the line of research initiated by the EASy campaign. In particular, PASSAGE aims at: The participation of around 10 parsing systems in a collective effort geared towards improving parsing robustness and acquiring linguistic knowledge from large scale corpora is a rather unique event. We believe that the combination of so many sources of information over a relatively long period of adaptation ensures good chances of success for the proposal. The parsing systems will be provided by participants or contractants, including: This proposal clearly falls in the NLP part (Thematic 5) of the 2006 ANR MDCA call, in particular covering the points: But PASSAGE also addresses datawarehouse issues such as large corpora processing (using grids) and large syntactic annotation banks.

B.2  Contexte et état de l'art

At the international level, the last decade has seen the emergence of a very strong trend of researches on statistical methods in Natural Language Processing. This trend results from several reasons but one of them, in particular for English, is the availability of large annotated corpora, such as the Penn Tree bank (1M words extracted from the Wall Street journal, with syntactic annotations; 2nd release in 1995 [http://www.cis.upenn.edu/~treebank/]), the British National Corpus (100M words covering various styles annotated with parts of speech [http://www.natcorp.ox.ac.uk/]), or the Brown Corpus (1M words with morpho-syntactic annotations). Such annotated corpora were very valuable to extract stochastic grammars or to parametrize disambiguation algorithms. These successes have led to many similar proposals of corpus annotations. A long (but non exhaustive) list may be found on http://www.ims.uni-stuttgart.de/projekte/TIGER/related/links.shtml, but we can here mention: However, the development of such treebanks is very costly from an human point of view and represents a long standing effort. The volume of data that can be manually annotated remains limited and is generally not sufficient to learn very rich information (sparse data phenomena). Furthermore, designing an annotated corpus involves choices that may block future experiments to acquire new kinds of linguistic knowledge. Last but not least, it is worth mentioning that even manually annotated corpora are not error prone.

We believe that a new option becomes possible. The French parsing evaluation campaign EASy has shown that parsing systems are now available for French, implementing shallow to deep parsing. Some of these systems were neither based on statistics, nor extracted from a treebank. While needing to be improved in robustness, coverage, and accuracy, these systems has nevertheless proved the feasibility to parse medium amount of data (40K sentences, 1M words). Some preliminary experiments made by some of the participants with deep parsers ([7]) indicate that processing several thousand sentences is not a problem, especially by relying on clusters of machines. These figures can even be increased for shallow parsers. In other words, it means that there now exists in France several parsing systems that could parse (and re-parse if needed) large corpora between 10 to 100M words (around 500K to 5M sentences).

While the quality of the analysis produced by these parsers remains to be assessed and improved (this is an important objective of this proposal), it should already be possible to learn valuable linguistic knowledge from the analysis of a large corpus, for instance about sub-categorization frames, restriction of selections, statistical distribution of syntactic phenomena, ..., as advocated by others (Deriving Linguistic Resources from Treebanks [ http://www.computing.dcu.ie/~away/Treebank/treebank.html]; LREC'02 Workshop [http://www.lrec-conf.org/lrec2002/lrec/wksh/CfP-WP16.html] on “Linguistic Knowledge Acquisition and Representation : Bootstrapping Annotated Data”). Furthermore, the knowledge thus acquired is meant to be integrated in the parsing systems to make them more accurate. Hopefully entering a virtuous circle, corpora may then be re-parsed to learn new knowledge.

The quality of the acquired linguistic knowledge will potentially be reinforced by combining (merging) the results produced by all the parsers involved in PASSAGE. To facilitate this merging, we will focus on dependency-based representations of the results, that also seem to be better adapted for the acquisition of linguistic knowledge. This combination process requires to assess the performance level of the different parsers involved in order to compute a confidence factor associated to the information provided by each parser. One way to assess this confidence factor is through quantitative black-box evaluation, a paradigm that has been used successfully for more than 20 years in speech processing community and is recognized as one of the factor that drove technology progresses in that field. PASSAGE will be the opportunity to build on the exploratory effort of EASY in order to promote the evaluation paradigm for parsing natural language.

Human validation by expert linguists is needed by the evaluation procedure and remains an important issue, first to assess the quality of the acquisition techniques but also because unsupervised acquisition does not seem reasonable since the improvement target has to be provided by humans, the only condition for breaching technological barriers. Furthermore, linguistic expertise and linguistic theories are necessary to guide the acquisition experiments. Part of the objectives of PASSAGE is to understand how this expertise may take place and be efficiently used through adequate validation interfaces (for instance on the model of the one developed for error mining [http://atoll.inria.fr/perl/results5/errorscgi.pl]).

The parsers may be improved by acquiring probabilistic information on natural language (for reducing ambiguities) but also by improving and enriching their underlying linguistic resources, lexica or grammars. Thus, as a very important side effect of PASSAGE, we should get richer and more extensive linguistic resources (or at least, get an improvement of existing ones).

Very recently, suggestions have been made to marry symbolic and statistics approaches [3]. Symbolic approaches provide ways to express linguistic knowledge and move to richer levels of descriptions (syntax, semantic) while statistics provide ways to capture the fact that languages are human artifacts partly characterized by their usage in a community. This proposal PASSAGE should help us going into this direction.

At the end of the project, the final set of syntactic annotations will also be made freely available to the community and, hopefully, boost new acquisition experiments. A subpart of this annotation set (around 500K words) should be manually validated and be also distributed.

B.2.1  ATOLL/INRIA

The INRIA project-team ATOLL (Software Tools for Linguistic Processing) is strongly involved in the exploration and development of parsing technologies for NLP, for various syntactic formalisms. Two parsing systems (FRMG/DyALog and SXLFG/Syntax), exploiting a common pre-parsing processing chain SXPIPE and a syntactic lexicon Lefff [10], have been developed for French and tried during the EASy campaign [20, 19]. These systems have been since used in large scale parsing experiments (Monde Diplomatique, 300K to 600K sentences; botanical corpus, 100K sentences) [11, 7], partly relying on the use of clusters. The results of these experiments have already been used to mine errors [9, 8], in order to improve our linguistic resources. The Lefff lexicon developed by ATOLL is partly the result of experiments using linguistic knowledge acquisition techniques ([21, 10]), whose latest version has been applied on Slovak ([15]). ATOLL is also involved in the current standardization efforts at the level of ISO TC37 SC4, on morpho-syntactic annotations (MAF) [18] and on syntactic annotations (SynAF). With Lefff, FRMG and SxLFG, through several national actions (LexSynt, MOSAIQUE, GrammoFun proposal) and trough the current proposal PASSAGE, ATOLL tries to contribute to the development of wide-coverage linguistic resources for French.

B.2.2  LIR/LIMSI

The “Langues, Information et Représentations” group at LIMSI has a strong competence in the organization of evaluation campaigns, being the scientific organizer of the GRACE campaign for Part-of-Speech taggers and the scientific organizer of EASY/EVALDA, the parsing evaluation campaign for French.

B.2.3  Langue & Dialogue / LORIA

Langue et Dialogue / LORIA's research focuses on computational semantics and in particular, on the development of a computational infrastructure for the semantic processing of French. Within PASSAGE, Langue et Dialogue / LORIA will bring expertise on lexical information acquisition ; parsing ; semantic parsing and semantic annotation. Thus Azim Roussanaly was one of the participant in the EASy campaign (parser LLP2) and Claire Gardent has worked on extracting valency information from the LADL tables [17, 6, 12]. Further work of that group includes the development of a semantic processing chain for French including a grammar writing environment for tree based grammar [], a Tree Adjoining Grammar for French integrating a semantic dimension [13] and the coupling with this grammar of two parsers [16] and a generator [5]. Finally, Langue et Dialogue / LORIA also has some experience in corpora annotation [4].

B.2.4  LIC2M / CEA-LIST

The LIC2M group from CEA-LIST laboratory is specialized in multilingual information search and extraction. As such, it has developed a multilingual linguistic analyzer. Its French version has particpated in the Easy campaign. The new evolution will also participate to the campaigns of the Passage project. The LIC2M is also involved in other projects where a very important computing architecture, like clusters or super computers, is necessary. This expertise will be used within PASSAGE in the building of the handler of multiple parsers. Finally, our researchs on the acquisition of semantic data from semantic analysis will be used in PASSAGE and developped thanks to our participation to this project.

B.2.5  ELDA

ELDA aims at collecting, commercializing and distributing Language Resources, as well as collecting and disseminating general information related to the field of Human Language Technologies, with the mission of providing a central clearing house to the players of the field. As a matter of fact, ELDA launched evaluation activities in the past few years, distributing the language resources appropriate to evaluation experiments in language engineering. ELDA is now involved in this field to a large extent, developing its skills to all aspects of evaluation: participation to the evaluation of systems, applications, involvement in European evaluation projects, such as CLEF'2026 To make this new activity more concrete, ELDA changed its former name “European Language Resources Distribution agency” into “Evaluation and Language Resources Distribution Agency”. In the same way as ELDA made a great number of Language Resources available, and carries on by distributing them widely, it aims at building a permanent and long-lasting evaluation infrastructure. In PASSAGE, ELDA will act as a contractant.

B.2.6  Tagmatica

Tagmatica is a specialized SME in the standardization of Natural Language Processing (www.tagmatica.com). Gil Francopoulo (from Tagmatica) works in lexicon management, sentence parsing and search engines for 20 years. He is editor of the ISO standard dedicated to lexicons (LMF aka CD24613). He participates in the work in progress about ISO syntactic annotation framework (SynAF aka WD24615). In PASSAGE, Tagmatica will act as a contractant.

B.3  Organisation du projet – description des sous-projets

B.3.1  General principles

The project development and evaluation strategy is based on a feedback loop between parsing and resource creation. More specifically, the output of parsing is used to extract linguistic information which once validated is integrated into the parsers which are then re-run on large scale corpora thus providing the trigger for another acquisition-evaluation-improvment cycle. In this way, (i) parsing is used to acquire richer and larger linguistic resources and (ii) the acquired resources are used to improve parsing performance, coverage and precision.

Meetings will be held every 4 months to evaluate the evolution of the project and to favor exchange of information between the participants. The participants will have access to a maximum of information, resources and tools to conduct their parsing experiments and, in return, will return a maximum of information (syntactic annotations) to be shared by others (possibly in an anonymous way).

B.3.2  Tasks

The project will be based around the following tasks.

WP1 – Identification and preliminary preparation of corpora
Coordinator: ATOLL / INRIA (with a subcontract to ELDA)

Objectives: The aim of this work package is the creation of a freely available large scale corpora containing various text styles. The subtasks involved will include:
  1. The selection of several corpus covering various kinds of styles (including oral transcription) totaling 10M words to 100M words. Corpora will be selected for their style but also for their possibility to be freely available (or at least easily available at reasonable cost). Natural candidates include:

  2. Corpus cleaning: while the original corpora should remain available, some scripts should be made available to achieve a minimum of normalization on their form (removal of some meta-data, removal of HTML/XML mark-ups, possibly sentence segmentation, ...).

  3. Corpus-annotation anchoring tools: for cases where the original corpora cannot be freely distributed, we will devise a method for distributing separatly the original material for one part and annotations and anchoring information for another part. More generally, it implies that annotations have to clearly reference portions of the original corpus (using standoff annotations and robust portable addressing schemes for linguistic units referred to in the annotations).
WP2 – Handling morpho-syntactic and syntactic annotations
Coordinator: LIR / LIMSI (with a subcontract to TAGMATICA)

Objectives: this work package will concentrate on defining a project internal annotation standard for syntactic annotations, on making the link between this standard and other existing standards and in setting up the basic computational infrastructure necessary to store, edit and search syntactically annotated corpora. More specifically, the aim will be:
  1. To propose formats for the annotations produced at various stages, essentially for syntactic annotations but maybe also for morpho-syntactic annotations. These formats are to ease the distribution of annotations but also to ease the comparison and fusion of the results provided by the parsers. They will be defined in relation with the ongoing standardization efforts promoted by ISO TC37 SC4, and relayed by the former Technolangue action Normalangue/RNIL, the European action Lirics, and the ANR MDCA proposal Nortal (if accepted). In particular, the most important emerging standards for PASSAGE are MAF “Morpho-syntactic Annotation Framework” and SynAF “Syntactic Annotation Framework”.

    It is not reasonable to quickly hope for a complete SynAF proposal. It seems more reasonable to look forward a relatively simple dependency-based model (possibly with a chunk level), extending the one proposed for the EASy campaign. The experience acquired on the EASy campaign should help the comparison and fusion of parsing results.

  2. To develop the format conversion tools between the internal format used by the project to combine parsers' annotations and the standard defined by standardization institutions as they will exist when the project will end.

  3. To identify technologies and tools to store, edit and search large repositories of syntactic annotations. Particularly, we will use the ongoing modifications of the LIC2M search engine that will be able to handle complex structured data.
Note: The selection and/or development of formats and tools will be guided by a preliminary investigation of existing treebank projects and annotation platforms (GATE, Linguistic Data Consortium [LDC], Alembic Workbench, MATE, TIGER, Prague Dependency Treebank, etc.).

WP3 – Large scale processing on clusters
Coordinator: ATOLL / INRIA

Objectives: This work package will aim at
  1. developing tools and environments to allow for cluster based parsing and
  2. identifying potential clusters such as GRID 5000 (resulting from the ACI Grid).
Note: This task may require some adaptation of the processing chains to be able to run on a cluster-based environment. However, there is no obligation for a participant to use a cluster-based environment if the performances on a single machine are good enough.

WP4 – Parser output comparisons and fusion
Coordinator: LIR / LIMSI

Objectives: To develop a merging protocol for the annotations provided by the various parsers. This procotol will use the performance assessment provided by the two evaluation campaigns of the project (one a the beginning and one towards the end) to define a confidence factor to associate to each system annotation. When needed, automatic error correction scripts will be developped to improve the data produced by individual parsers for the most frequent and systematic errors. This task will build on the experience gained during the EASy campain.

WP5 – Acquisition tasks
Coordinators: Langue et Dialogue / LORIA& ATOLL / INRIA

Objectives: Exploration of various techniques to extract information from the syntactically annotated corpora resulting from the parsing process. The work will mainly focus on exploring the creation of a knowledge rich lexicon for French including : In particular, one aim will be to use the information derived from a large scale syntactically annotated corpus to create a lexicon that associate with each syntactic functor both valency and theta grid information. The idea is to first derive valency information, then use this information and its lexical distribution to create Beth Levin's type alternation classes and finally to use these classes to systematically assign a common thematic grid to all verbs of a given class. Corpus derived information will be compared and combined with information made available by already existing resources such as the syntactic lexicon Lefff [10], the Synlex lexicon derived from the LADL tables [17, 6, 12] and Patrick Saint-Dizier's manually constructed alternation classes [].

Other kinds of information are susceptible to be acquired, such as Note:
WP6 – Integration and validation tasks
Coordinator: Langue et Dialogue / LORIA

Objectives:
  1. To provide ways to validate the NLP lexicon provided by the acquisition task (WP5)
  2. To integrate some or all of this NLP lexicon into at least one parsing system (but the information will be available to all participants for integration within their systems).
This work package will focus on assessing the correctness of the syntactic and semantic information contained by the lexicon acquired in WP5. Specifically, we intend (i) to integrate this lexicon into a parsing system, (ii) to use the resulting parsing system to build a pilot corpora annotated with semantic information and (iii) to evaluate the results of the parsing system against a manually created gold standard.

The parsing system used will build on the SemTAG system developed in Nancy [16, 13] which consists of a lexicon, a Tree Adjoining Grammar integrating syntax and semantics and Eric de la Clergerie's DyALog parser. The lexicon will be extended with the syntactic (valency) and semantic (thematic grid) information acquired from corpora, the grammar coverage will be extended to deal with constructions not yet covered and corpus extracted probabilistic information will be used to disambiguate the parser results.

The resulting parsing system will then be used to semi-automatically create a PropBank [http://www.cis.upenn.edu/~mpalmer/project_pages/ACE.htm] like corpora (that is, of a corpora annotated with semantic functor/arguments dependencies). That is, annotators will be asked to choose from amongst the parser output, the parse yielding the most appropriate thematic grid for each basic clause. The resulting annotated corpus will then be compared against a manually created gold standard using standard precision and recall measures.

Note: The aim here is not to construct and make available a Propbank style corpora but rather to conduct some pilot experiments on the usefulness of the acquired information for constructing such a corpora. Specifically, the construction of a Propbank style corpora will permit a first assessment of the quality of the valency and thematic grid information contained in the lexicon by adressing the question: does the constructed parsing system yield correct thematic grids for most cases?

WP7 – Administrative and Scientific Management of Evaluation campaign 1
Coordinator: LIR / LIMSI

Objectives: The partner will deploy an open evaluation campaign for parsers of French. The first evaluation campaign will take place at the beginning of the project. Building on the results of EASy it will provide a performance assessment of the available parsing technology. In particular it will gauge the progress of the technology made since the end of EASy. The performance information (confidence factor) associated to each parser will be used to drive the combination process (weighted voting procedure) and specific error analysis of the parsers output will be used to improve parsing quality of the corpus.

WP8 – Manually validated reference subcorpus
Coordinator: ATOLL / INRIA (with a subcontract to ELDA)

Objectives: Identification of a sub-corpus (500K words) and stabilization and validation of its syntactic annotations. Validation will be performed by human annotators. Human validation will be checked with inter-annotator agreement measures performed on randomly selected excerpts of corpus (amounting to 10% of the corpus). First, we will need to determine what is more efficient between correcting annotated material and hand-annotating anew the original material (for instance for speech transcriptions, if the error rate is too high, it is much cheaper to re-do the transcription from scratch than to try to correct the existing one).

Methodology and specific software for hand annotation/correction will be investigated in order to optimize the annotation task (in particular, consistency checking tools). The tools developped, if any, will be made available to the community at the end of the project along with the reference corpus which should be composed of copyright free material.

The dependency bank so produced will be used in the second evaluation campaign and be an invaluable resource to assess in the future the quality of parsers and the robustness of various acquisition tasks with respects to parsing errors (comparing the use of manually versus automatically annotated material).

WP9 – Administrative and Scientific Management of Evaluation campaign 2
Coordinator: LIR / LIMSI

Objectives: The partner will deploy an open evaluation campaign for parsers of French. The second evaluation campaign will take place at the end of the project. It will provide a performance assessment of the progress made during the project. The performance information (confidence factor) associated to each parser will be used to drive the combination process (weighted voting procedure) for the final release of the treebank. An evaluation package will be made freely available (free open source) with the treebank, to enable parsing technology developer to assess the performance of their system against the annotations of the treebank.

WP10 – Preparation of final results for distribution
Coordinator: ATOLL / INRIA (with a subcontract to ELDA and the CNRS “Centre National de Ressources Textuelles et Lexicales” (CNRTL) of Nancy)

Objectives based on designed format, ensure possibility to distribute results (individual, merged) at destination of both the scientific community and the industry. The “Centre de compétence” will be in charge of distributing the free material and managing contacts with the research community while ELDA will provide a commercial interface to the SMEs and industry.

B.3.3  Membres permanents impliqués

ATOLL / INRIA


Nom & Prénom Statut implication
De la Clergerie Éric CR1 30%
Pierre Boullier DR 15%


LIR / LIMSI


Nom & Prénom Statut implication
Paroubek Patrick IR1 60%
Vilnat Anne MdC 10%
Robba Isabelle MdC 20%


Langue et Dialogue / LORIA


Nom & Prénom Statut implication
Claire Gardent CR1 CNRS 20%
Azim Roussanaly MC Nancy 2 20%


LIC2M / CEA-LIST


Nom & Prénom Statut implication
de Chalendar Gaël Chercheur 20%
Ferret Olivier Chercheur 10%


B.4  Principaux “délivrables”



  Libellé Type Partenaire(s) pilote(s) Date
1 Web Site web site ATOLL T02
2 Initial Repository online database ATOLL T06
3 SSR report ATOLL T06
4 Intermediary Documentation report ATOLL T12
5 Report evaluation camp. 1 report LIR T12
6 SLR report ATOLL T12
7 SSR report ATOLL T18
8 Reference subcorpus database LIR T24
9 Acquired Lexical Resources (intermediate version) database LED & ATOLL T24
10 Report on Acquisition report LED & ATOLL & LIC2M T24
11 SLR report ATOLL T24
12 SSR report ATOLL T30
13 Final Report report ATOLL T36
14 ROVER Corpus online database LIR T36
15 Treebank toolkit mgmt. software ATOLL T36
16 Report evaluation camp. 2 report LIR T36
17 Evaluation package software LIR T36
18 Report on Propbank experiment report LED T36
19 Semantic parser software LED T36
20 Acquired Lexical Resources database LED & ATOLL T36


B.5  Résulats escomptés – perspectives

The expected results of the PASSAGE project include: Even if incomplete or partially incorrect, all the linguistic resources resulting from PASSAGE (dependency bank, lexicon, prototype PropBank) should be very valuable for the French NLP community.

At a more prospective level, the emergence of several efficient and evaluated parsing systems for French, able to parse large corpora, should boost their use in industrial applications, especially the information extraction ones. Furthermore, we believe that dependency-based representations of parser's output, as advocated in PASSAGE, should be a good basis for such applications and for moving forward more semantic representations.

The acquisition techniques explored in PASSAGE have vocation to be re-exploited and extended for other languages, in particular at the European level. Furthermore, a strong expertise in both parsing technologies and acquisition techniques could open the way for linguistic knowledge acquisition techniques through transfer, where using multilingual aligned corpora (such as Europarl) and good parsers on one side may help designing or improving parsers on the other side. Here again, by marrying symbolic and statistical techniques, we could maybe move forward efficient statistical transfer-based translation techniques.

B.6   Propriété intellectuelle

We are looking forward to producing freely available annotations and derived linguistic resources, at least for academic purpose. Resources will be distributed through the “Centre de Competence CNRS” (ATILF, Nancy) and, through ELDA/ELRA.


Eric de la Clergerie
Previous Up Next