Our main lines of work are:
● Theoretical and practical research on grammar formalisms and different parser types: among others, dependency analysis, categorial grammars, HPSG grammars, to reach a consensus on the type of formalism and analyzer which will implement the achievements.
● Survey and integration of lexical resources from different sources and possibly different languages: integration of verbal lexicons, annotated corpora, extrapolation from annotated corpora in one language to another language, using comparable corpora for terminology extraction.
● Feasibility study of applying machine learning techniques to the chosen formalism and the enriched lexicon.
In the groups involved in this project there are a number undergraduate and graduate students working in the areas of work listed above.
Specific lines of work
LIGM (Takuya Namakura) and UFSCar (Oto Vale) are working in collaboration on a comparison between lexicon-grammar tables of frozen expressions of French and of Brazilian Portuguese (Vale, 2001), in order to prepare their possible conversion into an LGLex syntactic lexicon and integration in a parser. Lexicon-grammar tables (Gross, 1975) are currently one of the major sources of syntactic lexical information for the French language, and some exist for other languages, such as Italian, Brazilian Portuguese, Modern Greek, Korean, Romanian, and others.
From October 1st, 2014, Eric Laporte co-supervises the doctoral thesis of Aline Evers (UFRGS) on application of multilingual resources for clustering of texts in Portuguese.
Matthieu Constant uses supervised machine learning and dictionaries of MWEs in order to detect MWEs in the context of syntactic parsing. He experimented on French (Candito & Constant, 2014) and Serbian (Constant et al., 2014) and the technique can be transferred to Spanish and Portuguese.
Bibliographical references
-
Marie Candito, Matthieu Constant. 2014. Strategies for Contiguous Multiword Expression Analysis and Dependency Parsing. ACL. Baltimore.
-
Matthieu Constant, Cvetana Krstev, Dusko Vitas. 2014. "Joint Compound/Named Entity Recognition and POS Tagging for Serbian: Preliminary Results", poster, Parseme meeting, Frankfurt.
-
Maurice Gross. 1975. Méthodes en syntaxe : Régime des constructions complétives. Hermann. Paris, France.
-
Oto Araújo Vale. 2001. Transparência e opacidade de expressões cristalizadas. In HIRATA-VALE, Flávia B.M. Anais do IV Seminário Nacional de Literatura e Crítica e do II Seminário Nacional de Lingüística e Língua Portuguesa, Goiânia Gráfica e Editora Vieira, 2001, p. 240-246
The PLN group at UdelaR, Montevideo, has been working on different issues related to the proposal in this project. On the one hand, between 2000 and 2002 we worked on the construction of a partial parser that segments sentences into clauses, for Spanish (Clatex) and for French (Propos).
The group will be working on the analysis of the main grammatical formalisms that are applicable to our interests in NLP (constituent grammars, dependency grammars, HPSG grammars and categorial grammars), together with a study of existing parsers and their performance. This has been an area of interest to the group and there are currently some theses on these issues. In particular, one master's thesis involves the development of a statistical parser based on HPSG.
On the other hand, another line that has been investigated in the framework of different projects within the group is the identification of events. Identifying events in texts has some aspects in common with parsing, because a reference to an event tends to have a strong correspondence with a clause at syntactic and semantic analysis level, as the distinctive element of the event is the predicative element of a linguistic utterance. In this sense, the identification of nominal events influences the syntactico-semantic analysis of text.
MoDyCo is currently working on a methodology for annotating variations in enunciative and modal commitment in a text. We are developing a
corpus of French newswire texts automatically annotated with enunciative and modal commitment information. The annotation scheme we propose is based on the detection of predicative cues - referring to an enunciative and/or modal variation - and their scope at a sentence level. The results of the evaluation show that the most challenging task is not to find the predicative cues but to delimit their scopes and beyond this delimitation question to define how to assess whether a scope is correct or not. Next step of our work is to launch a larger annotation campaign involving more human annotators and a bigger corpus. In this second step, our model will integrate discursive cues that can impact more than a single sentence.