SIMPLE Annual Report 1999

 

 

SIMPLE positions itself inside a strategic policy which aims at providing a core set of language resources for the EU languages.

The SIMPLE project , which can be considered as a follow up to PAROLE, aims in fact at adding a semantic layer to a subset of the existing morphological and syntactic layers.

The semantic lexicons (covering about 10,000 word meanings: 7,000 for nouns, 2,000 for verbs and 1,000 for adjectives) are being built in a harmonised way for all the 12 languages covered by PAROLE: Catalan, Danish, Dutch, English, Finnish, French, German, Greek, Italian, Portuguese, Spanish, Swedish.

The main types of information to be encoded for nouns, verbs, and adjectives are: domain information, the semantic type of the head (with a structured semantic type), the semantic type of the arguments of predicates (to be defined at different levels of granularity), a short lexicographic gloss, the linking with syntax.

The semantic lexicons will mostly be corpus-based: each language will exploit the harmonised and representative corpora built within PAROLE. This will make the semantic encoding aware of actual corpus distinctions and not only of generalisations based on linguist/lexicographer introspection, which can be sometimes misleading.

 

 

Summary of 1999 Activities

 

The first months of the project have been dedicated to the completion of the specification for the SIMPLE semantic lexicons. Consistently with the activity and the results achieved in the first year of the project, the formal representation of the ‘conceptual core’ of the lexicons has been designed. Moreover, the identification of the core structured set of meaning-types (the SIMPLE Ontology) to be used as a common starting point, and a shared device to build the harmonised language specific semantic lexicons have been carried out.

The specification phase has not only focused on preparing the top ontology of semantic types and the list of formal tools for the analysis of lexical items. Rather, much effort has aimed at the design of clusters of structured semantic information which could guide lexicographers through the encoding phase, by assuring a high degree of consistency both during the project and in future developments of the SIMPLE lexicons. The result of this work is the set of templates which have been provided together with the guidelines. These templates will facilitate the procedure of encoding, and at the same time will guarantee a uniform amount of information encoded in word senses among different languages. The 1999 specification phase has particularly focused on the design of the semantic types and semantic representation for verbs and adjectives.

During the 1999, the encoding of the semantic units and their linking with the PAROLE syntactic layer has started. The first phase of encoding has mostly concerned non-predicative nouns. This has allowed lexicographers to familiarise with the model, and has allowed the co-ordinators to device strategies to check and enhance the consistency of the encoding among the various partners. The results have proved that the SIMPLE system of templates allows lexicographers to keep a better average speed of encoding, as well as guaranteeing the general consistency of the work.

The second phase (September-December) of the encoding has concerned predicative nouns and verbs, which represent harder cases with respect to non predicative semantic units, as far as the specification of their semantic information and the linking with the PAROLE syntactic layer are concerned. At the end of the year, the total amount of the encoded units correspond to approximately two-thirds of the overall number.

Various internal check-points have been fixed during the year, so as to ensure the highest degree of uniformity, as well as to solve eventual difficulties in the encoding. On October 8th-9th, a general workshop of the SIMPLE project has been held in Graz to discuss and finalise common strategies for the development of the lexicons.

 

 

Important work area

 

·        Technology outlook and innovative features

 

Semantics is the crucial and critical issue of the next years. Every application having to deal with information, in the ever growing importance of the so-called ‘content industry’, calls for systems which go beyond the syntactic level to understand the ‘meaning’.

Many theoretical approaches are tackling different aspects of semantics, but in general they still have to be tested (i.) with really large-size implementations, and (ii.) with respect to their actual usefulness and usability in real-world systems.

SIMPLE aims at addressing directly point (i.) above, while providing the necessary platform to allow future projects to address point (ii.).

Also when we consider the multilingual aspect - with its problems and challenges - which is ultimately a critical issue in Europe, again semantics is at stake. We cannot hope to solve the multilingual aspect without some solution to the semantic aspect (unless we use only statistical techniques). For the addition of a multilingual layer (multilingual links) to available language resources (LR) it is essential to have a harmonized set of semantic lexicons tackling in a uniform way the core of what is needed for NLP, i.e. semantic typing of heads and arguments.

 

·        Resulting product profile

 

The SIMPLE project represents – to our knowledge - the first attempt to tackle encoding of semantic (argument) frames on a large scale, i.e. for so many languages and with rather wide coverage. Even though it is a real lexicon building project, it must also be seen as having challenging research aspects and it will provide a framework for testing and evaluating the maturity of the current state-of-the-art in the realm of lexical semantics grounded on, and connected to, a syntactic foundation.

 

The availability of rather large, uniformly structured lexical resources in so many EU languages will offer the users the benefits of a standardized base.

According to the subsidiarity concept, the process started at the EU level will be continued at the national level. This is already happening for a number of languages. The PAROLE Lexicons and Corpora will be enlarged in the framework of a number of National Projects, e.g. Danish, Dutch, Greek, Italian, Swedish, Spanish and Catalan. These national initiatives show that the goal of the LRs EC projects, aiming at providing a core set of resources to be extended with national support, is starting to be satisfied.

The fact that all these LRs will be based on the existing models and standards defined at the European level will create a really large infrastructure of harmonised LR throughout all Europe. This achievement is of major importance in a multilingual country like Europe, where all the difficulties connected with the task of LRs building are multiplied by the language factor. This would have been absolutely impossible without the fundamental role played by the EC LRs and standards projects.

 

·        User requirements

 

In the specification phase we have taken into account requirements of NLP applications, also as stated in the EAGLES report of the Lexicon Semantics WG (Sanfilippo et al., 1998, web address: www.ilc.pi.cnr.it/EAGLES96/rep2/rep2.html) (parsing, generation, word sense disambiguation, Information Retrieval, etc.), e.g. for the decision on the basic notions to be encoded. This is of utmost importance given the objectives of the PAROLE/SIMPLE lexicons.

 

As for PAROLE, the SIMPLE Lexicons, based on the results of EAGLES and GENELEX, must be declarative, and as far as possible “application independent”. Only in this way, they will be able to evolve easily, for example, to incorporate other levels of information (more application or domain dependent) or to become multilingual. This approach - which answers the requisites of genericity, explicitness, and variability of granularity - will guarantee a large scale reusability. The model - with a high level of precision in the description - is in fact designed to ensure that application dependent models of data and applicative dictionaries can be derived from this repository of information, by mapping the application model from the generic one.

A dichotomy at stake is the one between genericity of a LR vs. usefulness for applications. It is important to avoid the expensive duplications of efforts connected to the practice of building new specific LR from scratch for each different application. This is only possible by designing and building a common repository of general lexical information which can be customised and tuned for different applications, with a large economy of scale: there exists a large core of information that can be shared by many applicative uses, and this leads to the concept of “reusable” LRs, which is at the basis of the PAROLE/SIMPLE projects.

 

·        Format of the lexicons

 

The exchange format for the lexicons is SGML: all the semantic lexicons share the same DTD (as for the morphological and syntactic layers).

Moreover, the use of common specifications for lexicon management tools is a guarantee that all lexicons fully conform to the model. The use of these tools is a precondition of an industrial level of quality for the volume of data (in so many languages) that SIMPLE has to deliver.

 


 

User Group, Promotion and Awareness

 

The final output of SIMPLE should be used as an initial start-up kit to promote the development of semantically annotated databases and to prove the usability of such resources. The main goal of the dissemination activities undertaken by the SIMPLE consortium is then addressed to raise awareness about the availability of this sample, putting special emphasis in the potentiality of the model inclusive of the kind of information encoded, format and reusability of the contents. Therefore, all the dissemination plan focuses on how to gather enough information about the potentiality of the model to use it as a way to look for support to enlarge this first sample and to start providing cross-lingual linking.

The consortium has already started informing the community of potential interested parties about the work to be done and its potential interest for the applications now being developed. The goal of this first distribution of information is to get feedback from the outside and to create a first group of possible interested users of SIMPLE results which will be informed about the progress of the project and will be invited to perform validation of the results.

SIMPLE is expected to involve users from the following sectors:

 

-         Developers and integrators of IT applications

-         Research Laboratories in the area of IT technologies

-         Publishing companies

-         Academia

 

The SIMPLE consortium is also considering the possibility of organizing a special meeting with the created user group, as to supply with more detailed information different interested parties specially because of the validation exercise they will be invited to participate. Past experiences have shown that validation procedures demand a high degree of understanding of the rationale behind the proposed specifications.    The workshop will give an opportunity to discuss about semantic encoding. To that end experts coming from other related projects such as EuroWordNet and EAGLES will be invited, and also other well-known experts could be invited to talk about the conclusions obtained in exercises such as SENSEVAL/ROMANSEVAL. The workshop will also be an opportunity to continue the actions that  the PAROLE/SIMPLE Consortium is planning, together with ELRA and ELSNET, a series of actions aiming at stimulating the cooperation among the national activities in the field of LR in the different European countries

 

1) Dissemination

 

The project has been presented at the SIGLEX99 ACL Workshop, Maryland, June 21-22 1999. The SIMPLE Consortium is also actively cooperating to the organisation of the Second International Conference on Language Resources and Evaluation (LREC), to which several papers have been submitted.

 

2) Use of the SIMPLE lexicons

 

As agreed in the Technical Annex, the PAROLE/SIMPLE lexicons will be fully available to the R & D community. Each partner will be free to promote and distribute the Language Resources produced, adopting the strategy best suited to the needs, conditions and requirements specific for its language. But each partner is also engaged in distributing the lexicons through ELRA at a fair price.

Moreover, the EAGLES/ISLE Working Group on Computational Lexicon, focusing its activity on the Multilingual Lexicon, has decided to use a sample of SIMPLE entries to test the recommendations of EAGLES for a few languages.

 

3) National Projects which are direct spin-offs of PAROLE/SIMPLE

 

As said above, one of the main functions of the PAROLE resources is to stimulate local efforts which will extend the lexicons and the corpora.

Some national authorities (e.g. Italy, Denmark, Greece, etc.) have launched, beyond our expectations already during the current project life, various types of “national projects” which are explicitly intended to enlarge the PAROLE/SIMPLE initial nuclei: of course, the form and the nature of these initiatives differ according to the different national situation and the different funding structures.

 

 

 

Future Work

 

In 2000, the SIMPLE lexicons will be completed. A validation exercise is also envisaged, as well as the public release of the linguistic specifications that form the SIMPLE model. The dissemination and awareness activities will be intensified, as to make the Language Engineering scientific and industrial community informed of the project results and of the potentiality that the developed resources offer both at the research and at the applicative levels.