
SIMPLE Annual Report 1999
SIMPLE positions itself
inside a strategic policy which aims at providing a core set of language
resources for the EU languages.
The SIMPLE project
, which can be considered as a follow up to PAROLE, aims in fact at
adding a semantic layer to a subset of the existing morphological and syntactic
layers.
The semantic
lexicons (covering about 10,000 word meanings: 7,000 for nouns, 2,000 for verbs
and 1,000 for adjectives) are being built in a harmonised way for all
the 12 languages covered by PAROLE: Catalan, Danish, Dutch, English, Finnish,
French, German, Greek, Italian, Portuguese, Spanish, Swedish.
The main types of information to be encoded for nouns,
verbs, and adjectives are: domain information, the semantic type of the head
(with a structured semantic type), the semantic type of the arguments of
predicates (to be defined at different levels of granularity), a short
lexicographic gloss, the linking with syntax.
The semantic lexicons will mostly be corpus-based: each language will exploit the harmonised and representative corpora
built within PAROLE. This will make the semantic encoding aware of actual
corpus distinctions and not only of generalisations based on linguist/lexicographer introspection, which can be sometimes
misleading.
Summary of
1999 Activities
The first months of
the project have been dedicated to the completion of the specification for the
SIMPLE semantic lexicons. Consistently with the activity and the results
achieved in the first year of the project, the formal representation of the
‘conceptual core’ of the lexicons has been designed. Moreover, the
identification of the core structured set of meaning-types (the SIMPLE
Ontology) to be used as a common starting point, and a shared device to build
the harmonised language specific semantic lexicons have been carried out.
The specification
phase has not only focused on preparing the top ontology of semantic types and
the list of formal tools for the analysis of lexical items. Rather, much effort
has aimed at the design of clusters of structured semantic information which
could guide lexicographers through the encoding phase, by assuring a high
degree of consistency both during the project and in future developments of
the SIMPLE lexicons. The result of this work is the set of templates which have
been provided together with the guidelines. These templates will facilitate the
procedure of encoding, and at the same time will guarantee a uniform amount of
information encoded in word senses among different languages. The 1999
specification phase has particularly focused on the design of the semantic
types and semantic representation for verbs and adjectives.
During the 1999,
the encoding of the semantic units and their linking with the PAROLE syntactic
layer has started. The first phase of encoding has mostly concerned
non-predicative nouns. This has allowed lexicographers to familiarise with the
model, and has allowed the co-ordinators to device strategies to check and
enhance the consistency of the encoding among the various partners. The results
have proved that the SIMPLE system of templates allows lexicographers to keep a
better average speed of encoding, as well as guaranteeing the general
consistency of the work.
The second phase
(September-December) of the encoding has concerned predicative nouns and verbs,
which represent harder cases with respect to non predicative semantic units, as
far as the specification of their semantic information and the linking with the
PAROLE syntactic layer are concerned. At the end of the year, the total amount
of the encoded units correspond to approximately two-thirds of the overall
number.
Various internal
check-points have been fixed during the year, so as to ensure the highest
degree of uniformity, as well as to solve eventual difficulties in the
encoding. On October 8th-9th, a general workshop of the
SIMPLE project has been held in Graz to discuss and finalise common strategies
for the development of the lexicons.
Important
work area
·
Technology outlook and innovative
features
Semantics is the crucial and
critical issue of the next years. Every application having to deal with
information, in the ever growing importance of the so-called ‘content
industry’, calls for systems which go beyond the syntactic level to understand
the ‘meaning’.
Many theoretical approaches
are tackling different aspects of semantics, but in general they still have to
be tested (i.) with really large-size implementations, and (ii.) with respect
to their actual usefulness and usability in real-world systems.
SIMPLE
aims at addressing directly point (i.) above, while providing the necessary
platform to allow future projects to address point (ii.).
Also when we consider the multilingual
aspect - with its problems and challenges - which is ultimately a critical
issue in Europe, again semantics is at stake. We cannot hope to solve the
multilingual aspect without some solution to the semantic aspect (unless we use
only statistical techniques). For the addition of a multilingual layer
(multilingual links) to available language resources (LR) it is essential to
have a harmonized set of semantic lexicons tackling in a uniform way the core
of what is needed for NLP, i.e. semantic typing of heads and arguments.
·
Resulting
product profile
The SIMPLE project represents – to our knowledge - the
first attempt to tackle encoding of semantic (argument) frames on a large
scale, i.e. for so many languages and with rather wide coverage. Even though it
is a real lexicon building project, it must also be seen as having challenging
research aspects and it will provide a framework for testing and evaluating the
maturity of the current state-of-the-art in the realm of lexical semantics
grounded on, and connected to, a syntactic foundation.
The availability of rather large, uniformly
structured lexical resources in so many EU languages will offer the users the
benefits of a standardized base.
According to the subsidiarity
concept, the process started at the EU level will be continued at the national
level. This is already happening for a number of languages. The PAROLE Lexicons
and Corpora will be enlarged in the framework of a number of National Projects,
e.g. Danish, Dutch, Greek, Italian, Swedish, Spanish and Catalan. These
national initiatives show that the goal of the LRs EC projects, aiming at
providing a core set of resources to be extended with national support, is
starting to be satisfied.
The fact that all these LRs will be based on the
existing models and standards defined at the European level will create a
really large infrastructure of harmonised LR throughout all Europe. This
achievement is of major importance in a multilingual country like Europe, where
all the difficulties connected with the task of LRs building are multiplied by
the language factor. This would have been absolutely impossible without the
fundamental role played by the EC LRs and standards projects.
·
User
requirements
In the specification phase we have taken
into account requirements of NLP applications, also as stated in the
EAGLES report of the Lexicon Semantics WG (Sanfilippo et al., 1998, web
address: www.ilc.pi.cnr.it/EAGLES96/rep2/rep2.html) (parsing, generation, word
sense disambiguation, Information Retrieval, etc.), e.g. for the decision on
the basic notions to be encoded. This is of utmost importance given the
objectives of the PAROLE/SIMPLE lexicons.
As for PAROLE, the SIMPLE Lexicons, based on the
results of EAGLES and GENELEX, must be declarative, and as far as possible
“application independent”. Only in this way, they will be able to evolve
easily, for example, to incorporate other levels of information (more
application or domain dependent) or to become multilingual. This approach -
which answers the requisites of genericity, explicitness, and variability of
granularity - will guarantee a large scale reusability. The model - with
a high level of precision in the description - is in fact designed to ensure
that application dependent models of data and applicative dictionaries can be
derived from this repository of information, by mapping the application model
from the generic one.
A dichotomy at stake is the
one between genericity of a LR vs. usefulness for applications. It is important
to avoid the expensive duplications of efforts connected to the practice of
building new specific LR from scratch for each different application. This is
only possible by designing and building a common repository of general lexical
information which can be customised and tuned for different applications, with
a large economy of scale: there exists a large core of information that can be
shared by many applicative uses, and this leads to the concept of “reusable”
LRs, which is at the basis of the PAROLE/SIMPLE projects.
·
Format of the lexicons
The exchange format
for the lexicons is SGML: all the semantic lexicons share the same
DTD (as for the morphological and syntactic layers).
Moreover, the use
of common specifications for lexicon management tools is a guarantee
that all lexicons fully conform to the model. The use of these tools is a
precondition of an industrial level of quality for the volume of data
(in so many languages) that SIMPLE has to deliver.
User Group,
Promotion and Awareness
The final output of
SIMPLE should be used as an initial start-up kit to promote the development of
semantically annotated databases and to prove the usability of such resources.
The main goal of the dissemination activities undertaken by the SIMPLE consortium
is then addressed to raise awareness about the availability of this sample,
putting special emphasis in the potentiality of the model inclusive of the kind
of information encoded, format and reusability of the contents. Therefore, all
the dissemination plan focuses on how to gather enough information about the
potentiality of the model to use it as a way to look for support to enlarge
this first sample and to start providing cross-lingual linking.
The consortium has
already started informing the community of potential interested parties about
the work to be done and its potential interest for the applications now being
developed. The goal of this first distribution of information is to get
feedback from the outside and to create a first group of possible interested
users of SIMPLE results which will be informed about the progress of the
project and will be invited to perform validation of the results.
SIMPLE is expected to involve
users from the following sectors:
-
Developers and integrators of IT applications
-
Research Laboratories in the area of IT technologies
-
Publishing companies
-
Academia
The SIMPLE consortium is also
considering the possibility of organizing a special meeting with the created
user group, as to supply with more detailed information different interested
parties specially because of the validation exercise they will be invited to
participate. Past experiences have shown that validation procedures demand a
high degree of understanding of the rationale behind the proposed specifications.
The workshop will give an opportunity
to discuss about semantic encoding. To that end experts coming from other
related projects such as EuroWordNet and EAGLES will be invited, and also other
well-known experts could be invited to talk about the conclusions obtained in
exercises such as SENSEVAL/ROMANSEVAL. The workshop will also be an opportunity
to continue the actions that the
PAROLE/SIMPLE Consortium is planning, together with ELRA and ELSNET, a series
of actions aiming at stimulating the cooperation among the national activities
in the field of LR in the different European countries
1) Dissemination
The project has been
presented at the SIGLEX99 ACL Workshop, Maryland,
June 21-22 1999. The SIMPLE Consortium is also actively cooperating to
the organisation of the Second International Conference on Language Resources
and Evaluation (LREC), to which several papers have been submitted.
2) Use of the SIMPLE
lexicons
As agreed in the
Technical Annex, the PAROLE/SIMPLE lexicons will be fully available to the R
& D community. Each partner will be free to promote and distribute the
Language Resources produced, adopting the strategy best suited to the needs,
conditions and requirements specific for its language. But each partner is also
engaged in distributing the lexicons through ELRA at a fair price.
Moreover, the
EAGLES/ISLE Working Group on Computational Lexicon, focusing its activity on
the Multilingual Lexicon, has decided to use a sample of SIMPLE entries to test
the recommendations of EAGLES for a few languages.
3) National Projects which are direct spin-offs of PAROLE/SIMPLE
As said above, one of
the main functions of the PAROLE resources is to stimulate local efforts which
will extend the lexicons and the corpora.
Some national
authorities (e.g. Italy, Denmark, Greece, etc.) have launched, beyond our
expectations already during the current project life, various types of
“national projects” which are explicitly intended to enlarge the PAROLE/SIMPLE
initial nuclei: of course, the form and the nature of these initiatives differ
according to the different national situation and the different funding
structures.
Future Work
In 2000, the SIMPLE lexicons will be completed. A
validation exercise is also envisaged, as well as the public release of the
linguistic specifications that form the SIMPLE model. The dissemination and
awareness activities will be intensified, as to make the Language Engineering
scientific and industrial community informed of the project results and of the
potentiality that the developed resources offer both at the research and at the
applicative levels.