* * *
Document first version date |
28/04/00 | ||||
Document date |
12/05/00 | ||||
Document ID |
WP1 Swedish Simple-Lexicon Documentation | ||||
Version |
02 | ||||
Doc. type |
D_03_WP03.12_2_doc.htm | ||||
Document status |
|||||
Validation type |
|||||
Comments |
|||||
Name |
Organisation |
Purpose |
|||
From |
Maria Toporowska Gronostaj | GOT | documentation | ||
| Karin Warmenius | |||||
To |
information | ||||
The EU-project SIMPLE, Semantic Information for Multifunctional Plurilingual Lexica, is a follow-up of PAROLE, Preparatory Action for Linguistic Resources Organisation for Language Engineering. It has aimed at the development of wide-coverage semantic lexicons for the 12 European languages. The SIMPLE lexicons add information on a semantic layer to the information on morphological and syntactic layers encoded in PAROLE lexicons. All the lexicons share the same formal representation model, which is Entity/Relationship model, and the same DTD (Document Type Definition) implemented in SGML. The SIMPLE's theoretical and formal model is documented in Linguistic Guidelines, (Lenci et al. 1998) and the former knowledge of the guidelines is presupposed here.
Språkdata, Göteborg University, has participated in both the projects. The task of building the PAROLE and SIMPLE lexicons for Swedish has been to a large extent supported by the lexical resources elaborated by Department of Swedish Language, Språkdata, Göteborg University. The machine readable monolingual dictionaries Svenska ord (1992), Nationalencyklopedins ordbok (1995), and the lexical database, Göteborgs lexikaliska databas, (GLDB), provided a considerable amount of information on lexical data concerning morphology, syntax and semantics. GLDB is a monolingual lexical database with focus on modern Swedish. It is the most exhaustive source of information on modern Swedish. The PAROLE corpora and the corpora available in the Swedish Language Bank, Språkbanken, have been also frequently consulted. Moreover, the lexicographic and lexicological research carried at Språkdata has shown to be supportive with regard to questions concerning morphological, syntactic and semantic information in the PAROLE and SIMPLE lexicons (Gellerstam 1988, Toporowska Gronostaj 1996, Järborg 1989, Malmgren 1988).
To provide some background knowledge for this documentation, we start with a short overview of the conceptual model underlying the SIMPLE project, which is followed by some introductory notes on the Swedish PAROLE lexicon. The core section of this report deals with the documentation of the Swedish SIMPLE lexicon (hereafter Swe_S). We conclude the core section by presenting ways to automatically extend the content of Swe-S and showing its usability for different NLP applications. We reprint in the Appendix A a paper on this subject that will be presented at the LREC 2000 conference in Athens (Kokkinakis et al, 2000). In the the Appendix B of this document we provide some examples from the PAROLE and SIMPLE lexicons for Swedish to give an insight in the description modes and format used in these lexicons for the encoding of morphology, syntax and semantic of nouns, verbs and adjectives.
The notion of semantic type is central for the SIMPLE model and its ontology. It corresponds roughly to a word sense assigned to a lexical item. There are 139 semantic types distinguished in the SIMPLE ontology. Each semantic type is defined as a cluster of structured semantic information significant for a given word sense. Information on semantic class, domain, argument structure of predicative expressions and selectional restrictions on arguments as well as qualia roles constitute a relevant part of the semantic type specification; (Lenci at al. 1998, Calzolari (1999), Pedersen & Keson (1999)). The SIMPLE ontology is multidimensional as it is based on the principle of orthogonal inheritance (Pustejovsky 1995), and in this respect, it contrasts with the LexiQuests semantic class ontology which is based on a standard, monodimensional approach. The latter ontology includes 95 semantic classes. Both ontologies are hierarchically structured. The above mentioned ontologies were designed for the semantic typing of verbs and nouns. For the semantic typing of adjectives, an ontology consisting of 14 semantic types has been used.
The Swedish PAROLE lexicon is a language engineering resource which provides access to the lexically determined morphological and syntactic information on 20.000 Swedish words. It follows the conceptual and formal model formulated within the PAROLE project (Guimier, Ogonowski, 1998). A subset of the entries in the Swedish PAROLE lexicon, that is 5.621 words, has been enriched with the semantic information encoded in Swe-S.
The encoding of morphosyntactic data for Swedish has followed the model presented in the PAROLE report (WP-4.2.2b), Morphosyntactic Description of Swedish (Danielsson, Järborg, (1996)). In the Swedish PAROLE morphological lexicon there are 19.985 morphological units described according to 259 morphological modes. The morphological modes capture types of inflectional relations a morphological unit can display. Their descriptions vary according to the part of speech of a morphological unit and the morphological properties associated to its morphosyntactic type. To account for these, a technical stem and a set of relevant affixes have been postulated for each morphological mode.
A syntactic unit (Usyn) is a basic access point to the syntactic layer. In the Swedish PAROLE syntactic lexicon, there are 28.882 syntactic units linked to 19.985 morphological units. These syntactic units have their syntactic behaviour described by means of 438 syntactic descriptions. In the Swedish syntactic lexicon following aspects of lexically determined syntactic behaviour have been accounted for:
This lexically determined information has been considered as criterial for distinguishing different types of subcategorizations frames in Swedish, or in other words, for postulating 438 Description objects.
Lexicon population in the Swe-S lexicon can be characterized as composed of two sets. The first one consists of 8.571 fully encoded lexicon items, which have been manually encoded. It covers nouns, verbs and adjectives. The second one consists of 25.000 partially encoded nouns, which have been automatically extracted from corpora, given 1.800 fully encoded words from the Swe-S as input. This set covers 22.000 compound- and 3.000 simplex-nouns. For details on the methodology used for extending the coverage of the Swe-S lexicon see Appendix A.
Overall statistics: covering the fully encoded Semus in Swe-S
Number of Semus linked to Usyns and Ums 8.571
Semu per category:
| SemUs | SynUs | MuSs | |||
| Noun | 6.177 | 4.799 | 3.624 | ||
| Verb | 1.943 | 1.628 | 1.546 | ||
| Adjective | 451 | 253 | 127 |
While populating the Swe-S lexicon manually, the preference was given to those lexical items which were considered as Swedish equivalents to the base concepts. The base concept set, common to all SIMPLE partners, is a selection of concepts derived from the EuroWordNet by means two operational criteria, namely, the number of relations (general or limited to hyponymy) and the high position of the concept in a hyperonym-hyponym hierarchy. Frequency was another significant factor taken into consideration.
As far as the population of the sample lexicon for Swedish is concerned, the selection of items included there aimed at:
The fully encoded semantic data presented in the Swedish SIMPLE lexicon provide following information:
In the SIMPLE model, a set of templates provides guidelines concerning sense assignment, as each template is a structured set of semantic information, which highligts relevant aspects of word meaning.
The GLDB's approach to word sense discrimination is a relational one, which means that sense readings are internally structured and that kernel sense and its subsenses are related to each other by a set of lexical transfer rules. The lexical information conveyed by the hierarchical microstructure of a lexicon entry in GLDB is mapped onto the SIMPLE ontology and semantic information pertaining to a semantic type. In consequence, links between these two resources are established and the lexical data can be freely accessed. In the Swe-S lexicon, the links between the two systems are shown by means of the used marking convention for SemUs and glosses. While SemUs are marked consecutively, GLDB's lemma-lexeme-subsense numbers point to the readings in the GLDB and thus provide input to all the syntactic and semantic information linked to a reading there.
Word sense discrimination has been assumed to be a point of departure for the linking of syntactic and semantic layers, as it is the word meaning that determines its syntactic behavior, argument structure, argument type (true, default or shadow) as well as selectional preferences and semantic roles of its arguments. This information, with the exception of semantic roles that are not specified there, is mapped in Swe-S onto positions at the syntactic layer and consequently it is linked to information on (i) number and optionality of positions in subcategorization pattern, (ii) syntactic functions and their morphosyntactic and lexical realisations, (iii) semantic preferences on arguments with regard to the feature 'animatness'. As a consequence, recurrent mapping patterns can be distinguished.
Let's turn now to the SGML objects. The linking between syntactic positions and semantic arguments is specified in the Correspondence object. Non-predicative nouns are linked by relating to one or more Semus to which a syntactic unit corresponds. Three types of links have been used in Swe-S:
krokus 'crocus'
<SynUThe one-to-many relation is illustrated below for the non-predicative polysemous noun trea 'three, third'. The syntactic unit DN0 is linked to six Semus, assigned to the following semantic types: 1-Human, 2-Building, 3-Vehicle, 4-Sign, 5-Unit_of_measurment and 6-Number.
trea noun meaning 'three, third'
<SynUFor verbs and deverbal nouns taking arguments, the correspondence is established between particular positions in the subcategorization frame and their arguments. Thus the two arguments a0 and a1 of the verb brevväxla 'correspond (with sb)', are linked to the subject and prepositional object positions respectively in the subcategorization frame for the syntactic unit D01P11POMED. The same verb is also linked to the syntactic unit D01PL, with one position occupied by a plural subject. In this case one can talk about reduced correspondence. These two show semantic affinity, and therefore are encoded with the same Semu number, except that the latter is indexed by 'a' (alternation).
brevväxla 'correspond (with sb)'
<SynUIn the Swe-S all the verbs, deverbal and relational nouns have their positions linked to the arguments. In most cases these correspondences are isomorphic. Non-isomorphic cases occur for example when a shadow argument is part of a predicative representation, e.g. with weather verbs, or even in other constructions where the number of syntactic positions does not match the number of semantic arguments. As en example of the latter, one can mention a syntactic unit, DACVASUPR, which is a predicative construction with an adjective as head. The lack of correspondence follows from the fact that we have postulated an argument on the semantic layer in order to encode selectional preferences imposed by an adjectival head on the subject. These are needed in order to motivate the semantic type of an adjective in predicative constructions.
The hierachically structured domain list, elaborated by ERLI/LexiQuest, has been a point of departure for assigning domain information to the readings. This repository of specific domain values has been extended with the value General, to cover general, underspecified readings. While encoding domains, the attention has been paid to the following:
Template Type assignments were based on semantic specifications included in templates. The content of these specifications was compared to the information provided by glosses in GLDB. To make the assignment as exact as possible we have used both the core and recommended ontological types.
Since word senses differ in their lexical complexity and transparency, the assignment task varied accordingly. Concrete nouns could be assigned to the template types without problems in most cases, abstract nouns were more problematic. Verbs with transparent meaning were easier to assign than verbs with underspecified, vague meanings. Thus, the more abstract and vague the word senses were, the less obvious their assignment was. The same can be said about the semantic class assignments.
In the ERLI/LexiQuest classification there are 95 semantic classes to classsify the readings of nouns and verbs. Since this classification differs from the one postulated by SIMPLE ontology as to the number and distribution of categories, the mapping between categories is either one-to-one, one-to-many or many-to-one. For instance, the following semantic classes: Instrument, Apparatus, Measuring_instument or Musical_instrument map onto the template Instrument. We have taken advantage of such differences in order to nuance sense assignments.
As far as the task of assigning template types to adjectives is concerned, we have used all the 14 options provided by the ontology. It might be of interest to note that meaning components explicated in this ontology have had a double function, as they have served both as meaning components and as labels for semantic classes for adjective types. The list of semantic classes for adjectives has been extended with class categories taken from the set of semantic classes for nouns and verbs, e.g Attribute, Emotion, whenever these were semantically motivated.
The set of 139 ontological types postulated in the guidelines has proved to be relevant for Swedish, as illustrated below (R stands for recommended ontological types):
| 1. | TELIC | mål, syfte, målsättning |
| 2. | AGENTIVE | motiv, orsak, skäl, anledning |
| 2.1. | CAUSE | vålla, orsaka, föranleda |
| 3. | CONSTITUTIVE | beståndsdel, enhet, molekyl |
| 3.1. | PART | arm, framdel, baksida |
| 3.1.1. | BODY_PART | arm, huvud, lillfinger |
| 3.2. | GROUP | skock, stim, bukett |
| 3.2.1. | HUMAN_GROUP | skara, sekt, sextett |
| 3.3. | AMOUNT | flaska, glas, matsked |
| 4. | ENTITY | |
| 4.1. | CONCRETE_ENTITY | |
| 4.1.1. | LOCATION | badort |
| 4.1.1.1. | 3_D_location | berg, dal, grotta |
| 4.1.1.2. | Geopolitical_location | badort, bygd, förort |
| 4.1.1.3. | Area | område, trakt, åker |
| 4.1.1.4. | Opening | dörr, fönster, hål |
| 4.1.1.5. | Building | byggnad, foajé, hus, kyrka |
| 4.1.1.6. | Artifactual_area (R) | boulevard, cykelbana, hållplats, motorväg |
| 4.1.2. | MATERIAL | glas, material, stoff |
| 4.1.3. | ARTIFACT | papper, armbandsur, dörr, ficklampa |
| 4.1.3.1. | Artifactual_material | mjöl, nylon, papper, damast |
| 4.1.3.2. | Furniture | fåtölj, ribbstol, sekretär |
| 4.1.3.3. | Clothing | skjorta, slips, T-shirt |
| 4.1.3.4. | Container | kopp, glas, väska, burk |
| 4.1.3.5. | Artwork | bataljmålning, bild, opera |
| 4.1.3.6. | Instrument | persondator, piano, pistol |
| 4.1.3.7. | Money | dollart, enkrona, pund |
| 4.1.3.8. | Vehicle | bil, cykel, buss |
| 4.1.3.9. | Semiotic_artifact | papper, bok, bibel |
| 4.1.4. | FOOD | lingon, potatis, champinjon |
| 4.1.4.1. | Artifact_Food (R) | pepparkaka, bröd |
| 4.1.4.2. | Flavouring (R) | paprika, curry, dressing |
| 4.1.5. | PHYSICAL_OBJECT | fossil, gallsten, himlakropp |
| 4.1.6. | ORGANIC_OBJECT | hinna, kokong, svulst |
| 4.1.7. | LIVING_ENTITY | |
| 4.1.7.1. | Animal | däggdjur, fauna, |
| 4.1.7.1.1. | Earth_animal (R) | dromedar, får, hund |
| 4.1.7.1.2. | Air_animal (R) | bofink, duva, fjäril |
| 4.1.7.1.3. | Water_animal (R) | delfin, bläckfisk, gädda |
| 4.1.7.2. | Human | bekanting, fadderbarn |
| 4.1.7.2.1. | People | indier, svensk, svenska |
| 4.1.7.2.2. | Role | anhängare, apostel, dagbarn |
| 4.1.7.2.2.1. | Ideo | marxist, pacifist, katolik |
| 4.1.7.2.2.2. | Kinship | svåger, bror, son |
| 4.1.7.2.2.3. | Social_status | adjunkt, advokat, premiärminister |
| 4.1.7.2.3. | Agent_of_temporary_activity | pristagare, bedragare |
| 4.1.7.2.4. | Agent_of_persistent_activity | cellist, diktare, jägare |
| 4.1.7.2.5. | Profession | advokat, ambassadör, cellist |
| 4.1.7.3. | Vegetal_entity | champinjon, kantarell, murkla |
| 4.1.7.3.1. | Plant | lingon, potatis, rädisa |
| 4.1.7.3.2. | Flower | aster, begonia, krokus |
| 4.1.7.3.3. | Fruit | lingon, persika, nypon |
| 4.1.7.4. | Micro-organism | plankton, virus, bacill |
| 4.1.8. | SUBSTANCE | parfym, balsam, cellgift |
| 4.1.8.1. | Natural_substance | alabaster, cesium, diamant |
| 4.1.8.2. | Substance_food (R) | fasan, filé, forell |
| 4.1.8.3. | Drink (R) | dricksvatten, dryck |
| 4.1.8.3.1 | Artifactual_drink (R) | milkshake, läskedryck, saft |
| 4.2. | PROPERTY | |
| 4.2.1. | QUALITY | fräckhet, godhet, hövlighet |
| 4.2.2. | PSYCH_PROPERTY | intellekt, intuition, karisma |
| 4.2.3. | PHYSICAL_PROPERTY | massa, spänst, storlek |
| 4.2.3.1. | Physical_power (R) | färdighet, kondition, känsel |
| 4.2.3.2. | Color (R) | terrakotta, |
| 4.2.3.3. | Shape (R) | kub, klot, klyfta |
| 4.2.4. | SOCIAL_PROPERTY (R) | makt, självständighet, välgörenhet |
| 4.3. | ABSTRACT_ENTITY | värde, begrepp, huvudsak |
| 4.3.1. | DOMAIN | område, slätt, strand |
| 4.3.2. | TIME | säsong, söndag, telefontid |
| 4.3.3. | MORAL_STANDARDS (R) | jämlikhet, likaberättigande, solidaritet |
| 4.3.4. | COGNITIVE_FACT | tanke, teori, åsikt |
| 4.3.5. | MOVEMENT_OF_THOUGHT | funktionalism, hinduism, idealism |
| 4.3.6. | INSTITUTION | kyrka, riksdag, sjukhus |
| 4.3.7. | CONVENTION | traktat, överenskommelse, avtal |
| 4.4. | REPRESENTATION | verb, kliché, nummer |
| 4.4.1. | LANGUAGE | svenska, ialienska, danska |
| 4.4.2. | SIGN | apostrof, bestämningsord, glossa |
| 4.4.3. | INFORMATION | papper, handbok, meddelande |
| 4.4.4. | NUMBER (R) | miljard, nittiotal, sjua |
| 4.4.5. | UNIT_OF_MEASUREMENT | centiliter, decennium, dygn |
| 4.5. | EVENT | |
| 4.5.1. | PHENOMENON | |
| 4.5.1.1. | Weather_verbs (R) | regna, snöa, åska |
| 4.5.1.2. | Disease (R) | astma, cancer |
| 4.5.1.3. | Stimuli (R) | väsen |
| 4.5.2. | ASPECTUAL | sluta, fortgå, pågå |
| 4.5.2.1. | Cause_aspectual | sluta, börja |
| 4.5.3. | STATE | dåsa, blänka |
| 4.5.3.1. | Exist | förekomma, existera |
| 4.5.3.2. | Relational_state | toppar, utgöra, företräda |
| 4.5.3.2.1. | Identificational_state (R) | symbolisera, likna, |
| 1. | INTENSIONAL | |
| 1.1. | Modal | trolig, omöjlig |
| 1.2. | Temporal | tidig, daglig |
| 1.3. | Emotive | liten, dyr |
| 1.4. | Manner | svart, rättslig |
| 1.5. | Object-related | medicinsk, stark |
| 1.6. | Emphasizer | farlig, fattig |
| 2. | EXTENSIONAL | |
| 2.1. | Physical_property | liten, svart, stark |
| 2.2. | Psychological_property | seriös, stark |
| 2.3. | Social_property | svart, demokratisk |
| 2.4. | Temporal_property | tidig, färdig |
| 2.5. | Intensifying_property | liten, stark |
| 2.6. | Relational_property | likartad |
The semantic information encoded in the Swe-S lexicon can be automaticaly sorted, clustered and accessed in a number of different ways. We consider the following types of semantic classes to be of interesse for NLP and NLU applications, not to mention their relevance for checking the intern consistency in the SIMPLE lexicons:
The classes of semantic data derived from sorting according to the semantic type, domain or/and semantic class build semantic sublexicons can be easily enriched with other semantic, syntactic or morphological data available in the PAROLE -SIMPLE lexicon and GLDB. Examples of tripartite semantic classes are given below, with information on template type, domain and semantic type as input:
Such tripartite classes support semantic annotating, information extraction, machine translation and natural language understanding in an effective way.
The access to data on types of polysemous classes being instantiated in a language has shown to be indispensable for consistent encodning av meanings in the Swe-S lexicon. Below we have listed some frequent types of regular polysemous classes instantiated in the Swe-S lexicon. It should be observed that polysemy relations are not explicitly coded in the Swe-S lexicon, as they to a large extent can be automatically extracted from the GLDB. In the list below, the first two columns display pairs of semantic types which stand in a polysemous relation, the third one provides examples. The list is alphabetically sorted.
| Agent of persistent activity | Profession | cellist |
| Animal | Artifactual material | krokodil |
| Animal | Substance Food | krabba |
| Building | Human group | administration |
| Building | Institution | kyrka |
| Artwork | Information | illustration |
| Container | Amount | burk |
| Flower | Colour | aprikos |
| Geopolitical location | Human group | stad |
| Human group | Institution | administration |
| Human group | Artwork | trio |
| Location | Human group | avdelning |
| Number | Building | trea |
| Number | Human | trea |
| Number | Sign | trea |
| Number | Unit of measurment | trea |
| Number | Time | femtiotal |
| Number | Vehicle | trea |
| Opening | Artifact | dörr |
| People | Language | svenska |
| Plant | Fruit | avocado |
| Plant | Flavouring | pepparot |
| Plant | Substance | bomull |
| Profession | Social status | advokat |
| Semiotic artifact | Information | bibel |
| Semiotic artifact | Convention | traktat |
| Substance | Colour | terrakotta |
| Water-Animal | Substance Food | krabba |
There is no doubt that reaching a consensus on how to deal with polysemy relations is of great relevance for: (i) checking consistency in lexicons, (ii) determining criteria for sense discrimination, (iii) structuring lexicon entries with regard to the distinction between core senses and its subsenses, (iv) text disambiguating, (v) multilingual linking of lexicons.
A subset of words in the Swe-S lexicon carries information on their predicative representation and selectional restrictions on arguments. These representations create semantic frames whose selectional restrictions on arguments are specified by means of reference to semantic types. This information seems to be especially supportive for extending the coverage of the lexicon entries, as the new entries can by default inherit information on the semantic type assignment. To make the assignment as exact as possible, the information on semantic representation should be integrated with information on syntax. For example the verbs of type Natural transition show preference for a frame with a Living entity in the subject position, the same can be said about the Cooperative Speech act verbs. Thus their meanings determine semantic types of the arguments and this information can be reused for applications concerning annotating, disambiguating, lexical acquisition, NLU and automatic extending the coverage of the Swe-S lexicon.
In the SIMPLE lexicon model the hyponym-hyperonym relation is part of the qualia encoding, namely Formal. In the Swe-S lexicon this information has not been encoded, because it is already available for all the noun and verb entries in GLDB, through the application called SEMNET, which parses definitions and finds genus proximum. Access to this knowledge is indispensable for a number of NLP and NLU tasks.
Having already hinted at the relevance of predicative representation for different application tasks, we give below an example showing how the selectional restrictions on verb arguments are encoded for the verb stimmar 1. be noisy, 2 shoal. Selectional restrictions on the arguments in a predicative representation in Swe-S are encoded either by pointing to one or more template type, as is the case in the examples below, or to a particular SemU. Whenever a predicative expression implies a specific referent, its reference is encoded by means of a SemU (i.e. bark, quack).
The issue of usability of the semantic data encoded in Swe-S has been discussed in an article that will be published in the proceedings from LREC 2000 conference. The article is reprinted in Appendix A.
In the following, we show the distribution of nouns, verbs and adjectives with respect to template types. We list only those template types that have more than 10 words encoded per template.
| 337 | Human | 46 | Disease |
| 266 | Profession | 45 | Agent_of_persistent_activity |
| 238 | Semiotic_artifact | 45 | Body_part |
| 196 | Relational_act | 45 | Clothing |
| 186 | People | 45 | Food |
| 174 | Abstract_entity | 43 | Container |
| 134 | Agent_of_temporary_activity | 42 | Fruit |
| 129 | Artifact | 40 | Quality |
| 127 | Social_status | 39 | Movement_of_thought |
| 124 | Artwork | 38 | Artifactual_area |
| 122 | Time | 38 | Experience_event |
| 120 | Instrument | 37 | Location |
| 110 | Information | 36 | Number |
| 105 | Representation | 35 | Area |
| 103 | Convention | 35 | Cause_constitutive_change |
| 97 | Unit_of_measurment | 34 | Act |
| 95 | Role | 34 | Property |
| 92 | Language | 33 | Substance_Food |
| 92 | Natural_substance | 33 | Water_animal |
| 91 | Human_group | 32 | Substance |
| 85 | Plant | 30 | Artifactual_drink |
| 80 | Artifactual_material | 29 | Ideo |
| 88 | Kinship | 28 | Air_animal |
| 87 | Cognitive_fact | 28 | Geopolitical_location |
| 84 | Building | 28 | Physical_object |
| 80 | Artifactual_material | 27 | Identificational_state |
| 79 | Part | 26 | Opening |
| 78 | Amount | 26 | Speech_act |
| 72 | Psych_property | 25 | Furniture |
| 72 | Relational_state | 24 | Shape |
| 67 | Psychological_event | 22 | Change_of_state |
| 64 | Institution | 22 | Moral_standard |
| 62 | Sign | 21 | Modal_event |
| 61 | Constitutive | 20 | Flower |
| 61 | State | 19 | Artifactual_food |
| 58 | Group | 18 | Money |
| 58 | Vehicle | 18 | Symbolic_creation |
| 57 | Purpose_act | 16 | Flavouring |
| 56 | Cause_change_of_state | 15 | Entity |
| 55 | Domain | 15 | Transaction |
| 54 | Phenomenon | 14 | Cause_change_of_location |
| 53 | Artifact_Food | 14 | Change_of_location |
| 53 | Cooperative_activity | 14 | Material |
| 52 | Cognitive_event | 14 | Move |
| 51 | Earth_animal | 12 | Reporting_event |
| 48 | Physical_property | 10 | Micro-organism |
| 419 | Cause_change_of_state |
| 157 | Purpose_act |
| 148 | Non_relational_act |
| 122 | Relational_act |
| 99 | Move |
| 70 | Cause_constitutive_change |
| 61 | Change_of_state |
| 58 | Speech_act |
| 51 | Cause_experience_event |
| 50 | Cause_change_location |
| 37 | Cognitive_event |
| 35 | Reporting_event |
| 32 | Give_knowledge |
| 32 | Cause_change_of_value |
| 30 | Cooperative_activity |
| 29 | Relational_state |
| 27 | State |
| 26 | Symbolic_creation |
| 26 | Transaction |
| 24 | Change_of_location |
| 22 | Change_of_possession |
| 23 | Psychological_event |
| 23 | Expressive_speech_act |
| 20 | Physical_creation |
| 16 | Mental_creation |
| 15 | Acquire_knowledge |
| 15 | Perception |
| 14 | Stative_location |
| 13 | Cause_motion |
| 13 | Change_of_value |
| 13 | Copy_creation |
| 11 | Aspectual |
| 10 | Directive_speech_act |
| 10 | Cause_aspectual |
| 10 | Cause_natural_transition |
| 10 | Cooperative_speech_act |
| 10 | Directive_speech_act |
| 123 | Psychological_property |
| 109 | Physical_property |
| 48 | Extensional |
| 38 | Social_property |
| 27 | Object-related |
| 19 | Manner |
| 15 | Modal |
| 12 | Intensifying_property |
| 10 | Relational_property |
A sample lexicon that covers 100 fully described words is enclosed with the documentation. The morphological and syntactic information encoded in the Swedish PAROLE lexicon is enriched with the semantic information encoded in the Swe-S lexicon. Three entries from that sample are reprinted in the Appenix B: blomma 'flower', kastanj 'chestnut' and liten 'little'
Calzolari, N. 1999. SIMPLE: Harmonised Semantic Lexicons for the European Languages. In Proceedings of the SIGLEX-99 Workshop: "Standardizing Lexical Resources", Maryland, USA
Gellerstam, M., 1988, Verb Syntax in a Dictionary for Second-Language Learning. In Computer-Aided Lexicology, Almquist & Wiksell; Stockholm.
Guimier, E., Ogonowski, A., 1998. Report on the Syntacic layer, ftp://parole:yaooakgn@www.erli.fr
Holmes, Ph., Hinchliffe, I., 1994. Swedish, London: Routledge Grammmar.
Järborg, J., 1989. Betydelseanalys och betydelsebeskrivning i Lexikalisk databas. Göteborg: Inst. för språkvetenskaplig databehandling
Järborg, J., 1990. Användning av SynTag. Göteborg: Inst. för språkvetenskaplig databehandling.
Nationalencykopedins ordbok, 1995-96. Utarbetad vid Språkdata, Göteborgs Universitet. Höganes: Bra böckers.
Svenska ord med uttal och förklaringar. 1992. Stockholm:Nordstedts.
Lenci, A., F. Busa, N. Ruimy, E. Gola, M. Monachini 1999, N. Calzolari, A. Zampolli, 1999. Linguistic Specifications, D2.1.
Malmgren, S. (1988). On Regular Polysemy in Swedish, in: Studies in Computer-Aided Lexicography, Almquist & Wiksell, Stockholm.
Pedersen, B.S. and Keson, B. (1999). SIMPLE - Semantic Information for Multifunctional Plurilingual Lexica: Some Examples of Danish Concrete Nouns. In Proceedings of the SIGLEX-99 Workshop: "Standardizing Lexical Resources", Maryland, USA
Pustejovsky, J., 1995. The Generative Lexicon. Cambridge: MA. The MIT Press.
Toporowska Gronostaj, M. 1996. Integrerad valensbeskrivning. Mot ett formaliserat verbvalenslexikon. Göteborgs universitet: Inst. för svenska språket.
Appendix A
Annotating, Disambiguating & Automatically Extending
the Coverage of the Swedish SIMPLE Lexicon
Kokkinakis D., Toporowska Gronostaj M., Warmenius K.
Språkdata, Göteborg University
Box 200, SE-405 30,
Sweden
{svedk, svemt, svekws}@svenska.gu.se
Abstract
During recent years the development of high-quality lexical resources for real-world Natural Language Processing (NLP) applications has gained a lot of attention by many research groups around the world, and the European Union, through the promotion of the language engineering projects dealing directly or indirectly with this topic. In this paper, we focus on ways to extend and enrich such a resource, namely the Swedish version of the SIMPLE lexicon in an automatic manner. The SIMPLE project (Semantic Information for Multifunctional Plurilingual Lexica) aims at developing wide-coverage semantic lexicons for 12 European languages, though on a rather small scale for practical NLP, namely less than 10,000 entries. Consequently, our intention is to explore and exploit various (inexpensive) methods to progressively enrich the resources and, subsequently, to annotate texts with the semantic information encoded within the framework of SIMPLE, and enhanced with the semantic data from the Gothenburg Lexical DataBase (GLDB) and from large corpora.
During recent years there has been an increased interest to acquire, on a large-scale, high-quality semantic lexicons, McKeown & Hatzivassiloglou (1993), Dorr & Jones (1996), Hearst & Schütze (1996), Takunaga et al. (1997), Roventini et al. (1998), Viegas et al. (1998). The methodology behind these approaches is usually corpus-driven. It is based on the (re-)use of machine readable resources of various types, and the application of cost effective ways to eliminate the acquistion bottleneck, such as derivational morphology, customization of off-the-shelf resources and statistical techniques. The approach adopted here for the extension task is in line with the methodologies mentioned.
In this paper, we focus on ways to extend and enrich, as far as possible, automatically the coverage of the Swedish semantic lexicon by taking into consideration compounding, a distinctive feature of the Swedish language, and semantic similarity in noun phrases of enumerative type. With the support of semantic data from the Swedish SIMPLE lexicon (Semantic Information for Multifunctional Plurilingual Lexica, LE4-8346), Gothenburg Lexical DataBase (GLDB) and large corpora both raw and exposed to shallow parsing, we enhance the incorporation of new semantic entries into the SIMPLE lexicon. We expect to be able to extend the 6,000 entries in the Swedish SIMPLE lexicon to over 120,000 entries. Our assumption is based on the results obtained from the tests carried out so far on input data of 1,000 entries, which became 25,000 (22,000 through compounding and 3,000 through semantic similarity).
Furthermore, we semantically annotate texts with all the available material, and we apply Machine Learning techniques for the disambiguation of ambiguous readings. The annotation task provides an excellent opportunity to evaluate the usability of the semantic information encoded in SIMPLE.
This paper is organized as follows: first we give a brief presentation of the SIMPLE project and particularly of the Swedish lexicon; then we present how compounding and semantic similarity in enumerative phrases (under certain conditions) can contribute to the augmenting and enrichment of the lexicon, when subjected to compound segmentation and shallow parsing; we continue by describing a practical application of the semantic lexicon, namely semantic annotation and disambiguation; we then give some general remarks on the usability of the SIMPLE model, while conclusions end the presentation.
The EU-financed SIMPLE project aims at developing wide-coverage semantic lexicons for 12 European languages. The Swedish SIMPLE lexicon (hereafter Swe-S) is one of these. All lexicons share a common semantic model and a common encoding formalism in SGML. The semantic data in the SIMPLE lexicons is being linked to the morphological and syntactic data in their respective PAROLE lexicons, developed within the EU project PAROLE, (Preparatory Action for Linguistic Resources Organisation for Language Engineering). Out of the 20,000 words in the PAROLE lexicons, a subset of about 6,000 words, or approximately 10,000 senses, has been enriched with semantic descriptions in the SIMPLE counterpart. The content and the design of the SIMPLE model are documented in Lenci et al. (1998).
The notion of semantic type is central for the SIMPLE model and its ontology. It corresponds to a word sense assigned to a lexical item. There are 139 semantic types distinguished in the SIMPLE ontology. Each semantic type is defined as a cluster of structured semantic information significant for a given word sense. Information on semantic class, domain, argument structure of predicative expressions and selectional restrictions on arguments as well as qualia roles constitute a relevant part of the semantic type specification; (Calzolari (1999), Pedersen & Keson (1999)). The SIMPLE ontology is multidimensional as it is based on the principle of orthogonal inheritance (Pustejovsky 1995), and in this respect, it contrasts with the LexiQuests semantic class ontology which is based on a standard, monodimensional approach. The latter ontology includes 95 semantic classes. Both ontologies are hierarchically structured.
The theoretical and formal design of the Swe-S lexicon is conformant to the SIMPLE's linguistic guidelines presented by the specification group, Lenci et al. (1999). In the Swe-S lexicon, there are about 10,000 semantic units (hereafter Usems) encoded, comprising 7,000 noun, 2,000 verb and 1,000 adjective Usems. These 10,000 units are mapped onto 6,000 entries. Usems are described with respect to the following information:
In the course of building the Swedish SIMPLE (and PAROLE) lexicons we have, to a large extent, reused lexical data from GLDB which is the most comprehensive source of lexical information on contemporary Swedish, and information from the SO (1992) and NEO (1996).
The Swe-S resources are not quantitatively sufficient for realistic, large-scale Natural Language Processing (NLP) tasks, such as semantic annotation, and need to be extended. For this particular task, we take advantage of the productive compounding characteristic of Swedish and the use of raw and partially parsed corpora.
We assume that a considerable number of casual, or on the fly created compounds in Swedish can inherit relevant parts of semantic information provided on their heads by the Swe-S lexicon and thus, can be incorporated into the lexicon. By relevant parts, we mean in the first place the information concerning semantic type, domain and semantic class. To avoid errors, we exclude the information on argument structure from the inheritance, as the argument structure can undergo alternations in the process of compounding. This is the case when verbs and verbal nouns build compounds with either an obligatory or optional argument in the non-head position. The occurrence of an adjunct in the non-head position does not usually alter the predicative structure.
The fact that over 70%, or approximately 80,000, of all the entries in the SAOL (1998) are compound forms casts light not only onto an immense lexical repository, which is available for this particular extension task, but also on the need to design effective tools and routines for compound segmentation, as new, casual compounds are created constantly in Swedish. Most of these casual compounds are relatively transparent, which implies that their meaning is a function of the meaning of its components being related to each other by an implied predicative functor. For instance, brödkniv brödXknivY bread knife implies Y for (cutting) X and bärsaft bärXsaftY juice from berries implies Y which contains X. In Swedish, compounds are written as single orthographic units and nouns are the most frequent modifiers occurring in non-head positions.
A combination of various heuristic methods is used for the extension. Compound segmentation is applied to compound noun tokens on large corpora and lists of new nouns are produced. To maintain quality assurance and compatibility with the rest of the data in the lexicon, new heuristics are applied to the content of the noun lists produced. To avoid generation of incorrect data, these heuristics inspect the modifying component of a compound in order to distinguish its characteristics, such as its part-of-speech and semantic category (if any). These characteristics of the modifier, when enriched with the corresponding characteristics of the compound head, provide data for a preliminary estimation of the correctness of the heuristics. Few examples will illustrate this point.
If the part-of-speech of the compound modifier is an adjective, a new Swe-S entry, which will not cause semantic anomalies in the derived lexical set, can be created with great confidence. The inheritance criterion applies here and the compounds are hyponyms to the head. For example, the lemma klocka bell/watch can be extended with compounds of type [ADJ-MODIFIER]+HEAD: [digital]klocka, [guld]klocka, [lill]klocka, [silver]klocka, [stor]klocka, where the adjectival modifying part in these examples are digital, gold, little, silver and big. Similar results are obtained if the modifying part is a proper noun. For instance, anhängare supporter with modifiers such as: Berisha, Hammarby, Hitler, Likud and Mobutu, signal unambiguous compounds.
It is well known that the heuristics have a variable degree of performance on different types of compounds, and that some simple constraints are needed to exclude segmentation and interpretation errors. Particularly in the case where the part-of-speech of the modifying part of a compound is a noun (e.g. NOUN-MODIFIER[kultur]fråga cultural question) or verb (e.g. VERB-MODIFIER[betal]teve pay-TV). These constraints are formed by means of subroutines which impose checking of derived compounds against different lists to eliminate incorrect data. The lists with bound morphemes or lexicalized compounds, extracted from the GLDB allow exclusion of such compounds from the derived sets. Such constraints have proven to be a cheap way to automatically constrain the overgeneration of new entries in the lexicons.
For instance, when using large corpora, over 40 compounds with feber fever, as head, could be extracted. However, it became evident that not all of them belong to the semantic class of Illness, e.g. resfeber excitement before a journey. Thus, in some cases, additional inspection seems unavoidable, if we want to restrain automatic incorporation of lexicalised compounds with idiomatic, metaphoric or metonymic meanings. This inspection can be performed automatically by simply checking whether a given compound is included as a separate entry in GLDB. If this is the case, it means that the compound is lexicalised and should not be subjected to automatic inheritance. The manual inspection is needed, only if the derived compound shows diverging semantic and/or morphological patterns and the word is neither in a bound morpheme list, nor in the lexicalised compound list.
Moreover, the content of the Swe-S has been used as a means of bootstrapping the process. For instance, glas glass, can be extended with compounds having Substance as a modifier in the compound form. Consequently the [NOUN-MODIFIER{Substance }]+HEAD compounds [vatten]glas, [vin]glas, [öl]glas, [likör]glas all have Substance as the modifier part, namely water, wine, beer and liqueur.
A large number of already disambiguated compounds has been also extracted from GLDB, since the Swe-S entries are linked to the various senses and sub-senses in GLDB, and subsequently to the morphological examples of every entry (alias compounds). For instance, Swe-S encodes the non-compound lemma ämne (as having four senses, marked with 1/1-1/4), which are disambiguated here by means of their assignment to the following semantic types and semantic classes:
Each of these senses is exemplified in GLDB with a number of compounds, comprising totally 26 compounds with ämne as the head. Some of these are listed in the right column of table (1). Since there is only one compound with that head in the Swe-S lexicon (grundämne element), incorporating new, disambiguated compounds was straightforward.
Swe-S |
GLDB |
|
ämne:1/1:MATTER ämne:1/2:SUBSTANCE ämne:1/3:ABSTRACT ämne:1/4:NOTION grundämne:1/1:MATTER |
färgämne:1/1 hornämne:1/1
yxämne:1/2 fruktämne:1/2
predikoämne:1/3 uppsatsämne:1/3
läroämne:1/4 skolämne:1/4 |
Table 1: ämne in Swe-S, and GLDB compounds with ämne as head.
So far we have addressed the problem of the acquisition of compound nouns based on the content of the Swe-S lexicon, by applying heuristics, filters, and manual inspection, in some cases, in order to guarantee consistency. But how can we cope with the rest of the vocabulary?
Wilson and Thomas (1997:55-57) argue that one of the conditions that a semantic system should satisfy is that is should be able to account exhaustively for the whole vocabulary in the corpus, not just for a part of it. We have experimented with a corpus-based approach, using a cascaded finite-state syntactic parser (CASS-SWE), based on work done by Kokkinakis & Johansson Kokkinakis (1999), which seems a plausible way of progressively enriching the Swedish semantic resources.
An advantage of CASS-SWE is its ability to identify with high accuracy noun phrases, a property that we consider here as crucial for aiding the "discovery" of new semantic entries. Essentially the approach, which has similarities to naive clustering, is as follows. Gather large corpora (here 13 million tokens), part-of-speech tag, and then parse with CASS-SWE (the parser uses part-of-speech annotated input); from the resulted analyzed forest of chunks we filter out long noun phrases, namely those containing three or more common nouns. Finally, the overlap between the nouns in the NPs produced and the entries in Swe-S is measured. If at least two of the nouns (a figure arbitrarily taken) are also entries in the Swe-S, with the same semantic class, then there is a strong indication that the rest of the nouns are co-hyponyms, and thus semantically similar with the two already encoded in Swe-S. Accordingly, we take advantage of the transitivity aspect of hyponymy, and of the fact that two lexical items X and Y are co-hyponyms if: (i) they are disjuncts and therefore complementary; and (ii) have a common superordinate, e.g. animal is superordinate of cat, dog, horse and camel, cf. Sanfilippo et al. (1999).
Similarity plays an important role in word acquistion, and preliminary results have shown that the simple overlap works fairly well for the majority of the cases examined. However, the noise which is produced can be eliminated, if the semantic tags of all the words in a phrase are compared. Caution should be taken for cases where different semantic classes are involved in an enumerative NP, e.g.:
The unclassified husdjur in the first example, should not be assigned to a class Bio since there is another class involved in the same NP, namely Furniture. Similarly, no action should be taken in the second example, since two semantically ambiguous words with distinct classes are involved.
The best results were achieved for the semantic classes: Phenomena (Illness and Psychological-Feature), Occupation, Animal and Human (Bio, Ethnos and Occupation-Agent). Some examples of the last mentioned class are given below, these are NPs taken from the parsed corpus. In these examples, (*) marks an original Swe-S entry, (+) marks an entry incorporated through the compound analysis, (N) marks a completely new entry and (?) marks errors:
Using the previously described heuristics and observations, the relatively limited inventory of semantic information in Swe-S, has been extended to a large semantic resource, appropriate for a large number of intermediate NLP tasks, i.e. simpler processes which are carried out to help final tasks.
Regarding the use of the compounds for extending the entries, an estimated average of 20-25 compounds per Swe-S entry has been extracted by combining information from large corpora and the GLDB. Thus, by using only 1,000 nouns we could increase the total vocabulary size to over 22,000 semantic entries. For some entries, having both concrete and abstract senses, the number of compounds extracted from large corpora could be measured into several hundreds. Table (2) shows the top-10 non-compound entries, most rich in compound variants.
Swe-S Entry |
Occ. |
|
program programme, program arbete work, employment chef chief bok book verksamhet activity, operation skola school man man rum room, space kort card, photo bolag company |
469 402 390 357 299 275 273 244 231 217 |
Table 2: Swe-S entries richest in compound variants
Regarding now the shallow parsing approach of a l3 million corpus, over 15,600 NPs could be extracted, having the content we were interested in, namely over three common nouns. Approximately 3,000 new noun entries to the Swe-S could be identified without any further processing (bootstrapping the compound analysis). However, as mentioned in the previous section, some noise was produced and for this reason we do not use these new nouns for the semantic annotation discussed in the next section, until we find more reliable ways to eliminate the limited number of errors produced.
Semantic tagging is appealing since it is believed to contribute to the improvement of the performances and robustness of NLP systems, cf. Resnik & Yarowsky (1997). The appropriate content from the core Swe-S, i.e. "semantic class", "domain" and "template type" information, has been extracted and implemented as finite-state machines suitable for semantic tagging, the case of assigning semantic categories or clusters of semantically related concepts to words. These machines are then applied sequentially to lemmatized textual data resulting in all possible annotation for the tokens matched.
Testing was performed using 1,800 nouns from the Swe-S, while approximately 150 of those could be ambiguous, in the sense that more than one semantic label, class, domain and template, could be asssociated with a single token. For instance, the Swedish noun administration administration is semantically classified for four different semantic classes: Agency, Functional-Space, Human and Operation, while the noun affär shop, business, affair is classified for: Functional-Space, Operation, State and Event.
We adopted Machine Learning (ML), particularly Memory Based Learning, for the disambiguation of the semantic annotation of text samples.
Memory-Based Learning (MBL) is a supervised, inductive, classification-based method originating from the field of machine learning (ML), Mitchel (1997). MBL has several practical advantages, such as: (i) it has produced state-of-the-art results in many natural ambiguity problems (cf. Cardie & Mooney (1999)); (ii) the MBL method is not sensitive to sparse or low-frequency data, as low-frequency cases are not discarded but are kept in memory, hence, useful information can also be extrapolated from them; and (iii) fast learning and incremental learning; new instances can be added to the memory, improving the performance of the system. The software used for the experiments with the Swedish data has been developed at the University of Tilburg, by Daelemans et al. (1999).
MBL is closely based on the assumption that "performance in cognitive tasks is based on reasoning on the basis of similarity of new situations to stored representations of earlier experiences", Daelemans et al. (1999). An MBL system consists of two components: a learning component, which is memory-based, adding training instances to memory, and a performance component, in which the product of the learning component is used for performing the classification of the input.
It is rather difficult to give an exact number of examples required for an adequate description of noun senses. Intelligent example selection for supervised learning is an important issue in ML, an issue that we have not explored. However, from the (human) lexicographical point of view, an experienced scholar would need, roughly, a hundred arbitrarily chosen excerpts for each word in order to cover the majority of sense distinctions (Jerker Järborg personal communication). For a machine, that figure should be higher, although we have not empirically tested the validity of this statement.
We have automatically created large training non-lemmatized data, taken from concordances and then manually classified the training instances. The deliberate choice of non-lemmatized material should be emphasized here, as our experiments proved that noun morphology supports sense disambiguation, both for compound and non-compound forms in Swedish.
For instance, plural forms of finska or tyska, Finnish, German, refer almost exclusively to Ethnos (denoting a person) while the base form is ambiguous between Ethnos and Abstract (denoting either the person or language). Likewise, plural forms of begåvning talented (person), talent refer almost exclusively to Situ, while its base form refers to Psychological-Feature.
For the training and test instances we organized the near context of the ambiguous semantic entries into fixed-length vectors of symbolic n feature-value pairs (in the experiments in this paper n=12) which consist of the left and right context of the word under investigation, its part-of-speech and its byte-offset in the discourse, and a field containing the classification of that particular feature-value vector. Unknown features are marked with a question mark ? while long context is truncated. Moreover, we took advantage of the syntactic examples in GLDB, given for almost every lemma in the database, and in this way we could complement the training material automatically with already classified training instances. This last point can be illustrated by the use of two syntactic examples provided by the GLDB for the noun medicin medicine. Since these are already disambiguated, designated by their sense number, they can be directly mapped onto the respective Swe-S semantic classes for that particular word:
GLDB: medicin:1:studera medicin
MBL: byte-offs noun ? ? ? studera medicin ? ? ? ? Occupation
medicine:sense1:study medicine
GLDB: medicin:2:skriva ut recept på en bra medicin
MBL: byte-offs noun recept på en bra medicin ? ? ? ? Substance
medicine:sense2:write a prescription on a good medicine
During classification an unseen example X, a test instance, is presented to the system and a distance metric D between the instances in the memory Y and X is calculated, D(X,Y). Various implemented algorithms (variants of the k-nearest neighbour algorithm) try to find the nearest training instance for X and create a class as prediction for the class of the test instance.
At present, the standard for calculation of sense disambiguation algorithms is the "exact match" (or accuracy) criterion. Specifically for ML, our goal is to perform significantly better than the most-frequent-semantic classifier to be worthy of serious consideration.
Table (3) summarized the results for few ambiguous cases examined. In every case we try to improve the baseline for every semantic entry we want to disambiguate. Here by baseline is meant the most frequent class attached to an ambiguous token in the test sample.
Our experiments using the MBL approach returned 84.8% correct disambiguation, tested on 25 ambiguous entries (with 20-25 test instances in each case), with an average baseline of 69.4%.
Swe-S |
(Swe-S) Class |
Tr. Data |
Base- line |
Acc. |
administration administration |
Agency-39 Human-18 Operation-81 Functional-Space-5 |
143 |
56.6% |
76% |
affär shop, affair, business |
Event-102 Operation-175 State-2 Functional-Space-83 |
362 |
48.3% |
92% |
danska Danish |
Ethnos-258 abstract-32 |
290 |
88.9% |
88% |
klyfta segment, cleft, rift |
Phenomenon-67 Form-33 Alternation-3 |
119 |
56.3% |
88% |
medicin medicine |
Substance-380 Occupation-103 Substance-Occup.-6 |
489 |
77.7% |
72% |
område area, zone, field |
Location-168 abstract-151 |
319 |
52.6% |
84% |
vatten water |
Substance-396 Substance-Loc.-110 Location-61 |
567 | 69,8% |
100% |
teater theatre, play-acting |
Abstract-105 Agency-103 Function.-Space-102 Human-12 Activity-7 |
329 |
31.9% |
85% |
Table 3: Data used by MBL for semantic disambiguation
(Tr. Data: amount of training data, Acc.: accuracy based on the MBL approach)
In this section, we are going to reflect on the usability of the SIMPLE model for different NLP tasks, which require access to semantic information. Many NLP applications can be actively supported by the SIMPLE lexicon which offers multiple access points to the semantic data. 10,000 word senses can be accessed either directly, or by means of selective information searches starting with 139 ontological categories provided by the SIMPLE ontology, 95 semantic class categories and to 364 domain specifications. Since the two first capture somewhat different aspects of word meaning for a number of cases, the double ontological specifications not only provide more precise information, but also increase the granularity of semantic description.
The ontological information cluster can be extended with information on domains. The domain information, indispensable for text-recognition tasks can support disambiguation of senses with identical ontological clusters. For example, the word grad degree, grade has nine senses assigned, and four of these denote different units of measurment representative for domains such as Geometry, Earth-Sciences, Typography and Meteorology. Since those four display identical ontological categorization, the domain information supports disambiguation in a relevant way. In consequence, a tripartite cluster including both ontological and domain information seems to be preferred. The explicit specification of domain information in the SIMPLE lexicon makes it possible to generate domain-based sublexicons, which are basic for text-recognition tasks.
The attempt to harmonize the encoding of data makes it possible to multilink the SIMPLE lexicons for different languages, which is substantial for building the lexicon modules for machine-aided-translation.
Since the content of the Swe-S lexicon is linked to the GLDB database, the information exchange can proceed in two directions, which promotes development of both resources. These two resources describe and formalize lexical information concerning a words morphology, syntax and semantics, which is a prerequisite for advanced NLP tasks. As was already hinted, the SIMPLE project has aimed at harmonization of lexical resources by using a common lexicon model and formalism for 12 EU languages. This initiative has opened new prospects for further developments within the language engineering field.
This paper has discussed means to automatically extend the lexical inventory of the Swe-S semantic lexicon, by profiting from the productive compounding characteristic for Swedish, the semantic similarity in the enumerative noun phrases, by accessing corpora both in raw and parsed form, and the morphological, syntactic and semantic content of GLDB. Using a combination of all the available data, a relatively limited inventory of semantic information, such as the Swe-S, can be extended to a large semantic resource appropriate for a large number of intermediate NLP tasks. Moreover, its compatibility with the manually developed Swe-S lexicon, can be guaranteed and its high quality maintained, as we applied heuristics that do not try to overproduce semantically anomalous entries. We have also used the Swe-S resource for semantic annotation of texts, while for the disambiguation, we employed Machine Learning techniques, supported by manually created large portions of training data for a small number of ambiguous semantic entries. Work within the SIMPLE project was still in progress when writing this paper, so a future task would be to extend the rest of the material using the same methodology, and even to devise better ways to eliminate the noise produced by the syntactic parsing. Reliable extraction of similar words from text corpora opens up many exciting opportunities for further linguistic analysis.
We thank three anonymous reviewers for some useful comments on a previous draft. The first author is also indebted to the "Birgit & Gad Rausings" foundation for providing financial support for the participation at the conference.
Blåberg, O. (1988). A Study of Swedish Compounds. Report 29, General Linguistics, Umeå university, Sweden
Calzolari, N. (1999). SIMPLE: Harmonised Semantic Lexicons for the European Languages. In Proceedings of the SIGLEX-99 Workshop: "Standardizing Lexical Resources", Maryland, USA
Cardie, C. and Mooney, R.J. (1999). Guest Editors Introduction: Machine Learning and Natural Language. In Journal of Machine Learning, Special Issue on Natural Language Learning, Vol. 34, pp. 1-5, Kluwer AP
Daelemans, W., Zavrel, J., van der Sloot, K. and van den Bosch, A. (1999). TiMBL: Tilburg Memory Based Learner, version 2.0, Reference Guide. ILK Technical Report 99-01. Paper available from: http:/ilk.kub.nl/~ilk/papers/ilk9901.ps.gz
Dorr, B. and Jones, D. (1996). Acquisition of Semantic Lexicons: Using Word Sense Disambiguation to Improve Precision. In Proceedings of the SIGLEX Workshop "Breadth and Depth of Semantic Lexicons", pp. 42-50, Santa Cruz, California, USA
Hearst, M.A. and Schütze, H. (1996). Customizing a Lexicon to Better Suit a Computational Task. In Corpus Processing for Lexical Acquistion, pp. 77-94, Boguraev B. and Pustejovsky J. (eds.). MIT Press
Kokkinakis, D. and Johansson Kokkinakis, S. (1999). A Cascaded Finite-State Parser for Syntactic Analysis of Swedish. In Proceedings of the 9th EACL, pp. 245-248, Bergen, Norway. Paper available from: http://svenska.gu.se/~svedk/publics/eaclKokk.ps
Lenci, A. et al., (1998). SIMPLE WP2, Linguistics Specifications. Deliverable 2.1, Pisa
McKeown, K. and Hatzivassiloglou, V. (1993). Augmenting Lexicons Automatically: Clustering Semantically Related Adjectives. In Proceedings of the ARPA HLT Workshop, pp. 272-277, Princeton, NJ
Mitchell, T. M. (1997). Machine Learning. Series on Computer Science, McGraw-Hill
NEO, (1996). Nationalencyklopedins ordbok. Volumes 1-3, Språkdata & Bra Böcker AB
Pedersen, B.S. and Keson, B. (1999). SIMPLE - Semantic Information for Multifunctional Plurilingual Lexica: Some Examples of Danish Concrete Nouns. In Proceedings of the SIGLEX-99 Workshop: "Standardizing Lexical Resources", Maryland, USA
Resnik P. and Yarowsky D. (1997). A Perspective on Word Sense Disambiguation, Methods and their Evaluation. In Proceedings of the Workshop: "Tagging Text with Lexical Semantics. Why, What and How?", pp. 79-86, Washington D.C., USA
Roventini, A., Peters, C., Calzolari, N. and Bertagna, F. (1998). Building a Semantic Network for Italian Using Existing Lexical Resources. In Proceedings of the 1st LREC, Vol. 1, pp. 377-383, Granada, Spain
Sanfilippo, A. et al. (1999). Preliminary Recommend-ations on Lexical Semantic Encoding. EAGLES LE3-4244, Draft version
SAOL, (1998). Svenska Akademiens Ordlista över Svenska Språket (The Swedish Academy Word-List). Norstedts Förlag & Svenska Akademien
SO, (1992). Svenska Ord. Statens Skolverk, Nordstedts Förlag
Takunaga, T., Fujii, A, Iwayama, M., Sakurai, N. and Tanaka, H. (1997). Extending a Thesaurus by Classifying Words. In Proceedings of the Workshop: "Automatic Information Extraction and Building of Lexical Semantic Resources", Vossen P., Adriaens G., Calzolari N., Sanfilippo A. and Wilks Y. (eds), pp. 16-21, Madrid, Spain
Viegas, E., Ruelas, A., Beale, S. and Nirenburg, S. (1998). Extending a Core lexicon Using On-Line Language Resources with Savoir-Faire. In Proceedings of the 1st LREC, Vol. 1, pp. 97-104, Granada, Spain
Three entries taken from the 100 sample of the Swedish SIMPLE lexicon are reprinted below. These are blomma ' to flower' kastanj, 'chestnut' and liten 'little.