SIMPLE LE4-8346


SWEDISH SIMPLE — LEXICON DOCUMENTATION



* * *



Document first version date

28/04/00    

Document date

12/05/00

Document ID

WP1 Swedish Simple-Lexicon Documentation

Version

02   

Doc. type

D_03_WP03.12_2_doc.htm    

Document status

     

Validation type

     

Comments

 
       
 

Name

Organisation

Purpose

       

From

Maria Toporowska Gronostaj GOT documentation
  Karin Warmenius    
       

To

    information

 

Introduction

The EU-project SIMPLE, Semantic Information for Multifunctional Plurilingual Lexica, is a follow-up of PAROLE, Preparatory Action for Linguistic Resources Organisation for Language Engineering. It has aimed at the development of wide-coverage semantic lexicons for the 12 European languages. The SIMPLE lexicons add information on a semantic layer to the information on morphological and syntactic layers encoded in PAROLE lexicons. All the lexicons share the same formal representation model, which is Entity/Relationship model, and the same DTD (Document Type Definition) implemented in SGML. The SIMPLE's theoretical and formal model is documented in Linguistic Guidelines, (Lenci et al. 1998) and the former knowledge of the guidelines is presupposed here.

Språkdata, Göteborg University, has participated in both the projects. The task of building the PAROLE and SIMPLE lexicons for Swedish has been to a large extent supported by the lexical resources elaborated by Department of Swedish Language, Språkdata, Göteborg University. The machine readable monolingual dictionaries Svenska ord (1992), Nationalencyklopedins ordbok (1995), and the lexical database, Göteborgs lexikaliska databas, (GLDB), provided a considerable amount of information on lexical data concerning morphology, syntax and semantics. GLDB is a monolingual lexical database with focus on modern Swedish. It is the most exhaustive source of information on modern Swedish. The PAROLE corpora and the corpora available in the Swedish Language Bank, Språkbanken, have been also frequently consulted. Moreover, the lexicographic and lexicological research carried at Språkdata has shown to be supportive with regard to questions concerning morphological, syntactic and semantic information in the PAROLE and SIMPLE lexicons (Gellerstam 1988, Toporowska Gronostaj 1996, Järborg 1989, Malmgren 1988).

To provide some background knowledge for this documentation, we start with a short overview of the conceptual model underlying the SIMPLE project, which is followed by some introductory notes on the Swedish PAROLE lexicon. The core section of this report deals with the documentation of the Swedish SIMPLE lexicon (hereafter Swe_S). We conclude the core section by presenting ways to automatically extend the content of Swe-S and showing its usability for different NLP applications. We reprint in the Appendix A a paper on this subject that will be presented at the LREC 2000 conference in Athens (Kokkinakis et al, 2000). In the the Appendix B of this document we provide some examples from the PAROLE and SIMPLE lexicons for Swedish to give an insight in the description modes and format used in these lexicons for the encoding of morphology, syntax and semantic of nouns, verbs and adjectives.

0.1 The SIMPLE project

The notion of semantic type is central for the SIMPLE model and its ontology. It corresponds roughly to a word sense assigned to a lexical item. There are 139 semantic types distinguished in the SIMPLE ontology. Each semantic type is defined as a cluster of structured semantic information significant for a given word sense. Information on semantic class, domain, argument structure of predicative expressions and selectional restrictions on arguments as well as qualia roles constitute a relevant part of the semantic type specification; (Lenci at al. 1998, Calzolari (1999), Pedersen & Keson (1999)). The SIMPLE ontology is multidimensional as it is based on the principle of orthogonal inheritance (Pustejovsky 1995), and in this respect, it contrasts with the LexiQuest’s semantic class ontology which is based on a standard, monodimensional approach. The latter ontology includes 95 semantic classes. Both ontologies are hierarchically structured. The above mentioned ontologies were designed for the semantic typing of verbs and nouns. For the semantic typing of adjectives, an ontology consisting of 14 semantic types has been used.

0.2 The Swedish PAROLE lexicon

The Swedish PAROLE lexicon is a language engineering resource which provides access to the lexically determined morphological and syntactic information on 20.000 Swedish words. It follows the conceptual and formal model formulated within the PAROLE project (Guimier, Ogonowski, 1998). A subset of the entries in the Swedish PAROLE lexicon, that is 5.621 words, has been enriched with the semantic information encoded in Swe-S.

The encoding of morphosyntactic data for Swedish has followed the model presented in the PAROLE report (WP-4.2.2b), Morphosyntactic Description of Swedish (Danielsson, Järborg, (1996)). In the Swedish PAROLE morphological lexicon there are 19.985 morphological units described according to 259 morphological modes. The morphological modes capture types of inflectional relations a morphological unit can display. Their descriptions vary according to the part of speech of a morphological unit and the morphological properties associated to its morphosyntactic type. To account for these, a technical stem and a set of relevant affixes have been postulated for each morphological mode.

A syntactic unit (Usyn) is a basic access point to the syntactic layer. In the Swedish PAROLE syntactic lexicon, there are 28.882 syntactic units linked to 19.985 morphological units. These syntactic units have their syntactic behaviour described by means of 438 syntactic descriptions. In the Swedish syntactic lexicon following aspects of lexically determined syntactic behaviour have been accounted for:

This lexically determined information has been considered as criterial for distinguishing different types of subcategorizations frames in Swedish, or in other words, for postulating 438 Description objects.


1 General design information

1.1 Lexicon population

Lexicon population in the Swe-S lexicon can be characterized as composed of two sets. The first one consists of 8.571 fully encoded lexicon items, which have been manually encoded. It covers nouns, verbs and adjectives. The second one consists of 25.000 partially encoded nouns, which have been automatically extracted from corpora, given 1.800 fully encoded words from the Swe-S as input. This set covers 22.000 compound- and 3.000 simplex-nouns. For details on the methodology used for extending the coverage of the Swe-S lexicon see Appendix A.

Overall statistics: covering the fully encoded Semus in Swe-S

Number of Semus linked to Usyns and Ums 8.571

Semu per category:

       SemUs SynUs MuSs     
     Noun 6.177 4.799 3.624     
     Verb 1.943 1.628 1.546     
     Adjective 451 253 127     

While populating the Swe-S lexicon manually, the preference was given to those lexical items which were considered as Swedish equivalents to the base concepts. The base concept set, common to all SIMPLE partners, is a selection of concepts derived from the EuroWordNet by means two operational criteria, namely, the number of relations (general or limited to hyponymy) and the high position of the concept in a hyperonym-hyponym hierarchy. Frequency was another significant factor taken into consideration.

As far as the population of the sample lexicon for Swedish is concerned, the selection of items included there aimed at:

  1. presenting a representative sample of semantic types encoded in the Swe-S lexicon,
  2. exhibiting semantic cohesion within the lexicon (e.g blomma 'to bloom', krokus 'crocus',
  3. linking the syntactic and semantic information in the Swedish PAROLE and SIMPLE lexicons,
  4. showing the distribution of nouns, verbs and adjectives among the semantic types.

1.2 Current lexicon contents

The fully encoded semantic data presented in the Swedish SIMPLE lexicon provide following information:


2. Semantic encoding

2.0 Links from the Swe-S lexicon to GLDB

In the SIMPLE model, a set of templates provides guidelines concerning sense assignment, as each template is a structured set of semantic information, which highligts relevant aspects of word meaning.

The GLDB's approach to word sense discrimination is a relational one, which means that sense readings are internally structured and that kernel sense and its subsenses are related to each other by a set of lexical transfer rules. The lexical information conveyed by the hierarchical microstructure of a lexicon entry in GLDB is mapped onto the SIMPLE ontology and semantic information pertaining to a semantic type. In consequence, links between these two resources are established and the lexical data can be freely accessed. In the Swe-S lexicon, the links between the two systems are shown by means of the used marking convention for SemUs and glosses. While SemUs are marked consecutively, GLDB's lemma-lexeme-subsense numbers point to the readings in the GLDB and thus provide input to all the syntactic and semantic information linked to a reading there.

2.1 Criteria for syntax-semantic linking

Word sense discrimination has been assumed to be a point of departure for the linking of syntactic and semantic layers, as it is the word meaning that determines its syntactic behavior, argument structure, argument type (true, default or shadow) as well as selectional preferences and semantic roles of its arguments. This information, with the exception of semantic roles that are not specified there, is mapped in Swe-S onto positions at the syntactic layer and consequently it is linked to information on (i) number and optionality of positions in subcategorization pattern, (ii) syntactic functions and their morphosyntactic and lexical realisations, (iii) semantic preferences on arguments with regard to the feature 'animatness'. As a consequence, recurrent mapping patterns can be distinguished.

Let's turn now to the SGML objects. The linking between syntactic positions and semantic arguments is specified in the Correspondence object. Non-predicative nouns are linked by relating to one or more Semus to which a syntactic unit corresponds. Three types of links have been used in Swe-S:

krokus   'crocus'

<SynU
    id="US8376_9223"
    description="DN0"
    <CorrespSynUSemU
     targetsemu="USEMn_krokus1">
</SynU>
<SemU
    id="USEMn_krokus1"
    naming="krokus"
    example="GLDB:krokus1/1/0:saffrans&, vår&"
    comment="BC:No Base Concept"
    freedefinition="GLDB:krokus1/1/0:typ av liten liljeväxt med stora, trattlika blommor"
    weightvalsemfeaturel="
        WVSFTemplateFlowerPROT
        TSVP_BOTANY_TS_domaine_D
        TSVP_FLOWER_TS_classificateur_de_nom_C">
</SemU>

The one-to-many relation is illustrated below for the non-predicative polysemous noun trea 'three, third'. The syntactic unit DN0 is linked to six Semus, assigned to the following semantic types: 1-Human, 2-Building, 3-Vehicle, 4-Sign, 5-Unit_of_measurment and 6-Number.

trea   noun meaning 'three, third'

<SynU
    id="US17191_19079"
    description="DN0"
    <CorrespSynUSemU
        targetsemu="USEMn_trea1">
    <CorrespSynUSemU
        targetsemu="USEMn_trea2">
    <CorrespSynUSemU
        targetsemu="USEMn_trea3">
    <CorrespSynUSemU
        targetsemu="USEMn_trea4">
    <CorrespSynUSemU
        targetsemu="USEMn_trea5">
    <CorrespSynUSemU
        targetsemu="USEMn_trea6">
</SynU>
<SemU
    id="USEMn_trea1"
    naming="trea"
    example="GLDB:trea1/1/0:han kom & och fick alltså brons"
    comment="BC:No Base Concept"
    freedefinition="GLDB:trea1/1/0:person eller föremål med naturlig anknytning till talet 3::spec. om person som kommit på tredje plats i tävling"
    weightvalsemfeaturel="
        WVSFTemplateHumanPROT
        TSVP_SITU_TS_classificateur_de_nom_C">
</SemU>
<SemU
    id="USEMn_trea2"
    naming="trea"
    example="GLDB:trea1/1/1:§CX§"
    comment="BC:No Base Concept"
    freedefinition="GLDB:trea1/1/1:trerumslägenhet"
    weightvalsemfeaturel="
        WVSFTemplateBuildingPROT
        TSVP_FUNCTIONAL_SPACE_TS_classificateur_de_nom_C">
</SemU>
<SemU
    id="USEMn_trea3"
    naming="trea"
    example="GLDB:trea1/1/2:§CX§"
    comment="BC:No Base Concept"
    freedefinition="GLDB:trea1/1/2:buss eller spårvagn nummer 3"
    weightvalsemfeaturel="
        WVSFTemplateVehiclePROT
        TSVP_TRANSPORT_TS_domaine_D
        TSVP_VEHICLE_TS_classificateur_de_nom_C">
</SemU>
<SemU
    id="USEMn_trea4"
    naming="trea"
    example="GLDB:trea1/1/3:§X§"
    comment="BC:No Base Concept"
    freedefinition="GLDB:trea1/1/3:äv. om det påbjudna genomsnittsbetyget i skolans relativa betygssättning"
    weightvalsemfeaturel="
        WVSFTemplateSignPROT
        TSVP_PRIMARY_AND_SECONDARY_EDUCATION_TS_domaine_D
        TSVP_NUMBER_TS_classificateur_de_nom_C">
</SemU>
<SemU
    id="USEMn_trea5"
    naming="trea"
    example="GLDB:trea1/1/4:komma upp i &n"
    comment="BC:No Base Concept"
    freedefinition="GLDB:trea1/1/4:tredje årskursen::t. ex. i grundskolan el. gymnasiet"
    weightvalsemfeaturel="
        WVSFTemplateUnitofmeasurementPROT
        TSVP_PRIMARY_AND_SECONDARY_EDUCATION_TS_domaine_D
        TSVP_MEASURE_UNIT_TS_classificateur_de_nom_C">
</SemU>
<SemU
    id="USEMn_trea6"
    naming="trea"
    example="GLDB:trea1/1/5:§CX§"
    comment="BC:No Base Concept"
    freedefinition="GLDB:trea1/1/5:siffran 3:::spec. äv."
    weightvalsemfeaturel="
        WVSFTemplateNumberPROT
        TSVP_NUMBER_TS_classificateur_de_nom_C">
</SemU>

For verbs and deverbal nouns taking arguments, the correspondence is established between particular positions in the subcategorization frame and their arguments. Thus the two arguments a0 and a1 of the verb brevväxla 'correspond (with sb)', are linked to the subject and prepositional object positions respectively in the subcategorization frame for the syntactic unit D01P11POMED. The same verb is also linked to the syntactic unit D01PL, with one position occupied by a plural subject. In this case one can talk about reduced correspondence. These two show semantic affinity, and therefore are encoded with the same Semu number, except that the latter is indexed by 'a' (alternation).

brevväxla    'correspond (with sb)'

<SynU
    id="US2133_2323"
    description="D01PL"
    <CorrespSynUSemU
        targetsemu="USEMv_brevväxlar1a"
        correspondence="ISOmonovalent">
</SynU>
<SynU
    id="US2133_2323_1"
    description="D01P11POMED"
    <CorrespSynUSemU
        targetsemu="USEMv_brevväxlar1"
        correspondence="ISObivalent">
</SynU>

<SemU
    id="USEMv_brevväxlar1"
    naming="brevväxlar"
    example="GLDB:brevväxlar1/1/0:hon -ade med sin moster"
    comment="BC:No Base Concept"
    freedefinition="GLDB:brevväxlar1/1/0:regelmässigt skriva brev till och få brev från::viss person"
    weightvalsemfeaturel="
        WVSFTemplatecooperativeActivityPROT
        TSVP_COMMUNICATION_TS_classificateur_de_verbe">
    <PredicativeRepresentation
        typeoflink="MASTER"
        predicate="PRED_brevväxlar1">
</SemU>

<SemU
    id="USEMv_brevväxlar1a"
    naming="brevväxlar"
    example="GLDB:brevväxlar1/1/0:hon -ade med sin moster"
    comment="BC:No Base Concept"
    freedefinition="GLDB:brevväxlar1/1/0:regelmässigt skriva brev till och få brev från::viss person"
    weightvalsemfeaturel="
        WVSFTemplatecooperativeActivityPROT
        TSVP_COMMUNICATION_TS_classificateur_de_verbe">
    <PredicativeRepresentation
        typeoflink="MASTER"
        predicate="PRED_brevväxlar1a">
</SemU>

In the Swe-S all the verbs, deverbal and relational nouns have their positions linked to the arguments. In most cases these correspondences are isomorphic. Non-isomorphic cases occur for example when a shadow argument is part of a predicative representation, e.g. with weather verbs, or even in other constructions where the number of syntactic positions does not match the number of semantic arguments. As en example of the latter, one can mention a syntactic unit, DACVASUPR, which is a predicative construction with an adjective as head. The lack of correspondence follows from the fact that we have postulated an argument on the semantic layer in order to encode selectional preferences imposed by an adjectival head on the subject. These are needed in order to motivate the semantic type of an adjective in predicative constructions.

2.2 Criteria for assigning domain features

The hierachically structured domain list, elaborated by ERLI/LexiQuest, has been a point of departure for assigning domain information to the readings. This repository of specific domain values has been extended with the value General, to cover general, underspecified readings. While encoding domains, the attention has been paid to the following:

2.3 Criteria for assigning Semantic Class and Template Type

Template Type assignments were based on semantic specifications included in templates. The content of these specifications was compared to the information provided by glosses in GLDB. To make the assignment as exact as possible we have used both the core and recommended ontological types.

Since word senses differ in their lexical complexity and transparency, the assignment task varied accordingly. Concrete nouns could be assigned to the template types without problems in most cases, abstract nouns were more problematic. Verbs with transparent meaning were easier to assign than verbs with underspecified, vague meanings. Thus, the more abstract and vague the word senses were, the less obvious their assignment was. The same can be said about the semantic class assignments.

In the ERLI/LexiQuest classification there are 95 semantic classes to classsify the readings of nouns and verbs. Since this classification differs from the one postulated by SIMPLE ontology as to the number and distribution of categories, the mapping between categories is either one-to-one, one-to-many or many-to-one. For instance, the following semantic classes: Instrument, Apparatus, Measuring_instument or Musical_instrument map onto the template Instrument. We have taken advantage of such differences in order to nuance sense assignments.

As far as the task of assigning template types to adjectives is concerned, we have used all the 14 options provided by the ontology. It might be of interest to note that meaning components explicated in this ontology have had a double function, as they have served both as meaning components and as labels for semantic classes for adjective types. The list of semantic classes for adjectives has been extended with class categories taken from the set of semantic classes for nouns and verbs, e.g Attribute, Emotion, whenever these were semantically motivated.

2.3.1 Language specific typing

The set of 139 ontological types postulated in the guidelines has proved to be relevant for Swedish, as illustrated below (R stands for recommended ontological types):

General ontology for nouns and verbs

1. TELIC mål, syfte, målsättning
2. AGENTIVE motiv, orsak, skäl, anledning
2.1. CAUSE vålla, orsaka, föranleda
3. CONSTITUTIVE beståndsdel, enhet, molekyl
3.1. PART arm, framdel, baksida
3.1.1. BODY_PART arm, huvud, lillfinger
3.2. GROUP skock, stim, bukett
3.2.1. HUMAN_GROUP skara, sekt, sextett
3.3. AMOUNT flaska, glas, matsked
4. ENTITY  
4.1. CONCRETE_ENTITY  
4.1.1. LOCATION badort
4.1.1.1. 3_D_location berg, dal, grotta
4.1.1.2. Geopolitical_location badort, bygd, förort
4.1.1.3. Area område, trakt, åker
4.1.1.4. Opening dörr, fönster, hål
4.1.1.5. Building byggnad, foajé, hus, kyrka
4.1.1.6. Artifactual_area (R) boulevard, cykelbana, hållplats, motorväg
4.1.2. MATERIAL glas, material, stoff
4.1.3. ARTIFACT papper, armbandsur, dörr, ficklampa
4.1.3.1. Artifactual_material mjöl, nylon, papper, damast
4.1.3.2. Furniture fåtölj, ribbstol, sekretär
4.1.3.3. Clothing skjorta, slips, T-shirt
4.1.3.4. Container kopp, glas, väska, burk
4.1.3.5. Artwork bataljmålning, bild, opera
4.1.3.6. Instrument persondator, piano, pistol
4.1.3.7. Money dollart, enkrona, pund
4.1.3.8. Vehicle bil, cykel, buss
4.1.3.9. Semiotic_artifact papper, bok, bibel
4.1.4. FOOD lingon, potatis, champinjon
4.1.4.1. Artifact_Food (R) pepparkaka, bröd
4.1.4.2. Flavouring (R) paprika, curry, dressing
4.1.5. PHYSICAL_OBJECT fossil, gallsten, himlakropp
4.1.6. ORGANIC_OBJECT hinna, kokong, svulst
4.1.7. LIVING_ENTITY  
4.1.7.1. Animal däggdjur, fauna,
4.1.7.1.1. Earth_animal (R) dromedar, får, hund
4.1.7.1.2. Air_animal (R) bofink, duva, fjäril
4.1.7.1.3. Water_animal (R) delfin, bläckfisk, gädda
4.1.7.2. Human bekanting, fadderbarn
4.1.7.2.1. People indier, svensk, svenska
4.1.7.2.2. Role anhängare, apostel, dagbarn
4.1.7.2.2.1. Ideo marxist, pacifist, katolik
4.1.7.2.2.2. Kinship svåger, bror, son
4.1.7.2.2.3. Social_status adjunkt, advokat, premiärminister
4.1.7.2.3. Agent_of_temporary_activity pristagare, bedragare
4.1.7.2.4. Agent_of_persistent_activity cellist, diktare, jägare
4.1.7.2.5. Profession advokat, ambassadör, cellist
4.1.7.3. Vegetal_entity champinjon, kantarell, murkla
4.1.7.3.1. Plant lingon, potatis, rädisa
4.1.7.3.2. Flower aster, begonia, krokus
4.1.7.3.3. Fruit lingon, persika, nypon
4.1.7.4. Micro-organism plankton, virus, bacill
4.1.8. SUBSTANCE parfym, balsam, cellgift
4.1.8.1. Natural_substance alabaster, cesium, diamant
4.1.8.2. Substance_food (R) fasan, filé, forell
4.1.8.3. Drink (R) dricksvatten, dryck
4.1.8.3.1 Artifactual_drink (R) milkshake, läskedryck, saft
4.2. PROPERTY  
4.2.1. QUALITY fräckhet, godhet, hövlighet
4.2.2. PSYCH_PROPERTY intellekt, intuition, karisma
4.2.3. PHYSICAL_PROPERTY massa, spänst, storlek
4.2.3.1. Physical_power (R) färdighet, kondition, känsel
4.2.3.2. Color (R) terrakotta,
4.2.3.3. Shape (R) kub, klot, klyfta
4.2.4. SOCIAL_PROPERTY (R) makt, självständighet, välgörenhet
4.3. ABSTRACT_ENTITY värde, begrepp, huvudsak
4.3.1. DOMAIN område, slätt, strand
4.3.2. TIME säsong, söndag, telefontid
4.3.3. MORAL_STANDARDS (R) jämlikhet, likaberättigande, solidaritet
4.3.4. COGNITIVE_FACT tanke, teori, åsikt
4.3.5. MOVEMENT_OF_THOUGHT funktionalism, hinduism, idealism
4.3.6. INSTITUTION kyrka, riksdag, sjukhus
4.3.7. CONVENTION traktat, överenskommelse, avtal
4.4. REPRESENTATION verb, kliché, nummer
4.4.1. LANGUAGE svenska, ialienska, danska
4.4.2. SIGN apostrof, bestämningsord, glossa
4.4.3. INFORMATION papper, handbok, meddelande
4.4.4. NUMBER (R) miljard, nittiotal, sjua
4.4.5. UNIT_OF_MEASUREMENT centiliter, decennium, dygn
4.5. EVENT  
4.5.1. PHENOMENON  
4.5.1.1. Weather_verbs (R) regna, snöa, åska
4.5.1.2. Disease (R) astma, cancer
4.5.1.3. Stimuli (R) väsen
4.5.2. ASPECTUAL sluta, fortgå, pågå
4.5.2.1. Cause_aspectual sluta, börja
4.5.3. STATE dåsa, blänka
4.5.3.1. Exist förekomma, existera
4.5.3.2. Relational_state toppar, utgöra, företräda
4.5.3.2.1. Identificational_state (R) symbolisera, likna,

General ontology for adjectives

1. INTENSIONAL  
1.1. Modal trolig, omöjlig
1.2. Temporal tidig, daglig
1.3. Emotive liten, dyr
1.4. Manner svart, rättslig
1.5. Object-related medicinsk, stark
1.6. Emphasizer farlig, fattig
     
2. EXTENSIONAL  
2.1. Physical_property liten, svart, stark
2.2. Psychological_property seriös, stark
2.3. Social_property svart, demokratisk
2.4. Temporal_property tidig, färdig
2.5. Intensifying_property liten, stark
2.6. Relational_property likartad

2.4 Classes derived from encoding

The semantic information encoded in the Swe-S lexicon can be automaticaly sorted, clustered and accessed in a number of different ways. We consider the following types of semantic classes to be of interesse for NLP and NLU applications, not to mention their relevance for checking the intern consistency in the SIMPLE lexicons:

The classes of semantic data derived from sorting according to the semantic type, domain or/and semantic class build semantic sublexicons can be easily enriched with other semantic, syntactic or morphological data available in the PAROLE -SIMPLE lexicon and GLDB. Examples of tripartite semantic classes are given below, with information on template type, domain and semantic type as input:

WVSF_TEMPLATE_Instrument_PROT
TSVP_MUSIC_TS_domaine_D
TSVP_APPARATUS_TS_classificateur_de_nom_C:
grammofon1  jukebox1  kassettdäck1

WVSF_TEMPLATE_Instrument_PROT
TSVP_MUSIC_TS_domaine_D
TSVP_MUSICAL_INSTRUMENT_TS_classificateur_de_nom_C:
fiol1  flöjt1  gitarr1  piano1  trombon1  trumpet1  tuba1  violin1  violoncell1  xylofon1

Such tripartite classes support semantic annotating, information extraction, machine translation and natural language understanding in an effective way.

Polysemy

The access to data on types of polysemous classes being instantiated in a language has shown to be indispensable for consistent encodning av meanings in the Swe-S lexicon. Below we have listed some frequent types of regular polysemous classes instantiated in the Swe-S lexicon. It should be observed that polysemy relations are not explicitly coded in the Swe-S lexicon, as they to a large extent can be automatically extracted from the GLDB. In the list below, the first two columns display pairs of semantic types which stand in a polysemous relation, the third one provides examples. The list is alphabetically sorted.

Agent of persistent activity Profession cellist
Animal Artifactual material krokodil
Animal Substance Food krabba
Building Human group administration
Building Institution kyrka
Artwork Information illustration
Container Amount burk
Flower Colour aprikos
Geopolitical location Human group stad
Human group Institution administration
Human group Artwork trio
Location Human group avdelning
Number Building trea
Number Human trea
Number Sign trea
Number Unit of measurment trea
Number Time femtiotal
Number Vehicle trea
Opening Artifact dörr
People Language svenska
Plant Fruit avocado
Plant Flavouring pepparot
Plant Substance bomull
Profession Social status advokat
Semiotic artifact Information bibel
Semiotic artifact Convention traktat
Substance Colour terrakotta
Water-Animal Substance Food krabba

There is no doubt that reaching a consensus on how to deal with polysemy relations is of great relevance for: (i) checking consistency in lexicons, (ii) determining criteria for sense discrimination, (iii) structuring lexicon entries with regard to the distinction between core senses and its subsenses, (iv) text disambiguating, (v) multilingual linking of lexicons.

Semantic frame classes

A subset of words in the Swe-S lexicon carries information on their predicative representation and selectional restrictions on arguments. These representations create semantic frames whose selectional restrictions on arguments are specified by means of reference to semantic types. This information seems to be especially supportive for extending the coverage of the lexicon entries, as the new entries can by default inherit information on the semantic type assignment. To make the assignment as exact as possible, the information on semantic representation should be integrated with information on syntax. For example the verbs of type Natural transition show preference for a frame with a Living entity in the subject position, the same can be said about the Cooperative Speech act verbs. Thus their meanings determine semantic types of the arguments and this information can be reused for applications concerning annotating, disambiguating, lexical acquisition, NLU and automatic extending the coverage of the Swe-S lexicon.

Hyponym - hyperonym classes

In the SIMPLE lexicon model the hyponym-hyperonym relation is part of the qualia encoding, namely Formal. In the Swe-S lexicon this information has not been encoded, because it is already available for all the noun and verb entries in GLDB, through the application called SEMNET, which parses definitions and finds genus proximum. Access to this knowledge is indispensable for a number of NLP and NLU tasks.


2.5 Representation of predicative information

Having already hinted at the relevance of predicative representation for different application tasks, we give below an example showing how the selectional restrictions on verb arguments are encoded for the verb stimmar 1. be noisy, 2 shoal. Selectional restrictions on the arguments in a predicative representation in Swe-S are encoded either by pointing to one or more template type, as is the case in the examples below, or to a particular SemU. Whenever a predicative expression implies a specific referent, its reference is encoded by means of a SemU (i.e. bark, quack).


<Predicate
    id="PRED_stimmar1"
    naming="stimmar"
    type="LEXICAL"
    multilingual="NO"
    argumentl="Arg0PRED_stimmar1">

<Predicate
    id="PRED_stimmar2"
    naming="stimmar"
    type="LEXICAL"
    multilingual="NO"
    argumentl="Arg0PRED_stimmar2">

<Argument
    id="Arg0PRED_stimmar1"
    semanticrolel="RoleNotCoded"
    informargl="InfArgTemplateHuman_CHECK">

<Argument
    id="Arg0PRED_stimmar2"
    semanticrolel="RoleNotCoded"
    informargl="InfArgTemplateWater-Animal_CHECK">

<InformArg
    id="InfArgTemplateHuman_CHECK"
    comment="'ontological' InformArg"
    status="CHECK"
    weightvalsemfeaturel="WVSFTemplateHumanPROT">

<InformArg
    id="InfArgTemplateWater-Animal_CHECK"
    comment="'ontological' InformArg"
    status="CHECK"
    weightvalsemfeaturel="WVSFTemplateWater-AnimalPROT">

 

3 Usability of the semantic data encoded in Swe-S

The issue of usability of the semantic data encoded in Swe-S has been discussed in an article that will be published in the proceedings from LREC 2000 conference. The article is reprinted in Appendix A.


4. Statistics

In the following, we show the distribution of nouns, verbs and adjectives with respect to template types. We list only those template types that have more than 10 words encoded per template.

 

NOUNS:
The total number of fully encoded noun Semus: 6.177
   
337 Human 46 Disease
266 Profession 45 Agent_of_persistent_activity
238 Semiotic_artifact 45 Body_part
196 Relational_act 45 Clothing
186 People 45 Food
174 Abstract_entity 43 Container
134 Agent_of_temporary_activity 42 Fruit
129 Artifact 40 Quality
127 Social_status 39 Movement_of_thought
124 Artwork 38 Artifactual_area
122 Time 38 Experience_event
120 Instrument 37 Location
110 Information 36 Number
105 Representation 35 Area
103 Convention 35 Cause_constitutive_change
97 Unit_of_measurment 34 Act
95 Role 34 Property
92 Language 33 Substance_Food
92 Natural_substance 33 Water_animal
91 Human_group 32 Substance
85 Plant 30 Artifactual_drink
80 Artifactual_material 29 Ideo
88 Kinship 28 Air_animal
87 Cognitive_fact 28 Geopolitical_location
84 Building 28 Physical_object
80 Artifactual_material 27 Identificational_state
79 Part 26 Opening
78 Amount 26 Speech_act
72 Psych_property 25 Furniture
72 Relational_state 24 Shape
67 Psychological_event 22 Change_of_state
64 Institution 22 Moral_standard
62 Sign 21 Modal_event
61 Constitutive 20 Flower
61 State 19 Artifactual_food
58 Group 18 Money
58 Vehicle 18 Symbolic_creation
57 Purpose_act 16 Flavouring
56 Cause_change_of_state 15 Entity
55 Domain 15 Transaction
54 Phenomenon 14 Cause_change_of_location
53 Artifact_Food 14 Change_of_location
53 Cooperative_activity 14 Material
52 Cognitive_event 14 Move
51 Earth_animal 12 Reporting_event
48 Physical_property 10 Micro-organism

 

VERBS:
The total number of fully encoded verb Semus: 1.943
   
419 Cause_change_of_state
157 Purpose_act
148 Non_relational_act
122 Relational_act
99 Move
70 Cause_constitutive_change
61 Change_of_state
58 Speech_act
51 Cause_experience_event
50 Cause_change_location
37 Cognitive_event
35 Reporting_event
32 Give_knowledge
32 Cause_change_of_value
30 Cooperative_activity
29 Relational_state
27 State
26 Symbolic_creation
26 Transaction
24 Change_of_location
22 Change_of_possession
23 Psychological_event
23 Expressive_speech_act
20 Physical_creation
16 Mental_creation
15 Acquire_knowledge
15 Perception
14 Stative_location
13 Cause_motion
13 Change_of_value
13 Copy_creation
11 Aspectual
10 Directive_speech_act
10 Cause_aspectual
10 Cause_natural_transition
10 Cooperative_speech_act
10 Directive_speech_act

 

ADJECTIVES:
Total number of fully encoded adjective Semus: 451
   
123 Psychological_property
109 Physical_property
48 Extensional
38 Social_property
27 Object-related
19 Manner
15 Modal
12 Intensifying_property
10 Relational_property


5. Examples

A sample lexicon that covers 100 fully described words is enclosed with the documentation. The morphological and syntactic information encoded in the Swedish PAROLE lexicon is enriched with the semantic information encoded in the Swe-S lexicon. Three entries from that sample are reprinted in the Appenix B: blomma 'flower', kastanj 'chestnut' and liten 'little'



Bibliography

Calzolari, N. 1999. SIMPLE: Harmonised Semantic Lexicons for the European Languages. In Proceedings of the SIGLEX-99 Workshop: "Standardizing Lexical Resources", Maryland, USA

Gellerstam, M., 1988, Verb Syntax in a Dictionary for Second-Language Learning. In Computer-Aided Lexicology, Almquist & Wiksell; Stockholm.

Guimier, E., Ogonowski, A., 1998. Report on the Syntacic layer, ftp://parole:yaooakgn@www.erli.fr

Holmes, Ph., Hinchliffe, I., 1994. Swedish, London: Routledge Grammmar.

Järborg, J., 1989. Betydelseanalys och betydelsebeskrivning i Lexikalisk databas. Göteborg: Inst. för språkvetenskaplig databehandling

Järborg, J., 1990. Användning av SynTag. Göteborg: Inst. för språkvetenskaplig databehandling.

Nationalencykopedins ordbok, 1995-96. Utarbetad vid Språkdata, Göteborgs Universitet. Höganes: Bra böckers.

Svenska ord – med uttal och förklaringar. 1992. Stockholm:Nordstedts.

Lenci, A., F. Busa, N. Ruimy, E. Gola, M. Monachini 1999, N. Calzolari, A. Zampolli, 1999. Linguistic Specifications, D2.1.

Malmgren, S. (1988). ’On Regular Polysemy in Swedish’, in: Studies in Computer-Aided Lexicography, Almquist & Wiksell, Stockholm.

Pedersen, B.S. and Keson, B. (1999). SIMPLE - Semantic Information for Multifunctional Plurilingual Lexica: Some Examples of Danish Concrete Nouns. In Proceedings of the SIGLEX-99 Workshop: "Standardizing Lexical Resources", Maryland, USA

Pustejovsky, J., 1995. The Generative Lexicon. Cambridge: MA. The MIT Press.

Toporowska Gronostaj, M. 1996. Integrerad valensbeskrivning. Mot ett formaliserat verbvalenslexikon. Göteborgs universitet: Inst. för svenska språket.

 

 

 

 

Appendix A


Annotating, Disambiguating & Automatically Extending

the Coverage of the Swedish SIMPLE Lexicon

Kokkinakis D., Toporowska Gronostaj M., Warmenius K.

Språkdata, Göteborg University
Box 200, SE-405 30,
Sweden
{svedk, svemt, svekws}@svenska.gu.se

 

Abstract

During recent years the development of high-quality lexical resources for real-world Natural Language Processing (NLP) applications has gained a lot of attention by many research groups around the world, and the European Union, through the promotion of the language engineering projects dealing directly or indirectly with this topic. In this paper, we focus on ways to extend and enrich such a resource, namely the Swedish version of the SIMPLE lexicon in an automatic manner. The SIMPLE project (Semantic Information for Multifunctional Plurilingual Lexica) aims at developing wide-coverage semantic lexicons for 12 European languages, though on a rather small scale for practical NLP, namely less than 10,000 entries. Consequently, our intention is to explore and exploit various (inexpensive) methods to progressively enrich the resources and, subsequently, to annotate texts with the semantic information encoded within the framework of SIMPLE, and enhanced with the semantic data from the Gothenburg Lexical DataBase (GLDB) and from large corpora.

 

1. Introduction

During recent years there has been an increased interest to acquire, on a large-scale, high-quality semantic lexicons, McKeown & Hatzivassiloglou (1993), Dorr & Jones (1996), Hearst & Schütze (1996), Takunaga et al. (1997), Roventini et al. (1998), Viegas et al. (1998). The methodology behind these approaches is usually corpus-driven. It is based on the (re-)use of machine readable resources of various types, and the application of cost effective ways to eliminate the acquistion bottleneck, such as derivational morphology, customization of off-the-shelf resources and statistical techniques. The approach adopted here for the extension task is in line with the methodologies mentioned.

In this paper, we focus on ways to extend and enrich, as far as possible, automatically the coverage of the Swedish semantic lexicon by taking into consideration compounding, a distinctive feature of the Swedish language, and semantic similarity in noun phrases of enumerative type. With the support of semantic data from the Swedish SIMPLE lexicon (Semantic Information for Multifunctional Plurilingual Lexica, LE4-8346), Gothenburg Lexical DataBase (GLDB) and large corpora both raw and exposed to shallow parsing, we enhance the incorporation of new semantic entries into the SIMPLE lexicon. We expect to be able to extend the 6,000 entries in the Swedish SIMPLE lexicon to over 120,000 entries. Our assumption is based on the results obtained from the tests carried out so far on input data of 1,000 entries, which became 25,000 (22,000 through compounding and 3,000 through semantic similarity).

Furthermore, we semantically annotate texts with all the available material, and we apply Machine Learning techniques for the disambiguation of ambiguous readings. The annotation task provides an excellent opportunity to evaluate the usability of the semantic information encoded in SIMPLE.

This paper is organized as follows: first we give a brief presentation of the SIMPLE project and particularly of the Swedish lexicon; then we present how compounding and semantic similarity in enumerative phrases (under certain conditions) can contribute to the augmenting and enrichment of the lexicon, when subjected to compound segmentation and shallow parsing; we continue by describing a practical application of the semantic lexicon, namely semantic annotation and disambiguation; we then give some general remarks on the usability of the SIMPLE model, while conclusions end the presentation.

 

2. The SIMPLE Project

The EU-financed SIMPLE project aims at developing wide-coverage semantic lexicons for 12 European languages. The Swedish SIMPLE lexicon (hereafter Swe-S) is one of these. All lexicons share a common semantic model and a common encoding formalism in SGML. The semantic data in the SIMPLE lexicons is being linked to the morphological and syntactic data in their respective PAROLE lexicons, developed within the EU project PAROLE, (Preparatory Action for Linguistic Resources Organisation for Language Engineering). Out of the 20,000 words in the PAROLE lexicons, a subset of about 6,000 words, or approximately 10,000 senses, has been enriched with semantic descriptions in the SIMPLE counterpart. The content and the design of the SIMPLE model are documented in Lenci et al. (1998).

The notion of semantic type is central for the SIMPLE model and its ontology. It corresponds to a word sense assigned to a lexical item. There are 139 semantic types distinguished in the SIMPLE ontology. Each semantic type is defined as a cluster of structured semantic information significant for a given word sense. Information on semantic class, domain, argument structure of predicative expressions and selectional restrictions on arguments as well as qualia roles constitute a relevant part of the semantic type specification; (Calzolari (1999), Pedersen & Keson (1999)). The SIMPLE ontology is multidimensional as it is based on the principle of orthogonal inheritance (Pustejovsky 1995), and in this respect, it contrasts with the LexiQuest’s semantic class ontology which is based on a standard, monodimensional approach. The latter ontology includes 95 semantic classes. Both ontologies are hierarchically structured.

 

3. The Swedish SIMPLE Lexicon

The theoretical and formal design of the Swe-S lexicon is conformant to the SIMPLE's linguistic guidelines presented by the specification group, Lenci et al. (1999). In the Swe-S lexicon, there are about 10,000 semantic units (hereafter Usems) encoded, comprising 7,000 noun, 2,000 verb and 1,000 adjective Usems. These 10,000 units are mapped onto 6,000 entries. Usems are described with respect to the following information:

In the course of building the Swedish SIMPLE (and PAROLE) lexicons we have, to a large extent, reused lexical data from GLDB which is the most comprehensive source of lexical information on contemporary Swedish, and information from the SO (1992) and NEO (1996).

 

4. Extending the Coverage of the Swe-S

The Swe-S resources are not quantitatively sufficient for realistic, large-scale Natural Language Processing (NLP) tasks, such as semantic annotation, and need to be extended. For this particular task, we take advantage of the productive compounding characteristic of Swedish and the use of raw and partially parsed corpora.

We assume that a considerable number of casual, or on the fly created compounds in Swedish can inherit relevant parts of semantic information provided on their heads by the Swe-S lexicon and thus, can be incorporated into the lexicon. By relevant parts, we mean in the first place the information concerning semantic type, domain and semantic class. To avoid errors, we exclude the information on argument structure from the inheritance, as the argument structure can undergo alternations in the process of compounding. This is the case when verbs and verbal nouns build compounds with either an obligatory or optional argument in the non-head position. The occurrence of an adjunct in the non-head position does not usually alter the predicative structure.

4.1 Compounding

The fact that over 70%, or approximately 80,000, of all the entries in the SAOL (1998) are compound forms casts light not only onto an immense lexical repository, which is available for this particular extension task, but also on the need to design effective tools and routines for compound segmentation, as new, casual compounds are created constantly in Swedish. Most of these casual compounds are relatively transparent, which implies that their meaning is a function of the meaning of its components being related to each other by an implied predicative functor. For instance, brödkniv brödXknivY ’bread knife’ implies ’Y for (cutting) X’ and bärsaft bärXsaftY ’juice from berries’ implies Y which contains X. In Swedish, compounds are written as single orthographic units and nouns are the most frequent modifiers occurring in non-head positions.

A combination of various heuristic methods is used for the extension. Compound segmentation is applied to compound noun tokens on large corpora and lists of new nouns are produced. To maintain quality assurance and compatibility with the rest of the data in the lexicon, new heuristics are applied to the content of the noun lists produced. To avoid generation of incorrect data, these heuristics inspect the modifying component of a compound in order to distinguish its characteristics, such as its part-of-speech and semantic category (if any). These characteristics of the modifier, when enriched with the corresponding characteristics of the compound head, provide data for a preliminary estimation of the correctness of the heuristics. Few examples will illustrate this point.

If the part-of-speech of the compound modifier is an adjective, a new Swe-S entry, which will not cause semantic anomalies in the derived lexical set, can be created with great confidence. The inheritance criterion applies here and the compounds are hyponyms to the head. For example, the lemma klocka ’bell/watch’ can be extended with compounds of type [ADJ-MODIFIER]+HEAD: [digital]klocka, [guld]klocka, [lill]klocka, [silver]klocka, [stor]klocka, where the adjectival modifying part in these examples are ’digital, gold, little, silver’ and ’big’. Similar results are obtained if the modifying part is a proper noun. For instance, anhängare ’supporter’ with modifiers such as: Berisha, Hammarby, Hitler, Likud and Mobutu, signal unambiguous compounds.

It is well known that the heuristics have a variable degree of performance on different types of compounds, and that some simple constraints are needed to exclude segmentation and interpretation errors. Particularly in the case where the part-of-speech of the modifying part of a compound is a noun (e.g. NOUN-MODIFIER[kultur]fråga ’cultural question’) or verb (e.g. VERB-MODIFIER[betal]teve ’pay-TV’). These constraints are formed by means of subroutines which impose checking of derived compounds against different lists to eliminate incorrect data. The lists with bound morphemes or lexicalized compounds, extracted from the GLDB allow exclusion of such compounds from the derived sets. Such constraints have proven to be a cheap way to automatically constrain the overgeneration of new entries in the lexicons.

For instance, when using large corpora, over 40 compounds with feber ’fever’, as head, could be extracted. However, it became evident that not all of them belong to the semantic class of Illness, e.g. resfeber ’excitement before a journey’. Thus, in some cases, additional inspection seems unavoidable, if we want to restrain automatic incorporation of lexicalised compounds with idiomatic, metaphoric or metonymic meanings. This inspection can be performed automatically by simply checking whether a given compound is included as a separate entry in GLDB. If this is the case, it means that the compound is lexicalised and should not be subjected to automatic inheritance. The manual inspection is needed, only if the derived compound shows diverging semantic and/or morphological patterns and the word is neither in a bound morpheme list, nor in the lexicalised compound list.

Moreover, the content of the Swe-S has been used as a means of bootstrapping the process. For instance, glas ’glass’, can be extended with compounds having Substance as a modifier in the compound form. Consequently the [NOUN-MODIFIER{Substance }]+HEAD compounds [vatten]glas, [vin]glas, [öl]glas, [likör]glas all have Substance as the modifier part, namely ’water, wine, beer’ and ’liqueur’.

A large number of already disambiguated compounds has been also extracted from GLDB, since the Swe-S entries are linked to the various senses and sub-senses in GLDB, and subsequently to the morphological examples of every entry (alias compounds). For instance, Swe-S encodes the non-compound lemma ämne (as having four senses, marked with 1/1-1/4), which are disambiguated here by means of their assignment to the following semantic types and semantic classes:


Material: Matter ’material’
Substance: Substance ’stoff’
Part: Abstract ’topic’
Domain: Notion ’subject, discipline’

Each of these senses is exemplified in GLDB with a number of compounds, comprising totally 26 compounds with ämne as the head. Some of these are listed in the right column of table (1). Since there is only one compound with that head in the Swe-S lexicon (grundämne ’element’), incorporating new, disambiguated compounds was straightforward.

Swe-S

GLDB

ämne:1/1:MATTER

ämne:1/2:SUBSTANCE

ämne:1/3:ABSTRACT

ämne:1/4:NOTION

grundämne:1/1:MATTER

färgämne:1/1

hornämne:1/1

yxämne:1/2

fruktämne:1/2

predikoämne:1/3

uppsatsämne:1/3

läroämne:1/4

skolämne:1/4

Table 1: ämne in Swe-S, and GLDB compounds with ämne as head.

4.2 Heuristic Incorporation of New Entries through Shallow Parsing

So far we have addressed the problem of the acquisition of compound nouns based on the content of the Swe-S lexicon, by applying heuristics, filters, and manual inspection, in some cases, in order to guarantee consistency. But how can we cope with the rest of the vocabulary?

Wilson and Thomas (1997:55-57) argue that one of the conditions that a semantic system should satisfy is that is should be able to account exhaustively for the whole vocabulary in the corpus, not just for a part of it. We have experimented with a corpus-based approach, using a cascaded finite-state syntactic parser (CASS-SWE), based on work done by Kokkinakis & Johansson Kokkinakis (1999), which seems a plausible way of progressively enriching the Swedish semantic resources.

An advantage of CASS-SWE is its ability to identify with high accuracy noun phrases, a property that we consider here as crucial for aiding the "discovery" of new semantic entries. Essentially the approach, which has similarities to naive clustering, is as follows. Gather large corpora (here 13 million tokens), part-of-speech tag, and then parse with CASS-SWE (the parser uses part-of-speech annotated input); from the resulted analyzed forest of chunks we filter out long noun phrases, namely those containing three or more common nouns. Finally, the overlap between the nouns in the NPs produced and the entries in Swe-S is measured. If at least two of the nouns (a figure arbitrarily taken) are also entries in the Swe-S, with the same semantic class, then there is a strong indication that the rest of the nouns are co-hyponyms, and thus semantically similar with the two already encoded in Swe-S. Accordingly, we take advantage of the transitivity aspect of hyponymy, and of the fact that two lexical items X and Y are co-hyponyms if: (i) they are disjuncts and therefore complementary; and (ii) have a common superordinate, e.g. animal is superordinate of cat, dog, horse and camel, cf. Sanfilippo et al. (1999).

Similarity plays an important role in word acquistion, and preliminary results have shown that the simple overlap works fairly well for the majority of the cases examined. However, the noise which is produced can be eliminated, if the semantic tags of all the words in a phrase are compared. Caution should be taken for cases where different semantic classes are involved in an enumerative NP, e.g.:


kvinnor:Bio, barn:Bio, husdjur:? och möbler:Furniture
’women, children, pets and furniture’

immiga flaskor:Container#Artifact, feta cigarrer:?, och tangodansande kvinnor:Bio#Situ
’steamy bottles, fat cigars and tango-dancing women’

The unclassified husdjur in the first example, should not be assigned to a class Bio since there is another class involved in the same NP, namely Furniture. Similarly, no action should be taken in the second example, since two semantically ambiguous words with distinct classes are involved.

The best results were achieved for the semantic classes: Phenomena (Illness and Psychological-Feature), Occupation, Animal and Human (Bio, Ethnos and Occupation-Agent). Some examples of the last mentioned class are given below, these are NPs taken from the parsed corpus. In these examples, (*) marks an original Swe-S entry, (+) marks an entry incorporated through the compound analysis, (N) marks a completely new entry and (?) marks errors:


italienare*, finländare*, jugoslaverN, greker*
’Italians, Finnish, Jugoslavians and Greeks’

amerikanerN, japanerN, tyskar* och italienare*
’Americans, Japanese, Germans and Italians’

jurister*, läkare*, optiker*, psykologerN, sjukgymnasterN
’lawyers, doctors, opticians, psychologists, physiotherapists’

läkare*, psykologer* och andra brotssutredareN
’doctors, psychologists and other crime investigators’

några läkare*, präster* och socialarbetare+
’some doctors, priests and social workers’

samtliga politiska partier?, läkare*, jurister*
’all political parties, doctors, lawyers’

advokater*, psykiaterN, specialistläkare+ m.m.
’lawyers, psychiatrists, specialist doctors etc.’

4.3 Quantitative Results

Using the previously described heuristics and observations, the relatively limited inventory of semantic information in Swe-S, has been extended to a large semantic resource, appropriate for a large number of intermediate NLP tasks, i.e. simpler processes which are carried out to help final tasks.

Regarding the use of the compounds for extending the entries, an estimated average of 20-25 compounds per Swe-S entry has been extracted by combining information from large corpora and the GLDB. Thus, by using only 1,000 nouns we could increase the total vocabulary size to over 22,000 semantic entries. For some entries, having both concrete and abstract senses, the number of compounds extracted from large corpora could be measured into several hundreds. Table (2) shows the top-10 non-compound entries, most rich in compound variants.

 

Swe-S Entry

Occ.

program ’programme, program’

arbete ’work, employment’

chef ’chief’

bok ’book’

verksamhet ’activity, operation’

skola ’school’

man ’man’

rum ’room, space’

kort ’card, photo’

bolag ’company’

469

402

390

357

299

275

273

244

231

217

Table 2: Swe-S entries richest in compound variants

Regarding now the shallow parsing approach of a l3 million corpus, over 15,600 NPs could be extracted, having the content we were interested in, namely over three common nouns. Approximately 3,000 new noun entries to the Swe-S could be identified without any further processing (bootstrapping the compound analysis). However, as mentioned in the previous section, some noise was produced and for this reason we do not use these new nouns for the semantic annotation discussed in the next section, until we find more reliable ways to eliminate the limited number of errors produced.

5. Annotating with Swe-S (Semantic Tagging)

Semantic tagging is appealing since it is believed to contribute to the improvement of the performances and robustness of NLP systems, cf. Resnik & Yarowsky (1997). The appropriate content from the core Swe-S, i.e. "semantic class", "domain" and "template type" information, has been extracted and implemented as finite-state machines suitable for semantic tagging, the case of assigning semantic categories or clusters of semantically related concepts to words. These machines are then applied sequentially to lemmatized textual data resulting in all possible annotation for the tokens matched.

Testing was performed using 1,800 nouns from the Swe-S, while approximately 150 of those could be ambiguous, in the sense that more than one semantic label, class, domain and template, could be asssociated with a single token. For instance, the Swedish noun administration ’administration’ is semantically classified for four different semantic classes: Agency, Functional-Space, Human and Operation, while the noun affär ’shop, business, affair’ is classified for: Functional-Space, Operation, State and Event.

 

6. Supervised Learning

We adopted Machine Learning (ML), particularly Memory Based Learning, for the disambiguation of the semantic annotation of text samples.

6.1 Memory Based Learning

Memory-Based Learning (MBL) is a supervised, inductive, classification-based method originating from the field of machine learning (ML), Mitchel (1997). MBL has several practical advantages, such as: (i) it has produced state-of-the-art results in many natural ambiguity problems (cf. Cardie & Mooney (1999)); (ii) the MBL method is not sensitive to sparse or low-frequency data, as low-frequency cases are not discarded but are kept in memory, hence, useful information can also be extrapolated from them; and (iii) fast learning and incremental learning; new instances can be added to the memory, improving the performance of the system. The software used for the experiments with the Swedish data has been developed at the University of Tilburg, by Daelemans et al. (1999).

MBL is closely based on the assumption that "performance in cognitive tasks is based on reasoning on the basis of similarity of new situations to stored representations of earlier experiences", Daelemans et al. (1999). An MBL system consists of two components: a learning component, which is memory-based, adding training instances to memory, and a performance component, in which the product of the learning component is used for performing the classification of the input.

6.2 Training Material

It is rather difficult to give an exact number of examples required for an adequate description of noun senses. Intelligent example selection for supervised learning is an important issue in ML, an issue that we have not explored. However, from the (human) lexicographical point of view, an experienced scholar would need, roughly, a hundred arbitrarily chosen excerpts for each word in order to cover the majority of sense distinctions (Jerker Järborg personal communication). For a machine, that figure should be higher, although we have not empirically tested the validity of this statement.

We have automatically created large training non-lemmatized data, taken from concordances and then manually classified the training instances. The deliberate choice of non-lemmatized material should be emphasized here, as our experiments proved that noun morphology supports sense disambiguation, both for compound and non-compound forms in Swedish.

For instance, plural forms of finska or tyska, ’Finnish, German’, refer almost exclusively to Ethnos (denoting a person) while the base form is ambiguous between Ethnos and Abstract (denoting either the person or language). Likewise, plural forms of begåvning ’talented (person), talent’ refer almost exclusively to Situ, while its base form refers to Psychological-Feature.

For the training and test instances we organized the near context of the ambiguous semantic entries into fixed-length vectors of symbolic n feature-value pairs (in the experiments in this paper n=12) which consist of the left and right context of the word under investigation, its part-of-speech and its byte-offset in the discourse, and a field containing the classification of that particular feature-value vector. Unknown features are marked with a question mark ’?’ while long context is truncated. Moreover, we took advantage of the syntactic examples in GLDB, given for almost every lemma in the database, and in this way we could complement the training material automatically with already classified training instances. This last point can be illustrated by the use of two syntactic examples provided by the GLDB for the noun medicin ’medicine’. Since these are already disambiguated, designated by their sense number, they can be directly mapped onto the respective Swe-S semantic classes for that particular word:

GLDB: medicin:1:studera medicin
MBL: byte-offs noun ? ? ? studera medicin ? ? ? ? Occupation
’medicine:sense1:study medicine’

GLDB: medicin:2:skriva ut recept på en bra medicin
MBL: byte-offs noun recept på en bra medicin ? ? ? ? Substance
’medicine:sense2:write a prescription on a good medicine’

During classification an unseen example X, a test instance, is presented to the system and a distance metric D between the instances in the memory Y and X is calculated, D(X,Y). Various implemented algorithms (variants of the k-nearest neighbour algorithm) try to find the nearest training instance for X and create a class as prediction for the class of the test instance.

6.3 Results

At present, the standard for calculation of sense disambiguation algorithms is the "exact match" (or accuracy) criterion. Specifically for ML, our goal is to perform significantly better than the most-frequent-semantic classifier to be worthy of serious consideration.

Table (3) summarized the results for few ambiguous cases examined. In every case we try to improve the baseline for every semantic entry we want to disambiguate. Here by baseline is meant the most frequent class attached to an ambiguous token in the test sample.

Our experiments using the MBL approach returned 84.8% correct disambiguation, tested on 25 ambiguous entries (with 20-25 test instances in each case), with an average baseline of 69.4%.

 

Swe-S

(Swe-S)

Class

Tr.

Data

Base-

line

Acc.

administration

’administration’

Agency-39

Human-18

Operation-81

Functional-Space-5

143

56.6%

76%

affär

’shop, affair, business’

Event-102

Operation-175

State-2

Functional-Space-83

362

48.3%

92%

danska

’Danish’

Ethnos-258

abstract-32

290

88.9%

88%

klyfta

’segment, cleft, rift’

Phenomenon-67

Form-33

Alternation-3

119

56.3%

88%

medicin

’medicine’

Substance-380

Occupation-103

Substance-Occup.-6

489

77.7%

72%

område

’area, zone, field’

Location-168

abstract-151

319

52.6%

84%

vatten

’water’

Substance-396

Substance-Loc.-110

Location-61

567

69,8%

100%

teater

’theatre,

play-acting’

Abstract-105

Agency-103

Function.-Space-102

Human-12

Activity-7

329

31.9%

85%

Table 3: Data used by MBL for semantic disambiguation

(Tr. Data: amount of training data, Acc.: accuracy based on the MBL approach)

 

7. Usability of the SIMPLE Model

In this section, we are going to reflect on the usability of the SIMPLE model for different NLP tasks, which require access to semantic information. Many NLP applications can be actively supported by the SIMPLE lexicon which offers multiple access points to the semantic data. 10,000 word senses can be accessed either directly, or by means of selective information searches starting with 139 ontological categories provided by the SIMPLE ontology, 95 semantic class categories and to 364 domain specifications. Since the two first capture somewhat different aspects of word meaning for a number of cases, the double ontological specifications not only provide more precise information, but also increase the granularity of semantic description.

The ontological information cluster can be extended with information on domains. The domain information, indispensable for text-recognition tasks can support disambiguation of senses with identical ontological clusters. For example, the word grad ’degree, grade’ has nine senses assigned, and four of these denote different units of measurment representative for domains such as Geometry, Earth-Sciences, Typography and Meteorology. Since those four display identical ontological categorization, the domain information supports disambiguation in a relevant way. In consequence, a tripartite cluster including both ontological and domain information seems to be preferred. The explicit specification of domain information in the SIMPLE lexicon makes it possible to generate domain-based sublexicons, which are basic for text-recognition tasks.

The attempt to harmonize the encoding of data makes it possible to multilink the SIMPLE lexicons for different languages, which is substantial for building the lexicon modules for machine-aided-translation.

Since the content of the Swe-S lexicon is linked to the GLDB database, the information exchange can proceed in two directions, which promotes development of both resources. These two resources describe and formalize lexical information concerning a word’s morphology, syntax and semantics, which is a prerequisite for advanced NLP tasks. As was already hinted, the SIMPLE project has aimed at harmonization of lexical resources by using a common lexicon model and formalism for 12 EU languages. This initiative has opened new prospects for further developments within the language engineering field.

 

8. Conclusions and Further Research

This paper has discussed means to automatically extend the lexical inventory of the Swe-S semantic lexicon, by profiting from the productive compounding characteristic for Swedish, the semantic similarity in the enumerative noun phrases, by accessing corpora both in raw and parsed form, and the morphological, syntactic and semantic content of GLDB. Using a combination of all the available data, a relatively limited inventory of semantic information, such as the Swe-S, can be extended to a large semantic resource appropriate for a large number of intermediate NLP tasks. Moreover, its compatibility with the manually developed Swe-S lexicon, can be guaranteed and its high quality maintained, as we applied heuristics that do not try to overproduce semantically anomalous entries. We have also used the Swe-S resource for semantic annotation of texts, while for the disambiguation, we employed Machine Learning techniques, supported by manually created large portions of training data for a small number of ambiguous semantic entries. Work within the SIMPLE project was still in progress when writing this paper, so a future task would be to extend the rest of the material using the same methodology, and even to devise better ways to eliminate the noise produced by the syntactic parsing. Reliable extraction of similar words from text corpora opens up many exciting opportunities for further linguistic analysis.

 

Acknowledgements

We thank three anonymous reviewers for some useful comments on a previous draft. The first author is also indebted to the "Birgit & Gad Rausings" foundation for providing financial support for the participation at the conference.

 

References

Blåberg, O. (1988). A Study of Swedish Compounds. Report 29, General Linguistics, Umeå university, Sweden

Calzolari, N. (1999). SIMPLE: Harmonised Semantic Lexicons for the European Languages. In Proceedings of the SIGLEX-99 Workshop: "Standardizing Lexical Resources", Maryland, USA

Cardie, C. and Mooney, R.J. (1999). Guest Editors’ Introduction: Machine Learning and Natural Language. In Journal of Machine Learning, Special Issue on Natural Language Learning, Vol. 34, pp. 1-5, Kluwer AP

Daelemans, W., Zavrel, J., van der Sloot, K. and van den Bosch, A. (1999). TiMBL: Tilburg Memory Based Learner, version 2.0, Reference Guide. ILK Technical Report 99-01. Paper available from: http:/ilk.kub.nl/~ilk/papers/ilk9901.ps.gz

Dorr, B. and Jones, D. (1996). Acquisition of Semantic Lexicons: Using Word Sense Disambiguation to Improve Precision. In Proceedings of the SIGLEX Workshop "Breadth and Depth of Semantic Lexicons", pp. 42-50, Santa Cruz, California, USA

Hearst, M.A. and Schütze, H. (1996). Customizing a Lexicon to Better Suit a Computational Task. In Corpus Processing for Lexical Acquistion, pp. 77-94, Boguraev B. and Pustejovsky J. (eds.). MIT Press

Kokkinakis, D. and Johansson Kokkinakis, S. (1999). A Cascaded Finite-State Parser for Syntactic Analysis of Swedish. In Proceedings of the 9th EACL, pp. 245-248, Bergen, Norway. Paper available from: http://svenska.gu.se/~svedk/publics/eaclKokk.ps

Lenci, A. et al., (1998). SIMPLE WP2, Linguistics Specifications. Deliverable 2.1, Pisa

McKeown, K. and Hatzivassiloglou, V. (1993). Augmenting Lexicons Automatically: Clustering Semantically Related Adjectives. In Proceedings of the ARPA HLT Workshop, pp. 272-277, Princeton, NJ

Mitchell, T. M. (1997). Machine Learning. Series on Computer Science, McGraw-Hill

NEO, (1996). Nationalencyklopedins ordbok. Volumes 1-3, Språkdata & Bra Böcker AB

Pedersen, B.S. and Keson, B. (1999). SIMPLE - Semantic Information for Multifunctional Plurilingual Lexica: Some Examples of Danish Concrete Nouns. In Proceedings of the SIGLEX-99 Workshop: "Standardizing Lexical Resources", Maryland, USA

Resnik P. and Yarowsky D. (1997). A Perspective on Word Sense Disambiguation, Methods and their Evaluation. In Proceedings of the Workshop: "Tagging Text with Lexical Semantics. Why, What and How?", pp. 79-86, Washington D.C., USA

Roventini, A., Peters, C., Calzolari, N. and Bertagna, F. (1998). Building a Semantic Network for Italian Using Existing Lexical Resources. In Proceedings of the 1st LREC, Vol. 1, pp. 377-383, Granada, Spain

Sanfilippo, A. et al. (1999). Preliminary Recommend-ations on Lexical Semantic Encoding. EAGLES LE3-4244, Draft version

SAOL, (1998). Svenska Akademiens Ordlista över Svenska Språket (The Swedish Academy Word-List). Norstedts Förlag & Svenska Akademien

SO, (1992). Svenska Ord. Statens Skolverk, Nordstedts Förlag

Takunaga, T., Fujii, A, Iwayama, M., Sakurai, N. and Tanaka, H. (1997). Extending a Thesaurus by Classifying Words. In Proceedings of the Workshop: "Automatic Information Extraction and Building of Lexical Semantic Resources", Vossen P., Adriaens G., Calzolari N., Sanfilippo A. and Wilks Y. (eds), pp. 16-21, Madrid, Spain

Viegas, E., Ruelas, A., Beale, S. and Nirenburg, S. (1998). Extending a Core lexicon Using On-Line Language Resources with Savoir-Faire. In Proceedings of the 1st LREC, Vol. 1, pp. 97-104, Granada, Spain

 

APPENDIX B

Three entries taken from the 100 sample of the Swedish SIMPLE lexicon are reprinted below. These are blomma ' to flower' kastanj, 'chestnut' and liten 'little.


blommar 'flower'

<Parole
    lexiconname="Example_blommar"
    language="SWEDISH"
    version="MorfSynSem">
<ParoleMorpho>
    <MuS
        id="VB1871_2043"
        gramcat="VERB"
        autonomy="YES"
        synulist="US1871_2043">
        <Gmu
            range="0"
            naming="blommar"
            inp="VB_711">
            <Spelling>blommar</Spelling>
            <GStem
            range="1">
                <Spelling>blomma</Spelling>
            </GStem>
        </Gmu>
    </MuS>
    <GInP
        id="VB_711"
        comment="Verbs inflected like älska"
        example="älska">
        <CombMFCif
            combmf="PRA">
            <Cif
                stemind="1">
                <Removal></Removal>
                <AddedBefore></AddedBefore>
                <AddedAfter>r</AddedAfter>
            </Cif>
        </CombMFCif>
        <CombMFCif
            combmf="PRP">
            <Cif
                stemind="1">
                <Removal></Removal>
                <AddedBefore></AddedBefore>
                <AddedAfter>s</AddedAfter>
            </Cif>
        </CombMFCif>
        <CombMFCif
            combmf="PAA">
            <Cif
                stemind="1">
                <Removal></Removal>
                <AddedBefore></AddedBefore>
                <AddedAfter>de</AddedAfter>
            </Cif>
        </CombMFCif>
        <CombMFCif
            combmf="PAP">
            <Cif
                stemind="1">
                <Removal></Removal>
                <AddedBefore></AddedBefore>
                <AddedAfter>des</AddedAfter>
            </Cif>
        </CombMFCif>
        <CombMFCif
            combmf="INF">
            <Cif
                stemind="1">
                <Removal></Removal>
                <AddedBefore></AddedBefore>
                <AddedAfter></AddedAfter>
            </Cif>
        </CombMFCif>
        <CombMFCif
            combmf="IMP">
            <Cif
                stemind="1">
                <Removal></Removal>
                <AddedBefore></AddedBefore>
                <AddedAfter></AddedAfter>
            </Cif>
        </CombMFCif>
        <CombMFCif
            combmf="SUP">
            <Cif
                stemind="1">
                <Removal></Removal>
                <AddedBefore></AddedBefore>
                <AddedAfter>t</AddedAfter>
            </Cif>
        </CombMFCif>
        <CombMFCif
            combmf="PAPART">
            <Cif
                stemind="1">
                <Removal></Removal>
                <AddedBefore></AddedBefore>
                <AddedAfter>d</AddedAfter>
            </Cif>
        </CombMFCif>
        <CombMFCif
            combmf="PEP">
            <Cif
                stemind="1">
                <Removal></Removal>
                <AddedBefore></AddedBefore>
                <AddedAfter>nde</AddedAfter>
            </Cif>
        </CombMFCif>
    </GInP>
    <CombMF
        id="PAPART"
        mood="PASTPART">
    <CombMF
        id="SUP"
        mood="SUPINO">
    <CombMF
        id="PAP"
        tense="PAST"
        voice="PASSIVE">
    <CombMF
        id="PAA"
        tense="PAST"
        voice="ACTIVE">
    <CombMF
        id="PRP"
        tense="PRESENT"
        voice="PASSIVE">
    <CombMF
        id="PRA"
        tense="PRESENT"
        voice="ACTIVE">
    <CombMF
        id="PEP"
        mood="PRESPART">
    <CombMF
        id="INF"
        mood="INFINITIVE">
    <CombMF
        id="IMP"
        mood="IMPERATIVE">
</ParoleMorpho>
<ParoleSyntaxe>
    <SynU
        id="US1871_2043"
        description="D02"
        <CorrespSynUSemU
            targetsemu="USEMv_blommar2/1"
            correspondence="ISOmonovalent">
        <CorrespSynUSemU
            targetsemu="USEMv_blommar2/2">
            correspondence="ISOmonovalent">
        <CorrespSynUSemU
            targetsemu="USEMv_blommar2/3">
            correspondence="ISOmonovalent">
    </SynU>
    <Description
        id="D02"
        example="x slocknar"
        representativemu="slockna"
        self="S1SVB"
        construction="CXSE1">
    <Self
        id="S1SVB"
        intervconst="INTERVCONST0">
    <IntervConst
        id="INTERVCONST0"
        syntagmatl="V1SVB">
    <Construction
        id="CXSE1"
        syntlabel="CLAUSE"
        selfinsertion="1">
        <InstantiatedPositionC
            range="0"
            optional="NOO"
            positionc="NPSUBX">
    </Construction>
    <PositionC
        id="NPSUBX"
        naming="inanimate subjectnominal phrase or pronoun"
        function="SUBJECT"
        throle="AGENT"