SIMPLE LE4-8346

 

 

 

DANISH SIMPLE - LEXICON DOCUMENTATION

 

* * *

Document first version date

28/8/1999

   

Document date

25/4/2000

Document ID

Danish Simple-Lexicon Documentation Prefinal version prepared for evaluation May 19

Version

02

   

Doc. type

     

Document status

prefinal

   

Validation type

     

Comments

 
       
 

Name

Organisation

Purpose

       

From

Bolette Pedersen

COP

Documentation

 

Sanni Nimb

   
 

Sussi Olsen

   
       
       
       

To

evaluation panel

   
       
       

 

 

 

1. General design information

1.1. Lexicon population

The Danish SIMPLE-lexicon adds semantic descriptions to 8,200 of the 20,000 Danish PAROLE lexicon entries. These 8,200 morphological entries amounts to 10,000 semantic units because of cases of polysemy and homonomy. 7,000 of the semantic units are nouns; 2,000 are verbs, and 1,000 are adjectives (by April 25 9,700 semus are encoded) .

The entries to be encoded in SIMPLE have been chosen on the basis of three different criteria:

In the case of nouns, we have sought towards a relatively ‘closed approach’ to lexicon population so that all relevant readings of the particular words were encoded. We have primarily based our reading distinction strategy on a medium-sized monolingual lexicon as well as on corpus examinations (i.e. in some cases we have deviated from the lexicon because the corpus revealed either less or other ambiguities than the ones represented in the lexicon).

In the case of verbs, a closed approach has not been plausible first of all because the Danish PAROLE lexicon has not adopted such an approach when describing the syntax of Danish verbs. For instance, Danish is characterised by a very high use of phrasal verb constructions (see also Section 2.7) and not all of these have been encoded in syntax.

In relation to lexicon population it is important for us to stress that the elaboration of a Danish computational lexicon does not stop with the PAROLE/SIMPLE project. An ongoing project at Center for Sprogteknologi is concerned with the task of scaling up the PAROLE/SIMPLE lexicon to 100,000 semantic units (see Braasch et al. 1998). In particular wrt. phrasal verbs our aim is to extent the existing phrasal verb descriptions into something that corresponds better to the presence of phrasal verbs in Danish corpora.

1.2. Background resources

Two background resources have played an important role in the building of the Danish SIMPLE data, namely corpora and a medium-sized Danish lexicon. First of all, the decision was made very early in the project that all data should be described on the basis of corpus examinations and that each semantic unit should be supported by an illustrative example from the corpus. This means that if a meaning of a word shows significant frequency in corpus we represent it in the SIMPLE lexicon - even if the particular meaning is not represented in the traditional dictionary we use as our other important background resource (for instance the metaphorical meaning of puslespil (puzzle)). Also, if a meaning is represented in the lexicon but with no occurrences in the corpus, the particular meaning has in most cases been omitted.

Our corpus examinations are primarily based on two corpora. The most important is the Berlingske corpus of about 20 mill. tokens, consisting of newspaper articles concerning various topics. In the cases where there are few or no examples of a given word in this corpus, the DK-korpus (Bergenholtz 1990), a balanced corpus of 4 mill. words composed of novels, newspapers, journals, magazines and miscellaneous, is used. We have chosen the corpus tool Xkwic (Christ 1993) for our corpus examinations. Xkwic is part of the IMS corpus toolbox developed at the University of Stuttgart and available through the internet.

Nudansk Ordbog is a medium-sized Danish lexicon with a rather consistent reading distinction policy. We have achieved the right to exploit this resource as long as the material is not used with commercial perspectives. Almost all definitions have been extracted from an electronic version of this source. All encoded words in our lexicon include a definition; in cases where we did not find an appropriate definition in Nudansk Ordbog - either because the word was not represented or because the definition for some reason or other was inappropriate - we have elaborated one. It has been of great help to have this resource as a reference point.

 

1.3. Material selected for the evaluation

Since for adjectives we have encoded by now only required information (cf. Lenci et al. 2000) whereas we for nouns and verbs have encoded both recommended and in several cases also optional information, the two latter word classes (which represent 9/10 of the Danish SIMPLE material) show much more of the Danish SIMPLE lexicon. Therefore, 50 noun meanings and 50 verb meanings have been selected for evaluation purposes. Due to time limits there has not been time for a very careful word selection, on the other hand the rather random selection of words from the lexicon illustrates well the lexicon as a whole since there has been no special elaboration of the material presented here. However, as can be seen below, the material represents different aspects of the SIMPLE ontology since both concrete, abstract and event nouns are represented as well as a large set of the ontological verb types.

The selected words are seen in the two lists below. Note that only 31 morphological noun units and 15 verb units are selected but that these spread into 100 different meanings due to homonomy and polysemy. In the sgml files eval_nouns_DK.sgml and eval_verbs_DK.sgml both the morphological, syntactic and semantic units are given as well as all other sgml objects referred to in the entries. The two files have been successfully parsed using the English version of the DTD delivered by LexiQuest.

NOUNS: (eval_nouns_DK.sgml)

pige (girl)

koreaner (Korean)

bror (brother)

søn (son)

republikaner (republican)

forbruger (consumer)

opdrætter (breeder)

oplæser (reciter, newsreader)

biologi (biology)

årsag (cause)

marina (marina)

lystbådehavn (marina)

land (country)

skole (school)

bibliotek (library)

minut (minute)

aften (evening)

hjelm (helmet)

styrthjelm (crash helmet)

visir (visor)

menukort (menu)

spisekort (menu)

paraply (umbrella)

skadedyr (vermin)

papegøje (parrot)

enebær (juniper berry)

nellike (pink (flower), clove)

kanin (rabbit)

sild (herring)

storm (storm)

tordenvejr (thunder)

VERBS: (eval_verbs_DK.sgml)

finde (find, think of, think, stand (I can't stand it))

disponere (take the necessary steps, predispose, have disposal of)

bryde (break, change)

sende (send)

spille (act, play)

såre (hurt)

så (sow)

behandle (treat, cure)

bede (ask, pray)

male (paint, grind)

tale (speak, talk)

regne (rain, calculate, count on)

stoppe (stop, end)

læse (study, read)

springe (jump, explode)

The easiest way to ‘follow’ a word from morphology to semantics is to simply search on the word form throughout the file. For a verb like læse (study, read) this gives the following results (note that since the original Danish PAROLE lexicon covers 20,000 morphological units and around 60,000 syntactic units not all links are necessarily encoded in the semantic part of the lexicon which only covers 10,000 semantic units):

MORPHOLOGY

<MuS (morphological unit)

id="UM029573"

naming="LÆSE"

gramcat="VERB"

gramsubcat="MAIN"

synulist="Usyn12 Usyn3796 Usyn3797 Usyn3798 Usyn3800 Usyn3801 Usyn3802 Usyn3803">

<Gmu

attestation="RO86"

inp="MFG0131">

<Spelling>læse</Spelling></Gmu></MuS>

 

SYNTAX

<SynU (syntactic unit)

id="Usyn3797"

naming="læse"

attestation="cn"

description="Dv2P-i"><correspSynUSemU

targetsemu="USEM_V_læse_COE_1"

correspondence="arg12i"</SynU>

 

<SynU

id="Usyn12"

naming="læse"

attestation="cn"

description="Dv2N0"><correspSynUSemU

targetsemu="USEM_V_læse_COE_1"

correspondence="arg12"><correspSynUSemU

targetsemu="USEM_V_læse_COE_3"

correspondence="arg12"></SynU>

<SynU

id="Usyn3800"

naming="læse"

attestation="cn"

description="Dv2P-paa"></SynU>

<SynU

id="Usyn3801"

naming="læse"

attestation="cn"

description="Dv2P-til"><correspSynUSemU

targetsemu="USEM_V_læse_COE_2"

correspondence="arg12til"></SynU>

<SynU

id="Usyn3802"

naming="læse"

attestation="cn"

description="Dv2xP0-op-til"><correspSynUSemU

targetsemu="USEM_V_læse_op_COE_1"

correspondence="arg12til"></SynU>

<SynU

id="Usyn3796"

naming="læse"

attestation="cn"

description="Dv3N0P0-for">

<correspSynUSemU

targetsemu="USEM_V_læse_SPE_1"

correspondence="arg122P">

<SynU

id="Usyn3803"

naming="læse"

attestation="cn"

description="Dv2t"><correspSynUSemU

targetsemu="USEM_V_læse_COE_1"

correspondence="arg12t"></SynU>

<SynU

id="Usyn3798"

naming="læse"

attestation="cn"

description="Dv2xN0-op">

<correspSynUSemU

targetsemu="USEM_V_læse_op_SPE_1"

correspondence="arg12"></SynU>

SEMANTICS

COGNITIVE EVENTS

<SemU

id="USEM_V_læse_COE_1"

naming="læse"

example=" Det er ikke en bog , man gider at læse to gange , men sjov er den . "

comment="full BSP"

freedefinition="se på og forstå en tekst (NDONY)" /look at and understand a text/

weightvalsemfeaturel="

WVSFTemplateCognitiveEventPROT

WVSFTemplateSuperTypePsychologicalEventPROT

WVSFEventTypeProcessPROT

TSVP_Cognition_TS_classificateur_de_verbe">

<PredicativeRepresentation

typeoflink="MASTER"

predicate="PREDhumsem_COE_1">

/selectional restrictions ARG1=human ARG2=semiotic /

<RWeightValSemU

weight="PROTOTYPICAL"

comment="Type-defining semantic relation"

target="USEM_N_erkendelsesproces_COE_1"

semr="SRIsa">

</SemU>

<SemU

id="USEM_V_læse_op_COE_1"

naming="læse_op (til)"

example=" På en videregående uddannelse kan man ikke , som på gymnasiet , bare læse op til eksamen "

comment="full BSP"

freedefinition="forberede sig til en eksamen" /prepare an exam/

weightvalsemfeaturel="

WVSFTemplateCognitiveEventPROT

WVSFTemplateSuperTypePsychologicalEventPROT

WVSFEventTypeProcessPROT

TSVP_Cognition_TS_classificateur_de_verbe">

<PredicativeRepresentation

typeoflink="MASTER"

predicate="PREDhum_COE_1">

/selectional restriction ARG1=human ARG2=unrestricted/

<RWeightValSemU

weight="PROTOTYPICAL"

comment="Type-defining semantic relation"

target="USEM_N_erkendelsesproces_COE_1"

semr="SRIsa">

</SemU>

 

<SemU

id="USEM_V_læse_COE_2"

naming="læse"

example=" En ordentlig arbejder , der ville frem i geledderne måtte helst læse til cand.polit "

comment="full BSP"

freedefinition=" være ved at tage en boglig uddannelse i noget (NDONY)" /take an education to become something/

weightvalsemfeaturel="

WVSFTemplateCognitiveEventPROT

WVSFTemplateSuperTypePsychologicalEventPROT

WVSFEventTypeProcessPROT

TSVP_Cognition_TS_classificateur_de_verbe">

<PredicativeRepresentation

typeoflink="MASTER"

predicate="PREDhumprof_COE_1">

/selectional restriction ARG1=human ARG2=profession/

<RWeightValSemU

weight="PROTOTYPICAL"

comment="Type-defining semantic relation"

target="USEM_N_erkendelsesproces_COE_1"

semr="SRIsa">

</SemU>

<SemU

id="USEM_V_læse_COE_3"

naming="læse"

example="Han trådte som 20-årig ind i redemtoristordenen og læste teologi hos Mauterne i Østrig "

comment="full BSP"

freedefinition=" være ved at tage en boglig uddannelse i noget (NDONY)"

weightvalsemfeaturel="

WVSFTemplateCognitiveEventPROT

WVSFTemplateSuperTypePsychologicalEventPROT

WVSFEventTypeProcessPROT

TSVP_Cognition_TS_classificateur_de_verbe">

<PredicativeRepresentation

typeoflink="MASTER"

predicate="PREDhumdom_COE_1">

/selectional restriction ARG1=human ARG2=domain /

<RWeightValSemU

weight="PROTOTYPICAL"

comment="Type-defining semantic relation"

target="USEM_N_erkendelsesproces_COE_1"

semr="SRIsa">

</SemU>

SPEECH ACTS

<SemU

id="USEM_V_læse_op_SPE_1"

naming="læse_op"

example="jeg er heller ikke i stand til at læse op , hvad mine medarbejdere skriver"

comment="full SN"

freedefinition="udtale noget skrevet, så andre kan høre det (NDONY)" /read aloud/

weightvalsemfeaturel="

WVSFTemplateSpeechActPROT

WVSFTemplateSuperTypeActPROT

WVSFEventTypeProcessPROT

TSVP_COMMUNICATION_TS_classificateur_de_verbe">

<PredicativeRepresentation

typeoflink="MASTER"

predicate="PRED2hum_sem_SPE_1">

/selectional restriction ARG1=human ARG2=semiotic /

<RWeightValSemU

weight="PROTOTYPICAL"

comment="Type-defining semantic relation"

target="USEM_N_talehandling_SPE_1"

semr="SRIsa">

</SemU>

 

<SemU

id="USEM_V_læse_SPE_1"

naming="læse"

example="han læste for pigen "

comment="full SN"

freedefinition="læse højt af en tekst for nogen" /read aloud to somebody /

weightvalsemfeaturel="

WVSFTemplateSpeechActPROT

WVSFTemplateSuperTypeActPROT

WVSFEventTypeProcessPROT

TSVP_COMMUNICATION_TS_classificateur_de_verbe">

<PredicativeRepresentation

typeoflink="MASTER"

predicate="PRED3hum_sem_hum_SPE_1">

/selectional restrictions ARG1=human ARG2=semiotic (can be ommitted) ARG3=human/

<RWeightValSemU

weight="PROTOTYPICAL"

comment="Type-defining semantic relation"

target="USEM_N_talehandling_SPE_1"

semr="SRIsa">

</SemU>

We also include as evaluation material two papers (Nimb & Pedersen 2000, Pedersen & Nimb 2000) where we focus on metaphoric senses and on phrasal verbs, respectively. These papers give a more thorough description as well as the linguistic background of the specific phenomena that have required special attention during the Danish lexicon encoding.

1.4. Current Lexicon Contents

Table 1: Overall statistics

Number of full Semu's linked to syntax

and morphology

by April 25: 9,700 semu’s

Number of predicative Semu’s

2,035

Semu per category

Nouns: (required, recommended and optional information)

Verbs: (required, recommended and optional information)

Adjectives: (required information only)

6,700

2,000

1,000

Number of dummies

approx. 1000

The following schemas show the templates represented in the lexicon.

CONCRETE NOUN TEMPLATES REPRESENTED:

Part

Body part

Group

Human group

Concrete entity

Location

3D location

Geopol

Area

Openings

Building

Artifactual area

Material

Artifact

Artifact material

Furniture

Clothing

Container

Artwork

Instrument

Money

Vehicle

Semiotic artifact

Food

Artifact food

Flavouring

Physical object

Organic object

Animal

Earth

Air

Water

Human

People

Ideo

Kinship

Social status

Agent of temporary activity

Agent of persistent activity

Profession

Vegetal

Plant

Flower

Fruit

Microorganism

Substance

Natural Substance

Substance food

Drink

Artifactual drink

CONCRETE NOUN TEMPLATES NOT REPRESENTED

Entity

Living entity

Role

ABSTRACT NOUN TEMPLATES REPRESENTED:

Quality

Social property

Psychical property

Physical property

Colour

Physical power

Shape

Representation

Information

Language

Number

Sign

Unit of measurement

Abstract

Cognitive fact

Convention

Domain

Institution

Moral standards

Time

ABSTRACT NOUN TEMPLATES, NOT REPRESENTED:

Property

Movement of thought

 

EVENT TEMPLATES REPRESENTED:

Event

Weather

Cause Aspectual

Aspectual

State

Exist

Relational state

Identificational state

Constitutive state

Stative location

Stative possession

Act

Non-relational act

Relational act

Purpose act

Move

Caused Motion

Speech act

Reporting event

Commisives

Cognitive event

Judgment

Caused experience event

Perception

Change

Relational change

Change possession

Change Location

Natural transition

Change of State

Change of Value

Acquire knowledge

Cause Change

Creation

Physical creation

Mental creation

Symbolic creation

Copy creation

Cause relational change

Cause Change of State

Cause change of value

Cause change of location

Cause natural transition

EVENT TEMPLATES, NOT REPRESENTED

Disease

Stimuli

Cooperative Act

Cause Act

Cooperative Speech act

Directives

Expressives

Declaratives

Psychological event

Experience Event

Modal event

Constitutive change

Cause constitutive change

Give knowledge

 

PROPERTY TEMPLATES REPRESENTED

Modal

Temporal

Emotive

Manner

Emphasizer

Physical property

Psychological property

Social property

Temporal property

Relational property

Intensional

PROPERTY TEMPLATES NOT REPRESENTED

Object-related

Intensifying property

Extensional

 

1.5. Validation

In order to check the grammatical consistency of our encoded SGML templates we have adjusted an SGML parser which validates our files according to the document type definition (dtd).

 

Apart from the validation taken care of by the SGML parser; we have elaborated a few Unix procedures which help check other sources to mistakes. One procedure checks ‘id’ and ‘naming’ and produces a list of semantic units where the two are not identical. Another writes a list of target semu’s referred to via the semantic relations in the qualia structure and check these towards the already encoded entries. This list is essentially a list of dummy candidates (i.e. words that have not been fully coded yet and should therefore be established as dummy semu’s), but the list is checked manually and wrong references, misspellings, empty targets and other mistakes are sorted out. This can be done only because every ‘id’ is supplied with an abbreviation of the ontological type to which it belongs (i.e. USEM_V_bevæge_sig_MOV_1). Only when a word has more than one sense within the same ontological type the different senses receive subsequent reading numbers (i.e. USEM_N_kort_SEM_1 vs. USEM_N_kort_SEM_2).

As regards purely linguistic consistency checking, a great deal of work is still remaining. Although the lexical guidelines (Lenci et al. 2000) have ensured a large degree of consistency between the different parts of the lexicon by providing templates to each ontological type, many cases of inconsistency can still be found. A browser helps us ensure that the use of relations is appropriate; for instance hyponyms and hyperonyms are checked on the lexicon material in order to discover whether a homogenous semantic class refers to the same hypernym or not and whether the hyperonyms of a given hyponym really are hyperonyms at the same level of analysis.

1.6. Remaining work

Within the scope of the SIMPLE project, 300 nouns need to be encoded, linked to syntax and parsed. Further linguistic validation of the whole lexicon material is also foreseen in the last phase of the project.

2. Semantic encoding

2.1. Criteria for Syntax-Semantic linking

Non-predicative nouns are linked by simply relating to the semantic unit(s) to which a syntactic unit corresponds; in the case of adresse, two links are established from one syntactic unit, namely one to a ‘representation’ interpretation as in brevet skal være forsynet med navn og adresse på bagsiden (the letter should be supplied with name and address on the back) and one to a ‘location’ interpretation folk afstår fra at flytte ind på visse adresser (people desist from moving into to certain addresses):

<SynU

id="Usyn10003"

naming="adresse"

attestation="ns"

description="Dn0">

<CorrespSynUSemU

targetsemu="USEM_N_adresse_REP_1">

<CorrespSynUSemU

targetsemu="USEM_N_adresse_LOC_1"></SynU>

For events, also a linking procedure between syntactic complements and semantic arguments has been established. Here we have followed the LINDA specifications (Underwood et al. 1996) where a principled analysis is given of the argument structure of Danish verbs and nouns. For a further description of the argument structure applied in this lexicon we therefore refer to this manual.

In the syntactic unit below for ride (ride) we can see how the valency pattern Dv2P-paa in syntax is mapped onto the semantic frame arg12paa by means of the feature ‘correspondence’:

<SynU

id="Usyn4713"

naming="ride"

attestation="n"

description="Dv2P-paa"><correspSynUSemU

targetsemu="USEM_V_ride_MOV_1"

correspondence="arg12paa"></SynU>

The correspondence feature is further specified below where it can be seen how each complement (position) in syntax is linked to an argument in semantics; thus subject is linked to ARG1 and the valency bound prepositional phrase to ARG2:

<Correspondence

id="arg12paa"

naming="mapping for divalent verb with prepositional object"

corresargpos1="ARG1_P_CNPrsubj ARG2_P_CPP-paa">

In some cases, more than one description is given in the syntactic unit and in such cases it is sometimes necessary to specify which description links to which semantic unit. Below is given the case of bevæge (Dv4NPa0Pa0-fra-til) (‘move’ - causative) and bevæge sig (Dv4refNPa0Pa0-fra-til) (‘move’ reflexive, decausative). The two descriptions link to the semantic template MOVE and CAUSED MOTION, respectively:

<SynU

id="Usyn3515"

naming="bevæge"

attestation="cn"

description="Dv4NPa0Pa0-fra-til"

descriptionl="Dv3refNPa0Pa0-fra-til">

<correspSynUSemU

targetsemu=USEM_V_bevæge_sig_MOV_1"

correspondence="arg1_ADJ_ADJfratil"

description="Dv3refNPa0Pa0-fra-til">

<correspSynUSemU

targetsemu=USEM_V_bevæge_CAM_1"

correspondence="arg12_ADJ_ADJfratil"

descriptionl="Dv4NPa0Pa0-fra-til">

</SynU>

2.2. Criteria for assigning Domain Features

Most of the vocabulary for this deliverable belongs to the domain: General. Specific readings belonging to particular domains have been assigned an appropriate domain from the domain list. Wrt. to domain assignment we have to a large degree followed the encodings made in Nudansk Ordbog. See Section 3 for the statistics for Domain.

2.3. Criteria for assigning Semantic Class and Template Type

Semantic Class and Template Types have been assigned according to the guidelines given by the Specification Group. In most cases, the templates are so well-defined in the guidelines that it has been more or less unproblematic to assign templates to the words. In some cases, however, the features proposed in the templates have been too specific as to count for all the words that would naturally fit into the template. This is in particular the case for events. To give an example, the template CHANGE_LOCATION has as a type-defining feature, the event type ‘transition’. However, in the Danish lexicon we have encountered several ‘change of location’ verbs which denote processes rather than transitions such as falde (fall) and dale (descend) where the result phase is not expressed implicitly. One could argue that such verbs should therefore rather be encoded under the template MOVE. But the ‘change of location’ feature seems to be so essential for these two verbs that it doesn’t seem convenient to encode them as ‘manner of motion’ verbs either.

Also for the group of abstract nouns we have sometimes found it difficult to assign templates to the words. Somehow too many words did not seem to fit into the seven more specific abstract template types and therefore simply had to be assigned the mother node "abstract entity". In this template group we therefore find very different words like alibi (alibi), fødekæde (food chain) and harmoni (harmony), which do not share much meaning content. We also found it a bit difficult to distinguish between the groups "Moral Standards" and "Cognitive Fact", for instance in the case of the word holdning (attitude), which on the one hand just means a way of thinking about something, but on the other hand could be considered a question of moral. In the template group "Cognitive Fact" we have encoded words of "thinking": tanke (thought), viden (knowledge), but also words of feeling: jalousi (jealousy), henrykkelse (delight) etc., though one could discuss whether these words of ‘feeling’ are events more than cognitive facts.

2.3.1. Language specific typing

In accordance with the guidelines, argument structure and selectional restrictions are encoded as language specific typing. For the basis of selectional restrictions we have applied the template ontology by stating restrictions as follows:

<InformArg

id="ArgHuman"

comment="human"

status="CHECK"

weightvalsemfeaturel="WVSFTemplateHumanPROT">

<InformArg

id="ArgAnimal"

comment="animal"

status="DEFAULTCHECK"

weightvalsemfeaturel="WVSFTemplateAnimalPROT">

<InformArg

id="ArgHumanAnimal"

comment="animal or human"

status="CHECK"

weightvalsemfeaturel="WVSFTemplateHumanPROT WVSFTemplateAnimalPROT">

<InformArg

id="ArgHumanVehicle"

comment="human or vehicle"

status="CHECK"

weightvalsemfeaturel="WVSFTemplateHumanPROT WVSFTemplateVehiclePROT">

We have found it rather inconvenient that optionality is encoded here as a value to the feature ‘status’. Since optionality is already stated in syntax we see no reason for encoding it again here (although we have followed the guidelines in this respect).

  1. Template subtyping for language specific encoding

The very large amount of semantic units represented under the template ARTIFACT (457 semu’s) gives an indication of the fact that this category may require further splitting. We have felt the need for an additional subtemplate denoting electronic or mechanical devices

The interesting thing about electronic and mechanical devices is that they expose a different distribution than other artifacts in the sense that they can ‘work by themselves’ and thus can often fill in selectional slots which are very similar to human beings. This in particular counts for computers; consider for example the following corpus excerpt:

Så spørger computeren om cyklisten holder rigtigt og børnene skal så ved hjælp af musen klikke på enten ‘ja’ eller ‘nej’

(then the computer asks whether the biker is in the right place and the kids are then to click on either ‘yes’ or ‘no’ with the mouse)

2.3.3. Criteria for encoding Semantic Relations

We have focused on linguistically relevant semantic relations. All type-defining, obligatory semantic relations have been encoded. Apart from this some essential relations have been encoded in cases where we believed them to have strong linguistic relevance. In most cases, we have followed the definition given in Nudansk Ordbog. This means that when a feature has been represented as part of the definition for a given word, we have included this feature as a semantic relation in the formal part of the semantic unit.

Consider the relation ‘has_as_parts’. This is in many cases a semantic relation which describes what we would call a ‘world-knowledge’ aspect of a word. For instance, we would not encode a ‘has_as_parts’-relation on the noun hus (house) since we believe that it is not linguistically crucial for this word that it contains walls, roof, floors, and windows etc.. This hypothesis is supported by the definition in Nudansk Ordbog for the word hus : en bygning som udgør en selvstændig enhed, og som anvendes til beboelse (a building which constitute an independent unit and which is used for habitation). In contrast, for the noun trappe (staircase) the definition does imply a ‘has_as_parts’-relation: et antal sammenhængende trin som man kan gå op el. ned ad (a number of steps of which you can go up or down); thus this word is encoded with the relation trappe ’has_as_parts’ trin:

<SemU

id="USEM_N_trappe_ART_1"

naming="trappe"

example=" Ruten i Leeds er uhyggelig hård - indeholder således en lang trappe, der skal forceres med cyklen på ryggen"

comment="full BSP"

freedefinition=" et antal sammenhængende trin som man kan gå op el. ned ad (NDO)"

weightvalsemfeaturel="

WVSFTemplateArtifactPROT

WVSFUnificationPathConcreteentity-Agentive-TelicPROT

TSVP_ARTIFACT_TS_classificateur_de_nom_C">

<RWeightValSemU

weight="PROTOTYPICAL"

comment="Type-defining semantic relation"

target="USEM_N_genstand_ENT_1"

semr="SRIsa">

<RWeightValSemU

weight="PROTOTYPICAL"

comment="Type-defining semantic relation"

target="USEM_V_fremstille_1"

semr="SRCreatedby">

<RWeightValSemU

weight="PROTOTYPICAL"

comment="Type-defining semantic relation, gå op og ned"

target="USEM_V_gå_1"

semr="SRUsedfor">

<RWeightValSemU

weight="ESSENTIAL"

comment="Semantic relation"

target="USEM_N_trin_ART_1"

semr="SRHasaspart">

</SemU>

A similar situation can be found with many compounds in Danish. Here an essential (non-type-defining) feature can often be used to express exactly the relation that holds between the two parts of the compound; consider for instance the examples below of two kinds of containers in Danish, vinflaske (wine bottle) which ‘contains vin (wine) and blikdåse (tin can) which is ‘made of blik’ (tin)

<SemU

id="USEM_N_vinflaske_CON_1"

naming="vinflaske"

example="en vinflaske kan genbruges syv til otte gange"

comment="full BKK"

freedefinition="flaske til vin"

weightvalsemfeaturel="

WVSFTemplateContainerPROT

WVSFUnificationPathConcreteentity-ArtifactAgentive-TelicPROT

TSVP_NOTION_TS_classificateur_de_nom_C">

<RWeightValSemU

weight="PROTOTYPICAL"

comment="Type-defining semantic relation"

target="USEM_N_flaske_CON_1"

semr="SRIsa">

<RWeightValSemU

weight="PROTOTYPICAL"

comment="Type-defining semantic relation"

target="USEM_V_fremstille_1"

semr="SRCreatedby">

<RWeightValSemU

weight="PROTOTYPICAL"

comment="Type-defining semantic relation"

target="USEM_V_indeholde_1"

semr="SRUsedfor">

<RWeightValSemU

weight="ESSENTIAL"

comment="Semantic relation"

target="USEM_N_vin_ARD_1"

semr="SRContains">

</SemU>

 

<SemU

id="USEM_N_blikdåse_CON_1"

naming="blikdåse"

example="en urtepotteunderskål, hvori man omvendt har sat en tom blikdåse, som fyldes med vand"

comment="full BKK"

freedefinition="dåse lavet af blik"

weightvalsemfeaturel="

WVSFTemplateContainerPROT

WVSFUnificationPathConcreteentity-ArtifactAgentive-TelicPROT

TSVP_NOTION_TS_classificateur_de_nom_C">

<RWeightValSemU

weight="PROTOTYPICAL"

comment="Type-defining semantic relation"

target="USEM_N_dåse_CON_1"

semr="SRIsa">

<RWeightValSemU

weight="PROTOTYPICAL"

comment="Type-defining semantic relation"

target="USEM_V_fremstille_1"

semr="SRCreatedby">

<RWeightValSemU

weight="PROTOTYPICAL"

comment="Type-defining semantic relation"

target="USEM_V_indeholde_1"

semr="SRUsedfor">

<RWeightValSemU

weight="ESSENTIAL"

comment="Semantic relation"

target="USEM_N_blik_ARS_1"

semr="SRMadeof">

</SemU>

In general, we have applied a template-driven approach in the sense that each encoder has been responsible for a specific set of templates in order to ensure as large a degree of consistency among encoders as possible as regards the semantic relations to be applied within a template type. For instance, we have striven towards a homogenous level of specificity as well as a consensus on which of the more general Targetsemu’s to be applied for each relation.

2.3.4. Criteria for encoding Derivation Relations.

Derivation relations are not encoded in the Danish lexicon.

  1. Encoding of synonymy and polysemy relations

Synonymy

We have chosen to give information on synonyms in the cases where a synonym is mentioned in the Danish dictionary we use to retrieve our definitions, as long as the synonym is represented in the PAROLE dictionary.

An example, seen below, are the two words knække and brække (both meaning "cause to break"), encoded in the template group "cause change of state":

1) <SemU

id="USEM_V_brække_CCS_1"

naming="brække"

example="Jeg var målløs. Han sparkede på bilen, knuste lygterne og brækkede antennen"

comment="full BC 200203548 SN"

freedefinition="få noget til at brække(NDO)"

weightvalsemfeaturel="

WVSFTemplateCauseChangeofStatePROT

WVSFTemplateSuperTypeCauseRelationalChangePROT

WVSFEventTypeTransitionPROT

TSVP_CHANGE_TS_classificateur_de_verbe_C">

<PredicateRepresentation

typeoflink="MASTER"

predicate="PRED_brække_CCS_1">

<RWeightValSemU

semr="SRAgentiveCause"

target="USEM_V_ændre_CCS_1"

weight="PROTOTYPICAL">

<RWeightValSemU

semr="SRResultingState"

target="USEM_ADJ_itu_QUA_1"

weight="PROTOTYPICAL">

<RWeightValSemU

weight="ESSENTIAL"

comment="Synonym relation"

target="USEM_V_knække_CCS_1"

semr="SRSynonym">

</SemU>

2)

<SemU

id="USEM_V_knække_CCS_1"

naming="knække"

example="hvis man knækker skaftet udleveres en ny spade"

comment="full BC 200203548 SN"

freedefinition="få noget til at knække (NDO)"

weightvalsemfeaturel="

WVSFTemplateCauseChangeofStatePROT

WVSFTemplateSuperTypeCauseRelationalChangePROT

WVSFEventTypeTransitionPROT

TSVP_CHANGE_TS_classificateur_de_verbe_C">

<PredicateRepresentation

typeoflink="MASTER"

predicate="PRED_knække_CCS_1">

<RWeightValSemU

semr="SRAgentiveCause"

target="USEM_V_ændre_CCS_1"

weight="PROTOTYPICAL">

<RWeightValSemU

semr="SRResultingState"

target="USEM_ADJ_itu_QUA_1"

weight="PROTOTYPICAL">

<RWeightValSemU

weight="ESSENTIAL"

comment="Synonym relation"

target="USEM_V_brække_CCS_1"

semr="SRSynonym">

</SemU>

We imagine that links between synonyms in the dictionary could be very useful for many purposes, for instance in applications for information retrieval. It also helps to speed up the encoding process since the entries of two, or sometimes even three, synonymous words can be made easily at the same time.

 

Polysemy

Regular polysemy - when groups of related words display the same ambiguity - is handled in a uniform way in the SIMPLE model via the identification of a set of well-established regular semantic classes for nouns, which are adjusted for each of the languages involved. While unsystematic ambiguous readings of a word are represented as totally unrelated semantic units, regular polysemous senses can be encoded as interlinked semantic units. This is represented by the information slot complex, whose value is the polysemous class to which the semantic unit belongs as seen below for Dragør (Dragør - Danish village) in the semantic unit for the human group sense of the word:

<SemU

id="USEM_N_Dragør_HUG_1"

naming="Dragør"

example=" Dragør må i år af med godt 31 mill. kr. til den kommunale udligning"

/This year Dragør must pay approx. 31 mill. crowns to the community equalization /

comment="full BSP"

freedefinition="de mennesker der bor i Dragør eller som træffer belutningerne der"

weightvalsemfeaturel="

WVSFTemplateHumanGroupPROT

WVSFTemplateSuperTypeGroupPROT

TSVP_GROUP_NAMES_TS_classificateur_de_nom_C">

<RWeightValSemU

weight="PROTOTYPICAL"

comment="Type-defining semantic relation"

target="USEM_N_befolkning_HUG_1"

semr="SRIsa">

<RWeightValSemU

weight="PROTOTYPICAL"

comment="Type-defining semantic relation"

target="USEM_N_indbygger_HUM_1"

semr="SRHasasmember">

<RWeightValSemU

weight="PROTOTYPICAL"

target="USEM_N_Dragør_GEO_1"

semr="SRPolysemyHumanGroup-GeopoliticalLocation">

</SemU>

In the Danish lexicon the most productive cases of regular polysemy involving concrete nouns are the following:

Other well-known polysemous pairs are not productive in Danish, as for example 'people / language' and 'flower / colour', where only a few examples of each can be found. This difference relates to the distinction made by Apresjan (apud Malmgren, 1988) between productive and regular polysemy. Here productive polysemy refers to cases where more or less the whole group of nouns within a semantic class display the same polysemy relations, whereas regular polysemy refers to cases where at least two words - but not the whole class - follow the same polysemy pattern.

A more extensive, empirically-based study of regular semantic polysemous classes of Danish nouns has not yet been carried out. However, the corpus-oriented approach used during the encoding of the Danish SIMPLE lexicon facilitates the identification of new polysemous classes, since the differences in distributional patterns of the encoded words senses are a good indication of whether a regular polysemy relation could be involved. It should be noted, however, that the common polysemy classes established in the project are not totally unproblematic in this respect. One would expect that the classes established would expose different distributional patterns in the corpus; however, this is not always the case. A well-established test for examining such patterns is the so-called zeugma test: two different senses of a word are expected to create a zeugma (i.e. nonsense) if they are put together in the same phrase, as is the case for the regular polysemy class that holds between geopolitical location and human group:

*Danmark, som er et fladt og grønt land, nedlagde veto mod forslaget i Europakommissionen

(Denmark, which is a flat and green country, vetoed the proposal in the European Commission)

Nevertheless, for the semiotic artifact/information polysemy relation this is not the case as seen in the example below which clearly combines the two senses in one construction:

menukortet, der var dekoreret med en kopi af Arne Haugen Sørensens maleri ‘Skovkentaur med dame’, var varieret og ganske indbydende

(the menu, which was decorated with a copy of Arne Haugen Sørensens painting ‘Forest centaur with lady’, was varied and rather appetising)

This example leads to the discussion of the constraints that should be satisfied in order to establish two semantic units. If they are not distinguished in corpus via different distribution what are the criteria then for defining two senses ? In the particular case of semiotic artifact/information we are tempted to believe that this phenomenon should rather be categorised as a case of semantic vagueness than as a case of polysemy since we in a given context can refer to either meaning aspect OR both at the same time.

We have not encoded regular polysemy relations on verbs. It is characteristic for Danish that it has far less cases of regular polysemy for verbs that e.g. English, and we found that it would require a more detailed investigation to decide which of the many classes described in the guidelines would be relevant in the Danish lexicon. However, this work is foreseen in the Danish follow-up lexicon project.

2.5. Representation of Predicative information.

Words which take argument all include a predicate object, in which the argument structure is described. In the Danish lexicon we have chosen to name the arguments in accordance with the LINDA specifications (Underwood et al. 1996) where a principled analysis is given of the argument structure of Danish verbs and nouns. The grammatical subject is in most cases assigned ARG1, the grammatical and prepositional object ARG2 and weakly bound prepositional complements are assigned the function ADJUNCT. ARG0 is reserved for semantically empty subjects in the LINDA specifications, as in constructions like det regner ("it is raining"), and this kind of argument is not described in a predicate, but only taken care of in the syntactic description (at the syntactic level).

As regards selectional restrictions, we apply ontological types, only. When for an argument we want to express that it can refer human groups only, we simply refer to the ontological type ‘human group’ via the so-called Informarg objects:

<InformArg

id="ArgHumanGroup"

comment="human"

status="CHECK"

weightvalsemfeaturel="WVSFTemplateHumanGroupPROT">

The semantic roles are assigned to each argument according to the list in the guidelines on events.

Only we have felt the need to introduce an additional role, "NonProtoAgent", for subjects of the type flaget vajer (the flag waves).

 

    1. Linguistic problems

Phrasal verbs

Phrasal verbs have caused several problems during the encoding phase. Phrasal verbs are very frequent in Danish and therefore it is important to strive towards a principled treatment of these.

In traditional Danish lexicography, we distinguish between two kinds of phenomena: namely phrasal constructions vs. phrasal verbs. The basic criterion for this distinction relies on transparency: if either the verb or the particle is not transparent in meaning, i.e. diverge from its original or prototypical meaning then we prefer a lexicalised interpretation meaning in traditional lexicography that we would establish a sublemma to the verb in question. This is the case for vaske op (lit: ‘wash up’ meaning ‘do the dishes’) where vaske more or less preserve the original meaning whereas op (‘up’) clearly does not. In contrast, if the meaning is more or less predictable on the basis of the original meaning of the two words then we prefer a valency interpretation of the particle, as in for instance grave noget op/ned (‘dig something op/down) (see Braasch & Pedersen in press).

In the Danish Parole syntax such a distinction has not been established mainly due to the fact that the syntax does not really allow for such a distinction: irrespective of the internal nature of the particle construction, the particle is always expressed in the so-called ‘self’. This gives an overall splitting strategy as follows:

MORPHOLOGY SYNTAX SEMANTICS



grave
grave grave (dig)


grave ned grave ned (dig down)


grave op grave op (dig up)



vaske
vaske vaske (wash)


vaske op vaske op (do the dishes)

 

 

We interpret this as a kind of lexicalisation, having as a consequence that all phrasal verb/constructions in Danish are treated as lexicalisations. This lack of distinction provokes problems when dealing with semantics. As it is now we have been enforced to encode different semantic units to what is basically the same meaning of a word since the particles in such cases are not assigned a valency function but rather are considered as part of the lemma.

Consider the example below for the verb løbe. Two syntactic units have been established; the first one describes a construction like han løb (fra Roskilde) (til København) (he ran from Roskilde to Copenhagen); the second a construction like han løb ud (he ran out). Semantically, we would prefer to treat these as one semantic unit with a directional adjunct which can be expressed either as a PP or as a directional particle. However, as it is now we are enforced to encode - apart from the ‘basic’ sense of løbe - a phrasal verb construction of løbe ud/ind/op/ned (run out/on/up/down) which is fully predictable in meaning and which furthermore is considered to take only one argument (since the directional particle is considered to be a lexicalised part of the lexeme løbe).

<SynU

id="Usyn2016"

naming="løbe"

attestation="cn"

description="Dv3Pa0Pa0v-fra-til"><correspSynUSemU

targetsemu=USEM_V_løbe_MOV_1"

correspondence="arg1_ADJ_ADJ" ></SynU>

<SynU

id="Usyn5152"

naming="løbe"

attestation="cn"

description="Dv1xdv-dir"><correspSynUSemU

targetsemu=USEM_V_løbe_ud_ind_op_ned_MOV_1"

correspondence="arg1" ></SynU>

We would have preferred a valency interpretation of all particles at the syntactic level leaving for the semantics to consider whether the meaning was predicable or not. This would also fit nicely into the ‘split late’ strategy adopted in the project and would leave the semantic distinction where it belongs: in semantics. Consider the figure below where such an approach is adopted for grave and vaske respectively:

 

MORPHOLOGY SYNTAX SEMANTICS



grave
grave (+part) grave (part) (dig)



vaske
vaske (+part) vaske (wash)

vaske op (do the dishes)

 

Such a strategy would also be convenient for the really complex cases (which again are rather frequent in Danish) where both a predictable and a non-predictable meaning is found, as for gå op which can mean either ‘go up’ or ‘cancel out’:

MORPHOLOGY SYNTAX SEMANTICS



(part) (part) (walk)

gå op (cancel out)

Here the predictable sense (go up) is treated as one semantic unit together with the normal sense with an optional directional adjunct, whereas the ‘cancel out’ has its own semu belonging to a different node in the event ontology.

At a longer term, we will consider such a reorganisation of our lexicon; however, within the scope of SIMPLE, we are not in capable of performing such a large change to the PAROLE lexicon.

Figurative senses

When using a corpus to find the distribution of the different meanings of words being encoded, we have noticed that in many cases the concrete meaning of a word is rarely represented in the text, whereas we often find a high frequency of a figurative sense of the word instead. We haven’t systematically coded these figurative word senses, which we find are somewhat problematic, since it seems to be the most frequent use of many words in written language. As an example we could mention the word stormvejr (stormy weather) which is only encoded as a weather phenomenon, but which in fact in corpus is mostly used in the meaning of a hectic situation. Often these kinds of meanings are not described in the dictionary we use as our resource, and since these meanings are very abstract, they are quite difficult to place in the semantic hierarchy, at least in the case of abstract nouns. As regards verbs though, the event ontology seems to cover very well also the figurative senses. We have only been missing one ontological type, namely one to cover the metaphoric event senses ‘to move in time’ or ‘time passing’, which we have found were quite common figurative senses of motion verbs in our corpus. One example is with the verb passere (pass), which is encoded with the concrete sense ‘Change of location’, but which also has a figurative sense ‘to move in time’:

vi skal passere år 2000 , før alle danske biler kører med katalysator

(we will have to pass the year 2000 before all Danish cars run with catalytic converter)

 

In a future extension of the Danish SIMPLE lexicon, we do feel a need for developing the treatment of figurative senses of words in order to be able to cover written text in a better way.

 

  1. Statistics

Domains applied in the encodings:

agriculture

 

air_transport

 

arts

 

astronomy

 

baby_care

 

biochemistry

 

botany

 

bus_transport

 

business

 

car_transport

 

chemistry

 

civil_law

 

commerce

 

computing

 

diplomacy

 

drink

 

economics

 

education

 

entomology

 

ethnology

 

fashion

 

film

 

finance

 

fishing

 

food

 

freshwater_fishing

 

furnishing

 

geography

 

geology

 

geometry

 

gymnastics

 

health

 

history

 

home_and_garden

 

hotel_business

 

inland_waterway_transport

 

law

 

librarianship

 

life sciences

 

linguistics

 

livestock_farming

 

logic

 

mail

 

mathematics

 

mechanical_engineering

 

media

 

medicine

 

military

 

mineralogy

 

music

 

ornithology

 

physical sciences

 

physics

 

physiology

 

poetics

 

politics

 

politics and government

 

psychology

 

publishing

 

rail_transport

 

religion

 

restauration

 

road_transport

 

sailing_yachting_and_boating

 

sciences

 

sea_transport

 

ship_building

 

sociology

 

sports and leisure

 

subway_transport

 

taxation

 

transport

 

trucking

 

zoology

 

 

Semantic Classes applied in the encodings:

ABSTRACT

AGENCY

AMPHIBIAN

ANIMAL

ARTIFACT

ATTRIBUTE

BIO

BIRD

BODY

BODY_PART

BUILDING

CHANGE

COGNITION

COGNITIVE_FACT

COLOR

COMMUNICATION

COMPETITION

CONCRETE

CONSUMPTION

CONTACT

COULEUR

CREATION

CURRENCY

DAY

EMOTION

ETHNOS

FISH

FLOWER

FORM

FRUIT

FURNITURE

GARMENT

GEOG

GEOGRAPHY

GROUP_NAMES

HUMAN

IDEO

INANIMATE

INSECT

INSTRUMENT

LETTER

LIVING_BEING

LOCATION

MAMMAL

MATTER

MEASURE_UNIT

MICROORGANISM

MOLLUSC

MONTH

MOTION

MUSHROOM

NOTION

OBJECT

OCCUPATION

OCCUPATION_AGENT

PART

PERCEPTION

PERIOD

PERIODE

PLANT

POSSESSION

PSYCHOLOGICAL_FEATURE

REPTILE

SHRUB

SOCIAL

STATIVE

SUBSTANCE

TIME_PERIOD

TREE

VEHICLE

WEATHER

 

List of Polysemy Relations applied:

 

Agentofpersistentactivity-Profession

Animal-Food

Animal-Material

Area-Humangroup

Area-Institution

Building-HumanGroup

Building-Institution

Container-Amount

Convention-Semioticartifact

Flavouring-Plant

Flower-Colour

Flower-Plant

Food-Animal

Fruit-Plant

GeopoliticalLocation-HumanGroup

HumanGroup-Building

HumanGroup-GeopoliticalLocation

HumanGroup-Institution

Information-Semioticartifact

Institution-Building

Institution-HumanGroup

Language-People

Location-HumanGroup

Material-Animal

Material-Plant

Opening-Artifact

People-Language

Plant-Flavouring

Plant-Flower

Plant-Fruit

Plant-Material

Plant-Substance

Plant-Substancefood

Semioticartifact-Container

Semioticartifact-Information

Substance-Colour

Substance-Plant

List of Semantic Relations applied in the encodings:

Agentive

AgentiveCause

Concerns

Constitutiveactivity

Contains

Createdby

Derivedfrom

Hasascolour

Hasasmember

Hasaspart

Indirecttelic

Instrument

Isa

Isafollowerof

Isamemberof

Isapartof

Isin

Istheabilityof

Istheactivityof

Isthehabitof

Livesin

Madeof

Measuredby

Objectoftheactivity

Producedby

Produces

Propertyof

Purpose

Quantifies

Relates

Relatedto

ResultingState

Resultof

Successor

Successorof

Synonym

Telic

Usedas

Usedby

Usedfor

 

 

 

 

Bibliography

Braasch, A., A. B. Christensen, S. Olsen & B.S. Pedersen (1998) 'A Large-Scale Lexicon for Danish in the Information Society', in: Proceedings from First International Conference on Language Resources & Evaluation, Granada 1998.

Braasch, A., B. Pedersen (1999). ‘En stor sprogteknologisk ordbog for dansk - med særligt fokus på håndtering af flertydighed i den niveaudelte ordbog’, in: P. Widell (ed.) 7. Møde om Udforskning af Dansk Sprog, Århus Universitet.

Bergenholtz, H., (1990). 'DK87-DK90: Dansk korpus med almensproglige tekster', in: M. Kunøe & Erik Larsen (eds.) 3. Møde om Udforskning af Dansk Sprog, Aarhus Universitet.

Boje, F. & L. Schøsler (ed.) (1992). ‘DISEM - A Semantic MT-Component’ in: CST Working Papers no. 1, Center for Sprogteknologi, Copenhagen.

Christ, O. (1993) : The Xkwic User Manual. Institut für maschinelle Sprachverarbeitung, Universität Stuttgart.

Kjærulff Nielsen: Engelsk- Dansk Ordbog, Gyldendal, Copenhagen.

Malmgren, S. (1988). ‘On Regular Polysemy in Swedish’, in: Studies in Computer-Aided Lexicography, Almquist & Wiksell, Stockholm.

Nimb, S. & B. Pedersen (2000). ‘Treating Metaphoric Senses in a Danish Computational Lexicon – different cases of regular polysemy’, in: EURALEX 2000, Stuttgart, Germany.

Pedersen, B., & S. Nimb (2000) ‘Semantic Encoding of Danish Verbs in SIMPLE – Adapting a verb-framed model to a satellite-framed Language’, in Second International Conference on Language Resources and Evaluation, Athens, Greece.

Politikens Store Nye Nudansk Ordbog, Version 2.1, Politikens Forlag, Copenhagen.

Underwood, N., C. Povlsen, P. Paggio, A. Neville, B.S. Pedersen, L. Jørgensen, B. Ørsnes, A. Braasch (1996). LINDA, Linguistic Specifications for Danish, Technical Report, Center for Sprogteknologi, Copenhagen.