Topic 1-B

 

Language-Specific Adaptation to a Multilingual Lexicon Model

- some aspects of the Danish SIMPLE-lexicon

Sanni Nimb & Bolette Sandford Pedersen,

Center for Sprogteknologi

Njalsgade 80, DK-2300 S, DENMARK,

email: sanni@cst.ku.dk bolette@cst.ku.dk

 

Abstract

The aim of the EU-project SIMPLE (Semantic Information for Multifunctional Plurilingual Lexica) is to provide harmonized semantic lexicons for Natural Language Processing for 12 of the European languages. The language specific encodings are performed on the basis of a unified, ontology-based semantic model representing an extended Qualia Structure. In this paper we focus on two aspects of the Danish lexicon where systematic solutions have been required in order to adapt the universal model to empirical data. Basing our encodings on corpus data, in particular two aspects have come in focus both of which expose a complex system of meaning derivation and thereby challenge the borderlines of a pre-defined semantic model. The first aspect concerns a case of regular polysemy, namely the high frequency of figurative senses in real texts. Here we propose a treatment where parts of the qualia structure of the concrete sense is mapped systematically into the corresponding figurative sense. Secondly, we look into the relation between verb meanings and their corresponding phrasal verb constructions - a relation which can be seen as much less regular but which still requires a systematic approach in order to be coherently related to syntax as well as appropriately represented in semantics.

 

 

 

1. Introduction

One of the fundamental assumptions behind the SIMPLE model is that word senses differ in terms of their internal complexity and that this complexity can be described on the basis of an ontology established along different dimensions (Lenci et al. 1998). Some word senses can be described by means of simple types which means that they inherit their information from only one mother node in the ontology; others are more complex and thus inherit information from several mother nodes following the principle of orthogonal inheritance. These multiple dimensions of meaning are represented in SIMPLE by means of an extended Qualia Structure model based on Pustejovsky (1995) encompassing a set of semantic relations such as is_a, used_for, part_of, has_as_parts, is_the_result_of etc. for each qualia role (see also Alonge et al. 1998 for the use of similar semantic relations in EuroWordNet). Furthermore, regular polysemous classes are represented in SIMPLE via the additional type: complex which establishes a link between systematically related senses.

In this paper we shall look at two kinds of sense change and see how such changes can be represented in a predefined, multilingual model like SIMPLE. Basing our encodings on corpus data, the first issue that we shall focus upon concerns a case of regular polysemy, namely the relatively high frequency of figurative senses in real texts. Should these figurative senses - when they expose a high frequency in corpus - be represented in the SIMPLE dictionary even if they are non-existent in the traditional dictionaries that we use as our basis? And if yes, what is the systematicity wrt. meaning components (i.e. qualia structure) of a concrete and a figurative sense and how can it be represented in the encodings? Relating our findings to the polysemy classes identified by Malmgren 1988, we suggest a representation of figurative senses which reflects the fact that meaning components from the concrete meaning often map into a similar - although vaguer or broader meaning component in the figurative sense. The second issue of interest in this paper concerns phrasal verb constructions which we estimate constitute more than half of the verb senses to be described. Danish is a typical satellite-framed language in Talmy’s terms (Talmy 1985) that makes a large use of satellites (particles, prepositions, adverbs) and expresses a large degree of the core meaning schema through these satellites (see among others Harder, Heltoft & Thomsen 1996, Herslund 1993, Braasch & Pedersen in Press, as well as Weilgaard 1997). Representing these characteristics in a lexicon is a challenge not only for traditional lexicography but maybe even more for computational lexicography which has a long tradition of a modular composition of the lexicon distinguishing strictly between morphology, syntax and semantics. In this paper we establish a distinction between phrasal constructions and phrasal verbs - the first being transparent in meaning, the second not - and we show how this distinction relates to syntax. Furthermore we claim that for phrasal verbs there seems to be no systematicity wrt. to shared meaning components and thus we suggest that phrasal verbs be treated as unrelated in the SIMPLE lexicon at the semantic level.

2 Representing figurative senses in SIMPLE

First, for an illustration of the multi-dimensional semantic model applied in SIMPLE, consider below the four meaning components of the concrete sense of the Danish word puslespil (puzzle):

formal constitutive telic agentive

spil (game) brikker (pieces) samle (assemble) fremstille (produce)

 

 

puslespil (puzzle)

Four components are involved: (i) the formal role, which provides information that distinguishes an entity within a larger set (in this case is_a), (ii) the constitutive role, which expresses a variety of relations concerning the internal constitution of an entity (in this case part_of), (iii) the telic role, which concerns the typical function of an entity (here used_for), and (iv) the agentive role, which concerns the origin of an entity (in this case made_by). These elements, plus a long list of additional information types such as definition, domain, corpus example, polysemy relations etc. are represented in the lexical entry as follows:

Semantic Unit

puslespil_ART (puzzle - artifact reading)

Definition:

et spil med træ- el. papbrikker i forskellige faconer som skal lægges sammen så de danner et hele (NDO) (a game with wood or cardboard pieces in different shapes which must be assembled so that they make a whole)

Corpus example:

nu var hun næsten ved at være færdig med det puslespil, hun var begyndt på lige efter påske (now she had almost finished the puzzle she started right after Easter)

Semantic type:

Artifact

Unification Path

Concrete_Entity|Agentive|Telic

Domain:

General

Semantic Class

Artifact

Formal quale:

is_a = spil (game)

Agentive quale:

created_by = fremstille (produce)

Telic quale:

used_for = samle (assemble)

Constitutive quale:

has_as_parts=brikker (pieces)

Complex:

ArtifactAbstract= puslespil_ABS (puzzle - abstract reading)

Below we show the results of a small investigation we have made on a group of concrete nouns producing figurative meanings (including the puzzle example from above), namely words belonging to the group of artifacts in the SIMPLE ontology. The figure below shows how often a figurative sense was represented in a corpus of 20 mill tokens compared to its concrete counterpart:

 

CONCRETE

ARTIFACTS

FIGURATIVES

figurative sense in existing dictionary

TELIC ROLE OF CONCRETE SENSE

vindue (window)

92 %

8% (15)

no

used_for: se (to look)

sovepude (sleeping pillow)

0 %

100 % (14)

yes

used_for: sove opad (to sleep upon)

piedestal (pedestal)

25 %

75 % (12)

yes

used_for: placere højt (to put in high place)

vifte (fan)

10 %

90 % (72)

no

used_for: afkøle (to cool)

panser (armour)

40 %

60 % (10)

yes

used_for: beskytte (to protect)

springbræt (springboard)

0 %

100 % (38)

yes

used_for: sætte af (to take off)

skyklapper (blinkers)

0 %

100 % (14)

yes

used_for: afskærmning (limit. of visual field)

bombe (bomb)

50 %

50 % (150)

no

used_for: ødelægge (to destroy)

spændetrøje (straitjacket)

20 %

80 % (34)

yes

used_for: fastholde (to keep in place)

våben (weapon)

90 %

10 % (100)

no

used_for: kæmpe (to fight)

glidebane (slide)

20 %

80 % (12)

no

used_for: glide (to slide)

bro (bridge)

75 %

25 % (75)

yes

used_for: forbinde (to connect)

puslespil (puzzle)

20 %

80 % (67)

no

used_for: samle (to assemble/put together)

rygstød (back of a seat)

11 %

89 % (16)

yes

used_for: læne (to lean)

As can be seen from the scheme, the figurative senses are very frequent even if they are not mentioned in the existing dictionary that we use. In fact, in several cases, only the figurative senses are found in the corpus. This gives a clear indication of the fact that such senses cannot be ignored in a computational lexicon meant for processing real texts. The last column of the scheme shows the telic role of the concrete senses. It is remarkable how the verbs or verbal nouns which constitute the targets of the semantic relation used_for more or less create the meaning of the figurative sense as can be illustrated by the following corpus examples illustrating figurative use of the words skyklapper (blinkers) and puslespil (puzzle):

(1) valutahandlerne har skyklapper på i øjeblikket og vil kun se på de faktorer som vil føre til en styrket dollar

(the currency brokers are wearing blinkers at the moment and only want to look at the factors which will lead to a strengthened dollar)

(2) Det har været et puslespil at få udstillingen på benene

(it has been a puzzle to arrange the exhibition)

In the first example the "limiting of visual field" is what creates the new sense of skyklapper: the currency dealers sight is limited by their preoccupation for the dollar to such an extent that they can’t see anything else; they are blinded so to speak. In the second example, the metaphor puslespil is used to indicate all the sub-events which need to fall into place in order to establish an exhibition, so again the telic quale plays a central role.

The question is now to which extent we can directly map this qualia role into the semantic structure of the figurative sense. In our view, the qualia structure with its four meaning dimensions is best suited for the description of concrete nouns, especially for nouns denoting artifacts. This can also be seen from the fact that there are less type-defining quales to be expressed obligatorily in the abstract part of the ontology (see Lenci et al. for an overview of the complete SIMPLE ontology including its type-defining quales). For instance, for the node abstract no type-defining quales are predefined apart from the formal role (is_a). This, we believe, is not just a particular problem of the SIMPLE model, but rather a general problem relating to the fact that abstract nouns are much more difficult to classify coherently and thus assign type-defining semantic components to. However, in the case of the figurative senses that we are dealing with here - which all originate from artifact senses - the relevant qualia roles can be mapped more or less systematically onto the figurative senses; not as type-defining for the entire type ‘abstract’, but as an essential feature of these particular senses. Nevertheless, since the used_for-relation is too restricted for the abstract senses because it indicates a volitional act with the concrete sense as the object, we suggest to broaden the quale and thus apply the more general semantic relation object_of_the_activity. The semantic relation is in any case rather vague and differs slightly depending on whether the figurative sense denotes something negative or something positive; what seems most important is the information given by the verb or the verbal noun that constitutes the value of the relation. The resulting figurative lexicon entry is shown below for puslespil:

Semantic Unit

puslespil_ABS (puzzle - abstract reading)

Definition:

en kompleks sag der består af enkeltdele (a complex matter which consists of separate parts)

Corpus example:

Det har været et puslespil at få udstillingen på benene (it has been a puzzle to arrange the exhibition)

Semantic type:

Abstract

Unification Path

Entity

Domain:

General

Semantic Class

Abstract

Formal quale:

is_a = sag (matter)

Agentive quale:

Nil

Telic quale (essential):

object_of_the_activity = samle (assemble)

Constitutive quale:

has_as_parts=dele (parts)

Complex:

AbstractArtifact=puslespil_ART (puzzle - artifact reading)

 

3 Representing phrasal verbs in the SIMPLE-model

As a starting point for a semantic treatment of constructions with verb particles, we find it convenient to distinguish between two kinds of constructions involving a verb and a particle: namely phrasal constructions vs. phrasal verbs. The basic criterion for this distinction relies on transparency: if either the verb or the particle is not transparent in meaning, i.e. diverges from its original or prototypical meaning, then we suggest a lexicalised interpretation - meaning in traditional lexicography terms that a sublemma to the verb in question should be established. This is the case for vaske op (lit: ‘wash up’ meaning ‘do the dishes’) where vaske more or less preserves the original meaning whereas op (‘up’) clearly does not. In contrast, if the meaning is more or less predictable on the basis of the original meaning of the two words (i.e. compositional in meaning), then we suggest an unbounded adjunct interpretation of the particle, as in for instance grave noget op/ned (‘dig something up/down) resulting in the following syntactic pattern of arguments: SUBJ OBJ DIRECTIONAL. Thus we consider vaske op a phrasal verb, and grave ned/op a phrasal construction.

The PAROLE/SIMPLE lexicons are strictly modular in the sense that they contain morphological, syntactic and semantic units which are then coherently linked to each other (see Ruimy et al 1998). Consequently, the model permits to distinguish different syntactic behaviours on pure syntactic criteria and independently of whether they share the same meaning or not. In the case of phrasal verbs, however, it is a matter of dispute whether a phrasal verb like vaske op should be lexicalised at the morphological level and thus be treated as a completely different lexeme than vaske:

MORPHOLOGY SYNTAX SEMANTICS



vaske vaske vaske (wash)



vaske op vaske op vaske op (do the dishes)

or whether a ‘split late’ strategy should be adopted, meaning that the distinction is made at the semantic level and that the particle is treated as an optional complement at the syntactic level:

MORPHOLOGY SYNTAX SEMANTICS



vaske vaske (op) vaske (wash)

vaske op (do the dishes)

To consider op an optional complement to vaske may be controversial from a syntactic point of view but it has the advantage of leaving the discussion of whether we are dealing with a phrasal verb or a phrasal construction for the semantic level where it actually belongs since it is basically a semantic distinction. Especially in cases of ambiguity (i.e. where both a phrasal verb and a phrasal construction interpretation are possible as in gå op which in the phrasal construction interpretation means ‘walk upwards’ and in the phrasal verb interpretation means ‘cancel out’) this is convenient since it prevents unnecessary overgeneration at earlier levels and allows for a unified syntactic description of directionals at the syntactic level (irrespective of whether these are expressed as particles or as prepositional phrases):

MORPHOLOGY SYNTAX SEMANTICS



(DIR) (DIR) (walk)

gå op (cancel out)

This generalisation also represents an advantage at the semantic level. A look at the semantic representation of the motion interpretation of (walk) illustrates this:

Semantic unit:

gå_MOV (walk - move reading)

Definition:

komme frem ved at sætte den ene fod foran den anden (NDO)

(Proceed by putting one foot in front of the other)

Corpus example:

Vi skal hen til telefaxen , vente mens den kalder op osv.

(we have to walk over to the fax machine, wait while it makes the call etc.)

Semantic type:

Move

Sem. Supertype:

Act

Event type:

Process

Domain:

General

Predicative rep:

ARG1 (DIR)

Selectional restrictions:

ARG1= Human/animal

DIR= Concrete

Formal quale:

isa=bevæge sig (move)

Agentive quale:

Nil

Telic quale:

Nil

Constitutive quale:

Manner=yes

Complex:

Nil

One semantic unit can now account for in phrases like han gik 2 km, han gik ud, han gik op, han gik hen til telefaxen etc. (he walked 2 km, he walked out, he walked up, he walked over to the fax machine) since no lexicalisations are involved; only phrasal constructions. The phrasal verb gå op on the other hand receives a completely different representation in the semantics since it is assigned the type ‘Identificational State’ from the semantic model:

Semantic unit:

gå op_IDS (come out/cancel out))

Definition:

(om regnestykke og kabale) løses så der ikke bliver nogenb rest (NDO)

((about calculations and patiences) solve so that there is no remainder)

Corpus example:

..men for at få regnestykket til at gå op måtte han inddrgae naboens grund

(but in order to make the calculation come out rught hge had to include the neighbour’s garden)

Semantic type:

Identificational state

Sem. Supertype:

Relational state

Event type:

State

Domain:

Mathematics

Predicative rep:

ARG1

Selectional restrictions:

ARG1= representational

Formal quale:

isa=relation (relation)

Agentive quale:

Nil

Telic quale:

Nil

Constitutive quale:

Relates=numbers

Complex:

Nil

As can be seen from the two semantic representations above, there are no attempts at the current stage of the lexicon to establish a connection between the meaning of a verb like and that of its phrasal verb counterpart (the feature Complex being left unspecified). This relates to the fact that these derivations are considered to be more or less unsystematic and based rather on idiosyncratic conventions than on predictable, regular relations between meaning components. The following small list of phrasal verbs extracted at random from the corpus illustrates this idiosyncracy quite well: gå under (die, lit: walk under), gå an (be appropriate, lit: walk to), gå op i (be interested in, lit: walk up into), gå af (be pensioned, lit: walk off), gå efter (examine, lit: walk after), gå over (cease, lit: walk over).

4 Conclusions

In this paper we have looked at two aspects of the Danish SIMPLE lexicon which illustrate some of the language-specific adaptations which are necessary when elaborating a multilingual semantic lexicon with a pre-defined model as its common backbone. Since in the Danish SIMPLE-lexicon we apply corpora as our most important information source, two examples of very frequent ‘real text’ meaning deviations have been in focus of our paper: figurative uses of concrete senses, as well as the extensive use of phrasal verbs and phrasal constructions in Danish. We have suggested a systematic solution to the treatment of figurative senses: if these are frequent in the corpus they should be represented in our lexicon even if they are not represented in the traditional dictionary we use as our other important background resource. Furthermore, we have chosen to map the telic role from the concrete senses into a more general telic role, realised by the semantic relation object_of_the_activity. Thereby we systematically account for the parallelism that holds between the concrete and figurative senses. As for verb constructions with particles, it has been crucial for us to establish a ‘split late’ strategy, where the distinction between constructions that are transparent (i.e. compositional) in meaning and constructions that are not (phrasal verbs) is made at the level where it belongs: in semantics. Furthermore, we illustrate why phrasal verbs must be considered unsystematic and argue therefore that we cannot establish a systematic semantic relation between a verb and its phrasal verb counterpart.

As a concluding remark, it is important to stress - considering the current status of language technology for the 'small' European languages - that the scope of the SIMPLE project makes it a truly pioneering project for Danish. The development of these harmonized large-scale semantic lexicons is a first step in the right direction for creating advanced language technology also for less widely spoken European languages.

References

Alone, A., N. Calzolari, P. Vossen, L. Bloksma, I. Castellon, M.M. Marti, W. Peters (1998). ‘The Linguistic Design of the EuroWordNet Database’, In P. Vossen (ed.) EuroWordNet, A Multilingual Database with Lexical Semantic Networks.Kluwer Academic Press, Dordrecht, Boston, London..

Apresjan, J. (1980). Semantyka leksykalna. Wroclaw (apud Malmgren 1998).

Braasch, A. & B. Pedersen (in press) ‘En stor sprogteknologisk ordbog for dansk - med særligt fokus på håndtering af flertydighed i en niveaudelt ordbog’, in: 7. Møde om Udforskning af Dansk Sprog, Århus University.

Harder, P., L. Heltoft & O.N.Thomsen (1996).’Danish directional adverbs, content syntax and complex predicates: A case for host and co-predicates’, in: E. Engberg-Pedersen et al. (eds.) Content, Expression and Structure. Studies in Danish Functional Grammar. John Benjamins, Amsterdam.

Herslund, M. (1993). ‘Transitivity and the Danish Verbs’, in LAMBDA no. 18, Copenhagen Business School, Copenhagen.

Malmgren, S. (1988) ‘On Regular Polysemy in Swedish’, in: Studies in Computer-Aided Lexicography, Almquist & Wiksell, Stockholm.

Pustejovsky, J. (1995). The Generative Lexicon, Cambridge, MA, The MIT Press.

Ruimy, N. O. Corazzari, E. Gola, A. Spanu, N. Calzolari, A. Zampolli (1998). ‘The European LE-PAROLE Project: The Italian Syntactic Lexicon’, in: First International Conference on Language Resources & Evaluation, Granada, Spain.

Talmy, L. (1985). ‘Lexicalisation Patterns: Semantic Structures in Lexical Forms’, in T. Shopen (ed.) Grammatical Categories and the Lexicon, Vol. 3, Press Syndicate of the University of Chicago, Chicago.

Weilgaard, L. (1997). ‘Danske Partikelverber’, in B. Maegaard & B.S. Pedersen (eds.) UDOG Rapport nr. 6, Center for Sprogteknologi, Copenhagen.