LE-PAROLE
WP3.2
Catalan Lexicon Documentation
(t24 Resources Report)
1. General Design Information
The strategy followed in the construction of the Catalan Lexicon is statistically based. Entries have been selected from a Dictionary of Frequencies based on a corpus of 29 million words. This Dictionary has been recently published (in paper and in electronic form) by our organisation. Some corrections have been applied over the purely statistical results, such as taking into account the distribution of a lemma within the different parts of the corpus, and the presence of a lemma in the existing lexicographical sources.
Certain number information from the DCC-Corpus has been added to this, in order to include proper nouns and acronyms. The final number of those items to be included in the data was fixed according to its statistical presence within the corpus, and the selection of entries was also statistically based.
The morphological information has been taken from two existing sources, as described in the report 4.2.2a of MLAP-PAROLE: from the DCC-Lexicon we have taken the information on the inflected model of verbal category and the set of lexical entries related to; from the Servei de Tractament Informatitzat de Textos Catalans (University of Barcelona) we have taken the same information for non verbal categories. The internal structure of both sources determined the use of alternant stems with the non verbal categories and the use of one stem in the case of verbal category, as two different strategies in Genelex-Parole model to describe the inflectional behaviour of lexical entries.
As for syntax, the strategy consisted in starting with verbs, once stablished a previous list of descriptions and framesets needed, to be improved during the encoding procedure.
2. Current Lexicon Contents
2.1 Morphological layer
Number of simple morphological units |
20994 |
Number of compound morphological units |
430 |
Number of affix morphological units |
0 |
Number of agglutinated morphological units |
3 |
Number of graphical morphological units |
21038 |
Number of simple inflection modes |
222 |
Number of simple compound inflection modes |
83 |
|
Category |
Subcategory |
Number of Units |
|
WITHOUTC |
WITHOUTSC |
76 |
|
NOUN |
COMMON |
13779 |
|
NOUN |
PROPER |
261 |
|
VERB |
AUX |
2 |
|
VERB |
MAIN |
3065 |
|
ADJECTIVE |
QUALI |
3123 |
|
PRONOUN |
DEMONSTRATIVE |
7 |
|
PRONOUN |
POSSESSIVE |
7 |
|
PRONOUN |
INTERROGATIVE |
4 |
|
PRONOUN |
PERSONAL |
5 |
|
PRONOUN |
RELATIVE |
4 |
|
PRONOUN |
INDEFINITE |
22 |
|
PRONOUN |
EN |
1 |
|
PRONOUN |
HI |
1 |
|
PRONOUN |
HO |
1 |
|
ADVERB |
GENERAL |
707 |
|
ADPOSITION |
PREPOSITION |
85 |
|
CONJUNCTION |
COORDINATIVE |
20 |
|
CONJUNCTION |
SUBORDINATIVE |
22 |
|
NUMERAL |
ORDINAL |
23 |
|
NUMERAL |
CARDINAL |
109 |
|
DETERMINER |
DEMONSTRATIVE |
3 |
|
DETERMINER |
POSSESSIVE |
10 |
|
DETERMINER |
INTERROGATIVE |
1 |
|
DETERMINER |
EXCLAMATIVE |
1 |
|
DETERMINER |
RELATIVE |
1 |
|
DETERMINER |
INDEFINITE |
28 |
|
ARTICLE |
PERSONAL |
1 |
|
ARTICLE |
INDEFINITE |
1 |
|
ARTICLE |
DEFINITE |
1 |
|
INTERJECTION |
WITHOUTSC |
49 |
2.2 Syntactic layer
At this level we encode all those aspects of the syntactic behaviour of words, which cannot be directly deduced from their category or subcategory. The PAROLE syntactic level allows to describe the insertion context of a lexical entry with a variable degree of granularity, depending on its morphosyntactic category.
From a syntactic perspective there is a clear distinction between predicative categories (verbs, predicative nouns, adjectives or adverbs) and non predicative categories (non-predicative nouns, adjectives and adverbs, minor categories). In the first case we describe the subcategorization pattern, of the head; while in the second, we describe up to a certain degree the insertion context of the entry.
More often than not, one word has more than one possible syntactic construction, it has several alternative surface realizations .
The PAROLE model provides several ways for dealing with surface variation depending on how we use and combine the different syntactic objects : Syntactic Unit, Description, Self, Construction, Position and Syntagma.
A Syntactic Unit is equivalent to a syntactic entry or reading, and has one base description and possibly several derived descriptions.
A Description consists of a Self and a Construction, which in turn, consists of an ordered list of Positions, which may be realized by several syntagmatic categories.
2.2.1 Encoding of subcategorization properties of verbs
2.2.1.1 Design of a framing system
Since the model is not restrictive in this respect, one of the central tasks of syntactic encoding is to decide
how the different surface realizations are modelled into subcategorization frames and how the resulting set of frames belonging to a particular lexical entry are organized into Syntactic Units.
This is the case of the verb ‘cremar’ (‘burn’) exemplified below:
(1) Un piròman ha cremat el palau (A pyromaniac has burned the palace)
(2) El palau està cremant (The palace is burning)
(3) La carn s'ha cremat (The meat has burned)
(4) La sopa crema (The soup is burning hot)
(5) El palau va ser cremat pel piròman (The palace was burnt down by the pyromaniac)
As a convenient design option, we have chosen to organize this set of syntact realizations in a more reduced set of frames. A frame corresponds, at a deep syntactic level, with the argument structure of the head and, therefore, doesn’t include modifying elements.
Under this perspective, a verb -or any other predicative word- has as many Syntactic Units (or readings) as frames, and each of these may include one or more syntactic Descriptions.
For this purpose, we have restricted the choices the model offers: verbal Descriptions may show in some cases alternances of category in a given position, but never allow for optional positions.
The Derived Descriptions in the description list of one given Syntactic Unit must be related, either to their Base Description or to another derived description inside the list. This relation is expressed by means of a FrameSet. FrameSets were introduced by EAGLES to the original GENELEX model. They relate Descriptions (and Positions inside them) and are used to express deep syntactic relations like decausativization, passivization, etc.
This strategy provides us with a lexicon non merely descriptive but a lexicon equipped with internal structure : on one hand, the set of descriptions, give us the syntactic combining properties of the language; that is, the different possible complementation patterns in Catalan, independently of the subcategorization properties of the heads.
On the other hand, the combination of these descriptions in the Syntactic Units (alone or related via FrameSets) may be used to establish a typological classification on Catalan verbs.
2.2.1.2 Use of FrameSets
Deep syntactic phenomena like ergative alternation, pronominalisation, ... have been generally encoded by using Frame Sets.
We have defined six types of FrameSets (or relations between descriptions):
As an example of how we have used FrameSets, let's take a look at the ergative verb 'albergar' (to lodge). This verb has three different surface realizations and has been encoded using only one SynU with three different Descriptions, related by means of two different types of FrameSets : optional and ergative.
(6) La dona albergà el pobre a casa seva. (The woman lodged the beggar in her house)
(7) La dona albergà el pobre. (The woman lodged the beggar)
(8) El pelegrí albergà en un hostal. (The pilgrims lodged in a hostel)
These sentence are described by means of the following Descriptions:
(a) Snp+V+Onp+ADVloc
(b) Snp+V+Onp
(c) Snp+V+ADVloc
The three Descriptions are collected in one Syntactic Unit and are related via Framesets.
<SynU id="USALBERGAR" description="DVSnOnADVloc" <- Base Description
descriptionl="DVSnOn DVSnADVloc <- Derived Descriptions
framesetl="FSOp18 FSErg02"> <- List of FrameSets
FSOp18 is a FrameSet expressesing optionality of one complement. It relates Description (b) (DVSnOn) with the Base Description (a) (DVSnOnADVloc)
<FrameSet id="FSOp18" example="col·locar" descriptionl="DVSnOnADVloc DVSnOn">
FSErg02 expresses ergative relation between the Base Description and Description (c) (DVSnADVloc).
The second Position (P1) in DVSnOnADVloc is related with the first Position (P0) in DVSnADVloc. This accounts for the fact that these Positions impose the same kind of restrictions on the NP.
<FrameSet id="FSErg02" example="albergar" descriptionl="DVSnOnADVloc DVSnADVloc">
<Related>
<RelElement1
description="DVSnOnADVloc">
<WayToPosition
targetposition="1">
<RelElement2
description="DVSnADVloc">
<WayToPosition
targetposition="0">
2.2.1.3 Criteria for splitting Syntactic Units
The following surface variations are not gathered under a frame but lead to splitting SynUs.
Thus, we will have -at least- two readings for ‘oblidar’ (forget) to cope with:
(9) He oblidat les claus (I forgot the keys)
(10) He oblidat que arriba avui (I forgot that he arrives today)
<SynU id=OBLIDAR1
description=VSnpOnp>
<SynU id=OBLIDAR2
description=VSnpClind>
E.g. a) saber fer-ho (lit. know to do it)
b.1) saber que ho faria (know he/she would do it)
b.2) si ho faria (know whether he/she would do it)
b.3) qui ho faria (know who would do it)
There is no equi relation between a.1) and b.1). On the contrary, all exemples in b) belong to the same SynU. That amounts to having two SynU: saber_1, saber_2.
2.2.1.4. Encoding of optionality
The optionality of a position in a complementation frame can derive from
(i) the lexical idiosincrasy of certain heads,
(ii) the 'inherent' properties of certain positions in certain constructions and, finally,
(iii) from grammar principles.
The optionality of the direct object in object deletion verbs such as llegir (to read) as exemplified in:
(11) A la Maria li agrada llegir revistes del cor (Maria likes to read magazines)
(12) A la Maria li agrada llegir (Maria likes to read)
does not derive from the optional nature of direct objects in Catalan, but rather from the lexical entailments of certain verbs which, for whatever semantic reasons, allow object deletion. By no means we can generalize the fact that the direct object position is optional in Catalan.
On the contrary, the optionality of determiners in nominal constructions does not depend on the specific noun, but may be generalised to all such nouns.
(13) L'atac a la ciutat es va produir de matinada. (The attack to the city took place in the morning)
(14) Les probabilitats d'atac a la ciutat són escases. (The probabilities of (an) attack to the city are few)
Finally, absence of subject in the surface structure, what is known as subject Pro-drop, is usually regarded as a general grammar principle of these languages.
(15) Avui anem a la platja (Today go_we to the beach)
The way we encode optionality depends on the kind of phenomenum we are dealing with.
(i) Thus, when the optionality of a given position derives from the lexical semantic entailments of the head, we define a unique Syntactic Unit with two Descriptions by means of the relevant FrameSet. In the case of object deletion verbs, a Deletion FrameSet relates a transitive and an intransitive description within a Syntactic Unit.
Syntactic_Unit_id: LLEGIR
Description_id: DVSnOn
Description_id: DVSn
Frameset_id: FSOp15
Frameset_id: FSOp15
DVSnOn"related to" DVSn
This approach makes explicit the fact that there exists a natural class of verbs (called "object deletion verbs") which have the ability to drop their direct object.
(ii) When optionality is intrinsic of a construction we mark the position as optional. This is the case the determiner position in nominal constructions, the than-complement in adjectival comparative constructions, etc.
Description_id=DNclinfDE
Construction
P0 [Funtion=NDETERMINATIVE] [Optional=YES]
(Cat=NP)
P1 [Function=NCLAUSCOMP]
(Cat=CL, Prep=DE)
Self [Function=HEAD]
(Cat=NOUN)
(iii) Finally, optionality due to a grammar principle such as subject Pro-drop is not explicitly encoded. Thus, all verbs (except weather verbs) are lexically specified as bearing a subject and all verbal constructions have an obligatory subject position. It is in the grammar rules where this subject is allowed no to appear.
2.2.1.5 Encoding of alternance of syntagmas in a given position
Alternative syntagmatic realisations of the same Syntactic Function is sometimes expressed in the same description, and sometimes in a different one, depending on the potential for generalizing the alternation:
preposition ‘de’, but very frequently this preposition is optional. This fact is described by means of an alternation of "marked" (with preposition) and "unmarked" Infinitival Clause in the subject position.
See the example of "agradar" (like):
P0 (Function=SUBJECT)
(CL (Mood=INFINITIVE))
(CL (Prep=DE)(Mood=INFINITIVE))
P1 (Function=INDIRECTOBJECT)
(NP (Prep=A))
depending on which verb, all of them or only a subset may be selected. That is why we had decided to encode this alternation by using different Descriptions. Each description describes only one realization of the Attribute. After the task of codification was finished, we realized that whenever a verb subcategorizes for an Adjective attribut, it also may also take a PP attribute. For this reason we included the alternance in the same description:
2.2.1.6 Encoding of control relations
Control verbs are typified as being: SUBJECTCONTROL, OBJECTCONTROL and RAISING and the controller is coindexed with the subject of the Infinitive by means of the feature Coref, in this way:
P0 (Function=SUBJECT)
(NP (Coref=I))
P1 (Function=OBJECT)
(CL (Mood=INFINITIVE)(Coref=I))
For EQUI verbs, we have defined the FrameSetEqui which puts in relation both the Description with the Infinitive Clause and the Description with the Finite Completive Clause. Thus, ‘voler’ (want) would be encoded using only one Usyn, in this way:
<SynU id=VOLER
description=VSnpClsubj (clause in subjunctive mood)
descriptionl=VSnpClinf (clause in infinitive mood)
framesetl=FSEqui >
We have used the feature Coref, with negative values (NOI, NOJ) also for blocking the possible coreference of the controller with the explicit subject of the finite clause, for cases when this is impossible like:
*Jo vull que jo surti (I want that I go out)
2.2.2 Encoding of syntactic information for non-verbal categories
The syntactic properties of non-predicative categories can be expressed by means of the description of the insertion context or by using specially defined features. In what follows we give an overview of the kind of information which has been encoded for each non-verbal category:
2.2.2.1 Nouns
We describe the possibility of a noun for being determined or not.
The subcategorization properties of deverbal or deadjectival nouns, as well as the complements of relational nouns, are also described.
Since nominal complements are always optional, a predicative noun will always function as a simple noun too. For this reason, there are always at least two entries (SynU) for each predicative noun: a predicative reading with complements and a non predicative or simple reading without complements.
Simple nouns, or simple readings of nouns are classified in four classes based on two factors that we call enumerability and fractionability.
|
|
ENUMERABILITY |
FRACTIONABILITY |
|
Count |
+ |
– |
|
No count |
– |
– |
|
Mass |
– |
+ |
|
Variable |
+ |
+ |
2.2.2.2 Adjectives
In adjectives, like in nouns, we distinguish between predicative adjectives with argument structure and simple adjectives. Again like nouns, we consider that complements are always optional and we distinguish the simple reading (and its syntactic properties) from the predicative reading.
Thus, the adjective digne (worthy) which may have a de-complement, as in:
(16) Un home digne de crèdit (A man worthy of credit)
Or function as a simple adjective as in:
(17) És un home molt digne (He's a very worthy man)
Nevertheless, we have assumed that all adjectives (even the simple ones) bear an internal empty argument which is coindexed with its antecedent noun. This noun may be either the head of the NP which is modified by the adjective or the subject (or object) of a copulative verb. The internal argument may be used to impose semantic restrictions on this noun or to deal with control adjectives.
Simple (non-predicative) adjectives are classified with respect the following contextual information:
2.2.2.3 Adverbs
The lexical information relevant for adverbs include:
There are a few adverbs which need special descriptions, like:
Some information about surface order is encoded, basically what refers to the relative position of the adverb wrt the element it modifies. Taking into account the fact that the position of adverbs in Catalan is quite free, the order we indicate is the most frequent or, in other words, the least marked.
For instance, adverbs modifying an adjective tend to precede it (ex. 18), while certain adverbs that modify negative VPs tend to appear after the verb (ex. 19).
(18) És una ferida terriblement dolorosa (It's a terribly painful wound)
(19) No m'he cansat gaire (I haven't got tired at all)
The following descriptions show the use of the selfinsertion for giving an account of the word order information :
[Description_id : DAdvAP
comment : Adverbs that modify an adjective.
[Construction (CAdvAP):
P0 [Optional=NO] [Function=HEAD]
(AP)
Self inserted before P0
SyntlabelNT: AP]
[Self [Function=AMODIFIER]
SyntlabelT: ADV] ]
[Description_id : DAdvVPnegint
comment : Adverbs that modify a negative VP.
[Construction (CAdvVPnegint):
P0 [Optional=NO] [Function=HEAD]
(VP [Synsubcat=SSINTERROGATIVE)
(VP [Negative=YES])
Self inserted after P0
SyntlabelNT: VP]
[Self [Function=VMODIFIER]
SyntlabelT: ADV ] ]
2.2.2.4 Conjunctions
There are two types of conjunctions, distinguished by means of the feature Morphsubcat=COORDINATIVE, SUBORDINATIVE
For each coordinated category (CL, NP,...) we have a different Description but given the fact that most conjunctions may coordinate more than one category, we give only one SynU for each conjunction which contains all the Descriptions required.
Distributive conjunctions (ara ... ara ...) are also described.
2.2.2.5 Determinant Group
The categories which form the determinant group (articles, determiners and numerals) are encoded taking into account the internal coocurrences restrictions of the complex determinant group, i.e. which elements may precede or follow each other.
Basically we define three fields :
(i) a quantifier (tot) element which is placed more to the left;
(ii) a central field where we may find: articles, possessive and demonstratives
(iii) numerals and indefinites which appear to the right of the central field.
However, the insertion context is not the DETP but the NP, so that the restrictions on the noun may be expressed (count, mass,...).
2.2.2.6 Pronouns
This category is further subdivided in six subtypes (by means of the Morphsubcat feature) : demonstrative, indefinite, interrogative, personal, possessive, relative, en, hi, ho.
They are always head of an NP. In most cases the NP does not contain any other element, with the exception of possessive, which need to be preceded by an article or certain indefinite which may be followed by a partitive complement,like in the example:
(20) qualsevol dels nois (any of the boys)
2.2.2.7 Prepositions
For prepositions we encode two types of information :
(21) La Maria va romandre a casa (Mary stayed at home)
(22) un moble sense pintar (a furniture without painting)
Prepositional phrases modifying a predicative head behave -semantically- like adverbs; while prepositional phrases modifying a non-predicative head behave like adjectives.
2.2.2.8 Interjections
Although most interjections have an independent syntactic behaviour, some of them may be accompanied by a "complement" , like ai in:
(23) Ai de tu si et torno a veure per aquí ! (Beware, if I find you again around here!)
2.2.3 List of syntactic functions used
Values used for verbs:
SUBJECT, OBJECT, SUBJPRED, OBJPRED, INDIRECTOBJECT, PREPOBJ, ADVERBIAL, PERIPHTERM, VGTERM, VMODIFIER
Values used for nouns:
NCOMP, NSUBJ, NPREPCOMP, NAPPOSITION, NADJUNCT, NCLAUSCOMP, NDETERMINATIVE, NATTRIBUTIVE, NMODIFIER
Values used for adjectives:
ACOMP, APREPCOMP, ACLAUSCOMP, AMODIFIER
Values used for adverbs, determiners, prepositions and conjunctions:
ADVMODIFIER, DETMODIFIER, PREPDEPENDENT, CONJDEPENDENT
Value used for all categories:
HEAD
2.2.4 Validation of the data
The process of building the syntactic Catalan lexicon is divided in three main steps :
1- formalization of the syntactic properties of a given category (verb, noun, adjective, etc.) in a conventional code which is close to the PAROLE conceptual model, i.e. design of the Descriptions and FrameSets objects. These Descriptions and Framesets are given mnemonic labels which are going to be used in step #3;
2- (semiautomatic) translation of the syntactic objects designed in step #1 to SGML format;
3- encoding of the lexical entries in a relational DB (ACCESS) using a table of Description names and (possibly) a table of Framesets. This step generates a table of Usyns containing all the relevant information for each entry, including attestation, example and comment fields. Later, this table is (automatically) converted into SGML format.
The process of validation of the data must, therefore, include this three types of data: conceptual models, SGML code, lexical encoding in ACCESS.
With respect to SGML, the DTD allows us some formal checking: labels that appear twice, non defined elements,.... However, there are some things, like the assignation of values to attributes (for the SyntFeatClosed) that are only verified in the loading of SGML to the Object Store tool. Other things are not checked at all, like similar objects with different nouns.
The mnemonic labels that are given to all objects make easier the manual verification of the internal consistency of the SGML code (e.g. checking that a position that has to contain a NP does not contain an ADVP).
The validation of step #3 is the most important, because the mass of data is bigger. However, the three phases are intimately connected. The revision of the lexical encoding may supply feedback for the revision of step #1 (and, therefore, #2); i.e. the syntactic model previously defined may need some adjustements on the basis of lexical evidence.
This revision of #3 is not performed in a "linear" way (i.e. in alphabetical order) but it is organized following some informed linguistic criteria. This is facilitated by the ACCESS support which allows to make specific searches and consultations, cross-checkings, etc.
Number of Syntactic descriptions |
339 |
Number of framesets |
154 |
Number of Syntactic Units |
31435 |
Appendix - state of validation by the AlethGD tool
|
Nb of Morphological Units |
21427 |
|
Nb of Simple Morphological Units |
20994 |
|
Nb of Graphical Morphological Units |
21038 |
|
Number of simple words inflexion modes |
222 |
|
Number of agglutinated morphological units |
3 |
|
Number of compound words |
430 |
|
Number of compound words inflexion modes |
83 |
|
|
|
|
Number of syntactic Units |
31435 |
|
Number of constructions |
281 |
|
Category |
Subcategory |
Simple Morphological Units |
Compound Morphological Units |
Total |
Example |
|
WITHOUTC |
WITHOUTSC |
77 |
0 |
77 |
amagatons |
|
NOUN |
COMMON |
13646 |
133 |
13779 |
àbac |
|
NOUN |
PROPER |
247 |
14 |
261 |
Abel |
|
VERB |
AUX |
2 |
0 |
2 |
anar |
|
VERB |
MAIN |
3065 |
0 |
3065 |
abaixar |
|
ADJECTIVE |
QUALI |
3101 |
22 |
3123 |
basc |
|
PRONOUN |
DEMONSTRATIVE |
7 |
0 |
7 |
això |
|
PRONOUN |
POSSESSIVE |
7 |
0 |
7 |
llur |
|
PRONOUN |
INTERROGATIVE |
4 |
0 |
4 |
qui |
|
PRONOUN |
PERSONAL |
5 |
0 |
5 |
jo |
|
PRONOUN |
RELATIVE |
4 |
0 |
4 |
que |
|
PRONOUN |
INDEFINITE |
22 |
0 |
22 |
tot |
|
PRONOUN |
EN |
1 |
0 |
1 |
en |
|
PRONOUN |
HI |
1 |
0 |
1 |
hi |
|
PRONOUN |
HO |
1 |
0 |
1 |
ho |
|
ADVERB |
GENERAL |
511 |
196 |
707 |
no |
|
ADPOSITION |
PREPOSITION |
42 |
43 |
85 |
de |
|
CONJUNCTION |
COORDINATIVE |
16 |
4 |
20 |
i |
|
CONJUNCTION |
SUBORDINATIVE |
8 |
14 |
22 |
que |
|
NUMERAL |
ORDINAL |
23 |
0 |
23 |
catorzè |
|
NUMERAL |
CARDINAL |
109 |
0 |
109 |
dos |
|
DETERMINER |
DEMONSTRATIVE |
3 |
0 |
3 |
aquest |
|
DETERMINER |
POSSESSIVE |
10 |
0 |
10 |
llur |
|
DETERMINER |
INTERROGATIVE |
1 |
0 |
1 |
quin |
|
DETERMINER |
EXCLAMATIVE |
1 |
0 |
1 |
quin |
|
DETERMINER |
RELATIVE |
1 |
0 |
1 |
qual |
|
DETERMINER |
INDEFINITE |
28 |
0 |
28 |
mateix |
|
ARTICLE |
PERSONAL |
1 |
0 |
1 |
en |
|
ARTICLE |
INDEFINITE |
1 |
0 |
1 |
un |
|
ARTICLE |
DEFINITE |
1 |
0 |
1 |
el |
|
INTERJECTION |
WITHOUTSC |
49 |
0 |
49 |
oh |