Language use and open, linked data
Assessore alla Cultura, Università e Ricerca Regione Toscana
Many branches of (theoretical or applied) linguistics investigate language (s) in terms of the uses are made of it (them), in different environments, how some kind of non-linguistic reality – that is, dimensions like space, time, class, gender, mother tongue, education, migrations and other – influence patterns of language use.
Some research questions might require the application of different research methods: quantitative, qualitative, mixed. Anyway, the empirical observations result in data, a “more or less systematic collection of instances of language in use” (Coupland & Jaworski 1997: 70). As Stubb (2004: 111) claims: “No linguist can now ignore corpus data”. Some database regarding language use are also made of data resulting from census, surveys, polls and questionnaires.
The advent of computerized corpora has provided a kind of paradigm shift in linguistic description and use and in the application of corpus and database findings (Sinclair, 1991).
But there are still questions and problems to be solved:
- Many corpora and database have been designed to be available, but much data is still published in proprietary, closed formats and is not made available on the web. Building valid and reliable corpora or database is time- and resources-consuming work. E.g., to transcribe a one hour interview can take up to one day and even longer, to design and operationalize a questionnaire can take months or years. Many scholars decide not to make them available.
- Furthermore corpora are often annotated (tagged, parsed) for additional information, and different corpora annotation strategies can be employed, as different standards exist. Census and surveys use different questions, so sometimes is impossible to compare results.
- Also in corpora and database that have been designed to be available, data and relevant metadata are spread among different repositories and it is currently impossible to query all these repositories in an integrated fashion, as they use different data models and vocabularies.
Recently, something is changing. Experts and practitioners in language resources have started recognizing the benefits of open data and, in particular, of the linked data (LD) paradigm for the representation and exploitation of linguistic data on the Web, leading to an emerging ecosystem of multilingual open resources in which datasets of linguistic data are interconnected and represented following common vocabularies, which facilitates linguistic information discovery, integration and access (Chiarcos, Hellmann, Nordhoff, 2011).
The aim of my presentation is to discuss the previous issues and to stress that research can benefit by the interoperability and the integration of linguistic resources and data.