Linked data and terminologies

Recently, (October 2018) my colleague Elena Montiel-Ponsoda (Universidad Politécnica de Madrid) and myself were invited to give a seminar at the European Commission,  Directorate-General for Translation, on the topic “Linguistic Linked Open Data for Terminology”.

The audience were translators, terminologists and trainees from the European Commission and the European Parliament. We were gladly surprised by their interest and enthusiastic attitude! The presentation had two parts, a first one about “linked data” in general and a second one on “linguistic linked data” particularly focused on terminologies and dictionaries, and including some hands-on exercises. The audience formulated very challenging and relevant questions, for instance regarding the potential of linked data for IATE or about quality issues derived from the Web-centred nature of linked data.

This experience (the invitation itself and the positive attitude by terminologists and translators) illustrates the growing interest on linked data showed by people working in terminology and related fields such as lexicography. In fact, as showed by projects and initiatives such as the LIDER project, the Linguistic Linked Open Data Cloud, or the W3C Ontology Lexica Community Group, linked data technologies fit very well with the representation needs of linguistic data, when interoperability and distributed aspects need to be taken into consideration.

Linked data is no more than a set of recommended best practises when publishing data on the Web. In short, (i) resources are represented on the Web via HTTP URIs (Unique Resource Identifiers), (ii) once a resource is accessed via its URI, information about it is obtained, and (iii) such information contains links to other resources. Such resources can be practically anything, from Web documents (images, websites, audio files, …) to real world entities (cities, artists, writers, …), so information can be assessed about them in RDF (Resource Description Framework) by following the “subjet-predicate-object” pattern. For instance (omitting the whole URIs for simplicity): Spain hasCapitalCity Madrid.  As a result, a “Web of Data” is emerging in which links are at the level of data, as a counterpart to the “traditional” Web, in which links are established at the level of documents (e.g. hyperlinks between web pages).

An interesting aspect is that linked data can be applied also to represent and interlink words, lexical senses, grammar categories, and any other type of linguistic data. For instance one can state that “network” is a “lexical entry” and has part of speech “noun” as follows (the following example is expressed in the turtle serialisation of RDF):

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix ontolex: <http://www.w3.org/ns/lemon/ontolex#> .
@prefix lexinfo: <http://www.lexinfo.net/ontology/2.0/lexinfo#> .
@prefix terminoteca: <http://linguistic.linkeddata.es/id/terminoteca/>  .

terminoteca:lexiconEN/network-n-en rdf:type ontolex:LexicalEntry  ;
                   lexinfo:partOfSpeech lexinfo:noun ;
                   ontolex:lexicalForm [ontolex:writtenRep "network"@en] .

or graphically:Notice how different sources of information have been combined to built the example (the lemon-ontolex model, the lexinfo catalogue of grammar categories, and Terminoteca RDF)  that model lexical information and datasets. The graph can be further expanded for instance to associate such a lexical entry to a concept, or to its translations into other languages, or to equivalent terms in another datasets or ontologies. 

Terminoteca RDF is a cool demonstrator of linked terminologies that we started when I was part of the Ontology Engineering Group (Universidad Politécnica de Madrid),  focused on converting a number of multilingual terminologies in Spain into linked data, following the lemon-ontolex model. We started with Terminesp and Termcat, two broadly known multilingual terminologies in Spain. As a result, we obtained a unified graph where terminological data that was initially disconnected is enriched and easily discoverable with simple queries. For instance, Terminesp does not contain any acronym, but Termcat does. Now, thanks to the fact that both are connected, we can easily retrieve acronyms for terms in Terminesp  through a single query that traverses the unified graph and reaches the corresponding acronyms in Termcat. This and other example queries can be found here. More technical details on Terminoteca RDF can be found at [1].

The following picture illustrates the unified graph in Terminoteca RDF. Every circle represents a monolingual lexicon. On the left hand side you can see the lexicons that come from Terminesp and on the right hand side the ones from Termcat. The shared lexicons in the middle (French, Spanish, and English) enable the connection between both data sources.

For a few more thoughts on linked data and Terminology, you can read this interview by Olga Vamvaka for TermCoord (European Parliament). Let me finish with some final reflections that I included there, about the potential impact of linked data for terminologies and dictionaries: 

I think that dictionaries and terminologies must get rid of their physical boundaries to become natively digital. Although there are many electronic dictionaries out there, most of them still stick to the printed form version and mimic the hierarchical structures that one can find in paper. But, this is only one of many possible arrangements of lexical information. In the Linked Data paradigm, any element of the lexicon (lexical entry, lexical sense, translation, form, etc.) can be a “first class citizen” and become the centre of a graph-based structure, which will allow for many other possible arrangements and views on the information. Linked Data has proved to be useful for language resources in general, particularly when it comes to terminologies and dictionaries. By means of such technologies, we foresee more unified/linked graphs of terminologies and dictionaries on the Web, enriched through their linkage to other resources.

 

Further reading

[1] J. Bosque-Gil, E. Montiel-Ponsoda, J. Gracia, and G. Aguado-de Cea, “Terminoteca RDF: a gathering point for multilingual terminologies in spain,” in Proc. of 12th International Conference on Terminology and Knowledge Engineering (TKE’16), Copenhagen (Denmark), H. E. Thomsen, A. Pareja-Lora, and B. N. Madsen, Eds.    Copenhagen Business School, Jun. 2016, pp. 136-145. [Online]. Available: http://openarchive.cbs.dk/handle/10398/9323

[2] J. Bosque-Gil, J. Gracia, and A. Gómez-Pérez, “Linked data in lexicography,” Kernerman Dictionary News, pp. 19-24, Jul. 2016. [Online]. Available: http://kdictionaries.com/kdn/kdn24.pdf#page=19

[3] J. Gracia, I. Kernerman, and J. Bosque-Gil, “Toward linked data-native dictionaries,” in Electronic lexicography in the 21st century. Proc. of eLex 2017 conference, in Leiden, Netherlands.    Lexical Computing CZ s.r.o., Sep. 2017, pp. 550-559. [Online]. Available: https://elex.link/elex2017/wp-content/uploads/2017/09/paper33.pdf

[4] J. Gracia, “Multilingual dictionaries and the web of data,” Kernerman Dictionary News, no. 23, pp. 1-4, Jun. 2015. [Online]. Available: https://www.kdictionaries.com/kdn/kdn23_2015.pdf

Advertisement