The Apertium dictionaries on the Web of Data

Let me summarise one of our recent developments at OEG: the release of Apertium RDF, the linked data version of the Apertium family of bilingual dictionaries. The initial migration into RDF from their LMF version resulted from our collaboration with Marta Villegas and Núria Bel after several short visits I did to IULA (Universitat Pompeu Fabra) in 2014. The linking to BabelNet was the result of a visit I did to Roberto Navigli at Universitá di Roma “La Sapienza” on November/December 2014.

What is Apertium?

Apertium is a free open-source machine translation platform. The system was initially created by the Universitat d’Alacant and it is released under the terms of the GNU General Public License. In its core, Apertium relies on a set of bilingual dictionaries, developed by a community of contributors, which covers more than 40 languages pairs.ApertiumMT

Why is it useful to have it on the Web of Data?

Converting the Apertium data into RDF and publishing it as linked data (LD) on the Web has several advantages:
  1. The Apertium dictionaries, initially developed in isolation, are interlinked and accessible on the Web as part of a single graph.
  2. The data can be accessed in a unified manner by following standard query means (e.g. SPARQL). Querying across different dictionaries is now possible in a manner that otherwise would be more difficult if the isolated original data sources have to be queried.
  3. Indirect translations can be obtained between language pairs that were initially unconnected.
  4. Further, direct connections to other datasets in the Web of Data are possible so the original information can be enriched with additional relevant data.

How were the data migrated into RDF?

The migration of the Apertium dictionaries has served as motivating use case for the guidelines for LD generation of bilingual dictionaries that the W3C Best Practises for Multilingual Linked Open Data (BPMLOD) community group has proposed recently. In particular, the guidelines identify five steps, namely: (i) vocabulary selection, (ii) modelling, (iii) URI design, (iv) generation, and (v) publication. See such guidelines for more details.
In our conversion we took as starting point the LMF version of the Apertium dictionaries developed in the context of the METANET4U Project. In order to represent the lexical information contained in the original dictionaries, we relied on the LExicon Model for ONtologies (lemon), a de-facto standard for representing ontology lexica. We used the lemon translation module to represent explicit translations between languages.

What does the result look like?

We published 22 Apertium bilingual dictionaries (those with an LMF version) as linked data on the Web. The result, that we call Apertium RDF, groups the data of the (originally disparate) Apertium bilingual dictionaries in the same graph, interconnected through the common lexical entries of the monolingual lexicons that they share.

graphLangApertiumThe picture on the left  illustrates such unified graph. The nodes in the figure are the languages and the edges are the translation sets between them. The darker the colour, the more connections a node has.

All the generated information is accessible on the Web both for humans (via a Web interface) and software agents (with SPARQL). All the datasets are documented in Datahub.

If you are interested in playing with the Apertium RDF SPARQL endpoint, we have predefined some queries that you can find here and that you can use as starting point.

What about linking to other sources?

Of course, linked data is about “linking” data. In our case we defined links from Apertium to lexinfo, a catalog of lexical categories from which we took the representation of the different “part of speech”. Even more interestingly, we linked to BabelNet, the award-winning large muLLOD-cloud_May2015ltilingual encyclopaedic database. In that way we could expand the candidate translations that one can get from Apertium with babelsynsets associated to such translations, from which extra linguistic information, such as glosses, can be obtained. Around 270.000 links were established between Apertium senses and babelsynsets.

The Apertium RDF dictionaries and their links to these other datasets are depicted in the LLOD cloud.

Some examples

Example 1. A query can be built to get the direct translations of the English term “bank” into other languages. An excerpt of the result is (see here for the whole results):

Translated written repr. Part of Speech
“banc”@ca http://www.lexinfo.net/ontology/2.0/lexinfo#noun
“riba”@ca http://www.lexinfo.net/ontology/2.0/lexinfo#noun
“banco”@es http://www.lexinfo.net/ontology/2.0/lexinfo#noun
“orilla”@es http://www.lexinfo.net/ontology/2.0/lexinfo#noun
“ribera”@es http://www.lexinfo.net/ontology/2.0/lexinfo#noun
“beira”@gl http://www.lexinfo.net/ontology/2.0/lexinfo#noun
“banco”@gl http://www.lexinfo.net/ontology/2.0/lexinfo#noun
“ourela”@gl http://www.lexinfo.net/ontology/2.0/lexinfo#noun
“orela”@gl http://www.lexinfo.net/ontology/2.0/lexinfo#noun
“banku”@eu http://www.lexinfo.net/ontology/2.0/lexinfo#noun
“erribera”@eu http://www.lexinfo.net/ontology/2.0/lexinfo#noun
“ertz”@eu http://www.lexinfo.net/ontology/2.0/lexinfo#noun
“amuntegar”@ca http://www.lexinfo.net/ontology/2.0/lexinfo#verb
“agolpar”@es http://www.lexinfo.net/ontology/2.0/lexinfo#verb
“amontonar”@es http://www.lexinfo.net/ontology/2.0/lexinfo#verb
“apelotonar”@es http://www.lexinfo.net/ontology/2.0/lexinfo#verb
“hacinar”@es http://www.lexinfo.net/ontology/2.0/lexinfo#verb
….

Example 2. This query gets the indirect translations of “bank” from English to Portuguese using Spanish as pivot language.

Pivot translation written representation Indirect translation written representation
“banco”@es “banco”@pt
“orilla”@es “orla”@pt

Example 3. Finally, we show an excerpt of the result of a query that gives not only the written representation of the direct translations of “bank”@en but also their associated babelsynset. We add also the result of searching for the English gloss of such babelsynsets in BabelNet.

Translated written repr. BabelSynset BabelNet gloss
“banco” @es http://babelnet.org/rdf/s00008371n “A building in which the business of banking transacted”
“banco” @es http://babelnet.org/rdf/s00008366n “An arrangement of similar objects in a row or in tiers”
“banco” @es http://babelnet.org/rdf/s15346085n “An ocean bank, sometimes referred to as a fishing bank or simply bank, …”
“orilla” @es http://babelnet.org/rdf/s00008363n “Sloping land (especially the slope beside a body of water)”
“ribera” @es http://babelnet.org/rdf/s00008363n “Sloping land (especially the slope beside a body of water)”

Further readings

KDictionaryNews23

Another paper, jointly written with Marta Villegas, Núria Bel and Asun Gómez-Pérez, describing the technicalities of Apertium RDF, is under preparation and will appear (hopefully!) in a near future.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s