Let me summarise one of our recent developments at OEG: the release of Apertium RDF, the linked data version of the Apertium family of bilingual dictionaries. The initial migration into RDF from their LMF version resulted from our collaboration with Marta Villegas and Núria Bel after several short visits I did to IULA (Universitat Pompeu Fabra) in 2014. The linking to BabelNet was the result of a visit I did to Roberto Navigli at Universitá di Roma “La Sapienza” on November/December 2014.
What is Apertium?
Apertium is a free open-source machine translation platform. The system was initially created by the Universitat d’Alacant and it is released under the terms of the GNU General Public License. In its core, Apertium relies on a set of bilingual dictionaries, developed by a community of contributors, which covers more than 40 languages pairs.
Why is it useful to have it on the Web of Data?
- The Apertium dictionaries, initially developed in isolation, are interlinked and accessible on the Web as part of a single graph.
- The data can be accessed in a unified manner by following standard query means (e.g. SPARQL). Querying across different dictionaries is now possible in a manner that otherwise would be more difficult if the isolated original data sources have to be queried.
- Indirect translations can be obtained between language pairs that were initially unconnected.
- Further, direct connections to other datasets in the Web of Data are possible so the original information can be enriched with additional relevant data.
How were the data migrated into RDF?
What does the result look like?
We published 22 Apertium bilingual dictionaries (those with an LMF version) as linked data on the Web. The result, that we call Apertium RDF, groups the data of the (originally disparate) Apertium bilingual dictionaries in the same graph, interconnected through the common lexical entries of the monolingual lexicons that they share.
The picture on the left illustrates such unified graph. The nodes in the figure are the languages and the edges are the translation sets between them. The darker the colour, the more connections a node has.
All the generated information is accessible on the Web both for humans (via a Web interface) and software agents (with SPARQL). All the datasets are documented in Datahub.
If you are interested in playing with the Apertium RDF SPARQL endpoint, we have predefined some queries that you can find here and that you can use as starting point.
What about linking to other sources?
Of course, linked data is about “linking” data. In our case we defined links from Apertium to lexinfo, a catalog of lexical categories from which we took the representation of the different “part of speech”. Even more interestingly, we linked to BabelNet, the award-winning large multilingual encyclopaedic database. In that way we could expand the candidate translations that one can get from Apertium with babelsynsets associated to such translations, from which extra linguistic information, such as glosses, can be obtained. Around 270.000 links were established between Apertium senses and babelsynsets.
The Apertium RDF dictionaries and their links to these other datasets are depicted in the LLOD cloud.
Some examples
Example 1. A query can be built to get the direct translations of the English term “bank” into other languages. An excerpt of the result is (see here for the whole results):
Example 2. This query gets the indirect translations of “bank” from English to Portuguese using Spanish as pivot language.
Pivot translation written representation | Indirect translation written representation |
“banco”@es | “banco”@pt |
“orilla”@es | “orla”@pt |
Example 3. Finally, we show an excerpt of the result of a query that gives not only the written representation of the direct translations of “bank”@en but also their associated babelsynset. We add also the result of searching for the English gloss of such babelsynsets in BabelNet.
Translated written repr. | BabelSynset | BabelNet gloss |
“banco” @es | http://babelnet.org/rdf/s00008371n | “A building in which the business of banking transacted” |
“banco” @es | http://babelnet.org/rdf/s00008366n | “An arrangement of similar objects in a row or in tiers” |
“banco” @es | http://babelnet.org/rdf/s15346085n | “An ocean bank, sometimes referred to as a fishing bank or simply bank, …” |
… | … | … |
“orilla” @es | http://babelnet.org/rdf/s00008363n | “Sloping land (especially the slope beside a body of water)” |
“ribera” @es | http://babelnet.org/rdf/s00008363n | “Sloping land (especially the slope beside a body of water)” |
Further readings
J. Gracia, “Multilingual dictionaries and the web of data,” Kernerman Dictionary News, no. 23, pp. 1-4, Jun. 2015.
[UPDATE (2018):] Another paper, jointly written with Marta Villegas, Núria Bel and Asun Gómez-Pérez describes Apertium RDF from a more technical perspective:
Gracia, J., Villegas, M., Gómez-Pérez, A. and Bel, N.: “The Apertium Bilingual Dictionaries on the Web of Data”, Semantic Web Journal, Vol. 9, No. 2. (January 2018), pp. 231-240.