Bilingual Dictionary Construction

Bilingual dictionaries hold great potential for emerging research areas such as machine translation and human-aided translation. Unfortunately, the manual construction of bilingual dictionaries is expensive, and new or domain-specific terminology is difficult to cover. Therefore, a lot of research has been conducted on the automatic extraction of bilingual dictionaries. Especially the extraction from large parallel corpora (bitexts) has achieved impressive results. However, parallel corpora are available for only selected text domains and language pairs. For that reason, the potential of other resources is being explored as well.

Wikipedia Logo

We propose the extraction of bilingual terminology from large multilingual encyclopedias such as Wikipedia in order to complement bilingual dictionaries with accurate term-translation pairs for languages and text domains where no parallel corpora exist. Wikipedia is a very promising resource as the continuously growing encyclopedia already contains more than 16 million articles in over 270 languages, has a dense link structures and covers a wide variety of topics.

Wikipedia Bilingual Dictionary Logo

In Wikipedia, there are many links between articles in different languages. If we regard the titles of Wikipedia articles as terminology, it is easy to extract translation relations by analyzing the interlanguage links, assuming that if two articles are connected by an interlanguage link, their titles are translations of each other.

We have developed a method that analyzes not only interlanguage links in Wikipedia but also redirect pages and anchor texts to extend the number of term-translation pairs in the dictionary while maintaining a relatively high accuracy. Since not all term-translation pairs extracted by our method are correct, we use supervised learning to analyze the correctness of each extracted term-translation pair based on various characteristics (features).


Journal Papers:

International Conferences and Workshops:

Domestic Conferences and Workshops:

Demonstrations and Posters:



Training/Test Data: