Bilingual Dictionary Construction
Bilingual dictionaries hold great potential for emerging research areas such as machine translation and human-aided translation. Unfortunately, the manual construction of bilingual dictionaries is expensive, and new or domain-specific terminology is difficult to cover. Therefore, a lot of research has been conducted on the automatic extraction of bilingual dictionaries. Especially the extraction from large parallel corpora (bitexts) has achieved impressive results. However, parallel corpora are available for only selected text domains and language pairs. For that reason, the potential of other resources is being explored as well.
We propose the extraction of bilingual terminology from large multilingual encyclopedias such as Wikipedia in order to complement bilingual dictionaries with accurate term-translation pairs for languages and text domains where no parallel corpora exist. Wikipedia is a very promising resource as the continuously growing encyclopedia already contains more than 16 million articles in over 270 languages, has a dense link structures and covers a wide variety of topics.
In Wikipedia, there are many links between articles in different languages. If we regard the titles of Wikipedia articles as terminology, it is easy to extract translation relations by analyzing the interlanguage links, assuming that if two articles are connected by an interlanguage link, their titles are translations of each other.
We have developed a method that analyzes not only interlanguage links in Wikipedia but also redirect pages and anchor texts to extend the number of term-translation pairs in the dictionary while maintaining a relatively high accuracy. Since not all term-translation pairs extracted by our method are correct, we use supervised learning to analyze the correctness of each extracted term-translation pair based on various characteristics (features).
Publications
Journal Papers:
- M. Erdmann, K. Nakayama, T. Hara, S. Nishio: Extraction of Bilingual Terminology from a Multilingual Web-based Encyclopedia, IPSJ Journal of Information Processing (Jul. 2008).
- M. Erdmann, K. Nakayama, T. Hara, S. Nishio: Improving the Extraction of Bilingual Terminology from Wikipedia, ACM Transactions on Multimedia Computing, Communications and Applications (Oct. 2009).
- K. Nakayama, M. Ito, M. Erdmann, M. Shirakawa, T. Michishita, T. Hara, S. Nishio: Wikipedia Mining - Challenge for Realizing Early Profits (Japanese), JSAI Journal (Oct. 2009).
- K. Nakayama, M. Ito, M. Erdmann, M. Shirakawa, T. Michishita, T. Hara, S. Nishio: Wikipedia Mining - A Survey on Wikipedia Research (Japanese), IPSJ Journal of Information Processing (Dec. 2009).
International Conferences and Workshops:
- M. Erdmann, K. Nakayama, T. Hara, S. Nishio: An Approach for Extracting Bilingual Terminology from Wikipedia, Proc. of DASFAA (Mar. 2008).
- K. Nakayama, M. Pei, M. Erdmann, M. Ito, M. Shirakawa, T. Hara, S. Nishio: Wikipedia Mining - Wikipedia as a Corpus for Knowledge Extraction, Proc. of Wikimania (Jul. 2008).
- M. Erdmann, K. Nakayama, T. Hara, S. Nishio: Using an SVM Classifier to Improve the Extraction of Bilingual Terminlogy from Wikipedia, Proc. of IJCAI workshop (Jul. 2009)
Domestic Conferences and Workshops:
- M. Erdmann, M. Miyamae, Y. Kishino, T. Terada, S. Nishio: Design and Implementation of a Rule-based Navigation Framework for Wearable Computing Environments, Techn. Reports of Wearable Computer Research and Development Organization, (Jun. 2005).
- M. Erdmann, K. Nakayama, T. Hara, S. Nishio: Wikipedia Link Structure Analysis for Extracting Bilingual Terminology, Proc. of DBWS (Jul. 2007).
Demonstrations and Posters:
- M. Erdmann, K. Nakayama, T. Hara, S. Nishio: A Bilingual Dictionary Extracted from the Wikipedia Link Structure, Proc. of DASFAA Demo Session (Mar. 2008).
Prizes:
- DBWS Student Encouragement Award (Jul. 2007).
Download
Training/Test Data:
- term-translation-pairs-English-Japanese.sql
(890 labeled English-Japanese term-translation pairs with scores) - term-translation-pairs-German-English.sql
(6266 labeled German-English term-translation pairs with features)