Proceedings of the
Tibetan Information Technology Panel
(Paper Abstracts)
Paul
G. Hackett
University of Maryland at College Park
Automatic Segmentation and Part-Of-Speech Tagging for Tibetan:
A First Step Towards Machine Translation
In recent years, great advances have been made in the field of Natural Language Processing (NLP). Much of this research has focused on dominant world languages such as English, French, German, Spanish, and Chinese. There has been limited research however, for less commonly studied languages such as Tibetan. This paper presents a report on the current status of a research project to produce a fully automated Machine Translation (MT) utility for Tibetan. Of the three conceptually distinct components of a MT system, part of speech (POS) tagging, parsing, and generation, the first phase consisting of POS tagging has been successfully completed. The combination POS tagger / word-segmenter was manually constructed as a rule-based multi-tagger relying on the Wilson formulation of Tibetan grammar. Partial parsing was also performed in combination with POS-tag sequence disambiguation. The component was evaluated at the task of document indexing for Information Retrieval (IR). Although segmentation is application specific, preliminary analysis indicated slightly better (though statistically comparable) performance to n-gram based approaches at a known-item IR task, with error analysis placing segmentation accuracy at 99%. The accuracy of the POS tagger is also estimated at 99% based on IR error analysis and random sampling.
Karma Monlam
Tibetan Software Keymap
[Abstract unavailable.]
Tashi
Tsering
University of Virginia
A Structural Design and Programming for the Project of Electronic Publication of the Tibetan Canon Kanjur and Tanjur
In 1986, the National Centre for Tibetan Studies at Beijing began to collect and collate the different editions of the Tanjur. With the Derge edition as the base, it was compared sentence by sentence with three other Tanjur editions, with each difference added as an annotation appended to the end of each text. For the Kanjur, eight editions of woodblocks have been collected, and for the Tanjur, three other woodblock editions; the work of collation will be finished next year. Because of the low cost, search and retrieval capability, ease of storage and delivery, and the rapid advance of electronic publication technology and the Internet, the National Press for Tibetology is planning to publish the Kanjur and Tanjur in electronic form. It is also hoped that in the future the Kanjur and Tanjur can be made available worldwide via the Internet.
Christopher
E. Walker
University of Chicago
Keymaps into Tibet: Anthropological Perspectives on the Unicode Computer Standard
[Abstract unavailable.]
last updated:
27 May 2012