Sanskrit Computing Research
"If you don't need a new technique,
what you're saying is probably not new."
-- Philip Glass
Introduction
- Despite strong end-user communities and scholarly interest, electronic resources for Indo-Tibetan research have
remained scarse for a variety of reasons. Consequently, the compilation of large e-text corpora and lexical resources
has lagged behind other language groups.
- Two key areas that I have devoted time to in an effort to overcome these lacunae are: Devanagari and Tibetan OCR, and
Sanskrit-Tibetan lexicon construction. I have discussed some of my work in these areas, below.
Devanagari OCR
- Because Indo-Tibetan Buddhist religion and philosophy is heavily textually grounded,
a key resource in contemporary research is searchable e-text. While there are several
on-going efforts to manually key Tibetan and Sanskrit texts, "Optical Character Recognition" (OCR)
technology for Tibetan and Devanagari scripts has lagged behind in this field. To this end, I have been training
commercially available software (ABBYY FineReader) at the task of Devanagari OCR for several years (and hope to train the software for Tibetan in the future).
-
The end result of this effort has been a system that offers 90-95% accuracy (testing and evaluation is still on-going)
at the task of Devanagari OCR, producing Unicode e-text. Although the software has certain systemic weaknesses, it is hoped that these issues will be resolved in a future version of the program, thereby improving the accuracy. Nonetheless, in the meantime, the system
offers a viable means of rapidly producing Sanskrit e–text for research purposes.
-
After some additional training, it is my intention to make the recognition files freely available to scholars and researchers.
Tibetan-Sanskrit Lexicons
- While many key dictionaries for Sanskrit and Tibetan have been published over the past 150 years,
migration of these resources into an electronic environment has been slow.
- Working in conjunction with Robert Chilton of the Asian Classics Input Project (ACIP), I have been working on
processing one of these resources, Lokesh Chandra's Tibetan-Sanskrit Dictionary, for import into an electronic
lexicon for both general reference purposes and the refinement of a larger Tibetan Machine Translation lexicon (a subset of this data has been published in: A Tibetan Verb Lexicon).
- Lokesh Chandra's original dictionary has been scanned, and also electronically keyed by ACIP staff.
- I have written a series of post-processing programs to automatically tag the dictionary for importing into a database.
Although this work is on-going, a preliminary sample of the resulting data is given below:
|
<entry> |
|
|
<headword>KA</headword> |
|
</entry> |
|
|
|
|
|
|
<entry> |
|
|
<headword>KA</headword> |
|
|
<sense> |
|
|
|
<term lang="skt" scr="rom-dev">stambha</term> |
|
|
|
<term lang="skt" scr="rom-tib">STAM BHA</term> |
|
|
|
<citation> |
|
|
|
|
<citeRef>bo.ca.5.40kha</citeRef> |
|
|
|
</citation> |
|
|
</sense> |
|
</entry> |
|
|
|
|
|
|
<entry> |
|
|
<headword>KA KA</headword> |
|
|
<altheadword>K'A KA</altheadword> |
|
|
<sense> |
|
|
|
<term lang="skt" scr="rom-dev">ka#ka</term> |
|
|
|
<term lang="skt" scr="rom-tib">K'A KA</term> |
|
|
|
<citation> |
|
|
|
|
<citeRef>s*a.da#</citeRef> |
|
|
|
</citation> |
|
|
</sense> |
|
</entry> |
|
|
|
|
|
|
<entry> |
|
|
<headword>KA KA NI</headword> |
|
|
<sense> |
|
|
|
<senseNo>1</senseNo> |
|
|
|
<term lang="skt" scr="rom-dev">kapardaka</term> |
|
|
|
<term lang="skt" scr="rom-tib">KA PAR DA KA</term> |
|
|
|
<citation> |
|
|
|
|
<citeRef>s*a.da#</citeRef> |
|
|
|
</citation> |
|
|
</sense> |
|
|
<sense> |
|
|
|
<senseNo>2</senseNo> |
|
|
|
<term lang="skt" scr="rom-dev">ka#kan%i</term> |
|
|
|
<term lang="skt" scr="rom-tib">K'A KA nI</term> |
|
|
|
<citation> |
|
|
|
|
<citeRef>ma.vyu.9375</citeRef> |
|
|
|
</citation> |
|
|
|
<term lang="skt" scr="rom-dev">ka#kin%i#</term> |
|
|
|
<term lang="skt" scr="rom-tib">K'A KI n'I</term> |
|
|
|
<citation> |
|
|
|
|
<citeRef>s*a.da#</citeRef> |
|
|
|
</citation> |
|
|
</sense> |
|
</entry> |
|
|
|
|
|
Having secured distribution rights from Lokesh Chandra, it is the intention of ACIP to distribute this data (in both romanization and Unicode script)
for the benefit of the research community.
|