Natural language processing or NLP is the application of computational technique to the analysis and synthesis of human language both speech and text. The development of corpus, which is a collection of machine-readable text that is sampled to be representative of a particular language, is an essential step in building of NLP systems for a language. Such corpora exist for languages such as English, German, Chinese, Hindi, Bengali, Punjabi, etc. However, not all of these corpora are easily accessible. In English the most widely used corpora is the British National Corpus (BNC) and it is popular among researchers due to its accessibility.
Where Khasi is concerned, there are no such publicly available corpus and hence it is referred to as a resource poor language in so far as the application of NLP is concerned. A major contribution in this field has been made with the release of the Khasi annotated corpus titled “Tham Khasi annotated corpus” which is freely accessible through the European Language Resources Association (ELRA) via the link http://catalog.elra.info/en-us/repository/browse/ELRA-W0321/. The corpus is manually tagged using the formulated BIS (Bureau of Indian Standards) POS (Parts-of-Speech) tagset to ensure standardised tagging with other Indian languages. The corpus was developed by Dr. Medari Janai Tham who recently was awarded Ph.D. from the department of Computer Science and Engineering, Assam Don Bosco University for her thesis ‘Shallow Parsing for Khasi’ under the supervision of Prof. Pushpak Bhattarcharyya of IIT Bombay.
The details of the corpus including the annotation scheme and the development of the Khasi NLP tools are available in research papers published as part of her Ph.D. and available in https://grammarkhasi.in, which also a companion website of the book “Ka Grammar Khasi Da Ka Jingdro” by the same author published by Macmillan Education, India. The other contributions made by the scholar include the BIS Khasi tagset, a Hybrid Khasi POS tagger, an HMM Khasi POS tagger, an NLTK Khasi POS tagger, an HMM Khasi shallow parser, and a Khasi shallow parser using bidirectional gated recurrent unit, seminar report on ‘Towards Standardization of Khasi language for Computational Purposes’ available in the above-mentioned website. Some of the NLP tools for Khasi are available online for users and researchers to run any Khasi sentence and verify the response of the taggers and parser in https://medaritham.pythonanywhere.com.