Central Institute of Indian Languages [CIIL] MISSION STATEMENT:  Annotated, quality language data (both-text & speech) and tools in Indian Languages to Individuals, Institutions and Industry for Research & Development - Created in-house, through outsourcing and acquisition..  Our Other Sites  Related Sites 
You are here: BACK

Linguistic Research
Speech Corpora
Major Areas of Creation of Language Resources
Indian Sign Language Corpora
Corpora for Character Recognition
By-products like Lexicon, Thesauri, WordNet etc
Text Corpora - Monolingual and Multilingual (Parallel)
Licensing Policy
Tools for Corpora Management and Analysis
Computational Grammars for Indian Languages
Beyond Roadmap

1.    Introduction

The Linguistic Data Consortium for Indian Languages is an Eleventh Plan Project of the Department of Higher Education, Ministry of Human Resource Development allotted to the Central Institute of Indian Language for implementation. This Project came into existence on April 1, 2007. However, real implementation with proper human resources commenced only from June, 2008. As of today it has human resources with the following mother tongues: Assamese, Bengali, Bhojpuri, Gujarati, Hindi, Kannada, Maithili, Malayalam, Manipuri, Nepali, Oriya, Punjabi, Tamil, Telugu and Urdu.

It needs to be noted that, the Indian Constitution has recognized 22 Scheduled Languages: Assamese, Bengali, Bodo, Dogri, Gujarati, Hindi, Kannada, Kashmiri, Konkani, Maithili, Malayalam, Manipuri, Marathi, Nepali, Oriya, Punjabi, Sanskrit, Santali, Sindhi, Tamil, Telugu, Urdu and according to the census of India-2001, there are 100 Languages which are known as non scheduled languages spoken by not less than ten thousand persons in India.  They are Adi, Afghani / Kabuli / Pashto, Anal, Angami, Ao, Arabic / Arbi, Balti, Bhili / Bhilodi, Bhotia, Bhumij, Bishnupuriya, Chakhesang, Chakru / Chokri, Chang, Coorgi / Kodagu, Deori, Dimasa, English, Gadaba, Gangte, Garo, Gondi, Halabi, Halam, Hmar, Ho, Jatapu, Juang, Kabui, Karbi / Mikir, Khandeshi, Kharia, Khasi, Khezha, Khiemnungan, Khond / Kondh, Kinnauri, Kisan, Koch, Koda / Kora, Kolami, Kom, Konda, Konyak, Korku, Korwa, Koya, Kui, Kuki, Kurukh / Oraon, Ladakhi, Lahauli, Lahnda, Lakher, Lalung, Lepcha, Liangmei, Limbu, Lotha, Lushai / Mizo, Malto, Maram, Maring, Miri / Mishing, Mishmi, Mogh, Monpa, Munda, Mundari, Nicobarese, Nissi / Dafla, Nocte, Paite, Parji, Pawi, Persian, Phom, Pochury, Rabha, Rai, Rengma, Sangtam, Savara, Sema, Sherpa, Shina, Simte, Tamang, Tangkhul, Tangsa, Thado, Tibetan, Tripuri, Tulu, Vaiphei, Wancho, Yimchungre, Zeliang, Zemi, Zou.

Among these the following languages have considerable number of speakers: Ao, Bhili / Bhilodi, Bhotia, Coorgi / Kodagu, English, Garo, Gondi, Ho, Khandeshi, Khasi, Kolami, Konyak, Kurukh / Oraon, Ladakhi, Lushai / Mizo, Malto, Miri / Mishing, Mundari, Nissi / Dafla, Rabha, Tangkhul, Tulu. Attempts shall be made to create same kind of language resources in these languages. If there is demand from research community other languages shall also be takenup.

The mandate of the LDC-IL is to cover as many languages as possible in its endeavor to help Indian languages to absorb technology and develop to become vehicles of modern thought.


2.    Objectives

The major objectives of the consortium are:

  1. Become a repository of linguistic resources in the form of text and speech for all Indian languages.
  2. Set appropriate standards for data collection and storage of language corpora for different research and development activities.
  3. Support language technology development and sharing of tools for language-related data collection and management.
  4. Facilitate creation of such databases by different organizations which could contribute and enrich the main LDC-IL repository.
  5. Facilitate training and human resources development in these areas through workshops, seminars etc.
  6. Create and maintain the LDC-IL web-based services that would be the primary gateway for accessing resources.
  7. Design or provide help in creation of appropriate language technology,  based on the linguistic data for mass use, and
  8. Provide the necessary linkages between academic institutions, individual researchers and the general public.


3.    Major Areas of Creation of Language Resources

  • Text Corpora: monolingual and multilingual (including parallel)
  • Speech corpora
  • Sign language corpora
  • Corpora for Character Recognition
  • By-products like lexicon, thesauri, WordNet etc.


4.    Tasks

When we look at the major areas to be covered for developing the Language Resources and the broad objectives, the following tasks emerge before us.

  1. Establishing standards
  2. Creating language resources
  3. Annotating language data
  4. Building systems/ helping systems building
  5. Creating human resources through training, workshops, on hand experience etc.,
  6. Co-ordinating activities for language resources development


5.    Coverage

It is kept in mind that major work in India in the area of technology, on Indian languages so far has focused more on machine translation which is an important application in a multilingual country like India. And, what is on the back burner is analyzing of language data using computational techniques to understand the language and its genres itself. Different aspects of computational linguistics provide deeper insights into the structure and functioning of each one of the languages. This power needs to be exploited to understand languages. All languages deserve attention from the researchers for their analysis and interpretation. Creation of corpora and tools for their analysis, in these languages will help in comparing and contrasting structure and functioning of Indian languages. So, corpora for minor languages will also be collected to a tune of around 3 to 5 million words in each language depending upon availability of text for the purpose.  The same will be used for language development purpose.


6.    Text Corpora - Monolingual and Multilingual (Parallel)

Monolingual / Parallel Corpora (in millions of words): Scheduled Languages

Sl. No.


1st Year

2nd Year

3rd  Year

4th  Year

5th  Year


1. Assamese 2 2 2 2 2 10
2.  Bengali 2 2 2 2 2 10
3.  Bodo 0.6 0.6 0.6 0.6 0.6 3
4.  Dogri 0.6 0.6 0.6 0.6 0.6 3
5.  Gujarati 2 2 2 2 2 10
6.  Hindi 2 2 2 2 2 10
7.  Kannada 2 2 2 2 2 10
8.  Kashmiri 1 1 1 1 1 5
9.  Konkani 1 1 1 1 1 5
10.  Maithili 1 1 1 1 1 5
11.  Malayalam 2 2 2 2 2 10
12.  Manipuri 1 1 1 1 1 5
13.  Marathi 2 2 2 2 2 10
14.  Nepali 2 2 2 2 2 10
15.  Oriya 2 2 2 2 2 10
16.  Punjabi 2 2 2 2 2 10
17.  Sanskrit 0.4 0.4 0.4 0.4 0.4 2
18.  Santali 0.6 0.6 0.6 0.6 0.6 3
19.  Sindhi 0.6 0.6 0.6 0.6 0.6 3
20.  Tamil 2 2 2 2 2 10
21.  Telugu 2 2 2 2 2 10
22.  Urdu 2 2 2 2 2 10

Here, the text corpora includes:

  • Both balanced and full text corpora covering various domains of language use. Domains already identified and expanded as and when the need has arisen.
  • Parallel corpora across different languages
  • Newspaper corpora
  • Historical/Inscriptional databases of Indian languages which is most important to trace not only as the living documents of Indian History but also historical linguistics of Indian languages.
  • Comparative/descriptive/reference grammars shall be considered as corpora of databases.


7.    Tools for Corpora Management and Analysis

These are part and parcel of corpora creation and management process. Some of them are already done and others are in the pipeline. Also, as and when the need arises, necessary ones shall be prepared. Some of them are:

  1. Frequency analyzers for characters, words, sentences etc.,
  2. KWIC and KWOC retrievers.
  3. Tool to gather Indian Languages text (HTML) data automatically from internet.
  4. Tools for Automatic transliterations across all Indian language scripts as well as Roman.
  5. Parallel corpora tools for text alignment, including sentence alignment tool and chunk alignment tool as well as an interface for aligning corpora.
  6. Tools for
      1. Speech Data warehousing
      2. Morphological analysis
      3. POS tagging
      4. Semantic tagging
      5. Syntactic tree bank


8.    Computational Grammars for Indian Languages

Language technology work carried out in Indian languages till date, including dictionaries, thesauri, wordnets, morphological analyzers and generators, spell checkers, POS tagging, is primarily at the level of words. Sentences show rich and varied structures and it is very important to be able to analyze, capture and utilize the syntactic structure of natural language sentences.

A syntactic system consists of a grammar and a parser. A grammar is a declarative specification of all valid structures. A parser is the procedural counterpart, whose task is to apply the given grammar on a given natural language sentence and produce a description of the structure of the sentence. Going beyond words and handling relevant linguistic phenomena at the level of sentences is essential for advanced language technology applications such as automatic translation, question-answering, automatic summarization etc. So, developing wide coverage computational grammars or robust parsing systems/deep parsers for Indian languages is a need of the hour.

A computational grammar must be so precise that a computing machine can mechanically apply the grammar for parsing and generation. Human beings have a great deal of world knowledge and commonsense, which machines lack completely. Therefore, they need large quantity of data. In this context, the challenge is to develop grammars that do not require commonsense or world knowledge. Also, grammars meant for human users often normally talk only about special situations, exceptions etc., assuming that the readers already know all the usual basic rules. Unlike this, a computational grammar must be comprehensive and include extensive and thorough knowledge of all real- world possible/grammatical sentences, both simple and complex.

The process of developing a computational grammar involves the following tasks and subtasks which can be accomplished in phases:

Task 1:  Hierarchical POS tagset
Task 2:  Dictionary - (a) closed class words and (b) open class words
Task 3:  Morphological analyzer and generator
Task 4:  Manual POS annotation and development of an automatic tagger
Task 5:  Semantic tagging
Task 6:  Chunker
Task 7:  Treebanking
Task 8:  Shallow parser, which will eventually turn into a deep parser

Task 1: Hierarchical POS tagset - Indian languages exhibit a very rich system of morphology - words are long and complex, with several levels of suffixes with complex morpho-phonemic and morpho-syntactic changes at the junctures.  A large and hierarchical tagset allows capturing such complexity. A hierarchical tagset is morphologically rich and effective in reducing confusions and inconsistencies that naturally come up with a large tagset. A hierarchical design gives greater flexibility, extensibility and re-usability. Since hierarchical tagsets are amenable for partial analysis too, the same tagset can be used at all levels - dictionary, morphological analysis, POS tagging, chunking and deep parsing. 

Two major efforts in this direction have been taken up so far - ILPOST one is the work initiated by Microsoft Research Lab, India and another one by University of Hyderabad. These need to be compared and tested for comprehensiveness and efficiency. The most appropriate one would be adopted with necessary modifications (if any). After this process, it will be tested on a sample of minimum 10000 word corpora for accuracy in as many languages as possible. This experience shall be consolidated into preparation of an annotation manual for tagging Indian languages across language families.

Task 2: Dictionary - (a) closed class category and (b) open class category should be developed. Dictionary of closed class words should be comprehensive and complete. Dictionary of open class words can then be developed using available electronic dictionaries and text corpora as basic raw materials. Noun and Verb morphology should be worked out in full detail. Transfer Lexicon and Grammar (TransLexGram) or multilingual dictionary will be developed in this way containing rich word level and verb-argument structure information. It will be used as a parallel resource across all Indian languages.  At each  stage of development, the  dictionary-morphology system  should be tested  on  corpus  data  to  verify  coverage  and  performance  both qualitatively and  quantitatively. 

Task 3: Morphological analyzers and generators - this would be developed using Finite State Automata.   

Task 4: Manual POS tagging and development of an automatic tagger - Once reasonably complete dictionaries and high performance morphological analyzer are complete, POS tagging is done. Since in Indian languages POS information comes mostly from the morphology and not so much from the syntactic context, residual ambiguities could be studied and suitable (rule-based, statistical or hybrid) models for full POS tagging can be taken up.

A large POS tagged corpora will be developed side by side for faster and accurate learning of the automatic tagger.

Task 5:  Semantic tagging - First a semantic tagset has to be evolved. A semantically tagged corpora needs to be developed of word sense disambiguation system. But this task itself requires a lot of sub-tasks, internal indexing of the wordnets, creation of sense-tagged corpora etc.

Task 6: Chunker - In Indian languages, chunks are often single words, but not always. Noun and Verb groups are usually multi-word chunks. Finite State Grammars have to be developed for chunkers. The aim should be to develop wide coverage, high performance, and robust chunkers. This will in turn force refinements and enhancements to the dictionary, morphology and POS tagging components.

Task 7: Treebanking - Creation of a treebank requires high level of linguistic expertise, especially in syntactic structure. It is usually a slower process.

Task 8: Parsing systems - Initially, the aim is to develop full parsing system of simple sentences and simpler types of complex sentences which are amenable to basic rules of clause structure. The focus at this stage will be mainly on sub-categorization frames of verbs and selectional restrictions. Careful evaluations on large scale of test data is essential to plan further course of action to develop wide coverage, high performance grammars and syntactic parsing systems. A rich syntactic treebank or parsed corpus is required for automatic learning of the parsing system. Creation of semantically tagged corpus is also requires for a robust parsing system. In this process, shallow parsing can be resorted to analysis of complicated sentences. The aim is to develop wide coverage and robust partial parsing systems.

A tentative agenda for the preparation of computational grammars of Indian languages is given below. The year in which the work shall start intensely is given below. It is intended to have shallow parser version at the end of three years and deep parser one at the end of five years from the year of commencement

While implementing the preparation of the Computational Grammars for Indian Languages, the tasks listed above form the horizontals and the languages shall form the verticals. It is possible that, many horizontals continue simultaneously. However, the area in which the work shall start intensely is given below in the table. It is intended to have shallow parser version at the end of three years and deep parser at the end of five years from the year of commencement of work on any particular language.

I Year

II Year

III Year

1.   Bengali

4.   Assamese

7. Malayalam

2.   Hindi

5.   Nepali

8.  Manipuri

3.   Tamil

6.   Urdu

9.  Punjabi

It is intended not to duplicate the work that some other teams might have done in the above areas in the country. So, an attempt will be made to get the tools that they have developed and see viability of utilizing them in the Computational grammar proposed here. If found suitable for the purpose they shall be used.

Since wide experience shall be gained by working on these languages, other scheduled languages shall be taken up from the fourth year onwards at the rate of three languages per year.


9.    Linguistic Research

The text corpora could also be used for research in

  • Phonology/phonetics
  • Lexical studies
  • Semantics
  • Pragmatics & Discourse analysis
  • Sociolinguistics
  • Dialectology & Variation studies
  • Stylistics
  • Language teaching
  • Historical linguistics
  • Psycholinguistics
  • Social psychology
  • Cultural studies

a.    Phonology

  • The Corpora will provide largest ever sample of Indian language data as it extends over a wide range of variation, e.g. age, gender, class, genre etc. This will help in making generalizations about the language because the corpus will definitely be representative. Thus, large scale quantitative analyses will be possible through this corpus.
  • The corpora will be based on natural language and hence, the findings are more likely to reflect “real life” language.

b.    Lexical studies

  • The corpora will be useful for the preparation of dictionaries along with examining linguistic and non-linguistic associations of specific words such as how common are different words and their senses, whether words have systematic association with other words and does it relate to any particular register or dialect? The area of loan words can also be explored using the corpus. In lexicology, it can be used for studies of words, their meanings, elements, relations between words, word groups and the whole lexicon.
  • Textual and sociolinguistic information like register, genre, domain, gender, class, age etc. allows a more accurate description of the use of a particular lexical item. Parts of speech and word sense annotation allows a precise grouping of words (i.e. phraseology).Corpus is also very helpful in studying collocation meanings.
  • In producing frequency counts and dividing on various bases like which are the most common and uncommon words, where does the particular word fit on the continuum of very common to very uncommon words. This helps in understanding the patterns of use associated with a word.

c.    Semantics

  • It establishes an objective approach to semantics that takes account of indeterminacy and gradience. While assigning meanings to linguistic terms, the corpus can be used to provide an objective criteria as the meanings are dependent on various syntactic, morphological and prosodic contexts. Thus, an empirical objective indicator can be arrived at.
  • The notion of gradience is established firmly by using the corpus as the categories are not usually clear-cut and there do exist gradients of membership connected with frequency of inclusion.
  • The core area of semantics, i.e. aspect, can be investigated by using corpus. It can provide a more refined aspectual classification of verb classes and situation types

d.    Pragmatics & discourse analysis

  • Use of conversational corpora helps in understanding of how conversation works, with respect to lexical items and phrases which have conversational functions. It aides in identifying relevant discourse structures and also in understanding the function of many lexical and grammatical features in larger discourse contexts. Historical pragmatics also relies on corpus heavily. Thus, it helps in quantitative analyses of corpus-based approaches to pragmatics that are less explored till now.

e.    Sociolinguistics

  • Most of the studies have been in the area of language and gender. Other areas like sexism, femininity and sexual identity rely on the corpus. The conclusions can be tested against the existing sociolinguistic theories. General issues like language planning, language shift, language contact, language variety are also studied using the corpus. Patterns of certain use of linguistic features with dimensions like sex, age and class etc. can also be analyzed. In multilingual environment, use of corpus is very useful for all the above studies.

f.     Dialectology & variation studies

  • Language varieties of different geographical variation and complex co-occurrence patterns among features in different dialects and registers can be compared as well as described. This can also be used for testing the language variation theories. As the corpus is representative, the conclusions drawn will represent the whole population. This answers questions like how is speech different from writing, how do texts vary with respect to linguistic variations, how do internal parts of texts vary within a single register etc.

g.    Stylistics

  • Stylistics is closely related to genres and language varieties as features associated with situations of use are observed in stylistic shifts. Thus, in investigating these broader issues the corpus plays an important role. Moreover, it helps in examining an author’s particular style (vocabulary, sentences etc.) and compare one author’s work with other’s apart from investigating internally within the author’s work. It can be extended to compare various varieties too. As stylistics requires the use of quantification to endorse the judgments, corpus is not only useful but essential also. The differences between written and spoken language can be analyzed with the help of the corpus. Variations between genres can be studied using subsamples of corpora. Taking account of frequent words, sequences, and collocations, authorship attribution techniques can be modified and improve the results.

h.    Language teaching

  • It exposes to language learners the kind of sentences they will encounter in real life situations in different target settings, in other words, a more exact descriptions of language use. Existing syllabuses and teaching materials can be critically looked at as these materials need to be based on empirical and authentic evidence rather than tradition. Researchers can analyze relevant constructions or vocabularies both in text books and in the specific language. The teaching materials would then focus on common word forms, usage patterns and various combinations. This research will feed into the development of teaching resources for language teaching and the training of teachers.
  • Apart from this, the corpora can also help in exploring domain specific language use and professional communication. Use of corpus can cover issues like theories in language teaching, methods, problems as well as solutions. Computer-assisted language learning gets easier too. Finally, open-access on-line languages data from different regional backgrounds will be an invaluable resource to be consulted by researchers, tutors and students.

i.     Historical linguistics

  • It can be used to look at a particular phenomena diachronically and analyzing its patterns. It helps in studying the evolution of the language through time and thus, language change in the process. It also aides in extending the “corpus” of a language.

j.     Psycholinguistics

  • It helps in the study of speech errors in natural conversational language that is acquired from different sources. In order to hypothesize about the human language processing, corpus is useful for analyzing language pathologies by using data from linguistically impaired people. The corpus of normal children helps in theorizing about language developmental process in humans.

k.    Social psychology

Social psychologists need naturalistic data to be tested on theories. Though most human interaction takes place through medium of speech, diaries, newspapers, company reports etc. do form part of naturally occurring texts. Moreover, conversational data can also be taken for analysis in social psychology. Thus, it helps in studying how and why people attempt to explain things and also how they regard their environment.

l.     Culture studies

  • Corpus of a language suggests a picture of that specific culture. It shows how words reveal commonalities and differences between cultures. The frequencies of concepts in different categories can actually tell us about different cultures. Occurrences of words in different domains can also tell us a lot about that culture.

The above areas illustrate the arena of corpus studies and also the LDC-IL shall make some attempt to do some research into some of the areas as case studies. But it does not like to fix the targets.


10.  Speech Corpora

The objective of speech data collection is to primarily build speech recognition and synthesis systems for Indian languages. Voice user Interfaces for IT applications and services are very important and are valued for their ease of access, especially in telephony-based applications. In a country like India, where the rate of literacy is low, Indian language speech interfaces can provide access to IT applications and services, through internet and/or telephones, to the masses. So that people in various semi-urban and rural parts of India will be able to use telephones and Internet to access a wide range of services and information on health, agriculture, travel, etc. However, for this to become a reality, computers should be able to accept speech input in the user’s language and provide speech output. Also, in multilingual India, if speech technology is coupled with translation systems between the various Indian languages, services and information can be provided across languages more easily. Due to the lack of appropriate annotated speech databases in Indian languages robust applications have not been developed. Here, the focus is to:

  1. Develop tools that facilitate collection of high quality speech data

  2. Collect data that can be used for building speech recognition, speech synthesis and provide speech-to-speech translation from one language to another language spoken in India (including Indian English). Apart from these like applications in the area of text corpora, speech corpora also, main efforts are on the engineering side. So, efforts shall also be made to collect
    • Child language corpora
    • Pathological speech/language data and
    • Speech error Data


Speech Recognition and Speech Synthesis

  • Speech to Speech translation for a pair of Indian languages
  • Command and control applications 
  • Speech-To-Text processing (Word processor or emails, Transcription of speech into mobile text messages)
  • Real time voice recognition (Railway reservation / enquiry, ATM, weather information, music on demand)
  • Screen Readers for Visually Impaired
  • Health care (Medical Transcription)
  • Multimodal interfaces to the computer in Indian languages
  • Real-time Voice Writing (Court reporting, speech in the Parliament, Press conferences)
  • Information delivery to illiterate / unaware of a Language (Illiterate can not read / write but can listen and understand. Who can not read / write a particular language)
  • Telephony (SMS reading on cell phone, e-mail reading on cell phone, For fixed line phone: SMS can not be sent on a fixed line phone, but a TTS can convert text into speech and send on a fixed line phone)
  • Pronunciation evaluation in computer-aided language learning applications
  • Simple data entry (entering a credit card number, dialing a number)
  • Automatic translation (Speech to speech translation)
  • Robotics
  • Speech enabled Office Suite etc.

The LDC-IL shall collect, transliterate (text, phonemic and phonetic level) and annotate speech data and make it available for research and development purposes to the community.

Data set - 1: The speech data set consisting of the following components for each language shall be created for collecting speech corpora for speech recognition:

  1. Phonetically Balanced Vocabulary
  2. Phonetically Balanced Sentences
  3. Connected Text created using phonetically balanced vocabulary
  4. Date Format
  5. Command and Control Words
  6. Proper Nouns 500 place and 500 person names
  7. Most Frequent Words: 1000
  8. Form and Function Words
  9. News domain: news, editorial, essay - each text not less than 500 words

Number of speakers: Data will be collected from minimum of 300 (150 Male and 150 Female) speakers of each language. In addition to this, natural discourse data from various domains too shall be collected for Indian languages for research into spoken language.

Data set - 2: The speech data set consisting of the following components for each language shall be created for collecting speech corpora for speech synthesis.

  1. Phonetically Balanced Vocabulary
  2. Phonetically Balanced Sentences
  3. Connected Text created using phonetically balanced vocabulary
  4. Other texts from stories etc.,

Data for speech synthesis shall be collected from limited number of speakers - 3 male and 3 female in the studio environment. They shall invariably have very good voice quality and are professional voice givers/media announcers.

Annotation of data: (a) Data to be used for speech recognition shall be annotated at phoneme, syllable, word and sentence levels (b) Data to be used for speech synthesis shall be annotated at phone, phoneme, syllable, word, and phrase level. This will be in accordance with annotation guidelines already developed for the purpose.

Annotation tools: Tools will be developed for semiautomatic annotation of speech data and Praat will be used for manual annotation. These tools will also be useful for annotating speech synthesis databases.

Text Annotation Tools: POS taggers, phrase boundary markers and intonation markers etc.,

LDC-IL Coding Standard: LDC-IL has developed its own coding standard for development of tools on .NET framework. And LDC-IL will be developing different coding standards for different Linux based frameworks.

Telephone based data collection: The data set - 1 will be replicated using telephone as medium of recording.

Coverage of languages: The priority is to cover all the Scheduled languages and then take up other non-scheduled languages. The list of languages is already given in introduction. After data collection, annotation at various levels shall be taken up.

I Year

II Year

III Year

IV Year

1. Bengali

7.   Manipuri

13. Maithili

19.  Sindhi

2. Hindi

8.   Malayalam

14. Dogri

20.  Oriya

3. Tamil

9.   Punjabi

15. Bodo

21.  Marathi

4. Telugu

10. Urdu

16. Konkani

22.  Khasi

5. Assamese

11. Kannada

17. Santali

23.  Tulu

6. Nepali

12. Gujarati


24.  Kodava

Pronunciation dictionaries and sound dictionaries based on the defined formats shall also be created in all the languages listed above.

Other uses of speech databases

The data set for this shall include in addition to the data set said above shall include the word list and sentence list that the Central Institute of Indian Languages has created for the purpose and standardized. Major areas of application are given below.

a.    Phonetics and phonology

Annotated speech database will provide insights into production, acoustics and perception of spoken language. Corpus based phonological analysis typically involves defining a formal model, systematically testing it against data, and comparing it with other models. (In some cases, the model may be incorporated into a software system, e.g. for generating natural intonation in a text-to-speech system.) In this exploration and analysis - sorting, searching, tabulating, defining, testing and comparing - the principal task is computational.

b.    Speech pathology

Developmental language disorders occur in children who do not develop functional language skills. Language disorders belong to a broad category of disorders called communication disorders that also include speech and hearing disorders. Receptive language impairments refer to a difficulty understanding language at the level of meaning. The vocabulary range is usually very limited. The purpose of simple grammatical constructions is also not properly understood.

In this field various types of pathological speech are studied, ranging from mild disorders such as hoarseness to serve disorders such as aphasia. The aim of most studies of pathological speech is to find therapies that can alleviate or cure the speech disorders of interest. Aphasia can also be subject of phycholinguistics studies. Because such language disorders can be shed some light on underlying mental process. Corpora of pathological speech are very useful for these purpose. These corpora also use full for the development of automatic classification of speech physiologies.

Speech corpora based tools are useful particularly who have one are more of the following features.

  1. Sound substitutions in words, difficulty in processing sounds in to syllables and words.
  2. Inappropriate use of forms, limited vocabulary development and inability to follow directions.
  3. Speech corpora is helpful to list out most frequent words, phones and phonemes. With the help of multi media based speech annotated data and other speech analysis tools (like CATSCAR tools) we can give training for speech disorder people and their family members People with developmental language disorders can benefit from special education programs, usually monitored by a speech pathologist based on the speech corpora tools..

c.    Language learning and language teaching

Speech corpus can help in grammar and error diagnosis. It can aid in conversational interaction and pronunciation learning. Spoken Corpus shows us how we really use language, not how we are supposed to use it or how we use it when we are writing. It reveals how very different the spoken word is from the written word. Thus, speech corpus can be used in constructing teaching materials. The students are able to study the language through language teaching materials that represent the language as it really is. One can build automatic scoring of pronunciation quality for non-native speech. Methods from pattern and speech recognition are applied to develop appropriate feature sets for sentence and word level scoring. Besides this, phoneme accuracy, duration score and recognition accuracy are identified. This can be a part of computer assisted pronunciation training (CAPT) that is specially designed for Second Language Learning. The goal of CAPT is to provide instantaneous, individualized feedback to the user on the overall quality of the pronunciation mode as in the classrooms the teachers are sometimes unable to focus on individuals. It is able to correct learner’s segmental and prosodic mistakes. CAPT also helps in providing private, stress-free and self-paced learning environment.

d.    Socio-phonetics

Variation exists at every level of linguistic representation. It is now a well known fact that the study of socially conditioned variation concentrates more on phonetics. The phonetic realization of any word varies according to the speaker, the context (linguistic and social), the topic, the intention and other factors. Thus, the phonetic realizations are much layered with social meaning. This is the object of study in Socio-phonetics (the study of socially-conditioned phonetic variation in speech) .This makes a complete use of speech data. It can be used for acoustic phonetic analysis, analyzing variables, voice quality and discourse markers in different contexts. It offers help in an integrated understanding of how speech is produced, performed and perceived in its social context.

These are some of the illustrations of the use of speech databases. If time is available, efforts shall be made by the LDC-IL faculty to do some research into these aspects of speech.


11.  Indian Sign Language Corpora

A systematic linguistic study of sign language is about fifty years old. For a long period of time, it was generally believed that all sign languages are the same, gestural in origin based on spoken language, and is mutually intelligible to all the deaf communities. On these accounts, sign language used by deaf people was excluded from the league of natural human language. However, pioneering work on American Sign Language establishes a fact that sign language is independent of spoken language, and is not mutually intelligible to all the deaf communities. This work also shows that sign language follows linguistic rules in its structure, organizations and principles. Further, research shows that phonetic inventory varies from one sign language to another, linguistic constraints on sign formation as well as its rule governed nature in its sub-lexical, lexical and sentential/clausal structure. The linguistic study of sign language has formally acknowledged and established sign language as a natural human language, the only difference being that it is expressed in a different modality. 

The result of large scale descriptive studies of various sign languages, typological characteristics of sign language, language universals as well as modality universals have surfaced contributing significantly in understanding of natural human language. The cross-linguistic typological characteristics show that large aspects of sign languages are universal yet the cross-linguistic variations among sign languages are also observed. To mention a few - simultaneous morphology, verb agreement, classifier incorporation, aspectual modulation, reduplication, etc. are universal characteristics of sign languages whereas fingerspelling, word order, constituent doubling, information packaging strategies, etc. show cross-linguistic variations. On the other hand, the linguistic study of sign language has contributed tremendously in shaping policies and practices regarding deafness and sign language in the field of education as well as in the socio-economic participation. In other words, the study of sign language has made difference in deaf’s life. Two of the most significant pursuits have been research on and propagation of Indian Sign Language as a national sign language, and the use of it as a medium of deaf education.

Indian Sign Language is an SOV language, with asymmetry between embedded and matrix clauses in the information-neutral word order. It is a present vs. non-present language lacking overt tense markings and resorts to NP adverbs of time, tense neutralization and aspectual marking for time reference; wh-phrase is always clause final.

In the contemporary academic scenario, increasing number of sign language researchers are seeking to create large corpora of sign language digital video data. Projects have begun in Australia, Ireland, The Netherlands, the United Kingdom, Greece, France, Spain and Germany, for example, and more are underway or being planned in other parts of the world. 

In the coming years Consortium plans to create quality, annotated sign (and text) data of Indian Sign Language. It is a pioneering work not only in India but also in the Asian continent. However, for such a grand mission, sociolinguistic and methodological considerations are of crucial importance. Due to the differences in the visuo-spatial modality employed in sign, as well as the individual history of sign language acquisition, researcher’s knowledge of sign language is a primary requisite as the deaf community is monolingual primarily to a large extent,. The methodology employed for the study is deductive, followed by the elicitation of data through fieldwork in the various regions of India in association with deaf clubs, associations and schools. Along with the elicited data, the observations in discourse and production data are also taken into account in building corpora of Indian Sign Language. Native signer’s judgment is taken in account for grammaticality facts. The criteria for a consultant/informant are as follows:

  1. Exposure to sign by three years of age.
  2. Capability and comfort with judging whether a sentence is grammatical or not.
  3. Daily contact with the signed language in the deaf community (for 10+ years).

The project plan for the fieldwork will be carried out as the following, and each fieldwork fetches 3000 lexical items, 500 sentences and 10 production data enriching the corpora not only in numeric values but also in terms of regional, social, educational, and others parameters of linguistic variations:

1. Northern India : Delhi             

1st  year

2. Southern India: Mysore         

2nd  year

3. North-eastern India: Shillong    

3rd   year

4. Western India: Ichalkaranji      

4th   year

5. Eastern Indian: Kolkata            

5th   year

In order to avoid any potential extra erroneous influence and to assure quality, the data is verified with the bilingual consultant. The data is videographed using digital video graphy. Later, the edited data is time-line tagged as well as annotated enabling to meet the basic requirements of the applications and the tools as well as of academic and pedagogical research.

Total Indian sign language corpora set at the end of five years shall be:

Lexical items         




Production data     


Standards for collecting and recording of data etc., shall be evolved in consultation with some other teams working in the area.


12. Corpora for Character Recognition


Character Recognition refers to the conversion of printed or handwritten characters to a machine-interpretable form. The term has been used to address three very distinct language technologies with different applications.

“Online” handwriting recognition or Online HWR refers to the interpretation of handwriting captured dynamically using a handheld or tablet device. It allows the creation of more natural handwriting-based alternatives to keyboards for data entry in Indian scripts, and also for imparting of handwriting skills using computers.
“Offline” handwriting recognition or Offline HWR refers to the interpretation of handwriting captured statically as an image. It can be used for the interpretation of handwriting already recorded on paper, ranging from filled-in forms to handwritten manuscripts.

Optical character recognition or OCR refers to the interpretation of printed text captured as an image. It can be used for conversion of printed or typewritten material such as books and documents into electronic form.

These different areas of language technology require different algorithms and linguistic resources. However for convenience, they have been combined under the “character recognition” umbrella.  They are all hard research problems because of the variety of writing styles and fonts encountered.


  • Development of standards, tools and linguistic resources (datasets) for the fields of Online HWR, Offline HWR and OCR.
  • Promotion of development of these technologies.
  • Promotion of development of important and challenging applications of these technologies in the context of Indic languages and scripts.

This will be achieved in variety of ways:

  • Standards development will primarily be via a mixture of email discussions and face-to-face meetings of working group members organized under the aegis of LDC-IL.
  • Tool development will be given as projects to technology institutions with the necessary inclination, skills and resources.
  • Linguistic data collection, annotation and validation will be given as projects to linguistics/computational linguistics departments of Institutes and universities with the necessary inclination, skills and resources. However for each linguistic resource developed, validation will be performed by a different institution than the one doing the collection and annotation. Use of the linguistic resources for technology development will be promoted by arranging periodic competitions (for example, for recognition of online handwritten words in specific scripts) and by objective evaluation of performance

Implementation Phases

Language data collection will be done in two phases.

Phase I (year 1-3)

  • Development of standards - Standards are key to the creation of shared linguistic resources. The LDC-IL will adopt established processes for proposing and advancing standards, working with international standards bodies wherever applicable. Standards will be proposed for datasets of offline handwriting, offline handwriting and documents, and for printed characters.
  • Development of tools for data collection - The availability of good tools will allow researchers to start collecting data in different Indian scripts, and contribute data to LDC-IL. They are a must in order to extend support to all Indian scripts quickly. The design and development of tools for data collection and dataset creation in all three target technologies will be done.
  • Promotion of technology development for specific tasks in selected scripts - The LDC-IL will promote the development and implementation of technology for Online HWR, Offline HWR and OCR in the context of specific tasks and selected scripts. The tasks could be to interpret a:
    • line of handwriting captured using a handheld device or a tablet
    • form that has been filled in and scanned
    • page from a book

    Though all major Indian languages are objects of research to begin with, Devanagari, Kannada, Bengali and Gujarati will be addressed to. These offer considerable variety in terms of visual complexity (and hence the challenge for recognition). Other scripts will be taken up in due course of time.
  • Development of linguistic resources in selected scripts - The working group will drive the creation of significant linguistic resources for the tasks and scripts outlined above. Some examples of linguistic resources are:
    • Online handwritten word samples from at least 500 writers in each script.
    • Samples of handwritten characters extracted from forms representing at least 500 writers, and at least 500 samples of each handwritten character in each script.
    • Synthetic data covering all printed characters and at least 1000 pages in each script

Phase II (year 4-5)

  • Refinement of standards - Since standardization requires consensus among creators and users of linguistic resources, it is expected that the process of standardization would continue as an activity beyond the first three years.
  • Refinement of tools - The tools created in the first phase will be continuously refined during this second phase, as more and more researchers start to use them and provide feedback and suggestions for improvement.
  • Extension of technology tasks and linguistic resources to remaining scripts - The technologies developed for the initial set of scripts will be adapted for other scripts during the second phase.  As in the first phase, technology development will be supported by the creation of linguistic resources to support the technology development in other scripts, subject to budget constraints and interest from researchers working on those scripts.
  • Promotion of significant applications - A major activity during the second phase will be the promotion of significant applications with high potential impact on society. These will typically involve solving of challenging problems, multiple years of concerted effort, and close interaction between participating institutions and other researchers in India and abroad.
    It is envisaged that these applications will be developed for selected languages and scripts such as Hindi, and the same will be extended to other languages and scripts with participation from researchers from all over India in due course of time.


The list is meant to be indicative rather than exhaustive.

Handwriting Interface to Computers

Indian scripts are complex and not suitable for keyboard-based entry. Replacing the keyboard with a simpler and more natural interface based on handwriting would make computers much more accessible to the common man and to educators in particular.
Imagine that the keyboard is replaced with a special writing pad for handwriting input. As one writes, the writing is converted using HWR technology into words and entered into the target application. The solution would also need to support numerals, punctuation, and editing gestures, and functionally replace the keyboard.

Handwriting Tutor

The solution described above can also be adapted to provide computer-based instruction in handwriting to improve writing skills of school children, improve literacy as part of adult education programs, or allow literate adults to learn new scripts.

Automatic Forms Processing/Educational Testing

With millions of application forms filled in every year in Indian languages especially in the education sector, a solution for automatically reading handwritten entries from scanned images of forms is clearly very valuable. As a result of a growing school-going population, manual evaluation of answer papers has become very difficult. By using Offline HWR technology, there is the possibility of automatically reading and evaluating responses (for at least the fill-in-the-blanks style of questions where there is one (or a few) correct answers).

The proposed solution is a complete forms-processing system that can be used to read handwriting from a scanned image of a paper form. The interpreted results can be stored into a database (for applications) or compared with correct responses (for educational testing).

Depending upon the availability of expertise and collaborative initiatives from outside, the work on character recognition shall be initiated.


13.  By-products like Lexicon, Thesauri, WordNet etc

  1. Creation of frequency dictionaries - five per year

1st Year

2nd Year

3rd Year

4th Year

5th Year





other Indian languages

















  1. Multilingual multi directional dictionary - an ongoing process
  2. Aiding wordnet creation and collaborating with others for the same - an ongoing process
  3. Document Management and data tracking system to manage all of the data collected, processed, annotated and distributed for LDC-IL will be created. Each document will be assigned a unique document ID and there will be link between the collected document and its associated files. This database will track information such as document ID; languages; data type; data partition; data medium; author information; date; location; source type; file format; file size; and others as required etc., The database shall also be upgraded so that the existing data is ready for use in new systems and techniques. However, the backward compatibility of the database will be of major importance.


14.  Licensing Policy

Licensing is an important issue for LDC-IL. The draft policy for licensing shall be evolved through discussions within one year. The same shall be finalized within another one year by the time the annotated data is available for delivery purposes.

15.  Evaluation

The data that the LDC-IL creates and obtains has to be evaluated. For each kind of data, tool etc., matrixes have to be evolved. Bench marking, good standards etc., have to be developed.

In one year time frame, the same shall be accomplished for first set of tools. In the next year/s the same for other data and tools shall be developed.

16.  Beyond Roadmap

Above all and in addition to what LDC-IL has projected in the roadmap the LDC-IL will positively respond to the specific language data needs of the individuals, institutions and industry by taking up their requests on priority basis for licensing purposes. In the beginning the derivatives of the databases shall be licensed and after all the licensing issues are resolved the databases shall also be licensed.

You are visitor No.

Developed & Maintained by:
Copyright © LDC-IL,
Central Institute of Indian Languages
Central Institute of Indian Languages
Department of Higher Education
Ministry of Human Resource Development
Government of India
Manasagangothri, Hunsur Road, Mysore-570006, Karnataka, India.
Tel: (0821) 2515820 (Director)
Reception/PABX : (0821) 2345000
Fax: (0821) 2515032 (Off)
        Home | Announcements | News | CIIL | Contact Us