Skip to main content | Skip to Navigation | Text Size : | Language :

logo of Linguistic Data Consortium for Indian Languages (LDC-IL)
Road map | Official Website of Linguistic Data Consortium for Indian Languages

Road map

1. Introduction

The Linguistic Data Consortium for Indian Languages is an Eleventh Plan Project of the Department of Higher Education, Ministry of Human Resource Development allotted to the Central Institute of Indian Language for implementation. This Project came into existence on April 1, 2007. However, real implementation with proper human resources commenced only from June, 2008. As of today it has human resources with the following mother tongues: Assamese, Bengali, Bhojpuri, Gujarati, Hindi, Kannada, Maithili, Malayalam, Manipuri, Nepali, Oriya, Punjabi, Tamil, Telugu and Urdu.

It needs to be noted that, the Indian Constitution has recognized 22 Scheduled Languages: Assamese, Bengali, Bodo, Dogri, Gujarati, Hindi, Kannada, Kashmiri, Konkani, Maithili, Malayalam, Manipuri, Marathi, Nepali, Oriya, Punjabi, Sanskrit, Santali, Sindhi, Tamil, Telugu, Urdu and according to the census of India-2001, there are 100 Languages which are known as non scheduled languages spoken by not less than ten thousand persons in India. They are Adi, Afghani / Kabuli / Pashto, Anal, Angami, Ao, Arabic / Arbi, Balti, Bhili / Bhilodi, Bhotia, Bhumij, Bishnupuriya, Chakhesang, Chakru / Chokri, Chang, Coorgi / Kodagu, Deori, Dimasa, English, Gadaba, Gangte, Garo, Gondi, Halabi, Halam, Hmar, Ho, Jatapu, Juang, Kabui, Karbi / Mikir, Khandeshi, Kharia, Khasi, Khezha, Khiemnungan, Khond / Kondh, Kinnauri, Kisan, Koch, Koda / Kora, Kolami, Kom, Konda, Konyak, Korku, Korwa, Koya, Kui, Kuki, Kurukh / Oraon, Ladakhi, Lahauli, Lahnda, Lakher, Lalung, Lepcha, Liangmei, Limbu, Lotha, Lushai / Mizo, Malto, Maram, Maring, Miri / Mishing, Mishmi, Mogh, Monpa, Munda, Mundari, Nicobarese, Nissi / Dafla, Nocte, Paite, Parji, Pawi, Persian, Phom, Pochury, Rabha, Rai, Rengma, Sangtam, Savara, Sema, Sherpa, Shina, Simte, Tamang, Tangkhul, Tangsa, Thado, Tibetan, Tripuri, Tulu, Vaiphei, Wancho, Yimchungre, Zeliang, Zemi, Zou.

Among these the following languages have considerable number of speakers: Ao, Bhili / Bhilodi, Bhotia, Coorgi / Kodagu, English, Garo, Gondi, Ho, Khandeshi, Khasi, Kolami, Konyak, Kurukh / Oraon, Ladakhi, Lushai / Mizo, Malto, Miri / Mishing, Mundari, Nissi / Dafla, Rabha, Tangkhul, Tulu. Attempts shall be made to create same kind of language resources in these languages. If there is demand from research community other languages shall also be takenup.

The mandate of the LDC-IL is to cover as many languages as possible in its endeavor to help Indian languages to absorb technology and develop to become vehicles of modern thought.

2. Objectives

The major objectives of the consortium are:

  1. Become a repository of linguistic resources in the form of text and speech for all Indian languages.
  2. Set appropriate standards for data collection and storage of language corpora for different research and development activities.
  3. Support language technology development and sharing of tools for language-related data collection and management.
  4. Facilitate creation of such databases by different organizations which could contribute and enrich the main LDC-IL repository.
  5. Facilitate training and human resources development in these areas through workshops, seminars etc.
  6. Create and maintain the LDC-IL web-based services that would be the primary gateway for accessing resources.
  7. Design or provide help in creation of appropriate language technology, based on the linguistic data for mass use, and
  8. Provide the necessary linkages between academic institutions, individual researchers and the general public.

3. Major Areas of Creation of Language Resources

4. Tasks

When we look at the major areas to be covered for developing the Language Resources and the broad objectives, the following tasks emerge before us.

  1. Establishing standards
  2. Creating language resources
  3. Annotating language data
  4. Building systems/ helping systems building
  5. Creating human resources through training, workshops, on hand experience etc.,
  6. Co-ordinating activities for language resources development

5. Coverage

It is kept in mind that major work in India in the area of technology, on Indian languages so far has focused more on machine translation which is an important application in a multilingual country like India. And, what is on the back burner is analyzing of language data using computational techniques to understand the language and its genres itself. Different aspects of computational linguistics provide deeper insights into the structure and functioning of each one of the languages. This power needs to be exploited to understand languages. All languages deserve attention from the researchers for their analysis and interpretation. Creation of corpora and tools for their analysis, in these languages will help in comparing and contrasting structure and functioning of Indian languages. So, corpora for minor languages will also be collected depending upon availability of text for the purpose. The same will be used for language development purpose.

6. Text Corpora - Monolingual and Multilingual (Parallel)

Monolingual / Parallel Corpora (in millions of words): Scheduled Languages

Sl.no Languages 1st year 2nd year 3rd year 4th year 5th year Total
1. Assamese 2 2 2 2 2 10
2. Bengali 2 2 2 2 2 10
3. Bodo 0.6 0.6 0.6 0.6 0.6 3
4. Dogri 0.6 0.6 0.6 0.6 0.6 3
5. Gujarati 2 2 2 2 2 10
6. Hindi 2 2 2 2 2 10
7. Kannada 2 2 2 2 2 10
8. Kashmiri 1 1 1 1 1 5
9. Konkani 1 1 1 1 1 5
10. Maithili 1 1 1 1 1 5
11. Malayalam 2 2 2 2 2 10
12. Manipuri 1 1 1 1 1 5
13. Marathi 2 2 2 2 2 10
14. Nepali 2 2 2 2 2 10
15. Odia 2 2 2 2 2 10
16. Punjabi 2 2 2 2 2 10
17. Sanskrit 0.4 0.4 0.4 0.4 0.4 2
18. Santali 0.6 0.6 0.6 0.6 0.6 3
19. Sindhi 0.6 0.6 0.6 0.6 0.6 3
20. Tamil 2 2 2 2 2 10
21. Telugu 2 2 2 2 2 10
22. Urdu 2 2 2 2 2 10

Here, the text corpora includes:

7. Tools for Corpora Management and Analysis

These are part and parcel of corpora creation and management process. Some of them are already done and others are in the pipeline. Also, as and when the need arises, necessary ones shall be prepared. Some of them are:

  1. Frequency analyzers for characters, words, sentences etc.,
  2. KWIC and KWOC retrievers.
  3. Tool to gather Indian Languages text (HTML) data automatically from internet.
  4. Tools for Automatic transliterations across all Indian language scripts as well as Roman.
  5. Parallel corpora tools for text alignment, including sentence alignment tool and chunk alignment tool as well as an interface for aligning corpora.
  6. Tools for
    1. Speech Data warehousing
    2. Morphological analysis
    3. POS tagging
    4. Semantic tagging
    5. Syntactic tree bank

8. Linguistic Research

The text corpora could also be used for research in

a. Phonology b. Lexical studies c. Semantics d. Pragmatics & discourse analysis e. Sociolinguistics f. Dialectology & variation studies g. Stylistics h. Language teaching i. Historical linguistics j. Psycholinguistics k. Social psychology l. Culture studies

The above areas illustrate the arena of corpus studies and also the LDC-IL shall make some attempt to do some research into some of the areas as case studies. But it does not like to fix the targets.

9. Speech Corpora

The objective of speech data collection is to primarily build speech recognition and synthesis systems for Indian languages. Voice user Interfaces for IT applications and services are very important and are valued for their ease of access, especially in telephony-based applications. In a country like India, where the rate of literacy is low, Indian language speech interfaces can provide access to IT applications and services, through internet and/or telephones, to the masses. So that people in various semi-urban and rural parts of India will be able to use telephones and Internet to access a wide range of services and information on health, agriculture, travel, etc. However, for this to become a reality, computers should be able to accept speech input in the user’s language and provide speech output. Also, in multilingual India, if speech technology is coupled with translation systems between the various Indian languages, services and information can be provided across languages more easily. Due to the lack of appropriate annotated speech databases in Indian languages robust applications have not been developed. Here, the focus is to:

  1. Develop tools that facilitate collection of high quality speech data
  2. Collect data that can be used for building speech recognition, speech synthesis and provide speech-to-speech translation from one language to another language spoken in India (including Indian English). Apart from these like applications in the area of text corpora, speech corpora also, main efforts are on the engineering side. So, efforts shall also be made to collect
    • Child language corpora
    • Pathological speech/language data and
    • Speech error Data
Applications

Speech Recognition and Speech Synthesis

The LDC-IL shall collect, transliterate (text, phonemic and phonetic level) and annotate speech data and make it available for research and development purposes to the community.

Data set - 1: The speech data set consisting of the following components for each language shall be created for collecting speech corpora for speech recognition:

  1. Phonetically Balanced Vocabulary
  2. Phonetically Balanced Sentences
  3. Connected Text created using phonetically balanced vocabulary
  4. Date Format
  5. Command and Control Words
  6. Proper Nouns 500 place and 500 person names
  7. Most Frequent Words: 1000
  8. Form and Function Words
  9. News domain: news, editorial, essay - each text not less than 500 words

Number of speakers: Data will be collected from minimum of 300 (150 Male and 150 Female) speakers of each language. In addition to this, natural discourse data from various domains too shall be collected for Indian languages for research into spoken language.

Data set - 2: The speech data set consisting of the following components for each language shall be created for collecting speech corpora for speech synthesis.

  1. Phonetically Balanced Vocabulary
  2. Phonetically Balanced Sentences
  3. Connected Text created using phonetically balanced vocabulary
  4. Other texts from stories etc.,

Data for speech synthesis shall be collected from limited number of speakers - 3 male and 3 female in the studio environment. They shall invariably have very good voice quality and are professional voice givers/media announcers.

Annotation of data:

(a) Data to be used for speech recognition shall be annotated at phoneme, syllable, word and sentence levels

(b) Data to be used for speech synthesis shall be annotated at phone, phoneme, syllable, word, and phrase level. This will be in accordance with annotation guidelines already developed for the purpose.

Annotation tools: Tools will be developed for semiautomatic annotation of speech data and Praat will be used for manual annotation. These tools will also be useful for annotating speech synthesis databases.

Text Annotation Tools: POS taggers, phrase boundary markers and intonation markers etc.,

LDC-IL Coding Standard: LDC-IL has developed its own coding standard for development of tools on .NET framework. And LDC-IL will be developing different coding standards for different Linux based frameworks.

Telephone based data collection: The data set - 1 will be replicated using telephone as medium of recording.

Coverage of languages: The priority is to cover all the Scheduled languages and then take up other non-scheduled languages. The list of languages is already given in introduction. After data collection, annotation at various levels shall be taken up.

I year II year III year IV year
1. Bengali 7. Manipuri 13. Maithili 19. Sindhi
2. Hindi 8. Malayalam 14. Dogri 20. Odia
3. Tamil 9. Punjabi 15. Bodo 21. Marathi
4. Telugu 10. Urdu 16. Konkani 22. Khasi
5. Assamese 11.Kannada 17. Santali 23. Tulu
6. Nepali 12. Gujarati 18. Kashmiri 24. Kodava

Pronunciation dictionaries and sound dictionaries based on the defined formats shall also be created in all the languages listed above.

Other uses of speech databases

The data set for this shall include in addition to the data set said above shall include the word list and sentence list that the Central Institute of Indian Languages has created for the purpose and standardized. Major areas of application are given below.

  1. Phonetics and phonology
  2. Annotated speech database will provide insights into production, acoustics and perception of spoken language. Corpus based phonological analysis typically involves defining a formal model, systematically testing it against data, and comparing it with other models. (In some cases, the model may be incorporated into a software system, e.g. for generating natural intonation in a text-to-speech system.) In this exploration and analysis - sorting, searching, tabulating, defining, testing and comparing - the principal task is computational.

  3. Speech pathology
  4. Developmental language disorders occur in children who do not develop functional language skills. Language disorders belong to a broad category of disorders called communication disorders that also include speech and hearing disorders. Receptive language impairments refer to a difficulty understanding language at the level of meaning. The vocabulary range is usually very limited. The purpose of simple grammatical constructions is also not properly understood.

    In this field various types of pathological speech are studied, ranging from mild disorders such as hoarseness to serve disorders such as aphasia. The aim of most studies of pathological speech is to find therapies that can alleviate or cure the speech disorders of interest. Aphasia can also be subject of phycholinguistics studies. Because such language disorders can be shed some light on underlying mental process. Corpora of pathological speech are very useful for these purpose. These corpora also use full for the development of automatic classification of speech physiologies.

    Speech corpora based tools are useful particularly who have one are more of the following features.

    1. Sound substitutions in words, difficulty in processing sounds in to syllables and words.
    2. Inappropriate use of forms, limited vocabulary development and inability to follow directions.
    3. Speech corpora is helpful to list out most frequent words, phones and phonemes. With the help of multi media based speech annotated data and other speech analysis tools (like CATSCAR tools) we can give training for speech disorder people and their family members People with developmental language disorders can benefit from special education programs, usually monitored by a speech pathologist based on the speech corpora tools..
  5. Language learning and language teaching
  6. Speech corpus can help in grammar and error diagnosis. It can aid in conversational interaction and pronunciation learning. Spoken Corpus shows us how we really use language, not how we are supposed to use it or how we use it when we are writing. It reveals how very different the spoken word is from the written word. Thus, speech corpus can be used in constructing teaching materials. The students are able to study the language through language teaching materials that represent the language as it really is. One can build automatic scoring of pronunciation quality for non-native speech. Methods from pattern and speech recognition are applied to develop appropriate feature sets for sentence and word level scoring. Besides this, phoneme accuracy, duration score and recognition accuracy are identified. This can be a part of computer assisted pronunciation training (CAPT) that is specially designed for Second Language Learning. The goal of CAPT is to provide instantaneous, individualized feedback to the user on the overall quality of the pronunciation mode as in the classrooms the teachers are sometimes unable to focus on individuals. It is able to correct learner’s segmental and prosodic mistakes. CAPT also helps in providing private, stress-free and self-paced learning environment.

  7. Socio-phonetics
  8. Variation exists at every level of linguistic representation. It is now a well known fact that the study of socially conditioned variation concentrates more on phonetics. The phonetic realization of any word varies according to the speaker, the context (linguistic and social), the topic, the intention and other factors. Thus, the phonetic realizations are much layered with social meaning. This is the object of study in Socio-phonetics (the study of socially-conditioned phonetic variation in speech) .This makes a complete use of speech data. It can be used for acoustic phonetic analysis, analyzing variables, voice quality and discourse markers in different contexts. It offers help in an integrated understanding of how speech is produced, performed and perceived in its social context.

    These are some of the illustrations of the use of speech databases. If time is available, efforts shall be made by the LDC-IL faculty to do some research into these aspects of speech.

10. Indian Sign Language Corpora

A systematic linguistic study of sign language is about fifty years old. For a long period of time, it was generally believed that all sign languages are the same, gestural in origin based on spoken language, and is mutually intelligible to all the deaf communities. On these accounts, sign language used by deaf people was excluded from the league of natural human language. However, pioneering work on American Sign Language establishes a fact that sign language is independent of spoken language, and is not mutually intelligible to all the deaf communities. This work also shows that sign language follows linguistic rules in its structure, organizations and principles. Further, research shows that phonetic inventory varies from one sign language to another, linguistic constraints on sign formation as well as its rule governed nature in its sub-lexical, lexical and sentential/clausal structure. The linguistic study of sign language has formally acknowledged and established sign language as a natural human language, the only difference being that it is expressed in a different modality.

The result of large scale descriptive studies of various sign languages, typological characteristics of sign language, language universals as well as modality universals have surfaced contributing significantly in understanding of natural human language. The cross-linguistic typological characteristics show that large aspects of sign languages are universal yet the cross-linguistic variations among sign languages are also observed. To mention a few - simultaneous morphology, verb agreement, classifier incorporation, aspectual modulation, reduplication, etc. are universal characteristics of sign languages whereas fingerspelling, word order, constituent doubling, information packaging strategies, etc. show cross-linguistic variations. On the other hand, the linguistic study of sign language has contributed tremendously in shaping policies and practices regarding deafness and sign language in the field of education as well as in the socio-economic participation. In other words, the study of sign language has made difference in deaf’s life. Two of the most significant pursuits have been research on and propagation of Indian Sign Language as a national sign language, and the use of it as a medium of deaf education.

Indian Sign Language is an SOV language, with asymmetry between embedded and matrix clauses in the information-neutral word order. It is a present vs. non-present language lacking overt tense markings and resorts to NP adverbs of time, tense neutralization and aspectual marking for time reference; wh-phrase is always clause final.

In the contemporary academic scenario, increasing number of sign language researchers are seeking to create large corpora of sign language digital video data. Projects have begun in Australia, Ireland, The Netherlands, the United Kingdom, Greece, France, Spain and Germany, for example, and more are underway or being planned in other parts of the world.

In the coming years Consortium plans to create quality, annotated sign (and text) data of Indian Sign Language. It is a pioneering work not only in India but also in the Asian continent. However, for such a grand mission, sociolinguistic and methodological considerations are of crucial importance. Due to the differences in the visuo-spatial modality employed in sign, as well as the individual history of sign language acquisition, researcher’s knowledge of sign language is a primary requisite as the deaf community is monolingual primarily to a large extent,. The methodology employed for the study is deductive, followed by the elicitation of data through fieldwork in the various regions of India in association with deaf clubs, associations and schools. Along with the elicited data, the observations in discourse and production data are also taken into account in building corpora of Indian Sign Language. Native signer’s judgment is taken in account for grammaticality facts. The criteria for a consultant/informant are as follows:

  1. Exposure to sign by three years of age.
  2. Capability and comfort with judging whether a sentence is grammatical or not.
  3. Daily contact with the signed language in the deaf community (for 10+ years).

The project plan for the fieldwork will be carried out as the following, and each fieldwork fetches 3000 lexical items, 500 sentences and 10 production data enriching the corpora not only in numeric values but also in terms of regional, social, educational, and others parameters of linguistic variations:

1. Northern India : Delhi 1st year
2. Southern India: Mysore 2nd year
3. North-eastern India: Shillong 3rd year
4. Western India: Ichalkaranji 4th year
5. Eastern Indian: Kolkata 5th year

In order to avoid any potential extra erroneous influence and to assure quality, the data is verified with the bilingual consultant. The data is videographed using digital video graphy. Later, the edited data is time-line tagged as well as annotated enabling to meet the basic requirements of the applications and the tools as well as of academic and pedagogical research.

Total Indian sign language corpora set at the end of five years shall be:

Lexical items 15000
Sentences 2500
Production data 50

Standards for collecting and recording of data etc., shall be evolved in consultation with some other teams working in the area.

11. By-products like Lexicon, Thesauri, WordNet etc

  1. Creation of frequency dictionaries - five per year
  2. 1st year 2nd year 3rd year 4th year 5th year
    Bengali Bodo Assamese Kashmiri other Indian languages
    Hindi Dogri Gujarati Malayalam
    Kannada Maithili Odia Marathi
    Manipuri Nepali Punjabi Sanskrit
    Urdu Konkani Tamil Santali
  3. Multilingual multi directional dictionary - an ongoing process
  4. Aiding wordnet creation and collaborating with others for the same - an ongoing process
  5. Document Management and data tracking system to manage all of the data collected, processed, annotated and distributed for LDC-IL will be created. Each document will be assigned a unique document ID and there will be link between the collected document and its associated files. This database will track information such as document ID; languages; data type; data partition; data medium; author information; date; location; source type; file format; file size; and others as required etc., The database shall also be upgraded so that the existing data is ready for use in new systems and techniques. However, the backward compatibility of the database will be of major importance.

12. Licensing Policy

Licensing is an important issue for LDC-IL. The draft policy for licensing shall be evolved through discussions within one year. The same shall be finalized within another one year by the time the annotated data is available for delivery purposes.

13. Evaluation

The data that the LDC-IL creates and obtains has to be evaluated. For each kind of data, tool etc., matrixes have to be evolved. Bench marking, good standards etc., have to be developed.

In one year time frame, the same shall be accomplished for first set of tools. In the next year/s the same for other data and tools shall be developed.

14. Beyond Roadmap

Above all and in addition to what LDC-IL has projected in the roadmap the LDC-IL will positively respond to the specific language data needs of the individuals, institutions and industry by taking up their requests on priority basis for licensing purposes. In the beginning the derivatives of the databases shall be licensed and after all the licensing issues are resolved the databases shall also be licensed.

15. Future plans of LDC-IL (2024-2029)

LDC-IL is the national repository of linguistic data. The future focuses of Linguistic Consortium for Indian Languages are expanding and enriching linguistic data resources for Indian languages, implementing language technologies and supporting advancements in Natural Language Processing. LDC-IL aims to enhance qualitative diverse linguistic data resources for all Indian languages including scheduled and non-scheduled languages. It is promoting research and development in natural language processing, artificial intelligence, machine learning and related fields by providing access to text and speech corpora, annotated corpora, tools, and services for Indian languages.

The consortium is building and preserving all Indian languages for future generations. These resources are designed to support the development of applications such as automated translation speech recognition, sentiment analysis and so on. Through the development of linguistic and technological resources LDC-IL is ensuring that Indian languages are well-represented and supported in the digital era.

LDC-IL is collaborating among academic institutions, government agencies and industries to improve language technology. It wishes to create a network for Indian language technology. By connecting academic institutions and industries LDC-IL can develop robust technologies in Indian languages. LDC-IL aims to build skilled researchers and developers in the area of NLP by organizing workshops, conferences, and training programs. Thus this collaborative approach will address the critical language barriers and eventually all Indian languages to succeed in the digital age.

16. LDC-IL Goals

Increase Linguistic Resources:

Enhance Data Quality and Standardization:

Improve Data Accessibility:

Make user friendly accessibility in LDC-IL website

Organize Workshops and Training Programs:

Develop Language Technologies: