Skip to main content | Skip to Navigation | Text Size : | Language :

logo of Linguistic Data Consortium for Indian Languages (LDC-IL)
Language resources | Official Website of Linguistic Data Consortium for Indian Languages

Language resources

Introduction

The primary aim of the Linguistic Data Consortium for Indian Languages (LDC-IL) is to collect high-quality language data for research and development. So far, LDC-IL has released 56 datasets in both text and speech formats.

In addition to the 22 major Indian languages, there are hundreds of minor and tribal languages that require researchers' attention for analysis and interpretation. Creating corpora for these languages will facilitate the comparison and contrast of the structure and functioning of Indian languages. Therefore, as many minor languages as possible will be collected, each containing approximately 3 to 5 million words, depending on the availability of text for this purpose.

A scientific method will be employed to review the collected data at specified intervals. This review process ensures the accuracy and quality of the data. Moreover, the targets for data collection may be adjusted based on the linguistic complexity of each language and the specific research needs. If a language exhibits greater complexity or if there is a high demand for more extensive data, the collection goals will be increased accordingly to provide a more comprehensive dataset for analysis.

Corpora development at LDC-IL

LDC-IL is developing various types of corpora, including raw and annotated data. All of these datasets are available for free to researchers and at a reduced cost for industrial purposes. These are the types of data being developed by LDC-IL.

Comprehensive Corpora Development at LDC-IL

LDC-IL creates general text corpora but maintains a standardized list of categories for text collection. LDC-IL identifies six major categories: Aesthetics, Commerce, Mass Media, Official Documents, Science and Technology, and Social Sciences. These categories are further divided into 128 minor categories or sub-categories to cover various domains comprehensively.

LDC-IL is collecting two types of speech data read speech and spontaneous speech. The data is developed for ASR and TTS purposes. Read speech data is valuable for training automatic speech recognition systems. It provides audio samples with known transcriptions or texts. By collecting read speech data across various Indian languages, and try to create robust ASR models. Spontaneous speech recordings capture natural, unscripted conversations or monologues. This type of data reflects the natural flow of language in day today life. It has pauses, repetition, intonation, error etc. Spontaneous speech data is essential for developing text-to-speech systems that can generate real-like speech output.