Released Datasets | Official Website of Linguistic Data Consortium for Indian Languages

Skip to main content | Skip to Navigation | Text Size : | Language :

Released Datasets of LDC-IL and their Prices

LDC-IL has so far released a total of 58+ datasets. The list of the datasets released is given below along with their prices for the commercial users.

Sl no.	Name of datasets	Prices
1	Assamese Sentence Aligned Speech Corpus	217004
2	Bengali Sentence Aligned Speech Corpus	437866
3	Hindi Sentence Aligned Speech Corpus	464357
4	Kannada Sentence Aligned Speech Corpus	697297
5	Konkani Sentence Aligned Speech Corpus	546368
6	Maithili Sentence Aligned Speech Corpus	279020
7	Malayalam Sentence Aligned Speech Corpus	816153
8	Marathi Sentence Aligned Speech Corpus	265318
9	Nepali Sentence Aligned Speech Corpus	298711
10	Odia Sentence Aligned Speech Corpus	441013
11	Tamil Sentence Aligned Speech Corpus	538655
12	Urdu Sentence Aligned Speech Corpus	328755
13	Indian English-Bengali variant Sentence Aligned Speech Corpus	48555
14	Indian English-Kannada variant Sentence Aligned Speech Corpus	61959
15	Chhattisgarhi Raw Speech Corpus	375592

These datasets are distributed for both commercial and non-commercial usage.

Please note that for bonafide non-commercial and academic use, the datasets are free of charge. The requester needs to be a bonafide student/faculty/employee of a government funded research Institute or be a government entity.

Additional discounts are available for Startups, MSMEs, entitites from the SAARC countries. For more details about the discount and the procedure to procure the datasets, please login to the Data Distribution portal and see the FAQ page.

Linguistic Data Consortium for Indian Languages (LDC-IL)

Ministry of Education, Government of India

Released Datasets of LDC-IL and their Prices