Central Institute of Indian Languages [CIIL] MISSION STATEMENT:  Annotated, quality language data (both-text & speech) and tools in Indian Languages to Individuals, Institutions and Industry for Research & Development - Created in-house, through outsourcing and acquisition..  Our Other Sites  Related Sites 
You are here: BACK
Resources > Text Corpora
Text Corpora

Status of Text Corpora:( As on Jul 2014)

SL. No

LANGUAGE

CIIL CORPUS

MAGAZINE

NEWS TYPED

NEWS WEB

TOTAL

  1.  

ASSAMESE

4,086,443

2,292,435

3,225,595

-

9,604,473

  1.  

BENGALI

4,255,966

-

2,228,296

-

6,484,262

  1.  

BODO

607,079

-

2,266,072

-

2,873,151

  1.  

DOGRI

580,302

-

225,825

-

806,127

  1.  

ENGLISH

2,383,616

-

-

-

2,383,616

  1.  

GUJARATI

4,019,288

-

398,169

1,095,887

5,513,344

  1.  

HINDI

24,397,208

-

-

4,931,792

29,329,000

  1.  

KANNADA

4,152,847

95,837

2,644,996

-

6,893,680

  1.  

KASHMIRI

994,873

-

60,060

-

104,933

  1.  

KONKANI

1,158,064

781,934

1,392,591

-

3,332,589

  1.  

KODAVA

182,741

-

-

-

182,741

  1.  

MAITHILI

2,688,637

855,430

1,259,453

-

4,803,520

  1.  

MALAYALAM

3,319,101

-

-

2,684,201

6,003,302

  1.  

MANIPURI

3,966,015

103,897

1,411,090

 

5,481,002

  1.  

MARATHI

1,872,101

-

-

290,162

2,162,263

  1.  

NEPALI

4,079,275

351,922

1,762,937

269,606

6,463,740

  1.  

ORIYA

689,754

-

831,427

-

1,521,181

  1.  

PUNJABI

5,445,720

13,497

3,683,559

368,551

9,511,327

  1.  

SANTHALI

76,547

-

-

-

76,547

  1.  

SANSKRIT

517,642

-

-

-

517,642

  1.  

TAMIL

8,050,077

1,163,065

-

1,057,047

10,270,189

  1.  

TELUGU

2,913,937

-

-

28,469

2,942,406

  1.  

URDU

4,025,781

-

-

990,012

5,015,793

  1.  

YARAVA

13,904

-

-

-

13,904

Total Word count

121,878,014


Sample Files of Text Corpora:

Sl. No. Language Sample Files:->          
    PDF files:     Doc files:    
1. Assamese 1 2 3 1 2 3
2. Bengali 1 2 3 1 2 3
3. Bodo 1 2 3 1 2 3
4. Dogri 1 2 3 1 2 3
5. Gujarati 1 2 3 1 2 3
6. Hindi 1 2 3 1 2 3
7. Kannada 1 2 3 1 2 3
8. Kashmiri 1 2 3 1 2 3
9. Konkani 1 2 3 1 2 3
10. Maithili 1 2 3 1 2 3
11. Malayalam 1 2 3 1 2 3
12. Manipuri 1 2 3 1 2 3
13. Marathi 1 2 3 1 2 3
14. Nepali 1 2 3 1 2 3
15. Oriya 1 2 3 1 2 3
16. Punjabi 1 2 3 1 2 3
17. Sanskrit 1 2 3 1 2 3
18. Santhali 1 2 3 1 2 3
19. Sindhi 1 2 3 1 2 3
20. Tamil 1 2 3 1 2 3
21. Telugu 1 2 3 1 2 3
22. Urdu 1 2 3 1 2 3

TOP BACK
You are visitor No.
WAIT...

Developed & Maintained by:
LDC-IL, CIIL
Copyright © LDC-IL,
Central Institute of Indian Languages
Central Institute of Indian Languages
Department of Higher Education
Ministry of Human Resource Development
Government of India
Manasagangothri, Hunsur Road, Mysore-570006, Karnataka, India.
Tel: (0821) 2515820 (Director)
Reception/PABX : (0821) 2345000
Fax: (0821) 2515032 (Off)
        Home | Announcements | News | CIIL | Contact Us