Central Institute of Indian Languages [CIIL] MISSION STATEMENT:  Annotated, quality language data (both-text & speech) and tools in Indian Languages to Individuals, Institutions and Industry for Research & Development - Created in-house, through outsourcing and acquisition..  Our Other Sites  Related Sites 
You are here: BACK
Resources > Text Corpora
Text Corpora

Status of Text Corpora:( As on Jul 2014)

SL. No

LANGUAGE

CIIL CORPUS

NEWS TYPED

MAGAZINE

NEWS WEB

TOTAL

  1.  

ASSAMESE

4,113,771

3,273,037

2,424,912

-

9,811,720

  1.  

BENGALI

4,240,779

2,694,357

-

-

6,935,136

  1.  

BODO

603,360

2,260,890

13,958

-

2,878,208

  1.  

DOGRI

576,884

225,825

-

-

802,709

  1.  

ENGLISH

2,383,616

-

-

-

2,383,616

  1.  

GUJARATI

4,019,286

398,171

-

1,116,605

5,534,062

  1.  

HINDI

24,765,268

-

-

5,001,264

29,766,532

  1.  

KANNADA

4,543,950

-

250,825

2,644,996

7,439,771

  1.  

KASHMIRI

999,536

60,060

-

-

1,059,596

  1.  

KONKANI

1,192,719

2,011,807

791,090

-

3,995,616

  1.  

KODAVA

182,741

-

-

-

182,741

  1.  

MAITHILI

3,025,468

1,259,246

1,031,838

-

5,316,552

  1.  

MALAYALAM

3,635,002

92,749

-

2,684,188

6,411,939

  1.  

MANIPURI

4,693,426

1,415,306

39,566

-

6,148,298

  1.  

MARATHI

1,865,046

-

-

292,033

1,865,046

  1.  

NEPALI

3,972,402

2,339,826

475,690

269,606

7,057,524

  1.  

ODIA

688,205

900,564

-

-

1,588,769

  1.  

PUNJABI

5,615,489

3,818,587

337,824

369,865

10,141,765

  1.  

SANTHALI

76546

-

-

-

76546

  1.  

SANSKRIT

517,642

-

-

-

517,642

  1.  

TAMIL

8,068,759

1,805,164

-

1,059,561

10,933,484

  1.  

TELUGU

2,982,155

-

-

28,838

3,010,993

  1.  

URDU

4,167,023

-

-

992,594

5,159,617

  1.  

YARAVA

13,904

-

-

-

13,904

Total Word count

126,225,916


Sample Files of Text Corpora:

Sl. No. Language Sample Files:->          
    PDF files:     Doc files:    
1. Assamese 1 2 3 1 2 3
2. Bengali 1 2 3 1 2 3
3. Bodo 1 2 3 1 2 3
4. Dogri 1 2 3 1 2 3
5. Gujarati 1 2 3 1 2 3
6. Hindi 1 2 3 1 2 3
7. Kannada 1 2 3 1 2 3
8. Kashmiri 1 2 3 1 2 3
9. Konkani 1 2 3 1 2 3
10. Maithili 1 2 3 1 2 3
11. Malayalam 1 2 3 1 2 3
12. Manipuri 1 2 3 1 2 3
13. Marathi 1 2 3 1 2 3
14. Nepali 1 2 3 1 2 3
15. Odia 1 2 3 1 2 3
16. Punjabi 1 2 3 1 2 3
17. Sanskrit 1 2 3 1 2 3
18. Santhali 1 2 3 1 2 3
19. Sindhi 1 2 3 1 2 3
20. Tamil 1 2 3 1 2 3
21. Telugu 1 2 3 1 2 3
22. Urdu 1 2 3 1 2 3

TOP BACK
You are visitor No.
WAIT...

Developed & Maintained by:
LDC-IL, CIIL
Copyright © LDC-IL,
Central Institute of Indian Languages
Central Institute of Indian Languages
Department of Higher Education
Ministry of Human Resource Development
Government of India
Manasagangothri, Hunsur Road, Mysore-570006, Karnataka, India.
Tel: (0821) 2515820 (Director)
Reception/PABX : (0821) 2345000
Fax: (0821) 2515032 (Off)
        Home | Announcements | News | CIIL | Contact Us