Central Institute of Indian Languages [CIIL] MISSION STATEMENT:  Annotated, quality language data (both-text & speech) and tools in Indian Languages to Individuals, Institutions and Industry for Research & Development - Created in-house, through outsourcing and acquisition..  Our Other Sites  Related Sites 
You are here: BACK
Current Status
Current Status

POS Tagged Corpus

We have developed Automatic POS Tagger for Indian Languages using hybrid approach. The precision at present is 86.2% (LDC-IL Tagset 84.2%, BIS Tagset 88.2%) but it is expected to go higher after more rounds of fine tuning.

 The following table shows the number of words annotated as per the LDC-IL POS tagset.

Words tagged as per LDC-IL POS tagset

Sl.No.

Language

2008-09

2009-10

2010-11

Total Words  tagged

1

Assamese

Tag set creation

30,000 +

~ 50,000

85390

2

Bengali

Tag set creation

25,000 +

~ 50,000

75397

3

Bodo

Tag set creation

30,000 +

~ 50,000

83453

4

Gujarati

Tag set creation

30,000 +

~ 50,000

83435

5

Hindi

Tag set creation

30,000 +

~ 50,000

84962

6

Malayalam

Tag set creation

30,000 +

~ 50,000

82897

7

Manipuri

Tag set creation

30,000 +

~ 50,000

83439

8

Nepali

Tag set creation

29,000 +

~ 50,000

86616

9

Oriya

Tag set creation

30,000 +

~ 50,000

79159

10

Punjabi

Tag set creation

28,000 +

~ 50,000

78053

11

Tamil

Tag set creation

30,000 +

~ 50,000

88086

12

Urdu

Tag set creation

26,000 +

~ 50,000

76996


The following table shows the number of words annotated as per the BIS POS tags

Words tagged as per Bureau Of Indian Standard (BIS) POS tagset

S.No

Language

Previous Data
(before 31.3.2013)

Current Data
( 1.4.2013 onwards)

Total

1

Assamese

55066

52185

107251

2

Bengali

212426

67450

279876

3

Bodo

133887

115136

249023

4

Gujarati

265789

324305

590094

5

Hindi

233347

410406

643753

6

Kannada

228181

289229

517410

7

Kashmiri

99906

0

99906

8

Konkani

0

106016

106016

9

Maithili

62750

172347

235097

10

Malayalam

392689

762870

1155559

11

Manipuri

102116

284724

386840

12

Nepali

0

189159

189159

13

Odia

159474

341583

501057

14

Punjabi

251250

680994

932244

15

Tamil

174488

1202369

1376857

16

Urdu

184791

480461

665252


Back Top

Words tagged as per LDC-IL LEX tag set

Sl. No.

Languages

Words

Major Scheduled languages

1

Assamese

7,253

2

Bengali

8,338

3

Gujarati

11,552

4

Hindi

3,973

5

Malayalam

6,088

6

Marathi

7,645

7

Nepali

7,719

8

Oriya

5,783

9

Punjabi

7,818

10

Tamil

15,742

11

Urdu

4,955

Minor Scheduled Languages

12

Manipuri

2,754

13

Dogri

9,257

14

Konkani

9,190

15

Bodo

6,822


Back Top

POS & Chunking

Task

Languages

Prepared LDC-IL 0.4 version tagset

Assamese, Bengali, Bodo, Gujarati, Hindi, Kannada, Konkani, Maithili, Malayalam, Manipuri, Marathi, Nepali, Oriya, Punjabi, Tamil, Telugu, Urdu

Prepared Chunk tagset

Assamese, Bengali, Bodo, Gujarati, Hindi, Kannada, Konkani, Maithili, Malayalam, Manipuri, Marathi, Nepali, Oriya, Punjabi, Tamil, Telugu, Urdu

Done pilot chunking for 200 sentences

Assamese, Bengali, Bodo, Gujarati, Hindi, Kannada, Konkani, Maithili, Malayalam, Manipuri, Marathi, Nepali, Oriya, Punjabi, Tamil, Telugu, Urdu


Back Top

INDIAN SIGN LANGUAGE (ISL)

The ISL corpus has been collected at the LDC-IL in the recording studio of CIIL.  Segmentation and annotation of this corpus is presently going on. This corpus consists of the following categories:

S.No

Category

1

Short stories : Thirsty crow, rabbit and tortoise

2

Frequent words

3

Question and answering

The ISL corpus has also been collected by the RKMVU, Coimbatore. It consists of basic vocabulary, self introduction, information regarding family members, friends, activities/hobbies, food habit, travel etc.


TOP BACK
You are visitor No.
WAIT...

Developed & Maintained by:
LDC-IL, CIIL
Copyright © LDC-IL,
Central Institute of Indian Languages
Central Institute of Indian Languages
Department of Higher Education
Ministry of Human Resource Development
Government of India
Manasagangothri, Hunsur Road, Mysore-570006, Karnataka, India.
Tel: (0821) 2515820 (Director)
Reception/PABX : (0821) 2345000
Fax: (0821) 2515032 (Off)
        Home | Announcements | News | CIIL | Contact Us