POS Tagged Corpus
We have developed Automatic POS Tagger for Indian Languages using hybrid approach. The precision at present is 86.2% (LDC-IL Tagset 84.2%, BIS Tagset 88.2%) but it is expected to go higher after more rounds of fine tuning.

The following table shows the number of words annotated as per the LDC-IL POS tagset.
Words tagged as per LDC-IL POS tagset |
Sl.No. |
Language |
2008-09 |
2009-10 |
2010-11 |
Total Words tagged |
1 |
Assamese |
Tag set creation |
30,000 + |
~ 50,000 |
85390 |
2 |
Bengali |
Tag set creation |
25,000 + |
~ 50,000 |
75397 |
3 |
Bodo |
Tag set creation |
30,000 + |
~ 50,000 |
83453 |
4 |
Gujarati |
Tag set creation |
30,000 + |
~ 50,000 |
83435 |
5 |
Hindi |
Tag set creation |
30,000 + |
~ 50,000 |
84962 |
6 |
Malayalam |
Tag set creation |
30,000 + |
~ 50,000 |
82897 |
7 |
Manipuri |
Tag set creation |
30,000 + |
~ 50,000 |
83439 |
8 |
Nepali |
Tag set creation |
29,000 + |
~ 50,000 |
86616 |
9 |
Oriya |
Tag set creation |
30,000 + |
~ 50,000 |
79159 |
10 |
Punjabi |
Tag set creation |
28,000 + |
~ 50,000 |
78053 |
11 |
Tamil |
Tag set creation |
30,000 + |
~ 50,000 |
88086 |
12 |
Urdu |
Tag set creation |
26,000 + |
~ 50,000 |
76996 |
The following table shows the number of words annotated as per the BIS POS tags
Words tagged as per Bureau Of Indian Standard (BIS) POS tagset |
S.No |
Language |
Previous Data
(before 31.3.2013) |
Current Data
( 1.4.2013 onwards) |
Total |
1 |
Assamese |
55066 |
52185 |
107251 |
2 |
Bengali |
212426 |
67450 |
279876 |
3 |
Bodo |
133887 |
115136 |
249023 |
4 |
Gujarati |
265789 |
324305 |
590094 |
5 |
Hindi |
233347 |
410406 |
643753 |
6 |
Kannada |
228181 |
289229 |
517410 |
7 |
Kashmiri |
99906 |
0 |
99906 |
8 |
Konkani |
0 |
106016 |
106016 |
9 |
Maithili |
62750 |
172347 |
235097 |
10 |
Malayalam |
392689 |
762870 |
1155559 |
11 |
Manipuri |
102116 |
284724 |
386840 |
12 |
Nepali |
0 |
189159 |
189159 |
13 |
Odia |
159474 |
341583 |
501057 |
14 |
Punjabi |
251250 |
680994 |
932244 |
15 |
Tamil |
174488 |
1202369 |
1376857 |
16 |
Urdu |
184791 |
480461 |
665252 |
Words tagged as per LDC-IL LEX tag set |
Sl. No. |
Languages |
Words |
Major Scheduled languages |
1 |
Assamese |
7,253 |
2 |
Bengali |
8,338 |
3 |
Gujarati |
11,552 |
4 |
Hindi |
3,973 |
5 |
Malayalam |
6,088 |
6 |
Marathi |
7,645 |
7 |
Nepali |
7,719 |
8 |
Oriya |
5,783 |
9 |
Punjabi |
7,818 |
10 |
Tamil |
15,742 |
11 |
Urdu |
4,955 |
Minor Scheduled Languages |
12 |
Manipuri |
2,754 |
13 |
Dogri |
9,257 |
14 |
Konkani |
9,190 |
15 |
Bodo |
6,822 |
POS & Chunking |
Task |
Languages |
Prepared LDC-IL 0.4 version tagset |
Assamese, Bengali, Bodo, Gujarati, Hindi, Kannada, Konkani, Maithili, Malayalam, Manipuri, Marathi, Nepali, Oriya, Punjabi, Tamil, Telugu, Urdu |
Prepared Chunk tagset |
Assamese, Bengali, Bodo, Gujarati, Hindi, Kannada, Konkani, Maithili, Malayalam, Manipuri, Marathi, Nepali, Oriya, Punjabi, Tamil, Telugu, Urdu |
Done pilot chunking for 200 sentences |
Assamese, Bengali, Bodo, Gujarati, Hindi, Kannada, Konkani, Maithili, Malayalam, Manipuri, Marathi, Nepali, Oriya, Punjabi, Tamil, Telugu, Urdu |
INDIAN SIGN LANGUAGE (ISL)
The ISL corpus has been collected at the LDC-IL in the recording studio of CIIL. Segmentation and annotation of this corpus is presently going on. This corpus consists of the following categories:
S.No |
Category |
1 |
Short stories : Thirsty crow, rabbit and tortoise |
2 |
Frequent words |
3 |
Question and answering |
The ISL corpus has also been collected by the RKMVU, Coimbatore. It consists of basic vocabulary, self introduction, information regarding family members, friends, activities/hobbies, food habit, travel etc.
|