MINUTES OF THE SECOND PROJECT ADVISORY COMMITTEE MEETING OF THE LINGUISTIC DATA CONSORTIUM FOR INDIAN LANGUAGES (LDC-IL) HELD ON JUNE 9, 2008 AT 11.30 a.m
I. Welcome
Prof. Udaya Narayana Singh, Director, Central Institute of Indian Languages and Chairperson, Linguistic Data Consortium for Indian Languages (LDC-IL) welcomed the Members for the Second Project Advisory Committee meeting. He explained the constraints for not having another one meeting during the past financial year. The Director also took this opportunity to brief the members about the new projects sanctioned by the Ministry, especially about the National Translation Mission, which will have a bearing on the LDC-IL Project.
II. Agenda Items
After the Welcome, the agenda items were taken up in the order.
1. The Minutes of the First Project Advisory Committee Meeting of the Linguistic Data Consortium for Indian Languages (LDC-IL) held on June 5, 2007 were confirmed.
2. The Mission Statement of the LDC-IL namely, the following was approved: “Annotated, quality language data (both-text and speech) and tools in Indian languages to Individuals, Institutions Industry etc., for Research and Development - Created in house, through outsourcing and, acquisition”.
3. Dr. B. Mallikarjun, Reader cum Research Officer & Head, LDC-IL made a presentation on Action Taken Report on the recommendations of the First PAC meeting, progress made in the work of LDC-IL from June 5, 2007 to June 8, 2008 and proposed certain targets for the year 2008-09. The proposed targets and the details of progress made are given in Annexure – 2 in a tabular form.
III. Action Taken Report in respect of Working Groups
(a) The Working Groups on Licensing Issues, Natural Language Processing, Speech and Speech Deficiencies have met on August 3, 2007 at Pune, August 6, 2007 at Hyderabad, and November 29, 2007 at Mysore respectively and deliberated various issues relating to LDC-IL. The Working Group on Licensing Issues is expected to have another meeting to make specific and concrete recommendations. The Natural Language Processing Group has standardized POS Tag set and XML tag set (given in Annexure - 3). The group has assigned tasks to members of the group to think and provide write ups on future directions. The group will study the drafts prepared by the individuals in order to prepare a document on future directions.
(b) The Scholars of the Speech Group have met several times and arrived at standards for Speech Data capturing and Annotation. The Speech/Language Development Group has met and some of the personnel had sent their projects to the LDC-IL for grant in aid. However, they have been asked to recast the same.
(c) The Character Recognition Group could not meet due to various reasons. Prof. B.B. Chaudhuri, the Chairman of the Group said that he had informal discussion with members of the group and that they would try to give the scanned texts for LDC-IL. It was also agreed that this Working Group will meet on the sidelines of the next PRSG of
the MCIT Meeting to make specific recommendations regarding the tasks to be undertaken by the LDC-IL in this area.
(d) The following Standards for Language Data were presented, and discussed. They are accepted.
Top
Text Corpora
- Text in UNICODE
- Markup: SGML standard
- POS tagging: Extendable and expandable decided by the NLP Group on August 6, 2007.
Speech Corpora
- Rate of sampling - Multiples of 8 kHz. The purpose and the rate of sampling to be uniform.
- Transliteration scheme- LDC-IL standard
- Annotation - PRAAT, Wave surfer
- Pronunciation Dictionary – Format (All placed before the PAC)
(For the convenience of the PAC members absent in the PAC the full text on standards for speech is enclosed with this as Annexure - 4).
(e) The Copies of the First versions of the Training Modules prepared by Prof. Dipti Misra et al., on POS Tagging and Chunking, Prof. Amba Kulkarni on Morphological analyzer, Prof. Pushpak Bhattacharya on Sense tagging and Prof. Peri Bhaskararao on Collecting Speech data were given to the Project Advisory Committee. These modules will be used by the LDC-IL for in-house training as well as for the training that it will conduct elsewhere for collecting and annotating language data.
(f) It was noted that the following programmes conducted and sponsored by LDC-IL along with the reports were placed in the PAC meeting:
- Winter School on Speech and Audio Processing (WiSSAP 2008) from 2nd to 5th January 2008 held at IIT, Madras.
- Workshop on ‘Advanced Course in Computational Linguistics’ held from 16th to 25th March 208 by the Dravidian University, Kuppam.
- Workshop on ‘Speech Sciences’ held at CIIL, Mysore from 10th - 21st March 2008.
- Selection Workshop/Test conducted to recruit staff for the LDC-IL Project from 17th - 21st March 2008 at CIIL, Mysore.
- LDC-IL staff training May 19 to June 13, 2008.
Top
(g) A list of tasks that were recommended by the First Project Advisory Committee but not conducted by the LDC-IL was also provided to the PAC.
(h) In the absence of specifically appointed staff for the LDC-IL, the institute using the Workshop mode resource persons has created monolingual text corpora, parallel corpora and speech corpora. In doing so only availability of Resource Persons was taken into consideration and language priority was not considered. The statistical details were presented before the Committee as a part of the progress report.
IV. Recommendations
The members deliberated these and made the following recommendations:
-
The LDC-IL has a national role and has to function its assigned responsibility as a nodal agency. Therefore, the Linguistic Data Consortium for Indian Languages (LDC-IL) could be visualized as a repository of language/linguistic resources and tools for Indian languages. An attempt should be made by the LDC-IL to contact all the NLP groups and institutions and collect them. After collecting they have to be tested and validated. The resources and tools that are up to the standard can be licensed by the LDC-IL under its Licensing Policy.
- Regarding all Speech for speech corpus of languages the number of hours has to be 20 hrs. and not 10 hrs.
- While preparing and procuring data, end users needs have to be kept in mind. The focus has to be end user.
-
For evaluation process for each kind of data, tool etc., matrixes have to be evolved. Bench marking, good standards etc., have to be developed.
- Indian Sign Language Vocabulary has to be developed.
- Dr. Anupam Basu of IIT Kharagpur be co-opted for NLP group.
Top
V. Other Matters
(a) The existing 10 vacant positions will be filled by academic persons and they will be re-designated as Research Assistants (Senior) and Research Assistants (Junior). The existing academic persons under Technical positions will also be re-designated as Research Assistant (Senior).
(b) The member representing IBM Shri Abhijit Dutta said that the Workshop on Speech Recognition they intend to do in the previous financial year shall be conducted in the next few months.
(c) The members present were requested to send proposals for Seminars, events, workshops, training programmes etc.
(d) Working Group on Speech Deficiency will be renamed as Working Group on Speech Language Development.
(e) Grant-in-Aid : In case if some grantee does not show adequate progress to commensurate with the release of funds, further release of funds will be stopped and action will be taken as per the agreement.
(f) The next meeting of the LDC-IL PAC will be held at Mysore in the last week of November 2008. All the members shall keep this in mind.
The meeting ended with thanks to the Chair.
(UDAYA NARAYANA SINGH)
Chairperson & Director, Linguistic Data
Consortium for Indian Languages,
Central Institute of Indian Languages, Mysore
Top
Annexure - I
Minutes of the Second Project Advisory Committee Meeting of the Linguistic Data Consortium for Indian Languages (LDC-IL) held on June 9, 2008 at 11.30 a.m. under the Chairmanship of Prof. Udaya Narayana Singh, Director, Central Institute of Indian Languages, Mysore.
MEMBERS PRESENT
1. |
Prof. Udaya Narayana Singh
Director
Central Institute of Indian Languages
Manasagangotri, Hunsur Road,
Mysore - 570 006 |
Chairperson |
2. |
Mrs. Rashmi Chowdhary
Director (L)
Ministry of Human Resource Development
Department of Higher Education, Desk IV (L), Representing Language
Shastri Bhawan, `C' Wing, Bureau
New Delhi - 110 001. |
Member |
3. |
Shri S. Mohan
Director
Finance Division
Ministry of Human Resource Development
Department of Higher Education
Representing Language
Shastri Bhavan, 'C' Wing, Bureau
NEW DELHI - 110 001. |
Member |
4. |
Director
Indian Institute of Technology
Bombay, P.O. IIT, Powai,
MUMBAI – 400 076 |
Member
Represented by
Dr. Pushpak Bhattacharya
Dept. of Computer Science |
5. |
Director
Indian Institute of Technology
Madras, I.I.T Post Office
Chennai - 600 036. |
Member
Represented by
Prof. Hema Murthy Dept. of Computer Science |
|
SPECIAL INSTITUTIONAL INVITEES |
|
6. |
Director
Indian Institute of Technology Kharagpur,
KHARAGPUR – 721 302.
|
Member
Represented by
Prof. Anupam Basu
Dept. of Computer Science |
7. |
Director, C-DAC, Pune
|
Member
Represented by
Dr. Hemant Darbari
Programme Coordinator, AAI Group |
8. |
Prof. Vijayalakshmi Basavaraj
Director
All India Institute of Speech & Hearing
Manasagangotri, Mysore – 570 006
|
Member |
|
SPECIAL INDIVIDUAL INVITEES |
|
9. |
Prof. C.N. Krishnan
Director
AU-KBC Research Centre for Internet & Telecom
Technology, Anna University MIT Campus
Chromepet, Chennai – 600 044 |
Member |
10. |
Prof. B.B. Chaudhuri
Head, CVPR Unit
Project Coordinator, MIT Resource Centre for Bangla,
Indian Statistical Institute, Kolkata,
203 Barrackpore Trunk Road,
KOLKATA - 700 035, West Bengal |
Member |
11. |
Prof. Peri Bhaskararao
Tokyo University of Foreign Studies
Speech Sciences, ILCAA, Tokyo, Japan |
Member |
12. |
Ms. Rekha Sharma
Centre for Speech Sciences
Central Institute of Indian Languages
Manasagangotri, Hunsur Road,
Mysore - 570 006 |
Member |
|
SPECIAL INDUSTRY INVITEES |
|
13. |
MICROSOFT
DLF Cybergreens, 9th Floor
Tower A, Sector 25 A
GURGAON – 122 002,
HARYANA |
Member
Represented by
Dr. Kalika Bali
Bangalore |
14. |
MOTOROLA
Lake View Building
Bagmane Tech Park
C.V. Raman Nagar,
BANGALORE – 560 093
|
Member
Represented by
Shri Shailesh Ramamurthy
Bangalore |
15. |
GOOGLE INDIA
No. 3, RMZ Infinity - Tower E
Old Madras Road, 4th Floor,
Bangalore - 560 016
|
Member
Represented by
Dr. Prasad Ram
Bangalore |
16
|
IBM
DLF Silokhera, NH –8
Sector – 30, GURGAON – 122 002,
Haryana. |
Member
Represented by
Shri Abhijit Dutta
Globalization Specialist |
17. |
Dr. B. Mallikarjun
Reader cum Research Officer
Central Institute of Indian Languages
Manasagangotri,
Hunsur Road, Mysore - 570 006 |
Member-Convener
Head, LDC-IL |
Top
MEMBERS ABSENT
1. |
Director
Indian Institute of Science Bangalore
Bangalore - 560 012 |
2. |
Director
International Institute of Information
Technology Hyderabad
Gacchibowli, Hyderabad - 500 019 |
3. |
Joint Secretary
Human Centered Computing Division, Member
TDIL, Deptt. Of Information Technology,
Ministry of Communication & Information
Technology, Room No. 3009,
Electronics Niketan,
6, CGO Complex, New Delhi – 3 |
4. |
Vice Chancellor & Professor of Law
National Law School University
P.O. Box No. 7201, Nagarbhavi,
Bangalore - 560 072 |
|
SPECIAL INDIVIDUAL INVITEES |
5. |
Prof. G. Umamaheshwar Rao
Centre for Applied Linguistics & Translation Studies
(CALTS), University of Hyderabad Special Invitee
P.O. Central University Campus
Hyderabad – 500 046, A.P. |
6. |
Prof. Yagnanarayana
International Institute of Information
Technology Hyderabad
Gacchibowli, Hyderabad – 500 019. |
|
SPECIAL INDUSTRY INVITEES |
7. |
Director of Science & Technology
Hewlett-Packard Labs. India
24, Salarpuria Arena
Hosur Main Road, Adugodi
Bangalore – 560 030 |
8. |
Executive Director
MAIT
(Manufacturers’ Association for
Information Technology)
PHD House, 4th Floor,
Opposite Asian Games Village,
NEW DELHI – 110 016.
|
Top
Annexure - II
TARGETS/PROPOSALS 2008-09
The following were presented as the tasks for the current financial year:
1. |
|
Five million word Corpus in 6 Languages: Assamese, Bengali, Gujarati, Manipuri, Nepali and Tamil. |
2. |
a. |
Ten hours of Recording of Speech for Speech Corpus in languages: Assamese, Bengali, Gujarati, Kannada, Manipuri, Nepali and Tamil. |
|
b. |
Procuring of Speech Corpora to the tune of 50 hours in 8 languages. |
3. |
|
Multilingual dictionaries in 5 Languages. |
4. |
|
Frequency Dictionaries in 6 Languages. |
5. |
|
Creation of Pronunciation Dictionary in 5 Languages. |
6. |
|
Development of tools for analysis of the Text and Speech Corpus in Indian Languages. |
7. |
|
Conduct of Project Advisory Committee Meetings (2), National, Regional level Training Programmes/ Workshops/Meetings and Conferences (22). |
8. |
a. |
Conduct of One International Seminar/Conference. |
b. |
The Institute has already agreed to be one of the sponsors of ICON 2008 (6th International Conference on Natural Language Processing) to be held at CDAC, Pune, India from December 20-22, 2008. It also made a Commitment to the SLT at Goa from December 15-19, 2008. |
9. |
|
Conduct of faculty improvement programme for the staff of LDC-IL (4). |
10. |
|
Giving Grants for the creation of language/lexical resources in Indian Languages. |
|