LDC-IL

Natural Language Processing

Introduction

Some of the important language data resources required in Indian languages for various NLP applications are given below:

1. Parts-of-Speech Tagging:

It is the process of assigning a word in a text as corresponding to a particular part of speech on the basis of its definition and its occurrence in a given context. The process is basically to design or provide help in creation of appropriate language technology. Since each PoS tag is attached to a single word, preprocessing mechanisms such as splitting, tokenization, etc. have already been performed to filter out typesetting based-raw corpus. This is in response to meet the requirement of standardization amongst the Indian languages that exhibit a very rich system of morphology where words appear long with complex morpho-phonemic and morpho-syntactic changes at the junctures.

Coverage of Languages:

The priority is to cover all the Scheduled languages and then take up other non-scheduled languages. The third phase-based work on PoS tagging includes 22 Scheduled Languages such as Assamese, Bengali, Bodo, Dogri, Gujarati, Hindi, Kannada, Kashmiri, Konkani, Maithili, Malayalam, Manipuri, Marathi, Nepali, Odia, Punjabi, Sanskrit, Santali, Sindhi, Tamil, Telugu, Urdu.

PoS Tagging Guidelines:

In order to develop various TagSets for individual languages, the scheme has undertaken certain linguistic modus operandi as laid down below:

Defining the traditional parts of speech along with the examples
Understanding the concept of Form and Function (Pronouns, Demonstratives, Numerals, etc.)
Recognizing the fuzzy boundaries between the grammatical classes, i.e., a lexical item may function as a specific category and the same may function as different category in different context (Gerunds vs. Infinitive/Participle etc).
Working out the syntactic relation between the modifier-modified (Adj-Noun; Participle-Noun).
Realizing the morpho-syntactic features a particular lexical item carries in a given syntactic configuration. (Person-Number-Gender/Case; Tense-Aspect-Mood/Mod).

2. Chunking:

The process of annotating tagged tokens with structures in a non-hierarchical and non-recursive way is Chunking. It is acknowledged that segmentation and labeling are the most common operations in language processing. Chunking is a popular representative of a segmentation process aiming to segment the tagged tokens into meaningful structures. In the meantime, chunkers generally do not try to analyze entire sentences, but only try to build “chunks” of words. In this line of view, the rule system of chunkers is relatively simple, robust, and efficient.

Chunking Guidelines:

The scheme has adopted certain set of linguistic norms which should be followed by the Resource Persons working on chunking. The chunking of linguistic expression is purely based on specific categorial label and hence the following linguistic guides are being introduced for the ease of annotators.

Identifying different chunk levels along with the typical examples.
Keeping in mind that minimal recursive phrases (nominal or verbal) should be captured.
Understanding the idea that chunking operates on the minimal non-recursive phrases and within such minimal construction, there is no nested structure.
Make sure that nested non-recursive clusters are identified with their heads. (Possessive Constructions, Spatial Relational Nouns, Nested Modifier inside noun phrases).
Having the knowledge as well as hands-on experience of linguistic phenomena such as scrambling of the lexical items, dislocated element, spelling out of boundary elements realized as case markers or tense, mood, aspects etc, between two expressions that operate on the data of the language concerned.

PoS/Chunk Sets:

With the adequate information sketched above, the scheme has developed PoS as well as Chunk Sets for all Indian languages based on which concerned resource persons maintain their academic venture.