Speech Recognition and Synthesis

1.   Introduction

The objective of data collection effort is to primarily build speech recognition and synthesis systems for Indian languages. Although there are such ASR and TTS systems available around the world for a number of mainstream languages, commercially viable speech systems for Indian Languages are not available

Voice User Interfaces for IT applications and services have become more and more prevalent for languages like English, and are valued for their ease of access, especially in telephony-based applications. In a country like India, where the majority of the population is not comfortable using English and given the relatively lower rates of literacy, local language speech interfaces can provide access to IT applications and services, through internet and/or telephones, to the masses. If such technology is available in Indian languages, people in various semi-urban and rural parts of India will be able to use telephones and Internet to access a wide range of services and information on health, agriculture, travel, etc.  However, for this to become a reality, a computer has to be able to accept speech input in the user’s language and provide speech output. Also, in multilingual India, if speech technology is coupled with translation systems between the various Indian languages, services and information can be provided across languages more easily.

Although speech technology has been the focus of research in India for a number of years and the technology itself has matured for real-world applications, the main obstacle in customizing this technology for various Indian languages is the lack of appropriate annotated speech databases in these languages. The focus here is (i) to collect data that can be used for building speech enabled systems in Indian languages and (ii) to develop tools that facilitate collection of high quality speech data.

2.   Background - Speech Recognition

The task of automatic Speech recognition is the task of converting any speech signal into its orthographic representation. There are two different categories of speech recognition systems:

  • Isolated word recognition and connected word systems as in command and control applications.

  • Continuous speech recognition systems. In continuous speech recognition there are two different categories; read speech and spontaneous speech.


3.   Background - Speech Synthesis

The task of speech synthesis to convert written text (orthographic representation to speech).  The vocabulary should not be restricted for speech synthesis and synthesized speech must be close to natural speech. To enable unrestricted speech synthesis, the sentence is normally converted to a sequence of basic units. Then appropriate rules of synthesis are employed to produce speech sounding natural.

The focus is primarily on building (a) vocabulary independent speech to speech translation (for a pair of Indian languages) and (b) vocabulary dependent isolated word recognition in the Indian languages.

4.   Short Term Goal

The grand vision of this project is to collect data to provide speech-to-speech translation from each and every language to each and every other language spoken in India (including Indian English). Such a system would include unlimited vocabulary speech synthesis and recognition systems for every Indian language coupled with machine translation systems between those languages. The block diagram given below describes the basic architecture of such a system.

Speech input in Language A
Speech Recognition in Language A
Recognized Text output in Language A


5.   Long Term Goal

To create databases for building (a) bi-directional speech to speech translation system of read speech for a pair of Indian languages, namely, Hindi-Telugu, (b) a speech recognition system for Indian English. Further, it is desired to collect large vocabulary isolated data for the 22 Scheduled Indian languages.

Speech input in Language A
 Speech Recognition in Language A
Recognized Text output in Language A
Text to Speech Conversion in Language B
Translated Text in Language B  
Machine Translation from Language A to B
Speech output in Language B        


6.   Methodology for Short Term Effort

Methodologies for data collection and development of tools required for the short-term and long-term goals are given below:

Data collection Effort for Automatic Speech Recognition (ASR)


The data collection effort will involve collection of read and spontaneous speech.

Data required
Read speech corpora for two Indian languages and Indian English.


  • Close talking microphone, on a desktop or laptop.
  • Telephone, both landline and mobile.

The data will be annotated at phoneme, syllable, word and sentence levels.

>Data Collection for Isolated Speech Recognition


  • Close talking microphone, on a desktop or laptop
  • Telephone, both landline and mobile

10,000 words from 300 speakers (150 male, 150 female)


Data Collection for Text to Speech Synthesis

Data Required
Data will be collected in the form of read-out phonetically balanced text which will ensure coverage of all speech sounds of the language concerned in different prosodic and phonological contexts. The phonetically balanced text will be extracted from a huge text corpus.

Speech Synthesis requires high quality recording in an anechoic chamber using high quality microphones and recording equipment.

6 speakers: 3 males and 3 females per language.

Data to be annotated at phone, phoneme, syllable, word, and phrase level.

Tools Required for Data Collection and Annotation for ASR and TTS
Standardization of tools for data collection and annotation is required. Number of different tools are available for annotation of speech data like EMULAB and PRAAT. Convergence on representation, annotation and storage format is required. However, the project will also focus on providing converters across different formats.

Tools for Speech Recognition
The data will be annotated at phoneme, syllable, word and sentence levels. Tools need to be developed for semiautomatic annotation of speech data. These tools will also be useful for annotating speech synthesis databases. One could adopt LDC’s recording of data in the NIST format. This format is comprehensive in that it contains ALL the information about the recording environment, speaker information, sampling rate, number of channels, number of bits/sample, etc.

Tools for Speech Synthesis
Other than tools for annotating speech databases, text annotation is also required for speech synthesis:

Text Annotation Tools Required

  • PoS taggers, phrase boundary markers and intonation markers.
  • Identification and standardization of feature vectors for speech synthesis, vocalic/non-vocalic, pause break, distance from a pause break, duration of break, characteristics of the intonation across the entire sound unit.


7.   Applications

  • Speech to Speech translation for a pair of Indian languages, namely, Hindi and Telugu.
  • Command and control applications. 
  • Multimodal interfaces to the computer in Indian languages.  
  • E-mail readers over the telephone,
  • Readers for the visually disadvantaged.
  • Speech enabled Office Suite.  

The effort for both Speech Recognition and Speech Synthesis will be repeated across all 22 Scheduled languages. For Speech Recognition, spontaneous speech data will be collected along with read speech. For speech synthesis, data will be collected from professional speakers, with very good voice quality. Additional speech data will be collected to come out with models for prosody (intonation, duration, etc.) to improve the naturalness of synthesized speech. A database (lexicon) of proper names (of Indian origin) will be created, with the equivalent phonetic representation for each of the names.

