We have deployed a deep learning-based application Voice-specific Human-Machine Interaction for low resource languages

CoRePooL stands for Corpus for Resource-Poor Languages.It is a voice-specific Human-Machine Interaction application that is accelerated by deep learning algorithms. The annotation of supervised corpus helps in performing Speech-to-Text, Text-to-Speech, Translation, Gender, and Speaker Identification. Since these algorithms are data-hungry models, only very few resource-rich languages such as English, Ancient Greek, Latin, and Egyptian Language have a larger corpus to accommodate the deep learning algorithms. This is a big constraint for resource-poor languages in the speech domain where the data is limited. Deploying deep learning-based applications for low resource languages often suffers from data scarcity resulting in poor performance of the model. Here we aim to produce the corpus for resource-less Badaga language by setting baseline for speech and text analysis with the help of deep learning based foundation models.

The contributions of this paper

  • We collated 420 minutes of annotated and 968 minutes of unannotated corpus for Badaga in CoRePooL corpus.
  • We released 420 minutes of speech data for performing Speech-to-Text, Text-to-Speech, Gender Identification and Speaker Identification.
  • We released Badaga-English parallel text corpus with 2100 number of sentences for translation.
  • The foundation models are fine tuned to setup the baseline for all the tasks.
Code and Pre-trained models

Motivation

when deep learning models replaced conventional methods the successive results were obtained with the strength of non-linearity learning capability through various algorithms like DNN, DAE, CNN, RNN and LSTM. Since these algorithms are data-hungry models, only very few resource-rich languages such as English, Ancient Greek, Latin, and Egyptian Language have a larger corpus to accommodate the deep learning algorithms. This is a big constraint for resource-poor languages in the speech domain where the data is limited. Deploying deep learning-based applications for low resource languages often suffers from data scarcity resulting in poor performance of the model. Data scarcity occurs when there is not sufficient supervised and unsupervised data for the model to train. By observing these, we have developed the CoRePooL corpus, which includes 420 minutes of annotated and 968 minutes of un-annotated Baduga corpus for performing 4 speech analytics and 1 text analytics task. Badaga language is one of the low resource languages from the Dravidian language family, predominantly used by Badagas. It is closely related to Kannada, commonly spoken by the Badaga people of the Nilgiris district at the junction of Kerala, Karnataka, and Tamilnadu. There are almost 400 villages in the Nilgiris district, where the people speak the Badaga language. According to the 2011 census, there are 134000 native Badaga speakers and according to “Times of India” there are around 2.5 lakhs of native speakers.

CoRe-PooL

Since the Badaga language is not officially written on any native script, we have manually created 2100 number of Badaga transcript written in english slong with its corresponding translation in English. 12 native speaker spoke these text sets and recoded. Both annotated and unannotated corpus released as wave files with sampling rate of 22050 Hz and PCM 16 bit rate. The annotated corpus constructed carefully to include the word variations which includes, food items, utensils, proverbs, insects, flowers, animals, colours, numbers, days, weeks, months, fruits, common places, action words and vehicle types. The unannotated corpus has been collated by taking data from the Youtube. The genre includes musical songs, short films and talk shows

Table 1. CoRe-PooL Statistics

Number of transcript Duration(Seconds) Number of Unique Words Total Number of Words
Male Female Total Male Female Total
Train 3440 3457 6897 8295.75 8359.64 16655.39 1088 35436
Validation 734 736 1470 1759.63 1766.69 3526.32 775 7528
Test 734 736 1470 1777.13 1741.79 3518.92 789 7541
Total 4908 4929 9837 11832.51 11868.13 23700.65 2652 50505

Table 2. Core-pool Variation

Variation Total Audio Length(Minutes)
Annotated 420
Unannotated 968

Gender Identification

Speech to Text

Text to Speech

Speaker Identification

Translator

To learn more, check out our GitHub and read our Benchmark where you can find the fine tuned models applied for the different tasks and the observed results.

Contacts

If you have queries about our work, contact us at:

research@rbg.ai