HB Barathi Ganesh, Jyothish Lal G, Soman K P, Jairam R & Kamal N S
CoRePooL stands for Corpus for Resource-Poor Languages.It is a voice-specific Human-Machine Interaction application that is accelerated by deep learning algorithms. The annotation of supervised corpus helps in performing Speech-to-Text, Text-to-Speech, Translation, Gender, and Speaker Identification. Since these algorithms are data-hungry models, only very few resource-rich languages such as English, Ancient Greek, Latin, and Egyptian Language have a larger corpus to accommodate the deep learning algorithms. This is a big constraint for resource-poor languages in the speech domain where the data is limited. Deploying deep learning-based applications for low resource languages often suffers from data scarcity resulting in poor performance of the model. Here we aim to produce the corpus for resource-less Badaga language by setting baseline for speech and text analysis with the help of deep learning based foundation models.
when deep learning models replaced conventional methods the successive results were obtained with the strength of non-linearity learning capability through various algorithms like DNN, DAE, CNN, RNN and LSTM. Since these algorithms are data-hungry models, only very few resource-rich languages such as English, Ancient Greek, Latin, and Egyptian Language have a larger corpus to accommodate the deep learning algorithms. This is a big constraint for resource-poor languages in the speech domain where the data is limited. Deploying deep learning-based applications for low resource languages often suffers from data scarcity resulting in poor performance of the model. Data scarcity occurs when there is not sufficient supervised and unsupervised data for the model to train. By observing these, we have developed the CoRePooL corpus, which includes 420 minutes of annotated and 968 minutes of un-annotated Baduga corpus for performing 4 speech analytics and 1 text analytics task. Badaga language is one of the low resource languages from the Dravidian language family, predominantly used by Badagas. It is closely related to Kannada, commonly spoken by the Badaga people of the Nilgiris district at the junction of Kerala, Karnataka, and Tamilnadu. There are almost 400 villages in the Nilgiris district, where the people speak the Badaga language. According to the 2011 census, there are 134000 native Badaga speakers and according to “Times of India” there are around 2.5 lakhs of native speakers.
Since the Badaga language is not officially written on any native script, we have manually created 2100 number of Badaga transcript written in english slong with its corresponding translation in English. 12 native speaker spoke these text sets and recoded. Both annotated and unannotated corpus released as wave files with sampling rate of 22050 Hz and PCM 16 bit rate. The annotated corpus constructed carefully to include the word variations which includes, food items, utensils, proverbs, insects, flowers, animals, colours, numbers, days, weeks, months, fruits, common places, action words and vehicle types. The unannotated corpus has been collated by taking data from the Youtube. The genre includes musical songs, short films and talk shows
Number of transcript | Duration(Seconds) | Number of Unique Words | Total Number of Words | |||||
---|---|---|---|---|---|---|---|---|
Male | Female | Total | Male | Female | Total | |||
Train | 3440 | 3457 | 6897 | 8295.75 | 8359.64 | 16655.39 | 1088 | 35436 |
Validation | 734 | 736 | 1470 | 1759.63 | 1766.69 | 3526.32 | 775 | 7528 |
Test | 734 | 736 | 1470 | 1777.13 | 1741.79 | 3518.92 | 789 | 7541 |
Total | 4908 | 4929 | 9837 | 11832.51 | 11868.13 | 23700.65 | 2652 | 50505 |
Variation | Total Audio Length(Minutes) |
---|---|
Annotated | 420 |
Unannotated | 968 |