*Image Source: https://www.theguardian.com/education/gallery/2015/jan/23/a-language-family-tree-in-pictures. Image Credits: Minna Sundberg
Natural Language Processing (NLP) is a research area that marries linguistics with machine learning. My supervisor, Prof Pushpak Bhattacharyya, is fond of saying, “In NLP, linguistics is the eye, while computation is the body.” Our laboratory has been dedicated to language processing research for around 20 years, and has pushed the boundary in NLP research for Indian languages. For the past five years, we have been exploring the shared vocabulary among Indian languages, especially in terms of Cognates and False Friends.
Cognates are word pairs which share the same meaning and a similar spelling across languages. For example, the French and English word pair, Liberté / Liberty. In some cases, similar words have a common meaning only in some contexts, and such word pairs are called partial cognates. For instance, the word “police” in French can translate to “police”, “policy” or “font”, depending on the context.
On the other hand, False Friends are word pairs which share a similar spelling, but different meanings. Such a phenomenon is commonly observed by scholars who study diachronic linguistics or historical linguistics.
Cognates have a special place in Indian languages, as many Indian languages borrow words from Sanskrit. For such words, grammar also provides us with the category of Tatsama and Tadbhava words. There are dictionaries which even provide us with a plethora of such word-sets across Indian languages.
‘Tatsama’ words are easy to identify as they share identical spelling, whereas ‘Tadbhava’ words differ, as compared to how they were initially spelt. Computational algorithms have been using the notion of character sets or phonemes to identify such word pairs. This is the point where our work provides a more accurate alternative and identifies that Tadbhava words differ in terms of spelling / phonetics. We incorporate the notion of similarity based on word meaning (semantics) and the context of a word, to be able to identify such words with much better precision (to be honest, the recall is higher too!).
NLP research has experienced a sudden boom in recent times in this sub-area, which models word semantics. Cross-lingual (across multiple languages) NLP research, however, is still in the nascent stages where modelling of semantics across languages is slowly maturing. Computational algorithms require a lot of data to train themselves so that they can accurately model and predict the correct word meaning, even when dealing with the same language.
We, graduate research scholars of the IITB-Monash Research Academy, study for a dually-badged PhD from IIT Bombay and Monash University, spending time at both institutions to enrich our research experience. The Academy is a collaboration between India and Australia that endeavours to strengthen relationships between the two countries. Its CEO, M S Unnikrishnan, says, “The IITB-Monash Research Academy represents an extremely important collaboration between Australia and India. Established in 2008, it is now a strong presence in the context of India-Australia collaborations.”
For Indian languages, research in such an area where we require models to correctly predict the similarity in meaning across languages still remains a difficult problem. Lack of structured data for the task makes it a tough egg to crack. However, because most Indian languages are closely related to each other, the task at hand becomes relatively more natural. We use knowledge graphs like Wordnet and Indian language texts available to generate such word pairs in abundance. Using the context of such word pairs, we are able to say with a certain probability that these words carry the same meaning across languages. Our work also shows improvement when applied to the task of automatic translation. The results show promise that cognate detection and shared vocabulary can indeed improve NLP for Indian languages.
Both machine learning-based algorithms and modern deep learning-based techniques help us achieve better performance compared to the previous approaches towards the problem of cognate detection. As previous approaches use spelling and / or phonetics into account, they lack the inherent linguistic need of modelling the semantics of words across languages. We apply the same approach to detect false friends among Indian languages.
False friends hurt the accuracy of the translation task, and cross-lingual search task as computational algorithms predominantly take the spelling of words into account. Using our approach, we can classify false friends and generate a list of such word pairs. A direct translation of such words should be avoided as it can sometimes lead to disastrous results. For example, the word “gift” in the German language means “poison”, and I am sure you do not want to get an anniversary “gift” from your spouse. Interestingly, “gift” in the Swedish language also means “poison”, but in a different context, it could mean “marriage” as well. Now, we do not want our machines to tell us that “marriage = poison”; unfortunately, that is what they currently do.
Our research for both cognates and false friends achieves more than 90% accuracy for more than 12 Indian languages.
We also use the notion of word semantics and apply it to historical texts in Sanskrit grammar. Our research findings help us gain an insight into how these ancient texts have been passed on in a grammatical tradition and can be traced back to a hypothetical root. Our insights also show us how, with time, multiple variants of the same document are generated.
Prof Malhar Kulkarni, my co-supervisor says, “Texts are important sources of intellectual history and establishing a particular text using extant available resources is an important task for the historical linguistics community”. We create such a base for the most popular commentary on Panini’s Sanskrit grammar, known as Kāśikāvṛtti. With the help of data accumulation by philologists and the computational semantic modelling, we generate a pretty accurate version of the descendance of this text. Given my overall experience in the area of semantics and the long withstanding interest of the NLP research community, we hope to make further strides to help computers understand language.
Research scholar: Diptesh Kanojia, IITB-Monash Research Academy
Project title: Computational Phylogenetics for Variant Manuscripts in Sanskrit
Supervisors: Prof Pushpak Bhattacharya, Prof Malhar Kulkarni, Prof Reza Haffari
Contact details: firstname.lastname@example.org
This story was written by Diptesh Kanojia.
Copyright IITB-Monash Research Academy