Categorising data to match user intention in real time

It used to be that when students wanted information on a particular topic, or wanted to see the ouvre of work in a field, they went to a library. By searching broadly they’d find what was already out there, and then refine their search to include more narrow topics.

Today, with digital data growing at an unprecendented rate, there is a need for dynamic systems of categorization that can not only sort data quickly and efficiently, but also come up with refined sort subcategories on the fly.

While the problem is not new, Ramakrishna Bairi, a research scholar at the IITB-Monash Research Academy, is working on a new algorithm for data mining that seems robust and efficient, even at the early stage. It can also help the searcher tell at a glance what kind of data is out there, so that the user can narrow down the search by subcategory.

Picture Credit: Ramakrishna Bairi

The core idea is that when organizations grow, they have a lot of digital information. For example, if you plug the term “java” into an internet search engine, you would get all kinds of answers – about the island, about coffee, and about the programming language.

Rather than throwing out all the various iterations, Bairi’s interface searches the organization’s database of documents, say, on an intranet, and presents several drill down options. But, here’s the differentiator: It not only figures out that since you’re at a research institute, you must be searching for the programming language not coffee shops, it also indexes the results to give you a brand new, freshly rendered menu of subcategories. Now, would you like courses, blogs, books, dissertations, journal articles, or videos about java programming?

Reflecting on his research, Ramakrishna said, “Two things makes our work different. First, the results are really representative. Categorization is comprehensive. And second, the tool is always evolving, it’s learning and building new categories dynamically in real time.”

In existing searches, results may not be fully representative. Categories are already mapped and information is added to preexisting slots. It’s akin to manually tagging books into existing genres that were codified years ago. If someday a hybrid genre were to come along, it would fit badly into a preexisting grouping. A new genre would have to be made up and then perhaps all the existing books would have to be rechecked to see if they might also fit into this new genre.

The limitation of a system that assigns data to preexising categories is that, likewise, it is impossible to envisage future categories before they come into being. And manually revising categories to accommodate new categories is a tedious task.

In Bairi’s system, when new types of material is added to a database, fresh categories can organically materialize to accommodate them. And they can then be presented in a way that captures user intention.

Picture Credit: LGPL

The algorithm factors for the fact that while users may want fine-grained categories in some areas, they may not want it in some other areas. For instance, a computer science institute may not wish to have categories in Bacteria or Genetics even though there might documents in its digital library talking about classification algorithms for bacteria or genes. Instead, here, they may just want to categorize all these documents under a single Bio-Informatics category. But, keeping in mind user-intention, there is probably a need for fine-grained categories in Machine Learning such as Classification, Clustering, Active Learning, Kernel Learning, etc. The reverse may be true for a biotechnology institute.

“We handle a very large scale document classification problem that not only classifies documents into millions of categories, semi-automatically, but also suggests suitable categories for a collection of documents. The suggested categories capture the user intention,” said Bairi.

When the basic prototype of the user interface was tested in a closed group, test users showed a strong preference for this search tool over other existing searches, like regular web browser search engines.

Picture Credit: Wikimedia

The system may not scale to the size of the World Wide Web, but it will be able to accommodate an organization’s personalized intranet. As of now, the algorithm has indexed a database of 5 million records. The limitation is only that of computing resources. With clusters of computers, they could deal with billions of records in terabytes of data.
Initial pre-processing takes a bit of time to train the algorithm for each document collection for each organization. Once metadata is done it takes under 5 seconds to show results with intelligent drill down menus.
“As digital data grows—in the form of news, blogs, web pages, scientific articles, books, images, sound, video, social networks and so on, the need for effective categorization systems to organize, search and extract information becomes self- evident and imperative. Our work helps in automatic categorization of these digital data,” concluded Bairi. “This project is very inspiring for me as I am solving a real world problem.”
IITB-Monash Research Academy is a Joint Venture between IIT Bombay and Monash University. Research scholars study for a dually-badged PhD from both institutions, and enrich their research and build collaborative relationships by spending time in Australia and India over the course of their degree. Established in 2008, IITB-Monash Research Academy aims to enhance scientific collaborations between Australia and India.

Research scholar: Ramakrishna Bairi, IITB-Monash Research Academy

Project title: Enriching Information Retrieval through User Intention Detection and Suggestions

Supervisors: Professor Ganesh Ramakrishnan and Professor Mark Carman

Contact details: rkbairi@gmail.com

Contact research@iitbmonash.org for more information on this, and other projects.

The above story was written by Chhavi Sachdev based on inputs from the research student and IITB-Monash Research Academy.

Copyright IITB-Monash Research Academy