Information available in the digital world can be divided into two broad categories, structured and unstructured. Structured data is what you see in spreadsheets, tables, databases etc . While this type of data is easy for machines to process, analyze and categorize, the vast majority of content that we encounter, in the form of emails, blogs, web pages, articles and enterprise data are in natural language, with little or no structure. This kind of content does not lend itself to easy processing by machines.
We would be overwhelmed by the sheer volume of information available today if it is not possible to organize and analyze that information. Only the most relevant information should be selected and presented in a clear and coherent manner. This can be done using the inherent strengths of the processing power of Information Technology. . The process of extracting the various entities, their attributes and relationships in unstructured data sources is termed as Information Extraction (IE).
IE systems may be driven either by statistical models or by rules formulated by human developers. While each has its strengths, rule-driven systems are widely used in enterprise IE systems because of two main reasons.
For one, rules are clear and transparent. Therefore, in the event that the results obtained using an IE system are not satisfactory, the cause of the errors or deviations can be easily identified. The rules can then be refined or rectified as required.
For another, the rules can be customized. This means that every time a new domain is addressed, the basic set of rules can be re-used with the required tweaks, so that the rule developer is not saddled with the burden of re-inventing the wheel each time.
IE has the potential to facilitate the processing of unstructured documents similar to database applications handling structured data. The resulting presentation of the data contained therein can be rich and structured. However, even after a couple of decades of research in this area, the techniques that have been developed are yet to mature. This is due partly to the richness of natural language and partly to the pace of the growth of the body of knowledge worldwide.
These two factors make it extremely challenging to formulate rules to govern an IE system. On the one hand, manual involvement and effort is crucial to developing high-quality rules. On the other hand, in order to develop the right rules, the developer has to pore over copious volumes of data. The latter is a task more suited to machines, given their data-processing capabilities.
The best possible solution to the problem would be for man and machine work together, doing what each is best suited for: the rule developer is integral to the whole process and the machine performs some of the more laborious tasks. The result would be the creation of high-quality rules while the developer is spared the tedium of poring over huge volumes of data.
At IITB-Monash, research scholar Ajay Nagesh is working to develop just such a synergistic system that facilitates rule development. He has been working to bridge the gap between best capabilities of man and machine: manual rules are highly intuitive, but might fail to capture all the relevant patterns, whereas machines are capable of scanning far greater volumes of data but not so good at creating highly readable results.
The IITB-Monash Research Academy is a Joint Venture between the IIT Bombay, India and Monash University, Australia. Opened in 2008, the IITB-Monash Research Academy operates a graduate research program located in Mumbai that aims at enhancing research collaborations between Australia and India. Students study for a dually-badged PhD from both institutions, and spend time during their research in both India and Australia.
Working under the guidance of Prof. Ganesh Ramakrishnan, Prof. Pushpak Bhattacharyya Prof. Gholam Reza Haffari and Prof. Geoff Webb, Ajay has taken the first step towards the stated goal by developing a rule induction system which generates rules for identifying named entities in text.
Induction is the process of automatically discovering hypotheses on the basis of known examples in combination with background knowledge. It provides a set of techniques to tackle the problem of rule learning. The system takes as input an annotated document collection and a set of basic features in the form of dictionaries and regular expressions and induces a bunch of rules which have reasonable accuracy.
The system would enable rule developers and annotators of textual data to develop rules faster. Development of rules from scratch is a laborious job. If the rule developer is provided with hints in the process automatically, the development time is much faster. Human intervention would result in better annotation systems. With more annotated data, better models could be built to discover patterns automatically. So research in this direction will not only benefit rule developers but also machine learning systems, which will be enriched due to the availability of better annotated data.
Promising as the initial results have been, the team at IITB-Monash is clear that what has been achieved so far is only a beginning. A lot of ground needs to be covered before their goal is reached, but investigation in this direction will open up new research avenues in synergy between human developer and induction systems.
Research scholar: Ajay Nagesh, IITB-Monash Research Academy
Project title: Induction for Information Extraction
Supervisors: Prof. Ganesh Ramakrishnan, Prof. Pushpak Bhattacharyya, Prof. Geoff Webb and Prof Gholam Reza Haffari
Contact details: email@example.com
For more information and details on this technology, email firstname.lastname@example.org