The concept of text analytics is used more broadly than I realized, and, as defined in the opening keynote given by conference chair Tom Reamy, encompasses:
- Text mining, based on natural language processing, statistics, and machine learning
- Entity extraction, semantic technology that enables “fact extraction”
- Sentiment analysis, comprising various method to look for positive and negative words
- Auto-categorization, which is often rules-based
My presentation, “Taxonomies for Text Analytics and Auto-Indexing,” described how text analytics can be used with auto-categorization and taxonomies to achieve relatively high quality automated indexing results. Auto-categorization is a type of automated indexing that tends to make use of taxonomies, as categorization requires categories (taxonomy terms). Text analytics can be used as a technology to generate meaningful terms from texts, which in turn can be used auto-categorize content against a pre-existing taxonomy. Auto-categorization typically involves technologies of either complex rules to match terms or algorithms and machine learning. In either case, the terms picked up in auto-categorization would be more meaningful if they were first extracted with text analytics technologies based on natural language processing.
Another presentation looked at a different side to the relationship taxonomies and text analytics. Text analytics is also used as means to build taxonomies in the first place, by providing suggested terms that a taxonomist can then edit. Edee Edwards and Rena Morse of Silverchair Information Systems presented a case study on using text analytics to generate terms for taxonomy development. It required multiple iterations and refinements.
- Heather Edwards of the Associated Press explained how AP classifies the news using a custom-build taxonomy and rule-based auto-classification system.
- Evelyn Kent of MCT SmartContent also presented how news items are classified using a “context-based language” (taxonomy), and even demonstrated how the taxonomy is managed in the taxonomy tool (SmartLogic Semaphore Ontology Manager).
- Anna Divoli of Pingar presented survey results of taxonomy user interface preferences from cases that involved automatically generated hierarchical and faceted taxonomies.
- Alyona Medelyan also of Pingar discussed “controlled indexing” in her case study, which featured results of comparing human versus automated indexing (using machine learning and training sets) using the same taxonomy (the Agrovoc agriculture thesaurus of the FAO).
- Sarah Ann Berndt of the Johnson Space Center spoke about “automatic generation of semantic markup” in a presentation that turned out to be mostly about the application of a taxonomy.
Even if text analytics and taxonomies are combined in different ways, what is common is that combining techniques, tools, and technologies in more challenging situations achieves better results. Techniques, tools, and technologies in this field do not have to compete, but can complement each other.
Great post! Sums up the connection between taxonomy and text analytics really well. I notice a trend of companies moving away from rules based text analytics in favour of semantic machine learning technologies. I think this shift is intended to save resources required to constantly update taxonomy/categorization rules. I wonder if machines do better undertanding how to apply tags/or categorize content over time. It will be interesting to see if any studies emerge in the future.
Great summary. Thanks. A very useful steer.