TBC2008
Taxonomies for Human Vs Auto-Indexing by Heather Hedden
Anonymous — September 26, 2008 - 7:54am
Day 1 at Taxonomy Bootcamp covered a lot of basic taxonomy principles such as planning and implanting taxonomy, choosing taxonomy software and indexing principles. The talk by Heather covered the perennial issue of human vs. auto-indexing and whether it was possible to ascertain whether one was better than the other or not. Ultimately, whichever method was selected, it depended on the purpose of the taxonomy and its use. It was emphasized that indexing was best used for search and retrieval.
Before the virtues and drawbacks of each indexing method were explored, Heather provided clarity on what indexing was and how it differed from tagging and categorisation. In a nutshell:
• Indexing is done by a trained indexer, preferably with subject matter knowledge and is largely used for browsing.
• Tagging can be done by anyone and is the applying of labels to documents. These tags can then be used by a database.
• Categorisation is the grouping and placing of information in buckets in a systematic manner.
The differences in human and auto-indexing were covered in 3 broad areas, namely contents/materials handled, methodology and technology. In terms of contents, human indexing would be at its best if the contents were in manageable numbers and included a variety of formats and subjects/topics. On the other hand, auto-indexing would work well for very large numbers of documents, textual documents (no images!) and single subject areas.
Technology-wise, indexers (humans) use fairly simple and straightforward indexing tools which were designed so that indexing could be carried out in quickly and accurately. There was also the flexibility for indexers to input new terms, when necessary. Training for indexers could be carried out with the use of indexing guideline (both for development and quality checking). Auto-indexing was a little more complex as in required an entity extractor and text had to be mined and analysed. Although auto-indexing is done by the machines, there still has to be human intervention in the form of rules building as well as to provision of sample documents to of the ‘train the automated indexing’.
Having covered the pros and cons of each, the next part of the talk focused on the differences in the terms. Terms indexed by human and machines can be differentiated through their granularity, types of relationships, descriptions/notes and types synonyms/variants. The main difference in the term relationship between human and auto indexing is that in human indexing, there are both hierarchical and associative relationships. In human indexing, there can also be more notes which are visible to the end user and indexer.
Heather also touched on the differences in synonyms/variants between humanly indexed terms and auto-indexed terms. For example, in human indexing, abbreviations are allowed for common terms whereas in auto-indexing, the machine will not be able to understand the abbreviations.
She concluded with a short description of the additional tasks that an indexer would have to do in both human and auto-indexing. Both would require human intervention, its just that the tasks and extent of work is different. For human indexing, terms have to be checked and amended/added in if terms are omitted or misused. In the case of auto-indexing, the work is more focused on the training documents and adjustments of the rules.
This was a very factual and descriptive presentation on both human and automated indexing. It was reiterated that no one method is better than the other and the choice of either one is simply dependent on the usage of the taxonomy. The use of the taxonomy should determine whether human or automated indexing should be done. Both will yield different results in terms of structure and terms created. Both will also require a different level of human intervention in rules building or policy development.
Heather’s website can be found at http://www.hedden-information.com
Taxonomy Boot Camp Keynote: La taxonomie est morte! Vive la taxonomie
Anonymous — September 25, 2008 - 9:55am
Today is the first day of Taxonomy Boot Camp, and Theresa Regli delivered the keynote address: "Taxonomies: Dying? Dead? Or Just Hitting Their Stride?"
Theresa began by asking how we, as taxonomists, remain relevant. There have been a lot of changes in the past few years, and a lot of paradigms that we need to let go of. Taxonomies are still relevant--enterprises still focus on and invest in taxonomies, but in different ways than in the past. There are some situations in which taxonomies aren't necessary, and we need to acknowledge that, but in most situations technology needs taxonomy to achieve best results.
Some of the signs that taxonomies aren't dead yet:
- More people attending Taxonomy Boot Camp this year.
- Taxonomy COP has around 1000 members
- Taxonomy COP isn't limited to taxonomists and information architects; people form different backgrounds are joining, showing that taxonomists aren't as isolated from other groups as they once were.
Theresa reviewed what she called the "mullets" of taxonomy.
- Bob Boiko thinks the enterprise taxonomy is outdated. One all-encompassing taxonomy is too much for a large organization and is unmanageable. Smaller, more specific taxonomies are needed.
- Ron Daniel thinks that people who say they need a 3-level general business taxonomy need to go. The more general a taxonomy is, the less useful it is; targeted, focused taxonomies are more useful. You also can’t decide how a taxonomy should be structured until you understand the business problem.
- Seth Earley thinks site maps need to go. He also thinks that manual tagging projects are obsolete and we should utilize the improved technology for autocategorization.
- Theresa thinks that the idea of one classification that fits all needs to go. Taxonomists can't dictate how people might need to find information later; instead, we need to figure out how the user might need to find something later.
- Theresa also thinks that definitive categorization is usually obsolete. She gave the example of breakfast. She said to her breakfast means bacon and eggs; to me it means Coke and chocolate.
- Theresa's final mullet was bottom-up content analysis by humans. There is just too much content now to analyze. Content analysis software can give us a good starting point for a project because machines are better at finding the information. A human is better at figuring out the context of that information and how people will use it.
After talking about what taxonomists need to leave behind, Theresa focused on the new lifeblood of taxonomies:
- Application integration
- Creating smaller, more manageable taxonomies for the enterprise
- Understanding the context of information
- Meta data for dynamic navigation and filtered searches. Content has to be metadata rich in order to be found, and this is no longer specific to e-commerce.
- Taxonomists acknowledging the importance of technology and working to understand that technology.
- Creating standards and teaching auto-categorization tools to make contextual distinctions.
Theresa then talked about the key ideas behind taxonomy. She doesn't like to think of taxonomy as hierarchical—it’s more about categories. Because people approach information in different ways, pieces of information should be more fluid. Taxonomies are about enriching content with metadata so that people can find it however they need. She then commented on folksonomies, which can be useful in some cases, but as she pointed out, aren't really the right path in areas of science, legal, and compliance. When millions of dollars are at stake, letting the masses pick the category isn't a good idea because they don't always pick the correct one. Theresa presented Joseph Busch's three basic principles:
- Metadata needs to be associated with content
- Topics should be divided into a few discrete facets
- Some facets are common to many applications and some need to be specific to each application.
Theresa concluded her talk by relating our work to Isaac Asimov's short story "The Last Question." In the story, people ask the computer Multivac, "Can the workings of the second law of thermodynamics (used in the story as the increase of the entropy of the universe), be reversed?" The computer is unable to answer, and the question is repeated several times over thousands of years. Finally the computer answers that it is unable to answer the question because all the data relationships for the information aren't available. Once all humans are gone, the computer is able to answer the question because it knows all the data relationships.
So this relates to us because we are working to define relationships among data, and once we have completed that work (far, far, in the future), taxonomies will be obsolete. As we make computers smarter and technology better, we are working towards the death of taxonomies.
