Taxonomies for Human Vs Auto-Indexing by Heather Hedden

Day 1 at Taxonomy Bootcamp covered a lot of basic taxonomy principles such as planning and implanting taxonomy, choosing taxonomy software and indexing principles. The talk by Heather covered the perennial issue of human vs. auto-indexing and whether it was possible to ascertain whether one was better than the other or not. Ultimately, whichever method was selected, it depended on the purpose of the taxonomy and its use. It was emphasized that indexing was best used for search and retrieval.

Before the virtues and drawbacks of each indexing method were explored, Heather provided clarity on what indexing was and how it differed from tagging and categorisation. In a nutshell:

• Indexing is done by a trained indexer, preferably with subject matter knowledge and is largely used for browsing.
• Tagging can be done by anyone and is the applying of labels to documents. These tags can then be used by a database.
• Categorisation is the grouping and placing of information in buckets in a systematic manner.

The differences in human and auto-indexing were covered in 3 broad areas, namely contents/materials handled, methodology and technology. In terms of contents, human indexing would be at its best if the contents were in manageable numbers and included a variety of formats and subjects/topics. On the other hand, auto-indexing would work well for very large numbers of documents, textual documents (no images!) and single subject areas.

Technology-wise, indexers (humans) use fairly simple and straightforward indexing tools which were designed so that indexing could be carried out in quickly and accurately. There was also the flexibility for indexers to input new terms, when necessary. Training for indexers could be carried out with the use of indexing guideline (both for development and quality checking). Auto-indexing was a little more complex as in required an entity extractor and text had to be mined and analysed. Although auto-indexing is done by the machines, there still has to be human intervention in the form of rules building as well as to provision of sample documents to of the ‘train the automated indexing’.

Having covered the pros and cons of each, the next part of the talk focused on the differences in the terms. Terms indexed by human and machines can be differentiated through their granularity, types of relationships, descriptions/notes and types synonyms/variants. The main difference in the term relationship between human and auto indexing is that in human indexing, there are both hierarchical and associative relationships. In human indexing, there can also be more notes which are visible to the end user and indexer.

Heather also touched on the differences in synonyms/variants between humanly indexed terms and auto-indexed terms. For example, in human indexing, abbreviations are allowed for common terms whereas in auto-indexing, the machine will not be able to understand the abbreviations.

She concluded with a short description of the additional tasks that an indexer would have to do in both human and auto-indexing. Both would require human intervention, its just that the tasks and extent of work is different. For human indexing, terms have to be checked and amended/added in if terms are omitted or misused. In the case of auto-indexing, the work is more focused on the training documents and adjustments of the rules.

This was a very factual and descriptive presentation on both human and automated indexing. It was reiterated that no one method is better than the other and the choice of either one is simply dependent on the usage of the taxonomy. The use of the taxonomy should determine whether human or automated indexing should be done. Both will yield different results in terms of structure and terms created. Both will also require a different level of human intervention in rules building or policy development.

Heather’s website can be found at http://www.hedden-information.com