Taxonomy Boot Camp Keynote: La taxonomie est morte! Vive la taxonomie

Today is the first day of Taxonomy Boot Camp, and Theresa Regli delivered the keynote address: "Taxonomies: Dying? Dead? Or Just Hitting Their Stride?"

Theresa began by asking how we, as taxonomists, remain relevant. There have been a lot of changes in the past few years, and a lot of paradigms that we need to let go of. Taxonomies are still relevant--enterprises still focus on and invest in taxonomies, but in different ways than in the past. There are some situations in which taxonomies aren't necessary, and we need to acknowledge that, but in most situations technology needs taxonomy to achieve best results.
Some of the signs that taxonomies aren't dead yet:

  • More people attending Taxonomy Boot Camp this year.
  • Taxonomy COP has around 1000 members
  • Taxonomy COP isn't limited to taxonomists and information architects; people form different backgrounds are joining, showing that taxonomists aren't as isolated from other groups as they once were.

Theresa reviewed what she called the "mullets" of taxonomy.

  • Bob Boiko thinks the enterprise taxonomy is outdated. One all-encompassing taxonomy is too much for a large organization and is unmanageable. Smaller, more specific taxonomies are needed.
  • Ron Daniel thinks that people who say they need a 3-level general business taxonomy need to go. The more general a taxonomy is, the less useful it is; targeted, focused taxonomies are more useful. You also can’t decide how a taxonomy should be structured until you understand the business problem.
  • Seth Earley thinks site maps need to go. He also thinks that manual tagging projects are obsolete and we should utilize the improved technology for autocategorization.
  • Theresa thinks that the idea of one classification that fits all needs to go. Taxonomists can't dictate how people might need to find information later; instead, we need to figure out how the user might need to find something later.
  • Theresa also thinks that definitive categorization is usually obsolete. She gave the example of breakfast. She said to her breakfast means bacon and eggs; to me it means Coke and chocolate.
  • Theresa's final mullet was bottom-up content analysis by humans. There is just too much content now to analyze. Content analysis software can give us a good starting point for a project because machines are better at finding the information. A human is better at figuring out the context of that information and how people will use it.

After talking about what taxonomists need to leave behind, Theresa focused on the new lifeblood of taxonomies:

  • Application integration
  • Creating smaller, more manageable taxonomies for the enterprise
  • Understanding the context of information
  • Meta data for dynamic navigation and filtered searches. Content has to be metadata rich in order to be found, and this is no longer specific to e-commerce.
  • Taxonomists acknowledging the importance of technology and working to understand that technology.
  • Creating standards and teaching auto-categorization tools to make contextual distinctions.

Theresa then talked about the key ideas behind taxonomy. She doesn't like to think of taxonomy as hierarchical—it’s more about categories. Because people approach information in different ways, pieces of information should be more fluid. Taxonomies are about enriching content with metadata so that people can find it however they need. She then commented on folksonomies, which can be useful in some cases, but as she pointed out, aren't really the right path in areas of science, legal, and compliance. When millions of dollars are at stake, letting the masses pick the category isn't a good idea because they don't always pick the correct one. Theresa presented Joseph Busch's three basic principles:

  • Metadata needs to be associated with content
  • Topics should be divided into a few discrete facets
  • Some facets are common to many applications and some need to be specific to each application.

Theresa concluded her talk by relating our work to Isaac Asimov's short story "The Last Question." In the story, people ask the computer Multivac, "Can the workings of the second law of thermodynamics (used in the story as the increase of the entropy of the universe), be reversed?" The computer is unable to answer, and the question is repeated several times over thousands of years. Finally the computer answers that it is unable to answer the question because all the data relationships for the information aren't available. Once all humans are gone, the computer is able to answer the question because it knows all the data relationships.
So this relates to us because we are working to define relationships among data, and once we have completed that work (far, far, in the future), taxonomies will be obsolete. As we make computers smarter and technology better, we are working towards the death of taxonomies.

Get Your Taxonomy Bootcamp Puzzles Now!


Today Taxonomy Bootcamp officially starts in San Jose, CA. In an effort to have 'fun' with controlled vocabularies, I created these domain specific crossword puzzles. If you are at the conference stop by and pick some up for your trip home on friday!

These Puzzles were created from Taxonomies available for licensing via the Dow Jones Taxonomy Warehouse site at www.taxonomywarehouse.com. Each puzzle's theme covers a specific industry domain and the puzzle answers are based on standard taxonomy development relationships such as:

RT= Related Term
BT= Broader Term
EQ= Equivalent
USE= Use
UF= Use For
NT= Narrower Term

Download the Puzzles Now! [note: Adobe PDF reader required]

Stuck? need some answers? View the Puzzle Answer here:

The Future of Search: A Keynote by Susan Feldman at Enterprise Search Summit West 2008

Sue Feldman has been a key analyst and researcher in the search space for a number of years. Her work at IDC as Vice President of Research, Search and Digital Marketplace is very highly regarded. (Sometimes I envy her job!) It's Wednesday morning in sunny San Jose, and Sue has just given the morning's keynote at the Enterprise Search Summit West.

Sue believes that we are seeing a convergence of tools in search, and thankfully the vendors are seeing a stronger market, which will motivate them to keep innovating. The future of search is not a platform based on transactions, as we have today. It will be a language-based foundation for a new platform - a knowledge platform that she predicts will gain equal place with transaction-based systems. The similarities with the evolution of the database platforms imply a parallel path.

We will continue to see development in categorization, text analytics and linguistic modules. This includes capabilities for identifying parts of speech; extracting entities, concepts, relationships, sentiment and geo-location; semantic understanding via dictionaries and taxonomies; support for multiple languages. One of the biggest problems Sue thinks we need to solve for in the search market is something taught to every library school student: selection. There is so much information, from so many sources - what can you trust? What are the valuable sources?

What are the market drivers? New business requirements. Updated business requirements. Connect the right information to the right people at the right time. We've heard that for years, and while it may be annoying, it's still valid! Determining the state of the business despite the data being in separate silos. Compliance - governments create and change those requirements regularly! Controlling costs - think information workers and call centers; the faster a service rep can find information and finish the call, the lower the per call cost and the more calls they can take, translating to happier, more loyal customers and the next driver: increased revenue. In keeping with "Web 2.0" a better understanding of customers and improved communications with them are a key driver.

eCommerce will require ever more sophisticated tools in the search digital media space, as will publishers as they continue to migrate online. The digital marketplace and government (DoD, NIH) have been early investors in this space - for ad matching, interaction improvements, rich media search, fraud and terrorist detection, access to information and more. The market, according to Sue, is realizing that they have tons of information NOT in their ERP and CRM systems. Transaction-based computing is no longer enough. User-centered computing requires re-thinking, new human-computer interaction models.

Sue believes we need to automate knowledge work, as we no longer are limited to working 40 hours a week in our offices - we work in bits, here and there, 24/7 on multiple devices in many formats. We need to have personalized interaction models - and even more granular that just at the individual level, at the person's role level - employee, volunteer, family, friend. The personalization needs to address the user, the device, and the context. It needs to be flexible and adaptable, ad hoc in real time. It needs to be secure and contiguous across user environments.

The challenges for search are:

  • How to unify access to kinds of information from a single, contextual user interface
  • Improving human-computer interaction models
  • Identifying what is good in interaction design for information access

Sue believes that she will not have a market to forecast in 10 years. By then search will be embedded in the platform and in the applications to provide interaction. Applications will use this search platform to personalize, filter and visualize. We will see task-specific applications in our work environments. In fact, some of these applications are already on the market. Search will be at the center of interactive computing, as search is now language-based, just as humans.

Meaning-Based Computing: A Presentation by Autonomy at Enterprise Search Summit West

Like Daniela, I too am attending the 2008 KMWorld/Enterprise Search Summit West/Taxonomy Boot Camp meta-conference in San Jose. Shortly before lunch, we heard from Gary Szukalski, Vice President, Customer Relations, Autonomy. He spoke about Meaning-Based Computing. I've known Gary for a number of years and he did not disappoint - his message becomes more refined each time I see him speak.

Gary spoke of a "major paradigm shift" in the the IT industry. For years, we (IT practitioners and vendors) have been forced, unnaturally, to aggregate, dumb-down, and structure the mess of unstructured data that makes up approximately 80% of an organization's information assets. Why have we done this? Because that's how computers work - they need structure. We are moving into a world where we can stop forcing structure onto data as computers will understand the semantics of what they are storing and indexing.

Now, he didn't say semantic web or semantic technologies. :) He talked about meaning - how do we teach our machines to disambiguate terms. He gave an Enron example - "shred" means destroy paper documents, but also refers to slicing vegetables in the Enron corpus. It also is a snowboarding reference. How does the machine know? This is where Autonomy is heading.

Why would we care? Gary spoke of the December 2007 amendments to the Federal Rules of Civil Procedure. In a nutshell, these amendments made all relevant electronic information admissible in a legal case. There are definite ROI measures to be had for using the right discovery tools to protect organizations from legal troubles. This brought to my mind the Sedona Principles as well - legal guidelines regarding the importance of metadata.

Pan-enterprise search is the new buzzword. Rather than aggregating - federating - sources together, a search tool should now be able to index ALL objects, regardless of file or storage type. Glad to hear a top ES vendor saying that finally!

Now, I was a big Verity customer/user at a prior employer. I gave them a great deal of feedback on their tools. One thing that always gnawed at me, born from my library roots, was that the definitions of the categories and topics that improved search relevance were locked in the tools. My organization defined them, but we couldn't share them easily - only the evidence of their existence, by means of better search results and faceted browsing. But the critical thing about "meaning" is that it be shared! In the "shred" example above, I understood fully it's importance in the Enron context. But my first thought on hearing the word was cooking, while the woman next to me thought of snowboarding. How does an organization use the power of the tool to educate the users of the tool? Who is working on the UI part of this paradigm shift? And who is thinking about the UI in the context of information security? Secure search should provide access at the role, group, organization or public level. Is Autonomy using open standards to minimize efforts at integrating metadata pan-enterprise? For me pan-enterprise is not just behind the firewall, it extends onto the web in the form of corporate messaging and consumer feedback. Are any of the enterprise search vendors using open methods to allow this kind of integration? I'm interested in hearing, as I left the search world behind a couple of years ago, and have drifted towards the outer edges of the space.

This was one of the better presentations this morning, and I hope they post the slides somewhere soon!

Tag, You're It - Keynote From Enterprise Search Summit

I am at the opening day Keynote for Enterprise Search Summit West in San Jose today, rushing down from Pacifica on this beautiful morning driving (ok speeding) down 280 to make this early morning session. Obviously if you have been following me for a while over on my blog you know i have a 'thing' for social tagging and recently published an eBook on Hybrid approaches to Folksonomies and Taxonomies in the Enterprise so i did not want to miss it.

The Keynote is titled 'Tag, You're It: Social Tagging Strategies for the Enterprise' and is being lead by Gene Smith, Principal, nForm Experience. Gene is the author of the book 'Tagging People Powered Metadata for the Social Web'

Notes:
Why We're Here? (at the conference)
To figure out how to find *the good* Stuff

19th century explosion in paper records- flourishing of patten filings to store records and information. the one that emerged as the winner was vertical filing. folders and tabs where a key piece. Tabs in vertical filing are still seen in today's web User Interfaces.

Folders have been the dominant organizing principle - then links came into the scene.

Instead of Information explosion- think of it as a stream, immersion in the flow.
the challenge is keeping track and finding what we need later on - tags are
- fast
-simple
-social
-good enough

A tag (word) can mean a lot of different things.

Looking at different tools and why they are interesting:

Zigtag - semantic social bookmarking
When you are about to tag something, you type and pick from the list and it includes definition.
They have million of concepts- they mine public data sources for user generated content and built a inference engine to provide the concepts

LibraryThing
any person can make any two tags equivalent- but they can also remove it as well - "humor" and" "humour"- same word but different meaning in different cultures (america vs. UK) authors tagged to each are different.

Value chain of the LibraryThing features
>combine tags> tag mash search>tagsonomies (mapped to existing categories)

The big problem is getting people to use the tools you provide for them!

Cold-Start Problem-

  • creating incentives- reward a person by identifying that that were the first person to tag or create social proof 'feature linker'- (who doesn't like to see their name in lights?)
  • try to pre-populate the tag box- tags other people have used

Some other examples:
Wesabe - sticky tags- always applied to the item, but then allow 'not sticky' or one time tags. show you your spending habits by clustering your tags- giving benefits of the tags they used.

Dogear - built internally at IBM- architected it so it produces a RSS feed for every tag - what happened is that as people started using it- groups found interesting things to go with their RSS feeds like displaying the content into other environments- creating mashups- allowing innovation on the tags so that the value is created by the users needs.

Although this specific slide deck he used does not seem to be there yet- Gene shares his slides over at SlideShare and you can also follow Gene on Twitter @gsmith .

Thinking Outside Of The Synaptica Box

The Synaptica team likes to think outside the box- both with customers as well as with the use of our own tools, so I wanted to share with you some of the ways that we use Synaptica internally here at Dow Jones that might seem surprising if you think about taxonomy management tools as just that- a development and management tool for controlled vocabularies.

Beyond straightforward taxonomy and thesaurus management, which of course is the primary purpose for Synaptica, we have incorporated a number of other vocabularies and projects into to the tool and integrated them into our development and account management processes.

One way that we are using Synaptica is to manage accounts, contacts and even lists competitors. Given the ease with which one can set up uniquely designed vocabularies, it is no problem to create one dedicated to storing any of the above information and the appropriate details. Once the vocabulary has been created and populated, it is just as easy to begin to link companies and individuals to account information, notations about sales status, and even suggested projects for enhancing Synaptica.

In the same respect we utilize Synaptica to manage our ongoing projects and a general knowledge base for any issues that might be encountered by our user base. We can store a definition, the status and priority of the project, and link it to the company or individual that requested it. Using this method we can easily track every aspect of the work from start to finish. Each project assigned to a specific version release is marked using things like category labels and workflow flags to very closely monitor and track where we are with each proposed project for ongoing releases.

We are applying the tools and functionality of Synaptica in ways that may not seem immediately apparent, but given the flexibility of the application there are many ways that one can find to creatively employ the features and functionality of Synaptica. Contact us today to find out more about how you can use Synaptica to assist with your internal processes, even in ways you may not have thought possible!

Is the Semantic Web the Internet's Equivalent of the Green Building Movement?

My kids are still in their "Bob the Builder" years. At some point Bob went "green" - left Bobsville to build a sustainable community in Sunflower Valley, complete with grass-growing roofs, hand-pumped water and solar powered trailers. He even got a new catchphrase - "Can we fix it? Yes we can!" has been joined by "Reduce, Recycle, Reuse." While it's a little over the top, I can't complain. Every reminder helps. This last weekend though, when Bob chanted his new mantra, I heard a different voice echo in my head - one saying re-use, re-mix... it sounded eerily like Eric Miller at SemTech keynotes, in consulting presentations, in print. I guess I got the message!

Reduce - for our environment we want to encourage people to think carefully about what we actually need to consume. We need to think about that notion in regards to our data stores as well. Are they full of clutter? Are we afraid to get rid of data in case we might need it later? I don't think it should be about ownership - we "own" data about as much as we "own" the earth - we are stewards, not controllers. Yes, there is data we need to more carefully manage, as we would our home and property; government regulations help us define our responsibilities there. But how much time and effort do we want to spend managing all of that data? How many resources - human and infrastructure - does it take? Can we acquire some of the data we need from a trusted source - a trusted steward who will commit to maintaining its integrity? I'm not suggesting we create monopolies - we need diversity of thought, and may need to subscribe to more than one data provider. It could be simple or complex data you outsource - do you really need to maintain your own list of US States or Countries for your online form? Do you NEED to download a local copy of a paper, or do you simply want to make notes on a certain paragraph? The costs can range from free to exorbitant.

recycled plastic bag purseRecycle - find new ways of using data. If we can turn billboards, magazines and plastic shopping bags into purses, we can transform mere metrics into compelling visualizations and startling statistics. One compelling example is Gapminder. Using data from the United Nations, the team under Hans Rosling have developed interactive charts and maps detailing global trends that provide greater impact for consumers in a world of information overload. The same data we usually see in boring columns and rows, have new form: new impact.

Reuse - how many of us have taken a plastic container from the grocery store and used it later to store leftovers in the fridge? Or to carry our lunch to work where if it gets lost, no worries? We can do the same with our bits of data. Really, this notion of data reuse has taken off already - new mashups are being generated at an exponential rate! We can use this same creativity inside the firewall - let your geeks play with the data, give them sandboxes and free time to explore. At the very least, don't prevent them from trying new things when security restrictions have been met. Anthony Bradley of Gartner called this the MacGyver Principle at the Gartner Web Innovation Summit this past week in Los Angeles. MacGyver didn't go collect the right tools for the job - he used the resources already available. This is where serendipity occurs (and in MacGyver's case, survival!) - unanticipated uses by unanticipated users.

As to the technologies themselves - there are options, just as we have in the real world! Choose the tool right for your project. Wind, water, solar power - each option has a micro-climate for which it is best suited. Use the same consideration for your own information and technology projects. Will Microformats do the job? Do you need to use a W3C standard? Do you need a triple store, or can you start building on available databases? Think carefully, balancing the long term return on the investment.

One site I've recently come across is the new BBC Music (Beta) site. They're taking data from their own radio play system, Musicbrainz, Wikipedia and more to create comprehensive profiles of musicians. I can't do it justice - read about it for yourself.

I think Bob and the gang would be proud. Heck, so would Al Gore! ;) What are your favorite ways of making the internet "green?"

Flickr image by Kate Blackport, found thanks to Search by Creative Commons

What is the Hardest Content to Classify?

This week I saw that my first blog post on Synaptica Central had been published! After a few seconds of enjoyment, as a wave of pride washed over me, I realized that I now need to post pretty frequently to give our readers something interesting to read! So, without further ado…

The first topic that came to mind as I thought about topics to blog about is the whole area of classification of different types of content: text, sound, video and images.

I often speak to clients who have a range of item types stored in a number of repositories. They're often looking to classify new content, or to work on older content in order to improve its findability. They are always looking to get more value from their content.

In these circumstances a content audit is often called for, to answer the 'What do you have?' question. This then leads to a general discussion of the content types and the ways in which they can be classified, usually using a controlled vocabulary either applied by a machine, by a person, or by a mixture of the two.

One thing that often makes people ask me questions is my fairly frequent assertion that images are easily the hardest item types to deal with.

Why are Images the Hardest Content to Classify?

-Textual items contain text. Use of auto-categorising software, free text storage and access .etc .etc makes organising and finding textual items relatively easy.

-Sound can be digitised and turned into text.

-Video often has an audio track that can be turned into text too. Computers can be used to identify scenes. Breaking a video into scenes and linking a synched and indexed soundtrack together can provide pretty good access for many people - (though there's a whole blog post on the many access points to video that these process doesn't provide).

Images on the other hand have no text, no scenes, all you have are individual images, with the meaning and access points held in the visuals.

Some will say that this is really not a problem, all you need to do is use content based image retrieval software to identify colours, textures and shapes in your images, and you'll soon be searching for images without any manual indexing. However, whilst this technology is promising, it leaves a lot to be desired.

Today, the way to provide a wide and deep level of access to still images continues to be by using people to view images, write captions and assign keywords or tags to each image based on image 'depictions' and 'aboutness and attributes'. This manual process often requires the use of a controlled vocabulary to improve consistency and application.

However, how this indexing is done and what structures support it, will be the subject of further posts- I just wanted to get my thoughts out there !

So Stay tuned.
Ian

Why My Dog is Like a Search Engine Without a Taxonomy

A couple weeks back i asked my 130 pound dog Townes if he wanted a 'biscuit' and he ignored me- i had forgotten that he isn't the smartest dog in the world and that he can not make the automatic association that a 'biscuit' is a 'cookie' so as i spoke, he just stared into another room. (read: it really isn't that he is dumb just that he wasn't trained). Now he has become a bit of a taxonomy star and it might be getting to his head.

Today i ran into someone who had just read my personal blog for the first time recently and had seen my post titled "Why my dog is like a search engine without a taxonomy" and had loved the video i made. She told me that she used it in an internal discussion about taxonomies. She wasn't the first one who has said they loved the video so i figure i would repost it here.

So that morning when i asked him if he wanted a 'biscuit' instead of a 'cookie' and all i got was was a blank stare, i told him he was "like a search engine without a taxonomy" and then he looked even more confused. Then we made this video. He is a bit bored in this video because it is the second take so the intial reaction is not 100% but hopefully you get the idea!

The moral of the story? As humans we easily make associations between words- machine and dogs can't unless they are "trained".

He is a 4 year old Great Dane in case you are wondering- and yes he got lots of biscuits and cookies that morning! (like most days....)

Author Spotlight: Ian Davis Global Project Delivery Manager Taxonomy Delivery Team

My name is Ian Davis, and I'm a Global Project Delivery Manager working in the Client Solutions Taxonomy Delivery Team and based in our London office. I work to develop and deliver a range of content and information solutions for our global clients. Projects can include discovery assessments, taxonomy strategy and creation, taxonomy mapping, search support, information architecture and website development. I also assist in the marketing and deployment of Synaptica, a semantic management tool offered by Dow Jones, and the website www.taxonomywarehouse.com.

My particular areas of interest include: developing taxonomies, thesauri, and metadata schemas, manual and automated indexing of still and moving images, deploying and using Synaptica controlled vocabulary software, the challenges of managing teams of geographically dispersed information workers, website creation and development, and the localisation of content into multi-lingual environments"

I joined Dow Jones in February, 2006, after 13 years developing taxonomy and indexing solutions for still images libraries at both Corbis Corporation and Photonica (formerly part of Amana Japan and now part of Getty Images). At Corbis, I served as head of the UK division’s image cataloguing department. At Photonica, I worked to create and implement the e-commerce website www.iconica.com and was responsible for the development of www.photonica.com. I also developed, implemented and maintained all vocabularies underpinning the classification and retrieval of Photonica's extensive digital image content. One aspect of this included creating an extensive English language thesaurus and managing the localisation of that controlled vocabulary into five European languages. I managed a team of ten still image indexers and five thesaurus developers. After leaving Photonica, I worked as an independent consultant for BUPA in the area of metadata and taxonomy creation and development, and the implementation of an enterprise search solution.

Most of my time is currently spent working on the delivery of a major client engagement in Asia. I'm managing a team of geographically dispersed staff who are working on the customisation of a large topical thesaurus and the creation of various browsable taxonomies. We're also creating a multi-lingual thesaurus by translating the large English thesaurus into three other languages and tying the whole lot together. If that wasn't enough, we're also involved in the mapping of the vocabularies we're working on to both legacy internal client vocabularies and to third party ones. We're also starting to consider how to move these thesauri and taxonomies into the world of ontologies."