Text Analytics Practical Applications: The Junk Drawer

Time to address the junk drawer. If you’re anything like me, you have a junk drawer somewhere in your home. Regardless of how neat and organized you may be, you have allotted a space for loose batteries, twist ties, some pennies, a keyring, a household gadget received as a White Elephant gift, random screws and nails, something metallic you are sure is an essential part of some other device in your house but aren’t sure what, a cable for electronic gear outdated long ago, half a pencil, and all the other whatnots that don’t neatly fit anywhere else. If you don’t have such a drawer, don’t quit reading, because I’ll bet there’s still a metaphor for you out there somewhere.


You have a junk drawer at work as well. In many organizations, this junk drawer is called a file share or shared drive. In this shared drive are thousands of folder “drawers”, some of which are highly organized and include document naming conventions and governance of the content, but many of which include documents unknown to anyone but the original creator and very likely abandoned there years ago. When you clean out the junk drawer at home, perhaps you tip it over and sift through the contents, carefully placing some objects in a more suitable location, disposing of others, and neatly replacing whatever else you find back into the junk drawer. Taking the same approach at work isn’t practical: there are issues of scale, in that there may be thousands of folders containing hundreds of documents; there are issues of access, as you may not have permissions to view the contents of these folders; there are issues of knowledge, because you don’t know what all those documents contain or whether they are worth retaining.


What do you do? One, you could archive it all. Archiving is a clever name for doing nothing with the content other than designating it a valuable resource to be accessed at your own peril. Two, you could delete it all. Deleting it all saves a lot of space and time spent sorting through the content. I’m glad you are the one proposing the deletion of this content, because I would not want to face the angry revolt among the end users which would ensue because you deleted critical business content. Good luck! Third, you could gather a group of people or contact each folder owner to go through and sort all this content, saving some and deleting old and irrelevant information. Once you have bought all those doughnuts as bribes to get people to participate, you will have a very nice, clean folder system in just a few short years only to start the process over again with all the mess created in the interim. Finally, you could point a search engine at it and search all the content. That’s a step in the right direction in that you have sped up the process of looking inside these documents, but then you run into the garbage in, garbage out dilemma. The search engine doesn’t cast judgment on the good versus the bad content, it just indexes and retrieves keywords. You’d better know how to craft a good search query to get to exactly what you want.


At this point, you are probably secretly hoping there will be a catastrophic malfunction and all the content will simply vanish of its own accord, disappearing into the electronic ether with only an IT representative left behind shrugging his or her shoulders over your loss.


There are other options. While there is no software that can go through the content and make assessments as to the value of that content, using text analytics can help to speed the process of identifying key information and help you apply predefined value judgments. One way to do this is to use a text analytics software tool to run across the designated file share, picking out elements defined by the business as indicative of content value. For example, extracting the last date modified or important business keywords or hot topics allows for the identification of content which clearly has a current business value. Once identified, this content can be left in place or moved to another location and then made available to the corporate search engine. The rest of the content can be archived or the process can be repeated with a focus on another set of value parameters. Likewise, on content repositories that are very unknown to the business, a content audit can begin with using text mining on the content to identify common or important concepts when they cannot be predefined. While this is only the first step in cleaning up the content, it’s much more practical than manually perusing each folder. This second method is a great tool in an initial content audit and evaluation in content and knowledge management applications.


Another option is to use text analytics with a well-maintained and up-to-date corporate taxonomy as the basis for metadata application. Rather than manually tagging all the content, a text analytics tool can auto-categorize content using values from the taxonomy and store this metadata in a search index. While this method doesn’t cleanse content, it does allow access through the most important and current terminology used by the business, narrowing the search parameters for end users and surfacing only that content which best matches descriptors important to the business.


Knowing you have a junk drawer and knowing its content is an important first step in handling large and ungoverned information repositories.