Unstructured Data
Unstructured data is a generic term for describing any information that is not in a database. Unstructured data can be textual or non-textual. Textual unstructured data is generated in media like email, PowerPoint, Word, instant message, and social media. Non-textual unstructured data is generated in media like images, audio files, and videos.
If left unmanaged, unstructured data can pose a costly liability when the information cannot be located, or a versioning issue between the database, or just between multiple copies of the same file.
As our users become more tech literate, they expect to be able to use their skills to work with information in the tools they know. This introduces the version control issues I just mentioned and the User Expectations vs. IT Governance controls and policies.
Let's face it, the user doesn't know, and doesn't care, that you have to report reliable and audited data to someone else.
These are some of the important issues of trying to get a handle on unstructured data. But deciding how to work with it, manage it, and parse the import data points from it is just as important.
Management of Unstructured Data
When working with unstructured data, and the files that contain the information, the natural reaction for a system administrator is to create an index. But, the "index" is just one part of working with unstructured data.
The real trick is figuring out what is important about the unstructured data, and how, when, and why we have the unstructured data. This will help you decide how to index and what to do with "data fluff."
We know from our existing databases that indexes will evolve over time as users want to interact with the data differently. Traditionally, we use these indexes to speed up access to the data.
In this case, we are not talking about indexes in the sense of speeding up data retrieval, but indexing as in categorizing, exploring, and relating the unstructured data to existing structured data. Without this relationship, the unstructured data is just "data fluff."
What is "data fluff"? Data fluff is data that is stored because the designer or developer thought it would be important, but has no real value or use to the business process. Every database has "data fluff." Sometimes it is just transitory data to indicate flags needed later or during a record correction, or audit information like date and time created, user who created it, an so forth.
Many log files can be considered "data fluff" because they are used only for debugging and not actually used to help define, isolate, or refine important business processes.
Don't get me wrong, "data fluff" can be important and useful information, but if it is not being used, it is just "fluff."
Overlooked Unstructured Data
The most overlooked unstructured data we have in the enterprise are emails, documents, faxes, images, and videos. In the good old days, all we really worried about was scanned images or "cold" stored printouts.
Now scanned images are only part of the problem we have to address. These days, emails are the biggest unstructured data source that people are ignoring. Chat and social media comes after that, but they are all basically the same types of unstructured data as email.
If someone wants to manage their own email, they build a structure of folders that fits that person's need at the time. Fast forward a few years, and the structure doesn't handle the needs of that user, and the information they have "indexed" is unavailable to others who may need it.
The classic example of trying to relate this unstructured data to a structured data source is a CRM. A true CRM is designed to present anyone with relevant information about a customer. The keyword here is "anyone," not just the person directly involved with that customer.
What happens is the relevant information for that customer is in a salesperson's email or voice mail, but the email and voicemail systems are disconnected from our enterprise data. If someone in production needs access to information that was emailed to a salesperson, it is not available. Hence the development of the CRM, or the continued search for an adequate CRM that handles all aspects of the enterprise.
Another overlooked data source is our Windows files systems. Everyone is using Excel and Word these days for something; Printing labels, creating warranty documents, and data mining statistical data. Some of these files are duplicates or versions of the structured database data. Some are not, but are relevant to some aspect of the company. Even if it's just to the person who created it, they created for a reason.
What to do with Unstructured Data
We all can see the value of indexing emails and documents sent to customers and having it available to access from our CRMs. This is the indexing aspect of working with unstructured data — creating a relationship between the unstructured data source and the database's structured data.
Now what do you do with it? Our productions systems all have enterprise alerts for various things, like low quantities, out of date prices, and upcoming order statistics.
Let's look at some alerts and data you may be able to generate using email and instant message history. You can use this information with a predictive modeling API (Google has one) that would tell you if the mood, or sentiment, of the message is Good, Bad, or Ugly.
Wouldn't be nice to know if the communications between your people and the customer is positive or negative or indifference? Or if the message that the user just received showed a critical issue or negative message that needs to be addressed.
Another thing you can do with unstructured data is to map trends from your web logs — which customers visit, which ones don't.
Unstructured MultiValue Data
Let's back up a few steps and look at our existing database and the data we have collected.
While we generally assume our MultiValue databases are structured data stores, we also likely have tons of unstructured data as well. For example, product descriptions, notes, and text blobs associated with customers and vendors. Much of this information can provide useful valuable information, over and above the structured data it is associated with.
Also add into the design mix the fact that MultiValue databases are not inherently structured data sources.
Yes, we do create dictionaries to impose a structure, but we are not required to follow that structure when storing information. This feature gives the old referential integrity proponents fits.
Now add into the mix the correlatives, I-types, and virtual fields that allow us to create relationships to data in other locations, manipulate databases in the same location, or even parse, extract, and transform unstructured data blobs into structured contents. You now have a database with an slight identity crisis.
Parsing, Indexing, and Imposing Structure to Non-MultiValue Data
The type of unstructured data you are working with will dictate the structure you need to, or want, to impose on it.
Let's say we have a web log, or analytic file, and we want to see the relationship between sales, time of day, and how many phone calls you have received from a customer. Ok, I realize that this data is not exactly unstructured since it is a formatted log, but that is where you and the rest of the world differ. Parsing, extracting, and transforming information is the nature of our business, so this doesn't seem out of the ordinary.
To other people, this data is totally unstructured and takes time to manipulate. Or they just do it by hand.
Let's look at a totally different type of unstructured data: emails, chat logs, CRM notes, and social media posts. All this data is in textual blobs. Extracting meaning, issues, or just searching for a specific conversation is important to your business.
While many unstructured documents, like Word, Excel, or video files, look like they are unstructured, many really are not. Many of the files you want to create relationships with have MetaData storage features in them. Yes, even images and video files have metadata storage.
What is metadata storage? It is the ability to enter information about the rest of the file, or blob, like author, customer, dates and time, and other structured information about the document. This provides links and relationships that you can access or populate from your structured database.
Summary
This article should have provided you a starting point to think about how to work with unstructured data. Keep watch on the International Spectrum website and other magazine articles that talk about the "nuts and bolts" of access, parsing, and indexing various unstructured data.