What Does Document Data Management Have to do with Big Data?

Document Data Management is the discipline that consists of processes, tools, and techniques that are used to define, model, discover, extract, integrate, standardize, normalize, report, and govern the data embedded within documents.  This should not be confused with Document Management which is mainly used to manage the actual documents within an organization.  Document management is more interested in the original document and preservation of the document while being able to locate or classify it as needed.

To that end, document management uses metadata to describe documents using several simple document attributes that help users classify and find it easily.  This approach is similar to how books are cataloged in a library.  However, Document Data Management is more interested in the content or data within the documents than the classification or archiving.

It is said that 70 to 80 percentage of data inside an organization is actually unstructured.  That means this type of data is not usually stored in tables or even spreadsheets and cannot be abstracted into attributes and fields.  Unstructured data is deeply embedded in the texts of many documents types such as invoices, purchase orders, sales contracts, maintenance narrative, and many more.  Sometimes these documents are actually stored within databases as long unstructured texts such as notes, comments, support case narratives, doctors’ notes, and others.

With emergence of big data and the availability of techniques to store, search, and mine unstructured data using big data tools and techniques, there has been a great increase in demand for discovering and extracting this type of data from documents.  Companies are now trying to extract valuable data from huge volumes of call center cases to understand customer sentiment, product defects, fraud detection, and many more powerful insights.

However, so far all such efforts have been quite organic and limited to data mining rather that data management.  The main reason for this limitation is that there has been no well defined disciplines or methodologies that describe how unstructured data within documents should be managed.

Document Data Management as a discipline tries to address this void by providing the methods, techniques, processes, and in short the science of managing document data. I started my work with document data since the late 1990’s when I was building my first form processing software called FormBase which extracted key data points from printed forms and stored them in a database.  It also could identify the type of form among many different types and archive it in the correct location inside a document management software.

Today we are dealing with large volumes of fully unstructured data such as real estate county records, medical records, oil and gas maintenance records, and other exciting but challenging unstructured data to manage.  But this time around, we are trying to build the disciple, methods, processes, and tools to allow us to define, model, extract, integrate, report, and govern document data.  The goal is to turn unstructured document data into a structured form where it can be fully integrated into the rest of the enterprise data which can then be used in operations and business intelligence.  Imagine enhancing the information breadth, depth, and insight of an organization by adding data that was till now untapped.

In my next blog post I will write more about what each aspect of document data management entails and how organizations can incorporate this discipline into their overall data management strategy.

Big Data, New Hype, or New Reality?

IT industry is not immune to the “New Shining Object Syndrome” that impacts other industries as well as consumers.  Every once in a while we get a new technology or idea that grabs the attention of IT professionals and CIOs alike.

There have been many examples of these new ideas, some of which have endured the test of time and some that have not.  There was once a huge excitement around Object Oriented Programming (OOP), SOA (Service Oriented Architecture), BI (Business Intelligence), ASP (Application Service Provider), Master Data Management (MDM), and now Big Data!  Some of these ideas like OOP have been observed into the way we write our software where we no longer call it that.  ASP (not the same as Active Server Pages) has evolved into SAAS (Software As a Service) and hardly anyone remembers about ASP anymore.

Big Data however, has created a huge buzz lately and the excitement is very contagious and seems to have gotten the attention of the media much more than the previous trends.  The reason may be that unlike those other technical breakthroughs in the IT industry, this one has a strong business and even consumer impact.  Though many do not clearly understand what Big Data means, they have by now a vague notion that it somehow involves or impacts them.

News such as the NSA’s (National Security Agency) tapping into the metadata of domestic and intentional phone communications to Google, Amazon, and other companies profiling consumer’s every click on the web has awaken fear and excitement in normal consumers.  For this reason alone, it is no longer easy to ignore Big Data as a mere hype or “new shining object”, but perhaps a new reality which we are all forced to be reckoned with.

Big data itself and its potential is more analogous to the internet at its early days where only few understood its immediate potential and perhaps none could forecast its future potentials.  Like the internet, the early days of Big Data is only accessible to few and far in between, and like the internet, unless it is readily available to the common man, it will not fulfill its full potential and depth.

Therefore, if I were a betting man, I would bet on the Big Data.  But one should be weary of the hype, as like the internet, Big Data could find many .COM victims who may be blinded by the “new shining object” without knowing its risks and potentials of what makes it shine!

