What Does Document Data Management Have to do with Big Data?

Document Data Management is the discipline that consists of processes, tools, and techniques that are used to define, model, discover, extract, integrate, standardize, normalize, report, and govern the data embedded within documents.  This should not be confused with Document Management which is mainly used to manage the actual documents within an organization.  Document management is more interested in the original document and preservation of the document while being able to locate or classify it as needed.

To that end, document management uses metadata to describe documents using several simple document attributes that help users classify and find it easily.  This approach is similar to how books are cataloged in a library.  However, Document Data Management is more interested in the content or data within the documents than the classification or archiving.

It is said that 70 to 80 percentage of data inside an organization is actually unstructured.  That means this type of data is not usually stored in tables or even spreadsheets and cannot be abstracted into attributes and fields.  Unstructured data is deeply embedded in the texts of many documents types such as invoices, purchase orders, sales contracts, maintenance narrative, and many more.  Sometimes these documents are actually stored within databases as long unstructured texts such as notes, comments, support case narratives, doctors’ notes, and others.

With emergence of big data and the availability of techniques to store, search, and mine unstructured data using big data tools and techniques, there has been a great increase in demand for discovering and extracting this type of data from documents.  Companies are now trying to extract valuable data from huge volumes of call center cases to understand customer sentiment, product defects, fraud detection, and many more powerful insights.

However, so far all such efforts have been quite organic and limited to data mining rather that data management.  The main reason for this limitation is that there has been no well defined disciplines or methodologies that describe how unstructured data within documents should be managed.

Document Data Management as a discipline tries to address this void by providing the methods, techniques, processes, and in short the science of managing document data. I started my work with document data since the late 1990’s when I was building my first form processing software called FormBase which extracted key data points from printed forms and stored them in a database.  It also could identify the type of form among many different types and archive it in the correct location inside a document management software.

Today we are dealing with large volumes of fully unstructured data such as real estate county records, medical records, oil and gas maintenance records, and other exciting but challenging unstructured data to manage.  But this time around, we are trying to build the disciple, methods, processes, and tools to allow us to define, model, extract, integrate, report, and govern document data.  The goal is to turn unstructured document data into a structured form where it can be fully integrated into the rest of the enterprise data which can then be used in operations and business intelligence.  Imagine enhancing the information breadth, depth, and insight of an organization by adding data that was till now untapped.

In my next blog post I will write more about what each aspect of document data management entails and how organizations can incorporate this discipline into their overall data management strategy.