"Data" is a critical piece of PCIC's day-to-day operation and is the foundation for all our analysis work. In this article, we take a look at one way to classify it – by the degree of organization of the data.
Unstructured data is all around us. It is data that does not follow a predefined structure and includes everything from emails, text documents, PDF files, notes and blog posts like this one! It is estimated that about 80% of all data in an organization is unstructured data. It is extremely easy to create, but challenging to process and analyze in an automated manner.
Structured data, on the other hand, is strictly stored in clearly defined tables, with particular data types assigned to each column in the table – for example a patient's name (text), their age (numeric) and their date of birth (date). Semi-Structured data is a less strict version of data structure and follows a more "self-describing" structure, for example XML data files, JSON data files and HTML data files. These are data formats that do not fit into a strict table structure but still use markers to separate and group data elements together, and to a certain extent maintain the hierarchical relationship of the records.
Computers and computer programs like structure and rules, and have in the past had a hard time analyzing unstructured data. This however, has changed in today's world of Big Data Analytics. Revolutionary technology like NoSQL databases and document databases has allowed for mining of these unstructured data sources, bridging the gap that existed in the analysis of unstructured data, and in turn producing better and more actionable intelligence. What we are seeing today is technology that is blurring of the boundaries between structured and unstructured data, creating platforms that allow users to easily search, stratify and analyze a mixed bag of content and information.
So, how does blurry boundaries in the data realm affect our work at PCIC? When we started building our technology infrastructure, we focused heavily on the structured side of the data spectrum. Our Master Client Index (that connects patients across multiple datasets from our different partner organizations), our Unified Care Continuum Portal (that hosts our patient care plans) and our Electronic Medical Record (EMR) system are all structured and linked databases; but, like most organizations we too have our "80%" of unstructured data and information that is sitting in our emails, files in our SharePoint environment, patient care plans in word documents, and our meeting notes in Yammer.
We have started to move our focus on connecting these pieces together and analyzing it as a whole – using platforms like Microsoft Delve from Office 365 to search across these buckets of data, so we're not missing key pieces of information while making decisions. On our application development side, we are integrating NoSQL technologies into our analysis stack, adding features into our EMR to support communications between our care coordinators and patients using emails, text message and automated phone calls, providing the feature to include documents, images, video and audio with a patient record, all of which can be analyzed and included in developing the plan of care for our patients.
At PCIC, we continue to improve our analytics capabilities, by expanding our structured datasets as well as bridging the gap between structured and unstructured data sources. We constantly push ourselves and ask new questions, to understand the data better. For example: "can we find insight in dark data?", a topic we hope to explore in a future blog post. Until then, structure or unstructured – we'll be analyzing it!