Governance[and metadata, privacy, security] |
The goal is to tease apart, learn about, and explore the connections between the following (data-related items): governance, curation, stewardship, MDM, provenance, metadata, security, privacy.
As you know, the purpose of capturing and storing data, is to process and benefit from it - this involves the use of statistics, data mining and machine learning.
But, that is not all there is to it! What about policies, procedures, rules, guidelines, practices... regarding the collection, storage and use of data?
In other words, there exists a whole host of "secondary" (NOT really!), 'meta', and COMPLEMENTARY set of aspects.
Regardless of collection procedures, analysis and usage, the ONE prime characteristic of data is QUALITY ('GIGO').
'Data curation' refers to set of processes and technologies ("methods and tools") that are focused on maintaining high-quality data in an organization, for the purposes of:
Any (which means ALL!) data-driven organization/s need(s) a 'data curation infrastructure' that supports curation practices and software.
In the context of Big Data, data curation can be seen to sit right in the middle of a 'value chain':
Below are important points to keep in mind, while undertaking a data curation effort:
Here is a description, by Craigin et. al. [Cragin, M., Heidorn, P., Palmer, C. L.; Smith, Linda C., An Educational Program on Data Curation, ALA Science & Technology Section Conference, (2007)]: "Data curation is the active and on-going management of data through its lifecycle of interest and usefulness; ... curation activities enable data discovery and retrieval, maintain quality, add value, and provide for re-use over time".
Curators are tasked with enabling data reuse. But, reuse is not a matter of simply bolting together existing data in new configurations. Data 'messiness' (quality issue) and heterogeneity possibly need to be taken into account - a lot like 'ETL' for data warehousing.
So for creating reuse, it is good to think of data as 'raw material' that needs to be repurposed, rather than 'finished goods' that need to be repackaged. Note that reuse can be explicitly enabled both during data generation, and data consumption.
A variety of individuals and groups can play a part in curating:
From Wikipedia: 'Data governance is a data management concept concerning the capability that enables an organization to ensure that high data quality exists throughout the complete lifecycle of the data'.
In other words, governance == curation?
NOT REALLY: As per the DAMA International Data Management Book of Knowledge, "Data Governance (DG) is defined as the exercise of authority and control (planning, monitoring, and enforcement) over the management of data assets." In other words, Governance is about POLICIES, which can be seen as complementary to Curation.
So an organization would have Governance policies in place, which would aid in Curation's producing customized business data.
DG, summarized:
Data stewardship is the management and oversight of an organization's data assets to help provide business users with high-quality data that is easily accessible in a consistent manner. [from whatis.com]
A data steward is a role within an organization responsible for utilizing an organization's data governance processes to ensure fitness of data elements - both the content and metadata. [from Wikipedia]
A data stewardship role is essentially that of a curator, possibly a more managerial one.
Provenance ~= "lineage".
'Data provenance documents the inputs, entities, systems, and processes that influence data of interest, in effect providing a historical record of the data and its origins.'
Provenance has to do with origins, while lineage has to do with tracing data's 'journey' to the current point of usage.
Provenance/lineage is a form of metadata that needs to be added to data, during curation.
Provenance helps establish TRUST (or lack thereof) in data. With scientific data for example, this is crucial [incorrect/invalid data would lead of the acceptance of incorrect hypotheses!].
Here is an interesting list of provenance-related issues related to scientific research (just fyi).
MDM is the management of "master data" (similar to a master key): In business, master data management (MDM) is a method used to define and manage the critical data of an organization to provide, with data integration, a single point of reference. [Wikipedia]
The idea is to maintain a single (meaningful, accurate, complete, timely) reference for data that is shared - this is done in order to maintain consistency. The alternative (to replicate such data for each request) would be highly problematic - would lead to inconsistency, errors, wasted disk space, increased network traffic, etc.
Here is more on MDM.
Security breaches are almost 'normal' - Yahoo, Home Depot, Facebook, Uber, Equifax... what is going on?
Governance is not being followed properly - policies for handling data and accountability, culture of/training in handling data, (pro)actively managing (sensitive) data - these are missing.
Privacy and security breaches are very costly, literally - lost revenue in the form of customer attrition, fines (eg. levied by SEC), lawsuit awards... These losses are monumental, compared to investing in technologies and policies that guard against breaches!
Note that maintaining security and privacy both involve minimizing RISK - something that every business ought to be concerned about.
Good DG not only leads to data reuse, interoperability etc, it also results in enhanced privacy and security [graphic from associationanalytics.com]:
Data security/protection has to do with (preventing) UNAUTHORIZED access to data. Data privacy on the other hand, has to do with (limiting) AUTHORIZED access to data - related, but not identical!
Protection/security is a technical issue, related to protecting servers, encrypting data, restricting access (eg via passwords or biometrics), etc; privacy compliance on the other hand is a legal issue.
Also: data needs to be protectable first, before privacy can be ensured!
Effective 5/25/18, the EU has a SINGLE set of privacy guidelines for its member countries and citizens, called GDPR, which requires businesses to protect the "personal data and privacy of EU citizens for transactions that occur within the EU."
Here is more on GDPR.
Interestingly (or not so), the US has a maze of regulations when it comes to digital privacy. To be fair, so did Europe, pre-GDPR.
Here in CA, we do have the CCPA.