In today’s search for bigger & bigger Big Data, I fear metadata is getting overlooked.
Put that way it sounds ironic. Given many organisations are almost desperate to mine their data assets for advantage, why would anyone intentionally overlook metadata?
But in fact this is nothing new. Thinking back to when I successfully led my first data warehouse project in the 1990s, it was an uphill struggle to make progress on metadata then.
Those of a certain age may recall lots of enthusiasm for data warehouses & data mining. But the less glamorous aspects, of data quality management & metadata were all too often de-scoped to ‘save money’ or hit deadlines.
At its most basic this is just human nature of course. We get bored/distracted easily & crave reinforcement or short-term reward to persist with tasks. But metadata just might turn round & ‘bite you on the bum’ if you ignore it too long.
Let me explain, briefly, why I think this matters:
The domain knowledge gap
Data Science & Analytics work relies upon not just robust coding & appropriate use of statistics, but also an understanding of the real world being explored through the proxy of data. Too many projects fail to have any impact in organisations because interpretation or recommendations were naive or irrelevant (which was obvious to those who actually knew what was going on around them). Put simply, metadata is just data about data. Knowing what variables mean does really matter to designing & interpreting analysis.
In fact it can help to get your Data Scientists or Analysts closer to the real data issues as part of their induction. I’ve previously encouraged secondments or induction periods within the data management team. Understanding the data landscape, perennial problems & causes of systemic data quality pitfalls, can greatly improve their later analysis. At the least it opens eyes to possible data sources & people with expertise to ask.
Short-termism always robs effectiveness & often efficiency
Any apparent time-saving (or boredom avoidance) that comes, from skipping the work to create/maintain data dictionaries & reference data, is normal short-term. Looking over a longer term, you often see repeated work needed or further costs incurred through fixes needed because the initial analysis lacked proper understanding of data item meaning. At the most extreme findings can be directionally wrong & misleading, if built on the shaky foundation of misinterpreted data items.
However, I should also point out that metadata is not just an arduous maintenance task, it can be a goldmine of opportunity & time-saving in the medium term. Information about data, easily updated by those who use those data items & themselves discover meaning/problems/gaps/workarounds , can be not just time-saving but feel like life saving in some cases. Empowering your analysts to share, in a collaborative working ecosystem,the most up-to-date understanding of what each data item means, data quality issues to avoid & any workarounds or other data to use is very powerful.
GDPR may force a metadata revival
Although this topic has been buzzing around in my brain for months, I was prompted to post on it by an article I read in Data IQ magazine. Their flamboyant editor, David Reed, rightly explains that one of the implications of the EU’s General Data Protection Regulation (GDPR) will be a need for better metadata. Both for permission evidence & to enable rights like ‘the right to be forgotten’, data owners/controllers/processors will need data & time stamps for data items. They will also need better records as to the meaning of data items & how they were obtained, probably with expiry dates as well.
So, whether it’s to avoid costly mistakes, help your analysts be more efficient, or to ready your organisation for GDPR, please reconsider your need for metadata. It just might be the biggest improvement to your data that you can make.
What about your experience? Have you seen the benefits of accurate metadata, or do you have war stories caused by lack of metadata? Please do share.