Edvancer's Knowledge Hub

Metadata and Its importance

Manu Jeevan 21/08/2018

Though the word ‘Metadata’ is quite familiar to all of us, many of us are not very clear as to what exactly is metadata or why is it as important as data. Let us try to figure this out by plunging headlong! Some common concepts about Metadata are:

Metadata is data that provides information about other data.
Since Metadata summarizes basic information about data, it makes finding & working with particular instances of data easier.
Metadata can be created manually to be more accurate, or automatically and contain more basic information.

This establishes the importance of metadata. As to what is metadata; well this can be explained as a short-hand representation of the data which is being referred. Metadata can be thought of as references to data. How do you usually go about a Google search? There are endless possibilities for describing things. You started the search with the metadata you had in mind. It could have been a word, place name, phrase, slang, meme, or anything else. While metadata schema can be complex or simple, they all have a few things in common. METADATA HAVE BEEN AROUND FOR QUITE SOME TIME GOOD OLD PROVENANCE: PROTO-METADATA In the early 90s, only basic tcp/ip existed and people did their jobs without digital aids. Let us take the example of archeologists of that time period, or prior to it. The artefacts discovered by them had to have a provenance, that is, the artefacts origins, nature, ownership and characteristics were recorded as a guide to its authenticity and quality. What would happen to the scientific value of an artefact if it was removed from the excavation site, meaning that, it is taken out of context? That depends on how well that provenance was described and if the right keywords and organizational principles were used to categorize, describe, analyze and curate similar objects and artifacts. That is the reason why plundering of archaeological sites is so damaging. The loss is not just of the artifact, because even if it is recovered, it has lost its provenance or meaning! This explains that data on the data is as important as the data itself. Without a context, data has little reuse value. METADATA IS COMPARABLE TO DATA IN VALUE Going back to archaeology, an object loses its scientific value, if it loses its provenance or metadata. There is a methodical tagging and bagging with a numerical reference on the bag that corresponds to notes in a log; for each and every artefact. More often than not, photos and sketches are made of the artifact in-situ (in its original state), for future research. While Archaeology is not exactly about treasure hunting, Open Data is not just about storytelling. Both can be a lot of fun and excitement. The useful side of both Open Data and Archaeology is about the amount of reuse that can be extracted from our objects, whether stones and bones or mammoth datasets. DEFINING METADATA USING MULTIPLE SOURCES Having deliberated on the basics of “what is metadata”, we shall now study two important definitions as a reference: one from the International Standards Organization (ISO), the other from White House Roundtables in the context of Data Quality and Open Data for Public Private Collaboration. There are a few subtle differences in definitions of the ISO and the White House Roundtables. The first difference is that, provenance in the White House context is defined as the metadata of a dataset. Secondly- there is no “timeliness” dimension to the ISO definition of Data Quality. The ISO predates the widespread adoption of Open Data. Maybe, timeliness will become a part of the ISO in the future. A semantic definition to Data Quality has been provided by ISO, which serves as the metadata requirement. To simplify things, we will combine the definitions of provenance and semantics into what we will call metadata. CREATING OUR OWN DEFINITION OF WHAT IS METADATA In their paper, “A Semiotic Framework for Analyzing Data Provenance Research,” Liu and Ram mention that the word ‘provenance’ used in the context of data has different meanings for different people. In this paper and also in many other works, Liu and Ram go on to define the semantic model of provenance as a seven piece conceptual model. Liu and Ram had conceptualized data provenance to comprise of seven inter-connected simple elements which include when, where, what, which, how, who and why. These form the elements of several metadata frameworks. Most metadata schemas ask these elements about their data. THE W7 ONTOLOGICAL MODEL OF METADATA Thus, if we combine these two terms into metadata, we imply that metadata gives the following information about the data it models or represents:

What
When
Where
Who
How
Which
Why

It may be of interest to note that OpenDataSoft natively uses a subset of DCAT to describe datasets. The metadata available are: title, description, language, theme, keyword, license, publisher, references. It is also possible to activate the full ‘DCAT’ template, thereby adding the following additional metadata: created, issued, creator, contributor, accrual, periodicity, spatial, temporal, granularity and data quality. A full ‘INSPIRE’ template is also available which can be activated on demand. It is also possible to create a fully customized metadata template. USE OF METADATA TO BOOST DATA REUSE There is a lot of discussion on data quality and data discoverability which has focused on metadata and ontologies. Now, Ontologies are nothing but descriptions and definitions of relationships. Ontologies include some or all of the following descriptions/information:

Classes (general things, types of things)
Instances (individual things)
Relationships among things
Properties of things
Functions, processes, constraints, and rules relating to things.

It is because of ontologies that we can better understand the relationship between things. For example, an “android phone” is a subject of an object class, “cell phone”. Few people refer to an “ontology spectrum” that describes some frameworks as weak and others as strong. Encapsulation of the range of opinions as to what anontology really is, forms the “spectrum”.

ENHANCING DISCOVERABILITY IN METADATA WITH ONTOLOGIES

Take for example that we have with us a dataset of building permits. Naturally, we would like to compare the nature of our dataset of permits with another dataset of permits. As luck would have it, there exists a standard emerging for permit data called BILDS. On browsing the BILDS website, we observe a specification and that, as many as nine municipalities – are all using the BILDS specification. We can see a set of required standards for a permit dataset from the BILDS GitHub account. In this regard one may also like to have a look at Core Permits Requirements. If the schemas of those nine municipalities matched with our dataset, it can be said that they would be interoperable. But there is still the need to add some discoverable metadata around them. This is not so difficult now since all of these datasets share a similar schema. The metadata that we have could provide a standard definition for each column header type, implying that all the nine datasets would have an enhanced discoverability as well, helping us to know what to look for.

OUR DATA ENRICHED WITH VALUABLE METADATA

At the onset we discussed about Open Data and Data Quality. It was also emphasized that metadata were as valuable as the data itself. Later on in the article we touched upon some of the anatomy and definitions of metadata, ontologies, schemas, and standards. The provenance of the data is connected to Data Quality. In the absence of metadata to provide provenance, we would have a dataset without context. Data without context, like an artefact, television, common salt, or any other random object, has little value. It is frustrating to work on Open Data projects without any metadata. Metadata, all by itself can also be extremely useful. Even without the actual data metadata can provide pointers to datasets. We can put together an organizational chart around data that exists for a given topic.

About
Latest Posts

Manu Jeevan

Manu Jeevan is a self-taught data scientist and loves to explain data science concepts in simple terms. You can connect with him on LinkedIn, or email him at manu@bigdataexaminer.com.

Latest posts by Manu Jeevan (see all)

Python IDEs for Data Science: Top 5 - January 19, 2019
The 5 exciting machine learning, data science and big data trends for 2019 - January 19, 2019
A/B Testing Made Simple – Part 2 - October 30, 2018

Share this on

Follow us on

Author : Manu Jeevan