Blog

in Events

It’s time to make your data speak for itself!

By David Thoumas on Feb 23, 2019

Share

facebook twitter

Today, we are very excited to introduce our semantic chatbot. It is designed to help your data gain and maintain sense and reusabality by semantizing it, without even having to understand what data semantization is!

You have probably heard a lot of people use fancy words such as AI, machine learning or even PetaBytes while discussing their data. Chances are they are the only people to know what their data consists of, how to describe it and how to make sense of it. Data semantization is often perceived as an abstract and jargon-ish concept, making its meaning and purpose difficult to grasp. Even when it becomes a bit more comprehensible, it becomes a tricky and demanding task that only a few can achieve. And yet, semantizing one’s data is what will maintain its existence, relevance and uses through time.

At OpenDataSoft, we are trying to change that situation and make data semantization accessible to all. We have put together a semantic chatbot (https://chatbot.opendatasoft.com/) that guides data producers through the process of describing their data with great accuracy, and linking it to other data around the world, in a matter of minutes.

Show me what you got

First, let us tell you how our semantic chatbot works.

It learns from an algorithm we have created which analyzes your data. Based on the knowledge it gathers on semantized data, the algorithm suggests ontologies to describe your data, and relations to describe how the objects in your data are linked.

The chatbot aims to simplify as much as possible the research of commonly used ontologies. This means you are sure to use the same language as everybody else with data on the same topic as your own.

Approving the suggested ontologies and relations is what creates the links between your data and those made available by other data producers around the world.

At the end of this process, you will be able to generate what’s called a RDF mapping file. This file is the best way to semantically describe a dataset. The task took 5 minutes of your time. You did not have to learn anything about the RDF format nor about the ontologies themselves. Furthermore, you did not spend hours reading a bunch of W3C papers. Just like that, thanks to our AI based algorithms and your willingness, your data quality has significantly improved!

Wait... why would I want to semantize my data in the first place?

That’s a fair question. In the past, semantic mapping has mostly been used by the research community or major web companies such as Wikipedia and Google.

Semantizing your data means describing them in a way the rest of the world can understand. It means using a common language to refer to any unit of information, whether it’s the GDP of a country, the level of fine particles measured in an area, or the Roman emperor ruling in 250 A.D.

Once correctly described, -your data is easier to find, as search engines do not rely exclusively on metadata anymore; -your data is easier to understand, as it has become self explanatory; -your data is easier to maintain: the person who will be managing your data in 5 years will not have to look for a PDF document you some day uploaded to your company document archive; -your data is easier to enrich. For example, if you know you are dealing with U.S. counties, it becomes faster to gather missing information such as population size, unemployement rates, geo-shapes; -your data is used more often, a direct consequence of the point explained above: people can use your data to enrich theirs.

But why has nobody semantized their data yet?

Because, well, it’s not that trivial. Most of the time it is actually quite complicated. You need to know what ontologies are (a set of concepts and categories related to a subject area or domain, that describe their properties and the existing relations between them), and how to use them. You have to at least know where to find them. In some cases, you may even have to develop your own ontology because as what is already out there is not satisfying.

Semantizing data takes time and skills when it really should not.

I’m a semantic expert, how is this of any help?

Our algorithm is far from perfect and you may have spotted that little ß sending the clear message that we plan to improve it. Your help and feedbackare more than welcome!

Also, you probably know from your own experience that semantizing data is everything but a restful task.

That is why we suggest that you refer to the Pareto Principle to balance your efforts: use our bot to get the most obvious 80% of your mapping and use the time gained here to focus on the rest of your task.

And then what?

If you are an OpenDataSoft user, here is what you can do. Once your RDF Mapping Language (RML) mapping is in your hands, you can add it to your dataset’s metadata where it will become ready to use for you and others within the semantic community.

You can for example use SPARQL queries (https://en.wikipedia.org/wiki/SPARQL) to explore it through our TPF server(https://help.opendatasoft.com/apis/tpf/#authentication).

Soon, your data portal users and reusers will have the possibility to filter your catalog using not only the metadata from each dataset but also what each dataset contains, in a click.

You will discover that a whole new world opens up when you semantize your data as you broaden your ecosystem of users, and allow more reuses.

Contact us if you want more information, or wish to try our chatbot with your own data. The chatbot code source is open(https://github.com/opendatasoft/ontology-mapping-chatbot), under the MIT License. Feel free to use it for your own projects, to open issues or to open pull requests. We can’t wait for you to share your thoughts about this data semantization tool, as well as your suggestions to improve it!

Written by

written by David Thoumas
Logo open Belgium 2020 quit