Spotlight Story: Urban Data Mining

Carlo AlloccaOver the coming weeks, we’ll be introducing members of the Data Science Group and delving deep into their work on the MK Data Hub.

Today we talk to Carlo Allocca, a Research Associate in Urban Data Mining, who’s currently working on a solution that will allow users of the MK Data Hub to reformulate the same query over different sources—a task that’s currently pretty laborious.

What is the Data Science Group?

We’re researchers and scientists based in the Knowledge Media Institute at The Open University.

We study, investigate and attempt to understand the world around us. Our goal is literally to figure out what’s going on, and try in the process to contribute.

We are also technologists. We build digital tools to help make some tasks easier.

We create systems that can manipulate data, integrate them, federate them, analyse them, and visualise them. We research the interaction between people and data, to find ways to make it more effective and more valuable.

The MK Data Hub, which brings together data from a range of different sensors around Milton Keynes, is a great practical example.

What’s urban data mining?

Urban data mining is the term for the processes involved in taking pure data and turning it into usable knowledge for the end user.

In the context of the MK Data Hub, data is coming into the hub in a pure form. Through urban data mining, we’re converting this pure data into intelligible output for the end user. For example, the kind of output that enables developers to build smart applications, and government and research bodies to compile reports on the management of resources in the city.

What are you working on right now?

In simple terms, I’m working on supporting the integration of the data that’s coming into the MK Data Hub from a range of different sources so that users can generate queries relating to multiple different data points.

In technical terms, I’m working on a SPARQL—pronounced ‘sparkle’—query recommendation tool called SQUIRE that implements the logic of reformulating a SPARQL query that is satisfiable with regards to a source RDF dataset into others that are satisfiable with regards to a target RDF dataset.

Why is this work important?

Reformulating a query over many similar RDF datasets is usually laborious, especially as it takes time to explore and understand the target RDF dataset’s model and content. You then need to iteratively reformulate and test SPARQL queries until you reach a formulation that answers your question. As you can imagine, it’s time consuming.

Can you give an example?

Right now, you can use the data hub to query crime in an individual area of Milton Keynes, but if you wanted to compare it with crime in another area of the city then you would have to generate—that is, manually rewrite—a separate query. This is because the data is coming from two different sources and in two different formats.

My aim is to find a way to automatically generate queries across different data sources, such that the system behind the data hub will rewrite the query for you.

When will this be in place?

We’re expecting to have the first version available later this year. We’ve already presented a demo of the solution to the European Sematic Web Forum, and we’re now working on making the system more robust. Once the solution’s fully up and running, it will allow you to query anything that uses data integrated within the MK Data Hub.

Is this useful beyond the context of the MK Data Hub?

Absolutely. It will be a universal solution, not restricted to Milton Keynes’ data. It will allow you to compare Milton Keynes’ data with that of other smart cities, and indeed any data—not limited to smart cities—making it the first solution of its kind.