Exploring the Use of Semantic Technologies for Big Data Integration
Problem Description
There has been an exponential growth and availability of data, both structured and unstructured. Big Data has become the new ubiquitous term used to describe massive collection of datasets that is so large that it's difficult to process using traditional database and software techniques. Big Data may comprise of petabytes (1,024 terabytes) or exabytes (1,024 petabytes) of data consisting of billions to trillions of records of millions of people - all from different sources (e.g. Web, sales, customer contact center, social media, mobile data, etc.). The data is typically loosely structured data that is often incomplete and inaccessible. Big Data is transforming science, engineering, medicine, healthcare, finance, business, and ultimately society itself. Massive amounts of data are available to be harvested for competitive business advantage, sound government policies, and new insights in a broad array of applications (including healthcare, biomedicine, energy, smart cities, genomics, transportation, etc.). Yet, most of this data is inaccessible for users, as we need technology and tools to find, transform, analyze, and visualize data in order to make it consumable for decision-making. Research community also agrees that it is important to engineer Big Data meaningfully. Meaningful data integration in a schema-less, and complex Big Data world of databases is a big open challenge.
Project Description & Objectives
The goal of this project is to explore the use of semantic technologies to connect, link, and load data into a data warehouse. The specific objectives included:
(i) creation of a semantic data model via ontologies to provide a basis for integration and understanding knowledge from multiple sources;
(ii) creation of integrated semantic data using Resource Description Framework (RDF) as the graph data model;
(iii) extracting useful knowledge and information from the combined web of data for use in innovative applications.
More specifically this project addressed 2 domains of data:
(i) data from video game platforms about users that includes badges, trophies earned, time spent on playing a game, etc.
(ii) real-estate and crime rate data.
Process
The project was conducted in the following phases:
(i) Literature Review: this phase involved reading and discussing research articles in the field of data integration, identifying public datasets available in various domains, identifying innovative apps that could improve quality of life, and learning semantic technologies such as OWL (Ontology language), Protege (OWL editor), RDF (Semantic data format), and SPARQL (Semantic Query language).
(ii) Requirement analysis: At the beginning of this phase 2 domains were picked for the research project - home buying advisor system and video game suggestion system. A survey monkey survey was conducted to understand some of the most important factors one would look for when buying a house. A competitive analysis of various game recommender systems was done and categorized into online quizzes, forums, and suggestion software.
(iii) Ontology design: This phase involved reusing existing schemas and ontologies and extending them to create ontologies for the 2 domains: Videogames and user metadata; and Real estate and Crime data. The ontologies were designed and created in OWL language using Protege visual ontology editor.
(iv) Semantic Data generation: this phase involved converting public datasets found into semantic data in Resource Description Framework (RDF) format.
Acknowledgments:
The team would like to thank the CREU program supported by CDC and CRA-W, for making this project possible through their generous research grant.
This site provides updates on the progress of the CREU project.