Machine interpretable semantic descriptions are enhancing various computing and data resources on the Web to make it easier to search, discover, and integrate data. This interconnected metadata constitutes the Semantic Web, whose main principles, languages, frameworks, and best practices are set by the World Wide Web Consortium.
The W3Cs metadata acquisition languages include the Resource Description Framework (RDF), RDF in attributes, the RDF Schema, and the Web Ontology Language (OWL), which government, academia, and industry have embraced for capturing and sharing metadata. As a result, new data-intensive, semantics-enabled applications require the efficient management of RDF data. Many researchers have proposed using relational databases to store and query large RDF datasets. Such systems, called relational RDF stores are now in production.
More recently, researchers have started exploring distributed technologies deployed in the cloud such as Hadoop and HBase for distributed and scalable RDF data management. Here, we study and compare two approaches to distributed RDF data management, which are based on emerging cloud computing and traditional relational database clustering technologies. We elaborate on the design of distributed RDF data storage and querying schemes for HBase and MySQL Cluster and conduct an empirical comparison of these approaches on a cluster of commodity machines using datasets and queries from the Lehigh University Benchmark (LUBM). We focus on HBase and MySQL Cluster, because they are open source and frequently used in industry and academia for both prototyping and production. Note that Hbase is an open source implementation of Google Bigtable. The experiments revealed interesting patterns in query evaluation and showed which approach had the edge for scalable Semantic Web data management.
RDF and SPARQL in a Nutshell
The RDF data model is similar to classic conceptual modelling approaches such as entity relationship or class diagrams, as it is based upon the idea of making statements about resources (in particular web resources) in the form of subject-predicate-object expressions. These expressions are known as triples in RDF terminology. The subject denotes the resource, and the predicate denotes traits or aspects of the resource and expresses a relationship between the subject and the object. For example, one way to represent the notion "The sky has the color blue" in RDF is as the triple: a subject denoting the sky, a predicate denoting has the color, and an object denoting blue.
Therefore RDF swaps object for subject that would be used in the classical notation of an entity attribute value model within object-oriented design; object (sky), attribute (color) and value (blue). RDF is an abstract model with several serialization formats (i.e., file formats), and so the particular way in which a resource or triple is encoded varies from format to format. This mechanism for describing resources is a major component in the W3C’s Semantic Web activity: an evolutionary stage of the World Wide Web in which automated software can store, exchange, and use machine-readable information distributed throughout the Web, in turn enabling users to deal
with the information with greater efficiency and certainty. RDF’s simple data model and ability to model disparate, abstract concepts has also led to its increasing use in knowledge management applications unrelated to Semantic Web activity. A collection of RDF statements intrinsically represents a labelled, directed multi-graph. As such, an RDF-based data model is more naturally suited to certain kinds of knowledge representation than the relational model and other ontological models.
The RDF data model is a directed, labelled graph that can also be serialized and viewed as a set of triples. A running example we use throughout this article is an RDF graph with 10 triples that use the LUBM vocabulary to describe this articles authors Each triple consists of a subject, predicate, and object and defines a relationship between a subject and an object. In the figure below, the brackets and quotation marks denote resource identifiers and literals, respectively. For example, the first three triples state that a resource with identifier C is a student named Craig who is a member of IEEE.
This sample dataset can be queried using SPARQLa standard query language for RDF. SPARQL uses triple patterns and graph patterns that are matched over RDF data. For example, query Q14 from LUBM contains one triple pattern, which returns all undergraduate student identifiers as bindings of variable X. More details on SPARQL features and semantics can be found in the W3Cs SPARQL specification. SPARQL is a RDF query language. It was made a standard by the RDF Data Access Working Group (DAWG) of the World Wide Web Consortium, and is recognized as one of the key technologies of the semantic web.