Benchmarking D2RQ v0.2

Previous Version:
Benchmarking D2RQ v0.1
Author:
Richard Cyganiak (Freie Universität Berlin, Germany)

Abstract

This document is an update to Benchmarking D2RQ v0.1 taking into account the improvements made to D2RQ between versions 0.1 and 0.2.

Table of Contents


1. Introduction

As Semantic Web technologies are getting mature, there is a growing need for RDF applications to access the content of huge, live, non-RDF, legacy databases without having to replicate the whole database into RDF. D2RQ is a mapping language that allows RDF-aware applications to access SQL databases, and an implementation of this language as a plugin for the Jena framework. D2RQ is intended for use with large databases. See D2RQ Website for details.

D2RQ v0.2 is the second release of the mapping language and its implementation as Jena plugin. We focused on extending the mapping language. New features, including conditional mappings and translation tables, provide more flexibility and help to cover a wider range of applications. A second goal was to improve performance by addressing the issues identified in the previous version of this report. This document presents the results of this effort.

The dataset used for the benchmarks is a XML dump of the DBLP computer science bibliography . We converted the XML dataset into an SQL database and into an RDF representation. We loaded the RDF triples into a database-backed Jena model. Then, we executed queries against both the Jena model and a D2RQ mapping that maps them onto the SQL database. The SQL dataset contains several tables with up to 500,000 records. The RDF dataset contains 1.6M triples. For detailed methodology notes, please refer to the previous report.

2. Comparision to Jena's ModelRDB

The performance of the previous version of D2RQ, v0.1, was on par with Jena's ModelRDB database backend for most queries. There were large differences for queries with lots of result triples (D2RQ was slower), and when either had to search through a database column where it couldn't make use of an index (the “worst case”, where both are unacceptably slow).

D2RQ v0.2 is slightly faster than Jena's database backend for most queries. The issue with many result triples has been fixed. Triggering the worst case is now harder, and there are several new tools for map authors that allow them to avoid it in even more cases.

In most cases, D2RQ does the optimal translation from a find query into one or more SQL queries. The cases where it is slower than Jena RDB are caused by the fact that some find queries cannot be translated into a single SQL query, but require several of them.

Some direct comparisions follow. Times are in milliseconds. The first number in the brackets shows how often the query was run. The second is the number of triples in the query result.

successful find(s p o) (contains) query on large table

Query: http://dblp.uni-trier.de/inproceeding/16655 http://dblp.uni-trier.de/xml/dplp.dtd/inProceedingsAuthor http://dblp.uni-trier.de/person/23100

Jena 1273, D2RQ 1197            (500 x 1 result triples)


find(s ? ?) on large table

Query: http://dblp.uni-trier.de/inproceeding/16655 ANY ANY

Jena 1522, D2RQ 2201            (500 x 6 result triples)



find(s ? ?) with non-existing subject

Query: http://nope.example.net/ ANY ANY

Jena 915, D2RQ 6                (500 x 0 result triples)



find(? p ?), no index for jena (worst case)

Query: ANY http://dblp.uni-trier.de/xml/dplp.dtd/seriesTitle ANY

Jena 42431, D2RQ 72             (1 x 24 result triples)



find(? p o), no index for D2RQ (worst case)

Query: ANY http://dblp.uni-trier.de/xml/dplp.dtd/inProceedingPages 1299-1313:http://www.w3.org/2001/XMLSchema#string

Jena 116, D2RQ 23164            (50 x 1 result triples)



find(? p o), 14 results

Query: ANY http://dblp.uni-trier.de/xml/dplp.dtd/inProceedingsAuthor http://dblp.uni-trier.de/person/1

Jena 970, D2RQ 498              (500 x 14 result triples)



find(? ? o) with o being a resource that doesn't match pattern

Query: ANY ANY http://www.example.org/

Jena 862, D2RQ 1615             (500 x 0 result triples)



find(? ? o) with o being a resource that matches pattern, 110 results

Query: ANY ANY http://dblp.uni-trier.de/proceeding/436

Jena 4130, D2RQ 2699            (500 x 110 result triples)



find(? ? o) with o being a resource that matches pattern, 5 results

Query: ANY ANY http://dblp.uni-trier.de/proceeding/12

Jena 1068, D2RQ 744             (500 x 5 result triples)

Most interesting queries are more complicated than a single find query. Interesting queries can be broken down into several find queries, or can be expressed in a higher-level query language like RDQL. Our goal for the next version of D2RQ is to directly support RDQL queries, instead of using Jena's default query handler to break them down into find queries.

3. Overview of speed-affecting changes

The following list presents some changes that had a significant impact on query execution speed.

Avoiding the worst case: URI columns disjoint from URI patterns

D2RQ has a performance worst case which is triggered when a column without an index is searched. Especially find(?,?,o) and find(?,p,o) queries are dangerous. In D2RQ v0.2, if a given URI matches a d2rq:uriPattern, then we assume that it won't be present in any d2rq:uriColumn. In many cases, this small change eliminates the need to search through non-indexed d2rq:uriColumns, thus giving a speed boost of several orders of magnitude.

Avoiding the worst case: Optimization hints

The d2rq:valueContains, d2rq:valueMaxLength and d2rq:valueRegex properties give map authors the chance to avoid the worst case in many more situations. By asserting that all values of a database column satisfy some constraint, the author gives D2RQ the chance to exclude the column from some queries. Again, this can improve query time by some orders of magnitude.

Corrected query combination algorithm

This change makes some queries slower than in v0.1. That version had a bug that would cause missing data for some mappings that use joins. The corrected version is slower for Fetch queries (s,?,?), but works correctly in all cases.

Dropping unnecessary joins

A d2rq:join specified in the map file is not necessary for all kinds of queries because, by definition, both sides of the join have the same value. Dropping these joins makes query execution slightly faster. This comes at a cost: D2RQ v0.2 makes some assumptions with regard to referential integrity that could cause problems with some pathological database schemas.

Dropping the DISTINCT

D2RQ is designed to work with non-normalized database schemas. This means that it has to cope with duplicate rows. v0.1 does this by adding DISTINCT to all SELECT queries. In v0.2, map authors must mark all non-normalized class maps with d2rq:containsDuplicates. This makes map authoring harder, but lets D2RQ decide whether the DISTINCT is necessary or not.

Optimizing triple delivery

v0.2 has improved handling of duplicates and a fixed query combination algorithm. These changes made the checks for duplicate triples in D2RQ's result delivery stage unnecessary. Queries with lots of result triples scale much better now with regard to execution time and memory.

Doing more stuff at startup time

By preparing as much work as possible at startup time, time can be saved at query time. v0.2 does this in many places, but there's still some room for improvement.

New brakes

Some of the new additions to the mapping language are nice for flexibility, but bad for speed. This is especially true for d2rq:condition. But these new features are necessary because they enable scenarios where D2RQ couldn't be used otherwise.

4. Direct comparision between v0.1 and v0.2

Some direct comparisions between the two releases follow. Notable changes are the slower fetch queries in 0.2, the improvement for (? ? o) queries because of the disjoint URI patterns and URI columns optimization, and the improvement for queries with many result triples.

find(s ? ?) on large table

Query: http://dblp.uni-trier.de/inproceeding/16655 ANY ANY

v0.1  2158, v0.2  2201            (500 x 6 result triples)



find(s ? ?) on small table

Query: http://dblp.uni-trier.de/proceeding/160 ANY ANY

v0.1  1317, v0.2  3106            (500 x 8 result triples)



find(s ? ?) with non-existing subject

Query: http://nope.example.net/ ANY ANY

v0.1    16, v0.2     6            (500 x 0 result triples)



find(? p o), no index for D2RQ, 50 times

Query: ANY http://dblp.uni-trier.de/xml/dplp.dtd/inProceedingPages 1299-1313:http://www.w3.org/2001/XMLSchema#string

v0.1 24970, v0.2 23164            (50 x 1 result triples)



find(? p o), D2RQ can use index

Query: ANY http://dblp.uni-trier.de/xml/dplp.dtd/inProceedingTitle BSP in CSP: Easy as ABC.:http://www.w3.org/2001/XMLSchema#string

v0.1   941, v0.2   697            (500 x 1 result triples)



find(? p o), 14 results

Query: ANY http://dblp.uni-trier.de/xml/dplp.dtd/inProceedingsAuthor http://dblp.uni-trier.de/person/1

v0.1  1416, v0.2   498            (500 x 14 result triples)



find(? p o), 1 result

Query: ANY http://dblp.uni-trier.de/xml/dplp.dtd/inProceedingsAuthor http://dblp.uni-trier.de/person/5

v0.1   990, v0.2   596            (500 x 1 result triples)



find(? p o), 1 result, no joins

Query: ANY http://dblp.uni-trier.de/xml/dplp.dtd/seriesTitle Topics in Information Systems:http://www.w3.org/2001/XMLSchema#string

v0.1   666, v0.2   455            (500 x 1 result triples)



find(? ? o), no index for D2RQ, 50 times

Query: ANY ANY foo:http://www.w3.org/2001/XMLSchema#string

v0.1 25491, v0.2 23937            (50 x 0 result triples)



find(? ? o) with o being a resource, 110 results

Query: ANY ANY http://dblp.uni-trier.de/proceeding/436

v0.1 15598, v0.2  2699            (500 x 110 result triples)



find(? ? o) with o being a resource, 5 results

Query: ANY ANY http://dblp.uni-trier.de/proceeding/12

v0.1  8402, v0.2   744            (500 x 5 result triples)