Benchmarking D2RQ v0.1

Latest Version:: Benchmarking D2RQ v0.2
Author:: Richard Cyganiak (Freie Universität Berlin, Germany)

Abstract

Together with the release of D2RQ v0.1, a series of benchmark tests were performed. This document describes the goals, process and some results of these benchmarks, as well as some recommendations based on the results.

1. Introduction
2. Performed tests
- 2.1. find(s ? ?)
- 2.2. find(? p o)
- 2.3. find(? p ?)
- 2.4. find(? ? o)
3. Results and Recommendations
- 3.1. Optimizing database schema and queries
- 3.2. Approaches for optimization in D2RQ
4. Methodology

1. Introduction

Update: “Benchmarking D2RQ v0.2” is an update to this document taking into account the changes since the first version of D2RQ.

Together with the release of D2RQ v0.1, a series of benchmark tests were performed. D2RQ is intended for use on large databases. The goals of the tests were:

to see how well D2RQ performs on large databases,
to identify bottlenecks and targets for performance optimization,
to compare D2RQ performance to Jena's support for relational database persistence.

You may skip directly to the results or read on for more detail.

We obtained a large XML dataset and converted it to an SQL database and into an RDF representation. We loaded the RDF triples into a database-backed Jena model. Then, we executed queries against both the Jena model and a D2RQ mapping that maps them onto the SQL database.

The SQL dataset contains several tables with up to 500,000 records. The RDF dataset contains 1.6M triples.

So far, we have only tested simple triple pattern matches (find(s,p,o) queries). We intend to repeat the test with RDQL queries.

2. Performed tests

For this set of tests, we have executed find(s, p, o) queries. Find queries return all triples matching a simple pattern consisting of a subject, a predicate and an object, each of whom may be an URI (or literal for the object) or a wildcard matching anything.

This section lists the triple patterns we tested, each with a short comment.

2.1. `find(s ? ?)`

find(s p o): http://dblp.uni-trier.de/inproceeding/16655 ANY ANY
(run 500 times)
Jena: 2765 ms
D2RQ: 2081 ms
                    (6 results)

find(s ? ?) on a large (200,000 records) table, executed 500 times. D2RQ is slightly faster.

find(s p o): http://dblp.uni-trier.de/proceeding/160 ANY ANY
(run 500 times)
Jena: 3574 ms
D2RQ: 1892 ms
                    (8 results)

find(s ? ?) on a smaller table with 3,000 records. Now the difference is significant. D2RQ is faster because it matches URIs against a set of patterns and searches only in the matching table, while Jena has to go through the whole of its large triple table.

find(s p o): http://nope.example.net/ ANY ANY
(run 500 times)
Jena: 1938 ms
D2RQ: 61 ms
                    (0 results)

Here the subject doesn't exist in the dataset. D2RQ is extremely fast because the URI doesn't match any of the patterns and can be rejected without having to do any database queries.

2.2. `find(? p o)`

find(s p o): ANY http://dblp.uni-trier.de/xml/dplp.dtd/personName M. Rieskamp:http://www.w3.org/2001/XMLSchema#string
(run 500 times)
Jena: 1496 ms
D2RQ: 243487 ms
                    (1 results)

This happens when D2RQ has to look through a database column that has no index. It has to go through all 175,000 records. This is much too slow.

find(s p o): ANY http://dblp.uni-trier.de/xml/dplp.dtd/personName M. Rieskamp:http://www.w3.org/2001/XMLSchema#string
(run 500 times)
Jena: 1424 ms
D2RQ: 1534 ms
                    (1 results)

Now we have added an index to the column and repeated the same query. Speed improvement: two orders of magnitude. Note that D2RQ is still slower than Jena. D2RQ has only 175,000 records to search, while Jena has 1,600,000 records in its triple table. There seems to be overhead on the Java side of D2RQ.

find(s p o): ANY http://dblp.uni-trier.de/xml/dplp.dtd/inProceedingsAuthor http://dblp.uni-trier.de/person/1
(run 500 times)
Jena: 1665 ms
D2RQ: 2192 ms
                    (14 results)

find(s p o): ANY http://dblp.uni-trier.de/xml/dplp.dtd/inProceedingsAuthor http://dblp.uni-trier.de/person/5
(run 500 times)
Jena: 1475 ms
D2RQ: 1628 ms
                    (1 results)

These two queries are structurally identical, but one returns 14 triples, the other only one. Note that the number of result triples has a much stronger impact on D2RQ's speed.

find(s p o): ANY http://dblp.uni-trier.de/xml/dplp.dtd/seriesTitle Topics in Information Systems:http://www.w3.org/2001/XMLSchema#string
(run 500 times)
Jena: 1493 ms
D2RQ: 1435 ms
                    (1 results)

Another similar query, but on a table whose ClassMap contains no PropertyBridges with joins. Apparently, the SQL joins slow D2RQ down.

2.3. `find(? p ?)`

find(s p o): ANY http://dblp.uni-trier.de/xml/dplp.dtd/seriesTitle ANY
(run 1 times)
Jena: 62863 ms
D2RQ: 160 ms
                    (24 results)

This is the worst case for Jena. When searching by predicate only, Jena cannot use an index and has to comb through the hughe triple table. The query was executed only once, not 500 times like the other examples.

2.4. `find(? ? o)`

find(s p o): ANY ANY Cat Photo Collection:http://www.w3.org/2001/XMLSchema#string
(run 500 times)
Jena: 1836 ms
D2RQ: 372012 ms
                    (0 results)

This is the worst case for D2RQ: find(? ? o) with a non-existing literal. It has to go through all columns that are marked as xsd:strings. Most of these don't have indices.

find(s p o): ANY ANY http://dblp.uni-trier.de/proceeding/436
(run 500 times)
Jena: 5503 ms
D2RQ: 252308 ms
                    (110 results)

Here, the object is an URI, not a literal. This is not quite as bad for D2RQ. It has to go through all columns that contain URIs. In some databases, there might be no such columns. Our database contains several columns with web links, none of which have an index.

find(s p o): ANY ANY http://dblp.uni-trier.de/proceeding/436
(run 500 times)
Jena: 5338 ms
D2RQ: 17034 ms
                    (110 results)

Same query after adding indices to all URL columns. An improvement of an order of magnitude, but still slower than Jena because D2RQ has to execute several SQL queries on different tables, and because of the many result triples.

find(s p o): ANY ANY http://dblp.uni-trier.de/proceeding/12
(run 500 times)
Jena: 2794 ms
D2RQ: 9948 ms
                    (5 results)

Same kind of query with fewer result triples. The remaining difference between Jena and D2RQ is D2RQ's overhead from doing multiple SQL queries.

3. Results and Recommendations

In general, D2RQ performs quite well on large databases. In most cases, D2RQ performance is on par with Jena. D2RQ is very fast when it can reject URIs based on UriPatterns.

The tests exposed weaknesses in Jena and D2RQ for certain queries.

In Jena, find(? p ?) queries are very slow. Several Jena features, like the RDQL query handler and property tables, have been designed to limit the practical impact of this weakness.

In D2RQ, the weakness is queries for an object that is mapped to database columns without index. Performance in this case is still much better than Jena's worst case, but not acceptable for large databases. The performance hit is about two orders of magnitude for tables with 100,000s of records.

D2RQ exhibited further (albeit much smaller) weaknesses with queries that

return many result triples,
involve PropertyBridges with joins, or
require several SQL queries.

The possible speed boost from improvements in these areas is less than an order of magnitude.

The obvious recommendation is to do everything to avoid triggering the worst case of searching through a non-indexed column. There's some things that users can do when designing their database schemas, D2RQ maps and queries, and there's some modifications that could be done to D2RQ to reduce the risk of triggering the worst case.

3.1. Optimizing database schema and queries

Here is what a user of D2RQ can do to keep queries fast. The theme is to avoid queries on database columns without an index.

Avoid queries with known object and unknown subject: If you can avoid find(? p o) and find(? ? o) queries, D2RQ performance will be excellent.
Add indices to frequently-queried columns: If you expect a column to be frequently used in find(? p o) or or find(? ? o) queries, consider adding an index. Additionally, consider adding indices to all columns containing URLs because they often are implicitly included in various kinds of queries.
Provide datatype information for your PropertyBridges: Use the d2rq:datatype property to provide datatype information for your DatatypePropertyBridges. This helps D2RQ exclude columns from queries.

3.2. Approaches for optimization in future D2RQ versions

This section identifies some design and coding approaches that could improve performance of future D2RQ versions.

State that URL columns and UriPatterns are disjoint

It would be helpful if it were explicitly stated that the joint domain of all URL columns is disjoint from any UriPattern. This would allow D2RQ to avoid searching through any URL columns at all if the URI already matches one of the UriPatterns.

Currently, in find(? ? o) patterns with an URI as the object, all URL columns in the database have to be searched, even if the object matches the UriPattern of a ClassMap.

In practice, we don't expect much overlap between these cases: Either you search for triples involving a resource defined by some ClassMap, or (less frequently) you search for triples involving some URL, probably a web link, from an URL column. Thus, stating that there must be no such overlap would not be harmful and would allow a small change to D2RQ with possibly big gains in practical performance.

Provide properties to define what values can occur in a column

It's always good when D2RQ can exclude a column from the SQL queries. There could be some properties that allow a user to describe the kind of values that can occur in a certain PropertyBridge. Providing values for these properties would be purely a performance optimization. For example, one could state that values in this column are never longer than ten characters or adhere to a given regular expression. This would allow D2RQ to remove the column from consideration if the requested value is too long or doesn't fit the expression. Each excluded column reduces the risk of hitting a non-indexed column.

Optimize delivery of result triples

Queries with many result triples are slow in D2RQ. The reasons are probably the complicated TripleResultSet and ExtendedIterator classes, and the expensive checks for duplicate triples.

Don't join tables if not necessary

Currently, D2RQ joins tables whenever a PropertyBridge with d2rq:join statements is included. Often, that's not necessary because the joined value from the foreign table is identical by definition, and we know that it's there because of referential integrity. In this case, the join doesn't have to be included in the SQL query.

Optimize UriPattern matching

Some profiling not described here indicated that D2RQ spends much time going through the ClassMap and PropertyBridge hash tables and munging strings to find matching UriPatterns. There seems to be room for improvement.

4. Methodology

The dataset used for the benchmarks is the XML dump of the DBLP computer science library. The XML file can be downloaded from the UW XML Repository. We used only a part of the data: all inproceedings records (about 200,000) and all proceedings records (about 3,000), with all their persons (175,000), publishers (100), series (25) and relations between persons and inproceedings (500,000).

To set up the benchmarking environment, we

developed an SQL schema and an RDF vocabulary for this dataset,
devised a D2RQ mapping connecting the two,
wrote a converter that reads the XML file and outputs SQL INSERT statements and RDF N-Triples (source available on request),
imported the SQL dump into a MySQL database,
imported the N-Triples file into a database-backed Jena model,
and wrote a small benchmarking app that runs a query against both the RDB Jena model and the D2RQ model.

The actual tests were performed on an Apple PowerBook G4 with 1GHz processor and 768MB of RAM.

Stuff used in the tests:

Benchmark code: This includes the D2RQ map file. Needs the Jena JARs and the MySQL JDBC connector JAR on the classpath. Contains a class for importing the N-Triples file into Jena (NT2Jena.java), a class for running queries (Compare.java) and a helper class. You have to modify several file locations, DB connection info etc. as constants at the top of each file.
Database schema: Create a MySQL database called dblp. Import with mysql -u user -p dblp < dblp-schema.sql).
SQL dump (zipped, 14MB): Import with mysql -u user -p dblp < dblp.sql.
N-Triples dump (zipped, 19MB): Import with NT2Jena.java