Berlin SPARQL Benchmark (BSBM) - Benchmark Rules

Authors:: Chris Bizer (Web-based Systems Group, Freie Universität Berlin, Germany); Andreas Schultz (Institut für Informatik, Freie Universität Berlin, Germany)
This version:: http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark /spec/20101129/BenchmarkRules/
Latest version:: http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/BenchmarkRules/

Publication Date: 11/29/2010

Abstract

This document defines performance metrics, benchmark rules, how to report results and gives an overview of the data generator and test driver for the Berlin SPARQL Benchmark (BSBM).

1. Performance Metrics
2. Rules for Running the Benchmark and Reporting Results
3. Reporting Results
4. Data Generator and Test Driver
Appendix A: Changes
Appendix B: Acknowledgements

1. Performance Metrics

The three fundamental performance metrics of the BSBM are:

Queries per Second (QpS)
Query Mixes per Hour (QMpH)
Overall Runtime (oaRT)

1.1 Metrics for Single Queries

Average Query Execution Time (aQET): Average time for executing an individual query of type x multiple times with different parameters against the SUT.

Queries per Second (QpS): Average amount of queries of type x that were executed per second.

Min/Max Query Execution Time (minQET, maxQET): A lower and upper bound execution time for queries of type x.

1.2 Metrics for Query Mixes

Queries Mixes per Hour (QMpH): Number of query mixes with different parameters that are executed per hour against the SUT.

Overall Runtime (oaRT): Overall time it took the test driver to execute a certain amount of query mixes against the SUT.

Composite Query Execution Time (cQET): Average time for executing the query mix multiple times with different parameters.

Average Query Execution Time over all Queries (aQEToA): Overall time to run 50 query mixes devided by the number of queries (25*50=1250).

1.3 Price/Performance Metric for the Complete System under Test (SUT)

The Price/Performance Metric defined as $ / QMpH.

Where $ is the total system cost over 5 years in the specified currency. The total system cost over 5 years is calculated according to the TPC Pricing Specification. If compute on demand infrastructure is used, the costing will be $/QMpH/day.

2. Rules for running the Benchmark and reporting Results

When running the BSBM benchmark and reporting BSBM benchmark results, you should obey to the following rules:

The benchmark must run on standard hardware and should run released software, in order to make it possible to replicate results.
If the benchmark is run against pre-release versions of a store, than these versions must be publicly accessible, for instance via SVN or CVS.
Any hardware/software combination, including single machines, clusters, clusters rented from computer providers like Amazon EC2 are eligible. The exact hardware/software combination and setup must be disclosed.
The configuration of a SUT is left to the test sponsor, meaning any kind of indexing, use of property tables, materialized views, and so forth is allowed. Any special tuning or parameter-setting must be disclosed.
The benchmark dataset must be generated using the data generator described in section 4.1. The test sponsor is not allowed to change the data generator.
The test runs must be executed using the test driver described in section 4.2. The test sponsor is not allowed to change the test driver.
The results of the benchmark should be reported as described in section 3 of this specification.

3. Reporting Results

This section defines formats for reporting benchmark results.

3.1 Reporting Single Results

Benchmark results are named according to the scenario, the scale factor of the dataset and the number of concurrent clients.
For example:

NTP(1000,5) means

benchmark against a native triple store
containing data about 1000 products
queries have been concurrently executed by 5 clients

23.7 QPS(2)-NTS(10000,1)

means that on average 23.7 queries of type 2 were executed per second by a single client stream against a Native Triple Store containing data about 10,000 products.

3.2 Full Disclosure Report

To guarantee an intelligible interpretation of benchmark reports/results as well as to allow for efficient and even automated handling/comparisons, all necessary information shall be represented in XML. Furthermore we opt for full disclosure policy of the SUT, configuration, pricing etc. to give all information needed for replicating any detail of the system, thus enabling anyone to achieve similar benchmark results.

Full Disclosure Report Contents

A statement identifying the benchmark sponsor and other companies and organizations involved.
Hardware:
1. Overall number of nodes
2. CPUs, cores with associated type description (brand of CPU, GHz per core, size of second level cache etc.)
3. Size of allocated memory
4. Number and type of harddisk storage units, disk controller type and if applied, the RAID configuration;
5. Number of LAN connections and all network hardware included in the SUT that was used during the test (network adapters, routers, workstations, cabling etc.).
Software:
1. Type and run-time execution location of software components
  (eg. RDF-Store, DBMS, drivers, middleware components, query processing software etc.).
2. Query test results of the Qualification Dataset;
Configuration:
1. RDF-Store and DBMS (if DB-backed): All configuration settings that were changed from the defaults
2. Operating system: All configuration settings that were changed from the defaults;
3. Amount of main memory assigned to the RDF-Store and DBMS (if DB-backed);
4. Configuration of any other software component participating in the SUT;
5. Query optimizer: All configuration settings that were changed from the defaults
6. For any recompiled software the compiler optimization options
Database design:
1. Physical organisation of the working set, e.g. different types of clustering;
2. Details about all indices;
3. Data replication for performance or other reasons;
Data generator and Test Driver related items:
1. Version of both software components used in the benchmark;
Scaling and dataset population:
1. Scaling information as defined in section 4;
2. Partitioning of triple set among different disk drives;
3. Database load time for the benchmark dataset;
4. Number of triples after initial load;
5. Number of triples after the test run;
6. Warm up time for the working set needed to enter a stable performance state.
Performance metrics:
1. The timing interval for every query must be reported;
2. The number of concurrent streams for the test run;
3. Start and finish time for each query stream;
4. The total time for the whole measurement period;
5. All metrics.
SUT implementation related items:
1. Optionally: If an RDF store is built on top of a DBMS, the CPU usage between these two can be reported;
2. Server side application logic (if not preset, discuss).

Todo: Define an XML format for the Full Disclosure Report
Todo: Implement a nice tool which generates HTML reports including nice graphics from XML benchmark results in order to motivate people to use the reporting format.

4. Data Generator and Test Driver

TODO: add new options when finished with the new implementation
There is a Java (at least JVM 1.5 needed) implementation of a data generator and a test driver for the BSBM benchmark.

The source code of the data generator and the test driver can be downloaded from Sourceforge BSBM tools.
The code is licensed under the terms of the GNU General Public License.

4.1 Data Generator

The BSBM data generator can be used to create benchmark datasets of different sizes. Data generation is deterministic.

The data generator supports the following output formats:

Format	Option
N-Triples	-s nt
Turtle	-s ttl
XML	-s xml
(My-)SQL dump	-s sql

Next on the todo list: Implement TriG output format for benchmarking Named Graph stores.

Configuration options:

Option	Description
-s <output format>	For the dataset there are several output formats supported. See upper table for details. Default: nt
-pc <number of products>	Scale factor: The dataset is scaled via the number of products. For example: 91 products make about 50K triples. Default: 100
-fc	The data generator by default adds one rdf:type statement for the most specific type of a product to the dataset. However, this only works for SUTs that support RDFs reasoning and can inference the remaining relations. If the SUT doesn't support RDFS reasoning, the option -fc can be used to include the statements for the more general classes also. Default: disabled
-dir	The output directory for all the data the Test Driver uses for its runs. Default: "td_data"
-fn	The file name for the generated dataset (suffix is added according to the output format). Default: "dataset"

The following example command creates a Turtle benchmark dataset with the scale factor 1000 and forward chaining enabled:

$ java -cp bin:lib/ssj.jar benchmark.generator.Generator -fc -pc 1000 -s ttl

4.2 Test Driver

The test driver works against a SPARQL endpoint over the SPARQL protocol.

Configuration options:

Option	Description
-runs <number of runs>	The number of query mix runs. Default: 50
-idir <directory>	The input parameter directory which was created by the Data Generator. Default: "td_data"
-w <number of warm up runs>	Number of runs executed before the actual test to warm up the store. Default: 10
-o <result XML file>	The output file containing the aggregated result overview. Default: "benchmark_result.xml"
-dg <default graph URI>	Specify a default graph for the queries. Default: null
-mt <number of clients>	Benchmark with multiple concurrent clients.
-seed <Long value>	Set the seed for the random number generator used for the parameter generation.
-t <Timeout in ms>	If for a specific query the complete result is not read after the specified timeout, the client disconnects and reports a timeout to the Test Driver. This is also the maximum runtime a query can contribute to the metrics.
-q	Turn on qualification mode. For more information, see the qualification chapter of the use case.
-qf <qualification file name>	Change the qualification file name, also see the qualification chapter of the use case.

In addition to these options a SPARQL-endpoint must be given.

A detailed run log is generated for log level 'ALL' containing information about every executed query.

The following example command runs 128 query mixes (plus 32 for warm-up) against a SUT which provides a SPARQL-endpoint at http://localhost/sparql:

$ java -cp bin:lib/* benchmark.testdriver.TestDriver http://localhost/sparql

Or, if your java version does not support the asterisk in the classpath definition, you can write:

java -cp bin:lib/ssj.jar:lib/log4j-1.2.15.jar benchmark.testdriver.TestDriver http://localhost/sparql

The following example runs 1024 query mixes plus 128 warm up mixes with 4 clients against a SUT which provides a SPARQL-endpoint. The timeout per query is set to 30s.

$ java -cp bin:lib/ssj.jar:lib/log4j-1.2.15.jar benchmark.testdriver.TestDriver -runs 1024 -w 128 -mt 4 -t 30000 http://localhost/sparql

5. References

For more information about RDF and SPARQL Benchmarks please refer to:

ESW Wiki Page about RDF Benchmarks

http://esw.w3.org/topic/RdfStoreBenchmarking

Other SPARQLBenchmarks

Papers about RDF and SPARQL Benchmarks

Y. Guo, Z. Pan, and J. Heflin: LUBM: A Benchmark for OWL Knowledge Base Systems. Journal of Web Semantics 3(2), 2005, pp158-182
Yuanbo Guo et al: A Requirements Driven Framework for Benchmarking Semantic Web Knowledge Base Systems
Li Ma et al.: Towards a Complete OWL Ontology Benchmark (UOBM)
Timo Weithöner et al.: What's Wrong with OWL Benchmarks?

Appendix A: Changes

2010-11-29: Initial version of this document

Appendix B: Acknowledgements

The work on the BSBM Benchmark Version 3 is funded through the LOD2 project.