Berlin SPARQL Benchmark (BSBM) - Update Use Case

Authors:: Chris Bizer (Web-based Systems Group, Freie Universität Berlin, Germany); Andreas Schultz (Institut für Informatik, Freie Universität Berlin, Germany)
This version:: http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/20101129/UpdateUseCase/
Latest version:: http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/UpdateUseCase/
Publication Date: 11/29/2010

Abstract

This document defines the Update use case for the Berlin SPARQL Benchmark (BSBM) for measuring the performance of storage systems that expose SPARQL endpoints. The benchmark is built around an e-commerce use case, where a set of products is offered by different vendors and different consumers have posted reviews about products. The query mix of the Update use case simulates the update activity on the dataset by adding or deleting data of products, reviews and offers.

1. Introduction
2. Benchmark Dataset
3. Benchmark Queries
4. Qualification Dataset and Tests
Appendix A: Changes
Appendix B: Acknowledgements

1. Introduction

The SPARQL Query Language for RDF and the SPARQL Protocol for RDF are implemented by a growing number of storage systems and are used within enterprise and open web settings. As SPARQL is taken up by the community there is a growing need for benchmarks to compare the performance of storage systems that expose SPARQL endpoints via the SPARQL protocol. Such systems include native RDF stores, Named Graph stores, systems that map relational databases into RDF, and SPARQL wrappers around other kinds of data sources.

The Berlin SPARQL Benchmark (BSBM) defines a suite of benchmarks for comparing the performance of these systems across architectures. The benchmark is built around an e-commerce use case in which a set of products is offered by different vendors and consumers have posted reviews about products. This benchmark query mix represents update activity on the dataset. All queries conform to the SPARQL 1.1 Update draft.

The Berlin SPARQL Benchmark was designed along three goals: First, the benchmark should allow the comparison of different storage systems that expose SPARQL endpoints across architectures. Testing storage systems with realistic workloads of use case motivated queries is a well established benchmarking technique in the database field and is for instance implemented by the TPC benchmarks. The Berlin SPARQL Benchmark should apply this technique to systems that expose SPARQL endpoints. As an increasing number of Semantic Web applications do not rely on heavyweight reasoning but focus on the integration and visualization of large amounts of data from autonomous data sources on the Web, the Berlin SPARQL Benchmark should not be designed to require complex reasoning but to measure the performance of queries against large amounts of RDF data.

The rest of this document is structured as follows: Section 2 defines the schema of benchmark dataset and describes the rules that are used by the data generator for populating the dataset according to the chosen scale factor. Section 3 defines the benchmark queries. Sections 4 defines how a system under test is verified against the qualification dataset.

2. Benchmark Dataset

All three scenarios use the same Benchmark Dataset .The bataset is built around an e-commerce use case, where a set of products is offered by different vendors and different consumers have posted reviews about products. The content and the production rules for the dataset are described in the BSBM Dataset Specification

3. Benchmark Queries

This section defines a suite of benchmark queries and a query mix. The query mix is not meant to be run on its own. Instead it should be combined with the query mix of the Explore Use Case to measure the impact of updates on the performance.

The benchmark queries are designed to emulate the update behaviour of the e-commerce portal operator. An update operation is one of the following:

Add new product information, reviews and offers.
Delete outdated or erroneous offers.

3.1 Query Mix

Complete Query Mix

The complete query mix consists of 5 queries that simulate a the update behaviour of the e-commerce portal. The query sequence is given below:

Query 1: Add product, reviews and offers
Query 2: Delete outdated or erroneous offers
Query 1: Add product, reviews and offers
Query 2: Delete outdated or erroneous offers
Query 2: Delete outdated or erroneous offers

3.2 SPARQL Queries for the Triple Data Model

Query 1: Add product, reviews and offers

Use Case Motivation: New, so far unknown product data with related reviews and offers is inserted into the dataset.

SPARQL Query:

INSERT DATA {
    %updateData% # product data with associated reviews and offers
                 # A product has 10 reviews on average
                 # A product has 20 offers on average
                 # Altogether one product with all its reviews and offers consists of about 300 triples
}

Query 2: Delete outdated or erroneous offers

Use Case Motivation: Outdated or erroneous offers are deleted, so they won't show up on the customer side.

SPARQL Query

DELETE WHERE
{ %Offer% ?p ?o }

Parameters:

Parameter	Description
%Offer%	An offer URI (randomly selected)

3.3 SPARQL Queries for the Named Graph Data Model

The queries for the Named Graphs data model have the same semantics as the queries for the triple data model. The queries do not specify the IRIs of the named graphs in the RDF Dataset using the FROM NAMED clause, but assume that the query is executed against the complete RDF Dataset.

This is still work in progress ...
Todo: Rewrite all queries for Named Graphs. Two examples are already found below:

3.4 SQL Queries for the Relational Data Model

This section will contain a SQL representation of the benchmark queries in order to be able to compare the performance of stores that expose SPARQL endpoints to the performance of classic SQL-based RDBMS.
TODO: Write equivalent SQL queries when SPARQL queries are confirmed