Berlin SPARQL Benchmark (BSBM) - Dataset Specification

Authors:: Chris Bizer (Web-based Systems Group, Freie Universität Berlin, Germany); Andreas Schultz (Institut für Informatik, Freie Universität Berlin, Germany)
This version:: http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/20110217/Dataset/
Latest version:: http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/Dataset/

Publication Date: 11/29/2010

Abstract

This document defines the dataset for the Berlin SPARQL Benchmark (BSBM). The benchmark is built around an e-commerce use case, where a set of products is offered by different vendors and different consumers have posted reviews about products.

1. Introduction
2. Benchmark Dataset
Appendix A: Changes
Appendix B: Acknowledgements

1. Introduction

The SPARQL Query Language for RDF and the SPARQL Protocol for RDF are implemented by a growing number of storage systems and are used within enterprise and open web settings. As SPARQL is taken up by the community there is a growing need for benchmarks to compare the performance of storage systems that expose SPARQL endpoints via the SPARQL protocol. Such systems include native RDF stores, Named Graph stores, systems that map relational databases into RDF, and SPARQL wrappers around other kinds of data sources.

This document defines the data model and the generation rules for the dataset that is used for each use case of the Berlin SPARQL Benchmark suite. The dataset is scalable to different sizes based on a scale factor. There are three representations of the benchmark dataset: The first version represents the scenario data using the RDF triple data model, the second version represents the data using the Named Graphs data model, the third version represents the data uses the relational data model. All three representations have the same semantics.

2 Benchmark Dataset

This section defines the logical schema of the BSBM benchmark dataset (1) and the RDF triple, Named Graphs and relational representation of this schema. Section 3 defines the data generation rules that are used by the data generator to populate the dataset according to a given scale factor.

2.1 Logical Schema

This section defines the logical schema for the benchmark dataset. The dataset is based on an e-commerce use case, where a set of products is offered by different vendors and different consumers have posted reviews about these products on various review sites.

1.1 Namespaces

Prefix	Namespace
rdf:	http://www.w3.org/1999/02/22-rdf-syntax-ns#
rdfs:	http://www.w3.org/2000/01/rdf-schema#
foaf:	http://xmlns.com/foaf/0.1/
dc:	http://purl.org/dc/elements/1.1/
xsd:	http://www.w3.org/2001/XMLSchema#
rev:	http://purl.org/stuff/rev#
bsbm:	http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/vocabulary/
bsbm-inst:	http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/

2.2 Classes and Properties

Meta data Properties

dc:publisher (Resource: Vendor, Producer, ...)
dc:date (literal: xsd:date)

The meta data properties are used to capture the information source and the publication date of each instance.

Class Product

rdfs:label (literal: String)
rdfs:comment (literal: String)
rdf:type (resource: ProductType)
bsbm:producer (resource: Producer)
bsbm:productPropertyTextualX (literal: String, there are different productPropertyTextual properties, some are optional)
bsbm:productPropertyNumericX (literal: Number there are different productPropertyNumeric properties, some are optional)
bsbm:productFeature (Resource: ProductFeature)

Comment: Products are described by different sets of product properties and product features.

Example RDF Instance:

dataFromProducer001411:Product00001435443 
 rdf:type bsbm:Product;
 rdf:type bsbm-inst:ProductType001342;
 rdfs:label "Canon Ixus 20010" ;
 rdfs:comment "Mit ihrer hochwertigen Verarbeitung, innovativen Technologie und faszinierenden Erscheinung 
 verkörpern Digital IXUS Modelle die hohe Kunst des Canon Design." ;
 bsbm:producer bsbm-inst:Producer001411 ;
 bsbm:productFeature bsbm-inst:ProductFeature003432 ;
 bsbm:productFeature bsbm-inst:ProductFeature103433 ;
 bsbm:productFeature bsbm-inst:ProductFeature990433 ;
 bsbm:productPropertyTextual1 "New this year." ;
 bsbm:productPropertyTextual2 "Special Lens with special focus." ;
 bsbm:productPropertyNumeric1 "1820"^^xsd:Integer ;
 bsbm:productPropertyNumeric2 "140"^^xsd:Integer ;
 bsbm:productPropertyNumeric3 "17"^^xsd:Integer ;
 dc:publisher dataFromProducer001411:Producer001411 ;
 dc:date "2008-02-13"^^xsd:date .

Class ProductType

rdfs:label (literal: String)
rdfs:comment (literal: String)
rdfs:subClassOf (resource: ProductType)

Comment: Product types form an irregular subsumption hierarchy (depth 3-5).

Example RDF Instance:

bsbm-inst:ProductType011432
 rdf:type bsbm:ProductType ;
 rdfs:label "Digital Camera" ;
 rdfs:comment "A camera that records pictures electronically rather than on film." ;
 rdfs:subClassOf bsbm-inst:ProductType011000
 dc:publisher bsbm-inst:StandardizationInstitution01 ;
 dc:date "2008-02-13"^^xsd:date .

Class ProductFeature

rdfs:label (literal: String)
rdfs:comment (literal: String)

Comment: The set of possible product features for a specific product depends on the product type. Each product type in the hierarchy has a set of associated product features, which leads to some features being very generic and others being more specific.

Example RDF Instance:

bsbm-inst:ProductFeature103433
 rdf:type bsbm:ProductFeature ;
 rdfs:label "Wide Screen TFT-Display" ;
 rdfs:comment "Wide Screen TFT-Display." ;
 dc:publisher bsbm-inst:StandardizationInstitution01 ;
 dc:date "2008-02-13"^^xsd:date .

Class Producer

rdfs:label (literal: String)
rdfs:comment (literal: String)
foaf:homepage (URL)
bsbm:country (ISO3166 country URI)

Example RDF Instance:

dataFromProducer001411:Producer001411
 rdf:type bsbm:Producer ;
 rdfs:label "Canon" ;
 rdfs:comment "Canon is a world leader in imaging products and solutions for the digital home and office." ;
 foaf:homepage <http://www.canon.com/>
 bsbm:country <http://downlode.org/rdf/iso-3166/countries#US> ; 
 dc:publisher dataFromProducer001411:Producer001411 ;
 dc:date "2008-02-13"^^xsd:date .

Class Vendor

rdfs:label (literal: String)
rdfs:comment (literal: String)
foaf:homepage (URL)
bsbm:country (ISO3166 country URI)

Example RDF Instance:

dataFromVendor001400:Vendor001400
 rdf:type bsbm:Vendor ;
 rdfs:label "Cheap Camera Place" ;
 rdfs:comment "We sell the cheapest cameras." ;
 foaf:homepage <http://www.cameraplace.com/>
 bsbm:country <http://downlode.org/rdf/iso-3166/countries#GB> ; 
 dc:publisher dataFromVendor001400:Vendor001400 ;
 dc:date "2008-02-03"^^xsd:date .

Class Offer

bsbm:product (resource: Product)
bsbm:vendor (resource: Vendor)
bsbm:price (literal: price with currency data type)
bsbm:validFrom (literal: Date)
bsbm:validTo (literal: Date)
bsbm:deliveryDays (Literal: business days)
bsbm:offerWebpage (URL of vendor's HTML page containing the offer)

Example RDF Instance:

dataFromVendor001400:Offer2413
 rdf:type bsbm:Offer ;
 bsbm:product dataFromProducer001411:Product00001435443 ; 
 bsbm:vendor dataFromVendor001400:Vendor001400 ;
 bsbm:price "31.99"^^bsbm:USD ;
 bsbm:validFrom "2008-02-12"^^xsd:date ;
 bsbm:validTo "2008-02-20"^^xsd:date ; 
 bsbm:deliveryDays "7"^^xsd:Integer ;
 bsbm:offerWebpage <http://vendor001400.com/offers/Offer2413> 
 dc:publisher dataFromVendor001400:Vendor001400 ;
 dc:date "2008-02-13"^^xsd:date .

Class Person

foaf:name (literal: String)
foaf:mbox_sha1sum (literal: email address)
bsbm:country (ISO3166 country URI)

Example RDF Instance:

dataFromRatingSite0014:Reviewer1213
 rdf:type foaf:Person ;
 foaf:name "Jenny324" ; 
 foaf:mbox_sha1sum "4749d7c44dc4c0adf66c1319d42b89e18df6df76" ;
 bsbm:country <http://downlode.org/rdf/iso-3166/countries#DE> ; 
 dc:publisher dataFromRatingSite0014:RatingSite0014 ;
 dc:date "2007-10-13"^^xsd:date .

Class Review

bsbm:reviewFor (resource: Product)
rev:reviewer (resource: foaf:Person)
bsbm:reviewDate (literal: Date datatype)
dc:title (literal: String)
rev:text (literal: String)
bsbm:rating1 (literal: Number ranging from 1 to 10, optional property)
bsbm:rating2 (literal: Number ranging from 1 to 10, optional property)
bsbm:rating3 (literal: Number ranging from 1 to 10, optional property)
bsbm:rating4 (literal: Number ranging from 1 to 10, optional property)

Example RDF Instance:

dataFromRatingSite0014:Review022343
 rdf:type rev:Review ;
 bsbm:reviewFor dataFromProducer001411:Product00001435443 ; 
 rev:reviewer dataFromRatingSite0014:Reviewer1213 ;
 bsbm:reviewDate "2007-10-10"^^xsd:date ; 
 dc:title "This is a nice small camera"@en ;
 rev:text "Open your wallet, take out a credit card. No, I'm not going to ask you to order one just yet ..."@en 
 bsbm:rating1 "5"^^xsd:Integer ;
 bsbm:rating2 "4"^^xsd:Integer ; 
 bsbm:rating3 "3"^^xsd:Integer ;
 bsbm:rating4 "4"^^xsd:Integer ;
 dc:publisher dataFromRatingSite0014:RatingSite0014 ;
 dc:date "2007-10-13"^^xsd:date .

2.2. Triple, Named Graphs and Relational Representation

In order to compare the performance of systems that expose SPARQL endpoints, but use different internal data models, there are three different representations of the benchmark dataset as well as different versions of the benchmark queries.

pure RDF triple representation
Named Graphs representation
relational representation

2.2.1 Triple Representation

Within the triple representation of the dataset, the publisher and the publication data is captured for each instance by a dc:publisher and a dc:date triple.

Examples:

dataFromVendor001400:Offer2413
 rdf:type bsbm:Offer ;
 bsbm:product dataFromProducer001411:Product00001435443 ; 
 bsbm:vendor dataFromVendor001400:Vendor001400 ;
 bsbm:price "31.99"^^bsbm:USD ;
 bsbm:validFrom "2008-02-12"^^xsd:date ;
 bsbm:validTo "2008-02-20"^^xsd:date ; 
 bsbm:deliveryDays "7"^^xsd:Integer ;
 bsbm:offerWebpage <http://vendor001400.com/offers/Offer2413> 
 dc:publisher dataFromVendor001400:Vendor001400 ;
 dc:date "2008-02-13"^^xsd:date .
dataFromVendor001400:Offer2414
 rdf:type bsbm:Offer ;
 bsbm:product dataFromProducer001411:Product00001435444 ; 
 bsbm:vendor dataFromVendor001400:Vendor001400 ;
 bsbm:price "23.99"^^bsbm:USD ;
 bsbm:validFrom "2008-02-10"^^xsd:date ;
 bsbm:validTo "2008-02-22"^^xsd:date ; 
 bsbm:deliveryDays "7"^^xsd:Integer ;
 bsbm:offerWebpage <http://vendor001400.com/offers/Offer2414> 
 dc:publisher dataFromVendor001400:Vendor001400 ;
 dc:date "2008-02-13"^^xsd:date .

2.2.2. Named Graphs Representation

Within the Named Graph version of the dataset, all information that originates from a specific producer, vendor or rating site is put into a distinct named graph. There is one additional graph that contains provenance information (dc:publisher, dc:date) for all other graphs.

Example (using the TriG syntax):

dataFromVendor001400:Graph-2008-02-13 {

 dataFromVendor001400:Offer2413
 rdf:type bsbm:Offer ;
 bsbm:product dataFromProducer001411:Product00001435443 ; 
 bsbm:vendor dataFromVendor001400:Vendor001400 ;
 bsbm:price "31.99"^^bsbm:USD ;
 bsbm:validFrom "2008-02-12"^^xsd:date ;
 bsbm:validTo "2008-02-20"^^xsd:date ; 
 bsbm:deliveryDays "7"^^xsd:Integer ;
 bsbm:offerWebpage <http://vendor001400.com/offers/Offer2413> 

 dataFromVendor001400:Offer2414
 rdf:type bsbm:Offer ;
 bsbm:product dataFromProducer001411:Product00001435444 ; 
 bsbm:vendor dataFromVendor001400:Vendor001400 ;
 bsbm:price "23.99"^^bsbm:USD ;
 bsbm:validFrom "2008-02-10"^^xsd:date ;
 bsbm:validTo "2008-02-22"^^xsd:date ; 
 bsbm:deliveryDays "7"^^xsd:Integer ;
 bsbm:offerWebpage <http://vendor001400.com/offers/Offer2414> 
}

localhost:provenanceData {
 dataFromVendor001400:Graph-2008-02-13 dc:publisher dataFromVendor001400:Vendor001400 ;
 dataFromVendor001400:Graph-2008-02-13 dc:date "2008-02-13"^^xsd:date .
}

2.2.3 Relational Representation

In order to benchmark systems that map relational databases to RDF and rewrite SPARQL queries into SQL queries against an application specific relational data model, the BSBM data generator is also able to output the benchmark dataset as an MySQL dump.

This dump uses the following relational schema:


ProductFeature(nr, label, comment, publisher, publishDate)
ProductType(nr, label, comment, parent, publisher, publishDate)
Producer(nr, label, comment, homepage, country, publisher, publishDate)
Product(nr, label, comment, producer, propertyNum1, propertyNum2, propertyNum3, propertyNum4, propertyNum5, 
 propertyNum6, propertyTex1, propertyTex2, propertyTex3, propertyTex4, propertyTex5, propertyTex6, 
 publisher, publishDate)
ProductTypeProduct(product, productType)
ProductFeatureProduct(product, productFeature)
Vendor(nr, label, comment, homepage, country, publisher, publishDate)
Offer(nr, product, producer, vendor, price, validFrom, validTo, deliveryDays, offerWebpage, publisher, publishDate)
Person(nr, name, mbox_sha1sum, country, publisher, publishDate)
Review(nr, product, producer, person, reviewDate, title, text, language, rating1, rating2, rating3, rating4, 
 publisher, publishDate)

2.3. Scaling and Dataset Population

This section defines the rules for generating benchmark data for a given scale factor.

The benchmark is scaled by the number of products.

The table below gives an overview about the characteristics of BSBM datasets with different scale factors.

Scale Factor	666	2,785	70,812	284,826
Number of RDF Triples	250K	1M	25M	100M
Number of Producers	14	60	1422	5,618
Number of Product Features	2,860	4,745	23,833	47,884
Number of Product Types	55	151	731	2011
Number of Vendors	8	34	722	2,854
Number of Offers	13,320	55,700	1,416,240	5,696,520
Number of Reviewers	339	1432	36,249	146,054
Number of Reviews	6,660	27,850	708,120	2,848,260
Total Number of Instances	23,922	92,757	2,258,129	9,034,027
Exact Total Number of Triples	250,030	1,000,313	25,000,244	100,000,112
File Size Turtle (unzipped)	22 MB	86 MB	2.1 GB	8.5 GB

The BSBM data generator is described in Section 8.

2.3.1 Class: Product

Products have product types and are described with various properties. There are products with several different product property combinations (many properties, less properties).

Rules for data generation:

Label: String of 1-3 words, dictionary 1
Comment: String of 50-150 words, dictionary 2
productPropertyTextualX: Literal of 3-15 words, dictionary 2
productPropertyNumericX: Integer, range 1-2000, values of the normal distribution (mean value: 0, standard deviation: 1) are mapped from the range 0-2 to the range 1-2000.
productFeature: Every Product has about 10 -20 features.
publishDate: randomly chosen from 2000-09-20 to 2006-12-23.

There are three types of product descriptions. The table below lists the textual and numeric properties for each type.

	Textual Properties	Numeric Properties
Description Type 1	PropertyTextual1 to PropertyTextual5	PropertyNumeric1 to PropertyNumeric5	40%
Description Type 2	PropertyTextual1 to PropertyTextual3 + optional PropertyTextual4 (50%) + optional PropertyTextual5 (25%)	PropertyNumeric1 to PropertyNumeric3 + optional PropertyNumeric4 (50%) + optional PropertyNumeric5 (25%)	20%
Description Type 3	PropertyTextual1 to PropertyTextual3 + optional PropertyTextual5 (25%) + optional PropertyTextual6 (50%)	PropertyNumeric1 to PropertyNumeric3 + optional PropertyNumeric5 (25%) + optional PropertyNumeric6 (50%)	40%

Relation: Product-Producer

Every Product has one producer.
One producer is generated for 50 products on average .
The number of products per producer is taken randomly from the normal distribution (mean value: 50, standard deviation: 16.6) from the range 1 - unlimit.

Relation: Product-ProductType

Every Product is in one leaf of the product-type hierarchy.
Products are randomly assigned to the product types (leaf level) whereas the range 0-2 of the normal distribution (mean value: 1, std. deviation: 1) is mapped to the range 1 - number of (leaf) product types.

Relation: Product-ProductFeature

The set of possible product features for a product results from the product type and its superclasses.
Every feature for this set is chosen with a probability of 25%.

Relation: Product-Offer

Products are offered by multiple vendors.
Offers are randomly assigned to products whereas the range 0-4 of the normal distribution (mean value: 2, std. deviation: 1) is mapped to the range 1 - number of products.

Relation: Product-Review

One Product has 10 reviews on average.
Reviews are randomly assigned to products whereas the range 0-4 of the normal distribution (mean value: 2, std. deviation: 1) is mapped to the range 1 - number of products.

2.3.2 Class ProductType

Irregular subsumption hierarchy (depth 2-6). Number of classes increases with the number of products (around 4^{log₁₀#Products}).

The branching factor for every node on the same level is equal and gets calculated for arbitrary scale factors. The table below illustrates the relationship between number of products and branching factors for every level:

	root level	level 1	level 2	level 3	level 4
100 products	4	4
1 000 products	6	8	2
10 000 products	8	8	4
100 000 products	10	8	8	2
1 000 000 products	12	8	8	4
10 000 000 products	14	8	8	8	2
100 000 000 products	16	8	8	8	4

As can be seen the depth increases by one everytime the product count grows by a factor of 100.

Rules for data generation:

Label: String of 1-3 words, dictionary 1
Comment: String of 20-50 words, dictionary 2
publishDate: randomly chosen from 2000-05-20 to 2000-06-23.

2.3.4 Class Product Feature

Each feature is assigned to a product type in the type hierarchy, which leads to some features being very generic and other being more specific.

Rules for data generation:

Label: String of 1-3 words, dictionary 1
Comment: String of 20-50 words, dictionary 2
The distribution of Product Features among the ProductType hierarchy is done like this: In the root product type there are always 5 product features. In the remaining hierarchy bounds for the number of product features are calculatedfor all depth levels , so that nodes nearer to the root have more product features than nodes which lie deeper in the hierarchy. After that, for every node a random value between the upper and lower bound is chosen as the number of product features for that node. Product features for that node are then generated accordingly.
publishDate: randomly chosen from 2000-05-20 to 2000-06-23.

2.3.5 Class Producer

Rules for data generation:

Label: String of 1-3 words, dictionary 1
Comment: String of 20-50 words, dictionary 2
foaf:homepage: URI within the namespace of the producer
country: ISO3166 (US 40%, UK 10%, JP 10%, CN 10%, 5% DE, 5% FR, 5% ES, 5% RU, 5% KR, 5% AT)
publishDate: randomly chosen from 2000-07-20 to 2005-06-23.

Per 1000 products, there are 20 producers generated on average.

2.3.6 Class Vendor

Rules for data generation:

Label: String of 1-3 words, dictionary 1
Comment: String of 20-50 words, dictionary 2
foaf:homepage: URI within the namespace of the vendor
country: ISO3166 (US 40%, UK 10%, JP 10%, CN 10%, 5% DE, 5% FR, 5% ES, 5% RU, 5% KR, 5% AT)
publishDate: randomly chosen from 2000-09-20 to 2006-12-23.

Per 1000 products, there is 0.5 vendor generated on average.

Relation: Vendor-Offer

Every offer belongs to a vendor.
The number of offers per vendor is taken randomly from the Normal distribution (mean value: 2000, standard deviation: 666) from the range 1 - unlimit.

2.3.7 Class Offer

Rules for data generation:

price: random US-$ value between 5 and 10000
validFrom, validTo: date range between 7 and 180 days overlapping with the publication date of the offer.

validFrom ranges from 0-180 days before the publication date.
validTo ranges from 7-180 days after the publication date.
this means that about half of the offers are not valid anymore.

deliveryDays: Integer between 1-21
offerWebpage: URI within the namespace of the producer
publishDate: randomly chosen from (today - 97 days) to today.

Per 1000 products, there are 20000 offers generated.

2.3.8 Class Person

Rules for data generation:

Name: String of 2-4 words, dictionary 3
mbox_sha1sum: random sha1 value
country: ISO3166 (US 40%, UK 10%, JP 10%, CN 10%, 5% DE, 5% FR, 5% ES, 5% RU, 5% KR, 5% AT)
publishDate: randomly chosen from 2008-5-20 to 2008-8-23.

Per 1000 products, there are on average 500 persons generated.

2.3.9 Class Review

Rules for data generation:

Title: String of 4-15 words, dictionary 2
Text: String of 50-200, dictionary 2, lang: (EN 50%, JA 10%, ZH 10%, 5% DE, 5% FR, 5% ES, 5% RU, 5% KR, 5% AT)
Review Date: Random date within the last year
RatingX: Reviews might include up to 4 types of ratings. The likelihood that a review has a rating of type X is 70%. The values of the ratings range from 1 to 10.
publishDate: randomly chosen from Review Date to today (which is set to 2008-06-20).

Per 1000 products, there are 10000 reviews generated.

Relation: Review-Person

Every Review has one author.
The number of reviews per person is taken randomly from the Normal distribution (mean value: 20, standard deviation: 6.6) from the range 1 - unlimit.
On average there is a new person generated every 20 reviews.

Relation: Review-Ratingsite

Every Review belongs to one rating site.
The number of reviews per rating site is taken randomly from the Normal distribution (mean value: 10 000, standard deviation: 333) from the range 1 - unlimit.
On average there is a new rating site generated every 10000 reviews.

Dictionaries

Dictionary 1: Words from set of product names (around 90.000 words)

Dictionary 2: Words from English text (todo: look for a corpus with English sentences, currently dictionary 1 is used)

Dictionary 3: Names of persons (around 90.000 names)

http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/Dataset/

Appendix A: Changes

2010-11-29: Initial version of this document

Appendix B: Acknowledgements

The work on the BSBM Benchmark Version 3 is funded through the LOD2 project

Berlin SPARQL Benchmark (BSBM) - Dataset Specification

Abstract

Table of Contents

1. Introduction

2 Benchmark Dataset

2.1 Logical Schema

1.1 Namespaces

2.2 Classes and Properties

2.2. Triple, Named Graphs and Relational Representation

2.2.1 Triple Representation

2.2.2. Named Graphs Representation

2.2.3 Relational Representation

2.3. Scaling and Dataset Population

2.3.1 Class: Product

2.3.2 Class ProductType

2.3.4 Class Product Feature

2.3.5 Class Producer

2.3.6 Class Vendor

2.3.7 Class Offer

2.3.8 Class Person

2.3.9 Class Review

Appendix A: Changes

Appendix B: Acknowledgements