This document defines the Business Intelligence use case of the Berlin SPARQL Benchmark (BSBM) - Version 3.1 for measuring the performance of storage systems that expose SPARQL endpoints. The benchmark is built around an e-commerce use case, where a set of products is offered by different vendors and different consumers have posted reviews about products. This benchmark query mix is composed of queries that represent analytical questions by different stakeholders like vendors, customers or the owners of the e-commerce portal.
The SPARQL Query Language for RDF and the SPARQL Protocol for RDF are implemented by a growing number of storage systems and are used within enterprise and open web settings. As SPARQL is taken up by the community there is a growing need for benchmarks to compare the performance of storage systems that expose SPARQL endpoints via the SPARQL protocol. Such systems include native RDF stores, Named Graph stores, systems that map relational databases into RDF, and SPARQL wrappers around other kinds of data sources.
The Berlin SPARQL Benchmark (BSBM) defines a suite of benchmarks for comparing the performance of these systems across architectures. The benchmark is built around an e-commerce use case in which a set of products is offered by different vendors and consumers have posted reviews about products. In this variation of the BSBM we focus on the SPARQL 1.1 Working Draft whose features are already implemented in various RDF stores. That is we consider grouping, aggregates and sub-queries.
The rest of this document is structured as follows: Section 2 defines the schema of benchmark dataset and describes the rules that are used by the data generator for populating the dataset according to the chosen scale factor. Section 3 defines the benchmark queries. Sections 4 defines how a system under test is verified against the qualification dataset.All three scenarios use the same Benchmark Dataset .The bataset is built around an e-commerce use case, where a set of products is offered by different vendors and different consumers have posted reviews about products. The content and the production rules for the dataset are described in the BSBM Dataset Specification.
This section defines a suite of benchmark queries and a query mix.
The benchmark queries are designed to emulate independent analytical queries over the dataset. These are:
The query mix consists of the 8 queries of section 3.2 and 3.3 respectively. Besides that it also has following characteristics:
For queries that have a product type as parameter, the test driver will invoke the query multiple times with each time a random subtype of the product type of the previous invocation. The starting point of the drill-down is an a random type from a settable level in the hierarchy (typically the top level, but in case of query 7, the second level). The end point of a drill down is a product type on the lowest level of the hierarchy. The rationale for the drill-down mode is that depending on the parameter choice, there can be 1000x differences in query run time. Thus run times of consecutive query mixes will be incomparable unless we guarantee that each mix has a predictable number of queries with a product type from each level in the hierarchy.
The queries are run in a random order on each thread in multiuser mode. Doing exactly the same thing on many threads is not realistic for large queries. The data access patterns must be spread out in order to evaluate how bulk IO is organized with differing concurrent demands. The permutations are deterministic on consecutive runs and do not depend on the non-deterministic timing of concurrent activities. For queries with a drill-down, the individual executions that make up the drill-down are still consecutive.
Each query is defined by the following components:
Use Case Motivation: A vendor wants to find out
which product categories get the most attention by people from a
certain country
SPARQL Query:
prefix bsbm: <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/vocabulary/>
prefix rev: <http://purl.org/stuff/rev#>
Select ?productType ?reviewCount
{
{ Select ?productType (count(?review) As ?reviewCount)
{
?productType a bsbm:ProductType .
?product a ?productType .
?product bsbm:producer ?producer .
?producer bsbm:country %Country1% .
?review bsbm:reviewFor ?product .
?review rev:reviewer ?reviewer .
?reviewer bsbm:country %Country2% .
}
Group By ?productType
}
}
Order By desc(?reviewCount) ?productType
Limit 10
Parameters:
Parameter | Description |
---|---|
%Country1% | A randomly selected Country URI. |
%Country2% | A randomly selected Country URI. |
Use Case Motivation: A consumer wants to list
similar products to the product they are viewing right now.
SPARQL Query:
prefix bsbm: <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/vocabulary/>
SELECT ?otherProduct ?sameFeatures
{
?otherProduct a bsbm:Product .
FILTER(?otherProduct != %Product%)
{
SELECT ?otherProduct (count(?otherFeature) As ?sameFeatures)
{
%Product% bsbm:productFeature ?feature .
?otherProduct bsbm:productFeature ?otherFeature .
FILTER(?feature=?otherFeature)
}
Group By ?otherProduct
}
}
Order By desc(?sameFeatures) ?otherProduct
Limit 10
Parameters:
Parameter | Description |
---|---|
%Product% | A randomly selected Product URI. |
Use Case Motivation: Some stakeholder wants to get
a list of products with the highest increase in popularity in a certain
period for further investigation.
SPARQL Query:
prefix bsbm: <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/vocabulary/>
prefix bsbm-inst: <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/>
prefix rev: <http://purl.org/stuff/rev#>
prefix dc: <http://purl.org/dc/elements/1.1/>
prefix xsd: <http://www.w3.org/2001/XMLSchema#>
Select ?product (xsd:float(?monthCount)/?monthBeforeCount As ?ratio)
{
{ Select ?product (count(?review) As ?monthCount)
{
?review bsbm:reviewFor ?product .
?review dc:date ?date .
Filter(?date >= "%ConsecutiveMonth_1%"^^<http://www.w3.org/2001/XMLSchema#date> && ?date < "%ConsecutiveMonth_2%"^^<http://www.w3.org/2001/XMLSchema#date>)
}
Group By ?product
} {
Select ?product (count(?review) As ?monthBeforeCount)
{
?review bsbm:reviewFor ?product .
?review dc:date ?date .
Filter(?date >= "%ConsecutiveMonth_0%"^^<http://www.w3.org/2001/XMLSchema#date> && ?date < "%ConsecutiveMonth_1%"^^<http://www.w3.org/2001/XMLSchema#date>) #
}
Group By ?product
Having (count(?review)>0)
}
}
Order By desc(xsd:float(?monthCount) / ?monthBeforeCount) ?product
Limit 10
Parameters:
Parameter | Description |
---|---|
%ConsecutiveMonth_X% | The date of the first day of a randomly selected month (Index 0). With index X the date will be set X months after the picked dat |
Use Case Motivation: A customer wants to inform herself which features have the most impact on the product price to get hints on how to restrict the following product search.
SPARQL Query:
prefix bsbm: <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/vocabulary/>
prefix bsbm-inst: <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/>
prefix xsd: <http://www.w3.org/2001/XMLSchema#>
Select ?feature (?sumF*(?countTotal-?countF))/(?countF*(?sumTotal-?sumF)) As ?priceRatio
{
{ Select count(?price) As ?countTotal sum(xsd:float(xsd:string(?price))) As ?sumTotal
{
?product a %ProductType% .
?offer bsbm:product ?product ;
bsbm:price ?price .
}
}
{ Select ?feature count(?price2) As ?countF sum(xsd:float(xsd:string(?price2))) As ?sumF
{
?product2 a %ProductType% ;
bsbm:productFeature ?feature .
?offer2 bsbm:product ?product2 ;
bsbm:price ?price2 .
}
Group By ?feature
}
}
Order By desc(?priceRatio) ?feature
Limit 100
Parameters:
Parameter | Description |
---|---|
%ProductType% | A randomly selected Class
URI from the class hierarchy (except root category). |
Use Case Motivation: For advertisement reasons the
owners of the e-commerce platform want to generate profiles for the two
dimensions product type and the country of a customer.
SPARQL Query:
prefix bsbm: <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/vocabulary/>
prefix bsbm-inst: <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/>
prefix rev: <http://purl.org/stuff/rev#>
prefix xsd: <http://www.w3.org/2001/XMLSchema#>
Select ?country ?product ?nrOfReviews ?avgPrice
{
{ Select ?country (max(?nrOfReviews) As ?maxReviews)
{
{ Select ?country ?product (count(?review) As ?nrOfReviews)
{
?product a %ProductType% .
?review bsbm:reviewFor ?product ;
rev:reviewer ?reviewer .
?reviewer bsbm:country ?country .
}
Group By ?country ?product
}
}
Group By ?country
}
{ Select ?product (avg(xsd:float(xsd:string(?price))) As ?avgPrice)
{
?product a %ProductType% .
?offer bsbm:product ?product .
?offer bsbm:price ?price .
}
Group By ?product
}
{ Select ?country ?product (count(?review) As ?nrOfReviews)
{
?product a %ProductType% .
?review bsbm:reviewFor ?product .
?review rev:reviewer ?reviewer .
?reviewer bsbm:country ?country .
}
Group By ?country ?product
}
FILTER(?nrOfReviews=?maxReviews)
}
Order By desc(?nrOfReviews) ?country ?product
Parameters:
Parameter | Description |
---|---|
%ProductType% | A randomly selected Class URI from the class hierarchy (except root category). |
Use Case Motivation: The stakeholders representing
the
e-commerce platform want to find potential spam reviewer, who rate
products by a specific producer much higher than the average.
SPARQL Query:
prefix bsbm: <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/vocabulary/>
prefix bsbm-inst: <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/>
prefix rev: <http://purl.org/stuff/rev#>
Select ?reviewer (avg(xsd:float(?score)) As ?reviewerAvgScore)
{
{ Select (avg(xsd:float(?score)) As ?avgScore)
{
?product bsbm:producer %Producer% .
?review bsbm:reviewFor ?product .
{ ?review bsbm:rating1 ?score . } UNION
{ ?review bsbm:rating2 ?score . } UNION
{ ?review bsbm:rating3 ?score . } UNION
{ ?review bsbm:rating4 ?score . }
}
}
?product bsbm:producer %Producer% .
?review bsbm:reviewFor ?product .
?review rev:reviewer ?reviewer .
{ ?review bsbm:rating1 ?score . } UNION
{ ?review bsbm:rating2 ?score . } UNION
{ ?review bsbm:rating3 ?score . } UNION
{ ?review bsbm:rating4 ?score . }
}
Group By ?reviewer
Having (avg(xsd:float(?score)) > min(?avgScore) * 1.5)
Parameters:
Parameter | Description |
---|---|
%Producer% | A producer URI (randomly selected) |
Use Case Motivation: A vendor wants information
about potential market niches to offer new products in the vendor's
country of origin.
SPARQL Query:
prefix bsbm: <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/vocabulary/>
prefix bsbm-inst: <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/>
prefix xsd: <http://www.w3.org/2001/XMLSchema#>
Select ?product
{
{ Select ?product
{
{ Select ?product (count(?offer) As ?offerCount)
{
?product a %ProductType% .
?offer bsbm:product ?product .
}
Group By ?product
}
}
Order By desc(?offerCount)
Limit 1000
}
FILTER NOT EXISTS
{
?offer bsbm:product ?product .
?offer bsbm:vendor ?vendor .
?vendor bsbm:country ?country .
FILTER(?country=%Country%)
}
}
Parameters:
Parameter | Description |
---|---|
%Country% | A country URI (randomly selected) |
%ProductType% | A random product type (all levels) |
Use Case Motivation: A vendor or customer wants to
find "discounter" vendors for competitor analyses and procurement
respectively.
SPARQL Query:
prefix bsbm: <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/vocabulary/>
prefix bsbm-inst: <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/>
prefix xsd: <http://www.w3.org/2001/XMLSchema#>
Select ?vendor (xsd:float(?belowAvg)/?offerCount As ?cheapExpensiveRatio)
{
{ Select ?vendor (count(?offer) As ?belowAvg)
{
{ ?product a %ProductType% .
?offer bsbm:product ?product .
?offer bsbm:vendor ?vendor .
?offer bsbm:price ?price .
{ Select ?product (avg(xsd:float(xsd:string(?price))) As ?avgPrice)
{
?product a %ProductType% .
?offer bsbm:product ?product .
?offer bsbm:vendor ?vendor .
?offer bsbm:price ?price .
}
Group By ?product
}
} .
FILTER (xsd:float(xsd:string(?price)) < ?avgPrice)
}
Group By ?vendor
}
{ Select ?vendor (count(?offer) As ?offerCount)
{
?product a %ProductType% .
?offer bsbm:product ?product .
?offer bsbm:vendor ?vendor .
}
Group By ?vendor
}
}
Order by desc(xsd:float(?belowAvg)/?offerCount) ?vendor
limit 10
Parameters:
Parameter | Description |
---|---|
%ProductType% | A randomly selected Class URI from the class hierarchy (all levels). |
Each query is defined by the following components:
Use Case Motivation: A vendor wants to find out
which product categories get the most attention by people from a
certain country
SQL Query:
SELECT productType, COUNT(*) AS productCount
FROM Product pd, Producer pr, ProductTypeProduct ptp, Review rv, Person pn
WHERE pd.producer = pr.nr AND rv.product = pd.nr AND pn.nr = rv.person AND ptp.product = pd.nr
AND pr.country = %Country1% AND pn.country = %Country2%
GROUP BY productType
ORDER BY COUNT(*) DESC, ProductType
LIMIT 10
Parameters:
Parameter | Description |
---|---|
%Country1% | A randomly selected Country URI. |
%Country2% | A randomly selected Country URI. |
Use Case Motivation: A consumer wants to list
similar products to the product they are viewing right now.
SQL Query:
SELECT pfp2.product, COUNT(*) AS samefeatureS
FROM Product pd, ProductFeatureProduct pfp1, ProductFeatureProduct pfp2
WHERE pd.nr = %Product% AND pd.nr = pfp1.product AND pfp1.productFeature = pfp2.productFeature
GROUP BY pfp2.product
ORDER BY COUNT(*) DESC, pfp2.product
LIMIT 10
Parameters:
Parameter | Description |
---|---|
%Product% | A randomly selected Product URI. |
Use Case Motivation: Some stakeholder wants to get
a list of products with the highest increase in popularity in a certain
period for further investigation.
SQL Query:
SELECT month0.product, month1.monthCount/month0.monthCount AS ratio
FROM (SELECT pd1.nr AS product, COUNT(*) AS monthCount
FROM Product pd1, Review rv1
WHERE pd1.nr = rv1.product AND rv1.reviewDate >= "%ConsecutiveMonth_1%" AND rv1.reviewDate < "%ConsecutiveMonth_2%"
GROUP BY pd1.nr) AS month1,
(SELECT pd0.nr AS product, COUNT(*) AS monthCount
FROM Product pd0, Review rv0
WHERE pd0.nr = rv0.product AND rv0.reviewDate >= "%ConsecutiveMonth_0%" AND rv0.reviewDate < "%ConsecutiveMonth_1%"
GROUP BY pd0.nr) AS month0
WHERE month0.product = month1.product
ORDER BY month1.monthCount/month0.monthCount DESC, month1.product
LIMIT 10
Parameters:
Parameter | Description |
---|---|
%ConsecutiveMonth_X% | The date of the first day of a randomly selected month (Index 0). With index X the date will be set X months after the picked dat |
Use Case Motivation: A customer wants to inform herself which features have the most impact on the product price to get hints on how to restrict the following product search.
SQL Query:
SELECT feature.productFeature, feature.sumTotal*(global.cntTotal-feature.cntTotal)/
(feature.cntTotal*(global.sumTotal-feature.sumTotal) AS priceRatio
FROM (SELECT COUNT(*) AS cntTotal, SUM(ofr.price) AS sumTotal
FROM ProductTypeProduct ptp, Offer ofr, ProductFeatureProduct pfp
WHERE ptp.product = ofr.product AND ptp.productType = %ProductType% AND ptp.product = pfp.product
GROUP BY pfp.productFeature) as feature,
(SELECT COUNT(*) AS cntTotal, SUM(ofr.price) AS sumTotal
FROM ProductTypeProduct ptp1, Offer ofr1
WHERE ptp1.product = ofr1.product AND ptp1.productType = %ProductType%) AS global
ORDER BY priceRatio DESC
LIMIT 10
Parameters:
Parameter | Description |
---|---|
%ProductType% | A randomly selected Class
URI from the class hierarchy (except root category). |
Use Case Motivation: For advertisement reasons the
owners of the e-commerce platform want to generate profiles for the two
dimensions product type and the country of a customer.
SQL Query:
SELECT pdrv1.product, pdrv1.country, pdrv1.nrOfReviews, price.avgPrice
FROM (SELECT pdrv0.country, MAX(pdrv0.nrOfReviews) AS maxCount
FROM (SELECT pn0.country, rv0.product, COUNT(*) AS nrOfReviews
FROM Review rv0, Person pn0, ProductTypeProduct ptp0
WHERE rv0.person = pn0.nr AND rv0.product = ptp0.product AND ptp0.productType = %ProductType%
GROUP BY pn0.country, rv0.product) AS pdrv0
GROUP BY pdrv0.country) AS maxrv,
(SELECT pn1.country, rv1.product, COUNT(*) AS nrOfReviews
FROM Review rv1, Person pn1, ProductTypeProduct ptp1
WHERE rv1.person = pn1.nr AND rv1.product = ptp1.product AND ptp1.productType = %ProductType%
GROUP BY pn1.country, rv1.product) AS pdrv1,
(SELECT ofr.product, AVG(ofr.price) AS avgPrice
FROM Offer ofr
GROUP BY ofr.product) AS price
WHERE pdrv1.country = maxrv.country AND pdrv1.nrOfReviews = maxrv.maxCount AND pdrv1.product = price.product
ORDER BY pdrv1.nrOfReviews DESC, pdrv1.country, pdrv1.product
Parameters:
Parameter | Description |
---|---|
%ProductType% | A randomly selected Class URI from the class hierarchy (except root category). |
Use Case Motivation: The stakeholders representing
the
e-commerce platform want to find potential spam reviewer, who rate
products by a specific producer much higher than the average.
SQL Query:
SELECT rv.person, AVG((rv.rating1 + rv.rating2 + rv.rating3 + rv.rating4) / 4) AS score
FROM Review rv, Product pd
WHERE rv.product = pd.nr AND pd.producer = %Producer%
GROUP BY rv.person
HAVING AVG((rv.rating1 + rv.rating2 + rv.rating3 + rv.rating4) / 4) >
1.5*(SELECT AVG((rv1.rating1 + rv1.rating2 + rv1.rating3 + rv1.rating4) / 4) AS avgScore
FROM Review rv1, Product pd1
WHERE rv1.product = pd1.nr AND pd1.producer = %Producer%)
Parameters:
Parameter | Description |
---|---|
%Producer% | A producer URI (randomly selected) |
Use Case Motivation: A vendor wants information
about potential market niches to offer new products in the vendor's
country of origin.
SQL Query:
CREATE VIEW v6(product) AS
SELECT ptp.product
FROM ProductTypeProduct ptp, Offer ofr
WHERE ptp.product = ofr.product AND ptp.productType = %ProductType%
GROUP BY ptp.product
ORDER BY COUNT(*) DESC
LIMIT 1000;
SELECT product
FROM v6
WHERE NOT EXISTS (SELECT 42
FROM Offer ofr, Vendor vd
WHERE v6.product = ofr.product AND ofr.vendor = vd.nr AND vd.country = %Country%);
DROP VIEW v6
Parameters:
Parameter | Description |
---|---|
%Country% | A country URI (randomly selected) |
%ProductType% | A random product type (all levels) |
Use Case Motivation: A vendor or customer wants to
find "discounter" vendors for competitor analyses and procurement
respectively.
SQL Query:
SELECT ofr.vendor, COUNT(pavg.avgPrice)/COUNT(*) AS belowRatio
FROM Offer ofr
LEFT JOIN ProductTypeProduct ptp
ON ofr.product = ptp.product AND ptp.productType = %ProductType%
LEFT JOIN (SELECT ptp1.product, AVG(of1.price) AS avgPrice
FROM Offer of1, ProductTypeProduct ptp1
WHERE of1.product = ptp1.product AND ptp1.productType = %ProductType%
GROUP BY ptp1.product) AS pavg
ON ptp.product = pavg.product AND ofr.price < pavg.avgPrice
GROUP BY vendor
ORDER BY COUNT(pavg.avgPrice)/COUNT(*), ofr.vendor
LIMIT 10
Parameters:
Parameter | Description |
---|---|
%ProductType% | A randomly selected Class URI from the class hierarchy (all levels). |
TODO: Adapt to Use Case
Before the performance of a SUT is measured, it has to be verified
that the SUT returns correct results for the benchmark queries.
For testing whether a SUT returns correct results, the BSBM
benchmark provides a qualification dataset and a qualification tool
which compares the query results of a SUT with the correct query
results. At the moment, the qualification tool verifies only the
results of SELECT queries. The results of DESCRIBE and CONSTRUCT
queries (queries 9 and 12) are not checked.
A BSBM qualification test is conducted in the two-step procedure
described below:
$ java -cp bin:lib/* benchmark.testdriver.TestDriver -q http://SUT/sparqlThis will create a qualification file named "run.qual" (different file name can be specified with the "-qf" parameter) which is used in step 2. Also the run.log (if logging is set to "ALL" in the log4j.xml file) contains all queries with full result text, so single queries can be examined later on.
where http://SUT/sparql specifies the SPARQL endpoint
Option | Description |
-rc | Only check the amount of results returned and not the result content. |
-ql <qualification log file name> | Specify the file name to write the qualification test results into. |
$ java -cp bin:lib/* benchmark.qualification.Qualification correct.qual run.qualThis generates by default a log file called "qual.log" with the following content:
where run.qual is the qualification file generated by the Test Driver in qualification mode
For more information about RDF and SPARQL Benchmarks please refer to:
The work on the BSBM Benchmark Version 3 is funded through the LOD2 project.