BIBM is a set of test tools to benchmark and validate relational database management systems (tables and/or property graph variants). It can run BSBM and TPC-H test suites.[a]
The source data should be generated with the standard DBGEN program found at the http://www.tpc.org/tpch/ page. The database under the test can be loaded with any suitable means. BIBM offers program csvload.sh to load csv files into an SQL database (see paragraph 7 of this user manual). This script must be supplied with the schema file which define rules how to convert csv files into database tables. An example of such a schema can be found in the directory <bibm-root>/tpch/virtuoso/tpch_schema.json. Also in that directory can be found TPC-H SQL schema (schema.sql) and customized script tblLoad.sh.
To run TPC-H SQL test suite, use <bibm-root>/tpchdriver script. Pass -sql and -scale <scale> options. The scale must be the same as was passed to the TPC-H data generator. However, if the scale is wrong, it affects only performance figures in the test driver report.
The test driver must be supplied with the following information:
a) Which JDBC driver to use. The default JDBC driver name is com.mysql.jdbc.Driver, but it is not included in the BIBM distribution. However, the distribution contains driver virtuoso.jdbc4.Driver.
If desired driver is not in the distribution, add jar file with that driver in the directory <BIBM-root>/lib. If the desired driver has the name different from the default, pass option -dbdriver <DB-Driver Class Name>.
b) Which test suite to run. The BIBM distribution contains directory tpch/sql with the reference TPC-H test suite. Pass it with -uc tpch/sql option. To run test suite with other location, see 4.2 b).
c) How to connect to the system under test. Usually SQL endpoint has the form jdbc:dataSourceName://host:port/[?options]. For example, following URI works with OpenLink Virtuoso DBMS: jdbc:virtuoso://<host>:<port>/UID=dba/PWD=dba. Pass the endpoint as argument, without any option tag. For multiple and update endpoints, the rules described in 4.2 c) apply.
The above options are sufficient to run smoke test to see if everything works.
To run TPC-H power test, declare refresh functions RF1 and RF2 (see TPC-H specification) with the options -rf1 and -rf2. In the BIBM, a refresh function is represented as an external program which accepts one parameter - the stream number. For the Power test, stream number is 0. For the Throughput test, stream number is from 1 to <number of clients>.
The number of clients for the throughput test can be set with the option -mt <number of clients>. It’s default value is the minimum number of clients for the given scale factor, as defined in the TPC-H spec 5.3.4. For example, for scale factor=1, number of clients=2.
The test driver always prints execution summary in the file report.txt.
To run TPC-H validation test, the same tpchdriver script as in 1.12. is used, with any combination of options -q, -qf, and -qcf. In this mode, query mix is executed once, as for power test. Option -mt has no effect. Option -defaultparams must be set to turn random query parameter generation off (the file tpch/valid.qual is for default parameters only).
Options -q and -qf <outpit qualification file> order to produce qualification file with answers to queries. This file can be lately compared with other qualification file with Comparator (see chapter 5 ).
Option -qcf <valid qualification file> orders to perform comparison on the fly.
To run TPC-H ACID test, the <bibm-root>/tpchaciddriver is used. In this mode, all queries are embedded in the test driver. However, usecase directory must be specified, with optional “querymix.txt” and “ignoreQueries.txt” files, and with mandatory “Q1_1.sql” and “Q1_2.sql” files. Sql files contain Q1 query with different substituted parameter [delta], ad defined in the TPC-H spec 3.4.2.6 “Isolation Test 6”. “querymix.txt” and “ignoreQueries.txt” may contain only following qery names:
a1, a2, c1, i1, i2, i3, i4, i5, i6.
The leading letter denotes type of the test (Atomicity, Consistency, or Isolation), and the digit denotes the number of test as it is defined in the spec. Durability tests are not implemented.
To run BSBM test suite, use <bibm-root>/bsbmdriver script. The user must supply following information:
a) The name of the directory containting test driver data. This data were generated by the BSBM data generator alongside with the triple data (which already must be loaded to the triple store) and has default name td_data. It could be changed by the option -dir <directory name> passed to the data generator.
If it’s real name is not “td_data” or it is not located in the current working directory, pass its location to the testdriver by the option -idir <directory name>.
b) Which test suite(s) to run. The BIBM distribution contain directory bsbm with several flavours of the BSBM test suite. Choose one of them and pass its name (relative to the BIBM root) with the test driver option -uc <relative test suite name>. As an alternative, so called usecase file may be passed with option -ucf (again, name relative to the BIBM root).
The reason why the tesdriver accepts relative test suite names is because the script bsbmdriver passes option -qrd <BIBM root> to the tesdriver java program. If the test suite to run is located outside the BIBM root, set your own -qrd option.
In principle, there can be several -uc options (or usecase file may contain several lines with test suite names), but be aware that in case of query name clash, one of conflicting query names would be changed by test driver (internally, not on the disk), and the reports produced by the test driver would contain such artifical names.
c) How to connect to the system under test. The URL(s) of the SPARQL endpoint(s) are passed as arguments, without an option tag. If several URLs are passed, they are used in a round-robbin manner. If the test suite contains updating queries and that queries require separate endpoint(s), that endpoint(s) should be passed with the -u option.
A test suite in BIBM test format is represented by a directory which contains:
- query files. Query file name is “query<query name>.txt”. The <query name> is usually a decimal number, but also may be any string without spaces and dots.
- query description files. Description file name is “query<query name>.desc” (BIBM syle) or “query<query name>desc.txt” (BSBM style). If both files are present, BIBM -style is taken.
- exclude list. This is a file named “ignoreQueries.txt”.
- include list. This is a file named “querymix.txt”.
- file or directory named “querydescriptions”.
Files and directories with other names are ignored by the BIBM test driver.
If not all query files are supposed to be run, there are two ways to designate the exact set of queries to run: to list wanted queries in the include list (file “querymix.txt”) and to list unwanted queries in the exclude list (file named “ignoreQueries.txt”). The list files are text files and may contain only pure query names (not query file names) separated with spaces or carriage returns. The effective query set is wanted query set minus unwanted query set. If the file “querymix.txt” is absent, then all queries are considered wanted. If the file “ignoreQueries.txt” is absent, then unwanted query set is considered empty.
For each query to be run (that is, excluding ignored queries) there must exist a description with corresponding name. Description can be placed in the test suite directory or in a separate descriptions directory. Using separate descriptions directory is convenient when several flavours of the same test suite exist with similar descriptions.
The descriptions directory name is determined as follows:
If directory named “querydescriptions” exists in the testsuite directory (it can be a soft link), it is the descriptions directory.
If file named “querydescriptions” exists in the testsuite directory , then it’s content is the name of the descriptions directory.
If directory named “../querydescriptions” exists, it is the descriptions directory.
First, the test description is searched in the test suite directory, and then in the descriptions directory, if the latter exists.
Query file contains text in SQL or SPARQL with formal parameters. Formal parameter is a substring surrounded with parameterchar. Parameterchar is ‘@’ for SQL and ‘%’ for Sparql. Tests driver substitutes formal parameters with actual values before the query execution. If option -defaultparams is set in the tesdriver command, then parameter values are taken from the parameter declaration in the query description, otherwise values are randomly selected according to the class of parameter declared in the parameter declaration, so if the query is executed more than once, actual values can be different. The rules of substitution are described in the query description file. If a formal parameter is not described in the description file, it is left intact (this allows to use parameterchar for other purposes). The formal parameter STREAM_ID is considered declared globally and takes actual value to be equal to the number of the thread which executes the query.
This is identical to the format used by the original BSBM test driver.
Query description is written in JSON-like syntax. Field names are not included in quotes and are case-insensitive. Numerical values are not included in quotes. String values can be surrounded with quotes or not, depending on their position.
Description is a JSON-object with attributes:
{queryType:<query type>,
sequence:[<query type>...],
params:[<parameter description>...],
results:[ <result description>...],
resultKeys:[<integer>...]
}
No one attribute is mandatory.
3.2.1. Query Types and type Sequences
The scalar attribute <query type> describes the type of the query. Default value is select.
Query type values are case-insensitive.
<query type>: one of select describe construct update try
select: query is expected to return set or results (rows).
update: query is expected to return a number.
try: ignore possible errors and results. Useful to delete probably existing table.
The list attribute <sequence> is only used for complex queries (e.g. TPC-H query 15), which consist of several statements delimited with semicolon (‘;’). Each element of the sequence describes the type of corresponding statement. The sequence must contain single select element, which is considered the main element.
3.2.2. Query Parameters
The list attribute Params describes the rules to substitute formal parameters with actual values. The order of parameter descriptions is insignificant (exceptions exist).
Parameter description is a JSON-object with attributes:
{ name:<parameter name>,
class: <parameter class>,
range:[<value>...],
default:<value>
}
Name and Class fields are mandatory. Range and Default must correspond to the Class.
<parameter class> defines the method to randomly generate parameter values. Each test suite (BSBM or TPC-H) has its own set of parameter classes. BSBM classes are not described in this document. THC-H classes are described in the Appendix A.
3.2.3. Query Results Description
The list field Results describes the types and validation rules for the expected query results.
The result descriptions must be listed strictly in the order the columns appear in the result set,
{column:<column name>,
type: <column type>,
check: <checks>
}
No one field is mandatory.
The field <column name> is only for reference and is not used by the test driver and validator.
The field <column type> is one of: str, real, int, decimal. The default is str.
The field <checks> is one of: <delta check> <percent check>
<delta check> := $<delta>
<percent check>:=<percent>%
<delta>:= <integer>
<percent>:= <integer>
The field <checks> can also be a list with both types of checks.
Delta check prescribes that result is valid when it is within <delta> of the query validation output data. In the TPC-H spec, only delta check equal to $100 is used.
Percent check prescribes that result is valid when it’s rounded value is within <percent> of the query validation output data. In the TPC-H spec, only percent check equal to 1% is used.
If no checks declared, then the result is valid when it is strictly equals the the validation output data.
3.2.4. Query Results Description Keys
The attribute resultKeys in the query description influences the ways results are compared.
If it is absent, results from the valid result set and the checked result set are compared side by side, in linear order. However, a query may lack of order by clause and results may be valid but delivered in different order. To correctly match rows of valid and checked row sets, a subset of columns which unambiguously identify the row, must be declared. A good candidate for this identification set is the columns declared in the group by clause.
Thef resultKeys attribute lists the numbers of such identification columns, counted from 1.
The root directory of the distribution is referenced below as BIBM-root.
Test drivers are invoked with the help of testdriver scripts located in the BIBM-root.
They are invoked as follows:
testdriver <options> endpoints...
where endpoint is either URL of the HTTP SPARQL Endpoint, or URI of JDBC SQL endpoint.
There are 3 testdriver scripts:
bsbmdriver - to run bsbm-styled test suite
tpchdriver - to run tcph-styled performance test suite
tpchaciddriver - to run tcph-styled ACID test suite
4.1.1 Options applicable for all modes.
-version
prints the version of the Test Driver
-dry-run
makes parameter substition in queries but does not actually runs them.
Implies -printres option
-err-log <file name>
sets log file to write error messages
default: print errors only to stderr
-w <number of warm up runs before actual measuring>
default: 0
Warm up runs are not reflected in the reports.
-wud
allow to run queries of update type during warm ups
-runs <number of query mix runs>
default: 1
-qrd <query root directory>
Where to look for the directories listed in the use case file.
default: BIBM-root (current working directory if tesdriver invoked as java class)
-uc <use case query mix directory>
Specifies a query mix directory.
Can be used several times, then the queries from all directories are executed.
If case of query name clash, conflicting name is changed for the reporting.
The name is relative to the <query root directory>.
-ucf <use case file name>
Specifies where the use case description file can be found.
Usecase file contain lines of the form:
querymix=<directory name>
Then each <directory name> is handled as with -uc <directory name> option.
default: none
-o <benchmark results output file>
default: benchmark_result.xml
-mt <Number of clients>
Run multiple clients concurrently.
default: 1
-seed <Long Integer>
Init the Test Driver with another seed than the default.
default: 808080
-t <timeout in ms>
Timeouts will be logged for the result report.
default: 0ms
-qf <output qualification file name>
Generate output qualification file with the given name
-q
Generate output qualification file with the default name “run.qual“
-qcf <input qualification file name>
To turn on on-the-fly comparison of result sets.
default: none.
-sut <sutcommand>
<sutcommand> - the command to run external program, quoted if necessary
The sutcommand is expected to return the server's CPU time formatted as hh:mm:ss.
It is called twice, at the start of the whole test driver run, and at the end.
The difference is printed in the report.
default: none
-defaultparams
use default parameters for parameter substitution
default: false
4.1.2 Options applicable only for running SPARQL queries
-url <common prefix for all endpoints>
default: <empty string>
-u <Sparql Update Service Endpoint URL>
Use this if you have SPARQL Update queries in your query mix.
-dg <Default Graph>
default: null
-uqp <update query parameter>
The parameter name for update operation in the HTTP query string.
default: update
-retry-msg <message from server indicating deadlock>
If HTTP server returned error message which starts with the option parameter,
then retry this query.
default: <null, that does not match any string>
-retry-max <number of attemts to replay query if deadlock error message received>
default: 3
-retry-int <time interval between attempts to replay query (milliseconds)>
Increases by 1.5 times for each subsequent attempt.
default: 200 ms
-retry-intmax <upper bound of time interval between attempts to replay query (milliseconds)>
If set, actual retry-int is picked up randomly between set retry-int and retry-intmax
default: equals to retry-int
4.1.3 Options applicable only for running SQL queries
-sql the queries are written in SQL.
use JDBC connection to a RDBMS. Instead of a SPARQL-Endpoint, a JDBC URL has to be supplied.
default: not set
-dbdriver <DB-Driver Class Name>
default: com.mysql.jdbc.Driver
-idir <data input directory>
The input directory for the Test Driver data
default: td_data
-udataset <update dataset file name>
Specified an update file generated by the BSBM dataset generator.
-rampup
Run ramp-up to reach steady state.
-drill
Queries with parameter ProductType are executed several times in one mix,
once for each allowed level of product type.
Queries are shuffled before each mix run.
default: false
-scale <scale factor>
<scale factor> is the scale factor of the database being tested.
This value is used only to calculate performance figures.
-rf1 <refresh command>
<refresh command> - the command to run the refresh function 1 (insert, see TPC-H spec)
-rf2 <refresh command>
<refresh command> - the command to run the refresh function 2 (delete, see TPC-H spec)
Comparator compares qualification data files produced by previous runs of the test driver (with -q or -qf options).
Usage: BIBM-root/compareresults.sh [?-full] <base.qual> <checked.qual>
where <base.qual> is the qualification file considered valid, and <checked.qual> is the qualification file to be checked.
If option -full is passed, then full comparison is performed, otherwise, at most one error message is printed for a query.
The BIBM distribution contains file tpch/valid.qual with valid query answers for scale factor 1.
This file can be used in to check query results with the comparator, as well as with the -qcf option of the test driver.
Converter reads files formatted as comma-sparated-values and generates files formatted in Terse Triple RDF Language (Turtle). It also can generate SPARQL/Update queries and load data into SPARQL store on the fly. Note that usually SPARQL stores usually have more efficient facilities to load data.
Usage: cvs2ttl.sh [options]... [sourcefiles]...
Sourcefiles can be directories, then they are scanned for .csv files (not recursively).
Options applicable for all modes:
-ext <input file extention>
Used while searching in souce directories.
default: csv
-schema <conversion schema>
<conversion schema> is a json file, describing conversion rules.
-mt <number of client threads>
default: number of processors (cores) on the host machine
For TPC-H database, <conversion schema> can be found in BIBM-root/tpch/rdfh_schema.json.
Options applicable for convertion mode:
-d <destination directory>
default: current working directory
Options applicable for loading mode:
-u <Sparql Update Service Endpoint URL>
Indicates loading mode. Just like a testdriver, csv converter can be assigned with several endpoints.
-uqp <update query parameter>
The parameter name for update operation in the HTTP query string.
default: update
-dg <default graph>
add &default-graph-uri=<default graph> to the http request
-t <timeout in ms>
Timeouts will be logged for the result report.
default: 0
-tpq <triples per query>
default: 500
-uqp <update query parameter>
The forms parameter name for the query string.
default: update
-url <common prefix for all endpoints>
url <common prefix for all endpoints>
default: empty string
Loader loads csv files into SQL database via JDBC connection. Database tables should be already created before loading.
Usage:csvload.sh [options]... [sourcefiles]...
Sourcefiles can be directories, then they are scanned for .csv files (not recursively).
Options:
-d <destination directory>
default: current working directory
-dbdriver <DB-Driver Class Name>
default: com.mysql.jdbc.Driver
-dburl <URL of database endpoint>
-ext <input file extention> (to search in souce directories)
default: 'csv''
-loadsql <sql file>
sql file to exec before loading.
must contail single sql statement.
Option can be used more than once
-mt <integer>
number of client threads
default: 10
-schema <conversion schema> (json file)
-t <integer>
timeout in milliseconds
default: 10000
For TPC-H database, <conversion schema> can be found in BIBM-root/tpch/tpch_schema.json.
Definitions of classes refer to the TPC-H spec. The phrase “previously declared parameter of class N” means that the declaration of the parameter of class N precedes the declaration of the current parameter in the query description, not in the query text.
Brand - returns string of type "BRAND#MN” where M and N are two single character strings representing two numbers randomly and independently selected within [1 .. 5];
Date - requires range list containing start date and end date formatted as YYYY-MM-DD. Returns the date of randomly distributed day within the start and end dates, inclusive.
Color - returns randomly selected color from the list specified in the TPC-H spec 4.2.3 Test Database Data Generation.
Container - returns 2-syllable string randomly selected within the list of 2-syllable strings defined for Containers in TPC-H spec Clause 4.2.2.13.
CountryCodeSet - returns string '[I1]','[I2]','[I3]','[I4]','[I5]','[I6]','[I7]' where I1...I7 are randomly selected without repetition from the possible values for Country code as defined in Clause 4.2.2.9.
Note that unlike other parameter classes, this class returns the whole list of parameters, not a single one.
Fraction - returns floating point value equal to 0.0001 / SF, where SF is the scale factor.
OneOf - returns randomly chosen value from the range list.
Month - requires range list containing start month and end month formatted as YYYY-MM. Returns randomly distributed month within the start and end months, inclusive.
Nation - returns randomly selected string within the list of values defined for N_NAME in Clause 4.2.3;
Nation2 - returns randomly selected string within the list of values defined for N_NAME in Clause 4.2.3 which is different from the value of previously declared parameter of class Nation.
Random requires range of two decimals with equal precision or two integers. Returns value with the same precision randomly selected within the range boundaries.
Region returns string randomly selected within the list of values defined for R_NAME in Clause 4.2.3.
RegionForNation returns Region string for previously declared parameter of class Nation.
Shipmode returns string randomly selected within the list of values defined for Modes in Clause 4.2.2.13.
Shipmode2 returns string randomly selected within the list of values defined for Modes in Clause 4.2.2.13 which is different from the value of previously declared parameter of class Shipmode.
Type requires range of 1,2, or 3 integers, each of which denote syllable column of which parameter string should be concatenated. Variants for each syllable are defined for Types in Clause 4.2.2.13.
Test query - a text in SQL or SPARQL query languages, used to test a database management system or RDF storage in order to measure performance and/or correctness.
SUT - system under test. A process running a database management system or RDF storage and able to execute test queries.
Test suite - a set of test queries and accompanying information. It is characterized with:
- the programming language is SQL or SPARQL
- the test format.
Test format - the rules on how to represent test queries and accompanying information: test descriptions, lists of excluded lists etc. Relevant test formats are BSBM and BIBM .
Query mix - depending on the context, may mean:
- test suite
- subset of the test suite to be executed (probably, full test suite)
- execution of the subset of the test suite
Use case - one or more test suites (query mixes) in a file system.
BSBM - Berlin SPARQL Benchmark. Consists of a test suite, represented both in SPARQL and SQL languages, and a set of tools: data generator, test driver, and qualification. See http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark.
TPC-H - Transaction Processing Performance Council. Organisation which publishes a benchmark of the same name. The benchmark consists of a test suite in SQL language and a data generator. See www.tpc.org.
RGDH - Relational Graph Data H is a test suite similar to the TPC-H but driven by the SPARQL query language and aimed at relational property graph engines with semantically interlinked datasets in 3-tuple (triples) form.
O-BIBM. A set of benchmark tools forked from the BSBM tools version 3 in January, 2011. BSBM data generator is left intact. BSBM test driver is refactored but is compatible with the original test driver in the fields of accepted parameters and the BSBM test format.
The main extension is the ability to work with the TPC-H and RDFH test suites. For this purpose, BIBM test format was developed.
[b]
[a]alexei.kaigorodov:
todo: list all tools
[b]kidehen:
No screenshots? Especially showing CSV results and import of said results into spreadsheets e.g. Google spreadsheet