The WIQA Engine: Implementation Notes

Author:: Chris Bizer (chris@bizer.de); Richard Cyganiak (richard@cyganiak.de)
Version:: $Id: index.html,v 1.11 2006/07/04 22:10:47 bizer Exp $

Abstract

This document describes the implementation and internal architecture of the WIQA Engine and outlines the multi-step process that is required to check a set of RDF graphs against a WIQA policy and to generate WIQA explanations.

1. Overview
2. Package list
3. Policy objects
4. The parser
- 4.1. Handling of extension functions
5. Query execution
6. The graph pattern tree
7. Evaluating count constraints
8. Generating explanations
9. Change log

1. Overview

The WIQA engine is a library for filtering Semantic Web information. The input is a set of RDF graphs, each named with a URI. The output is an RDF graph that contains only those triples from the source graphs that match a certain policy.

In addition to the filtering, the engine can generate textual explanation objects that explain to a user why a certain triple was accepted into the output graph. The power of the policy language can be extended by defining custom Functions and Extensions.

Policies are defined using the WIQA-PL policy language. Its syntax is inspired by the SPARQL query language. The WIQA engine builds on ARQ, the SPARQL implementation that is shipped with the Jena Semantic Web Framework.

This document assumes familiarity with the WIQA engine's public API.

2. Package list

Package	Purpose
de.fuberlin.wiwiss.wiqa	The engine's public API.
de.fuberlin.wiwiss.wiqa.bindings	Code for dealing with streams of solutions and for grouping solutions by shared variable values. Solutions are called bindings inside the codebase because the ARQ class `com.hp.hpl.jena.query.core.Binding` is used as an implementation of the concept.
de.fuberlin.wiwiss.wiqa.count	Code for evaluating count constraints. Implements the algorithm described in section 7.
de.fuberlin.wiwiss.wiqa.engine	The query engine's core, especially those parts that hook into Jena's ARQ query processor.
de.fuberlin.wiwiss.wiqa.example	Some usage examples.
de.fuberlin.wiwiss.wiqa.explanation	Code for generating explanations. Implements the algorithm described in section 8.
de.fuberlin.wiwiss.wiqa.gp	Code for arranging graph patterns into a tree. Implements the algorithm described in section 6.
de.fuberlin.wiwiss.wiqa.extensions	Contains various functions and extensions that add capabilities to the engine, such as a complex trust filtering metric.
de.fuberlin.wiwiss.wiqa.parser	Hand-authored support code for the JavaCC-generated WIQA-PL parser.
de.fuberlin.wiwiss.wiqa.parser.javacc	JavaCC-generated parser for WIQA policies.
de.fuberlin.wiwiss.wiqa.vocab	RDF vocabularies used by WIQA.

3. Policy objects

A WIQA policy is a set of patterns and constraints. Policies operate on an input dataset (described in the API overview). Input triples that match the patterns and constraints are accepted into an output RDF graph. Policies are written using the WIQA-PL language, whose features are described in the WIQA-PL Language Specification.

The following example policy has five graph patterns, several explanation templates, and a count constraint. Namespace prefix declarations have been omitted.

NAME "Asserted by analysts with at least 3 positive ratings"
DESCRIPTION "Only accept information that has been asserted
    by analysts who have received at least 3 positive ratings."
PATTERNS { 
    GRAPH fd:GraphFromAggregator {
        EXPL "it was asserted by " ?authority " and " .
        ?GRAPH swp:assertedBy ?warrant .
        ?warrant swp:authority ?authority . 
    }
    GRAPH ?graph2 {
        ?authority rdf:type fin:Analyst .
    }
    GRAPH fd:GraphFromAggregator {
        EXPL ?authority2 " claims that " ?authority " is an analyst." .
        ?graph2 swp:assertedBy ?warrant2 .
        ?warrant2 swp:authority ?authority2 . 
    }
    GRAPH ANY {
        EXPL ?authority "has received positive ratings from" .
        ?rater fin:positiveRating ?authority . 
        FILTER (wiqa:count(?rater) >= 3) . 
    }
    GRAPH fd:BackgroundInformation {
        EXPL ?rater "who works for" ?company .
        ?rater fin:affiliation ?company . 
    } 
}

These policies are parsed into a cluster of Java objects centered around the Policy class.

UML diagram showing the classes that make up a WIQA policy

Policies consist of several parts:

Name and description: These are informatory text.
Graph patterns: A graph pattern is a set of triple patterns that is matched against certain named graphs from the input dataset. It may also contain additional constraint expressions (FILTER clauses). In policy objects, each graph patterns is represented by an ARQ ElementNamedGraph instance.
Graph pattern tree: In WIQA-PL, graph patterns appear as a simple list. But for some steps of query processing, it is necessary to order them into a tree of graph patterns. This is described in more detail in section 6.
Root constraint expressions: Constraint expressions (FILTER clauses) can occur not just as part of graph patterns, but also on the root level of the policy.
Extensions: In WIQA-PL, extension calls look just like normal FILTER expressions, but the engine treats them differently because they are evaluated at a different stage of query execution. In policy objects, they are represented as instances of ARQ's Extension interface.
Count constraints: Just as extension calls, Count constraints are evaluated at a stage different from usual FILTER conditions. In policy objects, they are represented as instances of the CountConstraint class. Their evaluation is described in section 7.
Explanation templates: Explanations are generated from instances of the ExplanationTemplate class. The templates are attached to nodes of the graph pattern tree. Section 8 describes explanations in more detail.

4. The parser

The parser's job is to turn a policy, written down in WIQA-PL, into a Policy object. Application programmers access it through the PolicyParser facade, the full implementation is in the de.fuberlin.wiwiss.wiqa.parser package.

The parser is derived from ARQ's SPARQL parser and built using the JavaCC parser generator. Most of the parser code (the javacc subpackage) is automatically generated from a grammar definition file by JavaCC. The grammar file can be found in the distribution as /grammar/wiqa.jj. It was created by adapting ARQ's sparql.jj. Most low-level grammar structures, such as triple patterns or FILTER expressions, have been re-used without change. But the grammar's higher levels have been replaced with grammar rules for WIQA-PL.

While the parser reads a WIQA-PL file, it adds the various parts described in the previous section to a Policy object using its setter methods.

Handling of extension functions

One syntactic construct in WIQA-PL is overloaded with three different features: the extension function syntax.

FILTER ... ext:someQName(?foo, ?bar) ...

When this “QName or URI followed by arguments in brackets” syntax is encountered in a policy, it could be

an extension function, like in SPARQL, or
an Extension, which is another mechanism for adding new functionality to WIQA-PL, and is evaluated in a different phase of query execution, or
a count constraint.

These features can be distinguished only by the URI. Count constraints use the well-known URI wiqa:count, and all other URIs must be checked against the lists of registered functions and extensions to see which kind it represents.

Distinguishing between these three types is somewhat difficult in the token/rule framework provided by JavaCC. Like most parsers, a JavaCC-generated parser operates in two stages: First, the input string is split into tokens (e.g. keywords, QNames, URIs, literals, punctuation). Second, the stream of tokens is matched against the grammar rules. To cope with the three function types, the WIQA parser has another phase between these two: The token stream is analysed, and URI or QName tokens whose URI is wiqa:count or a registered extension are replaced by special “count” or “extension” tokens. This happens in CustomTokenManager. The grammar rules for count constraints and extensions are defined using these tokens.

5. Query execution

This section describes what happens behind the scenes when the output graph is queried.

AcceptedGraph is an implementation of Jena's Graph interface. Like any RDF graph, it is a set of triples. AcceptedGraph is read-only, it doesn't implement methods for adding and removing triples. All the read methods (find, contains, size, query) are implemented in terms of a simple match of a triple pattern against the graph's triples. This match is implemented in the method AcceptedGraph.graphBaseFind(tripleMatch). The basic procedure is as follows:

Create an ARQ Query object out of the policy's graph patterns and extension calls. ARQ query objects are abstract representations of RDF queries. This happens in the Policy.toARQ() method.
Pre-bind the ?SUBJ, ?PRED and ?OBJ query variables if the subject, predicate or object of the triple pattern were concrete RDF nodes and not wildcards. This happens in AcceptedGraph.graphBaseFind.
Let ARQ execute the query against the input dataset. This automatically executes any calls to Functions (because they are evaluated when ARQ evaluates FILTER expressions in the graph patterns) and to Extensions (because they were explicitly added to the Query object in step 1). The result is an iterator of possible solutions to the query. WIQA has to hook into some internals of ARQ's query engine in order to return explanations from functions and extensions.
Evaluate all count constraints by going through the solutions and counting the number of different values of the constrained variables. Groups of solutions that don't fulfil the constraint are discarded. This is described in more detail in section 7.
Generate explanations by running the filtered solution stream through a QueryIterExplain, the most important class of the explanation package. Explanations are stored in an ExplanationCache, which is held by the AcceptedGraph. More details in section 8.
Deliver result triples by taking only the ?SUBJ, ?PRED and ?OBJ variables from the remaining solutions. An iterator over triples created from these variables is returned as the result of the initial AcceptedGraph.graphBaseFind call. This happens in TripleIterator, a private class inside AcceptedGraph.

The rest of this document provides more detail about steps 3, 4 and 5.

6. The graph pattern tree

Almost all policies have at least one graph pattern. In WIQA-PL, graph patterns appear as a simple list. But for some steps of query processing, it is necessary to order this list into a tree.

Each node in the tree is a tuple (gp, parent, v) where gp is a graph pattern, parent is the parent node, and v is the node's shared variable — a variable that is shared with the parent node's graph pattern.

The graph pattern of the tree's root is rootpattern, the implicit graph pattern

?GRAPH { ?SUBJ ?PRED ?OBJ . }

that is part of every policy. The full tree is constructed recursively by the function buildTree(GP), which takes the set of all graph patterns in the policy and produces a set of nodes interconnected through references to their parents.

buildTree(GP)
	input: GP, the set of graph patterns to be arranged as a tree
	return buildNodes(rootpattern, undef, undef, GP)

buildNodes(gp, parent, v, GP)
	input: gp, a graph pattern
	input: parent, a node in the graph parent tree
	input: v, a variable shared with the graph pattern of parent
	input: GP, the set of graph patterns to be arranged as a tree
	n = (gp, parent, v)
	nodes = {n}
    for each graph pattern gp_i in GP
        for each variable v_j contained in gp_i
            if v_j is contained in gp and
                v_j is not the shared variable of any node in ancestors(n) U {n}
            then
                nodes = nodes U buildNodes(gp_i, n, v_j, GP - {gp_i})
    return nodes
    
ancestors(n)
    input: n = (parent, gp, v), a node in the graph pattern tree
    return {} if parent is undef
    return ancestors(parent) U {parent}

The algorithm is implemented in the TreeBuilder class.

This tree is used in two parts of WIQA query evaluation. In count constraints, it is used to separate solutions into groups, and the counting happens within each of the groups. And when explanations are generated, the tree structure of the explanation's parts mirrors the graph pattern tree.

7. Evaluating count constraints

This section describes the algorithm used for evaluating count constraints.

If a policy P is in use, and B is the solution set produced by a request, and the filter condition

FILTER (wiqa:count(?x) > 3)

is encountered in graph pattern gp, then the filter result is determined by the function count(B, ?x, > 3, gp, graph pattern tree of P). gp is undef if the count constraint is located on the root.

A solution is a partial function from variable names to RDF terms.

count(B, v_c, condition, gp, t)
    input: B, a list of solutions
    input: v_c, the variable whose values are to be counted
    input: condition, a function from integers to {true, false}
    input: gp, the graph pattern containing the count constraint
    input: t, the policy's graph pattern tree
    result = B
    if gp = undef then
        result = groupAndCount(result, v_c, condition, {?SUBJ, ?PRED, ?OBJ})
    else
        do a depth-first walk of t; for each node n
            if gp = the graph pattern of n then
                V_p = the set of shared variables of n and all its ancestors
                result = groupAndCount(result, v_c, condition, {?SUBJ, ?PRED, ?OBJ} U V_p)

The function groupAndCount sorts solutions into buckets, where each bucket contains only solutions with identical values for the some set of group variables V_group. This is similar to SQL's GROUP BY feature. Then, in each bucket, the distinct values of another variable v_c are counted, and all buckets where the count doesn't match condition are discarded.

groupAndCount(B, v_c, condition, V_group)
    input: B, a list of solutions
    input: v_c, the variable whose values are to be counted
    input: condition, a function from integers to {true, false}
    input: V_group, a set of variables by which solutions are grouped prior to counting
    result = {}
    for each groupdef in { groupdef = projection(b, V_group) and b in B }
        group = { b in B : groupdef is a subset of b }
        if condition( |group| ) is true
            result = result U group

The projection of a solution on a set of variables V is a new solution that retains only values for variables in V.

projection(b, V)
    input: b, a solution
    input: V, a set of variables
    result = the sub-solution p of b that has
        p(v) = b(v) for all v in V, and
        p(v) = undef for all other variables

These functions are implemented in the count package, the most important class is QueryIterCount.

8. Generating explanations

This section describes how explanations for an accepted RDF triple are generated from the set of solutions.

An explanation is a set of explanation parts. Each of these is a tuple (text, parts), where text is a list of RDF nodes (usually literals), and parts is a set of explanation parts.

Explanations are generated from explanation templates. An explanation template is a list of RDF nodes and variables. Variables in the template are replaced with actual RDF nodes from a solution when the template is instantiated.

The explanation for an RDF triple rt = (s, p, o) is generated by the function explain(rt, B, root, tree), where B is the solution set, tree is a graph pattern tree as returned by buildTree, and root is its root node.

explain(rt, B, n, tree)
    input: rt = (subj, pred, obj), an RDF triple
    input: S, a solution set
    input: n, a node in the graph pattern tree
    input: tree, a graph pattern tree
    S' = { s in S where s(?SUBJ) = subj & s(?PRED) = pred & s(?OBJ) = obj }
    return makeParts(S', root(tree), tree)

makeParts(S, n, tree)
    input: S, a solution set
    input: n = (gp, parent, v), a node in the graph pattern tree
    input: tree, a graph pattern tree
    parts = {}
    if gp has an explanation template tpl
        V_tpl = set of all variables in tpl
        valuesets = {}
        for each s in S
            valuesets = valuesets U { projection(s, V_tpl) }
        for each valueset in valuesets
            text = instantiate(tpl, valueset)
            solutiongroup = { s in S where valueset subset of s }
            parts = {}
            for each c in children(n, tree)
                childparts = childparts U makeParts(solutiongroup, c, tree)
            parts = parts U { (text, childparts) }
    else
        for each c in children(n, tree)
            parts = parts U makeParts(S, c, tree)
    return parts
    
instantiate(tpl, s)
    input: tpl = <r₁, ..., r_n>, a list of RDF nodes or variables
    input: s, a solution
    return <r₁', ..., r_n'>, where
        r_i' = s(r_i), if r_i is a variable
        r_i' = r_i, otherwise

9. Change log

$Log: index.html,v $
Revision 1.10  2006/06/26 21:58:17  cyganiak
Changed iqa: prefix to wiqa: throughout

Revision 1.9  2006/06/26 21:42:06  cyganiak
Terminology: binding -> solution

Revision 1.8  2006/06/26 20:47:49  cyganiak
Small improvement to tree building algorithm

Revision 1.7  2006/06/21 14:16:11  cyganiak
Improved formal algorithms

Revision 1.6  2006/06/14 11:59:00  cyganiak
Terminology: parent link variable -> shared variable

Revision 1.5  2006/06/14 10:19:17  cyganiak
Added abstract, small editorial changes

Revision 1.4  2006/06/13 15:24:59  cyganiak
First complete version.

Revision 1.3  2006/06/12 17:56:42  cyganiak
- Added UML diagram of Policy and friends
- Various revisions to the text

Revision 1.2  2006/05/03 11:58:20  cyganiak
Added section on graph pattern trees to engine implementation doc

Revision 1.1  2006/05/02 13:14:55  cyganiak
Half-finished draft of engine architecture/implementation docs.