Radoslaw Oldakowski <radol@gmx.de>
October 2003
This document is meant for developers who are interested in further work on RDF API for PHP. I will explain here some general concepts behind the RDQL implementation. A thorough description of functions and variables used within the RDQL package can be found in the detailed method reference. For a more profound understanding of every single part of the query execution consult the commented source code.
1. | Introduction |
2. | Query Parser |
3. | RDQL Memory Engine |
4. | RDQL Database Engine |
The execution of an RDQL query in RDF API for PHP V0.8 consists of four main
steps. At first, the query string is parsed into an in-memory structure represented
by a PHP array using the class RdqlParser
. Subsequently, depending
on the storage mechanism of the model the query is performed on, this array
is passed to the corresponding engine (RdqlMemEngine for MemModels and RdqlDbEngine
for DbModels). In the second step, the query engine searches for triples matching
all patterns from the WHERE
clause of the RDQL query and returns
a set of variable bindings. Next, this result set is filtered by evaluating
Boolean expressions specified in the AND
clause. The last step
is the processing of the final result, i.e. creating of RAP objects (Resource
,
BlankNode
, Literal
) or their plain text serialization
as values of the query variables, according to the desired format. In case
of
the former, the returned objects can be used for model update or other API
calls, or simply printed out using the method writeQueryResultsAsHtmlTable()
of class RdqlEngine
.
Although there had already been an RDQL parser written in PHP [PHPxmlClasses] available, it did not meet all requirements of RAP's RDQL grammar. Hence, I have decided to develop a completely new parser from scratch.
The output of the RDQL parser used in RDF API for PHP V0.8 is a multi-dimensional array (described in Figure 1) representing every clause of the parsed query. In this section I will briefly explain how this structure is generated.
|
Figure 1: |
An abstract description of the PHP array representing
an RDQL query produced by RAP's RDQL parser ( [ ] stands for an integer
index 0..N, {} are only used within the parser class and are not passed
to the query engine).
|
After the query string has been passed to the method parseQuery()
called on an object RdqlParser
, the function removeComments()
is used by the parser to clear all comments. Subsequently, the string is divided
into small tokens (tokenize()
). Each token is either a special
character (e.g. '(
' , ',
' , '
') or a
character string (e.g. ?variable
) saved in the class internal variable
$tokens
of type array. Parsing is nothing but analyzing these tokens
one by one, in the order of their appearance in the query string.
The process is initiated by the method parseSelect()
which recognizes
tokens representing query variables from the SELECT
clause and
stores them in the array $parsedQuery['selectVars']
(according
to Figure1). When this function encounters a token holding
the keyword introducing a subsequent clause of the RDQL query (i.e. FROM
/ SOURCE
or WHERE
) the next method parseFrom()
or parseWhere()
is called respectively.
As explained in the RDQL tutorial,
there is no need to specify the source while calling the method rdqlQuery()
on an active object MemModel
or DbModel
. However,
RAP's RdqlParser is able to parse the FROM
clause and can be used,
for instance, together with the RDQL engine to carry out a query on multiple
documents specified by URLs, as shown in example 1.
|
Example 1: |
RDQL query on multiple documents.
|
When the parsing of the SELECT
and FROM
clauses is
completed, the method parseWhere()
is called as next and all patterns
are transformed into the array $parsedQuery['patterns']
(see Figure
1). If the object of a pattern is a literal, additional information about
its language and datatype is recorded. A token indicating the beginning of the
AND
or the USING
clause causes the corresponding method
parseAnd()
or parseUsing()
to be called.
The processing of query filters
is one of the most complex tasks the RdqlParser has to accomplish. In the AND
clause the tokens of each filter are combined to form a complete filter string
in order to examine the filter as a whole. At first, the parser searches inside
the string for regular
expression patterns (e.g. ?x ~~ "/\.html$/i"
) and
serializes their corresponding parts into the array described in Figure
1 (in this example: ['var'] = ?x
, ['operator'] =
'eq'
,
['regex'] = '/\.html$/i'
). Additionally, it replaces each pattern
inside the original filter string (['string']
) with a placeholder
(i.e. ##RegEx_$i##
where $i
is a numerical index)
and stores this transformed string in the array (['evalFilterStr']
).
Subsequently, the parser looks for three different kinds of string equality
expressions in the form of ?variable (eq | ne) VALUE
, where VALUE
can be: ?variable
, <uri>
or "literal"@lang^^<datatypeURI
>.
Similarly, in this case, all parts of an expression are serialized into the
array (see Figure 1) and each expression inside the
already once transformed filter (['evalFilterStr']
) is replaced
with a placeholder (i.e. ##strEqExpr_$i##
). The processed filter
string now only contains numerical expressions and placeholders (e.g. ?x >
5 && ##strEqExpr_0##
). Finally the parser finds all variables
of the numerical expressions and stores them in the array as well.
Given this structure saved in an array, the RDL query engine will be able to
determine the Boolean value of each single regex and string equality
expression for a certain set of variable bindings, replace the corresponding
placeholders, and insert the values of numerical variables. An example filter
string representation for a query result to be filtered might be: (7 >
5 && TRUE
). After having evaluated this condition, the query
engine will accept this particular set of variable bindings.
If all filters have been parsed and the query contains a USING
clause, the method parseUsing()
is called. It writes all prefixes
with their corresponding namespaces into the array described in Figure
1. However, this part of the array will not be returned by the parser but
will be only used internally to replace prefixes used in the WHERE
and AND
clause. All URIs contained in the in-memory representation
of the parsed query passed to the query engine will be therefore stored in full.
An advantageous and helpful feature of RAP's RdqlParser worth mentioning is its extensive error checking. The parser is not only able to find a syntax error, but moreover to return a message pointing to the spot in the query where the error occurred. This is even more important since there exist several syntactically different RDQL implementations. Example 2 presents some RDQL clauses containing syntax errors and the corresponding error message returned by the parser.
RDQL error in the SELECT clause: '?givenN@me' variable name contains illegal characters |
|
RDQL error in the WHERE clause: '<http://www.example.org,' - illegal input: ',' - '>' missing |
|
RDQL error in the AND clause: '?x ~~ /\.thml$/ ' - regular expressions must be quoted |
|
RDQL error in the USING clause: '?dc' - illegal input, namespace prefix expected |
Example 2: |
Error messages returned by the RDQL parser.
|
If an RDQL query is performed on an in-memory model, an instance of the object
RdqlMemEngine
has to be created. The query execution is initiated
by calling the method queryModel()
with two parameters being passed
by reference: the object MemModel
itself and the array representation
of the query, generated by the RdqlParser.
The RdqlMemEngine begins the execution of a query with searching for tuples
matching all patterns specified in the WHERE
clause, by calling
the corresponding method findTuplesMatchingAllPatterns()
. This
process is divided into a sequence of two steps carried out in a loop. Initially
the result for one single pattern is found (findTuplesMatchingOnePattern()
).
Subsequently, it is joined with the set of tuples matching the previously examined
patterns (joinTuples()
).
The execution of both steps is rather complex. In the case of the first one,
the RdqlMemEngine analyzes all parts of the statement pattern (subject, predicate,
object) and additionally determines internal bindings (e.g. ?x, <uri>,
?x
). Given all the information, the method findTriplesMatchingPattern()
searches the queried model for statements fitting into the pattern also considering
the nuances of the RDQL specification. The result of this function is then converted
into an array holding variable bindings that represents a set of tuples matching
one pattern.
The join operation performed on two result sets is nothing else than a PHP implementation of the SQL-like Inner Join. Furthermore, applying it within the loop (compared to the approach where the results for each pattern are first determined and then joined altogether) enables the engine to significantly reduce memory consumption.
In this way the query engine has found all tuples matching the patterns from
the WHERE
clause. If the RDQL query being processed additionally
contains filter expressions (AND
clause), the result set determined
so far is passed to the method filterTuples()
. This function, taken
one by one for each tuple, substitutes the placeholders inside the filter string
with the Boolean values of the evaluated filter regex and string equality
expressions; subsequently it inserts the values of the numerical variables and
evaluates the entire filter string as a PHP expression .
For a better understanding I will explain the filtering process with an example.
Consider the tuple consisting of two variable bindings: ?age = 26
,
?name = "Radoslaw"
and the AND
clause of
an RDQL query: (?age < 25) && (?name eq "Radoslaw")
.
The in-memory representation of this particular filter returned by the RdqlParser
is shown in Example 3 (compare to Figure
1).
|
Example 3: |
The in-memory representation of the AND clause:
|
Given the structure described in Example 3, the query
engine at first evaluates the string equality expression by checking if the
value of the variable ?name
(i.e. "Radoslaw"
)
is a literal with the label equal "Radoslaw" and no specified language
or datatype. In this case it is one, so the query engine replaces the placeholder
(i.e. ##strEqExpr_0##
) with TRUE
, what results in
the filter string: (?age < 25) && (TRUE)
. Next, the
query engine inserts the value of the numerical variable ?age
and
determines the Boolean value of the whole filter (26 < 25) &&
(TRUE)
. In this example the output is FALSE
( (26
< 25) && (TRUE) => (FALSE) && (TRUE) = FALSE
);
consequently this particular tuple will be removed from the final result.
This was just an example of a very simple filter. RAP RDQL query engine is
able to handle much more complex AND
clauses consisting of several
filters where each of them can be built of different kinds of expressions combined
together by using logical operators. Even the process of evaluating one single
string equality expression becomes more complicated if the literal has specified
datatype and language.
In the last step, after having filtered all tuples, the query engine removes
the filter variables (which have not been specified in the SELECT
clause) from the result set. Since the engine has been working with RAP objects
all the time, it can now return the final query result with no further processing,
unless the parameter $returnNodes
passed to the object RdqlMemEngine
has been set to FALSE
. In this case the values of all query variables
are still to be serialized to string.
Similarly to the case of a MemModel, an RDQL query performed on a persistent
model (DbModel) is initiated by calling the method queryModel()
on an object RdqlDbEngine
with two parameters being passed by reference:
the DbModel itself and the in-memory representation of the parsed query. Though,
from this point on the internal execution of the query takes a different course.
The searching for tuples matching all patterns from the WHERE
clause
is this time delegated to the database management system. Moreover instead of
calling the database for the result set matching each single pattern and then
joining them in memory, the RdqlDbEngine translates almost the entire RDQL query
into an SQL statement and sends it to the database for execution. This approach
enables the machine to use the database indexing and optimization capabilities.
The algorithm used for the query rewriting is relatively simple. Every single
statement pattern is treated as a separate virtual database table (statements).
If the subject, predicate, or object of each pattern is a URI or a literal,
it is bound to its value. Otherwise, in the case that it is a variable, an identity
relationship between all occurrences of this variable is created. The variables
to be returned are indicated in the SELECT
clause of the SQL statement
being generated.
Example 4 provides a sample RDQL query and its corresponding
translation into SQL produced by the RdqlDbEngine. The first two lines in the
SELECT
clause of the SQL statement describe the variable ?someone
.
Due to its mapping to the subject (of the first triple), it can either be a
resource or a Blank Node. To find this out we need to know the value of the
corresponding table column (subject_is). In the case of the next variable
?name
mapped to the object of the second triple, it can additionally
be a literal as well, so that the information about its language and datatype
must also be returned. The FROM
clause indicates three virtual
tables (statements) each representing a pattern of the RDQL query. In
the WHERE
clause of the SQL statement there are constraints specified
for every part of a triple. Respectively, the subject in the first table (s1.subject
)
representing the variable ?someone
is bound to the subject in the
second table (s2.subject
) holding the same variable; the predicate
(s1.predicate
) and object (s1.object
) are equated
to the string they are supposed to match. Since RAP supports storing of multiple
models in a database, each pattern has to be bound to the modelID
of the model the query is performed on. Note that variables ?name
and ?y
occur only once so that they are not constrained by any
binding and can therefore match anything. Furthermore, the value of the variable
?name
must be returned by the database even if it is not specified
in the SELECT
clause of the RDQL query. It will be needed to evaluate
the filter expressions (?name EQ "Brian" || ?name EQ "Carl"
),
which will be carried out in main memory. However, this variable will only be
used internally within the RdqlQueryEngine and will not be included in the returned
final result set.
|
|
Example 4: |
An example query and its corresponding SQL translation.
|
The technique used for the query rewriting is very similar to Matt Biddulp's
algorithm for translating rdfDB-style queries into SQL ([Biddulph
2002], [Miller 2002]).
It has been extended in order to
additionally retrieve variable bindings for filter variables used in the AND
clause of an RDQL query and adapted to RAP's database layout.
RdqlDbEngine uses ADOdb abstraction
layer for SQL query execution. Thus, the result of every successful database
query is returned in the form of an object ADORecordSet
holding
the found database records in the form of an array, where the values of each
single record field are associated with integer keys. However, because of the
fact that SQL statements are generated dynamically, the mapping of these indexes
to the query variables is different for every RDQL query. Fortunately, RdqlDbEngine
provides a solution to this problem: while generating an SQL statement it associates
each RDQL query variable with the corresponding index of the ADORecordSet->fields
array. This information is stored in the class private variable $rsIndexes
.
This approach allows the query engine to emulate indexing by variable name while
retrieving SQL query results.
The filtering of the tuples matching all patterns specified in the WHERE
clause of the RDQL query is carried out in memory. The process is very similar
to the one performed by the RdqlMemEngine with the major difference being that
the RdqlDbEngine does not operate on RAP objects (Resource
, Literal
,
BlankNode
), but on the object ADORecordSet
returned
by the database. Finally in the last step the values of the variable bindings
have to be converted either to RAP objects or to their string serialization
according to the desired format of the RDQL query result set.