Semantic Web Client Library

The Sematic Web Client Library represents the complete Semantic Web as a single RDF graph. The library enables applications to query this global graph using SPARQL- and find(SPO) queries. To answer queries, the library dynamically retrieves information from the Semantic Web by dereferencing HTTP URIs, by following rdfs:seeAlso links, and by querying the Sindice search engine. The library is written in Java and is based on the Jena framework.

Introduction
How does the Library work?
Example Queries
Using the Library on the Command Line
Using the Library in your Applications
GRDDL support
Sindice Search Engine Support
Download
Support and Feedback

1. Introduction

There is a recent tendency in the Semantic Web community to stress the Web aspect of the Semantic Web, meaning that the Semantic Web is increasingly understood as a single, global information space consisting of interlinked RDF data.

This tendency is carried by the revival of ideas around interlinking data on the Semantic Web: Current W3C efforts stress the fact that URI references should be dereference-able (W3C working draft on Best Practice Recipes for Publishing RDF Vocabularies). There is a nice memo by Tim Berners-Lee about the role of links on the Semantic Web (Berners-Lee: Linked Data). There are efforts around Swoogle to measure the size and characterize the content of the Semantic Web (Ding, Finin: Characterizing the Semantic Web on the Web). The Tabulator browser lets surfers browse interlinked data on the Semantic Web (Berners-Lee et al: Tabulator: Exploring and Analyzing linked data on the Semantic Web). Tools like D2R Server make it easy to publish existing data on on the Semantic Web (Bizer, Cyganiak: D2R Server - Publishing Relational Databases on the Semantic Web).

All these efforts are based on the assumption that, for being part of the Semantic Web, data should ideally fulfill the following requirements:

All entities of interest, such as information resources, real-world objects, and vocabulary terms should be identified by URI references.
URI references should be dereference-able, meaning that an application can look up a URI over the HTTP protocol and retrieve RDF data about the identified resource.
Data should be provided using the RDF/XML or Turtle syntax. If data is embedded inside HTML documents it is highly recommended to use RDFa.
Data should be interlinked with other data. Thus resource descriptions should contain links to related information in the form of dereference-able URIs within RDF statements and rdfs:seeAlso links.

The Semantic Web Client Library regards all data that is published according to these rules as a single, global set of named RDF graphs. The library allows application to query the merge of this graph set using the SPARQL query language or using find(SPO) queries. The triples that are returned by a find(SPO) query are linked to the graphs in which they occur, meaning that applications can keep track of information provenance.

Technically, the library represents the Semantic Web as a single Jena RDF graph or Jena Model. As both interfaces are commonly used within RDF applications, the library can be used to replace local RDF stores. Meaning that an RDF application is turned into a Semantic Web application. For instance, it might be fun to plug the library below existing RDF browsers.

2. How does the Library work?

For answering queries, the library dynamically retrieves information from the Semantic Web. In order to prevent unnecessary information from being retrieved, the library uses a directed-browsing algorithm that is similar to the algorithm used by the Tabulator browser (See section 4 of the Tabulator paper).

The library caches retrieved information locally as a set of named graphs. The library splits SPARQL queries into a set of triple patterns which are consecutively matched against the Semantic Web.

For each triple pattern, the library executes the following algorithm:

look up URIs that appear in the triple pattern. Add retrieved graphs to the local graph set.
look up any URI y where the graph set includes the triple { x rdfs:seeAlso y } and x is a URI from the triple pattern. Add retrieved graphs to the local graph set.
match the triple pattern against all graphs in the local graph set.
for each triple that matches the triple pattern
1. look up all new URIs that appear in the triple. Add retrieved graphs to the local graph set.
2. look up any new URI y where the graph set includes the triple { x rdfs:seeAlso y } and x is a URI from a matching triple. Add retrieved graphs to the local graph set.
match the triple pattern against all newly retrieved graphs.
repeat step 4 and 5 until the maximum number of retrieval steps or the timeout is reached.

The paper

Olaf Hartig, Christian Bizer, and Johann-Christoph Freytag: Executing SPARQL Queries over the Web of Linked Data. In Proceedings of the 8th International Semantic Web Conference (ISWC'09), Washington, DC, USA, Oct. 2009

provides a detailed description of the query execution approach implemented in the library.

The library is multithreaded to allow faster retrieval. For URIs containing the hash character ("#"), the part before the hash will be used for retrieval.

Retrieved graphs are kept in the local cache after the query execution is finished. Thus, the local cache is cumulatively filled by executing queries and the same query might return different results if it is asked again after some other queries have been executed. For providing the algorithm with proper starting points, it can be helpful to manually load some graphs into the local cache before querying.

3. Example Queries

This section shows the results of some example queries against the Semantic Web.

3.1 Query 1: Developers of the Tabulator Project

Query

"Find all developers of the Tabulator Project, their email addresses and other projects they are involved in."

SPARQL Query

PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX doap: <http://usefulinc.com/ns/doap#>
SELECT DISTINCT ?name ?mbox ?projectName
WHERE { 
  <http://dig.csail.mit.edu/2005/ajar/ajaw/data#Tabulator> doap:developer ?dev .
  ?dev foaf:name ?name .
  OPTIONAL { ?dev foaf:mbox ?mbox }
  OPTIONAL { ?dev doap:project ?proj . 
             ?proj foaf:name ?projectName }
}

Query Results

?name	?mbox	?projectName
"Ruth Dhanaraj"
"Timothy Berners-Lee"	<mailto:timbl@w3.org>
"David Sheets"
"James Hollenbach"	<mailto:jambo@mit.edu>
"Dan Connolly"	<mailto:connolly@w3.org>
"Adam Lerer"	"alerer@mit.edu"
"Yuhsin Joyce Chen"	"yuhsin@mit.edu"
"Yu-hsin Chen"	"yuhsin@mit.edu"
"Lydia Chilton"	"hmslydia@gmail.com"	"The illustrious project II"
"Lydia Chilton"	"hmslydia@gmail.com"	"The inimitable project III"

Retrieved Graphs

The following graphs have been retrieved by the library's directed browsing algorithm for answering the query:

First, the algorithm dereferences the URI http://dig.csail.mit.edu/2005/ajar/ajaw/data#Tabulator which is contained in the triple pattern. By looking up the URI, it gets the DOAP file of the Tabulator project. The DOAP file contains the URIs of all Tabulator developers, which are dereferenced to retrieve their FOAF files. A developer's FOAF file might contain links to her other projects. The retrieval algorithm follows these links afterwards.

3.2 Query 2: Richard's Friends and their Friends

Query

"Find the names and homepages of Richard's friends and the names and homepages of their friends."

SPARQL Query

PREFIX foaf: <http://xmlns.com/foaf/0.1/ >

SELECT DISTINCT ?friendsname ?friendshomepage ?foafsname ?foafshomepage  
WHERE {
  { < http://richard.cyganiak.de/foaf.rdf#cygri > foaf:knows ?friend .
    ?friend foaf:mbox_sha1sum ?mbox . 
    ?friendsURI foaf:mbox_sha1sum ?mbox .
    ?friendsURI foaf:name ?friendsname .
    ?friendsURI foaf:homepage ?friendshomepage . }
    OPTIONAL { ?friendsURI foaf:knows ?foaf .
               ?foaf foaf:name ?foafsname .
               ?foaf foaf:homepage ?foafshomepage .
             } 
  }

Query Result

The query returns 60 results. Some of the results are:

?friendsname	?friendshomepage	?foafsname	?foafshomepage
"David Best"	<http://www.david-best.de/>
"David Best"	<http://www.david-best.de/>	"Sebastian Frank"	<http://userpage.fu-berlin.de/~baltru/>
"David Best"	<http://www.david-best.de/>	"Richard Cyganiak"	<http://richard.cyganiak.de>
"David Best"	<http://www.david-best.de/>	"Richard Cyganiak"	<http://richard.cyganiak.de>
"David Best"	<http://www.david-best.de/>	"Anja Jentzsch"	<http://www.anjeve.de>
...	...	...	...
"Sven Schwarz"	<http://www.dfki.uni-kl.de/~schwarz/>
"Anja Jentzsch"	<http://www.anjeve.de>	"Lyndon Nixon"	<http://page.mi.fu-berlin.de/~nixon/>
"Anja Jentzsch"	<http://www.anjeve.de>	"Arne Handt"	<http://www.handtwerk.de>
"Anja Jentzsch"	<http://www.anjeve.de>	"Marco Rademacher"	<http://page.mi.fu-berlin.de/~mrademac/>
"Anja Jentzsch"	<http://www.anjeve.de>	"Christian Bizer"	<http://www.bizer.de>
"Anja Jentzsch"	<http://www.anjeve.de>	"Robert Tolksdorf"	<http://www.robert-tolksdorf.de>
...	...	...	...

Retrieved Graphs

The following graphs have been retrieved by the library's directed browsing algorithm for answering the query:

http://richard.cyganiak.de/foaf.rdf
http://www.aifb.uni-karlsruhe.de/WBS/mvo/foaf.rdf.xml
http://www.handtwerk.de/foaf.rdf
http://g1o.net/g1ofoaf.xml
http://www.leobard.net/foaf.xml
http://www.anjeve.de/foaf.rdf
http://thefigtrees.net/lee/ldf-card
http://torrez.us/who
http://page.mi.fu-berlin.de/~best/foaf.rdf
http://bblfish.net/people/henry/card
http://www.dfki.uni-kl.de/~schwarz/foaf.xml
http://www.wiwiss.fu-berlin.de/suhl/bizer/foaf.rdf
http://www.livejournal.com/users/littleve/data/foaf
http://xmlns.com/foaf/0.1/mbox_sha1sum
http://xmlns.com/foaf/0.1/knows
http://xmlns.com/foaf/0.1/name
http://xmlns.com/foaf/0.1/homepage
http://www.gnowsis.com/leo/foaf.xml
http://www.leobard.net/rdf/foaf.xml
http://www.dfki.uni-kl.de/~maus/foaf.xml
http://www.dfki.de/~kiesel/foaf.xml
http://zine.niij.org/data/about-me
http://www.dfki.uni-kl.de/~grimnes/foaf.rdf
http://www.heimwege.de/foaf/foaf.xml
http://anjeve.de/foaf.rdf
http://www.unix-ag.uni-kl.de/~guenther/guenther.xml
http://www.dfki.uni-kl.de/~sintek/foaf.xml
http://purl.org/net/inkel/inkel.foaf.rdf
http://www.kwark.org/XML/foaf.rdf
http://www.livejournal.com/users/doctorow/data/foaf
http://www.csd.abdn.ac.uk/~ggrimnes/codepict.rdf
http://swordfish.rdfweb.org/people/libby/rdfweb/webwho.xrdf
http://www.webmink.net/foaf.rdf
http://dannyayers.com/me.rdf
http://torrez.us/feed/rdf
http://clark.dallas.tx.us/kendall/foaf.rdf
http://www.w3.org/People/Connolly/home-smart.rdf
http://www.dajobe.org/foaf.rdf
http://www.sirpheon.com/foaf.rdf
http://thefigtrees.net/lee/eliast-school.ics.rdf
http://page.mi.fu-berlin.de/~nixon/foaf.rdf
http://page.mi.fu-berlin.de/~jentzsch/foaf.rdf
http://www.w3.org/People/Connolly/travel-sched
http://captsolo.net/semweb/foaf-captsolo.rdf
http://icite.net/about/foaf.rdf
http://xml.mfd-consult.dk/morten/foaf.rdf
http://xml.mfd-consult.dk/foaf/morten.rdf
http://rdfweb.org/people/danbri/rdfweb/danbri-foaf.rdf
http://www.david-best.de/foaf.rdf
http://userpage.fu-berlin.de/%7Epaslaru/foaf.rdf
http://page.mi.fu-berlin.de/~mschulze/foaf-extended.rdf
http://page.mi.fu-berlin.de/~barnicke/semanticweb/foaf-a-matic.rdf
http://www.userpage.fu-berlin.de/~brok/foaf.rdf

4. Using the Library on the Command Line

The library provides a command line tool that allows you to execute SPARQL and find(SPO) queries against the Semantic Web. It is invoked using the semwebquery command, which is found in the bin folder of the distribution.

Example queries:

semwebquery -sparqlfile tabulator-devs.sparql -retrieveduris

semwebquery -load http://richard.cyganiak.de/foaf.rdf#cygri
    -find "ANY <http://xmlns.com/foaf/0.1/knows> ANY"

This table shows the parameters of the semwebquery command:

Parameter	Description
`-sparql <Query>`	Executes a SPARQL query against the Semantic Web.
`-sparqlfile <Filename>`	Loads a SPARQL query from a file and executes it against the Semantic Web.
`-find <TriplePattern>`	Executes a find(SPO) query against the Semantic Web. Example: `-find "<http://www.w3.org/People/Berners-Lee/card#i> ANY ANY"`
`-maxsteps <Integer>`	Sets the maximal number of iterations of the retrieval algorithm. The default value is 3.
`-timeout <Integer>`	Sets the timeout of the query in seconds. The default is 60 seconds.
`-maxthreads <Integer>`	Sets the maximal number of parallel threads for retrieving URIs. The default value is 10.
`-load <URI>`	Loads a graph from the Web into the local cache before the query is executed.
`-grddl`	Enables GRDDL support; see below. Notice, GRDDL support is deprecated!
`-NoRDFa`	Disables RDFa support.
`-sindice`	Enables Sindice-based URI search during query execution; see below.
`-loadtrig <Filename>`	Loads a set of named graphs from a TriG file before the query is executed.
`-savetrig <Filename>`	Saves all graphs that have been retrieved during the query execution into a file.
`-retrieveduris`	Outputs a list of all successfully retrieved URIs.
`-redirecteduris`	Outputs a mapping of URIs that have been redirected.
`-faileduris`	Outputs a list of all URIs that could not be retrieved.
`-resultfmt`	Specifies the output format for the result of a SPARQL query: for SELECT and ASK queries use TXT, XML, or JSON (default: TXT) for CONSTRUCT or DESCRIBE queries use RDF/XML, N-TRIPLE, TURTLE, or N3 (default: RDF/XML)
`-verbose`	Shows additional information about the retrieval process.

The SemanticWebQuery command line tool creates an output like this.

Tip: Using -loadtrig and -savetrig in the same command allows you to cumulate cache data from several queries.

5. Using the Library in your Applications

The Semantic Web Client Library is based on the NG4J - Named Graphs API for Jena. The main interface of the Library is the SemanticWebClient interface. The table shows its most important methods. See JavaDoc for details about all methods.

SemanticWebClient implements NamedGraphSet
Semantic Web Client represents all data that is published on the Semantic Web a global set of named RDF graphs. Applications can query the merge of this graph set using the SPARQL query language or using find(SPO) queries.
find(pattern)	Returns a iterator over all matching triples.
find(pattern,listener)	Returns void. A Triple listener which notifies an application when a triple is found or the retrieval process is finished can be passed.
read(url,lang)	Read NamedGraphs from a URL into the NamedGraphSet. Supported RDF serialization languages are TriX, TriG, RDF/XML, N-Triples and N3.
write(url,lang,baseURI)	Writes a serialized representation of the NamedGraphSet to a writer. Supported RDF serialization languages are TriX, TriG, RDF/XML, N-Triples and N3. If the specified serialization language doesn't support Named Graphs, then the union graph will be serialized, and knowledge about graph names is lost. Only TriX and TriG support graph naming.
asJenaGraph(defaultGraphForAdding)	Returns a Jena Graph view on the NamedGraphSet, equivalent to the union graph of all graphs in the graph set.
asJenaModel(defaultGraphForAdding)	Returns a Jena Model view on the NamedGraphSet, equivalent to the union graph of all graphs in the graph set. All Statements returned by this NamedGraphsModel can be casted to NamedGraphStatements in order to access provenance information about the graphs they are contained in.
More...

5.1 Executing a SPARQL Query

The following example demonstrates how a new Semantic Web client is created and how the client is used to execute a SPARQL query against the Semantic Web.

import com.hp.hpl.jena.query.*;
import de.fuberlin.wiwiss.ng4j.semwebclient.SemanticWebClient;


// Create a new Semantic Web client.
SemanticWebClient semweb = new SemanticWebClient(); 
 
// Specify the query.
String queryString = 
"PREFIX foaf: <http://xmlns.com/foaf/0.1/> " + 
"SELECT DISTINCT ?i WHERE {" + 
"<http://www.w3.org/People/Berners-Lee/card#i> foaf:knows ?p . " + 
"?p foaf:interest ?i ." + 
"}"; 
 
// Execute the query and obtain results. 
Query query = QueryFactory.create(queryString); 
QueryExecution qe = QueryExecutionFactory.create(query, semweb.asJenaModel("default")); 
ResultSet results = qe.execSelect(); 

// Output query results.
ResultSetFormatter.out(System.out, results, query);

5.2 Executing a find(SPO) Query

The Semantic Web Client provides two different find methods. The first one, find(TripleMatch) returns an iterator over all matching Triples. First, the iterator returns matching triples from the cached graphs. Afterwards, the iterator's hasNext() method waits until more matching triples are found in retrieval process. When retrieval process is finished, the hasNext() method returns false.

// Create a new Semantic Web client.
SemanticWebClient semweb = new SemanticWebClient(); 
 
// Specify the query.
Triple t = new Triple(Node.ANY, Node.createURI("http://xmlns.com/foaf/0.1/knows"), Node.ANY);

// Search for the triple
Iterator iter = semWeb.find(t);

// Loop over all matching triples
while(iter.hasNext()){
	SemWebTriple triple = (SemWebTriple) iter.next();
	System.out.println(triple.toString());
}

The find() method can also be called with a TripleListener as second parameter. A TripleListener is an application-specific class that handles triple found events. A TripleListener has to implement the methods tripleFound() and findFinished(). The tripleFound() method is called every time when a new triple is found. The findFinished() method is called once when the retrieval process is finished.

The following example shows a simple listener which implements the TripleListener interface. Whenever a triple is found, the listener prints the triple to System.out.

class ExampleListener implements TripleListener {
	public int count = 0;
		
	public void tripleFound(Triple t){
		System.out.println("Found Triple: " + t.toString());
		this.count++;
	}
			
	public void findFinished(){
		System.out.println("Find finshed. " + this.count + " triples found." );
	}			
}

5.3 Configuration

Several aspects of the retrieval algorithm can be configured in the SemanticWebClientConfig object that is accessible using the SemantibcWebClient.getConfig method.

SemanticWebClient semWeb = new SemanticWebClient();
semWeb.getConfig().setValue( SemanticWebClientConfig.MAXSTEPS, "3" );

Parameter	Description	Default
`SemanticWebClientConfig.MAXSTEPS`	The maximum depth of links followed during a single query. 0 means no links are followed.	3
`SemanticWebClientConfig.MAXTHREADS`	The number of threads to use for URI dereferencing.	10
`SemanticWebClientConfig.TIMEOUT`	The maximal execution time for a query in milliseconds. A query will finish with the results found up to that point if the timeout is reached.	10000
`SemanticWebClientConfig.ENABLEGRDDL`	Setting this to `true` enables GRDDL support; see below. Notice, GRDDL support is deprecated!	false
`SemanticWebClientConfig.ENABLE_RDFA`	Setting this to `true` enables RDFa support.	true
`SemanticWebClientConfig.ENABLE_SINDICE`	Setting this to `true` enables Sindice-based URI search during query execution; see below.	false

5.4 Information Provenance

The Semantic Web Client Library provides applications with several methods to access provenance information about retrieved information.

SemWebTriples

Both find() methods return SemWebTriples. A SemWebTriple can be asked for its origin by calling the SemWebTriple.getSource method. The method returns the URL from which the triple was retrieved.

Provenance Graph

The local cache contains a named graph called http://localhost/provenanceInformation. This graph consists of a swp:sourceURL and a swp:retrievalTimestamp triple for each retrieved graph. The content of the graph can be used within SPARQL queries to restrict the origin of information. This allows applications to apply different trust policies for deciding whether to accept or reject information from the Semantic Web.

Dereferenced URI Lists

After the retrieval process is finished, two lists containing all successfully and all unsuccessfully retrieved URIs can be accessed by calling the SemanticWebClient.successfullyDereferencedURIs and SemanticWebClient.unsuccessfullyDereferencedURIs methods. The methods return iterators over all successfully and unsuccessfully dereferenced URIs.

6. GRDDL Support

GRDDL support is deprecated and will be dropped with the next release (unless someone steps up to maintain it in the future).

GRDDL is a technique for obtaining RDF data from XML documents and in particular XHTML pages. This works by adding a link to an XSLT stylesheet to the XML document. The stylesheet transforms the document to RDF/XML. Here is an example taken from Dan Connolly's home page:

<html xmlns="http://www.w3.org/1999/xhtml">
  <head profile="http://www.w3.org/2003/g/data-view
                 http://purl.org/NET/erdf/profile 
                 http://dig.csail.mit.edu/2007/id/doc
                 http://www.w3.org/2006/03/hcard">
    <title>Dan Connolly, W3C</title>
    ...
    <link rel="transformation"
          href="http://www.w3.org/2002/12/cal/glean-hcal.xsl"/>
    <link rel="transformation"
          href="http://www.w3.org/2002/12/cal/myevents"/>
    ...

The important part is the GRDDL metadata profile (http://www.w3.org/2003/g/data-view) in the profile attribute of the head element. It tells us that GRDDL can be used to extract RDF from this page. The link element with rel="transformation" attributes point to the XSLT stylesheets that can be applied to get the RDF data.

Another way to link to the XSLT stylesheets is by putting a profile document at the profile URL, and placing the links inside this document. More details on this can be found in the GRDDL Primer.

The following query extracts information about the events that Dan Connolly will attend:

PREFIX v: <http://www.w3.org/2002/12/cal/icaltzd#>
SELECT ?event ?prop ?value
WHERE {
   ?event v:attendee <http://www.w3.org/People/Connolly/#me> .
   ?event ?prop ?value
}

To retrieve the results, issue this command, assuming the SPARQL query is saved in a file dc.sql:

semwebquery -grddl -sparqlfile dc.sql

6.1 Enabling GRDDL in the Java API

Enabling GRDDL support for your own applications that use the SemanticWebClient API:

SemanticWebClient swc = new SemanticWebClient();
swc.setConfig(SemanticWebClient.CONFIG_ENABLEGRDDL, "true");

6.2 Performance Considerations

The GRDDL mechanism imposes heavy CPU utilization for applying the XSLT transformations and makes extensive use of the network bandwidth for retrieving transformation documents. These computer resource requirements are multiplied in some cases because of the recursive nature of the GRDDL algorithm. Therefore the GRDDL feature is disabled by default. It is also recommended to use a HTTP proxy in order to speed up the retrieval of popular transformations or profile documents.

7. Sindice Search Engine Support

The Sindice search engine supports a URI-based search for RDF documents; it provides a list of documents that mention a given URI. The Semantic Web client library optionally utilizes Sindice's URI-based search. In addition to dereferencing a URI the library optionally queries Sindice for documents that mention the URI. These additional documents are stored in the local cache. Hence, enabling Sindice-based URI search during the query execution may yield more complete results. For instance, while the query for properties with range foaf:Person (i.e. the find(SPO) query "ANY @rdfs:range <http://xmlns.com/foaf/0.1/Person>") finds one result without URI search the same query discovers six distinct results with URI search (tested on 2008/08/13).

However, use this option with care. Since enabling URI search may result in the addition of a large number of discovered graphs the consumed memory may grow drastically. For instance, the library cached 42 graphs during the URI search enabled execution of the aforementioned query; for the the same query execution without URI search the library added only two graphs to the cache. Furthermore, searching for URIs increases the query execution time significantly.

8. Download

The Semantic Web Client Library is part of the NG4J - Named Graphs API for Jena project and can be downloaded from Sourceforge NG4J.

The Semantic Web Client Library is licensed under the terms of the Berkeley Software Distribution (BSD) license.

9. Support and feedback

We are interested in hearing about your opinion and your experience with the library. Please sent comments and bug reports to the NG4J-namedgraphs mailing list:

ng4j-namedgraphs@lists.sourceforge.net

The archives of the list are found at http://sourceforge.net/mailarchive/forum.php?forum=ng4j-namedgraphs
You can subscribe to the list at http://lists.sourceforge.net/lists/listinfo/ng4j-namedgraphs

Good starting points for playing around with the library is interlinked data published by the Tabulator project (http://dig.csail.mit.edu/2005/ajar/ajaw/data) and the D2R demo server running at http://www3.wiwiss.fu-berlin.de:2020/. If you find other interesting chunks of interlinked data on the Semantic Web that can be queried using the library, please tell us.