thespot4sap.com independent sap information
 

get SAP Access - pay monthly

SAP Tutorials    Online SAP Training    SAP CBT's    SAP Forums    SAP Articles    SAP Jobs    SAP Resumes
  SAP Access    SAP Blogs    SAP Books     Links     SAP Vendor Directory     Submit Content    Search
Previous posts in Other ERP
Page 877 of 5577

Dbpedia Benchmark Revisited

Blogger : Virtual Database Technology Blog
All posts : All posts by Virtual Database Technology Blog
Category : Other ERP
Blogged date : 2008 May 09

Dbpedia Benchmark Revisited
We ran the Dbpedia benchmark queries again with different configurations of Virtuoso. I had not studied the details of the matter previously but now did have a closer look at the queries.
Comparing numbers given by different parties is a constant problem. In the case reported here, we loaded the full Dbpedia 3, all languages with about 198M triples on Virtuoso 5 and 6 cluster, all on the same 4 core 2GHz Xeon with 8G RAM. All databases were striped on 6 disks. The cluster configuration was with 4 processes in the same box.
We ran the queries in two variants: The first was with graph specified in the SPARQL from clause, using the default indices. The second variant was with no graph specified anywhere, using an alternate indexing scheme.
The times below are for the sequence of 5 queries, individual query times are not reported. I did not do a line by line review of the execution plans since they seem to run well enough. We could get some extra mileage from cost model tweaks, specially for the numeric range conditions but we will do this when somebody comes up with better times.
First about 5: Because there is a query in the set that specifies no condition on S or O and only P, this simply cannot be done with the default indices. With 6 it sort of can because 6 is more space efficient.
So we added the index:
create bitmap index rdf_quad_pogs on rdf_quad (p, o, g, s);

5 with gspo, ogps, pogs 
cold   210s
warm  0.600s

6 cluster with gspo, ogps 
cold  136s
warm 4.01 s


6 cluster with gspo, ogpps, pogs
cold  33.4s
warm 0.628 s
OK, so now let us do it without a graph being specified. For all platforms, we do:
- Drop any existing indices.
create table r2 (g iri_id_8, s, iri_id_8, p iri_id_8, o any, primary key (s, p, o, g))
alter index R2 on R2 partition (s int (0hexffff00));

log_enable (2);
insert into r2 (g, s, p, o) select g, s, p, o from rdf_quad;

drop table rdf_quad;
alter table r2 rename RDF_QUAD;
create bitmap index rdf_quad_opgs on rdf_quad (o, p, g, s) partition (o varchar (-1, 0hexffff));
create bitmap index rdf_quad_pogs on rdf_quad (p, o, g, s) partition (o varchar (-1, 0hexffff));
create bitmap index rdf_quad_gpos on rdf_quad (g, p, o, s) partition (o varchar (-1, 0hexffff));
The code is identical for 5 and 6, except that with 5 we use iri_id (32 bit) for the type, not iri_id_8 (64 bit). We note that we run out of id's with 5 around a few billion triples, so with 6 we have double the id length and still manage to be vastly more space efficient.
With the above 4 indices, we can query the data pretty much in any combination without hitting a full scan of any index. We note that all indices that do not begin with s end with s as a bitmap. This is about 60% of the space of a non-bitmap index for data such as Dbpedia.
If you intend to do completely arbitrary RDF queries in Virtuoso, then chances are you are best off with the above index scheme.
5 with spog, pogs, opgs, gpos 
warm 0.595 s

6 cluster  with spog, pogs, opgs, gpos 
warm 0.617 s
The cold times were about the same as above, so not reproduced.
Graph or No Graph?
It is in the SPARQL spirit to specify a graph and for pretty much any application, there are entirely sensible ways of keeping the data in graphs and specifying which ones are concerned by queries. This is why Virtuoso is set up for this by default.
On the other hand, for the open web scenario, dealing with an unknown large number of graphs, enumerating graphs is not possible and questions like which graph of which source asserts x become relevant. We have two distinct use cases which warrant a different setup of the database, simple as that.
The latter use case is not really within the SPARQL spec, so implementations may or may not support this. For example Oracle or Vertica would not do this well since they partition data according to graph or predicate, respectively. On the other hand stores that work with one quad table, which is most of the ones out there should do it maybe with some configuring, as shown above.
Frameworks like Jena are not to my knowledge geared towards having a wildcard for graph, although I would suppose this can be arranged by adding some "super-graph" object, a graph of all graphs. I don't think this is directly supported and besides most apps would not need it.
Once the indices are right, there is no difference between specifying a graph and no graph with the queries considered. With more complex queries, specifying a graph or set of graphs does allow some optimizations that cannot be done with graph missing. For example, bitmap intersections are possible only when all leading key parts are given.

Conclusions

The best warm cache time is with 5, the five queries under 600 ms after the first go. This is to show that all in memory with a single thread of execution is hard to beat.
The 6 cluster performs the same in 623 ms. What is gained in parallelism is lost in latency if all operations complete in microseconds. On the other hand, 6 cluster leaves 5 in the dust in any situation that has less than 100% hit rate. This is due to actual benefit from parallelism if operations take longer than a few microseconds, such as in the case of disk reads. 6 has substantially better data layout on disk, as well as fewer pages to load for the same content.
This makes it possible to run the queries without the pogs index on 6 even when 5 took prohibitively long.
The morale of the story is to have a lot of RAM and space efficient data representation.
The Dbpedia benchmark does not specify any random access pattern that would give a measure of sustained throughput under load, so we are left with the extremes of cold and warm cache of which neither is quite realistic.
Chris Bizer and I have talked on and off about benchmarks and I have made suggestions that we will see incorporated into the Berlin SPARQL benchmark, which will, I believe, be much more informative.

Appendix: Query Text

For reference, the query text is below, with graph given. To run without specifying the graph, just drop the from <http://dbpedia.org>. The returned row counts are indicated below the query text.
sparql SELECT ?p ?o from <http://dbpedia.org> WHERE {
  <http://dbpedia.org/resource/Metropolitan_Museum_of_Art> ?p ?o };

-- 1337 rows

sparql PREFIX p: <http://dbpedia.org/property/>
SELECT ?film1 ?actor1 ?film2 ?actor2
from <http://dbpedia.org> WHERE {
  ?film1 p:starring <http://dbpedia.org/resource/Kevin_Bacon> .
  ?film1 p:starring ?actor1 .
  ?film2 p:starring ?actor1 .
  ?film2 p:starring ?actor2 . };

--  23910 rows

sparql PREFIX p: <http://dbpedia.org/property/>
SELECT ?artist ?artwork ?museum ?director from <http://dbpedia.org> 
WHERE {
  ?artwork p:artist ?artist .
  ?artwork p:museum ?museum .
  ?museum p:director ?director };

-- 303 rows

sparql PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
SELECT ?s ?homepage from <http://dbpedia.org>  WHERE {
   <http://dbpedia.org/resource/Berlin> geo:lat ?berlinLat .
   <http://dbpedia.org/resource/Berlin> geo:long ?berlinLong . 
   ?s geo:lat ?lat .
   ?s geo:long ?long .
   ?s foaf:homepage ?homepage .
   FILTER (
     ?lat        <=     ?berlinLat + 0.03190235436 &&
     ?long       >=     ?berlinLong - 0.08679199218 &&
     ?lat        >=     ?berlinLat - 0.03190235436 && 
     ?long       <=     ?berlinLong + 0.08679199218) };

-- 56 rows


sparql PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX p: <http://dbpedia.org/property/>
SELECT ?s ?a ?homepage from <http://dbpedia.org>  WHERE {
   <http://dbpedia.org/resource/New_York_City> geo:lat ?nyLat .
   <http://dbpedia.org/resource/New_York_City> geo:long ?nyLong . 
   ?s geo:lat ?lat .
   ?s geo:long ?long .
   ?s p:architect ?a .
   ?a foaf:homepage ?homepage .
   FILTER (
     ?lat        <=     ?nyLat + 0.3190235436 &&
     ?long       >=     ?nyLong - 0.8679199218 &&
     ?lat        >=     ?nyLat - 0.3190235436 && 
     ?long       <=     ?nyLong + 0.8679199218) };

-- 13 rows

Read comments or post a reply to : Dbpedia Benchmark Revisited
Page 877 of 5577

Newest posts
New Page 1

 

 

About Us   Contact Us   Privacy   Disclaimer   Feedback   Email Discussion   Newsletter  

Copyright © - Independent SAP Information
Learn XML, Guesthouses and B&B's