GCM-Bench: A General Benchmark for RDF Data Management System
The data is growing up to an unprecedented scale in the biological field, including modern scientific instruments produced and organized by scientist. However, there is not a tool or system is developed specifically for the big data produced by biologist. The systems, e.g. hadoop for storage, RDF for semantic web, which are designed for general purpose is the only choice to process the data and this will inevitably lead to compatibility issues. So we build a benchmark called GCM-Bench to evaluate the performance of general-purpose system in biological field.
The Resource Description Framework (RDF) is a family of World Wide Web Consortium (W3C) specifications originally designed as a metadata data model. It has come to be used as a general method for conceptual description or modeling of information that is implemented in web resources, using a variety of syntax notations and data serialization formats. Biologists has extended it to organize bio-information due to the variety of data in each sub-domain of biology f. We need a flexible, diversiform format to blend different formats into a single, mixed one, which can express custom schema. RDF is the abbreviation of Resource Description Framework. You can define a schema for your data structure using RDFS and OWL. Biological data such as enzyme data, protein data, gene data and so on, can define their own format just like relational schema.
Due to the above mentioned features, RDF has become the best choice for biological data. We need to benchmark the systems that can handle these RDF data to evaluate the performance of these systems in the field of biology.
We has selected several systems that have been used to process RDF in recent years. There are RDF store systems such as Sesame, Virtuoso, Jena, gStore, and systems stored in relational databases with SPARQL-to-SQL rewriters, such as D2R Server, Virtuoso RDF Views and so on. We have built the GCM-Bench to evaluate these systems, and will open source for it on Github.
GCM-Bench is a integrated benchmark environment for RDF data management system composed of data generation tool, workloads and automatic testing framework. It can evaluate the performance of RDF data management systems like Jena, gStore and so on and generate evaluation reports for users.
Currently, GCM-Bench contains more than six real data sets and a data generation tool, which can generate up to TB data and support different levels of testing. And we have built some system workloads in different aspects and more than 10 query workloads. Automatic testing framework can run on almost all RDF data management system and generate a evaluation report for users, which is more precise and standardized with the unified testing environment.
- GCM，Global Catalogue of Microorganisms (gcm.wdcm.org)，286m
- UniProt，The Universal Protein Resource (www.uniprot.org)， 1m
- BioGRID， the Biological General Repository for Interaction Datasets (thebiogrid.org)， 21m
- DrugBank，DrugBank database (www.drugbank.ca)，14m
- LinkedCT， The Linked Clinical Trials (linkedct.org) 49m
- DBPedia，structured information from Wikipedia (wiki.dbpedia.org)，23b
The data generation tool can generate up to TB analog GCM data based on the microbiological data set GCM published by WDCM to support different levels of benchmark. The generated data set contains the enzyme, pathway, taxonomy, protein, gene and other data. The tool can be used either alone or integrated into the automatic testing framework.
Benchmark workloads are divided into two categories, data loading workloads and data query workloads. We will focus on the performance of data query workloads.
The systems that store RDF data need to load RDF data first and then build RDF data summary, index and other meta data. Meanwhile, the relational database storage systems need to analyze the data, dump and some other processes. This loading process always take a long time, so when can use the data is conditioned by the performance.
One of the major functions of RDF data management systems is providing data query in the form of SPARQL query language, so we take most workloads on evaluating the performance of data query. GCM-Bench provides more than 10 SPARQL queries to test the query performance of the system, meanwhile, validating the support for the vast features of SPARQL. SPARQL query language provides four different forms of query: SELECT query, ASK query, DESCRIBE query, CONSTRUCT query. Among these forms, SELECT query is the most frequently used query form, and most query workloads belong to SELECT query. The WHERE pattern in SELECT query contains some keywords like UNION, FILTER, OPTIONAL and aggregate functions in SELECT query like COUNT, MAX, SUM will be tested by specify workload.
We have built an automatic testing framework for GCM-Bench, which integrates a series of tools for testing, including data generator, workloads, system monitoring tools, report generator, and test task scheduler. This is a general framework that supports almost all RDF data management systems. The only thing you need to do is writing a driver to connect the RDF system with the framework before using. Everything will be running automatically and a evaluation report will be generated as result. Another feature of the framework is you can customize data sets and workloads by modifying the configuration file.
- data generator
- system monitoring tools
- report generator
- test task scheduler
The GCM-Bench is a benchmark system for evaluating the performance of RDF data management system in the biological field, but it also can be used in other fields. It is a general benchmark for RDF system.