Multi-tenancy version


What’s Multi-tenancy version ?

Multi-tenancy version of BigDataBench is a benchmark suite aiming to support the scenarios of multiple tenants running heterogeneous applications in the same datacenter. Examples are latency-critical online services (e.g. web search engine) and latency-insensitive offline analytics. The basic idea of Multi-tenancy version is to understand the behavior of realistic big data workloads (involves both service and analytics workloads) and their users. Workload suite has been designed and implemented based on workload traces from real-world applications, allowing the flexible setting and replaying of these workloads according to users’ varying requirements. At present, Multi-tenancy version consists of three representative workloads: Nutch search engine, Hadoop Mapreduce workloads, and Shark query workloads, which correspond to three real-world workload traces: Sougou, Facebook trace, and Google trace, respectively. The process of Multi-tenancy version is as follows:


What is Data center?

Data center reflects the thinking that the network is the computer, which makes the amount of computing resource, storage resources and software resources linked together, then forming a huge shared virtual IT resources pool to provide services via the Internet. Data center focuses on the high concurrency, the diversity of application performance, low power, automation, high efficiency.

What are Big data workloads?

Big data workloads can be characterized from three aspects:

  • Data characteristics

         –     Data type, source

         –     Input/output data volumes, distributions

  • Computation semantics

         –     Source code

         –     Big data software stacks

  • Job arrival patterns

         –     Arrival rate

         –     Arrival sequence

What are mixed big data workloads?

  • Time: Different workloads are launched simultaneously.
  • Space: Different workloads are hosted in the same machines.

Who will use the Multi-tenancy version?

For researchers intending to use Multi-tenancy version, please fully understand and abide by the licensing terms of the various components used in our benchmark.

Why Multi-tenancy version?

Big data benchmarks are developed to evaluate and compare the performance of big data systems and architectures. Successful and efficient benchmarking can provide realistic and accurate measuring of big data systems and thereby addressing two objectives.

(1) Promoting the development of big data technology, i.e. developing new architectures (processors, memory systems, and network systems), innovative theories, algorithms, techniques, and software stacks to manage big data and extract their value and hidden knowledge.

(2) Assisting system owners to make decisions for planning system features, tuning system configurations, validating deployment strategies, and conducting other efforts to improve systems. For example, benchmarking results can identify the performance bottlenecks in big data systems, thus optimizing system configuration and resource allocation.

To facilitate the deployment, configuration, management of large-scale MapReduce and search engines clusters, we need to observe the performance of the real, specific workloads. Existing benchmarks cannot meet this demand. Hence we develop Multi-tenancy version is to understand the behavior of big data workloads by analyzing real-world workload traces, thereby replying realistic evaluations.

Key Features of Multi-tenancy version

1. Repository of Shark traces and real life Search-engine workloads from production systems.

2. Using K-means cluster algorithm to match the workload information

3. Workload synthesis tools to generate representative test workloads by parsing historical MapReduce cluster traces and Sogou request traces.

4. Mixed workload replay tools to execute the matched workloads with low performance overhead.

Multi-tenancy version is currently integrated with Hadoop, Shark and Nutch Search. We believe DC cluster operators can use Multi-tenancy version to accomplish other previously challenging tasks, including but not limited to resource provisioning and planning in multiple dimensions; configurations tuning for diverse job types within a workload; anticipating workload consolidation behavior and quantify workload superposition in multiple dimensions.

Benchmark Programs

The current version of the Multi-tenancy version contains the following three kinds of workloads.

Workload Software stack Introduction
Nutch search Including client and web server(Apache Tomcat 6.0.26 and the Nutch Search Engine Nutch Search is a  search engine  model, which is used to evaluate data center and cloudcomputing systems
Offline Analytic workload Scripts matching, Workload synthesis tools, replay scripts   generator Tool kit used to evaluate the performance of Hadoop and MapReduce in data center workload cluster.
Shark Scripts matchin,Workload synthesis tools, replay scripts   generator Tool kit used to evaluate the performance of shark in data center


Multi-tenancy version user manual [User Manual]

Downloading software packages [Multi-tenancy version]




Rui Han

Wenqian Zhang

Shulin Zhan

Jiangtao Xu