Background and Introduction

This tool focuses on a mix of workloads whose arrivals follow patterns hidden in real-world traces. Two type of representative data center workloads are considered:

–     Long-running service workloads. These workloads offer online services such as web search engines and e-commerce sites to end users and the services usually keep running for months and years. The tenants of such workloads are service end users.
–     Short-term data analytic workloads. These workloads process input data of different scales (from KB to PB) using relatively short periods (from a few seconds to several hours). Example workloads are Hadoop, Spark and Shark jobs. The tenants of such workloads are job submitters.

Motivation: Realistic workloads

Both types of workload pervasively exist and co-locate in today’s data centers, in which mixes of different percentages of tenants and workloads share the same computing infrastructure. Hence to produce trustworthy benchmarking results, benchmark tools supporting such practical scenarios of realistic mixed workloads are urgently needed. Three major challenges should be addressed:

–     Actual workloads. Data analytic workloads have a variety of data characteristics (data types and sources, and input/output data volumes, distributions), computation semantics (source codes or implementation logics), and software stacks (e.g. Hadoop, Spark and MPI). Hence it is difficult to emulate the behaviors of such highly diverse workloads just using synthetic workloads.
–     Real workload traces. Workload traces are the most realistic data sources including workloads’ arrival patterns (i.e. requests/jobs’ submitting time, arrival rate and sequences). Hence generating workloads based on real trace is an equally important aspect of realistic workloads.
–     Scalable workloads. Flexibly adjusting the scale of workloads to meet users’ requirements of different benchmarking scenarios while still keeping their realistic mix (i.e. the number and priorities of tenants, and their changes over time) corresponding to real-world scenarios.

Mixed workloads

–     Time: Different workloads are launched simultaneously by concurrent tenants.
–     Space: Different workloads are hosted in the same machines.

Existing benchmarks

Existing benchmarks either generate actual workloads based on probability models, or replay workload traces using synthetic workloads.

Table 1. Existing benchmarks

Benchmarks Actual workloads Real workload traces Mixed workloads
 AMPLab benchmark [1], Linkbench [2], Bigbench [3], YCSB[4], CloudSuite[5]  Yes No No
 GridMix [6], SWIM [7]  No Yes No

[1] Amplab benchmark.
[2] T. G. Armstrong, V. Ponnekanti, D. Borthakur, and M. Callaghan. Linkbench: a database benchmark based on the facebook social graph. In SIGMOD’13, pages 1185–1196. ACM, 2013.
[3] A. Ghazal, M. Hu, T. Rabl, F. Raab, M. Poess, A. Crolotte, and H.-A. Jacobsen. Bigbench: Towards an industry standard benchmark for big data analytics. In SIGMOD’13. ACM, 2013.
[4] B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears. Benchmarking cloud serving systems with ycsb. In SoCC’10, SoCC’10, pages 143–154, New York, NY, USA, 2010. ACM.
[5] M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee, D. Jevdjic, C. Kaynak, A. D. Popescu, A. Ailamaki, and B. Falsafi. Clearing the clouds: a
study of emerging scale-out workloads on mo dern hardware. In ACM SIGPLAN Notices, volume 47, pages 37–48. ACM, 2012..
[6] Gridmix.
[7] Y. Chen, S. Alspaugh, and R. Katz. Interactive analytical processing in big data systems: A cross-industry study of mapreduce workloads. Proceedings of the VLDB Endowment, 5(12):1802–1813, 2012.

System Introduction

The multi-tenancy version of BigDataBench, called BigDataBench-MT, is a benchmark tool aiming to support the scenarios of multiple tenants running mixed heterogeneous workloads in cloud data centers. The basic idea of BigDataBench-MT is to generate a mix of actual workloads on the basis of real workload trace in the multi-tenant framework.

The framework of BigDataBench-MT is shown in Figure 1, this workload suite has been designed and implemented based on workload traces from real-world applications, allowing the flexible setting and replaying of these workloads according to users’ varying requirements. BigDataBench-MT has three modules:

  • Benchmark User Portal: It allows users to specify their benchmarking requirements, including the machine type and number to be tested, and the types of workload to use.
  • Combiner of Workloads and Traces: It matches the real workload and the selected workload traces, and outputs workload replaying scripts to guide the workload generation.
  • Multi-tenant Workload Generator: It extracts the tenant information from the scripts and constructs a multi-tenant framework to generate a mix of service and data analytic workloads.

Figure 1. The BigDataBench-MT framework

Key Features of BigDataBench-MT

1. Built on Real-life search-engine and data analytic workloads traces.

Table 2. Real-world workload traces in BigDataBench-MT

Workload trace Duration Tenant number Request/job number
SoGou [6] 50 days 9 million 43 million
Facebook [7] 4 hours 36971
Google [8] 29 days 5K 1000K

[6] Sogou user query logs.
[7] Facebook workload repository.
[8] Google cluster workload trace.

2. Applying robust machine learning algorithm to match the workload characteristics information from both real workloads and anonymous workloads in real traces to generate workload replaying scripts.

3. Convenient multi-tenant framework to support the scalable generation of both service and data analytic workloads.

Supported Workloads

The current version of the Multi-tenancy version contains the following three kinds of workloads.

Table 3. The supported workloads in BigDataBench-MT

Workload Software stack Introduction
Service Workload(Nutch search engine) Apache Tomcat 6.0.26, the Nutch Search Engine, and Apache-Storm 0.9.3 Nutch Search is a web search engine and it is a typical time-critical service workload
Data analytic workloads(Hadoop MapReduce jobs) Hadoop 1.2.1 and Mahout 0.6 MapReduce jobs are a majoy type of data center workloads for data-intensive computing.
Data analytic workloads(Spark jobs) Spark 0.8.0 MapReduce jobs are a majoy type of in-memory computing data center workloads
Data analytic workloads(Shark queries) (Version 0.5) Shark-0.8.0 In-memory data analytic workloads built on Spark.


Downloading the software package (54.8KB) [Multi-tenancy version]

Downloading the BigDataBench User Manual [User Manual]