Background and Introduction
This tool focuses on a mix of workloads whose arrivals follow patterns hidden in real-world traces. Two type of representative data center workloads are considered:
– Long-running service workloads. These workloads offer online services such as web search engines and e-commerce sites to end users and the services usually keep running for months and years. The tenants of such workloads are service end users.
– Short-term data analytic workloads. These workloads process input data of different scales (from KB to PB) using relatively short periods (from a few seconds to several hours). Example workloads are Hadoop, Spark and Shark jobs. The tenants of such workloads are job submitters.
Motivation: Realistic workloads
Both types of workload pervasively exist and co-locate in today’s data centers, in which mixes of different percentages of tenants and workloads share the same computing infrastructure. Hence to produce trustworthy benchmarking results, benchmark tools supporting such practical scenarios of realistic mixed workloads are urgently needed. Three major challenges should be addressed:
– Actual workloads. Data analytic workloads have a variety of data characteristics (data types and sources, and input/output data volumes, distributions), computation semantics (source codes or implementation logics), and software stacks (e.g. Hadoop, Spark and MPI). Hence it is difficult to emulate the behaviors of such highly diverse workloads just using synthetic workloads.
– Real workload traces. Workload traces are the most realistic data sources including workloads’ arrival patterns (i.e. requests/jobs’ submitting time, arrival rate and sequences). Hence generating workloads based on real trace is an equally important aspect of realistic workloads.
– Scalable workloads. Flexibly adjusting the scale of workloads to meet users’ requirements of different benchmarking scenarios while still keeping their realistic mix (i.e. the number and priorities of tenants, and their changes over time) corresponding to real-world scenarios.
– Time: Different workloads are launched simultaneously by concurrent tenants.
– Space: Different workloads are hosted in the same machines.
Existing benchmarks either generate actual workloads based on probability models, or replay workload traces using synthetic workloads.
Table 1. Existing benchmarks
|Benchmarks||Actual workloads||Real workload traces||Mixed workloads|
|AMPLab benchmark , Linkbench , Bigbench , YCSB, CloudSuite||Yes||No||No|
|GridMix , SWIM ||No||Yes||No|
 Amplab benchmark. https://amplab.cs.berkeley.edu/benchmark/.
 T. G. Armstrong, V. Ponnekanti, D. Borthakur, and M. Callaghan. Linkbench: a database benchmark based on the facebook social graph. In SIGMOD’13, pages 1185–1196. ACM, 2013.
 A. Ghazal, M. Hu, T. Rabl, F. Raab, M. Poess, A. Crolotte, and H.-A. Jacobsen. Bigbench: Towards an industry standard benchmark for big data analytics. In SIGMOD’13. ACM, 2013.
 B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears. Benchmarking cloud serving systems with ycsb. In SoCC’10, SoCC’10, pages 143–154, New York, NY, USA, 2010. ACM.
 M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee, D. Jevdjic, C. Kaynak, A. D. Popescu, A. Ailamaki, and B. Falsafi. Clearing the clouds: a
study of emerging scale-out workloads on mo dern hardware. In ACM SIGPLAN Notices, volume 47, pages 37–48. ACM, 2012..
 Gridmix. http://hadoop.apache.org/docs/stable1/gridmix.html.
 Y. Chen, S. Alspaugh, and R. Katz. Interactive analytical processing in big data systems: A cross-industry study of mapreduce workloads. Proceedings of the VLDB Endowment, 5(12):1802–1813, 2012.
The multi-tenancy version of BigDataBench, called BigDataBench-MT, is a benchmark tool aiming to support the scenarios of multiple tenants running mixed heterogeneous workloads in cloud data centers. The basic idea of BigDataBench-MT is to generate a mix of actual workloads on the basis of real workload trace in the multi-tenant framework.
The framework of BigDataBench-MT is shown in Figure 1, this workload suite has been designed and implemented based on workload traces from real-world applications, allowing the flexible setting and replaying of these workloads according to users’ varying requirements. BigDataBench-MT has three modules:
- Benchmark User Portal: It allows users to specify their benchmarking requirements, including the machine type and number to be tested, and the types of workload to use.
- Combiner of Workloads and Traces: It matches the real workload and the selected workload traces, and outputs workload replaying scripts to guide the workload generation.
- Multi-tenant Workload Generator: It extracts the tenant information from the scripts and constructs a multi-tenant framework to generate a mix of service and data analytic workloads.
Figure 1. The BigDataBench-MT framework
Key Features of BigDataBench-MT
1. Built on Real-life search-engine and data analytic workloads traces.
Table 2. Real-world workload traces in BigDataBench-MT
|Workload trace||Duration||Tenant number||Request/job number|
|SoGou ||50 days||9 million||43 million|
|Google ||29 days||5K||1000K|
 Sogou user query logs. http://www.sogou.com/labs/dl/q-e.htm
 Google cluster workload trace. http://code.google.com/p/googleclusterdata/.
2. Applying robust machine learning algorithm to match the workload characteristics information from both real workloads and anonymous workloads in real traces to generate workload replaying scripts.
3. Convenient multi-tenant framework to support the scalable generation of both service and data analytic workloads.
BigDataBench-MT is currently integrated with Hadoop, Shark and Nutch search engine. We believe data center cluster operators can use BigDataBench-MT to accomplish other previously challenging tasks, including but not limited to resource provisioning and planning in multiple dimensions; configurations tuning for diverse job types within a workload; anticipating workload consolidation behavior and quantify workload superposition in multiple dimensions.
To support the successful and efficient benchmarking of data center systems, BigDataBench-MT addresses two objectives.
– Promoting the development of data center technology. Developing new architectures (processors, memory systems, and network systems), innovative theories, algorithms, techniques, and software stacks to manage big data and extract their value and hidden knowledge.
– System Optimization. Assisting system owners to make decisions for planning system features, tuning system configurations, validating deployment strategies, and conducting other efforts to improve systems. For example, benchmarking results can identify the performance bottlenecks in big data systems, thus optimizing system configuration and resource allocation.
The current version of the Multi-tenancy version contains the following three kinds of workloads.
Table 3. The supported workloads in BigDataBench-MT
|Service Workload(Nutch search engine)||Apache Tomcat 6.0.26, the Nutch Search Engine, and Apache-Storm 0.9.3||Nutch Search is a web search engine and it is a typical time-critical service workload|
|Data analytic workloads(Hadoop MapReduce jobs)||Hadoop 1.2.1 and Mahout 0.6||MapReduce jobs are a majoy type of data center workloads for data-intensive computing.|
|Data analytic workloads(Spark jobs)||Spark 0.8.0||MapReduce jobs are a majoy type of in-memory computing data center workloads|
|Data analytic workloads(Shark queries) (Version 0.5)||Shark-0.8.0||In-memory data analytic workloads built on Spark.|
Downloading the software package (312MB) [Multi-tenancy version]
Downloading the 24-hour workload Sogou/Google trace stored in MySQL database (1.8GB) [Workload-trace]
(Please contact us if you need the full version of workload trace stored in Impala (57GB))
Downloading the BigDataBench User Manual [User Manual]
If you need a citation for the multi-tenancy version of BigDataBench, please cite the following papers related with your work:
BigDataBench: a Big Data Benchmark Suite from Internet Services. [PDF]
Lei Wang, Jianfeng Zhan, Chunjie Luo, Yuqing Zhu, Qiang Yang, Yongqiang He, WanlingGao, Zhen Jia, Yingjie Shi, Shujie Zhang, Cheng Zhen, Gang Lu, Kent Zhan, Xiaona Li, and Bizhu Qiu. The 20th IEEE International Symposium On High Performance Computer Architecture (HPCA-2014), February 15-19, 2014, Orlando, Florida, USA.
BigDataBench-MT: A Benchmark Tool for Generating Realistic Mixed Data Center Workloads. [PDF]
Rui Han, Shulin Zhan, Chenrong Shao, Junwei Wang, Lizy K. John, Gang Lu, Lei Wang. SoCC 2015 Poster paper.