Overview

News: Linkedin user group.  BPOE-7  submission deadline extended to Feb. 17 (CFP) and BigDataBench tutorial at ASPLOS’16.  BigDataBench 3.2 released. BigData100 ranking.  Two recent BigDataBench slides [BigDataBench-WBDB2015] [BigDataBench-HPBDC2015].  TR on Eight Dwarfs workloads in Big Data Analytics.  China’s first industry standard big data benchmark suite.   BigDataBench Tutorial at Micro 2014 . BigDataBench on MARSSx86, gem5, and Simics.

Summary

As a multi-discipline research and engineering effort, i.e., system, architecture, and data management, from both industry and academia, BigDataBench is an open-source big data benchmark suite. In nature, BigDataBench is a benchmark suite for scale-out workloads, different from SPEC CPU (sequential workloads), and PARSEC (multithreaded workloads). The current BigDataBench 3.2 version models five typical and important big data application domains: search engine, social networks, e-commerce, multimedia analytics, and bioinformatics. In total, it includes 14 real-world data sets, and 34 big data workloads.

In specifying representative big data workloads, BigDataBench focuses on units of computation that are frequently appearing in OLTP, NoSQL, OLAP, interactive and offline analytics, graph computing, and streaming computing in each application domain. It identifies eight dwarf workloads in big data analytics (Please refer to our technical report). Meanwhile, it considers variety of data models with different types and semantics, which are extracted from real-world data sets, including unstructured, semi-structured, and structured data. BigDataBench also provides an end-to-end application benchmarking framework (Please refer to our DASFAA paper) to allow the creation of flexible benchmarking scenarios by abstracting data operations and workload patterns, which can be extended to other application domains.

For the same big data benchmark specifications,  different implementations are provided. For example, we and other developers implemented the offline analytics workloads using MapReduce, MPI, Spark, Flink, interactive analytics and OLAP workloads using Shark, Impala, and Hive. In addition to including real-world data sets, BigDataBench also provides several parallel big data generation tools—BDGS—to generate scalable big data, e.g., a PB scale, from small or medium-scale real-world data while preserving their original characteristics.

To model and reproduce multi-application or multi-user scenarios on Cloud or datacenters, we provide the multi-tenancy version of BigDataBench, which allows flexible setting and replaying of mixed workloads according to the real workload traces—the Facebook, Google and Sogou traces.

For system and architecture researches, i. e., architecture, OS, networking and storage, the number of benchmarks will be multiplied by different implementations, and hence become massive. To reduce the research or benchmarking cost, we select a small number of representative benchmarks, called the BigDataBench subset according to workload characteristics from a specific perspective. For example, for architecture communities, as simulation-based research is very time-consuming, we provide the BigDataBench architecture subset on the MARSSx86, gem5, and Simics simulator versions, respectively.

Together with several industry partners, including Telecom Research Institute Technology, Huawei, Intel (China), Microsoft (China), IBM CDL, Baidu, Sina, INSPUR, ZTE and etc, we also release China’s first industry standard big data benchmark suite—-BigDataBench-DCA, which is a subset of BigDataBench.

Why BigDataBench?

As shown in Table 1, among nine desired properties, we can find that BigDataBench is more sophisticated than the other state-of-art big data benchmarks.

Table 1: The Differences of  BigDataBench from Other Benchmarks Suites. 

  Spec[1]  App domains Workload types  Workloads Scalable data sets[2]  Diverse implem[3]  Multi- tenancy[4]   Subset[5]  Simulator[6]
BigDataBench Y Five  Six[7] Thirty-three [8]  Eight[9] Y Y  Y  Y
 BigBench Y One  Three  Ten  Three  N  N  N N
CloudSuite  N  N/A  Two  Eight  Three  N  N  N  Y
HiBench N  N/A  Two Ten  Three  N  N  N  N
 CALDA Y  N/A  One  Five  N/A  Y  N N  N
 YCSB Y  N/A  One Six  N/A  Y  N  N  N
LinkBench  Y  N/A  One Ten  N/A  Y  N  N  N
 AMP Benchmarks Y  N/A  One Four  N/A  Y  N  N N

 [1]Spec is short for specification. Y indicates there is specification; N indicates there is no specification. 

 [2]Scalable data sets are extracted from real-world data sets. The number x indicates there are x scalable data sets are extracted from real-world data sets. 

 [3] For diverse implem, Y indicates for the same workload specification, diverse implementations using competitive techniques are provided. N indicates for the same workload specification, only a few implementations are provided. 

 [4] For multi-tenancy version, Y indicates that the multi-tenancy version is provided. N indicates that the multi-tenancy version is not provided. 

 [5] For subset, Y indicates that a subset of benchmarks is provided. N indicates that there is no subset of benchmarks. For example, an architecture subset is provided for BigDataBench. 

[6]For simulator, Y indicates that the simulator version is provided. e.g., MARSSx86, gem5, and Simics versions are provided for BigDataBench.  N indicates that the simulator version is not provided.

 [7]The six workloads types include Streaming, Offline Analytics, Cloud OLTP, DW, Graph and Online Service

[8]There are 42 workloads in the specification. We have implemented 34 workloads.

[9]Eight real data sets are scalable, while the other seven ones are undergoing development.

What’s New?

BigDataBench 3.2 adds graph and streaming frameworks, and provides Flink implementations. Currently, BigDataBench includes 15 real-world data sets, and 34 big data workloads. We also release the multi-tenancy version for multi-user or multi-application scenarios and the simulator versions—MARSSx86, gem5 and Simics—for architecture communities.

Methodology

Figure 1 summarizes the benchmarking methodology in BigDataBench. Overall,it includes five steps: investigating and choosing important application domains; identifying typical workloads and data sets; proposing big data benchmarks specifications; providing diverse implementations using competitive techniques; mixing different workloads to assemble multi-tenancy workloads or subsetting big data benchmarks.

Figure 1 BigDataBench Benchmarking Methodology.

ͼƬ1

Benchmarks

BigDataBench is in fast expansion and evolution. Currently, we proposed benchmarks specifications modelling five typical application domains. The current version BigDataBench 3.2 includes 14 real-world data sets and 33 big data workloads. Table 2 summarizes the real-world data sets and scalable data generation tools included into BigDataBench 3.2, covering the whole spectrum of data types, including structured, semi-structured, and unstructured data, and different data sources, including text, graph, image, audio, video and table data. Table 3 presents BigDataBench from perspectives of application domains, workloads, workload types, data sets, software stacks. For some end users, they may just pay attention to big data application of a specific type. For example, they want to perform an apples-to- apples comparison of software stacks for offline analytics. They only need to choose benchmarks with the type of offline analytics. But if the users want to measure or compare big data systems and architecture, we suggest they cover all benchmarks.

Table 2.The summary of data sets and data generation tools

data sets data size Scalable data set
1 Wikipedia Entries 4,300,000 English articles(unstructured text) Text Generator of BDGS
2 Amazon Movie Reviews 7,911,684 reviews(semi-structured text) Text Generator of BDGS
3 Google Web Graph 875713 nodes, 5105039 edges(unstructured graph) Graph Generator of BDGS
4 Facebook Social Network 4039 nodes, 88234 edges (unstructured graph) Graph Generator of BDGS
5 E-commerce Transaction Data table1:4 columns,38658 rows.

table2: 6columns, 242735 rows(structured table)

Table Generator of BDGS
6 ProfSearch Person Resumes 278956 resumes(semi-structured table) Table Generator of BDGS
7 ImageNet  ILSVRC2014 DET image dataset(unstructured image) Ongoing development
8 English broadcasting audio files  Sampled at 16 kHz, 16-bit linear sampling(unstructured audio) Ongoing development
9 DVD Input Streams  110 input streams,resolution:704*480(unstructured video) Ongoing development
10 Image scene  39 image scene description files(unstructured text) Ongoing development
11 Genome sequence data Cfa data format(unstructured text) 4 volumes of data sets
12 Assembly of the human genome Fa data format(unstructured text) 4 volumes of data sets
13 SoGou Data the corpus and search query data from So-Gou Labs(unstructured text) Ongoing development
14 MNIST handwritten digits database which has 60,000 training examples and 10,000 test examples(unstructured image) Ongoing development
15 MovieLens Dataset User’s score data for movies, which has 9,518,231 training examples and 386,835 test examples(semi-structured text) Ongoing development


Table 3. The summary of the implemented workloads in BigDataBench 3.2

Domains

Operations or Algorithm

Types

Data Sets

Software Stacks

Spec ID

Search Engine

Grep

Offline analytics

Wikipedia Entries

Hadoop, Spark, Flink, MPI

W1-1

Streaming

Random Generate

Spark Streaming

W1-1

WordCount

Offline analytics

Wikipedia Entries

Hadoop, Spark, Flink, MPI

W1-2

Index

Offline analytics

Wikipedia Entries

Hadoop, Spark, MPI

W1-4

PageRank

Offline analytics

Google Web Graph

Hadoop, Spark, Flink, MPI

W1-5

Nutch Server

Online Service

SoGou Data

Nutch

W1-6-1

Search

Streaming

Search Data

JStorm

W1-6-2

Sort

Offline analytics

Wikipedia Entries

Hadoop, Spark, MPI

W1-7

Read

Cloud OLTP

ProfSearch Resumes

HBase, MySQL

W1-11-1

Write

Cloud OLTP

ProfSearch Resumes

HBase, MySQL

W1-11-2

Scan

Cloud OLTP

ProfSearch Resumes

HBase, MySQL

W1-11-3

Social Networks

Rolling Top Words

Streaming

Random Generate

JStorm, Spark Streaming

W2-1

CC

Graph

Facebook Social Network

Hadoop, Spark, MPI, GraphX, GraphLab, Flink Gelly

W2-8-1

Kmeans

Streaming

Randon Generate

Spark Streaming

W2-8-2

Offline analytics

Facebook Social Network

Hadoop, Spark, Flink, MPI

W2-8-2

Label Propagation

Graph

Facebook Social Network

GraphX, GraphLab, Flink Gelly

W2-8-3

Triangle Count

Graph

Facebook Social Network

GraphX, GraphLab, Flink Gelly

W2-8-4

BFS

Graph

Self Generating by the program(MPI) ; Facebook Social Network

GraphX, GraphLab, Flink Gelly, MPI

W2-9

E-commerce

Select Query

Data Warehouse

E-commerce Transaction Data

Hive, Shark, Impala

W3-1

Aggregation

Data Warehouse

E-commerce Transaction Data

Hive, Shark, Impala

W3-2

Join Query

Data Warehouse

E-commerce Transaction Data

Hive, Shark, Impala

W3-3

CF

Streaming

MovieLens Dataset

JStorm

W3-4-1

Offline Analytics

Amazon Movie Review

Hadoop, Spark, MPI

W3-4-2

Bayes

Offline Analytics

Amazon Movie Review

Hadoop, Spark, MPI

W3-5

Project

Data Warehouse

E-commerce Transaction Data

Hive, Shark, Impala

W3-6-1

Filter

Data Warehouse

E-commerce Transaction Data

Hive, Shark, Impala

W3-6-2

Cross Product

Data Warehouse

E-commerce Transaction Data

Hive, Shark, Impala

W3-6-3

Order By

Data Warehouse

E-commerce Transaction Data

Hive, Shark, Impala

W3-6-4

Union

Data Warehouse

E-commerce Transaction Data

Hive, Shark, Impala

W3-6-5

Difference

Data Warehouse

E-commerce Transaction Data

Hive, Shark, Impala

W3-6-6

Aggregation

Data Warehouse

E-commerce Transaction Data

Hive, Shark, Impala

W3-6-7

Multimedia

analytics

BasicMPEG

Offline analytics

Stream Data

Libc

W4-1

SIFT

Offline analytics

ImageNet

MPI

W4-2-1

DBN

Offline analytics

MNIST

MPI

W4-2-2

Speech Recognition

Offline analytics

Audio files

MPI

W4-3

Ray Tracing

Offline analytics

Scene description files

MPI

W4-4

Image Segmentation

Offline analytics

ImageNet

MPI

W4-5

Face Detection

Offline analytics

ImageNet

MPI

W4-6

 Bioinformatics

SAND

Offline analytics

Genome sequence Data

Work Queue

W5-1

BLAST

Offline analytics

Assembly of the human genome

MPI

W5-2

Evolution 

As shown in Figure 2, the evolution of BigDataBench has gone through three major stages: At the first stage, we released three benchmarks suites, BigDataBench 1.0 (6 workloads from Search engine), DCBench 1.0 (11 workloads from data analytics), and CloudRank 1.0 (mixed data analytics workloads).

At the second stage, we merged the previous three benchmark suites and release BigDataBench 2.0, through investigating the top three important application domains from internet services in terms of the number of page views and daily visitors. BigDataBench 2.0 includes 6 real-world data sets, and 19 big data workloads with different implementations, covering six application scenarios: micro benchmarks, Cloud OLTP, relational query, search engine, social networks, and e-commerce. Moreover, BigDataBench 2.0 provides several big data generation tools–BDGS– to generate scalable big data, e.g, PB scale, from small-scale real-world data while preserving their original characteristics.

BigDataBench 3.0 is a multidisciplinary effort. It includes 6 real-world, 2 synthetic data sets, and 32 big data workloads, covering micro and application benchmarks from typical application domains, e. g., search engine, social networks, and e-commerce. As to generating representative and variety of big data workloads, BigDataBench 3.0 focuses on units of computation that frequently appear in Cloud OLTP, OLAP, interactive and offline analytics.

Figure 2: BigDataBench Evolution

Previous releases

BigDataBench 3.1 http://prof.ict.ac.cn/BigDataBench/old/3.1/

BigDataBench 3.0 http://prof.ict.ac.cn/BigDataBench/old/3.0/

BigDataBench 2.0 http://prof.ict.ac.cn/BigDataBench/old/2.0/

BigDataBench 1.0 http://prof.ict.ac.cn/BigDataBench/old/1.0/

DCBench 1.0 http://prof.ict.ac.cn/DCBench/

CloudRank 1.0 http://prof.ict.ac.cn/CloudRank/

Handbook

Handbook of BigDataBench [BigDataBench-handbook]

Q &A

More questions & answers are available from the handbook of BigDataBench.

Contacts (Email)

People

  • Prof. Jianfeng Zhan
  • Lei Wang
  • Jingwei Li
  • Chunjie Luo
  • Wanling Gao
  • Zhen Jia
  • Qiang Yang
  • Xinhui Tian
  • Rui Han
  • Xinlong Lin
  • Rui Ren
  • Yuanqing Guo
  • Yuqing Zhu

Alumni

                • Hainan Ye
                • Yingjie Shi
    • Zijian Ming

License

BigDataBench is available for researchers interested in big data. Software components of BigDataBench are all available as open-source software and governed by their own licensing terms. Researchers intending to use BigDataBench are required to fully understand and abide by the licensing terms of the various components. BigDataBench is open-source under the Apache License, Version 2.0. Please use all files in compliance with the License. Our BigDataBench Software components are all available as open-source software and governed by their own licensing terms. If you want to use our BigDataBench you must understand and comply with their licenses. Software developed externally (not by BigDataBench group)

Software developed internally (by BigDataBench group) BigDataBench_3.2 License BigDataBench_3.2 Suite Copyright (c) 2013-2015, ICT Chinese Academy of Sciences All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

  • Redistribution of source code must comply with the license and notice disclaimers
  • Redistribution in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimers in the documentation and/or other materials provided by the distribution.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE ICT CHINESE ACADEMY OF SCIENCES BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.