DOWNLOAD

BigDataBench 4.0 is released.

Download User Manual, Technical Report and Specification

BigDataBench 4.0 User Manual  [BigDataBench-UserManual]

BigDataBench JStorm User Manual [BigDataBench-JStorm-UserManual]

BigDataBench Spark Streaming User Manual [BigDataBench-SparkStreaming-UserManual]

BigDataBench 4.0 Technical Report  [BigDataBench-TechnicalReport]

Download data sets

Table 1: The Summary of Data Sets

data sets  data size Scalable data set
1 Wikipedia Entries 4,300,000English articles(unstructuredtext) Text Generator of BDGS
2 Amazon Movie Reviews 7,911,684 reviews(semi-structured text) Text Generator of BDGS
3 Google Web Graph 875713 nodes, 5105039 edges(unstructured graph) Graph Generator of BDGS
4 Facebook
Social Network
4039 nodes, 88234 edges (unstructured graph) Graph Generator of BDGS
5 E-commerce Transaction Data table1:4 columns,38658 rows.
table2: 6columns, 242735 rows(structured table)
Table Generator of BDGS
6 ProfSearch Person Resumes 278956 resumes(semi-structured table) Table Generator of BDGS
7 CIFAR-10 60000 color images with the dimension of 32*32 Ongoing development
8 ImageNet (1GB,10GB) ILSVRC2014 DET image dataset(unstructured image) Ongoing development
9 LSUN One million labelled images, classified into 10 scene categories and 20 object categories Ongoing development
10 TED Talks Translated TED talks provided by IWSLT evaluation campaign Ongoing development
11 SoGou  Data
(Search Data processed from SogouT)
the corpus and search query data from
So-Gou Labs(unstructured text)
Ongoing development
12 MNIST handwritten digits database which has 60,000
training examples and 10,000 test examples(unstructured image)
Ongoing development
13 MovieLens Dataset User’s score data for movies, which has 9,518,231
training examples and 386,835 test examples(semi-structured text)
Ongoing development

Download software

We provide two options: download the full software package one time or download components one by one. Please note that you need to download and deploy prerequisite software packages before using BigDataBench.  Please refer to the user manual. The following packages should be installed firstly, and the running platform is Linux.

Software Version Download
Hadoop 1.0.2 http://hadoop.apache.org/#Download+Hadoop
HBase 0.94.5 http://www.apache.org/dyn/closer.cgi/hbase/
Cassandra 1.2.3 http://cassandra.apache.org/download/
MongoDB 2.4.1 http://www.mongodb.org/downloads
Mahout 0.8 https://cwiki.apache.org/confluence/display/MAHOUT/Downloads
Hive 0.9.0 https://cwiki.apache.org/confluence/display/Hive/GettingStarted #GettingStarted-InstallationandConfiguration
Spark 0.8.0 http://spark.incubator.apache.org/
Shark 0.8.0 http://shark.cs.berkeley.edu/
Impala 1.1.1 http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using-Impala/ciiu_install.html
JStorm 0.9.6.3 https://github.com/alibaba/jstorm/wiki/Downloads
Flink 0.10.1 https://flink.apache.org/downloads.html
MPICH 2.0 http://www.mpich.org/downloads/
Boost 1_43_0 http://www.boost.org/doc/libs/1_43_0/more/getting_started/unix-variants.html
Scala 2.9.3 http://www.scala-lang.org/download/2.9.3.html
GCC 4.8.2 http://gcc.gnu.org/releases.html
GSL 1.16 http://www.gnu.org/software/gsl/

Full download

Full software packages of different implementations are available from the following links:

Separate download

You may download different components of BigDataBench from the following Tables.

BDGSBig Data Generator Suite in BigDataBench

  Name Description
BDGS generates big data on the basis of six raw data sets Text BigDataGeneratorSuite.tar.gz
Size: 40MB
Graph
Table

MicroBenchmark workloads.  Please note that each shell script for generating data and running workloads is included in the distribution.

Micro Benchmark

Involved Dwarf

Application Domain

Workload Type

Date Set

Software Stack

Sort

Sort

SE, SN, EC, MP, BI [1]

Offline analytics

Wikipedia entries

Hadoop, Spark, Flink, MPI

Grep

Set

SE, SN, EC, MP, BI

Offline analytics

Wikipedia entires

Hadoop, Spark, Flink, MPI

Streaming

Random generate

Spark streaming

WordCount

Basic statistics

SE, SN, EC, MP, BI

Offline analytics

Wikipedia entires

Hadoop, Spark, Flink, MPI

MD5

Logic

SE, SN, EC, MP, BI

Offline analytics

Wikipedia entires

Hadoop, Spark, MPI

Connected Component

Graph

SN

Graph analytics

Facebook social network

Hadoop, Spark, Flink, GraphLab, MPI

RandSample

Sampling

SE, MP, BI

Offline analytics

Wikipedia entires

Hadoop, Spark, MPI

FFT

Transform

MP

Offline analytics

Two-dimensional matrix

Hadoop, Spark, MPI

Matrix Multiply

Matrix

SE, SN, EC, MP, BI

Offline analytics

Two-dimensional matrix

Hadoop, Spark, MPI

Read

Set

SE, SN, EC

NoSQL

ProfSearch resumes

HBase, MongoDB

Write

Set

SE, SN, EC

NoSQL

ProfSearch resumes

HBase, MongoDB

Scan

Set

SE, SN, EC

NoSQL

ProfSearch resumes

HBase, MongoDB

Convolution

Transform

SN, EC, MP, BI

AI

Cifar, ImageNet

TensorFlow, Caffe

Fully Connected

Matrix

SN, EC, MP, BI

AI

Cifar, ImageNet

TensorFlow, Caffe

Relu

Logic

SN, EC, MP, BI

AI

Cifar, ImageNet

TensorFlow, Caffe

Sigmoid

Matrix

SN, EC, MP, BI

AI

Cifar, ImageNet

TensorFlow, Caffe

Tanh

Matrix

SN, EC, MP, BI

AI

Cifar, ImageNet

TensorFlow, Caffe

MaxPooling

Sampling

SN, EC, MP, BI

AI

Cifar, ImageNet

TensorFlow, Caffe

AvgPooling

Sampling

SN, EC, MP, BI

AI

Cifar, ImageNet

TensorFlow, Caffe

CosineNorm

Basic Statistics

SN, EC, MP, BI

AI

Cifar, ImageNet

TensorFlow, Caffe

BatchNorm

Basic Statistics

SN, EC, MP, BI

AI

Cifar, ImageNet

TensorFlow, Caffe

Dropout

Sampling

SN, EC, MP, BI

AI

Cifar, ImageNet

TensorFlow, Caffe

 [1]SE (Search Engine), SN (Social Network), EC (e-commerce), BI (Bioinformatics), MP (Multimedia Processing). 

MicroBenchmark workloads.  Please note that each shell script for generating data and running workloads is included in the distribution.

Component Benchmark

Involved Dwarf

Application Domain

Workload Type

Date Set

Software Stack

Xapian Server

Get, Put, Post

SE

Online service

Wikipedia entries

Xapian

PageRank

Matrix, Sort, Basic statistics, Graph

SE

Graph analytics

Google web graph

Hadoop, Spark, Flink, GraphLab, MPI

Index

Logic, Sort, Basic statistics, Set

SE

Offline analytics

Wikipedia entries

Hadoop, Spark

Rolling top words

Sort, Basic statistics

SN

Streaming

Random generate

Spark streaming, JStorm

Kmeans

Matrix, Sort, Basic statistics

SE, SN, EC, MP, BI

Offline analytics

Facebook social network

Hadoop, Spark, Flink, MPI

Streaming

Random generate

Spark streaming

Collaborative Filtering

Graph, Matrix

EC

Offline analytics

Amazon movie review

Hadoop, Spark

Streaming

MovieLens dataset

JStorm

Naive Bayes

Basic statistics, Sort

SE, SN, EC

Offline analytics

Amazon movie review

Hadoop, Spark, Flink, MPI

SIFT

Matrix, Sampling, Transform, Sort

MP

Offline analytics

ImageNet

Hadoop, Spark, MPI

LDA

Matrix, Graph, Sampling

SE

Offline analytics

Wikipedia entries

Hadoop, Spark, MPI

OrderBy

Set, Sort

EC

Data warehouse

E-commerce transaction

Hive, Spark-SQL, Impala

Aggregation

Set, Basic statistics

EC

Data warehouse

E-commerce transaction

Hive, Spark-SQL, Impala

Project

Set

EC

Data warehouse

E-commerce transaction

Hive, Spark-SQL, Impala

Filter

Set

EC

Data warehouse

E-commerce transaction

Hive, Spark-SQL, Impala

Select

Set

EC

Data warehouse

E-commerce transaction

Hive, Spark-SQL, Impala

Union

Set

EC

Data warehouse

E-commerce transaction

Hive, Spark-SQL, Impala

Alexnet

Matrix, Transform, Sampling, Logic, Basic statistics

SN, MP, BI

AI

Cifar, ImageNet

TensorFlow, Caffe

Googlenet

SN, MP, BI

AI

Cifar, ImageNet

TensorFlow, Caffe

Resnet

SN, MP, BI

AI

Cifar, ImageNet

TensorFlow, Caffe

Inception Resnet V2

SN, MP, BI

AI

Cifar, ImageNet

TensorFlow, Caffe

VGG16

SN, MP, BI

AI

Cifar, ImageNet

TensorFlow, Caffe

DCGAN

SN, MP, BI

AI

LSUN

TensorFlow, Caffe

WGAN

SN, MP, BI

AI

LSUN

TensorFlow, Caffe

GAN

Matrix, Sampling, Logic, Basic Statistics

SN, MP, BI

AI

LSUN

TensorFlow, Caffe

Seq2Seq

SN, EC, BI

AI

TED Talks

TensorFlow, Caffe

Word2vec

Matrix, Basic statistics, Logic

SE, SN, EC

AI

Wikipedia entries, Sogou data

TensorFlow, Caffe