Performance Numbers

Processors

Motivations

Three trends of current architecture and system communities for Big Data systems and architectures:

1) Commodity x86 based systems dominate data center markets, although such systems may lead to mismatched architectures for scale-out workloads.

2) Internet services providers such as Facebook exhibit great interest in low-power wimpy cores (e.g., ARM-based systems that reduce power consumption).

3) Wimpy many-core systems are the promising architecture for Big Data workloads.

Platforms

We evaluated four platforms with different brawny-core, wimpy multi-core, and wimpy many-core processors, in order to reflect the state-of-the-practice Big Data architectures. To perform apples-to-apples comparisons of brawny-core and wimpy multi-core processors, we choose the Xeon E5310 and Atom D510, which have the same CPU frequency (1.6GHz) and the same cache-hierarchy levels (two). We also select the E5645 (Westmere micro-architecture) as the scale-up architecture corresponding to the E5310 (Clovertown micro-architecture), and the TileGx36 (36 cores) as the scale-out point of comparison for the Atom D510 (two cores). The Table 1 lists their configuration details.

                   Table 1 Configuration Details of Four Platforms

 Configurations

Metrics

For the sake of reproducibility, we choose a variety of user-perceivable metrics that are easy to gather and compare.

Data processed per second (DPS) is the main performance metric for data analytic workloads, which is defined as the input data size divided by the total processing time.

Operations per second (OPS) is the main performance metric for the basic data store operations, which is defined as the number of operations divided by the total processing time.

Data Processed per Joule (DPJ): the input size divided by the energy consumed (joules).

Operations per Joule (OPJ): the number of operations divided by the energy consumed (joules).

Note: One joule represents the work required to produce one watt of power for one second. In our evaluations, we report the numbers of performance and energy efficiency per processor.

Results

dps-avg

 Figure1 the Average DPS/OPS and DPJ/OPJ per Processor of Each Workload on Four Platforms

As shown on the Figure 1, we normalized the number on each platform to that of the Xeon E5310, and show the ratios in the Figure 1. We have the following observations:

I/O-intensive workloads (Sort, Write, and Scan, which have higher I/O_wait ratios and lower CPU utilizations):

  • The brawny-core processors (Xeon E5645) do not provide performance advantages to their wimpy-core counterparts (TileGX36).
  • The wimpy many-core TileGx36 achieves the best performance on Sort, Write and Scan, and this processor provides obvious energy-efficiency advantage.

        Lessons Learned: The wimpy many-core processor is more suitable for the I/O-intensive workloads.

CPU-intensive workloads (Bayes and K-means, which have higher CPU utilization and higher floating point instruction ratios)

  • The brawny-core processors show obvious advantages on both performance and energy efficiency over the wimpy multi-core and many-core processors.

       Lessons Learned: The brawny-core processor is more suitable for the CPU-intensive and FP (floating point) dominated   workloads.

Other nine workloads

  • Grep is similar to CPU-intensive workloads; however, its FP instruction ratio is lower than Naive Bayes and K-means (10% of K-means, 5% of Naïve Bayes). So Grep does not show performance advantage on the brawny-core processor over wimpy-core processors as obvious as Naive Bayes and K-means.
  • The I/O_wait percent of Select is higher than the average percent of all the workloads, however, its CPU utilization is higher than the I/O-intensive workloads like Sort, Write and Scan. So Select doesn’t show the same performance advantage as the I/O-intensive workloads on the wimpy many-core processors.
  • The behaviors of the other seven workloads are very diverse. For example, the brawny-core processors show a performance advantage over the wimpy multi-core and many-core processors for Join and Aggregation workloads, but the wimpy multi-core and many-core processors realize more energy efficiency over the brawny cores for the Grep and Select workloads.

        Lessons Learned: No platform that consistently wins in terms of both performance and energy efficiency for most workloads.

Conclusions

1)  None of the microprocessors consistently wins in terms of both performance and energy efficiency for all of our Big Data workloads.

2) There are different classes of Big Data workloads, and each class of workload realizes better performance and energy efficiency on different Big Data architectures. Thus, we recommend eschewing one-size-fits-all solutions, and instead tailoring system designs to specific workload requirements.

Big Data Generators

Motivations

Data generation is a key issue in big data benchmarking that aims to generate application-specific data sets to meet the 4V requirements of big data (i.e. volume, velocity, variety, and veracity). To date, most existing techniques only generate big data sets belonging to some specific data types and sources, or cannot guarantee data veracity.

BDGS

In BigDataBench 3.0, we designed a general framework and a tool called Big Data Generator Suite (BDGS)to generate synthetic big data while preserving the 4V properties in big data generation.BDGS has the following three features:

1)  BDGS includes eightspecific generators based on a variety of real-world data sets from different internet services domains, covering three representative data types (structured, semi-structured and unstructured) and three data sources (text, graph, and table data).

2)  BDGS proposes threenovel data generation methods: (1) Text data generation method based on LDA model; (2) Weighted graph data generation method based on Kronecker graph model and SNAP; (3) Semi-structured data generation method for Personal Resume.

3)  BDGS parallelizes the Text data generator and the Personal Resume generator, so that with the addition of machines, TB level data can be generated inshorter time.

Performance

Experiments are run on 8 Xeon E5645 platforms. In our experiments, use 8 nodes, we can generate 10TB text data in 300 minutes and 10TB personal resume data (NoSQL table) in 338 minutes.

The average generation rate of Text, Graph and Table data on single machine is 77.01MB/s, 13.3MB/s and 23.85MB/s (SQL table), respectively. Figure2 shows that BDGS generates 10 TB wiki data in 5 hours.

Wiki

Figure2 Generating wiki text data on the 8 nodes cluster

Figure1 the Average DPS/OPS and DPJ/OPJ per Processor of Each Workload on Four Platforms