The BigData100 is an open source project for benchmarking and ranking big data systems. The benchmarks we use are selected from BigDataBench. Currently, the big data systems include Hadoop, Spark, Flink, Hive on Hadoop, Impala, and the workloads cover offline batch processing, iterative machine learning, and interactive query processing from different domains. We will include more systems and workloads in near future.
The benchmarks come from BigDataBench. The batch processing and iterative machine learning part support Hadoop, Spark and Flink, while the interactive query processing benchmarks cover systems like Impala, Spark SQL, Hive on Hadoop, Hive on Tez and Hive on Spark. In the first part, there are 4 data sets and 6 workloads in BigData100. Table 1 summarizes the real-world data sets and scalable data generation tools included into BigData100; Table 2 presents the workloads of BigData100 for batch processing and machine learning.
Table 1: The Summary of Data Sets
|Data sets||Raw data size||Scalable data set|
|Wikipedia Entries||1.600,000,000 English words (unstructured text)||Text Generator of BDGS of BigDataBench|
|Amazon Movie Reviews||5,700,000 reviews (semi-structured text)||Text Generator of BDGS of BigDataBench|
|Google Web Graph||16777216 nodes, 99184770 edges (unstructured graph)||Graph Generator of BDGS of BigDataBench|
|Facebook Social Network||460,000,000 vectors||Graph Generator of BDGS of BigDataBench|
Table 2. The summary of the workloads in BigData100
- Jingwei Li,    BAFST, email@example.com
- Xinhui Tian,    ICT, CAS, firstname.lastname@example.org
- Jianfeng Zhan,    ICT, CAS, email@example.com
Now we use a 16-node cluster as the testbed. Table 3 summarizes the configurations of systems.
Table 3: The configurations of cluster
|Computation nodes||16 nodes|
|CPU per node||2*Intel Xeon E5645|
|Memory per node||32GB|
|Disk per node||1TB x 2 SATA disks|
|Network||Broadcom NetXtreme II Gigabit Ethernet|
Results | 2016.01
We released some preliminary results of BigData100.
Part 1: Batch Processing and iterative Machine Learning
The first BigData100 ranking report covers Spark, Hadoop, and Flink. Table 4 shows the performance numbers of different systems running different benchmarks.
Table 4: the performance metrics of different systems (seconds)
|Workload||Hadoop v. 2.7.1||Spark v 1.5.1||Flink 0.9.1|
|PageRank||7753||780||598 (469 in delta PageRank)|
Part 2: Interactive Query Processing
In this part, we have tested five systems with three queries included in BigDataBench. The queries adopt a schema of e-commerce includeing three tables of items, customers, and orders. These three queries cover common operations such as select, aggregation, and join. The data size is described in Table 5, and the results are shown as below. Currently, we only consider the text format for these three tables.
Table 5: The data size used for interactive query processing
Runtime System Configurations
Table 6 presents the details of different big data run-time systems configurations
Table 6: The configurations of big data systems
|Spark.storage.memoryFraction||0.2 (0.5 for iteration)|