Dr. Jianfeng Zhan is Full Professor and Director at Software Systems Lab, Institute of Computing Technology (ICT), Chinese Academy of Sciences (CAS) and University of Chinese Academy of Sciences. He enjoys building new systems, and feel great interest in collaborating with researchers with different backgrounds. He recently focuses on different aspects of datacenter computing. He founded the International Symposium on Benchmarking, Measuring, and Optimizing, dedicated to benchmarking, measuring, and optimizing different complex systems.
We propose a new approach to characterizing big data and AI workloads. We consider each big data and AI workload as a pipeline of one or more classes of unit of computations performed on different initial or intermediate data inputs. Each class of unit of computation captures the common requirements while being reasonably divorced from individual implementations, and hence we call it a data motif. For the first time, among a wide variety of big data and AI workloads, we identify eight data motifs that takes up most of run time, including Matrix, Sampling, Logic, Transform, Set, Graph, Sort and Statistic. Interesting, many other scholars are advocating domain-specific hardware and software systems. We believe the data motif concept provides a new unified approach to rebuilding software and hardware systems for big data and AI workloads. Other than building or optimizing systems case by case, we can focus on accelerating eight data motifs.
Selected publicationsThe history witnesses that the FLOPS (FLoating-point Operations Per Second) metric and the HPC benchmarks define the concrete R&D objectives and road-maps for HPC (Gflops in the 1990, Tflops in the 2000, Pflops in the 2010, and Eflops in the 2020). To date, to provide Internet services, or perform big data or AI analytics, more and more organizations in the world build internal datacenters, or rent hosted datacenters. It seems that the fraction of datacenter computing has outweighed HPC in terms of market share (HPC only takes 20% of total). Unfortunately, in the context of datacenter computing, we are isolated. On one hand, the academia community has no real-world data and workloads, which are owned by different Internet service giants. On the other hand, each giant has only its own data and workloads without knowing the other. It is time for us (both academia and industry communities) to set up the unified metrics and benchmarks for datacenter computing.
Selected publicationsAs a multi-discipline research and engineering effort, i.e., architecture, system, data management and machine learning communities from both industry and academia, we set up an open-source big data and AI benchmark suite---BigDataBench. The current versionBigDataBench 4.0 provides 13 representative real-world data sets and 47 benchmarks. Other than creating a new benchmark or proxy for every possible workload, we propose using data motif-based benchmarks---the combination of eight data motifs---to represent diversity of big data and AI workloads. Our benchmark suite includes micro benchmarks, each of which is a single data motif, components benchmarks, which consist of the data motif combinations, and end-to-end application benchmarks, which are the combinations of component benchmarks. For the architecture community, whatever early in the architecture design process or later in the system evaluation, it is time-consuming to run a comprehensive benchmark suite. The complex software stacks of the big data and AI workloads aggravate this issue. To tackle this challenge, we propose the data motif-based simulation benchmarks for architecture communities, which speed up runtime 100 times while preserving system and micro-architectural characteristic accuracy.
Selected publicationsWe are happy to work with many scientists and doctors on this amazing topic.
Despite computation becomes much complex on data with an unprecedented scale, we argue computers or smart devices should and will consistently provide information and knowledge to human being in the order of a few tens milliseconds. We coin a new term 10-millisecond computing to call attention to this class of workloads. 10-millisecond computing raises many challenges for both software and hardware stacks.
Selected publicationsIt seems that we are happy to benchmark and optimize many things. We struggle to find something interesting. Most of time, we fail...
Traditionally, we refer to the OS scalability in terms of the average performance. In the context of latency-critical services, the worst-case performance (latency) is amplified by the system scale. So we must care about the OS scalability in terms of both average performance and worst-case performance. We present the "isolate first, then share" OS model in which the machine's process or cores, memory, and devices are divided up between disparate OS instances, and a new abstraction---subOS---is proposed to encapsulate an OS instance that can be created, destroyed, and resized on-the-fly. The intuition is that this avoids shared kernel states between applications, which in turn reduces performance loss caused by contention. We decompose the OS into the supervisor and several subOSes running at the same privilege level: a subOS directly manages physical resources, while the supervisor can create, destroy, resize a subOS on-the-fly. The supervisor and subOSes have few state sharing, but fast inter-subOS communication mechanisms are provided on demand. We present the first implementation---RainForest, which supports unmodified Linux binaries. Our comprehensive evaluation shows RainForest outperforms Linux with four different kernels, LXC, and Xen in terms of worst-case and average performance most of time when running a large number of benchmarks. We submit this system paper to ASPLOS four times (from 2015 to 2018). Finally, we feel no interest in submitting this paper again. But I will happy if you feel interest in reading this paper.
Selected publicationsWe built three innovative cluster and cloud systems software: Phoenix, DawningCloud, and PhoenixCloud. Among them, GridView (one component of Phoenix Cluster operating system) was transferred to Sugon, which is a premier supercomputing company in China, and becomes its popular software product. Having not open-sourced these projects is my deepest regret.
Selected publicationsIn collaboration with Tencent, we build a programming framework for building different data-parallel programming models.
We feel great interests and engage in proposing approaches and developing performance analysis tools for large-scale scientific computing and datacenter, and cloud computing.
We built several tools for understanding the reliability and availability of large-scale computing systems.
TPC member, IPDPS 2018
TPC member, ICDCS 2017
HPBDC Co-Chair, in conjunction with IPDPS’16 17 18
BPOE chair, in conjunction with ASPLOS 2014, 15, 16, 17, VLDB’14
TPC member, IISWC 2014
TPC member, CCGrid 2014
TPC member. CCF Big Data conference 2013
TPC member, International Conference on Computer Communications and Networks (ICCCN 2014)
Founding Organizer, The First Workshop of Benchmarks, Performance Optimization, and Emerging Hardware of Big Data Systems and Applications (BPOE 2013), In conjunction with IEEE Big Data Conference 2013, October 8, 2013, Silicon Valley, CA, USA
TPC Member, The second IEEE International Conference on Big Data Science and Engineering (BDSE),December 3-5, 2013 in Sydney, Australia.
TPC member, The ACM Cloud and Autonomic Computing Conference (CAC 2013), Miami, Florida, USA August 5-August 9, 2013
Organizer, HPCA 2013 Tutorial, High Volume Computing: The Motivations, Metrics, and Benchmarks Suites for Data Center Computer Systems, Shenzhen, 2013.
Track Chair of Utility Computing, HPCC 2013
PC Member and Publicity Chair, ICAC 2013
PC Member, SOSE 2013
PC member, NPC 2012
Guest editor, Cloud computing special issue of Frontier of Computer Sciences, 2012
PC Member, IDPDS 2012, Ph.D Forum
Publicity Chair for China of ICAC 2012 (The 9th International Conference on Autonomic Computing
PC Member, ICDCS 2012 (The 32nd International Conference on Distributed Computing Systems)
PC Member, AINA 2012 (The 26th IEEE International Conference on Advanced Information Networking and Applications, Tokyo, Japan, March 26-29, 2012.),
PC Member, NPC 2011, CSE 2011, GCC 2011, Cloud 2011