Large-Scale In-Memory Data Processing

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


PhD Thesis Defence


Title: "Large-Scale In-Memory Data Processing"

By

Mr. Zhiqiang MA


Abstract

As cloud and big data computation grows to be an increasingly important 
paradigm, providing a general abstraction for datacenter-scale programming has 
become an imperative research agenda. Researchers have proposed, designed and 
implemented various computation models and systems on different abstraction 
levels, such as MapReduce, X10, Dryad, Storm and Spark. However, many 
abstractions expose the distributed detail of the platform to the application 
layer, and lead to increased complexity in programming, decreased performance, 
and, sometimes, loss of generality. At the data substrate layer, traditional 
cloud computing technologies, such as MapReduce, use disk-based file systems as 
the system-wide substrate for data storage and sharing. A distributed file 
system provides a global name space and stores data persistently, but it also 
introduces significant overhead. Several recent systems use DRAM to store data 
and tremendously improve the performance of cloud computing systems. However, 
both our own experience and related work indicate that a simple substitution of 
distributed DRAM for the file system does not provide a solid and viable 
foundation for data processing and storage in the datacenter environment, and 
the capacity of such systems is limited by the amount of physical memory in the 
cluster.

To support general, efficient, flexible, and concurrent application workloads 
with sophisticated data processing, we present programmers an illusion of a big 
virtual machine built on top of one, multiple or many compute nodes and unify 
the physical memory and disks of the nodes to form a globally addressable data 
substrate. We design a new instruction set architecture, i0, to unify myriads 
of compute nodes to form a big virtual machine called MAZE where thousands of 
tasks run concurrently in VOLUME, a large, unified, and snapshotted distributed 
virtual memory. i0, MAZE and VOLUME form the foundation of the Layer Zero 
systems which provide a general substrate for cloud computing. i0 provides a 
simple yet general and scalable programming model. VOLUME mitigates the 
scalability bottleneck of traditional distributed shared memory systems and 
unifies the physical memory and disks on many compute nodes to form a 
distributed transactional virtual memory. VOLUME provides a general 
memory-based abstraction, takes advantage of DRAM in the system to accelerate 
computation, and, transparently to programmers, scales the system to process 
and store large datasets by swapping data to disks and remote servers. Along 
with an efficient execution engine, the capacity of a MAZE can scale up to 
support large datasets and large clusters.

We have implemented the Layer Zero systems on several platforms, and designed 
and implemented various benchmarks, graph processing and machine learning 
programs and application frameworks. Our evaluation shows that Layer Zero has 
excellent performance and scalability. On one physical host, the system 
overhead is comparable to that of traditional VMMs. On 16 physical hosts, Layer 
Zero runs 10 times faster than Hadoop and X10. On 160 physical compute servers, 
Layer Zero scales linearly on a typical iterative workload.


Date:			Tuesday, 12 August 2014

Time:			2:30pm - 4:30pm

Venue:			Room 4480
 			Lifts 25/26

Chairman:		Prof. Man-Yu Wong (MATH)

Committee Members:	Prof. Lin Gu (Supervisor)
 			Prof. Gary Chan
 			Prof. Ke Yi
 			Prof. Pak-Wo Leung (PHYS)
 			Prof. Cho-Li Wang (Comp. Sci., HKU)


**** ALL are Welcome ****