Large-Scale In-Memory Data Processing

PhD Thesis Proposal Defence


Title: "Large-Scale In-Memory Data Processing"

by

Mr. Zhiqiang MA


Abstract:

As cloud-based computation grows to be an increasingly important paradigm, 
providing a general computational interface and data substrate to support 
datacenter-scale programming has become an imperative research agenda. 
Traditional cloud computing technologies, such as MapReduce, use disk-based 
file systems as the system-wide substrate for data storage and sharing. A 
distributed file system provides a global name space and stores data 
persistently, but it also introduces significant overhead. Several recent 
systems use DRAM to store data and tremendously improve the performance of 
cloud computing systems. However, both our own experience and related work 
indicate that a simple substitution of distributed DRAM for the file system 
does not provide a solid and viable foundation for data storage and processing 
in the datacenter environment, and the capacity of such systems is limited by 
the amount of physical memory in the cluster.

We view the unified physical memory of many hosts as the solid data substrate 
for large-scale efficient data processing for cloud-based systems. We 
investigate the limitation of the traditional file system-based system 
MapReduce using the parallel project compilation as a probing case with 
moderate-size data with dependcences among numerous computational steps. We 
propose organizing the in-memory data processing in many compute nodes by 
presenting programmers a illusion of a big virtual machine, and design a new 
instruction set architecture, i0, to unify myriads of compute nodes to form a 
big virtual machine called MAchine ZEro (MAZE), and present programmers the 
view of a single computer where thousands of tasks run concurrently in a large, 
unified, and snapshotted memory space. i0 and MAZE form the foundation of the 
Layer Zero system which provides a generate substrate for cloud computing. The 
Layer Zero provides a simple yet scalable programming model and mitigates the 
scalability bottleneck of traditional distributed shared memory systems. Along 
with an efficient execution engine, the capacity of a Layer Zero can scale up 
to support large clusters. We have implemented and tested Layer Zero on four 
platforms, and our evaluation shows that Layer Zero has excellent performance 
and scalability. On the other hand, the simple substitution of distributed DRAM 
for the file system does not fulfill the needs of many data storage and 
processing applications in the datacenter environment. The capacity of such 
systems is limited by the amount of physical memory in the cluster and do not 
provided data persistency mechanisms. We propose an improved data substrate to 
unify the physical memory and disk resources on many compute nodes, to form a 
system-wide data substrate for large-scale data processing. The substrate 
provides a general memory-based abstraction, takes advantage of DRAM in the 
system to accelerate computation, and, transparent to programmers, scales the 
system to handle large datasets by swapping data to disks and remote servers. 
The memory-based data substrate can also provide a solid foundation for data 
storage systems such as key/value stores.


Date:			Wednesday, 9 April 2014

Time:                   2:00pm - 4:00pm

Venue:                  Room 3501
                         lifts 25/26

Committee Members:	Dr. Lin Gu (Supervisor)
 			Dr. Kai Chen (Chairperson)
 			Dr. Ke Yi
 			Prof. Qian Zhang


**** ALL are Welcome ****