A Survey of Column-Oriented Storage Techniques in Read-Optimized Data Warehouse Systems

PhD Qualifying Examination


Title: "A Survey of Column-Oriented Storage Techniques in Read-Optimized
Data Warehouse Systems"

by

Mr. Jiangchuan Zheng


Abstract:

Most traditional DBMS store records row-by-row. Historically, the choice
of row-store layout is not merely for technical simplicity, but rather
motivated by the typical workloads in transactional processing which
access data on the granularity of entity. However, with the emergence of
big data comes another kind of queries more analytical in nature, which do
not care about the details of certain entities, but target at high-level
statistical information that help with data mining tasks in warehouse
environment. Analytical workloads are read-intensive, attribute-focused
and big data-oriented, which contrast sharply with transactional queries.
In view of these new characteristics, write-optimized row-store layout is
no longer the best choice and redesign of physical layer is needed.

In recent years, column-oriented storage structure has gained popularity
in both research and industrial communities. By organizing tabular data
column-by-column in physical layer, column-store outperforms row-stores in
processing analytical workloads as it need only access relevant
attributes. Advantages of column-store over row-store include high I/O
efficiency, great chances of compression and high flexibility in adapting
to dynamic workloads. Nevertheless, quite a few challenges exist ranging
from tuple reconstruction to compression-based query execution.

In this survey, we review major research results towards building a
high-performance, analytics-oriented column-store warehouse system. We
start from the description of the storage layout and execution engine in
C-Store, an open-source column-store system. In the following, we delve
into several key issues in column-store system such as compression, tuple
reconstruction, materialization strategies. We summarize key challenges
and typical solutions, and describe from a system perspective how they
help improve the performance of analytical workloads processing. Also, we
review major issues of applying column-store techniques in distributed
environment such as MapReduce. Finally, we end this survey with some
conclusions and future directions.


Date:                   Friday, 17 February 2012

Time:                   2:00pm - 4:00pm

Venue:                  Room 3301A
                         lifts 17/18

Committee Members:	Prof. Lionel Ni (Supervisor)
                         Dr. Qiong Luo (Chairperson)
 			Dr. Lei Chen
 			Dr. Lin Gu


**** ALL are Welcome ****