DGIST Scholar: Fast and Memory-Efficient Query Processing SysteMaster for Complex OLAP Databases and Queries

Department of Electrical Engineering and Computer Science Theses Ph.D.

Cited time in webofscience

Cited time in scopus

Fast and Memory-Efficient Query Processing SysteMaster for Complex OLAP Databases and Queries

Title: Fast and Memory-Efficient Query Processing SysteMaster for Complex OLAP Databases and Queries

Alternative Title: 복잡한 OLAP 데이터베이스들과 쿼리들에 대해 빠르고 효율적으로 메모리를 활용하는 질의 처리 시스템들

Author(s): Yoon-Min Nam

DGIST Authors: Kim, Min-Soo ; Nam, Yoon-Min ; Nam, BeoMastereok

Advisor: 김민수

Co-Advisor(s): BeoMastereok Nam

Issued Date: 2019

Awarded Date: 2019-08

Type: Thesis

Description: Horizontal database partitioning, Graph-based partitioning, Parallel query processing, Multi-way join processing, OLAP query processing system, GPU-accelerated query processing

Abstract: 최근 OLAP 데이터베이스의 크기가 매우 증가함에 따라 빠르고 효율적인 OLAP 질의 처리 기술이 더욱 중요해지고 있다. 특히, 병렬 OLAP 질의 처리 시스템 관점에서 확장성 있고 효율적인 데이터의 수평분할 저장 방법을 잘 활용하는 것이 매우 중요하다. 하지만, 기존재하는 데이터베이스 수평 분할 방법의 경우 데이터의 중복저장정도가 매우 커질 뿐만 아니라 조인 처리를 위해 여전히 처리 비용이 비싼 셔플 연산을 필요로 하는 경우가 많다.

본 학위논문의 첫번째 공헌으로 우리는 복잡한 스키마와 질의 워크로드를 가진 큰 규모의 데이터베이스를 많은 수의 머신들에 효과적으로 저장하는 방법에 대해 연구하였 다. 구체적으로, 우리는 기 존재하는 방법의 치명적인 단점들이 트리 기반의 데이터베이스 분할 계획에서 유래함을 발견하였고, 기존 방법보다 데이터베이스 중복량을 줄이면서 질의 처리 성능은 더욱 향상시킨 그래프 기반의 새로운 데이터베이스 수평 분할 방법인 GPT 방법을 제안하였다. 우리는 제안한 GPT 방법을 세계에서 가장 널리 사용되는 오픈소스 병렬 질의 처리 시스템인 Spark SQL 구현하였다. 이를 위해, 우리는 새로운 시스템 Spark SQL/GPT 구현 내부에서 스캔 연산자와 질의 계획 생성기 등 관련된 모든 핵심 모듈들 수정하고 필요한 새로운 모듈들 또한 개발하였다. TPC-DS, IMDB 그리고 BioWarehouse 벤치마크들을 활용한 많은 실험을 통해 우리는 제안하는 GPT 방법과 이에 대한 시스템 구현인 Spark SQL/GPT 기술이 기존의 최신 데이터베이스 분할 기법에 대해 저장공간 오버헤드 및 질의 처리속도의 모든 부분에서 크게 향상된 성능을 얻었다는 것을 보였다.

또한, 관계형 OLAP 질의 처리 기술이 다양한 분야에서 사용됨에 따라 기존에 널리 사용되던 간단한 스타 스키마와 그에 따른 질의형태에서는 고려하지 못했던 복잡한 데이터베이스 스키마와 그에 따른 복잡 질의 처리 기술이 요구되고 있다. 위와 같은 복잡한 질의들의 주요한 특징은 두 테이블의 컬럼 사이의 고유키-참조키 관계에 따른 조인 뿐만 아니라 참조키 사이의 조인 혹은 비-고유키 사이의 조인 또한 포함하고 있는 것이다. 위와 같은 새로운 타입의 조인연산을 포함한 질의의 경우 중복된 조인 키 값들로 인해 매우 큰 중간 결과를 생성해야 할 수 있는데, 이 경우 일반적인 이항 조인 연산으로는 잘 처리하지 못한다.

본 학위논문의 두번째 공헌으로, 기존 OLAP 시스템들이 잘 처리하지 못하였던 조인 연산들을 포함하는 복잡한 질의를 효율적으로 처리할 수 있는 질의 처리 시스템을 연구하였다. 구체적으로, 우리는 기 존재하는 트리 기반의 질의 처리 계획 생성 방법이 큰 중간데이터를 생성하는 질의를 잘 처리하지 못한다는 단점을 극복하기 위한 방법으로 다항 조인 연산 개념을 도입하는 것을 연구하였다. 문제를 해결하기 위해 연구 및 개발한 모든 기술애 대해 우리는 오픈소스 데이터베이스 엔진인 OmniSci 내부에 필요한 모든 부 분을 구형하였고 이를 SPRINTER라 부른다. SPRINTER를 통해서 우리는 가장 최신의 시스템들도 잘 처리하지 못하는 복잡한 질의에 대해 큰 성능 향상이 있음을 실험을 통해 보였다.|Nowadays, a fast and eﬃcient OLAP query processing techniques have become more important than ever because of signiﬁcantly increased size of OLAP database. Speciﬁcally, it is important to utilize a scalable and eﬃcient horizontal database partitioning method as parallel database systeMaster have large amounts of data to process. The existing partitioning methods have major drawbacks that not only cause large amounts of data redundancy but also still require expensive shuﬄe operations for join queries in many cases—despite their high data redundancy.

As the ﬁrst contribution of this dissertation, we study how to eﬃciently store a huge volume of database having a complex schema and its workload across a number of machines. Speciﬁcally, we elucidate upon the drawbacks of the existing horizontal database partitioning method originating from the tree-based partitioning schemes and propose a novel graph-based database partitioning method called GPT that both improves the query performance and reduces data redundancy. We integrate the proposed GPT method into a parallel query processing system, Spark SQL, across all the relevant layers and modules, including the query plan generator and the scan operator. Through extensive experiments using three benchmarks, TPC-DS, IMDB and BioWarehouse, we show that GPT signiﬁcantly outperforMaster the state-of-the-art method in terMaster of both storage overhead and query performance.

As a concept of relational OLAP query evaluation has widely adopted in various areas, the database schemas and their query workload have become complex which have not been addressed by simple star schema and its query workload. The characteristics of such queries contain join operations not only a conventional primary-key-foreign-key joins but also among foreign-keys and/or non-unique keys. Such type of queries may generate large join results due to duplicated join key values which a typical binary joins are not performed well.

As the second contribution of this dissertation, we study the integration of n-ary join operator into the existing tree-based query planning method, to solve the performance issue incurred from large intermediate results. We integrate all proposed techniques into an open-source modern database engine OmniSci (which we called SPRINTER) across all relevant system modules including query plan generator and physical join operator. Under the system we integrated, we prove its signiﬁcant beneﬁt regarding query performance over a complex join query that most of query evaluation systeMaster including state-of-the-art system cannot evaluate such query workload eﬃciently.

Table Of Contents: Abstract
1 Introduction
1.1 Motivation and objectives
1.2 Main contributions
1.3 Structure of thesis
Chapter 2.Spark SQL/GPT: A parallel database system based on a graph-based database partitioning
2.1.Motivation
2.1.1.PREF method
2.1.2.Tuple-level Duplicates
2.1.3.Table-level Duplicates
2.1.4.Join Operations with repartitioning
2.2.GPT Method
2.2.1.Join Graph and Partitioned Graph
2.2.2.Determining Vertices
2.2.3.Triangles and Hubs
2.2.4.Determining Edges
2.2.5.A Case Study: TPC-DS Benchmark
2.2.6.Comparison Analysis with PREF
2.3.HMC Partitioning
2.4.Query Processing
2.4.1.Scanning a Table Irrelevant to a Join (Single Scan)
2.4.2.Scanning a Table Relevant to a Join (Join Scan)
2.4.3.Cyclic Query Processing
2.5.Integration of GPT into Spark SQL
2.5.1.System Overview
2.5.2.GPT Partitioner of Spark SQL/GPT
2.5.3.Spark SQL/GPT Query Processor
2.6.Experimental Evaluation
2.6.1.Experimental Setup
2.6.2.Data Redundancy and Loading for TPC-DS
2.6.3.Query Performance for TPC-DS
2.6.4.Results for IMDB and BioWarehouse
2.6.5.Characteristics of GPT
2.6.6.Evaluation of Spark SQL/GPT
2.6.7.Evaluation of Cyclic Query
2.7.Related Work
Chapter 3.SPRINTER: A Fast n-ary Join Query Processing Method for Complex OLAP Queries
3.1.Motivation
3.2.Preliminaries
3.2.1.Sorting algorithMaster
3.2.2.Worst-case optimal join algorithm
3.3.Query planning method
3.3.1.Query of a single n-ary join operator
3.3.2.Query of multiple n-ary join operators
3.3.3.Search space
3.4.n-ary join processing method
3.4.1.Determination of Global Order
3.4.2.Strategy for Sorting
3.4.3.Merge Join of Sorted Relations
3.5.Cost model for query optimization
3.5.1.Cost model of the base system
3.5.2.Cost model of SPRINTER
3.6.Experimental Evaluation
3.6.1.Experimental Setup
3.6.2.Relational OLAP Query Performance
3.6.3.Characteristics of SPRINTER
3.7.Related Work
Chapter 4.Conclusion
Bibliography