DGIST Scholar: Resource-aware multi-way join processing using MapReduce

Department of Electrical Engineering and Computer Science Theses Master

Cited time in webofscience

Cited time in scopus

Resource-aware multi-way join processing using MapReduce

Title: Resource-aware multi-way join processing using MapReduce

Alternative Title: 자원 활용을 고려한 맵리듀스 기반 멀티웨이 조인 처리

Author(s): Nam, Yoon Min

DGIST Authors: Nam, Yoon Min ; Kim, Min Soo ; Choi, Jihwan P.

Advisor: Kim, Min Soo

Co-Advisor(s): Choi, Jihwan P.

Issued Date: 2015

Awarded Date: 2015. 2

Type: Thesis

Subject: multi-way join ; resource-aware ; MapReduce ; streaming ; balanced workload ; 멀티웨이 조인 ; 자원고려 ; 맵리듀스 ; 스트리밍 ; 균등한 작업량

Abstract: With a growing demand of hidden insights from the large scale of data, multi-way join operations become the key of many OLAP-style data analytic tasks for not only relational data analysis, but also various scientific applications. To process OLAP-style data analytic tasks with cost-efficiency, shared-nothing distributed system, such as MapReduce, gets its popularity in both academic field and enterprises. However, due to various lacks in support of MapReduce such as processing data from multiple sources and data skew handling, join operation is inefficient operation in MapReduce. Specifically, generation of query execution plan in MapReduce does not consider the dominantly utilized resources that affect the performance of a query processing significantly. Our work is based on a counter observations from traditional wisdoms of query processing in MapReduce: reducing the number of MapReduce job does not guarantee the performance benefit, and growing intermediate data does not always triggers the performance degradation.
In this work, we propose efficient resource-aware multi-way join processing method by taking not only algorithmic approach, but also systemic approach. As an algorithmic approach, we propose in-memory streaming hash join method with careful consideration of memory constraint in a computing machine and a balanced workload of each join task. As a systemic approach, we propose a generation of efficient multi-way join query execution plan. In experimental results, our method improves the performance of multi-way join query processing, especially the latest version of Apache Hive[5], and AQUA[11]. In addition, our method shows better performance even if the aggregated intermediate data is larger than other method by exploiting major resources very efficiently. ⓒ 2015 DGIST

Table Of Contents: Ⅰ. INTRODUCTION --
Ⅱ. BACKGROUND --
2.1 MapReduce --
2.2 SQL-on-Hadoop --
Ⅲ. JOIN PROCESSING USING MAPREDUCE --
3.1 Basic join algorithms in MapReduce --
3.2 Replicated join --
3.3 1-Bucket-Theta --
Ⅳ. RESOURCE-AWARE MULTI-WAY JOIN PROCESSING --
4.1 Problem description --
4.2 Cost model of operator pipeline in MapReduce --
4.3 In-memory streaming hash join --
4.4 Finding multi-way join group --
Ⅴ. EXPERIMENTS --
Ⅵ. RELATED WORK --
Ⅶ. CONCLUSIONS --
Ⅷ. REFERENCE

URI: http://dgist.dcollection.net/jsp/common/DcLoOrgPer.jsp?sItemId=000001922578

http://hdl.handle.net/20.500.11750/1381

DOI: 10.22677/thesis.1922578

Degree: Master

Department: Information and Communication Engineering

Publisher: DGIST

Files in This Item:: 000001922578.pdf
기타 데이터 / 12.29 MB / Adobe PDF download

Appears in Collections:: Department of Electrical Engineering and Computer Science Theses Master

Show Full Item Record

qrcode

DGIST

DGIST Scholar was built with support from the OAK distribution project by the National Library of Korea.

You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.

Library Services Team, DGIST 333. Techno Jungang-daero, Hyeonpung-myeon, Dalseong-gun, Daegu, 42988, Republic of Korea.

DGIST Library Repository

BROWSE

DGIST

BROWSE