DGIST Scholar: Resource-aware multi-way join processing using MapReduce

Department of Electrical Engineering and Computer Science Theses Master

Cited time in webofscience

Cited time in scopus

Full metadata record

DC Field	Value	Language
dc.contributor.advisor	Kim, Min Soo	-
dc.contributor.author	Nam, Yoon Min	-
dc.date.accessioned	2017-05-10T08:51:04Z	-
dc.date.available	2015-01-12T00:00:00Z	-
dc.date.issued	2015	-
dc.identifier.uri	http://dgist.dcollection.net/jsp/common/DcLoOrgPer.jsp?sItemId=000001922578	en_US
dc.identifier.uri	http://hdl.handle.net/20.500.11750/1381	-
dc.description.abstract	With a growing demand of hidden insights from the large scale of data, multi-way join operations become the key of many OLAP-style data analytic tasks for not only relational data analysis, but also various scientific applications. To process OLAP-style data analytic tasks with cost-efficiency, shared-nothing distributed system, such as MapReduce, gets its popularity in both academic field and enterprises. However, due to various lacks in support of MapReduce such as processing data from multiple sources and data skew handling, join operation is inefficient operation in MapReduce. Specifically, generation of query execution plan in MapReduce does not consider the dominantly utilized resources that affect the performance of a query processing significantly. Our work is based on a counter observations from traditional wisdoms of query processing in MapReduce: reducing the number of MapReduce job does not guarantee the performance benefit, and growing intermediate data does not always triggers the performance degradation. In this work, we propose efficient resource-aware multi-way join processing method by taking not only algorithmic approach, but also systemic approach. As an algorithmic approach, we propose in-memory streaming hash join method with careful consideration of memory constraint in a computing machine and a balanced workload of each join task. As a systemic approach, we propose a generation of efficient multi-way join query execution plan. In experimental results, our method improves the performance of multi-way join query processing, especially the latest version of Apache Hive[5], and AQUA[11]. In addition, our method shows better performance even if the aggregated intermediate data is larger than other method by exploiting major resources very efficiently. ⓒ 2015 DGIST	-
dc.description.tableofcontents	Ⅰ. INTRODUCTION -- Ⅱ. BACKGROUND -- 2.1 MapReduce -- 2.2 SQL-on-Hadoop -- Ⅲ. JOIN PROCESSING USING MAPREDUCE -- 3.1 Basic join algorithms in MapReduce -- 3.2 Replicated join -- 3.3 1-Bucket-Theta -- Ⅳ. RESOURCE-AWARE MULTI-WAY JOIN PROCESSING -- 4.1 Problem description -- 4.2 Cost model of operator pipeline in MapReduce -- 4.3 In-memory streaming hash join -- 4.4 Finding multi-way join group -- Ⅴ. EXPERIMENTS -- Ⅵ. RELATED WORK -- Ⅶ. CONCLUSIONS -- Ⅷ. REFERENCE	-
dc.format.extent	64	-
dc.language	eng	-
dc.publisher	DGIST	-
dc.subject	multi-way join	-
dc.subject	resource-aware	-
dc.subject	MapReduce	-
dc.subject	streaming	-
dc.subject	balanced workload	-
dc.subject	멀티웨이 조인	-
dc.subject	자원고려	-
dc.subject	맵리듀스	-
dc.subject	스트리밍	-
dc.subject	균등한 작업량	-
dc.title	Resource-aware multi-way join processing using MapReduce	-
dc.title.alternative	자원 활용을 고려한 맵리듀스 기반 멀티웨이 조인 처리	-
dc.type	Thesis	-
dc.identifier.doi	10.22677/thesis.1922578	-
dc.description.alternativeAbstract	Shared-nothing 구조 기반의 분산 및 병렬 데이터 처리 시스템인 맵리듀스는 데이터베이스 분야에서 부터 기초과학 및 응용과학까지 매우 광범위한 분야에서 대규모 데이터에 대한 분석을 위해 사용되고 있다. 하지만 맵리듀스는 매우 단순 한 종류의 질의 처리를 위해 만들어졌기 때문에, 대규모 데이터 분석에 사용되는 매우 복잡한 형태의 질의는 잘 처리하지 못하는 단점이 있다. 특히 데이터 분석 질의에서 매우 빈번하게 사용되는 조인 연산 처리에서 기존의 RDBMS와 비교해 볼 때 매우 비효율적으로 처리하는 Data-flow를 가지고 있다. 이러한 단점에도 불구하고, 처리해야 하는 데이터의 크기가 폭발적으로 증가함에 따라, 기존의 RDBMS 기반의 scale-up 시스템의 한계점이 분명해진 이 시점에서, 맵리듀스를 이용한 다양한 형태의 질의처리 기술, 특히 조인에 대한 효율적인 처리 방법이 매우 필요하다. 여러 가지 종류의 조인 연산 중 멀티웨이 조인 연산은 데이터로부터 의미 있는 결과를 도출하기 위해 사용되는 복잡한 질의에 매우 빈번하게 사용되며, 질의를 효율적으로 표현할 수 있게 해준다. 따라서 맵리듀스 기반의 환경에서 효율적인 멀티웨이 조인 연산의 처리 방법은 활용가치가 매우 높은 기술이라 할 수 있다. 본 논문에서는 맵리듀스를 활용한 효율적인 멀티웨이 조인 처리 방법에 대해 다룬다. 특히, 조인 연산에서 매우 빈번하게 발생하는 메모리 오버플로우 (memory overflow) 문제와 멀티웨이 조인 연산의 느린 처리 속도를 개선하기 위해 알고리즘과 시스템 두 가지 관점에서 효율적인 방법을 제안한다. 첫째로, 알고리즘 적인 관점에서 본 논문은 인-메모리 기반의 스트리밍 조인 방법을 제안한다. 특히, 조인 연산 처리 중 발생하기 쉬운 메모리 오버플로우 문제를 해결하기 위한 데이터 파티셔닝 (data partitioning) 방법을 다룬다. 그리고 복잡질의를 처리하기 위한 일련의 맵리듀스 작업간의 데이터 이동을 최대한 줄이기 위해 최적화된 조인 순서를 생성하는 방법을 다룬다. 또한 시스템적인 관점에서 맵리듀스 연산 처리시의 시스템 자원간의 성능차이를 비용모델로 도출하고 이를 활용해 가장 빠른 자원을 최대한 활용할 수 있도록 하는 효율적인 복잡 질의 실행 계획 생성 방법을 제안한다. 여러 가지 TPC-H 질의와 생성 데이터를 사용한 실험에서 우리가 제안한 멀티웨이 조인 처리 방법이 가장 최신의 Apache Hive[5]와 AQUA [11]에서 제안한 질의 처리 계획을 사용하였을 때 보다 대부분의 경우에서 좋은 성능을 보이고 있으며, 주요 자원에 대한 효율적 사용으로 인해 합계된 중간데이터의 크기가 늘어나더라도 좋은 성능을 보인다. ⓒ 2015 DGIST	-
dc.description.degree	Master	-
dc.contributor.department	Information and Communication Engineering	-
dc.contributor.coadvisor	Choi, Jihwan P.	-
dc.date.awarded	2015. 2	-
dc.publisher.location	Daegu	-
dc.description.database	dCollection	-
dc.date.accepted	2015-01-12	-
dc.contributor.alternativeDepartment	대학원 정보통신융합공학전공	-
dc.contributor.affiliatedAuthor	Nam, Yoon Min	-
dc.contributor.affiliatedAuthor	Kim, Min Soo	-
dc.contributor.affiliatedAuthor	Choi, Jihwan P.	-
dc.contributor.alternativeName	남윤민	-
dc.contributor.alternativeName	김민수	-
dc.contributor.alternativeName	최지환	-

Files in This Item:: 000001922578.pdf
기타 데이터 / 12.29 MB / Adobe PDF download

Appears in Collections:: Department of Electrical Engineering and Computer Science Theses Master

Show Simple Item Record

qrcode

DGIST

DGIST Scholar was built with support from the OAK distribution project by the National Library of Korea.

You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.

Library Services Team, DGIST 333. Techno Jungang-daero, Hyeonpung-myeon, Dalseong-gun, Daegu, 42988, Republic of Korea.

RSS_1.0 RSS_2.0 ATOM_1.0

DGIST Library Repository

BROWSE

DGIST

BROWSE