DGIST Scholar: A Distributed In-situ Analysis Method for Large-scale Scientfic Data

Department of Electrical Engineering and Computer Science Theses Master

Cited time in webofscience

Cited time in scopus

A Distributed In-situ Analysis Method for Large-scale Scientfic Data

Title: A Distributed In-situ Analysis Method for Large-scale Scientfic Data

Alternative Title: 분산 환경 기반 시스템에서 과학 기술 빅데이터 in-situ 분석 방법

Author(s): Han, Dong Hyoung

DGIST Authors: Han, Dong Hyoung ; Kim, Min Soo ; Kang, Won Seok ; Choi, Jihwan P.

Advisor: Kim, Min Soo

Co-Advisor(s): Kang, Won Seok ; Choi, Jihwan P.

Issued Date: 2016

Awarded Date: 2016. 2

Type: Thesis

Subject: In-situ processing ; data loading ; array DBMS ; scientific data format ; 데이터 로딩 ; 과학 기술 데이터 ; 분산 환경 시스템. In-situ 분석방법 ; array 데이터베이스

Abstract: The size of scientific data has been increasing rapidly in a variety of do-mains. The scientific data is represented as array data and is managed by a diverse scientific data format such as HDF, NetCDF and MDSplus. Even though the existing array DBMSs such as SciDB and RasDaMan manage array data, there are challenges in loading data into the array DBMS. The data loading process of the distributed array DBMS incurs the significant overheads since the inefficient four transformation steps of file format incur the expensive disk I/O.
In this paper, we propose a distributed in-situ analysis method DISCAN that can process a scientific query efficiently and directly over raw scientific array data in distributed array DBMSs. Our approach eliminates unnecessary write opera-tions during the data loading and processes only the data required in query. Our in-situ processing consists of two phases, HDF merger and DISCAN. HDF merger is responsible for managing raw scientific data in order to distribute the scientific data to nodes. DISCAN is composed of Local Map that transforms the raw scientific data into the internal data representation of DBMS and Global Map that replaces the transformed data according to a partitioning policy of the DBMS. DISCAN reads only the data required during query processing using the well-defined scientific data format libraries. We evaluate the performance of DISCAN across real-world scien-tific dataset. Experimental results show that DISCAN outperforms the processing query after data loading of the distributed array DBMS by up to more than 60 times. ⓒ 2016 DGIST

Table Of Contents: 1. INTRODUCTION 1--
2. PRELIMINARIES 6--
2.1 Array DBMS 6--
2.2 Data loading 9--
3. RELATED WORK 12--
4. DISCAN 17--
4.1 In-situ processing 17--
4.2 Modification of a query plan 23--
4.3 Distributed in-situ scan operator 27--
5. PERFORMANCE EVALUATION 31--
6. CONCLUSIONS 40--
7. REFERENCES 41

URI: http://dgist.dcollection.net/jsp/common/DcLoOrgPer.jsp?sItemId=000002229871

http://hdl.handle.net/20.500.11750/1474

DOI: 10.22677/thesis.2229871

Degree: Master

Department: Information and Communication Engineering

Publisher: DGIST

Related Researcher

Kang, Won-Seok
Research Interests Digital Phenotyping; Data Mining & Machine Learning for Text & Multimedia; Brain-Sense-ICTConvergence Computing; Computational Olfaction Measurement; Simulation&Modeling

Files in This Item:: 000002229871.pdf
기타 데이터 / 1.58 MB / Adobe PDF download

Appears in Collections:: Department of Electrical Engineering and Computer Science Theses Master

Show Full Item Record

qrcode

DGIST

DGIST Scholar was built with support from the OAK distribution project by the National Library of Korea.

You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.

Library Services Team, DGIST 333. Techno Jungang-daero, Hyeonpung-myeon, Dalseong-gun, Daegu, 42988, Republic of Korea.

DGIST Library Repository

BROWSE

DGIST

BROWSE