Cited time in webofscience Cited time in scopus

A Distributed In-situ Analysis Method for Large-scale Scientfic Data

Title
A Distributed In-situ Analysis Method for Large-scale Scientfic Data
Alternative Title
분산 환경 기반 시스템에서 과학 기술 빅데이터 in-situ 분석 방법
Author(s)
Han, Dong Hyoung
DGIST Authors
Han, Dong HyoungKim, Min SooKang, Won SeokChoi, Jihwan P.
Advisor
Kim, Min Soo
Co-Advisor(s)
Kang, Won SeokChoi, Jihwan P.
Issued Date
2016
Awarded Date
2016. 2
Type
Thesis
Subject
In-situ processingdata loadingarray DBMSscientific data format데이터 로딩과학 기술 데이터분산 환경 시스템. In-situ 분석방법array 데이터베이스
Abstract
The size of scientific data has been increasing rapidly in a variety of do-mains. The scientific data is represented as array data and is managed by a diverse scientific data format such as HDF, NetCDF and MDSplus. Even though the existing array DBMSs such as SciDB and RasDaMan manage array data, there are challenges in loading data into the array DBMS. The data loading process of the distributed array DBMS incurs the significant overheads since the inefficient four transformation steps of file format incur the expensive disk I/O.
In this paper, we propose a distributed in-situ analysis method DISCAN that can process a scientific query efficiently and directly over raw scientific array data in distributed array DBMSs. Our approach eliminates unnecessary write opera-tions during the data loading and processes only the data required in query. Our in-situ processing consists of two phases, HDF merger and DISCAN. HDF merger is responsible for managing raw scientific data in order to distribute the scientific data to nodes. DISCAN is composed of Local Map that transforms the raw scientific data into the internal data representation of DBMS and Global Map that replaces the transformed data according to a partitioning policy of the DBMS. DISCAN reads only the data required during query processing using the well-defined scientific data format libraries. We evaluate the performance of DISCAN across real-world scien-tific dataset. Experimental results show that DISCAN outperforms the processing query after data loading of the distributed array DBMS by up to more than 60 times. ⓒ 2016 DGIST
Table Of Contents
1. INTRODUCTION 1--
2. PRELIMINARIES 6--
2.1 Array DBMS 6--
2.2 Data loading 9--
3. RELATED WORK 12--
4. DISCAN 17--
4.1 In-situ processing 17--
4.2 Modification of a query plan 23--
4.3 Distributed in-situ scan operator 27--
5. PERFORMANCE EVALUATION 31--
6. CONCLUSIONS 40--
7. REFERENCES 41
URI
http://dgist.dcollection.net/jsp/common/DcLoOrgPer.jsp?sItemId=000002229871

http://hdl.handle.net/20.500.11750/1474
DOI
10.22677/thesis.2229871
Degree
Master
Department
Information and Communication Engineering
Publisher
DGIST
Related Researcher
  • 강원석 Kang, Won-Seok
  • Research Interests Digital Phenotyping; Data Mining & Machine Learning for Text & Multimedia; Brain-Sense-ICTConvergence Computing; Computational Olfaction Measurement; Simulation&Modeling
Files in This Item:
000002229871.pdf

000002229871.pdf

기타 데이터 / 1.58 MB / Adobe PDF download
Appears in Collections:
Department of Electrical Engineering and Computer Science Theses Master

qrcode

  • twitter
  • facebook
  • mendeley

Items in Repository are protected by copyright, with all rights reserved, unless otherwise indicated.

BROWSE