Cited 0 time in webofscience Cited 0 time in scopus

A Distributed In-situ Analysis Method for Large-scale Scientfic Data

A Distributed In-situ Analysis Method for Large-scale Scientfic Data
Translated Title
분산 환경 기반 시스템에서 과학 기술 빅데이터 in-situ 분석 방법
Han, Dong Hyoung
DGIST Authors
Han, Dong Hyoung; Kim, Min SooKang, Won SeokChoi, Jihwan P.
Kim, Min Soo
Kang, Won SeokChoi, Jihwan P.
Issue Date
Available Date
Degree Date
2016. 2
In-situ processingdata loadingarray DBMSscientific data format데이터 로딩과학 기술 데이터분산 환경 시스템. In-situ 분석방법array 데이터베이스
The size of scientific data has been increasing rapidly in a variety of do-mains. The scientific data is represented as array data and is managed by a diverse scientific data format such as HDF, NetCDF and MDSplus. Even though the existing array DBMSs such as SciDB and RasDaMan manage array data, there are challenges in loading data into the array DBMS. The data loading process of the distributed array DBMS incurs the significant overheads since the inefficient four transformation steps of file format incur the expensive disk I/O. In this paper, we propose a distributed in-situ analysis method DISCAN that can process a scientific query efficiently and directly over raw scientific array data in distributed array DBMSs. Our approach eliminates unnecessary write opera-tions during the data loading and processes only the data required in query. Our in-situ processing consists of two phases, HDF merger and DISCAN. HDF merger is responsible for managing raw scientific data in order to distribute the scientific data to nodes. DISCAN is composed of Local Map that transforms the raw scientific data into the internal data representation of DBMS and Global Map that replaces the transformed data according to a partitioning policy of the DBMS. DISCAN reads only the data required during query processing using the well-defined scientific data format libraries. We evaluate the performance of DISCAN across real-world scien-tific dataset. Experimental results show that DISCAN outperforms the processing query after data loading of the distributed array DBMS by up to more than 60 times. ⓒ 2016 DGIST
Table Of Contents
1. INTRODUCTION 1-- 2. PRELIMINARIES 6-- 2.1 Array DBMS 6-- 2.2 Data loading 9-- 3. RELATED WORK 12-- 4. DISCAN 17-- 4.1 In-situ processing 17-- 4.2 Modification of a query plan 23-- 4.3 Distributed in-situ scan operator 27-- 5. PERFORMANCE EVALUATION 31-- 6. CONCLUSIONS 40-- 7. REFERENCES 41
Information and Communication Engineering
Related Researcher
  • Author Kim, Min-Soo InfoLab
  • Research Interests Big Data Systems; Big Data Mining & Machine Learning; Big Data Bioinformatics; 데이터 마이닝 및 빅데이터 분석; 바이오인포메틱스 및 뉴로인포메틱스; 뇌-기계 인터페이스(BMI)
Department of Information and Communication EngineeringThesesMaster

qrcode mendeley

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.