Detail View

LLM-Driven Human-Robot Interaction with Tri- Modal Tactile Skin for Context-Aware Speech and Gesture Generation in Social Settings

Citations

WEB OF SCIENCE

Citations

SCOPUS

Metadata Downloads

DC Field Value Language
dc.contributor.advisor 박경서 -
dc.contributor.author Fawole Emmanuel Ademola Ayobami -
dc.date.accessioned 2026-01-23T10:55:10Z -
dc.date.available 2026-01-23T10:55:10Z -
dc.date.issued 2026 -
dc.identifier.uri https://scholar.dgist.ac.kr/handle/20.500.11750/59666 -
dc.identifier.uri http://dgist.dcollection.net/common/orgView/200000949725 -
dc.description Human–Robot Interaction (HRI), Large Language Models (LLMs), Multimodal Interaction, Multimodal Sensing, Tri-Modal Tactile Skin, Gesture Chaining, Embodied Conversational Robots, Emotional and Expressive Understanding, Tactile Gesture Classification, Non-Parametric User Study Analysis, LangChain, LangGraph -
dc.description.abstract 본 논문은 음성, 얼굴 정서, 촉각 상호작용을 통합하여 사회적 단서를 인식하고 반응할 수 있는 대규모 언어 모델(LLM) 기반 다중모달 인간–로봇 상호작용(HRI) 시스템의 설계, 구현 및 평가를 제시한다. 본 연구는 실시간 상호작용 환경에서 이질적인 감각 입력을 표현적인 신체 행동과 함께 통합적으로 조율할 수 있는 통합 추론 프레임워크의 부재라는 기존 HRI 시스템의 핵심적인 한계를 해결하고자 한다.
제안하는 시스템은 공압, 전도성, 음향 센서를 결합한 삼중 모달 로봇 촉각 스킨과 LLM 기반 상호작용 아키텍처를 통합한다. 촉각 신호는 안전한 데이터 증강을 적용하여 학습된 **강건성 향상 1차원 합성곱 신경망(1D CNN)**을 통해 실시간으로 보정 및 분류된다. 시각 인지는 얼굴 인식 및 정서 분석을 통해 수행되며, 청각 인지는 음성-텍스트 변환을 통해 처리된다. 모든 감각 입력은 실시간 입출력을 담당하는 NVIDIA Jetson Orin NX와 고성능 GPU 추론 서버를 연결하는 분산 아키텍처 내에서 처리된다. LangChain 및 LangGraph 기반 프레임워크를 통해 이들 입력은 대규모 언어 모델과 연결되며, LLM은 언어적·비언어적 로봇 행동을 조율하는 중심 추론 구성요소로 기능한다. 표현 출력은 자연스러운 음성 합성과 함께, 발화 길이와 대화적 강조에 맞추어 여러 사전 정의된 동작 프리미티브를 동적으로 선택·연결하는 제스처 체이닝(Gesture Chaining) 기법을 통해 생성된다.
시스템의 효과를 검증하기 위해 2단계 반복측정(Within-subject) 사용자 실험을 수행하였다. 총 30명이 실험에 참여하였으며, 이 중 28명의 데이터가 최종 분석에 포함되었다. 1단계에서는 음성만을 이용한 상호작용을 수행하였고, 2단계에서는 음성, 얼굴 표정, 촉각을 포함한 다중모달 상호작용을 제공하였다. 정량적 평가는 7점 리커트 척도를 사용하여 의사소통 및 이해도, 자연스러움 및 몰입도, 정서 및 표현적 이해도, 촉각 및 다중모달 상호작용(2단계만 해당)을 측정하였으며, 비모수 통계 기법을 통해 분석하였다. 또한 개방형 질문과 연구자 관찰을 통해 정성적 피드백을 수집하였다.
분석 결과, 다중모달 상호작용은 의사소통의 명확성을 저해하지 않으면서 정서적 및 표현적 이해도에 있어 일관된 향상을 보였다. 자연스러움과 몰입도에 대한 집단 평균 변화는 참가자 간 차이를 보였으나, 정성적 결과와 모달리티별 평가에서는 다중모달 조건에 대한 전반적인 선호가 확인되었다. 촉각 및 신체 제스처는 대체로 의사소통을 방해하기보다는 보조하는 요소로 인식되었으며, 다만 개인별 편안함 수준과 반응 방식에는 차이가 존재하였다. 이러한 개인차 분석은 주관적 상호작용 품질을 해석하는 데 있어 혼합 방법론 평가의 중요성을 강조한다.
종합적으로 본 연구는 대규모 언어 모델이 다중모달 인지와 표현 행동을 통합적으로 조율하는 체화된 사회적 로봇의 중심 추론 에이전트로 효과적으로 기능할 수 있음을 보여준다. 보정된 촉각 인지, 다중모달 지각, 그리고 LLM 기반 제스처 체이닝을 통합함으로써, 본 시스템은 대화 명확성을 유지하면서 사용자 인지된 정서 이해와 선호도를 향상시킬 수 있음을 실증적으로 입증한다. 본 연구의 결과는 사회적 표현성과 상황 인지 능력을 갖춘 인간–로봇 상호작용을 위한 재현 가능한 아키텍처와 방법론적 통찰을 제공한다.|This thesis presents the design, implementation, and evaluation of an LLM-driven multimodal human– robot interaction (HRI) system that enables a humanoid robot to perceive and respond to social cues through the integration of speech, facial affect, and tactile interaction. The work addresses a central limitation in existing HRI systems: the lack of a unified reasoning framework capable of coordinating heterogeneous sensory inputs with expressive embodied output in real time. The proposed system integrates a tri-modal robotic skin—combining pneumatic, conductive, and acoustic sensing—with an LLM-based interaction architecture. Tactile signals are calibrated and classified in real time using a robustness-enhanced one-dimensional convolutional neural network (1D CNN) trained with safe data augmentation. Visual perception is achieved through facial recognition and affect analysis, while auditory perception relies on speech-to-text transcription. All sensory inputs are processed within a distributed architecture linking an NVIDIA Jetson Orin NX for real-time input/output with a GPU-enabled inference server. A LangChain/LangGraph-based framework connects these inputs to a large language model, which serves as the central reasoning component for coordinating verbal and non-verbal robot behaviour. Expressive output is generated through natural-sounding speech synthesis and LLM-planned gesture sequences using a method termed Gesture Chaining, in which multiple predefined motion primitives are dynamically selected and sequenced to align embodied motion with speech duration and conversational emphasis. The system was evaluated through a two-stage within-subject user study involving 30 participants, with 28 valid datasets included in the final analysis. In Stage 1, participants interacted with the robot using voice- only communication. In Stage 2, participants engaged in multimodal interaction incorporating speech, facial expression, and touch, while the robot responded using coordinated speech and embodied gestures. Quantitative evaluation employed seven-point Likert-scale measures assessing Communication and Understanding, Naturalness and Engagement, Emotional and Expressive Understanding, and Touch and Multimodal Interaction (Stage 2 only), analysed using non-parametric statistical methods. Qualitative feedback was collected through open-ended questions and researcher observation. Results indicate that multimodal interaction preserved communication clarity while producing a consistent improvement in perceived emotional and expressive understanding. While aggregate shifts in naturalness and engagement varied across participants, qualitative findings and modality-specific ratings revealed a strong overall preference for the multimodal condition. Touch and embodied gestures were generally perceived as supportive of communication rather than disruptive, though individual differences in comfort and response style were observed. Analysis of participant-level variability further highlighted the value of mixed-method evaluation for interpreting subjective interaction quality. Overall, this work demonstrates that large language models can function effectively as central reasoning agents for embodied social robots, coordinating multimodal perception and expressive action within a unified interaction loop. By integrating calibrated tactile sensing, multimodal perception, and LLM-planned gesture sequencing, the system provides empirical evidence that carefully designed multimodal frameworks can enhance perceived emotional understanding and user preference without compromising conversational clarity. The findings contribute a reproducible architecture and methodological insights for future research in socially expressive, context-aware human–robot interaction.
-
dc.description.tableofcontents Abstract i
List of Contents iii
List of Tables vii
List of Figures viii
1. Introduction 1
2. Problem Statement & Motivation 3
2.1. Background and Context 3
2.2. Limitations of Existing Systems 3
2.3. Justification for an LLM-Based Reasoning Framework 4
2.4. Research Gap 6
2.5. Objectives of this Thesis 6
3. Literature Review 8
3.1 Social Tactile Interaction in HRI 8
3.2 Artificial Skin and Multimodal Tactile Sensing 8
3.3 Face Perception in Human–Robot Interaction 8
3.4 LLMs for Multimodal Interaction and Social Robotics 9
3.5 Summary of Research Gap 9
3.5.1 Summary of Key Insights and Limitations 9
4. Research Objectives And Questions· 11
4.1 Research Objectives 11
4.2 Original Contributions of This Thesis 12
4.3 Research Questions 14
5. Methodology16
5.1 System Design Overview 16
5.2 Multimodal Tactile Sensing 17
5.3 Tactile Gesture Classification Framework 18
5.3.1 Overview 18
5.3.2 Sensor Calibration and Drift Compensation 19
5.3.3 Baseline 1D CNN Architecture 19
5.3.4 Robust Model Development via Safe Data Augmentation 20
5.3.5 Denoising Autoencoder Pretraining 20
5.3.6 Final Model Selection and Training 21
5.3.7 Real-Time Deployment in the HRI System 21
5.4 LLM-Based Reasoning and Control Framework 22
5.4.1 Role of the LLM in the Interaction Pipeline 23
5.4.2 Prompt Engineering and Behavioural Constraints 23
5.4.3 Multimodal Context Injection 24
5.4.4 Gesture Generation via Tool-Augmented Reasoning 24
5.4.5 Gesture Chaining (Proposed Method) 25
5.4.6 Real-Time Streaming and Synchronization 26
5.4.7 Summary 26
5.5 Experimental Design 26
5.6 Evaluation Metrics 27
6. Experiment Design 28
6.1 Experimental Goals 28
6.2 Experimental Framework 28
6.3 Participants 29
6.3.1 Participant Sample Size Justification 29
6.4 Experimental Setup 31
6.5 Procedure. 31
6.6 Safety and Ethical Considerations 32
6.7 Data Collection and Management 32
6.8 Evaluation Focus 33
7. Experiment Plan 34
7.1 Overview 34
7.2 System Overview 34
7.3 Experimental Setup 35
7.4 Interaction Protocol 36
7.5 Participants 38
7.6 Data Collection 38
7.7 Evaluation Strategy 39
7.7.1 Rationale for Statistical Approach 39
8. Experiment Results 42
8.1 Descriptive Statistics Overview 42
8.1.1 Stage Comparison Using Wilcoxon Signed-Rank Tests 42
8.2 Distributional Analysis and Variability 43
8.3 Touch and Multimodal Interaction Results (Stage 2 Only) 45
8.3.1 Overview 45
8.3.2 Question-Level Results 45
8.3.3 Summary of Section 1-D Quantitative Results 46
8.4 Outlier Identification (Participant P7) 47
8.4.1 Sensitivity Analysis (With vs Without P7) 47
8.5 Summary of Quantitative Findings 48
8.6 Qualitative Results 48
8.6.1 Stage Preference (Stage 1 vs Stage 2) 48
8.6.2 Activation Modality Preference (Touch vs Voice) 48
8.6.3 Perceived Naturalness and Engagement 49
8.6.4 Perceived Emotional and Expressive Understanding 49
8.6.5 Effect of Facial Awareness on Interaction Ease 49
8.6.6 Willingness to Re-Engage and Suggested Improvements 50
9. Discussion, Limitations, And Future Work 51
9.1 Discussion 51
9.1.1 Overview and Link to Study Expectations 51
9.1.2 Communication and Understanding 51
9.1.3 Naturalness and Engagement: Why the Quantitative Signal Was Mixed 52
9.1.4 Emotional and Expressive Understanding: Converging Quantitative and Qualitative
Evidence 52
9.1.5 Interpretation of Touch and Multimodal Interaction (Section 1-D) 53
9.1.6 Qualitative Results in Context: Majority Themes and Meaningful Minority Perspectives 54
9.1.7 Outlier Participant P7: Interpreting the Quantitative–Qualitative Divergence 54
9.1.8 Implications for System Design 55
9.2 Limitations 56
9.3 Future Work 56
10. Conclusion 59
References61
Appendices 63
Appendix A – Participant Experiment Guidelines 63
Appendix B – Experiment Protocol Document 63
Appendix C – HRI Robot Interaction Survey 64
Section 1 – Quantitative Evaluation 64
Section 2 – Qualitative Feedback 65
Section 3 – Stage Comparison (Complete after Stage 2) 65
Appendix D – Raw Data Summary Tables 66
Stage 1 Likert Scale Participant Averages 66
Stage 2 Likert Scale Participant Averages 67
Touch and Multimodal Interaction (Section 1-D) Participant Scores 68
Full test per class performance comparison across CNN Variants 70
Appendix E – Additional Plots 71
Appendix F – Ethical Approval and Consent 74
요 약 문 75
-
dc.format.extent 75 -
dc.language eng -
dc.publisher DGIST -
dc.title LLM-Driven Human-Robot Interaction with Tri- Modal Tactile Skin for Context-Aware Speech and Gesture Generation in Social Settings -
dc.title.alternative 사회적 상호작용 환경에서 맥락 인지 기반 음성 및 제스처 생성을 위한 삼중 모달 촉각 스킨을 활용한 LLM 기반 인간–로봇 상호작용 -
dc.type Thesis -
dc.identifier.doi 10.22677/THESIS.200000949725 -
dc.description.degree Master -
dc.contributor.department Department of Robotics and Mechatronics Engineering -
dc.date.awarded 2026-02-01 -
dc.publisher.location Daegu -
dc.description.database dCollection -
dc.citation XT.RM F281 202602 -
dc.date.accepted 2026-01-19 -
dc.contributor.alternativeDepartment 로봇및기계전자공학과 -
dc.subject.keyword Human–Robot Interaction (HRI), Large Language Models (LLMs), Multimodal Interaction, Multimodal Sensing, Tri-Modal Tactile Skin, Gesture Chaining, Embodied Conversational Robots, Emotional and Expressive Understanding, Tactile Gesture Classification, Non-Parametric User Study Analysis, LangChain, LangGraph -
dc.contributor.affiliatedAuthor Fawole Emmanuel Ademola Ayobami -
dc.contributor.affiliatedAuthor Kyungseo Park -
dc.contributor.alternativeName 파월레 이마누엘 아데멀라 아여바미 -
dc.contributor.alternativeName Kyungseo Park -
dc.rights.embargoReleaseDate 2028-02-29 -
Show Simple Item Record

File Downloads

  • There are no files associated with this item.

공유

qrcode
공유하기

Total Views & Downloads

???jsp.display-item.statistics.view???: , ???jsp.display-item.statistics.download???: