KR102110478B1

KR102110478B1 - Headset type user interface system and user feature classify method

Info

Publication number: KR102110478B1
Application number: KR1020190150558A
Authority: KR
Inventors: 김시호; 차재광
Original assignee: 연세대학교 산학협력단
Priority date: 2019-11-21
Filing date: 2019-11-21
Publication date: 2020-06-08

Abstract

The present invention provides a headset-type user interface system including: a near-infrared ray emission unit for emitting near-infrared rays to skin at a predetermined distance; a signal frame reception unit for receiving at least one signal frame including a signal of a near-infrared ray, which is scattered within the skin and emitted out of the skin among near-infrared rays incident on the skin by the emitted near-infrared rays; a signal frame preprocessing unit for preprocessing the at least one signal frame in an order in which the at least one signal frame is received; a feature information extraction unit for extracting feature information of the skin deformed by the emitted near-infrared rays from the preprocessed at least one signal frame; and a feature information classification unit for classifying the extracted feature information according to a preset user intention. Accordingly, hands-free natural user interface (NUI) technology for augmented reality (AR)/virtual reality (VR) is implemented.

Description

Headset type user interface system and user feature information classification method therefor {Headset type user interface system and user feature classify method}

본 발명은 사용자 인터페이스에 관한 것으로, 더욱 상세하게는 근적외선에 의한 산란현상에 따른 공간 분해 확산 반사율(SRDR: Spatially Resolved Diffuse Reflectance)을 통해 사용자의 의도를 파악함으로써 인터페이스 기술을 구현하는 헤드셋 형태의 사용자 인터페이스 시스템 및 이를 위한 사용자 특징정보 분류 방법에 관한 것이다.The present invention relates to a user interface, and more specifically, a user interface in the form of a headset that implements an interface technology by grasping the user's intention through spatially resolved diffuse reflection (SRDR) according to scattering phenomenon by near infrared rays. A system and a method for classifying user feature information therefor.

증강현실(AR)은 정보통신기술(ICT) 업계에서 가장 뜨거운 이슈 중 하나가 됐다. AR 분야에서는 AR 헤드셋에 최적화된 UI(사용자 인터페이스) 기기를 개발하는 것이 주요 과제 중 하나이다. 개인용 컴퓨터나 휴대전화와 크게 다른 헤드셋 환경 때문에 기존의 UI(키보드, 마우스, 터치스크린 등)를 사용하기 어렵기 때문이다. 헤드셋 측면의 버튼을 누르거나 리모컨을 잡고 있는 것이 현재 헤드셋 UI에 채택된 가장 대표적인 해결책이다.Augmented reality (AR) has become one of the hottest issues in the ICT industry. In the AR field, developing a UI (user interface) device optimized for an AR headset is one of the main tasks. This is because it is difficult to use the existing UI (keyboard, mouse, touch screen, etc.) because of the headset environment that is significantly different from a personal computer or mobile phone. Pressing the button on the side of the headset or holding the remote control is the most representative solution currently adopted in the headset UI.

다만 이러한 UI는 수술이나 수작업 등 복잡하고 위험한 수술을 지원하는 데 한계가 있을 뿐만 아니라 일반 사용에도 충분히 편리하지 않다. 음성인식 기법 및 EOG(Electrooculography), EEG(Electrencephalography), EMG(Electromyography) 등의 생리학적 센서는 AR 핸즈프리 UI 개발에 유망한 기술들이지만 이를 이용한 최적의 헤드셋 UI 구현 방법은 아직 개발되지 않았다.However, these UIs have limitations in supporting complex and dangerous operations such as surgery and manual work, and are not convenient enough for general use. Speech recognition techniques and physiological sensors such as EOG (Electrooculography), EEG (Electrencephalography), and EMG (Electromyography) are promising technologies for AR hands-free UI development, but an optimal headset UI implementation method using them has not yet been developed.

학계나 산업계에서는 이미 헤드셋 타입으로 사용자 인터랙션을 가능하게 하는 기술을 개발한 바 있다. 일반적인 인터페이스 기술은 시선 인식, 손 동작 인식, EMG 센서, 터치패드, 리모콘 등을 이용하는 등 다양한 방식이 존재하는데, 그 중 손 제스처 감지 또는 시선 추적 기술이 가장 많이 사용되고 있다.Academia and industry have already developed a technology that enables user interaction with a headset type. In general interface technology, there are various methods such as gaze recognition, hand gesture recognition, EMG sensor, touch pad, remote controller, etc. Among them, hand gesture detection or gaze tracking technology is most frequently used.

다만 대부분의 상용 AR 헤드셋의 인터랙션 기술은 여전히 효과적인 UI에 필요한 충분한 기술이 개발되지 않아, 헤드셋의 주변 장치에 해당하는 핸드-헤드 컨트롤러(Hand-held controller) 또는 버튼 기술에 의존하고 있다.However, the interaction technology of most commercial AR headsets still does not have sufficient technology necessary for effective UI, and thus relies on the hand-held controller or button technology corresponding to the peripheral device of the headset.

이러한 기술은 수술, 공장 현장 등과 같이 사용자가 직접 손을 사용해야만 하는 특수한 상황에 적용되기에는 한계점이 존재한다.This technique has limitations to be applied to special situations in which a user has to use his or her hands, such as surgery or a factory site.

이에 UI 기술에 있어서, 핸즈프리 방식으로 사용자의 의도를 읽어내고, 이에 따른 명령을 수행할 수 있는 기술을 개발할 필요성이 있다.Accordingly, in the UI technology, there is a need to develop a technology capable of reading a user's intention in a hands-free manner and performing commands accordingly.

한국공개특허공보 제2019-0056833호Korean Patent Publication No. 2019-0056833

이에 본 발명은 상기와 같은 제반 사항을 고려하여 제안된 것으로, 얼굴 제스처의 피부 변형을 감지한 후, 이를 UI 입력 명령으로 이용함으로써 AR/VR용 핸즈프리 NUI(Natural User Interface) 기술을 구현하는 것을 목적으로 한다.Accordingly, the present invention has been proposed in consideration of the above, and aims to implement hands-free NUI (AR/VR) technology for AR/VR by detecting skin deformation of a face gesture and using it as a UI input command. Is done.

또한, 본 발명은 얼굴 피부의 움직임을 비침습적으로 쉽게 감지하고, 감지된 얼굴 동작을 입력 명령에 매핑하여 헤드셋 사용자와 AR 시스템 간의 핸즈프리 상호작용을 지원하는 것을 목적으로 한다.In addition, the present invention aims to support hands-free interaction between a headset user and an AR system by easily detecting non-invasively moving facial skin and mapping detected facial motions to input commands.

또한, 본 발명은 헤드셋 형태의 사용자 인터페이스 기술을 구현함으로써, 특수한 상황 또는 환경에서도 적용이 가능하도록 하는 것을 목적으로 한다.In addition, the present invention aims to enable application in a special situation or environment by implementing a headset type user interface technology.

또한, 본 발명은 피부 변형에 관한 데이터를 2차원 형태로 수신함으로써, 양적 및 질적으로 향상된 데이터를 수신할 수 있도록 하는 것을 목적으로 한다.In addition, an object of the present invention is to receive data related to skin deformation in a two-dimensional form, so that quantitative and qualitatively improved data can be received.

또한, 본 발명은 연속적으로 수신된 데이터의 변화량 산출을 통해 사용자 동작 변화에 관한 데이터를 생성함으로써, 데이터 분류가 용이하도록 하는 것을 목적으로 한다.In addition, an object of the present invention is to facilitate data classification by generating data related to a change in a user's motion through calculating the amount of change of continuously received data.

또한, 본 발명은 특징 추출 및 분류를 비지도 학습방법으로 구현함으로써, 태깅 정보 없이도 학습용 데이터세트를 분류할 수 있도록 하는 것으로 목적으로 한다.In addition, an object of the present invention is to implement feature extraction and classification by an unsupervised learning method, so that a learning dataset can be classified without tagging information.

또한, 본 발명은 DEC(Deep Embedded Clustering) 기법의 적용을 통해, 비지도 학습방법의 한계점인 정확도 문제점을 개선하는 것을 목적으로 한다.In addition, the present invention aims to improve the accuracy problem, which is a limitation of an unsupervised learning method, by applying a DEC (Deep Embedded Clustering) technique.

본 발명의 목적들은 이상에서 언급한 목적들로 제한되지 않으며, 언급되지 않은 또 다른 목적들은 아래의 기재로부터 당업자에게 명확하게 이해할 수 있을 것이다.The objects of the present invention are not limited to the objects mentioned above, and other objects not mentioned will be clearly understood by those skilled in the art from the following description.

상기와 같은 목적을 달성하기 위하여 본 발명의 기술적 사상에 의한 헤드셋 형태의 사용자 인터페이스 시스템은 일정한 거리를 두고 근적외선을 피부에 방출하는 근적외선 방출부, 상기 방출된 근적외선에 의해 상기 피부로 입사되는 근적외선 중 상기 피부 내에서 산란되어 상기 피부 밖으로 방출되는 근적외선의 신호가 포함된 적어도 하나 이상의 신호 프레임을 수신하는 신호 프레임 수신부, 상기 적어도 하나 이상의 신호 프레임을 수신된 순서로 전처리하는 신호 프레임 전처리부, 상기 전처리된 적어도 하나 이상의 신호 프레임으로부터 상기 방출된 근적외선에 의해 변형된 상기 피부의 특징정보를 추출하는 특징정보 추출부, 상기 추출된 특징정보를 기 설정된 사용자의 의도에 분류하는 특징정보 분류부를 포함할 수 있다.In order to achieve the above object, the user interface system in the form of a headset according to the technical concept of the present invention is a near-infrared emitting unit that emits near-infrared light to the skin at a certain distance, and among the near-infrared rays incident on the skin by the emitted near-infrared rays. A signal frame receiving unit for receiving at least one signal frame that contains signals of near infrared rays scattered in the skin and emitted out of the skin, a signal frame pre-processing unit for pre-processing the at least one signal frame in a received order, and the pre-processed at least It may include a feature information extraction unit for extracting feature information of the skin deformed by the emitted near-infrared from one or more signal frames, and a feature information classification unit for classifying the extracted feature information into a predetermined user's intention.

상기 근적외선 방출부는 상기 피부에 방출된 근적외선에 의한 산란현상에 따른 신호 프레임의 수신이 상기 피부에 방출된 근적외선에 의한 반사현상의 영향을 최소화하도록 시준된(Collimated) 근적외선을 방출할 수 있다.The near-infrared emission unit may emit collimated near-infrared rays so that reception of a signal frame according to scattering phenomena caused by near-infrared rays emitted to the skin minimizes the influence of reflection phenomena caused by the near-infrared rays emitted to the skin.

상기 신호 프레임 수신부에서 상기 수신하는 신호 프레임은 상기 피부에 방출된 근적외선에 의한 산란현상으로부터 측정된 공간 분해 확산 반사율(SRDR: Spatially Resolved Diffuse Reflectance)이 포함된 근적외선 이미지 형태라 할 수 있다.The signal frame received by the signal frame receiver may be in the form of a near infrared image including spatially resolved diffuse reflectance (SRDR) measured from scattering caused by near infrared rays emitted to the skin.

상기 신호 프레임 수신부는 상기 근적외선 이미지에 목표 대역의 근적외선 신호 외에 다른 대역의 노이즈 신호가 포함되지 않도록 근적외선 필터를 포함할 수 있다.The signal frame receiving unit may include a near-infrared filter so that the near-infrared image does not include noise signals of other bands in addition to the near-infrared signal of the target band.

상기 신호 프레임 전처리부는 상기 근적외선 이미지를 수신 순서대로 전처리하고, 상기 특징 추출부의 데이터 입력 형태에 따라 데이터 형태를 재구성할 수 있다.The signal frame pre-processing unit may pre-process the near-infrared images in the order of reception, and reconstruct the data format according to the data input format of the feature extraction unit.

상기 특징정보 추출부는 상기 전처리된 근적외선 이미지로부터 상기 사용자 의도에 분류하기 위한 적어도 하나 이상의 특징을 추출하고, 상기 특징정보 분류부는 상기 추출된 적어도 하나 이상의 특징을 상기 사용자 의도에 분류하여 매핑할 수 있다.The feature information extracting unit may extract at least one feature for classifying the user's intention from the pre-processed near-infrared image, and the feature information classification unit may classify and map the extracted at least one feature to the user's intention.

상기 특징정보 추출부는 심층 신경망(DNN: Deep Neural Network)을 기반으로 사전 학습된 시공간 오토인코더(STAE: Spatiotemporal Autoencoder)의 인코더를 이용하여 상기 특징정보를 추출하고, 상기 특징정보 분류부는 DEC(Deep Embedded Clustering) 기법을 기반으로 사전 학습된 분류기를 이용하여 상기 추출된 적어도 하나 이상의 특징을 분류하여 기 설정된 사용자 의도에 매핑할 수 있다.The feature information extractor extracts the feature information using an encoder of a pre-trained spatiotemporal autoencoder (STAE) based on a deep neural network (DNN), and the feature information classifier is DEC (Deep Embedded) Based on the clustering technique, the extracted at least one feature may be classified using a pre-trained classifier and mapped to a preset user intention.

상기와 같은 목적을 달성하기 위하여 본 발명의 기술적 사상에 의한 사용자 인터페이스를 위한 사용자 특징정보 분류 방법은 근적외선 방출부에서 일정한 거리를 두고 근적외선을 피부에 방출하는 근적외선 방출단계, 신호 프레임 수신부에서 상기 방출된 근적외선에 의해 상기 피부로 입사되는 근적외선 중 상기 피부 내에서 산란되어 상기 피부 밖으로 방출되는 근적외선의 신호가 포함된 적어도 하나 이상의 신호 프레임을 수신하는 신호 프레임 수신단계, 신호 프레임 전처리부에서 상기 적어도 하나 이상의 신호 프레임을 수신된 순서로 전처리하는 신호 프레임 전처리단계, 특징정보 추출부에서 상기 전처리된 적어도 하나 이상의 신호 프레임으로부터 상기 방출된 근적외선에 의해 변형된 상기 피부의 특징정보를 추출하는 특징정보 추출단계, 특징정보 분류부에서 상기 추출된 특징정보를 기 설정된 사용자의 의도에 분류하는 특징정보 분류단계를 포함할 수 있다.In order to achieve the above object, the method for classifying user feature information for a user interface according to the technical idea of the present invention includes a near-infrared emission step of emitting near-infrared rays to the skin at a certain distance from the near-infrared emission unit, and the emitted from the signal frame receiving unit A signal frame receiving step of receiving at least one signal frame including signals of near infrared rays scattered within the skin and emitted out of the skin among the near infrared rays incident on the skin by the near infrared rays, wherein the at least one signal is received by the signal frame preprocessing unit A signal frame pre-processing step of pre-processing the frames in the received order, a feature information extraction step of extracting feature information of the skin deformed by the emitted near infrared rays from the at least one signal frame pre-processed by the feature information extraction unit, and feature information The classification unit may include a feature information classification step of classifying the extracted feature information into a preset user's intention.

상기 근적외선 방출단계는 상기 피부에 방출된 근적외선에 의한 산란현상에 따른 신호 프레임의 수신이 상기 피부에 방출된 근적외선에 의한 반사현상의 영향을 최소화하도록 시준된(Collimated) 근적외선을 방출할 수 있다.The near-infrared emission step may emit collimated near-infrared rays so that the reception of a signal frame according to the scattering phenomenon caused by the near-infrared rays emitted to the skin minimizes the influence of the reflection phenomenon caused by the near-infrared rays emitted to the skin.

상기 신호 프레임 수신단계에서 상기 수신하는 신호 프레임은 상기 피부에 방출된 근적외선에 의한 산란현상으로부터 측정된 공간 분해 확산 반사율(SRDR: Spatially Resolved Diffuse Reflectance)이 포함된 근적외선 이미지 형태라 할 수 있다. In the signal frame receiving step, the received signal frame may be in the form of a near infrared image including spatially resolved diffuse reflection (SRDR) measured from scattering caused by near infrared rays emitted to the skin.

상기 신호 프레임 전처리단계는 상기 근적외선 이미지를 수신 순서대로 전처리하고, 상기 특징 추출단계의 데이터 입력 형태에 따라 데이터 형태를 재구성할 수 있다.In the signal frame pre-processing step, the near-infrared image may be pre-processed in the order of reception, and the data form may be reconstructed according to the data input form of the feature extraction step.

상기 특징정보 추출단계는 상기 전처리된 근적외선 이미지로부터 상기 사용자 의도에 분류하기 위한 적어도 하나 이상의 특징을 추출하고, 상기 특징정보 분류단계는 상기 추출된 적어도 하나 이상의 특징을 상기 사용자 의도에 분류하여 매핑할 수 있다.The feature information extraction step extracts at least one feature for classifying the user's intention from the pre-processed near-infrared image, and the feature information classification step classifies and maps the extracted at least one feature to the user's intention. have.

상기 특징정보 추출단계는 심층 신경망(DNN: Deep Neural Network)을 기반으로 사전 학습된 시공간 오토인코더(STAE: Spatiotemporal Autoencoder)의 인코더를 이용하여 상기 특징정보를 추출하고, 상기 특징정보 분류단계는 DEC(Deep Embedded Clustering) 기법을 기반으로 사전 학습된 분류기를 이용하여 상기 추출된 적어도 하나 이상의 특징을 분류하여 기 설정된 사용자 의도에 매핑할 수 있다.The feature information extraction step extracts the feature information using an encoder of a pre-trained spatiotemporal autoencoder (STAE) based on a deep neural network (DNN), and the feature information classification step is DEC ( Based on the Deep Embedded Clustering technique, the extracted at least one feature may be classified using a pre-trained classifier and mapped to a preset user intention.

이상에서 설명한 바와 같은 헤드셋 형태의 사용자 인터페이스 시스템 및 이를 위한 사용자 특징정보 분류 방법에 따르면, According to the user interface system in the form of a headset as described above and a method for classifying user feature information therefor,

첫째, 얼굴 제스처의 피부 변형을 감지한 후, 이를 UI 입력 명령으로 이용함으로써 AR/VR용 핸즈프리 NUI(Natural User Interface) 기술을 구현할 수 있는 효과를 가진다.First, after detecting the skin deformation of the face gesture, it has the effect of realizing hands-free NUI (NUI) technology for AR/VR by using it as a UI input command.

둘째, 얼굴 피부의 움직임을 비침습적으로 쉽게 감지하고, 감지된 얼굴 동작을 입력 명령에 매핑하여 헤드셋 사용자와 AR 시스템 간의 핸즈프리 상호작용을 지원할 수 있는 효과를 가진다.Second, it has the effect of easily detecting a non-invasive movement of the face skin and mapping the detected face motion to an input command to support hands-free interaction between the headset user and the AR system.

셋째, 헤드셋 형태의 사용자 인터페이스 기술을 구현함으로써, 특수한 상황 또는 환경에서도 적용이 가능한 효과를 가진다.Third, by implementing a user interface technology in the form of a headset, it has an effect that can be applied in special situations or environments.

넷째, 피부 변형에 관한 데이터를 2차원 형태로 수신함으로써, 양적 및 질적으로 향상된 데이터를 수신할 수 있는 효과를 가진다.Fourth, by receiving the data on the skin deformation in a two-dimensional form, it has the effect of receiving quantitative and qualitatively improved data.

다섯째, 연속적으로 수신된 데이터의 변화량 산출을 통해 사용자 동작 변화에 관한 데이터를 생성함으로써, 데이터 분류가 용이한 효과를 가진다.Fifth, by generating the data related to the change of the user's motion through the calculation of the amount of change of continuously received data, data classification has an easy effect.

여섯째, 특징 추출 및 분류를 비지도 학습방법으로 구현함으로써, 태깅 정보 없이도 학습용 데이터세트를 분류할 수 있는 효과를 가진다.Sixth, by implementing feature extraction and classification as an unsupervised learning method, it has an effect of classifying a learning dataset without tagging information.

일곱째, DEC(Deep Embedded Clustering) 기법의 적용을 통해, 비지도 학습방법의 한계점인 정확도 문제점을 개선할 수 있는 효과를 가진다.Seventh, through the application of the DEC (Deep Embedded Clustering) technique, it has the effect of improving the accuracy problem, which is a limitation of the unsupervised learning method.

도 1은 본 발명의 실시예에 따른 헤드셋 형태의 사용자 인터페이스 시스템을 나타낸 구성도.
도 2는 본 발명의 실시예에 따른 사용자 인터페이스를 위한 사용자 특징정보 분류 방법을 나타낸 순서도.
도 3은 본 발명의 일 실시예에 따른 핸즈프리 UI 시스템을 나타낸 도면.
도 4는 도 3에 따른 적외선 카메라의 촬영 원리는 나타낸 도면.
도 5는 본 발명의 일 실시예에 따라 구현된 것으로, AR 헤드셋에 적용된 센서 모듈을 나타낸 도면.
도 6은 본 발명의 일 실시예로서, 사용자가 AR 헤드셋을 착용하고 있는 동안 센서 모듈에 의해 기록된 일부 IR 확산 패턴을 보여주는 도면.
도 7은 본 발명의 일 실시예로서, 클러스터링 네트워크에 대한 전처리 과정을 나타낸 도면.
도 8은 본 발명의 일 실시예로서, 센서 데이터 특징을 추출하는 네트워크 구조를 나타낸 도면.
도 9는 본 발명의 일 실시예로서, 특징 추출을 위해 사용된 STAE의 네트워크 구조에 대한 세부사항을 나타낸 도면.
도 10은 본 발명의 일 실시예로서, STAE에 기반한 특징 추출기와 DEC에 기반한 특징 분류기로 구성된 분류기 네트워크를 나타낸 도면.
도 11은 발명의 일 실시예로서, 미세 조정 기법(DEC)을 적용하여 클러스터링 한 결과를 나타낸 도면.
도 12는 본 발명의 일 실시예로서, 맞춤형 응용 어플리케이션을 사용하여 시연한 스크린샷을 보여주는 도면.1 is a block diagram showing a user interface system in the form of a headset according to an embodiment of the present invention.
2 is a flowchart illustrating a method for classifying user feature information for a user interface according to an embodiment of the present invention.
3 is a view showing a hands-free UI system according to an embodiment of the present invention.
4 is a view showing the principle of the infrared camera according to FIG. 3;
Figure 5 is implemented in accordance with an embodiment of the present invention, showing a sensor module applied to the AR headset.
6 is an embodiment of the present invention, a diagram showing some IR diffusion patterns recorded by the sensor module while the user is wearing an AR headset.
7 is a view showing a pre-processing process for a clustering network as an embodiment of the present invention.
8 is a diagram illustrating a network structure for extracting sensor data features as an embodiment of the present invention.
9 is a diagram showing details of a network structure of a STAE used for feature extraction as an embodiment of the present invention.
FIG. 10 is a diagram illustrating a classifier network composed of a feature extractor based on a STAE and a feature classifier based on a DEC, as an embodiment of the present invention.
11 is a diagram illustrating clustering results by applying a fine adjustment technique (DEC) as an embodiment of the invention.
12 is a view showing a screenshot demonstrated using a custom application, as an embodiment of the present invention.

본 발명과 본 발명의 동작상의 이점 및 본 발명의 실시에 의하여 달성되는 목적을 충분히 이해하기 위해서는 본 발명의 바람직한 실시예를 예시하는 첨부 도면 및 첨부 도면에 기재된 내용을 참조하여야만 한다. 본 발명의 특징 및 이점들은 첨부 도면에 의거한 다음의 상세한 설명으로 더욱 명백해질 것이다. 이에 앞서, 본 명세서 및 청구범위에 사용된 용어나 단어는 발명자가 그 자신의 발명의 가장 최선의 방법으로 설명하기 위해 용어의 개념을 적절하게 정의할 수 있다는 원칙에 입각하여 본 발명의 기술적 사상에 부합하는 의미와 개념으로 해석되어야 할 것이다. 또한 본 발명과 관련된 공지 기능 및 그 구성에 대한 구체적인 설명은 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우, 그 구체적인 설명을 생략하였음에 유의해야할 것이다.In order to fully understand the present invention, the operational advantages of the present invention, and the objects achieved by the practice of the present invention, reference should be made to the accompanying drawings and the contents described in the accompanying drawings, which illustrate preferred embodiments of the present invention. Features and advantages of the present invention will become more apparent from the following detailed description based on the accompanying drawings. Prior to this, the terms or words used in the present specification and claims are based on the principle that the inventor can appropriately define the concept of terms in order to explain the best way of his own invention. It should be interpreted as a matching meaning and concept. In addition, it should be noted that detailed descriptions of known functions and configurations related to the present invention are omitted when it is determined that the subject matter of the present invention may be unnecessarily obscured.

도 1은 본 발명의 실시예에 따른 헤드셋 형태의 사용자 인터페이스 시스템을 나타낸 구성도이다.1 is a block diagram showing a user interface system in the form of a headset according to an embodiment of the present invention.

도 1을 참조하면, 본 발명의 실시예에 따른 헤드셋 형태의 사용자 인터페이스 시스템은 크게 근적외선 방출부(100), 신호 프레임 수신부(200), 신호 프레임 전처리부(300), 특징정보 추출부(400), 특징정보 분류부(500)를 포함할 수 있다.Referring to Figure 1, the user interface system in the form of a headset according to an embodiment of the present invention is largely near-infrared emission unit 100, signal frame receiving unit 200, signal frame pre-processing unit 300, feature information extraction unit 400 , It may include a feature information classification unit 500.

근적외선 방출부(100)는 일정한 거리를 두고 근적외선을 피부에 방출할 수 있다. 이는 비접촉 방식으로 피부를 센싱하기 위한 구성요소라 할 수 있다. 이 때 근적외선 방출부(100)는 LED가 아닌, 레이저 다이오드라 할 수 있다. 그 이유는 LED의 경우 피부에 접촉하여 사용함으로써 신호 인식이 가능한 반면, 레이저 다이오드는 피부에 접촉하지 않고도 신호 인식이 가능하기 때문이다.The near-infrared emission unit 100 may emit near-infrared light at a predetermined distance. This can be said to be a component for sensing the skin in a non-contact manner. At this time, the near-infrared emitting unit 100 may be referred to as a laser diode, not an LED. The reason is that the signal can be recognized by using the LED in contact with the skin, whereas the laser diode can recognize the signal without touching the skin.

이 때 근적외선 방출부(100)는 상기 피부에 방출된 근적외선에 의한 산란현상에 따른 신호 프레임의 수신이 상기 피부에 방출된 근적외선에 의한 반사현상의 영향을 최소화하도록 시준된(Collimated) 근적외선을 방출할 수 있다. 그 이유는 상기 피부에 방출된 근적외선으로 인해 발생한 적외선 산란현상이 상기 피부에 방출된 근적외선으로 인해 발생한 적외선 반사현상에 가려져 신호 인식 또는 측정이 불가능할 수도 있기 때문이다. 이에 근적외선 방출부(100)는 충분히 시준된(Collimated)근적외선을 피부에 방출할 수 있다.In this case, the near-infrared emission unit 100 emits collimated near-infrared rays so that reception of a signal frame according to scattering phenomena caused by near-infrared rays emitted to the skin minimizes the effect of reflection caused by the near-infrared rays emitted to the skin. Can be. The reason is that signal scattering or measurement may not be possible because the infrared scattering phenomenon caused by the near infrared rays emitted from the skin is blocked by the infrared reflection phenomenon caused by the near infrared rays emitted by the skin. Accordingly, the near-infrared emitting unit 100 may emit sufficiently collimated near-infrared rays to the skin.

신호 프레임 수신부(200)는 상기 근적외선 방출부(100)로부터 방출된 근적외선에 의해 상기 피부로 입사되는 근적외선 중 상기 피부 내에서 산란되어 상기 피부 밖으로 방출되는 근적외선의 신호가 포함된 적어도 하나 이상의 신호 프레임을 수신할 수 있다. 이는 피부 변형에 관한 신호를 수신하기 위한 구성요소라 할 수 있다.The signal frame receiving unit 200 includes at least one signal frame including signals of near infrared rays scattered within the skin and emitted from the skin among the near infrared rays incident on the skin by the near infrared rays emitted from the near infrared emission unit 100. I can receive it. This can be said to be a component for receiving a signal regarding skin deformation.

이 때 신호 프레임 수신부(200)는 적외선 포토다이오드(PD: photodiode)가 아닌, 적외선 카메라라 할 수 있다. 그 이유는 적외선 포토다이오드는 1차원 포인트 데이터(point data)를 획득하는 반면, 적외선 카메라는 카메라 영상 데이터를 이용한 2차원 데이터를 획득할 수 있기 때문이다. 이는 양적 및 질적으로 향상된 데이터라 할 수 있다.In this case, the signal frame receiving unit 200 may be referred to as an infrared camera, not an infrared photodiode (PD). The reason is that the infrared photodiode acquires one-dimensional point data, while the infrared camera can acquire two-dimensional data using camera image data. This can be said to be quantitative and qualitatively improved data.

신호 프레임 수신부(200)에서 수신하는 신호 프레임은 피부에 방출된 근적외선에 의한 산란현상으로부터 측정된 공간 분해 확산 반사율(SRDR: Satially Resolved Diffuse Relectance)이 포함된 근적외선 이미지 형태라 할 수 있다. 이는 근적외선에 의한 산란현상에 따른 공간 분해 확산 반사율(SRDR: Spatially Resolved Diffuse Reflectance)을 통해 사용자의 의도를 파악하기 위한 특징이라 할 수 있다.The signal frame received from the signal frame receiving unit 200 may be in the form of a near infrared image including spatially resolved diffuse reflection (SRDR) measured from scattering phenomena caused by near infrared rays emitted to the skin. This may be a characteristic for grasping the user's intention through spatially resolved diffuse reflection (SRDR) according to scattering phenomenon caused by near infrared rays.

한편, 신호 프레임 수신부(200)는 근적외선 이미지에 목표 대역의 근적외선 신호 외에 다른 대역의 노이즈 신호가 포함되지 않도록 근적외선 필터를 포함할 수 있다. Meanwhile, the signal frame receiver 200 may include a near-infrared filter so that the near-infrared image does not include noise signals in other bands in addition to the near-infrared signal in the target band.

신호 프레임 전처리부(300)는 적어도 하나 이상의 신호 프레임을 수신된 순서로 전처리할 수 있다. 이는 수신된 신호 프레임으로부터 특징정보 추출 및 분류의 성능을 높이기 위한 구성요소라 할 수 있다. 이 때 전처리 순서를 수신된 순서로 하는 이유는 방출된 근적외선에 의해 피부 내에서 확산되는 변화를 확인할 수 있기 때문이다.The signal frame pre-processing unit 300 may pre-process at least one signal frame in a received order. This can be said to be a component for improving the performance of the feature information extraction and classification from the received signal frame. At this time, the reason for pre-ordering the received order is that it is possible to confirm the change diffused in the skin by the emitted near infrared rays.

이 때 신호 프레임 전처리부(300)는 하기 특징정보 추출부(400)의 데이터 입력 형태에 따라 데이터 형태를 재구성할 수 있다.In this case, the signal frame pre-processing unit 300 may reconstruct the data type according to the data input form of the following feature information extraction unit 400.

특징정보 추출부(400)는 상기 신호 프레임 전처리부(300)로부터 전처리된 적어도 하나 이상의 신호 프레임으로부터 상기 방출된 근적외선에 의해 변형된 상기 피부의 특징정보를 추출할 수 있다.The feature information extracting unit 400 may extract feature information of the skin deformed by the emitted near infrared rays from at least one signal frame pre-processed by the signal frame pre-processing unit 300.

보다 상세하게, 특징정보 추출부(400)는 상기 신호 프레임 전처리부(300)로부터 전처리된 근적외선 이미지로부터 상기 사용자 의도에 분류하기 위한 적어도 하나 이상의 특징을 추출할 수 있다.In more detail, the feature information extraction unit 400 may extract at least one feature for classifying the user's intention from the near-infrared image pre-processed by the signal frame pre-processing unit 300.

이 때 특징정보 추출부(400)는 심층 신경망(DNN: Deep Neural Network)을 기반으로 사전 학습된 시공간 오토인코더(STAE: Spatiotemporal Autoencoder)의 인코더를 이용하여 상기 특징정보를 추출할 수 있다.At this time, the feature information extracting unit 400 may extract the feature information using an encoder of a spatiotemporal autoencoder (STAE) pre-trained based on a deep neural network (DNN).

특징정보 분류부(500)는 상기 특징정부 추출부(400)로부터 추출된 특징정보를 기 설정된 사용자의 의도에 분류할 수 있다. The feature information classifying unit 500 may classify feature information extracted from the feature government extracting unit 400 to a predetermined user's intention.

보다 상세하게, 특징정보 분류부(500)는 상기 특징정보 추출부(400)로부터 추출된 적어도 하나 이상의 특징을 사용자 의도에 분류하여 매핑할 수 있다. In more detail, the feature information classifying unit 500 may classify and map at least one feature extracted from the feature information extracting unit 400 to a user's intention.

이 때 특징정보 분류부(500)는 DEC(Deep Embedded Clustering) 기법을 기반으로 사전 학습된 분류기를 이용하여 상기 추출된 적어도 하나 이상의 특징을 분류하여 기 설정된 사용자 의도에 매핑할 수 있다.At this time, the feature information classifying unit 500 may classify the extracted at least one feature using a pre-trained classifier based on a DEC (Deep Embedded Clustering) technique and map it to a preset user intention.

이러한 특징정보 추출부(400) 및 특징정보 분류부(500)는 비지도 학습 방법을 기반으로 하는데, 이는 태깅 정보 없이도 학습용 데이터세트를 분류할 수 있다는 특징을 가질 수 있다. 그 중 DEC(Deep Embedded Clustering) 기법은 비지도 학습 방법의 한계인 정확도 문제점을 개선하기 위한 기법이라 할 수 있다.The feature information extracting unit 400 and the feature information classifying unit 500 are based on an unsupervised learning method, which may have a feature of classifying a learning dataset without tagging information. Among them, the DEC (Deep Embedded Clustering) technique can be said to improve the accuracy problem, which is a limitation of the unsupervised learning method.

도 2는 본 발명의 실시예에 따른 사용자 인터페이스를 위한 사용자 특징정보 분류 방법을 나타낸 순서도이다.2 is a flowchart illustrating a method for classifying user feature information for a user interface according to an embodiment of the present invention.

도 2를 참조하면, 본 발명의 실시예에 따른 사용자 인터페이스를 위한 사용자 특징정보 분류 방법은 크게 근적외선 방출단계(S100), 신호 프레임 수신단계(S200), 신호 프레임 전처리단계(S300), 특징정보 추출단계(S400), 특징정보 분류단계(S500)를 포함할 수 있다.Referring to FIG. 2, a method for classifying user feature information for a user interface according to an embodiment of the present invention is largely a near infrared emission step (S100), a signal frame receiving step (S200), a signal frame preprocessing step (S300), and feature information extraction It may include a step (S400), a feature information classification step (S500).

근적외선 방출단계는 근적외선 방출부(100)에서 일정한 거리를 두고 근적외선을 피부에 방출할 수 있다(S100). 이는 비접촉 방식으로 피부를 센싱하기 위한 단계라 할 수 있다. 이 때 근적외선 방출단계(S100)에서 근적외선 방출부(100)는 LED가 아닌, 레이저 다이오드라 할 수 있다. 그 이유는 LED의 경우 피부에 접촉하여 사용함으로써 신호 인식이 가능한 반면, 레이저 다이오드는 피부에 접촉하지 않고도 신호 인식이 가능하기 때문이다.In the near-infrared emission step, the near-infrared emission unit 100 may emit near-infrared light at a predetermined distance from the skin (S100). This can be said to be a step for sensing the skin in a non-contact manner. At this time, in the near-infrared emission step (S100), the near-infrared emission unit 100 may be referred to as a laser diode, not an LED. The reason is that the signal can be recognized by using the LED in contact with the skin, whereas the laser diode can recognize the signal without touching the skin.

이 때 근적외선 방출단계(S100)는 상기 피부에 방출된 근적외선에 의한 산란현상에 따른 신호 프레임의 수신이 상기 피부에 방출된 근적외선에 의한 반사현상의 영향을 최소화하도록 시준된(Collimated) 근적외선을 방출할 수 있다. 그 이유는 상기 피부에 방출된 근적외선으로 인해 발생한 적외선 산란현상이 상기 피부에 방출된 근적외선으로 인해 발생한 적외선 반사현상에 가려져 신호 인식 또는 측정이 불가능할 수도 있기 때문이다. 이에 근적외선 방출단계(S100)는 충분히 시준된(Collimated)근적외선을 피부에 방출할 수 있다.In this case, the near-infrared emission step (S100) emits collimated near-infrared rays so that reception of a signal frame according to scattering phenomena caused by the near-infrared rays emitted to the skin minimizes the effect of reflection caused by the near-infrared rays emitted to the skin. Can be. The reason is that signal scattering or measurement may not be possible because the infrared scattering phenomenon caused by the near infrared rays emitted from the skin is blocked by the infrared reflection phenomenon caused by the near infrared rays emitted by the skin. Accordingly, the near-infrared emission step (S100) may emit sufficiently collimated near-infrared rays to the skin.

신호 프레임 수신단계는 상기 근적외선 방출단계(S100)로부터 방출된 근적외선에 의해 상기 피부로 입사되는 근적외선 중 상기 피부 내에서 산란되어 상기 피부 밖으로 방출되는 근적외선의 신호가 포함된 적어도 하나 이상의 신호 프레임을 수신할 수 있다(S200). 이는 피부 변형에 관한 신호를 수신하기 위한 단계라 할 수 있다.The signal frame receiving step receives at least one signal frame including signals of near infrared rays scattered within the skin and emitted out of the skin among the near infrared rays incident on the skin by the near infrared rays emitted from the near infrared emission step (S100). It can be (S200). This may be a step for receiving a signal regarding skin deformation.

이 때 신호 프레임 수신단계(S200)에서 신호 프레임 수신부(200)는 적외선 포토다이오드(PD: photodiode)가 아닌, 적외선 카메라라 할 수 있다. 그 이유는 적외선 포토다이오드는 1차원 포인트 데이터(point data)를 획득하는 반면, 적외선 카메라는 카메라 영상 데이터를 이용한 2차원 데이터를 획득할 수 있기 때문이다. 이는 양적 및 질적으로 향상된 데이터라 할 수 있다.At this time, in the signal frame receiving step S200, the signal frame receiving unit 200 may be an infrared camera, not an infrared photodiode (PD). The reason is that the infrared photodiode acquires one-dimensional point data, while the infrared camera can acquire two-dimensional data using camera image data. This can be said to be quantitative and qualitatively improved data.

신호 프레임 수신단계(S200)에서 수신하는 신호 프레임은 피부에 방출된 근적외선에 의한 산란현상으로부터 측정된 공간 분해 확산 반사율(SRDR: Satially Resolved Diffuse Relectance)이 포함된 근적외선 이미지 형태라 할 수 있다. 이는 근적외선에 의한 산란현상에 따른 공간 분해 확산 반사율(SRDR: Spatially Resolved Diffuse Reflectance)을 통해 사용자의 의도를 파악하기 위한 특징이라 할 수 있다.The signal frame received in the signal frame receiving step S200 may be in the form of a near infrared image including spatially resolved diffuse reflection (SRDR) measured from scattering phenomena caused by near infrared rays emitted to the skin. This may be a characteristic for grasping the user's intention through spatially resolved diffuse reflection (SRDR) according to scattering phenomenon caused by near infrared rays.

신호 프레임 전처리단계는 적어도 하나 이상의 신호 프레임을 수신된 순서로 전처리할 수 있다(S300). 이는 수신된 신호 프레임으로부터 특징정보 추출 및 분류의 성능을 높이기 위한 단계라 할 수 있다. 이 때 전처리 순서를 수신된 순서로 하는 이유는 방출된 근적외선에 의해 피부 내에서 확산되는 변화를 확인할 수 있기 때문이다.In the signal frame pre-processing step, at least one or more signal frames may be pre-processed in the received order (S300). This can be said to be a step for improving the performance of feature information extraction and classification from the received signal frame. At this time, the reason for pre-ordering the received order is that it is possible to confirm the change diffused in the skin by the emitted near infrared rays.

이 때 신호 프레임 전처리단계(S300)는 하기 특징정보 추출단계(S400)의 데이터 입력 형태에 따라 데이터 형태를 재구성할 수 있다.At this time, the signal frame pre-processing step S300 may reconstruct the data form according to the data input form of the following feature information extraction step S400.

특징정보 추출단계는 상기 신호 프레임 전처리단계(S300)로부터 전처리된 적어도 하나 이상의 신호 프레임으로부터 상기 방출된 근적외선에 의해 변형된 상기 피부의 특징정보를 추출할 수 있다(S400).The feature information extraction step may extract feature information of the skin deformed by the emitted near-infrared from at least one signal frame pre-processed from the signal frame pre-processing step (S300) (S400).

보다 상세하게, 특징정보 추출단계(S400)는 상기 신호 프레임 전처리단계(S300)로부터 전처리된 근적외선 이미지로부터 상기 사용자 의도에 분류하기 위한 적어도 하나 이상의 특징을 추출할 수 있다.In more detail, the feature information extraction step (S400) may extract at least one feature for classifying the user's intention from the near-infrared image preprocessed from the signal frame pre-processing step (S300 ).

이 때 특징정보 추출단계(S400)는 심층 신경망(DNN: Deep Neural Network)을 기반으로 사전 학습된 시공간 오토인코더(STAE: Spatiotemporal Autoencoder)의 인코더를 이용하여 상기 특징정보를 추출할 수 있다.At this time, the feature information extraction step S400 may extract the feature information using an encoder of a spatiotemporal autoencoder (STAE) pre-trained based on a deep neural network (DNN).

특징정보 분류단계는 상기 특징정부 추출단계(S400)로부터 추출된 특징정보를 기 설정된 사용자의 의도에 분류할 수 있다(S500). In the feature information classification step, feature information extracted from the feature government extraction step (S400) may be classified according to a preset user's intention (S500).

보다 상세하게, 특징정보 분류단계(S500)는 상기 특징정보 추출단계(S400)로부터 추출된 적어도 하나 이상의 특징을 사용자 의도에 분류하여 매핑할 수 있다. In more detail, the feature information classification step (S500) may classify and map at least one feature extracted from the feature information extraction step (S400) according to a user's intention.

이 때 특징정보 분류단계(S500)는 DEC(Deep Embedded Clustering) 기법을 기반으로 사전 학습된 분류기를 이용하여 상기 추출된 적어도 하나 이상의 특징을 분류하여 기 설정된 사용자 의도에 매핑할 수 있다.At this time, the feature information classification step (S500) may classify the extracted at least one feature using a pre-trained classifier based on a DEC (Deep Embedded Clustering) technique and map it to a preset user intention.

이러한 특징정보 추출단계(S400) 및 특징정보 분류단계(S500)는 비지도 학습 방법을 기반으로 하는데, 이는 태깅 정보 없이도 학습용 데이터세트를 분류할 수 있다는 특징을 가질 수 있다. 그 중 DEC(Deep Embedded Clustering) 기법은 비지도 학습 방법의 한계인 정확도 문제점을 개선하기 위한 기법이라 할 수 있다.The feature information extraction step (S400) and the feature information classification step (S500) are based on an unsupervised learning method, which may have a feature that a learning dataset can be classified without tagging information. Among them, the DEC (Deep Embedded Clustering) technique can be said to improve the accuracy problem, which is a limitation of the unsupervised learning method.

보다 상세하게 본 발명을 설명하자면, 도 3 내지 도 12를 참조할 수 있다.Referring to the present invention in more detail, reference may be made to FIGS. 3 to 12.

도 3은 본 발명의 일 실시예에 따른 핸즈프리 UI 시스템을 나타낸 도면이고, 도 4는 도 3에 따른 적외선 카메라의 촬영 원리는 나타낸 도면이다.FIG. 3 is a view showing a hands-free UI system according to an embodiment of the present invention, and FIG. 4 is a view showing a shooting principle of the infrared camera according to FIG. 3.

도 5는 본 발명의 일 실시예에 따라 구현된 것으로, AR 헤드셋에 적용된 센서 모듈을 나타낸 도면이라 할 수 있다.5 is implemented according to an embodiment of the present invention, and may be referred to as a diagram showing a sensor module applied to an AR headset.

먼저 도 3 및 도 4를 참조하면, 본 발명의 일 실시예에 따른 핸즈프리 UI 시스템에서 센서 모듈(Sensor module)은 IR LD와 IR 카메라를 포함할 수 있다. IR 카메라는 IR 확산 패턴에 대한 이미지를 촬영(캡쳐)하고, IR LD은 IR 광을 피부에 방출하는 구성요소라 할 수 있다.Referring first to FIGS. 3 and 4, in a hands-free UI system according to an embodiment of the present invention, a sensor module may include an IR LD and an IR camera. The IR camera captures (captures) an image of the IR diffusion pattern, and the IR LD is a component that emits IR light to the skin.

도 5에서 (a)은 센서 모듈(Sensor module)로서 USB 카메라와 NIR 레이저 다이오드를 포함할 수 있다. 이는 Epson BT-350 AR 안경의 왼쪽에 설치한 것이라 할 수 있다. (b)은 사용자가 헤드셋을 착용한 모습으로서, 레이저 다이오드(Laser diode)는 왼쪽 뺨 근처 피부를 타겟으로 할 수 있다. 왼쪽 뺨 근처 피부는 사용자가 윙크 제스처를 했을 때 변형될 수 있다.In Figure 5 (a) is a sensor module (Sensor module) may include a USB camera and a NIR laser diode. This can be said to be installed on the left side of the Epson BT-350 AR glasses. (b) is a user wearing a headset, and a laser diode may target the skin near the left cheek. The skin near the left cheek may deform when the user makes a wink gesture.

사람 피부의 변형은 피부 밑(피부 내)에서 전달되는 IR SRDR의 변화를 측정하여 검출할 수 있다. 이 때 SR(Spatially resolved)은 광의 입사지점으로부터 방사상 거리(Radial distance)에 따라 측정된 확산반사이며, DR(Diffuse reflectance)은 상기 확산반사에 의해 재방사되는 플럭스(Flux) 양의 비율인 산란반사라 할 수 있다. DR의 강도는 피부 밑에서 분포되는 콜라겐 섬유의 미세구조에 의해 영향을 받을 수 있다.Deformation of human skin can be detected by measuring the change in IR SRDR delivered under the skin (in the skin). At this time, SR (Spatially resolved) is a diffuse reflection measured according to a radial distance from the point of incidence of light, and diffuse reflectance (DR) is a scattering plane that is a ratio of the amount of flux re-emitted by the diffuse reflection. You can disappear. The strength of DR can be influenced by the microstructure of collagen fibers distributed under the skin.

본 발명의 일 실시예에서는 사람 피부에서 콜라겐 섬유 분포를 나타내는 할선(Landger’s line)이 피부의 SRDR 측정 방향을 따른다는 것에 기초하여, IR 확산 반사체를 이용하여 사용자의 표정을 감지하는 실시 예를 구현할 수 있다.According to an embodiment of the present invention, an embodiment in which a user's expression is sensed using an IR diffuse reflector can be implemented based on that the randger's line representing collagen fiber distribution in human skin follows the SRDR measurement direction of the skin. have.

이를 기초로 도 3 내지 도 5의 실시 예를 보다 상세하게 설명하면 다음과 같다.Based on this, the embodiments of FIGS. 3 to 5 will be described in more detail as follows.

도 3을 참조하면, 본 발명의 일 실시예에 따른 핸즈프리 UI 시스템에서는 크게 센서 착용자의 피부 표면으로부터 IR 확산 패턴을 검출하는 센서 모듈(Sensor module)과, DNN 구조에 기반한 센서 데이터 분류기(Sensor Data Clustering Module)를 포함할 수 있다.Referring to FIG. 3, in a hands-free UI system according to an embodiment of the present invention, a sensor module that detects an IR diffusion pattern from a skin surface of a sensor wearer and a sensor data clustering based on a DNN structure Module).

센서 모듈은 웨어러블 AR 헤드셋에 장착할 수 있다. 센서 모듈은 LD와 USB 카메라를 한 쌍으로 구성할 수 있다. 이는 각각 IR 방출기와 수신기라 할 수 있다. 사용자가 AR 안경을 쓰고 있는 동안, LD로부터 방출되는 IR 광은 사용자의 피부를 통해 확산될 수 있다. 카메라는 확산된 IR 광의 순차적 이미지를 촬영할 수 있고, 이를 분류기로 전송하여 사용자의 피부 변형 상태를 인터페이스 명령으로 변환할 수 있다. The sensor module can be mounted on a wearable AR headset. The sensor module can consist of a pair of LD and USB cameras. These can be called IR emitters and receivers, respectively. While the user is wearing AR glasses, IR light emitted from the LD can diffuse through the user's skin. The camera may take a sequential image of the diffused IR light and transmit it to a classifier to convert the user's skin deformation state into an interface command.

이 때 일 실시예를 통해 제작된 AR 헤드셋은 상용 AR 헤드셋(Epson BT-350)과, 얼굴 제스처 추적 센서 모듈을 이용할 수 있다. 그리고 상기 얼굴 제스처 추적 센서 모듈은 도 5에서와 같이 AR 안경의 왼쪽에 장착되었고, ROI 센서는 사용자가 윙크 제스처를 할 때 피부가 국소적으로 펴지거나 압축되는 사용자의 왼쪽 눈과 귀 사이의 지점을 목표로 한 것이라 할 수 있다.In this case, the AR headset manufactured through one embodiment may use a commercial AR headset (Epson BT-350) and a face gesture tracking sensor module. In addition, the face gesture tracking sensor module is mounted on the left side of the AR glasses as shown in FIG. 5, and the ROI sensor detects a point between the user's left eye and ear where the skin is locally stretched or compressed when the user makes a wink gesture. It can be said that it was a goal.

NIR LD(즉, IR 방출기)는 평균 출력 전압이 0.4mW인 850nm 중심 파장을 가졌는데, 이는 레벨 1 수준의 레이저로 분류될 수 있다. 이러한 NIR 조명은 사용자의 시야를 방해하지 않았으며, 사용자의 눈에 대한 잠재적 유해 영향을 방지하고자 NIR 선량을 안전한 수준으로 제한한 것이라 할 수 있다.The NIR LD (ie, IR emitter) had a center wavelength of 850 nm with an average output voltage of 0.4 mW, which can be classified as a level 1 laser. The NIR illumination did not interfere with the user's field of view, and it can be said that the NIR dose was limited to a safe level to prevent potential harmful effects on the user's eyes.

그리고 상기 NIR의 밴드는 무해한 광 스펙트럼 중 가장 깊게 피부로 침투할 수 있다. 이는 센서 구현을 위한 광원을 선택하는 데에 있어서, 낮은 전력 소비량과 함께 중요한 특성이라 할 수 있다. 발광 다이오드(LED) 대신 LD를 선택한 이유는 LED로부터 나오는 IR 광이 실제보다 내구성이 더 좋지만, LED로부터 나오는 IR 광은 LD 빛과는 대조적으로 피부에 직접 접촉하지 않을 때 사용자 피부를 통해 전달될 수 있는 시준(Collimation)이 불충분하기 때문이다.And the band of the NIR can penetrate the deepest skin of the harmless light spectrum. This is an important characteristic in selecting a light source for sensor implementation, along with low power consumption. The reason for choosing LD instead of light emitting diode (LED) is that the IR light from the LED is more durable than it actually is, but in contrast to the LD light, the IR light from the LED can be transmitted through the user's skin when it is not in direct contact with the skin. This is because collimation is insufficient.

IR 방출기는 LD, 보호회로, 시준 렌즈(Collimation lens)를 포함할 수 있다. 보호회로는 LD의 수명을 연장할 수 있고, 시준 렌즈는 방출의 분산을 감소시키기 위해 IR 레이저 빔을 시준할 수 있다. IR 수신기는 OV2710 칩셋 카메라를 포함하는 센서 모듈이라 할 수 있다. 이 때 센서 모듈은 주변 가시광선으로부터 발생하는 방해 없이, IR 광을 감지할 수 있는 IR 밴드 패스 필터를 내장할 수 있다. 카메라의 시야(FOV: Field of view)는 100 각도이고, 제조사 사양에 따라 320 X 240 해상도를 가진 120fps로 기록할 수 있다. 본 발명의 일 실시예에서의 카메라는 데이터 처리 지연시간 때문에, (평균적으로) 20fps로 실행될 수 있다. 실제로 입력 데이터에 320 X 240 해상도 그레이스케일(Grayscale) jpeg 이미지를 사용했다.The IR emitter may include an LD, a protection circuit, and a collimation lens. The protection circuit can extend the life of the LD, and the collimating lens can collimate the IR laser beam to reduce the dispersion of the emission. The IR receiver is a sensor module that includes an OV2710 chipset camera. In this case, the sensor module may incorporate an IR band pass filter capable of detecting IR light without interference from ambient visible light. The field of view (FOV) of the camera is 100 degrees, and can be recorded at 120 fps with 320 X 240 resolution according to the manufacturer's specifications. The camera in one embodiment of the present invention can be run at 20 fps (on average) due to data processing latency. In fact, a 320 x 240 resolution Grayscale jpeg image was used for the input data.

도 6은 본 발명의 일 실시예로서, 사용자가 AR 헤드셋을 착용하고 있는 동안 센서 모듈에 의해 기록된 일부 IR 확산 패턴을 보여주는 도면이라 할 수 있다.FIG. 6 is an embodiment of the present invention, and may be a view showing some IR diffusion patterns recorded by the sensor module while the user is wearing the AR headset.

보다 상세하게, 도 6에서 (a)은 얼굴 제스처 없이 촬영된 IR SRDR이라 할 수 있다. (b)은 사용자가 윙크 제스처를 취하는 동안 촬영된 IR SRDR이라 할 수 있다.In more detail, in FIG. 6, (a) may be referred to as IR SRDR photographed without a face gesture. (b) may be referred to as IR SRDR photographed while the user makes a wink gesture.

도 6을 참조하면, 카메라에 설치된 NIR 대역 밴드 패스 필터는 850nm(NIR)을 제외한 모든 광의 파장대를 절단할 수 있다. 패턴의 모양은 사용자의 얼굴 제스처와 관련해서 바뀔 수 있다.Referring to FIG. 6, the NIR band band pass filter installed in the camera can cut all wavelength bands except 850 nm (NIR). The shape of the pattern can be changed in relation to the user's face gesture.

센서 데이터 분류기(Sensor Data Clustering Module)로 입력되는 입력 데이터는 사용자 피부를 촬영한 NIR SRDR 이미지 형태에 포함될 수 있다. 카메라로 촬영한 SRDR 모양은 사용자의 얼굴 제스처가 동일함에도 불구하고, 얼굴 제스처의 모델링이 거의 되지 않는 다양한 요인에 의해, 피투피(Person-to-person) 변형이 존재할 수도 있다. 본 발명의 실 예에서의 센서 데이터 분류기는 사용자의 제스처를 인식하기 위해 촬영된 SRDR 모양의 변형을 감지해야한다.The input data input to the sensor data clustering module may be included in the form of an NIR SRDR image of a user's skin. Although the SRDR shape photographed by the camera is the same as the user's face gesture, there may be a person-to-person deformation due to various factors that rarely model the face gesture. In an embodiment of the present invention, the sensor data classifier must detect deformation of the SRDR shape photographed in order to recognize a user's gesture.

이에 본 발명에서는 실시예로서, 신경망 네트워크 센서 데이터 분류기에 대해 감독되지 않은 학습, 즉 비지도 학습 방법을 활용할 수 있다. 비지도 학습 방법은 피투피(Person-to-person) 변형에 대응할 필요가 있을 경우 쉽게 보정할 수 있는 방이라 할 수 있다. 그 이유는 비지도 학습 방법은 훈련 데이터 셋을 생성하기 위한 데이터 주석 작업을 하지 않을 수 있기 때문이다.Accordingly, in the present invention, as an embodiment, an unsupervised learning, that is, an unsupervised learning method, may be used for a neural network network sensor data classifier. The unsupervised learning method can be said to be a room that can be easily corrected when it is necessary to cope with a personal-to-person transformation. This is because the unsupervised learning method may not work with data annotation to create a training data set.

센서 데이터 분류기를 훈련하는 과정은 다음과 같은 2단계로 구성할 수 있다.The process of training the sensor data classifier can consist of the following two steps.

(1단계) 오토인코더(Autoencoder)를 사용한 비지도 학습을 통해 데이터 셋으로부터 특징을 추출 할 수 있다.(Step 1) Features can be extracted from the data set through unsupervised learning using an autoencoder.

(2단계) 추출된 특징을 클러스터링(Clustering) 및 미세 조정(Fine-tuning)하여 분류기의 성능을 향상시킬 수 있다.(Step 2) The performance of the classifier can be improved by clustering and fine-tuning the extracted features.

상기 도 5에 도시된 바와 같이, 본 발명의 실시예에서는 내장 IR 센서 모듈을 사용하여 분류기 신경망 네트워크에 대한 훈련 데이터를 수집할 수 있다. 보다 상세하게, 상기 분류기 네트워크에 대한 훈련 데이터 셋에 대한 데이터 주석 작업 없이, 1인당 5,000개 이상의 순차적인 SRDR 이미지를 수집할 수 있다. 여기서, 분류기 네트워크를 훈련시키기 위해 10명의 다른 사용자로부터 수집한 81,758 개의 SRDR 이미지를 이용할 수 있다.As shown in FIG. 5, in an embodiment of the present invention, training data for a classifier neural network network may be collected using a built-in IR sensor module. More specifically, 5,000 or more sequential SRDR images per person can be collected without data annotation on the training data set for the classifier network. Here, 81,758 SRDR images collected from 10 different users can be used to train the classifier network.

분류기 네트워크로의 데이터 입력은 4개의 320 X 240 pixel IR 확산 스냅 샷 이미지 시리즈로서, 이는 카메라에 의해 순차적으로 촬영되어 입력 이미지 큐(Queue)에 저장될 수 있다. 상기 분류기 네트워크는 상기 이미지를 네트워크의 특징 추출 부분으로 보내기 전에 전처리를 수행할 수 있다.The data input to the classifier network is a series of four 320 X 240 pixel IR diffuse snapshot images, which can be taken sequentially by the camera and stored in an input image queue. The classifier network may perform pre-processing before sending the image to the feature extraction portion of the network.

도 7은 본 발명의 일 실시예로서, 클러스터링 네트워크에 대한 전처리 과정을 나타낸 도면이라 할 수 있다.7 is an embodiment of the present invention, and may be referred to as a diagram showing a pre-processing process for a clustering network.

보다 상세하게, 도 7에서의 전처리 유닛은 두 개의 이미지 사이의 차를 계산할 수 있다. 계산 이후, 전처리 유닛은 입력 이미지를 임계값(Thresholds)으로 설정할 수 있고, 상기 두 개의 이미지 사이의 픽셀 별 뺄셈을 수행할 수 있으며, 이를 클러스터링 네트워크 입력 사이즈에 맞추기 위해 28 X 28pixel 크기로 조정하여 클러스터링 네트워크에서 요구하는 입력 길이 만큼 한 그룹으로 묶을 수 있다. More specifically, the pre-processing unit in FIG. 7 can calculate a difference between two images. After calculation, the pre-processing unit can set the input image as thresholds, perform subtraction by pixel between the two images, and cluster it by adjusting it to a size of 28 X 28 pixels to fit the clustering network input size. It can be grouped into groups as long as the input length required by the network.

도 7에 도시된 바와 같이, 전처리 유닛은 SRDR 컨투어의 형태 변동에 초점을 맞추어 분류기 네트워크를 만들기 위해, SRDR 컨투어 형태의 변형을 강조한 다음, STAE의 RNN 창 크기에 맞추어 4개의 순차 이미지를 그룹화할 수 있다.As shown in FIG. 7, the pre-processing unit can emphasize the deformation of the SRDR contour shape and then group the four sequential images according to the size of the RNN window of the STAE to create a classifier network by focusing on the shape variation of the SRDR contour. have.

또한, 전처리 유닛은 먼저 각 SRDR의 컨투어형을 강조하기 위해 입력 이미지를 기 설정된 임계값을 가진 흑백 이진 이미지로 변환할 수 있다. 그런 다음, 전처리 유닛은 SRDR 컨투어형의 변형을 강조하기 위해, 일정한 시간 간격 사이에서의 SRDR 이진 이미지의 픽셀 단위 차를 계산할 수 있다. 이 때 일정한 시간 간격은 현재 샘플링 시간[n]과 이전 샘플링 시간[n-2] 사이인 2개의 샘플링 간격이라 할 수 있다. 이후 계산 과정의 축소를 위해, SRDR 이진 이미지를 28 X 28pixel 이미지 크기로 줄일 수 있다. 줄이기 전의 이미지는 이미지 큐에 저장할 수 있다. 마지막으로, 전처리 유닛을 특징 추출 부분에 입력하기 위해, 입력 이미지 시퀀스에 대한 세트로서 4개의 순차 이미지를 합성할 수 있다.In addition, the pre-processing unit may first convert the input image to a black and white binary image having a predetermined threshold to emphasize the contour type of each SRDR. The pre-processing unit can then calculate the pixel-by-pixel difference of the SRDR binary image between regular time intervals to emphasize the deformation of the SRDR contour. At this time, the constant time interval can be referred to as two sampling intervals between the current sampling time [n] and the previous sampling time [n-2]. Afterwards, to reduce the computational process, the SRDR binary image can be reduced to a 28 X 28pixel image size. Images before shrinking can be stored in the image queue. Finally, in order to input the pre-processing unit to the feature extraction portion, four sequential images can be synthesized as a set for the input image sequence.

입력 시퀀스 끝에 있는 이미지([n]th)는 카메라로부터 촬영한 가장 최신의 이미지라 할 수 있다.The image at the end of the input sequence ([n]th) can be said to be the most recent image taken from the camera.

도 8은 본 발명의 일 실시예로서, 센서 데이터 특징을 추출하는 네트워크 구조를 나타낸 도면이다.8 is a diagram illustrating a network structure for extracting sensor data features as an embodiment of the present invention.

보다 상세하게, 도 8에서 STAE은 인코더(Encoder)와 디코더(Decoder)로 구성되며, STAE의 학습 이후 특징 추출을 위해서는 인코더 부분만이 사용될 수 있다.In more detail, in FIG. 8, the STAE is composed of an encoder and a decoder, and only the encoder part can be used for feature extraction after learning of the STAE.

도 8에 도시된 바와 같이, 본 발명의 일 실시예의 센서 카메라는 사용자의 얼굴 제스처를 순차적인 NIR SRDR 이미지로 변환할 수 있다. 변환된 이미지들은 분류기 네트워크에 입력된 데이터였고, 이는 윙크 제스처를 암시하는지 여부를 결정할 수 있는 데이터라 할 수 있다. 이미지 클러스터링은 비지도 학습 방법에서 이미지 분류를 위한 최적의 방법이라 할 수 있다. 다만, 이러한 방법은 여전히 아래 2가지 이유로 한계점을 안고 있다.As shown in FIG. 8, the sensor camera of an embodiment of the present invention may convert a user's face gesture into a sequential NIR SRDR image. The converted images were data input to the classifier network, which can be said to be data that can determine whether or not to imply a wink gesture. Image clustering is an optimal method for classifying images in an unsupervised learning method. However, this method still has limitations for the following two reasons.

첫 번째 이유는 입력 이미지 데이터의 차원이다. 이미지는 종종 매우 높은 차원성을 가지며, 이는 K-평균 알고리즘(K-means algorithm)의 성능에 큰 영향을 미치는 차원성의 저주(Curse of dimensionality)를 피하기 위해 이를 줄여야 한다.The first reason is the dimension of the input image data. Images often have very high dimensionality, which should be reduced to avoid the curse of dimensionality, which greatly affects the performance of the K-means algorithm.

두 번째 이유는 이미지에서의 특징이 주로 2D 또는 3D 로컬 구조를 가지는데, 이는 클러스터링을 하는 동안 무시해서는 안 된다. 이러한 특징은 로컬 구조의 형태에 최적화된 커널(Kernels)을 필요로 할 수 있다. 나아가, 센서로 입력되는 데이터는 시간적 특징을 가지는 순차적인 데이터이기 때문에, 본 발명의 클러스터링 네트워크는 입력으로부터 시간적 특징과 공간 정보를 추출할 수 있어야 한다.The second reason is that the features in the image usually have a 2D or 3D local structure, which should not be ignored during clustering. These features may require kernels optimized for the shape of the local structure. Furthermore, since the data input to the sensor is sequential data having temporal characteristics, the clustering network of the present invention must be able to extract temporal characteristics and spatial information from the input.

본 발명의 실시예에서는 사용자의 감독 없이 입력의 시공간적 특성을 보존하면서 입력 차수를 줄이는 STAE을 채택할 수 있다. 오토인코더의 인코더 부분은 클러스터링 방법으로 특징 추출부를 만들 수 있다. 인코더는 오토인코더를 훈련시키는 동안 압축 손실을 최소화하면서 입력 데이터를 압축하는 방법을 학습할 수 있기 때문이다.In an embodiment of the present invention, STAE can be adopted to reduce the input order while preserving the spatiotemporal characteristics of the input without user supervision. The encoder part of the autoencoder can make a feature extraction part by a clustering method. This is because the encoder can learn to compress input data while minimizing compression loss while training the autoencoder.

STAE은 입력 데이터로부터 각각 시간적 및 공간적 특징을 추출하기 위한 RNN(Recurrent neural network)와 CNN(Convulutional neural network)으로 구성된 오토인코더의 타입 중 하나라 할 수 있다. n개의 이미지 시퀀스

를 중심

STAE으로 대표되는 k 클러스터에 클러스터링한 경우, 비선형 매핑

을 가진 각

의 차원 수를 줄일 수 있다. 여기서

은 커널의 중량이고, Z은 잠재된 특징 공간이라 할 수 있다. 커널은 차원 감소를 지원할 뿐만 아니라, 이미지 시퀀스에서 특징을 감지할 수 있는 용량을 가질 수 있다. STAE 학습이 끝나면, 학습

은 완료되고(이는 분류기 파라미터의 초기화를 의미할 수 있다.), STAE의 디코더 부분은 더 이상 필요하지 않을 수 있다. 그런 다음 후속 부스팅 단계에서 폐기될 수 있다.STAE can be said to be one of the types of autoencoder composed of recurrent neural network (RNN) and convolutional neural network (CNN) for extracting temporal and spatial features from input data, respectively. n image sequence

Centered

When clustering on a k cluster represented by STAE, nonlinear mapping

Angle with

Can reduce the number of dimensions. here

Is the weight of the kernel, and Z is the potential feature space. The kernel not only supports dimensionality reduction, but also has the capacity to detect features in an image sequence. When STAE learning is over, learn

Is completed (which may mean initialization of the classifier parameters), and the decoder portion of the STAE may no longer be needed. It can then be discarded in a subsequent boosting step.

오토인코더는 비지도 학습 방법으로 데이터 표현을 학습하는 인공 신경망의 일종이며, 일반적으로 인코더-디코더 쌍의 일부분이라 할 수 있다. 인코더는 입력 데이터를 압축하기 위해 훈련되고, 디코더는 압축된 데이터를 최소의 손실이 있는 원래 입력 데이터로 복원하도록 훈련될 수 있다. 여기서, 인코더는 입력 데이터에서 불필요한 중복(Redundancy)을 줄이기 위한 데이터 표현을 학습하도록 유도될 수 있다.Autoencoder is a kind of artificial neural network that learns data representation using an unsupervised learning method and is generally a part of an encoder-decoder pair. The encoder is trained to compress the input data, and the decoder can be trained to restore the compressed data to the original input data with minimal loss. Here, the encoder may be induced to learn a data representation to reduce unnecessary redundancy in the input data.

도 8과 같이, 본 발명의 일 실시예에서의 STAE은 전처리 유닛으로부터 입력 데이터로 4개의 연속적인 이미지를 수신할 수 있다.8, the STAE in one embodiment of the present invention can receive four consecutive images as input data from the pre-processing unit.

도 9는 본 발명의 일 실시예로서, 특징 추출을 위해 사용된 STAE의 네트워크 구조에 대한 세부사항을 나타낸 도면이다.9 is a diagram showing details of a network structure of a STAE used for feature extraction as an embodiment of the present invention.

도 9에 도시된 바와 같이, 공간적 특징을 찾고 입력 이미지 크기를 줄이기 위해, STAE 네트워크에서 2개의 3D CNN 레이어가 최우선적으로 배치될 수 있다. 입력 이미지의 차원 수를 줄이기 위해, 인코더의 각 컨볼루션 레이어는 non-padding mode에서 작동할 수 있다. 커널의 가중치를 초기화하기 위해, 분산 스케일을 사용할 수 있다. 각 CNN 레이어는 이미지 크기를 절반으로 줄인 최대 풀링 레이어가 뒤따를 수 있다. 이 때 입력 데이터는 3D(폭, 높이, 샘플링 시간)이기 때문에, 네트워크는 일반적인 2D CNN이 아니라 3D CNN 레이어를 사용하여 입력 데이터를 처리할 필요가 있다.As shown in FIG. 9, in order to find spatial features and reduce the size of the input image, two 3D CNN layers may be disposed in the STAE network with the highest priority. To reduce the number of dimensions of the input image, each convolutional layer of the encoder can operate in a non-padding mode. To initialize the kernel weights, you can use a distributed scale. Each CNN layer can be followed by a maximum pooling layer that cuts the image size in half. At this time, since the input data is 3D (width, height, sampling time), the network needs to process the input data using a 3D CNN layer rather than a typical 2D CNN.

그런 다음, 연속적인 2개의 컨볼루셔널 LSTM(ConvLSTM: Convolutional long-short-term memory) 레이어는 압축된 입력 데이터로부터 일시적 특징을 추출할 수 있다.Then, two convolutional convolutional long-short-term memory (LSTM) layers can extract temporary features from the compressed input data.

기존의 RNN은 숨겨진 상태 매개변수를 통해 일시적 특징을 학습할 수 있지만, 이는 기울기 소실 문제(Vanishing gradients problem)를 가지고 있다. LSTM은 망각 게이트(forget gate)라고 불리는 반복적인 게이트를 도입함으로써 기울기 소실 문제를 완화할 수 있다. ConvLSTM은 영상 데이터와 같은 이미지 프레임의 시퀀스로부터 시간 정보를 찾도록 설계된 LSTM의 일종으로서, 2D 형상 가중치를 가질 수 있다. 입력과 가중치 사이의 곱셈 연산을 사용하는 대신, ConvLSTM은 입력 간 컨볼루션 연산을 사용할 수 있다. 상기 LSTM 레이어는 각 LSTM 상태를 통해 시공간적 특징을 전파할 수 있는 기능을 가질 수 있다.Conventional RNNs can learn transient features through hidden state parameters, but they have a vanishing gradients problem. LSTM can alleviate the slope loss problem by introducing a repetitive gate called a forget gate. ConvLSTM is a type of LSTM designed to find time information from a sequence of image frames such as image data, and may have a 2D shape weight. Instead of using multiplication operations between inputs and weights, ConvLSTM can use convolution operations between inputs. The LSTM layer may have a function of propagating spatiotemporal features through each LSTM state.

마지막으로, 본 발명의 일 실시예에서는 압축된 입력 이미지와 동일한 커널 크기를 가지도록 설계된 ConvLSTM 레이어 뒤에 또 다른 3D CNN 레이어를 배열하여, 4개의 2D 이미지를 추출된 특징을 암시할 수 있는 4개의 2D 벡터로 변환할 수 있다.Finally, in one embodiment of the present invention, another 3D CNN layer is arranged after the ConvLSTM layer designed to have the same kernel size as the compressed input image, and 4 2D images that can suggest the extracted features of the 4 2D images. Can be converted to vectors.

상기 배열된 3D CNN 레이어는 클러스터링을 위해 이전 레이어의 출력을 2D 공간에 매핑할 수 있다. 본 발명에서는 의도한 클러스터의 개수와 일치하도록 특징 공간의 차원을 기 설정할 수 있다. 2개의 상태(피부 변형 상태 및 변형이 없는 상태)만 포함하고자 하는 클러스터의 개수 때문에, 본 발명에서는 특징 공간의 차원을 2D로 설정할 수 있다. 4개의 추출된 벡터 중 가장 최근에 촬영된 데이터의 특징을 포함하고 있는 마지막 벡터만 센서 데이터 클러스터링 과정에서 특징 추출의 최종 결과로 이용될 수 있다.The arranged 3D CNN layer may map the output of the previous layer to 2D space for clustering. In the present invention, the dimension of the feature space can be preset to match the intended number of clusters. Due to the number of clusters that only want to include two states (skin deformation state and no deformation state), the dimension of the feature space can be set to 2D in the present invention. Of the four extracted vectors, only the last vector containing the feature of the most recently photographed data can be used as the final result of feature extraction in the sensor data clustering process.

디코더는 네트워크의 구현의 편의를 위해, 최대 언풀링 레이어를 사용하는 대신 평균 언풀링 레이어를 채택하는 것을 제외하고 인코더의 구성과 완전히 대칭적으로 구성할 수 있다. 오토인코더에 대한 학습 이후, 디코더는 더 이상 필요하지 않기 때문에 분류기 네트워크에서 폐기될 수 있다. 특징 추출기에서는 인코더 부분만 사용할 수 있다.The decoder can be configured completely symmetrically with the configuration of the encoder, except for adopting the average unpooling layer instead of using the maximum unpooling layer for the convenience of implementing the network. After learning about the autoencoder, the decoder can be discarded in the classifier network because it is no longer needed. Only the encoder part can be used in the feature extractor.

도 10은 본 발명의 일 실시예로서, STAE에 기반한 특징 추출기와 DEC에 기반한 특징 분류기로 구성된 분류기 네트워크를 나타낸 도면이다.10 is a diagram illustrating a classifier network composed of a feature extractor based on a STAE and a feature classifier based on a DEC, as an embodiment of the present invention.

도 10에 도시된 바와 같이, 클러스터링은 비지도 학습 방법 중 하나로, 이는 유사한 특성을 가진 데이터를 그룹화하는 데에 사용할 수 있다. 지도 학습 방법에 기반한 분류기와 비교하여, 비지도 학습 방법에 기반한 분류기는 애매한 분류 대상 데이터에 대해 배경 답변(background answer)을 할 필요가 없다는 장점을 가질 수 있다.As illustrated in FIG. 10, clustering is one of unsupervised learning methods, which can be used to group data having similar characteristics. Compared with the classifier based on the supervised learning method, the classifier based on the unsupervised learning method may have an advantage that it is not necessary to provide a background answer to the ambiguous classification target data.

한편, 본 발명에서는 일 실시예로서 클러스터링 구현을 위해 k-평균 알고리즘을 채택할 수 있다. 그 이유는 k-평균 알고리즘은 다른 클러스터링 알고리즘에 비해 간단하고, 비교적 효율적이기 때문이다. 다만, 비지도적 클러스터링에 기반한 k-평균 알고리즘은 일반적으로 지도적 분류 방법보다 정확성이 다소 낮다.Meanwhile, in the present invention, as an embodiment, a k-average algorithm may be adopted to implement clustering. The reason is that the k-average algorithm is simple and relatively efficient compared to other clustering algorithms. However, the k-average algorithm based on unsupervised clustering is generally somewhat less accurate than the supervised classification method.

본 발명의 실시예에서의 센서는 클러스터링 네트워크 성능을 강화하기 위해 DEC(Deep embedded clustering) 방법을 채택할 수 있다. 본 발명에서의 클러스터링 방법은 클러스터링 품질을 향상시키기 위해, 다음과 같은 2단계의 단계를 포함할 수 있다.The sensor in the embodiment of the present invention may adopt a deep embedded clustering (DEC) method to enhance clustering network performance. The clustering method in the present invention may include the following two steps to improve clustering quality.

(1단계) 첫 번째로는 센서 데이터 특징 공간에 K-평균 클러스터링 알고리즘을 적용함으로써 초기 클러스터 중심을 찾는 것이다.(Step 1) The first is to find the initial cluster center by applying the K-means clustering algorithm to the sensor data feature space.

(2단계) 두 번째로는 더 많은 효과를 제공할 때까지 전체 분류기 네트워크에 미세 조정 기법(DEC)을 반복적으로 적용하여 정확도를 높이는 것이다.(Step 2) The second is to increase the accuracy by repeatedly applying a fine-tuning technique (DEC) to the entire classifier network until it provides more effects.

미세 조정 기법(DEC)은 특징 공간 Z에서 사전에 훈련된 k 클러스터 중심 세트

와, Z에 입력을 위한 인코드 데이터에 대한 STAE의 가중치

를 동시에 조정함으로써 데이터를 클러스터 할 수 있다.The fine-tuning technique (DEC) is a set of pre-trained k cluster centers in feature space Z

And, the weight of the STAE for the encoded data for input to Z

Data can be clustered by simultaneously adjusting.

중심

와 인코더

의 가중치는 미세 조정 기법(DEC) 과정 동안 파라미터가 최적의 클러스터링 결과를 보여줄 때까지 반복적으로 계산될 수 있다. Deep embedded clustering(DEC)의 첫 번째 단계로서, K-평균 클러스터링 알고리즘을 이용하여 원래의 분포 Q를 구하고 Q의 클러스터링 결과가 보다 잘 구별되게 하는 목표 분포 P의 소프트 할당(soft assignment)을 계산할 수 있다. 여기에서 목표 분포 P의 클러스터간 분산을 기존 분포 Q의 클러스터간 분산보다 크게 하도록 스튜던트 T분포(Student’s T-distribution)를 소프트 할당의 커널 함수로써 사용할 수 있다. 그런 다음, KL 분산을 이용하여 상기 두 분포의 간의 차이를 구할 수 있다. 여기서 KL 분산은 보통 상기 두 분포 간의 차이를 구하기 위해 사용될 수 있다. 상기 계산된 차이는 네트워크 훈련에 대한 Deep embedded clustering(DEC)의 손실함수로 사용될 수 있다. 이는 손실을 역전파(backpropagation)하고,

및

을 업데이트함으로써 분류기의 성능을 향상시키기 위한 것이라 할 수 있다.center

And encoder

The weight of can be repeatedly calculated until the parameter shows an optimal clustering result during the DEC process. As a first step in deep embedded clustering (DEC), the K-means clustering algorithm can be used to obtain the original distribution Q and calculate the soft assignment of the target distribution P that makes the clustering results of Q better distinguishable. . Here, the Student's T-distribution can be used as a kernel function of soft allocation so that the inter-cluster distribution of the target distribution P is larger than the inter-cluster distribution of the existing distribution Q. Then, the difference between the two distributions can be obtained using KL variance. Here, the KL variance can usually be used to find the difference between the two distributions. The calculated difference can be used as a loss function of deep embedded clustering (DEC) for network training. This backpropagates the loss,

And

It can be said to improve the performance of the classifier by updating.

이러한 과정은

의 하드 할당(hard assignment)이 기 설정된 임계값 내에서 변하지 않을 때까지 반복될 수 있다. 미세 조정 기법 과정 이후, 클러스터의 인코더 가중치와 중심 좌표는 입력 이미지 스트림의 분류에 대해 최적화될 수 있다.This process

It can be repeated until the hard assignment of (hard assignment) does not change within a preset threshold. After the fine tuning process, the encoder weights and center coordinates of the cluster can be optimized for the classification of the input image stream.

도 11은 발명의 일 실시예로서, 미세 조정 기법(DEC)을 적용하여 클러스터링 한 결과를 나타낸 도면이다.11 is a diagram illustrating clustering results by applying a fine adjustment technique (DEC) as an embodiment of the invention.

보다 상세하게, 도 11에서 (a)은 81,758개의 훈련 데이터 셋 이미지에 대한 클러스터링 결과를 나타낸 도면이고, (b)은 사용자(on-line validation)로부터 실시간으로 감지한 클러스터링 결과를 나타낸 도면이다.More specifically, in FIG. 11, (a) is a diagram showing clustering results for 81,758 training data set images, and (b) is a diagram showing clustering results sensed in real time from a user (on-line validation).

도 11에 도시된 바와 같이, 분류기 네트워크는 클러스터링 프로세스의 성능을 확인하기 위해 81,758개의 훈련 세트 입력 이미지의 시퀀스를 클러스터링했다. 클러스터링 결과는 t-SNE 방법을 사용하여 2D 특징 공간으로 나타낼 수 있다. 점의 색상과 모양은 뉴트럴(neutral) 또는 윙크 상태의 각 클러스터를 나타내며, 클러스터의 이름은 산점도(scatter plots)의 범례에 나타낼 수 있다. 미세 조정 기법(DEC) 적용 전에 획득한 결과와 비교하여, 미세 조정 기법(DEC) 적용 후의 결과는 학습이 시작됨에 따라 클러스터 경계의 분해능이 크게 향상되었다.As shown in Figure 11, the classifier network clustered a sequence of 81,758 training set input images to verify the performance of the clustering process. The clustering result can be represented as a 2D feature space using the t-SNE method. The color and shape of the dots represent each cluster in a neutral or wink state, and the name of the cluster can be represented in the legend of the scatter plots. Compared to the results obtained before applying the fine-tuning technique (DEC), the results after applying the fine-tuning technique (DEC) greatly improved the resolution of the cluster boundary as learning began.

다만, 본 발명에서는 미세 조정 기법(DEC) 기반 분류기와 특징 추출에 있어서 주석되지 않은 데이터를 사용했기 때문에, 클러스터링 네트워크의 정확성을 판단할 수 없다. 본 발명은 분류기 네트워크에 비지도 학습 방법을 적용했기 때문에, 사용자 상태가 윙크인지 또는 뉴트럴(neutral)인지 나타내는 정보는 없었다.However, in the present invention, the accuracy of a clustering network cannot be determined because uncommented data is used in the DEC-based classifier and feature extraction. In the present invention, since the unsupervised learning method is applied to the classifier network, there is no information indicating whether the user state is wink or neutral.

본 발명에서는 일 실시예로서 분류기 네트워크의 정확성을 평가하기 위해, AR 헤드셋을 착용한 사용자들을 대상으로 추가적인 실험(두 번째 실험 데이터 셋에 관한 실험)을 수행할 수 있다.In an embodiment of the present invention, in order to evaluate the accuracy of the classifier network, an additional experiment (experiment on a second set of experimental data) can be performed on users wearing an AR headset.

이 때 검증 및 테스트를 위한 데이터 셋는 수집하지 않았기 때문에, 개인의 온라인 센서 데이터를 검증 데이터 셋으로 활용할 수 있다. 실험 유효성 검사 단계에서는 학습의 횟수에 따라 네트워크에 훈련된 가중치를 적용한 후, 분류 정확도를 측정할 수 있다.At this time, since the data set for verification and testing has not been collected, the personal online sensor data can be used as the verification data set. In the experimental validation step, after applying a trained weight to the network according to the number of learning, it is possible to measure classification accuracy.

도 11(b)은 추가 실험을 통해 획득된 온라인 센서 데이터에 대한 실시간 클러스터링 결과를 보여주는 도면이다. 각각의 실험 참가자는 각 측정동안 30번의 윙크 제스처를 취했으며, 본 발명에서는 이러한 의도적인 윙크 제스처의 성공적인 탐지 횟수를 기록할 수 있다. 도 11(b)에 도시된 바와 같이, 학습 단계가 증가함에 따라, 클러스터링 네트워크의 성능이 향상된 것을 확인할 수 있다. 미세 조정 기법(DEC) 학습을 500회 반복한 후, 분류기 네트워크는 윙크 제스처 검출에 있어서 30/30의 정확도를 달성할 수 있다.11(b) is a diagram showing real-time clustering results for online sensor data obtained through additional experiments. Each experimental participant took 30 wink gestures during each measurement, and the present invention can record the number of successful detections of this intentional wink gesture. As shown in FIG. 11( b), it can be seen that as the learning step increases, the performance of the clustering network is improved. After repeating the fine-tuning technique (DEC) learning 500 times, the classifier network can achieve 30/30 accuracy in wink gesture detection.

본 발명에서의 제스처 감지 시스템은 30fps의 속도로 실시간 모드로 운영될 수 있다. 이러한 분류기 네트워크는 데이터 수집에 있어서 자원하여 참여하지 않은 다른 사람들에게도 적용될 수 있는지를 테스트하기 위해, 본 발명에서는 맞춤형 응용 프로그램을 사용한 시연으로부터 스크린샷에 나타난 세 번째 추가 실험을 수행할 수 있다.The gesture detection system in the present invention can be operated in a real-time mode at a speed of 30 fps. In order to test whether such a classifier network can be applied to other people who have volunteered to participate in data collection, the present invention may perform a third additional experiment shown in the screenshot from a demonstration using a customized application.

사용자는 배경 (c,d)을 변경하기 위해 풍선 (a,b)을 터뜨리거나 버튼을 선택할 수 있다. 사용자는 특정 물체에 빨간색 중앙 점을 목표로 물체를 선택한 다음, 윙크 제스처를 실행할 수 있다.The user can pop a balloon (a,b) or select a button to change the background (c,d). The user can select an object with a red center point on a specific object, and then execute a wink gesture.

세 번째 실험에서는 참가자들이 얼굴 동작에 대한 반응으로 제스처 검출 결과를 실시간으로 확인할 수 있었고, 윙크 제스처가 센서에 의해 검출될 때 1점을 받을 수 있었다. 각각의 참가자는 100점을 받아야 실험을 완료할 수 있었다. 본 발명에서는 각 실험을 완료할 때까지 각 참가자가 의도한 윙크 제스처의 수를 카운트할 수 있다. 이러한 실험은 모두 새로운 참가자(즉, 훈련 데이터 셋 데이터 수집에 참여하지 않은 사람) 10명을 대상으로 수행되었다. 이는 아래 표 1와 같이 95.4%의 평균 정확도를 달성할 수 있다. 표 1은 본 발명의 AR 헤드셋을 착용한 사용자의 제스처 감지 정확도에 대한 실시간 측정 결과를 나타낸 표라 할 수 있다.In the third experiment, participants were able to check the gesture detection results in real time in response to facial motions, and receive 1 point when the wink gesture was detected by the sensor. Each participant needed 100 points to complete the experiment. In the present invention, the number of wink gestures intended by each participant can be counted until each experiment is completed. All of these experiments were conducted on 10 new participants (ie, those who did not participate in the training data set data collection). This can achieve an average accuracy of 95.4% as shown in Table 1 below. Table 1 may be a table showing real-time measurement results for gesture detection accuracy of a user wearing the AR headset of the present invention.

<표 1><Table 1>

이러한 본 발명의 인터페이스 기술은 IR 카메라를 입력 장치로 활용하였으나, 일반적인 이미지 기반 표정 분류와는 완전히 다르다고 할 수 있다. 본 발명에서의 센서는 IR 확산 반사체를 기반으로 피부 내 콜라겐 섬유의 왜곡된 얼라이먼트(alignment)를 추적하여 피부 표면 아래 근육의 움직임을 감지할 수 있다.The interface technology of the present invention uses an IR camera as an input device, but it can be said that it is completely different from a general image-based facial expression classification. The sensor in the present invention can detect the movement of muscles under the skin surface by tracking the distorted alignment of collagen fibers in the skin based on the IR diffuse reflector.

특히 본 발명에서는 피부와의 접촉 없이 ROI의 피부 변형을 감지하기 때문에, 얼굴 전체 이미지가 필요하지 않다는 특징이 있다. 일반적으로 인간은 서로 다른 안면 근육을 사용하여 눈을 깜박이고 의도적으로 윙크를 하기 때문에, 본 발명에서의 SRDR 센서는 상기 두 가지 동작을 구별할 수 있는 장점을 가지고 있다. 또한, 이러한 센싱 방법의 장점은 본 발명에서의 센서가 카메라로 어떤 특징도 찾을 수 없는 평범한 얼굴 피부 표면으로부터 분류 특징을 생성할 수 있다는 점이다.In particular, in the present invention, since the skin deformation of the ROI is sensed without contact with the skin, an entire face image is not required. In general, since humans use different facial muscles to blink and intentionally wink, the SRDR sensor in the present invention has an advantage of distinguishing the two motions. In addition, the advantage of this sensing method is that the sensor in the present invention can generate classification features from ordinary facial skin surfaces where no features can be found with the camera.

본 발명에서는 맞춤형 얼굴 제스처 감지 센서와 분류기 네트워크를 통해 AR 헤드셋을 착용한 사용자의 윙크로 생성된 IR 확산 이미지의 클러스터링을 달성할 수 있다. 즉, 본 발명에서의 센서는 AR 헤드셋용 UI를 구현하는 효율적인 방법을 제공할 수 있다.In the present invention, it is possible to achieve clustering of an IR spread image generated by a wink of a user wearing an AR headset through a customized face gesture detection sensor and a classifier network. That is, the sensor in the present invention can provide an efficient method of implementing a UI for an AR headset.

많은 명령어는 얼굴 제스처와 머리 회전 데이터를 조합하여 적용할 수 있는데, 후자는 대부분의 AR 헤드셋에 통합된 자이로 센서에 의해 캡처될 수 있다. 예를 들어, 사용자는 윙크를 통해 버튼을 선택하거나, 데스크톱 환경에서 흔히 사용되는 끌어서 놓기(drag-and-drop) 명령과 유사한 윙크 제스처로 머리를 돌려 가상공간에 있는 일부 객체를 움직일 수 있다.Many commands can be applied in combination with facial gestures and head rotation data, the latter of which can be captured by a gyro sensor integrated into most AR headsets. For example, the user can move some object in the virtual space by selecting a button through a wink, or turning a head with a wink gesture similar to a drag-and-drop command commonly used in desktop environments.

도 12는 본 발명의 일 실시예로서, 맞춤형 응용 어플리케이션을 사용하여 시연한 스크린샷을 보여주는 도면이다. 보다 상세하게, 도 12에서 사용자는 배경 (c,d)을 변경하기 위해, 풍선 (a,b)을 터뜨리거나 버튼을 선택할 수 있으며, 특정 물체에 빨간색 중심점을 목표로 물체를 선택한 후 윙크 동작으로 이를 실행할 수 있다.12 is a view showing a screenshot demonstrated using a custom application, as an embodiment of the present invention. In more detail, in FIG. 12, the user can pop a balloon (a,b) or select a button to change the background (c,d), select an object with a red center point as a target, and then wink You can do this.

도 12를 참조하면, 사용자는 윙크 제스처를 통해 AR 환경에서 풍선을 터뜨리거나 버튼을 선택할 수 있다. 이는 간단한 핸즈프리 제어와 저비용 구현의 장점을 가진 사용자 친화적인 UI를 제공할 수 있다.Referring to FIG. 12, a user can pop a balloon or select a button in an AR environment through a wink gesture. This can provide a user-friendly UI with the advantages of simple hands-free control and low cost implementation.

본 발명에서의 센서는 적절한 위치가 필요하며, 얼굴 피부 변형을 감지하기 위해 인체 내 특정 위치로 제한될 수 있다. IR SRDR의 변화량은 피부 표면 아래의 근육이 충분히 움직여 피부의 내부 구조에 영향을 미치는 경우로도 충분하기 때문이다.The sensor in the present invention requires an appropriate location, and can be limited to a specific location in the human body to detect facial skin deformation. This is because the amount of change in IR SRDR is sufficient when the muscles under the skin surface are sufficiently moved to affect the internal structure of the skin.

최상의 결과를 얻기 위해서는, 센서의 적절한 ROI을 설정하는 것이 중요할 수 있다. 센서를 부적절하게 배치할 경우, 에러가 발생할 수 있기 때문이다. 한편, 윙크 제스처는 일부 사람들에게 지겹거나 심지어 어려운 제스처가 될 수 있기 때문에, 본 발명에서는 또 다른 실시예로 센서 감지 위치를 다른 위치로 선택함으로써 다양한 얼굴 제스처를 감지할 수 있도록 할 수 있다. 또는, 더 많은 센서를 추가함으로써 감지 가능한 얼굴 제스처의 수를 증가시킬 수 있다. 이는 사용자 환경 또는 적합성에 따른 다양한 실험을 통해 용이하게 변경 및 추가할 수 있다.In order to get the best results, it may be important to set the proper ROI of the sensor. This is because an error may occur if the sensor is improperly positioned. On the other hand, since the wink gesture can be a boring or even difficult gesture for some people, in the present invention, it is possible to detect various face gestures by selecting the sensor detection location as another location. Alternatively, the number of detectable face gestures can be increased by adding more sensors. This can be easily changed and added through various experiments according to the user environment or suitability.

이상에서 본 발명의 기술적 사상을 예시하기 위한 바람직한 실시 예와 관련하여 설명하고 도시하였지만, 본 발명은 이와 같이 도시되고 설명된 그대로의 구성 및 작용에만 국한되는 것이 아니며, 기술적 사상의 범주를 일탈함이 없이 본 발명에 대해 다수의 변경 및 수정이 가능함을 당업자들은 잘 이해할 수 있을 것이다. 따라서 그러한 모든 적절한 변경 및 수정들도 본 발명의 범위에 속하는 것으로 간주되어야할 것이다.Although described and illustrated in connection with the preferred embodiment for illustrating the technical idea of the present invention, the present invention is not limited to the configuration and operation as illustrated and described, and deviates from the scope of the technical idea. It will be understood by those skilled in the art that many changes and modifications to the present invention are possible without. Accordingly, all such suitable changes and modifications should also be considered within the scope of the present invention.

100 : 근적외선 방출부 200 : 신호 프레임 수신부
300 : 신호 프레임 전처리부 400 : 특징정보 추출부
500 : 특징정보 분류부100: near-infrared emission unit 200: signal frame receiving unit
300: signal frame pre-processing unit 400: feature information extraction unit
500: feature information classification

Claims

In the headset-type user interface system,
A near-infrared emitting unit which emits near-infrared rays to the skin at a certain distance;
A signal frame receiver configured to receive at least one signal frame including signals of near infrared rays scattered within the skin and emitted from the skin among the near infrared rays incident on the skin by the emitted near infrared rays;
A signal frame pre-processing unit pre-processing the at least one signal frame in a received order;
A feature information extraction unit for extracting feature information of the skin deformed by the emitted near infrared rays from the at least one signal frame pre-processed; And
Includes a; feature information classification unit for classifying the extracted feature information to a predetermined user's intention,
The signal frame received by the signal frame receiving unit is in the form of a near infrared image including spatially resolved diffuse reflection (SRDR) measured from scattering caused by near infrared rays emitted to the skin,
The signal frame pre-processing unit pre-processes the near-infrared image in the order of reception, reconstructs the data format according to the data input format of the feature information extraction unit,
The feature information extracting unit extracts at least one feature for classifying the user's intention from the pre-processed near infrared image,
The feature information classifying unit classifies and maps the extracted at least one feature to the user's intention,
The feature information extraction unit extracts the feature information using an encoder of a spatiotemporal autoencoder (STAE) pre-trained based on a deep neural network (DNN),
The feature information classifier is a headset-type user interface system that maps the extracted at least one feature to a preset user intention by using a pre-trained classifier based on a DEC (Deep Embedded Clustering) technique.

According to claim 1, wherein the near-infrared emitting unit,
A user interface system in the form of a headset that emits collimated near infrared rays so that reception of a signal frame according to scattering phenomenon caused by near infrared rays emitted to the skin minimizes the effect of reflection caused by the near infrared rays emitted to the skin.

delete

According to claim 1, wherein the signal frame receiving unit,
A headset-type user interface system including a near-infrared filter so that the near-infrared image does not include noise signals in other bands in addition to the near-infrared signal in the target band.

delete

A method for classifying user feature information for a user interface,
A near-infrared emission step of emitting near-infrared rays to the skin at a certain distance from the near-infrared emission unit;
A signal frame receiving step of receiving at least one signal frame including a signal of a near infrared ray scattered within the skin and emitted out of the skin among the near infrared rays incident on the skin by the near infrared ray emitted from the near infrared ray emitting step from the signal frame receiving unit ;
A signal frame preprocessing step in which the signal frame preprocessing unit preprocesses at least one signal frame received from the signal frame reception step in a received order;
A feature information extracting unit extracting feature information of the skin deformed by the emitted near infrared rays from at least one signal frame pre-processed from the signal frame pre-processing step; And
The feature information classification unit includes a feature information classification step of classifying feature information extracted from the feature information extraction step into a predetermined user's intention.
In the signal frame receiving step, the signal frame received is in the form of a near-infrared image including spatially resolved diffuse reflectance (SRDR) measured from scattering caused by near-infrared rays emitted to the skin,
In the signal frame pre-processing step, the near-infrared image is pre-processed in the order of reception, and the data form is reconstructed according to the data input form of the feature information extraction step,
The feature information extraction step extracts at least one feature for classifying the user's intention from a near-infrared image preprocessed from the signal frame pre-processing step,
The feature information classification step classifies and maps at least one feature extracted from the feature information extraction step to the user intention,
The feature information extraction step extracts the feature information using an encoder of a spatiotemporal autoencoder (STAE) pre-trained based on a deep neural network (DNN),
The feature information classification step is a user feature information classification method for a user interface that classifies the extracted at least one feature using a pre-trained classifier based on a DEC (Deep Embedded Clustering) technique and maps it to a preset user intention.

The method of claim 8, wherein the near-infrared emission step,
A method for classifying user feature information for a user interface that emits collimated near infrared rays so that reception of a signal frame according to scattering phenomena caused by the near infrared rays emitted to the skin minimizes the effect of reflection caused by the near infrared rays emitted to the skin .

delete

The method of claim 8, wherein the signal frame receiving unit,
A method of classifying user feature information for a user interface including a near-infrared filter so that the near-infrared image does not include noise signals of other bands in addition to the near-infrared signal of the target band.

delete