KR102420924B1

KR102420924B1 - 3D gaze estimation method and apparatus using multi-stream CNNs

Info

Publication number: KR102420924B1
Application number: KR1020200132465A
Authority: KR
Inventors: 김용국
Original assignee: 세종대학교산학협력단
Priority date: 2020-06-15
Filing date: 2020-10-14
Publication date: 2022-07-14
Also published as: KR20210155317A

Abstract

딥러닝 기반 3D 시선 예측 방법 및 그 장치가 개시된다. 딥러닝 기반 3D 시선 예측 방법은 가시광 입력 영상에서 눈 영역을 각각 검출한 후 정규화하여 우안 패치 및 좌안 패치를 각각 추출하는 단계; 상기 우안 패치 및 상기 좌안 패치를 우안 특징 추출 모듈과 좌안 특징 추출 모듈에 적용하여 우안 특징 벡터와 좌안 특징 벡터를 각각 추출하는 단계; 및 상기 우안 특징 벡터, 상기 좌안 특징 벡터 및 헤드 포즈를 학습된 시선각 추정 모델에 적용하여 시선각을 추정하는 단계를 포함한다. A deep learning-based 3D gaze prediction method and apparatus are disclosed. The deep learning-based 3D gaze prediction method includes the steps of extracting a right eye patch and a left eye patch by normalizing after detecting each eye region from a visible light input image; extracting a right eye feature vector and a left eye feature vector by applying the right eye patch and the left eye patch to a right eye feature extraction module and a left eye feature extraction module; and estimating the viewing angle by applying the right eye feature vector, the left eye feature vector, and the head pose to a learned viewing angle estimation model.

Description

Deep learning-based 3D gaze estimation method and apparatus {3D gaze estimation method and apparatus using multi-stream CNNs}

본 발명은 딥러닝 기반 3D 시선 예측 방법 및 그 장치에 관한 것이다.The present invention relates to a deep learning-based 3D gaze prediction method and an apparatus therefor.

시각 및 인지 처리 측면에서 눈의 움직임과 시선 추정은 매우 중요하다. 특히, 안구 운동은 인간의 시선 집중, 감정 분석 및 행동장애 식별 연구를 위해 폭넓게 연구되어 왔다. In terms of visual and cognitive processing, eye movement and gaze estimation are very important. In particular, eye movement has been extensively studied for the study of human gaze concentration, emotion analysis, and behavioral disorder identification.

시선 추정은 인간-컴퓨터 상호 작용, 심리학, 장애 연구, 내비게이션, 운전자 행동 감지, 로봇 수술 및 마케팅 연구 등과 같이 광범위한 어플리케이션을 가지고 있기 때문에 컴퓨터 비전 영역에서 연구되어 있다. Gaze estimation has been studied in the field of computer vision because it has a wide range of applications, such as human-computer interaction, psychology, disability research, navigation, driver behavior detection, robotic surgery, and marketing research.

시선 예측을 위한 이전 모델 및 특징 기반 방법들은 조명 조건, 카메라 보정 방법, 및 개별 머리 포즈 변화에 따라 제한을 가지고 있다. 컴퓨터 비전 연구자들은 최근 대규모 시선 데이터 세트의 가용성으로 인해 일반적으로 컨볼루션 신경망(CNN)을 사용하여 제어되지 않는 환경에서 사람의 시선을 추정하는 외모 기반 방법을 탐색했다. Previous models and feature-based methods for gaze prediction have limitations depending on lighting conditions, camera calibration methods, and individual head pose changes. Computer vision researchers have recently explored appearance-based methods for estimating a person's gaze in an uncontrolled environment, typically using convolutional neural networks (CNNs) due to the availability of large gaze data sets.

딥러닝 접근 방식이 자연 환경에서 인간의 시선을 추정하는데 놀라운 성공을 거두었음에도 불구하고 현재 접근 방법은 약3.6도의 성취를 달성하는데 그치고 있어 실시간 어플리케이션에 적용하기는 어려운 실정이다. Although the deep learning approach has achieved remarkable success in estimating human gaze in the natural environment, the current approach achieves only about 3.6 degrees, making it difficult to apply to real-time applications.

본 발명은 웹캠 영상만으로 관찰자의 시선을 정확하게 예측할 수 있는 딥러닝 기반 3D 시선 예측 방법 및 그 장치를 제공하기 위한 것이다. An object of the present invention is to provide a deep learning-based 3D gaze prediction method and apparatus capable of accurately predicting an observer's gaze only with a webcam image.

또한, 본 발명은 다양한 자연 얼굴 영상만을 이용해 모델을 학습시키며, 실행 시간이 길지 않고 성능이 우수한 딥러닝 기반 3D 시선 예측 방법 및 그 장치를 제공하기 위한 것이다. In addition, an object of the present invention is to provide a deep learning-based 3D gaze prediction method and apparatus that trains a model using only various natural face images, does not have a long execution time, and has excellent performance.

본 발명의 일 측면에 따르면, 웹캠 영상만으로 관찰자의 시선을 정확하게 예측할 수 있는 딥러닝 기반 3D 시선 예측 방법이 제공된다. According to one aspect of the present invention, there is provided a deep learning-based 3D gaze prediction method capable of accurately predicting the gaze of an observer only with a webcam image.

본 발명의 일 실시예에 따르면, 가시광 입력 영상에서 눈 영역을 각각 검출한 후 정규화하여 우안 패치 및 좌안 패치를 각각 추출하는 단계; 상기 우안 패치 및 상기 좌안 패치를 우안 특징 추출 모듈과 좌안 특징 추출 모듈에 적용하여 우안 특징 벡터와 좌안 특징 벡터를 각각 추출하는 단계; 및 상기 우안 특징 벡터, 상기 좌안 특징 벡터 및 헤드 포즈를 학습된 시선각 추정 모델에 적용하여 시선각을 추정하는 단계를 포함하는 3D 시선 예측 방법이 제공될 수 있다. According to an embodiment of the present invention, there is provided a method comprising: extracting a right eye patch and a left eye patch by normalizing each eye region from a visible light input image; extracting a right eye feature vector and a left eye feature vector by applying the right eye patch and the left eye patch to a right eye feature extraction module and a left eye feature extraction module; and estimating the viewing angle by applying the right eye feature vector, the left eye feature vector, and the head pose to a learned gaze angle estimation model.

상기 시선각 추정 모델은, 3개의 완전 연결층과 선형 회귀 모듈을 포함하되, 상기 우안 특징 벡터와 상기 좌안 특징 벡터 및 상기 헤드 포즈는 상기 3개의 완전 연결층을 통해 결합된 후 학습된 상기 선형 회귀 모듈에 적용되어 시선각이 추정될 수 있다. The gaze angle estimation model includes three fully connected layers and a linear regression module, wherein the right eye feature vector, the left eye feature vector, and the head pose are learned after combining through the three fully connected layers. Applied to the module, the viewing angle can be estimated.

상기 선형 회귀 모듈은 예측된 시선각과 실제 시선각을 이용하여 계산되는 손실 함수가 최소가 되도록 학습될 수 있다.The linear regression module may be trained such that a loss function calculated using the predicted viewing angle and the actual viewing angle is minimized.

상기 시선각은 요(yaw) 및 피치(pitch)이다. The viewing angles are yaw and pitch.

상기 우안 패치 및 상기 좌안 패치는 기준점(x)에 상응하여 헤드 롤(head roll)이 제거된 상태에서 중심(p)에서 일정 사이즈로 잘린 정규화된 패치 영상일 수 있다. The right eye patch and the left eye patch may be normalized patch images cut to a predetermined size from the center p with a head roll removed corresponding to the reference point x.

본 발명의 다른 측면에 따르면, 웹캠 영상만으로 관찰자의 시선을 정확하게 예측할 수 있는 장치가 제공된다. According to another aspect of the present invention, there is provided an apparatus capable of accurately predicting an observer's gaze only from a webcam image.

본 발명의 일 실시예에 따르면, 가시광 입력 영상에서 눈 영역을 각각 검출한 후 정규화하여 우안 패치 및 좌안 패치를 각각 추출하는 전처리부; 상기 우안 패치를 분석하여 우안 특징 벡터를 추출하는 우안 특징 추출 모듈과 상기 좌안 패치를 분석하여 좌안 특징 벡터를 추출하는 좌안 특징 추출 모듈을 포함하는 특징 추출부; 및 상기 우안 특징 벡터, 상기 좌안 특징 벡터 및 헤드 포즈를 모델에 적용함으로써 선형 회귀 분석을 통해 시선각을 추정하는 시선각 추정부를 포함하는 3D 시선 예측 장치가 제공될 수 있다. According to an embodiment of the present invention, there is provided a pre-processing unit for extracting a right eye patch and a left eye patch by normalizing each eye region from a visible light input image; a feature extraction unit including a right eye feature extraction module for analyzing the right eye patch to extract a right eye feature vector, and a left eye feature extraction module for extracting a left eye feature vector by analyzing the left eye patch; and a viewing angle estimator for estimating the viewing angle through linear regression analysis by applying the right eye feature vector, the left eye feature vector, and the head pose to a model.

상기 모델은, 3개의 완전 연결층과 선형 회귀 모듈을 포함하되, 상기 우안 특징 벡터와 상기 좌안 특징 벡터 및 상기 헤드 포즈는 상기 3개의 완전 연결층을 통해 결합된 후 학습된 상기 선형 회귀 모듈에 적용되어 시선각이 추정될 수 있다.The model includes three fully connected layers and a linear regression module, wherein the right eye feature vector, the left eye feature vector, and the head pose are combined through the three fully connected layers and then applied to the learned linear regression module Thus, the viewing angle can be estimated.

상기 전처리부는, 상기 입력 영상에서 얼굴 영역을 검출하고, 상기 검출된 얼굴 영역에서 양쪽 눈 영역을 각각 검출한 후 기준점(x)에 상응하여 헤드 롤(head roll)을 제거한 상태에서 중심(p)에서 일정 사이즈로 잘라 정규화하여 상기 우안 패치 및 상기 좌안 패치를 각각 추출할 수 있다. The preprocessor detects a face region from the input image, detects both eye regions from the detected face region, respectively, and removes the head roll corresponding to the reference point x at the center p. The right eye patch and the left eye patch may be extracted by cutting to a predetermined size and normalizing it.

본 발명의 일 실시예에 따른 딥러닝 기반 3D 시선 예측 방법 및 그 장치를 제공함으로써 가시광 영상을 이용하여 정확하게 관찰자의 시선을 예측할 수 있는 이점이 있다. By providing a deep learning-based 3D gaze prediction method and apparatus according to an embodiment of the present invention, there is an advantage in that the gaze of an observer can be accurately predicted using a visible light image.

또한, 본 발명은 적외선 영상 이용 없이 다양한 가시광 얼굴 영상만을 이용하여 모델을 학습시켜 실행 시간이 길지 않으면서도 성능이 우수한 시선 예측이 가능한 이점이 있다. In addition, the present invention has an advantage in that it is possible to predict a gaze with excellent performance without a long execution time by learning a model using only various visible light face images without using an infrared image.

도 1은 본 발명의 일 실시예에 따른 딥러닝 기반 3D 시선 예측 방법을 나타낸 순서도.
도 2는 본 발명의 일 실시예에 따른 좌안 패치와 우안 패치 추출을 설명하기 위해 도시한 도면.
도 3은 본 발명의 일 실시예에 따른 특징 추출 모듈의 세부 구조를 도시한 도면.
도 4는 본 발명의 일 실시예에 따른 시선각 추정부의 상세 구성을 도시한 도면.
도 5는 본 발명의 일 실시예에 따른 MPIIGaze 데이터 셋 및 EYEDIAP 데이터 셋으로부터 예제 이미지를 예시한 도면.
도 6은 본 발명의 일 실시예에 따른 데이터 셋을 증강 방법을 설명하기 위해 도시한 도면.
도 7은 본 발명의 일 실시예에 따른 3D 시선 예측 장치의 구성을 도시한 도면.
도 8은 본 발명의 일 실시예에 따른 추정된 시선각과 실제 시선각을 비교한 도면.1 is a flowchart illustrating a deep learning-based 3D gaze prediction method according to an embodiment of the present invention.
FIG. 2 is a diagram illustrating extraction of a left eye patch and a right eye patch according to an embodiment of the present invention; FIG.
3 is a diagram illustrating a detailed structure of a feature extraction module according to an embodiment of the present invention.
4 is a diagram illustrating a detailed configuration of a viewing angle estimator according to an embodiment of the present invention.
5 is a diagram illustrating example images from an MPIIGaze data set and an EYEDIAP data set according to an embodiment of the present invention;
6 is a diagram illustrating a method for augmenting a data set according to an embodiment of the present invention;
7 is a diagram illustrating a configuration of a 3D gaze prediction apparatus according to an embodiment of the present invention.
8 is a view comparing an estimated viewing angle and an actual viewing angle according to an embodiment of the present invention.

본 명세서에서 사용되는 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 명세서에서, "구성된다" 또는 "포함한다" 등의 용어는 명세서상에 기재된 여러 구성 요소들, 또는 여러 단계들을 반드시 모두 포함하는 것으로 해석되지 않아야 하며, 그 중 일부 구성 요소들 또는 일부 단계들은 포함되지 않을 수도 있고, 또는 추가적인 구성 요소 또는 단계들을 더 포함할 수 있는 것으로 해석되어야 한다. 또한, 명세서에 기재된 "...부", "모듈" 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어 또는 소프트웨어로 구현되거나 하드웨어와 소프트웨어의 결합으로 구현될 수 있다.As used herein, the singular expression includes the plural expression unless the context clearly dictates otherwise. In this specification, terms such as “consisting of” or “comprising” should not be construed as necessarily including all of the various components or various steps described in the specification, some of which components or some steps are It should be construed that it may not include, or may further include additional components or steps. In addition, terms such as "...unit" and "module" described in the specification mean a unit that processes at least one function or operation, which may be implemented as hardware or software, or a combination of hardware and software. .

이하, 첨부된 도면들을 참조하여 본 발명의 실시예를 상세히 설명한다. Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 일 실시예에 따른 딥러닝 기반 3D 시선 예측 방법을 나타낸 순서도이고, 도 2는 본 발명의 일 실시예에 따른 좌안 패치와 우안 패치 추출을 설명하기 위해 도시한 도면이며, 도 3은 본 발명의 일 실시예에 따른 특징 추출 모듈의 세부 구조를 도시한 도면이고, 도 4는 본 발명의 일 실시예에 따른 시선각 추정부의 상세 구성을 도시한 도면이며, 도 5는 본 발명의 일 실시예에 따른 MPIIGaze 데이터 셋 및 EYEDIAP 데이터 셋으로부터 예제 이미지를 예시한 도면이고, 도 6은 본 발명의 일 실시예에 따른 데이터 셋을 증강 방법을 설명하기 위해 도시한 도면이다. 1 is a flowchart illustrating a deep learning-based 3D gaze prediction method according to an embodiment of the present invention, and FIG. 2 is a diagram illustrating extraction of a left eye patch and a right eye patch according to an embodiment of the present invention, FIG. 3 is a diagram showing a detailed structure of a feature extraction module according to an embodiment of the present invention, FIG. 4 is a diagram showing a detailed configuration of a viewing angle estimator according to an embodiment of the present invention, and FIG. 5 is a diagram showing the present invention is a diagram illustrating an example image from an MPIIGaze data set and an EYEDIAP data set according to an embodiment of the present invention, and FIG. 6 is a diagram illustrating a method for augmenting a data set according to an embodiment of the present invention.

단계 110에서 시선 예측 장치(100)는 입력 영상을 분석하여 정규화된 좌안 패치와 우안 패치를 각각 추출한다. In step 110 , the gaze prediction apparatus 100 analyzes the input image and extracts the normalized left eye patch and right eye patch, respectively.

입력 영상인 가시광 영상은 조명 조건, 카메라 조건에 따라 해상도가 상이하다. 따라서, 본 발명의 일 실시예에서는 입력 영상에서 좌안 영역과 우안 영역을 각각 검출한 후 정규화한 후 좌안 패치와 우안 패치를 각각 추출할 수 있다. A visible light image, which is an input image, has a different resolution depending on lighting conditions and camera conditions. Accordingly, according to an embodiment of the present invention, after detecting the left eye region and the right eye region from the input image, normalizing the detection and then extracting the left eye patch and the right eye patch, respectively.

시선 예측 장치(100)는 가시광 영상인 입력 영상에서 얼굴 영역을 검출하고, 검출된 얼굴 영역에서 양쪽 눈 영역을 각각 검출할 수 있다. 이어, 시선 예측 장치(100)는 검출된 양쪽 눈 영역을 정규화한 후 양쪽 눈 패치를 각각 추출한다. The gaze prediction apparatus 100 may detect a face region from an input image that is a visible light image, and may detect both eye regions from the detected face region. Next, the gaze prediction apparatus 100 normalizes the detected both eye regions and then extracts both eye patches.

예를 들어, 시선 예측 장치(100)는 입력 영상에서 머리 롤(head roll)을 제거한 정규화된 양쪽 눈 패치를 각각 추출할 수 있다. For example, the gaze prediction apparatus 100 may extract normalized both eye patches from which a head roll is removed from the input image, respectively.

본 발명의 일 실시예에 따르면, 입력 영상은 웹캠(web cam)에서 촬영되는 가시광 영상일 수 있다.According to an embodiment of the present invention, the input image may be a visible light image captured by a webcam.

시선 예측 장치(100)는 입력 영상에서 카메라 파라미터와 무관하게 외관 변화(appearance variation)을 극복하여 시선을 정확하게 예측하기 위해 양쪽 눈 패치를 추출함에 있어 머리 롤(head roll)을 제거한 정규화된 양쪽 눈 패치를 추출할 수 있다. 이에 대해 보다 상세히 설명하기로 한다. The gaze prediction apparatus 100 is a normalized bilateral eye patch from which a head roll is removed in extracting both eye patches to accurately predict a gaze by overcoming an appearance variation regardless of a camera parameter in an input image. can be extracted. This will be described in more detail.

입력 영상을 I라 칭하기로 한다. 입력 영상 I와 기준점(x)가 주어지는 경우, 목표는 컨버전 행렬 M(conversion matrix)을 계산하는 것이다. 여기서, 컨버전 행렬 M은 수학식 1을 이용하여 계산될 수 있다. The input image will be referred to as I. Given an input image I and a reference point x, the goal is to calculate a conversion matrix M (M). Here, the conversion matrix M may be calculated using Equation (1).

회전 행렬 R을 사용하면 머리 좌표계와 카메라의 x축은 평행하다. 따라서, 스케일 행렬(S)는 가상 카메라가 고정된 거리

로부터 기준점(x)를 보도록 정의될 수 있다. 여기서, 스케일 행렬(S)는 수학식 2와 같이 나타낼 수 있다. With a rotation matrix R, the head coordinate system and the x-axis of the camera are parallel. Therefore, the scale matrix S is the distance at which the virtual camera is fixed.

It can be defined to look at the reference point (x) from Here, the scale matrix S can be expressed as in Equation (2).

여기서, diag()는 대각 행렬을 나타내고,

는 기준점으로부터의 이동(translation)을 나타내는 행렬이다. where diag() represents a diagonal matrix,

is a matrix representing translation from the reference point.

시선 예측 장치(100)는 입력 영상 I에 대해 변형 행렬 W(transformation matrix)를 사용하여 투영 와핑(perspective warping)을 통해 정규화할 수 있다. 이를 수학식으로 나타내면 수학식 3과 같다. The gaze prediction apparatus 100 may normalize the input image I through perspective warping by using a transformation matrix W (transformation matrix). This is expressed as Equation 3 as shown in Equation 3.

여기서,

는 정규화된 카메라의 투영 행렬을 나타내고,

는 실제 카메라 행렬을 나타낸다. here,

denotes the projection matrix of the normalized camera,

represents the actual camera matrix.

결과적으로 시선 예측 장치(100)는 가시광 입력 영상에서 양쪽 눈 영역을 각각 검출한 후 기준점(x)에 대한 머리 롤(head roll)이 제거된 상태에서 중심(p)에서 일정 사이즈(W x H)를 가지는 잘린 패치로 양쪽 눈 패치를 각각 추출할 수 있다. As a result, the gaze prediction apparatus 100 detects both eye regions in the visible light input image, and has a predetermined size (W x H) at the center (p) in a state in which the head roll with respect to the reference point (x) is removed. Each of the eye patches can be extracted with a cropped patch having .

도 2에는 양쪽 눈 패치 추출에 대한 일 예가 도시되어 있다. 시선 예측 장치(100)는 웹캠 영상에서 얼굴 영역을 검출하고, 검출된 얼굴 영역에서 양쪽 눈 영역을 검출한 후 기준점(x)에 대해 머리 롤(head roll)이 제거된 정규화된 눈 패치(I_L, I_R)를 각각 추출할 수 있다. 2 shows an example of both eye patch extraction. The gaze prediction apparatus 100 detects a face region from the webcam image, detects both eye regions from the detected face region, and then a normalized eye patch I _L from which a head roll is removed with respect to the reference point x. , I _R ) can be extracted, respectively.

단계 115에서 시선 예측 장치(100)는 정규화된 좌안 패치(I_L)와 우안 패치(I_R)에서 특징 벡터를 각각 추출한다. In operation 115 , the gaze prediction apparatus 100 extracts feature vectors from the normalized left eye patch I _L and the right eye patch I _R , respectively.

정규화된 좌안 패치와 우안 패치는 각각의 특징 추출부로 입력되며, 각각의 특징 추출부에 의해 특징 벡터가 각각 추출될 수 있다. 편의상 이하에서는 좌안 특징 벡터와 우안 특징 벡터라 칭하기로 한다. The normalized left-eye patch and right-eye patch may be input to respective feature extraction units, and feature vectors may be respectively extracted by each feature extraction unit. For convenience, hereinafter, the left eye feature vector and the right eye feature vector are referred to as the left eye feature vector.

좌안 패치와 우안 패치는 좌안 특징 추출 모듈과 우안 특징 추출 모듈로 각각 입력된다. 여기서, 좌안 특징 추출 모듈과 우안 특징 추출 모듈은 서로 가중치를 공유하지 않는다. 즉, 좌안 특징 추출 모듈과 우안 특징 추출 모듈은 서로 다른 가중치를 가지도록 학습될 수 있음은 당연하며, 독립적으로 학습/동작될 수 있다. The left eye patch and right eye patch are respectively input to the left eye feature extraction module and the right eye feature extraction module. Here, the left-eye feature extraction module and the right-eye feature extraction module do not share weights with each other. That is, it is natural that the left-eye feature extraction module and the right-eye feature extraction module can be learned to have different weights, and can be independently learned/operated.

좌안 특징 추출 모듈과 우안 특징 추출 모듈의 세부 구성은 동일하다. 좌안 특징 추출 모듈과 우안 특징 추출 모듈의 구조는 도 3에 도시된 바와 같다. The detailed configuration of the left eye feature extraction module and the right eye feature extraction module is the same. The structures of the left eye feature extraction module and the right eye feature extraction module are as shown in FIG. 3 .

특징 추출 모듈은 특징 추출을 위해 5개의 컨볼루션 레이어를 포함하며, 공간 주의 모듈의 출력값인 공간 가중치 매트릭스가 5개의 컨볼루션 레이어를 통과한 결과값(활성맵)과 시그모이드 연산되어 최종 가중 활성화맵이 특징 벡터로 각각 생성될 수 있다. The feature extraction module includes 5 convolutional layers for feature extraction, and the spatial weight matrix, the output value of the spatial attention module, is sigmoided with the result value (active map) passed through the 5 convolutional layers to activate the final weight Each map may be generated as a feature vector.

공간 주의 모듈은 필터 크기가 1 x 1인 3개의 컨볼루션 레이어와 정제된 선형 유닛(ReLU)로 구성된다. 따라서, 특징 추출 모듈은 5개의 컨볼루션 레이어를 통과한 결과값(활성화맵)과 공간 주의 모듈을 적용한 결과값인 공간 가중치 매트릭스를 이용하여 최종 가중 활성화 맵이 생성될 수 있다. 이를 수학식으로 나타내면 수학식 4와 같다. The spatial attention module consists of three convolutional layers with filter size of 1 x 1 and a refined linear unit (ReLU). Accordingly, the feature extraction module may generate a final weighted activation map using a result value (activation map) that has passed through five convolutional layers and a spatial weight matrix that is a result value applied with a spatial attention module. This is expressed as Equation 4 as shown in Equation 4.

여기서, W는 공간 주의 모듈의 출력 결과인 공간 가중치 매트릭스를 나타내고, U는 5개의 컨볼루션 레이어를 통과한 활성화 맵을 나타낸다. Here, W denotes a spatial weight matrix that is the output result of the spatial attention module, and U denotes an activation map that has passed through five convolutional layers.

가중 활성화 맵은 최종 최대 풀링 계층을 통과하며 차원이 축소될 수 있다. 공간 가중치 매트릭스와 최종 가중 활성화 맵의 각 요소별 곱셈을 통해 최대 풀링 계층을 통과함으로써 특징 벡터의 차원(크기)가 감소될 수 있다. The weighted activation map goes through the final maximum pooling layer and can be reduced in dimension. The dimension (size) of the feature vector can be reduced by passing through the maximum pooling layer through multiplication of each element of the spatial weight matrix and the final weighted activation map.

단계 120에서 시선 예측 장치(100)는 좌안 특징 벡터와 우안 특징 벡터 및 헤드 포즈(head pose)를 시선각 추정 모델에 적용하여 시선각을 추정한다. In operation 120 , the gaze prediction apparatus 100 estimates the gaze angle by applying the left eye feature vector, the right eye feature vector, and the head pose to the gaze angle estimation model.

여기서, 추정된 시선각은 요(yaw)와 피치(pitch)일 수 있다. 여기서, 요(yaw)와 피치(pitch)는 눈동자 움직임에 따른 요(yaw)와 피치(pitch)일 수 있다. Here, the estimated viewing angle may be a yaw and a pitch. Here, yaw and pitch may be yaw and pitch according to pupil movement.

결과적으로 본 발명의 일 실시예에 따르면, 3D 시선 예측 장치(100)는 2D 가시광 입력 영상을 이용하여 눈동자 움직임을 검출하여 3D 시선각을 추정할 수 있다. As a result, according to an embodiment of the present invention, the 3D gaze prediction apparatus 100 may estimate a 3D gaze angle by detecting a pupil movement using a 2D visible light input image.

시선각 추정 모델에 대해 도 4를 참조하여 보다 상세히 설명하기로 한다. The viewing angle estimation model will be described in more detail with reference to FIG. 4 .

시선각 추정 모델은 3개의 완전 연결 층과 선형 회귀 모듈을 포함한다.The gaze angle estimation model includes three fully connected layers and a linear regression module.

이때, 완전 연결 층의 후단에는 BN층(Batch-normalization layer)와 ReLU층이 각각 위치될 수 있다. In this case, a batch-normalization layer (BN) layer and a ReLU layer may be respectively positioned at the rear end of the fully connected layer.

본 발명의 일 실시예에 따르면, 시선각 추정 모델에 포함된 3개의 완전 연결층의 크기는 각기 상이할 수 있다. 즉, 3개의 완전 연결층의 크기는 각각 512, 256 및 2일 수 있다. 완전 연결층의 사이즈가 반드시 이로 제한되는 것은 아니며 구현 방법에 따라 완전 연결층의 개수 및 사이즈는 상이해질 수도 있음은 당연하다. According to an embodiment of the present invention, the sizes of the three fully connected layers included in the viewing angle estimation model may be different from each other. That is, the sizes of the three fully connected layers may be 512, 256, and 2, respectively. The size of the fully connected layer is not necessarily limited thereto, and it is natural that the number and size of the fully connected layer may be different according to an implementation method.

좌안 특징 벡터와 우안 특징 벡터를 결합하기 전에, 드롭 아웃층(p=0.5)를 512 크기의 완전 연결층에 연결한 후 BN 층과 ReLU층을 연결하였다. Before combining the left and right eye feature vectors, the dropout layer (p=0.5) was connected to a fully connected layer of 512 size, and then the BN layer and the ReLU layer were connected.

이후, 256 크기의 완전 연결 층이 추가되며, 256 크기의 완전 연결층은 BN층과 ReLU 층과 연결되고, 2 크기의 완전 연결 층이 추가될 수 있다. 또한, 본 발명의 일 실시예에 같이 BN층을 완전 연결층과 연결하여 사용하는 경우 성능이 개선되는 것을 알 수 있다. Thereafter, a fully connected layer of size 256 is added, the fully connected layer of size 256 is connected to the BN layer and the ReLU layer, and a fully connected layer of size 2 may be added. In addition, it can be seen that the performance is improved when the BN layer is used in connection with the fully connected layer as in an embodiment of the present invention.

최종 층에 헤드 포즈가 입력될 수 있다. A head pose may be input to the final layer.

가장 후단 레이어를 통해 헤드 포즈 벡터가 입력될 수 있다. A head pose vector may be input through the rearmost layer.

이로 인해, 선형 회귀 모듈은 좌안 특징 벡터, 우안 특징 벡터 및 헤드 포즈를 이용하여 시선각 요(yaw) 및 피치(pitch)를 계산할 수 있다. Due to this, the linear regression module may calculate the viewing angle yaw and pitch using the left eye feature vector, the right eye feature vector, and the head pose.

선형 회귀 모델은 훈련 데이터 셋을 이용하여 사전 학습되어 있는 것을 가정하기로 한다. 선형 회귀 모델은 예측된 시선각과 실제 시선각도 사이의 유클리드 거리를 추정하여 손실 함수를 계산할 수 있다. 이때, 선형 회귀 모델은 손실 함수가 최소가 되도록 학습될 수 있다.It is assumed that the linear regression model is pre-trained using the training data set. The linear regression model can calculate the loss function by estimating the Euclidean distance between the predicted viewing angle and the actual viewing angle. In this case, the linear regression model may be trained such that the loss function is minimized.

예를 들어, 손실 함수는 수학식 5와 같이 정의될 수 있다. For example, the loss function may be defined as in Equation 5.

여기서, N은 전체 이미지 개수를 나타내고,

는 i번째 영상의 예측된 시선각을 나타내며,

는 i번째 영상의 실제 시선각을 나타낸다. where N represents the total number of images,

represents the predicted viewing angle of the i-th image,

denotes the actual viewing angle of the i-th image.

본 발명의 일 실시예에 따르면, 시선각 추정 모델은 시선 추적에 있어 세계적으로 표준으로 이용되는 MIPGaze와 EYEDIAP 데이터 셋을 이용하여 사전에 학습될 수 있다. 도 5에는 MIPGaze와 EYEDIAP 데이터 셋에 포함된 영상들의 일 예가 도시되어 있다. MIPGaze와 EYEDIAP 데이터 셋에 각각 영상들이 포함되어 있으나, 이들만을 이용하여 모델을 학습하는 경우 정확도가 상대적으로 낮아질 수 있다. 따라서, 본 발명의 일 실시예에서는 MIPGaze와 EYEDIAP 데이터 셋 이외에도, 해당 데이터 셋의 영상에 가우시안 블러, 감마 변환, 노이즈 추가하여 데이터 셋을 늘려서 학습에 이용할 수 있다(도 6 참조). According to an embodiment of the present invention, the gaze angle estimation model may be trained in advance using the MIPGaze and EYEDIAP data sets, which are used as global standards for gaze tracking. 5 shows an example of images included in the MIPGaze and EYEDIAP data sets. Although images are included in the MIPGaze and EYEDIAP data sets, respectively, when the model is trained using only these images, the accuracy may be relatively low. Therefore, in an embodiment of the present invention, in addition to the MIPGaze and EYEDIAP data sets, Gaussian blur, gamma transform, and noise are added to the image of the corresponding data set to increase the data set and use it for learning (see FIG. 6 ).

데이터 셋에 포함된 영상에 감마 보정된 영상을 추가함으로써 서로 다른 조명 조건(예를 들어, 어두운 조명, 밝은 조명)에 강인하게 적응하도록 모델을 훈련시킬 수 있다. By adding gamma-corrected images to the images included in the data set, the model can be trained to robustly adapt to different lighting conditions (eg, dark lighting, bright lighting).

또한, 카메라의 흐림 조건에서의 모델을 더욱 강력하게 훈련하기 위해 OpenCV 가우시안 블러를 이용하여 커널 사이즈를 7 x 7, 3 x 3 등으로 하는 보정된 영상을 생성하여 추가적인 훈련이 가능하도록 하였다. 또한, 가우시안 솔트 페퍼(salt pepper)와 같은 다양한 노이즈를 눈 패치에 추가한 영상을 이용하여 모델을 훈련할 수 있다. In addition, in order to more strongly train the model in the blur condition of the camera, using OpenCV Gaussian blur, a corrected image with a kernel size of 7 x 7, 3 x 3, etc. was generated to enable additional training. Also, a model can be trained using an image in which various noises such as Gaussian salt pepper are added to the eye patch.

본 발명의 일 실시예에서는 이와 같이 훈련 데이터 셋을 생성함에 있어, 가우시안 블러, 감마 보정, 노이즈 추가 등을 통해 데이터를 증강하여 다양한 조명 조건, 카메라 조건에 대해서도 학습이 가능하도록 하여 모델의 정확도를 높일 수 있는 이점이 있다. In one embodiment of the present invention, in generating the training data set as described above, the data is augmented through Gaussian blur, gamma correction, noise addition, etc. to enable learning about various lighting conditions and camera conditions to increase the accuracy of the model. There are advantages that can be

도 7은 본 발명의 일 실시예에 따른 3D 시선 예측 장치의 구성을 도시한 도면이고, 도 8은 본 발명의 일 실시예에 따른 추정된 시선각과 실제 시선각을 비교한 도면이다. 7 is a diagram illustrating a configuration of a 3D gaze prediction apparatus according to an embodiment of the present invention, and FIG. 8 is a diagram comparing an estimated gaze angle and an actual gaze angle according to an embodiment of the present invention.

도 7을 참조하면, 본 발명의 일 실시예에 따른 3D 시선 예측 장치(100)는 전처리부(710), 특징 추출부(715), 시선각 추정부(720), 학습부(725), 메모리(730) 및 프로세서(735)를 포함하여 구성된다. Referring to FIG. 7 , the 3D gaze prediction apparatus 100 according to an embodiment of the present invention includes a preprocessor 710 , a feature extraction unit 715 , a gaze angle estimation unit 720 , a learning unit 725 , and a memory. 730 and a processor 735 .

전처리부(710)는 가시광 영상(입력 영상)에서 얼굴 영역을 검출하고, 검출된 얼굴 영역에서 양쪽 눈 영역(좌안 영역과 우안 영역)을 각각 검출한 후 정규화하여 좌안 패치와 우안 패치를 각각 추출하기 위한 수단이다. The preprocessor 710 detects a face region in the visible light image (input image), detects and normalizes both eye regions (left eye region and right eye region) from the detected face region to extract the left eye patch and the right eye patch, respectively. is a means for

이에 대해서는 도 1을 참조하여 설명한 바와 동일하므로 중복되는 설명은 생략하기로 한다. Since this is the same as that described with reference to FIG. 1 , the overlapping description will be omitted.

특징 추출부(715)는 좌안 특징 추출 모듈과 우안 특징 추출 모듈을 포함한다. 좌안 특징 추출 모듈은 좌안 패치를 이용하여 좌안 특징 벡터를 추출하기 위한 수단이다. 또한, 우안 특징 추출 모듈은 우안 패치를 이용하여 우안 특징 벡터를 추출하기 위한 수단이다.The feature extraction unit 715 includes a left eye feature extraction module and a right eye feature extraction module. The left-eye feature extraction module is a means for extracting a left-eye feature vector using a left-eye patch. Also, the right eye feature extraction module is a means for extracting a right eye feature vector using a right eye patch.

특징 추출부(715)에 포함된 좌안 특징 추출 모듈과 우안 특징 추출 모듈의 세부 구성은 도 3에 도시된 바와 같다. 이에 대해서는 전술한 바와 동일하므로 중복되는 설명은 생략하기로 한다. Detailed configurations of the left-eye feature extraction module and the right-eye feature extraction module included in the feature extraction unit 715 are as shown in FIG. 3 . Since this is the same as that described above, the overlapping description will be omitted.

시선각 추정부(720)는 선형 회귀 모듈을 포함한다. 선형 회귀 모듈 전단에는 3개의 완전 연결층이 위치되며, 3개의 완전 연결층을 통해 좌안 특징 벡터와 우안 특징 벡터가 결합될 수 있다. 또한, 최종 레이어(층)에는 헤드 포즈가 입력될 수 있다. The viewing angle estimator 720 includes a linear regression module. Three fully connected layers are positioned in front of the linear regression module, and the left eye feature vector and the right eye feature vector may be combined through the three fully connected layers. Also, a head pose may be input to the final layer (layer).

따라서, 선형 회귀 모듈은 좌안 특징 벡터, 우안 특징 벡터 및 헤드 포즈를 포함하는 벡터값이 입력되며, 선형 회귀 모듈은 좌안 특징 벡터, 우안 특징 벡터 및 헤드 포즈를 이용하여 3D 시선각을 추정할 수 있다. 3D 시선각은 요(yaw) 및 피치(pitch)일 수 있다. Accordingly, the linear regression module receives a vector value including the left eye feature vector, the right eye feature vector, and the head pose, and the linear regression module can estimate the 3D viewing angle using the left eye feature vector, the right eye feature vector, and the head pose. . The 3D viewing angle may be yaw and pitch.

학습부(725)는 훈련 데이터 셋을 이용하여 시선각 추정부(720)를 학습하기 위한 수단이다. The learning unit 725 is a means for learning the viewing angle estimator 720 using the training data set.

학습부(725)는 손실 함수가 최소가 되도록 선형 회귀 모듈을 학습할 수 있다. 손실 함수에 대해서는 이미 전술한 바와 동일하므로 이에 대한 추가 설명은 생략하기로 한다. The learning unit 725 may learn the linear regression module such that the loss function is minimized. Since the loss function is the same as that described above, a further description thereof will be omitted.

훈련 데이터 셋은 이미 전술한 바와 같이, MIPGaze와 EYEDIAP 데이터 셋을 이용하되, MIPGaze와 EYEDIAP을 이용하여 가우시안 블러, 감마 변환, 노이즈 추가 등을 통해 데이터를 추가하여 학습에 이용할 수 있다. As already described above, the training data set uses the MIPGaze and EYEDIAP data sets, but can be used for learning by adding data through Gaussian blur, gamma transformation, and noise addition using MIPGaze and EYEDIAP.

메모리(730)는 본 발명의 일 실시예에 따른 딥러닝 기반 3D 시선 예측 방법을 수행하기 위해 필요한 다양한 명령어들(프로그램 코드들)을 저장하기 위한 수단이다.The memory 730 is a means for storing various instructions (program codes) necessary to perform a deep learning-based 3D gaze prediction method according to an embodiment of the present invention.

프로세서(735)는 본 발명의 일 실시예에 따른 3D 시선 예측 장치(100)의 내부 구성 요소들(예를 들어, 전처리부(710), 특징 추출부(715), 시선각 추정부(720), 메모리(730) 등)을 제어하기 위한 수단이다. The processor 735 includes internal components (eg, a preprocessor 710 , a feature extractor 715 , and a gaze angle estimator 720 ) of the 3D gaze prediction apparatus 100 according to an embodiment of the present invention. , the memory 730, etc.).

도 8에는 3D 시선각을 예측한 결과가 예시되어 있다. 도 8에서 그린은 예측된 시선각을 나타내고, 붉은색은 실제 시선각을 나타낸다. 도 8에서 보여지는 바와 같이, 가시광 영상에서 정확하게 3D 시선각이 예측되는 것을 알 수 있다. 8 illustrates a result of predicting a 3D viewing angle. In FIG. 8 , green indicates a predicted viewing angle, and red indicates an actual viewing angle. As shown in FIG. 8 , it can be seen that the 3D viewing angle is accurately predicted from the visible light image.

본 발명의 실시 예에 따른 장치 및 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 컴퓨터 판독 가능 매체에 기록되는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 분야 통상의 기술자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media) 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다.The apparatus and method according to an embodiment of the present invention may be implemented in the form of program instructions that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, and the like, alone or in combination. The program instructions recorded on the computer readable medium may be specially designed and configured for the present invention, or may be known and available to those skilled in the computer software field. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floppy disks. - Includes magneto-optical media and hardware devices specially configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions include not only machine language codes such as those generated by a compiler, but also high-level language codes that can be executed by a computer using an interpreter or the like.

상술한 하드웨어 장치는 본 발명의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the present invention, and vice versa.

이제까지 본 발명에 대하여 그 실시 예들을 중심으로 살펴보았다. 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자는 본 발명이 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 변형된 형태로 구현될 수 있음을 이해할 수 있을 것이다. 그러므로 개시된 실시 예들은 한정적인 관점이 아니라 설명적인 관점에서 고려되어야 한다. 본 발명의 범위는 전술한 설명이 아니라 특허청구범위에 나타나 있으며, 그와 동등한 범위 내에 있는 모든 차이점은 본 발명에 포함된 것으로 해석되어야 할 것이다.Up to now, the present invention has been looked at focusing on the embodiments thereof. Those of ordinary skill in the art to which the present invention pertains will understand that the present invention may be implemented in a modified form without departing from the essential characteristics of the present invention. Therefore, the disclosed embodiments should be considered in an illustrative rather than a restrictive sense. The scope of the present invention is indicated in the claims rather than the foregoing description, and all differences within the scope equivalent thereto should be construed as being included in the present invention.

100: 시선 예측 장치
710: 전처리부
715: 특징 추출부
720: 시선각 추정부
725: 학습부
730: 메모리
735: 프로세서100: gaze prediction device
710: preprocessor
715: feature extraction unit
720: gaze angle estimation unit
725: Learning Department
730: memory
735: Processor

Claims

extracting a right eye patch and a left eye patch by normalizing each eye region from the visible light input image;
extracting a right eye feature vector and a left eye feature vector by applying the right eye patch and the left eye patch to a right eye feature extraction module and a left eye feature extraction module; and
estimating the viewing angle by applying the right eye feature vector, the left eye feature vector, and the head pose to a learned viewing angle estimation model,
The right-eye feature extraction module and the left-eye feature extraction module sigmoidally compute the result of passing five convolution layers for the right-eye patch and the left-eye patch and the spatial weight matrix that is the result of applying the spatial attention module. Generate each weighted activation map as a feature vector,
The right eye patch and the left eye patch are normalized patch images cut to a certain size from the center (p) with the head roll removed corresponding to the reference point (x),
The gaze angle is a gaze prediction method, characterized in that the yaw (yaw) and the pitch (pitch).

The method of claim 1,
The viewing angle estimation model is
Includes 3 fully connected layers and a linear regression module,
The gaze prediction method, characterized in that the right eye feature vector, the left eye feature vector, and the head pose are combined through the three fully connected layers and then applied to the learned linear regression module to estimate the viewing angle.

3. The method of claim 2,
The linear regression module is a gaze prediction method, characterized in that it is learned to minimize the loss function calculated using the predicted gaze angle and the actual gaze angle.

delete

A computer-readable recording medium product on which a program code for performing the method according to any one of claims 1 to 3 is recorded.

a preprocessor that detects and normalizes each eye region from the visible light input image to extract a right eye patch and a left eye patch, respectively;
a feature extraction unit including a right eye feature extraction module for analyzing the right eye patch to extract a right eye feature vector, and a left eye feature extraction module for extracting a left eye feature vector by analyzing the left eye patch; and
Comprising a viewing angle estimator for estimating the viewing angle through linear regression analysis by applying the right eye feature vector, the left eye feature vector, and the head pose to a model,
The right-eye feature extraction module and the left-eye feature extraction module sigmoidally compute the result of passing five convolution layers for the right-eye patch and the left-eye patch and the spatial weight matrix that is the result of applying the spatial attention module. Generate each weighted activation map as a feature vector,
The preprocessor detects a face region from the input image, detects both eye regions from the detected face region, respectively, and removes the head roll corresponding to the reference point (x) to a certain size from the center (p). The right eye patch and the left eye patch are extracted by cutting and normalizing, respectively,
The gaze angle is a gaze prediction apparatus, characterized in that the yaw (yaw) and the pitch (pitch).

8. The method of claim 7,
The model is
Includes 3 fully connected layers and a linear regression module,
The gaze prediction apparatus, characterized in that the right eye feature vector, the left eye feature vector, and the head pose are combined through the three fully connected layers and then applied to the learned linear regression module to estimate the gaze angle.

delete