KR20240019030A

KR20240019030A - Learning method and learning device, and testing method and testing device for gaze detection model based on deep learning

Info

Publication number: KR20240019030A
Application number: KR1020230097379A
Authority: KR
Inventors: 이수민; 블라디미르 블라디미로비치 예게이; 황윤정
Original assignee: 주식회사 딥핑소스
Priority date: 2022-08-03
Filing date: 2023-07-26
Publication date: 2024-02-14

Abstract

본 발명은 사람의 시선 방향을 검출하는 딥러닝 기반의 시선 방향 검출 모델을 학습하는 방법에 있어서, (a) 적어도 하나의 제1 학습 이미지가 획득되면, 학습 장치가, 상기 제1 학습 이미지를 바디 컨볼루셔널 레이어에 입력하여 상기 바디 컨볼루셔널 레이어로 하여금 상기 제1 학습 이미지를 적어도 한번 컨볼루셔널 연산하여 상기 제1 학습 이미지에 포함된 제1 사람의 바디 피처를 추출한 적어도 하나의 제1 바디 피처맵을 생성하도록 하며, 상기 제1 바디 피처맵을 바디 FC(Fully Connected) 레이어에 입력하여 상기 바디 FC 레이어로 하여금 상기 제1 바디 피처맵을 적어도 한번 FC 연산하여 상기 제1 사람의 바디의 정면이 향하는 방향을 예측한 적어도 하나의 예측된 바디 디렉션 정보를 출력하도록 하고, 상기 예측된 바디 디렉션 정보와 상기 제1 학습 이미지에 대응되는 제1 그라운드 트루스 정보에 포함된 라벨링된 바디 디렉션 정보를 이용하여 적어도 하나의 바디 디렉션 로스를 생성하며, 상기 바디 디렉션 로스를 이용하여 상기 바디 FC 레이어 및 상기 바디 컨볼루셔널 레이어를 학습시키는 단계; 및 (b) 적어도 하나의 제2 학습 이미지가 획득되면, 상기 학습 장치가, 상기 제2 학습 이미지를 상기 바디 컨볼루셔널 레이어에 입력하여 상기 바디 컨볼루셔널 레이어로 하여금 상기 제2 학습 이미지를 적어도 한번 컨볼루셔널 연산하여 상기 제2 학습 이미지에 포함된 제2 사람의 바디 피처를 추출한 적어도 하나의 제2 바디 피처맵을 생성하도록 하며, 상기 제2 학습 이미지를 헤드 컨볼루셔널 레이어에 입력하여 상기 헤드 컨볼루셔널 레이어로 하여금 상기 제2 학습 이미지를 적어도 한번 컨볼루셔널 연산하여 상기 제2 사람의 헤드 피처를 추출한 적어도 하나의 제1 헤드 피처맵을 생성하도록 하고, 상기 제2 바디 피처맵과 상기 제1 헤드 피처맵을 컨캐터네이트하여 제1 통합 피처맵을 생성하고, 상기 제1 통합 피처맵을 헤드 FC 레이어에 입력하여 상기 헤드 FC 레이어로 하여금 상기 제1 통합 피처맵을 적어도 한번 FC 연산하여 상기 제2 사람의 헤드의 정면이 향하는 방향을 예측한 적어도 하나의 제1 예측된 헤드 디렉션 정보를 출력하도록 하며, 상기 제1 예측된 헤드 디렉션 정보와 상기 제2 학습 이미지에 대응되는 제2 그라운드 트루스에 포함된 라벨링된 헤드 디렉션 정보를 이용하여 적어도 하나의 헤드 디렉션 로스를 생성하고, 상기 헤드 디렉션 로스를 이용하여 상기 헤드 FC 레이어 및 상기 헤드 컨볼루셔널 레이어를 학습시키는 단계; 를 포함하는 방법에 관한 것이다.The present invention provides a method for learning a deep learning-based gaze direction detection model for detecting a person's gaze direction. (a) When at least one first learning image is acquired, the learning device converts the first learning image into a body. At least one first body that is input to a convolutional layer and causes the body convolutional layer to perform a convolutional operation on the first training image at least once to extract body features of the first person included in the first training image. A feature map is generated, and the first body feature map is input to a body FC (Fully Connected) layer so that the body FC layer performs an FC operation on the first body feature map at least once to determine the front of the body of the first person. Output at least one predicted body direction information that predicts the direction it is facing, and use labeled body direction information included in the predicted body direction information and first ground truth information corresponding to the first learning image. Generating at least one body direction loss and learning the body FC layer and the body convolutional layer using the body direction loss; and (b) when at least one second learning image is acquired, the learning device inputs the second learning image into the body convolutional layer to cause the body convolutional layer to obtain at least the second learning image. Convolutional operation is performed once to generate at least one second body feature map extracting the body features of a second person included in the second learning image, and the second learning image is input to the head convolutional layer to generate the second body feature map. A head convolutional layer performs a convolutional operation on the second learning image at least once to generate at least one first head feature map from which head features of the second person are extracted, and the second body feature map and the Concatenate the first head feature map to create a first integrated feature map, input the first integrated feature map to the head FC layer, and have the head FC layer perform FC operation on the first integrated feature map at least once. To output at least one first predicted head direction information predicting the direction in which the front of the second person's head is facing, and second ground truth corresponding to the first predicted head direction information and the second learning image. generating at least one head direction loss using labeled head direction information included in and learning the head FC layer and the head convolutional layer using the head direction loss; It relates to a method of including .

Description

A learning method and learning device for learning a deep learning-based gaze direction detection model that detects the gaze direction, and a test method and test device using the same {LEARNING METHOD AND LEARNING DEVICE, AND TESTING METHOD AND TESTING DEVICE FOR GAZE DETECTION MODEL BASED ON DEEP LEARNING }

본 발명은 시선 방향을 검출하는 딥러닝 기반의 시선 방향 검출 모델을 학습하는 방법에 관한 것으로, 보다 상세하게는, 사람의 바디 디렉션 정보와 헤드 디렉션 정보를 이용하여 사람의 시선 방향을 검출하는 딥러닝 기반의 시선 방향 검출 모델을 학습하는 학습 방법 및 학습 장치, 그리고 이를 이용한 테스트 방법 및 테스트 장치에 관한 것이다.The present invention relates to a method of learning a deep learning-based gaze direction detection model for detecting gaze direction, and more specifically, deep learning to detect a person's gaze direction using a person's body direction information and head direction information. It relates to a learning method and learning device for learning a based gaze direction detection model, and a testing method and testing device using the same.

시선 방향(gaze) 정보는 다양한 분야에서 이용될 수 있는데, 가령, 마케팅 분야에서, 광고가 효과적으로 이루어지는지 여부를 분석하기 위해 시선 방향 정보가 이용될 수 있다.Gaze information can be used in various fields. For example, in the marketing field, gaze direction information can be used to analyze whether advertising is effective.

종래에는, 사용자 단말에 장착된 카메라를 통해 사용자의 안면 이미지를 촬영하고, 사용자의 안면 이미지로부터 사용자의 시선 방향 정보를 획득하는 수준에 머물렀다.In the past, the user's facial image was captured through a camera mounted on the user terminal, and the user's gaze direction information was obtained from the user's facial image.

하지만, 상기와 같은 시선 방향 검출 방식은, 극히 제한된 조건을 가지는 상황(가령, 카메라가 장착된 휴대전화를 통해 사용자가 특정 콘텐츠를 시청하는 상황)에서만 이용할 수 있다는 문제점이 존재하였다.However, there was a problem in that the gaze direction detection method described above could only be used in situations with extremely limited conditions (for example, a situation where a user was watching specific content through a mobile phone equipped with a camera).

또 다른, 종래의 시선 방향 검출 방식은, 사람을 촬영한 이미지로부터 사람의 얼굴을 검출한 안면 이미지 또는 사람의 얼굴을 촬영한 안면 이미지로부터 동공을 검출한 후 동공 내 조명 반사점을 검출하고, 검출된 조명 반사점을 이용하여 시선 방향을 검출하는 방식이므로, 동공이 촬상되지 않은 이미지에는 적용되기 어려운 문제점이 존재하였다.Another conventional gaze direction detection method detects the pupil from a facial image of a person's face or a facial image of a person's face, then detects the light reflection point within the pupil, and detects the detected light reflection point within the pupil. Since this method detects the direction of gaze using lighting reflection points, there was a problem that it was difficult to apply to images in which the pupil was not captured.

따라서, 상기 문제점들을 해결하기 위한 개선 방안이 요구되는 실정이다.Therefore, improvement measures to solve the above problems are required.

본 발명은 상술한 문제점을 모두 해결하는 것을 그 목적으로 한다.The purpose of the present invention is to solve all of the above-mentioned problems.

또한, 본 발명은 정확하게 시선 방향을 검출하는 것을 다른 목적으로 한다.Additionally, another purpose of the present invention is to accurately detect the gaze direction.

또한, 본 발명은 헤드 디렉션 정보 및 바디 디렉션 정보를 이용하여 정확하게 시선 방향을 검출하는 것을 또 다른 목적으로 한다.Additionally, another purpose of the present invention is to accurately detect the gaze direction using head direction information and body direction information.

또한, 본 발명은 정확하게 시선 방향을 검출함으로써 소비자에게 효과적인 광고가 이루어질 수 있도록 지원하는 것을 또 다른 목적으로 한다.Another purpose of the present invention is to support effective advertising to consumers by accurately detecting the gaze direction.

상기한 바와 같은 본 발명의 목적을 달성하고, 후술하는 본 발명의 특징적인 효과를 실현하기 위한, 본 발명의 특징적인 구성은 하기와 같다.In order to achieve the object of the present invention as described above and realize the characteristic effects of the present invention described later, the characteristic configuration of the present invention is as follows.

본 발명의 일 실시예에 따르면, 사람의 시선 방향을 검출하는 딥러닝 기반의 시선 방향 검출 모델을 학습하는 방법에 있어서, (a) 적어도 하나의 제1 학습 이미지가 획득되면, 학습 장치가, 상기 제1 학습 이미지를 바디 컨볼루셔널 레이어에 입력하여 상기 바디 컨볼루셔널 레이어로 하여금 상기 제1 학습 이미지를 적어도 한번 컨볼루셔널 연산하여 상기 제1 학습 이미지에 포함된 제1 사람의 바디 피처를 추출한 적어도 하나의 제1 바디 피처맵을 생성하도록 하며, 상기 제1 바디 피처맵을 바디 FC(Fully Connected) 레이어에 입력하여 상기 바디 FC 레이어로 하여금 상기 제1 바디 피처맵을 적어도 한번 FC 연산하여 상기 제1 사람의 바디의 정면이 향하는 방향을 예측한 적어도 하나의 예측된 바디 디렉션 정보를 출력하도록 하고, 상기 예측된 바디 디렉션 정보와 상기 제1 학습 이미지에 대응되는 제1 그라운드 트루스 정보에 포함된 라벨링된 바디 디렉션 정보를 이용하여 적어도 하나의 바디 디렉션 로스를 생성하며, 상기 바디 디렉션 로스를 이용하여 상기 바디 FC 레이어 및 상기 바디 컨볼루셔널 레이어를 학습시키는 단계; 및 (b) 적어도 하나의 제2 학습 이미지가 획득되면, 상기 학습 장치가, 상기 제2 학습 이미지를 상기 바디 컨볼루셔널 레이어에 입력하여 상기 바디 컨볼루셔널 레이어로 하여금 상기 제2 학습 이미지를 적어도 한번 컨볼루셔널 연산하여 상기 제2 학습 이미지에 포함된 제2 사람의 바디 피처를 추출한 적어도 하나의 제2 바디 피처맵을 생성하도록 하며, 상기 제2 학습 이미지를 헤드 컨볼루셔널 레이어에 입력하여 상기 헤드 컨볼루셔널 레이어로 하여금 상기 제2 학습 이미지를 적어도 한번 컨볼루셔널 연산하여 상기 제2 사람의 헤드 피처를 추출한 적어도 하나의 제1 헤드 피처맵을 생성하도록 하고, 상기 제2 바디 피처맵과 상기 제1 헤드 피처맵을 컨캐터네이트하여 제1 통합 피처맵을 생성하고, 상기 제1 통합 피처맵을 헤드 FC 레이어에 입력하여 상기 헤드 FC 레이어로 하여금 상기 제1 통합 피처맵을 적어도 한번 FC 연산하여 상기 제2 사람의 헤드의 정면이 향하는 방향을 예측한 적어도 하나의 제1 예측된 헤드 디렉션 정보를 출력하도록 하며, 상기 제1 예측된 헤드 디렉션 정보와 상기 제2 학습 이미지에 대응되는 제2 그라운드 트루스에 포함된 라벨링된 헤드 디렉션 정보를 이용하여 적어도 하나의 헤드 디렉션 로스를 생성하고, 상기 헤드 디렉션 로스를 이용하여 상기 헤드 FC 레이어 및 상기 헤드 컨볼루셔널 레이어를 학습시키는 단계; 를 포함하는 방법이 제공된다.According to an embodiment of the present invention, in a method of learning a deep learning-based gaze direction detection model for detecting a person's gaze direction, (a) when at least one first learning image is acquired, the learning device, The first training image is input to the body convolutional layer, and the body convolutional layer performs a convolutional operation on the first training image at least once to extract the body features of the first person included in the first training image. At least one first body feature map is generated, and the first body feature map is input to a body FC (Fully Connected) layer to cause the body FC layer to perform FC operation on the first body feature map at least once. 1 Output at least one piece of predicted body direction information that predicts the direction in which the front of the human body faces, and labeled information included in the predicted body direction information and first ground truth information corresponding to the first learning image. Generating at least one body direction loss using body direction information, and learning the body FC layer and the body convolutional layer using the body direction loss; and (b) when at least one second learning image is acquired, the learning device inputs the second learning image into the body convolutional layer to cause the body convolutional layer to obtain at least the second learning image. Convolutional operation is performed once to generate at least one second body feature map extracting the body features of a second person included in the second learning image, and the second learning image is input to the head convolutional layer to generate the second body feature map. A head convolutional layer performs a convolutional operation on the second learning image at least once to generate at least one first head feature map from which head features of the second person are extracted, and the second body feature map and the Concatenate the first head feature map to create a first integrated feature map, input the first integrated feature map to the head FC layer, and have the head FC layer perform FC operation on the first integrated feature map at least once. To output at least one first predicted head direction information predicting the direction in which the front of the second person's head is facing, and second ground truth corresponding to the first predicted head direction information and the second learning image. generating at least one head direction loss using labeled head direction information included in and learning the head FC layer and the head convolutional layer using the head direction loss; A method including is provided.

상기 일 실시예에서, 상기 (b) 단계에서, 상기 학습 장치는, 상기 헤드 디렉션 로스에 로스 웨이트를 더 추가하여 상기 헤드 FC 레이어 및 상기 헤드 컨볼루셔널 레이어를 학습시키며, 상기 헤드 디렉션 로스가 기설정된 임계만 미만일 경우에는, 상기 로스 웨이트로 “0”을 적용하고, 상기 헤드 디렉션 로스가 상기 임계값 이상일 경우에는, 상기 로스 웨이트로 “0” 보다 큰 기설정된 실수값을 적용할 수 있다.In the embodiment, in step (b), the learning device trains the head FC layer and the head convolutional layer by adding a loss weight to the head direction loss, and the head direction loss is If the loss weight is less than the set threshold, “0” can be applied as the loss weight, and if the head direction loss is greater than the threshold value, a preset real number greater than “0” can be applied as the loss weight.

상기 일 실시예에서, 상기 (b) 단계에서, 상기 학습 장치는, 상기 헤드 FC 레이어로 하여금, (i) 상기 제2 사람의 헤드의 정면이 향하는 방향이 기설정된 헤드 디렉션 클래스들 중 어느 클래스에 대응되는 지를 분류한 클래시피케이션 정보 및 (ii) 상기 제2 사람의 헤드의 정면이 향하는 방향이 연속적 방향 후보군 중 어느 방향에 대응되는 지를 리그레션한 리그레션 정보 중 어느 하나를 상기 예측된 디렉션 정보로 출력하도록 할 수 있다.In the above embodiment, in step (b), the learning device causes the head FC layer to: (i) determine which of the preset head direction classes the direction in which the front of the second person's head is facing; The predicted direction information is one of (ii) classification information that classifies whether the second person's head is facing, and (ii) regression information that classifies which direction the front of the second person's head faces corresponds to among the continuous direction candidates. It can be output as .

상기 일 실시예에서, 상기 예측된 디렉션 정보는, 상기 제2 학습 이미지에 대응되는 2차원 평면 및 3차원 공간 중 어느 하나에서 상기 제2 사람의 헤드의 정면이 향하는 방향을 예측한 것일 수 있다.In one embodiment, the predicted direction information may be a prediction of the direction in which the front of the second person's head faces in either a two-dimensional plane or a three-dimensional space corresponding to the second learning image.

상기 일 실시예에서, 상기 제1 학습 이미지 또는 상기 제2 학습 이미지는, 촬영되거나 크롭된 사람 이미지에서, (i) 해당 사람의 바디 방향 및 시선 방향 각각을 2차원 평면 또는 3차원 공간에서 기설정된 바디 방향 클래스들 및 기설정된 시선 방향 클래스들 중에서 대응되는 어느 하나의 특정 바디 방향 클래스 및 특정 시선 방향 클래스 각각으로 라벨링하거나, (ii) 상기 해당 사람의 바디 방향 및 시선 방향 각각을 2차원 평면 또는 3차원 공간에서 바디 방향 벡터 및 시선 방향 벡터 각각으로 라벨링하여 생성된 것일 수 있다.In the above embodiment, the first learning image or the second learning image is, in a photographed or cropped person image, (i) each of the person's body direction and gaze direction is preset in a two-dimensional plane or three-dimensional space; labeling each of the body direction classes and preset gaze direction classes with a specific body direction class and a specific gaze direction class, or (ii) labeling each of the body direction and gaze direction of the person in question on a two-dimensional plane or 3 It may be generated by labeling each of the body direction vector and gaze direction vector in a dimensional space.

상기 일 실시예에서, 상기 제1 학습 이미지 또는 상기 제2 학습 이미지는, 자이로스코프 센서를 착용한 사람을 촬영한 사람 이미지에서, 촬영 시점에서의 상기 자이로스코프 센서의 센싱 정보를 이용하여 획득한 해당 사람의 센싱된 바디 방향 정보 및 센싱된 시선 방향 정보를 이용하여, (i) 2차원 평면 또는 3차원 공간에서 기설정된 바디 방향 클래스들 및 기설정된 시선 방향 클래스들 중에서 상기 센싱된 바디 방향 정보 및 상기 센싱된 시선 방향 정보 각각에 대응되는 어느 하나의 특정 바디 방향 클래스 및 특정 시선 방향 클래스 각각으로 라벨링하거나, (ii) 2차원 평면 또는 3차원 공간에서의 상기 센싱된 바디 방향 정보 및 상기 센싱된 시선 방향 정보 각각을 상기 해당 사람의 바디 방향 벡터 및 시선 방향 벡터 각각으로 라벨링하여 생성된 것일 수 있다.In the embodiment, the first learning image or the second learning image is a corresponding image obtained using sensing information of the gyroscope sensor at the time of shooting in a human image of a person wearing a gyroscope sensor. Using a person's sensed body direction information and sensed gaze direction information, (i) the sensed body direction information and the Label each of the sensed gaze direction information with a specific body direction class and a specific gaze direction class, or (ii) the sensed body direction information and the sensed gaze direction in a two-dimensional plane or three-dimensional space. It may be generated by labeling each piece of information with each of the person's body direction vector and gaze direction vector.

상기 일 실시예에서, (c) 상기 학습 장치가, 적어도 하나의 평가 이미지를 상기 바디 컨볼루셔널 레이어에 입력하여 상기 바디 컨볼루셔널 레이어로 하여금 상기 평가 이미지를 적어도 한번 컨볼루셔널 연산하여 상기 평가 이미지에 포함된 제3 사람의 바디 피처를 추출한 적어도 하나의 제3 바디 피처맵을 생성하도록 하며, 상기 평가 이미지를 상기 헤드 컨볼루셔널 레이어에 입력하여 상기 헤드 컨볼루셔널 레이어로 하여금 상기 평가 이미지를 적어도 한번 컨볼루셔널 연산하여 상기 제3 사람의 헤드 피처를 추출한 적어도 하나의 제2 헤드 피처맵을 생성하도록 하고, 상기 제3 바디 피처맵과 상기 제2 헤드 피처맵을 컨캐터네이트하여 제2 통합 피처맵을 생성하고, 상기 제2 통합 피처맵을 상기 헤드 FC 레이어에 입력하여 상기 헤드 FC 레이어로 하여금 상기 제2 통합 피처맵을 적어도 한번 FC 연산하여 상기 제3 사람의 헤드의 정면이 향하는 방향을 예측한 적어도 하나의 제2 예측된 헤드 디렉션 정보를 출력하도록 하며, 상기 제2 예측된 헤드 디렉션 정보와 상기 평가 이미지에 대응되는 제3 그라운드 트루스를 참조하여 상기 바디 컨볼루셔널 레이어, 상기 헤드 컨볼루셔널 레이어, 및 상기 헤드 FC 레이어를 포함하는 상기 시선 방향 검출 모델을 평가하는 단계; 를 더 포함할 수 있다.In the embodiment, (c) the learning device inputs at least one evaluation image to the body convolutional layer and causes the body convolutional layer to perform a convolutional operation on the evaluation image at least once to perform the evaluation. At least one third body feature map is generated by extracting the body features of a third person included in the image, and the evaluation image is input to the head convolutional layer to cause the head convolutional layer to generate the evaluation image. A convolutional operation is performed at least once to generate at least one second head feature map from which the head features of the third person are extracted, and the third body feature map and the second head feature map are concatenated to perform a second integration. Create a feature map, input the second integrated feature map to the head FC layer, and have the head FC layer calculate the second integrated feature map at least once to determine the direction in which the front of the third person's head faces. At least one second predicted head direction information is output, and the body convolutional layer and the head convolution layer are generated with reference to the second predicted head direction information and a third ground truth corresponding to the evaluation image. Evaluating the gaze direction detection model including a sional layer and the head FC layer; It may further include.

상기 일 실시예에서, 상기 학습 장치는, 상기 제2 예측된 헤드 디렉션 정보와 상기 제3 그라운드 트루스를 이용하여 다음의 수학식을 통해 정확도를 연산하며, 연산된 상기 정확도를 이용하여 상기 시선 방향 검출 모델을 평가할 수 있다. , 상기 수학식에서, N은 평가에 이용된 상기 제2 예측된 헤드 디렉션 정보의 총 개수이고, # of predicted soft corrects는 상기 제2 예측된 헤드 디렉션 정보 중 라벨링된 정답을 정확히 예측하지 못한 개수이며, # of predicted corrects는 라벨링 정답을 정확히 예측한 개수일 수 있다.In the embodiment, the learning device calculates accuracy through the following equation using the second predicted head direction information and the third ground truth, and detects the gaze direction using the calculated accuracy. The model can be evaluated. , in the above equation, N is the total number of the second predicted head direction information used for evaluation, # of predicted soft corrections is the number of the second predicted head direction information that did not correctly predict the labeled correct answer, # of predicted corrects may be the number of correctly predicted labeling correct answers.

본 발명의 다른 실시예에 따르면, 사람의 시선 방향을 검출하는 딥러닝 기반의 시선 방향 검출 모델을 학습하는 방법에 있어서, (a) 적어도 하나의 학습 이미지가 획득되면, 학습 장치가, 상기 학습 이미지를 바디 컨볼루셔널 레이어에 입력하여 상기 바디 컨볼루셔널 레이어로 하여금 상기 학습 이미지를 적어도 한번 컨볼루셔널 연산하여 상기 학습 이미지에 포함된 사람의 바디 피처를 추출한 적어도 하나의 바디 피처맵을 생성하도록 하고, 상기 학습 이미지를 헤드 컨볼루셔널 레이어에 입력하여 상기 헤드 컨볼루셔널 레이어로 하여금 상기 학습 이미지를 적어도 한번 컨볼루셔널 연산하여 상기 사람의 헤드 피처를 추출한 적어도 하나의 헤드 피처맵을 생성하도록 하는 단계; (b) 상기 학습 장치가, 상기 바디 피처맵을 바디 FC 레이어에 입력하여 상기 바디 FC 레이어로 하여금 상기 바디 피처맵을 적어도 한번 FC 연산하여 상기 사람의 바디의 정면이 향하는 방향을 예측한 적어도 하나의 예측된 바디 디렉션 정보를 출력하도록 하고, 상기 바디 피처맵과 상기 헤드 피처맵을 컨캐터네이트한 통합 피처맵을 헤드 FC 레이어에 입력하여 상기 헤드 FC 레이어로 하여금 상기 통합 피처맵을 적어도 한번 FC 연산하여 상기 사람의 헤드의 정면이 향하는 방향을 예측한 적어도 하나의 예측된 헤드 디렉션 정보를 출력하도록 하는 단계; 및 (c) 상기 학습 장치가, 상기 예측된 바디 디렉션 정보와 상기 학습 이미지에 대응되는 그라운드 트루스 정보에 포함된 라벨링된 바디 디렉션 정보를 이용하여 적어도 하나의 바디 디렉션 로스를 생성하며, 상기 예측된 헤드 디렉션 정보와 상기 그라운드 트루스에 포함된 라벨링된 헤드 디렉션 정보를 이용하여 적어도 하나의 헤드 디렉션 로스를 생성하고, 상기 바디 디렉션 로스를 이용하여 상기 바디 FC 레이어 및 상기 바디 컨볼루셔널 레이어를 학습시키며, 상기 헤드 디렉션 로스를 이용하여 상기 헤드 FC 레이어 및 상기 헤드 컨볼루셔널 레이어를 학습시키는 단계; 를 포함하는 방법이 제공된다.According to another embodiment of the present invention, in a method of learning a deep learning-based gaze direction detection model for detecting a person's gaze direction, (a) when at least one learning image is acquired, the learning device Input to the body convolutional layer to cause the body convolutional layer to perform a convolutional operation on the learning image at least once to generate at least one body feature map extracting the body features of the person included in the learning image, and , inputting the learning image to a head convolutional layer to cause the head convolutional layer to perform a convolutional operation on the learning image at least once to generate at least one head feature map from which the head features of the person are extracted. ; (b) The learning device inputs the body feature map to a body FC layer and causes the body FC layer to perform FC operation on the body feature map at least once to predict the direction in which the front of the human body faces. Output the predicted body direction information, input the integrated feature map obtained by concatenating the body feature map and the head feature map to the head FC layer, and have the head FC layer perform FC operation on the integrated feature map at least once. outputting at least one piece of predicted head direction information predicting a direction in which the front of the person's head is facing; and (c) the learning device generates at least one body direction loss using the predicted body direction information and labeled body direction information included in ground truth information corresponding to the learning image, and the predicted head Generating at least one head direction loss using direction information and labeled head direction information included in the ground truth, learning the body FC layer and the body convolutional layer using the body direction loss, Learning the head FC layer and the head convolutional layer using head direction loss; A method including is provided.

상기 다른 실시예에서, 상기 학습 이미지는, 촬영되거나 크롭된 사람 이미지에서, (i) 해당 사람의 바디 방향 및 시선 방향 각각을 2차원 평면 또는 3차원 공간에서 기설정된 바디 방향 클래스들 및 기설정된 시선 방향 클래스들 중에서 대응되는 어느 하나의 특정 바디 방향 클래스 및 특정 시선 방향 클래스 각각으로 라벨링하거나, (ii) 상기 해당 사람의 바디 방향 및 시선 방향 각각을 2차원 평면 또는 3차원 공간에서 바디 방향 벡터 및 시선 방향 벡터 각각으로 라벨링하여 생성된 것일 수 있다.In the other embodiment, the learning image is, in a photographed or cropped person image, (i) each of the person's body direction and gaze direction is divided into preset body direction classes and preset gaze in a two-dimensional plane or three-dimensional space; Label each of the direction classes with a specific body direction class and a specific gaze direction class, or (ii) label each of the person's body direction and gaze direction as a body direction vector and gaze direction in a two-dimensional plane or three-dimensional space. It may be generated by labeling each direction vector.

상기 다른 실시예에서, 상기 학습 이미지는, 자이로스코프 센서를 착용한 사람을 촬영한 사람 이미지에서, 촬영 시점에서의 상기 자이로스코프 센서의 센싱 정보를 이용하여 획득한 해당 사람의 센싱된 바디 방향 정보 및 센싱된 시선 방향 정보를 이용하여, (i) 2차원 평면 또는 3차원 공간에서 기설정된 바디 방향 클래스들 및 기설정된 시선 방향 클래스들 중에서 상기 센싱된 바디 방향 정보 및 상기 센싱된 시선 방향 정보 각각에 대응되는 어느 하나의 특정 바디 방향 클래스 및 특정 시선 방향 클래스 각각으로 라벨링하거나, (ii) 2차원 평면 또는 3차원 공간에서의 상기 센싱된 바디 방향 정보 및 상기 센싱된 시선 방향 정보 각각을 상기 해당 사람의 바디 방향 벡터 및 시선 방향 벡터 각각으로 라벨링하여 생성된 것일 수 있다.In another embodiment, the learning image includes the person's sensed body direction information obtained using the sensing information of the gyroscope sensor at the time of shooting in a person image of a person wearing a gyroscope sensor, and Using the sensed gaze direction information, (i) corresponding to each of the sensed body direction information and the sensed gaze direction information among preset body direction classes and preset gaze direction classes in a two-dimensional plane or three-dimensional space; labeling each of a specific body direction class and a specific gaze direction class, or (ii) labeling each of the sensed body direction information and the sensed gaze direction information in a two-dimensional plane or three-dimensional space with the body of the person in question. It may be generated by labeling each of the direction vector and gaze direction vector.

본 발명의 또 다른 실시예에 따르면, 사람의 시선 방향을 검출하는 딥러닝 기반의 시선 방향 검출 모델을 학습하는 학습 장치에 있어서, 사람의 시선 방향을 검출하는 딥러닝 기반의 시선 방향 검출 모델을 학습하기 위한 인스트럭션들이 저장된 메모리; 및 상기 메모리에 저장된 상기 인스트럭션들에 따라 상기 시선 방향 검출 모델을 학습하기 위한 동작을 수행하는 프로세서; 를 포함하되, 상기 프로세서는, (I) 적어도 하나의 제1 학습 이미지가 획득되면, 상기 제1 학습 이미지를 바디 컨볼루셔널 레이어에 입력하여 상기 바디 컨볼루셔널 레이어로 하여금 상기 제1 학습 이미지를 적어도 한번 컨볼루셔널 연산하여 상기 제1 학습 이미지에 포함된 제1 사람의 바디 피처를 추출한 적어도 하나의 제1 바디 피처맵을 생성하도록 하며, 상기 제1 바디 피처맵을 바디 FC(Fully Connected) 레이어에 입력하여 상기 바디 FC 레이어로 하여금 상기 제1 바디 피처맵을 적어도 한번 FC 연산하여 상기 제1 사람의 바디의 정면이 향하는 방향을 예측한 적어도 하나의 예측된 바디 디렉션 정보를 출력하도록 하고, 상기 예측된 바디 디렉션 정보와 상기 제1 학습 이미지에 대응되는 제1 그라운드 트루스 정보에 포함된 라벨링된 바디 디렉션 정보를 이용하여 적어도 하나의 바디 디렉션 로스를 생성하며, 상기 바디 디렉션 로스를 이용하여 상기 바디 FC 레이어 및 상기 바디 컨볼루셔널 레이어를 학습시키는 프로세스, 및 (II) 적어도 하나의 제2 학습 이미지가 획득되면, 상기 제2 학습 이미지를 상기 바디 컨볼루셔널 레이어에 입력하여 상기 바디 컨볼루셔널 레이어로 하여금 상기 제2 학습 이미지를 적어도 한번 컨볼루셔널 연산하여 상기 제2 학습 이미지에 포함된 제2 사람의 바디 피처를 추출한 적어도 하나의 제2 바디 피처맵을 생성하도록 하며, 상기 제2 학습 이미지를 헤드 컨볼루셔널 레이어에 입력하여 상기 헤드 컨볼루셔널 레이어로 하여금 상기 제2 학습 이미지를 적어도 한번 컨볼루셔널 연산하여 상기 제2 사람의 헤드 피처를 추출한 적어도 하나의 제1 헤드 피처맵을 생성하도록 하고, 상기 제2 바디 피처맵과 상기 제1 헤드 피처맵을 컨캐터네이트하여 제1 통합 피처맵을 생성하고, 상기 제1 통합 피처맵을 헤드 FC 레이어에 입력하여 상기 헤드 FC 레이어로 하여금 상기 제1 통합 피처맵을 적어도 한번 FC 연산하여 상기 제2 사람의 헤드의 정면이 향하는 방향을 예측한 적어도 하나의 제1 예측된 헤드 디렉션 정보를 출력하도록 하며, 상기 제1 예측된 헤드 디렉션 정보와 상기 제2 학습 이미지에 대응되는 제2 그라운드 트루스에 포함된 라벨링된 헤드 디렉션 정보를 이용하여 적어도 하나의 헤드 디렉션 로스를 생성하고, 상기 헤드 디렉션 로스를 이용하여 상기 헤드 FC 레이어 및 상기 헤드 컨볼루셔널 레이어를 학습시키는 프로세스를 수행하는 학습 장치가 제공된다.According to another embodiment of the present invention, in a learning device for learning a deep learning-based gaze direction detection model for detecting a person's gaze direction, a deep learning-based gaze direction detection model for detecting a person's gaze direction is learned. a memory in which instructions for doing so are stored; and a processor that performs an operation to learn the gaze direction detection model according to the instructions stored in the memory. Including, wherein the processor: (I) When at least one first training image is obtained, input the first training image to the body convolutional layer to cause the body convolutional layer to generate the first training image At least one first body feature map is generated by extracting the body features of the first person included in the first learning image by performing a convolutional operation at least once, and the first body feature map is used as a body FC (Fully Connected) layer. input to cause the body FC layer to perform an FC operation on the first body feature map at least once and output at least one predicted body direction information predicting the direction in which the front of the first person's body faces, and the prediction At least one body direction loss is generated using the labeled body direction information and the labeled body direction information included in the first ground truth information corresponding to the first learning image, and the body FC layer is generated using the body direction loss. and a process of training the body convolutional layer, and (II) when at least one second training image is obtained, inputting the second training image to the body convolutional layer to cause the body convolutional layer to Perform a convolutional operation on the second learning image at least once to generate at least one second body feature map extracting body features of a second person included in the second learning image, and convert the second learning image into a head control image. Input to a translational layer to cause the head convolutional layer to perform a convolutional operation on the second training image at least once to generate at least one first head feature map from which the head features of the second person are extracted, A first integrated feature map is generated by concatenating the second body feature map and the first head feature map, and the first integrated feature map is input to the head FC layer to cause the head FC layer to generate the first integrated feature map. FC-operates the map at least once to output at least one first predicted head direction information predicting the direction in which the front of the second person's head is facing, and the first predicted head direction information and the second learning image are combined. A process of generating at least one head direction loss using labeled head direction information included in the second ground truth corresponding to and learning the head FC layer and the head convolutional layer using the head direction loss. A learning device that performs is provided.

상기 또 다른 실시예에서, 상기 프로세서는, 상기 (II) 프로세스에서, 상기 헤드 디렉션 로스에 로스 웨이트를 더 추가하여 상기 헤드 FC 레이어 및 상기 헤드 컨볼루셔널 레이어를 학습시키며, 상기 헤드 디렉션 로스가 기설정된 임계만 미만일 경우에는, 상기 로스 웨이트로 “0”을 적용하고, 상기 헤드 디렉션 로스가 상기 임계값 이상일 경우에는, 상기 로스 웨이트로 “0” 보다 큰 기설정된 실수값을 적용할 수 있다.In another embodiment, in the process (II), the processor trains the head FC layer and the head convolutional layer by adding a loss weight to the head direction loss, and the head direction loss is If the loss weight is less than the set threshold, “0” can be applied as the loss weight, and if the head direction loss is greater than the threshold value, a preset real number greater than “0” can be applied as the loss weight.

상기 또 다른 실시예에서, 상기 프로세서는, 상기 (II) 프로세스에서, 상기 헤드 FC 레이어로 하여금, (i) 상기 제2 사람의 헤드의 정면이 향하는 방향이 기설정된 헤드 디렉션 클래스들 중 어느 클래스에 대응되는 지를 분류한 클래시피케이션 정보 및 (ii) 상기 제2 사람의 헤드의 정면이 향하는 방향이 연속적 방향 후보군 중 어느 방향에 대응되는 지를 리그레션한 리그레션 정보 중 어느 하나를 상기 예측된 디렉션 정보로 출력하도록 할 수 있다.In another embodiment, the processor, in process (II), causes the head FC layer to: (i) determine which of the preset head direction classes the direction in which the front of the second person's head is facing; The predicted direction information is one of (ii) classification information that classifies whether the second person's head is facing, and (ii) regression information that classifies which direction the front of the second person's head faces corresponds to among the continuous direction candidates. It can be output as .

상기 또 다른 실시예에서, 상기 예측된 디렉션 정보는, 상기 제2 학습 이미지에 대응되는 2차원 평면 및 3차원 공간 중 어느 하나에서 상기 제2 사람의 헤드의 정면이 향하는 방향을 예측한 것일 수 있다.In another embodiment, the predicted direction information may be a prediction of the direction in which the front of the second person's head faces in either a two-dimensional plane or a three-dimensional space corresponding to the second learning image. .

상기 또 다른 실시예에서, 상기 제1 학습 이미지 또는 상기 제2 학습 이미지는, 촬영되거나 크롭된 사람 이미지에서, (i) 해당 사람의 바디 방향 및 시선 방향 각각을 2차원 평면 또는 3차원 공간에서 기설정된 바디 방향 클래스들 및 기설정된 시선 방향 클래스들 중에서 대응되는 어느 하나의 특정 바디 방향 클래스 및 특정 시선 방향 클래스 각각으로 라벨링하거나, (ii) 상기 해당 사람의 바디 방향 및 시선 방향 각각을 2차원 평면 또는 3차원 공간에서 바디 방향 벡터 및 시선 방향 벡터 각각으로 라벨링하여 생성된 것일 수 있다.In another embodiment, the first learning image or the second learning image is, in a photographed or cropped person image, (i) each of the person's body direction and gaze direction is based on a two-dimensional plane or three-dimensional space; Label each of the set body direction classes and the preset gaze direction classes with a specific body direction class and a specific gaze direction class, or (ii) label each of the body direction and gaze direction of the person in question on a two-dimensional plane or It may be generated by labeling each of the body direction vector and gaze direction vector in three-dimensional space.

상기 또 다른 실시예에서, 상기 제1 학습 이미지 또는 상기 제2 학습 이미지는, 자이로스코프 센서를 착용한 사람을 촬영한 사람 이미지에서, 촬영 시점에서의 상기 자이로스코프 센서의 센싱 정보를 이용하여 획득한 해당 사람의 센싱된 바디 방향 정보 및 센싱된 시선 방향 정보를 이용하여, (i) 2차원 평면 또는 3차원 공간에서 기설정된 바디 방향 클래스들 및 기설정된 시선 방향 클래스들 중에서 상기 센싱된 바디 방향 정보 및 상기 센싱된 시선 방향 정보 각각에 대응되는 어느 하나의 특정 바디 방향 클래스 및 특정 시선 방향 클래스 각각으로 라벨링하거나, (ii) 2차원 평면 또는 3차원 공간에서의 상기 센싱된 바디 방향 정보 및 상기 센싱된 시선 방향 정보 각각을 상기 해당 사람의 바디 방향 벡터 및 시선 방향 벡터 각각으로 라벨링하여 생성된 것일 수 있다.In another embodiment, the first learning image or the second learning image is obtained using sensing information of the gyroscope sensor at the time of shooting from a human image of a person wearing a gyroscope sensor. Using the person's sensed body direction information and sensed gaze direction information, (i) the sensed body direction information among preset body direction classes and preset gaze direction classes in a two-dimensional plane or three-dimensional space, and Label each of the sensed gaze direction information with a specific body direction class and a specific gaze direction class, or (ii) the sensed body direction information and the sensed gaze in a two-dimensional plane or three-dimensional space. It may be generated by labeling each piece of direction information with each of the person's body direction vector and gaze direction vector.

상기 또 다른 실시예에서, 상기 프로세서는, (III) 적어도 하나의 평가 이미지를 상기 바디 컨볼루셔널 레이어에 입력하여 상기 바디 컨볼루셔널 레이어로 하여금 상기 평가 이미지를 적어도 한번 컨볼루셔널 연산하여 상기 평가 이미지에 포함된 제3 사람의 바디 피처를 추출한 적어도 하나의 제3 바디 피처맵을 생성하도록 하며, 상기 평가 이미지를 상기 헤드 컨볼루셔널 레이어에 입력하여 상기 헤드 컨볼루셔널 레이어로 하여금 상기 평가 이미지를 적어도 한번 컨볼루셔널 연산하여 상기 제3 사람의 헤드 피처를 추출한 적어도 하나의 제2 헤드 피처맵을 생성하도록 하고, 상기 제3 바디 피처맵과 상기 제2 헤드 피처맵을 컨캐터네이트하여 제2 통합 피처맵을 생성하고, 상기 제2 통합 피처맵을 상기 헤드 FC 레이어에 입력하여 상기 헤드 FC 레이어로 하여금 상기 제2 통합 피처맵을 적어도 한번 FC 연산하여 상기 제3 사람의 헤드의 정면이 향하는 방향을 예측한 적어도 하나의 제2 예측된 헤드 디렉션 정보를 출력하도록 하며, 상기 제2 예측된 헤드 디렉션 정보와 상기 평가 이미지에 대응되는 제3 그라운드 트루스를 참조하여 상기 바디 컨볼루셔널 레이어, 상기 헤드 컨볼루셔널 레이어, 및 상기 헤드 FC 레이어를 포함하는 상기 시선 방향 검출 모델을 평가하는 프로세서를 더 수행할 수 있다.In another embodiment, the processor: (III) inputs at least one evaluation image to the body convolutional layer and causes the body convolutional layer to perform a convolutional operation on the evaluation image at least once to perform the evaluation. At least one third body feature map is generated by extracting the body features of a third person included in the image, and the evaluation image is input to the head convolutional layer to cause the head convolutional layer to generate the evaluation image. A convolutional operation is performed at least once to generate at least one second head feature map from which the head features of the third person are extracted, and the third body feature map and the second head feature map are concatenated to perform a second integration. Create a feature map, input the second integrated feature map to the head FC layer, and have the head FC layer calculate the second integrated feature map at least once to determine the direction in which the front of the third person's head faces. At least one second predicted head direction information is output, and the body convolutional layer and the head convolution layer are generated with reference to the second predicted head direction information and a third ground truth corresponding to the evaluation image. A processor that evaluates the gaze direction detection model including a sional layer and the head FC layer may be further performed.

상기 또 다른 실시예에서, 상기 프로세서는, 상기 제2 예측된 헤드 디렉션 정보와 상기 제3 그라운드 트루스를 이용하여 다음의 수학식을 통해 정확도를 연산하며, 연산된 상기 정확도를 이용하여 상기 시선 방향 검출 모델을 평가할 수 있다. , 상기 수학식에서, N은 평가에 이용된 상기 제2 예측된 헤드 디렉션 정보의 총 개수이고, # of predicted soft corrects는 상기 제2 예측된 헤드 디렉션 정보 중 라벨링된 정답을 정확히 예측하지 못한 개수이며, # of predicted corrects는 라벨링 정답을 정확히 예측한 개수일 수 있다.In another embodiment, the processor calculates accuracy through the following equation using the second predicted head direction information and the third ground truth, and detects the gaze direction using the calculated accuracy. The model can be evaluated. , in the above equation, N is the total number of the second predicted head direction information used for evaluation, # of predicted soft corrections is the number of the second predicted head direction information that did not correctly predict the labeled correct answer, # of predicted corrects may be the number of correctly predicted labeling correct answers.

본 발명의 또 다른 실시예에 따르면, 사람의 시선 방향을 검출하는 딥러닝 기반의 시선 방향 검출 모델을 학습하는 학습 장치에 있어서, 사람의 시선 방향을 검출하는 딥러닝 기반의 시선 방향 검출 모델을 학습하기 위한 인스트럭션들이 저장된 메모리; 및 상기 메모리에 저장된 상기 인스트럭션들에 따라 상기 시선 방향 검출 모델을 학습하기 위한 동작을 수행하는 프로세서; 를 포함하며, 상기 프로세서는, (I) 적어도 하나의 학습 이미지가 획득되면, 상기 학습 이미지를 바디 컨볼루셔널 레이어에 입력하여 상기 바디 컨볼루셔널 레이어로 하여금 상기 학습 이미지를 적어도 한번 컨볼루셔널 연산하여 상기 학습 이미지에 포함된 사람의 바디 피처를 추출한 적어도 하나의 바디 피처맵을 생성하도록 하고, 상기 학습 이미지를 헤드 컨볼루셔널 레이어에 입력하여 상기 헤드 컨볼루셔널 레이어로 하여금 상기 학습 이미지를 적어도 한번 컨볼루셔널 연산하여 상기 사람의 헤드 피처를 추출한 적어도 하나의 헤드 피처맵을 생성하도록 하는 프로세스, (II) 상기 바디 피처맵을 바디 FC 레이어에 입력하여 상기 바디 FC 레이어로 하여금 상기 바디 피처맵을 적어도 한번 FC 연산하여 상기 사람의 바디의 정면이 향하는 방향을 예측한 적어도 하나의 예측된 바디 디렉션 정보를 출력하도록 하고, 상기 바디 피처맵과 상기 헤드 피처맵을 컨캐터네이트한 통합 피처맵을 헤드 FC 레이어에 입력하여 상기 헤드 FC 레이어로 하여금 상기 통합 피처맵을 적어도 한번 FC 연산하여 상기 사람의 헤드의 정면이 향하는 방향을 예측한 적어도 하나의 예측된 헤드 디렉션 정보를 출력하도록 하는 프로세스, 및 (III) 상기 예측된 바디 디렉션 정보와 상기 학습 이미지에 대응되는 그라운드 트루스 정보에 포함된 라벨링된 바디 디렉션 정보를 이용하여 적어도 하나의 바디 디렉션 로스를 생성하며, 상기 예측된 헤드 디렉션 정보와 상기 그라운드 트루스에 포함된 라벨링된 헤드 디렉션 정보를 이용하여 적어도 하나의 헤드 디렉션 로스를 생성하고, 상기 바디 디렉션 로스를 이용하여 상기 바디 FC 레이어 및 상기 바디 컨볼루셔널 레이어를 학습시키며, 상기 헤드 디렉션 로스를 이용하여 상기 헤드 FC 레이어 및 상기 헤드 컨볼루셔널 레이어를 학습시키는 프로세스를 수행하는 학습 장치가 제공된다.According to another embodiment of the present invention, in a learning device for learning a deep learning-based gaze direction detection model for detecting a person's gaze direction, a deep learning-based gaze direction detection model for detecting a person's gaze direction is learned. a memory in which instructions for doing so are stored; and a processor that performs an operation to learn the gaze direction detection model according to the instructions stored in the memory. It includes: (I) when at least one training image is obtained, the processor inputs the training image to a body convolutional layer to perform a convolutional operation on the training image at least once generate at least one body feature map extracting the human body features included in the learning image, and input the learning image to a head convolutional layer to cause the head convolutional layer to extract the learning image at least once A process of generating at least one head feature map from which the head features of the person are extracted through convolutional calculation, (II) inputting the body feature map to a body FC layer to cause the body FC layer to extract the body feature map at least FC calculation is performed once to output at least one predicted body direction information predicting the direction in which the front of the human body faces, and an integrated feature map obtained by concatenating the body feature map and the head feature map is generated as a head FC layer. A process of inputting to the head FC layer to perform an FC operation on the integrated feature map at least once to output at least one predicted head direction information predicting the direction in which the front of the person's head is facing, and (III) the At least one body direction loss is generated using predicted body direction information and labeled body direction information included in ground truth information corresponding to the learning image, and labeling included in the predicted head direction information and ground truth is used. Generate at least one head direction loss using the head direction information, learn the body FC layer and the body convolutional layer using the body direction loss, and train the head FC layer using the head direction loss. And a learning device that performs a process of learning the head convolutional layer is provided.

상기 또 다른 실시예에서, 상기 학습 이미지는, 촬영되거나 크롭된 사람 이미지에서, (i) 해당 사람의 바디 방향 및 시선 방향 각각을 2차원 평면 또는 3차원 공간에서 기설정된 바디 방향 클래스들 및 기설정된 시선 방향 클래스들 중에서 대응되는 어느 하나의 특정 바디 방향 클래스 및 특정 시선 방향 클래스 각각으로 라벨링하거나, (ii) 상기 해당 사람의 바디 방향 및 시선 방향 각각을 2차원 평면 또는 3차원 공간에서 바디 방향 벡터 및 시선 방향 벡터 각각으로 라벨링하여 생성된 것일 수 있다.In another embodiment, the learning image is, in a photographed or cropped person image, (i) each of the person's body direction and gaze direction is classified into preset body direction classes and preset body direction classes in a two-dimensional plane or three-dimensional space; Label each of the gaze direction classes with a specific body direction class and a specific gaze direction class, or (ii) label each of the person's body direction and gaze direction with a body direction vector and a body direction vector in a two-dimensional plane or three-dimensional space. It may be generated by labeling each gaze direction vector.

상기 또 다른 실시예에 따르면, 상기 학습 이미지는, 자이로스코프 센서를 착용한 사람을 촬영한 사람 이미지에서, 촬영 시점에서의 상기 자이로스코프 센서의 센싱 정보를 이용하여 획득한 해당 사람의 센싱된 바디 방향 정보 및 센싱된 시선 방향 정보를 이용하여, (i) 2차원 평면 또는 3차원 공간에서 기설정된 바디 방향 클래스들 및 기설정된 시선 방향 클래스들 중에서 상기 센싱된 바디 방향 정보 및 상기 센싱된 시선 방향 정보 각각에 대응되는 어느 하나의 특정 바디 방향 클래스 및 특정 시선 방향 클래스 각각으로 라벨링하거나, (ii) 2차원 평면 또는 3차원 공간에서의 상기 센싱된 바디 방향 정보 및 상기 센싱된 시선 방향 정보 각각을 상기 해당 사람의 바디 방향 벡터 및 시선 방향 벡터 각각으로 라벨링하여 생성된 것일 수 있다.According to another embodiment, the learning image is a person image of a person wearing a gyroscope sensor, and the sensed body direction of the person obtained using the sensing information of the gyroscope sensor at the time of shooting. Using information and sensed gaze direction information, (i) the sensed body direction information and the sensed gaze direction information, respectively, among preset body direction classes and preset gaze direction classes in a two-dimensional plane or three-dimensional space; or (ii) label each of the sensed body direction information and the sensed gaze direction information in a two-dimensional plane or three-dimensional space with a specific body direction class and a specific gaze direction class corresponding to the corresponding person. It may be generated by labeling with each of the body direction vector and gaze direction vector of .

이 외에도, 본 발명의 방법을 실행하기 위한 컴퓨터 프로그램을 기록하기 위한 컴퓨터 판독 가능한 기록 매체가 더 제공된다.In addition, a computer-readable recording medium for recording a computer program for executing the method of the present invention is further provided.

본 발명은 정확하게 시선 방향을 검출하는 효과가 있다.The present invention has the effect of accurately detecting the gaze direction.

또한, 본 발명은 헤드 디렉션 정보 및 바디 디렉션 정보를 이용하여 정확하게 시선 방향을 검출하는 효과가 있다.Additionally, the present invention has the effect of accurately detecting the gaze direction using head direction information and body direction information.

또한, 본 발명은 정확하게 시선 방향을 검출함으로써 소비자에게 효과적인 광고가 이루어질 수 있도록 지원하는 효과가 있다.In addition, the present invention has the effect of supporting effective advertising to consumers by accurately detecting the gaze direction.

본 발명의 실시예의 설명에 이용되기 위하여 첨부된 아래 도면들은 본 발명의 실시예들 중 단지 일부일 뿐이며, 본 발명이 속한 기술분야에서 통상의 지식을 가진 자(이하 "통상의 기술자")에게 있어서는 발명적 작업이 이루어짐 없이 이 도면들에 기초하여 다른 도면들이 얻어질 수 있다.
도 1은 본 발명에 따른 시선 방향을 검출하는 딥러닝 기반의 시선 방향 검출 모델을 학습하는 학습 장치를 개략적으로 도시한 것이고,
도 2는 본 발명에 따른 시선 방향을 검출하는 딥러닝 기반의 시선 방향 검출 모델을 학습하는 학습 방법을 설명하기 위한 도면이며,
도 3은 본 발명에 따른 시선 방향을 검출하는 딥러닝 기반의 시선 방향 검출 모델을 학습하는 학습 방법을 설명하기 위한 도면이며,
도 4는 본 발명에 따른 시선 방향을 검출하는 딥러닝 기반의 시선 방향 검출 모델을 테스트하기 위한 테스트 장치를 개략적으로 도시한 것이고,
도 5는 본 발명에 따른 시선 방향을 검출하는 딥러닝 기반의 시선 방향 검출 모델을 테스트하기 위한 테스트 방법을 설명하기 위한 도면이며,
도 6은 본 발명에 따른 시선 방향을 검출하는 딥러닝 기반의 시선 방향 검출 모델의 학습에 이용되는 학습 이미지를 개략적으로 도시한 것이다.The following drawings attached for use in explaining embodiments of the present invention are only some of the embodiments of the present invention, and to those skilled in the art (hereinafter “those skilled in the art”), the invention Other drawings can be obtained based on these drawings without further work being done.
Figure 1 schematically shows a learning device for learning a deep learning-based gaze direction detection model for detecting the gaze direction according to the present invention;
Figure 2 is a diagram for explaining a learning method for learning a deep learning-based gaze direction detection model for detecting the gaze direction according to the present invention;
Figure 3 is a diagram illustrating a learning method for learning a deep learning-based gaze direction detection model for detecting the gaze direction according to the present invention;
Figure 4 schematically shows a test device for testing a deep learning-based gaze direction detection model for detecting the gaze direction according to the present invention;
Figure 5 is a diagram illustrating a test method for testing a deep learning-based gaze direction detection model for detecting the gaze direction according to the present invention;
Figure 6 schematically shows a learning image used for learning a deep learning-based gaze direction detection model that detects the gaze direction according to the present invention.

후술하는 본 발명에 대한 상세한 설명은, 본 발명의 목적들, 기술적 해법들 및 장점들을 분명하게 하기 위하여 본 발명이 실시될 수 있는 특정 실시예를 예시로서 도시하는 첨부 도면을 참조한다. 이들 실시예는 통상의 기술자가 본 발명을 실시할 수 있기에 충분하도록 상세히 설명된다.The detailed description of the present invention described below refers to the accompanying drawings, which show by way of example specific embodiments in which the present invention may be practiced to make clear the objectives, technical solutions and advantages of the present invention. These embodiments are described in sufficient detail to enable those skilled in the art to practice the invention.

또한, 본 발명의 상세한 설명 및 청구항들에 걸쳐, "포함하다"라는 단어 및 그것의 변형은 다른 기술적 특징들, 부가물들, 구성요소들 또는 단계들을 제외하는 것으로 의도된 것이 아니다. 통상의 기술자에게 본 발명의 다른 목적들, 장점들 및 특성들이 일부는 본 설명서로부터, 그리고 일부는 본 발명의 실시로부터 드러날 것이다. 아래의 예시 및 도면은 실례로서 제공되며, 본 발명을 한정하는 것으로 의도된 것이 아니다.Additionally, throughout the description and claims, the word “comprise” and variations thereof are not intended to exclude other technical features, attachments, components or steps. Other objects, advantages and features of the invention will appear to those skilled in the art, partly from this description and partly from practice of the invention. The examples and drawings below are provided by way of example and are not intended to limit the invention.

더욱이 본 발명은 본 명세서에 표시된 실시예들의 모든 가능한 조합들을 망라한다. 본 발명의 다양한 실시예는 서로 다르지만 상호 배타적일 필요는 없음이 이해되어야 한다. 예를 들어, 여기에 기재되어 있는 특정 형상, 구조 및 특성은 일 실시예에 관련하여 본 발명의 정신 및 범위를 벗어나지 않으면서 다른 실시예로 구현될 수 있다. 또한, 각각의 개시된 실시예 내의 개별 구성요소의 위치 또는 배치는 본 발명의 정신 및 범위를 벗어나지 않으면서 변경될 수 있음이 이해되어야 한다. 따라서, 후술하는 상세한 설명은 한정적인 의미로서 취하려는 것이 아니며, 본 발명의 범위는, 적절하게 설명된다면, 그 청구항들이 주장하는 것과 균등한 모든 범위와 더불어 첨부된 청구항에 의해서만 한정된다. 도면에서 유사한 참조부호는 여러 측면에 걸쳐서 동일하거나 유사한 기능을 지칭한다.Moreover, the present invention encompasses all possible combinations of the embodiments shown herein. It should be understood that the various embodiments of the present invention are different from one another but are not necessarily mutually exclusive. For example, specific shapes, structures and characteristics described herein with respect to one embodiment may be implemented in other embodiments without departing from the spirit and scope of the invention. Additionally, it should be understood that the location or arrangement of individual components within each disclosed embodiment may be changed without departing from the spirit and scope of the invention. Accordingly, the detailed description that follows is not intended to be taken in a limiting sense, and the scope of the invention is limited only by the appended claims, together with all equivalents to what those claims assert, if properly described. Similar reference numbers in the drawings refer to identical or similar functions across various aspects.

여기에 제공되는 본 발명의 제목이나 요약은 단지 편의를 위해 제공되는 것으로 이 실시 예들의 범위 또는 의미를 제한하거나 해석하지 않는다. The title or summary of the present invention provided herein is provided for convenience only and does not limit or construe the scope or meaning of these embodiments.

이하의 설명에서는, 보행자의 시선을 검출하는 경우를 예로 들어 설명하지만, 본 발명이 이에 한정되는 것은 아니며, 보행자가 아닌 경우에도 본 발명이 적용될 수 있다. In the following description, the case of detecting the gaze of a pedestrian is taken as an example, but the present invention is not limited to this, and the present invention can be applied even to cases where pedestrians are not pedestrians.

이하, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 본 발명을 용이하게 실시할 수 있도록 하기 위하여, 본 발명의 바람직한 실시예들에 관하여 첨부된 도면을 참조하여 상세히 설명하기로 한다.Hereinafter, in order to enable those skilled in the art to easily practice the present invention, preferred embodiments of the present invention will be described in detail with reference to the attached drawings.

도 1은 본 발명의 일 실시예에 따라 시선 방향을 검출하는 딥러닝 기반의 시선 방향 검출 모델을 학습하기 위한 학습 장치(1000)를 개략적으로 도시한 것으로, 학습 장치(1000)는 시선 방향 검출 모델을 학습하기 위한 인스트럭션들이 저장된 메모리(1001)와 메모리(1001)에 저장된 인스트럭션들에 대응하여 시선 방향 검출 모델을 학습하기 위한 동작을 수행하는 프로세서(1002)를 포함할 수 있다. 이때, 시선 방향 검출 모델은 바디 컨볼루셔널 레이어, 헤드 컨볼루셔널 레이어, 및 헤드 FC(Fully-Connected) 레이어를 포함할 수 있으며, 이에 대해서는 다음의 학습 방법에서 자세히 설명하도록 한다.Figure 1 schematically shows a learning device 1000 for learning a deep learning-based gaze direction detection model for detecting the gaze direction according to an embodiment of the present invention. The learning device 1000 is a gaze direction detection model. It may include a memory 1001 in which instructions for learning are stored and a processor 1002 that performs an operation for learning a gaze direction detection model in response to the instructions stored in the memory 1001. At this time, the gaze direction detection model may include a body convolutional layer, a head convolutional layer, and a head FC (Fully-Connected) layer, which will be described in detail in the following learning method.

구체적으로, 학습 장치(1000)는 전형적으로 컴퓨팅 장치(예컨대, 컴퓨터 프로세서, 메모리, 스토리지, 입력 장치 및 출력 장치, 기타 기존의 컴퓨팅 장치의 구성요소들을 포함할 수 있는 장치; 라우터, 스위치 등과 같은 전자 통신 장치; 네트워크 부착 스토리지(NAS) 및 스토리지 영역 네트워크(SAN)와 같은 전자 정보 스토리지 시스템)와 컴퓨터 소프트웨어(즉, 컴퓨팅 장치로 하여금 특정의 방식으로 기능하게 하는 인스트럭션들)의 조합을 이용하여 원하는 시스템 성능을 달성하는 것일 수 있다.Specifically, learning device 1000 is typically a computing device (e.g., a device that may include a computer processor, memory, storage, input and output devices, and other components of a traditional computing device; electronic devices such as routers, switches, etc. the desired system using a combination of a communication device (electronic information storage systems, such as network attached storage (NAS) and storage area network (SAN)) and computer software (i.e., instructions that cause a computing device to function in a particular way) Performance may be achieved.

또한, 컴퓨팅 장치의 프로세서는 MPU(Micro Processing Unit) 또는 CPU(Central Processing Unit), 캐쉬 메모리(Cache Memory), 데이터 버스(Data Bus) 등의 하드웨어 구성을 포함할 수 있다. 또한, 컴퓨팅 장치는 운영체제, 특정 목적을 수행하는 애플리케이션의 소프트웨어 구성을 더 포함할 수도 있다.Additionally, the processor of the computing device may include hardware components such as a Micro Processing Unit (MPU) or Central Processing Unit (CPU), cache memory, and data bus. Additionally, the computing device may further include an operating system and a software component of an application that performs a specific purpose.

그러나, 컴퓨팅 장치가 본 발명을 실시하기 위한 미디엄, 프로세서 및 메모리가 통합된 형태인 integrated 프로세서를 포함하는 경우를 배제하는 것은 아니다.However, this does not exclude the case where the computing device includes an integrated processor in which a medium, processor, and memory for implementing the present invention are integrated.

이와 같이 구성된 학습 장치(1000)를 이용하여 시선 방향을 검출하는 시선 방향 검출 모델을 학습하기 위한 학습 방법에 대해 도 2 및 도 3을 참조하여 설명하면 다음과 같다.A learning method for learning a gaze direction detection model for detecting the gaze direction using the learning device 1000 configured as described above will be described with reference to FIGS. 2 and 3 as follows.

참고로, 아래에서 각각의 구성요소들이 단수로 기재되었다고 하더라도, 복수의 가능성을 배제하는 것은 아니다.For reference, although each component is described below in the singular, this does not exclude the possibility of plurality.

참고로, 보행자의 안면부가 어느 방향을 가리키는지를 예측하는 것보다 보행자의 몸이 어느 방향으로 진행하는지를 예측하는 것이 일반적으로 더 쉬운 편이다. 이는, 보행자의 팔, 다리, 손, 발 등 풍부한 정보에 기반하여 보행자의 몸이 어느 방향으로 진행하는지를 예측할 수 있기 때문이다.For reference, it is generally easier to predict which direction a pedestrian's body is moving than to predict which direction the pedestrian's face is pointing. This is because it is possible to predict which direction the pedestrian's body is moving based on abundant information such as the pedestrian's arms, legs, hands, and feet.

따라서, 본 발명의 일 실시예에 따른 학습 장치(1000)는, 학습의 효율을 높이기 위해서, 보행자의 몸이 어느 방향으로 진행하는지에 관련된 특징을 추출하는 바디 컨볼루셔널 레이어(1100)의 파라미터를 먼저 학습한 후, 바디 컨볼루셔널 레이어(1100)의 파라미터를 고정시킨 상태에서, 보행자의 안면부가 어느 방향을 가리키는지에 관련된 특징을 추출하는 헤드 컨볼루셔널 레이어 및 헤드 FC 레이어의 파라미터를 학습할 수 있다.Therefore, in order to increase learning efficiency, the learning device 1000 according to an embodiment of the present invention uses the parameters of the body convolutional layer 1100 to extract features related to the direction in which the pedestrian's body moves. After first learning, with the parameters of the body convolutional layer 1100 fixed, the parameters of the head convolutional layer and head FC layer, which extract features related to which direction the pedestrian's face is pointing, can be learned. there is.

다만, 본 발명이 이에 한정되는 것은 아니며, 본 발명에 따른 학습 장치(1000)는, 바디 컨볼루셔널 레이어(1100), 헤드 컨볼루셔널 레이어 및 헤드 FC 레이어의 파라미터를 동시에 학습시킬 수도 있다.However, the present invention is not limited to this, and the learning device 1000 according to the present invention may simultaneously learn the parameters of the body convolutional layer 1100, the head convolutional layer, and the head FC layer.

설명의 편의상, 아래에서는, 보행자의 몸이 어느 방향으로 진행하는지에 관련된 특징을 추출하는 바디 컨볼루셔널 레이어(1100)의 파라미터를 먼저 학습한 후, 바디 컨볼루셔널 레이어(1100)의 파라미터를 고정시킨 상태에서, 보행자의 안면부가 어느 방향을 가리키는지에 관련된 특징을 추출하는 헤드 컨볼루셔널 레이어 및 헤드 FC 레이어의 파라미터를 학습하는 경우에 대해서 먼저 설명하며, 바디 컨볼루셔널 레이어(1100), 헤드 컨볼루셔널 레이어 및 헤드 FC 레이어의 파라미터를 동시에 학습하는 경우에 대해서는 다음에 설명하기로 한다.For convenience of explanation, below, the parameters of the body convolutional layer 1100, which extracts features related to the direction in which the pedestrian's body moves, are first learned, and then the parameters of the body convolutional layer 1100 are fixed. In this state, the case of learning the parameters of the head convolutional layer and head FC layer, which extract features related to which direction the pedestrian's face is pointing, will first be described, and the body convolutional layer 1100 and head control layer will be described first. The case of simultaneously learning the parameters of the volutional layer and the head FC layer will be described next.

도 2를 참조하면, 적어도 하나의 제1 학습 이미지(가령, 보행자 이미지)가 획득되면, 학습 장치(1000)는, 제1 학습 이미지를 바디 컨볼루셔널 레이어(1100)에 입력하여 바디 컨볼루셔널 레이어(1100)로 하여금 제1 학습 이미지에 컨볼루셔널 연산을 적용하도록 함으로써 적어도 하나의 제1 바디 피처맵을 생성하도록 할 수 있다. 즉, 학습 장치(1000)는 바디 컨볼루셔널 레이어(1100)로 하여금 제1 학습 이미지를 적어도 한번 컨볼루셔널 연산하여 제1 학습 이미지에 포함된 제1 사람의 바디 피처를 추출한 적어도 하나의 제1 바디 피처맵을 생성하도록 할 수 있다.Referring to FIG. 2, when at least one first learning image (e.g., pedestrian image) is acquired, the learning device 1000 inputs the first learning image into the body convolutional layer 1100 to create a body convolutional image. At least one first body feature map can be generated by having the layer 1100 apply a convolutional operation to the first learning image. That is, the learning device 1000 causes the body convolutional layer 1100 to perform a convolutional operation on the first learning image at least once to extract at least one body feature of the first person included in the first learning image. You can create a body feature map.

참고로, 제1 학습 이미지 및 후술할 제2 학습 이미지는 별개의 학습 이미지일 수 있으나, 이에 한정되는 것은 아니며, 서로 동일한 학습 이미지일 수 있다.For reference, the first learning image and the second learning image, which will be described later, may be separate learning images, but are not limited thereto, and may be the same learning image.

또한, 제1 학습 이미지, 후술할 제2 학습 이미지 및 후술할 테스트 이미지는 CCTV를 통해 획득되는 이미지일 수 있으나, 이에 한정되는 것은 아니다.Additionally, the first learning image, the second learning image to be described later, and the test image to be described later may be images obtained through CCTV, but are not limited thereto.

다음으로, 학습 장치(1000)는, 바디 피처맵을 바디 FC(fully-connected) 레이어(1200)에 입력하여 바디 FC 레이어(1200)로 하여금 바디 피처맵에 FC 연산을 적용하도록 함으로써 적어도 하나의 예측된 바디 디렉션 정보를 출력하도록 할 수 있다. 즉, 학습 장치(1000)는 바디 FC 레이어(1200)로 하여금 제1 바디 피처맵을 적어도 한번 FC 연산하여 제1 사람의 바디의 정면이 향하는 방향을 예측한 적어도 하나의 예측된 바디 디렉션 정보를 출력하도록 할 수 있다.Next, the learning device 1000 inputs the body feature map to the body FC (fully-connected) layer 1200 and causes the body FC layer 1200 to apply FC operation to the body feature map to make at least one prediction. body direction information can be output. That is, the learning device 1000 causes the body FC layer 1200 to perform FC operations on the first body feature map at least once and output at least one piece of predicted body direction information predicting the direction in which the front of the first person's body faces. You can do it.

이때, 학습 장치(1000)는, 바디 FC 레이어(1200)로 하여금 예측된 바디 디렉션 정보로서 클래시피케이션(classification) 정보, 즉, 불연속적인 값을 갖는 정보 중 어느 하나의 정보를 출력하도록 할 수 있으나, 이에 한정되는 것은 아니며, 리그레션(regression) 정보, 즉, 연속적인 값을 갖는 정보 중 어느 하나의 정보를 출력하도록 할 수도 있다.At this time, the learning device 1000 may cause the body FC layer 1200 to output classification information, that is, information with a discontinuous value, as predicted body direction information. , but is not limited to this, and any one of regression information, that is, information with continuous values, may be output.

즉, 학습 장치(1000)는 바디 FC 레이어(1200)로 하여금, 제1 사람의 바디의 정면이 향하는 방향이 기설정된 바디 디렉션 클래스들 중 어느 클래스에 대응되는 지를 분류한 클래시피케이션 정보 및 제1 사람의 바디의 정면이 향하는 방향이 연속적 방향 후보군 중 어느 방향에 대응되는 지를 리그레션한 리그레션 정보 중 어느 하나를 예측된 바디 디렉션 정보로 출력하도록 할 수 있다.That is, the learning device 1000 causes the body FC layer 1200 to provide classification information classifying which class among preset body direction classes the direction in which the front of the first person's body faces corresponds, and first One of the regression information obtained by regressing which direction among the continuous direction candidates the direction in which the front of the human body faces corresponds can be output as predicted body direction information.

일 예로, 예측된 바디 디렉션 정보는, 제1 사람, 즉, 보행자의 바디를 중심으로, 보행자의 바디의 정면이 카메라를 향할 때를 기준(가령, South)으로 하여, 보행자의 바디의 정면이 향하는 방향이 2차원 평면 상에서 기설정된 8개의 바디 디렉션 클래스들(즉, S, SE, E, NE, N, NW, W, SW) 중 어느 클래스에 대응되는 지를 분류한 클래시피케이션 정보일 수 있다.As an example, the predicted body direction information is centered on the body of the first person, that is, a pedestrian, with the front of the pedestrian's body facing the camera as a reference (e.g., South), and the front of the pedestrian's body facing. It may be classification information that classifies which class the direction corresponds to among eight preset body direction classes (i.e., S, SE, E, NE, N, NW, W, SW) on a two-dimensional plane.

다른 예로, 예측된 바디 디렉션 정보는, 제1 사람, 즉, 보행자의 바디를 중심으로, 보행자의 바디의 정면이 카메라를 향할 때를 기준(즉, 0도)으로 하여, 보행자의 바디의 정면이 2차원 평면 상의 360도의 방향 중 어느 하나의 방향(가령, 150.1482도)을 향하는 지를 리그레션한 리그레션 정보일 수 있다.As another example, the predicted body direction information is based on the body of the first person, that is, the pedestrian, when the front of the pedestrian's body is facing the camera (i.e., 0 degrees), and the front of the pedestrian's body is This may be regression information obtained by regressing which direction (for example, 150.1482 degrees) out of 360 degrees on a two-dimensional plane is headed.

한편, 상기에서는 예측된 바디 디렉션 정보가 2차원 평면 상에서의 클래시피케이션 정보 또는 리그레션 정보인 것으로 설명하였으나, 3차원 공간에서의 클래시피케이션 정보 또는 리그레션 정보일 수도 있다.Meanwhile, although it has been described above that the predicted body direction information is collision information or regression information in a two-dimensional plane, it may also be collision information or regression information in a three-dimensional space.

즉, 예측된 바디 디렉션 정보는, 제1 사람, 즉, 보행자의 바디를 중심으로, 보행자의 바디의 정면이 카메라를 향할 때를 기준으로 하여, 보행자의 바디의 정면이 3차원 공간 상에서 방사형 방향으로 기설정된 26개의 바디 디렉션 클래스들(즉, 2차원 평면 상에서의 기설정된 8개의 바디 디렉션 클래스들에 더하여, 상부 방향으로의 8개의 바디 디렉션 클래스들, 하부 방향으로의 8개의 바디 디렉션 클래스들, 상부 방향 클래스, 및 하부 방향 클래스를 포함함) 중 어느 클래스에 대응되는 지를 분류한 클래시피케이션 정보, 또는 3차원 구면 좌표계에 따른 연속적인 값을 갖는 방향 중 어느 하나의 방향을 향하는 지를 리그레션한 리그레션 정보일 수 있다.That is, the predicted body direction information is centered on the body of the first person, that is, the pedestrian, and is based on when the front of the pedestrian's body faces the camera, and the front of the pedestrian's body is radial in three-dimensional space. In addition to the 26 preset body direction classes (i.e., the 8 preset body direction classes on the two-dimensional plane, 8 body direction classes in the upper direction, 8 body direction classes in the lower direction, upper Classification information that classifies which class it corresponds to (including direction classes and lower direction classes), or league that regresses which direction it faces among directions with continuous values according to a three-dimensional spherical coordinate system. It may be lesion information.

결과적으로, 예측된 바디 디렉션 정보는, 바디 FC 레이어의 설정에 따라, 2차원 평면에서의 클래시피케이션 정보, 2차원 평면에서의 리그레션 정보, 3차원 공간에서의 클래시피케이션 정보, 및 3차원 공간에서의 리그레션 정보 중 어느 하나일 수 있다.As a result, the predicted body direction information is, depending on the settings of the body FC layer, collision information in a 2-dimensional plane, regression information in a 2-dimensional plane, collision information in a 3-dimensional space, and 3-dimensional It may be any one of regression information in space.

다시, 도 2를 참조하면, 학습 장치(1000)는, 적어도 하나의 예측된 바디 디렉션 정보와 제1 학습 이미지에 대응되는 제1 그라운드 트루스(Ground truth)에 포함된 라벨링된 바디 디렉션 정보를 이용하여 적어도 하나의 바디 디렉션 로스를 생성하고, 바디 디렉션 로스를 이용한 백프로파게이션을 수행함으로써 바디 컨볼루셔널 레이어(1100) 및 바디 FC 레이어(1200) 중 적어도 일부의 파라미터를 학습할 수 있다. 즉, 학습 장치(1000)는 바디 디렉션 로스를 이용하여 바디 컨볼루셔널 레이어(1100) 및 바디 FC 레이어(1200)를 학습시킬 수 있다.Again, referring to FIG. 2, the learning device 1000 uses at least one piece of predicted body direction information and labeled body direction information included in the first ground truth corresponding to the first learning image. Parameters of at least some of the body convolutional layer 1100 and the body FC layer 1200 can be learned by generating at least one body direction loss and performing backpropagation using the body direction loss. That is, the learning device 1000 can learn the body convolutional layer 1100 and the body FC layer 1200 using the body direction loss.

위와 같이, 바디 컨볼루셔널 레이어(1100) 및 바디 FC 레이어(1200)의 파라미터가 학습된 이후에, 학습 장치(1000)는, 바디 컨볼루셔널 레이어(1100)의 파라미터를 고정한 상태에서, 헤드 컨볼루셔널 레이어 및 헤드 FC 레이어 중 적어도 일부의 파라미터를 학습할 수 있다.As above, after the parameters of the body convolutional layer 1100 and the body FC layer 1200 are learned, the learning device 1000 performs the head control while fixing the parameters of the body convolutional layer 1100. Parameters of at least some of the volutional layer and the head FC layer can be learned.

참고로, 앞서 설명한 바와 같이, 보행자의 눈(동공)이 촬영되지 않는 경우가 많은데, 본 발명의 일 실시예에 따르면, 보행자의 안면부가 향하는 방향을 보행자의 시선이 향하는 방향으로 간주하게 되므로, 이미지가 촬영되는 다양한 환경에 구애받지 않고, 정확하게 보행자의 시선 방향을 예측할 수 있게 된다.For reference, as previously explained, the pedestrian's eyes (pupils) are often not photographed, but according to one embodiment of the present invention, the direction in which the pedestrian's face faces is regarded as the direction in which the pedestrian's gaze is directed, so that the image It is possible to accurately predict the direction of a pedestrian's gaze, regardless of the various environments in which it is filmed.

도 3을 참조하면, 적어도 하나의 제2 학습 이미지(가령, 보행자 이미지)가 획득되면, 학습 장치(1000)는, 제2 학습 이미지를 헤드 컨볼루셔널 레이어(1300) 및 바디 컨볼루셔널 레이어(1100)에 입력하여 헤드 컨볼루셔널 레이어(1300) 및 바디 컨볼루셔널 레이어(1100) 각각으로 하여금 제2 학습 이미지에 컨볼루셔널 연산을 각각 적용하도록 함으로써 적어도 하나의 제1 헤드 피처맵 및 적어도 하나의 제2 바디 피처맵 각각을 생성하도록 할 수 있다.Referring to FIG. 3, when at least one second learning image (e.g., pedestrian image) is acquired, the learning device 1000 divides the second learning image into a head convolutional layer 1300 and a body convolutional layer ( 1100) to cause each of the head convolutional layer 1300 and the body convolutional layer 1100 to apply a convolutional operation to the second learning image, thereby creating at least one first head feature map and at least one Each of the second body feature maps can be generated.

즉, 학습 장치(1000)는, 제2 학습 이미지를 바디 컨볼루셔널 레이어(1100)에 입력하여 바디 컨볼루셔널 레이어(1100)로 하여금 제2 학습 이미지를 적어도 한번 컨볼루셔널 연산하여 제2 학습 이미지에 포함된 제2 사람의 바디 피처를 추출한 적어도 하나의 제2 바디 피처맵을 생성하도록 하며, 제2 학습 이미지를 헤드 컨볼루셔널 레이어(1300)에 입력하여 헤드 컨볼루셔널 레이어(1300)로 하여금 제2 학습 이미지를 적어도 한번 컨볼루셔널 연산하여 제2 사람의 헤드 피처를 추출한 적어도 하나의 제1 헤드 피처맵을 생성하도록 할 수 있다.That is, the learning device 1000 inputs the second learning image into the body convolutional layer 1100 and causes the body convolutional layer 1100 to perform a convolutional operation on the second learning image at least once to perform the second learning image. At least one second body feature map is created by extracting the body features of the second person included in the image, and the second learning image is input to the head convolutional layer 1300 to be converted to the head convolutional layer 1300. It is possible to generate at least one first head feature map from which head features of a second person are extracted by performing a convolutional operation on the second learning image at least once.

참고로, 도 2를 통해 설명했던 바디 FC 레이어(1200)는, 바디 컨볼루셔널 레이어(1100)의 파라미터를 학습하는 과정에서만 이용될 뿐, 도 3에서 도시하는 바와 같이, 헤드 컨볼루셔널 레이어(1300) 및 헤드 FC 레이어(1400)를 학습하는 과정에서는 이용되지 않을 수 있다. 즉, 본 발명에 따른 시선 방향 검출 모델은 바디 컨볼루셔널 레이어(1100), 헤드 컨볼루셔널 레이어(1300), 및 헤드 FC 레이어(1400)로 구성될 수 있으며, 바디 FC 레이어(1200)는 시선 방향 검출 모델의 학습 과정에서만 이용되는 것일 수 있다.For reference, the body FC layer 1200 explained through FIG. 2 is only used in the process of learning the parameters of the body convolutional layer 1100, and as shown in FIG. 3, the head convolutional layer ( 1300) and may not be used in the process of learning the head FC layer 1400. That is, the gaze direction detection model according to the present invention may be composed of a body convolutional layer 1100, a head convolutional layer 1300, and a head FC layer 1400, and the body FC layer 1200 is a gaze direction detection model. It may be used only in the learning process of the direction detection model.

다음으로, 학습 장치(1000)는, 헤드 피처맵과 바디 피처맵을 컨캐터네이트하여 제1 통합 피처맵을 생성하고, 제1 통합 피처맵을 헤드 FC(fully-connected) 레이어에 입력하여 헤드 FC 레이어(1400)로 하여금 제1 통합 피처맵을 적어도 한번 FC 연산하여 제2 사람의 헤드의 정면이 향하는 방향을 예측한 적어도 하나의 제1 예측된 헤드 디렉션 정보를 출력하도록 할 수 있다.Next, the learning device 1000 concatenates the head feature map and the body feature map to generate a first integrated feature map, and inputs the first integrated feature map to the head FC (fully-connected) layer to head FC The layer 1400 may perform an FC operation on the first integrated feature map at least once to output at least one piece of first predicted head direction information that predicts the direction in which the front of the second person's head is facing.

이때, 학습 장치(1000)는, 헤드 FC 레이어(1400)로 하여금 예측된 헤드 디렉션 정보로서 클래시피케이션(classification) 정보, 즉, 불연속적인 값을 갖는 정보 중 어느 하나의 정보를 출력하도록 할 수 있으나, 이에 한정되는 것은 아니며, 리그레션(regression) 정보, 즉, 연속적인 값을 갖는 정보 중 어느 하나의 정보를 출력하도록 할 수도 있다.At this time, the learning device 1000 may cause the head FC layer 1400 to output classification information, that is, information with a discontinuous value, as predicted head direction information. , but is not limited to this, and any one of regression information, that is, information with continuous values, may be output.

즉, 학습 장치(1000)는 헤드 FC 레이어(1400)로 하여금, 제2 사람의 헤드의 정면이 향하는 방향이 기설정된 바디 디렉션 클래스들 중 어느 클래스에 대응되는 지를 분류한 클래시피케이션 정보 및 제2 사람의 헤드의 정면이 향하는 방향이 연속적 방향 후보군 중 어느 방향에 대응되는 지를 리그레션한 리그레션 정보 중 어느 하나를 예측된 헤드 디렉션 정보로 출력하도록 할 수 있다.That is, the learning device 1000 causes the head FC layer 1400 to provide classification information classifying which class among preset body direction classes the direction in which the front of the second person's head faces corresponds, and the second person's head One of the regression information obtained by regressing which direction among the continuous direction candidates the direction in which the front of the person's head faces corresponds can be output as predicted head direction information.

일 예로, 예측된 헤드 디렉션 정보는, 제2 사람, 즉, 보행자의 헤드를 중심으로, 보행자의 헤드의 정면이 카메라를 향할 때를 기준(가령, South)으로 하여, 보행자의 헤드의 정면이 향하는 방향이 2차원 평면 상에서 기설정된 8개의 헤드 디렉션 클래스들(즉, S, SE, E, NE, N, NW, W, SW) 중 어느 클래스에 대응되는 지를 분류한 클래시피케이션 정보일 수 있다. 이때, 예측된 헤드 디렉션 정보는, 보행자의 헤드의 정면이 향하는 방향이 8개의 헤드 디렉션 클래스들 중 어느 클래스에도 대응되지 않는 방향에 대한 클래스, 즉, 보행자가 보행자 자신을 바라보고 있는 등의 경우에 대응되는 디폴트 디렉션 클래스를 더 포함하여 분류한 클래시피케이션 정보일 수 있다. As an example, the predicted head direction information is centered on the head of the second person, that is, the pedestrian, and is based on when the front of the pedestrian's head is facing the camera (e.g., South), and the front of the pedestrian's head is toward the camera. It may be classification information that classifies which class the direction corresponds to among eight preset head direction classes (i.e., S, SE, E, NE, N, NW, W, SW) on a two-dimensional plane. At this time, the predicted head direction information is a class for a class in which the direction in which the front of the pedestrian's head faces does not correspond to any of the eight head direction classes, that is, in the case where the pedestrian is looking at the pedestrian himself. It may be classified information that further includes a corresponding default direction class.

다른 예로, 예측된 헤드 디렉션 정보는, 제2 사람, 즉, 보행자의 헤드를 중심으로, 보행자의 헤드 정면, 즉, 보행자의 안면부가 카메라를 향할 때를 기준(즉, 0도)으로 하여, 보행자의 헤드의 정면이 2차원 평면 상의 360도의 방향 중 어느 하나의 방향(가령, 150.1482도)을 향하는 지를 리그레션한 리그레션 정보일 수 있다. 이때, 예측된 헤드 디렉션 정보는, 보행자의 헤드의 정면이 향하는 방향이 360도의 방향 중 어느 방향에도 대응되지 않는 방향에 대한 클래스, 즉, 보행자가 보행자 자신을 바라보고 있는 등의 경우에 대응되는 디폴트 방향을 더 포함하여 리그레션한 리그레션 정보일 있다.As another example, the predicted head direction information is based on the head of the second person, that is, the pedestrian, and the front of the pedestrian's head, that is, when the pedestrian's face is facing the camera (i.e., 0 degrees), This may be regression information obtained by regressing whether the front of the head is facing in one of the 360-degree directions on a two-dimensional plane (for example, 150.1482 degrees). At this time, the predicted head direction information is a class for a direction in which the direction in which the front of the pedestrian's head faces does not correspond to any of the 360-degree directions, that is, the default corresponding to the case where the pedestrian is facing the pedestrian himself. There is regression information that further includes direction.

한편, 상기에서는 예측된 헤드 디렉션 정보가 2차원 평면 상에서의 클래시피케이션 정보 또는 리그레션 정보인 것으로 설명하였으나, 3차원 공간에서의 클래시피케이션 정보 또는 리그레션 정보일 수도 있다.Meanwhile, although it has been described above that the predicted head direction information is collision information or regression information in a two-dimensional plane, it may also be collision information or regression information in a three-dimensional space.

즉, 예측된 헤드 디렉션 정보는, 제2 사람, 즉, 보행자의 헤드를 중심으로, 보행자의 헤드의 정면이 카메라를 향할 때를 기준으로 하여, 보행자의 헤드의 정면이 3차원 공간 상에서 방사형 방향으로 기설정된 26개의 헤드 디렉션 클래스들(즉, 2차원 평면 상에서의 기설정된 8개의 헤드 디렉션 클래스들에 더하여, 상부 방향으로의 8개의 헤드 디렉션 클래스들, 하부 방향으로의 8개의 헤드 디렉션 클래스들, 상부 방향 클래스, 및 하부 방향 클래스를 포함함) 중 어느 클래스에 대응되는 지를 분류한 클래시피케이션 정보, 또는 3차원 구면 좌표계에 따른 연속적인 값을 갖는 방향 중 어느 하나의 방향을 향하는 지를 리그레션한 리그레션 정보일 수 있다.In other words, the predicted head direction information is based on the head of the second person, that is, the pedestrian, when the front of the pedestrian's head is facing the camera, and the front of the pedestrian's head is radial in three-dimensional space. 26 preset head direction classes (i.e., in addition to the 8 preset head direction classes on a two-dimensional plane, 8 head direction classes in the upper direction, 8 head direction classes in the lower direction, upper Classification information that classifies which class it corresponds to (including direction classes and lower direction classes), or league that regresses which direction it faces among directions with continuous values according to a three-dimensional spherical coordinate system. It may be lesion information.

결과적으로, 예측된 헤드 디렉션 정보는, 헤드 FC 레이어의 설정에 따라, 2차원 평면에서의 클래시피케이션 정보, 2차원 평면에서의 리그레션 정보, 3차원 공간에서의 클래시피케이션 정보, 및 3차원 공간에서의 리그레션 정보 중 어느 하나일 수 있다.As a result, the predicted head direction information is, depending on the settings of the head FC layer, collision information in a 2-dimensional plane, regression information in a 2-dimensional plane, collision information in a 3-dimensional space, and 3-dimensional It may be any one of regression information in space.

다시, 도 3을 참조하면, 학습 장치(1000)는, 적어도 하나의 제1 예측 헤드 디렉션 정보 및 이에 대응되는 적어도 하나의 GT 헤드 디렉션 정보를 참조하여 적어도 하나의 헤드 디렉션 로스를 생성하고, 헤드 디렉션 로스를 이용한 백프로파게이션을 수행함으로써 헤드 컨볼루셔널 레이어(1300) 및 헤드 FC 레이어(1400) 중 적어도 일부의 파라미터를 학습할 수 있다. 즉, 학습 장치(1000)는 제1 예측된 헤드 디렉션 정보와 제2 학습 이미지에 대응되는 제2 그라운드 트루스에 포함된 라벨링된 헤드 디렉션 정보를 이용하여 적어도 하나의 헤드 디렉션 로스를 생성하고, 헤드 디렉션 로스를 이용하여 헤드 FC 레이어(1400) 및 헤드 컨볼루셔널 레이어(1300)를 학습시킬 수 있다.Again, referring to FIG. 3, the learning device 1000 generates at least one head direction loss with reference to at least one first predicted head direction information and at least one GT head direction information corresponding thereto, and generates a head direction loss. By performing backpropagation using loss, the parameters of at least some of the head convolutional layer 1300 and the head FC layer 1400 can be learned. That is, the learning device 1000 generates at least one head direction loss using the labeled head direction information included in the first predicted head direction information and the second ground truth corresponding to the second learning image, and The head FC layer 1400 and the head convolutional layer 1300 can be learned using the loss.

이때, 학습 장치(1000)는, 헤드 디렉션 로스를 생성하기 위한 손실 함수로서 크로스 엔트로피(cross entropy)를 이용할 수 있으나, 본 발명이 이에 한정되는 것은 아니며, 다양한 로스 함수를 이용하여 헤드 디렉션 로스를 생성할 수도 있다.At this time, the learning device 1000 may use cross entropy as a loss function for generating the head direction loss, but the present invention is not limited thereto and generates the head direction loss using various loss functions. You may.

한편, 상기에서는 학습 장치가(1000)가 헤드 디렉션 로스를 그대로 이용하여 헤드 컨볼루셔널 레이어(1300) 및 헤드 FC 레이어(1400) 중 적어도 일부의 파라미터를 학습하도록 하였으나, 이와는 달리, 헤드 디렉션 로스에 로스 웨이트를 더 추가하여 헤드 FC 레이어(1400) 및 헤드 컨볼루셔널 레이어(1300) 중 적어도 일부의 파라미터를 학습하도록 할 수도 있다.Meanwhile, in the above, the learning device 1000 uses the head direction loss as is to learn the parameters of at least some of the head convolutional layer 1300 and the head FC layer 1400. However, unlike this, the head direction loss Additional loss weights may be added to learn the parameters of at least some of the head FC layer 1400 and the head convolutional layer 1300.

일 예로, 헤드 디렉션 로스가 기설정된 임계만 미만일 경우에는, 로스 웨이트로 "0"을 적용하고, 헤드 디렉션 로스가 임계값 이상일 경우에는, 로스 웨이트로 "0" 보다 큰 기설정된 실수값을 적용할 수 있다.For example, if the head direction loss is less than a preset threshold, "0" is applied as the loss weight, and if the head direction loss is more than the threshold, a preset real number greater than "0" is applied as the loss weight. You can.

다른 예로, 예측된 헤드 디렉션 정보가 실제 True인 정답을 False라고 예측한 것일 경우(FN: False negative)에는 로스 웨이트로 "0"을 적용하고, 예측된 헤드 디렉션 정보가 실제 false인 정답을 true라고 예측한 것일 경우(FP: False Positive)에는 로스 웨이트로 "0" 보다 큰 기설정된 실수값을 적용할 수 있다.As another example, if the correct answer where the predicted head direction information is actually True is predicted to be False (FN: False negative), “0” is applied as the loss weight, and the correct answer where the predicted head direction information is actually False is called true. If it is a prediction (FP: False Positive), a preset real number greater than "0" can be applied as the loss weight.

다음으로, 학습 장치(1000)는 바디 컨볼루셔널 레이어(1100), 헤드 컨볼루셔널 레이어(1300) 및 헤드 FC 레이어(1400)를 포함하는 시선 방향 검출 모델의 학습 중이나 학습이 완료된 이후, 시선 방향 검출 모델의 성능을 평가할 수 있다.Next, the learning device 1000 determines the gaze direction during or after learning of the gaze direction detection model including the body convolutional layer 1100, the head convolutional layer 1300, and the head FC layer 1400. The performance of the detection model can be evaluated.

일 예로, 학습 장치(1000)는 적어도 하나의 평가 이미지를 바디 컨볼루셔널 레이어(1100)에 입력하여 바디 컨볼루셔널 레이어(1100)로 하여금 평가 이미지를 적어도 한번 컨볼루셔널 연산하여 평가 이미지에 포함된 제3 사람의 바디 피처를 추출한 적어도 하나의 제3 바디 피처맵을 생성하도록 하며, 평가 이미지를 헤드 컨볼루셔널 레이어(1300)에 입력하여 헤드 컨볼루셔널 레이어(1300)로 하여금 평가 이미지를 적어도 한번 컨볼루셔널 연산하여 제3 사람의 헤드 피처를 추출한 적어도 하나의 제2 헤드 피처맵을 생성하도록 할 수 있다. 그리고, 학습 장치(1000)는 제3 바디 피처맵과 제2 헤드 피처맵을 컨캐터네이트하여 제2 통합 피처맵을 생성하고, 제2 통합 피처맵을 헤드 FC 레이어(1400)에 입력하여 헤드 FC 레이어(1400)로 하여금 제2 통합 피처맵을 적어도 한번 FC 연산하여 제3 사람의 헤드의 정면이 향하는 방향을 예측한 적어도 하나의 제2 예측된 헤드 디렉션 정보를 출력하도록 할 수 있다. 이후, 학습 장치(1000)는 제2 예측된 헤드 디렉션 정보와 평가 이미지에 대응되는 제3 그라운드 트루스를 참조하여 바디 컨볼루셔널 레이어(1100), 헤드 컨볼루셔널 레이어(1300), 및 헤드 FC 레이어(1400)를 포함하는 시선 방향 검출 모델을 평가할 수 있다.As an example, the learning device 1000 inputs at least one evaluation image into the body convolutional layer 1100 and causes the body convolutional layer 1100 to perform a convolutional operation on the evaluation image at least once and include it in the evaluation image. At least one third body feature map is generated by extracting the body features of a third person, and the evaluation image is input to the head convolutional layer 1300 to cause the head convolutional layer 1300 to extract the evaluation image at least. At least one second head feature map from which the head features of a third person are extracted can be generated by performing a convolutional operation once. Then, the learning device 1000 concatenates the third body feature map and the second head feature map to generate a second integrated feature map, and inputs the second integrated feature map to the head FC layer 1400 to create a head FC The layer 1400 may perform an FC operation on the second integrated feature map at least once to output at least one second predicted head direction information predicting the direction in which the front of the third person's head is facing. Thereafter, the learning device 1000 uses the body convolutional layer 1100, the head convolutional layer 1300, and the head FC layer with reference to the second predicted head direction information and the third ground truth corresponding to the evaluation image. A gaze direction detection model including (1400) can be evaluated.

이때, 학습 장치(1000)는 제2 예측된 헤드 디렉션 정보와 제3 그라운드 트루스를 이용하여 다음의 수학식을 통해 정확도를 연산하며, 연산된 정확도를 이용하여 시선 방향 검출 모델을 평가할 수 있다.At this time, the learning device 1000 calculates accuracy through the following equation using the second predicted head direction information and the third ground truth, and can evaluate the gaze direction detection model using the calculated accuracy.

상기 수학식에서, N은 평가에 이용된 제2 예측된 헤드 디렉션 정보의 총 개수이고, # of predicted soft corrects는 제2 예측된 헤드 디렉션 정보 중 라벨링된 정답을 정확히 예측하지 못한 개수이며, # of predicted corrects는 라벨링 정답을 정확히 예측한 개수일 수 있다.In the above equation, N is the total number of second predicted head direction information used for evaluation, # of predicted soft corrections is the number of second predicted head direction information that did not correctly predict the labeled correct answer, and # of predicted corrects may be the number of correctly predicted labeling answers.

일 예로, 평가에 이용된 제2 예측된 헤드 디렉션 정보 각각이, 실제 True인 정답을 True라고 예측(정답)한 True Positive(TP), 실제 False인 정답을 True라고 예측(오답)한 False Positive(FP), 실제 True인 정답을 False라고 예측(오답)한 False Negative(FN), 및 실제 False인 정답을 False라고 예측(정답)한 True Negative(TN)를 포함한다고 가정하면, N은 TP+FP+FN+TN이고, # of predicted corrects은 TP+TN이며, # of predicted soft corrects는 FN일 수 있다.As an example, each of the second predicted head direction information used for evaluation is True Positive (TP), which predicts the correct answer that is actually True as True (correct answer), and False Positive (TP), which predicts (incorrect answer) the correct answer that is actually False as True. Assuming that it includes FP), False Negative (FN), which predicted (incorrect answer) the correct answer that was actually True as False, and True Negative (TN) that predicted (correct answer) that the correct answer that was actually False was False, N is TP+FP +FN+TN, # of predicted corrects can be TP+TN, and # of predicted soft corrects can be FN.

한편, 상기에서는 바디 컨볼루셔널 레이어를 먼저 학습 한 다음, 헤드 컨볼루셔널 레이어 및 헤드 FC 레이어를 학습하는 과정을 설명하였으나, 이와 달리, 바디 컨볼루셔널 레이어, 헤드 컨볼루셔널 레이어 및 헤드 FC 레이어를 학습하는 과정을 설명하면 다음과 같다. 아래의 설명에서는, 상기 도 2와 도 3을 참조한 설명으로부터 용이하게 이해 가능한 부분에 대해서는 상세한 설명을 생략하기로 한다.Meanwhile, the above described the process of learning the body convolutional layer first, and then learning the head convolutional layer and head FC layer. However, unlike this, the body convolutional layer, head convolutional layer, and head FC layer The process of learning is explained as follows. In the description below, detailed description of parts that can be easily understood from the description referring to FIGS. 2 and 3 will be omitted.

학습 장치는, 적어도 하나의 학습 이미지가 획득되면, 학습 이미지를 바디 컨볼루셔널 레이어에 입력하여 바디 컨볼루셔널 레이어로 하여금 학습 이미지를 적어도 한번 컨볼루셔널 연산하여 학습 이미지에 포함된 사람의 바디 피처를 추출한 적어도 하나의 바디 피처맵을 생성하도록 하고, 학습 이미지를 헤드 컨볼루셔널 레이어에 입력하여 헤드 컨볼루셔널 레이어로 하여금 학습 이미지를 적어도 한번 컨볼루셔널 연산하여 사람의 헤드 피처를 추출한 적어도 하나의 헤드 피처맵을 생성하도록 할 수 있다.When at least one learning image is acquired, the learning device inputs the learning image to the body convolutional layer and causes the body convolutional layer to perform a convolutional operation on the learning image at least once to determine the human body features included in the learning image. At least one extracted body feature map is generated, and the learning image is input to the head convolutional layer, and the head convolutional layer performs a convolutional operation on the learning image at least once to extract at least one human head feature. You can create a head feature map.

그리고, 학습 장치는, 바디 피처맵을 바디 FC 레이어에 입력하여 바디 FC 레이어로 하여금 바디 피처맵을 적어도 한번 FC 연산하여 사람의 바디의 정면이 향하는 방향을 예측한 적어도 하나의 예측된 바디 디렉션 정보를 출력하도록 할 수 있다.Then, the learning device inputs the body feature map to the body FC layer and causes the body FC layer to perform FC operations on the body feature map at least once to generate at least one predicted body direction information that predicts the direction in which the front of the human body faces. You can have it printed.

또한, 학습 장치는, 바디 피처맵과 헤드 피처맵을 컨캐터네이트한 통합 피처맵을 헤드 FC 레이어에 입력하여 헤드 FC 레이어로 하여금 통합 피처맵을 적어도 한번 FC 연산하여 사람의 헤드의 정면이 향하는 방향을 예측한 적어도 하나의 예측된 헤드 디렉션 정보를 출력하도록 할 수 있다.In addition, the learning device inputs an integrated feature map obtained by concatenating the body feature map and the head feature map into the head FC layer, and causes the head FC layer to perform FC calculation on the integrated feature map at least once to determine the direction in which the front of the human head is facing. At least one predicted head direction information predicted can be output.

그리고, 학습 장치는, 예측된 바디 디렉션 정보와 학습 이미지에 대응되는 그라운드 트루스 정보에 포함된 라벨링된 바디 디렉션 정보를 이용하여 적어도 하나의 바디 디렉션 로스를 생성하며, 예측된 헤드 디렉션 정보와 그라운드 트루스에 포함된 라벨링된 헤드 디렉션 정보를 이용하여 적어도 하나의 헤드 디렉션 로스를 생성할 수 있다.And, the learning device generates at least one body direction loss using the labeled body direction information included in the predicted body direction information and ground truth information corresponding to the learning image, and uses the predicted head direction information and ground truth At least one head direction loss can be generated using the included labeled head direction information.

이후, 학습 장치는, 바디 디렉션 로스를 이용하여 바디 FC 레이어 및 바디 컨볼루셔널 레이어를 학습시키며, 헤드 디렉션 로스를 이용하여 헤드 FC 레이어 및 헤드 컨볼루셔널 레이어를 학습시킬 수 있다.Afterwards, the learning device can learn the body FC layer and the body convolutional layer using the body direction loss, and the head FC layer and the head convolutional layer using the head direction loss.

이처럼, 학습 장치(1000)가 바디 컨볼루셔널 레이어(1100), 헤드 컨볼루셔널 레이어(1300) 및 헤드 FC 레이어(1400) 중 적어도 일부의 파라미터를 학습한 상태에서, 테스트 이미지(가령, 보행자 이미지)가 획득되는 경우 테스트 장치의 동작에 대해 도 4 및 도 5를 참조하여 설명하기로 한다.In this way, while the learning device 1000 has learned the parameters of at least some of the body convolutional layer 1100, the head convolutional layer 1300, and the head FC layer 1400, a test image (e.g., a pedestrian image) ) is obtained, the operation of the test device will be described with reference to FIGS. 4 and 5.

먼저, 도 4를 참조하여, 시선 방향을 검출하는 시선 방향 검출 모델을 테스트하기 위한 테스트 장치(2000)에 대해 설명하겠다.First, with reference to FIG. 4, a test device 2000 for testing a gaze direction detection model that detects the gaze direction will be described.

테스트 장치(2000)는, 시선 방향을 검출하는 시선 방향 검출 모델을 테스트하기 위한 인스트럭션들이 저장된 메모리(2001)와 메모리(2001)에 저장된 인스트럭션들에 따라 시선 방향 검출 모델을 테스트하기 위한 동작을 수행하는 프로세서(2002)를 포함할 수 있다.The test device 2000 has a memory 2001 storing instructions for testing a gaze direction detection model that detects the gaze direction, and performs an operation to test the gaze direction detection model according to the instructions stored in the memory 2001. It may include a processor 2002.

구체적으로, 테스트 장치(2000)는 전형적으로 컴퓨팅 장치(예컨대, 컴퓨터 프로세서, 메모리, 스토리지, 입력 장치 및 출력 장치, 기타 기존의 컴퓨팅 장치의 구성요소들을 포함할 수 있는 장치; 라우터, 스위치 등과 같은 전자 통신 장치; 네트워크 부착 스토리지(NAS) 및 스토리지 영역 네트워크(SAN)와 같은 전자 정보 스토리지 시스템)와 컴퓨터 소프트웨어(즉, 컴퓨팅 장치로 하여금 특정의 방식으로 기능하게 하는 인스트럭션들)의 조합을 이용하여 원하는 시스템 성능을 달성하는 것일 수 있다.Specifically, test device 2000 is typically a computing device (e.g., a device that may include a computer processor, memory, storage, input and output devices, and other components of a traditional computing device; electronic devices such as routers, switches, etc. the desired system using a combination of a communication device (electronic information storage systems, such as network attached storage (NAS) and storage area network (SAN)) and computer software (i.e., instructions that cause a computing device to function in a particular way) Performance may be achieved.

이때, 테스트 장치(2000)는, 도 1에 도시된 학습 장치(1000)와 동일한 장치이거나, 서로 별개의 장치일 수 있다.At this time, the test device 2000 may be the same device as the learning device 1000 shown in FIG. 1, or may be a separate device.

아래에서는 도 5를 참조하여 테스트 장치(2000)가 시선 방향 검출 모델을 테스트하는 과정에 대해 구체적으로 설명하겠다. Below, a process in which the test device 2000 tests the gaze direction detection model will be described in detail with reference to FIG. 5 .

참고로, 학습 장치(1000)에 대해 설명한 내용과 동일/유사한 내용에 대해서는 중복되는 설명을 생략하기로 한다.For reference, overlapping descriptions of content that is the same/similar to the content described about the learning device 1000 will be omitted.

먼저, 학습 장치(1000)에 의해, 바디 컨볼루셔널 레이어(1100), 헤드 컨볼루셔널 레이어(1300) 및 헤드 FC 레이어(1400) 중 적어도 일부의 파라미터가 학습된 상태에서, 테스트 장치(2000)가 테스트 이미지를 획득할 수 있다.First, with the parameters of at least some of the body convolutional layer 1100, the head convolutional layer 1300, and the head FC layer 1400 learned by the learning device 1000, the test device 2000 A test image can be obtained.

그리고, 테스트 장치(2000)는, 테스트 이미지를 헤드 컨볼루셔널 레이어(1300) 및 바디 컨볼루셔널 레이어(1100)에 각각 입력하여 헤드 컨볼루셔널 레이어(1300) 및 바디 컨볼루셔널 레이어(1100) 각각으로 하여금 테스트 이미지에 컨볼루셔널 연산을 각각 적어도 한번 적용하도록 함으로써 적어도 하나의 테스트용 헤드 피처맵 및 적어도 하나의 테스트용 바디 피처맵 각각을 생성하도록 할 수 있다.Then, the test device 2000 inputs the test image into the head convolutional layer 1300 and the body convolutional layer 1100, respectively, to create the head convolutional layer 1300 and the body convolutional layer 1100. At least one head feature map for testing and at least one body feature map for testing can be generated by having each of them apply a convolutional operation to the test image at least once.

그리고, 테스트 장치(2000)는, 테스트용 헤드 피처맵 및 테스트용 바디 피처맵을 컨캐터네이트하여 테스트용 통합 피처맵을 생성하고, 테스트용 통합 피처맵을 헤드 FC(fully-connected) 레이어에 입력하여 헤드 FC 레이어(1400)로 하여금 테스트용 통합 피처맵에 FC 연산을 적용하도록 함으로써 적어도 하나의 테스트용 예측 헤드 디렉션 정보를 출력하도록 할 수 있다.Then, the test device 2000 generates an integrated feature map for testing by concatenating the head feature map for testing and the body feature map for testing, and inputs the integrated feature map for testing to the head FC (fully-connected) layer. By having the head FC layer 1400 apply FC operation to the integrated feature map for testing, at least one piece of predicted head direction information for testing can be output.

한편, 상기에서 설명한 시선 방향 검출 모델을 학습하는 과정에서 사용된 학습 이미지, 제1 학습 이미지, 제2 학습 이미지와, 시선 방향 검출 모델을 평가하는 과정에서 사용된 평가 이미지는 각각의 그라운드 트루스들이 라벨링된 것으로, 사람 이미지에서 각각의 그라운드 트루스를 라벨링하는 방법을 설명하면 다음과 같다.Meanwhile, the learning image, the first learning image, the second learning image used in the process of learning the gaze direction detection model described above, and the evaluation image used in the process of evaluating the gaze direction detection model are each labeled with ground truths. As a result, the method of labeling each ground truth in a human image is explained as follows.

먼저, 학습 이미지를 생성하기 위하여, 촬영되거나 크롭된 적어도 하나의 사람 이미지를 수집할 수 있다.First, in order to generate a learning image, at least one photographed or cropped human image may be collected.

이때, 하나의 사람을 촬영한 사람 이미지를 획득하거나, 다수의 사람이 있는 이미지에서 오브젝트 디텍션에 의해 사람들을 검출한 정보를 이용하여 사람들을 검출한 각각의 바운딩 박스를 이용하여 하나의 이미지에서 각각의 사람 이미지들을 획득할 수 있으며, 사람 이미지에 사람의 바디의 정면이 향하는 방향과 사람의 헤드의 정면이 향하는 방향을 라벨링할 수 있다.At this time, a person image is obtained by taking a picture of a single person, or information on people is detected by object detection in an image with multiple people, and each bounding box is used to detect each person in an image. Human images can be acquired, and the direction in which the front of the person's body faces and the direction in which the front of the person's head faces can be labeled in the human image.

하지만, 본 발명이 이에 한정되는 것은 아니며, 사람을 촬영한 비디오 상에서 사람의 바디의 정면이 향하는 방향과 사람의 헤드의 정면이 향하는 방향을 라벨링할 수도 있다.However, the present invention is not limited to this, and the direction in which the front of the person's body faces and the direction in which the front of the person's head faces can be labeled on a video of a person.

일 예로, 사람 이미지에서의 해당 사람의 바디 방향 및 시선 방향 각각을 2차원 평면 또는 3차원 공간에서 기설정된 바디 방향 클래스들 및 기설정된 시선 방향 클래스들 중에서 대응되는 어느 하나의 특정 바디 방향 클래스 및 특정 시선 방향 클래스 각각으로 라벨링할 수 있다. 이때, 특정 바디 방향 클래스와 특정 시선 방향 클래스가 동일한 경우에는 하나의 방향 클래스만을 라벨링할 수도 있다.As an example, each of the person's body direction and gaze direction in a person image is matched to a specific body direction class and a specific gaze direction class among preset body direction classes and preset gaze direction classes in a two-dimensional plane or three-dimensional space. Each gaze direction class can be labeled. At this time, if the specific body direction class and the specific gaze direction class are the same, only one direction class may be labeled.

2차원 평면을 예를 들면, 사람의 바디 또는 헤드의 정면이 카메라를 향할 때를 기준(가령, South)으로 하여, 사람의 바디 또는 헤드의 정면이 향하는 방향이 2차원 평면 상에서 기설정된 8개의 디렉션 클래스들(즉, S, SE, E, NE, N, NW, W, SW) 중 대응되는 특정 디렉션 클래스를 사람 이미지에 라벨링할 수 있다. 이와는 달리, 사람의 바디 또는 헤드를 중심으로 한 360도 방향을 단위 각도로 구분한 각각의 디렉션 클래스들 중에서 대응되는 특정 디렉션 클래스를 사람 이미지에 라벨링할 수도 있다. 즉, 360도를 10도 단위로 구분할 경우에는 불연속적인(discrete) 36개의 디렉션 클래스들이 설정될 수 있고, 1도 단위로 구분할 경우에는 불연속적인 360개의 디렉션 클래스들이 설정될 수 있으며, 설정된 디렉션 클래스들 중에서 대응되는 특정 디렉션 클래스를 사람 이미지에 라벨링할 수 있다. 또한, 사람의 헤드 정면이 사람 자신을 향하는 등의 경우에 대응되는 디폴트 디렉션 클래스를 더 포함하여 사람 이미지를 라벨링할 수도 있다.For example, in a two-dimensional plane, based on when the front of a person's body or head is facing the camera (e.g., South), the directions in which the front of a person's body or head faces are set in eight directions on the two-dimensional plane. The human image can be labeled with a corresponding specific direction class among the classes (i.e., S, SE, E, NE, N, NW, W, SW). Alternatively, the human image may be labeled with a corresponding specific direction class among each direction class that divides the 360-degree direction centered on the human body or head into unit angles. In other words, when 360 degrees are divided into 10-degree increments, 36 discrete direction classes can be set, and when 360 degrees are divided into 1-degree increments, 360 discrete direction classes can be set, and the set direction classes are Among them, the corresponding specific direction class can be labeled on the human image. Additionally, the human image may be labeled by further including a default direction class corresponding to a case where the front of the person's head is facing the person himself.

한편, 상기에서는 사람의 바디 또는 헤드의 정면이 향하는 방향을 정확히 라벨링한 포지티브(positive) 샘플들을 설명하였으나, 사람의 바디 또는 헤드의 정면이 향하는 방향과는 다른 방향을 라벨링한 네거티브(negative) 샘플들을 사람 이미지에 라벨링할 수도 있다.Meanwhile, in the above description, positive samples were described that accurately labeled the direction in which the front of the human body or head was facing, but negative samples were labeled in a direction different from the direction in which the front of the human body or head was facing. You can also label human images.

그리고, 3차원 공간을 예를 들면, 3차원 공간 상에서 방사형 방향으로 기설정된 26개의 헤드 디렉션 클래스들, 즉, 사람의 높이에 대응되는 기준 평면에서의 8개의 방향들에 대한 8개의 디렉션 클래스들, 상부 방향으로 8개의 방향들을 향하는 상부 방향으로의 8개의 디렉션 클래스들, 하부 방향으로 8개의 방향들을 향하는 하부 방향으로의 8개의 디렉션 클래스들, 상부 방향을 향하는 상부 방향 클래스, 및 하부 방향을 향하는 하부 방향 클래스 중에서 대응되는 특정 디렉션 클래스를 사람 이미지에 라벨링할 수 있다. 이와는 달리, 3차원 구면 좌표계를 단위 좌표로 구분한 각각의 디렉션 클래스들 중에서 대응되는 특정 디렉션 클래스를 사람 이미지에 라벨링할 수도 있다. 또한, 사람의 헤드 정면이 사람 자신을 향하는 등의 경우에 대응되는 디폴트 디렉션 클래스를 더 포함하여 사람 이미지를 라벨링할 수도 있다.And, taking the three-dimensional space as an example, there are 26 head direction classes preset in the radial direction in the three-dimensional space, that is, eight direction classes for eight directions in the reference plane corresponding to the height of the person, 8 direction classes in the upward direction, pointing in 8 directions in the upward direction, 8 direction classes in the downward direction, pointing in 8 directions in the downward direction, an upper direction class in the upward direction, and lower in the downward direction. Among the direction classes, the corresponding specific direction class can be labeled in the human image. Alternatively, the human image may be labeled with a corresponding specific direction class among each direction class that divides the 3D spherical coordinate system into unit coordinates. Additionally, the human image may be labeled by further including a default direction class corresponding to a case where the front of the person's head is facing the person himself.

다른 예로, 사람이미지에서 해당 사람의 바디 방향 및 시선 방향 각각을 2차원 평면 또는 3차원 공간에서 바디 방향 벡터 및 시선 방향 벡터 각각으로 라벨링할 수 있다.As another example, in a person image, each of the person's body direction and gaze direction may be labeled as a body direction vector and a gaze direction vector, respectively, in a two-dimensional plane or three-dimensional space.

2차원 평면을 예를 들면, 사람의 바디 또는 헤드를 중심으로 한 연속적인(continuous) 360도 방향에서의 어느 하나의 방향에 대응되는 특정 방향 벡터를 사람 이미지에 라벨링할 수도 있다. 또한, 사람의 헤드 정면이 사람 자신을 향하는 등의 경우에 대응되는 디폴트 방향 벡터를 사람 이미지에 라벨링할 수도 있다.For example, in a two-dimensional plane, a person image may be labeled with a specific direction vector corresponding to one direction in a continuous 360-degree direction centered on the person's body or head. Additionally, a default direction vector corresponding to a case where the front of the person's head is facing the person himself may be labeled in the person image.

그리고, 3차원 공간을 예를 들면, 3차원 공간에 대응되는 구면 좌표계에서의 연속적인 좌표들 중에서 사람의 바디 또는 헤드의 정면이 향하는 방향에 대응되는 특정 좌표에 대한 특정 방향 벡터를 사람 이미지에 라벨링할 수 있다. 또한, 사람의 헤드 정면이 사람 자신을 향하는 등의 경우에 대응되는 디폴트 방향 벡터를 더 포함하여 사람 이미지를 라벨링할 수도 있다.And, taking three-dimensional space as an example, labeling a human image with a specific direction vector for a specific coordinate corresponding to the direction in which the front of the human body or head is facing among continuous coordinates in a spherical coordinate system corresponding to three-dimensional space. can do. Additionally, the human image may be labeled by further including a default direction vector corresponding to a case where the front of the person's head is facing the person himself.

그리고, 상기에서는 사람 이미지를 보고 직접적으로 그라운드 트루스로 라벨링하였으나, 이와는 달리, 사람의 바디 또는 헤드의 정면이 향하는 방향에 대한 연속적인 방향 정보가 있을 경우, 연속적인 방향 정보를 이용하여 사람 이미지를 라벨링할 수도 있다.In addition, in the above, the human image was directly labeled with ground truth, but differently from this, if there is continuous direction information about the direction in which the front of the human body or head is facing, the continuous direction information is used to label the human image. You may.

일 예로, 자이로스코프 센서를 착용한 사람을 촬영한 사람 이미지에서, 촬영 시점에서의 자이로스코프 센서의 센싱 정보를 이용하여 획득한 사람의 센싱된 바디 방향 정보 및 센싱된 시선 방향 정보를 이용하여, 2차원 평면 또는 3차원 공간에서 기설정된 바디 방향 클래스들 및 기설정된 시선 방향 클래스들 중에서 상기 센싱된 바디 방향 정보 및 상기 센싱된 시선 방향 정보 각각에 대응되는 어느 하나의 특정 바디 방향 클래스 및 특정 시선 방향 클래스 각각으로 라벨링하거나, 2차원 평면 또는 3차원 공간에서의 센싱된 바디 방향 정보 및 센싱된 시선 방향 정보 각각을 사람의 바디 방향 벡터 및 시선 방향 벡터 각각으로 라벨링할 수 있다.As an example, in a human image of a person wearing a gyroscope sensor, the person's sensed body direction information and sensed gaze direction information obtained using the sensing information of the gyroscope sensor at the time of shooting are used to obtain 2 One specific body direction class and a specific gaze direction class corresponding to each of the sensed body direction information and the sensed gaze direction information among preset body direction classes and preset gaze direction classes in a dimensional plane or three-dimensional space. Alternatively, the sensed body direction information and the sensed gaze direction information in a two-dimensional plane or three-dimensional space may be labeled as the person's body direction vector and gaze direction vector, respectively.

아래에서는, 자이로스코프 센서를 이용하여 그라운드 트루스를 라벨링하는 방법에 대하여 도 6을 참조하여 좀 더 구체적으로 설명하기로 한다. 아래의 설명에서는 헤드 디렉션 정보를 라벨링하는 과정에 대하여 설명하며, 바디 디렉션 정보를 라벨링하는 과정 또한 유사하므로 이에 대한 상세한 설명은 생략하기로 한다.Below, a method of labeling ground truth using a gyroscope sensor will be described in more detail with reference to FIG. 6. In the description below, the process of labeling head direction information is explained. Since the process of labeling body direction information is also similar, detailed description thereof will be omitted.

도 6은, 보행자 좌측의 제1 기둥(610)에 개시된 제1 광고물 및 제2 기둥(620)에 게시된 제2 광고물 중 제2 광고물을 바라보는 보행자의 모습이 촬영된 학습 이미지를 개략적으로 도시하고 있다. Figure 6 shows a learning image of a pedestrian looking at a second advertisement among the first advertisement posted on the first pillar 610 on the left side of the pedestrian and the second advertisement posted on the second pillar 620. It is shown schematically.

이때, (i) 보행자의 몸이 실제로 진행하는 방향(630), 즉, GT 바디 디렉션 정보는, (x1, y1, z1) 성분에 대응되는 벡터일 수 있으며, (ii) 보행자의 안면부가 실제로 향하는 방향(640), 즉, GT 헤드 디렉션 정보는, (x2, y2, z2) 성분에 대응되는 벡터일 수 있다. At this time, (i) the direction 630 in which the pedestrian's body actually moves, i.e., GT body direction information, may be a vector corresponding to the (x1, y1, z1) component, and (ii) the pedestrian's face part actually faces. The direction 640, that is, GT head direction information, may be a vector corresponding to the (x2, y2, z2) component.

참고로, 도 6에서 도시하는 바와 같이, 보행자가 평지를 걷는 상황에서의 GT 바디 디렉션 정보에 대응되는 z1 은 0의 값을 가지거나(가령, (x1, y1, z1)이 (-1, 1, 0)인 경우), z1이 x1 및/또는 y1에 비해 매우 작은 값(가령, (x1, y1, z1)이 (-1, 1, 0.01)인 경우)을 가질 것이다. 반면에, 도 6에 도시되지는 않았지만, 보행자가 경사진 구간(가령, 계단)을 걷는 상황에서의 GT 바디 디렉션 정보에 대응되는 z1 은 0이 아닌 값(가령, (x1, y1, z1)이 (1, 2, 1)인 경우)을 가질 것이다. For reference, as shown in FIG. 6, z1 corresponding to GT body direction information in a situation where a pedestrian is walking on level ground has a value of 0 (e.g., (x1, y1, z1) is (-1, 1). , 0), z1 will have a very small value compared to x1 and/or y1 (e.g., if (x1, y1, z1) is (-1, 1, 0.01). On the other hand, although not shown in FIG. 6, z1 corresponding to GT body direction information in a situation where a pedestrian walks on an inclined section (e.g., stairs) has a non-zero value (e.g., (x1, y1, z1) (1, 2, 1) will have).

한편, 도 6에서 도시하는 바와 같이, 보행자는 고개를 좌우로 돌릴 수 있을 뿐 아니라, 위아래로 꺾을 수도 있으므로, 보행자가 어떤 구간에서 보행하는지에 관계없이, GT 헤드 디렉션 정보에 대응되는 z2 는 다양한 값(가령, (x2, y2, z2)이 (-1, -1, 1)인 경우)을 가질 수 있다. Meanwhile, as shown in FIG. 6, pedestrians can not only turn their heads left and right, but also turn up and down, so z2 corresponding to GT head direction information has various values, regardless of which section the pedestrian is walking. (For example, if (x2, y2, z2) is (-1, -1, 1)).

참고로, 앞서 설명한 바와 같이, 그라운드 트루스는 보행자의 머리 부분에 장착된 자이로스코프 센서로부터 획득되는 정보에 대응될 수 있다.For reference, as described above, ground truth may correspond to information obtained from a gyroscope sensor mounted on the head of a pedestrian.

그런데, 도 6에서 도시하는 바와 같이, 제1 기둥(610) 및 제2 기둥(620)이 근접한 거리에 위치할 경우(또는, 제1 기둥 및 제2 기둥이 보행자의 기설정된 근접 시야각 또는 근접 뷰잉 프러스텀 안에 위치하는 경우), 자이로스코프로부터 획득되는 정보만으로는, 보행자가 제1 기둥(610)에 게시된 제1 광고물을 바라보는지, 아니면, 제2 기둥(620)에 게시된 제2 광고물을 바라보는지 정확하게 라벨링하기 어려울 수 있다.However, as shown in FIG. 6, when the first pillar 610 and the second pillar 620 are located at a close distance (or the first pillar and the second pillar are at a preset close viewing angle or close viewing angle of the pedestrian) (if located within the frustum), only the information obtained from the gyroscope determines whether the pedestrian is looking at the first advertisement posted on the first pillar 610, or the second advertisement posted on the second pillar 620. It can be difficult to accurately label what you are looking at.

따라서, 보다 더 정확한 라벨링을 위해서, 보행자가 촬상된 시점으로부터 소정의 시구간 내에 획득되는 보행자 보조 정보(가령, 보행자가 촬상된 시점으로부터 60초 내에 보행자 단말을 이용하여 제2 광고물에 대응되는 콘텐츠를 검색했다는 검색 정보 또는 제2 광고물 관련 물품 매장에 방문하거나 해당 물품을 구매한 이력 정보 등) 및 자이로스코프 센서로부터 획득되는 정보가 함께 사용될 수 있다.Therefore, for more accurate labeling, pedestrian assistance information acquired within a predetermined time period from the time the pedestrian was imaged (e.g., content corresponding to the second advertisement using a pedestrian terminal within 60 seconds from the time the pedestrian was imaged) Information obtained from a gyroscope sensor (such as search information indicating that a user has searched for or history information of visiting a second advertisement-related product store or purchasing the product) may be used together.

한편, 상기의 설명에 따라 그라운드 트루스가 라벨링된 학습 이미지를 이용하여 도 2 내지 도 3에서 설명한 바와 같이 시선 방향 검출 모델을 학습하는 것과는 달리, 학습 이미지를 생성하지 않고 자이로스코프 센서에서 센싱되는 센싱 정보를 그라운드 트루스로 이용하여 시선 방향 검출 모델을 실시간으로 학습할 수도 있다.Meanwhile, unlike learning the gaze direction detection model as described in FIGS. 2 and 3 using ground truth-labeled learning images according to the above description, sensing information sensed from a gyroscope sensor without generating a learning image You can also use as ground truth to learn the gaze direction detection model in real time.

일 예로, 학습 장치는 자이로스코프 센서를 착용한 보행자가 보행하는 것을 촬영한 비디오가 획득할 수 있다. 이때, 자이로스코프 센서로부터 획득된 보행자가 바디 디렉션 정보는 (-1, 1, 0)이고, 헤드 디렉션 정보는 (-1, -1, 1)인 것으로 가정할 수 있다.As an example, the learning device can acquire a video of a pedestrian wearing a gyroscope sensor walking. At this time, it can be assumed that the pedestrian's body direction information obtained from the gyroscope sensor is (-1, 1, 0) and the head direction information is (-1, -1, 1).

그리고, 학습 장치는, 보행자가 촬상된 영역이 크롭(crop)된 크롭 이미지를 상기에서 설명한 바에 의해 학습된 시선 방향 검출 모델에 입력하여 시선 방향 검출 모델로 하여금 보행자의 헤드의 정면이 향하는 방향을 예측한 예측된 헤드 디렉션 정보(가령, (-1.1, -1, 0.8))를 출력하도록 할 수 있다.Then, the learning device inputs a cropped image of the area where the pedestrian was captured into the gaze direction detection model learned as described above to predict the direction in which the front of the pedestrian's head is facing. Predicted head direction information (e.g., (-1.1, -1, 0.8)) can be output.

그리고, 학습 장치는, 예측된 헤드 디렉션 정보(가령, (-1.1, -1, 0.8))와 자이로스코프 센서에 의해 센싱된 정보로부터 획득한 보행자의 실제 헤드 디렉션 정보(가령, (-1, -1, 1))를 참조하여 헤드 디렉션 로스를 생성하고, 헤드 디렉션 로스를 이용한 백프로파게이션을 수행함으로써 시선 방향 검출 모델을 학습할 수 있다.In addition, the learning device uses predicted head direction information (e.g., (-1.1, -1, 0.8)) and the pedestrian's actual head direction information (e.g., (-1, -) obtained from information sensed by the gyroscope sensor. A gaze direction detection model can be learned by generating head direction loss by referring to 1, 1)) and performing backpropagation using the head direction loss.

또한, 이상 설명된 본 발명에 따른 실시예들은 다양한 컴퓨터 구성요소를 통하여 수행될 수 있는 프로그램 명령어의 형태로 구현되어 컴퓨터 판독 가능한 기록 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능한 기록 매체는 프로그램 명령어, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 컴퓨터 판독 가능한 기록 매체에 기록되는 프로그램 명령어는 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 분야의 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능한 기록 매체의 예에는, 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체, CD-ROM, DVD와 같은 광기록 매체, 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 ROM, RAM, 플래시 메모리 등과 같은 프로그램 명령어를 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령어의 예에는, 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드도 포함된다. 상기 하드웨어 장치는 본 발명에 따른 처리를 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.Additionally, the embodiments according to the present invention described above may be implemented in the form of program instructions that can be executed through various computer components and recorded on a computer-readable recording medium. The computer-readable recording medium may include program instructions, data files, data structures, etc., singly or in combination. Program instructions recorded on the computer-readable recording medium may be specially designed and configured for the present invention, or may be known and usable by those skilled in the computer software field. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks and magnetic tapes, optical recording media such as CD-ROMs and DVDs, and magneto-optical media such as floptical disks. media), and hardware devices specifically configured to store and perform program instructions, such as ROM, RAM, flash memory, etc. Examples of program instructions include not only machine language code such as that created by a compiler, but also high-level language code that can be executed by a computer using an interpreter or the like. The hardware device may be configured to operate as one or more software modules to perform processing according to the invention and vice versa.

이상에서 본 발명이 구체적인 구성요소 등과 같은 특정 사항들과 한정된 실시예 및 도면에 의해 설명되었으나, 이는 본 발명의 보다 전반적인 이해를 돕기 위해서 제공된 것일 뿐, 본 발명이 상기 실시예들에 한정되는 것은 아니며, 본 발명이 속하는 기술분야에서 통상적인 지식을 가진 자라면 이러한 기재로부터 다양한 수정 및 변형을 꾀할 수 있다.In the above, the present invention has been described with specific details such as specific components and limited embodiments and drawings, but this is only provided to facilitate a more general understanding of the present invention, and the present invention is not limited to the above embodiments. , a person skilled in the art to which the present invention pertains can make various modifications and variations from this description.

따라서, 본 발명의 사상은 상기 설명된 실시예에 국한되어 정해져서는 아니 되며, 후술하는 특허청구범위뿐만 아니라 이 특허청구범위와 균등하게 또는 등가적으로 변형된 모든 것들은 본 발명의 사상의 범주에 속한다고 할 것이다.Therefore, the spirit of the present invention should not be limited to the above-described embodiments, and the scope of the patent claims described below as well as all modifications equivalent to or equivalent to the scope of the claims fall within the scope of the spirit of the present invention. They will say they do it.

1000: 학습 장치,
1001: 메모리,
1002: 프로세서,
2000: 테스트 장치,
2001: 메모리,
2002: 프로세서1000: learning device,
1001: memory,
1002: processor,
2000: Test device,
2001: Memory,
2002: Processor

Claims

In a method of learning a deep learning-based gaze direction detection model for detecting a person's gaze direction,
(a) When at least one first learning image is acquired, the learning device inputs the first learning image into a body convolutional layer and causes the body convolutional layer to convolve the first learning image at least once. At least one first body feature map is generated by extracting the body features of the first person included in the first learning image by performing a shunt operation, and the first body feature map is input to the body FC (Fully Connected) layer. Have the body FC layer perform FC operations on the first body feature map at least once to output at least one piece of predicted body direction information that predicts the direction in which the front of the first person's body faces, and the predicted body direction information and generate at least one body direction loss using labeled body direction information included in first ground truth information corresponding to the first learning image, and use the body direction loss to determine the body FC layer and the body control. Learning a volutional layer; and
(b) When at least one second learning image is acquired, the learning device inputs the second learning image to the body convolutional layer to cause the body convolutional layer to display the second learning image at least once. A convolutional operation is performed to generate at least one second body feature map that extracts body features of a second person included in the second learning image, and the second learning image is input to a head convolutional layer to generate the head Have a convolutional layer perform a convolutional operation on the second learning image at least once to generate at least one first head feature map from which head features of the second person are extracted, and the second body feature map and the 1 Concatenate the head feature map to create a first integrated feature map, input the first integrated feature map to the head FC layer, and have the head FC layer perform FC operation on the first integrated feature map at least once. Output at least one first predicted head direction information predicting the direction in which the front of the second person's head is facing, and output a second ground truth corresponding to the first predicted head direction information and the second learning image. Generating at least one head direction loss using the included labeled head direction information, and learning the head FC layer and the head convolutional layer using the head direction loss;
How to include .

According to paragraph 1,
In step (b) above,
The learning device trains the head FC layer and the head convolutional layer by adding a loss weight to the head direction loss,
If the head direction loss is less than a preset threshold, “0” is applied as the loss weight,
A method of applying a preset real number greater than “0” as the loss weight when the head direction loss is greater than the threshold.

According to paragraph 1,
In step (b) above,
The learning device causes the head FC layer to provide (i) classification information classifying which class among preset head direction classes the direction in which the front of the second person's head faces corresponds, and (ii) A method of outputting one of the regression information obtained by regressing which direction among the continuous direction candidates the direction in which the front of the second person's head faces corresponds as the predicted direction information.

According to paragraph 2,
The predicted direction information predicts the direction in which the front of the second person's head faces in either a two-dimensional plane or a three-dimensional space corresponding to the second learning image.

According to paragraph 1,
The first learning image or the second learning image is, in a photographed or cropped person image, (i) each of the person's body direction and gaze direction is classified into preset body direction classes and groups in a two-dimensional plane or three-dimensional space; Label each of the corresponding specific body direction classes and specific gaze direction classes among the set gaze direction classes, or (ii) label each of the person's body direction and gaze direction as a body direction vector in a two-dimensional plane or three-dimensional space. and a method generated by labeling each of the gaze direction vectors.

According to paragraph 1,
The first learning image or the second learning image is the sensed body direction of the person obtained using the sensing information of the gyroscope sensor at the time of shooting in an image of a person wearing a gyroscope sensor. Using information and sensed gaze direction information, (i) the sensed body direction information and the sensed gaze direction information, respectively, among preset body direction classes and preset gaze direction classes in a two-dimensional plane or three-dimensional space; or (ii) label each of the sensed body direction information and the sensed gaze direction information in a two-dimensional plane or three-dimensional space with a specific body direction class and a specific gaze direction class corresponding to the corresponding person. A method that is generated by labeling with each of the body direction vector and the gaze direction vector.

According to paragraph 1,
(c) The learning device inputs at least one evaluation image into the body convolutional layer, causes the body convolutional layer to perform a convolutional operation on the evaluation image at least once, and calculates the third evaluation image included in the evaluation image. At least one third body feature map is generated by extracting human body features, and the evaluation image is input to the head convolutional layer to perform a convolutional operation on the evaluation image at least once. generate at least one second head feature map by extracting the head features of the third person, and generate a second integrated feature map by concatenating the third body feature map and the second head feature map, Inputting the second integrated feature map to the head FC layer to cause the head FC layer to perform an FC operation on the second integrated feature map at least once to predict the direction in which the front of the third person's head is facing 2 Output predicted head direction information, and refer to the second predicted head direction information and the third ground truth corresponding to the evaluation image to output the body convolutional layer, the head convolutional layer, and the head. Evaluating the gaze direction detection model including an FC layer;
How to include more.

In clause 7,
The learning device calculates accuracy through the following equation using the second predicted head direction information and the third ground truth, and evaluates the gaze direction detection model using the calculated accuracy.

In the above equation, N is the total number of the second predicted head direction information used for evaluation, # of predicted soft corrections is the number of the second predicted head direction information that did not correctly predict the labeled correct answer, # of predicted corrects is the number of correctly predicted labeling correct answers.

In a method of learning a deep learning-based gaze direction detection model for detecting a person's gaze direction,
(a) When at least one learning image is acquired, the learning device inputs the learning image to a body convolutional layer and causes the body convolutional layer to perform a convolutional operation on the learning image at least once to generate the learning image. generate at least one body feature map by extracting the human body features included in the learning image, and input the learning image into a head convolutional layer to perform a convolutional operation on the learning image at least once. generating at least one head feature map from which head features of the person are extracted;
(b) The learning device inputs the body feature map to a body FC layer and causes the body FC layer to perform FC operation on the body feature map at least once to predict the direction in which the front of the human body faces. Output the predicted body direction information, input the integrated feature map obtained by concatenating the body feature map and the head feature map to the head FC layer, and have the head FC layer perform FC operation on the integrated feature map at least once. outputting at least one piece of predicted head direction information predicting a direction in which the front of the person's head is facing; and
(c) the learning device generates at least one body direction loss using the predicted body direction information and labeled body direction information included in ground truth information corresponding to the learning image, and the predicted head direction Generate at least one head direction loss using information and labeled head direction information included in the ground truth, learn the body FC layer and the body convolutional layer using the body direction loss, and learn the body FC layer and the body convolutional layer. Learning the head FC layer and the head convolutional layer using a direction loss;
How to include .

According to clause 9,
The learning image is, in a photographed or cropped person image, (i) each of the person's body direction and gaze direction corresponds to the corresponding person among preset body direction classes and preset gaze direction classes in a two-dimensional plane or three-dimensional space; By labeling each of the person's body direction and gaze direction with a specific body direction class and a specific gaze direction class, or (ii) labeling each of the person's body direction and gaze direction with a body direction vector and a gaze direction vector, respectively, in a two-dimensional plane or three-dimensional space. How it was created.

According to clause 9,
The learning image is a person image taken of a person wearing a gyroscope sensor, and the person's sensed body direction information and sensed gaze direction information obtained using the sensing information of the gyroscope sensor at the time of shooting Using, (i) any one specific body corresponding to each of the sensed body direction information and the sensed gaze direction information among preset body direction classes and preset gaze direction classes in a two-dimensional plane or three-dimensional space labeling each of a direction class and a specific gaze direction class, or (ii) labeling each of the sensed body direction information and the sensed gaze direction information in a two-dimensional plane or three-dimensional space as a body direction vector and a gaze direction vector of the person in question. A method created by labeling each.

In a learning device for learning a deep learning-based gaze direction detection model for detecting a person's gaze direction,
A memory storing instructions for learning a deep learning-based gaze direction detection model for detecting a person's gaze direction; and
a processor that performs an operation to learn the gaze direction detection model according to the instructions stored in the memory;
Including,
The processor, (I) when at least one first training image is obtained, inputs the first training image to a body convolutional layer and causes the body convolutional layer to convolute the first training image at least once At least one first body feature map is generated by extracting the body features of the first person included in the first learning image by performing a shunt operation, and the first body feature map is input to the body FC (Fully Connected) layer. Have the body FC layer perform FC operations on the first body feature map at least once to output at least one piece of predicted body direction information that predicts the direction in which the front of the first person's body faces, and the predicted body direction information and generate at least one body direction loss using labeled body direction information included in first ground truth information corresponding to the first learning image, and use the body direction loss to determine the body FC layer and the body control. A process of training a translational layer, and (II) once at least one second training image is obtained, inputting the second training image to the body convolutional layer to cause the body convolutional layer to train the second learning image. Perform a convolutional operation on the image at least once to generate at least one second body feature map extracting body features of a second person included in the second training image, and apply the second training image to a head convolutional layer. input to cause the head convolutional layer to perform a convolutional operation on the second learning image at least once to generate at least one first head feature map from which the head feature of the second person is extracted, and the second body feature A first integrated feature map is generated by concatenating the map and the first head feature map, and the first integrated feature map is input to the head FC layer to cause the head FC layer to generate the first integrated feature map at least once. FC calculation is performed to output at least one first predicted head direction information predicting the direction in which the front of the second person's head is facing, and a first predicted head direction information corresponding to the first predicted head direction information and the second learning image is output. 2 Learning to generate at least one head direction loss using labeled head direction information included in ground truth, and to perform a process of learning the head FC layer and the head convolutional layer using the head direction loss. Device.

According to clause 12,
The processor,
In the process (II), additional loss weights are added to the head direction loss to learn the head FC layer and the head convolutional layer,
If the head direction loss is less than a preset threshold, “0” is applied as the loss weight,
A learning device that applies a preset real number greater than “0” as the loss weight when the head direction loss is greater than the threshold.

According to clause 12,
The processor,
In the process (II), the head FC layer is provided with (i) classification information classifying which class among preset head direction classes the direction in which the front of the second person's head faces corresponds, and ( ii) A learning device that outputs as the predicted direction information one of the regression information obtained by regressing which direction among the continuous direction candidates the direction in which the front of the second person's head faces corresponds.

According to clause 13,
The predicted direction information predicts the direction in which the front of the second person's head faces in either a two-dimensional plane or a three-dimensional space corresponding to the second learning image.

According to clause 12,
The first learning image or the second learning image is, in a photographed or cropped person image, (i) each of the person's body direction and gaze direction is classified into preset body direction classes and groups in a two-dimensional plane or three-dimensional space; Label each of the corresponding specific body direction classes and specific gaze direction classes among the set gaze direction classes, or (ii) label each of the person's body direction and gaze direction as a body direction vector in a two-dimensional plane or three-dimensional space. and a learning device generated by labeling each of the gaze direction vectors.

According to clause 12,
The first learning image or the second learning image is the sensed body direction of the person obtained using the sensing information of the gyroscope sensor at the time of shooting in an image of a person wearing a gyroscope sensor. Using information and sensed gaze direction information, (i) the sensed body direction information and the sensed gaze direction information, respectively, among preset body direction classes and preset gaze direction classes in a two-dimensional plane or three-dimensional space; or (ii) label each of the sensed body direction information and the sensed gaze direction information in a two-dimensional plane or three-dimensional space with a specific body direction class and a specific gaze direction class corresponding to the corresponding person. A learning device created by labeling with each of the body direction vector and the gaze direction vector.

According to clause 12,
The processor, (III) inputs at least one evaluation image to the body convolutional layer, causes the body convolutional layer to perform a convolutional operation on the evaluation image at least once to determine a third person included in the evaluation image. At least one third body feature map is generated by extracting body features, and the evaluation image is input to the head convolutional layer to perform a convolutional operation on the evaluation image at least once. generate at least one second head feature map by extracting the head features of the third person, and generate a second integrated feature map by concatenating the third body feature map and the second head feature map, Inputting a second integrated feature map to the head FC layer to cause the head FC layer to perform an FC operation on the second integrated feature map at least once to predict the direction in which the front of the third person's head is facing Predicted head direction information is output, and the body convolutional layer, the head convolutional layer, and the head FC are generated with reference to the second predicted head direction information and a third ground truth corresponding to the evaluation image. A learning device further performing a processor that evaluates the gaze direction detection model including the layer.

According to clause 18,
The processor calculates accuracy through the following equation using the second predicted head direction information and the third ground truth, and evaluates the gaze direction detection model using the calculated accuracy.

In the above equation, N is the total number of the second predicted head direction information used for evaluation, # of predicted soft corrections is the number of the second predicted head direction information that did not correctly predict the labeled correct answer, # of predicted corrects is the number of correctly predicted labeling correct answers.

In a learning device for learning a deep learning-based gaze direction detection model for detecting a person's gaze direction,
A memory storing instructions for learning a deep learning-based gaze direction detection model for detecting a person's gaze direction; and
a processor that performs an operation to learn the gaze direction detection model according to the instructions stored in the memory;
Includes,
The processor, (I) when at least one learning image is acquired, inputs the learning image to a body convolutional layer and causes the body convolutional layer to perform a convolutional operation on the learning image at least once to generate the learning image generate at least one body feature map by extracting the human body features included in the learning image, and input the learning image into a head convolutional layer to perform a convolutional operation on the learning image at least once. A process of generating at least one head feature map from which the head features of the person are extracted, (II) inputting the body feature map to a body FC layer and causing the body FC layer to perform FC operation on the body feature map at least once Output at least one predicted body direction information predicting the direction in which the front of the human body faces, and input an integrated feature map obtained by concatenating the body feature map and the head feature map to the head FC layer. A process of causing a head FC layer to perform an FC operation on the integrated feature map at least once to output at least one predicted head direction information predicting the direction in which the front of the person's head is facing, and (III) the predicted body direction. At least one body direction loss is generated using information and labeled body direction information included in ground truth information corresponding to the learning image, and the predicted head direction information and labeled head direction information included in the ground truth are generated. Generate at least one head direction loss using, learn the body FC layer and the body convolutional layer using the body direction loss, and learn the head FC layer and the head control using the head direction loss. A learning device that performs the process of learning volutional layers.

According to clause 20,
The learning image is, in a photographed or cropped person image, (i) each of the person's body direction and gaze direction corresponds to the corresponding person among preset body direction classes and preset gaze direction classes in a two-dimensional plane or three-dimensional space; By labeling each of the person's body direction and gaze direction with a specific body direction class and a specific gaze direction class, or (ii) labeling each of the person's body direction and gaze direction with a body direction vector and a gaze direction vector, respectively, in a two-dimensional plane or three-dimensional space. A learning device that has been created.

According to clause 20,
The learning image is a person image taken of a person wearing a gyroscope sensor, and the person's sensed body direction information and sensed gaze direction information obtained using the sensing information of the gyroscope sensor at the time of shooting Using, (i) any one specific body corresponding to each of the sensed body direction information and the sensed gaze direction information among preset body direction classes and preset gaze direction classes in a two-dimensional plane or three-dimensional space labeling each of a direction class and a specific gaze direction class, or (ii) labeling each of the sensed body direction information and the sensed gaze direction information in a two-dimensional plane or three-dimensional space as a body direction vector and a gaze direction vector of the person in question. A learning device created by labeling each.