KR20220090940A

KR20220090940A - System and method for analyzing speaker-listener recognition and gaze interaction through tracking the eyes of the characters in story-based visual media

Info

Publication number: KR20220090940A
Application number: KR1020200181955A
Authority: KR
Inventors: 최용석; 정지은
Original assignee: 한양대학교 산학협력단
Priority date: 2020-12-23
Filing date: 2020-12-23
Publication date: 2022-06-30

Abstract

스토리 기반 영상매체의 등장인물 시선 추적을 통한 화자-청자 인식 및 시선 상호작용 분석 시스템 및 방법이 개시된다. 일 실시예에 따른 시선 상호작용 분석 시스템에 의해 수행되는 시선 상호작용 분석 방법은, 스토리 기반의 영상 매체에서의 영상 정보를 입력받는 단계; 상기 입력받은 영상 정보로부터 복원된 깊이 지도(depth map)와 상기 입력받은 영상 정보로부터 인식된 오브젝트(object) 정보를 시선 추적 모델을 이용하여 오브젝트의 시선 정보를 추적하는 단계; 상기 추적된 오브젝트의 시선 정보가 상기 시선 추적 모델에 입력됨에 따라 상기 추적된 오브젝트의 시선 정보로부터 상대방 오브젝트의 시선 정보를 추적하는 단계; 및 상기 추적된 오브젝트의 시선 정보와 상기 추적된 상대방 오브젝트의 시선 정보에 기초하여 시선 상호작용 정보를 분류하는 단계를 포함할 수 있다. Disclosed are a speaker-listener recognition and gaze interaction analysis system and method through character gaze tracking in a story-based video medium. A gaze interaction analysis method performed by a gaze interaction analysis system according to an embodiment includes: receiving image information from a story-based video medium; tracking the gaze information of the object using a gaze tracking model using a depth map restored from the received image information and object information recognized from the received image information; tracking gaze information of a counterpart object from gaze information of the tracked object as the gaze information of the tracked object is input to the gaze tracking model; and classifying gaze interaction information based on gaze information of the tracked object and gaze information of the tracked counterpart object.

Description

SYSTEM AND METHOD FOR ANALYZING SPEAKER-LISTENER RECOGNITION AND GAZE INTERACTION THROUGH TRACKING THE EYES OF THE CHARACTERS IN STORY-BASED VISUAL MEDIA

아래의 설명은 등장인물의 시선 추적을 통해 인식된 화자와 청자 간의 시선 상호작용을 분석하는 기술에 관한 것이다. The description below relates to the technique of analyzing the gaze interaction between the speaker and the listener recognized through the gaze tracking of the character.

일반적인 시선 인식 기술은 아이 트래커와 같은 하드웨어 기기 및 센서가 필요했다. 최근에는 딥러닝 기술을 활용하여 카메라로 촬영된 영상만으로도 눈동자의 위치와 각도 등을 추정하거나 스마트 폰과 같은 기기 스크린 내부의 지점을 추정하는 기술이 보고되고 있다. Common gaze recognition technology required hardware devices and sensors, such as eye trackers. Recently, a technique for estimating the position and angle of the pupil using only images captured by a camera using deep learning technology or estimating a point inside the screen of a device such as a smart phone has been reported.

그러나, 이러한 기술은 인물이 응시하는 오브젝트가 무엇인지에 대한 정보를 얻을 수가 없었다. 피사체의 시선 추적(gaze following)을 위한 최신 기술이 대안 기술로 보고되고 있음에도 불구하고 현재까지 RGB-D 카메라로 촬영할 수 없는 드라마, 영화와 같은 영상 매체는 깊이 정보가 손실되기 때문에 기존 최신 기술을 활용하여 등장인물의 시선을 추적하였을 때 정확도가 현저히 떨어진다는 한계가 있다. However, this technique could not obtain information on what object the person gazes at. Although the latest technology for gaze following has been reported as an alternative technology, image media such as dramas and movies that cannot be shot with RGB-D cameras so far lose depth information, so the existing state-of-the-art technology is utilized. Therefore, there is a limitation in that the accuracy significantly decreases when tracking the gaze of a character.

CNN 기반의 학습 모델을 이용하여 스토리 기반의 영상 매체에서 자동으로 화자를 탐지하고 시선 정보를 추적하여 청자를 인식하며, 인식된 청자의 시선 정보를 추출하여 화자와 청자 간 시선 상호작용 관계를 수집 및 분석하는 방법 및 시스템을 제공할 수 있다. Using a CNN-based learning model, it automatically detects the speaker in the story-based video medium and recognizes the listener by tracking the gaze information, and collects the gaze interaction relationship between the speaker and the listener by extracting the recognized gaze information. Methods and systems for analyzing may be provided.

스토리 기반의 영상 매체에서의 영상 정보와 영상 정보로부터 생성된 깊이 지도를 통해 인식된 화자의 시선 정보를 추적하고, 추적된 화자의 시선으로부터 청자의 시선 정보를 추적하여 화자와 청자 간 시선 상호작용 관계를 수집 및 분석하는 방법 및 시스템을 제공할 수 있다.The relationship of gaze interaction between the speaker and the listener by tracking the gaze information of the speaker recognized through the depth map generated from the video information and the video information in the story-based video medium, and tracking the gaze information of the listener from the tracked gaze of the speaker A method and system for collecting and analyzing can be provided.

시선 상호작용 분석 시스템에 의해 수행되는 시선 상호작용 분석 방법은, 스토리 기반의 영상 매체에서의 영상 정보를 입력받는 단계; 상기 입력받은 영상 정보로부터 복원된 깊이 지도(depth map)와 상기 입력받은 영상 정보로부터 인식된 오브젝트(object) 정보를 시선 추적 모델을 이용하여 오브젝트의 시선 정보를 추적하는 단계; 상기 추적된 오브젝트의 시선 정보가 상기 시선 추적 모델에 입력됨에 따라 상기 추적된 오브젝트의 시선 정보로부터 상대방 오브젝트의 시선 정보를 추적하는 단계; 및 상기 추적된 오브젝트의 시선 정보와 상기 추적된 상대방 오브젝트의 시선 정보에 기초하여 시선 상호작용 정보를 분류하는 단계를 포함할 수 있다. The gaze interaction analysis method performed by the gaze interaction analysis system includes the steps of: receiving image information from a story-based video medium; tracking the gaze information of the object using a gaze tracking model using a depth map restored from the received image information and object information recognized from the received image information; tracking gaze information of a counterpart object from gaze information of the tracked object as the gaze information of the tracked object is input to the gaze tracking model; and classifying gaze interaction information based on gaze information of the tracked object and gaze information of the tracked counterpart object.

상기 오브젝트는 상기 스토리 기반의 영상 매체에서의 등장인물들 중 말하는 화자를 의미하고, 상기 오브젝트의 시선 정보를 추적하는 단계는, 상기 영상 정보를 깊이지도 생성 모델에 입력받고, 상기 깊이지도 생성 모델을 이용하여 상기 영상 정보로부터 깊이 지도를 생성하는 단계를 포함할 수 있다. The object means a speaker among the characters in the story-based video medium, and the step of tracking the gaze information of the object includes receiving the image information into a depth map generation model, and generating the depth map generation model. It may include generating a depth map from the image information using the image information.

상기 오브젝트는 상기 스토리 기반의 영상 매체에서의 등장인물들 중 말하는 화자를 의미하고, 상기 오브젝트의 시선 정보를 추적하는 단계는, 상기 영상 정보를 화자 인식 모델에 입력받고, 상기 화자 인식 모델을 이용하여 상기 영상 정보로부터 화자 인식 정보를 획득하는 단계를 포함할 수 있다. The object means a speaking speaker among the characters in the story-based video medium, and the step of tracking gaze information of the object includes receiving the image information into a speaker recognition model and using the speaker recognition model The method may include obtaining speaker recognition information from the image information.

상기 오브젝트의 시선 정보를 추적하는 단계는, 상기 획득된 화자 인식 정보를 데이터베이스에 저장하는 단계를 포함할 수 있다. The tracking of the gaze information of the object may include storing the acquired speaker recognition information in a database.

상기 오브젝트의 시선 정보를 추적하는 단계는, 상기 추적된 오브젝트의 시선 정보의 추적 결과를 데이터베이스에 저장하는 단계를 포함할 수 있다. The tracking of the gaze information of the object may include storing a tracking result of the gaze information of the tracked object in a database.

상기 상대방 오브젝트는 상기 스토리 기반의 영상 매체에서의 등장인물들 중 화자를 응시하거나 또는 화자와 말하는 대상인 청자를 의미하고, 상기 상대방 오브젝트의 시선 정보를 추적하는 단계는, 상기 추적된 청자의 시선 정보의 추적 결과를 데이터베이스에 저장하는 단계를 포함할 수 있다. The counterpart object refers to a listener who is a target of staring at a speaker or speaking with the speaker among the characters in the story-based video medium, and the step of tracking gaze information of the counterpart object includes: It may include the step of storing the tracking result in a database.

상기 시선 상호작용 정보를 분류하는 단계는, 상기 추적된 오브젝트의 시선 정보와 상기 추적된 상대방 오브젝트의 시선 정보를 이용하여 상호작용 없음, 상호 응시, 시선 회피, 시선 추적 및 상호 관심을 포함하는 시선 상호작용 기준에 기초하여 시선 상호작용을 분류하는 단계를 포함할 수 있다. The classifying the gaze interaction information includes gaze interaction including no interaction, mutual gaze, gaze avoidance, gaze tracking, and mutual interest using gaze information of the tracked object and gaze information of the tracked counterpart object. classifying the gaze interaction based on the action criterion.

상기 시선 상호작용 정보를 분류하는 단계는, 상기 분류된 시선 상호작용 중 상호 응시에 대응하는 시선 상호작용 분류 데이터를 데이터베이스에 저장하는 단계를 포함할 수 있다. Classifying the gaze interaction information may include storing gaze interaction classification data corresponding to mutual gaze among the classified gaze interaction in a database.

상기 시선 상호작용 정보를 분류하는 단계는, 상기 인식된 오브젝트의 정보, 상기 추적된 오브젝트의 시선 정보의 추적 결과, 상기 추적된 상대방 오브젝트의 시선 정보의 추적 결과 및 상호 응시에 대응하는 시선 상호작용 분류 데이터를 포함하는 데이터베이스의 검색을 통해 시선 상호작용과 관련된 검색 정보를 제공하는 단계를 포함할 수 있다. The classifying the gaze interaction information includes: the information of the recognized object, the tracking result of the gaze information of the tracked object, the tracking result of the gaze information of the tracked counterpart object, and the gaze interaction classification corresponding to the mutual gaze The method may include providing search information related to gaze interaction through a search of a database including data.

상기 시선 상호작용 정보를 분류하는 단계는, 상기 분류된 시선 상호작용 정보를 통한 화자와 청자 간 상호작용 관계에 기초하여 등장인물의 성향, 갈등, 대화 판별, 대화 긍정 또는 부정 관계, 스토리 상의 주요 대화 인물, 주요 오브젝트를 분석하는 단계를 포함할 수 있다. The classifying the gaze interaction information includes the character's disposition, conflict, dialogue discrimination, dialogue positive or negative relationship, and main dialogue in the story based on the interaction relationship between the speaker and the listener through the classified gaze interaction information. It may include analyzing a person and a main object.

시선 상호작용 분석 시스템은, 스토리 기반의 영상 매체에서의 영상 정보를 입력받는 입력부; 상기 입력받은 영상 정보로부터 복원된 깊이 지도(depth map)와 상기 입력받은 영상 정보로부터 인식된 오브젝트(object) 정보를 시선 추적 모델을 이용하여 오브젝트의 시선 정보를 추적하는 화자 시선 추적부; 상기 추적된 오브젝트의 시선 정보가 상기 시선 추적 모델에 입력됨에 따라 상기 추적된 오브젝트의 시선 정보로부터 상대방 오브젝트의 시선 정보를 추적하는 청자 인식부; 및 상기 추적된 오브젝트의 시선 정보와 상기 추적된 상대방 오브젝트의 시선 정보에 기초하여 시선 상호작용 정보를 분류하는 시선 상호작용 분류부를 포함할 수 있다. The gaze interaction analysis system includes: an input unit for receiving image information from a story-based image medium; a speaker gaze tracking unit that tracks gaze information of an object using a gaze tracking model based on a depth map restored from the received image information and object information recognized from the received image information; a listener recognition unit that tracks gaze information of a counterpart object from gaze information of the tracked object as the gaze information of the tracked object is input to the gaze tracking model; and a gaze interaction classification unit configured to classify gaze interaction information based on gaze information of the tracked object and gaze information of the tracked counterpart object.

상기 화자 시선 추적부는, 상기 영상 정보를 깊이지도 생성 모델에 입력받고, 상기 깊이지도 생성 모델을 이용하여 상기 영상 정보로부터 깊이 지도를 생성하고, 상기 영상 정보를 화자 인식 모델에 입력받고, 상기 화자 인식 모델을 이용하여 상기 영상 정보로부터 화자 인식 정보를 획득할 수 있다. The speaker gaze tracking unit receives the image information into a depth map generation model, generates a depth map from the image information using the depth map generation model, receives the image information into a speaker recognition model, and recognizes the speaker Speaker recognition information may be obtained from the image information using a model.

상기 시선 상호작용 분류부는, 상기 추적된 오브젝트의 시선 정보와 상기 추적된 상대방 오브젝트의 시선 정보를 이용하여 상호작용 없음, 상호 응시, 시선 회피, 시선 추적 및 상호 관심을 포함하는 시선 상호작용 기준에 기초하여 시선 상호작용을 분류할 수 있다. The gaze interaction classification unit, based on gaze interaction criteria including no interaction, mutual gaze, gaze avoidance, gaze tracking, and mutual interest, using gaze information of the tracked object and gaze information of the tracked counterpart object Thus, the gaze interaction can be classified.

시선 상호작용 분석 시스템은, 상기 인식된 오브젝트의 정보, 상기 추적된 오브젝트의 시선 정보의 추적 결과, 상기 추적된 상대방 오브젝트의 시선 정보의 추적 결과 및 상호 응시에 대응하는 시선 상호작용 분류 데이터를 데이터베이스에 저장하는 저장부를 더 포함할 수 있다. The gaze interaction analysis system stores, in the database, information on the recognized object, a tracking result of gaze information of the tracked object, a tracking result of gaze information of the tracked counterpart object, and gaze interaction classification data corresponding to mutual gaze in a database. It may further include a storage unit for storing.

상기 시선 상호작용 분류부는, 상기 분류된 시선 상호작용 정보를 통한 화자와 청자 간 상호작용 관계에 기초하여 등장인물의 성향, 갈등, 대화 판별, 대화 긍정 또는 부정 관계, 스토리 상의 주요 대화 인물, 주요 오브젝트를 분석할 수 있다. The gaze interaction classification unit, based on the interaction relationship between the speaker and the listener through the classified gaze interaction information, the character's disposition, conflict, dialogue discrimination, dialogue positive or negative relationship, main dialogue person in the story, main object can be analyzed.

스토리 기반의 영상 매체에서 대화 장면을 자동으로 탐지하고 화자-청자의 시선 정보 및 시선의 상호 작용 정보를 수집할 수 있다. It is possible to automatically detect a dialogue scene in a story-based video medium, and to collect information about the speaker-listener's gaze and gaze interaction information.

드라마, 영화와 같은 영상 매체의 RGB 이미지 프레임으로부터 자동으로 깊이 지도를 생성하는 딥러닝 모델을 활용하여 드라마, 영화 매체에 대해 추가적인 하드웨어 장비나 센서, 카메라 장비 없이 높은 정확도로 시선을 인식할 수 있다. 즉, 단순 RGB 영상 매체로부터 누가 무엇을 보는지, 누구와 대화 중 인지 여부와 같은 유의미한 정보를 자동으로 수집할 수 있다.By using a deep learning model that automatically generates a depth map from RGB image frames of video media such as dramas and movies, it is possible to recognize gaze with high accuracy in drama and movie media without additional hardware equipment, sensors, or camera equipment. In other words, it is possible to automatically collect meaningful information such as who is watching what and who is talking with whom from a simple RGB image medium.

수집된 화자-청자 인식, 시선 정보, 및 시선 상호작용 정보를 바탕으로 시선 관련 장면의 검색 서비스가 고안될 수 있으며, 드라마 대화 장면에서 화자와 청자를 자동으로 인식하여 데이터화 했을 때 원하는 대화 장면의 검색이 가능하다. 대사, 표정, 행동과 더불어 시선 상호작용 관계를 분석하면 드라마 스토리 및 등장인물 간의 소셜 관계 분석에 활용될 수 있으며 이는 학술 연구 및 관련 상업 서비스에 적용될 수 있다.A search service for gaze-related scenes can be devised based on the collected speaker-listener recognition, gaze information, and gaze interaction information. This is possible. Analyzing the gaze interaction relationship along with dialogue, facial expression, and behavior can be utilized for the analysis of drama stories and social relationships between characters, which can be applied to academic research and related commercial services.

도 1은 일 실시예에 있어서, 시선 추적을 비교하기 위한 예이다.
도 2는 일 실시예에 있어서, 시선 상호작용 분석 시스템의 동작을 설명하기 위한 도면이다.
도 3은 일 실시예에 있어서, 시선 상호작용 분석 시스템의 구성을 설명하기 위한 블록도이다.
도 4는 일 실시예에 있어서, 화자의 시선에 의해 청자를 인식하고 시선 상호작용이 분류되는 과정을 설명하기 위한 예이다.
도 5는 일 실시예에 있어서, 시선 상호작용 분석 방법을 설명하기 위한 흐름도이다.
도 6은 일 실시예에 있어서, 시선 상호작용 분류 체계를 설명하기 위한 예이다.
도 7은 일 실시예에 있어서, 화자와 청자를 인식하기 위한 학습 모델을 설명하기 위한 예이다. 1 is an example for comparing eye tracking, according to an embodiment.
2 is a diagram for explaining an operation of a gaze interaction analysis system according to an embodiment.
3 is a block diagram illustrating a configuration of a gaze interaction analysis system according to an embodiment.
4 is an example for explaining a process of recognizing a listener by the speaker's gaze and classifying the gaze interaction, according to an embodiment.
5 is a flowchart illustrating a gaze interaction analysis method according to an embodiment.
6 is an example for explaining a gaze interaction classification system according to an embodiment.
7 is an example for explaining a learning model for recognizing a speaker and a listener, according to an embodiment.

이하, 실시예를 첨부한 도면을 참조하여 상세히 설명한다.Hereinafter, embodiments will be described in detail with reference to the accompanying drawings.

실시예에서는 영화와 같은 스토리 기반의 영상 매체에서 자동으로 화자를 탐지하고 시선을 추적하여 청자를 인식하며, 인식된 청자의 시선 정보를 추출하여 시선 상호작용 관계를 수집 및 분석하는 동작에 대하여 설명하기로 한다. 이를 통해 대화 참가자의 응시 인물, 응시 오브젝트, 응시 방향 등을 추출하여 대화 참가자 간의 시선 상호작용 관계를 자동으로 수집 및 분석할 수 있다. In the embodiment, the operation of automatically detecting the speaker in a story-based video medium such as a movie, recognizing the listener by tracking the gaze, and collecting and analyzing the gaze interaction relationship by extracting the gaze information of the recognized listener. do it with Through this, the gaze interaction relationship between the conversation participants can be automatically collected and analyzed by extracting the gaze person, the gaze object, the gaze direction, etc. of the conversation participant.

실시예에서 '샷(shot)'은 카메라가 한 번 작동을 시작해서 끝날 때까지의 시간 동안 촬영된 부분이며, 샷이 복수 개 모인 것은 장면(scene)이라고 정의한다. 대화 장면에서 말하고 있는 등장인물(오브젝트)를 '화자'로 정의하고, 화자에게 응시하고 있는 대상 또는, 대화하고 있는 대상(상대방 오브젝트), 즉, 화자의 이야기를 듣고 있는 등장인물을 '청자'라고 정의하기로 한다. In the embodiment, a 'shot' is a portion taken during the time from when the camera starts to operate and ends, and a plurality of shots is defined as a scene. In a dialogue scene, the speaking character (object) is defined as the 'speaker', and the target staring at the speaker or the conversational target (opposite object), that is, the character listening to the speaker is called the 'listener'. to define

도 1은 일 실시예에 있어서, 시선 추적을 비교하기 위한 예이다. 1 is an example for comparing eye tracking, according to an embodiment.

도 1은 종래 기술과 실시예에서 제안하는 기술을 비교한 것이다. 기존 기술을 활용하면 드라나마 영화 영상에 깊이 정보가 없기 때문에 영상에서 뒤(예를 들면, 청자 뒤)에 존재하는 오브젝트(예를 들면, 등장인물이나 사물 등)를 응시한다고 추론되는 경우가 많다. 그러나, 실시예에서 제안하는 기술은 깊이 지도를 자동으로 생성하기 때문에 종래 기술의 문제점을 효과적으로 방지할 수 있다. 1 is a comparison between the prior art and the technique proposed in Examples. If the existing technology is used, since there is no depth information in the drama video image, it is often inferred that the viewer gazes at an object (for example, a character or an object, etc.) that exists behind the video (for example, behind the listener). However, since the technique proposed in the embodiment automatically generates a depth map, the problems of the prior art can be effectively prevented.

도 4는 일 실시예에 있어서, 화자의 시선에 의해 청자를 인식하고 시선 상호작용이 분류되는 과정을 설명하기 위한 예이다. 4 is an example for explaining a process of recognizing a listener by the speaker's gaze and classifying the gaze interaction, according to an embodiment.

영상 정보에서 인식된 화자가 (x1, y1)에 위치될 수 있다. 인식된 화자로부터 응시되는 청자가 (x2, y2)에 위치될 수 있다. 또한, 청자가 (x1, y1)에 위치하고 있는 화자를 응시할 수 있다. 이때, 화자와 청자가 동시간에 응시하고 있을 경우, 상호 응시하는 것으로 판단될 수 있다. A speaker recognized in the image information may be located at (x1, y1). A listener stared at from the recognized speaker may be located at (x2, y2). Also, the listener may gaze at the speaker located at (x1, y1). In this case, when the speaker and the listener are staring at the same time, it may be determined that they are staring at each other.

도 2는 일 실시예에 있어서, 시선 상호작용 분석 시스템의 동작을 설명하기 위한 도면이다. 2 is a diagram for explaining an operation of a gaze interaction analysis system according to an embodiment.

시선 상호작용 분석 시스템은 스토리 기반의 영상 매체에서의 영상 정보를 입력받을 수 있다. 다시 말해서, 시선 상호작용 분석 시스템은 입력 영상(이미지)(201)를 수신할 수 있다. The gaze interaction analysis system may receive image information from a story-based image medium. In other words, the gaze interaction analysis system may receive the input image (image) 201 .

시선 상호작용 분석 시스템은 영상 정보(201)를 깊이지도 생성 모델(210)에 입력받을 수 있다. 시선 상호작용 분석 시스템은 깊이지도 생성 모델(210)을 이용하여 영상 정보(201)로부터 깊이 지도를 생성할 수 있다. 이때, 깊이지도 생성 모델(210)는 깊이지도를 생성하기 위하여 구성된 딥러닝 기반의 모델일 수 있다. The gaze interaction analysis system may receive the image information 201 into the depth map generation model 210 . The gaze interaction analysis system may generate a depth map from the image information 201 using the depth map generation model 210 . In this case, the depth map generation model 210 may be a deep learning-based model configured to generate a depth map.

또한, 시선 상호작용 분석 시스템은 영상 정보(201)를 화자 인식 모델(220)에 입력받을 수 있다. 시선 상호작용 분석 시스템은 화자 인식 모델(220)을 이용하여 영상 정보(201)로부터 화자 인식 정보를 획득할 수 있다. 이때, 화자 인식 모델(220)는 화자를 인식하기 위하여 구성된 딥러닝 기반의 모델일 수 있다. Also, the gaze interaction analysis system may receive the image information 201 into the speaker recognition model 220 . The gaze interaction analysis system may obtain speaker recognition information from the image information 201 using the speaker recognition model 220 . In this case, the speaker recognition model 220 may be a deep learning-based model configured to recognize the speaker.

이때, 시선 상호작용 분석 시스템은 깊이 지도를 생성하는 동작과 화자를 인식하는 동작을 각각 동시적 또는 비동시적으로 수행할 수 있다. In this case, the gaze interaction analysis system may simultaneously or asynchronously perform an operation of generating a depth map and an operation of recognizing a speaker, respectively.

시선 상호작용 분석 시스템은 획득된 화자 인식 정보를 데이터베이스(260)에 저장할 수 있다. 시선 상호작용 분석 시스템은 획득된 화자 인식의 결과를 데이터베이스(260)에 저장할 수 있다. The gaze interaction analysis system may store the acquired speaker recognition information in the database 260 . The gaze interaction analysis system may store the acquired speaker recognition result in the database 260 .

시선 상호작용 분석 시스템은 입력받은 영상 정보(201)로부터 복원된 깊이 지도(depth map)와 입력받은 영상 정보(201)로부터 인식된 화자 정보를 제1 시선 추적 모델(230)을 통해 오브젝트의 시선 정보를 추적할 수 있다. 이때, 제1 시선 추적 모델(230)는 화자의 시선을 추적하기 위하여 구성된 딥러닝 기반의 모델일 수 있다. The gaze interaction analysis system uses a depth map restored from the input image information 201 and speaker information recognized from the input image information 201 to obtain gaze information of an object through the first gaze tracking model 230 . can be tracked. In this case, the first gaze tracking model 230 may be a deep learning-based model configured to track the speaker's gaze.

시선 상호작용 분석 시스템은 추적된 화자의 시선 정보의 추적 결과를 데이터베이스(260)에 저장할 수 있다. 예를 들면, A 사용자와 B 사용자가 존재할 경우, A가 오른쪽에 존재하는 B를 응시할 수 있다. 이에, 화자의 시선 추적 결과 오른쪽을 응시하는 것으로 판단될 수 있고, 판단된 정보(오른쪽 응시)가 데이터베이스에 저장될 수 있다. 또한, 판단된 정보와 관련된 다양한 정보(예를 들면, 오브젝트의 위치, 오브젝트 명 등)가 데이터베이스에 저장될 수 있다. 화자로부터 응시된 오른쪽에 존재하는 사용자가 청자로 인식될 수 있다.The gaze interaction analysis system may store the tracking result of the tracked speaker's gaze information in the database 260 . For example, when user A and user B exist, A may gaze at B existing on the right. Accordingly, as a result of tracking the speaker's gaze, it may be determined that the speaker is gazing at the right side, and the determined information (right gaze) may be stored in the database. In addition, various information related to the determined information (eg, the location of an object, an object name, etc.) may be stored in the database. A user present on the right side of the gaze from the speaker may be recognized as the listener.

시선 상호작용 분석 시스템은 추적된 화자의 시선 정보가 제2 시선 추적 모델(240)에 입력됨에 따라 추적된 화자의 시선 정보로부터 청자의 시선 정보를 추적할 수 있다. 이때, 제2 시선 추적 모델(240)는 청자의 시선을 추적하기 위하여 구성된 딥러닝 기반의 모델일 수 있다. 또한, 제1 시선 추적 모델(230) 및 제2 시선 추적 모델(240)는 화자와 청자의 시선을 추적하기 위하여 각각 별도로 구성된 것일 수 있고, 또는, 화자 및 청자를 포함하는 오브젝트의 시선을 추적하기 위하여 구성된 하나의 모델일 수 있다. 다시 말해서, 제1 시선 추적 모델(230) 및 제2 시선 추적 모델(240)은 동일한 모델일 수 있으며, 이를 통해 화자 및 청자의 시선을 추적할 수 있다. The gaze interaction analysis system may track the listener's gaze information from the tracked speaker's gaze information as the tracked speaker's gaze information is input to the second gaze tracking model 240 . In this case, the second gaze tracking model 240 may be a deep learning-based model configured to track the gaze of the listener. Also, the first gaze tracking model 230 and the second gaze tracking model 240 may be configured separately to track the gaze of the speaker and the listener, or to track the gaze of an object including the speaker and the listener. It may be a single model configured for In other words, the first eye tracking model 230 and the second eye tracking model 240 may be the same model, and through this, the eyes of the speaker and the listener may be tracked.

시선 상호작용 분석 시스템은 추적된 청자의 시선 정보의 추적 결과를 데이터베이스(260)에 저장할 수 있다. 예를 들면, A 사용자와 B 사용자가 존재할 경우, B가 왼쪽에 존재하는 A를 응시할 수 있다. 이에, 청자의 시선 추적 결과 왼쪽을 응시하는 것으로 판단될 수 있고, 판단된 정보(왼쪽 응시)가 데이터베이스에 저장될 수 있다. 또한, 판단된 정보와 관련된 다양한 정보(예를 들면, 오브젝트의 위치, 오브젝트 명 등)가 데이터베이스에 저장될 수 있다. 청자로부터 응시된 왼쪽에 존재하는 사용자가 화자로 인식될 수 있다.The gaze interaction analysis system may store the tracking result of the tracked listener's gaze information in the database 260 . For example, when user A and user B exist, B may gaze at A existing to the left. Accordingly, as a result of tracking the listener's gaze, it may be determined that the listener is gazing at the left side, and the determined information (left gaze) may be stored in the database. In addition, various information related to the determined information (eg, the location of an object, an object name, etc.) may be stored in the database. A user present on the left side, gazed from the listener, may be recognized as the speaker.

시선 상호작용 분석 시스템은 추적된 화자의 시선 정보와 추적된 청자의 시선 정보에 기초하여 시선 상호작용 정보를 분류(250)할 수 있다. 시선 상호작용 분석 시스템은 추적된 화자의 시선 정보와 추적된 청자의 시선 정보를 이용하여 상호작용 없음, 상호 응시, 시선 회피, 시선 추적 및 상호 관심을 포함하는 시선 상호작용 기준에 기초하여 시선 상호작용을 분류할 수 있다. 시선 상호작용 분석 시스템은 분류된 시선 상호작용 중 상호 응시에 대응하는 시선 상호작용 정보를 데이터베이스에 저장할 수 있다. The gaze interaction analysis system may classify the gaze interaction information 250 based on the tracked gaze information of the speaker and the tracked gaze information of the listener. The gaze interaction analysis system uses the tracked speaker's gaze information and the tracked listener's gaze information to perform gaze interaction based on gaze interaction criteria including no interaction, mutual gaze, gaze avoidance, gaze tracking, and mutual interest. can be classified. The gaze interaction analysis system may store gaze interaction information corresponding to mutual gaze among the classified gaze interactions in a database.

시선 상호작용 분석 시스템은 인식된 화자의 결과, 추적된 화자의 시선 정보의 추적 결과, 추적된 청자의 시선 정보의 추적 결과 및 상호 응시에 대응하는 시선 상호작용 데이터를 포함하는 데이터베이스의 검색을 통해 시선 상호작용과 관련된 정보를 제공할 수 있다. 시선 상호작용 분석 시스템은 분류된 시선 상호작용 정보를 통한 화자와 청자 간 상호작용 관계에 기초하여 등장인물의 성향, 갈등, 대화 판별, 대화 긍정 또는 부정 관계, 스토리 상의 주요 대화 인물, 주요 오브젝트를 분석할 수 있다.The gaze interaction analysis system uses a search result of a database including a result of the recognized speaker, a tracking result of the tracked speaker's gaze information, a tracking result of the tracked listener's gaze information, and gaze interaction data corresponding to mutual gaze. You can provide information related to the interaction. The gaze interaction analysis system analyzes the character's disposition, conflict, dialogue discrimination, dialogue positive or negative relationship, the main dialogue person in the story, and the main object based on the interaction relationship between the speaker and the listener through the classified gaze interaction information. can do.

이를 통해, 인식된 화자의 결과, 추적된 화자의 시선 정보의 추적 결과, 추적된 청자의 시선 정보의 추적 결과 및 상호 응시에 대응하는 시선 상호작용 데이터를 포함하는 데이터베이스의 검색을 통해 시선 상호작용과 관련된 검색 정보가 제공될 수 있다. 예를 들면, 사용자에 의한 데이터베이스의 검색을 통하여 시선 상호작용과 관련된 검색 정보가 제공될 수 있다. 드라마, 영화와 같은 스토리 기반의 영상 매체에서 자동으로 대화 장면 인식, 화자-청자 인식, 시선의 상호작용 수집을 통해 대화 장면이 검색될 수 있다. 또한, 시선 추적 결과를 바탕으로 등장 인물 간의 소셜 관계 분석 서비스에 적용될 수 있다.Through this, through a search of a database including the recognized speaker's result, the tracked speaker's gaze information tracking result, the tracked listener's gaze information tracking result, and gaze interaction data corresponding to mutual gaze, gaze interaction and Relevant search information may be provided. For example, search information related to gaze interaction may be provided through a database search by a user. In story-based video media such as dramas and movies, dialogue scenes can be automatically retrieved through dialogue scene recognition, speaker-listener recognition, and gaze interaction collection. In addition, it can be applied to a social relationship analysis service between characters based on the result of eye tracking.

도 3은 일 실시예에 있어서, 시선 상호작용 분석 시스템의 구성을 설명하기 위한 블록도이다.3 is a block diagram illustrating a configuration of a gaze interaction analysis system according to an embodiment.

시선 상호작용 분석 시스템은 입력부(310), 화자 인식부(320), 화자 시선 추적부(330), 청자 인식부(340), 시선 상호작용 분류부(350) 및 시선 상호작용 관련 분석부(360)를 포함할 수 있다. The gaze interaction analysis system includes the input unit 310 , the speaker recognizer 320 , the speaker gaze tracking unit 330 , the listener recognizer 340 , the gaze interaction classification unit 350 , and the gaze interaction related analysis unit 360 . ) may be included.

입력부(310)는 스토리 기반의 영상 매체에서의 영상 정보를 입력받을 수 있다. 예를 들면, 영상 매체에서 사용자로부터 시간 정보가 설정될 수 있고, 설정된 시간 정보에 대응하는 영상 정보가 입력될 수 있다. The input unit 310 may receive image information from a story-based image medium. For example, time information may be set by the user in the image medium, and image information corresponding to the set time information may be input.

화자 인식부(320)는 대화 장면의 화자를 탐지할 수 있다. 화자 인식부(320)는 영상 매체에서 인물이 말하는 행동을 탐지하고 말하는 행동을 하고 있는 인물을 화자로 인식할 수 있다. The speaker recognition unit 320 may detect a speaker of a conversation scene. The speaker recognition unit 320 may detect the action of the person speaking in the video medium and recognize the person who is speaking as the speaker.

화자 시선 추적부(330)는 자동으로 깊이 지도를 생성하여 화자의 시선이 어디를 응시하는지 파악할 수 있다. 화자 시선 추적부(330)는 인식된 화자의 얼굴 부분과 전체 장면을 입력으로 하고, 화자가 응시하는 좌표 데이터(x, y)를 출력으로 하는 CNN 기반의 딥러닝 모델을 활용하여 시선을 추적할 수 있다. 화자 시선 추적부(330)는 응시점을 추론하는 과정에서 영상 매체의 손실된 깊이 정보를 복원하기 위해 깊이지도 생성 모델을 통하여 영상 정보로부터 깊이 지도를 생성할 수 있다. 화자 시선 추적부(330)에서 수집이 가능한 시선 정보로는 인물 응시, 물체 응시, 응시 없음 및 응시 방향 등이다. The speaker's gaze tracking unit 330 may automatically generate a depth map to determine where the speaker's gaze is gazing. The speaker's gaze tracking unit 330 uses a CNN-based deep learning model that takes the recognized speaker's face part and the entire scene as input, and outputs the coordinate data (x, y) that the speaker gazes at to track the gaze. can The speaker's gaze tracker 330 may generate a depth map from the image information through a depth map generation model in order to restore depth information lost in the image medium in the process of inferring the gaze point. The gaze information that can be collected by the speaker gaze tracking unit 330 includes a person gaze, an object gaze, no gaze, and a gaze direction.

예를 들면, 화자 시선 추적부(330)에서 드라마나 영상 매체에서 특정 등장 인물의 시선을 추적하여 주로 어떤 인물과 응시하며 대화를 하는지, 주로 어떤 오브젝트를 응시하는지 혹은 아무것도 응시하지 않고 내면에 집중하는 지 등과 같이 시선 행동 상태를 드라마의 에피소드 단위로 자동 수집될 수 있다. For example, the speaker's gaze tracking unit 330 tracks the gaze of a specific character in a drama or video medium to focus on the inner side without gazing at what type of object or mainly gazing at what type of person he is gazing at and talking to. The state of gaze behavior, such as information, may be automatically collected in units of episodes of the drama.

청자 인식부(340)는 추적된 응시점을 바탕으로 청자를 탐지할 수 있다. 청자 인식부(340)는 추론된 응시점의 좌표 데이터를 바탕으로 해당 지점에 위치하는 오브젝트를 식별할 수 있다. 만약, 청자 인식부(340)는 오브젝트가 인물인 경우, 해당 인물을 인식된 화자의 발화가 전달되는 대상인 청자로 인식할 수 있다. 이때, 청자 인식부(340)는 추정된 응시점의 좌표 데이터에 아무것도 존재하지 않거나 인물이 아닌 일반적인 사물이 인식된 경우에도 데이터로 수집하여 저장할 수 있다. The listener recognizer 340 may detect the listener based on the tracked gaze point. The listener recognizer 340 may identify an object located at a corresponding point based on the inferred coordinate data of the gaze point. If the object is a person, the listener recognizer 340 may recognize the person as a listener that is a target to which the recognized speaker's utterance is transmitted. In this case, the listener recognizer 340 may collect and store data even when there is nothing in the estimated coordinate data of the gaze point or a general object other than a person is recognized.

시선 상호작용 분류부(350)는 화자와 청자의 시선 방향, 상호작용 정도를 통해 시선의 상호작용 관계를 분류할 수 있다. 시선 상호작용 분류부(350)는 추출된 화자와 청자의 시선 방향과 응시하는 대상에 따라 샷 단위로 복수 개(예를 들면, 5개)의 시선 상호작용 중 하나로 분류할 수 있다. 예를 들면, 시선 상호작용은 상호작용 없음, 상호 응시, 시선 회피, 시선 추적 및 상호 관심으로 구성될 수 있다. The gaze interaction classification unit 350 may classify the interaction relationship of the gaze through the gaze direction and the degree of interaction between the speaker and the listener. The gaze interaction classifying unit 350 may classify one of a plurality of (eg, five) gaze interactions in units of shots according to the extracted gaze directions of the speaker and listener and a gaze target. For example, gaze interaction may consist of no interaction, mutual gaze, gaze avoidance, gaze tracking, and mutual interest.

구체적으로, 시선 상호작용은 A와 B가 서로 대화중 일 때, A와 B의 시선 처리에 따라 분류될 수 있다. 도 6을 참고하면, 시선 상호작용 분류 체계(시선 상호작용 기준)를 설명하기 위한 예이다. Specifically, the gaze interaction may be classified according to the gaze processing of A and B when A and B are talking to each other. Referring to FIG. 6 , it is an example for describing a gaze interaction classification system (a gaze interaction criterion).

'상호작용 없음'은 A와 B 모두 서로 다른 곳을 응시하고 있는 경우를 의미한다.'No interaction' means that both A and B are looking at different places.

'상호 응시'는 A와 B가 서로 상대방을 응시하고 있는 경우를 의미한다. 일반적인 대화 장면에서 가장 대표적으로 일어나는 상호작용이다.'Mutual gaze' refers to a case where A and B are looking at each other. This is the most representative interaction that occurs in a typical conversation scene.

'시선 회피'는 A와 B가 서로 상대방을 응시하다가 적어도 한 명 이상이 시선을 회피하는 경우를 의미한다.'Gaze avoidance' refers to a case in which at least one person avoids the gaze while A and B gaze at each other.

'시선 추적'은 A와 B가 서로 상대방을 응시하다가 적어도 한 명 이상이 어떠한 오브젝트를 응시하는 경우, 상대방(A는 B, B는 A)이 그 시선을 따라 동일한 오브젝트를 응시하는 경우를 의미한다. 'Gaze tracking' means that when A and B gaze at each other and at least one person gazes at an object, the other party (A is B, B is A) follows the gaze and gazes at the same object. .

'상호 관심'은 A와 B가 모두 동일한 오브젝트(예를 들면, 사물)를 응시하고 있는 경우를 의미한다. 'Mutual interest' refers to a case in which both A and B are gazing at the same object (eg, object).

시선 상호작용 관련 분석부(360)는 수집된 시선 상호작용 관계를 바탕으로 다양한 분석을 시도할 수 있다. 시선 상호작용 관련 분석부(360)는 추출된 시선 상호작용 관계로부터 대화의 긍정 또는 부정 관계 등을 분석할 수 있다. 시선 상호작용 관련 분석부(360)는 등장 인물들의 시선 방향, 응시 오브젝트, 응시 인물, 응시 시간 등과 같이 수집된 정보를 바탕으로 등장 인물의 성향, 갈등 대화 판별, 대화 긍정 또는 부정 관계, 스토리 상의 주요 대화 인물, 주요 오브젝트 등을 분석할 수 있다. The gaze interaction related analysis unit 360 may try various analyzes based on the collected gaze interaction relationship. The gaze interaction related analysis unit 360 may analyze a positive or negative relationship of a conversation from the extracted gaze interaction relationship. The gaze interaction-related analysis unit 360 is based on the collected information, such as the gaze direction of the characters, the gaze object, the gaze person, the gaze time, etc. You can analyze dialogue characters, main objects, etc.

스토리 기반 영상 속 등장 인물들은 대화 장면에서 의도적인 시선 처리가 인물의 성향을 나타내는 보조적인 역할을 한다. 예를 들면, 감독과 배우는 의도적으로 시선을 아래로 향하게 하여 수동적이고 소극적인 등장 인물의 성향을 비언어적으로 표현할 수 있다. The characters in the story-based video play an auxiliary role in expressing the character's disposition through intentional gaze processing in the dialogue scene. For example, a director and actor can intentionally turn their gaze downwards to non-verbally express the character's tendency to be passive and passive.

일례로, 하나의 에피소드 동안 특정 인물에 대해 대화 장면마다 시선 데이터가 지속적으로 수집되었을 때, 시선 상호작용 분석 시스템은 기 수집된 시선 관련 데이터를 바탕으로 인물의 성향을 추론할 수 있다. For example, when gaze data is continuously collected for each dialogue scene with respect to a specific person during one episode, the gaze interaction analysis system may infer the personality of the person based on the previously collected gaze-related data.

다른 예로서, 인물 간의 갈등이 보이는 갈등 대화가 일반적인 대화 상황에서 이루어지는 대화의 특성 및 구조와는 차이를 보일 수 있다. 예를 들면, A와B가 대화 중인 장면에서 서로를 응시하지 않거나, 한 쪽만 마주보고 있는 상대방은 시선을 회피하는 경우가 있다. 이러한 경우는, 시선 상호작용 분류체계를 기준으로 의사소통 없음과 시선 회피 등이 여러 차례 탐지되는 경우, 화자와 청자 사이에 갈등 대화가 이루어지고 있음을 추론할 수 있다. As another example, a conflicting dialogue in which conflicts between characters is seen may be different from the characteristics and structure of a dialogue conducted in a normal dialogue situation. For example, in a scene where A and B are talking, there are cases where the other person does not look at each other, or the other person who is facing only one side avoids the gaze. In this case, when no communication and gaze avoidance are detected multiple times based on the gaze interaction classification system, it can be inferred that a conflicting conversation is taking place between the speaker and the listener.

또 다른 예로서, 드라마 및 영화에서 주요 등장 인물의 시선을 추적함으로써 누구와 대화를 자주하는 지, 어떤 오브젝트에 주목하는지, 얼마나 오랜 시간 동안 응시 했는지에 대한 정보가 획득될 수 있다. As another example, by tracking the gazes of main characters in dramas and movies, information on who they talk to often, what objects they pay attention to, and how long they stare at them for a long time may be acquired.

도 5는 일 실시예에 있어서, 시선 상호작용 분석 방법을 설명하기 위한 흐름도이다. 5 is a flowchart illustrating a gaze interaction analysis method according to an embodiment.

시선 상호작용 분석 시스템은 스토리 기반의 영상 매체에서의 영상 정보를 입력받을 수 있다. 시선 상호작용 분석 시스템은 입력받은 영상 정보로부터 말하는 행동을 탐지하고 말하는 행동을 하는 등장 인물을 화자로 인식할 수 있다(510). 시선 상호작용 분석 시스템은 입력받은 영상 정보로부터 깊이지도 생성 모델에 의해 RGB 이미지의 깊이 정보를 복원할 수 있다(520). 상세하게는, 시선 상호작용 분석 시스템은 입력받은 영상 정보로부터 복원된 깊이 지도(depth map)와 입력받은 영상 정보로부터 인식된 화자 정보를 시선 추적 모델을 통해 화자의 시선 정보를 추적할 수 있다. The gaze interaction analysis system may receive image information from a story-based image medium. The gaze interaction analysis system may detect a speaking action from the input image information and recognize a character who performs the speaking action as the speaker ( 510 ). The gaze interaction analysis system may restore depth information of the RGB image by the depth map generation model from the input image information ( 520 ). In detail, the gaze interaction analysis system may track the speaker's gaze information through a gaze tracking model using a depth map restored from the input image information and the speaker information recognized from the input image information.

시선 상호작용 분석 시스템은 화자의 시선 정보를 추적하여 청자를 인식할 수 있다(530). 시선 상호작용 분석 시스템은 추적된 화자의 시선 정보가 시선 추적 모델에 입력됨에 따라 추적된 화자의 시선 정보로부터 청자를 인식할 수 있다. The gaze interaction analysis system may recognize the listener by tracking the speaker's gaze information ( 530 ). The gaze interaction analysis system may recognize the listener from the tracked speaker's gaze information as the tracked speaker's gaze information is input to the gaze tracking model.

도 7을 참고하면, 화자와 청자를 인식하기 위한 학습 모델에 대하여 설명하기로 한다. 화자와 청자를 인식하기 위한 학습 모델에 깊이지도 생성 모델 및 시선 추적 모델이 포함될 수 있다. Referring to FIG. 7 , a learning model for recognizing a speaker and a listener will be described. A learning model for recognizing a speaker and a listener may include a depth map generation model and an eye tracking model.

학습 모델의 네트워크에는 시선 (방향) 경로, 깊이 경로 및 히트맵 경로를 포함하는 세 가지의 경로가 존재할 수 있다. 학습 모델에서 먼저 시선 경로가 결정되고, 깊이맵 경로가 결정된 다음 히트맵 경로가 결정되는 순서로 동작될 수 있다. 시선 경로는 시선 각도를 추정하기 위한 시선 방향 모듈과 시선 포커스 마스크를 생성하기 위한 시선 주의 마스크(gaze attention mask) 모듈로 구성될 수 있다. 시선 경로에서 사람이 어디를 응시하고 있는지 확인하고 전체 영상 정보에서 주의가 필요한 부분을 보여주는 시선 주의 마스크가 생성될 수 있다. 예를 들면, 사람의 머리 위치를 임베딩할 때, 두 개의 완전 연결 레이어가 사용될 수 있다. 시선 경로에서 머리 이미지와 머리 위치만을 고려하여 시선 방향을 결정하고 시선 주의 마스크가 생성될 수 있다. 깊이 경로는 원본 이미지로부터 깊이맵이 추출될 수 있다. 각 모듈에서 생성된 원본 이미지, 시선 주의 마스크 및 깊이맵이 연결되고, 결과를 히트맵 경로에 대한 입력으로 제공됨으로써 최종 응시점(gaze point)이 히트맵으로 생성될 수 있다.There can be three paths in the network of the learning model, including the gaze (direction) path, the depth path, and the heat map path. The learning model may operate in an order in which a gaze path is first determined, a depth map path is determined, and then a heat map path is determined. The gaze path may include a gaze direction module for estimating a gaze angle and a gaze attention mask module for generating a gaze focus mask. A gaze attention mask can be generated that checks where the person is gazing in the gaze path and shows the part that needs attention in the entire image information. For example, when embedding a human head position, two fully connected layers may be used. The gaze direction may be determined by considering only the head image and the head position in the gaze path, and a gaze attention mask may be generated. For the depth path, a depth map may be extracted from the original image. The original image, the gaze attention mask, and the depth map generated in each module are connected, and the result is provided as an input to the heat map path, so that the final gaze point can be generated as a heat map.

학습 모델에서 원본 이미지, 머리 이미지 및 머리 위치를 포함하는 3가지 입력 데이터가 입력될 수 있다. 제3자의 원본 이미지와 머리 이미지는 224x224 사이즈로 조정되고, 머리 중심 위치 좌표 (h_x, h_y)는 0과 1사이로 정규화될 수 있다. 학습 모델의 출력은 시선 방향과 히트맵 (p_x, p_y)으로 구성될 수 있다. 도 7에서와 같이 ImgeNet에서 사전 훈련된 ResNet-50을 백본으로 사용하여 머리 이미지의 특징이 추출될 수 있다. 동시에 머리 위치 좌표를 완전 연결 레이어에 입력하여 256 차원의 특징에 포함시킬 수 있다. 다음으로, 머리 특징맵은 융합을 위해 완전 연결 레이어를 통해 포함된 머리 위치와 연결될 수 있다. 그 후, 마지막 완전 연결 레이어를 통해 최종 시선 좌표 (p_x, p_y)가 예측될 수 있다. 각 레이어 다음에, ReLu 활성화 함수가 사용될 수 있다. 시선 방향 모듈이 시선 좌표 (p_x, p_y)를 예측하면, 머리 중심 좌표 (h_x, h_y)에서 예상 응시점을 향해 깔때기 모양의 원뿔 모양으로 방사하는 세 개의 시선 주의 마스크가 생성될 수 있다. In the learning model, three input data including an original image, a head image, and a head position may be input. The third party's original image and head image are adjusted to a size of 224x224, and the head center position coordinates (h _x , h _y ) can be normalized between 0 and 1. The output of the learning model may consist of gaze directions and heatmaps (p _x , p _y ). As shown in FIG. 7 , the features of the head image can be extracted using ResNet-50 pre-trained in ImgeNet as a backbone. At the same time, the head position coordinates can be entered into the fully connected layer to be included in the 256-dimensional feature. Next, the head feature map can be linked with the included head position through a fully connected layer for fusion. After that, the final gaze coordinates (p _x , p _y ) may be predicted through the last fully connected layer. After each layer, the ReLu activation function may be used. When the gaze direction module predicts the gaze coordinates (p _x , p _y ), it can generate three gaze attention masks radiating from the head center coordinates (h _x , h _y ) in a funnel-shaped cone towards the expected gaze point. have.

히트맵 경로에서 히트맵이 생성되어 이미지에서 응시된 고정된 위치가 예측될 수 있다. 먼저, 시선 주의 마스크를 활용하여 전체 이미지에서 특정 주의가 필요한 방향이 결정될 수 있다. 그 후, 깊이맵을 사용하여 오브젝트 간의 3차원 위치 관계를 고려하여 최종 응시점이 추정될 수 있다. 이때, 히트맵 경로의 컨볼루션 레이어는 다중 스케일 기능을 통합하여 객체를 강력하게 감지하는 FPN(Feature Pyramid Network)의 구조를 따를 수 있다. 시선 주의 마스크, 깊이맵 및 원본 RGB 이미지를 연결한 다음, FPN으로 입력함으로써 출력으로 크기가 56x56인 히트맵이 생성될 수 있다. 히트맵 경로를 훈련할 때, 예측된 히트맵과 실측(Ground Truth)의 히트맵 간의 교차 엔트로피 손실을 최소화하기 위해 그레디언트가 업데이트될 수 있다. 이때, 깊이맵은 히트맵 경로 앞에 연결되며, FPN은 깊이 정보를 고려하여 최종 응시점을 조정할 수 있다.A heatmap may be generated from the heatmap path so that the fixed position gazed in the image can be predicted. First, a direction requiring specific attention in the entire image can be determined by using the gaze attention mask. Thereafter, the final gaze point may be estimated in consideration of the three-dimensional positional relationship between objects using the depth map. In this case, the convolutional layer of the heat map path may follow the structure of a Feature Pyramid Network (FPN) that strongly detects objects by integrating multi-scale functions. By concatenating the gaze attention mask, depth map, and the original RGB image, then input as FPN, a heatmap of size 56x56 can be generated as output. When training a heatmap path, the gradient may be updated to minimize cross-entropy loss between the predicted heatmap and the ground truth heatmap. In this case, the depth map is connected in front of the heat map path, and the FPN may adjust the final gaze point in consideration of the depth information.

학습 모델은 시선 방향과 히트맵을 학습하기 위해 모델의 백본 네트워크에 따라 다른 손실 함수가 사용될 수 있다. 시선 각도를 훈련하기 위한 손실 함수로 실측 시선 방향과 예측 시선 방향 간의 코사인 유사도가 사용될 수 있다. 머리 위치 H=(h_x, h_y)에서 실측 응시점 G=(g_x, g_y)까지의 선은 1로 정규화되어 L_HG로 정의되고, 머리 위치에서 예측 지점 P=(p_x, p_y)까지의 선은 정규화되어 L_HP로 정의될 수 있다. 두 선(line)의 코사인 유사도가 계산된 후, 1에서 뺄 수 있다. The learning model can use different loss functions depending on the model's backbone network to learn the gaze direction and heatmap. As a loss function for training the gaze angle, the cosine similarity between the measured gaze direction and the predicted gaze direction may be used. The line from the head position H=(h _x , h _y ) to the ground truth gaze point G=(g _x , g _y ) is normalized to 1 and defined as L _HG , and the predicted point P=(p _x , p at the head position) The line up to _y ) can be normalized and defined as L _HP . After the cosine similarity of two lines is calculated, it can be subtracted from 1.

이진 교차 엔트로피 손실(BCE Loss)은 실측 히트맵과 예측 히트맵 사이에 히트맵 손실로 사용될 수 있다. Binary cross entropy loss (BCE Loss) may be used as a heat map loss between the measured heat map and the predicted heat map.

시선 상호작용 분석 시스템은 인식된 화자 및 청자의 시선 정보에 의해 시선 상호작용을 분류할 수 있다(540). 시선 상호작용 분석 시스템은 추적된 오브젝트의 시선 정보와 추적된 상대방 오브젝트의 시선 정보를 이용하여 상호작용 없음, 상호 응시, 시선 회피, 시선 추적 및 상호 관심을 포함하는 시선 상호작용 기준에 기초하여 시선 상호작용을 분류할 수 있다. The gaze interaction analysis system may classify the gaze interaction according to the recognized speaker and listener gaze information ( 540 ). The gaze interaction analysis system uses gaze information of the tracked object and gaze information of the tracked counterpart object to perform gaze interaction based on gaze interaction criteria including no interaction, mutual gaze, gaze avoidance, gaze tracking, and mutual interest. actions can be classified.

시선 상호작용 분석 시스템은 시선 상호작용 데이터를 데이터베이스에 저장할 수 있다(550). 시선 상호작용 분석 시스템은 인식된 화자의 정보, 추적된 화자의 시선 정보의 추적 결과, 추적된 청자의 시선 정보의 추적 결과 및 상호 응시에 대응하는 시선 상호작용 데이터를 데이터베이스에 저장할 수 있다. The gaze interaction analysis system may store the gaze interaction data in a database ( 550 ). The gaze interaction analysis system may store, in the database, gaze interaction data corresponding to the recognized speaker information, the tracked speaker gaze information tracking result, the tracked listener gaze information tracking result, and mutual gaze.

시선 상호작용 분석 시스템은 데이터베이스에 데이터가 충분히 수집되었는지 판단할 수 있다. 이때, 시선 상호작용 분석 시스템은 기 설정된 기준에 의하여 데이터가 충분히 수집되었는지 판단할 수 있다. 예를 들면, 기 설정된 용량 기준에 기초하여 데이터베이스의 데이터가 충족되었는지 판단될 수 있다. 또는 기 설정된 기간 동안 데이터가 수집될 수 있다. 시선 상호작용 분석 시스템은 데이터베이스에 데이터가 충분히 수집된 것으로 판단됨에 따라 프로세스를 종료할 수 있다. 또한, 시선 상호작용 분석 시스템은 데이터베이스에 데이터가 충분히 수집되지 않은 것으로 판단됨에 따라 단계(510)을 다시 수행할 수 있다. The gaze interaction analysis system may determine whether sufficient data is collected in the database. In this case, the gaze interaction analysis system may determine whether data has been sufficiently collected according to a preset criterion. For example, it may be determined whether data in the database is satisfied based on a preset capacity criterion. Alternatively, data may be collected for a preset period. The gaze interaction analysis system may terminate the process when it is determined that sufficient data has been collected in the database. In addition, the gaze interaction analysis system may perform step 510 again as it is determined that data is not sufficiently collected in the database.

스토리 기반의 영상에서 주요 등장인물의 시선을 인식하는 것은 등장인물 프로파일링, 소셜 관계 분석, 대화의 함축적인 의도를 분석하는 데 중요한 역할을 담당한다. 이러한 중요성에도 불구하고 기존에는 사람이 일일이 시선 정보를 수집해야 하므로 정보 수집에 큰 비용이 소모되었다. 반면에, 본 발명은 스토리 기반의 영상 매체에서 대화 장면을 자동으로 탐지하고 화자-청자의 시선 정보 및 시선의 상호 작용 정보를 수집할 수 있다. Recognizing the gaze of the main characters in story-based video plays an important role in character profiling, analyzing social relationships, and analyzing implicit intentions of conversations. In spite of this importance, in the past, a large amount of money was consumed in collecting information because people had to collect gaze information one by one. On the other hand, according to the present invention, it is possible to automatically detect a dialogue scene in a story-based video medium, and to collect information about the speaker-listener's gaze and gaze interaction information.

또한, 기존의 시선 추적 기술은 적외선 카메라 또는 스테레오(stereo) 카메라로부터 획득한 RGB-D 영상을 이용해서 2차원 혹은 3차원 공간 좌표상의 시선을 추적한다. 그러나 드라마, 영화와 같이 이미 모노(mono) 카메라로 촬영된 영상매체의 경우 기존의 기술을 그대로 적용하면 3차원 현실을 2차원 영상물에 투사할 때 깊이 정보가 손실되기 때문에 시선 추적 시 정확도가 현저히 떨어진다는 한계가 있었다. 반면에, 본 발명의 기술을 활용하면 드라마, 영화와 같은 영상매체의 RGB 이미지 프레임으로부터 자동으로 깊이 지도를 생성하는 딥러닝 모델을 활용하여 드라마, 영화 매체에 대해 추가적인 하드웨어 장비나 센서, 카메라 장비 없이 높은 정확도로 시선을 인식할 수 있다. 즉, 단순 RGB 영상 매체로부터 누가 무엇을 보는지, 누구와 대화 중 인지 여부와 같은 유의미한 정보를 자동으로 수집할 수 있다.In addition, the existing gaze tracking technology tracks the gaze in 2D or 3D spatial coordinates using an RGB-D image obtained from an infrared camera or a stereo camera. However, in the case of video media already filmed with a mono camera, such as a drama or movie, if the existing technology is applied as it is, the depth information is lost when the 3D reality is projected onto the 2D video object, so the accuracy of eye tracking is significantly reduced. had limitations. On the other hand, if the technology of the present invention is used, a deep learning model that automatically generates a depth map from RGB image frames of video media such as dramas and movies is used without additional hardware equipment, sensors, or camera equipment for drama and movie media. It can recognize the gaze with high accuracy. In other words, it is possible to automatically collect meaningful information such as who is watching what and who is talking with whom from a simple RGB image medium.

본 발명의 기술을 응용하면 수집된 화자-청자 인식, 시선 정보, 및 시선 상호작용 정보를 바탕으로 시선 관련 장면의 검색 서비스를 고안할 수 있으며 드라마 대화 장면에서 화자와 청자를 자동으로 인식하여 데이터화 했을 때 원하는 대화 장면의 검색이 가능하다. 대사, 표정, 행동과 더불어 시선 상호작용 관계를 분석하면 드라마 스토리 및 등장인물 간의 소셜 관계 분석에 활용될 수 있으며 이는 학술 연구 및 관련 상업 서비스에 적용될 수 있다.If the technology of the present invention is applied, a search service for gaze-related scenes can be devised based on the collected speaker-listener recognition, gaze information, and gaze interaction information, and the speaker and listener can be automatically recognized and converted into data in a drama dialogue scene. When you want, you can search for the conversation scene you want. Analyzing the gaze interaction relationship along with dialogue, facial expression, and behavior can be utilized for the analysis of drama stories and social relationships between characters, which can be applied to academic research and related commercial services.

이상에서 설명된 장치는 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치 및 구성요소는, 예를 들어, 프로세서, 콘트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPGA(field programmable gate array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 애플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 콘트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The device described above may be implemented as a hardware component, a software component, and/or a combination of the hardware component and the software component. For example, devices and components described in the embodiments may include, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable gate array (FPGA). , a programmable logic unit (PLU), microprocessor, or any other device capable of executing and responding to instructions, may be implemented using one or more general purpose or special purpose computers. The processing device may execute an operating system (OS) and one or more software applications running on the operating system. A processing device may also access, store, manipulate, process, and generate data in response to execution of the software. For convenience of understanding, although one processing device is sometimes described as being used, one of ordinary skill in the art will recognize that the processing device includes a plurality of processing elements and/or a plurality of types of processing elements. It can be seen that can include For example, the processing device may include a plurality of processors or one processor and one controller. Other processing configurations are also possible, such as parallel processors.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치에 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.Software may comprise a computer program, code, instructions, or a combination of one or more thereof, which configures a processing device to operate as desired or is independently or collectively processed You can command the device. The software and/or data may be any kind of machine, component, physical device, virtual equipment, computer storage medium or apparatus, to be interpreted by or to provide instructions or data to the processing device. may be embodied in The software may be distributed over networked computer systems and stored or executed in a distributed manner. Software and data may be stored in one or more computer-readable recording media.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, etc. alone or in combination. The program instructions recorded on the medium may be specially designed and configured for the embodiment, or may be known and available to those skilled in the art of computer software. Examples of the computer-readable recording medium include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic such as floppy disks. - includes magneto-optical media, and hardware devices specially configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions include not only machine language codes such as those generated by a compiler, but also high-level language codes that can be executed by a computer using an interpreter or the like.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.As described above, although the embodiments have been described with reference to the limited embodiments and drawings, various modifications and variations are possible from the above description by those skilled in the art. For example, the described techniques are performed in an order different from the described method, and/or the described components of the system, structure, apparatus, circuit, etc. are combined or combined in a different form than the described method, or other components Or substituted or substituted by equivalents may achieve an appropriate result.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents to the claims are also within the scope of the following claims.

Claims

In the gaze interaction analysis method performed by the gaze interaction analysis system,
receiving image information from a story-based image medium;
tracking gaze information of an object using a gaze tracking model using a depth map restored from the received image information and object information recognized from the received image information;
tracking gaze information of a counterpart object from gaze information of the tracked object as the gaze information of the tracked object is input to the gaze tracking model; and
Classifying gaze interaction information based on gaze information of the tracked object and gaze information of the tracked counterpart object
Gaze interaction analysis method comprising a.

According to claim 1,
The object means a speaker among the characters in the story-based video medium,
The step of tracking the gaze information of the object,
receiving the image information into a depth map generation model, and generating a depth map from the image information using the depth map generation model
Gaze interaction analysis method comprising a.

According to claim 1,
The object means a speaker among the characters in the story-based video medium,
The step of tracking the gaze information of the object,
receiving the image information into a speaker recognition model, and obtaining speaker recognition information from the image information by using the speaker recognition model
Gaze interaction analysis method comprising a.

According to claim 1,
The step of tracking the gaze information of the object,
Storing the acquired speaker recognition information in a database
Gaze interaction analysis method comprising a.

According to claim 1,
The step of tracking the gaze information of the object,
Storing the tracking result of the gaze information of the tracked object in a database
Gaze interaction analysis method comprising a.

According to claim 1,
The counterpart object refers to a listener who is either staring at the speaker or speaking with the speaker among the characters in the story-based video medium,
The step of tracking the gaze information of the counterpart object comprises:
Storing the tracking result of the tracked listener's gaze information in a database
Gaze interaction analysis method comprising a.

According to claim 1,
Classifying the gaze interaction information includes:
Classifying gaze interaction based on gaze interaction criteria including no interaction, mutual gaze, gaze avoidance, gaze tracking, and mutual interest using gaze information of the tracked object and gaze information of the tracked counterpart object step
Gaze interaction analysis method comprising a.

8. The method of claim 7,
Classifying the gaze interaction information includes:
Storing gaze interaction data corresponding to mutual gaze among the classified gaze interactions in a database
Gaze interaction analysis method comprising a.

According to claim 1,
Classifying the gaze interaction information includes:
Gaze interaction through a search of a database including a result of the recognized object, a result of tracking gaze information of the tracked object, a tracking result of gaze information of the tracked counterpart object, and gaze interaction data corresponding to mutual gaze providing search information related to
Gaze interaction analysis method comprising a.

10. The method of claim 9,
Classifying the gaze interaction information includes:
Analyzing the character's disposition, conflict, dialogue discrimination, dialogue positive or negative relationship, main dialogue person in the story, and main object based on the interaction relationship between the speaker and the listener through the classified gaze interaction information
Gaze interaction analysis method comprising a.

In the gaze interaction analysis system,
an input unit for receiving image information from a story-based image medium;
a speaker gaze tracking unit that tracks gaze information of an object using a gaze tracking model based on a depth map restored from the received image information and object information recognized from the received image information;
a listener recognition unit that tracks gaze information of a counterpart object from gaze information of the tracked object as the gaze information of the tracked object is input to the gaze tracking model; and
A gaze interaction classification unit for classifying gaze interaction information based on gaze information of the tracked object and gaze information of the tracked counterpart object
A gaze interaction analysis system comprising a.

12. The method of claim 11,
The speaker gaze tracking unit,
receiving the image information into a depth map generation model, and generating a depth map from the image information using the depth map generation model;
receiving the image information into a speaker recognition model, and obtaining speaker recognition information from the image information using the speaker recognition model;
Gaze interaction analysis system, characterized in that.

12. The method of claim 11,
The gaze interaction classification unit,
Classifying gaze interaction based on gaze interaction criteria including no interaction, mutual gaze, gaze avoidance, gaze tracking, and mutual interest using gaze information of the tracked object and gaze information of the tracked counterpart object
Gaze interaction analysis system, characterized in that.

12. The method of claim 11,
A storage unit for storing the result of the recognized object, the tracking result of the gaze information of the tracked object, the tracking result of the gaze information of the tracked counterpart object, and gaze interaction data corresponding to mutual gaze in a database
Gaze interaction analysis system further comprising.

14. The method of claim 13,
The gaze interaction classification unit,
Based on the interaction relationship between the speaker and the listener through the classified gaze interaction information, the character's disposition, conflict, dialogue discrimination, dialogue positive or negative relationship, main dialogue characters in the story, and main objects are analyzed.
Gaze interaction analysis system, characterized in that.