KR20130117624A

KR20130117624A - Method and apparatus for detecting talking segments in a video sequence using visual cues

Info

Publication number: KR20130117624A
Application number: KR1020120086189A
Authority: KR
Inventors: 수드하 벨루사미; 비스와나스 고팔라크리쉬난; 빌바 브하라찬드라 나바테; 안술 샤르마
Original assignee: 삼성전자주식회사
Priority date: 2012-04-17
Filing date: 2012-08-07
Publication date: 2013-10-28
Also published as: KR101956166B1

Abstract

PURPOSE: A method and an apparatus thereof capable of detecting a talking segment in a video sequence by using a visual cue are provided to accurately classify a human emotion by dividing a movement of lips into talking and non-talking states in a temporal way. CONSTITUTION: A histogram of structure descriptive features of a face is obtained from a visual cue about a frame (205). An integrated gradient histogram (IGH) is drawn out from the structure descriptive features about the frame (206). The entropy of the IGH is calculated about the frame in the visual cue (207). The IGH is divided in order to detect a talking segment of the face in the visual cue (208). The talking segment is analyzed about the frame in the visual cue in order to reason an emotion (210,211). [Reference numerals] (201) Achieve a video frame; (202) Detect the face, the pupils, and the nose; (203) Normalize the face by using the pupils; (204) Estimate the mouth by using the nose; (205) Obtain an LBP histogram; (206) Encode IGH; (207) Obtain an IGH entropy; (208) Execute online time division of the IGH entropy; (209) Is a character speaking?; (210) Analyze a behavior unit of the upper and lower face; (211) Analyze a behavior unit of the upper face; (212) Infer feeling; (AA) No; (BB) Yes

Description

METHOD AND APPARATUS FOR DETECTING TALKING SEGMENTS IN A VIDEO SEQUENCE USING VISUAL CUES}

아래의 설명은 이미지 처리, 컴퓨터 비전 및 기계 학습에 관한 것으로, 더 구체적으로는 비디오 시퀀스에서의 감정 인식에 관한 것이다.The description below relates to image processing, computer vision and machine learning, and more particularly to emotion recognition in video sequences.

최근 기술의 발전과 함께, 인간 컴퓨터 상호작용(HCI, Human Computer Interaction)을 강화시키기 위한 상당한 관심이 계속되고 있다. 특히, 엔지니어와 과학자들은 HCI를 개선하기 위해 목소리, 시선, 제스쳐 및 감정 상태와 같은 기본적인 인간의 속성을 이용하려는 시도를 하고 있다. 인간의 감정을 검출하고, 인간의 감정에 반응하는 디바이스의 능력은 '감성 컴퓨팅(Affective Computing)'으로 알려져 있다.With recent advances in technology, considerable interest continues to enhance human computer interaction (HCI). In particular, engineers and scientists are attempting to exploit basic human attributes such as voice, gaze, gesture and emotional state to improve HCI. The ability of a device to detect human emotions and respond to human emotions is known as 'affective computing'.

자동 얼굴 표정 인식은 HCI의 리서치 분야에서 중요한 요소이다. 자동 얼굴 표정 인식은 또한, 비디오 회의(video conferencing), 비디오 게임, 비디오 감시(video surveillance) 등과 같은 어플리케이션에서 발전 가능성이 큰 인간 행동 모델링(human behavior modeling)에 중요한 역할을 한다. 자동 얼굴 표정 인식에 있어, 대부분의 리서치는 통제된 환경에서 포즈를 취한 얼굴 표정 데이터세트에 대해 6개의 기본적인 감정(슬픔, 두려움, 분노, 행복, 혐오감, 놀람)을 식별하는 것을 목표로 한다. 엔지니어와 과학자들은 얼굴 표정 데이터세트에서 다른 감정들을 추론하기 위해 동적인 방법뿐만 아니라 정적인 방법도 이용해 오고 있다. 동적인 방법은 특정 감정을 추론하기 위해 연속적인 프레임 그룹을 고려하는데 반해, 정적인 방법은 하나의 비디오 시퀀스에서 프레임을 독립적으로 분석하는 방법이다.Automatic facial expression recognition is an important element in HCI's research field. Automatic facial expression recognition also plays an important role in human behavior modeling, which is likely to develop in applications such as video conferencing, video games, video surveillance, and the like. In automatic facial expression recognition, most research aims to identify six basic emotions (sadness, fear, anger, happiness, disgust, surprise) on facial expression datasets that are posing in a controlled environment. Engineers and scientists have used static as well as dynamic methods to infer different emotions from facial expression datasets. Dynamic methods consider contiguous groups of frames to infer specific emotions, whereas static methods are methods that independently analyze frames in a video sequence.

얼굴의 입 영역은 인간의 감정과 관련하여 매우 식별성이 높은 정보를 포함하고 있으며, 얼굴 표정 인식에 있어 중요한 역할을 한다. 그러나, 비디오 회의와 같은 일반적인 상황에서는, 사람의 토킹(talking)과 관련된 중요한 시간 세그먼트(temporal segments)가 있을 수 있고, 사람의 감정을 추론하기 위해 입 영역을 이용하는 얼굴 표정 인식 시스템은, 입 영역 주위의 임의적이고 복잡한 형태로 인해 잠재적으로 입 영역을 잘못 해석할 수 있다. 비디오 시퀀스에서 토킹 세그먼트(talking segments)에 관한 시간 세그먼트 정보는 감정 인식 시스템을 개선시키는데 이용될 수 있으므로 매우 중요하다.The mouth area of the face contains highly identifiable information about human emotions and plays an important role in facial expression recognition. However, in a typical situation such as video conferencing, there may be important temporal segments associated with a person's talking, and a facial expression recognition system that uses the mouth area to infer a person's feelings may be around the mouth area. The arbitrary and complex shape of can potentially misinterpret the mouth area. Time segment information about talking segments in the video sequence is very important because it can be used to improve the emotion recognition system.

사람의 감정 인식에 있어, 입 영역에 대해 추론된 행동 유닛(AU, Action Units)에 따라 '토킹 페이스(talking face)'의 조건을 처리해오고 있는 많지 않은 방법들은 잠재적으로 사람의 감정을 잘못 인식하게 할 수 있다. 현재, 알려진 방법들은 다자간(multi-person) 환경에서 말하고 있는 사람을 결정하는 것을 목표로 하고 있으나, 입술의 움직임을 시간적으로 토킹 상태 및 논토킹(non-talking)(다양한 감정 세그먼트뿐만 아니라 중립적인 상태도 포함함) 상태로 분할하는 것은 의도하고 있지 않다. 그 결과, 현재 감정 인식과 관련된 시스템들은 정확한 감정을 캡쳐하고 있지 못하다.In recognition of a person's emotions, not many methods that have dealt with the condition of a 'talking face' in accordance with Action Units (AU) inferred for the mouth area can potentially misrecognize a person's feelings. can do. Currently, known methods aim to determine who is speaking in a multi-person environment, but the movement of the lips in a time of talking and non-talking (a variety of emotional segments as well as neutral states). It is not intended to divide into a state). As a result, systems currently associated with emotion recognition do not capture accurate emotions.

위에 언급한 이유 때문에, 입술의 움직임을 시간적으로 토킹 및 논토킹 상태로 분할하여 사람의 감정을 정확하게 분류하기 위한 방법이 요구된다.For the reasons mentioned above, there is a need for a method for accurately classifying a person's emotions by dividing the movement of the lips into talking and non-talking states in time.

일실시예에 따른 비주얼 큐에서 얼굴의 토킹 세그멘트를 검출하고 분류하는 방법은 비주얼 큐의 각 프레임에 대해 얼굴 영역을 추적하고 정규화하는 단계와 비주얼 큐에서 각 프레임에 대해 얼굴 영역의 구조 묘사적 피쳐(structure descriptive features)의 히스토그램을 획득하는 단계를 포함한다.According to an embodiment, a method of detecting and classifying a talking segment of a face in a visual cue may include tracking and normalizing a face region for each frame of the visual cue, and a structural description feature of the facial region for each frame in the visual cue. obtaining a histogram of the structure descriptive features).

일실시예에 따른 비주얼 큐에서 얼굴의 토킹 세그멘트를 검출하고 분류하는 방법은 상기 비주얼 큐에서 상기 프레임에 대한 구조 묘사적 피쳐로부터 인티그레이티드 그래디언트 히스토그램(IGH, Integrated Gradient Histogram)을 도출하고, 그 후 상기 비주얼 큐에서 상기 프레임에 대한 IGH의 엔트로피를 계산한다. 그리고, 상기 비주얼 큐에서 얼굴의 토킹 세그멘트를 검출하고 분류하는 방법은 상기 비주얼 큐에서 상기 얼굴 영역에 대한 토킹 세그먼트를 검출하기 위해 IGH의 분할을 수행하고, 감정을 추론하기 위해 상기 비주얼 큐에서 상기 프레임에 대한 토킹 세그먼트를 분석한다.According to an embodiment, a method of detecting and classifying a talking segment of a face in a visual cue may derive an Integrated Gradient Histogram (IGH) from a structural descriptive feature for the frame in the visual cue. Compute the entropy of IGH for the frame in the visual cue. The method of detecting and classifying talking segments of a face in the visual cue may perform segmentation of an IGH to detect a talking segment for the face region in the visual cue, and infer the frame from the visual cue to infer emotions. Analyze the talking segment for.

일실시예에 따른 비주얼 큐에서 얼굴의 토킹 세그먼트를 검출하고 분류하는 컴퓨터 프로그램 제품은 집적 회로를 포함한다.A computer program product for detecting and classifying talking segments of a face in a visual cue according to one embodiment includes an integrated circuit.

일실시예에 따른 상기 컴퓨터 프로그램 제품의 집적 회로는 적어도 하나의 프로세서, 상기 집적 회로 안에 컴퓨터 프로그램 코드를 가진 적어도 하나의 메모리, 적어도 하나의 메모리 및 적어도 하나의 프로세서가 상기 컴퓨터 프로그램 제품으로 하여금 상기 비주얼 큐의 각 프레임에 대해 상기 얼굴 영역을 추적하고 정규화하도록 구성된 컴퓨터 프로그램 코드를 포함할 수 있다.According to an embodiment, an integrated circuit of the computer program product may include at least one processor, at least one memory having computer program code in the integrated circuit, at least one memory, and at least one processor causing the computer program product to display the visual. Computer program code configured to track and normalize the face area for each frame of a queue.

일실시예에 따른 상기 컴퓨터 프로그램 제품은 상기 비주얼 큐에서 상기 프레임에 대한 구조 묘사적 피쳐의 히스토그램을 획득하고, 상기 비주얼 큐에서 상기 프레임에 대한 구조 묘사적 피쳐로부터 IGH를 도출하며, 상기 비주얼 큐에서 상기 프레임에 대한 IGH의 엔트로피를 계산하고, 또한, 상기 컴퓨터 프로그램 제품은 상기 비주얼 큐에서 상기 얼굴에 대한 토킹 세그먼트를 검출하기 위해 IGH의 분할을 수행하고, 감정을 추론하기 위해 상기 비주얼 큐에서의 프레임에 대해 토킹 세그먼트를 분석한다.According to an embodiment, the computer program product obtains a histogram of a structural depiction feature for the frame in the visual queue, derives an IGH from the structural depiction feature for the frame in the visual queue, and in the visual queue. Computing the entropy of IGH for the frame, the computer program product also performs segmentation of IGH to detect talking segments for the face in the visual cue, and frames in the visual cue to infer emotions. Analyze the talking segment for.

도 1은 일실시예에 따른 비디오 시퀀스에서 캐릭터의 감정을 인식하는 방법의 일례를 도시한 흐름도이다.
도 2는 일실시예에 따른 비주얼 큐를 이용하여 비디오 시퀀스에서 토킹 세그먼트를 검출하는 방법의 일례를 도시한 상세 흐름도이다.
도 3은 일실시예에 따른 어플리케이션을 수행하는 컴퓨팅 환경을 도시한 도면이다.1 is a flowchart illustrating an example of a method of recognizing a character's emotion in a video sequence, according to an exemplary embodiment.
2 is a detailed flowchart illustrating an example of a method of detecting a talking segment in a video sequence using a visual cue according to an embodiment.
3 illustrates a computing environment for executing an application according to an embodiment.

아래의 실시예들의 주된 목적은 비주얼 큐에서 토킹 세그먼트를 검출하는 시스템 및 방법을 제공하는 것이다. 또한, 실시예들의 다른 목적은 토킹 페이스를 검출하기 위해 언슈퍼바이즈드 시간 세그먼트 (unsupervised temporal segmentation)를 제공하는 것이다.It is a primary object of the following embodiments to provide a system and method for detecting talking segments in a visual cue. Another object of the embodiments is also to provide unsupervised temporal segmentation to detect the talking phase.

여기에 기재된 실시예들은 첨부된 도면과 아래의 상세한 설명에 의해 더 잘 인식되고 이해될 수 있다. 그러나, 아래의 설명에 기재된 바람직한 실시예 및 발명의 세부 사항들은 실시예를 위해 기재된 것이고, 기재된 내용에 의해 발명의 내용이 제한되지는 않는다. 발명의 사상이 유지되는 범위 내에서 실시예들의 많은 변경 및 수정이 이루어질 수 있고, 아래 설명에 기재된 실시예들은 그러한 변경 및 수정을 포함할 수 있다.The embodiments described herein can be better appreciated and understood by the accompanying drawings and the detailed description below. However, preferred embodiments and details of the invention described in the following description are described for the examples, and the content of the invention is not limited by the contents described. Many changes and modifications of the embodiments can be made within the scope of the spirit of the invention, and the embodiments described below may include such changes and modifications.

이하, 실시예들을 첨부된 도면을 참조하여 상세하게 설명한다. 각 도면에 제시된 참조부호는 동일한 부재를 나타낸다. 여기에 기재된 실시예들은 첨부된 도면과 상세한 설명에 의해 더 잘 이해될 수 있다.Hereinafter, embodiments will be described in detail with reference to the accompanying drawings. Reference numerals shown in each drawing denote the same members. The embodiments described herein may be better understood by the accompanying drawings and the detailed description.

실시예들의 다양한 특징 및 그에 대한 세부 사항들은 아래의 상세한 설명에서 상세히 설명될 것이고, 첨부된 도면에 도시된 실시예에 의해 더 잘 설명될 수 있으나, 기재된 내용에 의해 실시예들이 제한되는 것은 아니다. 실시예들에 관한 설명을 불필요하게 모호하게 하지 않기 위해 잘 알려진 구성이나 프로세싱 기술에 대한 설명은 생략되었다. 아래 설명에서의 실시예들은, 단지 실시예들의 실시 방법을 쉽게 이해시키고 당업자가 실시예들을 용이하게 실시할 수 있도록 하기 위한 의도로 기재되었다. 따라서, 실시예들에 대한 아래의 설명은 실시예의 범위를 제한하는 것으로 해석되어서는 안 된다.Various features of the embodiments and details thereof will be described in detail in the following detailed description, which can be better explained by the embodiments shown in the accompanying drawings, but the embodiments are not limited to the described contents. In order not to unnecessarily obscure the description of the embodiments, descriptions of well-known configurations or processing techniques have been omitted. The embodiments in the following description have been described only for the purpose of easily understanding the method of implementing the embodiments and for those skilled in the art to easily implement the embodiments. Accordingly, the following description of the embodiments should not be construed as limiting the scope of the embodiments.

아래 설명에서의 실시예들은, 비주얼 큐(visual cue)를 이용하여 이미지 프레임들의 시퀀스에서 토킹 세그먼트(talking segments)와 논토킹 세그먼트(non-talking segments)를 검출하는 시스템 및 방법에 관한 것이다.Embodiments in the description below relate to a system and method for detecting talking and non-talking segments in a sequence of image frames using a visual cue.

비주얼 큐에서 토킹 세그먼트를 검출하는 방법은 오디오 큐(audio cue)가 검출의 대상이 되는 화자 이외의 다른 사람으로부터 발생할 수 있고, 토킹 세그먼트와 논토킹 세그먼트를 잘못 검출되게 할 수 있는 점 때문에 오디오 큐가 아닌 비주얼 큐를 이용한다. 또한, 비주얼 큐에서 토킹 세그먼트를 검출하는 방법은 웃음소리, 감탄사 등과 같은 오디오를 비롯하여 다른 표현들을 가질 수 있는 논토킹 세그먼트와 토킹 세그먼트를 분류하는 것을 목표로 한다. 위와 같은 이유로, 오디오 큐가 아닌 비주얼 큐가 토킹 세그먼트와 논토킹 세그먼트를 구별하는 것으로 이용되어야 한다.The method of detecting talking segments in a visual cue can be caused by an audio cue from someone other than the speaker to be detected, and the audio cue can be detected incorrectly because the talking and non-talking segments can be detected incorrectly. Use visual cues instead. In addition, a method of detecting talking segments in a visual cue aims to classify non-talking segments and talking segments that may have other expressions, including audio such as laughter, interjection, and the like. For the same reason as above, a visual cue, not an audio cue, should be used to distinguish between talking and non-talking segments.

비주얼 큐에서 토킹 세그먼트를 검출하는 방법은 입 또는 입술 움직임의 묘사와 관련된 불확실성(uncertainties)을 추정하여 비디오 시퀀스에서 토킹 페이스(talking face)의 시간 세그먼트를 식별할 수 있다. 일실시예에 따르면, 입 영역을 추적하는 단계 이후에, 입의 움직임은 로컬 바이너리 패턴(LBP, Local Bianary Pattern) 값의 인티그레이티드 그래디언트 히스토그램(IGH, Integrated Gradient Histogram)으로 인코딩된다. 입의 움직임에서의 불확실성은 IGH의 엔트로피를 추정하는 것에 의해 정량화된다. 입의 다른 움직임들 중에서 말하고 있는 입의 패턴을 구별하기 위해 온라인 K-평균 알고리즘(online K-Means algorithm)을 이용함에 따라 각 프레임에서 엔트로피 값의 타임 시리즈 데이터(time series data)는 더욱 클러스터링된다.The method of detecting talking segments in the visual cue may identify the temporal segments of the talking face in the video sequence by estimating uncertainties associated with the depiction of mouth or lip movements. According to one embodiment, after the step of tracking the mouth area, the movement of the mouth is encoded into an Integrated Gradient Histogram (IGH) of Local Binary Pattern (LBP) values. Uncertainty in mouth movement is quantified by estimating the entropy of IGH. As the online K-Means algorithm is used to distinguish the pattern of the mouth speaking among other movements of the mouth, time series data of entropy values in each frame is further clustered.

여기서, 비주얼 큐는 사진 또는 프레임들의 시퀀스를 포함하는 비디오일 수 있다.Here, the visual cue may be a video including a photo or a sequence of frames.

도면을 참고하면, 도면에는 바람직한 실시예가 나타나 있고, 각 도면에 제시된 참조부호는 동일한 부재를 나타낸다.Referring to the drawings, preferred embodiments are shown in the drawings, wherein reference numerals given in the drawings represent the same members.

도 1은 일실시예에 따른 비디오 시퀀스에서 캐릭터의 감정을 인식하는 방법의 일례를 도시한 흐름도이다.1 is a flowchart illustrating an example of a method of recognizing a character's emotion in a video sequence, according to an exemplary embodiment.

캐릭터의 감정을 인식하는 방법은, 단계(101)에서 비디오에서 비디오 프레임을 획득하고, 단계(102)에서 캐릭터의 눈동자 위치를 고정시키는 방법을 통해 얼굴을 검출한다. 여기서, 캐릭터는 비디오 프레임에서 감정을 추론하려는 대상일 수 있다. 단계(103)에서, 캐릭터의 감정을 인식하는 방법은 캐릭터가 말하고 있는지를 체크한다. 단계(104)에서, 캐릭터의 감정을 인식하는 방법은 캐릭터가 말하고 있지 않음을 체크한 경우, 얼굴 전체에 대해 피쳐(feature)를 획득한다. 단계(105)에서, 캐릭터의 감정을 인식하는 방법은 행동 유닛(AUs, action units)을 예측한다. 행동 유닛은 얼굴 행동 코딩 시스템(FACS, Facial Action Coding System)에서 정의된 것으로, 얼굴 외관의 변화를 발생시키는 근육의 움직임을 나타낸다. 단계(106)에서, 캐릭터의 감정을 인식하는 방법은 행동 유닛에 기초하여 캐릭터의 감정을 추론한다. 단계(107)에서, 일실시예에 따른 캐릭터의 감정을 인식하는 방법은, 캐릭터가 말하고 있음을 식별한 경우, 얼굴 상단에 대해서만 피쳐를 획득한다. 캐릭터의 감정을 인식하는 방법은 단계(108)에서, 행동 유닛을 예측하고, 단계(109)에서, 캐릭터의 감정을 추론한다.In the method of recognizing the emotion of the character, in step 101, a video frame is acquired in the video, and in step 102, the face is detected by fixing the pupil position of the character. Here, the character may be an object to infer emotions from the video frame. In step 103, the method of recognizing the emotion of the character checks whether the character is speaking. In step 104, the method of recognizing the emotion of the character acquires a feature for the entire face if it is checked that the character is not speaking. In step 105, the method of recognizing the emotion of the character predicts action units (AUs). Behavioral units are defined in the Facial Action Coding System (FACS) and represent the movement of muscles that cause changes in facial appearance. In step 106, the method of recognizing the emotion of the character infers the emotion of the character based on the action unit. In step 107, the method of recognizing the emotion of the character according to an embodiment acquires the feature only for the top of the face when it is identified that the character is speaking. The method of recognizing a character's emotions includes, at step 108, predicting a behavioral unit, and at step 109, inferring the character's emotions.

일실시예에 따르면, 토킹 페이스(talking face)는 어떤 감정을 가진 상태 또는 감정을 가지지 않은 상태에서 말하고 있는 얼굴을 의미한다. 논토킹 페이스(non-talking face)는 말하고 있지는 않지만 감정을 드러내는 얼굴을 의미한다. 캐릭터의 감정을 인식하는 방법(100)의 단계들은 제시된 순서 또는 다른 순서에 따라, 또는 동시에 수행될 수 있다. 또한, 몇몇 실시예에서, 도 1에 도시된 일부 단계는 생략될 수 있다.According to an embodiment, the talking face refers to a face that speaks in a state with no emotion or no emotion. A non-talking face is a face that doesn't speak but reveals emotions. The steps of the method 100 of recognizing the emotions of a character may be performed in the order presented or in another order, or simultaneously. In addition, in some embodiments, some steps shown in FIG. 1 may be omitted.

도 2는 일실시예에 따른 비주얼 큐를 이용하여 비디오 시퀀스에서 토킹 세그먼트를 검출하는 방법의 일례를 도시한 상세 흐름도이다.2 is a detailed flowchart illustrating an example of a method of detecting a talking segment in a video sequence using a visual cue according to an embodiment.

도 2에 도시된 것처럼, 토킹 세그먼트를 검출하는 방법은 해당 단계를 수행하기 위해 알고리즘을 이용할 수 있다.As shown in FIG. 2, the method for detecting the talking segment may use an algorithm to perform the corresponding step.

알고리즘은, 단계(201)에서 비디오 프레임들의 시퀀스를 획득하고, 더 나아가 단계(202)에서 최초의 얼굴을 검출하고 눈동자와 코의 위치를 추적한다. 일실시예에 따르면, 비디오의 모든 프레임에서 얼굴, 눈동자, 및 코의 위치를 식별하기 위해 기준 얼굴 검출기(standard face detector)와 능동 외모 모델(AAM, Active Appearance Model)의 버전에 기초한 방법이 이용될 수 있다. 능동 외모 모델은 널리 이용되는 능동 형태 모델(ASM, Active Shape Model) 접근법에 대한 일반화된 방법이지만, 단지 모델링된 윤곽선 근처의 정보 만을 이용하기 보다는 타겟 오브젝트가 차지하는 이미지 영역의 모든 정보를 이용한다. 단계(203)에서, 토킹 세그먼트를 검출하는 방법은 눈동자를 이용하여 얼굴을 정규화한다. 눈동자의 위치는 모든 얼굴 이미지를 MxN 사이즈로 정규화하는데 이용된다. 단계(204)에서, 토킹 세그먼트를 검출하는 방법은 추가적인 과정에서 각 프레임에서 입 영역을 노출시킬 코의 위치를 추적한다.The algorithm obtains a sequence of video frames in step 201, further detects the first face in step 202 and tracks the positions of the pupil and nose. According to one embodiment, a method based on a version of a standard face detector and an Active Appearance Model (AAM) may be used to identify the positions of the face, eyes, and nose in every frame of the video. Can be. The active appearance model is a generalized method for the widely used Active Shape Model (ASM) approach, but uses all the information of the image area occupied by the target object rather than just information near the modeled contour. In step 203, the method of detecting the talking segment normalizes the face using the pupil. The position of the pupil is used to normalize all face images to MxN size. In step 204, the method of detecting the talking segment tracks the position of the nose to expose the mouth area in each frame in further processing.

일실시예에 따르면, 얼굴을 정규화하고 입 영역을 56x46 픽셀 사이즈로 잘라내기 위해 눈동자 사이의 거리는 48 픽셀로 유지된다.According to one embodiment, the distance between the pupils is maintained at 48 pixels to normalize the face and crop the mouth area to 56x46 pixel size.

잘려진 입 영역 이미지들의 시퀀스는 조도(illumination) 변수와 프레임들에 걸쳐 정렬선(alignment)을 가질 수 있으므로, 토킹 세그먼트를 검출하는 방법은 이러한 조건들을 처리할 수 있는 피쳐 디스크립터(feature descriptor)를 선택한다. 일실시예에 따르면, 단계(205)에서 토킹 세그먼트를 검출하는 방법은, 입 영역의 외형을 인코딩하기 위해 국부 이진 패턴(LBP, Local Binary Pattern) 값의 히스토그램을 도출한다. LBP는 텍스쳐 분류에 이용되는 강력한 피쳐로, 텍스쳐 분류는 얼굴 인식 및 관련 어플리케이션에 있어 매우 효과적임이 나중에 입증되었다. 일실시예에 따르면, LBP 패턴은 잘려진 입 영역의 이미지 안의 모든 픽셀에 대해 계산된다. 또한, 균일한 LBP 패턴(많아봐야 2비트의 이동을 가진 패턴)들은 모두 유사하게 이용되고 분류된다. 잘려진 입 영역의 이미지에 대해 추정된 LBP 값의 히스토그램은 관련 프레임에서 입 영역의 외형을 묘사하는데 이용된다.Since the truncated sequence of mouth region images may have an alignment variable across the illumination parameter and frames, the method of detecting the talking segment selects a feature descriptor that can handle these conditions. . According to one embodiment, the method of detecting the talking segment in step 205 derives a histogram of the Local Binary Pattern (LBP) values to encode the appearance of the mouth region. LBP is a powerful feature used for texture classification, which later proved to be very effective for face recognition and related applications. According to one embodiment, the LBP pattern is calculated for every pixel in the image of the cropped mouth region. In addition, uniform LBP patterns (at most, patterns with 2-bit shifting) are all similarly used and classified. A histogram of the estimated LBP values for the image of the truncated mouth region is used to depict the appearance of the mouth region in the relevant frame.

토킹 세그먼트를 검출하는 시스템 및 방법은 웃음, 놀람, 역겨움 등과 같은 감정의 시작 및 오프셋에 나타나는 입 움직임의 매끄러운 외형 변화와 말하는 상태일 때 입에 나타나는 복잡한 외형 변화를 구별한다. 또한, 말하고 있지 않은 중성의 얼굴에 대해서는 입의 외형에 많은 변화가 없을 것이다. 일실시예에 따르면, 입의 복잡한 외형 변화를 구별하기 위해, 2

(타우)의 시간 주기에 걸쳐 입의 외형 변화를 캡쳐할 목적으로 프레임 i라 불리는 특정 프레임으로부터 그래디언트 히스토그램이 계산된다.Systems and methods for detecting talking segments distinguish between smooth appearance changes in mouth movement at the beginning and offset of emotions such as laughter, surprise, disgust, and the like, and complex appearance changes in the mouth when speaking. Also, for a neutral face that is not spoken, there will not be much change in the appearance of the mouth. According to one embodiment, to distinguish the complex appearance changes of the mouth, 2

The gradient histogram is computed from a particular frame called frame i for the purpose of capturing changes in the appearance of the mouth over the time period of (tau).

그래디언트 LBP 히스토그램(gradient LBP histograms)은 다음과 같이 계산된다.Gradient LBP histograms are calculated as follows.

는 i번째 프레임과 (i+n)번째 프레임의 히스토그램 간 차이를 이용하여 계산되는 그래디언트 히스토그램이고

는 i번째 프레임과 (i-n)번째 프레임의 히스토그램 간 차이를 이용하여 계산되는 그래디언트 히스토그램이다.

Is a gradient histogram computed using the difference between the histograms of the i th frame and (i + n) th frame.

Is a gradient histogram computed using the difference between the histogram of the i th frame and the (in) th frame.

그래디언트 히스토그램은 시간 차원에 따라 입 패턴에서의 외형 변화를 인코딩한다. 단계(206)에서, 토킹 세그먼트를 검출하는 방법은 (2

+1)의 시간 세그먼트에 걸쳐 입의 외형 변화와 관련된 완전한 정보를 얻고, 다음과 같이 하나의 IGH로 인코딩한다.The gradient histogram encodes the change in appearance in the mouth pattern over the time dimension. At step 206, the method of detecting the talking segment is (2).

Obtain complete information related to the appearance change of the mouth over the time segment of +1) and encode it into one IGH as follows.

일련의 토킹 프레임들(talking frames)은 특정 감정을 나타내는 프레임과 비교하여 좀 더 균등하게 분배된 IGH 값을 가질 것이다. 다시 말해서, 인티그레이티드 그래디언트 히스토그램의 묘사와 관련된 불확실성은 감정 세그먼트(emotion segments)와 비교하여 좀 더 토킹 세그먼트(talking segments)에 가까울 수 있다. 단계(207)에서, 토킹 세그먼트를 검출하는 방법은 IGH의 엔트로피를 도출한다. 단계(208)에서, 토킹 세그먼트를 검출하는 방법은 위와 같은 이유로, IGH의 엔트로피에 대해 온라인 시간 분할(online temporal segmentation)을 수행하고, 비디오 세그먼트에서의 불확실성을 정량화하기 위해 IGH의 엔트로피를 이용한다.The series of talking frames will have a more evenly distributed IGH value compared to the frame representing the particular emotion. In other words, the uncertainty associated with the depiction of the integrated gradient histogram may be closer to the talking segments as compared to the emotion segments. In step 207, the method of detecting the talking segment derives the entropy of the IGH. In step 208, the method of detecting the talking segment performs online temporal segmentation on the entropy of the IGH and uses the entropy of the IGH to quantify the uncertainty in the video segment.

i번째 프레임의 IGH의 엔트로피는 다음과 같이 계산된다.The entropy of the IGH of the i th frame is calculated as follows.

Ep_i 는 i번째 프레임의 IGH의 엔트로피 값이고, p_k 는 k번째 빈(bin)에 대한 히스토그램 값이다.Ep _i is the entropy value of the IGH of the i th frame, and p _k is the histogram value for the k th bin.

또한, IGH는 i번째 프레임의 IGH의 엔트로피 값을 추정하기 전에 정규화된다. 이는, 다른 시간 세그먼트에 걸쳐 엔트로피 값을 비교하기 위한 필요성 때문이다. 다른 시간 세그먼트에 걸친 IGH의 에너지 값은 그래디언트(gradient) 과정의 결과에 따라 다양할 수 있다. IGH의 엔트로피 값은 IGH에서의 분리된 빈(bin)과 같이 처음의 LBP 히스토그램 사이에 공통의 에너지 값을 추가하는 것에 의해 정규화된다. 정적인 세그먼트에서, 공통의 에너지 값은 IGH에서 매우 크고, 그에 따라 매우 작은 엔트로피 값을 초래할 수 있다. 감정 세그먼트에서, 공통의 에너지 값은 슬로우 토킹 프로세스(slow talking process)에 대응될 수 있다. 그러나, IGH의 그래디언트 에너지 파트(gradient energy part)는 토킹 세그먼트에 더 넓게 퍼져있으므로 감정 세그먼트와 비교하여 더 높은 엔트로피 값을 가질 수 있다. 모든 프레임의 IGH로부터 추정한 엔트로피 값의 타임 시리즈 데이터는 토킹 페이스 및 논토킹 페이스의 언슈퍼바이즈드 온라인 세그멘테이션(unsupervised online segmentation)을 위해 이용된다.In addition, the IGH is normalized before estimating the entropy value of the IGH of the i th frame. This is because of the need to compare entropy values over different time segments. The energy value of IGH over different time segments may vary depending on the result of the gradient process. The entropy value of IGH is normalized by adding a common energy value between the first LBP histograms, such as a separate bin in IGH. In static segments, common energy values are very large in IGH, and can result in very small entropy values. In the emotional segment, the common energy value may correspond to a slow talking process. However, the gradient energy part of the IGH is more widespread in the talking segment and therefore may have a higher entropy value compared to the emotional segment. Time series data of entropy values estimated from the IGH of every frame are used for unsupervised online segmentation of the talking face and the non-talking face.

일실시예에 따르면, 비디오 시퀀스에서 모든 프레임에 대해 획득된 엔트로피 값은 타임 시리즈 데이터를 형성한다. 다음으로, 타임 시리즈 데이터는 비디오 시퀀스에서 토킹 페이스의 존재와 관련하여 감정 인식 시스템에 요구되는 입력을 제공하기 위해 언슈퍼바이즈드 온라인(unsupervised online) 형식으로 분할된다. 일실시예에 따르면, 감정 인식 시스템은 타임 시리즈 데이터를 분할하기 위해 k가 2인 온라인 K-평균 알고리즘을 이용한다. 데이터의 초기 값 또는 데이터의 범위와 관련하여 추가적인 가정은 없다.According to one embodiment, the entropy values obtained for every frame in the video sequence form time series data. Next, the time series data is divided into an unsupervised online format to provide the input required for the emotion recognition system in relation to the presence of the talking face in the video sequence. According to one embodiment, the emotion recognition system uses an online K-average algorithm with k equal to two to partition the time series data. There are no additional assumptions regarding the initial value of the data or the range of data.

입 영역 주위에 폐색 영역(occlusions)이 존재하는 상태에서 감정 추론 시 발생되는 문제점은 감정 검출의 정확성을 향상시키기 위해 지금까지 제기되어 왔다. 단계(209)에서, 토킹 세그먼트를 검출하는 방법은 캐릭터가 말하고 있는지를 체크한다. 일실시예에 따르면, 토킹 세그먼트를 검출하는 방법은 캐릭터가 말하고 있는 것이 검출될 때마다 입 영역이 폐색되었는지를 고려한다. 단계(210)에서, 토킹 세그먼트를 검출하는 방법은 캐릭터가 말하고 있지 않음을 식별한 경우, 얼굴 상단 및 얼굴 하단의 행동 유닛을 분석한다. 일실시예에 따르면, 토킹 세그먼트를 검출하는 방법은 캐릭터가 말하고 있는지를 식별할 수 있고, 간단한 방법으로는 특정 시간 세그먼트에서 입 영역을 제외한 채 비주얼 큐를 제거하는 방법이 될 수 있다. 일실시예에 따르면, 단계(211)에서, 토킹 세그먼트를 검출하는 방법은 단지 얼굴의 상단으로부터 행동 유닛을 분석한다. 단계(212)에서, 토킹 세그먼트를 검출하는 방법은 토킹 비주얼 큐(talking visual cue) 또는 논토킹 비주얼 큐(non-talking visual cue)에 기초하여 감정을 추론한다. 일반적인 상황에서는, 단지 얼굴의 상단으로부터 행동 유닛을 분석하는 방법이 모든 행동 유닛을 이용하는 방법보다 열등할 것이지만, 캐릭터가 말하고 있는 상황에서는, 모든 행동 유닛을 이용하는 방법이 잘못 해석시킬 정보를 많이 가지고 있기 때문에 모든 행동 유닛을 이용하는 방법보다 우수할 수 있다.Problems that arise in emotional inference with occlusions around the mouth area have been raised to improve the accuracy of emotion detection. In step 209, the method of detecting the talking segment checks whether the character is speaking. According to one embodiment, the method of detecting the talking segment takes into account whether the mouth area is occluded whenever it is detected that the character is speaking. In step 210, the method of detecting the talking segment analyzes the action units at the top of the face and the bottom of the face when the character is not speaking. According to an embodiment, the method of detecting the talking segment may identify whether the character is speaking, or, in a simple method, may be a method of removing the visual cue without the mouth area in a specific time segment. According to one embodiment, in step 211, the method of detecting the talking segment only analyzes the behavioral unit from the top of the face. In step 212, the method of detecting the talking segment infers emotion based on a talking visual cue or a non-talking visual cue. In a normal situation, the method of analyzing the action unit from the top of the face would be inferior to that of using all the action units, but in the situation that the character is speaking, the method of using all the action units has a lot of misinterpretation. It can be better than how to use all the action units.

다른 실시예에 따르면, 감정 인식의 성능을 향상시키는 것은 감정 인식에 있어 입 영역을 이용하지만, 일단 캐릭터가 말하고 있음이 검출되면, 감정 인식의 방식을 바꾸는 것이다. 비록 토킹 페이스로부터 이미지의 피쳐가 쉽게 해석될 수 없을지라도, 입 영역은 여전히 현재 감정에 관한 단서들을 가지고 있다. 예를 들어, 행복한 상태에서의 토킹 페이스와 슬픈 상태에서의 토킹 페이스는 구별될 수 있다. 입 영역을 이용하여 토킹 페이스에서 감정을 추론하는 접근법은 보통의 감정 인식 시스템과 다를 수 있다. 해당 기술 분야에서의 기술자는, 입꼬리의 움직임은 말하고 있는 중일 때라도 특정 감정을 구별하는 데 도움을 줄 수 있음을 인식할 것이다. 비디오 시퀀스에서 토킹 세그먼트를 검출하는 방법(200)에서의 다양한 단계들은 제시된 순서 또는 다른 순서에 따라, 또는 동시에 수행될 수 있다. 또한, 몇몇 실시예에서, 도 2에 도시된 일부 단계는 생략될 수 있다.According to another embodiment, improving the performance of emotion recognition uses the mouth area for emotion recognition, but once it is detected that the character is speaking, it changes the manner of emotion recognition. Although the features of the image from the talking face cannot be easily interpreted, the mouth area still has clues about the current emotion. For example, a talking face in a happy state and a talking face in a sad state can be distinguished. The approach of inferring emotion at the talking face using the mouth area can be different from the usual emotion recognition system. Those skilled in the art will recognize that the movement of the mouth tail can help to distinguish certain emotions even when they are talking. The various steps in the method 200 of detecting a talking segment in a video sequence may be performed in the order presented or in another order, or simultaneously. In addition, in some embodiments, some steps shown in FIG. 2 may be omitted.

일실시예에 따르면, 토킹 세그먼트를 검출하는 방법은 카메라가 사람에게 초점을 두고 있는 비디오 회의, 비디오 미팅, 또는 인터뷰 상황에 이용될 수 있고, 그와 관련된 사람의 토킹 페이스 및 논토킹 페이스를 검출하여 사람의 감정을 결정한다. 또한, 토킹 세그먼트를 검출하는 방법은 좀 더 나은 감정 분류를 위해 감정 인식 시스템에서 수행될 수 있다.According to one embodiment, a method of detecting a talking segment may be used in a video conference, video meeting, or interview situation in which the camera focuses on a person, and detects a talking face and a non-talking face of the person associated with it. Determines people's feelings In addition, the method of detecting the talking segment may be performed in the emotion recognition system for better emotion classification.

도 3은 일실시예에 따른 어플리케이션을 수행하는 컴퓨팅 환경을 도시한 도면이다.3 illustrates a computing environment for executing an application according to an embodiment.

도3에 도시된 것처럼, 컴퓨팅 환경(computing environment)은 제어 유닛, 산술 논리 유닛(ALU, Arithmetic Logic Unit), 메모리, 스토리지(storage), 복수 개의 네트워킹 디바이스, 및 복수 개의 입/출력(I/O, input output)디바이스를 갖춘 적어도 하나의 프로세싱 유닛(processing unit)을 포함한다. 프로세싱 유닛은 알고리즘의 명령어를 처리한다. 프로세싱 유닛은 명령어를 처리하기 위해 제어 유닛으로부터 명령어를 수신한다. 또한, 명령어의 실행과 관련된 논리적, 산술적 작업은 산술 논리 유닛의 도움으로 처리된다.As shown in Figure 3, the computing environment includes a control unit, an Arithmetic Logic Unit (ALU), memory, storage, a plurality of networking devices, and a plurality of input / output (I / O). at least one processing unit having an input output device. The processing unit processes the instructions of the algorithm. The processing unit receives the instructions from the control unit to process the instructions. In addition, logical and arithmetic tasks related to the execution of instructions are handled with the aid of arithmetic logic units.

전체적인 컴퓨팅 환경은 복수 개의 동종 및/또는 이종의 코어(core), 다른 종류의 복수 개의 CPU, 특수한 미디어, 및 다른 액셀레이터들(accelerators)로 구성될 수 있다. 또한, 복수 개의 프로세싱 유닛은 하나의 칩 또는 복수 개의 칩에 위치할 수 있다.The overall computing environment may consist of a plurality of homogeneous and / or heterogeneous cores, a plurality of different types of CPUs, special media, and other accelerators. In addition, the plurality of processing units may be located on one chip or a plurality of chips.

실행에 필요한 코드와 명령어로 구성된 알고리즘은 메모리 유닛 또는 스토리지에 저장되거나 양쪽 모두에 저장된다. 명령어가 실행되는 경우, 명령어는 대응되는 메모리 유닛 및/또는 스토리지로부터 로딩될 수 있고, 프로세싱 유닛에 의해 실행된다.Algorithms consisting of code and instructions for execution are stored in memory units, storage, or both. When the instructions are executed, the instructions may be loaded from the corresponding memory unit and / or storage and executed by the processing unit.

하드웨어의 실행에 있어서, 다양한 네트워킹 디바이스 또는 외부의 입/출력 디바이스는 하드웨어 실행을 지원하기 위해 네트워킹 유닛과 입/출력 디바이스를 통해 컴퓨팅 환경에 연결될 수 있다.In the implementation of hardware, various networking devices or external input / output devices may be coupled to the computing environment through networking units and input / output devices to support hardware execution.

여기에 기재된 실시예들은 적어도 하나의 소프트웨어 프로그램을 통해 수행될 수 있으며, 소프트웨어 프로그램은 적어도 하나의 하드웨어 디바이스에서 동작되고, 구성 요소들을 제어하기 위해 네트워크 관리 기능을 수행한다. 도 3에 도시된 구성 요소들은 하드웨어 디바이스 또는 하드웨어 디바이스와 소프트웨어 모듈의 조합 중 적어도 하나가 될 수 있는 블록을 포함한다.The embodiments described herein may be performed through at least one software program, the software program being operated on at least one hardware device and performing network management functions to control the components. The components shown in FIG. 3 include blocks that can be at least one of a hardware device or a combination of hardware device and software module.

특정 실시예에 대한 앞선 설명들은 실시예들의 본질을 드러내기에 충분하므로, 누구나 현재의 지식을 적용하는 것에 의해 본 발명의 사상에서 벗어나지 않으면서도 위의 구체적인 실시예에 대한 다양한 응용도 용이하게 수정하고/하거나 변형할 수 있다. 그리고, 그러한 변형 및 수정은 개시된 실시예의 의미와 균등 범위 안에서 이해되어야 한다. 여기에서 사용된 어법이나 용어는 발명의 상세한 설명을 위한 것이며 발명의 내용을 제한하는 것이 아님이 이해되어야 한다. 그러므로, 여기에 기재된 실시예들이 바람직한 실시예에 관하여 설명되었지만, 당업자는 여기에 기재된 실시예들이 설명된 것과 같이 실시예의 범위와 사상을 벗어나지 않는 한도에서 변형되어 수행될 수 있음을 인식할 것이다.The foregoing descriptions of the specific embodiments are sufficient to reveal the essence of the embodiments, so that anyone can readily modify various applications to the specific embodiments above without departing from the spirit of the invention by applying current knowledge. Can be modified. Such variations and modifications are to be understood within the meaning and equivalents of the disclosed embodiments. It is to be understood that the phraseology or terminology used herein is for the purpose of description and should not be regarded as limiting. Therefore, while the embodiments described herein have been described in terms of preferred embodiments, those skilled in the art will recognize that the embodiments described herein may be modified and practiced without departing from the scope and spirit of the embodiments as described.

Claims

In a method of detecting and classifying talking segments of a face in a visual cue,
Tracking and normalizing the area of the face for each frame of the visual cue;
Obtaining a histogram of the structure descriptive features of the face with respect to the frame in the visual cue;
Deriving an integrated gradient histogram (IGH) in the structural depiction feature for the frame of the visual cue;
Calculating an entropy of the integrated gradient histogram for the frame in the visual cue;
Dividing the integrated gradient histogram to detect the talking segment of the face in the visual cue; And
Analyzing the talking segment for the frame in the visual cue to infer emotions
Method for detecting and classifying the talking segment comprising a.

The method of claim 1,
Tracking and normalizing the area of the face,
Using the position of the pupil to normalize the image of the face with respect to the frame of the visual cue
Method for detecting and classifying the talking segment comprising a.

The method of claim 1,
Tracking and normalizing the area of the face,
Using the position of the nose to crop the mouth area in an accurate manner relative to the frame of the visual cue
Method for detecting and classifying the talking segment comprising a.

The method of claim 1,
Deriving the integrated gradient histogram,
Obtaining uncertainty associated with the depiction of the integrated gradient histogram for the talking segment in comparison to non-talking segments
Method for detecting and classifying the talking segment comprising a.

The method of claim 1,
The entropy of the integrated gradient histogram is
And calculate the amount of uncertainty associated with the talking segment of the visual cue.

The method of claim 1,
Analyzing the talking segment,
Using an action unit on top of the face to infer feelings about the talking face
Method for detecting and classifying the talking segment comprising a.

The method of claim 1,
Analyzing the talking segment,
Using the behavior unit of the entire face to infer emotions against the non talking face
Method for detecting and classifying the talking segment comprising a.

The method of claim 1,
The visual cue,
A method of detecting and classifying talking segments that are at least one of an image, a frame, and a video.

The system of claim 1, wherein the system detects and classifies talking segments of a face in a visual cue that performs at least one of the claimed steps.

A computer program product for detecting and classifying talking segments of a face in a visual cue,
An integrated circuit further comprising at least one processor;
At least one memory having computer program code in the integrated circuit;
With at least one processor, the computer program product,
Track and normalize the area of the face for each frame of the visual cue,
Obtain a histogram of the structural depiction features of the face with respect to the frame in the visual cue,
Derive an integrated gradient histogram in the structural depiction feature for the frame of the visual cue,
Calculate an entropy of the integrated gradient histogram for the frame in the visual cue,
Dividing the integrated gradient histogram to detect talking segments of the face in the visual cue,
Analyze the talking segment for the frame in the visual cue to infer emotions
The computer program code configured and the at least one memory
Computer program product comprising a.

The method of claim 10,
Tracking and normalizing the area of the face,
And use the position of the pupil to normalize the image of the face with respect to the frame of the visual cue.

The method of claim 10,
Tracking the area of the face,
And use the position of the nose to crop the mouth area in an accurate manner relative to the frame of the visual cue.

The method of claim 10,
Deriving the integrated gradient histogram,
Obtaining uncertainty associated with the depiction of the integrated gradient histogram for the talking segment as compared to the non-talking segment.

The method of claim 10,
The entropy of the integrated gradient histogram is
And calculate to determine the amount of uncertainty associated with the talking segment of the visual cue.

The method of claim 10,
Analyzing the talking segment,
A computer program product that uses an action unit on top of a face to infer emotions about a talking face.

The method of claim 10,
Analyzing the talking segment,
A computer program product that uses an entire face action unit to infer emotions against a non-talking face.