KR102192342B1

KR102192342B1 - Method and device for multimodal character identification on multiparty dialogues

Info

Publication number: KR102192342B1
Application number: KR1020190054140A
Authority: KR
Inventors: 최기선; 한기종; 신기연; 김동환
Original assignee: 한국과학기술원
Priority date: 2018-11-30
Filing date: 2019-05-09
Publication date: 2020-12-17
Also published as: KR20200066134A

Abstract

일 실시예에 따른 멀티모달 다중 화자 대화 속 인물 식별 방법은, 다중 화자 대화에서, 대본 정보의 멘션 및 상기 대본 정보 상에서 상기 멘션을 포함하는 문장이 발화되는 시점의 영상 정보를 분석함에 따라 각각의 표현 정보를 획득하는 단계; 및 상기 획득된 각각의 표현 정보에 기초하여 상기 다중 화자 대화에서의 멘션이 의미하는 인물을 식별하는 단계를 포함할 수 있다. A method for identifying a person in a multi-modal multi-speaker conversation according to an embodiment includes, in a multi-speaker conversation, each expression by analyzing a mention of script information and image information at a time point at which a sentence including the mention is uttered on the script information. Obtaining information; And identifying a person that the mention in the multi-speaker conversation means based on the obtained respective expression information.

Description

METHOD AND DEVICE FOR MULTIMODAL CHARACTER IDENTIFICATION ON MULTIPARTY DIALOGUES}

아래의 설명은 멀티모달 다중 화자 대화 속 인물 식별 방법 및 장치에 관한 것이다.The following description relates to a method and apparatus for identifying a person in a multimodal multi-speaker conversation.

다중 화자 대화 속 인물 식별(Character Identification)이란 드라마 대본 등 여러 등장인물이 나오는 대본에서 인물 멘션(Mention)들을 실제 인물과 연결해주는 문제이다. 인물 멘션을 의미하는 '그녀' 또는 '아버지'와 같이 인물을 지칭하는 명사 또는 명사구들이 실제 어떤 인물을 나타내는지 파악하는 문제라고 볼 수 있다. 대화 문장 속에서 인물 멘션이 의미하는 실제 인물을 파악하는 문제는 질의 응답, 요약과 같은 고차원 자연어처리 문제를 해결하기 위해 중요하다. 일례로, 인물 식별 문제는 미국 드라마 프렌즈(Friends) 두 시즌(Season) 47개 화(Episode)에 대해서 데이터가 구축되었고, SemEval 2018 Task4로 공개되어 다수의 연구팀이 참여하는 등 여러 관련 연구가 진행되어 왔다. Character Identification in a multi-speaker dialogue is a problem that connects character mentions with real characters in a script where several characters such as a drama script appear. It can be seen as a matter of grasping what kind of person the nouns or noun phrases that refer to a person, such as'her' or'father', which means mentioning a person. The problem of grasping the real person, which is meant by the person mentioning in the dialogue sentence, is important for solving high-level natural language processing problems such as question and answer and summary. For example, for the problem of character identification, data was built for 47 episodes (Episode) in two seasons of the American drama Friends, and it was released as SemEval 2018 Task4, and several related studies were conducted, such as participation of a number of research teams. come.

대화 문장 속에서 인물 멘션이 의미하는 실제 인물을 파악하는 문제는 자연어처리에서 개체 연결(Entity Linking) 및 상호참조 해결(Coreference Resolution) 문제와 비슷한 형태를 가진다. 그러나 연결 대상이 되는 개체(Entity)가 지식베이스 개체가 아닌 대화 속 등장인물이기에 지식베이스 정보를 활용할 수 없고, 기존 상호참조해결 문제처럼 같은 개체를 나타내는 멘션(Mention)들끼리 묶어주는 것뿐만 아니라 실제 인물과 연결까지 해주어야 된다는 점이 다르다. 또한 이러한 다중 화자 대화/대본은 은유, 수사적 표현 등이 많은 구어체로 구성되며 문맥 변화가 빠르게 일어나기에 상기 문제를 어렵게 만든다.The problem of recognizing the actual person that the person mention means in the dialogue sentence has a form similar to the problem of Entity Linking and Coreference Resolution in natural language processing. However, since the entity to be connected is not a knowledge base entity, but a character in the conversation, knowledge base information cannot be used. As in the existing cross-reference problem, mentions representing the same entity are not only grouped together, but actually The difference is that you have to connect with the person. In addition, this multi-speaker dialogue/script is composed of colloquial styles with many metaphors and rhetorical expressions, and the context changes rapidly, making the above problem difficult.

도 1은 대본 정보만으로 대화 속 인물을 식별하는 것을 설명하기 위한 예이다. 도 1에 도시된 바와 같이, 어떤 여성이 잠을 자고 있고, 사람들이 이 여성을 대상으로 대화를 나누고 있는 영상 정보가 없다면 도 1의 대본 만으로 '그녀(She)'가 지칭하는 바를 알아내기 힘들다.1 is an example for explaining identifying a person in a conversation only with script information. As shown in FIG. 1, if a woman is sleeping and there is no image information about people talking to this woman, it is difficult to find out what'She' refers to only with the script of FIG. 1.

이와 같이, 종래의 연구들이 다양한 기법을 제시하고 성능을 어느 정도 향상시키긴 하였지만, 자연어 대본만을 입력으로 사용했기 때문에 대본 문장만을 보고 인물 멘션이 누구를 지칭하는지 식별하는데 정확도가 떨어진다.As described above, although conventional studies have suggested various techniques and improved performance to some extent, accuracy is inferior in identifying who the person mention refers to only by looking at the script sentence because only the natural language script was used as input.

다중 화자 대화 속 인물 식별 문제의 성능을 높이기 위하여 대본 텍스트와 영상 정보가 적합하게 정렬된(align) 입력 데이터를 생성하는 방법 및 시스템을 제공할 수 있다. In order to improve the performance of a person identification problem in a multi-speaker conversation, a method and system for generating input data in which script text and image information are appropriately aligned can be provided.

다중 화자 대화 속 인물 식별 문제를 해결하는 데 있어서, 대본 텍스트와 영상 정보가 결합된 데이터를 활용하여 인물 식별을 수행하는 방법 및 시스템을 제공할 수 있다. In solving the problem of identifying a person in a multi-speaker conversation, a method and a system for performing person identification using data combined with script text and image information can be provided.

멀티모달 다중 화자 대화 속 인물 식별 방법은, 다중 화자 대화에서, 대본 정보의 멘션 및 상기 대본 정보 상에서 상기 멘션을 포함하는 문장이 발화되는 시점의 영상 정보를 분석함에 따라 각각의 표현 정보를 획득하는 단계; 및 상기 획득된 각각의 표현 정보에 기초하여 상기 다중 화자 대화에서의 멘션이 의미하는 인물을 식별하는 단계를 포함할 수 있다. In the multimodal multi-speaker conversation, the method of identifying a person in a multi-speaker conversation includes obtaining each expression information by analyzing a mention of script information and image information at a time point at which a sentence including the mention is uttered on the script information. ; And identifying a person that the mention in the multi-speaker conversation means based on the obtained respective expression information.

상기 인물 식별 방법은, 상기 대본 정보의 텍스트 데이터와 상기 대본 정보 상에서 상기 멘션이 발화되는 시점에 해당하는 영상 정보를 정렬함에 따라 상기 대본 정보와 상기 영상 정보를 결합하여 인물을 식별하기 위한 데이터를 전처리할 수 있다. The person identification method preprocesses data for identifying a person by combining the script information and the image information by arranging text data of the script information and image information corresponding to a time point at which the mention is uttered on the script information. can do.

상기 전처리하는 단계는, 상기 대본 정보의 텍스트 데이터에 시간 정보를 부착하기 위하여 자막 데이터와 동기화하는 단계를 포함할 수 있다. The pre-processing may include synchronizing with subtitle data to attach time information to text data of the script information.

상기 전처리하는 단계는, 상기 자막 데이터의 대사 하나와 상기 텍스트 데이터의 대사 하나를 기본 단위로 하여, 각 대본 정보의 대사에 대하여 토큰 유사도가 기 설정된 기준 값 이상인 후보군을 추출하고, 상기 추출된 후보군을 최장 증가 수열 알고리즘에 기초하여 각 대본 정보의 대사에 0 또는 1개의 자막 데이터를 매칭하여 상기 텍스트 데이터에 시간 정보를 부착하는 단계를 포함할 수 있다. The pre-processing includes extracting a candidate group having a token similarity greater than or equal to a preset reference value for each script information dialogue using one dialogue of the subtitle data and one dialogue of the text data as basic units, and extracting the extracted candidate group. It may include the step of attaching time information to the text data by matching 0 or 1 subtitle data to the dialogue of each script information based on the longest increasing sequence algorithm.

상기 전처리하는 단계는, 멘션을 포함하는 문장이 발화되는 시점에 해당하는 영상 정보의 구간을 초당 기 설정된 개수의 이미지를 가지는 각 장면 이미지로 분할하고, 상기 분할된 각 장면 이미지에 대하여 미리 학습된 네트워크를 통과시켜 기 설정된 크기의 장면의 벡터값을 추출하는 단계를 포함할 수 있다. In the pre-processing, a section of image information corresponding to a time point at which a sentence including a mention is uttered is divided into scene images having a preset number of images per second, and a network previously learned for each of the divided scene images It may include the step of extracting a vector value of the scene having a preset size by passing through.

상기 각각의 표현 정보를 획득하는 단계는, 상기 대본 정보의 문맥 흐름과 화자 정보를 반영하여 대본 정보 내의 각 멘션의 특징을 나타내는 벡터 표현을 생성하고, 상기 멘션이 발화되는 시간에 따라 영상 정보의 특징을 나타내는 벡터 표현을 생성하는 단계를 포함할 수 있다. The obtaining of each of the expression information includes generating a vector expression representing the characteristic of each mention in the script information by reflecting the context flow of the script information and speaker information, and the characteristic of the video information according to the time when the mention is uttered. It may include the step of generating a vector representation representing.

상기 각각의 표현 정보를 획득하는 단계는, 상기 다중 화자 대화에서, 미리 정의된 등장 인물을 대상으로 멘션이 나타내는 인물을 분류하는 단계를 포함할 수 있다. The obtaining of each expression information may include classifying a person indicated by the mention as a target of a predefined character in the multi-speaker conversation.

상기 각각의 표현 정보를 획득하는 단계는, 멘션의 단어를 제1 벡터로 표현하고, 상기 멘션의 발화자를 랜덤하게 제2 벡터로 표현한 뒤, 상기 제1 벡터 및 상기 제2 벡터를 이용하여 Bi-LSTM을 통과시킴에 따라 제3 벡터(h _i )를 획득하고, 영상 정보에 대하여 ResNet-152를 통과한 제1 영상 정보 벡터를 ReLu 활성함수를 가진 리니어 레이어(Linear Layer)에 통과시킴에 따라 제2 영상 정보 벡터(v _i )를 획득하는 단계를 포함할 수 있다. In the obtaining of the respective expression information, the word of the mention is expressed as a first vector, and the talker of the mention is randomly expressed as a second vector, and then Bi- by using the first vector and the second vector. By passing the LSTM, a third vector ( h _i ) is obtained, and the first image information vector passed through ResNet-152 for image information is passed through a linear layer with a ReLu activation function. 2 It may include the step of obtaining the image information vector ( v _i ).

상기 각각의 표현 정보를 획득하는 단계는, 상기 획득된 제2 영상 정보 벡터(v _i )를 상기 획득된 제3 벡터(h _i )와 결합시켜 인물을 식별하고자 하는 멘션을 표현하는 벡터(e _i )와 각 인물의 특징을 표현하는 개체 라이브러리(Entity Library) 행렬에 대한 벡터와 코사인 유사도(Cosine Similarity)를 계산하고, 상기 계산된 값을 소프트맥스(Softmax) 레이어를 통과시켜 상기 멘션이 나타내는 인물을 분류할 수 있다. The obtaining of each of the expression information may include combining the obtained second image information vector ( v _i ) with the obtained third vector ( h _i ) to express a mention for identifying a person ( e _i ) And the vector and cosine similarity of the matrix of the entity library expressing the characteristics of each person, and the calculated value is passed through a Softmax layer to determine the person represented by the mention. Can be classified.

상기 각각의 표현 정보를 획득하는 단계는, 상기 다중 화자 대화에서, 미리 정의된 인물의 정보없이 상호참조 해결에 기초하여 멘션이 나타내는 인물을 분류하는 단계를 포함할 수 있다. The obtaining of each of the expression information may include classifying a person represented by the mention based on a cross-reference solution without predefined person information in the multi-speaker conversation.

상기 각각의 표현 정보를 획득하는 단계는, 멘션의 단어를 임베딩과 Bi-LSTM 레이어를 통해 표현하고, 각 멘션이 복수 개의 단어로 이루어져 있는 경우, 어텐션으로 학습시키고, 어텐션 레이어를 통과한 멘션 표현 벡터(g _i )를 생성하고, 상기 멘션이 발화된 시점의 ResNet을 통과한 영상정보 벡터를 제1 완전 연결 레이어(Fully Connected Layer)에 통과시킴에 따라 벡터(v _i )를 생성하고, 상기 생성된 벡터(v _i )를 선행사 점수를 계산하는 제2 완전 연결 레이어에 입력하는 단계를 포함할 수 있다. In the step of acquiring the respective expression information, the words of the mention are expressed through embedding and Bi-LSTM layer, and when each mention consists of a plurality of words, the mention expression vector is learned by attention and passed through the attention layer. ( g _i ) is generated, and a vector ( v _i ) is generated by passing the image information vector that has passed through the ResNet at the time the mention is uttered through the first fully connected layer, and the generated It may include the step of inputting the vector ( v _i ) to a second fully connected layer that calculates an a priori score.

상기 각각의 표현 정보를 획득하는 단계는, 복수 개의 멘션 표현 벡터간의 상호참조 여부를 판단하기 위하여 각 멘션이 발화된 시점의 영상 정보를 반영하여 선행사 점수를 계산하고, 상기 계산된 선행사 점수에 기초하여 멘션 후보를 선정하는 단계를 포함할 수 있다. In the obtaining of each of the expression information, in order to determine whether or not a plurality of mention expression vectors are cross-referenced, a predecessor score is calculated by reflecting the image information at the time when each mention is uttered, and based on the calculated predecessor score. It may include the step of selecting a mention candidate.

상기 각각의 표현 정보를 획득하는 단계는, 상기 선정된 멘션 후보에 기초하여 동일한 대상을 나타내는 멘션들끼리 그룹을 생성하고, 상기 생성된 그룹 내의 각 멘션들을 기 설정된 우선 순위에 기초하여 실제 인물과 매칭하는 단계를 포함할 수 있다. In the obtaining of the respective expression information, a group of mentions representing the same target is generated based on the selected mention candidate, and each mention of the generated group is matched with an actual person based on a preset priority. It may include the step of.

상기 획득된 각각의 표현 정보에 기초하여 상기 다중 화자 대화에서의 멘션이 의미하는 인물을 식별하는 단계는, 상기 다중 화자 대화에서의 각 멘션이 지칭하는 인물을 식별한 식별 결과를 출력하는 단계를 포함할 수 있다. The step of identifying a person that the mention in the multi-speaker conversation means based on the obtained respective expression information includes outputting an identification result identifying the person referred to by each mention in the multi-speaker conversation can do.

멀티모달 다중 화자 대화 속 인물 식별 시스템은, 다중 화자 대화에서, 대본 정보의 멘션 및 상기 대본 정보 상에서 상기 멘션을 포함하는 문장이 발화되는 시점의 영상 정보를 분석함에 따라 각각의 표현 정보를 획득하는 획득부; 상기 획득된 각각의 표현 정보에 기초하여 상기 다중 화자 대화에서의 멘션이 의미하는 인물을 식별하는 식별부; 및 상기 대본 정보의 텍스트 데이터와 상기 대본 정보 상에서 상기 멘션이 발화되는 시점에 해당하는 영상 정보를 정렬하여 상기 대본 정보와 상기 영상 정보를 결합하여 인물을 식별하기 위한 데이터를 전처리하는 전처리부를 포함할 수 있다. In a multi-modal multi-speaker conversation, the system for identifying a person in a multi-speaker conversation is obtained by analyzing the mention of the script information and the image information at the time when the sentence including the mention is uttered on the script information to obtain each expression information. part; An identification unit for identifying a person that the mention in the multi-speaker conversation means based on the obtained respective expression information; And a preprocessor for preprocessing data for identifying a person by aligning text data of the script information and image information corresponding to a time point at which the mention is uttered on the script information, and combining the script information and the image information. have.

영상 정보를 추가적으로 활용함으로써 대본 텍스트만으로 파악하기 어려운 인물도 식별할 수 있다.By additionally utilizing video information, people who are difficult to grasp with only the script text can be identified.

드라마 대본 등 여러 등장인물이 나오는 다중 화자 대화에서 멘션들이 실제 어떤 인물인지 보다 정확하게 식별할 수 있다.In a multi-speaker dialogue where several characters such as drama scripts appear, it is possible to more accurately identify who the mentions are.

도 1은 대본 정보만으로 대화 속 인물을 식별하는 것을 설명하기 위한 예이다.
도 2는 일 실시예에 따른 인물 식별 시스템의 구성을 설명하기 위한 블록도이다.
도 3은 일 실시예에 따른 인물 식별 시스템에서 멀티모달 다중 화자 대화 속 인물 식별 방법을 설명하기 위한 흐름도이다.
도 4 및 도 5는 일 실시예에 따른 인물 식별 시스템에서 멘션이 나타내는 인물을 분류하는 것을 설명하기 위한 예이다.1 is an example for explaining identifying a person in a conversation only with script information.
2 is a block diagram illustrating a configuration of a person identification system according to an exemplary embodiment.
3 is a flowchart illustrating a method for identifying a person in a multimodal multi-speaker conversation in a person identification system according to an exemplary embodiment.
4 and 5 are examples for explaining classification of a person indicated by a mention in a person identification system according to an embodiment.

이하, 실시예를 첨부한 도면을 참조하여 상세히 설명한다.Hereinafter, embodiments will be described in detail with reference to the accompanying drawings.

실시예에서는 드라마 대본 등 여러 등장인물이 등장하는 다중 화자 대화에서 멘션(인물을 나타내는 명사 또는 명사구)들이 실제 어떤 인물인지 식별하는 기술을 설명하기로 한다. 멘션이란(Mention)이란 인물을 나타내는 명사 또는 명사구를 나타내며, 예를 들면, she, mom, Ross Geller 등이 해당될 수 있다. 이에, 문장(텍스트 데이터)이 발화되는 시점의 영상 정보까지 활용하여 대본 텍스트 데이터만으로 파악하기 어려운 인물도 식별하는 방법 및 시스템에 대하여 설명하기로 한다. In the embodiment, description will be made of a technique for identifying what kind of person the mentions (nouns or noun phrases representing characters) are actually in a multi-speaker dialogue in which several characters such as a drama script appear. Mention refers to a noun or a noun phrase representing a person, for example, she, mom, Ross Geller, and the like. Accordingly, a description will be made of a method and system for identifying a person who is difficult to grasp only with script text data by utilizing image information at the time when the sentence (text data) is uttered.

도 2는 일 실시예에 따른 인물 식별 시스템의 구성을 설명하기 위한 블록도이다.2 is a block diagram illustrating a configuration of a person identification system according to an exemplary embodiment.

인물 식별 시스템(100)는 대본과 문장이 발화되는 시점의 영상 정보를 결합하여 인물을 식별하기 위한 것으로, 전처리부(210), 획득부(220) 및 식별부(230)를 포함할 수 있다. The person identification system 100 is for identifying a person by combining the script and image information at the time when the sentence is uttered, and may include a preprocessor 210, an acquisition unit 220, and an identification unit 230.

전처리부(210)는 대본 정보의 텍스트 데이터 대본 정보 상에서 멘션이 포함된 문장이 발화되는 시점에 해당하는 영상 정보를 정렬하여 대본 정보와 영상 정보를 결합하여 인물을 식별하기 위한 데이터를 전처리할 수 있다. 전처리부(210)는 대본 정보의 텍스트 데이터에 시간 정보를 부착하기 위하여 자막 데이터와 동기화하고, 영상 전처리를 할 수 있다. 전처리부(210)는 자막 데이터의 대사 하나와 텍스트 데이터의 대사 하나를 기본 단위로 하여, 각 대본 정보의 대사에 대하여 토큰 유사도가 기 설정된 기준 값 이상인 후보군을 추출하고, 추출된 후보군을 최장 증가 수열 알고리즘에 기초하여 각 대본 정보의 대사에 0 또는 1개의 자막 데이터를 매칭하여 텍스트 데이터에 시간 정보를 부착할 수 있다. 또한, 전처리부(210)는 멘션을 포함하는 문장이 발화되는 시점에 해당하는 영상 정보의 구간을 초당 기 설정된 개수의 이미지를 가지는 각 장면 이미지로 분할하고, 분할된 각 장면 이미지에 대하여 미리 학습된 네트워크를 통과시켜 기 설정된 크기의 장면의 벡터값을 추출할 수 있다. The preprocessor 210 may pre-process data for identifying a person by combining the script information and the image information by arranging the image information corresponding to the time point at which the sentence including the mention is uttered on the text data script information of the script information. . The preprocessor 210 may synchronize with caption data and perform image preprocessing in order to attach time information to text data of the script information. The preprocessing unit 210 extracts a candidate group whose token similarity is equal to or greater than a preset reference value for each script information dialogue using one dialogue of subtitle data and one dialogue of text data as basic units, and increases the extracted candidate group to the longest sequence. Time information may be attached to text data by matching 0 or 1 subtitle data to the dialogue of each script information based on an algorithm. In addition, the preprocessor 210 divides the section of the image information corresponding to the time point at which the sentence including the mention is uttered, into each scene image having a preset number of images per second, and learns in advance for each divided scene image. A vector value of a scene with a preset size can be extracted by passing through the network.

획득부(220)는 다중 화자 대화에서, 대본 정보의 멘션 및 대본 정보 상에서 멘션이 포함된 문장이 발화되는 시점의 영상 정보를 분석함에 따라 각각의 표현 정보를 획득할 수 있다. 이때, 멘션 정보란 인물을 나타내는 명사 또는 명사구를 나타내며, 대용어(Anaphor)를 포함할 수 있다. 대용어란 앞에 나온 단어나 어구를 지칭하는 대명사 따위를 의미하며, 예를 들면, 'Anna said She was leaving.'이라는 문장에서 'she'가 대용어에 해당된다. 획득부(220)는 대본 정보의 문맥 흐름과 화자 정보를 반영하여 대본 정보 내의 각 멘션의 특징을 나타내는 벡터 표현을 생성하고, 멘션이 발화되는 시간에 따라 영상 정보의 특징을 나타내는 벡터 표현을 생성할 수 있다. In a multi-speaker conversation, the acquisition unit 220 may obtain each expression information by analyzing the mention of the script information and image information at a time point at which a sentence including the mention is uttered on the script information. Here, the mention information indicates a noun or a noun phrase representing a person, and may include an anaphor. A pronoun refers to a pronoun that refers to the preceding word or phrase. For example,'she' is a substitute term in the sentence'Anna said She was leaving.' The acquisition unit 220 generates a vector expression indicating the characteristics of each mention in the script information by reflecting the context flow of the script information and the speaker information, and generates a vector expression indicating the characteristics of the image information according to the time when the mention is uttered. I can.

일례로, 획득부(220)는 다중 화자 대화에서, 미리 정의된 등장 인물을 대상으로 멘션이 나타내는 인물을 분류할 수 있다. 획득부(220)는 멘션의 단어를 제1 벡터로 표현하고, 멘션의 발화자를 랜덤하게 제2 벡터로 표현한 뒤, 제1 벡터 및 제2 벡터를 이용하여 Bi-LSTM을 통과시킴에 따라 제3 벡터(h _i )를 획득하고, 영상 정보에 대하여 ResNet-152를 통과한 제1 영상 정보 벡터를 ReLu 활성함수를 가진 리니어 레이어(Linear Layer)에 통과시킴에 따라 제2 영상 정보 벡터(v _i )를 획득할 수 있다. 획득부(220)는 획득된 제2 영상 정보 벡터(v _i )를 획득된 제3 벡터(h _i )와 결합시켜 인물을 식별하고자 하는 멘션을 표현하는 벡터(e _i )와 각 인물의 특징을 표현하는 개체 라이브러리(Entity Library) 행렬에 대한 벡터와 코사인 유사도(Cosine Similarity)를 계산하고, 계산된 값을 소프트맥스(Softmax) 레이어를 통과시켜 멘션이 나타내는 인물을 분류할 수 있다. For example, in a multi-speaker conversation, the acquisition unit 220 may classify a person indicated by the mention by targeting a predefined person. The acquisition unit 220 expresses the word of the mention as a first vector, randomly expresses the talker of the mention as a second vector, and then passes through the Bi-LSTM using the first vector and the second vector. A second image information vector ( v _i ) is obtained by acquiring a vector ( h _i ) and passing the first image information vector that has passed through ResNet-152 with respect to the image information through a linear layer having a ReLu activation function. Can be obtained. The acquisition unit 220 combines the acquired second image information vector ( v _i ) with the acquired third vector ( h _i ) to identify a vector ( e _i ) expressing a mention to identify a person, and features of each person. A vector and cosine similarity for the expressed entity library matrix can be calculated, and the calculated value can be passed through a Softmax layer to classify the person indicated by the mention.

다른 예로서, 획득부(220)는 다중 화자 대화에서, 미리 정의된 인물의 정보없이 상호참조해결 기반에 기초하여 멘션이 나타내는 인물을 분류할 수 있다. 획득부(220)는 멘션의 단어를 임베딩과 Bi-LSTM 레이어를 통해 표현하고, 각 멘션이 복수 개의 단어로 이루어져 있는 경우, 어텐션으로 학습시키고, 어텐션 레이어를 통과한 멘션 표현 벡터(g _i )를 생성하고, 멘션이 발화된 시점의 ResNet을 통과한 영상정보 벡터를 제1 완전 연결 레이어(Fully Connected Layer)에 통과시킨 벡터(v _i )를 생성하고, 생성된 벡터(v _i )를 선행사 점수를 계산하는 제2 완전 연결 레이어에 입력할 수 있다. 획득부(220)는 복수 개의 멘션 표현 벡터간의 상호참조 여부를 판단하기 위하여 각 멘션이 발화된 시점의 영상 정보를 반영하여 선행사 점수를 계산하고, 계산된 선행사 점수에 기초하여 멘션 후보를 선정할 수 있다. 획득부(220)는 선정된 멘션 후보에 기초하여 동일한 대상을 나타내는 멘션들끼리 그룹을 생성하고, 생성된 그룹 내의 각 멘션들을 기 설정된 우선 순위에 기초하여 실제 인물과 매칭할 수 있다. As another example, in a multi-speaker conversation, the acquisition unit 220 may classify a person indicated by the mention based on a cross-reference resolution basis without information on a person defined in advance. The acquisition unit 220 expresses the words of the mention through the embedding and Bi-LSTM layer, and when each mention consists of a plurality of words, it learns as an attention, and calculates the mention expression vector g _i passing through the attention layer. A vector ( v _i ) is created and the image information vector that has passed through the ResNet at the time the mention is uttered is passed through the first fully connected layer, and the generated vector ( v _i ) is used to calculate the predecessor score. It can be input to the second fully connected layer to be calculated. In order to determine whether a plurality of mention expression vectors are cross-referenced, the acquisition unit 220 calculates a prior story score by reflecting the image information at the time each mention is uttered, and selects a mention candidate based on the calculated previous story score. have. The acquisition unit 220 may generate a group among mentions representing the same target based on the selected mention candidate, and match each mention in the generated group with an actual person based on a preset priority.

식별부(230)는 획득된 각각의 표현 정보에 기초하여 다중 화자 대화에서의 멘션이 의미하는 인물을 식별할 수 있다. 식별부(230)는 다중 화자 대화에서의 각 멘션이 지칭하는 인물을 식별한 식별 결과를 출력할 수 있다. The identification unit 230 may identify a person that the mention in the multi-speaker conversation means based on the obtained respective expression information. The identification unit 230 may output an identification result of identifying a person referred to by each mention in the multi-speaker conversation.

도 3은 일 실시예에 따른 인물 식별 시스템에서 멀티모달 다중 화자 대화 속 인물 식별 방법을 설명하기 위한 흐름도이다.3 is a flowchart illustrating a method for identifying a person in a multimodal multi-speaker conversation in a person identification system according to an exemplary embodiment.

인물 식별 방법은 도 2에서 설명한 인물 식별 시스템에 의해 수행될 수 있다. 인물 식별 시스템은 대본 정보의 텍스트 데이터(301)와 영상 정보(302)에 대한 전처리 및 정렬(303)을 수행할 수 있다. 이때, 대본 정보의 텍스트 데이터(301)는 인물 식별의 대상이 되는 대본 문서를 의미할 수 있다. 영상 정보(302)는 인물 식별의 대상이 되는 대본 문서와 관련된 영상을 의미할 수 있다. 인물 식별 시스템은 대본 정보의 텍스트 데이터(301)와 영상 정보(302)를 데이터베이스에 저장할 수 있다. 예를 들면, 인물 식별 시스템은 인물을 식별하기 위한 시점 또는 구간에 대한 대본 정보의 텍스트 데이터와 영상 정보를 추출할 수 있다. 이때, 사용자로부터 요청된 시점 또는 구간에 대한 대본 정보의 텍스트 데이터와 영상 정보를 추출할 수 있고, 또는, 각각의 시점 또는 구간에 대한 대본 정보의 텍스트 데이터와 영상 정보를 추출할 수도 있다. The person identification method may be performed by the person identification system described in FIG. 2. The person identification system may perform pre-processing and alignment 303 on text data 301 and image information 302 of script information. In this case, the text data 301 of the script information may mean a script document that is an object of person identification. The image information 302 may refer to an image related to a script document to be identified as a person. The person identification system may store text data 301 and image information 302 of script information in a database. For example, the person identification system may extract text data and image information of script information about a viewpoint or section for identifying a person. At this time, text data and image information of the script information for the time point or section requested by the user may be extracted, or text data and image information of the script information for each time point or section may be extracted.

인물 식별 시스템은 해당 대본 문서 및 영상을 결합하여 학습 가능하도록 데이터를 처리 및 정렬할 수 있다(303). 자막 데이터는 시간 정보를 가지고 있지만 대본 정보의 텍스트 데이터에는 시간 정보가 없다. 시간 정보를 부착하기 위하여 자막 데이터와 동기화하는 작업이 수행되어야 한다. 인물 식별 시스템은 대본 정보의 문장과 자막 데이터의 문장이 온전히 일치하지 않기 때문에 대본 정보의 텍스트 데이터에 시간 정보를 부착하기 위하여 자막 데이터와 동기화하는 작업을 수행할 수 있다. 인물 식별 시스템은 자막 데이터의 대사 하나와 대본 정보의 텍스트 데이터의 대사 하나를 기본 단위로 하여, 각 대본 대사에 대해서 토큰 유사도가 일정 값 이상인 후보군을 추출할 수 있다. 이후 인물 식별 시스템은 최대한 많은 대본 대사가 자막과 매칭될 수 있도록 최장 증가 수열 알고리즘을 통해서 각 대본 대사에 0개 또는 1개의 자막을 매칭시켜서 시간 정보를 부착할 수 있다. 또한, 인물 식별 시스템은 발화가 되는 구간의 영상 자질을 표현하기 위하여 먼저 발화된 영상 정보의 구간을 기 설정된 개수(예를 들면, 초당 24개)의 이미지를 가지는 각 장면(Frame)을 분할할 수 있다. 인물 식별 시스템은 각 장면 이미지에 대하여 ImageNet으로 미리 학습된(Pre-trained) ResNet-152 네트워크를 통과시킨 후, 최종 분류 레이어 직전 Hidden Layer인 2048 크기의 장면의 벡터값으로 추출할 수 있다. 이 장면 이미지에 대한 각각의 벡터값들을 평균을 도출한 것을 영상 구간에 포함되는 모든 장면 이미지의 벡터값을 평균을 나타내는 입력 자질 벡터로 사용할 수 있다. The person identification system may process and arrange the data to enable learning by combining the corresponding script document and image (303). The caption data has time information, but the text data of the script information does not have time information. Synchronization with subtitle data must be performed to attach time information. Since the text of the script information and the text of the caption data do not completely match, the person identification system can synchronize with the caption data to attach time information to the text data of the script information. The person identification system may extract a candidate group having a token similarity greater than or equal to a predetermined value for each script line by using one line of subtitle data and one line of text data of script information as basic units. Thereafter, the person identification system can attach time information by matching 0 or 1 subtitle to each script line through the longest increasing sequence algorithm so that as many script lines as possible can be matched with the subtitles. In addition, the person identification system can divide each scene (Frame) having a preset number (for example, 24 images per second) of the section of the first uttered image information to express the image quality of the section in which the speech is made. have. After passing through the ResNet-152 network pre-trained with ImageNet for each scene image, the person identification system may extract a vector value of a scene having a size of 2048, which is a hidden layer immediately before the final classification layer. The average of the vector values of the scene image is derived, and the vector values of all scene images included in the image section can be used as an input feature vector representing the average.

인물 식별 시스템은 데이터 전처리 및 정렬을 수행함에 따라 멘션에 대한 표현 정보(304) 및 발화 시간의 영상 정보에 대한 표현 정보(305)를 획득할 수 있다. 다시 말해서, 인물 식별 시스템은 대본 정보 및 영상 정보에 대한 각각의 표현 정보를 획득할 수 있다. 인물 식별 시스템은 대본 정보 속 각 멘션의 특징을 나타내는 벡터 표현을 획득할 수 있다. 인물 식별 시스템은 대본 기반 멘션 표현 추출기를 통하여 대본 정보의 문맥 흐름과 화자 정보를 반영하여 대복 속 각 멘션의 특징을 나타내는 벡터 표현을 생성할 수 있다. 예를 들면, 인물 식별 시스템은 AMORE-UPF(2018)에서 제안된 방식에 기반하여 대사 및 대사의 화자를 입력으로 활용하여 대본 문맥을 고려한 멘션 표현을 생성하고, 멘션 표현과 인물 표현의 유사도를 계산하여 각 멘션이 나타내는 인물을 결정할 수 있다. 또한, 인물 식별 시스템은 각 발화 시간에 대응하는 영상 정보의 특징을 나타내는 벡터 표현을 획득할 수 있다. 인물 식별 시스템은 영상 표현 추출기를 통하여 각 발화 시간별로 영상의 특징을 나타내는 벡터 표현을 생성할 수 있다. 예를 들면, 인물 식별 시스템은 ResNet-152 기술을 사용하여 각 발화 시간별로 영상의 특징을 나타내는 벡터 표현을 생성할 수 있다.As the person identification system performs data preprocessing and alignment, the expression information 304 for the mention and the expression information 305 for the image information of the speech time may be obtained. In other words, the person identification system may obtain respective expression information for script information and image information. The person identification system may obtain a vector expression indicating the characteristics of each mention in the script information. The character identification system can generate a vector expression representing the characteristics of each mention in the Daebok by reflecting the context flow of the script information and speaker information through the script-based mention expression extractor. For example, based on the method proposed in AMORE-UPF (2018), the person identification system uses the dialogue and the speaker of the dialogue as input to create a mention expression considering the script context, and calculates the similarity between the mention expression and the person expression. Thus, you can determine who each mention represents. In addition, the person identification system may obtain a vector representation indicating characteristics of image information corresponding to each speech time. The person identification system may generate a vector expression representing the characteristics of the image for each speech time through the image expression extractor. For example, the person identification system may generate a vector representation representing the characteristics of an image for each speech time using the ResNet-152 technology.

인물 식별 시스템은 인물을 식별(306)할 수 있다. 인물 식별 시스템은 획득된 각각의 표현 정보에 기초하여 다중 화자 대화에서의 멘션이 의미하는 인물을 식별할 수 있다. 일례로, 인물 식별 시스템은 미리 정의된 등장 인물을 대상으로 멘션이 나타내는 인물을 분류할 수 있다. 인물 식별 시스템은 미리 정의된 인물 대상으로 분류하는 방식은 AMORE-UPF 구조에 추가적으로 영상 정보를 활용할 수 있는 구조를 결합하여 인물을 식별할 수 있다. 우선적으로, AMORE-UPF 구조 모델에 대하여 설명하기로 한다. 멘션(도 4에서 He)을 word2vec을 이용하여 벡터로 표현하고 상기 멘션의 발화자(도 4에서 Monica)를 랜덤 하게 벡터로 표현 한 후, 이러한 두 벡터를 합쳐서 입력 데이터로 사용할 수 있다. 입력 데이터를 Bi-LSTM 을 통과시킨 후, 인물을 식별을 하고자 하는 i(i는 자연수)번째 멘션의 Bi-LSTM 출력 벡터를 리니어 레이어(Linear Layer)를 통과시킬 수 있다. 이때, 리니어 레이어(Linear Layer)를 통과함에 따라 획득된 벡터를 e _i 라고 기재하기로 한다.The person identification system may identify a person (306). The person identification system may identify a person that the mention in the multi-speaker conversation means based on the obtained respective expression information. For example, the person identification system may classify the person indicated by the mention by targeting a predefined person. The person identification system can identify a person by combining a structure that can utilize image information in addition to an AMORE-UPF structure to classify a person into a predefined object. First of all, the AMORE-UPF structural model will be described. After the mention (He in FIG. 4) is expressed as a vector using word2vec, and the talker of the mention (Monica in FIG. 4) is randomly expressed as a vector, these two vectors can be combined and used as input data. After passing the input data through Bi-LSTM, the Bi-LSTM output vector of the i (i is a natural number)-th mention for identifying a person can be passed through a linear layer. At this time, the vector obtained by passing through the linear layer is e _i It is decided to write.

도 4에서 엔티티 라이브러리(Entity Library) 행렬

는 N(N은 자연수) 개의 분류 대상이 되는 등장 인물 개체(Entity)를 각 k(k는 자연수) 차원의 벡터로 표현한 행렬이다. 엔티티 라이브러리 행렬

은 학습 과정에서 업데이트 될 수 있다. 이때, 멘션을 표현하는 벡터 e _i 와 각 인물 특징을 표현하는 엔티티 라이브러리 행렬

의 벡터와 코사인 유사도(Cosine Similarity)를 계산할 수 있다. 이와 같이 계산된 값을 소프트맥스(Softmax) 레이어에 통과시켜 해당 멘션이 나타내는 인물이 누군지 분류할 수 있다. Entity Library matrix in Figure 4

Is a matrix in which N (N is a natural number) character entities to be classified as vectors of k (k is a natural number) dimensions. Entity library matrix

Can be updated during the learning process. At this time, the vector e _i representing the mention and the entity library matrix representing the features of each person

We can calculate the vector and cosine similarity of. By passing the calculated value through the Softmax layer, the person represented by the mention can be classified.

실시예에서는 도 4와 같이, 멘션을 표현하는 벡터e _i 가 영상 정보도 반영하도록 네트워크 구조가 구성될 수 있다. 인물 식별 시스템은 ResNet-152을 통과한 영상 정보 벡터를 ReLU 활성함수를 가진 리니어 레이어에 통과시킴에 따라 획득된 벡터 v _i 를 멘션을 표현하는 벡터e _i 를 계산하기 위하여 입력으로 들어가는 벡터 h _i 와 결합시킬 수 있다. 이때, 벡터 h _i 는 단어 멘션을 벡터로 표현하고, 상기 멘션의 발화자를 랜덤하게 벡터로 표현한 후, 두 벡터를 합쳐서 Bi-LSTM을 통과시킴에 따라 획득된 값을 의미할 수 있다. 인물 식별 시스템은 멘션이 각 인물을 나타낼 확률값을 가지고 있는 최종 출력 벡터 o _i 는 다음 수학식 1과 같이 표현될 수 있다.In an embodiment, as shown in FIG. 4, a network structure may be configured such that a vector e _i representing a mention also reflects image information. The person identification system passes the image information vector that has passed ResNet-152 through the linear layer with the ReLU activation function, and the obtained vector v _i is used as the input vector h _i to calculate the vector e _i expressing the mention. Can be combined. In this case, the vector h _i may mean a value obtained by expressing a word mention as a vector, randomly expressing the talker of the mention as a vector, and passing the two vectors together and passing Bi-LSTM. In the person identification system, the final output vector o _i in which the mention has a probability value representing each person may be expressed as Equation 1 below.

수학식 1:Equation 1:

수학식 1에서 벡터 h _i 는 대본 정보 기반의 멘션에 대한 표현이고, 벡터 v _i 는 상기 멘션이 발화되는 시점의 영상 표현을 의미한다. 이와 같이, 인물 식별 시스템은 대본 정보 기반의 멘션에 대한 표현과 상기 멘션을 포함하는 문장이 발화되는 시점의 영상 표현 정보를 결합하여 실제 멘션이 어떤 인물을 나타내는지 식별할 수 있다. In Equation 1, a vector h _i is an expression for a mention based on script information, and a vector v _i is an image representation at a time point at which the mention is spoken. In this way, the person identification system may identify which person the actual mention represents by combining the expression for the mention based on the script information and the image expression information at the time when the sentence including the mention is uttered.

다른 예로서, 도 5를 참고하면, 인물 식별 시스템은 상호참조해결 기반으로 멘션이 나타내는 인물을 식별할 수 있다. 인물 식별 시스템은 미리 정의된 인물 정보 없이 한번의 학습 후 임의의 대본에 적용될 수 있는 상호참조해결 기반으로 인물을 식별할 수 있다. 먼저, 상호참조해결 모델에 대하여 설명하기로 한다. 인물 식별 시스템에 입력된 단어들을 Glove, ELMo 임베딩과 Bi-LSTM 레이어를 통해 표현할 수 있다. 각 멘션이 여러 단어로 이루어져 있을 경우는 중심어 등 중요한 단어에 좀 더 가중치를 주기 위하여 이를 어텐션(Attention)으로 학습하고. 어텐션 레이어를 통과한 i(i는 자연수)번째의 멘션 표현 벡터g _i 를 생성할 수 있다. 그리고 상호참조인지 알고자 하는 두 멘션 벡터 표현을 완전 연결 레이어(Fully Connected Layer)를 통해 멘션 i가 멘션 j의 선행사 점수 s(i,j)를 계산할 수 있다. 현재 멘션에 대해 앞의 모든 선행하는 멘션에 대해 이 s(i,j)를 포함하는 수식을 통해 계산되는 최종 상호참조 점수(Coreference Score)를 비교하여 가장 점수가 높은 멘션과 상호참조라고 보고 멘션과 동일한 대상을 나타낸다고 그룹화를 수행할 수 있다. 상호참조해결 모델은 모든 단어 조합과, 모든 멘션 조합을 기본적으로 연산 대상으로 한다. 이렇게 되면 연산량이 많기 때문에 멘션이 될 가능성이 적은 단어 조합은 점수를 도출하여 제거하는 과정이 수행될 수 있다. As another example, referring to FIG. 5, a person identification system may identify a person represented by a mention based on a cross-reference resolution. The person identification system can identify a person based on a cross-reference solution that can be applied to an arbitrary script after one learning without predefined person information. First, the cross-reference resolution model will be described. Words entered into the person identification system can be expressed through Glove, ELMo embedding and Bi-LSTM layer. When each mention consists of several words, it is learned as attention to give more weight to important words such as the main word. An i (i is a natural number)-th mention expression vector g _i that has passed through the attention layer may be generated. In addition, the mention i can calculate the predecessor score s(i,j) of the mention j through the fully connected layer by expressing the two mention vectors to be known as cross-references. For the current mention, for all preceding mentions, the final Coreference Score calculated through the formula containing this s(i,j) is compared, and the mention with the highest score is compared with the mention as a cross-reference. Grouping can be performed to represent the same object. The cross-reference resolution model basically targets all word combinations and all mention combinations. In this case, since the computational amount is large, a process of deriving and removing a score for word combinations that are unlikely to become mentions can be performed.

실시예에서는 멘션의 범위는 주어지기 때문에 점수대로 멘션을 선택하는 과정을 제거하고 입력에 주어진 멘션 후보만 선정하는 과정을 추가할 수 있다. 또한, 두 멘션이 상호참조인지 계산할 때 각 멘션이 발화된 시점의 영상 정보를 반영하여 판단할 수 있도록 수정될 수 있다. 먼저 멘션 i가 발화된 시점의 ResNet을 통과한 영상정보 벡터를 도 5와 같이 Fully Connected Layer(그림 5에서 FFNNv) 를 통과 시킨 벡터 v _i 를 생성할 수 있다. 그리고 나서, 상기 벡터 v _i 를 선행사 점수 s(i,j)를 계산하는 Fully Connected Layer(도 5에서 FFNNco)에 입력할 수 있다. 수정된 선행사 점수 s(i,j) 계산이 다음의 수학식 2로 표현될 수 있다.In the embodiment, since the range of mentions is given, the process of selecting mentions according to scores may be removed, and a process of selecting only mention candidates given in the input may be added. In addition, when calculating whether two mentions are cross-references, it may be modified so that it can be determined by reflecting the image information at the time when each mention is uttered. First, it is possible to create a vector v _i in which the image information vector passed through the ResNet at the time mention i was uttered passed through the Fully Connected Layer ( FFNNv in Figure 5) as shown in FIG. 5. Then, the vector v _i may be input to a Fully Connected Layer ( FFNNco in FIG. 5) that calculates the preceding score s(i,j). Calculation of the modified antecedent score s(i,j) may be expressed by Equation 2 below.

수학식 2:Equation 2:

벡터 v _i , v _j 는 각각 i번째와 j번째 멘션이 발화된 시점의 영상 정보 벡터 표현이고, g _i , g _j 는 i번째와 j번째 멘션의 벡터 표현이다.

는 벡터 요소별 곱(eliment-wise multiplication)이며 콤마(,)는 벡터 결합(Concatenate)를 나타낸다.

는 i번째와 j번째 벡터를 제외한 나머지 벡터 값들을 나타내고, 이러한 나머지 벡터 값들은 두 멘션의 발화자가 일치하는지 정보와 두 멘션 간의 문서 상에서 거리를 나타내는 특징 벡터이다. Vectors v _i and v _j are vector representations of image information at the time point when the i-th and j-th mentions are spoken, and g _i and g _j are vector representations of the i-th and j-th mentions.

Is vector element-wise multiplication and comma (,) indicates vector concatenation.

Denotes the remaining vector values excluding the i-th and j-th vectors, and these remaining vector values are feature vectors that indicate whether the talkers of the two mentions match and the distance on the document between the two mentions.

인물 식별 시스템은 동일한 대상을 나타내는 멘션들끼리 그룹을 생성하는 것뿐 것 아니라 그룹을 실제 인물과 연결할 수 있다. 그룹 내의 각 멘션들을 우선 순위에 기초하여 실제 인물과 매칭할 수 있다. 예를 들면, 우선 순위가 다음과 같이 설정될 수 있다. 첫 번째 우선순위는 1인칭 대명사일 경우, 화자인 인물과 바로 매칭할 수 있고, 두 번째 우선 순위는 전체 이름의 문자열이 일치할 경우, 화자인 인물과 바로 매칭할 수 있고, 세 번째 우선 순위는 이름만(성이 아닌 이름) 문자열이 일치할 경우, 화자인 인물과 바로 매칭할 수 있다. 이와 같이, 그룹 내의 각 멘션들의 실제 인물과 매칭이 완료됨에 따라 그룹 내에서 가장 많이 매칭된 인물을 해당 그룹을 나타내는 인물이라고 선택할 수 있다. The person identification system not only creates a group among mentions representing the same object, but also connects the group with real people. Each mention in the group may be matched with a real person based on priority. For example, the priority may be set as follows. If the first priority is a first-person pronoun, it can be directly matched with the person who is the speaker, the second priority can be matched with the person who is the speaker if the string of the entire name matches, and the third priority is If the string matches only the first name (first name, not the last name), it can be directly matched with the person who is the speaker. In this way, as the matching with the real person of each mention in the group is completed, the person who matches the most in the group may be selected as the person representing the group.

이상에서 설명된 장치는 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치 및 구성요소는, 예를 들어, 프로세서, 콘트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPGA(field programmable gate array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 애플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 콘트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The apparatus described above may be implemented as a hardware component, a software component, and/or a combination of a hardware component and a software component. For example, the devices and components described in the embodiments are, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable gate array (FPGA). , A programmable logic unit (PLU), a microprocessor, or any other device capable of executing and responding to instructions, such as one or more general purpose computers or special purpose computers. The processing device may execute an operating system (OS) and one or more software applications executed on the operating system. In addition, the processing device may access, store, manipulate, process, and generate data in response to the execution of software. For the convenience of understanding, although it is sometimes described that one processing device is used, one of ordinary skill in the art, the processing device is a plurality of processing elements and/or a plurality of types of processing elements. It can be seen that it may include. For example, the processing device may include a plurality of processors or one processor and one controller. In addition, other processing configurations are possible, such as a parallel processor.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치에 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.The software may include a computer program, code, instructions, or a combination of one or more of these, configuring the processing unit to behave as desired or processed independently or collectively. You can command the device. Software and/or data may be interpreted by a processing device or to provide instructions or data to a processing device, of any type of machine, component, physical device, virtual equipment, computer storage medium or device. Can be embodyed in The software may be distributed over networked computer systems and stored or executed in a distributed manner. Software and data may be stored on one or more computer-readable recording media.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, and the like alone or in combination. The program instructions recorded on the medium may be specially designed and configured for the embodiment, or may be known and usable to those skilled in computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floptical disks. -A hardware device specially configured to store and execute program instructions such as magneto-optical media, and ROM, RAM, flash memory, and the like. Examples of the program instructions include not only machine language codes such as those produced by a compiler, but also high-level language codes that can be executed by a computer using an interpreter or the like.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.As described above, although the embodiments have been described by the limited embodiments and drawings, various modifications and variations are possible from the above description by those of ordinary skill in the art. For example, the described techniques are performed in a different order from the described method, and/or components such as a system, structure, device, circuit, etc. described are combined or combined in a form different from the described method, or other components Alternatively, even if substituted or substituted by an equivalent, an appropriate result can be achieved.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and claims and equivalents fall within the scope of the claims to be described later.

Claims

In the method of identifying a person in a multimodal multi-speaker conversation,
In a multi-speaker conversation, obtaining each expression information by analyzing a mention of the script information and image information at a time point at which a sentence including the mention is uttered on the script information; And
Identifying a person that the mention in the multi-speaker conversation means based on the obtained respective expression information
Including,
Pre-processing data for identifying a person by combining the script information and the image information by arranging text data of the script information and image information corresponding to a time point at which a sentence including the mention is uttered on the script information
Including more,
The pretreatment step,
Synchronize with subtitle data to attach time information to the text data of the script information, and use one dialogue of the subtitle data and one dialogue of the text data as basic units, and a token similarity is preset for each dialogue of the script information. Extracting a candidate group equal to or greater than a reference value, and attaching time information to the text data by matching 0 or 1 subtitle data to the dialogue of each script information based on the longest increasing sequence algorithm for the extracted candidate group, and including the mention. A vector of a scene of a preset size by dividing the section of image information corresponding to the time point at which the sentence is uttered into each scene image having a preset number of images per second, and passing through a network that is learned in advance for each of the divided scene images Steps to extract values
Including,
Obtaining the respective expression information,
By reflecting the context flow of the script information and speaker information, a vector expression representing the characteristics of each mention in the script information is generated, a vector expression representing the characteristics of the video information is generated according to the time when the mention is uttered, and the multiple speaker In a dialogue, a person represented by a mention is classified based on a predefined character, the word of the mention is expressed as a first vector, and the talker of the mention is randomly expressed as a second vector, and then the first vector and the By passing the Bi-LSTM using the second vector, a third vector ( h _i ) is obtained, and the first image information vector passed through ResNet-152 for the image information is converted into a linear layer with a ReLu activation function. according to Sikkim passed to Layer) to obtain a second image information vector (v _i), was coupled with the third vector (h _i) the obtaining the second image information vector (v _i) the obtained to identify a person, Calculate the vector and cosine similarity for the vector ( e _i ) expressing the mention of the person and the Entity Library matrix expressing the features of each person, and calculate the calculated value as a Softmax layer Classifying the person represented by the mention by passing through
Person identification method comprising a.

delete

The method of claim 1,
Obtaining the respective expression information,
In the multi-speaker conversation, classifying a person represented by a mention based on a cross-reference resolution without information on a predefined person
Person identification method comprising a.

The method of claim 10,
Obtaining the respective expression information,
The words of the mention are expressed through embedding and Bi-LSTM layer, and when each mention consists of a plurality of words, it is learned as an attention, and a mention expression vector ( g _i ) passed through the attention layer is generated, and the mention is The second vector ( v _i ) is generated by passing the image information vector passing through the ResNet at the time of utterance through the first fully connected layer, and the generated vector ( v _i ) is used to calculate the predecessor score. Steps to enter the fully connected layer
Person identification method comprising a.

The method of claim 11,
Obtaining the respective expression information,
In order to determine whether a plurality of mention expression vectors are cross-referenced, calculating a prior story score by reflecting the image information at the time when each mention is uttered, and selecting a mention candidate based on the calculated previous story score
Person identification method comprising a.

The method of claim 12,
Obtaining the respective expression information,
Creating a group among mentions representing the same target based on the selected mention candidate, and matching each mention in the generated group with a real person based on a preset priority
Person identification method comprising a.

The method of claim 1,
The step of identifying a person that the mention in the multi-speaker conversation means based on the obtained respective expression information,
Outputting an identification result identifying the person referred to by each mention in the multi-speaker conversation
Person identification method comprising a.

In a multi-modal multi-speaker dialogue, a person identification system,
In a multi-speaker conversation, an acquisition unit for acquiring respective expression information by analyzing the mention of the script information and the image information at the time when the sentence including the mention is spoken on the script information;
An identification unit for identifying a person that the mention in the multi-speaker conversation means based on the obtained respective expression information; And
A preprocessing unit preprocessing data for identifying a person by aligning the text data of the script information and image information corresponding to the time point at which the mention is uttered on the script information, and combining the script information and the image information
Including,
The pretreatment unit,
Synchronize with subtitle data to attach time information to the text data of the script information, and use one dialogue of the subtitle data and one dialogue of the text data as basic units, and a token similarity is preset for each dialogue of the script information. Extracting a candidate group equal to or greater than a reference value, and attaching time information to the text data by matching 0 or 1 subtitle data to the dialogue of each script information based on the longest increasing sequence algorithm for the extracted candidate group, and including the mention. A vector of a scene of a preset size by dividing the section of image information corresponding to the time point at which the sentence is uttered into each scene image having a preset number of images per second, and passing through a network that is learned in advance for each of the divided scene images Involves extracting the value,
The acquisition unit,
By reflecting the context flow of the script information and speaker information, a vector expression representing the characteristics of each mention in the script information is generated, a vector expression representing the characteristics of the video information is generated according to the time when the mention is uttered, and the multiple speaker In a dialogue, a person represented by a mention is classified based on a predefined character, the word of the mention is expressed as a first vector, and the talker of the mention is randomly expressed as a second vector, and then the first vector and the By passing the Bi-LSTM using the second vector, a third vector ( h _i ) is obtained, and the first image information vector passed through ResNet-152 for the image information is converted into a linear layer with a ReLu activation function. according to Sikkim passed to Layer) to obtain a second image information vector (v _i), was coupled with the third vector (h _i) the obtaining the second image information vector (v _i) the obtained to identify a person, Calculate the vector and cosine similarity for the vector ( e _i ) expressing the mention of the person and the Entity Library matrix expressing the features of each person, and calculate the calculated value as a Softmax layer To classify the person represented by the mention
Person identification system.