KR20210051473A

KR20210051473A - Apparatus and method for recognizing video contents

Info

Publication number: KR20210051473A
Application number: KR1020190136784A
Authority: KR
Inventors: 서용석; 김정현; 김혜미; 박지현; 임동혁; 유원영
Original assignee: 한국전자통신연구원
Priority date: 2019-10-30
Filing date: 2019-10-30
Publication date: 2021-05-10

Abstract

Disclosed is a video content identification method, which comprises the steps of: extracting a frame from an input video; detecting a face region from the extracted frame; detecting a clothing region from a frame in which the face region is detected; comparing the detected clothing region with one or more candidate images stored in a database using information on a face recognized based on the detected face region; and identifying the input video according to the comparison result. According to the present invention, it is possible to identify video content with the improved facial recognition accuracy.

Description

Video content identification device and method {APPARATUS AND METHOD FOR RECOGNIZING VIDEO CONTENTS}

본 발명은 동영상 콘텐츠 식별 장치 및 방법에 관한 것으로, 더욱 상세하게는 얼굴 인식 및 의상 영역 검출을 이용한 동영상 콘텐츠 식별 장치 및 방법에 관한 것이다.The present invention relates to a video content identification apparatus and method, and more particularly, to a video content identification apparatus and method using face recognition and clothing area detection.

기존에 영상을 식별하는 방법으로는, 원본 영상의 해시값 또는 특징점이라 불리는 원본 신호로부터 추출한 오디오 또는 비디오의 특징 정보를 DB에 저장해 두었다가, 식별이 요구되는 입력 영상으로부터 해시값 또는 특징점을 추출하여 DB에 저장된 값과 유사도를 비교하는 방식이 영상 식별 또는 영상 필터링 기술 분야에서 널리 사용되고 있다. 하지만, 서로 다른 저작물 영상을 하나의 파일로 편집하거나 오디오 신호를 추가 또는 삭제하여 생성된 영상의 경우에는 이러한 방법으로 식별하기 어렵다.In the conventional method of identifying an image, the hash value of the original image or feature information of the audio or video extracted from the original signal called the feature point is stored in the DB, and the hash value or the feature point is extracted from the input image that requires identification, and the DB A method of comparing the similarity with the value stored in is widely used in the field of image identification or image filtering technology. However, in the case of images generated by editing different works images into one file or adding or deleting audio signals, it is difficult to identify them in this way.

최근에는 얼굴 인식 방법으로 다양한 얼굴 영상들을 이용하여 심층신경망을 미리 학습시킨 뒤 학습된 신경망을 이용하여 입력 영상의 얼굴이 학습한 얼굴 중 어느 것과 가장 유사한지를 찾는 방법을 많이 사용하고 있다. 또한, 이러한 기술을 이용하여 감시 카메라 등에서 범죄자나 미아 얼굴을 자동으로 탐지하고자 하는 시도들이 있다.Recently, as a face recognition method, a deep neural network is pre-trained using various face images, and then a method of finding which face of the input image is the most similar to the learned face by using the learned neural network has been widely used. In addition, there are attempts to automatically detect the face of a criminal or a lost child in a surveillance camera using such a technology.

이러한 동영상 콘텐츠 식별을 목적으로 하는 딥러닝 네트워크를 활용한 얼굴 인식 기술은, 충분한 길이의 단일 저작물 영상에 대해서는 다수의 출연 배우 정보를 이용하여 콘텐츠 식별이 가능하다. 반면, 짧은 길이로 편집되어 한 배우만 등장하는 영상에서는 해당 배우가 출연한 다수의 영상물 중 하나를 특정하는 것은 쉬운 일이 아니다. 특히 복수의 저작물 영상을 편집하여 단일 동영상으로 재생성한 경우에는 정확한 식별 결과를 기대하기 매우 어렵다.In the face recognition technology using a deep learning network for the purpose of identifying such moving picture contents, it is possible to identify contents using information of a plurality of actors for a single work image having a sufficient length. On the other hand, in a video that is edited to a short length and only one actor appears, it is not easy to specify one of a number of videos in which the actor appeared. In particular, it is very difficult to expect an accurate identification result when a plurality of works images are edited and reproduced as a single moving image.

상기와 같은 문제점을 해결하기 위한 본 발명의 목적은, 동영상 콘텐츠 식별 방법을 제공하는 데 있다.An object of the present invention for solving the above problems is to provide a method for identifying video content.

상기와 같은 문제점을 해결하기 위한 본 발명의 다른 목적은, 상기 동영상 콘텐츠 식별 방법을 이용하는 동영상 콘텐츠 식별 장치를 제공하는 데 있다.Another object of the present invention for solving the above problems is to provide a video content identification device using the video content identification method.

상기 목적을 달성하기 위한 본 발명의 일 실시예에 따른 동영상 콘텐츠 식별 방법은, 입력되는 동영상으로부터 프레임을 추출하는 단계; 추출된 프레임에서 얼굴 영역을 검출하는 단계; 얼굴 영역이 검출된 프레임에서 의상 영역을 검출하는 단계; 검출된 얼굴 영역에 기초하여 인식된 얼굴에 대한 정보를 이용해 데이터베이스에 저장된 하나 이상의 후보 이미지들과 상기 검출된 의상 영역을 비교하는 단계; 및 상기 비교 결과에 따라 입력된 동영상을 식별하는 단계를 포함할 수 있다. A video content identification method according to an embodiment of the present invention for achieving the above object includes: extracting a frame from an input video; Detecting a face area from the extracted frame; Detecting a clothing area in the frame in which the face area is detected; Comparing the detected clothing region with one or more candidate images stored in a database using information on a face recognized based on the detected face region; And identifying the input video according to the comparison result.

상기 데이터베이스에 저장된 하나 이상의 후보 이미지는, 데이터셋을 구성하는 이미지에서 얼굴 영역을 검출하는 단계; 검출된 얼굴 영역을 이용해 인물을 인식하는 단계; 상기 얼굴 영역이 검출된 이미지에서 의상 영역을 검출하는 단계; 및 상기 검출된 의상 영역을 상기 인식된 인물에 대한 메타 정보와 함께 저장하는 단계를 통해 생성될 수 있다. The one or more candidate images stored in the database may include: detecting a face region from an image constituting a data set; Recognizing a person using the detected face area; Detecting a clothing area from the image in which the face area is detected; And storing the detected clothing area together with meta information on the recognized person.

상기 추출된 프레임에서 얼굴 영역을 검출하는 단계는, 인물의 얼굴 영역에 대한 지도 학습을 통해 훈련된 제1 신경망 모델을 이용해 수행될 수 있다.The step of detecting the face region from the extracted frame may be performed using a first neural network model trained through supervised learning on the face region of a person.

상기 얼굴 영역이 검출된 이미지에서 의상 영역을 검출하는 단계는, 인물의 의상 영역에 대한 지도 학습을 통해 훈련된 단일 단계 방식의 제2 신경망 모델을 이용해 수행될 수 있다.The step of detecting the clothing region from the image in which the face region is detected may be performed using a second neural network model of a single-step method trained through supervised learning on the clothing region of a person.

상기 데이터베이스에 저장된 하나 이상의 후보 이미지들과 상기 검출된 의상 영역을 비교하는 단계는, 인물의 의상 영역에 대한 학습을 통해 훈련된 제3 신경망 모델을 이용해 수행될 수 있다.Comparing the detected clothing region with one or more candidate images stored in the database may be performed using a third neural network model trained through learning about the clothing region of a person.

상기 데이터베이스에 저장된 하나 이상의 후보 이미지들과 상기 검출된 의상 영역을 비교하는 단계는, 상기 데이터베이스에 저장된 의상 영역을 포함하는 적어도 하나의 제1 이미지 및 입력된 영상으로부터 검출된 의상 영역을 포함하는 제2 이미지 간의 유사도를 산출하는 단계; 및 상기 적어도 하나의 제1 이미지 및 상기 제2 이미지 간의 유사도 값들 중 최대 값을 갖는 제1 이미지에 해당하는 콘텐츠를 상기 입력 동영상이 속하는 콘텐츠로 식별하는 단계를 포함할 수 있다. Comparing the detected clothing area with one or more candidate images stored in the database includes at least one first image including the clothing area stored in the database and a second clothing area detected from the input image. Calculating a degree of similarity between images; And identifying a content corresponding to a first image having a maximum value among similarity values between the at least one first image and the second image as a content to which the input video belongs.

상기 제3신경망 모델은, 상기 제1 이미지 및 상기 제2 이미지에 대한 CNN 특징맵을 산출하고, 산출된 CNN 특징맵을 재구성함으로써, 상기 제1 이미지 및 상기 제2 이미지 간의 유사도를 비율 값으로 산출하여 출력한다. The third neural network model calculates a CNN feature map for the first image and the second image, and reconstructs the calculated CNN feature map, thereby calculating the similarity between the first image and the second image as a ratio value. And print it out.

상기 데이터베이스에 저장된 하나 이상의 후보 이미지는, 해당 후보 이미지 관련 인물의 이름, 해당 후보 이미지 내 의상 영역에 대한 정보, 및 해당 후보 이미지에 대한 콘텐츠 식별자와 연관되어 관리되는 것을 특징으로 한다.The one or more candidate images stored in the database may be managed in association with a name of a person related to the candidate image, information on a clothing area in the candidate image, and a content identifier for the candidate image.

상기 목적을 달성하기 위한 본 발명의 다른 실시예에 따른 동영상 콘텐츠 식별 장치는, 프로세서; 및 상기 프로세서를 통해 실행되는 적어도 하나의 명령을 저장하는 메모리를 포함할 수 있고, A video content identification apparatus according to another embodiment of the present invention for achieving the above object includes: a processor; And a memory storing at least one instruction executed through the processor,

상기 적어도 하나의 명령은, 입력되는 동영상으로부터 프레임을 추출하도록 하는 명령; 추출된 프레임에서 얼굴 영역을 검출하도록 하는 명령; 얼굴 영역이 검출된 프레임에서 의상 영역을 검출하도록 하는 명령; 검출된 얼굴 영역에 기초하여 인식된 얼굴에 대한 정보를 이용해 데이터베이스에 저장된 하나 이상의 후보 이미지들과 상기 검출된 의상 영역을 비교하도록 하는 명령; 및 상기 비교 결과에 따라 입력된 동영상을 식별하도록 하는 명령을 포함할 수 있다. The at least one command may include a command to extract a frame from an input video; An instruction to detect a face area in the extracted frame; A command for detecting a clothing area in the frame in which the face area is detected; An instruction for comparing the detected clothing region with one or more candidate images stored in a database using information on a face recognized based on the detected face region; And a command to identify the input video according to the comparison result.

상기 추출된 프레임에서 얼굴 영역을 검출하도록 하는 명령은, 인물의 얼굴 영역에 대한 지도 학습을 통해 훈련된 제1 신경망 모델을 이용해 수행될 수 있다.The command to detect the face region from the extracted frame may be performed using a first neural network model trained through supervised learning on the face region of a person.

상기 얼굴 영역이 검출된 이미지에서 의상 영역을 검출하도록 하는 명령은, 인물의 의상 영역에 대한 지도 학습을 통해 훈련된 단일 단계 방식의 제2 신경망 모델을 이용해 수행될 수 있다.The command for detecting the clothing region from the image in which the face region is detected may be performed using a second neural network model of a single step method trained through supervised learning on the clothing region of a person.

상기 데이터베이스에 저장된 하나 이상의 후보 이미지들과 상기 검출된 의상 영역을 비교하도록 하는 명령은, 인물의 의상 영역에 대한 학습을 통해 훈련된 제3 신경망 모델을 이용해 수행될 수 있다.The command for comparing the detected clothing region with one or more candidate images stored in the database may be performed using a third neural network model trained through learning about the clothing region of a person.

상기 데이터베이스에 저장된 하나 이상의 후보 이미지들과 상기 검출된 의상 영역을 비교하는 명령은, 상기 데이터베이스에 저장된 의상 영역을 포함하는 적어도 하나의 제1 이미지 및 입력된 영상으로부터 검출된 의상 영역을 포함하는 제2 이미지 간의 유사도를 산출하도록 하는 명령; 및 상기 적어도 하나의 제1 이미지 및 상기 제2 이미지 간의 유사도 값들 중 최대 값을 갖는 제1 이미지에 해당하는 콘텐츠를 상기 입력 동영상이 속하는 콘텐츠로 식별하도록 하는 명령을 포함할 수 있다. The command for comparing the detected clothing region with one or more candidate images stored in the database includes at least one first image including the clothing region stored in the database and a second clothing region detected from the input image. An instruction to calculate the similarity between the images; And a command for identifying a content corresponding to a first image having a maximum value among similarity values between the at least one first image and the second image as a content to which the input video belongs.

상기 제3 신경망 모델은, 상기 제1 이미지 및 상기 제2 이미지에 대한 CNN 특징맵을 산출하고, 산출된 CNN 특징맵을 재구성함으로써, 상기 제1 이미지 및 상기 제2 이미지 간의 유사도를 비율 값으로 산출하여 출력한다. The third neural network model calculates a CNN feature map for the first image and the second image, and reconfigures the calculated CNN feature map to calculate the similarity between the first image and the second image as a ratio value. And print it out.

상기와 같은 본 발명의 실시예들에 따르면 얼굴 인식 정확도가 향상된 동영상 콘텐츠 식별이 가능하다. According to the embodiments of the present invention as described above, it is possible to identify video content with improved face recognition accuracy.

특히, 종래의 방식에 비해 짧은 클립 형태로 편집된 저작물 영상들과 복수개의 저작물 영상을 하나의 파일로 편집한 동영상 콘텐츠에 대해 보다 정확한 식별 성능을 기대할 수 있다. In particular, compared to the conventional method, it is possible to expect more accurate identification performance for work images edited in the form of short clips and moving image contents in which a plurality of work images are edited into a single file.

이를 통해 영화, TV 드라마 등과 같이 저작권 침해가 빈번하게 발생하는 분야에서 짧은 콘텐츠 및 복수 저작물 편집 동영상의 자동 식별과 모니터링 분야에 효율적으로 활용될 수 있다.Through this, it can be effectively used in the field of automatic identification and monitoring of short content and editing videos of multiple works in areas where copyright infringement occurs frequently, such as movies and TV dramas.

도 1은 본 발명의 일 실시예에 따른 동영상 콘텐츠 식별 장치의 개념도이다.
도 2은 본 발명의 일 실시예에 따른 동영상 콘텐츠 식별 방법의 흐름 도이다.
도 3은 본 발명의 일 실시예에 따른 얼굴 인식 모델의 생성 방법의 순서도이다.
도 4는 본 발명의 일 실시예에 따라 CNN 모델을 이용해 얼굴을 식별하는 과정의 동작 순서도이다.
도 5는 본 발명의 일 실시예에 따른 의상 이미지 검출 및 저장 과정을 나타낸 순서도이다.
도 6은 본 발명에 따른 의상 영역 검출에 사용되는 객체 검출의 개념도이다.
도 7은 본 발명에 따라 의상 영역 검출 학습에 활용되는 의상 영역 포함 학습 이미지들의 예를 나타낸다.
도 8은 본 발명의 일 실시예에 따라 인물 관련 이미지 정보의 관리 체계를 나타내는 개념도이다.
도 9는 본 발명의 일 실시예에 따라 의상 영역을 기반으로 두 이미지의 유사도를 판별하기 위한 딥러닝 시스템의 블록 구성도이다.
도 10 및 도 11은 본 발명에 따라 재구성된 특징맵의 예를 나타낸다.
도 12는 본 발명에 따른 유사도 판단을 위한 딥러닝 시스템의 각 단계에서 입출력되는 이미지의 크기를 나타낸 테이블이다.
도 13은 본 발명의 일 실시예에 따라 데이터베이스에 등록된 의상영역 이미지를 이용해 입력 영상을 최종 식별하는 개념을 나타낸다.
도 14는 본 발명의 다른 실시예에 따른 동영상 콘텐츠 식별 장치의 블록 구성도이다. 1 is a conceptual diagram of a video content identification device according to an embodiment of the present invention.
2 is a flow diagram of a method for identifying video content according to an embodiment of the present invention.
3 is a flowchart of a method of generating a face recognition model according to an embodiment of the present invention.
4 is a flowchart illustrating a process of identifying a face using a CNN model according to an embodiment of the present invention.
5 is a flowchart illustrating a process of detecting and storing a clothing image according to an embodiment of the present invention.
6 is a conceptual diagram of object detection used for detecting a clothing area according to the present invention.
7 shows examples of learning images including a clothing region used for learning to detect a clothing region according to the present invention.
8 is a conceptual diagram illustrating a management system of image information related to a person according to an embodiment of the present invention.
9 is a block diagram of a deep learning system for determining similarity between two images based on a clothing area according to an embodiment of the present invention.
10 and 11 show examples of feature maps reconstructed according to the present invention.
12 is a table showing the size of an image input/output in each step of a deep learning system for determining similarity according to the present invention.
13 illustrates a concept of final identification of an input image using a clothing area image registered in a database according to an embodiment of the present invention.
14 is a block diagram of a video content identification device according to another embodiment of the present invention.

본 발명은 다양한 변경을 가할 수 있고 여러 가지 실시예를 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 상세한 설명에 상세하게 설명하고자 한다. 그러나, 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 각 도면을 설명하면서 유사한 참조부호를 유사한 구성요소에 대해 사용하였다. In the present invention, various modifications may be made and various embodiments may be provided, and specific embodiments will be illustrated in the drawings and described in detail in the detailed description. However, this is not intended to limit the present invention to a specific embodiment, it should be understood to include all changes, equivalents, and substitutes included in the spirit and scope of the present invention. In describing each drawing, similar reference numerals have been used for similar elements.

제1, 제2, A, B 등의 용어는 다양한 구성요소들을 설명하는 데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 예를 들어, 본 발명의 권리 범위를 벗어나지 않으면서 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다. "및/또는"이라는 용어는 복수의 관련된 기재된 항목들의 조합 또는 복수의 관련된 기재된 항목들 중의 어느 항목을 포함한다. Terms such as first, second, A, and B may be used to describe various elements, but the elements should not be limited by the terms. The above terms are used only for the purpose of distinguishing one component from another component. For example, without departing from the scope of the present invention, a first element may be referred to as a second element, and similarly, a second element may be referred to as a first element. The term "and/or" includes a combination of a plurality of related described items or any of a plurality of related described items.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는, 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다. When a component is referred to as being "connected" or "connected" to another component, it is understood that it may be directly connected or connected to the other component, but other components may exist in the middle. It should be. On the other hand, when a component is referred to as being "directly connected" or "directly connected" to another component, it should be understood that there is no other component in the middle.

본 출원에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terms used in the present application are only used to describe specific embodiments, and are not intended to limit the present invention. Singular expressions include plural expressions unless the context clearly indicates otherwise. In the present application, terms such as "comprise" or "have" are intended to designate the presence of features, numbers, steps, actions, components, parts, or combinations thereof described in the specification, but one or more other features. It is to be understood that the presence or addition of elements or numbers, steps, actions, components, parts, or combinations thereof does not preclude in advance.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless otherwise defined, all terms used herein including technical or scientific terms have the same meaning as commonly understood by one of ordinary skill in the art to which the present invention belongs. Terms as defined in a commonly used dictionary should be interpreted as having a meaning consistent with the meaning in the context of the related technology, and should not be interpreted as an ideal or excessively formal meaning unless explicitly defined in the present application. Does not.

본 발명은 전술한 종래기술의 문제점을 해결하기 위하여 하나 이상의 저작물 영상들을 짧은 길이로 편집한 임의 동영상 콘텐츠에 대하여 딥러닝 기반의 객체 검출 기술을 이용하여 배우 얼굴 인식과 의상 영역을 검출하고 그 결과를 이용하여 어떤 저작물인지를 식별하는 방법을 제안한다. In order to solve the above-described problems of the prior art, the present invention detects an actor's face recognition and a clothing area using a deep learning-based object detection technology for arbitrary moving image contents edited with a short length of one or more works images, and the result is determined. It proposes a method to identify which work is used.

이하, 본 발명에 따른 바람직한 실시예를 첨부된 도면을 참조하여 상세하게 설명한다. Hereinafter, preferred embodiments according to the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 일 실시예에 따른 동영상 콘텐츠 식별 장치의 개념도이다. 1 is a conceptual diagram of a video content identification device according to an embodiment of the present invention.

도 1에 도시된 바와 같이 본 발명의 일 실시예에 따른 동영상 콘텐츠 식별 장치(100)는 입력되는 영화 또는 TV 동영상 콘텐츠에서 배우 얼굴을 인식한다. 동영상 콘텐츠 식별 장치(100)는 얼굴을 이용한 배우 인식 결과를 기초로, 얼굴 인식된 배우의 의상 영역을 검출한다. 동영상 콘텐츠 식별 장치(100)는 검출된 복수의 의상 영역들이 검출된 이미지를 DB에 등록된 동일 배우의 의상 이미지들과 비교함으로써, 재식별 과정을 통해 입력된 미지의 동영상을 식별할 수 있다. As shown in FIG. 1, the video content identification apparatus 100 according to an embodiment of the present invention recognizes an actor's face from an input movie or TV video content. The video content identification apparatus 100 detects a costume area of an actor whose face is recognized based on a result of actor recognition using a face. The video content identification apparatus 100 may identify an unknown video input through a re-identification process by comparing the detected image of the plurality of clothing regions with the clothing images of the same actor registered in the DB.

도 2은 본 발명의 일 실시예에 따른 동영상 콘텐츠 식별 방법의 흐름 도이다. 2 is a flow diagram of a method for identifying video content according to an embodiment of the present invention.

본 발명에 따른 동영상 콘텐츠 식별 방법의 전체적인 흐름은 도 2에 도시된 바와 같다.The overall flow of the method for identifying video content according to the present invention is as shown in FIG. 2.

동영상 콘텐츠 식별 방법에 따르면, 입력되는 동영상으로부터 프레임을 추출한다(S210, S220). 해당 프레임 내에서 얼굴 영역이 검출되면(S230), 배우 얼굴 인식(S240)과 함께 의상영역을 검출한다(S250). 이때 배우얼굴 인식 및 의상영역 검출은 배우얼굴과 의상영역 이미지들을 이용해 사전에 학습된 딥러닝 네트워크를 이용해 수행될 수 있다. 입력 동영상으로부터 검출된 의상 영역 이미지에 대해서는, DB에 등록된 해당 배우의 의상 이미지들과의 유사도 비교가 수행되며(S260), 이를 통해 최종적인 식별 결과를 얻을 수 있다(S270). According to the video content identification method, frames are extracted from the input video (S210 and S220). When a face area is detected within the frame (S230), a costume area is detected together with actor face recognition (S240) (S250). At this time, the actor face recognition and the clothing region detection may be performed using a deep learning network previously learned using the actor face and clothing region images. For the costume area image detected from the input video, a similarity comparison with the costume images of the actor registered in the DB is performed (S260), and a final identification result may be obtained through this (S270).

도 3은 본 발명의 일 실시예에 따른 얼굴 인식 모델의 생성 방법의 순서도이다. 3 is a flowchart of a method of generating a face recognition model according to an embodiment of the present invention.

도 3에 도시된 얼굴 인식 모델의 생성은 도 2에 도시된 동영상 식별 방법을 위한 사전 작업으로 수행되는 하나의 절차이다. 즉, 도 3에 도시된 방법에 따라 생성된 얼굴 인식 모델을 이용해 도 2에 도시된 동영상 식별 방법이 수행될 수 있다.The generation of the face recognition model shown in FIG. 3 is a procedure performed as a preliminary task for the video identification method shown in FIG. 2. That is, the video identification method shown in FIG. 2 may be performed using the face recognition model generated according to the method shown in FIG. 3.

얼굴 인식 모델을 생성하기 위해서는 다수의 배우 얼굴 이미지들을 활용한다. 보다 구체적으로, 입력되는 배우 이미지 각각에 대해 얼굴 영역을 검출하고(S310), 검출된 얼굴 영역에 대한 이미지 정렬을 수행한다(S320). 이때, 이미지 정렬에는 얼굴 특징점(facial point)이 활용되는데, 눈, 코, 입과 같은 얼굴의 랜드마크가 특징점들이 될 수 있다. 이러한 랜드마크를 기준으로 이미지를 정렬하고 이를 CNN 모델에 입력하여 CNN 모델을 학습(S330)시킴로써 얼굴 인식 정확도가 향상된 얼굴 인식 모델을 생성할 수 있다(S340).In order to create a face recognition model, a number of actors' face images are used. More specifically, a face region is detected for each input actor image (S310), and image alignment is performed on the detected face region (S320). In this case, facial feature points are used for image alignment, and facial landmarks such as eyes, nose, and mouth may be the feature points. By arranging images based on such landmarks and inputting them to the CNN model to train the CNN model (S330), a face recognition model with improved face recognition accuracy may be generated (S340).

도 4는 본 발명의 일 실시예에 따라 CNN 모델을 이용해 얼굴을 식별하는 과정의 동작 순서도이다.4 is a flowchart illustrating a process of identifying a face using a CNN model according to an embodiment of the present invention.

도 4의 얼굴 식별 과정은 도 3을 통해 살펴본 얼굴 지도 학습(S330)을 상세히 나타낸 것이다. 도 4의 실시예에 사용된 CNN(Convolutional Neural Network) 모델은 일 예이며, 본 발명에서 얼굴 식별에 활용하는 신경망 모델이 CNN에 한정되는 것은 아니다. The face identification process of FIG. 4 shows in detail the face supervised learning (S330) examined through FIG. 3. The CNN (Convolutional Neural Network) model used in the embodiment of FIG. 4 is an example, and the neural network model used for face identification in the present invention is not limited to the CNN.

여기서, CNN(합성곱) 신경망은 인공 신경망의 하나로, 입력 데이터와 필터의 합성곱(convolution)을 이용해서 데이터를 보다 용이하게 분석하는 데 쓰인다. Here, the CNN (convolution) neural network is one of artificial neural networks, and is used to more easily analyze data by using the convolution of input data and filters.

인공신경망(Atificial Neural Network; ANN)은 기계학습과 인지과학에서 생물학의 신경망(동물의 중추신경계중 특히 뇌)에서 영감을 얻은 통계학적 학습 알고리즘이다. 인공신경망은 시냅스의 결합으로 네트워크를 형성한 인공 뉴런(노드)이 학습을 통해 시냅스의 결합 세기를 변화시켜, 문제 해결 능력을 가지는 모델 전반을 가리킨다. Artificial Neural Network (ANN) is a statistical learning algorithm inspired by biological neural networks (especially the brain in the central nervous system of animals) in machine learning and cognitive science. The artificial neural network refers to the overall model with problem-solving ability by changing the strength of synaptic bonding through learning by artificial neurons (nodes) that form a network through synaptic bonding.

인공신경망에는 지도 신호(정답)의 입력에 의해서 문제에 최적화되어 가는 지도 학습과 지도 신호를 필요로 하지 않는 비지도 학습이 있다. 명확한 해답이 있는 경우에는 지도 학습이, 데이터 클러스터링에는 비지도 학습이 이용되는 것이 보통이다. 인공신경망은 많은 입력들에 의존하면서 일반적으로 베일에 싸인 함수를 추측하고 근사치를 낼 경우 사용한다. 일반적으로 입력으로부터 값을 계산하는 뉴런 시스템의 상호 연결로 표현되고, 적응성이 있어 패턴 인식과 같은 기계학습을 수행할 수 있다. Artificial neural networks include supervised learning that is optimized for a problem by inputting a guidance signal (correct answer) and unsupervised learning that does not require a guidance signal. Supervised learning is usually used when there is a clear answer and unsupervised learning is used for data clustering. Artificial neural networks rely on many inputs and are generally used to guess and approximate a function wrapped in a veil. In general, it is expressed as an interconnection of neuronal systems that calculate values from inputs, and has adaptability, so machine learning such as pattern recognition can be performed.

도 4를 참조하면, CNN 모델은 수차례(예를 들어, 3회)의 컨볼루션 및 활성화 함수(ReLU; Rectified Linear Unit), 그리고 맥스 풀링(Max Pooling) 과정을 거치고, 추가적으로, 복수의 완전연결층(Fully Connected Layer)의 출력 값을 0~1 사이의 값으로 정규화 하는 소프트맥스(Softmax) 함수를 통해 얼굴 식별 결과를 얻을 수 있다.Referring to FIG. 4, the CNN model undergoes several (for example, three) convolution and activation functions (ReLU; Rectified Linear Unit), and Max Pooling, and additionally, a plurality of complete connections. Face identification results can be obtained through a Softmax function that normalizes the output value of the layer (Fully Connected Layer) to a value between 0 and 1.

컨볼루션에서는 필터들을 통해 특징맵을 추출할 수 있고, 특징맵에 활성화 함수가 적용된다. 활성화 함수로는 ReLU가 사용될 수 있다. 컨볼루셔날 레이어를 거쳐 추출된 특징은 필요에 따라 그 크기를 줄이는 풀링(또는 서브샘플링) 과정을 거치게 되는데 그 중 하나가 맥스 풀링 기법이다. 맥스 풀링은 활성화 맵을 MxN의 크기로 잘라낸 후, 그 안에서 가장 큰 값을 뽑아내는 방법이다. 맥스 풀링은 특징값이 큰 값이 다른 특징들을 대표한다는 개념을 기반으로 한다. In convolution, a feature map can be extracted through filters, and an activation function is applied to the feature map. ReLU can be used as the activation function. Features extracted through the convolutional layer undergo a pooling (or subsampling) process that reduces their size as necessary, and one of them is the max pooling technique. Max pooling is a method of cropping the activation map to the size of MxN and then extracting the largest value from it. Max pooling is based on the concept that a value with a large feature value represents other features.

이러한 과정을 거쳐 특징이 추출되었다면 추출된 특징 값을 완전 연결 레이어(일반적인 뉴럴 네트워크)에 넣어 분류를 수행한다. 완전 연결층 이후에는 소프트맥스 함수가 적용될 있는데 소프트맥스 함수는 일종의 활성화(activation) 함수로, 여러 개의 분류를 가질 수 있는 함수이다. 따라서, 도 4에서 도출되는 결과가 여러 개의 식별자로 분류됨을 알 수 있다. If features are extracted through this process, classification is performed by putting the extracted feature values into a fully connected layer (general neural network). After the fully connected layer, a softmax function is applied. The softmax function is a kind of activation function, which can have several classifications. Therefore, it can be seen that the result derived in FIG. 4 is classified into several identifiers.

대량의 얼굴 이미지를 이용해 도 4에 도시된 네트워크를 반복하여 지도 학습시키는 과정을 통해 CNN 모델이 충분히 학습되면 이를 본 발명에 따른 얼굴 인식 모델로 사용할 수 있다. 앞서 설명한 바와 같이 지도학습이란 정답을 알려주며 학습시키는 방법으로, 예를 들면, 이 사진은 배우 A, 이 사진은 배우 B라고 정답을 알려주면서 신경망을 훈련시키는 방식이다.When the CNN model is sufficiently learned through the process of repetitively supervising the network shown in FIG. 4 using a large amount of face images, it can be used as a face recognition model according to the present invention. As described above, supervised learning is a method of learning by telling the correct answer. For example, this picture is actor A and this picture is actor B. This is a method of training a neural network while giving the correct answer.

이러한 얼굴 인식 모델을 이용한 얼굴 식별 결과는 도 4에 도시된 바와 같은 확률값으로 얻어질 수 있다. 도 4의 예에서는 입력된 이미지에 포함된 얼굴이 face_ID_003일 확률이 90%로서 입력된 이미지는 face_ID_003일 것으로 추정될 수 있다. The face identification result using such a face recognition model can be obtained as a probability value as shown in FIG. 4. In the example of FIG. 4, the probability that the face included in the input image is face_ID_003 is 90%, and the input image may be estimated to be face_ID_003.

도 5는 본 발명의 일 실시예에 따른 의상 이미지 검출 및 저장 과정을 나타낸 순서도이다. 5 is a flowchart illustrating a process of detecting and storing a clothing image according to an embodiment of the present invention.

본 발명의 일 실시예에 따라 저작물 영상 식별을 위한 배우별 의상 이미지들은 도 5에 도시된 순서에 따라 처리되고 데이터베이스에 저장될 수 있다. According to an embodiment of the present invention, costume images for each actor for identifying a work image may be processed according to the order shown in FIG. 5 and stored in a database.

보다 상세하게는, 우선 입력된 동영상에 대해 프레임을 추출한다(S510, S520). 해당 프레임 내에서 얼굴 영역이 검출되면(S530), 배우 얼굴 인식(S540)과 함께 의상영역을 검출한다(S550). 이때 배우얼굴 인식 및 의상영역 검출은 배우얼굴과 의상영역 이미지들을 이용해 사전에 학습된 딥러닝 네트워크를 이용해 수행될 수 있다. 검출된 의상 영역은 해당 영역 관련 메타 정보와 함께 저장하여 데이터베이스화한다(S560). 이때, 메타 정보는 해당 동영상의 제목, 배우 이름 등을 포함할 수 있다. 예를 들어, 영화의 경우 본편 또는 예고편으로부터 프레임을 추출하여 얼굴 검출된 프레임에서 의상영역을 검출하여 영화제목, 배우명과 같은 메타정보와 함께 의상 이미지를 데이터베이스에 저장한다.In more detail, first, frames are extracted from the input video (S510, S520). When a face area is detected within the frame (S530), a costume area is detected together with actor face recognition (S540) (S550). At this time, the actor face recognition and the clothing region detection may be performed using a deep learning network previously learned using the actor face and clothing region images. The detected clothing area is stored together with meta information related to the corresponding area and converted into a database (S560). In this case, the meta information may include the title of the video, the name of the actor, and the like. For example, in the case of a movie, a frame is extracted from a main story or a trailer, and a costume area is detected from a frame where a face is detected, and the costume image is stored in a database along with meta information such as the movie title and actor name.

도 6은 본 발명에 따른 의상 영역 검출에 사용되는 객체 검출의 개념도이다. 6 is a conceptual diagram of object detection used for detecting a clothing area according to the present invention.

본 발명에서는 의상영역 검출을 위해 도 6과 같은 단일 스테이지(Single Stage) 방식의 딥러닝 네트워크를 객체 검출기로 활용할 수 있다. In the present invention, a deep learning network of a single stage method as shown in FIG. 6 may be used as an object detector for detecting a garment region.

즉, 본 발명에 따른 객체 검출은 CNN을 기반으로 하는 특징 추출 및 탐지 레이러를 통해 이루어지며, 객체 종류 및 위치를 결과 값으로 출력한다. 여기서, 객체의 종류는 의상영역이며, 객체의 위치는 의상 영역의 위치가 될 수 있다.That is, object detection according to the present invention is performed through feature extraction and detection radar based on CNN, and the object type and location are output as a result value. Here, the type of the object is the clothing area, and the location of the object may be the location of the clothing area.

단일 프로세스에 의해 객체의 종류를 구별하고 탐지하는 방법으로 SSD(Single shot multiBox detector), YOLO(You Only Look Once) 등을 이용하는 방법이 사용될 수 있다. 이들은 객체 후보 영역 탐색 없이 입력 이미지로부터 객체의 종류와 위치를 검출하는 것을 목적으로 구성된 딥러닝 네트워크이다. As a method of discriminating and detecting the type of object by a single process, a method using a single shot multibox detector (SSD) or You Only Look Once (YOLO) may be used. These are deep learning networks constructed for the purpose of detecting the type and location of an object from an input image without searching for an object candidate area.

도 6과 같은 의상 영역 검출을 위한 딥러닝 네트워크를 학습시킬 때, 본 발명에서 목적으로 하는 영화, TV와 같은 저작물 영상의 특징을 고려하여 학습 데이터를 준비한다. 영화, TV의 경우 배우 얼굴의 표정 연기 등의 이유로 클로즈업된 화면으로 주로 촬영되어 상반신만 보이는 경우가 많다. 이러한 환경에서 의상영역을 검출하고 데이터베이스에 등록된 의상 이미지들과의 재식별(Re-Identification) 정확도를 높이기 위해, 의상영역 검출기 학습에 사용되는 이미지 데이터 준비 단계에서부터 상반신 위주의 이미지들에서 정답 영역을 표시하여 딥러닝 네트워크를 지도 학습시킨다.When training the deep learning network for detecting the clothing area as shown in FIG. 6, the training data is prepared in consideration of the characteristics of a work image, such as a movie or TV, for the purpose of the present invention. In the case of movies and TV, it is often filmed on a close-up screen for reasons such as acting on the actor's face, and only the upper body is visible. In this environment, in order to detect the clothing region and increase the accuracy of re-identification with the clothing images registered in the database, the correct answer region from the image data preparation stage used for learning the clothing region detector is selected from the upper body oriented images. To supervise the deep learning network.

도 7은 본 발명에 따라 의상 영역 검출 학습에 활용되는 의상 영역 포함 학습 이미지들의 예를 나타낸다.7 shows examples of learning images including a clothing region used for learning to detect a clothing region according to the present invention.

도 7의 각 이미지는 전체 이미지 중 의상 영역이 별도로 구별되어 표시되어 있음을 알 수 있다. 이들 이미지들은 본 발명에 따른 의상영역 검출기의 학습에 사용될 수 있다.It can be seen that in each image of FIG. 7, a clothing area is separately displayed among the entire images. These images can be used for learning of the garment region detector according to the present invention.

도 8은 본 발명의 일 실시예에 따라 인물 관련 이미지 정보의 관리 체계를 나타내는 개념도이다. 8 is a conceptual diagram illustrating a management system of image information related to a person according to an embodiment of the present invention.

도 8을 참조하면 인물, 즉 얼굴이 인식된 배우에 대해서는 배우의 이름을 정점으로 하여 해당 배우의 복수의 의상 영역 이미지(fashion image)가 연관되어 저장된다. 또한 각 의상 영역 이미지에 대해서는 content_ID가 함께 저장 및 관리된다. Referring to FIG. 8, for an actor whose face is recognized, a plurality of fashion images of the actor are associated and stored with the actor's name as a vertex. In addition, content_ID is also stored and managed for each clothing area image.

이러한 저장 체계를 통해 저장된 얼굴 인식된 배우의 출연작품 속 의상 이미지는 추후 입력되는 콘텐츠와의 비교를 통해 입력 콘텐츠를 식별할 수 있도록 한다. Through this storage system, the image of the costume in the actor's appearance work, whose face is recognized, is compared with the content to be input later to identify the input content.

이러한 이미지 정보 관리 체계는, 도 5를 통해 살펴본, 얼굴 인식된 배우들의 의상 이미지들을 DB에 저장 단계(S560)를 통해 구축될 사용될 수 있다. 또한, 이렇게 구축된 데이터베이스는 동영상 식별 과정에서 동일한 과정으로 얼굴 인식된 배우 이름 및 의상 이미지들과의 비교를 거쳐, 미지의 동영상 콘텐츠 식별에 도움을 준다.Such an image information management system may be constructed through a storage step (S560) of the costume images of actors whose faces are recognized as viewed through FIG. 5, in a DB. In addition, the database constructed in this way helps to identify unknown video contents through comparison with actor names and costume images recognized by the same process in the video identification process.

도 9는 본 발명의 일 실시예에 따라 의상 영역을 기반으로 두 이미지의 유사도를 판별하기 위한 딥러닝 시스템의 블록 구성도이다.9 is a block diagram of a deep learning system for determining similarity between two images based on a clothing area according to an embodiment of the present invention.

도 9의 실시예에서 비교하는 두 이미지는 식별된 의상 영역을 포함하는 이미지로서 데이터베이스에 저장된 제1 이미지(input_1)와 입력된 영상으로부터 검출된 의상 영역을 포함하는 제2 이미지(input_2)일 수 있다.The two images to be compared in the embodiment of FIG. 9 are images including the identified clothing area, and may be a first image (input_1) stored in a database and a second image (input_2) including a clothing area detected from the input image. .

이때 입력되는 이미지는 의상영역(점선 박스)만을 이용할 수도 있고, 얼굴을 포함한 영역(도9에서 전체 이미지를 둘러싸는 실선 박스)에 대한 이미지를 네트워크의 입력으로 사용하는 것도 가능하다. At this time, the input image may use only the clothing area (dotted line box), or the image of the area including the face (the solid line box surrounding the entire image in FIG. 9) may be used as an input to the network.

도 9를 참조하면, 두 이미지의 유사도를 판단하기 위해 입력 이미지쌍, 즉, 제1 이미지 및 제2 이미지에 대해 각각 컨볼루션 및 맥스 풀링(910; 920)을 반복하여 적용한다. Referring to FIG. 9, convolution and max pooling 910 (920) are repeatedly applied to an input image pair, that is, a first image and a second image, respectively, in order to determine the similarity between two images.

컨볼루션 및 맥스 풀링(910; 920)은 Convolution_1, MaxPooling_1, MaxPooling_2를 포함하며, 이와 동일한 구조의 Convolution_2, MaxPooling_3, MaxPooling_4을 더 포함한다.The convolution and max pooling 910; 920 include Convolution_1, MaxPooling_1, and MaxPooling_2, and further include Convolution_2, MaxPooling_3, and MaxPooling_4 having the same structure.

이후, 추출된 2개의 CNN 특징맵((

,

) 의 차이를 계산하여 두 이미지 사이의 차이에 대한 특징맵(

,

)을 새롭게 구성한다. 이를 다시 컨볼루션, 풀링, 완전연결층으로 이어지는 일련의 신경망(920)을 통과시키고 최종적으로 소프트맥스를 통해 두 이미지의 동일성을 판단한다. Then, the extracted two CNN feature maps ((

,

) By calculating the difference in the feature map (

,

) Is newly constructed. This is again passed through a series of neural networks 920 leading to convolution, pooling, and fully connected layers, and finally, the identity of the two images is determined through softmax.

여기서, 일련의 신경망(920)은 Convolution_3, Convolution_4, Convolution_5, Convolution_6 및 MaxPooling_5, MaxPooling_6을 포함하고, 신경망의 끝단에서 완전 연결층(Fully connected layer)을 포함할 수 있다. Here, the series of neural networks 920 may include Convolution_3, Convolution_4, Convolution_5, Convolution_6, MaxPooling_5, and MaxPooling_6, and may include a fully connected layer at the end of the neural network.

추가적으로, 도10에 표현되지는 않았지만 컨볼루션 다음에 ReLU(Rectified Linear Unit)와 같은 활성함수를 추가하여 딥러닝 네트워크를 구성하는 것이 보다 효과적일 수 있다. Additionally, although not shown in FIG. 10, it may be more effective to configure a deep learning network by adding an activation function such as ReLU (Rectified Linear Unit) after convolution.

도 9에서 입력 이미지쌍의 차이(Difference)를 계산하는 과정은 아래 수학식 1을 표현해 나타낼 수 있다. 수학식 1은, 두 특징맵(

,

) 사이의 관계를 계산하는 방법을 나타내며, 입력 이미지쌍에 대해 재구성된 특징맵

와

를 도출하는 식이다.In FIG. 9, the process of calculating the difference between input image pairs can be expressed by expressing Equation 1 below. Equation 1 is the two feature maps (

,

) Represents the method of calculating the relationship between, and the reconstructed feature map for the input image pair.

Wow

Is an equation that derives.

는 첫 번째 입력 이미지가 컨볼루션 레이어(910)를 통과하여 도출된 특징맵이고,

는 두 번째 입력 이미지가 컨볼루션 레이어(910)를 통과하여 도출된 특징맵을 의미한다. 또한,

는 1로 구성된 5×5 행렬을 뜻하고,

는

행렬에서

가 중심인 5×5 행렬을 의미한다.

Is a feature map derived by passing the first input image through the convolutional layer 910,

Denotes a feature map derived by passing the second input image through the convolutional layer 910. Also,

Means a 5x5 matrix of 1s,

Is

In the procession

Denotes a 5×5 matrix with the center of.

도 10 및 도 11은 본 발명에 따라 재구성된 특징맵의 예를 나타낸다.10 and 11 show examples of feature maps reconstructed according to the present invention.

즉, 도 10 및 도 11은 수학식 1을 실제 특징맵에 적용한 결과 예를 보여준다. 본 발명에서는 이와 같은 방식으로 입력 이미지쌍에 대한 특징맵

와

를 재구성한다. That is, FIGS. 10 and 11 show examples of results of applying Equation 1 to an actual feature map. In the present invention, a feature map for an input image pair in this way

Wow

Reconstruct.

도 12는 본 발명에 따른 유사도 판단을 위한 딥러닝 시스템의 각 단계에서 입출력되는 이미지의 크기를 나타낸 테이블이다. 12 is a table showing the size of an image input/output in each step of a deep learning system for determining similarity according to the present invention.

도 12에 도시된 테이블은 도 9의 실시예를 통해 설명한 딥러닝 시스템의 각 레이어의 입력 이미지의 크기 및 출력 이미지의 크기를 나타낸다.The table shown in FIG. 12 represents the size of the input image and the size of the output image of each layer of the deep learning system described through the embodiment of FIG. 9.

보다 구체적으로, 도 12의 테이블은 특징벡터를 도출하기 위해 구성된 신경망 910에 포함된 Convolution_1, MaxPooling_1, MaxPooling_2의 입력 및 출력 이미지의 크기, 그리고, 신경망 920에 포함된 Convolution_2, MaxPooling_3, MaxPooling_4의 입력 및 출력 이미지의 크기를 나타낸다. More specifically, the table of FIG. 12 shows the sizes of the input and output images of Convolution_1, MaxPooling_1, and MaxPooling_2 included in the neural network 910 configured to derive the feature vector, and the input and output of Convolution_2, MaxPooling_3 and MaxPooling_4 included in the neural network 920 Indicate the size of the image.

또한, 재구성된 특징맵을 처리하는 일련의 신경망 930에 포함된 Convolution_3, Convolution_4, Convolution_5, Convolution_6, MaxPooling_5, MaxPooling_6의 입력 및 출력 이미지의 크기를 나타낸다. In addition, it shows the size of the input and output images of Convolution_3, Convolution_4, Convolution_5, Convolution_6, MaxPooling_5, and MaxPooling_6 included in the series of neural networks 930 that process the reconstructed feature map.

도 13은 본 발명의 일 실시예에 따라 데이터베이스에 등록된 의상영역 이미지를 이용해 입력 영상을 최종 식별하는 개념을 나타낸다. 13 illustrates a concept of final identification of an input image using a clothing area image registered in a database according to an embodiment of the present invention.

도 13은 학습된 신경망 시스템을 이용하여 미지의 영상으로부터 검출된 배우의 의상 영역과 DB에 등록된 해당 배우의 의상영역 이미지들간의 유사도를 비교하여 의상 영역을 재식별하고, 미지의 입력 영상을 식별하는 개념도이다. 즉, 충분히 학습된 신경망 시스템을 이용하여 입력 영상에서 식별된 배우의 의상 영역과 데이터베이스 내 복수의 해당 배우 후보 영상(movieID_0001, movieID_0002, movieID_0003, movieID_0004) 속 의상 영역을 비교한다. 학습된 신경망 시스템을 이용한 비교를 통해, 각 후보 영상과 입력 영상 간의 유사도가 도출될 수 있다. 도출된 유사도 값들 중 가장 높은 유사도를 갖는 의상 이미지의 메타 정보를 확인하여 콘텐츠를 식별하는 것이 가능하다.13 shows the similarity between the actor's clothing region detected from an unknown image and the actor's costume region images registered in the DB using the learned neural network system to re-identify the costume region, and to identify an unknown input image. It is a conceptual diagram. That is, the costume area of the actor identified in the input image is compared with the costume area in the plurality of corresponding actor candidate images (movieID_0001, movieID_0002, movieID_0003, movieID_0004) in the database using a sufficiently learned neural network system. The similarity between each candidate image and the input image may be derived through comparison using the learned neural network system. Among the derived similarity values, it is possible to identify the content by checking meta information of the clothing image having the highest similarity.

도 13의 예에서는 movieID_0001와 입력 영상 간의 유사도가0.05, movieID_0002와 입력 영상 간의 유사도는0.02, movieID_0003과 입력 영상 간의 유사도는0.9, movieID_0004와 입력 영상 간의 유사도는0.03이므로, 가장 높은 유사도를 갖는 이미지는 movieID_0003의 식별자를 갖는 콘텐츠임을 확인할 수 있다. 즉, 도 13의 예에서 입력 영상은 비디오 콘텐츠 movieID_0003에 속하는 영상임을 확인할 수 있다. In the example of FIG. 13, since the similarity between movieID_0001 and the input image is 0.05, the similarity between movieID_0002 and the input image is 0.02, the similarity between movieID_0003 and the input image is 0.9, and the similarity between movieID_0004 and the input image is 0.03, the image with the highest similarity is movieID_0003. It can be confirmed that the content has an identifier of. That is, in the example of FIG. 13, it can be confirmed that the input image belongs to the video content movieID_0003.

도 14는 본 발명의 다른 실시예에 따른 동영상 콘텐츠 식별 장치의 블록 구성도이다. 14 is a block diagram of a video content identification device according to another embodiment of the present invention.

본 발명의 일 실시예에 따른 동영상 콘텐츠 식별 장치는, 적어도 하나의 프로세서(110), 상기 프로세서를 통해 실행되는 적어도 하나의 명령을 저장하는 메모리(120) 및 네트워크와 연결되어 통신을 수행하는 송수신 장치(130)를 포함할 수 있다. A video content identification device according to an embodiment of the present invention includes at least one processor 110, a memory 120 for storing at least one command executed through the processor, and a transmission/reception device connected to a network to perform communication. It may include (130).

동영상 콘텐츠 식별 장치(100)는 또한, 입력 인터페이스 장치(140), 출력 인터페이스 장치(150), 저장 장치(160) 등을 더 포함할 수 있다. 동영상 콘텐츠 식별 장치(100)에 포함된 각각의 구성 요소들은 버스(bus)(170)에 의해 연결되어 서로 통신을 수행할 수 있다. The video content identification device 100 may further include an input interface device 140, an output interface device 150, and a storage device 160. Each of the components included in the video content identification apparatus 100 may be connected by a bus 170 to communicate with each other.

프로세서(110)는 메모리(120) 및 저장 장치(160) 중에서 적어도 하나에 저장된 프로그램 명령(program command)을 실행할 수 있다. 프로세서(110)는 중앙 처리 장치(central processing unit, CPU), 그래픽 처리 장치(graphics processing unit, GPU), 또는 본 발명의 실시예들에 따른 방법들이 수행되는 전용의 프로세서를 의미할 수 있다. The processor 110 may execute a program command stored in at least one of the memory 120 and the storage device 160. The processor 110 may mean a central processing unit (CPU), a graphics processing unit (GPU), or a dedicated processor in which methods according to embodiments of the present invention are performed.

메모리(120) 및 저장 장치(160) 각각은 휘발성 저장 매체 및 비휘발성 저장 매체 중에서 적어도 하나로 구성될 수 있다. 예를 들어, 메모리(120)는 읽기 전용 메모리(read only memory, ROM) 및 랜덤 액세스 메모리(random access memory, RAM) 중에서 적어도 하나로 구성될 수 있다. 또한, 저장 장치(160)는 앞서 실시예들을 통해 설명한 데이터베이스를 포함할 수 있으며, 이러한 데이터베이스는 하나 이상의 후보 이미지를 포함할 수 있다. Each of the memory 120 and the storage device 160 may be configured with at least one of a volatile storage medium and a nonvolatile storage medium. For example, the memory 120 may be composed of at least one of read only memory (ROM) and random access memory (RAM). In addition, the storage device 160 may include the database described through the above embodiments, and such a database may include one or more candidate images.

여기서, 적어도 하나의 명령은, 상기 프로세서로 하여금, 입력되는 동영상으로부터 프레임을 추출하도록 하는 명령; 추출된 프레임에서 얼굴 영역을 검출하도록 하는 명령; 얼굴 영역이 검출된 프레임에서 의상 영역을 검출하도록 하는 명령; 검출된 얼굴 영역에 기초하여 인식된 얼굴에 대한 정보를 이용해 데이터베이스에 저장된 하나 이상의 후보 이미지들과 상기 검출된 의상 영역을 비교하도록 하는 명령; 및 상기 비교 결과에 따라 입력된 동영상을 식별하도록 하는 명령을 포함할 수 있다. Here, the at least one command may include: a command for causing the processor to extract a frame from an input video; An instruction to detect a face area in the extracted frame; A command for detecting a clothing area in the frame in which the face area is detected; An instruction for comparing the detected clothing region with one or more candidate images stored in a database using information on a face recognized based on the detected face region; And a command to identify the input video according to the comparison result.

상기 얼굴 영역이 검출된 이미지에서 의상 영역을 검출하도록 하는 명령은, 인물의 의상 영역에 대한 지도 학습을 통해 훈련된 단일 단계 방식의 제2 신경망 모델을 이용해 수행될 수 있다.The command for detecting the clothing region from the image in which the face region is detected may be performed using a second neural network model of a single-step method trained through supervised learning on the clothing region of a person.

상기 데이터베이스에 저장된 하나 이상의 후보 이미지들과 상기 검출된 의상 영역을 비교하도록 하는 명령은, 인물의 의상 영역에 대한 학습을 통해 훈련된 제3 신경망 모델을 이용해 수행될 수 있다.The command to compare the detected clothing region with one or more candidate images stored in the database may be performed using a third neural network model trained through learning about the clothing region of a person.

이상의 실시예들을 통해 확인할 수 있는 바와 같이, 본 발명에 따른 동영상 콘텐츠 식별 방법은 배우 얼굴과 의상 이미지를 동시에 고려함으로써 기존 식별 방법들에 비해 짧게 편집된 동영상과 복수의 저작물 영상을 편집하여 단일 동영상으로 재생성한 경우에도 정확한 식별이 가능한 방법이다.As can be seen from the above embodiments, the video content identification method according to the present invention simultaneously considers the actor's face and the costume image, and edits a short edited video and a plurality of copyrighted images as a single video compared to the existing identification methods. It is a method that can be accurately identified even when regenerated.

본 발명의 실시예에 따른 방법의 동작은 컴퓨터로 읽을 수 있는 기록매체에 컴퓨터가 읽을 수 있는 프로그램 또는 코드로서 구현하는 것이 가능하다. 컴퓨터가 읽을 수 있는 기록매체는 컴퓨터 시스템에 의해 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록장치를 포함한다. 또한 컴퓨터가 읽을 수 있는 기록매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어 분산 방식으로 컴퓨터로 읽을 수 있는 프로그램 또는 코드가 저장되고 실행될 수 있다. The operation of the method according to the embodiment of the present invention can be implemented as a computer-readable program or code on a computer-readable recording medium. The computer-readable recording medium includes all types of recording devices that store data that can be read by a computer system. In addition, a computer-readable recording medium may be distributed over a network-connected computer system to store and execute a computer-readable program or code in a distributed manner.

또한, 컴퓨터가 읽을 수 있는 기록매체는 롬(rom), 램(ram), 플래시 메모리(flash memory) 등과 같이 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치를 포함할 수 있다. 프로그램 명령은 컴파일러(compiler)에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터(interpreter) 등을 사용해서 컴퓨터에 의해 실행될 수 있는 고급 언어 코드를 포함할 수 있다.Further, the computer-readable recording medium may include a hardware device specially configured to store and execute program commands, such as ROM, RAM, and flash memory. The program instructions may include not only machine language codes such as those produced by a compiler, but also high-level language codes that can be executed by a computer using an interpreter or the like.

본 발명의 일부 측면들은 장치의 문맥에서 설명되었으나, 그것은 상응하는 방법에 따른 설명 또한 나타낼 수 있고, 여기서 블록 또는 장치는 방법 단계 또는 방법 단계의 특징에 상응한다. 유사하게, 방법의 문맥에서 설명된 측면들은 또한 상응하는 블록 또는 아이템 또는 상응하는 장치의 특징으로 나타낼 수 있다. 방법 단계들의 몇몇 또는 전부는 예를 들어, 마이크로프로세서, 프로그램 가능한 컴퓨터 또는 전자 회로와 같은 하드웨어 장치에 의해(또는 이용하여) 수행될 수 있다. 몇몇의 실시예에서, 가장 중요한 방법 단계들의 하나 이상은 이와 같은 장치에 의해 수행될 수 있다. While some aspects of the invention have been described in the context of an apparatus, it may also represent a description according to a corresponding method, where a block or apparatus corresponds to a method step or characteristic of a method step. Similarly, aspects described in the context of a method can also be represented by a corresponding block or item or a feature of a corresponding device. Some or all of the method steps may be performed by (or using) a hardware device such as, for example, a microprocessor, a programmable computer or electronic circuit. In some embodiments, one or more of the most important method steps may be performed by such an apparatus.

실시예들에서, 프로그램 가능한 로직 장치(예를 들어, 필드 프로그머블 게이트 어레이)가 여기서 설명된 방법들의 기능의 일부 또는 전부를 수행하기 위해 사용될 수 있다. 실시예들에서, 필드 프로그머블 게이트 어레이는 여기서 설명된 방법들 중 하나를 수행하기 위한 마이크로프로세서와 함께 작동할 수 있다. 일반적으로, 방법들은 어떤 하드웨어 장치에 의해 수행되는 것이 바람직하다.In embodiments, a programmable logic device (eg, a field programmable gate array) may be used to perform some or all of the functionality of the methods described herein. In embodiments, the field programmable gate array may work with a microprocessor to perform one of the methods described herein. In general, the methods are preferably performed by some hardware device.

이상 본 발명의 바람직한 실시예를 참조하여 설명하였지만, 해당 기술 분야의 숙련된 당업자는 하기의 특허 청구의 범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 수 있을 것이다. Although the above has been described with reference to preferred embodiments of the present invention, those skilled in the art will be able to variously modify and change the present invention within the scope not departing from the spirit and scope of the present invention described in the following claims. You will understand that you can.

Claims

Extracting a frame from the input video;
Detecting a face area from the extracted frame;
Detecting a clothing area in the frame in which the face area is detected;
Comparing the detected clothing region with one or more candidate images stored in a database using information on a face recognized based on the detected face region; And
And identifying an input video according to the comparison result.

The method according to claim 1,
One or more candidate images stored in the database,
Detecting a face area from an image constituting a data set;
Recognizing a person using the detected face area;
Detecting a clothing area from the image in which the face area is detected; And
A video content identification method generated through the step of storing the detected clothing area together with meta information on the recognized person.

The method according to claim 1,
The step of detecting a face area from the extracted frame,
A method of identifying video contents performed using a first neural network model trained through supervised learning on a person's face region.

The method according to claim 1,
The step of detecting the clothing area in the image in which the face area is detected,
A method of identifying video contents performed using a second neural network model of a single-step method trained through supervised learning on a person's face region.

The method according to claim 1,
Comparing the detected clothing area with one or more candidate images stored in the database,
A method of identifying video content, performed using a third neural network model trained through learning about a person's clothing area.

The method according to claim 1,
Comparing the detected clothing area with one or more candidate images stored in the database,
Calculating a similarity between at least one first image including a clothing area stored in the database and a second image including a clothing area detected from an input image; And
And identifying a content corresponding to a first image having a maximum value among similarity values between the at least one first image and the second image as a content to which the input video belongs.

The method of claim 5,
The third neural network model,
A video content identification method for calculating and outputting a similarity between the first image and the second image as a ratio value by calculating a CNN feature map for the first image and the second image and reconstructing the calculated CNN feature map.

The method according to claim 2,
One or more candidate images stored in the database,
A video content identification method that is managed in association with a name of a person related to the candidate image, information on a clothing area in the candidate image, and a content identifier for the candidate image.

The method of claim 3,
The first neural network model is configured to include at least one of convolution, activation function, max pooling, fully connected layer, and softmax.

The method of claim 5,
The third neural network model,
Convolution, activation function, Max Pooling (Max Pooling), fully connected layer (Fully connected layer), and consisting of one or more of the SoftMax (SoftMax), video content identification method.

Processor; And
Includes a memory for storing at least one instruction executed through the processor,
The at least one command,
A command to extract a frame from an input video;
An instruction to detect a face area in the extracted frame;
A command for detecting a clothing area in the frame in which the face area is detected;
An instruction for comparing the detected clothing region with one or more candidate images stored in a database using information on a face recognized based on the detected face region; And
A video content identification device comprising a command to identify an input video according to the comparison result.

The method of claim 11,
One or more candidate images stored in the database,
Detecting a face area from an image constituting a data set;
Recognizing a person using the detected face area;
Detecting a clothing area from the image in which the face area is detected; And
A video content identification device generated through the step of storing the detected clothing area together with meta information on the recognized person.

The method of claim 11,
The command to detect the face area in the extracted frame,
A video content identification device performed using a first neural network model trained through supervised learning on a person's face region.

The method of claim 11,
The command to detect the clothing area in the frame in which the face area is detected,
A video content identification device that is performed using a second neural network model of a single-step method trained through supervised learning of a person's clothing area.

The method of claim 11,
The command to compare the detected clothing area with one or more candidate images stored in the database,
A video content identification device performed using a third neural network model trained through learning about a person's clothing area.

The method of claim 11,
The command to compare the detected clothing area with one or more candidate images stored in the database,
A command for calculating a similarity between at least one first image including a clothing area stored in the database and a second image including a clothing area detected from an input image; And
A video content identification apparatus comprising a command for identifying a content corresponding to a first image having a maximum value among similarity values between the at least one first image and the second image as a content to which the input video belongs.

The method of claim 15,
The third neural network model,
A video content identification apparatus for calculating and outputting a similarity between the first image and the second image as a ratio value by calculating a CNN feature map for the first image and the second image and reconstructing the calculated CNN feature map.

The method of claim 11,
One or more candidate images stored in the database,
A video content identification device that is managed in association with a name of a person related to the candidate image, information on a clothing area within the candidate image, and a content identifier for the candidate image.

The method of claim 13,
The first neural network model is configured to include at least one of convolution, activation function, max pooling, fully connected layer, and softmax.

The method of claim 15,
The third neural network model,
Convolution, activation function, Max Pooling (Max Pooling), a fully connected layer (Fully connected layer), and consisting of one or more of SoftMax (SoftMax), video content identification device.