KR20210064597A

KR20210064597A - Apparatus and method for analyzing video

Info

Publication number: KR20210064597A
Application number: KR1020190152992A
Authority: KR
Inventors: 서봉원; 이준환; 이성우
Original assignee: 서울대학교산학협력단
Priority date: 2019-11-26
Filing date: 2019-11-26
Publication date: 2021-06-03
Also published as: KR102294817B1

Abstract

The present invention relates to an apparatus for analyzing a video. The apparatus separates the voice data of audio content in the video for each speaker, recognizes the face of a person appearing in the video content, matches the separated voice for each speaker with the recognized face, and identifies characters in every frame of the video to provide character information for each frame. It can create bookmarking information about the appearance of characters for each scene by using this information, and perform scene search and clip image generation for each character. The apparatus includes an audio analysis part, a video analysis part and a control part.

Description

Apparatus and method for video analysis {APPARATUS AND METHOD FOR ANALYZING VIDEO}

동영상을 분석하는 장치에 관한 것으로, 보다 상세하게는 동영상 내에 등장인물의 등장을 분석하여 프레임 별로 등장 정보를 제공하는 기술에 관한 발명이 개시된다.Disclosed is an apparatus for analyzing a moving picture, and more particularly, a technology for analyzing the appearance of a character in a moving picture and providing appearance information for each frame.

동영상 편집은 영상의 품질에 영향을 많이 주는 중요한 작업이지만 수작업으로 이루어져 시간과 비용이 많이 소요된다. 영상을 편집하기 위해서는 편집자가 영상의 전체적인 정보를 알고 있어야 하며 장면 별로 중역 부분을 별도로 북마킹해 두어야 편집 시 쉽게 참조할 수 있다. 일반적으로 동영상 편집은 편집자가 북마킹 작업을 주관적인 컨셉에 따라 직접 수작업으로 수행한다.Editing a video is an important task that greatly affects the quality of a video, but it is done manually and takes a lot of time and money. In order to edit a video, the editor must know the overall information of the video, and it is necessary to bookmark the executive section for each scene so that it can be easily referenced during editing. In general, video editing is done manually by the editor according to a subjective concept of bookmarking.

영상에서 중요한 부분을 차지하는 등장인물은 객관적인 정보로 등장인물이 출연하는 장면들만 자동으로 북마킹할 수 있다면 수작업으로 수행되는 편집 작업이 편리해질 수 있다. 동영상이 입력으로 제공되면 영상을 분석하여 출연자의 출연장면만 따로 북마킹하고, 북마킹 정보를 이용하여 부가정보들을 생성할 수 있는 장치의 개발이 필요한 실정이다.Characters occupying an important part of the video are objective information, and if only scenes in which the characters appear can be automatically bookmarked, manual editing can be convenient. When a video is provided as an input, it is necessary to develop a device capable of analyzing the video, bookmarking only the appearance of the performer, and generating additional information using the bookmarking information.

본 발명은 동영상을 분석하여 동영상 내에 등장인물이 등장하는 프레임들에 대하여 자동으로 북마킹을 하는 장치 및 방법을 제공하는 것을 목적으로 한다.An object of the present invention is to provide an apparatus and method for automatically bookmarking frames in which a character appears in a moving picture by analyzing a moving picture.

추가로, 본 발명은 동영상을 분석하여 동영상 내에 특정 등장인물이 등장하는 장면만 검색할 수 있는 장치 및 방법을 제공하는 것을 또 다른 목적으로 한다.In addition, it is another object of the present invention to provide an apparatus and method that can analyze a moving picture and search only a scene in which a specific character appears in the moving picture.

추가로, 본 발명은 동영상을 분석하여 동영상 내에 특정 등장인물이 등장하는 장면만을 포함하는 클립 영상을 생성할 수 있는 장치 및 방법을 제공하는 것을 또 다른 목적으로 한다.In addition, it is another object of the present invention to provide an apparatus and method capable of generating a clip image including only a scene in which a specific character appears in the moving image by analyzing the moving image.

본 발명의 일 양상에 따르는 동영상 분석 장치는 오디오 분석부와, 비디오 분석와, 제어부를 포함한다. A video analysis apparatus according to an aspect of the present invention includes an audio analysis unit, a video analysis unit, and a control unit.

오디오 분석부는 동영상의 오디오 콘텐츠에서 화자 별 음성을 분리하고, 화자 별 음성에 대한 음성 프레임 정보를 생성한다.The audio analyzer separates the voices for each speaker from the audio content of the moving picture, and generates voice frame information for the voices for each speaker.

비디오 분석부는 동영상의 비디오 콘텐츠에서 등장하는 인물들의 얼굴을 인식하여 등장인물을 구분하고, 등장인물에 대한 얼굴인식 프레임 정보를 생성한다.The video analysis unit recognizes the faces of characters appearing in the video content of the moving picture, classifies the characters, and generates face recognition frame information for the characters.

제어부는 음성 프레임 정보와 얼굴인식 프레임 정보를 조합하여 인식된 얼굴과 음성을 매칭하고, 전체 프레임에 대하여 등장인물의 등장 여부를 나타내는 등장인물 프레임 벡터를 생성한다.The control unit matches the recognized face and voice by combining the voice frame information and the face recognition frame information, and generates a character frame vector indicating whether a character appears or not with respect to the entire frame.

본 발명의 일 실시 예에 따르는 동영상 분석 방법은 동영상을 입력 받아 처리하는 동영상 분석 장치에서 실행되는 프로그램 명령어들로 적어도 일부가 구현되며, 동영상의 오디오 콘텐츠에서 화자 별 음성을 분리하는 단계와, 화자 별 음성에 대한 음성 프레임 정보를 생성하는 단계와, 동영상의 비디오 콘텐츠에서 등장하는 인물들의 얼굴을 인식하여 등장인물을 구분하는 단계와, 등장인물에 대한 얼굴인식 프레임 정보를 생성하는 단계와, 음성 프레임 정보와 얼굴인식 프레임 정보를 조합하여 인식된 얼굴과 음성을 매칭하는 단계와, 전체 프레임에 대하여 등장인물의 등장 여부를 나타내는 등장인물 프레임 벡터를 생성하는 단계를 포함한다.A video analysis method according to an embodiment of the present invention is implemented at least in part by program commands executed in a video analysis apparatus that receives and processes a video, and includes the steps of separating a speech for each speaker from audio content of the video, and for each speaker The steps of generating voice frame information for the voice, the steps of recognizing the faces of people appearing in the video content of the moving picture to classify the characters, the steps of generating face recognition frame information about the characters, and the voice frame information and matching the recognized face and voice by combining face recognition frame information, and generating a character frame vector indicating whether a character appears in the entire frame.

본 발명에 의하면 동영상을 분석하여 동영상 내에 등장인물이 등장하는 프레임들에 대하여 자동으로 북마킹할 수 있다.According to the present invention, by analyzing a moving picture, it is possible to automatically bookmark frames in which a character appears in the moving picture.

추가로, 본 발명에 의하면 동영상을 분석하여 동영상 내에 특정 등장인물이 등장하는 장면만 검색할 수 있다.Additionally, according to the present invention, only scenes in which a specific character appears in the video can be searched by analyzing the video.

추가로, 본 발명에 의하면 동영상을 분석하여 동영상 내에 특정 등장인물이 등장하는 장면만을 포함하는 클립 영상을 자동으로 생성할 수 있다.In addition, according to the present invention, it is possible to automatically generate a clip image including only a scene in which a specific character appears in the moving picture by analyzing the moving picture.

도 1은 본 발명의 동영상 분석 장치가 동영상을 분석하는 과정을 개념적으로 도시한 것이다.
도 2는 본 발명의 동영상 분석 장치의 블록도를 도시하고 있다.
도 3은 음성 분리 모델이 오디오 콘텐츠에 화자 별 음성을 분리하는 개념을 간략하게 도시하고 있다.
도 4는 화자 별 음성과 인식된 얼굴을 동시발생 행렬을 이용하여 매칭하는 예를 도시한 것이다.
도 5는 슬라이딩 윈도우를 이용하여 등장인물 프레임 벡터를 보정하는 개념을 도시하고 있다.
도 6은 본 발명의 동영상 분석 방법의 예시적 절차를 도시한 절차도이다.1 is a conceptual diagram illustrating a process of analyzing a video by the video analysis apparatus of the present invention.
2 is a block diagram of a video analysis apparatus of the present invention.
3 schematically illustrates a concept in which the voice separation model separates voices for each speaker into audio content.
FIG. 4 shows an example of matching each speaker's voice with a recognized face using a co-occurrence matrix.
5 illustrates a concept of correcting a character frame vector using a sliding window.
6 is a flowchart illustrating an exemplary procedure of the video analysis method of the present invention.

전술한, 그리고 추가적인 양상들은 첨부된 도면을 참조하여 설명하는 실시 예들을 통해 구체화된다. 각 실시 예들의 구성 요소들은 다른 언급이나 상호간에 모순이 없는 한 실시 예 내에서 다양한 조합이 가능한 것으로 이해된다. 블록도의 각 블록은 어느 경우에 있어서 물리적인 부품을 표현할 수 있으나 또 다른 경우에 있어서 하나의 물리적인 부품의 기능의 일부 혹은 복수의 물리적인 부품에 걸친 기능의 논리적인 표현일 수 있다. 때로는 블록 혹은 그 일부의 실체는 프로그램 명령어들의 집합(set)일 수 있다. 이러한 블록들은 전부 혹은 일부가 하드웨어, 소프트웨어 혹은 이들의 결합에 의해 구현될 수 있다.The foregoing and additional aspects are embodied through embodiments described with reference to the accompanying drawings. It is understood that various combinations are possible within the embodiments as long as there is no contradiction between the elements or other mentions. Each block in the block diagram may in some cases represent a physical part, but in other cases may be a part of the function of one physical part or a logical representation of a function across a plurality of physical parts. Sometimes a block or part of an entity may be a set of program instructions. All or part of these blocks may be implemented by hardware, software, or a combination thereof.

도 1은 본 발명의 동영상 분석 장치가 동영상을 분석하는 과정을 개념적으로 도시한 것이다. 도 1을 참고하여 동영상을 분석하는 과정을 설명하면, 화자 별로 음성을 분리하도록 학습된 딥러닝 모델인 음성 분리 모델이 동영상(예, 영화)에 포함된 오디오 콘텐츠를 대상으로 화자 별로 음성을 분리하고, 각각의 화자 음성을 동영상 내에 등장인물들의 음성이 나오는 프레임들에 대한 정보인 음성 프레임 정보를 생성하고, 영상에서 얼굴을 인식하도록 학습된 딥러닝 모델인 얼굴 인식 모델이 동영상에 포함된 비디오 콘텐츠의 영상에 등장하는 인물의 얼굴을 인식하여 동일 인물에 대하여 얼굴 인식 그룹(face group)을 형성하고, 인식된 얼굴의 등장인물이 동영상 내에 등장하는 프레임들에 대한 정보인 얼굴인식 프레임 정보를 생성한다. 이때, 음성에 대한 처리와 영상에 대한 처리가 순서에 상관없이 순차적으로 처리될 수도 있고, 동시에 처리될 수도 있다. 이후 음성 프레임 정보와 얼굴인식 프레임 정보를 조합하고 음성과 얼굴이 동시에 등장하는 빈도를 이용하여 자동으로 화자의 음성과 얼굴 인식 그룹을 매칭한다. 매칭 결과와 음성 프레임 정보 및 얼굴인식 프레임 정보를 이용하여 전체 프레임에 대하여 전체 등장인물의 등장 여부를 나타내는 등장인물 프레임 벡터를 생성한다. 이때, 설정에 따라 음성과 얼굴이 모두 나타나는 경우 등장으로 인정하거나 음성 또는 얼굴 중 하나라도 나오면 등장으로 인정할 수 있다. 생성된 등장인물 프레임 벡터는 아주 짧은 시간 동안만 등장하고 사라지는 부분을 포함하고 있다. 설정된 임계값 이상 연속된 프레임에 등장하는 경우에만 등장으로 인정하도록 전체 등장인물 프레임 벡터를 대상으로 짧은 등장을 삭제하는 보정을 한다. 보정된 등장인물 프레임 벡터와, 음성 프레임 정보 및 얼굴인식 프레임 정보를 이용하여 다양한 메타정보를 생성한다.1 is a conceptual diagram illustrating a process of analyzing a video by the video analysis apparatus of the present invention. When explaining the process of analyzing a video with reference to FIG. 1, a voice separation model, which is a deep learning model trained to separate voices for each speaker, separates voices for each speaker from audio content included in a video (eg, a movie), and , generates voice frame information, which is information about frames in which the voices of characters appear in the video, for each speaker’s voice, and the face recognition model, a deep learning model trained to recognize faces in the video A face group is formed for the same person by recognizing the face of a person appearing in the image, and face recognition frame information, which is information about frames in which the character of the recognized face appears in the video, is generated. In this case, the audio processing and the image processing may be sequentially processed or processed simultaneously regardless of the order. Then, the voice frame information and the face recognition frame information are combined, and the speaker's voice and the face recognition group are automatically matched using the frequency at which the voice and face appear at the same time. Using the matching result, voice frame information, and face recognition frame information, a character frame vector indicating whether all the characters appear in the entire frame is generated. At this time, depending on the setting, if both a voice and a face appear, it may be recognized as an appearance, or it may be recognized as an appearance if either a voice or a face appears. The generated character frame vector contains parts that appear and disappear only for a very short time. A correction is made to delete short appearances for the entire character frame vector so that it is recognized as an appearance only when it appears in consecutive frames over a set threshold. Various meta information is generated using the corrected character frame vector, voice frame information, and face recognition frame information.

도 2는 본 발명의 동영상 분석 장치의 블록도를 도시하고 있다. 동영상 분석 장치(10)는 마이크로프로세서, 메모리, 디스플레이 등을 포함하는 컴퓨팅 장치로 구현될 수 있고, 네트워크로 연결된 복수의 컴퓨팅 장치로 구현될 수도 있다.2 is a block diagram of a video analysis apparatus of the present invention. The video analysis apparatus 10 may be implemented as a computing device including a microprocessor, a memory, a display, and the like, or may be implemented as a plurality of computing devices connected through a network.

본 발명의 일 양상에 따르면, 동영상 분석 장치(10)는 오디오 분석부(100)와, 비디오 분석부(110)와, 제어부(120)를 포함한다.According to an aspect of the present invention, the video analysis apparatus 10 includes an audio analysis unit 100 , a video analysis unit 110 , and a control unit 120 .

동영상 분석 장치(10)는 추가적 양상에 따라 메타정보 생성부(130)를 더 포함할 수 있고, 다른 추가적 양상에 따라 메타정보 생성부(130)와, 사용자 인터페이스(160)와, 프레임 검색부(140)를 더 포함할 수 있고, 또 다른 추가적 양상에 따라 메타정보 생성부(130)와, 사용자 인터페이스(160)와, 클립 영상 편집부(150)를 더 포함할 수 있다.The video analysis apparatus 10 may further include a meta-information generating unit 130 according to an additional aspect, and according to another additional aspect, a meta-information generating unit 130, a user interface 160, and a frame search unit ( 140), and may further include a meta information generating unit 130, a user interface 160, and a clip image editing unit 150 according to another additional aspect.

또한, 동영상 분석 장치(10)는 도 2에 도시된 것과 같이 오디오 분석부(100)와, 비디오 분석부(110)와, 제어부(120)와, 메타정보 생성부(130)와, 사용자 인터페이스(160)와, 프레임 검색부(140)와, 클립 영상 편집부(150)를 포함하여 구성될 수 있다.In addition, as shown in FIG. 2 , the video analysis apparatus 10 includes an audio analysis unit 100 , a video analysis unit 110 , a control unit 120 , a meta information generation unit 130 , and a user interface ( 160 , a frame search unit 140 , and a clip image editing unit 150 may be included.

오디오 분석부(100)와, 비디오 분석부(110)와, 제어부(120)와, 메타정보 생성부(130)와, 사용자 인터페이스(160)와, 프레임 검색부(140)와, 클립 영상 편집부(150)는 적어도 일 부분이 동영상 분석 장치(10)에서 실행되는 프로그램 명령어로 구현될 수 있으며, 하나의 컴퓨팅 장치에서 모두 실행되거나 일부 또는 전체가 다른 컴퓨팅 장치에서 실행될 수 있다.The audio analysis unit 100, the video analysis unit 110, the control unit 120, the meta information generation unit 130, the user interface 160, the frame search unit 140, and the clip image editing unit ( 150) may be implemented as program instructions, at least a part of which is executed in the video analysis device 10, all of which are executed in one computing device, or some or all of them in another computing device.

오디오 분석부(100)는 딥러닝 등의 학습된 신경망 모델인 음성 분리 모델을 포함하고, 음성 분리 모델을 통해 동영상의 오디오 콘텐츠에서 화자 별로 음성을 분리한다. 음성 분리 모델은 다수의 화자가 동시에 발화하는 경우에도 화자 별로 음성을 각각 분리하도록 딥러닝 등으로 학습된 모델이다. 음성 분리 모델의 구체적인 알고리즘은 다양한 논문 등을 통해 공개된 공지된 기술로 이에 대하여 자세한 기술은 생략한다.The audio analysis unit 100 includes a voice separation model, which is a trained neural network model such as deep learning, and separates voices for each speaker from audio content of a video through the voice separation model. The voice separation model is a model trained by deep learning, etc. to separate the voices for each speaker even when multiple speakers speak simultaneously. The specific algorithm of the speech separation model is a well-known technique that has been published through various papers, and a detailed description thereof will be omitted.

도 3은 음성 분리 모델이 오디오 콘텐츠에 화자 별 음성을 분리하는 개념을 간략하게 도시하고 있다. 도 3의 예는 동시에 발화되어 합쳐진(mix) 음원을 화자 별로 분리하는 예를 개념적으로 도시하고 있다.3 schematically illustrates a concept in which the voice separation model separates voices for each speaker into audio content. The example of FIG. 3 conceptually illustrates an example of separating sound sources that are simultaneously uttered and mixed for each speaker.

또한, 오디오 분석부(100)는 화자 별로 분리된 음성에 대하여 음성 프레임 정보를 생성한다. 음성 프레임 정보는 화자 별 음성이 영상의 프레임 정보에 매핑된 정보이다. 예를 들어, 분리된 화자 음성이 등장하는 동영상 구간(시작 시간과 종료 시간)을 타임코드, 재생 경과 시간 또는 프레임 번호 등으로 영상의 프레임 정보에 매핑한다.In addition, the audio analyzer 100 generates voice frame information with respect to the separated voice for each speaker. The voice frame information is information in which a voice for each speaker is mapped to frame information of an image. For example, a video section (start time and end time) in which the separated speaker voice appears is mapped to frame information of the video by time code, elapsed playback time, or frame number.

비디오 분석부(110)는 딥러닝 등의 학습된 신경망 모델인 얼굴 인식 모델을 포함하고, 얼굴 인식 모델을 통해 동영상의 비디오 콘텐츠에서 등장하는 인물들의 얼굴을 인식하여 등장인물을 구분한다. 일반적으로 얼굴 인식 모델은 영상 속의 사람 얼굴에서 눈의 위치, 코의 위치 등의 다수의 키포인트 위치 정보를 벡터화해서 인물을 구분하도록 딥러닝 등으로 학습된 모델이다. 얼굴 인식 모델의 구체적인 알고리즘은 다양한 논문 등을 통해 공개된 공지된 기술로 이에 대하여 자세한 기술은 생략한다.The video analysis unit 110 includes a face recognition model, which is a trained neural network model such as deep learning, and recognizes the faces of people appearing in the video content of the moving picture through the face recognition model to classify the characters. In general, a face recognition model is a model trained by deep learning, etc. to differentiate a person by vectorizing a number of key point location information such as the location of the eyes and the location of the nose in a human face in an image. The specific algorithm of the face recognition model is a well-known technology that has been published through various papers, and a detailed description thereof will be omitted.

비디오 분석부(110)가 비디오 콘텐츠에서 인식하는 얼굴은 해당 장면에서 인식되는 얼굴로 동일 인물이라 하더라도 장면에 따라 다양하게 얼굴이 인식될 수 있다. 따라서, 동일 인물에 대하여 인식된 얼굴이 다수 존재할 수 있으므로 본 발명은 동일 인물에 대한 인식된 얼굴을 얼굴 그룹(face group)으로 관리한다.The face recognized by the video analysis unit 110 in the video content is a face recognized in the corresponding scene, and even if it is the same person, the face may be recognized in various ways depending on the scene. Accordingly, since a plurality of faces recognized for the same person may exist, the present invention manages the recognized faces of the same person as a face group.

또한, 비디오 분석부(110)는 얼굴 인식된 등장인물에 대하여 얼굴인식 프레임 정보를 생성한다. 얼굴인식 프레임 정보는 얼굴 인식된 인물이 등장하는 장면이 영상의 프레임 정보에 매핑된 정보이다. 예를 들어, 얼굴 인식된 인물이 등장하는 동영상 구간(시작 시간과 종료 시간)을 타임코드, 재생 경과 시간 또는 프레임 번호 등으로 영상의 프레임 정보에 매핑한다.In addition, the video analysis unit 110 generates face recognition frame information for the face-recognized character. The face recognition frame information is information in which a scene in which a face-recognized person appears is mapped to frame information of an image. For example, a video section (start time and end time) in which a face-recognized person appears is mapped to frame information of the video as a time code, elapsed playback time, or frame number.

제어부(120)는 음성 프레임 정보와 얼굴인식 프레임 정보를 조합하여 인식된 얼굴과 음성을 매칭한다. 음성 프레임 정보 내의 화자 별 음성이 등장하는 프레임 정보와 얼굴인식 프레임 정보에서 인식된 얼굴이 등장하는 프레임 정보로부터 동시 등장하는 빈도를 이용하여 음성과 인식된 얼굴 즉, 얼굴 그룹을 자동으로 매칭한다. 본 발명의 양상에 따라서는 동시발생 행렬을 이용하여 자동으로 매칭할 수 있다. 동시발생 행렬은 데이터 마이닝 분야나 자연어 처리 분야 등에서 두 단어들이 같은 문장에서 동시에 등장하는 빈도를 이용하여 분석할 때 사용되는 행렬로 본 발명에서는 음성과 인식된 얼굴이 동시에 등장하는 프레임의 빈도를 이용하여 화자 별 음성과 인식된 얼굴을 매칭한다.The controller 120 matches the recognized face and voice by combining the voice frame information and the face recognition frame information. The voice and the recognized face, that is, a face group, are automatically matched by using the frequency of simultaneous appearance from frame information in which the voice of each speaker appears in the voice frame information and the frame information in which a face recognized in the face recognition frame information appears at the same time. According to an aspect of the present invention, matching may be performed automatically using a co-occurrence matrix. The co-occurrence matrix is a matrix used when analyzing using the frequency in which two words appear simultaneously in the same sentence in the data mining field or natural language processing field. In the present invention, the frequency of the frames in which the voice and the recognized face appear at the same time is used. Match each speaker's voice with the recognized face.

도 4는 화자 별 음성과 인식된 얼굴을 동시발생 행렬을 이용하여 매칭하는 예를 도시한 것이다. 도 4의 행렬에서 행은 화자 별 음성을 나타내고, 열은 인식된 얼굴(얼굴 그룹)을 나타내고 행과 열이 교차하는 값은 동시에 등장하는 연속 프레임의 빈도수를 의미한다. 각 행렬에서 각 행 별로 빈도수가 높은 열을 매칭한다. 도 4의 예에서 audio1은 face group5와 매칭되고, audio2는 face group1과 매칭되는 방식이다.FIG. 4 shows an example of matching each speaker's voice with a recognized face using a co-occurrence matrix. In the matrix of FIG. 4 , a row represents a voice for each speaker, a column represents a recognized face (face group), and a value crossing a row and a column means the frequency of consecutive frames appearing at the same time. In each matrix, the column with the highest frequency is matched for each row. In the example of FIG. 4 , audio1 matches face group5, and audio2 matches face group1.

또한, 제어부(120)는 전체 프레임에 대하여 등장인물의 등장 여부를 나타내는 등장인물 프레임 벡터를 생성한다. 등장인물 프레임 벡터를 생성하기 전에 전체 프레임에 대하여 등장인물 별로 음성과 얼굴이 등장하는 프레임은 1로 표시하고 등장하지 않는 프레임은 0으로 표시한 벡터를 먼저 생성한다. 이 벡터의 예시는 도 1의 등장인물 프레임 벡터 생성 부분의 각 face group의 audio, face 행에 표시된 0, 1이다. 제어부(120)는 설정에 따라 각 face group의 audio와 face의 등장 여부를 AND 또는 OR 연산하여 각 등장인물(face group) 별로 등장인물 프레임 벡터를 생성한다.In addition, the control unit 120 generates a character frame vector indicating whether a character appears with respect to the entire frame. Before generating the character frame vector, a vector in which a frame in which a voice and a face appears for each character is marked as 1 and a frame that does not appear as 0 is first generated for the entire frame. An example of this vector is the audio of each face group in the character frame vector generation part of FIG. 1, 0 and 1 indicated in the face row. The controller 120 generates a character frame vector for each face group by performing an AND or OR operation on the appearance of the face and audio of each face group according to the setting.

또한, 제어부(120)는 등장인물이 임계값 이하의 연속된 프레임에 등장하는 경우 해당 프레임에는 등장하지 않은 것으로 등장인물 프레임 벡터를 보정할 수 있다. 음성 프레임 정보 및 얼굴인식 프레임 정보는 등장인물의 음성 또는 얼굴이 아주 짧은 시간에 잠깐 등장했다가 사라지는 경우의 프레임 정보도 포함하고 있기 때문에 이들 정보를 이용하여 생성한 등장인물 프레임 벡터에도 아주 짧은 순간만 등장하는 경우도 해당 인물이 등장하는 것으로 표시된다. 따라서, 제어부(120)는 사전 설정된 임계값 이하의 연속된 프레임으로 등장인물의 음성 및/또는 얼굴이 등장하는 경우에 등장하지 않은 것으로 등장인물 프레임 벡터를 보정하고, 이때 보정을 하기 위해 등장인물 프레임 벡터를 슬라이딩 윈도우를 이용하여 이동하며 보정할 수 있다. 윈도우의 크기는 특정한 값으로 설정되어 있다. Also, when the character appears in consecutive frames less than or equal to the threshold, the controller 120 may correct the character frame vector so that it does not appear in the corresponding frame. Since voice frame information and face recognition frame information include frame information when a character's voice or face appears briefly and then disappears in a very short time, the character frame vector created using these information contains only a very short moment. In the case of an appearance, the corresponding person is displayed as appearing. Accordingly, the control unit 120 corrects the character frame vector as not appearing when the voice and/or face of the character appears in continuous frames below the preset threshold, and at this time, the character frame is corrected. The vector can be moved and corrected using a sliding window. The size of the window is set to a specific value.

도 5는 슬라이딩 윈도우를 이용하여 등장인물 프레임 벡터를 보정하는 개념을 도시하고 있다. 일반적으로 슬라이딩 윈도우 기법은 시계열 정보를 보정하는 데 이용되며, 짧은 순간 등장하는 노이즈나 변동성이 큰 부분을 주변 값들을 이용하여 평균화하여 보정한다. 등장인물 프레임 벡터의 경우에는 연속적인 값들을 보정하는 것이 아니라 등장(1)/미등장(0)의 불연속적인 값을 가지므로 윈도우 내의 주변 값들의 다수결, 연속된 1의 개수 등을 고려하여 보정한다. 도 5의 예에서는 윈도우(크기 5) 내의 등장(1)이 1개 이므로 이를 미등장(0)으로 보정하는 것이 도시되어 있다. 도시되어 있지 않지만 반대의 경우도 가능하다 윈도우 내 미등장(0)이 1개이고 나머지가 등장(1)인 경우 미등장(0)을 등장(1)으로 보정할 수도 있다.5 illustrates a concept of correcting a character frame vector using a sliding window. In general, the sliding window technique is used to correct time series information, and is corrected by averaging noise or large variability parts appearing in a short moment using surrounding values. In the case of the character frame vector, not continuous values are corrected, but discontinuous values of appearance (1)/non-appearance (0) are corrected in consideration of the majority vote of surrounding values in the window, the number of consecutive ones, etc. . In the example of FIG. 5 , since there is one appearance (1) in the window (size 5), it is illustrated that it is corrected to not appear (0). Although not shown, the opposite case is also possible. If there is one non-appearance (0) in the window and the remaining ones appear (1), the non-appearance (0) can be corrected to the appearance (1).

사용자 인터페이스(160)는 영상 분석 장치가 사용자의 입력을 받아 들이고, 처리된 결과를 화면으로 표시할 수 있다. 또한, 사용자 인터페이스(160)는 동영상을 재생하여 표시할 수 있다. The user interface 160 may allow the image analysis apparatus to accept a user's input and display the processed result on the screen. Also, the user interface 160 may play and display a video.

또한, 사용자 인터페이스(160)는 프레임 검색을 위해 얼굴 인식된 등장인물의 썸네일 이미지를 리스트 형태로 제시하여 사용자가 등장인물을 선택하도록 할 수 있고, 이때 사용자는 복수의 등장인물을 선택할 수도 있다. 사용자 입력에 따른 프레임 검색 결과를 사용자 인터페이스(160)를 통해 사용자에게 제시할 수 있으며, 사용자 인터페이스(160)를 통해 해당 결과의 프레임만 재생하여 표시할 수도 있다.In addition, the user interface 160 may present thumbnail images of face-recognized characters in a list form for frame search so that the user can select the characters, and in this case, the user may select a plurality of characters. The frame search result according to the user input may be presented to the user through the user interface 160 , and only the frame of the corresponding result may be reproduced and displayed through the user interface 160 .

또한, 사용자 인터페이스(160)는 클립 영상 생성을 위해 얼굴 인식된 등장인물의 썸네일 이미지를 리스트 형태로 제시하여 사용자가 등장인물을 선택하도록 할 수 있고, 이때 사용자는 복수의 등장인물을 선택할 수도 있다. 사용자 인터페이스(160)를 통해 사용자 입력에 따라 생성된 클립 영상을 재생하여 표시할 수도 있다.In addition, the user interface 160 may present a thumbnail image of a face-recognized character in a list form for generating a clip image so that the user can select the character, and in this case, the user may select a plurality of characters. A clip image generated according to a user input may be reproduced and displayed through the user interface 160 .

메타정보 생성부(130)는 등장인물 프레임 벡터를 이용하여 등장인물이 등장하는 프레임들에 대한 북마킹 정보를 생성한다. 북마킹 정보를 이용하여 동영상 분석 장치(10)가 제공하는 사용자 인터페이스(160)를 통해 특정 등장인물이 등장하는 장면들 사이를 빠르게 탐색할 수 있다.The meta information generating unit 130 generates bookmarking information for frames in which the character appears by using the character frame vector. Using the bookmarking information, it is possible to quickly search between scenes in which a specific character appears through the user interface 160 provided by the video analysis apparatus 10 .

또한, 메타정보 생성부(130)는 등장인물 프레임 벡터를 이용하여 등장인물 별 등장 시간 및 등장 횟수를 계산할 수 있고, 등장인물의 총 등장 시간 및 등장 횟수를 포함하는 인물별 등장 정보를 생성할 수 있다. 인물별 등장 정보를 이용하여 동영상 분석 장치(10)가 제공하는 사용자 인터페이스(160)를 통해 등장인물 별 등장 정보를 비교하여 제시하는 그래프 등을 출력할 수 있다.In addition, the meta information generating unit 130 may calculate the appearance time and number of appearances for each character using the character frame vector, and may generate appearance information for each person including the total appearance time and number of appearances of the characters. have. Using the appearance information for each person, a graph that compares and presents information on the appearance of each person may be output through the user interface 160 provided by the video analysis apparatus 10 .

또한, 메타정보 생성부(130)는 등장인물을 노드로 하고, 동일한 장면에 등장하는 등장인물을 엣지로 연결하고, 동시 등장 빈도수를 엣지 가중치로 하는 그래프를 정의하여 그래프 분석을 통해 주요 등장인물, 등장인물 간 관계도를 포함하는 인물 관계도 정보를 생성할 수 있다. 메타정보 생성부(130)는 등장인물의 관계를 그래프 자료구조로 표현하고 그래프 분석 기법들을 사용하여 주요 등장인물을 선정할 수 있으며, 등장인물 간 관계도를 작성할 수 있다. 사용되는 그래프 분석 기법은 Degree Centrality, Closeness Centrality, Betweeness Centrality, Eigenvector Centrality 등의 지표를 이용하여 분석할 수 있다. 예를 들어, Degree Centrality 지표를 이용하면 연결된 노드가 많은 등장인물일수록 주요 등장인물에 해당한다고 볼 수 있다.In addition, the meta-information generating unit 130 defines a graph using the characters as nodes, connecting characters appearing in the same scene to edges, and defining a graph using the frequency of simultaneous appearance as an edge weight to analyze the main characters, Character relationship diagram information including a relationship diagram between characters may be generated. The meta-information generating unit 130 may express the relationship of the characters as a graph data structure, select the main characters using graph analysis techniques, and create a relationship diagram between the characters. The graph analysis technique used can be analyzed using indicators such as Degree Centrality, Closeness Centrality, Betweeness Centrality, and Eigenvector Centrality. For example, using the Degree Centrality indicator, it can be seen that the more connected the nodes are, the more important the characters are.

프레임 검색부(140)는 북마킹 정보를 이용하여 사용자 인터페이스(160)를 통해 선택된 등장인물이 등장하는 프레임들을 검색할 수 있다. 이때, 검색 결과는 리스트 형태로 제공될 수 있다.The frame search unit 140 may search for frames in which the character selected through the user interface 160 appears by using the bookmarking information. In this case, the search results may be provided in the form of a list.

클립 영상 편집부(150)는 북마킹 정보를 이용하여 사용자 인터페이스(160)를 통해 선택된 등장인물이 등장하는 프레임들을 편집하여 자동으로 클립 영상을 생성할 수 있다. 클립 영상 편집부(150)는 선택된 등장인물이 등장하는 프레임들을 먼저 검색한 후 검색된 프레임들을 결합하여 하나의 클립 영상을 생성한다. 생성된 클립 영상을 사용자 인터페이스(160)를 통해 재생할 수 있다.The clip image editing unit 150 may automatically create a clip image by editing frames in which a character selected through the user interface 160 appears using the bookmarking information. The clip image editing unit 150 first searches for frames in which the selected character appears, and then combines the searched frames to generate one clip image. The generated clip image may be reproduced through the user interface 160 .

또한 클립 영상 편집부(150)는 북마킹 정보를 이용하여 사용자 인터페이스(160)를 통해 선택된 등장인물이 등장하는 프레임들을 삭제 편집된 영상을 자동으로 생성할 수 있다.Also, the clip image editing unit 150 may automatically generate an edited image by deleting frames in which a character selected through the user interface 160 appears by using the bookmarking information.

본 발명이 생성하는 북마킹 정보를 비롯한 다양한 메타정보들을 활용하여 방송 프로그램에서 특정 출연자의 출연분량을 편집하여 삭제하는데 활용할 수 있으며, 다른 방송 프로그램에서 참조영상을 찾을 때도 쉽게 활용할 수 있으며, 출연자별 영상을 자동으로 편집하여 제공하거나, 영화 등에서 배우 별로 출연 분량 등을 계산하여 제공할 수 있다. 또한, 영상에 대하여 주요 등장인물, 재생 시간, 파일 크기 등의 정보를 포함하는 요약 정보를 제공할 수 있다.By using various meta information including bookmarking information generated by the present invention, it can be used to edit and delete the appearance amount of a specific performer in a broadcast program, and can be easily utilized when searching for reference images in other broadcasting programs, can be automatically edited and provided, or the amount of appearances for each actor in a movie can be calculated and provided. In addition, it is possible to provide summary information including information such as main characters, playing time, and file size for the image.

본 발명의 일 실시 예에 따르면, 동영상 분석 방법은 동영상을 입력 받아 처리하는 동영상 분석 장치(10)에서 실행되는 프로그램 명령어들로 적어도 일부가 구현되며, 화자 별 음성 분리 단계와, 음성 프레임 정보 생성 단계와, 등장인물 구분 단계와, 얼굴인식 프레임 정보 생성 단계와, 얼굴과 음성 매칭 단계와, 등장인물 프레임 벡터 생성 단계를 포함한다.According to an embodiment of the present invention, the video analysis method is implemented at least in part by program commands executed by the video analysis apparatus 10 that receives and processes a video, and includes the steps of separating voices for each speaker and generating voice frame information. and a character identification step, a face recognition frame information generation step, a face and voice matching step, and a character frame vector generation step.

화자 별 음성 분리 단계는 동영상의 오디오 콘텐츠에서 화자 별 음성을 분리하는 단계이다. 동영상 분석 장치(10)는 딥러닝 등의 학습된 신경망 모델인 음성 분리 모델을 통해 동영상의 오디오 콘텐츠에서 화자 별로 음성을 분리한다. 음성 분리 모델은 앞서 언급한 딥러닝 등으로 학습된 모델이다.The step of separating the voices for each speaker is a step of separating the voices for each speaker from the audio content of the video. The video analysis apparatus 10 separates speech for each speaker from the audio content of the video through a speech separation model, which is a learned neural network model such as deep learning. The speech separation model is a model trained with the aforementioned deep learning.

음성 프레임 정보 생성 단계는 분리된 화자 별 음성에 대한 음성 프레임 정보를 생성하는 단계이다. 음성 프레임 정보는 화자 별 음성이 영상의 프레임 정보에 매핑된 정보이다.The voice frame information generating step is a step of generating voice frame information for the separated voices for each speaker. The voice frame information is information in which a voice for each speaker is mapped to frame information of an image.

등장인물 구분 단계는 동영상의 비디오 콘텐츠에서 등장하는 인물들의 얼굴을 인식하여 등장인물을 구분하는 단계이다. 동영상 분석 장치(10)는 딥러닝 등의 학습된 신경망 모델인 얼굴 인식 모델을 통해 동영상의 비디오 콘텐츠에서 등장하는 인물들의 얼굴을 인식하여 등장인물을 구분한다. 얼굴 인식 모델은 앞서 언급한 딥러닝 등으로 학습된 모델이다. 동영상 분석 장치(10)가 비디오 콘텐츠에서 인식하는 얼굴은 해당 장면에서 인식되는 얼굴로 동일 인물이라 하더라도 장면에 따라 다양하게 얼굴이 인식될 수 있다. 따라서, 동일 인물에 대하여 인식된 얼굴이 다수 존재할 수 있으므로 본 발명은 동일 인물에 대한 인식된 얼굴을 얼굴 그룹(face group)으로 관리한다.The character classification step is a step of classifying the characters by recognizing the faces of the characters appearing in the video content of the moving picture. The video analysis device 10 classifies the characters by recognizing the faces of people appearing in the video content of the video through a face recognition model, which is a trained neural network model such as deep learning. The face recognition model is a model trained with the aforementioned deep learning. The face recognized in the video content by the video analysis apparatus 10 is a face recognized in the corresponding scene, and even if it is the same person, the face may be recognized in various ways depending on the scene. Accordingly, since a plurality of faces recognized for the same person may exist, the present invention manages the recognized faces of the same person as a face group.

얼굴인식 프레임 정보 생성 단계는 등장인물에 대한 얼굴인식 프레임 정보를 생성하는 단계이다. 영상 분석 장치는 얼굴 인식된 등장인물에 대하여 얼굴인식 프레임 정보를 생성한다. 얼굴인식 프레임 정보는 얼굴 인식된 인물이 등장하는 장면이 영상의 프레임 정보에 매핑된 정보이다.The step of generating face recognition frame information is a step of generating face recognition frame information about the character. The image analysis apparatus generates face recognition frame information for the face-recognized character. The face recognition frame information is information in which a scene in which a face-recognized person appears is mapped to frame information of an image.

얼굴과 음성 매칭 단계는 음성 프레임 정보와 얼굴인식 프레임 정보를 조합하여 인식된 얼굴과 음성을 매칭하는 단계이다. 영상 분석 장치는 음성 프레임 정보 내의 화자 별 음성이 등장하는 프레임 정보와 얼굴인식 프레임 정보에서 인식된 얼굴이 등장하는 프레임 정보로부터 동시 등장하는 빈도를 이용하여 음성과 인식된 얼굴 즉, 얼굴 그룹을 자동으로 매칭한다.The face and voice matching step is a step of matching the recognized face and voice by combining voice frame information and face recognition frame information. The image analysis apparatus automatically identifies the voice and the recognized face, that is, a face group, using the frequency of simultaneous appearance from frame information in which the voice of each speaker appears in the voice frame information and the frame information in which a face recognized in the face recognition frame information appears. match

등장인물 프레임 벡터 생성 단계는 전체 프레임에 대하여 등장인물의 등장 여부를 나타내는 등장인물 프레임 벡터를 생성하는 단계이다. 영상 분석 장치는 등장인물 프레임 벡터를 생성하기 전에 전체 프레임에 대하여 등장인물 별로 음성과 얼굴이 등장하는 프레임은 1로 표시하고 등장하지 않는 프레임은 0으로 표시한 벡터를 먼저 생성하고, 설정에 따라 각 face group의 audio와 face의 등장 여부를 AND 또는 OR 연산하여 각 등장인물(face group) 별로 등장인물 프레임 벡터를 생성한다.The character frame vector generation step is a step of generating a character frame vector indicating whether a character appears in the entire frame. Before generating a character frame vector, the image analysis device first generates a vector in which frames in which voices and faces appear for each character are marked as 1 and frames that do not appear are marked as 0 for the entire frame, and each A character frame vector is generated for each face group by performing AND or OR operation on the audio of the face group and the appearance of the face.

본 발명의 또 다른 실시 예에 따르면, 동영상 분석 방법은 등장인물 프레임 벡터 보정 단계를 더 포함할 수 있고, 등장인물 프레임 벡터 보정 단계는 등장인물이 임계값 이하의 연속된 프레임에 등장하는 경우 해당 프레임에는 등장하지 않은 것으로 등장인물 프레임 벡터를 보정하는 단계이다.According to another embodiment of the present invention, the video analysis method may further include a character frame vector correction step, wherein the character frame vector correction step includes a corresponding frame when a character appears in consecutive frames below a threshold value. This is the step of correcting the character frame vector as it does not appear in .

본 발명의 또 다른 실시 예에 따르면, 동영상 분석 방법은 북마킹 정보 생성 단계를 더 포함할 수 있고, 북마킹 정보 생성 단계는 등장인물 프레임 벡터를 이용하여 등장인물이 등장하는 프레임들에 대한 북마킹 정보를 생성하는 단계이다.According to another embodiment of the present invention, the video analysis method may further include the step of generating bookmarking information, and the step of generating bookmarking information is bookmarking the frames in which the character appears by using the character frame vector. This is the information generation step.

본 발명의 또 다른 실시 예에 따르면, 동영상 분석 방법은 등장정보 생성 단계를 더 포함할 수 있고, 등장정보 생성 단계는 등장인물 프레임 벡터를 이용하여 등장인물 별 등장 시간 및 등장 횟수를 계산하고, 등장 시간 및 등장 횟수를 포함하는 인물별 등장 정보를 생성하는 단계이다.According to another embodiment of the present invention, the video analysis method may further include an appearance information generation step, wherein the appearance information generation step calculates an appearance time and number of appearances for each character using a character frame vector, and This is a step of generating appearance information for each person including time and number of appearances.

본 발명의 또 다른 실시 예에 따르면, 동영상 분석 방법은 인물 관계도 정보 생성 단계를 더 포함할 수 있고, 인물 관계도 정보 생성 단계는 그래프 분석을 통해 주요 등장인물, 등장인물 간 관계도를 포함하는 인물 관계도 정보를 생성하는 단계이다. 이때 그래프는 등장인물을 노드로 하고, 동일한 장면에 등장하는 등장인물을 엣지로 연결하고, 동시 등장 빈도수를 엣지 가중치로 하여 정의된다. 영상 분석 장치는 등장인물의 관계를 그래프 자료구조로 표현하고 그래프 분석 기법들을 사용하여 주요 등장인물을 선정할 수 있으며, 등장인물 간 관계도를 작성할 수 있다. 사용되는 그래프 분석 기법은 Degree Centrality, Closeness Centrality, Betweeness Centrality, Eigenvector Centrality 등의 지표를 이용하여 분석할 수 있다.According to another embodiment of the present invention, the video analysis method may further include the step of generating the person relationship diagram information, and the person relationship diagram information generation step includes the main characters and the relationship diagram between the characters through graph analysis. The person relationship is also a step in generating information. At this time, the graph is defined by using characters as nodes, connecting characters appearing in the same scene to edges, and using the frequency of simultaneous appearances as edge weights. The image analysis apparatus can express the relationship of the characters as a graph data structure, select the main characters using graph analysis techniques, and create a relationship diagram between the characters. The graph analysis technique used can be analyzed using indicators such as Degree Centrality, Closeness Centrality, Betweeness Centrality, and Eigenvector Centrality.

본 발명의 또 다른 실시 예에 따르면, 동영상 분석 방법은 프레임 검색 단계를 더 포함할 수 있고, 프레임 검색 단계는 북마킹 정보를 이용하여 사용자 인터페이스(160)를 통해 선택된 등장인물이 등장하는 프레임들을 검색하는 단계이다. 검색할 등장인물의 선택은 동영상 분석 장치(10)의 얼굴 인식된 등장인물을 제시하여 검색할 등장인물을 선택 받는 사용자 인터페이스(160)를 통해 적어도 한 명 선택될 수 있다.According to another embodiment of the present invention, the video analysis method may further include a frame searching step, wherein the frame searching step searches for frames in which a character selected through the user interface 160 appears using bookmarking information. is a step to At least one person may be selected through the user interface 160 for selecting a character to be searched by presenting the face-recognized character of the video analysis apparatus 10 to select the character to be searched.

본 발명의 또 다른 실시 예에 따르면, 동영상 분석 방법은 클립 영상 생성 단계를 더 포함할 수 있고, 클립 영상 생성 단계는 북마킹 정보를 이용하여 사용자 인터페이스(160)를 통해 선택된 등장인물이 등장하는 프레임들을 편집하여 클립 영상을 생성하는 단계이다. 검색할 등장인물의 선택은 동영상 분석 장치(10)의 얼굴 인식된 등장인물을 제시하여 검색할 등장인물을 선택 받는 사용자 인터페이스(160)를 통해 적어도 한 명 선택될 수 있다.According to another embodiment of the present invention, the video analysis method may further include a clip image generating step, wherein the clip image generating step is a frame in which a character selected through the user interface 160 using bookmarking information appears. It is a step to create a clip image by editing them. At least one person may be selected through the user interface 160 for selecting a character to be searched by presenting the face-recognized character of the video analysis apparatus 10 to select the character to be searched.

도 6은 본 발명의 동영상 분석 방법의 예시적 절차를 도시한 절차도이다. 도 6을 참조하여 절차를 설명하면, 동영상 분석 장치(10)는 학습된 신경망 모델인 음성 분리 모델을 통해 동영상의 오디오 콘텐츠에서 화자 별 음성을 분리한다(S1000). 동영상 분석 장치(10)는 분리된 화자 별 음성에 대한 화자 별 음성이 영상의 프레임 정보에 매핑된 정보인 음성 프레임 정보를 생성한다(S1010).6 is a flowchart illustrating an exemplary procedure of the video analysis method of the present invention. The procedure will be described with reference to FIG. 6 , the video analysis apparatus 10 separates the speech for each speaker from the audio content of the video through the speech separation model that is the learned neural network model ( S1000 ). The video analyzing apparatus 10 generates voice frame information, which is information in which a speaker-specific voice for a separated speaker-specific voice is mapped to frame information of an image (S1010).

동영상 분석 장치(10)는 학습된 신경망 모델인 얼굴 인식 모델을 통해 동영상의 비디오 콘텐츠에서 등장하는 인물들의 얼굴을 인식하여 등장인물을 구분한다(S1030). 이때 동영상 분석 장치(10)가 인식한 얼굴들을 얼굴 그룹(face group)으로 관리한다. 동영상 분석 장치(10)는 등장인물에 대한 얼굴 인식된 인물이 등장하는 장면이 영상의 프레임 정보에 매핑된 정보인 얼굴인식 프레임 정보를 생성한다(S1050).The video analysis apparatus 10 recognizes the faces of persons appearing in the video content of the video through a face recognition model, which is a learned neural network model, and classifies the characters ( S1030 ). In this case, the faces recognized by the video analysis apparatus 10 are managed as a face group. The video analysis apparatus 10 generates face recognition frame information, which is information in which a scene in which a person whose face is recognized for the character appears is mapped to frame information of an image (S1050).

동영상 분석 장치(10)는 음성 프레임 정보와 얼굴인식 프레임 정보를 조합하여 인식된 얼굴과 음성을 매칭한다S1070). 이때, 영상 분석 장치는 음성 프레임 정보 내의 화자 별 음성이 등장하는 프레임 정보와 얼굴인식 프레임 정보에서 인식된 얼굴이 등장하는 프레임 정보로부터 동시 등장하는 빈도를 이용하여 음성과 인식된 얼굴 즉, 얼굴 그룹을 자동으로 매칭한다.The video analysis apparatus 10 matches the recognized face and voice by combining the voice frame information and the face recognition frame information (S1070). In this case, the image analysis apparatus divides the voice and the recognized face, that is, a face group, by using the frequency of simultaneous appearance from frame information in which the voice of each speaker appears in the voice frame information and frame information in which a face recognized in the face recognition frame information appears at the same time. automatically match.

동영상 분석 장치(10)는 전체 프레임에 대하여 등장인물의 등장 여부를 나타내는 등장인물 프레임 벡터를 생성하고(S1090), 등장인물이 임계값 이하의 연속된 프레임에 등장하는 경우 해당 프레임에는 등장하지 않은 것으로 등장인물 프레임 벡터를 보정한다.The video analysis device 10 generates a character frame vector indicating whether or not the character appears for the entire frame (S1090), and when the character appears in consecutive frames below the threshold value, it is assumed that the character does not appear in the frame. Correct the character frame vector.

동영상 분석 장치(10)는 등장인물 프레임 벡터를 이용하여 등장인물이 등장하는 프레임들에 대한 북마킹 정보를 생성하고(S1110), 등장인물 프레임 벡터를 이용하여 등장인물 별 등장 시간 및 등장 횟수를 계산하고, 등장 시간 및 등장 횟수를 포함하는 인물별 등장 정보를 생성하고(S1130), 그래프 분석을 통해 주요 등장인물, 등장인물 간 관계도를 포함하는 인물 관계도 정보를 생성한다(S1150). 이 과정들은 필수적 단계가 아니며 예시와 다른 순서로 생성될 수 있고, 전부 또는 일부 단계는 사용자의 선택에 따라 생략될 수도 있다.The video analysis device 10 generates bookmarking information for frames in which the characters appear using the character frame vector (S1110), and calculates the appearance time and number of appearances for each character using the character frame vector and generates appearance information for each person including appearance time and number of appearances (S1130), and generates person relationship diagram information including main characters and relationship diagrams between characters through graph analysis (S1150). These processes are not essential steps and may be generated in an order different from the example, and all or some steps may be omitted according to a user's selection.

사용자가 특정 등장인물 또는 등장인물들이 등장하는 장면을 검색하고자 할 때, 동영상 분석 장치(10)는 북마킹 정보를 이용하여 사용자 인터페이스(160)를 통해 선택된 등장인물이 등장하는 프레임들을 검색한다(S1170). 이때, 검색할 등장인물의 선택은 동영상 분석 장치(10)의 얼굴 인식된 등장인물을 제시하여 검색할 등장인물을 선택 받는 사용자 인터페이스(160)를 통해 적어도 한 명 선택될 수 있다.When the user wants to search for a specific character or a scene in which the characters appear, the video analysis apparatus 10 searches for frames in which the selected character appears through the user interface 160 using the bookmarking information (S1170). ). In this case, the selection of the character to be searched may be at least one selected through the user interface 160 for presenting the face-recognized character of the video analysis apparatus 10 to select the character to be searched.

사용자가 특정 등장인물 또는 등장인물들이 등장하는 장면으로 구성된 클립 영상을 생성하고자 할 때, 동영상 분석 장치(10)는 북마킹 정보를 이용하여 사용자 인터페이스(160)를 통해 선택된 등장인물이 등장하는 프레임들을 편집하여 클립 영상을 생성한다(S1190). 이때, 검색할 등장인물의 선택은 동영상 분석 장치(10)의 얼굴 인식된 등장인물을 제시하여 검색할 등장인물을 선택 받는 사용자 인터페이스(160)를 통해 적어도 한 명 선택될 수 있다.When a user wants to create a clip image composed of a specific character or a scene in which the characters appear, the video analysis apparatus 10 uses bookmarking information to view frames in which the character selected through the user interface 160 appears. A clip image is created by editing (S1190). In this case, the selection of the character to be searched may be at least one selected through the user interface 160 for presenting the face-recognized character of the video analysis apparatus 10 to select the character to be searched.

이상에서 본 발명을 첨부된 도면을 참조하는 실시 예들을 통해 설명하였지만 이에 한정되는 것은 아니며, 이들로부터 당업자라면 자명하게 도출할 수 있는 다양한 변형 예들을 포괄하도록 해석되어야 한다. 특허청구범위는 이러한 변형 예들을 포괄하도록 의도되었다.Although the present invention has been described above with reference to the accompanying drawings, the present invention is not limited thereto, and it should be construed to encompass various modifications that can be apparent from those skilled in the art. The claims are intended to cover such variations.

10: 동영상 분석 장치
100: 오디오 분석부
110: 비디오 분석부
120: 제어부
130: 메타정보 생성부
140: 프레임 검색부
150: 클립 영상 편집부
160: 사용자 인터페이스10: video analysis device
100: audio analysis unit
110: video analysis unit
120: control unit
130: meta information generating unit
140: frame search unit
150: clip video editing unit
160: user interface

Claims

an audio analysis unit that separates voices for each speaker from the audio content of the video and generates voice frame information on the voices of each speaker;
a video analysis unit for recognizing the faces of people appearing in the video content of the moving picture, classifying the characters, and generating face recognition frame information for the characters; and
a controller for matching the recognized face and voice by combining the voice frame information and the face recognition frame information, and generating a character frame vector indicating whether a character appears in the entire frame;
A video analysis device comprising a.

The method of claim 1,
The control unit is a moving picture analysis device for correcting the character frame vector as not appearing in the frame when the character appears in consecutive frames below the threshold value.

3. The device of claim 2, wherein the device
a meta information generating unit generating bookmarking information for frames in which characters appear;
Video analysis device further comprising a.

4. The method of claim 3,
The meta-information generator calculates the appearance time and number of appearances for each character, and generates appearance information for each person including the appearance time and number of appearances.

4. The method of claim 3,
The meta-information generator defines a graph that uses characters as nodes, connects characters appearing in the same scene to edges, and uses the frequency of simultaneous appearances as edge weights. A video analysis device that generates information about the person relationship that includes it.

4. The method of claim 3, wherein the device
a user interface for presenting a face-recognized character, selecting at least one character to be searched, and playing and displaying the retrieved frames; and
a frame search unit that searches for frames in which the selected character appears by using the bookmarking information;
Video analysis device further comprising a.

4. The method of claim 3, wherein the device
a user interface for presenting a face-recognized character, selecting at least one character to be searched, and playing and displaying the retrieved frames; and
a clip image editing unit for generating a clip image by editing frames in which the selected character appears by using the bookmarking information;
Video analysis device further comprising a.

In the video analysis method in which at least some are implemented as program commands executed in a video analysis device that receives and processes a video,
Separating each speaker's voice from the audio content of the video;
generating voice frame information for each speaker's voice;
distinguishing the characters by recognizing the faces of the people appearing in the video content of the moving picture;
generating face recognition frame information for the character;
matching the recognized face and voice by combining voice frame information and face recognition frame information; and
generating a character frame vector indicating whether or not a character appears with respect to the entire frame;
A video analysis method comprising a.

9. The method of claim 8, wherein the method
correcting the character frame vector as not appearing in the frame when the character appears in consecutive frames below the threshold value;
Video analysis method further comprising.

10. The method of claim 9, wherein the method
generating bookmarking information for frames in which characters appear;
Video analysis method further comprising.

11. The method of claim 10, wherein the method
calculating an appearance time and number of appearances for each character, and generating appearance information for each person including the appearance time and number of appearances;
Video analysis method further comprising.

11. The method of claim 10, wherein the method
generating character relationship diagram information including main characters and relationship diagrams between characters through graph analysis;
further comprising,
A graph is a video analysis method defined by using characters as nodes, connecting characters appearing in the same scene to edges, and using the frequency of simultaneous appearances as edge weights.

11. The method of claim 10, wherein the method
retrieving frames in which the selected character appears by using the bookmarking information;
further comprising,
The selection of the character to be searched is a video analysis method in which at least one person is selected through a user interface for selecting a character to be searched by presenting the face-recognized character of the video analysis device.

11. The method of claim 10, wherein the method
generating a clip image by editing frames in which the selected character appears using the bookmarking information;
further comprising,
The selection of the character to be searched is a video analysis method in which at least one person is selected through a user interface for selecting a character to be searched by presenting the face-recognized character of the video analysis device.