KR102267403B1

KR102267403B1 - Apparatus or Method for Detecting Meaningful Intervals using voice and video information

Info

Publication number: KR102267403B1
Application number: KR1020190166956A
Authority: KR
Inventors: 강은철
Original assignee: 주식회사 코난테크놀로지
Priority date: 2019-12-13
Filing date: 2019-12-13
Publication date: 2021-06-22
Also published as: KR102343407B1; KR20210075924A; WO2021118072A1

Abstract

As a system for extracting a meaningful interval according to an embodiment of the present invention, the system includes: a video segmentation module for registering or dividing a video to generate metadata and a video segmentation engine including the video registration module; a metadata auto tagging engine for automatically generating metadata of the registered or divided video; and a section determination engine for extracting a meaningful interval from the interval of the divided video. The interval determination engine can store a set consisting of a predetermined number of continuous caption information as one or more temporary meaningful intervals through calculation of a similarity score based on caption information. The stored temporary meaningful intervals can be composed of one or more shot intervals. Based on a built-in deep learning model, it is possible to quickly extract a scene interval estimated as the same event from a video by using voice information.

Description

Method and apparatus for detecting meaningful intervals using voice and video information {Apparatus or Method for Detecting Meaningful Intervals using voice and video information}

본 발명은 음성 및 영상 정보를 활용한 의미있는 구간을 검출하기 위한 방법, 이를 위한 장치 또는 시스템에 관한 것으로서, 보다 정확한 영상 구간을 검출하기 위한 것과 관련된다. The present invention relates to a method, apparatus or system for detecting a meaningful section using audio and video information, and to more accurately detecting a video section.

딥러닝의 발전으로 인공지능에 관련된 연구는 나날이 증가하고 있다. 그 중에서도 기계가 인간의 언어를 이해하고, 상황을 인지하고, 더 나아가 상황에 대한 설명이나 배경등을 설명하는 튜링 테스트(Turing Test) 분야는 로보틱스 공학에서 가장 중요한 연구 분야로 자리잡고 있다. 하지만, 튜링 테스트 분야의 연구는 사람의 얼굴이나 사물, 배경등을 인식하는 시각 분야와 음성, 소리 등을 인식하는 청각 분야, 실 세계에 관련된 내용을 연결하여 지식 베이스를 구축하는 세계지식분야 등 연구해야 할 항목들이 유기적으로 복잡하게 이루어져 있고 특히 이러한 연구들을 진행하기 위해서는 의미있는 구간(또는, 의미 구간)(이하, "씬(scene)" 또는 "씬 구간"이라고 표기)을 사람이 개입하여 이벤트의 발생에 따라서 구간을 나눠야 한다. 사람이 개입하여 이벤트의 발생에 따른 씬 구간을 나누는 것은 대체적으로 정확도가 높지만, 영상을 모두 보면서 수행하기 때문에 많은 시간이 소요되고, 개개인마다 동일 이벤트라고 판단하는 기준이 다르기 때문에 동일하게 오류가 발생하기도 한다. 이러한 오류는 다수의 동일한 정보를 구축하는 시각이나 청각 메타데이터와는 다르게 구축되는 개수가 적고, 시계열 데이터의 특성 상 1개의 오류가 발생하면 2~3개의 구간에 영향을 끼치기 때문에 전체적으로 오류율이 크게 올라간다.With the development of deep learning, research related to artificial intelligence is increasing day by day. Among them, the field of the Turing test, in which a machine understands human language, recognizes a situation, and further explains the situation or background, is positioned as the most important research field in robotics engineering. However, research in the field of the Turing test includes the visual field that recognizes a person's face, object, background, etc., the auditory field that recognizes voice and sound, and the world knowledge field that builds a knowledge base by connecting content related to the real world. The items to be done are organically complex, and in particular, in order to proceed with these studies, a meaningful section (or semantic section) (hereinafter referred to as “scene” or “scene section”) must be intervened by a human to It should be divided into sections according to occurrence. Although human intervention and dividing the scene according to the occurrence of an event is generally highly accurate, it takes a lot of time because it is performed while watching all the videos, and the same error may occur because the criteria for determining that the event is the same for each individual are different. do. Unlike visual or auditory metadata that constructs a large number of identical information, the number of errors is small, and due to the nature of time series data, if one error occurs, it affects 2-3 sections, so the overall error rate increases significantly. .

따라서, 기 구축된 딥러닝 모델들을 기반으로 음성 정보를 활용하여 영상에서 동일 이벤트로 추정되는 씬 구간을 빠르게 추출하고, 추출된 구간 정보에 영상 정보를 활용하여 검수와 수정을 통하여 자동으로 최종 메타데이터 산출물로 만드는 방법을 제안하고자 한다.Therefore, based on the built-in deep learning models, a scene section estimated as the same event is quickly extracted from the video using voice information, and the final metadata is automatically checked and corrected by using the video information in the extracted section information. I would like to suggest a way to make it a product.

본 발명에서 이루고자 하는 해결하고자 하는 과제들은 상기 해결하고자 하는 과제로 제한되지 않으며, 언급하지 않은 또 다른 과제들은 아래의 기재로부터 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.The problems to be solved in the present invention are not limited to the problems to be solved, and other problems not mentioned may be clearly understood by those of ordinary skill in the art to which the present invention belongs from the following description. There will be.

본 발명의 일 실시예에 따라 의미 구간을 추출하기 위한 시스템으로서, 상기 시스템은 메타데이터를 생성해야할 동영상을 등록하거나 분할하기 위한 동영상 등록 모듈 및 동영상 분할 모듈을 포함한 동영상 분할 엔진; 상기 등록 또는 분할된 동영상의 메타데이터를 자동으로 생성하기 위한 메타데이터 오토 태깅 엔진; 및 상기 분할된 동영상의 구간에서 의미 구간을 추출하기 위한 구간 결정 엔진을 포함하고, 상기 구간 결정 엔진은: 기 설정된 수의 연속된 자막정보로 구성된 집합을 자막정보에 기반한 유사도 점수 계산을 통해 하나 이상의 임시 의미 구간으로 저장하고, 상기 저장된 임시 의미 구간은 하나 이상의 샷 구간으로 구성되고, 상기 구간 결정 엔진은: 상기 자막정보에 기반한 유사도 점수 계산을 위해 상기 집합에서 유사도 점수 계산의 시작 자막정보 n을 설정하고, 상기 집합 중 자막정보 n 내지 자막정보 n+K-1로 구성된 K개의 순차적 자막정보에 대해, 자막정보 n과 자막정보 n+1 내지 n+K-1 각각 간의 유사도 점수를, 미리 설정된 점수 이하인 유사도 점수를 갖는 자막정보 쌍의 수가 미리 결정된 수가 될 때까지 n을 1씩 증가시키며, 순차적으로 계산하고, 상기 유사도 점수를 계산한 마지막 자막정보 쌍 중 시간적으로 후행하는 자막정보를 상기 유사도 점수 계산의 종료 자막정보 n'로 결정하고, 상기 시작 자막정보 및 상기 종료 자막정보로 정의되는 자막정보 구간의 시간의 길이가 미리 결정된 시간 길이 이상이면, 상기 자막정보 구간을 샷 구간에 매칭하여 상기 임시 의미 구간으로 저장하는, 여기서 K는 미리 결정된 상수이며, n은 1보다 큰 정수이며 상한을 갖는 값일 수 있다.A system for extracting a semantic section according to an embodiment of the present invention, the system comprising: a video segmentation engine including a video registration module and a video segmentation module for registering or dividing a video to generate metadata; a metadata auto tagging engine for automatically generating metadata of the registered or divided video; and a section determination engine for extracting a semantic section from the section of the divided video, wherein the section determination engine: A set consisting of a preset number of continuous caption information is calculated by calculating a similarity score based on the caption information. stored as a temporary semantic section, the stored temporary semantic section consists of one or more shot sections, and the section determination engine: sets the start caption information n of the similarity score calculation in the set for calculating the similarity score based on the caption information and, for K pieces of sequential caption information including caption information n to caption information n+K-1 in the set, a similarity score between caption information n and caption information n+1 to n+K-1, respectively, is a preset score n is incremented by 1 until the number of pairs of caption information having a similarity score of less than or equal to a predetermined number is calculated sequentially, and temporally trailing caption information among the last pair of caption information for which the similarity score is calculated is calculated as the similarity score is determined as the end caption information n' of ', and if the length of the caption information section defined by the start caption information and the end caption information is equal to or greater than a predetermined time length, the caption information section is matched with the shot section to give the temporary meaning Stored as an interval, where K is a predetermined constant, and n is an integer greater than 1 and may be a value with an upper bound.

추가로 또는 대안으로, 상기 구간 결정 엔진은 상기 집합 중 상기 임시 의미 구간으로 저장되지 않은 나머지 자막정보가 있는 경우, 상기 나머지 자막정보에 대해 상기 유사도 점수 계산을 수행하여 다음 임시 의미 구간의 저장을 시도할 수 있다. Additionally or alternatively, when there is remaining caption information that is not stored as the temporary semantic section in the set, the section determination engine calculates the similarity score for the remaining caption information and attempts to store the next temporary semantic section. can do.

추가로 또는 대안으로, 상기 미리 설정된 점수 이하인 유사도 점수의 수가 상기 미리 결정된 수보다 작더라도, 더 이상 유사도 점수 계산을 할 자막정보가 없다면, 상기 유사도 점수 계산을 한 마지막 자막정보 쌍 중 시간적으로 후행하는 자막정보를 임시 의미 구간의 종료 지점으로 결정하고, 상기 시작 지점 및 상기 종료 지점으로 정의되는 자막정보 구간의 시간의 길이가 상기 미리 결정된 시간 길이 이상이면, 상기 자막정보 구간을 샷 구간에 매칭하여 상기 임시 의미 구간으로 저장할 수 있다. Additionally or alternatively, even if the number of similarity scores equal to or less than the preset score is smaller than the predetermined number, if there is no more subtitle information for which similarity scores are to be calculated, it is temporally trailing among the last subtitle information pairs for which the similarity score was calculated. The caption information is determined as the end point of the temporary semantic section, and if the length of time of the caption information section defined by the start point and the end point is equal to or greater than the predetermined time length, the caption information section is matched to the shot section and the It can be stored as a temporary semantic interval.

삭제delete

추가로 또는 대안으로, 상기 유사도 점수 계산을 위해 복수 개의 분석 방식이 사용될 수 있다.Additionally or alternatively, a plurality of analysis schemes may be used for calculating the similarity score.

추가로 또는 대안으로, 상기 시작 지점 및 상기 종료 지점으로 정의되는 자막정보 구간의 시간의 길이가 상기 미리 결정된 시간 길이 미만이면, 상기 자막정보 구간은 임시 의미 구간으로 저장되지 않을 수 있다.Additionally or alternatively, if the length of time of the caption information section defined by the start point and the end point is less than the predetermined time length, the caption information section may not be stored as a temporary semantic section.

추가로 또는 대안으로, 상기 저장된 임시 의미 구간 각각에 대해, 상기 구간 결정 엔진은: 해당 임시 의미 구간의 영상정보에 기반한 유사도 점수 계산을 통해 상기 임시 의미 구간을 최종 의미 구간으로 저장하거나, 상기 임시 의미 구간을 추가로 분할한 뒤 상기 최종 의미 구간으로 저장할 수 있다.Additionally or alternatively, for each of the stored temporary semantic sections, the section determination engine may: store the temporary semantic section as a final semantic section by calculating a similarity score based on image information of the corresponding temporary semantic section, or After the section is further divided, it can be stored as the final semantic section.

추가로 또는 대안으로, 상기 구간 결정 엔진은 상기 유사도 점수 계산을 위해 해당 임시 의미 구간의 시간적으로 가장 앞선 L개의 샷 구간을 기준 구간으로 설정하고, 상기 기준 구간과 나머지 샷 구간의 유사도를 판단할 수 있다.Additionally or alternatively, the section determination engine may set the temporally most advanced L shot sections of the corresponding temporary semantic section as a reference section for calculating the similarity score, and determine the similarity between the reference section and the remaining shot sections. have.

추가로 또는 대안으로, 상기 유사도 점수 계산은 상기 L개의 샷 구간 각각과 나머지 샷 구간 사이의, 인물 또는 객체 정보의 유사도 또는 인물 또는 객체의 위치 및 크기 정보의 유사도의 판단을 포함할 수 있다.Additionally or alternatively, the calculation of the similarity score may include determining the similarity of person or object information or the similarity of position and size information of the person or object between each of the L shot sections and the remaining shot sections.

추가로 또는 대안으로, 상기 L개의 샷 구간 각각과 나머지 샷 구간 사이의 인물 또는 객체 정보의 유사도 점수가 모두 미리 설정된 기준 점수 미만이고, 상기 L개의 샷 구간 각각과 나머지 샷 구간 사이의 인물 또는 객체의 위치 및 크기 정보의 유사도 점수가 모두 미리 설정된 기준 점수 미만이면, 상기 저장된 임시 의미 구간은 두 개로 분할되고, 상기 저장된 임시 의미 구간이 분할되는 기점은 상기 나머지 샷 구간일 수 있다.
본 발명의 또다른 일 실시예에 따라, 의미 구간을 추출하기 위한 방법이 제안되며, 상기 방법은 메타데이터를 생성해야할 동영상을 등록하거나 분할하는 단계; 상기 등록 또는 분할된 동영상의 자막정보를 자동으로 생성하는 단계; 상기 분할된 동영상의 구간에서 상기 자막정보에 기반하여 임시 의미 구간을 추출하는 단계를 포함하고, 상기 추출하는 단계는 기 설정된 수의 연속된 자막정보로 구성된 집합을 자막정보에 기반한 유사도 점수 계산을 통해 하나 이상의 임시 의미 구간을 저장하는 단계를 포함하고, 상기 임시 의미 구간은 하나 이상의 샷 구간으로 구성되고, 상기 임시 의미 구간을 저장하는 단계는, 상기 집합에서 유사도 점수 계산의 시작 자막정보 n을 설정하는 단계; 상기 집합 중 자막정보 n 내지 자막정보 n+K-1로 구성된 K개의 순차적 자막정보에 대해, 자막정보 n과 자막정보 n+1 내지 n+K-1 각각 간의 유사도 점수를, 미리 설정된 점수 이하인 유사도 점수를 갖는 자막정보 쌍의 수가 미리 결정된 수가 될 때까지 n을 1씩 증가시키며, 순차적으로 계산하는 단계; 상기 유사도 점수를 계산한 마지막 자막정보 쌍 중 시간적으로 후행하는 자막정보를 상기 유사도 점수 계산의 종료 자막정보 n'로 결정하는 단계; 및 상기 시작 자막정보 및 상기 종료 자막정보로 정의되는 자막정보 구간의 시간의 길이가 미리 결정된 시간 길이 이상이면, 상기 자막정보 구간을 샷 구간에 매칭하여 상기 임시 의미 구간으로 저장하는 단계를 포함하고, 여기서 K는 미리 결정된 상수이며, n은 1보다 큰 정수이며 상한을 갖는 값일 수 있다.Additionally or alternatively, all similarity scores of person or object information between each of the L shot sections and the remaining shot sections are less than a preset reference score, and the person or object between each of the L shot sections and the remaining shot sections is less than a preset reference score. When the similarity scores of the location and size information are both less than a preset reference score, the stored temporary semantic section may be divided into two, and the starting point at which the stored temporary semantic section is divided may be the remaining shot sections.
According to another embodiment of the present invention, a method for extracting a semantic section is proposed, the method comprising: registering or dividing a video for generating metadata; automatically generating subtitle information of the registered or divided video; and extracting a temporary semantic section based on the caption information from the segmented video section, wherein the extracting includes calculating a similarity score based on the caption information for a set consisting of a preset number of continuous caption information. storing at least one temporary semantic section, wherein the temporary semantic section consists of one or more shot sections, and the storing of the temporary semantic section includes setting start caption information n for calculating the similarity score in the set step; For K pieces of sequential subtitle information composed of subtitle information n to n+K-1 in the set, the similarity score between each of the subtitle information n and the subtitle information n+1 to n+K-1 is equal to or less than a preset score. incrementing n by 1 until the number of pairs of caption information having scores becomes a predetermined number, and sequentially calculating; determining temporally trailing caption information among the last pair of caption information for which the similarity score is calculated as the ending caption information n' of the similarity score calculation; and if the length of the caption information section defined by the start caption information and the end caption information is equal to or greater than a predetermined time length, matching the caption information section with the shot section and storing the section as the temporary semantic section, Here, K is a predetermined constant, n is an integer greater than 1, and may be a value having an upper limit.

상기 과제 해결방법들은 본 발명의 실시예들 중 일부에 불과하며, 본원 발명의 기술적 특징들이 반영된 다양한 실시예들이 당해 기술분야의 통상적인 지식을 가진 자에 의해 이하 상술할 본 발명의 상세한 설명을 기반으로 도출되고 이해될 수 있다.The above problem solving methods are only some of the embodiments of the present invention, and various embodiments in which the technical features of the present invention are reflected are based on the detailed description of the present invention to be described below by those of ordinary skill in the art can be derived and understood as

본 발명에 따르면 의미있는 구간의 획득을 객관적인 조건에 따라 수행할 수 있고, 이를 위한 시간도 단축할 수 있는 장점이 있다. According to the present invention, there is an advantage that the acquisition of a meaningful section can be performed according to an objective condition, and the time for this can be shortened.

본 발명에서 얻은 수 있는 효과는 이상에서 언급한 효과들로 제한되지 않으며, 언급하지 않은 또 다른 효과들은 아래의 기재로부터 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.The effects obtainable in the present invention are not limited to the above-mentioned effects, and other effects not mentioned may be clearly understood by those of ordinary skill in the art to which the present invention belongs from the following description. will be.

본 발명에 관한 이해를 돕기 위해 상세한 설명의 일부로 포함되는, 첨부 도면은 본 발명에 대한 실시 예를 제공하고, 상세한 설명과 함께 본 발명의 기술적 사상을 설명한다.
도 1은 영상 정보와 음성 정보를 활용한 의미있는 구간 검출 시스템의 구성도이다.
도 2는 동영상 분할 엔진의 구성을 도시한다.
도 3은 데이터 관리 엔진, 메타데이터 오토태깅 엔진의 구성을 도시한다.
도 4는 음성 정보들을 활용한 의미 구간 결정 방법의 순서도를 나타낸다.
도 5는 영상 정보들을 활용한 의미 구간 결정 방법의 순서도를 나타낸다. BRIEF DESCRIPTION OF THE DRAWINGS The accompanying drawings, which are included as a part of the detailed description to help understand the present invention, provide embodiments of the present invention, and together with the detailed description, explain the technical spirit of the present invention.
1 is a block diagram of a meaningful section detection system using image information and audio information.
2 shows the configuration of a video segmentation engine.
3 shows a configuration of a data management engine and a metadata auto-tagging engine.
4 is a flowchart of a method for determining a semantic section using voice information.
5 is a flowchart of a method for determining a semantic section using image information.

이하, 본 발명의 실시예를 첨부한 도면을 참고하여 설명한다. 그러나 본 발명은 본 명세서에서 설명하는 실시예에 한정되지 않으며 여러 가지 다른 형태로 구현될 수 있다. 본 명세서에서 사용되는 용어는 실시예의 이해를 돕기 위한 것이며, 본 발명의 범위를 한정하고자 의도된 것이 아니다. 또한, 이하에서 사용되는 단수 형태들은 문구들이 이와 명백히 반대의 의미를 나타내지 않는 한 복수 형태들도 포함한다.Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings. However, the present invention is not limited to the embodiments described herein and may be implemented in various other forms. The terminology used in this specification is intended to help the understanding of the embodiments, and is not intended to limit the scope of the present invention. Also, singular forms used hereinafter include plural forms unless the phrases clearly indicate the opposite.

로보틱스 공학은 공장에서 물건을 자동으로 조립하는 로봇팔, 소프트 아이스크림을 자동으로 판매하는 로봇 등과 같이 아주 간단한 분야부터 정밀 의료를 지원하는 로봇, 공항에서 관광객을 안내하는 로봇 등과 같이 복잡한 판단이 필요한 분야에서도 널리 사용되고 있다. 이러한 로봇틱스 공학의 발전은 인공지능 기술이 비약적으로 향상되는 것이 주요했고, 더 깊숙히 들어가면 다양한 딥러닝 아키텍쳐의 개발에서 비롯되었다고 볼 수 있다. 하지만 아무리 성능이 뛰어난 딥러닝 아키텍쳐가 개발되었다고 하더라도 그에 버금가는 양질의 대량의 메타데이터가 필요하다. 비유하자면 좋은 그릇에 담을 수 있는 양질의 음식이라고 표현할 수 있는데, 여기에서 그릇은 딥러닝 아키텍쳐가 되고, 음식을 메타데이터라고 부를 수 있다. 특히 로보틱스 공학에서 가장 주목 받고 있는 연구 분야인 튜링 테스트 분야는 로봇으로 하여금 인간이 질문을 하면 그 말을 듣고, 연산하여 대답을 하고, 더 나아가 상황에 대한 설명이나 배경지식까지 대답을 할 수 있게 만드는 연구이다. 로보틱스에 활용되는 튜링 테스트의 연구는 굉장히 다양하고 복잡한 연구의 결과들이 조합되어 사용되는데 로봇이 사물이나 인물을 구별하기 위한 시각 엔진, 인간의 언어를 이해하고 내부적으로 연산된 결과를 인간의 언어로 출력하기 위한 음성 엔진과 자연어 처리 엔진, 입력된 시각이나 음성 자료를 기반으로 상황을 추측할 수 있도록 학습되는 상황 지식 엔진, 그리고 현실 세계와 입/출력데이터를 매핑할 수 있도록 구성된 세계 지식 엔진 등 목적에 따라 구성할 수 엔진들의 수는 헤아릴 수 없이 많고, 그에 따라 다양한 형태의 메타데이터가 필요하다.Robotics engineering can also be applied in fields that require complex judgment, such as robots that support precision medical care and robots that guide tourists at airports, from very simple fields such as a robot arm that automatically assembles objects in a factory, and a robot that automatically sells soft ice cream. It is widely used. The development of robotics engineering was mainly due to the rapid improvement of artificial intelligence technology, and if you go deeper, it can be seen that it originated from the development of various deep learning architectures. However, no matter how high-performance deep learning architecture is developed, it requires a large amount of metadata of comparable quality. By analogy, it can be expressed as quality food that can be placed in a good bowl, where the bowl becomes a deep learning architecture, and the food can be called metadata. In particular, the field of Turing test, which is receiving the most attention in robotics engineering, is a field that allows robots to listen to a human question and answer a question by calculating it, and furthermore, to provide an explanation of the situation or background knowledge. It is research. The study of the Turing test used in robotics is a combination of very diverse and complex research results. The robot uses a visual engine to distinguish objects or people, understands human language, and outputs internally calculated results in human language. A voice engine and natural language processing engine for the purpose of doing this, a situation knowledge engine that learns to guess the situation based on input visual or voice data, and a world knowledge engine configured to map input/output data with the real world. The number of engines that can be configured according to it is immeasurable, and thus various types of metadata are required.

본질적으로 로봇이 인간의 질문을 이해하고 그에 대한 답변을 하게 만드는 것의 기본은 당연하게도 인간의 언어를 학습하는 것이다. 이를 위해서는 하나의 이벤트를 기반으로 씬 구간 단위로 영상을 분리하는 것이 중요하게 여겨지고 있지만 씬 단위로 영상을 분할하는 작업은 모두 사람이 수동으로 영상을 보면서 작업을 수행해야 한다. 메타데이터를 구축하는데 있어서 사람이 개입하여 수동으로 구축한다는 것은 많은 시간과 비용이 들어간다는 의미이며, 이는 튜링 테스트 분야의 연구 속도를 저하하는 주요 원인 중 하나이다. 이에 이하에서는 음성을 활용하여 의미있는 구간을 나누는 방법, 영상 정보를 활용하여 음성을 활용한 구간의 신뢰도를 보강하여 최종적인 의미있는 구간을 검출하는 방법을 설명한다. Essentially, the basis of making robots understand human questions and answer them is, of course, learning human language. To this end, it is considered important to separate an image in units of scene sections based on one event, but all work of segmenting images in units of scenes requires a human to manually view the image. Human intervention in constructing metadata means that it takes a lot of time and money to build it manually, which is one of the main reasons for slowing down the research speed in the field of Turing tests. Hereinafter, a method of dividing a meaningful section using voice and a method of detecting a final meaningful section by reinforcing the reliability of a section using voice using image information will be described.

도 1은 영상 정보와 음성 정보를 활용한 의미있는 구간 검출 시스템의 구성도이다. 본 발명에 따른 의미있는 구간 검출 시스템(1)은 동영상 분할 엔진(10), 메타데이터 오토 태깅 엔진(20), 구간 결정 엔진(30), 및 이들과 상호작용하는 데이터 관리 엔진(40)으로 구성된다. 1 is a block diagram of a meaningful section detection system using image information and audio information. The meaningful section detection system 1 according to the present invention is composed of a video segmentation engine 10, a metadata auto tagging engine 20, a section determination engine 30, and a data management engine 40 interacting with them. do.

동영상 분할 엔진(10)은 동영상 등록 모듈(11)과 동영상 분할 모듈(12)로 구성된다. 동영상 등록 모듈(11)은 사용자에 의해 선택된 메타데이터 생성이 필요한 영상을 변환하여 등록하고, 등록된 영상을 동영상 DB(41)에 저장한다. 도 2를 참조하면, 동영상 분할 엔진의 기능 또는 동작이 상세하게 설명된다. 동영상 분할 모듈(12)은 동영상 DB(41)에 저장된 동영상을 프레임 단위로 분할할 수 있다. The video segmentation engine 10 is composed of a video registration module 11 and a video segmentation module 12 . The video registration module 11 converts and registers an image that requires metadata generation selected by a user, and stores the registered image in the video DB 41 . Referring to FIG. 2 , the function or operation of the video segmentation engine will be described in detail. The video division module 12 may divide the video stored in the video DB 41 in units of frames.

동영상 분할 모듈(12)은 메타데이터 검수 서버에 저장된 영상의 장면이 변환되는 구간을 추출하고, 해당 구간 별로 10fps(frame per second) 로 이미지를 추출할 수 있다. 여기서, 영상의 장면이 변환되는 구간을 "샷"이라 지칭하도록 한다. 동영상 분할 모듈은 각 구간에 대한 구간 정보를 추출하여 저장할 수 있고, 상기 구간 정보는 해당 구간의 시작 시간, 종료 시간, 샷 인덱스를 포함할 수 있다. 추출된 구간 정보와 이미지는 영상 DB에 저장되고, 저장된 이미지는 메타데이터 오토태깅 엔진에서 사람의 얼굴과 객체를 인식할 때 사용될 수 있다. The video segmentation module 12 may extract a section in which a scene of an image stored in the metadata verification server is converted, and extract the image at 10 fps (frame per second) for each section. Here, a section in which a scene of an image is converted is referred to as a “shot”. The video segmentation module may extract and store section information for each section, and the section information may include a start time, an end time, and a shot index of the corresponding section. The extracted section information and images are stored in the image DB, and the stored images can be used when the metadata auto-tagging engine recognizes human faces and objects.

메타데이터 오토태깅 엔진(20)은 얼굴인식 모듈(21), 객체인식 모듈(22), 음성인식 모듈(23)로 구성된다. 메타데이터 오토태깅 엔진은 씬 구간 검출을 위한 메타데이터를 자동으로 구축한다. 도 3은 메타데이터 오토태깅 엔진의 구성 및 상호관계를 도시한다. The metadata auto-tagging engine 20 includes a face recognition module 21 , an object recognition module 22 , and a voice recognition module 23 . The metadata auto-tagging engine automatically builds metadata for scene section detection. 3 shows the configuration and interrelationship of a metadata auto-tagging engine.

동영상 분할 엔진에서 분할이 완료된 영상은 메타데이터 오토태깅 엔진에서 얼굴인식 모듈, 객체인식 모듈 및 음성인식 모듈을 실행시켜, 얼굴, 객체 및 음성을 인식한 결과를 음성 DB(42) 또는 영상 DB(43)에 저장시킬 수 있다.In the video segmentation engine, the image segmentation is completed by executing the face recognition module, object recognition module and voice recognition module in the metadata auto-tagging engine, and the result of recognizing faces, objects and voices is displayed in the voice DB 42 or image DB 43 ) can be stored in

얼굴인식 모듈과 객체인식 모듈은 동영상 분할 엔진을 통해 추출된 이미지를 기반으로 기 학습된 딥러닝 모델을 활용하여 각 이미지 내의 인물의 얼굴과 객체를 인식한다.The face recognition module and object recognition module recognize the face and object of a person in each image by using a deep learning model previously learned based on the image extracted through the video segmentation engine.

인식 모듈의 학습기는 양질의 데이터로 학습할수록 정확도가 높은 결과값을 출력한다. The learner of the recognition module outputs a result value with higher accuracy as it learns with high-quality data.

얼굴인식 모듈, 객체인식 모듈 및 음성인식 모듈을 거친 메타데이터는 편집이 필요할 경우, 메타데이터 검수 엔진(미도시)에서 편집되어 다시 인식 모듈의 학습기의 데이터 셋으로 사용되어 정확도가 더 높은 새로운 모델을 생성할 수 있다.If the metadata that has passed through the face recognition module, object recognition module, and voice recognition module needs to be edited, it is edited in the metadata inspection engine (not shown) and used again as the data set of the learner of the recognition module to develop a new model with higher accuracy. can create

얼굴인식 모듈은 이미지 내 얼굴의 눈, 코, 입 부분의 특징점을 추출하고, CNN(Convolutional Neural Network) 기반의 모델을 이용하여 벡터화하여 얼굴을 인식한다. 인식된 얼굴에 대해 해당 이미지 위의 해당 위치에 바운딩박스가 표시되고, 해당 위치의 좌표가 추출되고, 인식된 인물명과 신뢰도가 함께 표시될 수 있다. 즉, 상기 좌표, 인물명 및 신뢰도가 자동으로 생성된 비디오 메타데이터에 해당하며, 이는 영상 DB(43)에 저장될 수 있다. 신뢰도는 70%을 기준으로 신뢰도가 70%미만의 결과는 실패로 저장될 수 있다.The face recognition module extracts the feature points of the eyes, nose, and mouth of the face in the image, and recognizes the face by vectorizing it using a CNN (Convolutional Neural Network)-based model. For the recognized face, a bounding box may be displayed at a corresponding position on the image, coordinates of the corresponding position may be extracted, and a recognized person name and reliability may be displayed together. That is, the coordinates, the person's name, and the reliability correspond to the automatically generated video metadata, which may be stored in the image DB 43 . Reliability is based on 70%, and results with less than 70% reliability can be stored as failures.

객체인식 모듈은 이미지의 특징을 추출한 특징 맵(Feature map)을 이용해 객체를 분류하는 CNN기반의 GBD-Net(gated bi-direction CNN)을 활용하여 이미지 내의 모든 객체의 위치 좌표와 해당하는 객체의 명을 인식한다. 인식된 객체의 위치 좌표와 객체명, 신뢰도가 영상 DB(43)에 저장된다. 얼굴인식 모듈과 마찬가지로 신뢰도 70%을 기준으로 70%미만의 결과는 실패로 저장된다.The object recognition module utilizes CNN-based GBD-Net (gated bi-direction CNN) that classifies objects using a feature map extracted from the features of the image, so that the position coordinates of all objects in the image and the names of the corresponding objects are recognize The position coordinates, object name, and reliability of the recognized object are stored in the image DB 43 . As with the face recognition module, a result of less than 70% based on 70% reliability is saved as a failure.

음성인식 모듈은 영상 속 음성에 대해 발화 시작 시간, 종료 시간, 길이, 자막을 생성한다. 음성인식 과정은 영상의 음성에 대하여 전처리 단계로 잡음을 제거하고, 너무 크거나 작은 신호에 대하여 적절한 크기로 조절하고, 연속된 발화 중에서 음성 구간만을 검출 하고, 디코더를 통해, 언어 모델 스코어링을 통해 최적으로 선택하고 후처리로 도메인사전 구축 보정 알고리즘을 사용하여 음성을 텍스트로 변환하여 자막을 생성한다. The speech recognition module generates speech start time, end time, length, and subtitles for the voice in the video. The speech recognition process is a pre-processing step that removes noise from the audio of the video, adjusts the size to an appropriate size for a signal that is too loud or too small, detects only the speech section among consecutive speeches, and optimizes it through a decoder and language model scoring. Subtitles are generated by converting speech into text using a domain dictionary construction correction algorithm as a post-processing.

발화자 추출을 위하여 화자 인식 중 화자 식별 기술을 이용한다. 화자 식별 기술을 적용하기 위해서는 발화자의 음성을 미리 저장해두어야 한다. 입력된 영상에서 음성 신호에서 특징을 추출하고 모델링하는 GMM-UBM(Gaussian Mixture Model-Universal Background Model)기법 등을 이용하여 사전에 DB에 등록된 음성 중 가장 가까운 인물명을 발화자로 추출한다. 생성된 음성의 발화자와 발화 시작 및 종료 시간, 자막 데이터는 음성 DB(42)에 저장될 수 있으며, 이러한 정보 모두 비디오 메타데이터의 범주 내에 포함된다. For speaker extraction, speaker identification technology is used during speaker recognition. In order to apply the speaker identification technology, it is necessary to store the speaker's voice in advance. Using the Gaussian Mixture Model-Universal Background Model (GMM-UBM) technique that extracts and models features from the voice signal from the input image, the name of the person closest to the voice registered in the DB is extracted as the speaker. The speaker of the generated voice, utterance start and end times, and caption data may be stored in the voice DB 42 , and all of this information is included within the scope of video metadata.

메타데이터 오토 태깅 엔진에 의해 생성된 메타데이터는 추후 구간 결정 엔진(30)에서 신뢰도 또는 유사도 등을 계산하여 최종 의미 구간을 검출하는데 활용될 수 있다. The metadata generated by the metadata auto-tagging engine may be later used to detect the final semantic section by calculating reliability or similarity in the section determination engine 30 .

데이터 관리 엔진(40)은 동영상 DB(관리 모듈)(41), 음성 DB(관리 모듈)(42), 및 영상 DB(관리 모듈)(43)로 구성된다. 데이터 관리 엔진에서는 동영상에 포함된 메타데이터, 동영상에서 추출된 10fps 단위의 이미지 메타데이터, 화면전환 기반으로 자동 추출된 샷 구간 정보, 각 이미지들이 오토태깅된 결과값, 음성 자막정보, 음성 구간 정보를 저장하고 각 모듈들의 요청과 요청에 대한 응답을 수행하는 중추적인 역할을 한다. 의미구간을 검출하기 전 모든 전처리 작업이 완료 후 구간 결정 엔진을 호출하게 된다.The data management engine 40 includes a video DB (management module) 41 , an audio DB (management module) 42 , and an image DB (management module) 43 . The data management engine manages the metadata included in the video, image metadata in 10fps unit extracted from the video, shot section information automatically extracted based on screen transition, auto-tagging results for each image, audio caption information, and audio section information. It stores and plays a pivotal role in performing requests of each module and responses to requests. Before detecting the semantic section, all preprocessing tasks are completed and the section determination engine is called.

앞서, "씬 또는 씬 구간"과 "샷 또는 샷 구간"이 정의되었고, 이 둘의 관계는 하나의 "씬 또는 씬 구간"에 여러 개의 "샷 또는 샷 구간"이 포함된다고 이해하면 될 것이다. 즉, 여러 개의 샷 또는 샷 구간이 하나의 씬 또는 씬 구간을 구성하게 된다. 또한, 앞서 언급했듯이 "씬 구간"은 "의미 구간"으로도 표현될 수 있는 용어이다.Previously, a “scene or scene section” and a “shot or shot section” were defined, and the relationship between the two should be understood to include several “shots or shot sections” in one “scene or scene section”. That is, several shots or shot sections constitute one scene or scene section. Also, as mentioned above, the “scene section” is a term that can also be expressed as a “meaning section”.

구간 결정 엔진(30)은 구간 신뢰도 분석 모듈(31)과 자막 신뢰도 분석 모듈(32)로 구성된다. 구간 결정 엔진(30)은 각 엔진들에서 추출된 영상 정보들과 음성 정보들을 조합하여 신뢰도(또는 유사도)를 결정하고 의미있는 구간을 결정한다. 먼저 음성 정보들을 활용하여 의미 구간(씬 또는 씬 구간)을 결정하는 방법은 아래의 순서대로 진행된다. 음성 정보들을 활용한 의미 구간 결정은 구간 결정 엔진(30) 또는 자막 신뢰도 분석 모듈(32)에 의해 수행되며 다음과 같다.The section determination engine 30 includes a section reliability analysis module 31 and a caption reliability analysis module 32 . The section determination engine 30 combines the image information and audio information extracted from each engine to determine reliability (or similarity) and determine a meaningful section. First, a method of determining a semantic section (scene or scene section) using voice information proceeds in the following order. The semantic section determination using the voice information is performed by the section determination engine 30 or the caption reliability analysis module 32 as follows.

한편, 앞서 설명한 대로, 구간 결정 엔진(30)은 데이터 관리 엔진(40)으로부터, 동영상 분할 엔진(10), 메타데이터 오토 태깅 엔진(20) 등에 의해 추출된 샷 또는 샷 구간, 그리고 그에 대한 메타데이터들을 사용하여 해당 방법 내지는 방식을 수행할 수 있다.On the other hand, as described above, the section determination engine 30 is a shot or shot section extracted from the data management engine 40 by the video segmentation engine 10 , the metadata auto tagging engine 20 , and the like, and metadata therefor. can be used to perform the method or method.

도 4는 음성 정보들을 활용한 의미 구간 결정 방법의 순서도를 나타낸다. 4 is a flowchart of a method for determining a semantic section using voice information.

구간 결정 엔진(30) 또는 자막 신뢰도 분석 모듈(32)은 자막정보, 즉 데이터를 정제할 수 있다(S411). 자막정보는 영상에서 음성 정보만 분리한 후, STT(speech to text) 엔진을 통해 대사 부분이 텍스트 형태로서 변환되어 획득될 수 있다. 상기 데이터 정제는, 자막정보에서 영어와 숫자만 남도록 정규화를 진행한 후 형태소 분석을 통한 조사, 접속사, 전치사 등의 불용어는 삭제시켜 단어의 배열 형태로 변환하는 과정이다. 또한, 상기 데이터 정제는 3단어 이하만 남는, 문장으로 보기 어려운 짧은 자막들은 제외시킨 후, 단어를 원형으로 변환하는 어간 추출, 즉 스테밍(stemming)을 진행하여 자막정보, 발화자 정보, 발화 시간의 메타데이터를 사전(dictionary)화 시켜 리스트의 형태로 저장하는 것을 의미한다. The section determination engine 30 or the caption reliability analysis module 32 may refine caption information, that is, data (S411). The caption information may be obtained by separating only the audio information from the image, and then converting the dialogue part into a text form through a speech to text (STT) engine. The data purification is a process of normalizing the subtitle information so that only English and numbers remain, and then converting it into a word arrangement form by deleting stopwords such as investigations, conjunctions, and prepositions through morpheme analysis. In addition, in the data purification, after excluding short subtitles that are difficult to read as sentences, which are only 3 words or less, stem extraction that converts words into a circle, that is, stemming, is performed to determine the subtitle information, speaker information, and utterance time. It means that metadata is converted into a dictionary and stored in the form of a list.

이에 따라, 미리 결정된 수의 연속된 자막정보로 구성된 집합이 구성되며, 상기 결과로서 정제된 데이터, 즉 자막정보는 영상의 시간 정보에 기반한 시간 순서대로 순번이 매겨진 채 관리될 수 있다.Accordingly, a set composed of a predetermined number of continuous caption information is configured, and as a result, the refined data, that is, caption information, can be managed while being numbered in chronological order based on the time information of the video.

그리고나서, 구간 결정 엔진(30) 또는 자막 신뢰도 분석 모듈(32)은 후술할 유사도 분석의 대상이 되는 시작 자막정보의 순번(n) 및 임시 씬 구간의 시작 자막정보의 순번(i)을 초기화할 수 있다(S412). 초기화 값은 각각 1이다.Then, the section determination engine 30 or the caption reliability analysis module 32 initializes the sequence number (n) of the start caption information to be subjected to similarity analysis, which will be described later, and the sequence number (i) of the start caption information of the temporary scene section. It can be (S412). Each initialized value is 1.

대상 자막정보 중 시간적으로 가장 앞선 K개의 순차적인 자막정보가 선택되고, 이 선택된 K개의 자막 정보가 유사도 분석의 대상이 되는 자막 정보 구간이 설정될 수 있다(S413). Among the target caption information, the temporally most advanced K pieces of sequential caption information may be selected, and a caption information section in which the selected K pieces of caption information are subjected to similarity analysis may be set ( S413 ) .

상기 K개의 자막 정보의 선택은, 자막 정보 간의 유사도 점수가 미리 정의된 점수 이하(또는 미만)인 것이 미리 설정된 개수(예컨대, c, c는 1이상의 정수)가 나올 때까지 반복될 수 있다. 즉, S413에서 K개의 자막 정보에 대한 유사도 분석의 결과, 유사도 점수가 미리 정의된 점수 이하인 자막 정보 쌍이 c개 미만이라면, 자막 정보 구간(K개의 자막 정보)을 다시 선택하되, 그 시작 자막 정보는 이전에 선택된 자막 정보 구간의 시작 자막 정보의 다음의 순번을 갖는 자막 정보가 될 것이다. 즉, 첫번째 선택된 자막 정보 구간의 자막 정보의 순번이 n 내지 n+K-1이라면, 그 다음 선택될 자막 정보 구간의 자막 정보의 순번은 n+1 내지 n+K가 될 것이다. The selection of the K pieces of caption information may be repeated until a preset number (eg, c and c are integers greater than or equal to 1) that a similarity score between caption information is less than (or less than) a predefined score. That is, as a result of the similarity analysis on the K pieces of caption information in S413, if there are less than c pairs of caption information whose similarity score is less than or equal to a predefined score, the caption information section (K caption information) is selected again, but the starting caption information is It will be the caption information having the next sequence number of the start caption information of the previously selected caption information section. That is, if the sequence numbers of the caption information of the first selected caption information section are n to n+K-1, the sequence numbers of the caption information of the next selected caption information section will be n+1 to n+K.

좀더 상세히 설명하면, 구간 결정 엔진(30) 또는 자막 신뢰도 분석 모듈(32)은, 첫번째 자막부터 총 K개의 자막 정보에 대해 유사도 점수를 계산할 수 있다(S414). 즉, K-1개의 자막 정보 쌍에 대한 유사도 점수가 계산될 수 있다. 아울러, 상기 K-1개의 자막 정보 쌍에 대해서, 순차적으로 유사도 점수의 계산이 이루어진다. In more detail, the section determination engine 30 or the caption reliability analysis module 32 may calculate a similarity score for a total of K caption information from the first caption (S414). That is, similarity scores for K-1 pairs of caption information may be calculated. In addition, for the K-1 pairs of caption information, similarity scores are sequentially calculated.

구간 결정 엔진(30) 또는 자막 신뢰도 분석 모듈(32)은 후술할 방식(이하, "자막 유사도 분석"이라 지칭함)을 수행하여 평균 유사도 점수를 계산하고, 이와 기준 값의 대소 관계에 따라 씬 또는 씬 구간의 구분을 수행할 수 있다. The section determination engine 30 or the caption reliability analysis module 32 calculates an average similarity score by performing a method to be described later (hereinafter, referred to as “subtitle similarity analysis”), and depending on the magnitude relationship between the reference value and the scene or scene You can perform segmentation.

좀더 상세하게 설명하면, 순차적인 K개의 자막 정보가 선택되고, 가장 앞선 자막 정보와 나머지 K-1개의 자막 정보가 각각 짝지어서 상기 자막 유사도 분석이 수행될 수 있다. 즉, 구간 결정 엔진(30) 또는 자막 신뢰도 분석 모듈(32)은 K-1개의 자막 정보 쌍에 대해 상기 자막 유사도 분석에 따른 평균 유사도 점수를 산출할 수 있다(S414). 예를 들어, 첫번째 자막 정보의 순번이 n이라고 하면, (n, n+1), (n, n+2), (n, n+3), ..., (n, n+K-1)으로 구성된 K-1개의 자막 정보 쌍에 대한 자막 유사도 분석이 수행된다. In more detail, the caption similarity analysis may be performed by sequentially selecting K pieces of caption information, pairing the most advanced caption information with the remaining K-1 pieces of caption information, respectively. That is, the section determination engine 30 or the caption reliability analysis module 32 may calculate an average similarity score according to the caption similarity analysis for K-1 pairs of caption information ( S414 ). For example, if the sequence number of the first subtitle information is n, (n, n+1), (n, n+2), (n, n+3), ..., (n, n+K-1) ), a caption similarity analysis is performed on K-1 pairs of caption information.

상기 자막 유사도 분석에 대해서는 아래에서 상세하게 설명한다.The subtitle similarity analysis will be described in detail below.

상기 자막 유사도 분석을 위해, 먼저 상기 K-1개의 자막 정보 쌍에 대해 TF-IDF(Term Frequency-Inverse Document Frequency) 분석을 이용하여 유사도가 비교될 수 있다. TF-IDF는 TF와 IDF의 곱으로 표현될 수 있으며, TF는 특정 문서에서의 특정 단어의 등장 빈도수를 의미하며, 본 발명에 적용하면, i번째 자막 정보에서 단어 k의 등장 빈도수를 의미한다. IDF는 말 그대로, DF의 역수이며, DF는 단어 k가 등장한 문서(즉, 자막 정보)의 수이다. IDF는 보통 로그 함수로서 표현하며 다음과 같다.To analyze the subtitle similarity, first, the similarity of the K-1 pairs of subtitle information may be compared using TF-IDF (Term Frequency-Inverse Document Frequency) analysis. TF-IDF may be expressed as a product of TF and IDF. TF means the frequency of occurrence of a specific word in a specific document. When applied to the present invention, it means the frequency of occurrence of the word k in the i-th subtitle information. IDF is literally the inverse of DF, and DF is the number of documents (ie, subtitle information) in which the word k appears. IDF is usually expressed as a logarithmic function and is as follows.

위 수학식에서 n은 총 문서의 수, 즉 본 발명에 따르면 총 자막 정보의 수이며, 두 개의 자막 정보 상호 간을 비교하므로, 여기서 n=2가 될 수 있겠다. d는 문서, 즉 개별 자막 정보를, t는 단어를 지칭한다. df(t)는 앞서 설명한 DF이며, 단어 t가 등장한 자막 정보의 수를 지칭한다. In the above equation, n is the total number of documents, that is, the total number of caption information according to the present invention. Since two pieces of caption information are compared with each other, n=2 here. d denotes a document, i.e. individual subtitle information, and t denotes a word. df(t) is the DF described above, and refers to the number of subtitle information in which the word t appears.

TF-IDF 값은 특정 단어에 대한 가중치로서 이해될 수 있으며, 항상 1보다 작다.The TF-IDF value can be understood as a weight for a specific word, and is always less than 1.

상기 TF-IDF는 각 자막정보 별로 벡터로서 표현할 수 있다. 예컨대, n=1인 자막 정보에서 3개의 단어가 출현하면, TF-IDF는 각 단어별로 획득되므로 3개의 값이 도출되고, 그에 따라 3차원의 벡터로 구성될 수 있다. 즉, 두 개의 자막정보를 비교했으므로, 2개의 동차원 또는 서로 다른 차원의 벡터 2개가 획득될 수 있다. 이 2개의 벡터를 코사인 유사도(cosine similarity)를 이용하여 값으로 표현할 수 있다. The TF-IDF may be expressed as a vector for each subtitle information. For example, when three words appear in the caption information where n=1, the TF-IDF is obtained for each word, so three values are derived, and accordingly, it can be configured as a three-dimensional vector. That is, since two pieces of caption information are compared, two vectors of two same dimensions or different dimensions can be obtained. These two vectors can be expressed as values using cosine similarity.

여기서, A, B는 각각 두 자막 정보를 지칭하며, Ai, Bi는 각 자막 정보의 TF-IDF 값에 해당한다. 이 값은 유사도 점수로 표현되기 위해, 백분율로 표현될 수 있으며, A로 지칭하도록 한다.Here, A and B respectively refer to two pieces of caption information, and Ai and Bi correspond to TF-IDF values of each caption information. This value can be expressed as a percentage, to be expressed as a similarity score, and will be referred to as A.

벡터 유사도 분석은 Word2Vec을 이용하여 미리 학습된 모델을 가지고 각각의 자막에서 출현하는 단어들이 얼마나 유사도를 가지는지를 계산하는 방법이다. Word2Vec를 사용하기 위해서는 임베딩 벡터라는 수치화된 단어들을 학습한 모델을 사용하게 되는데, 미리 전체 자막을 사용하여 학습된 결과를 기반으로 한다. 각각의 단어는 0 내지 1 사이의 가중치를 갖게 되며, 예컨대 1개의 자막 정보에서 6개의 단어가 출현하였고, 각각의 가중치가 0.043, 0.096, 0.034, 0.084, 0.012, 0.056을 갖는다면, 평균값인 0.054를 취하도록 한다. 단어의 가중치를 구하는 대상이 자막 정보 n과 자막 정보 n+1일 때, 비교 대상인 자막 정보 n+1의 값만을 계산하도록 한다. 이 결과값은 유사도 점수로 표현되기 위해, 백분율로 표현될 수 있으며, B로 지칭하도록 한다.Vector similarity analysis is a method of calculating how similar words appearing in each subtitle have with a pre-trained model using Word2Vec. To use Word2Vec, a model that has learned numerical words called embedding vectors is used, and it is based on the results learned using full subtitles in advance. Each word has a weight between 0 and 1, for example, if 6 words appear in one subtitle information, and each weight has 0.043, 0.096, 0.034, 0.084, 0.012, 0.056, the average value of 0.054 to take it When the target for obtaining the weight of a word is the caption information n and the caption information n+1, only the value of the caption information n+1 as a comparison target is calculated. This result value can be expressed as a percentage in order to be expressed as a similarity score, and will be referred to as B.

연관 분석은 S411에서와 같이 얻어진 주요 단어들간이 출현 빈도를 기반으로 PMI(Pointwise Mutual Information)를 측정하여 유사도를 비교한다. PMI는 두 변수간의 상호 의존성을 측정 한 것으로 아래와 같이 계산된다.Association analysis compares the similarity by measuring PMI (Pointwise Mutual Information) based on the frequency of appearance between the obtained main words as in S411. PMI is a measure of the interdependence between two variables and is calculated as follows.

여기서, p(x, y)는 X와 Y의 결합분포함수이고, p(x)와 p(y)는 각각 X와 Y의 주변분포함수이다. 본 발명에 따르면, X와 Y는 각각 i, j번째 자막 정보가 될 것이며, x, y는 각각 i, j번째 자막 정보에 포함된 단어가 될 것이다. Here, p(x, y) is the joint distribution function of X and Y, and p(x) and p(y) are the marginal distribution functions of X and Y, respectively. According to the present invention, X and Y will be i and j-th caption information, respectively, and x and y will be words included in the i and j-th caption information, respectively.

두 개의 자막 정보를 비교하므로, 각 자막 정보에 속한 각 단어를 비교하되, 같은 자막 정보에 속한 단어들은 서로 비교하지 않는다. 예를 들어, n번째 자막에 5개의 단어 배열(n(1), n(2), n(3), n(4), n(5))이 있고, n+1번째 자막에서 9개의 단어 배열(n+1(1), n+1(2), n+1(3), n+1(4), n+1(5), n+1(6), n+1(7), n+1(8), n+1(9))이 있다고 가정하면, n(1)과 n+1(1), n(1)과 n+1(2), …, n(1)과 n+1(9)까지 비교하고, n(2)와 n+1(1), n(2)와 n+1(2), … n(2)~n+1(9), ... 와 같이 순차적으로 비교하게 된다. 전체 단어들의 연관분석 결과가 나오면 모든 값의 평균을 산출하도록 한다. 위의 연관 분석에 따른 유사도 점수를 C라고 지창한다. Since two pieces of caption information are compared, each word included in each caption information is compared, but words included in the same caption information are not compared with each other. For example, there is an array of 5 words (n(1), n(2), n(3), n(4), n(5)) in the nth subtitle, and 9 words in the n+1th subtitle. Array(n+1(1), n+1(2), n+1(3), n+1(4), n+1(5), n+1(6), n+1(7)) , n+1(8), n+1(9)), n(1) and n+1(1), n(1) and n+1(2), ... , compare n(1) to n+1(9), n(2) to n+1(1), n(2) to n+1(2), ... n(2)~n+1(9), ... are sequentially compared. When the result of association analysis of all words is obtained, the average of all values is calculated. The similarity score according to the above association analysis is called C.

이와 같이, 위의 3가지 분석에 따른 유사도 점수를 이용하여 평균 유사도 점수를 계산하게 되고, (평균 유사도 점수) = A+(B+C)/2로 정의될 수 있다.In this way, the average similarity score is calculated using the similarity scores according to the above three analyses, and it can be defined as (average similarity score) = A+(B+C)/2.

구간 결정 엔진(30) 또는 자막 신뢰도 분석 모듈(32)은, 상기 평균 유사도 점수가 미리 정의된 기준 점수(예컨대, 50%) 이하 또는 미만인지에 대해 판단할 수 있고, 그 결과에 따라 카운트 값을 변경하며, 상기 카운트 값이 미리 설정된 기준값(앞서 언급한 c, 예컨대, c=5)에 도달하였는지 여부를 확인할 수 있다(S415). 예컨대, 상기 평균 유사도 점수가 미리 정의된 기준 점수 이하 또는 미만이라면, 상기 카운트 값은 1만큼 증가될 수 있으며, 그렇지 않은 경우 상기 카운트 값은 변동되지 않는다. 이 카운트의 값은 초기값은 0이며, 임시 의미 구간(씬 구간)이 저장되면 초기화될 수 있다.The section determination engine 30 or the caption reliability analysis module 32 may determine whether the average similarity score is below or below a predefined reference score (eg, 50%), and calculates a count value according to the result It may be changed, and it may be checked whether the count value has reached a preset reference value (c, for example, c=5 mentioned above) (S415). For example, if the average similarity score is less than or equal to a predefined reference score, the count value may be incremented by 1, otherwise the count value is not changed. The initial value of this count is 0, and may be initialized when a temporary semantic section (scene section) is stored.

아울러, 상기 카운트 값의 확인은 각 자막정보 쌍의 유사도 계산 후에 이루어질 수 있다. 도시는 되진 않았지만, S414와 S415는 K-1개의 자막 정보 쌍 각각에 대해 개별적으로 그리고 순차적으로 수행된다. 즉, 예를 들어, 구간 결정 엔진(30) 또는 자막 신뢰도 분석 모듈(32)은, 자막정보 n과 자막정보 n+1의 유사도 분석 및 평균 유사도 점수를 산출하고(S414), 상기 산출된 평균 유사도 점수가 미리 정의된 점수(예컨대, 50%) 이하 또는 미만인지 판단하며, 상기 카운트 값도 확인하고(S415), 상기 카운트 값이 c에 도달하지 않았다면, 다시 자막정보 n+1과 자막정보 n+2에 대해 S414 내지 S415 과정을 반복할 수 있다.In addition, the check of the count value may be performed after calculating the similarity of each pair of caption information. Although not shown, S414 and S415 are performed individually and sequentially for each of K-1 pairs of subtitle information. That is, for example, the section determination engine 30 or the caption reliability analysis module 32 calculates a similarity analysis and average similarity score between the caption information n and the caption information n+1 (S414), and the calculated average similarity degree It is determined whether the score is less than or equal to a predefined score (eg, 50%), the count value is also checked (S415), and if the count value does not reach c, subtitle information n+1 and subtitle information n+ again Steps S414 to S415 may be repeated for step 2.

상기 카운트 값이 미리 설정된 기준값에 도달하지 않았다면, 절차는 다시 두번째 순번의 자막 정보부터 K개의 자막 정보에 대해, 앞서 설명한 S413 내지 S415가 반복될 수 있다. 이러한 반복은 상기 카운트 값이 미리 설정된 기준값에 도달할 때까지 수행될 수 있으며, 유사도 분석의 대상이 되는 시작 자막 정보의 순번이 1만큼 증가된다(S417). 즉, n+1번 자막 정보부터 n+K번 자막 정보까지 총 K개의 자막에 대해 유사도 분석 수행이 수행될 것이다. If the count value does not reach the preset reference value, the procedure may be repeated again with respect to the K pieces of subtitle information from the second sequence of subtitle information to S413 to S415 described above. This repetition may be performed until the count value reaches a preset reference value, and the sequence number of the starting caption information, which is the subject of the similarity analysis, is increased by one (S417). That is, similarity analysis will be performed on a total of K subtitles from the n+1 subtitle information to the n+K subtitle information.

한편, 상기 카운트 값이 미리 설정된 기준값에 도달하지 않았더라도, S413, S414에서 유사도 분석의 대상이 된 K개의 자막 정보가 대상 자막 정보 중 마지막 자막 정보였다면(S416), 절차는 S418로 진행될 수 있다. 즉, 상기 K개의 자막 정보 중 맨 첫 자막 정보의 순번이 P를 넘는지 여부가 판단될 수 있다(전체 자막 정보 수는 P+K). 다시 말하면, 상기 K개의 자막 정보 중 맨 마지막 자막 정보(n+K-1번 자막 정보)가 마지막 자막 정보였다면, 절차는 S418로 진행된다.Meanwhile, even if the count value has not reached the preset reference value, if the K pieces of caption information that are the subject of similarity analysis in S413 and S414 are the last caption information among the target caption information (S416), the procedure may proceed to S418. That is, it may be determined whether the sequence number of the first subtitle information among the K pieces of subtitle information exceeds P (the total number of subtitle information is P+K). In other words, if the last caption information (subtitle information n+K-1) among the K pieces of caption information is the last caption information, the procedure proceeds to S418.

상기 카운트 값이 미리 설정된 기준값에 도달하면, 구간 결정 엔진(30) 또는 자막 신뢰도 분석 모듈(32)은 마지막으로 수행한 유사도 점수 계산의 대상인 자막정보 쌍 중 시간적으로 후행하는 자막 정보를 상기 임시 씬 구간의 마지막 자막(n')으로 설정할 수 있다(S418).When the count value reaches a preset reference value, the section determination engine 30 or the caption reliability analysis module 32 sets the temporally trailing caption information among the caption information pairs that are the subject of the last similarity score calculation to the temporary scene section. It can be set as the last subtitle (n') of (S418).

그리고나서, 구간 결정 엔진(30) 또는 자막 신뢰도 분석 모듈(32)은, 임시 씬 구간의 시작 및 마지막 자막 정보를 통해, 이에 대응하는 시간의 길이가 미리 설정된 길이를 초과하는지 또는 이상인지 여부를 판단할 수 있다(S419). Then, the section determination engine 30 or the caption reliability analysis module 32 determines whether the length of the corresponding time exceeds or exceeds a preset length through the start and last caption information of the temporary scene section. It can be done (S419).

구간 결정 엔진(30) 또는 자막 신뢰도 분석 모듈(32)은, 상기 시간의 길이가 미리 설정된 길이보다 같거나 크다면, 상기 임시 씬 구간에 대한 정보를 저장할 수 있다(S420). 상기 임시 씬 구간에 대한 정보를 저장할 때엔, 상기 시작 자막 정보 및 상기 마지막 자막 정보의 시간 정보를 기반으로 앞서 설명한 샷 구간과 매칭되어 저장될 수 있다. 보통은, 상기 자막 정보는 하나의 샷 구간에서 맨 처음 프레임 또는 그 보다는 이후의 프레임부터 존재하므로, 상기 임시 씬 구간은 앞서 미리 구분된 샷 구간의 시작 또는 종료 시점과 일치하지 않을 가능성이 높다. 상기 자막 유사도 분석의 최종 목표는 씬 구간과 그에 포함된 샷 구간을 정확하게 나누기 위함이므로 상기 임시 씬 구간을 개별 샷 구간으로 구성되게끔 설정할 필요가 있으며, 이에 따라 상기 임시 씬 구간을 샷 구간과 매칭할 필요가 있다. 예컨대, 상기 임시 씬 구간이 하나의 영상 내에서 00:03 내지 02:30의 범위에 해당하는 경우, 실제 샷 구간은 00:01 내지 02:55(샷 구간 2 내지 40)에 속할 수 있다. 따라서, 상기 매칭의 결과에 따라, 상기 임시 씬 구간은 샷 구간 2 내지 40으로 저장될 수 있다. The section determination engine 30 or the caption reliability analysis module 32 may store information on the temporary scene section if the length of the time is equal to or greater than a preset length ( S420 ). When the information on the temporary scene section is stored, it may be matched with the above-described shot section and stored based on the time information of the start caption information and the last caption information. In general, since the caption information exists from the first frame or a subsequent frame in one shot section, the temporary scene section is highly likely not to coincide with the start or end time of the previously divided shot section. Since the final goal of the subtitle similarity analysis is to accurately divide a scene section and a shot section included therein, it is necessary to set the temporary scene section to be composed of individual shot sections. There is a need. For example, when the temporary scene section corresponds to the range of 00:03 to 02:30 in one image, the actual shot section may belong to 00:01 to 02:55 (shot sections 2 to 40). Accordingly, according to the matching result, the temporary scene section may be stored as shot sections 2 to 40.

그리고나서, 구간 결정 엔진(30) 또는 자막 신뢰도 분석 모듈(32)은, 나머지 자막 정보에 대한 분석을 수행하기 위해, 상기 자막 유사도 분석의 순번을 n'+1(상기 마지막 자막 정보 이후의 자막 정보)로 갱신하고, 상기 임시 씬 구간에 대한 시작 자막 정보의 순번도 n'+1로 갱신할 수 있다(S421).Then, the section determination engine 30 or the caption reliability analysis module 32 sets the order of the caption similarity analysis to n'+1 (the caption information after the last caption information) in order to analyze the remaining caption information. ), and the sequence number of the start caption information for the temporary scene section may also be updated to n'+1 (S421).

또는, 상기 시간의 길이가 미리 설정된 길이보다 짧다면, 구간 결정 엔진(30) 또는 자막 신뢰도 분석 모듈(32)은, 상기 임시 씬 구간에 대한 정보를 저장하지 않는다. 그리고나서, 구간 결정 엔진(30) 또는 자막 신뢰도 분석 모듈(32)은 나머지 자막 정보에 대한 분석을 수행하기 위해, 상기 자막 유사도 분석의 첫 자막 정보의 순번을 n'+1(상기 마지막 자막 정보 이후의 자막 정보)로 갱신하고, 상기 임시 씬 구간에 대한 시작 자막 정보의 순번도 n'+1로 갱신할 수 있다(S421).Alternatively, if the length of the time is shorter than the preset length, the section determination engine 30 or the caption reliability analysis module 32 does not store information on the temporary scene section. Then, the section determination engine 30 or the caption reliability analysis module 32 sets the order of the first caption information of the caption similarity analysis to n'+1 (after the last caption information) in order to analyze the remaining caption information. of subtitle information), and the sequence number of the start subtitle information for the temporary scene section may also be updated to n'+1 (S421).

그리고나서, 구간 결정 엔진(30) 또는 자막 신뢰도 분석 모듈(32)은, 갱신된 시작 자막 정보의 순번에 따라 K개의 자막 정보가 상기 자막 유사도 분석의 대상 자막 범위를 넘는지 여부를 확인할 수 있다(S422). 즉, 추가적으로 상기 자막 유사도 분석을 할 필요가 있는지 여부가 판단되며(n>Q인지를 확인, Q=P+K는 전체 자막 정보 수), 추가 분석이 필요하면 절차는 S413으로 진행하고; 그렇지 않다면 절차는 종료될 수 있다.Then, the section determination engine 30 or the caption reliability analysis module 32 may check whether the K pieces of caption information exceed the target caption range of the caption similarity analysis according to the sequence of the updated start caption information ( S422). That is, it is determined whether additional analysis of the subtitle similarity is necessary (check whether n>Q, Q=P+K is the total number of subtitle information). If additional analysis is required, the procedure proceeds to S413; Otherwise, the procedure may be terminated.

앞서 설명한 임시 씬 구간 결정 방식에 따르면, 상기 임시 씬 구간은 하나 이상의 개별 임시 씬 구간으로 나누어질 수 있다.According to the above-described temporary scene period determination method, the temporary scene period may be divided into one or more individual temporary scene periods.

다음으로, 앞서 설명한 추출된 임시 씬 구간들을 영상 정보들을 활용하여 검수하여 최종 구간으로 결정하는 방법을 설명하도록 한다. Next, a method of determining the extracted temporary scene sections described above by using image information to determine them as the final section will be described.

도 5는 영상 정보들을 활용한 의미 구간 결정 방법의 순서도를 나타낸다. 5 is a flowchart of a method for determining a semantic section using image information.

구간 결정 엔진(30) 또는 자막 신뢰도 분석 모듈(32)은 앞서 설명한 음성(자막) 정보들을 활용한 임시 저장한 의미 구간, 즉 임시 씬 구간들에 대한 데이터를 정제할 수 있다(S511). 즉, 구간 결정 엔진(30) 또는 자막 신뢰도 분석 모듈(32)은 상기 임시 씬 구간들의 모든 샷 구간에 대한 시간 정보, 메타데이터, 위치, 크기, 라벨, 신뢰도 등을 사전화한 리스트로 저장할 수 있다. The section determination engine 30 or the caption reliability analysis module 32 may purify data on the temporarily stored semantic section, ie, temporary scene sections, using the aforementioned voice (caption) information (S511). That is, the section determination engine 30 or the caption reliability analysis module 32 may store time information, metadata, location, size, label, reliability, etc. for all shot sections of the temporary scene sections as a list in advance. .

아울러, 상기 임시 씬 구간들은 시간 순서대로 순번이 매겨진 채 관리될 수 있다.In addition, the temporary scene sections may be managed while being numbered in chronological order.

그리고나서, 후술할 영상 정보 유사도 분석의 대상이 되는 시작 씬 구간 정보의 순번(n)이 초기화될 수 있다(S512). 초기화 값은 1이다. 또한, 상기 임시 씬 구간들 각각의 샷 구간의 수가 설정될 수 있다.Then, the sequence number n of the starting scene section information, which is the target of image information similarity analysis to be described later, may be initialized (S512). The initialization value is 1. Also, the number of shot sections in each of the temporary scene sections may be set.

구간 결정 엔진(30) 또는 자막 신뢰도 분석 모듈(32)은 해당 씬 구간의 최초 L개의 샷 구간이 선택되고, 이를 기준 샷 구간으로 설정될 수 있다(S513). 즉, 매 임시 씬 구간의 최초 L개의 샷 구간은 후술할 영상 정보 유사도 분석의 기준이 된다. The section determination engine 30 or the caption reliability analysis module 32 may select the first L shot sections of the corresponding scene section and set them as the reference shot section (S513). That is, the first L shot sections of each temporary scene section serve as a criterion for image information similarity analysis, which will be described later.

구간 결정 엔진(30) 또는 자막 신뢰도 분석 모듈(32)은 상기 기준 샷 구간과, 그 다음의 샷 구간을 서로 비교하면서, 음성 정보(자막 정보)를 활용한 임시 씬 구간의 모든 샷 구간을 검색하되, 인물 또는 객체 등의 종류가 미리 설정된 정도 이상 또는 초과하는 유사도를 갖고, 상기 인물 또는 객체의 크기와 위치가 미리 설정된 정도 이상 또는 초과하는 유사도를 갖는지 확인하여 씬 구간을 최종적으로 결정할 수 있다. The section determination engine 30 or the caption reliability analysis module 32 searches all shot sections of the temporary scene section using voice information (subtitle information) while comparing the reference shot section and the next shot section with each other. , the scene section may be finally determined by checking whether the type of person or object has a degree of similarity equal to or greater than a preset degree, and the size and location of the person or object has a degree of similarity higher than or exceeding a preset degree.

좀더 구체적으로, 영상 정보 유사도 분석은, i) 인물 또는 객체 정보(즉, 이름, 명칭 내지는 종류 등으로 식별될 수 있는 인물 또는 객체 등), ii) 인물 또는 객체 등의 위치 정보 및 크기 정보를 활용할 수 있다.More specifically, image information similarity analysis is performed using i) person or object information (that is, a person or object that can be identified by name, name or type, etc.), ii) location information and size information of the person or object. can

구간 결정 엔진(30) 또는 자막 신뢰도 분석 모듈(32)은 상기 기준 샷 정보와 다음 샷 구간의 유사도 분석 및 유사도 점수를 산출하며, 이에는 상기 인물 또는 객체 정보를 이용할 수 있다(S514). The section determination engine 30 or the caption reliability analysis module 32 calculates a similarity analysis and similarity score between the reference shot information and the next shot section, and the person or object information can be used for this (S514).

이를 위해서, 상기 기준 샷 구간의 각각의 샷 구간의 객체 또는 인물 정보가 배열 형태로 저장되어 있을 수 있고, 상기 기준 샷 구간의 각 샷 구간의 객체 또는 인물 정보와 나머지 샷 구간에 대한 객체 또는 인물 정보의 비교가 수행될 수 있다. 즉, 상기 기준 샷 구간이 L개의 샷 구간으로 구성되어 있고, 각각의 순번을 n, n+1, ..., n+L-1로 지칭하면, 바로 그 다음의 나머지 샷 구간의 순번은 n+L이 되며, 이에 따라 상기 영상 정보 유사도 분석은 샷 구간 n과 n+L, n+1과 n+L, ..., n+L-1과 n+L이 서로 비교되는 것이다. To this end, object or person information of each shot section of the reference shot section may be stored in an array form, and object or person information of each shot section of the reference shot section and object or person information of the remaining shot sections A comparison of can be performed. That is, if the reference shot section consists of L shot sections, and each order is referred to as n, n+1, ..., n+L-1, the order of the next remaining shot section is n +L, and accordingly, the image information similarity analysis compares shot sections n and n+L, n+1 and n+L, ..., n+L-1 and n+L.

구간 결정 엔진(30) 또는 자막 신뢰도 분석 모듈(32)은 상기 유사도 점수가 미리 설정된 기준 점수 R1(예컨대, 70%)를 초과하거나 그 이상인 샷 구간이 있는지 여부를 판단할 수 있다(S515). 상기 유사도 점수가 미리 설정된 점수를 초과하거나 그 이상인 경우, 구간 결정 엔진(30) 또는 자막 신뢰도 분석 모듈(32)은 상기 기준 샷 구간의 각각의 샷 구간과 나머지 샷 구간에 대해 인물 또는 객체 등의 위치 정보 및 크기 정보를 기반으로 유사도 분석 및 유사도 점수를 산출할 수 있다(S516).The section determination engine 30 or the caption reliability analysis module 32 may determine whether there is a shot section in which the similarity score exceeds or exceeds a preset reference score R1 (eg, 70%) (S515). When the similarity score exceeds or exceeds the preset score, the section determination engine 30 or the caption reliability analysis module 32 determines the position of a person or object for each shot section and the remaining shot sections of the reference shot section. A similarity analysis and a similarity score may be calculated based on the information and the size information (S516).

구간 결정 엔진(30) 또는 자막 신뢰도 분석 모듈(32)은 상기 유사도 점수가 미리 설정된 기준 점수 R2(예컨대, 70%)를 초과하거나 그 이상인 샷 구간이 있는지 여부를 판단할 수 있다(S517). The section determination engine 30 or the caption reliability analysis module 32 may determine whether there is a shot section in which the similarity score exceeds or exceeds a preset reference score R2 (eg, 70%) (S517).

상기 유사도 점수가 미리 설정된 기준 점수를 초과하거나 그 이상인 샷 구간이 있으면, 상기 유사도 분석은 그 다음 샷 구간(M+1=L+2)에 대해서 수행될 수 있다. 이에 따라, M=M+1로 설정된다(S518). 만약 n번 임시 씬 구간에 대해 더 이상 유사도 분석을 할 샷 구간이 없으면(S519), 구간 결정 엔진(30) 또는 자막 신뢰도 분석 모듈(32)은 유사도 분석을 할 임시 씬 구간이 남아있는지를 확인할 수 있다(S520). 예컨대, 임시 씬 구간의 수가 총 T개 인 경우, n이 T 이상인지 여부가 판단될 수 있다. 상기 유사도 분석을 할 임시 씬 구간이 남아 있으면, 구간 결정 엔진(30) 또는 자막 신뢰도 분석 모듈(32)은 그 다음 임시 씬 구간(n+1)에 대해 S513 내지 S519, 또는 S522 내지 S523의 절차를 반복할 수 있다. 즉, n=n+1로, P=P_n으로 설정될 수 있고(S521), 절차는 S513으로 진행된다. If there is a shot section in which the similarity score exceeds or exceeds the preset reference score, the similarity analysis may be performed on the next shot section (M+1=L+2). Accordingly, M=M+1 is set (S518). If there is no longer a shot section for similarity analysis for the n-th temporary scene section (S519), the section determination engine 30 or the caption reliability analysis module 32 can check whether a temporary scene section for similarity analysis remains. There is (S520). For example, when the total number of temporary scene sections is T, it may be determined whether n is equal to or greater than T. If there is a temporary scene section for the similarity analysis remaining, the section determination engine 30 or the subtitle reliability analysis module 32 performs the procedures of S513 to S519 or S522 to S523 for the next temporary scene section (n+1). Can be repeated. That is, n=n+1 and P=P _n may be set ( S521 ), and the procedure proceeds to S513 .

S516과 S517에서, 상기 유사도 점수가 미리 설정된 기준 점수를 초과하거나 그 이상인 샷 구간이 하나도 없다면, 구간 결정 엔진(30) 또는 자막 신뢰도 분석 모듈(32)은 상기 다음 샷 구간을 기점으로 씬 구간을 분할할 수 있다(S522). 임시 씬 구간 분할에 따라, 임시 씬 구간의 수(T)는 1만큼 증가될 수 있다(S523). 이 경우, 하나의 임시 씬 구간이 두 개로 분할되었으므로, 임시 씬 구간의 샷 구간의 수(P_n) 정보도 갱신되어야 한다. 즉, P_n=L이 되고, P_n+1=P_n-L이 될 수 있다.
그리고나서, 구간 결정 엔진(30) 또는 자막 신뢰도 분석 모듈(32)은 유사도 분석을 할 임시 씬 구간이 남아있는지를 확인할 수 있다(S520). 상기 유사도 분석을 할 임시 씬 구간이 남아 있으면, 구간 결정 엔진(30) 또는 자막 신뢰도 분석 모듈(32)은 그 다음 임시 씬 구간(n+1)에 대해 S513 내지 S519, 또는 S522 내지 S523의 절차를 반복할 수 있다. 즉, n=n+1로, P=P_n으로 설정될 수 있고(S521), 절차는 S513으로 진행된다. In S516 and S517, if there is no shot section in which the similarity score exceeds or exceeds the preset reference score, the section determination engine 30 or the caption reliability analysis module 32 divides the scene section from the next shot section as a starting point It can be done (S522). According to the division of temporary scene sections, the number T of temporary scene sections may be increased by one (S523). In this case, since one temporary scene section is divided into two, _{information on the number of shot sections (P n} ) of the temporary scene section must also be updated. That is, P _n =L, and P _n+1 =P _n -L.
Then, the section determination engine 30 or the caption reliability analysis module 32 may check whether a temporary scene section to be analyzed for similarity remains ( S520 ). If there is a temporary scene section for the similarity analysis remaining, the section determination engine 30 or the subtitle reliability analysis module 32 performs the procedures of S513 to S519 or S522 to S523 for the next temporary scene section (n+1). Can be repeated. That is, n=n+1 and P=P _n may be set ( S521 ), and the procedure proceeds to S513 .

더이상 유사도 분석을 수행할 임시 씬 구간 또는 샷 구간이 없으면, 구간 결정 엔진(30) 또는 자막 신뢰도 분석 모듈(32)은 임시 씬 구간으로 저장해놓은 정보를 최종적인 씬 구간으로 저장할 수 있다(S524). 즉, 최종적으로 설정된 T, Pn 등의 값이 최종적인 씬 구간에 대한 구성을 나타낼 것이다.When there is no longer a temporary scene section or shot section for which similarity analysis is to be performed, the section determination engine 30 or the caption reliability analysis module 32 may store the information stored as the temporary scene section as the final scene section (S524). That is, the finally set values of T, Pn, etc. will indicate the configuration of the final scene section.

이상의 명세서에서, "시스템" 또는 그에 속한 "엔진" 또는 "모듈" 등이 해당 방법 또는 절차 등을 수행하는 것으로 설명하였으나, "시스템", "엔진" 또는 "모듈"은 명칭일 뿐 권리범위가 그에 종속되는 것은 아니다. 즉, 시스템, 엔진, 모듈 외에도 장치 또는 그에 속한 구성으로서도 해당 방법 또는 절차가 수행될 수 있으며, 그 뿐만 아니라 비디오 메타데이터 검수 또는 편집을 위한 소프트웨어 또는 컴퓨터 또는 그밖의 기계, 장치 등으로 판독가능한 코드에 의해 상기 방법 또는 방식이 수행될 수 있다. In the above specification, "system" or "engine" or "module" belonging thereto has been described as performing the method or procedure, etc., but "system", "engine" or "module" is only a name and the scope of rights is limited thereto. not dependent. That is, in addition to the system, engine, and module, the method or procedure may be performed as a device or a component belonging thereto, and as well as software for reviewing or editing video metadata, or code readable by a computer or other machine, device, etc. The method or manner may be performed by

아울러, 본 발명의 또다른 양태(aspect)로서, 앞서 설명한 제안 또는 발명의 동작이 "컴퓨터"(시스템 온 칩(system on chip; SoC) 또는 (마이크로) 프로세서 등을 포함하는 포괄적인 개념)에 의해 구현, 실시 또는 실행될 수 있는 코드 또는 상기 코드를 저장 또는 포함한 컴퓨터-판독가능한 저장 매체 또는 컴퓨터 프로그램 제품(product) 등으로도 제공될 수 있고, 본 발명의 권리범위가 상기 코드 또는 상기 코드를 저장 또는 포함한 컴퓨터-판독가능한 저장 매체 또는 컴퓨터 프로그램 제품으로 확장가능하다. In addition, as another aspect of the present invention, the above-described proposal or operation of the invention is performed by a "computer" (a comprehensive concept including a system on chip (SoC) or (micro) processor, etc.) Code that can be implemented, implemented or executed, or a computer-readable storage medium or computer program product storing or including the code, may also be provided, and the scope of the present invention is that the code or the storage or storage of the code is provided. a computer-readable storage medium including a computer-readable storage medium or a computer program product.

Claims

As a system for extracting a semantic interval,
a video segmentation engine including a video registration module and a video segmentation module for registering or dividing a video for generating metadata;
a metadata auto tagging engine for automatically generating metadata of the registered or divided video; and
and a section determination engine for extracting a semantic section from the section of the divided video,
The section determination engine: stores a set consisting of a preset number of continuous caption information as one or more temporary semantic sections by calculating a similarity score based on the caption information;
The stored temporary semantic section consists of one or more shot sections,
The section determination engine: For calculating a similarity score based on the subtitle information
Set the starting caption information n of calculating the similarity score in the set,
For K pieces of sequential subtitle information composed of subtitle information n to n+K-1 in the set, the similarity score between each of the subtitle information n and the subtitle information n+1 to n+K-1 is equal to or less than a preset score. n is incremented by 1 until the number of pairs of subtitle information having a score becomes a predetermined number, and sequentially calculate,
determining temporally trailing caption information among the last pair of caption information for which the similarity score is calculated as the ending caption information n' of the similarity score calculation;
If the length of the caption information section defined by the start caption information and the end caption information is greater than or equal to a predetermined time length, matching the caption information section to the shot section and storing the section as the temporary meaning section,
where K is a predetermined constant and n is an integer greater than 1 and a value having an upper bound.

The method of claim 1, wherein the interval determination engine comprises:
and when there is remaining subtitle information that is not stored as the temporary semantic section in the set, the similarity score calculation is performed on the remaining subtitle information to attempt to store the next temporary semantic section.

The method according to any one of claims 1 to 2, wherein the interval determination engine comprises:
Even if the number of pairs of caption information having a similarity score equal to or less than the preset score is smaller than the predetermined number, if there is no more caption information for calculating a similarity score,
determining temporally trailing caption information among the last pair of caption information for which the similarity score is calculated as end caption information of a temporary semantic section;
If the length of the caption information section defined by the start caption information and the end caption information is equal to or greater than the predetermined time length, the caption information section is matched with the shot section and stored as the temporary semantic section. .

The method according to any one of claims 1 to 2,
A semantic interval extraction system in which a plurality of analysis methods are used to calculate the similarity score.

The semantic section extraction according to claim 1, wherein if the length of the caption information section defined by the start caption information and the end caption information is less than the predetermined time length, the caption information section is not stored as a temporary semantic section. system.

The interval determination engine according to any one of claims 1 to 2, wherein for each of the stored temporary semantic intervals:
A semantic section extraction system for storing the temporary semantic section as a final semantic section by calculating a similarity score based on image information of the corresponding temporary semantic section, or further dividing the temporary semantic section and storing the temporary semantic section as the final semantic section.

The method of claim 6, wherein the interval determination engine comprises:
A semantic section extraction system for setting the temporally most advanced L shot sections of a corresponding temporary semantic section as a reference section, and determining a similarity between the reference section and the remaining shot sections.

The method of claim 7, wherein the calculation of the similarity score comprises:
and determining the degree of similarity of person or object information or position and size information of person or object between each of the L shot sections and the remaining shot sections.

8. The method of claim 7,
All of the similarity scores of person or object information between each of the L shot sections and the remaining shot sections are less than a preset reference score, and the similarity of the position and size information of the person or object between each of the L shot sections and the remaining shot sections If all the scores are less than the preset reference score, the stored temporary semantic section is divided into two,
The starting point at which the stored temporary semantic section is divided is the remaining shot section.

As a method for extracting a semantic interval,
registering or dividing a video for which metadata is to be generated;
automatically generating subtitle information of the registered or divided video;
extracting a temporary semantic section based on the subtitle information from the segmented video section,
The extracting includes storing one or more temporary semantic sections by calculating a similarity score based on the subtitle information for a set consisting of a preset number of continuous subtitle information,
The temporary semantic section consists of one or more shot sections,
The step of storing the temporary semantic section includes:
setting start caption information n for calculating the similarity score in the set;
For K pieces of sequential subtitle information composed of subtitle information n to n+K-1 in the set, the similarity score between each of the subtitle information n and the subtitle information n+1 to n+K-1 is equal to or less than a preset score. incrementing n by 1 until the number of pairs of caption information having scores becomes a predetermined number, and sequentially calculating;
determining temporally trailing caption information among the last pair of caption information for which the similarity score is calculated as the ending caption information n' of the similarity score calculation; and
If the length of the caption information section defined by the start caption information and the end caption information is equal to or greater than a predetermined time length, matching the caption information section to the shot section and storing the section as the temporary meaning section,
wherein K is a predetermined constant and n is an integer greater than 1 and a value having an upper bound.