KR102198480B1

KR102198480B1 - Video summarization apparatus and method via recursive graph modeling

Info

Publication number: KR102198480B1
Application number: KR1020200024805A
Authority: KR
Inventors: 손광훈; 박정인
Original assignee: 연세대학교 산학협력단
Priority date: 2020-02-28
Filing date: 2020-02-28
Publication date: 2021-01-05
Also published as: WO2021172674A1

Abstract

The present invention provides a video summary generation device and a method thereof. The video summary generation device regards a plurality of frames of a video image as nodes of a relationship graph, and infers semantic similarity between multiple frames using a graph convolutional neural network having a recursive reasoning structure, so as to extract an accurate key frame considering the global and long-term interrelationships between multiple frames, thereby generating a video summary in various video platforms with a relatively small number of parameters and high accuracy.

Description

Video summary generation device and method through recursive graph modeling {VIDEO SUMMARIZATION APPARATUS AND METHOD VIA RECURSIVE GRAPH MODELING}

본 발명은 비디오 요약 생성 장치 및 방법에 관한 것으로, 재귀 그래프 모델링을 통한 비디오 요약 생성 장치 및 방법에 관한 것이다.The present invention relates to an apparatus and method for generating a video summary, and to an apparatus and method for generating a video summary through recursive graph modeling.

최근 온라인 스트리밍 서비스와 같은 비디오 플랫폼의 변화로 인해, 사용자가 원하는 비디오 데이터에 액세스하기 어려워졌다. 또한 비디오의 길이가 점차로 증가하고 있기 때문에 사용자는 비디오에서 유용한 정보를 획득하는 것이 더더욱 어려워졌다. 이에 사용자가 비디오 데이터를 효율적으로 탐색할 수 있도록 하기 위한 연구가 수행되고 있다. 이러한 연구의 일환으로 비디오에서 유용한 장면을 선출하여 비디오의 내용을 간결하게 묘사할 수 있도록 하는 비디오 요약은 사용자 편의성을 크게 향상시킬 수 있는 도구로서 주목받고 있다.Due to recent changes in video platforms such as online streaming services, it has become difficult for users to access desired video data. In addition, as the length of the video is gradually increasing, it becomes more difficult for users to obtain useful information from the video. Accordingly, research is being conducted to enable users to efficiently search for video data. As part of this research, video summarization, which selects useful scenes from videos and allows concise description of the video contents, is drawing attention as a tool that can greatly improve user convenience.

다만 비디오 요약은 비디오의 전체 프레임 사이의 상호 연관성을 고려하여 유용한 프레임을 선출해야 한다는 어려움을 갖는다. 이에 인공 신경망 (artificial neural network)을 이용하여 키 프레임(key frame)을 추출하는 비디오 요약 방식에 대한 다양한 연구가 수행되고 있다.However, the video summary has a difficulty in selecting useful frames in consideration of interrelationships between all frames of a video. Accordingly, various researches on video summarization methods for extracting key frames using an artificial neural network are being conducted.

기존에 인공 신경망을 이용하는 비디오 요약 알고리즘들은 주로 컨볼루션 신경망(Convolutional neural networks: CNN) 또는 순환 신경망(Recurrent neural networks: RNN)을 기반으로 구현되었다. 그러나 CNN이나 RNN과 같은 인공 신경망은 작은 수용 영역(receptive field)와 표현자를 전역적으로 표현하지 못하는 국지성(locality)의 문제를 가지고 있어 비디오의 장기 의존성(long-term dependency)을 반영하기에 적절하지 않다는 한계가 있다.Previously, video summarization algorithms using artificial neural networks were mainly implemented based on convolutional neural networks (CNN) or recurrent neural networks (RNN). However, artificial neural networks such as CNN and RNN have a problem of locality in which they cannot express globally with a small receptive field, so they are not suitable for reflecting the long-term dependency of video. There is a limit to not.

한국 공개 특허 제10-2019-0099027호 (2019.08.23 공개)Korean Patent Publication No. 10-2019-0099027 (published on August 23, 2019)

본 발명의 목적은 비디오의 다수의 프레임을 관계 그래프의 노드로 간주하여 그래프 컨볼루션 신경망에 적용함으로써 비디오 요약을 위한 키 프레임을 정확하게 추출할 수 있는 비디오 요약 생성 장치 및 방법을 제공하는데 있다.An object of the present invention is to provide a video summary generation apparatus and method capable of accurately extracting a key frame for video summary by considering a plurality of frames of a video as nodes of a relationship graph and applying them to a graph convolutional neural network.

본 발명의 다른 목적은 다수의 프레임 사이의 시멘틱 유사성을 재귀적으로 추론하여 다수 프레임 사이의 전역적이고 장기적인 상호 관계를 고려한 키 프레임을 추출할 수 있는 비디오 요약 생성 장치 및 방법을 제공하는데 있다.Another object of the present invention is to provide an apparatus and method for generating a video summary capable of extracting a key frame in consideration of global and long-term interrelationships between multiple frames by recursively inferring semantic similarity between multiple frames.

상기 목적을 달성하기 위한 본 발명의 일 실시예에 따른 비디오 요약 생성 장치는 다수 프레임으로 구성되는 입력 비디오를 인가받고, 미리 학습된 패턴 추정 방식에 따라 다수의 프레임에서 각각 추출된 다수의 특징맵을 노드 벡터로 간주하여 초기 특징 그래프를 생성하며, 상기 노드 벡터 사이의 유사도를 계산하여 초기 인접 행렬을 생성하는 초기 그래프 생성부; 패턴 추정 방식이 미리 학습된 다수의 그래프 컨볼루션 모듈를 이용하여 상기 초기 특징 그래프와 상기 초기 인접 행렬 또는 반복적으로 재귀되어 인가되는 특징 그래프와 인접 행렬 사이의 패턴으로부터 상기 초기 특징 그래프 또는 상기 특징 그래프의 노드 사이의 패턴을 보정하여 보정 그래프를 추출하는 그래프 상관부; 상기 보정 그래프를 인가받아 미리 지정된 서로 다른 가중치 함수로 가중하고, 가중된 2개의 보정 그래프 사이의 유사도를 나타내는 보정 유사도와 이전 인접 행렬을 기반으로 다음 보정 그래프를 추출하기 위한 인접 행렬과 특징 그래프를 획득하는 재귀 그래프 획득부; 및 패턴 추정 방식이 미리 학습된 별도의 그래프 컨볼루션 네트워크를 이용하여, 기지정된 횟수로 반복 재귀하여 획득된 최종 보정 그래프와 최종 인접 행렬로부터 상기 최종 보정 그래프의 각 노드 사이의 시멘틱 유사도에 따라 다수의 프레임 각각이 키 프레임이 될 확률을 추정하여 다수의 키 프레임을 선택하는 키 프레임 추출부를 포함한다.A video summary generation apparatus according to an embodiment of the present invention for achieving the above object receives an input video composed of a plurality of frames, and generates a plurality of feature maps each extracted from a plurality of frames according to a previously learned pattern estimation method. An initial graph generator configured to generate an initial feature graph based on a node vector, and to generate an initial adjacency matrix by calculating a similarity between the node vectors; The initial feature graph or the node of the feature graph from a pattern between the initial feature graph and the initial adjacency matrix, or a feature graph that is repeatedly recursively applied and an adjacency matrix using a plurality of graph convolution modules in which a pattern estimation method is previously learned A graph correlation unit for extracting a correction graph by correcting a pattern therebetween; Obtaining an adjacency matrix and a feature graph for extracting the next correction graph based on the correction similarity and the previous adjacency matrix indicating the similarity between the two weighted correction graphs by applying the correction graph and weighting with different weighting functions previously specified. A recursive graph acquisition unit; And using a separate graph convolution network in which the pattern estimation method is pre-learned, the final correction graph obtained by iterative recursion at a predetermined number of times and the final adjacency matrix, according to the semantic similarity between each node of the final correction graph. And a key frame extractor for selecting a plurality of key frames by estimating a probability that each frame becomes a key frame.

상기 초기 그래프 생성부는 미리 학습된 패턴 추정 방식에 따라 상기 입력 비디오의 다수의 프레임 각각에 대한 특징을 추출하여 상기 다수의 특징맵을 생성하는 특징맵 획득부; 상기 다수의 특징맵 각각을 다수의 노드와 다수의 노드를 서로 연결하는 다수의 에지로 구성되는 그래프의 노드로 간주하여 상기 초기 특징 그래프를 획득하는 초기 그래프 획득부; 및 상기 초기 특징 그래프의 다수의 노드 사이의 유사도를 계산하여 초기 인접 행렬을 획득하는 초기 인접 행렬 획득부를 포함할 수 있다.The initial graph generator comprises: a feature map acquisition unit for generating the plurality of feature maps by extracting features for each of the plurality of frames of the input video according to a previously learned pattern estimation method; An initial graph acquisition unit for obtaining the initial feature graph by considering each of the plurality of feature maps as a node of a graph consisting of a plurality of nodes and a plurality of edges connecting the plurality of nodes to each other; And an initial adjacency matrix obtaining unit for obtaining an initial adjacency matrix by calculating a degree of similarity between a plurality of nodes of the initial feature graph.

상기 초기 그래프 획득부는 상기 다수의 특징맵 각각을 1차원의 노드 벡터로 변환하여 상기 초기 특징 그래프를 획득할 수 있다.The initial graph acquisition unit may obtain the initial feature graph by converting each of the plurality of feature maps into a one-dimensional node vector.

상기 초기 인접 행렬 획득부는 상기 초기 특징 그래프(X⁰)의 각 노드 벡터(x₁, …, x_T) 사이의 유사도를 수학식 The initial adjacency matrix acquisition unit calculates the degree of similarity between each node vector (x ₁ , …, x _T ) of the initial feature graph (X ⁰ ).

(여기서 a_ij는 초기 인접 행렬(A⁰)의 원소를 나타내고, x_i와 x_j는 각각 초기 특징 그래프(X⁰)의 다수의 노드 벡터(x₁, …, x_T) 중 i번째 노드 벡터와 j번째 노드 벡터를 나타내고, T는 전치 행렬을 나타내며, ∥∥₂ 는 L2 놈(L2 norm) 함수를 나타낸다.)에 따라 계산하여 상기 초기 인접 행렬(A⁰)을 획득할 수 있다.(Where a _ij denotes the element of the initial adjacency matrix (A ⁰ ), and x _i and x _j are the i-th node vectors of the multiple node vectors (x ₁ , …, x _T ) of the initial feature graph (X ⁰ ), respectively And j-th node vectors, T denotes a transpose matrix, and ∥ ∥ ₂ denotes an L2 norm function.), the initial adjacency matrix A ⁰ may be obtained.

상기 초기 그래프 생성부는 상기 입력 비디오의 다수의 프레임에서 기지정된 시간 구간 단위로 프레임을 추출하여 상기 특징맵 획득부로 전달하는 프레임 선별부를 더 포함할 수 있다.The initial graph generation unit may further include a frame selection unit that extracts frames from a plurality of frames of the input video in units of a predetermined time interval and transmits the extracted frames to the feature map acquisition unit.

상기 그래프 상관부는 상기 다수의 그래프 컨볼루션 모듈이 직렬로 순차 연결되고, 상기 다수의 그래프 상관 모듈 중 초기단의 그래프 상관 모듈은 초기 특징 그래프 또는 재귀된 특징 그래프와 초기 인접 행렬 또는 재귀된 인접 행렬을 인가받아 패턴을 추정하여 제1 중간 특징 그래프를 추출하고, 나머지 그래프 상관 모듈은 각각 이전단에서 추출된 중간 특징 그래프와 초기단의 그래프 상관 모듈에 인가된 인접 행렬을 인가받아 패턴을 추정하여 중간 특징 그래프 또는 상기 보정 그래프를 추출할 수 있다.In the graph correlation unit, the plurality of graph convolution modules are sequentially connected in series, and the graph correlation module at the initial stage among the plurality of graph correlation modules includes an initial feature graph or a recursive feature graph and an initial adjacency matrix or a recursive adjacency matrix. It is applied to estimate the pattern to extract the first intermediate feature graph, and the remaining graph correlation modules receive the intermediate feature graph extracted from the previous stage and the adjacent matrix applied to the graph correlation module of the initial stage, respectively, and estimate the pattern by estimating the intermediate feature. The graph or the correction graph can be extracted.

상기 재귀 그래프 획득부는 이전 획득된 상기 보정 그래프를 각각 인가받고, 상기 보정 그래프가 서로 다른 선형 임베딩 공간에 투사되도록 서로 다른 기지정된 가중치 함수를 상기 보정 그래프에 가중하여 2개의 투영 보정 그래프를 획득하는 투사부; 상기 2개의 투영 보정 그래프 사이의 유사도를 계산하여 상기 보정 유사도를 획득하는 인접 보정값 획득부; 및 이전 인접 행렬과 획득된 보정 유사도를 가산하여 상기 인접 행렬을 획득하는 인접 행렬 획득부를 포함할 수 있다.The recursive graph acquisition unit receives the previously acquired correction graphs, respectively, and weights different predetermined weight functions to the correction graph so that the correction graphs are projected on different linear embedding spaces to obtain two projection correction graphs. part; An adjacent correction value acquisition unit for obtaining the correction similarity by calculating a similarity between the two projection correction graphs; And an adjacency matrix obtaining unit that obtains the adjacency matrix by adding the previous adjacency matrix and the obtained correction similarity.

상기 인접 보정값 획득부는 상기 보정 유사도(dA^k)를 수학식 The adjacent correction value obtaining unit calculates the correction similarity dA ^k

(여기서 W_θZ^k, W_φZ^k는 보정 그래프(Z^k)에 가중치 함수(W_θ, W_φ)가 가중된 투영 보정 그래프를 나타내고, T는 전치 행렬을 나타내며, ∥∥₂ 는 L2 놈(L2 norm) 함수를 나타낸다.)에 따라 계산할 수 있다.(Where W _θ Z ^k , W _φ Z ^k represents the projection correction graph weighted with the weight function (W _θ , W _φ ) on the correction graph (Z ^k ), T represents the transpose matrix, and ∥ ∥ ₂ is the L2 norm It can be calculated according to (L2 norm) function.)

상기 키 프레임 추출부는 패턴 추정 방식이 미리 학습된 별도의 그래프 컨볼루션 네트워크에 상기 최종 보정 그래프와 상기 최종 인접 행렬을 인가하여, 상기 최종 보정 그래프의 각 노드 사이의 시멘틱 유사도 패턴에 따라 상기 입력 비디오의 다수의 프레임 각각이 키 프레임일 확률을 나타내는 키 프레임 확률맵을 추출하는 상관 관계 추정부; 및 상기 키 프레임 확률맵으로부터 상기 입력 비디오의 다수의 프레임 각각이 키 프레임될 확률을 분석하여 다수의 키 프레임을 선택하는 키 프레임 선택부를 포함할 수 있다.The key frame extractor applies the final correction graph and the final adjacency matrix to a separate graph convolution network in which a pattern estimation method is learned in advance, and the input video is applied according to a semantic similarity pattern between nodes of the final correction graph. A correlation estimating unit for extracting a key frame probability map indicating a probability that each of the plurality of frames is a key frame; And a key frame selection unit for selecting a plurality of key frames by analyzing a probability that each of the plurality of frames of the input video will be key framed from the key frame probability map.

상기 다른 목적을 달성하기 위한 본 발명의 다른 실시예에 따른 비디오 요약 생성 방법은 다수 프레임으로 구성되는 입력 비디오를 인가받고, 미리 학습된 패턴 추정 방식에 따라 다수의 프레임에서 각각 추출된 다수의 특징맵을 노드 벡터로 간주하여 초기 특징 그래프를 생성하며, 상기 노드 벡터 사이의 유사도를 계산하여 초기 인접 행렬을 생성하는 단계; 패턴 추정 방식이 미리 학습된 다수의 그래프 컨볼루션 모듈를 이용하여 상기 초기 특징 그래프와 상기 초기 인접 행렬 또는 반복적으로 재귀되어 인가되는 특징 그래프와 인접 행렬 사이의 패턴으로부터 상기 초기 특징 그래프 또는 상기 특징 그래프의 노드 사이의 패턴을 보정하여 보정 그래프를 추출하는 단계; 상기 보정 그래프를 인가받아 미리 지정된 서로 다른 가중치 함수로 가중하고, 가중된 2개의 보정 그래프 사이의 유사도를 나타내는 보정 유사도와 이전 인접 행렬을 기반으로 다음 보정 그래프를 추출하기 위한 인접 행렬과 특징 그래프를 획득하는 단계; 및 패턴 추정 방식이 미리 학습된 별도의 그래프 컨볼루션 네트워크를 이용하여, 기지정된 횟수로 반복 재귀하여 획득된 최종 보정 그래프와 최종 인접 행렬로부터 상기 최종 보정 그래프의 각 노드 사이의 시멘틱 유사도에 따라 다수의 프레임 각각이 키 프레임이 될 확률을 추정하여 다수의 키 프레임을 선택하는 단계를 포함한다.A video summary generation method according to another embodiment of the present invention for achieving the above object is to receive an input video composed of a plurality of frames, and a plurality of feature maps each extracted from a plurality of frames according to a previously learned pattern estimation method. Generating an initial feature graph by regard to the node vector, and generating an initial adjacency matrix by calculating a similarity between the node vectors; The initial feature graph or the node of the feature graph from a pattern between the initial feature graph and the initial adjacency matrix, or a feature graph that is repeatedly recursively applied and an adjacency matrix using a plurality of graph convolution modules in which a pattern estimation method is previously learned Extracting a correction graph by correcting the pattern therebetween; Obtaining an adjacency matrix and a feature graph for extracting the next correction graph based on the correction similarity and the previous adjacency matrix indicating the similarity between the two weighted correction graphs by applying the correction graph and weighting with different weighting functions previously specified. The step of doing; And using a separate graph convolution network in which the pattern estimation method is pre-learned, the final correction graph obtained by iterative recursion at a predetermined number of times and the final adjacency matrix, according to the semantic similarity between each node of the final correction graph. And selecting a plurality of key frames by estimating a probability that each frame will be a key frame.

따라서, 본 발명의 실시예에 따른 비디오 요약 생성 장치 및 방법은 비디오 영상의 다수 프레임을 관계 그래프의 노드로 간주하고, 재귀적 추론 구조를 갖는 그래프 컨볼루션 신경망을 이용하여 다수의 프레임 사이의 시멘틱 유사성을 추론함으로써 다수 프레임 사이의 전역적이고 장기적인 상호 관계를 고려한 정확한 키 프레임을 추출할 수 있다. 그러므로 상대적으로 적은 파라미터 개수와 높은 정확성으로 다양한 비디오 플랫폼에서 비디오 요약을 사용자에게 용이하게 제공할 수 있다.Accordingly, the apparatus and method for generating a video summary according to an embodiment of the present invention considers multiple frames of a video image as nodes of a relationship graph, and uses a graph convolutional neural network having a recursive inference structure to achieve semantic similarity between multiple frames. By inferring, it is possible to extract an accurate key frame in consideration of global and long-term interrelationships between multiple frames. Therefore, it is possible to easily provide video summaries to users in various video platforms with relatively small number of parameters and high accuracy.

도 1은 본 발명의 일 실시예에 따른 비디오 요약 생성 장치의 개략적 구조를 나타낸다.
도 2는 도 1의 비디오 요약 생성 장치의 초기 그래프 생성부의 상세 구성을 나타낸다.
도 3은 도 1의 비디오 요약 생성 장치의 그래프 상관부와 재귀 그래프 획득부의 상세 구성을 나타낸다.
도 4는 재귀 반복 횟수에 따라 추출된 키 프레임의 일예를 나타낸다.
도 5는 본 발명의 일 실시예에 따른 비디오 요약 생성 방법을 나타낸다.
도 6은 본 실시예에 따른 비디오 요약 생성 방법과 기존의 비디오 요약 생성 방법에 따라 생성된 비디오 요약을 비교한 결과를 나타낸다.1 shows a schematic structure of an apparatus for generating a video summary according to an embodiment of the present invention.
FIG. 2 shows a detailed configuration of an initial graph generator of the video summary generation apparatus of FIG. 1.
3 shows detailed configurations of a graph correlation unit and a recursive graph acquisition unit of the video summary generation apparatus of FIG. 1.
4 shows an example of a key frame extracted according to the number of recursion repetitions.
5 shows a video summary generation method according to an embodiment of the present invention.
6 shows a result of comparing a video summary generation method according to the present embodiment with a video summary generated according to a conventional video summary generation method.

본 발명과 본 발명의 동작상의 이점 및 본 발명의 실시에 의하여 달성되는 목적을 충분히 이해하기 위해서는 본 발명의 바람직한 실시예를 예시하는 첨부 도면 및 첨부 도면에 기재된 내용을 참조하여야만 한다. In order to fully understand the present invention, the operational advantages of the present invention, and the objects achieved by the implementation of the present invention, reference should be made to the accompanying drawings illustrating preferred embodiments of the present invention and the contents described in the accompanying drawings.

이하, 첨부한 도면을 참조하여 본 발명의 바람직한 실시예를 설명함으로써, 본 발명을 상세히 설명한다. 그러나, 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며, 설명하는 실시예에 한정되는 것이 아니다. 그리고, 본 발명을 명확하게 설명하기 위하여 설명과 관계없는 부분은 생략되며, 도면의 동일한 참조부호는 동일한 부재임을 나타낸다. Hereinafter, the present invention will be described in detail by describing a preferred embodiment of the present invention with reference to the accompanying drawings. However, the present invention may be implemented in various different forms, and is not limited to the described embodiments. In addition, in order to clearly describe the present invention, parts irrelevant to the description are omitted, and the same reference numerals in the drawings indicate the same members.

명세서 전체에서, 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라, 다른 구성요소를 더 포함할 수 있는 것을 의미한다. 또한, 명세서에 기재된 "...부", "...기", "모듈", "블록" 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어나 소프트웨어 또는 하드웨어 및 소프트웨어의 결합으로 구현될 수 있다. Throughout the specification, when a part "includes" a certain component, it means that other components may be further included, rather than excluding other components unless specifically stated to the contrary. In addition, terms such as "... unit", "... group", "module", and "block" described in the specification mean units that process at least one function or operation, which is hardware, software, or hardware. And software.

도 1은 본 발명의 일 실시예에 따른 비디오 요약 생성 장치의 개략적 구조를 나타낸다.1 shows a schematic structure of an apparatus for generating a video summary according to an embodiment of the present invention.

도 1을 참조하면, 본 실시예에 따른 비디오 요약 생성 장치는 초기 그래프 생성부(100), 그래프 상관부(200), 재귀 그래프 획득부(300) 및 키 프레임 추출부(400)를 포함할 수 있다.Referring to FIG. 1, the video summary generation apparatus according to the present embodiment may include an initial graph generation unit 100, a graph correlation unit 200, a recursive graph acquisition unit 300, and a key frame extraction unit 400. have.

초기 그래프 생성부(100)는 다수의 프레임으로 구성된 입력 비디오(I)를 인가받아 각 프레임에 대한 특징을 추출하여 다수의 특징맵(X = {x₁, …, x_T})을 생성하고, 생성된 다수의 특징맵(x₁, …, x_T) 각각을 노드 벡터로 간주하여, 초기 특징 그래프(X⁰)를 생성하며, 초기 특징 그래프(X⁰)의 각 노드 벡터 사이의 유사도(affinity)를 계산하여 초기 인접 행렬(A⁰)을 생성한다. 여기서 초기 인접 행렬(A⁰)은 T × T 크기로 획득될 수 있다.The initial graph generator 100 receives the input video I composed of a plurality of frames and extracts features for each frame to generate a plurality of feature maps (X = {x ₁ , …, x _T }), Regarding each of the generated feature maps (x ₁ , …, x _T ) as node vectors, an initial feature graph (X ⁰ ) is generated, and the affinity between each node vector of the initial feature graph (X ⁰ ) ) To generate an initial adjacency matrix (A ⁰ ). Here, the initial adjacency matrix A ⁰ may be obtained with a size of T × T.

그래프 상관부(200)는 초기 그래프 생성부(100)에서 생성된 초기 특징 그래프(X⁰)와 초기 인접 행렬(A⁰) 또는 이전 획득된 특징 그래프(X^k-1)와 재귀 그래프 획득부(300)에서 획득된 인접 행렬(A^k-1)을 인가받고, 학습을 통해 미리 획득된 가중치 행렬(W)을 이용한 패턴 추정 방식에 따라 초기 특징 그래프(X⁰)와 초기 인접 행렬(A⁰) 또는 특징 그래프(X^k-1)와 인접 행렬(A^k-1) 사이의 패턴으로부터 초기 특징 그래프(X⁰) 또는 특징 그래프(X^k-1)의 노드 사이의 패턴을 보정하여 보정 그래프(Z^k)를 추출한다.The graph correlation unit 200 includes an initial feature graph (X ⁰ ) and an initial adjacency matrix (A ⁰ ) generated by the initial graph generator 100 or a previously acquired feature graph (X ^k-1 ) and a recursive graph acquisition unit ( 300) adjacent to the matrix (a ^k-1) is received, the initial characteristic graph (X ⁰⁾ and the initial adjacency matrix in accordance with the pattern estimation method using a weighting matrix (W) obtained in advance by learning a obtained in (a ⁰⁾ Alternatively, a correction graph (Z) by correcting the pattern between the nodes of the initial feature graph (X ⁰ ) or the feature graph (X ^k-1 ) from the pattern between the feature graph (X ^k-1 ) and the adjacency matrix (A ^k-1 ). ^k ) is extracted.

재귀 그래프 획득부(300)는 그래프 상관부(200)에서 추출된 보정 그래프(Z^k)를 인가받아 미리 지정된 가중치 함수(W_θ, W_φ)로 가중하여 서로 다른 선형 임베딩 공간에 투사하고, 서로 다른 선형 임베딩 공간에 투사된 2개의 보정 그래프(W_θZ^k, W_φZ^k) 사이의 보정 유사도(dA^k)와 이전 인접 행렬(A^k-1)을 이용하여 인접 행렬(A^k)을 획득한다. 그리고 이미 획득된 보정 그래프(Z^k)를 특징 그래프(X^k = Z^k)로서 인접 행렬(A^k)과 함께 그래프 상관부(200)로 전달하여, 그래프 상관부(200)가 다음 보정 그래프(Z^k+1)를 추출할 수 있도록 한다.The recursive graph acquisition unit 300 receives the correction graph Z ^k extracted from the graph correlation unit 200, weights it with a predetermined weight function (W _θ , W _φ ), and projects it to different linear embedding spaces, using the corrected similarity (dA ^k) from the previous adjacent matrix (a ^k-1) between the two calibration graph projected on other linear embedding space (W _θ Z ^k, W _φ Z ^k) the adjacency matrix (a ^k) Acquire. Then, the obtained correction graph (Z ^k ) is transferred to the graph correlation unit 200 along with the adjacency matrix (A ^k ) as a feature graph (X ^k = Z ^k ), so that the graph correlation unit 200 sends the next correction graph ( Make it possible to extract Z ^k+1 ).

키 프레임 추출부(400)는 그래프 상관부(200)와 재귀 그래프 획득부(300)에 의해 기지정된 횟수로 반복 재귀하여 획득한 최종 보정 그래프(Z^K)와 최종 인접 행렬(A^K)을 인가받고, 미리 학습된 패턴 추정 방식에 따라 인가된 최종 보정 그래프(Z^K)의 각 노드 사이의 시멘틱 유사도 패턴을 추정하여 입력 비디오(I)의 다수의 프레임 각각이 키 프레임(F_key)일 가능성을 나타내는 키 프레임 확률맵(Y)을 추출하고, 추출된 키 프레임 확률맵(Y)에 따라 키 프레임 키 프레임(F_key)을 선택한다.The key frame extraction unit 400 applies the final correction graph Z ^K and the final adjacency matrix A ^K obtained by recursively recursively at a predetermined number of times by the graph correlation unit 200 and the recursive graph acquisition unit 300 receiving, the likelihood of the key frame (F _key) a number of frames, each of the patterned final calibration graph, the applied according to the estimation method (Z ^K) semantic estimate the degree of similarity patterns input video (I) between each node in the learning in advance The indicated key frame probability map (Y) is extracted, and a key frame key frame (F _key ) is selected according to the extracted key frame probability map (Y).

즉 본 실시예에 따른 비디오 요약 생성 장치는 입력 비디오(I)의 다수의 프레임 각각에서 추출된 특징맵을 노드로 간주하여 다수의 노드로 구성된 초기 특징 그래프(X⁰)와 노드 사이의 유사도를 나타내는 초기 인접 행렬(A⁰)을 생성하고, 생성된 초기 특징 그래프(X⁰)와 초기 인접 행렬(A⁰)로부터 각 노드의 특징을 인접 행렬의 유사도에 기반하여 보상한 보정 특징 그래프(Z^k)와 인접 행렬(A^k)을 기지된 횟수로 재귀적으로 반복 추출하여 최종 획득된 최종 보정 특징 그래프(Z^K)와 최종 인접 행렬(A^K)을 획득한 이후, 최종 보정 특징 그래프(Z^K)와 최종 인접 행렬(A^K)로부터 입력 비디오(I)의 다수의 프레임 각각이 키 프레임(F_key)일 확률을 시멘틱 유사도 패턴에 따라 추정한다.That is, the video summary generation apparatus according to the present embodiment regards the feature map extracted from each of the plurality of frames of the input video I as a node, and indicates the similarity between the initial feature graph X ⁰ composed of a plurality of nodes and the nodes. generating an initial adjacency matrix (a ^0), and a compensation based on the similarity of adjacent features of each node in the matrix from the initial characteristic graph (X ⁰⁾ and the initial adjacency matrix (a ⁰⁾ correction characteristic graph (Z ^k) After obtaining the final corrected feature graph (Z ^K ) and the final adjacency matrix (A ^K ) obtained by recursively recursively extracting the and adjacency matrix (A ^k ) at a known number of times, the final corrected feature graph (Z ^K ) The probability that each of the plurality of frames of the input video I is a key frame F _key is estimated from the final adjacency matrix A ^K and the semantic similarity pattern.

따라서 입력 비디오(I)의 다수의 프레임 사이의 상관 관계가 보정된 최종 보정 특징 그래프(Z^K)와 최종 인접 행렬(A^K)을 기반으로 시멘틱 유사도가 높은 키 프레임을 정확하게 선택할 수 있도록 한다.Accordingly, a key frame having a high semantic similarity can be accurately selected based on the final correction feature graph Z ^K and the final adjacency matrix A ^{K in} which the correlation between the plurality of frames of the input video I is corrected.

이하에서는 도 1의 비디오 요약 생성 장치의 각 구성을 상세하게 설명한다.Hereinafter, each configuration of the video summary generating apparatus of FIG. 1 will be described in detail.

도 2는 도 1의 비디오 요약 생성 장치의 초기 그래프 생성부의 상세 구성을 나타낸다.FIG. 2 shows a detailed configuration of an initial graph generator of the video summary generation apparatus of FIG. 1.

도 2를 참조하면, 초기 그래프 생성부(100)는 프레임 선별부(110), 특징맵 획득부(120), 초기 그래프 획득부(130) 및 초기 인접 행렬 획득부(140)를 포함할 수 있다.Referring to FIG. 2, the initial graph generation unit 100 may include a frame selection unit 110, a feature map acquisition unit 120, an initial graph acquisition unit 130, and an initial adjacency matrix acquisition unit 140. .

프레임 선별부(110)는 다수의 프레임을 포함하는 입력 비디오(I)를 인가받아 기지정된 시간 구간 단위(예를 들면 1초)로 프레임을 선별하여 특징맵 획득부(120)로 전달한다. 일반적으로 비디오 영상은 초당 30 프레임 내지 60 프레임으로 구성되며, 대부분 짧은 시간 구간 내에서는 다수의 프레임이 서로 매우 유사하다. 따라서 입력 비디오(I)의 모든 프레임 사이의 시멘틱 유사성을 분석하여 비디오 요약을 생성하기 위한 키 프레임을 선별하는 것은 비효율적이다.The frame selection unit 110 receives the input video I including a plurality of frames, selects the frames in a predetermined time interval unit (for example, 1 second), and transmits them to the feature map acquisition unit 120. In general, a video image is composed of 30 to 60 frames per second, and most of the frames are very similar to each other within a short time period. Therefore, it is inefficient to select a key frame for generating a video summary by analyzing the semantic similarity between all the frames of the input video I.

이에 프레임 선별부(110)는 비디오 요약 생성 장치의 효율성을 향상시키기 위해 인가된 입력 비디오(I)에서 기지정된 시간 구간 단위로 프레임을 선별하여 특징맵 획득부(120)로 전달한다.Accordingly, the frame selection unit 110 selects frames from the applied input video I in units of predetermined time intervals to improve the efficiency of the video summary generation apparatus and transmits them to the feature map acquisition unit 120.

다만 이는 일반적으로 비디오 요약은 비디오 영상에서 프레임 단위로 정밀하게 키 프레임을 선별할 것을 요구하지 않는다는 점을 고려하여 효율성을 높이기 위한 것으로, 키 프레임을 매우 정확하게 추출해야 하는 경우에는 프레임을 선별하지 않을 수 있다. 즉 프레임 선별부(110)는 생략될 수도 있다.However, this is to increase efficiency in consideration of the fact that in general, video summary does not require precise selection of key frames in frame units from video images. When key frames must be extracted very accurately, frames may not be selected. have. That is, the frame selection unit 110 may be omitted.

특징맵 획득부(120)는 패턴 추정 방식이 미리 학습된 인공 신경망으로 구현되어 입력 비디오(I) 또는 프레임 선별부(110)에서 프레임이 선별된 입력 비디오를 인가받고, 미리 학습된 패턴 추정 방식에 따라 각 프레임의 특징을 추출하여 다수의 특징맵(X = {x₁, …, x_T})을 획득한다.The feature map acquisition unit 120 is implemented as an artificial neural network in which the pattern estimation method is learned in advance, receives the input video (I) or the input video from which the frame is selected by the frame selection unit 110, and uses the previously learned pattern estimation method. Accordingly, features of each frame are extracted and a plurality of feature maps (X = {x ₁ , …, x _T }) are obtained.

특징맵 획득부(120)는 다양한 인공 신경망으로 구현될 수 있으며, 각 프레임 이미지로부터 특징맵(x₁, …, x_T)을 추출하는 다양한 인공 신경망이 이미 공개되어 있으므로, 이러한 공개된 인공 신경망 중 하나를 이용하여 구현될 수도 있다. 특징맵 획득부(120)는 일 예로 미리 학습된 컨볼루션 신경망(Convolutional Neural Network: 이하 CNN) 등으로 구현될 수 있다.The feature map acquisition unit 120 may be implemented as a variety of artificial neural networks, and since various artificial neural networks that extract feature maps (x ₁ , …, x _T ) from each frame image are already publicly disclosed, among these disclosed artificial neural networks It can also be implemented using one. The feature map acquisition unit 120 may be implemented with, for example, a pre-learned convolutional neural network (CNN).

초기 그래프 획득부(130)는 다수의 특징맵(X) 각각을 노드 벡터로 간주하여 초기 특징 그래프(X⁰)를 획득한다. 초기 그래프 획득부(130)는 일예로 특징맵 획득부(120)에서 기지정된 2차원 또는 3차원의 크기로 획득된 다수의 특징맵(X)을 1차원의 노드 벡터로 변환하여, 초기 특징 그래프(X⁰)를 획득할 수 있다.The initial graph acquisition unit 130 obtains an initial feature graph X ⁰ by considering each of the plurality of feature maps X as node vectors. The initial graph acquisition unit 130 converts a plurality of feature maps (X) acquired in a two-dimensional or three-dimensional size determined by the feature map acquisition unit 120 into one-dimensional node vectors, for example, (X ⁰ ) can be obtained.

초기 그래프 획득부(130)는 다수의 특징맵(X) 각각을 다수의 노드와 다수의 노드를 서로 연결하는 다수의 에지로 구성되는 그래프의 노드로 간주함으로써, 이후 그래프 컨볼루션 네트워크(Graph Convolutional Network: GCN)로 구성되는 그래프 상관부(200)에 적합한 초기 특징 그래프(X⁰)로 획득한다. 이때 초기 그래프 획득부(130)는 다수의 특징맵(X) 각각을 기지정된 형식을 갖는 노드 벡터로 변환하여 초기 특징 그래프(X⁰)를 획득하도록 구성될 수도 있다. 일 예로 초기 그래프 획득부(130)는 다수의 특징맵(X) 각각이 T × D 크기의 2차원 벡터로 획득된 경우, 이를 T × D 길이의 1차원 벡터 형태로 변환할 수 있다.The initial graph acquisition unit 130 regards each of the plurality of feature maps (X) as a node of a graph composed of a plurality of nodes and a plurality of edges connecting the plurality of nodes to each other, so that a graph convolutional network : GCN) is obtained as an initial characteristic graph (X ⁰ ) suitable for the graph correlation unit 200. In this case, the initial graph acquisition unit 130 may be configured to obtain an initial feature graph X ⁰ by converting each of the plurality of feature maps X into a node vector having a predetermined format. As an example, when each of the plurality of feature maps X is acquired as a two-dimensional vector having a size of T × D, the initial graph acquisition unit 130 may convert it into a form of a one-dimensional vector having a length of T × D.

초기 인접 행렬 획득부(140)는 초기 그래프 획득부(130)에서 획득된 초기 특징 그래프(X⁰)의 각 노드 벡터(x₁, …, x_T), 즉 다수의 특징맵(X) 사이의 유사도를 수학식 1에 따라 계산하여 초기 인접 행렬(A⁰)을 획득한다.The initial adjacency matrix acquisition unit 140 includes each node vector (x ₁ , …, x _T ) of the initial feature graph X ⁰ acquired by the initial graph acquisition unit 130, that is, between a plurality of feature maps X. The similarity is calculated according to Equation 1 to obtain an initial adjacency matrix A ⁰ .

여기서 a_ij는 초기 인접 행렬(A⁰)의 원소를 나타내고, x_i와 x_j는 각각 초기 특징 그래프(X⁰)의 다수의 노드 벡터(x₁, …, x_T) 중 i번째 노드 벡터와 j번째 노드 벡터를 나타내고, T는 전치 행렬을 나타내며, ∥∥₂ 는 L2 놈(L2 norm) 함수를 나타낸다.Where a _ij denotes an element of the initial adjacency matrix (A ⁰ ), and x _i and x _j are the i-th node vectors of the multiple node vectors (x ₁ , …, x _T ) of the initial feature graph (X ⁰ ), respectively. It represents the j-th node vector, T represents the transpose matrix, and ∥ ∥ ₂ represents the L2 norm function.

즉 초기 인접 행렬(A⁰)은 초기 특징 그래프(X⁰)의 노드 벡터(x₁, …, x_T) 사이를 유사도에 따라 연결하는 에지의 가중치 행렬로 볼 수 있다.That is, the initial adjacency matrix A ⁰ can be viewed as a weight matrix of edges connecting node vectors (x ₁ , …, x _T ) of the initial feature graph X ⁰ according to similarity.

초기 그래프 생성부(100)는 초기 인접 행렬 획득부(130)와 초기 인접 행렬 획득부(140)에서 생성된 초기 특징 그래프(X⁰)와 초기 인접 행렬(A⁰)을 그래프 상관부(200)로 전달한다.The initial graph generator 100 converts the initial feature graph X ⁰ and the initial adjacency matrix A ⁰ generated by the initial adjacency matrix acquisition unit 130 and the initial adjacency matrix acquisition unit 140 into a graph correlation unit 200. To pass.

도 3은 도 1의 비디오 요약 생성 장치의 그래프 상관부와 재귀 그래프 획득부의 상세 구성을 나타낸다.3 shows detailed configurations of a graph correlation unit and a recursive graph acquisition unit of the video summary generation apparatus of FIG. 1.

도 3을 참조하면, 그래프 상관부(200)는 직렬로 순차 연결된 다수의 그래프 상관 모듈(210 ~ 230)을 포함하여 구성된다. 다수의 그래프 상관 모듈(210 ~ 230) 각각은 패턴 추정 방식이 미리 학습된 그래프 컨볼루션 네트워크(GCN)로 구현되어, 초기 그래프 생성부(100)로부터 인가되는 초기 특징 그래프(X⁰)와 초기 인접 행렬(A⁰)에서 추정되는 패턴을 보정하여 보정 그래프(Z¹)를 추출한다. 그리고 그래프 상관부(200)는 추출된 보정 그래프(Z¹)를 재귀 그래프 획득부(300)로 전달하고, 재귀 그래프 획득부(300)로부터 전달되는 인접 행렬(A¹)과 특징 그래프(X¹)를 인가받아 다음 보정 그래프(Z²)를 추출하는 과정을 반복한다.Referring to FIG. 3, the graph correlation unit 200 includes a plurality of graph correlation modules 210 to 230 sequentially connected in series. Each of the plurality of graph correlation modules 210 to 230 is implemented as a graph convolution network (GCN) in which the pattern estimation method is learned in advance, and is initially adjacent to the initial feature graph X ⁰ applied from the initial graph generator 100. The correction graph Z ¹ is extracted by correcting the pattern estimated from the matrix A ⁰ . In addition, the graph correlation unit 200 transfers the extracted correction graph Z ¹ to the recursive graph acquisition unit 300, and the adjacent matrix A ¹ and the feature graph X ¹ transmitted from the recursive graph acquisition unit 300 ) Is applied and the process of extracting the next correction graph (Z ² ) is repeated.

즉 그래프 상관부(200)는 이미 획득된 특징 그래프(X^k-1)와 인접 행렬(A^k-1)을 인가받아 수학식 2로 표현될 수 있는 그래프 컨볼루션 네트워크(GCN)의 패턴 추정 방식에 따라 보정 그래프(Z^k)를 추출한다.That is, the graph correlation unit 200 is a pattern estimation method of a graph convolution network (GCN) that can be expressed by Equation 2 by receiving the already acquired feature graph (X ^k-1 ) and the adjacency matrix (A ^k-1 ). According to the correction graph (Z ^k ) is extracted.

이때, 다수의 그래프 상관 모듈(210 ~ 230) 중 초기단의 그래프 상관 모듈(210)은 특징 그래프(X^k-1)와 인접 행렬(A^k-1)을 인가받아 패턴을 추정하여 제1 중간 특징 그래프(X^')를 추출하고, 이후 나머지 그래프 상관 모듈(220, 230)은 각각 이전단에서 추출된 중간 특징 그래프(X', X")와 인접 행렬(A^k-1)을 인가받아 패턴을 추정하여 제1 중간 특징 그래프(X") 또는 보정 그래프(Z^k)를 추출한다.At this time, the graph correlation module 210 of the initial stage among the plurality of graph correlation modules 210 to 230 receives the feature graph X ^k-1 and the adjacency matrix A ^k-1 to estimate the pattern and After the feature graph (X ^' ) is extracted, the remaining graph correlation modules 220 and 230 receive the intermediate feature graphs (X', X") and adjacency matrix (A ^k-1 ) extracted from the previous stage, respectively, and By estimating the first intermediate feature graph (X") or the correction graph (Z ^k ) is extracted.

여기서 W는 학습된 그래프 컨볼루션 네트워크(GCN)의 가중치를 나타내고, σ는 ReLU(Rectified Linear Unit)와 같은 비선형 활성화 함수를 나타낸다.Here, W represents the weight of the learned graph convolution network (GCN), and σ represents a nonlinear activation function such as ReLU (Rectified Linear Unit).

본 실시예에서는 일 예로 그래프 상관부(200)가 직렬로 순차 연결된 3개의 그래프 상관 모듈(210 ~ 230)을 포함하는 것으로 도시하였으나, 그래프 상관부(200)에 포함되는 그래프 상관 모듈(210 ~ 230)의 개수는 보정 그래프(Z^k)를 추출하는 성능을 실험적으로 분석하여 다양하게 조절될 수 있다. 즉 그래프 상관 모듈(210 ~ 230)의 개수에 의해 보정 그래프(Z^k)를 추출하는 성능이 조절될 수 있으며, 여기서는 실험적 결과로서 3개의 그래프 상관 모듈(210 ~ 230)을 포함하는 경우를 도시하였다.In the present embodiment, for example, the graph correlation unit 200 is shown to include three graph correlation modules 210 to 230 sequentially connected in series, but the graph correlation modules 210 to 230 included in the graph correlation unit 200 The number of) can be variously adjusted by experimentally analyzing the performance of extracting the correction graph Z ^k . That is, the performance of extracting the correction graph (Z ^k ) can be adjusted by the number of graph correlation modules (210 to 230), and here, as an experimental result, a case including three graph correlation modules (210 to 230) is illustrated. .

한편, 재귀 그래프 획득부(300)는 제1 및 제2 투사부(310, 320), 인접 보정값 획득부(330) 및 인접 행렬 획득부(340)를 포함할 수 있다.Meanwhile, the recursive graph acquisition unit 300 may include first and second projection units 310 and 320, an adjacent correction value acquisition unit 330, and an adjacent matrix acquisition unit 340.

제1 및 제2 투사부(310, 320)는 각각 그래프 상관부(200)에서 획득된 보정 그래프(Z^k)를 인가받고, 보정 그래프(Z^k)가 서로 다른 선형 임베딩 공간에 투사되도록 서로 다르게 기 지정된 가중치 함수(W_θ, W_φ)를 인가된 보정 그래프(Z^k)에 가중하여 투영 보정 그래프(W_θZ^k, W_φZ^k)를 획득한다.Each of the first and second projection units 310 and 320 receives the correction graph Z ^k obtained from the graph correlation unit 200, so that the correction graph Z ^k is projected in different linear embedding spaces. A projection correction graph (W _θ Z ^k , W _φ Z ^k ) is obtained by weighting a predetermined weighting function (W _θ , W _φ ) to the applied correction graph Z ^k .

인접 보정값 획득부(330)는 제1 및 제2 투사부(310, 320)에서 획득된 투영 보정 그래프(W_θZ^k, W_φZ^k)를 인가받아 보정 유사도(dA^k)를 수학식 3에 따라 획득한다.The adjacent correction value acquisition unit 330 receives the projection correction graphs (W _θ Z ^k , W _φ Z ^k ) obtained from the first and second projection units 310 and 320 and calculates the correction similarity (dA ^k ). Acquired according to 3.

여기서도 T는 전치 행렬을 나타내며, ∥∥₂ 는 L2 놈(L2 norm) 함수를 나타낸다.Here, too, T represents the transpose matrix, and ∥ ∥ ₂ represents the L2 norm function.

인접 행렬 획득부(340)는 이전 인접 행렬(A^k-1)과 인접 보정값 획득부(330)에서 획득된 보정 유사도(dA^k)를 가산하여 그래프 상관부(200)가 다음 보정 그래프(Z^k+1)를 추정하도록 하기 위한 인접 행렬(A^k)을 획득한다.The adjacency matrix acquisition unit 340 adds the previous adjacency matrix (A ^k-1 ) and the correction similarity (dA ^k ) obtained from the adjacent correction value acquisition unit 330, so that the graph correlation unit 200 performs the next correction graph (Z). ^An adjacency matrix (A ^k ) for estimating ^k+1 ) is obtained.

즉 인접 행렬 획득부(340)는 초기 인접 행렬(A⁰)로부터 이후 획득되는 보정 유사도(dA^k)를 누적 가산하여 다음 보정 그래프(Z^k+1) 추정을 위한 인접 행렬(A^k)을 획득한다. 따라서 만일 재귀 그래프 획득부(300)가 기지정된 K회만큼 반복적으로 재귀하여 인접 행렬(A^k)을 획득하는 경우, 최종적으로 획득되는 최종 인접 행렬(A^K)은 수학식 4와 같이 계산될 수 있다.That is, the adjacency matrix acquisition unit 340 accumulates and adds the correction similarity (dA ^k ) obtained later from the initial adjacency matrix (A ⁰ ) to obtain an adjacency matrix (A ^k ) for estimating the next correction graph (Z ^k+1 ). do. Therefore, if the recursive graph acquisition unit 300 recursively recurses a predetermined number of K times to obtain an adjacency matrix (A ^k ), the final adjacency matrix (A ^K ) that is finally obtained can be calculated as in Equation 4 have.

이때 인접 행렬 획득부(340)는 이미 획득된 보정 그래프(Z^k)를 다음 특징 그래프(X^k)로서 인접 행렬(A^k)과 함께 그래프 상관부(200)로 전달 수 있다.At this time, the adjacency matrix acquisition unit 340 may transfer the already acquired correction graph Z ^k to the graph correlation unit 200 together with the adjacency matrix A ^k as the next feature graph X ^k .

다시 도 1을 참조하면, 키 프레임 추출부(400)는 상관 관계 추정부(410) 및 키 프레임 선택부(420)를 포함할 수 있다.Referring back to FIG. 1, the key frame extracting unit 400 may include a correlation estimating unit 410 and a key frame selecting unit 420.

상관 관계 추정부(410)는 패턴 추정 방식이 미리 학습된 그래프 컨볼루션 네트워크(GCN)으로 구현되어 인가되는 최종 보정 그래프(Z^K)와 최종 인접 행렬(A^K)의 패턴으로부터 최종 보정 그래프(Z^K)의 다수의 노드 벡터 사이의 상관 관계를 추정하여, 각 노드 벡터가 나타내는 입력 비디오(I)의 다수의 프레임 각각이 키 프레임(F_key)일 가능성을 나타내는 확률로 구성되는 키 프레임 확률맵(Y)을 시멘틱 유사도에 따라 추출한다.The correlation estimating unit 410 is implemented as a graph convolutional network (GCN) in which the pattern estimation method is learned in advance, and is applied from the final correction graph Z ^K and the final correction graph Z from the pattern of the final adjacency matrix A ^K. A key frame probability map consisting of a probability indicating the probability that each of the plurality of frames of the input video I represented by each node vector is a key frame (F _key ) by estimating the correlation between the plurality of node vectors of ^K ) ( Y) is extracted according to the semantic similarity.

상관 관계 추정부(410)는 수학식 5로 표현되는 그래프 컨볼루션 네트워크(GCN)의 패턴 추정 방식에 따라 키 프레임 확률맵(Y)을 추출할 수 있다.The correlation estimating unit 410 may extract the key frame probability map Y according to the pattern estimation method of the graph convolution network GCN represented by Equation (5).

여기서 W_c는 학습된 그래프 컨볼루션 네트워크(GCN)의 가중치를 나타내고, σ는 활성화 함수를 나타낸다.Here, W _c represents the weight of the learned graph convolution network (GCN), and σ represents the activation function.

키 프레임 확률맵(Y)은 프레임의 개수(T)와 각 노드 벡터에 대응하는 프레임이 키 프레임일 확률과 키 프레임이 아닐 확률을 각각 표현하도록 T × 2의 크기로 획득될 수 있다.The key frame probability map (Y) may be obtained with a size of T × 2 to represent the number of frames (T) and the probability that the frame corresponding to each node vector is a key frame and a probability that it is not a key frame, respectively.

그리고 키 프레임 선택부(420)는 키 프레임 확률맵(Y)으로부터 입력 비디오(I)의 다수의 프레임 중 키 프레임일 확률이 높고 키 프레임이 아닐 확률이 낮은 프레임을 기지정된 방식으로 선출하여 키 프레임(F_key)으로 선택한다.In addition, the key frame selection unit 420 selects a frame with a high probability of being a key frame and a low probability of not being a key frame among a plurality of frames of the input video I from the key frame probability map (Y), Select with (F _key ).

키 프레임 선택부(420)는 일 예로 키 프레임(F_key)을 기지정된 개수로 선택하거나, 키 프레임일 확률이 기지정된 제1 문턱값 이상이거나, 키 프레임이 아닐 확률이 기지정된 제2 문턱값 이하인 프레임을 키 프레임(F_key)으로 선택할 수 있다.The key frame selection unit 420 selects, for example, a predetermined number of key frames (F _key ), or a second threshold value in which the probability of a key frame is equal to or greater than a predetermined first threshold value, or a probability of not being a key frame The following frames can be selected as key frames (F _key ).

한편, 그래프 컨볼루션 네트워크(GCN)를 포함하여 구현되는 그래프 상관부(200)와 키 프레임 추출부(400)는 미리 학습될 필요가 있다. 따라서 본 실시예에 따른 비디오 요약 생성 장치는 그래프 상관부(200)와 키 프레임 추출부(400)를 학습시키기 위한 학습부(미도시)를 더 포함할 수 있다.Meanwhile, the graph correlation unit 200 and the key frame extraction unit 400 implemented including a graph convolution network (GCN) need to be learned in advance. Accordingly, the video summary generation apparatus according to the present embodiment may further include a learning unit (not shown) for learning the graph correlation unit 200 and the key frame extraction unit 400.

학습부는 기지정된 다수의 손실 함수에 따라 계산되는 손실이 최소화되도록 그래프 상관부(200)와 키 프레임 추출부(400)를 학습시킬 수 있다.The learning unit may train the graph correlation unit 200 and the key frame extraction unit 400 to minimize losses calculated according to a plurality of predetermined loss functions.

본 실시예에서 학습부는 지도 학습(Supervised learning) 방식과 비지도 학습(Unsupervised learning) 방식 중 하나로 학습을 수행할 수 있다.In this embodiment, the learning unit may perform learning in one of a supervised learning method and an unsupervised learning method.

우선 지도 학습 방식으로 학습을 수행하는 경우, 학습부는 다수의 사용자에 의해 정규화 및 평균화된 중요도 점수를 기반으로 진리 키 프레임 확률맵(Y^*)이 미리 획득된 비디오를 학습 데이터로 획득하여 입력 비디오(I)로서 입력하고, 학습 데이터의 진리 키 프레임 확률맵(Y^*)과 비디오 요약 생성 장치에서 획득된 키 프레임 확률맵(Y)을 비교하여 지도 학습 손실(L_sup)을 계산하여 역전파하여 학습을 수행한다.First of all, when learning is performed in a supervised learning method, the learning unit acquires a video in which a truth key frame probability map (Y ^* ) is obtained in advance based on the importance score normalized and averaged by a plurality of users as training data, and input video ( I), by comparing the truth key frame probability map (Y ^* ) of the training data with the key frame probability map (Y) obtained from the video summary generator to calculate the supervised learning loss (L _sup ) and _{backpropagating it} to learn. Perform.

지도 학습 방식에서 학습부는 분류 손실(L_c)과 희소성 손실(L_s), 복원 손실(L_r) 및 다양성 손실(L_d)을 각각 계산한다.In the supervised learning method, the learning unit calculates the classification loss (L _c ), the sparsity loss (L _s ), the restoration loss (L _r ), and the diversity loss (L _d ), respectively.

분류 손실(L_c)은 비디오 요약 생성 장치에서 생성된 키 프레임 확률맵(Y)과 진리 키 프레임 확률맵(Y^*) 사이의 이진 크로스 엔트로피 손실(binary cross-entropy loss)로 수학식 6에 따라 계산될 수 있다.Classification loss (L _c ) is a binary cross-entropy loss between the key frame probability map (Y) and the truth key frame probability map (Y ^* ) generated by the video summary generation device according to Equation 6 Can be calculated.

여기서 y^* _t는 t번째 프레임의 진리 레이블이고, w_t는 t번째 프레임의 가중치로서, w_t = median_freq/freq(s)로 계산될 수 있다. freq(s)는 키 프레임 수를 총 프레임 수로 나눈 값이고, median_freq는 키 프레임 발생 빈도의 중간값을 나타낸다.Here, y ^* _t is the truth label of the t-th frame, w _t is the weight of the t-th frame, and can be calculated as w _t = median_freq/freq(s). freq(s) is a value obtained by dividing the number of key frames by the total number of frames, and median_freq represents the median value of the occurrence frequency of key frames.

희소성 손실(L_s)은 입력 영상(I)의 다수의 프레임에서 키 프레임의 수가 희소해야 한다는 가정에서 도출되는 손실로서, 수학식 7과 같이 인접 행렬의 각 원소(a_ij ∈ A^K)에 L1 놈 함수를 적용하여 획득될 수 있다.The sparsity loss (L _s ) is a loss derived from the assumption that the number of key frames should be sparse in multiple frames of the input image (I), and L1 is applied to each element of the adjacency matrix (a _ij ∈ A ^K ) as shown in Equation 7 It can be obtained by applying a norm function.

복원 손실(L_r)은 키 프레임은 시각적으로 다양하게 존재해야 한다는 가정에 따라 추가되는 손실로서, 키 프레임 확률맵(Y)을 별도로 하나의 그래프 컨볼루션 모듈을 포함하고 미리 학습된 그래프 컨벌루션 네트워크에 인가하여 초기 특징 그래프(X⁰)와 동일 크기의 추가 보정 특징 그래프(

)를 생성하고, 생성된 추가 보정 특징 그래프(

)와 초기 특징 그래프(X⁰) 사이의 평균 제곱 오차(MSE)를 수학식 8과 같이 계산하여 획득할 수 있다.Restoration loss (L _r ) is a loss added according to the assumption that the key frame must exist in various ways, and the key frame probability map (Y) is separately included in one graph convolution module and is added to the previously learned graph convolution network. Applied to the initial feature graph (X ⁰ ) and the additional correction feature graph (

), and the generated additional correction feature graph (

) And the initial feature graph X ⁰ may be obtained by calculating the mean squared error (MSE) as shown in Equation 8.

마지막으로 다양성 손실(L_d)은 추가 보정 특징 그래프(

)에서 키 프레임으로 선택되는 노드들 사이의 반발 규격화(repelling regularizer)를 적용하여 수학식 9에 따라 획득할 수 있다.Finally, the diversity loss (L _d ) is the additional correction feature graph (

) Can be obtained according to Equation 9 by applying a repelling regularizer between nodes selected as key frames.

그리고 학습부는 수학식 6 내지 9에 따라 계산된 분류 손실(L_c)과 희소성 손실(L_s), 복원 손실(L_r) 및 다양성 손실(L_d)로부터 지도 학습 시의 총 손실(L_sup)을 수학식 10과 같이 계산하여 역전파한다.And the learning unit is the total loss (L _sup ) in supervised learning from the classification loss (L _c ), the sparse loss (L _s ), the restoration loss (L _r ), and the diversity loss (L _d ) calculated according to Equations 6 to 9 Is calculated as in Equation 10 and backpropagated.

여기서 λ, α 및 β는 각 손실의 중요도를 조절하기 위한 가중치이다.Here, λ, α, and β are weights for controlling the importance of each loss.

한편 학습부는 비지도 학습을 수행하는 경우, 학습 데이터가 존재하지 않으므로, 수학식 10에서분류 손실(L_c)을 제외하고, 희소성 손실(L_s)과 복원 손실(L_r) 및 다양성 손실(L_d)로부터 비지도 학습 시의 총 손실(L_unsup)을 수학식 11과 같이 계산하여 역전파한다.On the other hand, when the learning unit performs unsupervised learning, since the training data does not exist, excluding the classification loss (L _c ) in Equation 10, the sparsity loss (L _s ), the restoration loss (L _r ), and the diversity loss (L _{From d} ), the total loss (L _unsup ) during unsupervised learning is calculated as in Equation 11 and _{backpropagated} .

도 4는 재귀 반복 횟수에 따라 추출된 키 프레임의 일예를 나타낸다.4 shows an example of a key frame extracted according to the number of recursion repetitions.

도 4에서 (a)는 1회 재귀 반복(K=1)을 수행한 경우를 나타내고, (b)는 3회 재귀 반복(K=3)을 수행한 경우를 나타내며, (c)는 5회 재귀 반복(K=5)을 수행한 경우를 나타낸다. 그리고 (a) 내지 (c) 각각에서 좌측 상단 프레임은 입력 비디오(I)에서 선택된 키 프레임(F_key)을 나타내고, 나머지는 키 프레임(F_key)을 제외하고 키 프레임 확률이 높은 순서로 선출된 프레임을 나타낸다. 그리고 각 프레임과 함께 표시된 s는 다수의 사용자들에 의해 미리 주석되고 정규화 및 평균화된 중요도 점수를 나타낸다.In FIG. 4, (a) shows a case of performing one recursive iteration (K = 1), (b) shows a case of performing three recursive iterations (K = 3), and (c) shows a case of performing 5 times recursion It shows the case where iteration (K=5) was performed. And in each of (a) to (c), the upper left frame represents the key frame (F _key ) selected from the input video (I), and the rest are selected in the order of high key frame probability excluding the key frame (F _key ). Represents a frame. And s displayed with each frame represents a priority score that has been pre-annotated, normalized and averaged by a number of users.

도 4에 도시된 바와 같이, 의미적으로 유사한 프레임, 즉 시멘틱 유사도가 높은 키 프레임들 사이의 연결은 재귀 반복 횟수가 증가될수록 점차로 강화됨을 알 수 있다. 다만 재귀 반복 횟수가 증가할수록 연산량과 연산 시간은 증가하는데 반해, 키 프레임들 사이의 시멘틱 유사도는 크게 강화되지 않으므로, 효율성을 위해 재귀 반복 횟수는 실험을 통해 미리 지정될 수 있다.As shown in FIG. 4, it can be seen that the connection between semantically similar frames, that is, key frames having high semantic similarity, is gradually strengthened as the number of recursion repetitions increases. However, as the number of recursive iterations increases, the amount of computation and the computation time increase, but the semantic similarity between key frames is not greatly enhanced. Therefore, for efficiency, the number of recursive iterations may be predetermined through experimentation.

도 5는 본 발명의 일 실시예에 따른 비디오 요약 생성 방법을 나타낸다.5 shows a video summary generation method according to an embodiment of the present invention.

도 1 내지 도 3을 참조하여 도 5의 비디오 요약 생성 방법을 설명하면, 우선 비디오 요약이 생성되어야 하는 다수 프레임의 입력 비디오(I)를 인가받아 각 프레임에서 추출된 특징맵(X)을 노드 벡터로 간주하여 초기 특징 그래프(X⁰)를 생성하며, 노드 벡터 사이의 유사도를 계산하여 초기 인접 행렬(A⁰)을 생성한다(S10).When the video summary generation method of FIG. 5 is described with reference to FIGS. 1 to 3, first, a feature map (X) extracted from each frame is applied to the input video (I) of a plurality of frames for which a video summary is to be generated is a node vector. An initial feature graph (X ⁰ ) is generated as regarded as, and an initial adjacency matrix (A ⁰ ) is generated by calculating the similarity between node vectors (S10).

초기 특징 그래프(X⁰)와 초기 인접 행렬(A⁰)을 생성하는 단계(S10)에서는 먼저 다수의 프레임으로 구성된 입력 비디오(I)를 획득한다(S11). 그리고 미리 학습된 패턴 추정 방식에 따라 입력 비디오(I)의 다수의 프레임 각각에 대한 특징을 추출하여 다수의 특징맵(X)을 생성한다(S12). 다수의 특징맵(X)이 생성되면, 다수의 특징맵(X) 각각을 다수의 노드와 다수의 노드를 서로 연결하는 다수의 에지로 구성되는 그래프의 노드를 나타내는 노드 벡터로 간주하여 초기 특징 그래프(X⁰)를 획득한다(S13).In the step (S10) of generating the initial feature graph (X ⁰ ) and the initial adjacency matrix (A ⁰ ), an input video I composed of a plurality of frames is first obtained (S11). In addition, a plurality of feature maps X is generated by extracting features for each of a plurality of frames of the input video I according to a previously learned pattern estimation method (S12). When a number of feature maps (X) are created, each of the plurality of feature maps (X) is regarded as a node vector representing a node of a graph consisting of a number of nodes and a number of edges connecting a number of nodes to each other, and an initial feature graph (X ⁰ ) is obtained (S13).

그리고 다수의 특징맵(X)에 각각 대응하는 다수의 노드를 포함하는 초기 특징 그래프(X⁰)가 획득되면, 다수의 노드 사이의 유사도를 수학식 1에 따라 계산하여 초기 인접 행렬(A⁰)을 획득한다(S14).And when an initial feature graph (X ⁰ ) including a plurality of nodes corresponding to each of the plurality of feature maps (X) is obtained, the similarity between the plurality of nodes is calculated according to Equation 1, and the initial adjacency matrix (A ⁰ ) Is obtained (S14).

초기 특징 그래프(X⁰)와 초기 인접 행렬(A⁰)이 획득되면, 초기 특징 그래프(X⁰)와 초기 인접 행렬(A⁰)을 패턴 추정 방식이 미리 학습된 다수의 그래프 컨볼루션 네트워크에 인가하여 초기 특징 그래프(X⁰)의 다수의 노드 사이의 패턴을 보정하여 보정 그래프(Z^k)를 추출한다(S20).When the initial feature graph (X ⁰ ) and the initial adjacency matrix (A ⁰ ) are acquired, the initial feature graph (X ⁰ ) and the initial adjacency matrix (A ⁰ ) are applied to a plurality of graph convolution networks in which the pattern estimation method is learned in advance. Thus, the correction graph Z ^k is extracted by correcting the pattern between the plurality of nodes of the initial feature graph X ⁰ (S20 ).

그리고 재귀 반복 횟수(k)가 기지정된 횟수(K) 이상(k ≥ K)인지 판별한다(S30).Then, it is determined whether the number of recursion repetitions (k) is equal to or greater than the predetermined number (K) (k ≥ K) (S30).

만일 재귀 반복 횟수(k)가 기지정된 횟수(K) 미만이면, 보정 그래프(Z^k)를 서로 다른 선형 임베딩 공간에 투사하고, 투사된 2개의 보정 그래프(W_θZ^k, W_φZ^k) 사이의 보정 유사도(dA^k)를 계산하여 다음 보정 그래프(Z^k+1)를 획득하기 위한 인접 행렬(A^k)을 획득한다(S40).If the number of recursion repetitions (k) is less than the predetermined number (K), the correction graph (Z ^k ) is projected into different linear embedding spaces, and the projected two correction graphs (W _θ Z ^k , W _φ Z ^k ) An adjacency matrix (A ^k ) for obtaining the next correction graph (Z ^k+1 ) is obtained by calculating the correction similarity (dA ^k ) between (S40).

구체적으로, 재귀 반복 횟수(k)가 기지정된 횟수(K) 미만이면, 서로 다른 기지정된 가중치 함수(W_θ, W_φ)를 보정 그래프(Z^k)에 가중하여, 보정 그래프(Z^k)를 서로 다른 선형 임베딩 공간에 투사한다(S41). 그리고 2개의 투영 보정 그래프(W_θZ^k, W_φZ^k) 사이의 유사도를 계산하여 보정 유사도(dA^k)를 획득한다(S42). 보정 유사도(dA^k)가 획득되면, 이전 인접 행렬(A^k-1)과 보정 유사도(dA^k)를 가산하여 다음 인접 행렬(A^k)을 획득하고, 보정 그래프(Z^k)를 다음 특징 그래프(X^k)로서 적용한다(S43). 획득된 다음 인접 행렬(A^k)과 다음 특징 그래프(X^k)는 패턴 추정 방식이 미리 학습된 다수의 그래프 컨볼루션 네트워크에 재귀되어 인가됨으로써, 다음 인접 행렬(A^k)과 다음 특징 그래프(X^k)로부터 다음 보정 그래프(Z^k+1)가 추출되도록 한다(S20).Specifically, the recursive repetition number (k) a group weighting is less than a specified number of times (K), different groups specified weighting function (W _θ, W _φ) to the calibration graph (Z ^k), the correction graph (Z ^k) Projected onto different linear embedding spaces (S41). Then, the similarity between the two projection correction graphs W _θ Z ^k and W _φ Z ^k is calculated to obtain a corrected similarity dA ^k (S42). When the correction similarity (dA ^k ) is obtained, the next adjacency matrix (A ^k ) is obtained by adding the previous adjacency matrix (A ^k-1 ) and the correction similarity (dA ^k ), and the correction graph (Z ^k ) is ^converted to the next feature graph. It is applied as (X ^k ) (S43). The obtained next adjacency matrix (A ^k ) and the next feature graph (X ^k ) are recursively applied to a plurality of graph convolution networks in which the pattern estimation method is learned in advance, so that the next adjacency matrix (A ^k ) and the next feature graph (X ^k ) ^The next correction graph (Z ^k+1 ) is extracted from ^k ) (S20).

그러나 재귀 반복 횟수(k)가 기지정된 횟수(K) 이상이면, 최종적으로 획득된 최종 인접 행렬(A^K)과 최종 보정 그래프(Z^k)로부터 미리 학습된 패턴 추정 방식에 따라 최종 보정 그래프(Z^K)의 각 노드 사이의 시멘틱 유사도를 추정하여 키 프레임(F_key)을 선택한다(S50).However, if the number of recursion repetitions (k) is more than the predetermined number (K), the final correction graph (Z) according to the pattern estimation method learned in advance from the finally obtained final adjacency matrix (A ^K ) and the final correction graph (Z ^k ) ^A key frame (F _key ) is selected by estimating the semantic similarity between each node of ^K ) (S50).

키 프레임(F_key)을 선택하기 위해서는 먼저 패턴 추정 방식이 미리 학습된 별도의 그래프 컨볼루션 네트워크에 우선 최종적으로 획득된 최종 인접 행렬(A^K)과 최종 보정 그래프(Z^k)를 인가하여, 최종 보정 그래프(Z^K)의 각 노드 사이의 시멘틱 유사도 패턴에 따라 입력 비디오(I)의 다수의 프레임 각각이 키 프레임(F_key)일 확률을 나타내는 키 프레임 확률맵(Y)을 추출한다(S51). 그리고 추출된 키 프레임 확률맵(Y)로부터 입력 비디오(I)의 다수의 프레임 각각이 키 프레임될 확률을 분석하여 다수의 키 프레임(F_key)을 선택한다(S52). 여기서 선택된 다수의 키 프레임(F_key)은 입력 비디오(I)의 다수의 프레임 중 시멘틱 유사성이 높은 프레임들로서 비디오 요약으로 볼 수 있다.To select a key frame (F _key ), first, a final adjacency matrix (A ^K ) and a final correction graph (Z ^k ) that are finally obtained are first applied to a separate graph convolution network in which the pattern estimation method is learned in advance. A key frame probability map Y representing the probability that each of the plurality of frames of the input video I is a key frame F _key is extracted according to the semantic similarity pattern between each node of the correction graph Z ^K (S51). . Then, from the extracted key frame probability map (Y), a plurality of key frames (F _key ) are selected by analyzing the probability that each of the plurality of frames of the input video (I) is key framed (S52). The plurality of key frames F _key selected here are frames having high semantic similarity among the plurality of frames of the input video I, and can be viewed as a video summary.

도 6은 본 실시예에 따른 비디오 요약 생성 방법과 기존의 비디오 요약 생성 방법에 따라 생성된 비디오 요약을 비교한 결과를 나타낸다.6 shows a result of comparing a video summary generation method according to the present embodiment with a video summary generated according to a conventional video summary generation method.

도 6에서 (a) 와 (c)는 기존에 공개된 SUM-FCN 기법에 따라 비디오 요약을 생성한 결과를 나타내고, (b)와 (d)는 본 실시예에 따른 비디오 요약 기법인 SumGraph 기법에 따라 비디오 요약을 생성한 결과를 나타낸다. 그리고 도 6의 (a) 내지 (d)에서 선택된 프레임 상부에 표시된 막대 그래프는 학습 데이터에 레이블된 중요도 값을 나타낸다. 막대 그래프에서 붉은색으로 표시된 영역이 선택된 각 기법에 따라 선택된 키 프레임을 나타낸다.In FIG. 6, (a) and (c) show the result of generating a video summary according to the SUM-FCN technique previously disclosed, and (b) and (d) are based on the SumGraph technique, which is a video summary technique according to the present embodiment. It shows the result of generating the video summary accordingly. In addition, the bar graphs displayed above the frames selected in FIGS. 6A to 6D represent the importance values labeled on the training data. The red area in the bar graph represents the key frame selected according to each selected technique.

도 6에 도시된 바와 같이, 본 실시예에 따른 비디오 요약 생성 방법은 기존의 기법에 비해 더욱 시각적으로 다양하고, 레이블된 중요도 값의 피크에 대응하는 프레임을 키 프레임으로 선택함을 알 수 있다. 이는 본 실시예에 따른 비디오 요약 생성 방법이 입력 비디오에서 최적의 유의미한 요약을 생성하기 위해, 다수의 프레임 사이의 시멘틱 관계를 정확하게 추정할 수 있음을 나타낸다.As shown in FIG. 6, it can be seen that the video summary generation method according to the present embodiment is more visually diverse than the conventional technique, and selects a frame corresponding to the peak of the labeled importance value as a key frame. This indicates that the video summary generation method according to the present embodiment can accurately estimate the semantic relationship between a plurality of frames in order to generate an optimal meaningful summary in the input video.

본 발명에 따른 방법은 컴퓨터에서 실행시키기 위한 매체에 저장된 컴퓨터 프로그램으로 구현될 수 있다. 여기서 컴퓨터 판독가능 매체는 컴퓨터에 의해 액세스 될 수 있는 임의의 가용 매체일 수 있고, 또한 컴퓨터 저장 매체를 모두 포함할 수 있다. 컴퓨터 저장 매체는 컴퓨터 판독가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터와 같은 정보의 저장을 위한 임의의 방법 또는 기술로 구현된 휘발성 및 비휘발성, 분리형 및 비분리형 매체를 모두 포함하며, ROM(판독 전용 메모리), RAM(랜덤 액세스 메모리), CD(컴팩트 디스크)-ROM, DVD(디지털 비디오 디스크)-ROM, 자기 테이프, 플로피 디스크, 광데이터 저장장치 등을 포함할 수 있다.The method according to the present invention can be implemented as a computer program stored in a medium for execution on a computer. Here, the computer-readable medium may be any available medium that can be accessed by a computer, and may also include all computer storage media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, and ROM (Read Dedicated memory), RAM (random access memory), CD (compact disk)-ROM, DVD (digital video disk)-ROM, magnetic tape, floppy disk, optical data storage device, and the like.

본 발명은 도면에 도시된 실시예를 참고로 설명되었으나 이는 예시적인 것에 불과하며, 본 기술 분야의 통상의 지식을 가진 자라면 이로부터 다양한 변형 및 균등한 타 실시예가 가능하다는 점을 이해할 것이다.The present invention has been described with reference to the embodiments shown in the drawings, but these are merely exemplary, and those of ordinary skill in the art will appreciate that various modifications and other equivalent embodiments are possible therefrom.

따라서, 본 발명의 진정한 기술적 보호 범위는 첨부된 청구범위의 기술적 사상에 의해 정해져야 할 것이다.Therefore, the true technical protection scope of the present invention should be determined by the technical spirit of the appended claims.

100: 초기 그래프 생성부 110: 프레임 선별부
120: 특징맵 획득부 130: 초기 그래프 획득부
140: 초기 인접 행렬 획득부 200: 그래프 상관부
210 ~ 230: 그래프 상관 모듈 300: 재귀 그래프 획득부
310: 제1 투사부 320: 제2 투사부
330: 인접 보정값 획득부 340: 인접 행렬 획득부
400: 키 프레임 추출부 410: 상관 관계 추정부
420: 키 프레임 선택부100: initial graph generation unit 110: frame selection unit
120: feature map acquisition unit 130: initial graph acquisition unit
140: initial adjacency matrix acquisition unit 200: graph correlation unit
210 ~ 230: graph correlation module 300: recursive graph acquisition unit
310: first projection unit 320: second projection unit
330: adjacent correction value acquisition unit 340: adjacent matrix acquisition unit
400: key frame extraction unit 410: correlation estimation unit
420: key frame selection unit

Claims

An input video consisting of a plurality of frames is received, and an initial feature graph is generated by considering a plurality of feature maps each extracted from a plurality of frames as a node vector according to a previously learned pattern estimation method, and the similarity between the node vectors is calculated. An initial graph generator that calculates and generates an initial adjacency matrix;
The initial feature graph or the node of the feature graph from a pattern between the initial feature graph and the initial adjacency matrix, or a feature graph that is repeatedly recursively applied and an adjacency matrix using a plurality of graph convolution modules in which a pattern estimation method is previously learned A graph correlation unit for extracting a correction graph by correcting a pattern therebetween;
Obtaining an adjacency matrix and a feature graph for extracting the next correction graph based on the correction similarity and the previous adjacency matrix indicating the similarity between the two weighted correction graphs by applying the correction graph and weighting with different weighting functions previously specified. A recursive graph acquisition unit; And
Multiple frames according to the semantic similarity between each node of the final correction graph from the final correction graph obtained by recursion repeatedly at a predetermined number of times using a separate graph convolution network in which the pattern estimation method is learned in advance A video summary generation apparatus including a key frame extracting unit for selecting a plurality of key frames by estimating a probability that each will be a key frame.

The method of claim 1, wherein the initial graph generator
A feature map acquisition unit for generating the plurality of feature maps by extracting features for each of the plurality of frames of the input video according to a previously learned pattern estimation method;
An initial graph acquisition unit for acquiring the initial feature graph by considering each of the plurality of feature maps as a node of a graph composed of a plurality of nodes and a plurality of edges connecting the plurality of nodes to each other; And
And an initial adjacency matrix acquisition unit for obtaining an initial adjacency matrix by calculating a similarity between a plurality of nodes of the initial feature graph.

The method of claim 2, wherein the initial graph acquisition unit
A video summary generation apparatus for obtaining the initial feature graph by converting each of the plurality of feature maps into a one-dimensional node vector.

The method of claim 3, wherein the initial adjacency matrix obtaining unit
The degree of similarity between each node vector (x ₁ , …, x _T ) of the initial feature graph (X ⁰ ) is equation

(Where a _ij denotes the element of the initial adjacency matrix (A ⁰ ), and x _i and x _j are the i-th node vectors of the multiple node vectors (x ₁ , …, x _T ) of the initial feature graph (X ⁰ ), respectively And the j-th node vector, T represents the transpose matrix, and ∥ ∥ ₂ represents the L2 norm function.)
A video summary generation apparatus that calculates according to and obtains the initial adjacency matrix A ⁰ .

The method of claim 2, wherein the initial graph generator
A video summary generation apparatus further comprising a frame selection unit for extracting frames from the plurality of frames of the input video in units of a predetermined time interval and transmitting the extracted frames to the feature map acquisition unit.

The method of claim 2, wherein the graph correlation unit
The plurality of graph convolution modules are sequentially connected in series, and the graph correlation module at the initial stage among the plurality of graph correlation modules receives an initial feature graph or a recursive feature graph and an initial adjacency matrix or a recursive adjacency matrix to generate a pattern. The first intermediate feature graph is extracted by estimating, and the remaining graph correlation modules receive the intermediate feature graph extracted from the previous stage and the adjacency matrix applied to the graph correlation module of the initial stage, respectively, and estimate the pattern, and the intermediate feature graph or the correction. Video summary generation device to extract graphs.

The method of claim 6, wherein the recursive graph acquisition unit
A projection unit that receives the previously obtained correction graphs, and obtains two projection correction graphs by weighting different predetermined weight functions to the correction graph so that the correction graph is projected on different linear embedding spaces;
An adjacent correction value acquisition unit for obtaining the correction similarity by calculating a similarity between the two projection correction graphs; And
An apparatus for generating video summarization, comprising: an adjacency matrix obtaining unit for obtaining the adjacency matrix by adding a previous adjacency matrix and the obtained correction similarity.

The method of claim 7, wherein the adjacent correction value acquisition unit
The correction similarity (dA ^k ) is ^calculated by the equation

(Where W _θ Z ^k , W _φ Z ^k represents the projection correction graph weighted with the weight function (W _θ , W _φ ) on the correction graph (Z ^k ), T represents the transpose matrix, and ∥ ∥ ₂ is the L2 norm (L2 norm) function.)
Video summary generation device that calculates according to.

The method of claim 7, wherein the key frame extraction unit
By applying the final correction graph and the final adjacency matrix to a separate graph convolution network in which the pattern estimation method is learned in advance, each of the plurality of frames of the input video is generated according to the semantic similarity pattern between each node of the final correction graph. A correlation estimation unit for extracting a key frame probability map indicating a probability of a key frame; And
And a key frame selection unit for selecting a plurality of key frames by analyzing a probability that each of the plurality of frames of the input video will be key framed from the key frame probability map.

An input video consisting of a plurality of frames is received, and an initial feature graph is generated by considering a plurality of feature maps each extracted from a plurality of frames as a node vector according to a previously learned pattern estimation method, and the similarity between the node vectors is calculated. Calculating and generating an initial adjacency matrix;
The initial feature graph or the node of the feature graph from a pattern between the initial feature graph and the initial adjacency matrix, or a feature graph that is repeatedly recursively applied and an adjacency matrix using a plurality of graph convolution modules in which a pattern estimation method is previously learned Extracting a correction graph by correcting the pattern therebetween;
Obtaining an adjacency matrix and a feature graph for extracting the next correction graph based on the correction similarity and the previous adjacency matrix indicating the similarity between the two weighted correction graphs by applying the correction graph and weighting with different weighting functions previously specified. Step to do; And
Multiple frames according to the semantic similarity between each node of the final correction graph from the final correction graph obtained by recursion repeatedly at a predetermined number of times using a separate graph convolution network in which the pattern estimation method is learned in advance And selecting a plurality of key frames by estimating a probability that each will be a key frame.

The method of claim 10, wherein generating the initial adjacency matrix
Generating the plurality of feature maps by extracting features for each of the plurality of frames of the input video according to a previously learned pattern estimation method;
Obtaining the initial feature graph by considering each of the plurality of feature maps as nodes of a graph consisting of a plurality of nodes and a plurality of edges connecting the plurality of nodes to each other; And
And obtaining an initial adjacency matrix by calculating a similarity between a plurality of nodes of the initial feature graph.

The method of claim 11, wherein obtaining the initial feature graph comprises:
A video summary generation method for obtaining the initial feature graph by converting each of the plurality of feature maps into a one-dimensional node vector.

The method of claim 11, wherein obtaining the initial adjacency matrix
The degree of similarity between each node vector (x ₁ , …, x _T ) of the initial feature graph (X ⁰ ) is equation

(Where a _ij denotes the element of the initial adjacency matrix (A ⁰ ), and x _i and x _j are the i-th node vectors of the multiple node vectors (x ₁ , …, x _T ) of the initial feature graph (X ⁰ ), respectively And the j-th node vector, T represents the transpose matrix, and ∥ ∥ ₂ represents the L2 norm function.)
A video summary generation method for obtaining the initial adjacency matrix (A ⁰ ) by calculating according to.

The method of claim 11, wherein generating the initial adjacency matrix
Before generating the plurality of feature maps, the method further comprising extracting frames from the plurality of frames of the input video in units of a predetermined time period.

The method of claim 11, wherein extracting the correction graph
The plurality of graph convolution modules are sequentially connected in series, and the graph correlation module at the initial stage among the plurality of graph correlation modules receives an initial feature graph or a recursive feature graph and an initial adjacency matrix or a recursive adjacency matrix to generate a pattern. The first intermediate feature graph is extracted by estimating, and the remaining graph correlation modules receive the intermediate feature graph extracted from the previous stage and the adjacency matrix applied to the graph correlation module of the initial stage, respectively, and estimate the pattern, and the intermediate feature graph or the correction. How to generate video summaries to extract graphs.

The method of claim 15, wherein obtaining the adjacency matrix and the feature graph comprises:
Obtaining two projection correction graphs by respectively receiving the previously obtained correction graphs and weighting different predetermined weight functions to the correction graph so that the correction graphs are projected onto different linear embedding spaces;
Obtaining the correction similarity by calculating a similarity between the two projection correction graphs; And
And obtaining the adjacency matrix by adding a previous adjacency matrix and the acquired correction similarity.

The method of claim 16, wherein obtaining the correction similarity
The correction similarity (dA ^k ) is ^calculated by the equation

(Where W _θ Z ^k , W _φ Z ^k represents the projection correction graph weighted with the weight function (W _θ , W _φ ) on the correction graph (Z ^k ), T represents the transpose matrix, and ∥ ∥ ₂ is the L2 norm (L2 norm) function.)
Video summary generation method calculated according to.

The method of claim 16, wherein selecting the plurality of key frames comprises:
By applying the final correction graph and the final adjacency matrix to a separate graph convolution network in which the pattern estimation method is learned in advance, each of the plurality of frames of the input video is generated according to the semantic similarity pattern between each node of the final correction graph. Extracting a key frame probability map indicating a probability of a key frame; And
And selecting a plurality of key frames by analyzing a probability that each of the plurality of frames of the input video will be key framed from the key frame probability map.