KR102148607B1

KR102148607B1 - Audio-video matching area detection apparatus and method

Info

Publication number: KR102148607B1
Application number: KR1020190090937A
Authority: KR
Inventors: 손광훈; 이지영
Original assignee: 연세대학교 산학협력단
Priority date: 2019-07-26
Filing date: 2019-07-26
Publication date: 2020-08-26

Abstract

The present invention relates to an audio-video matching area detection apparatus and a method thereof. According to the present invention, the audio-video matching area detection apparatus comprises: a feature map acquisition unit which extracts features of each of video data and audio data in accordance with a prelearned pattern estimation method, and acquires a video feature map and an audio feature map; a semantic vector acquisition unit which converts the video feature map and the audio feature map into a video conversion feature map and an audio conversion feature map having the same predesignated dimension, extracts features of each of the video conversion feature map and the audio conversion feature map in accordance with a prelearned pattern estimation method, and acquires a video semantic vector and an audio semantic vector; and a localization unit which engages the video feature map with the audio semantic vector by a predesignated method, acquires a video emphasis map which shows an emphasis strength for each position in the video feature map in accordance with the audio semantic vector, engages the audio feature map and the video semantic vector by a predesignated method, and acquires an audio emphasis map which shows the emphasis strength for each position in the audio feature map in accordance with the video semantic vector. The audio-video matching area detection apparatus is able to detect an audio section corresponding to an object recognized by a video, or to detect a corresponding object area of the video from the audio.

Description

Audio-video matching area detection apparatus and method {AUDIO-VIDEO MATCHING AREA DETECTION APPARATUS AND METHOD}

본 발명은 오디오-비디오 정합 영역 탐지 장치 및 방법에 관한 것으로, 오디오로부터 비디오에서 대응하는 객체 영역을 탐지하거나, 비디오의 객체에 대응하는 오디오 구간을 탐지할 수 있는 오디오-비디오 정합 영역 탐지 장치 및 방법에 관한 것이다.The present invention relates to an apparatus and method for detecting an audio-video matched region, comprising detecting a corresponding object region in a video from an audio or an audio section corresponding to an object of a video. It is about.

최근 개인의 멀티미디어 방송의 활성화에 따라 간단한 멀티미디어 편집 기술에 대한 요구가 급증하고 있다. 이러한 멀티미디어 편집 시에는 영상에 적합한 다양한 효과음이 포함하거나 효과음에 대응하는 영상을 제공해야 하는 경우가 빈번하게 발생한다. 다양한 객체에 대한 비디오나 오디오와 같은 멀티미디어 데이터는 기존의 검색 기법을 통해 용이하게 획득될 수 있다. 그리고 획득된 영상 및 음향에서 지정된 특정 객체의 이미지 또는 음향이 포함된 영역 및 구간을 검출하기 위해 객체 영역 로컬라이제이션(object area localization) 기법이 제시된바 있다.In recent years, with the revitalization of personal multimedia broadcasting, the demand for simple multimedia editing technology is increasing rapidly. In such multimedia editing, it is frequently necessary to include various sound effects suitable for an image or to provide an image corresponding to the sound effect. Multimedia data such as video or audio for various objects can be easily obtained through an existing search technique. In addition, an object area localization technique has been proposed in order to detect an area and section including an image or sound of a specific object designated from the acquired image and sound.

즉 주어진 비디오나 오디오에 포함된 객체를 인식하거나, 비디오나 오디오에서 주어진 객체에 대응하는 구간을 탐지하기 위한 다양한 기술이 연구되어 왔다. 그러나, 객체의 다양성과 복잡한 배경 등과 같은 여러 이유로 인해 성능의 제약이 있다.That is, various techniques have been studied for recognizing an object included in a given video or audio, or detecting a section corresponding to a given object in a video or audio. However, performance is limited due to various reasons such as diversity of objects and complex backgrounds.

이에 최근에는 딥 러닝(Deep learning) 기법으로 학습된 인공 신경망(artificial neural network)을 이용하여 비디오 또는 오디오에서 객체 영역을 추출하는 로컬라이제이션을 수행하기 위한 다양한 연구가 진행되었으며, 딥 러닝 기법을 이용함에 의해 비디오에 대한 객체 영역 로컬라이제이션 작업의 성능이 크게 향상되었다.In recent years, various studies have been conducted to perform localization to extract object regions from video or audio using an artificial neural network learned by a deep learning technique, and by using a deep learning technique. The performance of object area localization for video has been greatly improved.

그러나 기존의 연구에서 객체 영역 로컬라이제이션은 주로 비디오와 오디오 각각에 대해 개별적으로 주어진 색인어에 대한 객체 영역을 검출하거나, 비디오나 오디오에 포함된 객체를 인지하여 객체에 대한 태그를 추출하는 방식으로 수행되었다. 따라서 동일 객체에 대한 유의어 등에 취약하여 요구하는 객체에 대해 정확하게 지정된 태그가 제시되지 않는다면 검출 성능이 크게 저하되는 문제가 있다. 또한 비디오로부터 객체의 오디오 구간을 직접 추출하거나, 오디오로부터 비디오에 포함된 객체 영역을 직접 추출하지 못한다는 한계가 있다. 뿐만 아니라 색인어와 같은 문자어를 이용하는 경우, 단순히 객체를 지정하는 방식이므로 객체의 존재 여부나 전체 윤곽을 판별하기에는 용이하지만, 비디오에서 음향이 발생되는 영역을 나타내거나, 오디오에서 특정 객체의 음향이 포함된 구간을 추출하거나 주변 잡음을 제거하기에는 부적합하다.However, in the existing research, object region localization was mainly performed by detecting the object region for a given index word for each video and audio, or by recognizing the object included in the video or audio and extracting the tag for the object. Therefore, there is a problem that detection performance is greatly degraded if a tag that is accurately designated for the requested object is not presented because the synonym for the same object is vulnerable. In addition, there is a limitation in that it is not possible to directly extract an audio section of an object from a video or an object region included in a video from an audio. In addition, in the case of using a text language such as an index word, it is easy to determine the existence of an object or the entire outline because it is a simple method of designating an object, but indicates the area where sound is generated in the video, or includes the sound of a specific object in the audio. It is unsuitable for extracting the broken section or removing ambient noise.

특히 멀티미디어 편집 시에는 비디오와 오디오 사이에 동기화(synchronization)가 이루어지지 않는 경우가 빈번하게 발생한다. 그러나 기존에는 비디오와 오디오 사이의 정합을 제공하기 위한 기준을 제시할 수 없어, 멀티미디어를 편집하고자 하는 편집자가 매번 수작업으로 동기화를 수행해야 하는 번거로움이 있었다.In particular, when editing multimedia, a case in which synchronization between video and audio is not performed frequently occurs. However, in the past, it is not possible to present a standard for providing a match between video and audio, so there is a hassle of having to manually synchronize an editor who wants to edit multimedia each time.

한국 등록 특허 제10-1900237호 (2018.09.13 등록)Korean Patent Registration No. 10-1900237 (Registered on September 13, 2018)

본 발명의 목적은 오디오로부터 비디오의 대응하는 객체 영역을 검출할 수 있는 오디오-비디오 정합 영역 탐지 장치 및 방법을 제공하는데 있다.An object of the present invention is to provide an audio-video matching area detection apparatus and method capable of detecting a corresponding object area of video from audio.

본 발명의 다른 목적은 비디오의 객체를 인식하고, 인식된 객체에 대응하는 오디오 구간을 검출할 수 있는 오디오-비디오 정합 영역 탐지 장치 및 방법을 제공하는데 있다.Another object of the present invention is to provide an apparatus and method for detecting an audio-video matching area capable of recognizing an object of a video and detecting an audio section corresponding to the recognized object.

본 발명의 또 다른 목적은 오디오와 비디오의 동기화를 위한 구간을 자동으로 검출할 수 있도록 하는 오디오-비디오 정합 영역 탐지 장치 및 방법을 제공하는데 있다.Still another object of the present invention is to provide an apparatus and method for detecting an audio-video matched region capable of automatically detecting a section for synchronizing audio and video.

본 발명의 또 다른 목적은 주석 처리된 학습용 데이터를 요구하지 않고, 비디오와 오디오 사이의 상호 시멘틱 특징을 기반으로 자기 지도 학습 방식으로 학습되어 오디오와 비디오에서 동일 객체에 대한 구간 검출을 위한 학습을 동시에 수행할 수 있는 오디오-비디오 정합 영역 탐지 장치 및 방법을 제공하는데 있다.Another object of the present invention is to simultaneously learn to detect sections for the same object in audio and video by learning in a self-supervised learning method based on mutual semantic characteristics between video and audio without requiring annotated learning data. It is to provide an apparatus and method for detecting an audio-video matching area that can be performed.

상기 목적을 달성하기 위한 본 발명의 일 실시예에 따른 오디오-비디오 정합 영역 탐지 장치는 미리 학습된 패턴 추정 방식에 따라 비디오 데이터와 오디오 데이터 각각에서 특징을 추출하여 비디오 특징맵과 오디오 특징맵을 획득하는 특징맵 획득부; 상기 비디오 특징맵과 상기 오디오 특징맵을 기지정된 동일한 차원을 갖는 비디오 변환 특징맵과 오디오 변환 특징맵으로 변환하고, 미리 학습된 패턴 추정 방식에 따라 상기 비디오 변환 특징맵과 상기 오디오 변환 특징맵 각각의 특징을 추출하여 비디오 시멘틱 벡터와 오디오 시멘틱 벡터를 획득하는 시멘틱 벡터 획득부; 및 상기 비디오 특징맵과 상기 오디오 시멘틱 벡터를 기지정된 방식으로 결합하여 상기 비디오 특징맵에서 상기 오디오 시멘틱 벡터에 따른 위치별 강조 세기를 나타내는 비디오 강조맵을 획득하고, 상기 오디오 특징맵과 상기 비디오 시멘틱 벡터를 기지정된 방식으로 결합하여 상기 오디오 특징맵에서 비디오 시멘틱 벡터에 따른 위치별 강조 세기를 나타내는 오디오 강조맵을 획득하는 로컬라이제이션부; 를 포함한다.In order to achieve the above object, the apparatus for detecting an audio-video matching region according to an embodiment of the present invention obtains a video feature map and an audio feature map by extracting features from each of video data and audio data according to a previously learned pattern estimation method. A feature map acquisition unit; The video feature map and the audio feature map are converted into a video transform feature map and an audio transform feature map having the same predetermined dimension, and each of the video transform feature map and the audio transform feature map according to a previously learned pattern estimation method. A semantic vector obtaining unit that extracts features and obtains a video semantic vector and an audio semantic vector; And combining the video feature map and the audio semantic vector in a known manner to obtain a video emphasis map representing the emphasis intensity for each position according to the audio semantic vector from the video feature map, and the audio feature map and the video semantic vector. A localization unit that combines in a known manner and obtains an audio enhancement map representing an enhancement intensity for each position according to a video semantic vector from the audio feature map; Includes.

상기 시멘틱 벡터 획득부는 상기 비디오 특징맵을 인가받아 상기 비디오 변환 특징맵으로 변환하는 비디오 특징 차원 변환부; 상기 오디오 특징맵을 인가받아 상기 오디오 변환 특징맵으로 변환하는 오디오 특징 차원 변환부; 상기 오디오 시멘틱 벡터가 반영된 상기 비디오 강조맵에 기반하여 패턴 추정 방식이 미리 학습되어 상기 비디오 변환 특징맵으로부터 상기 비디오 시멘틱 벡터를 추출하는 비디오 시멘틱 벡터 추출부; 및 상기 비디오 시멘틱 벡터가 반영된 상기 오디오 강조맵에 기반하여 패턴 추정 방식이 미리 학습되어 상기 오디오 변환 특징맵으로부터 상기 오디오 시멘틱 벡터를 추출하는 오디오 시멘틱 벡터 추출부; 를 포함할 수 있다.The semantic vector obtaining unit comprises: a video feature dimensional conversion unit receiving the video feature map and converting the video feature map into the video conversion feature map; An audio feature dimension converting unit that receives the audio feature map and converts it into the audio transform feature map; A video semantic vector extracting unit for extracting the video semantic vector from the video transformation feature map by learning a pattern estimation method in advance based on the video enhancement map reflecting the audio semantic vector; And an audio semantic vector extractor configured to extract the audio semantic vector from the audio transformation feature map by learning a pattern estimation method in advance based on the audio enhancement map reflecting the video semantic vector. It may include.

상기 로컬라이제이션부는 상기 비디오 특징맵의 부분 행렬들 각각에 대한 전치 행렬과 상기 오디오 시멘틱 벡터를 행렬 곱하여 비디오 강조 벡터를 획득하는 비디오 강조 벡터 획득부; 상기 오디오 특징맵의 부분 행렬들 각각에 대한 전치 행렬과 상기 비디오 시멘틱 벡터를 행렬 곱하여 오디오 강조 벡터를 획득하는 오디오 강조 벡터 획득부; 및 상기 비디오 강조 벡터와 상기 비디오 강조 벡터를 인가받아 기지정된 방식으로 정규화하여 상기 비디오 강조맵과 상기 오디오 강조맵을 획득하는 강조 벡터 정규화부; 를 포함할 수 있다.The localization unit includes a video enhancement vector obtaining unit for obtaining a video enhancement vector by multiplying a transpose matrix for each of the partial matrices of the video feature map and the audio semantic vector by a matrix; An audio enhancement vector obtaining unit for obtaining an audio enhancement vector by multiplying a transpose matrix for each of the partial matrices of the audio feature map and the video semantic vector; And an enhancement vector normalization unit for obtaining the video enhancement map and the audio enhancement map by receiving the video enhancement vector and the video enhancement vector and normalizing it in a predetermined manner. It may include.

상기 강조 벡터 정규화부는 소프트맥스 함수에 의한 확률에 기반하여 상기 비디오 강조 벡터와 상기 비디오 강조 벡터를 정규화할 수 있다.The enhancement vector normalization unit may normalize the video enhancement vector and the video enhancement vector based on a probability by a softmax function.

상기 오디오-비디오 정합 영역 탐지 장치는 상기 특징맵 획득부와 상기 시멘틱 벡터 획득부를 학습시키기 위한 학습부를 더 포함할 수 있다.The apparatus for detecting an audio-video matched region may further include a learning unit for learning the feature map acquisition unit and the semantic vector acquisition unit.

상기 학습부는 상기 비디오 강조맵의 원소들과 상기 비디오 특징맵의 곱을 누적하여 비디오 누적 강조 특징맵을 획득하는 비디오 강조 특징 누적부; 상기 오디오 강조맵의 원소들과 상기 오디오 특징맵의 곱을 누적하여 오디오 누적 강조 특징맵을 획득하는 오디오 강조 특징 누적부; 학습 시에 상기 특징맵 획득부와 상기 시멘틱 벡터 획득부와 함께 패턴 추정 방식이 학습되어 상기 비디오 누적 강조 특징맵의 특징을 추출하여 오디오 강조 특징 벡터를 획득하는 오디오 강조 특징 변환부; 학습 시에 상기 특징맵 획득부와 상기 시멘틱 벡터 획득부와 함께 패턴 추정 방식이 학습되어 상기 오디오 누적 강조 특징맵의 특징을 추출하여 비디오 강조 특징 벡터를 획득하는 비디오 강조 특징 변환부; 및 상기 비디오 변환 특징맵과 상기 비디오 강조 특징 벡터 사이의 차와 상기 오디오 변환 특징맵과 상기 오디오 강조 특징 벡터 사이의 차를 합하여 손실을 계산하고, 계산된 손실을 상기 특징맵 획득부와 상기 시멘틱 벡터 획득부, 상기 오디오 강조 특징 변환부 및 상기 비디오 강조 특징 변환부로 역전파하는 손실 역전파부; 를 포함할 수 있다.The learning unit may include a video enhancement feature accumulator for accumulating a product of the elements of the video enhancement map and the video feature map to obtain a video accumulation enhancement feature map; An audio enhancement feature accumulator for accumulating a product of the audio enhancement map elements and the audio feature map to obtain an audio accumulation enhancement feature map; An audio enhancement feature conversion unit for obtaining an audio enhancement feature vector by learning a pattern estimation method together with the feature map acquisition unit and the semantic vector acquisition unit during learning to extract a feature of the video cumulative enhancement feature map; A video enhancement feature conversion unit for obtaining a video enhancement feature vector by learning a pattern estimation method together with the feature map acquisition unit and the semantic vector acquisition unit during learning to extract a feature of the audio accumulation enhancement feature map; And calculating a loss by summing the difference between the video conversion feature map and the video enhancement feature vector and the audio conversion feature map and the audio enhancement feature vector, and calculating the loss from the feature map acquisition unit and the semantic vector. A lossy backpropagation unit backpropagating to the acquisition unit, the audio enhancement feature conversion unit, and the video enhancement feature conversion unit; It may include.

상기 목적을 달성하기 위한 본 발명의 다른 실시예에 따른 오디오-비디오 정합 영역 탐지 방법은 미리 학습된 패턴 추정 방식에 따라 비디오 데이터와 오디오 데이터 각각에서 특징을 추출하여 비디오 특징맵과 오디오 특징맵을 획득하는 단계; 상기 비디오 특징맵과 상기 오디오 특징맵을 기지정된 동일한 차원을 갖는 비디오 변환 특징맵과 오디오 변환 특징맵으로 변환하는 단계; 미리 학습된 패턴 추정 방식에 따라 상기 비디오 변환 특징맵과 상기 오디오 변환 특징맵 각각의 특징을 추출하여 비디오 시멘틱 벡터와 오디오 시멘틱 벡터를 획득하는 단계; 상기 비디오 특징맵과 상기 오디오 시멘틱 벡터를 기지정된 방식으로 결합하고 상기 오디오 특징맵과 상기 비디오 시멘틱 벡터를 기지정된 방식으로 결합하여, 상기 비디오 특징맵에서 상기 오디오 시멘틱 벡터에 따른 위치별 강조 세기를 나타내는 비디오 강조맵과 상기 오디오 특징맵에서 비디오 시멘틱 벡터에 따른 위치별 강조 세기를 나타내는 오디오 강조맵을 획득하는 단계; 를 포함한다.In order to achieve the above object, an audio-video matching region detection method according to another embodiment of the present invention obtains a video feature map and an audio feature map by extracting features from each of video data and audio data according to a previously learned pattern estimation method. Step to do; Converting the video feature map and the audio feature map into a video transform feature map and an audio transform feature map having the same predetermined dimension; Extracting features of the video transform feature map and the audio transform feature map according to a pre-learned pattern estimation method to obtain a video semantic vector and an audio semantic vector; The video feature map and the audio semantic vector are combined in a known manner, and the audio feature map and the video semantic vector are combined in a known manner to indicate the emphasis intensity for each position according to the audio semantic vector in the video feature map. Acquiring an audio enhancement map representing a location-specific enhancement intensity according to a video semantic vector from the video enhancement map and the audio feature map; Includes.

따라서, 본 발명의 실시예에 따른 오디오-비디오 정합 영역 탐지 장치 및 방법은 비디오의 객체를 인식하고, 인식된 객체에 대응하는 오디오 구간을 검출하거나 오디오로부터 비디오의 대응하는 객체 영역을 검출할 수 있도록 한다. 그러므로 객체에 대한 정확한 태그를 알지 못하는 상태에서도 멀티미디어에서 비디오 또는 오디오 중 하나로부터 나머지 하나의 대응하는 영역 또는 구간을 추출하여 오디오와 비디오의 동기화를 위한 구간을 자동으로 검출할 수 있다. 뿐만 아니라 오디오와 비디오에 대한 학습을 동시에 수행할 수 있어 학습의 효율성을 크게 향상 시킬 수 있으며, 오디오와 비디오 사이의 상호 시멘틱 상관 관계에 따른 자가 지도 학습 방식으로 학습을 수행하여 별도로 주석된 학습용 데이터를 필요로 하지 않는다.Accordingly, the apparatus and method for detecting an audio-video matched region according to an embodiment of the present invention can recognize an object of a video, detect an audio section corresponding to the recognized object, or detect a corresponding object region of a video from audio. do. Therefore, even without knowing the exact tag for the object, it is possible to automatically detect a section for synchronizing audio and video by extracting the other corresponding region or section from one of video or audio in multimedia. In addition, learning for audio and video can be performed at the same time, which greatly improves the efficiency of learning, and by performing learning in a self-supervised learning method according to the mutual semantic correlation between audio and video, separately annotated learning data I don't need it.

도 1은 본 발명의 일 실시예에 따른 오디오-비디오 정합 영역 탐지 장치의 개략적 구조를 나타낸다.
도 2는 도 1의 시멘틱 벡터 획득부의 상세 구성을 나타낸다.
도 3은 도 1의 로컬라이제이션부의 상세 구성을 나타낸다.
도 4는 도 1의 오디오-비디오 정합 영역 탐지 장치의 각 구성별 동작을 설명하기 위한 도면이다.
도 5는 본 발명의 일 실시예에 따른 오디오-비디오 정합 영역 탐지 방법을 나타낸다.
도 6 및 도 7은 본 발명의 오디오-비디오 정합 영역 탐지 방법을 이용하여 멀티미디어에서 객체의 음향에 대응하는 영역을 추출한 결과를 나타낸다.1 shows a schematic structure of an audio-video matching area detection apparatus according to an embodiment of the present invention.
2 shows a detailed configuration of the semantic vector acquisition unit of FIG. 1.
3 shows a detailed configuration of the localization unit of FIG. 1.
FIG. 4 is a diagram illustrating an operation of each component of the apparatus for detecting an audio-video matched region of FIG. 1.
5 shows a method of detecting an audio-video matching area according to an embodiment of the present invention.
6 and 7 show a result of extracting a region corresponding to a sound of an object in multimedia using the audio-video matching region detection method of the present invention.

본 발명과 본 발명의 동작상의 이점 및 본 발명의 실시에 의하여 달성되는 목적을 충분히 이해하기 위해서는 본 발명의 바람직한 실시예를 예시하는 첨부 도면 및 첨부 도면에 기재된 내용을 참조하여야만 한다. In order to fully understand the present invention, the operational advantages of the present invention, and the objects achieved by the implementation of the present invention, reference should be made to the accompanying drawings illustrating preferred embodiments of the present invention and the contents described in the accompanying drawings.

이하, 첨부한 도면을 참조하여 본 발명의 바람직한 실시예를 설명함으로써, 본 발명을 상세히 설명한다. 그러나, 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며, 설명하는 실시예에 한정되는 것이 아니다. 그리고, 본 발명을 명확하게 설명하기 위하여 설명과 관계없는 부분은 생략되며, 도면의 동일한 참조부호는 동일한 부재임을 나타낸다. Hereinafter, the present invention will be described in detail by describing a preferred embodiment of the present invention with reference to the accompanying drawings. However, the present invention may be implemented in various different forms, and is not limited to the described embodiments. In addition, in order to clearly describe the present invention, parts irrelevant to the description are omitted, and the same reference numerals in the drawings indicate the same members.

명세서 전체에서, 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라, 다른 구성요소를 더 포함할 수 있는 것을 의미한다. 또한, 명세서에 기재된 "...부", "...기", "모듈", "블록" 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어나 소프트웨어 또는 하드웨어 및 소프트웨어의 결합으로 구현될 수 있다. Throughout the specification, when a part "includes" a certain component, it means that other components may be further included, rather than excluding other components unless specifically stated to the contrary. In addition, terms such as "... unit", "... group", "module", and "block" described in the specification mean units that process at least one function or operation, which is hardware, software, or hardware. And software.

도 1은 본 발명의 일 실시예에 따른 오디오-비디오 정합 영역 탐지 장치의 개략적 구조를 나타내고, 도 2는 도 1의 시멘틱 벡터 획득부의 상세 구성을 나타내며, 도 3은 도 1의 로컬라이제이션부의 상세 구성을 나타낸다. 그리고 도 4은 도 1의 오디오-비디오 정합 영역 탐지 장치의 각 구성별 동작을 설명하기 위한 도면이다.1 shows a schematic structure of an apparatus for detecting an audio-video matched region according to an embodiment of the present invention, FIG. 2 shows a detailed configuration of a semantic vector acquisition unit of FIG. 1, and FIG. 3 is a detailed configuration of a localization unit of FIG. Show. And FIG. 4 is a diagram for explaining the operation of each component of the apparatus for detecting an audio-video matched region of FIG. 1.

도 1 및 도 4를 참조하면, 본 실시예에 따른 오디오-비디오 정합 영역 탐지 장치는 멀티미디어 획득부(100), 특징맵 획득부(200), 시멘틱 벡터 획득부(300) 및 로컬라이제이션부(400)를 포함한다.1 and 4, the apparatus for detecting an audio-video matching region according to the present embodiment includes a multimedia acquisition unit 100, a feature map acquisition unit 200, a semantic vector acquisition unit 300, and a localization unit 400. Includes.

멀티미디어 획득부(100)는 동일 객체에 대해 대응하는 영역을 탐지하고자 하는 멀티미디어 데이터를 획득한다. 여기서 멀티미디어 데이터는 비디오 데이터와 오디오 데이터를 포함할 수 있다. 그리고 멀티미디어 데이터는 동영상과 같이 비디오 데이터와 오디오 데이터가 동일 장소에서 함께 획득된 데이터일 수 있으나, 비디오와 오디오가 개별적으로 획득된 데이터이어도 무방하다. 즉 서로 다른 시간과 장소에서 유사한 객체에 대해 획득된 비디오 데이터와 오디오 데이터일 수 있다.The multimedia acquisition unit 100 acquires multimedia data to detect a region corresponding to the same object. Here, the multimedia data may include video data and audio data. In addition, the multimedia data may be data obtained together with video data and audio data in the same place, such as a video, but may be data obtained separately from video and audio. That is, they may be video data and audio data obtained for similar objects at different times and places.

멀티미디어 획득부(100)는 직접 멀티미디어 데이터를 생성하는 비디오 카메라 와 같은 멀티미디어 장치로 구현될 수도 있으나, 유/무선 네트워크를 통해 멀티미디어 데이터를 전송받는 통신부 또는 이전 획득한 멀티미디어 데이터를 저장하는 저장 장치 등으로 구현될 수도 있다.The multimedia acquisition unit 100 may be implemented as a multimedia device such as a video camera that directly generates multimedia data, but may be implemented as a communication unit that receives multimedia data through a wired/wireless network or a storage device that stores previously acquired multimedia data. It can also be implemented.

멀티미디어 획득부(100)는 비디오 데이터를 획득하는 비디오 획득부(110)와 오디오 데이터를 획득하는 오디오 획득부(120)를 포함한다. 만일 동영상과 같이 비디오 데이터와 오디오 데이터가 하나의 파일로 통합된 멀티미디어 데이터가 획득된 경우, 비디오 획득부(110)와 오디오 획득부(120)는 멀티미디어 데이터에서 비디오 데이터와 오디오 데이터를 각각 분리하여 획득할 수도 있다.The multimedia acquisition unit 100 includes a video acquisition unit 110 acquiring video data and an audio acquisition unit 120 acquiring audio data. If multimedia data in which video data and audio data are integrated into a single file, such as a video, is acquired, the video acquisition unit 110 and the audio acquisition unit 120 separately acquire video data and audio data from the multimedia data. You may.

또한 비디오 획득부(110)는 비디오 데이터가 연속되는 다수의 프레임을 포함하는 경우, 다수 프레임 각각을 구분하여 개별 프레임의 이미지를 추출하여 비디오 데이터로 획득할 수도 있다.In addition, when the video data includes a plurality of consecutive frames, the video acquisition unit 110 may classify each of the plurality of frames and extract an image of an individual frame to obtain the video data.

특징맵 획득부(200)는 멀티미디어 획득부(100)에서 획득된 비디오 데이터와 오디오 데이터 각각의 특징을 추출하여, 비디오 특징맵(V)과 오디오 특징맵(S)을 획득한다. 특징맵 획득부(200)는 비디오 데이터의 특징을 추출하는 비디오 특징 추출부(210)와 오디오 데이터의 특징을 추출하는 오디오 특징 추출부(220)를 포함한다.The feature map acquisition unit 200 extracts features of each of the video data and audio data acquired by the multimedia acquisition unit 100 to obtain a video feature map (V) and an audio feature map (S). The feature map acquisition unit 200 includes a video feature extractor 210 for extracting features of video data and an audio feature extractor 220 for extracting features of audio data.

비디오 특징 추출부(210)와 오디오 특징 추출부(220) 각각은 패턴 추정 방식이 각각 미리 학습된 인공 신경망을 포함하여 구현되어 학습된 패턴 추정 방식에 따라 비디오 데이터의 특징과 오디오 데이터의 특징을 추출하여, 비디오 특징맵(V)과 오디오 특징맵(S)을 획득한다. 여기서 비디오 특징 추출부(210)는 2차원의 비디오 데이터로부터 실수값(R)을 원소로 높이(H), 폭(W) 및 깊이(D)를 갖는 3차원 행렬(또는 벡터) 형태(V ∈ R^H×W×D)의 비디오 특징맵(V)을 획득하고, 오디오 특징 추출부(220)는 1차원의 오디오 데이터로부터 실수값(R)을 원소로 강도(M)와 깊이(D)를 갖는 2차원 행렬(또는 벡터) 형태(S ∈ R^M×D)의 오디오 특징맵(S)을 획득할 수 있다. 그러나 비디오 특징맵(V)과 오디오 특징맵(S)의 차원 및 크기는 다양하게 조절될 수 있다.Each of the video feature extraction unit 210 and the audio feature extraction unit 220 extracts features of video data and audio data according to a pattern estimation method, each of which has a pattern estimation method implemented including a pre-learned artificial neural network. Thus, a video feature map (V) and an audio feature map (S) are obtained. Here, the video feature extraction unit 210 is in the form of a three-dimensional matrix (or vector) having a height (H), a width (W) and a depth (D) using a real value (R) from the 2D video data. A video feature map (V) of R ^H×W×D ) is obtained, and the audio feature extraction unit 220 calculates the intensity (M) and the depth (D) using a real value (R) as an element from the one-dimensional audio data. An audio feature map S in the form of a two-dimensional matrix (or vector) (S ∈ R ^M×D ) can be obtained. However, the dimensions and sizes of the video feature map (V) and the audio feature map (S) can be variously adjusted.

비디오 데이터와 오디오 데이터로부터 비디오 특징맵(V)과 오디오 특징맵(S)을 획득하는 인공 신경망은 다양하게 공개되어 있다. 여기서는 일예로 도 3에 도시된 바와 같이, 컨볼루션 신경망은 적어도 하나의 컨볼루션 레이어(conv)와 적어도 하나의 풀링 레이어(pool)를 포함하는 컨볼루션 신경망(Convolutional Neural Networks)으로 구현되는 것으로 가정한다.Various artificial neural networks that acquire a video feature map (V) and an audio feature map (S) from video data and audio data have been disclosed. Here, as an example, as shown in FIG. 3, it is assumed that the convolutional neural network is implemented as convolutional neural networks including at least one convolutional layer (conv) and at least one pooling layer (pool). .

또한 본 실시예에서 비디오 특징 추출부(210)와 오디오 특징 추출부(220)는 후술하는 학습부(500)에서 계산된 손실이 역전파되어 특징맵(V, S)을 획득하기 위한 학습이 수행될 수 있다. 그러나 비디오 특징맵(V)과 오디오 특징맵(S)을 획득하기 위해 패턴 추정 방식이 미리 학습된 다양한 인공 신경망이 공개되어 있으므로, 경우에 따라서는 이러한 학습된 인공 신경망을 이용할 수도 있다.In addition, in the present embodiment, the video feature extraction unit 210 and the audio feature extraction unit 220 perform learning to obtain a feature map (V, S) by backpropagating the loss calculated by the learning unit 500 to be described later. Can be. However, since various artificial neural networks in which a pattern estimation method is learned in advance to obtain a video feature map (V) and an audio feature map (S) are disclosed, such a learned artificial neural network may be used in some cases.

한편, 시멘틱 벡터 획득부(300)는 미리 학습된 패턴 추정 방식에 따라 특징맵 획득부(200)에서 획득된 비디오 특징맵(V)과 오디오 특징맵(S) 각각으로부터 비디오 시멘틱 벡터(h^v)와 오디오 시멘틱 벡터(h^s)를 획득한다. 여기서 비디오 시멘틱 벡터(h^v)는 비디오 특징맵(V)의 패턴으로부터 추출되는 특징 벡터로서 비디오 데이터의 공간 축에서 객체의 위치를 나타내기 위한 벡터이고, 오디오 시멘틱 벡터(h^s)는 오디오 특징맵(S)의 패턴으로부터 추출되는 특징 벡터로서 오디오 데이터의 시간 축에서 객체의 위치를 나타내기 위한 벡터이다.On the other hand, the semantic vector acquisition unit 300 is a video semantic vector (h ^v ) from each of the video feature map (V) and the audio feature map (S) obtained by the feature map acquisition unit 200 according to a previously learned pattern estimation method. And the audio semantic vector (h ^s ) are obtained. Here, the video semantic vector (h ^v ) is a feature vector extracted from the pattern of the video feature map (V) and is a vector to indicate the position of the object in the spatial axis of the video data, and the audio semantic vector (h ^s ) is the audio feature map. As a feature vector extracted from the pattern of (S), it is a vector for indicating the position of an object in the time axis of audio data.

비디오 시멘틱 벡터(h^v)와 오디오 시멘틱 벡터(h^s)는 본 실시예의 오디오-비디오 정합 영역 탐지 장치가 비디오 데이터와 오디오 데이터서 공통으로 포함된 유사 객체의 영역을 판별할 수 있도록 하기 위해 획득되는 정보로서, 비디오 특징과 오디오 특징 사이에 상호 공통 객체 영역을 검출하기 위해 획득된다. 즉 본 실시예에서 비디오 시멘틱 벡터(h^v)는 오디오 데이터에서 객체의 음향이 발생된 영역을 검출하기 위해 획득되는 벡터이고, 오디오 시멘틱 벡터(h^s)는 비디오 데이터에서 객체가 나타나는 영역을 검출하기 위해 획득되는 벡터이다.The video semantic vector (h ^v ) and the audio semantic vector (h ^s ) are obtained so that the apparatus for detecting an audio-video matching region of the present embodiment can determine a region of a similar object commonly included in video data and audio data. As information, it is obtained to detect a mutually common object area between a video feature and an audio feature. That is, in the present embodiment, the video semantic vector (h ^v ) is a vector obtained to detect the region where the sound of the object is generated in the audio data, and the audio semantic vector (h ^s ) is to detect the region where the object appears in the video data. It is a vector obtained for.

일반적으로 멀티미디어 획득부(100)에서 획득되는 비디오 데이터에는 특정 객체뿐만 아니라 주변 배경과 함께 다양한 객체가 포함될 수 있다. 또한 오디오 데이터에도 특정 객체에서 발생된 음향 이외에도 주변의 다양한 음향이 함께 포함될 수 있다. 따라서 비디오 데이터와 오디오 데이터에 공통의 객체에 대한 정보가 포함되어 있더라도, 비디오 데이터와 오디오 데이터 각각에서 개별적으로 획득된 비디오 특징맵(V)과 오디오 특징맵(S)에서 공통의 객체가 포함된 영역을 검출하기 용이하지 않다. 다만 특징맵 획득부(200)의 비디오 특징 추출부(210)와 오디오 특징 추출부(220)가 정상적으로 학습된 상태라면, 특징맵 비디오 특징맵(V)과 오디오 특징맵(S)에는 공통 객체에 대한 특징이 포함될 것으로 추정할 수 있다. 이에 시멘틱 벡터 획득부(300)는 학습된 패턴 추정 방식에 따라 비디오 특징맵(V)과 오디오 특징맵(S)의 객체의 위치를 나타내는 주요 특징에 대한 패턴을 비디오 시멘틱 벡터(h^v)와 오디오 시멘틱 벡터(h^s)로 추출한다.In general, the video data acquired by the multimedia acquisition unit 100 may include not only a specific object but also various objects along with a surrounding background. Also, in the audio data, in addition to the sound generated by a specific object, various sounds of the surrounding may be included together. Therefore, even if the video data and the audio data contain information on a common object, the area in which the common object is included in the video feature map (V) and audio feature map (S) separately acquired from the video data and audio data. Is not easy to detect. However, if the video feature extracting unit 210 and the audio feature extracting unit 220 of the feature map acquisition unit 200 are normally learned, the feature map video feature map (V) and the audio feature map (S) are It can be estimated that the features of The semantic vector obtaining unit 300 is a vector pattern to the main feature that indicates the position of the object in the video feature map (V) and an audio feature map (S) in accordance with the learning patterns estimation scheme video semantics (h ^v) and audio Extract as semantic vector (h ^s ).

본 실시예에서는 시멘틱 벡터 획득부(300)가 서로 다른 종류의 데이터인 비디오 데이터와 오디오 데이터 각각으로부터 획득된 특징맵에서 의미적론(semantic)으로 동일한 객체에 대한 영역을 탐지하기 위해 획득하는 특징 벡터를 추출하므로, 추출된 특징 벡터를 비디오 시멘틱 벡터(h^v)와 오디오 시멘틱 벡터(h^s)로 정의하였다.In this embodiment, a feature vector obtained by the semantic vector acquisition unit 300 to detect a region for the same object in semantic from a feature map obtained from each of video data and audio data, which are different types of data, is obtained. Therefore, the extracted feature vectors were defined as a video semantic vector (h ^v ) and an audio semantic vector (h ^s ).

시멘틱 벡터 획득부(300)는 비디오 특징맵(V)으로부터 비디오 시멘틱 벡터(h^s)를 획득하는 비디오 시멘틱 벡터 획득부(310)와 오디오 특징맵(S)으로부터 오디오 시멘틱 벡터(h^s)를 획득하는 오디오 시멘틱 벡터 획득부(320)를 포함한다.From the semantic vector obtaining unit 300 is a video feature map (V) video semantic vector acquiring unit 310 and the audio feature map (S) for obtaining a video semantic vector (h ^s) from the acquired audio semantic vector (h ^s) It includes an audio semantic vector acquisition unit 320.

도 2를 참조하면, 비디오 시멘틱 벡터 획득부(310)는 비디오 특징 차원 변환부(311)와 비디오 시멘틱 벡터 추출부(312)를 포함하고, 오디오 시멘틱 벡터 획득부(320)는 오디오 특징 차원 변환부(321)와 오디오 시멘틱 벡터 추출부(322)를 포함할 수 있다.Referring to FIG. 2, the video semantic vector acquisition unit 310 includes a video feature dimension conversion unit 311 and a video semantic vector extraction unit 312, and the audio semantic vector acquisition unit 320 is an audio feature dimension conversion unit. 321 and an audio semantic vector extractor 322 may be included.

비디오 특징 차원 변환부(311)와 오디오 특징 차원 변환부(321)는 특징맵 획득부(200)에서 인가되는 비디오 특징맵(V)과 오디오 특징맵(S)의 크기를 동일하게 일치시키기 위해 포함되는 구성이다. 상기한 바와 같이, 특징맵 획득부(200)에서 획득되는 비디오 특징맵(V)은 3차원(H × W × D) 행렬 형태로 획득되는 반면, 오디오 특징맵(S)은 2차원(M × D) 행렬 형태로 획득되므로, 서로 차원 및 크기가 상이하며, 이로 인해 공통의 객체에 대한 특징을 추출하기 어려우며, 추출하더라도 상호 적용이 용이하지 않다.The video feature dimensional conversion unit 311 and the audio feature dimensional conversion unit 321 are included to equally match the sizes of the video feature map V and the audio feature map S applied from the feature map acquisition unit 200 It is a configuration. As described above, the video feature map V obtained by the feature map acquisition unit 200 is obtained in the form of a three-dimensional (H × W × D) matrix, while the audio feature map S is a two-dimensional (M × D) Since it is obtained in the form of a matrix, the dimensions and sizes are different from each other, and thus it is difficult to extract features for a common object, and even if extracted, it is not easy to mutually apply.

이에 비디오 특징 차원 변환부(311)와 오디오 특징 차원 변환부(321)는 비디오 특징맵(V)과 오디오 특징맵(S)을 기지정된 방식으로 동일한 1차원 행렬로 변환한다.Accordingly, the video feature dimension transform unit 311 and the audio feature dimension transform unit 321 convert the video feature map V and the audio feature map S into the same one-dimensional matrix in a known manner.

일예로 비디오 특징 차원 변환부(311)는 3차원(H × W × D) 행렬로 구성되는 비디오 특징맵(V)에서 깊이(D)를 기준으로 높이(H) 및 폭(W)에 대한 원소들의 평균값을 획득함으로써, 1차원으로 차원 변환하여 비디오 변환 특징맵(f^v ∈ R^D)을 획득할 수 있다.As an example, the video feature dimension conversion unit 311 is an element for height (H) and width (W) based on depth (D) in a video feature map (V) composed of a three-dimensional (H × W × D) matrix. By obtaining the average value of these, the video transform feature map (f ^v ∈ R ^D ) can be obtained by dimensional transformation into one dimension.

오디오 특징 차원 변환부(321) 또한 유사하게 2차원(M × D) 행렬인 오디오 특징맵(S)을 인가받아 1차원 행렬로 변환하여 오디오 변환 특징맵(f^s ∈ R^D)을 획득할 수 있다.Similarly, the audio feature dimension transform unit 321 can obtain an audio transform feature map (f ^s ∈ R ^D ) by receiving an audio feature map (S) which is a two-dimensional (M × D) matrix and converting it into a one-dimensional matrix. have.

비디오 시멘틱 벡터 추출부(312)와 오디오 시멘틱 벡터 추출부(322)는 비디오 변환 특징맵(f^v)과 오디오 변환 특징맵(f^s)을 인가받고, 각각 패턴 추정 방식이 미리 학습된 인공 신경망으로 구현되어 인가된 비디오 변환 특징맵(f^v)과 오디오 변환 특징맵(f^s)의 특징 벡터인 비디오 시멘틱 벡터(h^v)와 오디오 시멘틱 벡터(h^s)를 획득한다.The video semantic vector extraction unit 312 and the audio semantic vector extraction unit 322 are applied with a video conversion feature map (f ^v ) and an audio conversion feature map (f ^s ), respectively, using an artificial neural network whose pattern estimation method is previously learned. A video semantic vector (h ^v ) and an audio semantic vector (h ^s ), which are feature vectors of the implemented and applied video conversion feature map (f ^v ) and audio conversion feature map (f ^s ), are obtained.

로컬라이제이션부(400)는 특징맵 획득부(200)에서 획득된 비디오 특징맵(V) 및 오디오 특징맵(S)과 시멘틱 벡터 획득부(300)에서 획득된 비디오 시멘틱 벡터(h^v) 및 오디오 시멘틱 벡터(h^s)를 이용하여 비디오 강조맵(a^v)과 오디오 강조맵(a^s)을 획득한다.The localization unit 400 includes a video feature map (V) and an audio feature map (S) obtained from the feature map acquisition unit (200), a video semantic vector (h ^v ) obtained from the semantic vector acquisition unit (300), and audio semantics. A video enhancement map (a ^v ) and an audio enhancement map (a ^s ) are obtained using the vector (h ^s ).

도 1 및 도 3을 참조하면, 로컬라이제이션부(400)는 비디오 로컬라이제이션부(410)와 오디오 로컬라이제이션부(420)를 포함할 수 있다. 그리고 비디오 로컬라이제이션부(410)는 비디오 강조 벡터 획득부(411)와 비디오 강조 벡터 정규화부(412)를 포함하고, 오디오 로컬라이제이션부(420)는 오디오 강조 벡터 획득부(451)와 오디오 강조 벡터 정규화부(422)를 포함할 수 잇다.1 and 3, the localization unit 400 may include a video localization unit 410 and an audio localization unit 420. In addition, the video localization unit 410 includes a video enhancement vector acquisition unit 411 and a video enhancement vector normalization unit 412, and the audio localization unit 420 includes an audio enhancement vector acquisition unit 451 and an audio enhancement vector normalization unit. (422) may be included.

비디오 강조 벡터 획득부(411)는 비디오 특징맵(V)과 오디오 시멘틱 벡터(h^s)를 인가받아 수학식 1에 따라 비디오 강조 벡터(c^v)를 획득한다.The video enhancement vector obtaining unit 411 receives the video feature map V and the audio semantic vector h ^s and obtains the video enhancement vector c ^v according to Equation 1.

여기서 v^T는 3차원인 비디오 특징맵(V)의 부분 행렬(v)에 대한 전치 행렬을 의미한다.Here, v ^T denotes a transposed matrix for the partial matrix (v) of the three-dimensional video feature map (V).

비디오 강조 벡터(c^v)는 비디오 특징맵(V)에서 오디오 시멘틱 벡터(h^s)에 따른 위치별 강조 세기를 나타낸다.The video enhancement vector (c ^v ) represents the enhancement intensity for each location according to the audio semantic vector (h ^s ) in the video feature map (V).

그리고 비디오 강조 벡터 정규화부(412)는 비디오 강조 벡터(c^v)를 인가받고, 기지정된 방식으로 비디오 강조 벡터(c^v)를 정규화하여 비디오 강조맵(a^v)을 획득한다. 이때 비디오 강조 벡터 정규화부(412)는 일예로 소프트맥스 함수(softmax function)에 의한 확률에 따라 수학식 2와 같이 비디오 강조 벡터(c^v)를 정규화할 수 있다.And a video emphasis vector normalizer 412 is being applied to the video highlight vector ^(v c), component by normalizing the video emphasis vector ^(v c) to the specified method to obtain the emphasized video map (a ^v). In this case, the video enhancement vector normalization unit 412 may normalize the video enhancement vector (c ^v ) as shown in Equation 2 according to the probability by the softmax function, for example.

한편, 오디오 강조 벡터 획득부(421)는 오디오 특징맵(S)과 비디오 시멘틱 벡터(h^v)를 인가받아 수학식 3에 따라 오디오 특징맵(S)에서 비디오 시멘틱 벡터(h^v)에 따른 위치별 강조 세기를 나타내는 오디오 강조 벡터(c^s)를 획득한다.Meanwhile, the audio enhancement vector acquisition unit 421 receives the audio feature map (S) and the video semantic vector (h ^v ), and positions according to the video semantic vector (h ^v ) in the audio feature map (S) according to Equation 3 An audio enhancement vector (c ^s ) representing the star enhancement intensity is obtained.

여기서 s^T는 2차원인 오디오 특징맵(S)의 부분 행렬(s)에 대한 전치 행렬을 의미한다.Here, s ^T denotes a transposed matrix for the partial matrix s of the two-dimensional audio feature map S.

그리고 오디오 강조 벡터 정규화부(422)는 비디오 강조 벡터 정규화부(412)와 유사하게 오디오 강조 벡터(c^s)를 소프트맥스 함수에 의한 확률에 따라 수학식 4와 같이 오디오 강조 벡터(c^s)를 정규화하여 오디오 강조맵(a^s)을 획득할 수 있다.And the audio enhancement vector normalization unit 422, similar to the video enhancement vector normalization unit 412, the audio enhancement vector (c ^s ) and the audio enhancement vector (c ^s ) as shown in Equation 4 according to the probability by the softmax function. The audio enhancement map a ^s can be obtained by normalizing.

로컬라이제이션부(400)에서 획득되는 비디오 강조맵(a^v)과 오디오 강조맵(a^s)은 비디오 데이터의 위치별 오디오 데이터의 특징에 대응하는 강도 및 오디오 데이터의 위치별 비디오 데이터의 특징에 대응하는 강도를 나타내는 행렬이다.The video enhancement map (a ^v ) and the audio enhancement map (a ^s ) obtained from the localization unit 400 are intensity corresponding to the characteristics of the audio data for each position of the video data and the characteristics of the video data for each position of the audio data. It is a matrix representing the intensity.

따라서 비디오 강조맵(a^v)과 오디오 강조맵(a^s)은 비디오 데이터와 오디오 데이터 각각에서 서로 대응하는 객체 영역을 표현할 수 있다.Accordingly, the video enhancement map (a ^v ) and the audio enhancement map (a ^s ) may represent object regions corresponding to each other in the video data and the audio data.

도 3에서는 이해를 위해 비디오 강조 벡터 정규화부(412)와 오디오 강조 벡터 정규화부(422)를 구분하였으나, 비디오 강조 벡터 정규화부(412)와 오디오 강조 벡터 정규화부(422)는 강조 벡터 정규화부로 통합될 수 있다.In FIG. 3, the video emphasis vector normalization unit 412 and the audio emphasis vector normalization unit 422 are separated for understanding, but the video emphasis vector normalization unit 412 and the audio emphasis vector normalization unit 422 are integrated into the emphasis vector normalization unit. Can be.

한편, 오디오-비디오 정합 영역 탐지 장치가 비디오 데이터와 오디오 데이터에서 공통된 객체에 대한 영역을 바르게 검출하기 위해서는 특징맵 획득부(200) 및 시멘틱 벡터 획득부(300)의 인공 신경망이 미리 학습되어야 하며, 이에 본 실시예에 따른 오디오-비디오 정합 영역 탐지 장치는 특징맵 획득부(200) 및 시멘틱 벡터 획득부(300)의 인공 신경망이 학습시키기 위한 학습부(500)를 더 포함할 수 있다.Meanwhile, in order for the audio-video matching region detection apparatus to correctly detect a region for a common object in video data and audio data, the artificial neural network of the feature map acquisition unit 200 and the semantic vector acquisition unit 300 must be learned in advance. Accordingly, the apparatus for detecting an audio-video matching region according to the present exemplary embodiment may further include a learning unit 500 for learning by the artificial neural network of the feature map acquisition unit 200 and the semantic vector acquisition unit 300.

학습부(500)는 오디오-비디오 정합 영역 탐지 장치의 학습 과정에 필요한 구성으로 오디오-비디오 정합 영역 탐지 장치가 학습된 이후, 실제 운용 시에는 제외될 수 있다.The learning unit 500 is a configuration necessary for a learning process of the apparatus for detecting an audio-video matched region, and may be excluded during actual operation after the apparatus for detecting an audio-video matched region is learned.

다시 도 1을 참조하면, 학습부(500)는 비디오 강조 특징 누적부(510), 오디오 강조 특징 누적부(520), 오디오 강조 특징 변환부(530), 비디오 강조 특징 변환부(540) 및 손실 역전파부(550)를 포함할 수 있다.Referring back to FIG. 1, the learning unit 500 includes a video enhancement feature accumulator 510, an audio enhancement feature accumulator 520, an audio enhancement feature converter 530, a video enhancement feature converter 540, and a loss. A back propagation unit 550 may be included.

우선 비디오 강조 특징 누적부(510)는 비디오 로컬라이제이션부(410)에서 획득된 비디오 강조맵(a^v)과 비디오 특징 추출부(210)에서 획득된 비디오 특징맵(V)을 인가받고, 인가된 비디오 강조맵(a^v)의 각 원소(a_i ^v)와 비디오 특징맵(V)의 곱을 누적하여 비디오 누적 강조 특징맵(Z^v)을 수학식 5와 같이 획득한다. 여기서 i(i ∈ {1, ..., HW})는 공간적 위치를 나타낸다.First, the video enhancement feature accumulator 510 receives the video enhancement map (a ^v ) obtained from the video localization unit 410 and the video feature map (V) obtained from the video feature extractor (210), and by highlighting the map (a ^v) the product of each element (a _i ^v) and the video feature map (v) of the stacking to obtain, as the accumulated video emphasis characteristic map (Z ^v) and equation (5). Here, i(i ∈ {1, ..., HW}) represents the spatial location.

그리고 오디오 강조 특징 누적부(520)는 오디오 로컬라이제이션부(420)에서 획득된 오디오 강조맵(a^s)과 오디오 특징 추출부(210)에서 획득된 오디오 특징맵(S)을 인가받고, 인가된 오디오 강조맵(a^s)의 각 원소(a_j ^s)와 오디오 특징맵(S)의 곱을 누적하여 오디오 누적 강조 특징맵(Z^s)을 수학식 6과 같이 획득한다. 여기서 j(j ∈ {1, ..., M})는 시간적 위치를 나타낸다.In addition, the audio enhancement feature accumulator 520 receives the audio enhancement map (a ^s ) obtained from the audio localization unit 420 and the audio feature map (S) obtained from the audio feature extraction unit 210, and It highlighted map the product of the respective elements (a _j ^s) and the audio feature map (S) of (a ^s) accumulated will be obtained as the cumulative audio emphasis characteristic map (Z ^s) and equation (6). Here, j(j ∈ {1, ..., M}) represents the temporal position.

한편, 비디오 누적 특징 변환부(530)와 오디오 누적 특징 변환부(540)는 각각 인공 신경망으로 구현된다. 인공 신경망으로 구현되는 비디오 누적 특징 변환부(530)와 오디오 누적 특징 변환부(540)는 특징맵 획득부(200) 및 시멘틱 벡터 획득부(300)의 학습 시에 함께 학습이 수행되어, 오디오-비디오 정합 영역 탐지 장치가 별도의 주석이 포함되지 않은 멀티미디어 데이터를 인가받아 자가 학습이 수행되도록 한다.Meanwhile, the video accumulation feature conversion unit 530 and the audio accumulation feature conversion unit 540 are each implemented as an artificial neural network. The video cumulative feature conversion unit 530 and the audio cumulative feature conversion unit 540 implemented as an artificial neural network perform learning together when the feature map acquisition unit 200 and the semantic vector acquisition unit 300 learn, so that the audio- The video matching area detection device receives multimedia data that does not contain separate annotations to perform self-learning.

비디오 강조 특징 변환부(530)는 비디오 강조 특징 누적부(510)에서 인가되는 비디오 누적 강조 특징맵(Z^v)에서 특징을 추출하여, 오디오 특징 추출부(220)와 오디오 시멘틱 벡터 획득부(320)를 학습시키기 위한 오디오 강조 특징 벡터(g^s)를 획득한다.The video enhancement feature conversion unit 530 extracts features from the video cumulative enhancement feature map (Z ^v ) applied from the video enhancement feature accumulator 510, and the audio feature extraction unit 220 and the audio semantic vector acquisition unit 320 An audio enhancement feature vector (g ^s ) for learning) is obtained.

그리고 오디오 강조 특징 변환부(540)는 오디오 강조 특징 누적부(520)에서 인가되는 오디오 누적 강조 특징맵(Z^s)에서 특징을 추출하여, 비디오 특징 추출부(210)와 비디오 시멘틱 벡터 획득부(310)를 학습시키기 위한 비디오 강조 특징 벡터(g^v)를 획득한다.In addition, the audio enhancement feature conversion unit 540 extracts features from the audio accumulation enhancement feature map (Z ^s ) applied from the audio enhancement feature accumulator 520, and the video feature extraction unit 210 and the video semantic vector acquisition unit ( A video enhancement feature vector (g ^v ) for training 310) is obtained.

만일 학습이 정상적으로 수행된 상태라면, 비디오 데이터에서 객체의 특징을 추출한 비디오 특징맵(V)의 차원을 변환한 비디오 변환 특징맵(f^v)과 비디오 시멘틱 벡터(h^v)에 의해 강조된 오디오 누적 강조 특징맵(Z^s)의 특징을 추출한 비디오 강조 특징 벡터(g^v)는 유사하게 나타나야 한다(f^v ≒ g^v). 또한 오디오 데이터에서 객체의 특징을 추출한 오디오 특징맵(S)의 차원을 변환한 오디오 변환 특징맵(f^s)과 오디오 시멘틱 벡터(h^s)에 의해 강조된 비디오 누적 강조 특징맵(Z^v)의 특징을 추출한 오디오 강조 특징 벡터(g^s)는 유사하게 나타나야 한다(f^s ≒ g^s).If the learning is normally performed, the video conversion feature map (f ^v ) that transforms the dimensions of the video feature map (V) extracted from the object feature from the video data and the audio cumulative emphasis emphasized by the video semantic vector (h ^v ) The video enhancement feature vector (g ^v ) from which the features of the feature map (Z ^s ) are extracted should appear similarly (f ^v ≒ g ^v ). In addition, the features of the audio conversion feature map (f ^s ) that transforms the dimensions of the audio feature map (S) extracted from the features of the object from the audio data, and the video cumulative emphasis feature map (Z ^v ) highlighted by the audio semantic vector (h ^s ). The audio enhancement feature vector (g ^s ) from which is extracted should appear similarly (f ^s ≒ g ^s ).

이에 비디오 강조 특징 변환부(530)는 오디오 특징 추출부(220)와 오디오 시멘틱 벡터 획득부(320)의 오디오 시멘틱 벡터 추출부(322)를 자기 지도 학습 시킬 수 있도록 손실을 계산하기 위한 오디오 강조 특징 벡터(g^s)를 획득하고, 오디오 강조 특징 변환부(540)는 비디오 특징 추출부(210)와 비디오 시멘틱 벡터 획득부(310)의 비디오 시멘틱 벡터 추출부(312)를 자기 지도 학습 시킬 수 있도록 손실을 계산하기 위한 비디오 강조 특징 벡터(g^v)를 획득한다.Accordingly, the video enhancement feature conversion unit 530 is an audio enhancement feature for calculating the loss so that the audio feature extraction unit 220 and the audio semantic vector extraction unit 322 of the audio semantic vector acquisition unit 320 are self-supervised learning. To obtain a vector (g ^s ), and the audio enhancement feature conversion unit 540 to self-supervise the video feature extraction unit 210 and the video semantic vector extraction unit 312 of the video semantic vector acquisition unit 310 A video enhancement feature vector (g ^v ) for calculating the loss is obtained.

한편 손실 역전파부(550)는 시멘틱 벡터 획득부(300)에서 획득된 비디오 변환 특징맵(f^v)과 비디오 강조 특징 벡터(g^v) 사이의 차와 오디오 변환 특징맵(f^s)과 오디오 강조 특징 벡터(g^s) 사이의 차에 기반하여, 수학식 5와 같이 손실(L)을 계산한다.Meanwhile, the lossy backpropagation unit 550 includes a difference between the video conversion feature map (f ^v ) and the video enhancement feature vector (g ^v ) obtained by the semantic vector acquisition unit 300, and the audio conversion feature map (f ^s ) and audio enhancement. Based on the difference between the feature vectors (g ^s ), the loss (L) is calculated as in Equation (5).

여기서 λ는 비디오 손실과 오디오 손실 사이의 중요도를 조절하기 위한 매개 변수이고, ∥ ∥₂ 는 L₂-norm 함수이다.Here, λ is a parameter for adjusting the importance between video loss and audio loss, and ∥ ∥ ₂ is an L ₂ -norm function.

그리고 손실 역전파부(550)는 계산된 손실(L)을 특징맵 획득부(200) 및 시멘틱 벡터 획득부(300)와 함께 비디오 강조 특징 변환부(530) 및 오디오 강조 특징 변환부(540)로 역전파하여 학습시킨다.Further, the lossy backpropagation unit 550 transfers the calculated loss L to the video enhancement feature conversion unit 530 and the audio enhancement feature conversion unit 540 together with the feature map acquisition unit 200 and the semantic vector acquisition unit 300. Learn by backpropagating.

즉 본 실시예의 오디오-비디오 정합 영역 탐지 장치에서는 특징맵 획득부(200) 및 시멘틱 벡터 획득부(300)와 함께 비디오 강조 특징 변환부(530) 및 오디오 강조 특징 변환부(540)가 자가 지도 학습이 수행된다. 이때, 손실 역전파부(550)는 계산된 손실(L)이 기지정된 문턱 손실(L_th) 이하이거나, 반복 횟수가 기지정된 반복 학습 횟수에 도달하면 학습을 종료할 수 있다.That is, in the audio-video matching region detection apparatus of this embodiment, the video enhancement feature conversion unit 530 and the audio enhancement feature conversion unit 540 together with the feature map acquisition unit 200 and the semantic vector acquisition unit 300 are self-supervised learning. This is done. In this case, the loss back propagation unit 550 may terminate the learning when the calculated loss L is equal to or less than a predetermined threshold loss L _th , or when the number of repetitions reaches a predetermined number of repetition learning.

다만 상기한 바와 같이, 특징맵 획득부(200)는 미리 학습이 수행된 인공 신경망을 적용할 수 있으며, 이 경우, 특징맵 획득부(200)로는 손실을 역전파하여 학습시키지 않을 수도 있다. 또한 특징맵 획득부(200)가 이미 학습된 상태일지라도 손실(L)을 역전파하여 추가적인 학습이 수행되도록 할 수도 있다. 이는 비록 추가 학습을 수행하는 경우일지라도, 특징맵 획득부(200)가 이전에 객체 탐지를 위한 학습이 수행된 상태라면 학습 속도를 향상시킬 수 있기 때문이다.However, as described above, the feature map acquisition unit 200 may apply an artificial neural network that has been previously learned, and in this case, the feature map acquisition unit 200 may not learn by backpropagating the loss. In addition, even if the feature map acquisition unit 200 is already in a learned state, the loss L may be backpropagated to perform additional learning. This is because even in the case of performing additional learning, if the feature map acquisition unit 200 has previously performed learning for object detection, the learning speed can be improved.

한편 상기한 바와 같이 본 실시예에 따른 오디오-비디오 정합 영역 탐지 장치는 서로 다른 장소나 시간 등에서 별도로 획득된 비디오 데이터나 오디오 데이터에 대해서도 동일 객체에 대한 영역을 용이하게 탐지할 수 있다. 다만 학습을 수행하는 경우에는 가급적 서로 대응하는 비디오 데이터와 오디오 데이터가 이용되는 것이 바람직하며, 이에 동일 장소에서 동일 시간에 비디오 데이터와 오디오 데이터가 함께 획득된 동영상이 이용되는 것이 바람직하다.Meanwhile, as described above, the apparatus for detecting an audio-video matched region according to the present exemplary embodiment can easily detect a region for the same object even for video data or audio data separately obtained at different places or times. However, in the case of performing learning, it is preferable to use video data and audio data corresponding to each other, and for this reason, it is preferable to use a video in which video data and audio data are obtained together at the same time at the same place.

도 5는 본 발명의 일 실시예에 따른 오디오-비디오 정합 영역 탐지 방법을 나타낸다.5 shows a method of detecting an audio-video matching area according to an embodiment of the present invention.

도 1 내지 도 4를 참조하여, 도 5의 오디오-비디오 정합 영역 탐지 방법을 설명하면, 오디오-비디오 정합 영역 탐지 방법은 오디오-비디오 정합 영역 탐지 단계(S10) 및 학습 단계(S20)를 포함할 수 있다.Referring to Figs. 1 to 4, the audio-video matching region detection method of Fig. 5 will be described, and the audio-video matching region detection method may include an audio-video matching region detection step (S10) and a learning step (S20). I can.

여기서 학습 단계(S20)는 오디오-비디오 정합 영역 탐지 방법의 운용 이전 학습 시에만 이용되므로, 실제 운용 시에는 생략될 수 있다.Here, since the learning step S20 is used only for learning prior to the operation of the method for detecting an audio-video matching region, it may be omitted during actual operation.

오디오-비디오 정합 영역 탐지 단계(S10)를 살펴보면, 우선 정합 영역이 탐지되어야 하는 비디오 데이터와 오디오 데이터를 획득한다(S11). 그리고 미리 학습된 패턴 추정 방식에 따라 획득된 비디오 데이터와 오디오 데이터 각각에서 특징을 추출하여 비디오 특징맵(V)과 오디오 특징맵(S)을 획득한다(S12).In the audio-video matching area detection step (S10), first, video data and audio data for which the matching area is to be detected are acquired (S11). In addition, a feature is extracted from each of the video data and audio data obtained according to the pattern estimation method learned in advance to obtain a video feature map (V) and an audio feature map (S) (S12).

여기서 획득된 비디오 특징맵(V)과 오디오 특징맵(S)의 차원이 서로 상이하므로, 비디오 특징맵(V)과 오디오 특징맵(S)이 동일 차원이 되도록 1차원으로 변환하여, 비디오 변환 특징맵(f^v)과 오디오 변환 특징맵(f^s)을 획득한다(S13).Since the dimensions of the video feature map (V) and the audio feature map (S) obtained here are different from each other, the video feature map (V) and the audio feature map (S) are converted into one dimension so that they have the same dimension, The map f ^v and the audio transform feature map f ^s are acquired (S13).

이후, 공통의 특징을 추출하기 위해 미리 학습된 패턴 추정 방식에 따라 비디오 변환 특징맵(f^v)과 오디오 변환 특징맵(f^s) 각각으로부터 공통의 특징을 추출하여 비디오 시멘틱 벡터(h^v)와 오디오 시멘틱 벡터(h^s)를 획득한다(S14).Thereafter, in order to extract a common feature, a video semantic vector (h ^v ) and a video semantic vector (h ^v ) by extracting a common feature from each of the video transform feature map (f ^v ) and the audio transform feature map (f ^s ) according to a previously learned pattern estimation method. An audio semantic vector (h ^s ) is obtained (S14).

비디오 시멘틱 벡터(h^v)와 오디오 시멘틱 벡터(h^s)가 획득되면, 비디오 특징맵(V)과 오디오 시멘틱 벡터(h^s)를 기지정된 방식으로 결합하고 정규화하여 오디오 데이터에 대응하는 비디오 데이터의 공간적 강조 영역을 나타내는 비디오 강조맵(a^v)을 획득하고, 오디오 특징맵(S)과 비디오 시멘틱 벡터(h^v)를 기지정된 방식으로 결합하고 정규화하여 비디오 데이터에 대응하는 오디오 데이터의 시간적 강조 영역을 나타내는 오디오 강조맵(a^s)을 획득한다(S15).When the video semantic vector (h ^v ) and the audio semantic vector (h ^s ) are obtained, the video feature map (V) and the audio semantic vector (h ^s ) are combined and normalized in a predetermined manner to obtain the video data corresponding to the audio data. A temporal emphasis area of audio data corresponding to video data by acquiring a video emphasis map (a ^v ) representing a spatial emphasis area, combining and normalizing the audio feature map (S) and video semantic vector (h ^v ) in a predetermined manner An audio enhancement map (a ^s ) indicating is obtained (S15).

한편 학습 단계(S20)에서는 학습을 위한 멀티미디어 데이터, 즉 학습을 위해 획득된 비디오 데이터와 오디오 데이터로부터 획득된 비디오 강조맵(a^v)과 오디오 강조맵(a^s)의 각 원소와 비디오 특징맵(V) 및 오디오 특징맵(S)의 곱을 누적하여, 비디오 누적 강조 특징맵(Z^v)과 오디오 누적 강조 특징맵(Z^s)을 획득한다(S21).Meanwhile, in the learning step (S20), each element of the multimedia data for learning, that is, the video enhancement map (a ^v ) and the audio enhancement map (a ^s ) obtained from the video data and audio data acquired for learning, and the video feature map ( By accumulating the product of V) and the audio feature map (S), a video accumulation enhancement feature map (Z ^v ) and an audio accumulation enhancement feature map (Z ^s ) are obtained (S21).

그리고 획득된 비디오 누적 강조 특징맵(Z^v)의 특징을 추정하여, 오디오 강조 특징 벡터(g^s)으로 변환하고, 오디오 누적 강조 특징맵(Z^s)의 특징을 추정하여, 비디오 강조 특징 벡터(g^v)으로 변환한다(S22).And by estimating the features of the acquired video cumulative emphasis feature map (Z ^v ), converting it into an audio enhancement feature vector (g ^s ), and estimating the features of the audio cumulative emphasis feature map (Z ^s ), Convert to g ^v ) (S22).

이후 획득된 비디오 변환 특징맵(f^v)과 비디오 강조 특징 벡터(g^v) 사이의 차와 오디오 변환 특징맵(f^s)과 오디오 강조 특징 벡터(g^s) 사이의 차에 기반하여, 수학식 5와 같이 손실(L)을 계산한다(S23).Based on the difference between the video conversion feature map (f ^v ) and the video enhancement feature vector (g ^v ) obtained after that, and the difference between the audio conversion feature map (f ^s ) and the audio enhancement feature vector (g ^s ), the equation The loss (L) is calculated as in 5 (S23).

손실(L)이 계산되면, 계산된 손실(L)을 역전파하여 학습을 수행한다(S24).When the loss (L) is calculated, learning is performed by backpropagating the calculated loss (L) (S24).

여기서 오디오-비디오 정합 영역 탐지 방법의 학습은 오디오-비디오 정합 영역 탐지 단계(S10) 및 학습 단계(S20) 전체를 반복하여 수행되며, 반복 수행 횟수가 기지정된 반복 학습 횟수에 도달하거나, 는 계산된 손실(L)이 기지정된 문턱 손실(L_th) 이하이면 종료될 수 있다.Here, the learning of the audio-video matching area detection method is performed by repeating the entire audio-video matching area detection step (S10) and the learning step (S20), and the number of iterations reaches a predetermined number of iterations, or is calculated. If the loss (L) is less than the predetermined threshold loss (L _th ), it may be terminated.

도 6 및 도 7은 본 발명의 오디오-비디오 정합 영역 탐지 방법을 이용하여 멀티미디어에서 객체의 음향에 대응하는 영역을 추출한 결과를 나타낸다.6 and 7 show a result of extracting a region corresponding to a sound of an object in multimedia using the audio-video matching region detection method of the present invention.

본 실시예에 따른 오디오-비디오 정합 영역 탐지 장치 및 방법은 도 6 및 도 7 각각의 (a) 내지 (c)에 도시된 바와 같이 공통의 특정 객체가 포함된 비디오 데이터와 오디오 데이터가 주어지면, (d) 내지 (f)와 같이 오디오 데이터에 대응하는 비디오 데이터의 영역을 검출하거나 비디오 데이터에 대응하는 오디오 데이터의 구간을 정확하게 검출할 수 있다. 비록 도 6 및 도 7에서는 시각적 표현의 용이성에 따라 오디오 데이터에 대응하는 비디오 데이터의 강조 영역만을 표시하였으나, 비디오 데이터에 따른 오디오 데이터의 구간도 정확하게 추출될 수 있다.In the apparatus and method for detecting an audio-video matched region according to the present embodiment, as shown in (a) to (c) of FIGS. 6 and 7 respectively, when video data and audio data including a common specific object are given, As shown in (d) to (f), a region of video data corresponding to audio data may be detected, or a region of audio data corresponding to video data may be accurately detected. Although only the highlight area of video data corresponding to audio data is displayed in FIGS. 6 and 7 according to the ease of visual expression, a section of audio data corresponding to the video data can be accurately extracted.

본 발명에 따른 방법은 컴퓨터에서 실행 시키기 위한 매체에 저장된 컴퓨터 프로그램으로 구현될 수 있다. 여기서 컴퓨터 판독가능 매체는 컴퓨터에 의해 액세스 될 수 있는 임의의 가용 매체일 수 있고, 또한 컴퓨터 저장 매체를 모두 포함할 수 있다. 컴퓨터 저장 매체는 컴퓨터 판독가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터와 같은 정보의 저장을 위한 임의의 방법 또는 기술로 구현된 휘발성 및 비휘발성, 분리형 및 비분리형 매체를 모두 포함하며, ROM(판독 전용 메모리), RAM(랜덤 액세스 메모리), CD(컴팩트 디스크)-ROM, DVD(디지털 비디오 디스크)-ROM, 자기 테이프, 플로피 디스크, 광데이터 저장장치 등을 포함할 수 있다.The method according to the present invention may be implemented as a computer program stored in a medium for execution on a computer. Here, the computer-readable medium may be any available medium that can be accessed by a computer, and may also include all computer storage media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, and ROM (Read Dedicated memory), RAM (random access memory), CD (compact disk)-ROM, DVD (digital video disk)-ROM, magnetic tape, floppy disk, optical data storage device, and the like.

본 발명은 도면에 도시된 실시예를 참고로 설명되었으나 이는 예시적인 것에 불과하며, 본 기술 분야의 통상의 지식을 가진 자라면 이로부터 다양한 변형 및 균등한 타 실시예가 가능하다는 점을 이해할 것이다.The present invention has been described with reference to the embodiments shown in the drawings, but these are merely exemplary, and those of ordinary skill in the art will appreciate that various modifications and other equivalent embodiments are possible therefrom.

따라서, 본 발명의 진정한 기술적 보호 범위는 첨부된 청구범위의 기술적 사상에 의해 정해져야 할 것이다.Therefore, the true technical protection scope of the present invention should be determined by the technical spirit of the appended claims.

100: 멀티미디어 획득부 200: 특징맵 획득부
300: 시멘틱 벡터 획득부 400: 로컬라이제이션부
500: 학습부 310: 비디오 시멘틱 벡터 획득부
320: 오디오 시멘틱 벡터 획득부 410: 비디오 로컬라이제이션부
420: 오디오 로컬라이제이션부100: multimedia acquisition unit 200: feature map acquisition unit
300: semantic vector acquisition unit 400: localization unit
500: learning unit 310: video semantic vector acquisition unit
320: audio semantic vector acquisition unit 410: video localization unit
420: audio localization unit

Claims

A feature map acquisition unit for obtaining a video feature map and an audio feature map by extracting features from each of the video data and the audio data according to a previously learned pattern estimation method;
The video feature map and the audio feature map are converted into a video transform feature map and an audio transform feature map having the same predetermined dimension, and each of the video transform feature map and the audio transform feature map according to a previously learned pattern estimation method. A semantic vector obtaining unit that extracts features and obtains a video semantic vector and an audio semantic vector; And
The video feature map and the audio semantic vector are combined in a known manner to obtain a video emphasis map representing the emphasis intensity for each position according to the audio semantic vector from the video feature map, and the audio feature map and the video semantic vector are obtained. A localization unit that combines in a predetermined manner and obtains an audio enhancement map representing an enhancement intensity for each position according to a video semantic vector from the audio feature map; Including,
Further comprising a learning unit for learning the feature map acquisition unit and the semantic vector acquisition unit,
The learning unit
A video enhancement feature accumulator for accumulating a product of the elements of the video enhancement map and the video feature map to obtain a video accumulation enhancement feature map;
An audio enhancement feature accumulator for accumulating a product of the audio enhancement map elements and the audio feature map to obtain an audio accumulation enhancement feature map;
An audio enhancement feature conversion unit for obtaining an audio enhancement feature vector by learning a pattern estimation method together with the feature map acquisition unit and the semantic vector acquisition unit during learning to extract a feature of the video cumulative enhancement feature map;
A video enhancement feature conversion unit for obtaining a video enhancement feature vector by learning a pattern estimation method together with the feature map acquisition unit and the semantic vector acquisition unit during learning to extract a feature of the audio accumulation enhancement feature map; And
A loss is calculated by summing the difference between the video conversion feature map and the video enhancement feature vector and the audio conversion feature map and the audio enhancement feature vector, and the calculated loss is obtained by the feature map acquisition unit and the semantic vector. A lossy backpropagation unit backpropagating to the audio enhancement feature conversion unit and the video enhancement feature conversion unit; Audio-video matching area detection apparatus comprising a.

The method of claim 1, wherein the semantic vector obtaining unit
A video feature dimensional conversion unit receiving the video feature map and converting it into the video transform feature map;
An audio feature dimension converting unit that receives the audio feature map and converts it into the audio transform feature map;
A video semantic vector extracting unit for extracting the video semantic vector from the video transformation feature map by learning a pattern estimation method in advance based on the video enhancement map reflecting the audio semantic vector; And
An audio semantic vector extraction unit for extracting the audio semantic vector from the audio transformation feature map by learning a pattern estimation method in advance based on the audio enhancement map reflecting the video semantic vector; Audio-video matching area detection apparatus comprising a.

The method of claim 1, wherein the localization unit
A video enhancement vector obtaining unit that obtains a video enhancement vector by multiplying a transpose matrix for each of the partial matrices of the video feature map by the audio semantic vector;
An audio enhancement vector obtaining unit for obtaining an audio enhancement vector by multiplying a transpose matrix for each of the partial matrices of the audio feature map and the video semantic vector; And
An enhancement vector normalization unit for obtaining the video enhancement map and the audio enhancement map by receiving the video enhancement vector and the video enhancement vector and normalizing it in a predetermined manner; Audio-video matching area detection apparatus comprising a.

The method of claim 3, wherein the emphasis vector normalization unit
An apparatus for detecting an audio-video matched region for normalizing the video enhancement vector and the video enhancement vector based on a probability by a softmax function.

delete

Obtaining a video feature map and an audio feature map by extracting features from each of the video data and the audio data according to a previously learned pattern estimation method;
Converting the video feature map and the audio feature map into a video transform feature map and an audio transform feature map having the same predetermined dimension;
Extracting features of the video transform feature map and the audio transform feature map according to a pre-learned pattern estimation method to obtain a video semantic vector and an audio semantic vector;
The video feature map and the audio semantic vector are combined in a known manner, and the audio feature map and the video semantic vector are combined in a known manner to indicate the emphasis intensity for each position according to the audio semantic vector in the video feature map. Acquiring an audio enhancement map representing a location-specific enhancement intensity according to a video semantic vector from the video enhancement map and the audio feature map; Including,
Further comprising a learning step for learning the step of obtaining the audio feature map and the video feature map and obtaining the video semantic vector and the audio semantic vector,
The learning step is
Accumulating a product of the elements of the video enhancement map and the video feature map to obtain a video accumulation enhancement feature map;
Accumulating a product of the elements of the audio enhancement map and the audio feature map to obtain an audio accumulation enhancement feature map;
Obtaining an audio enhancement feature vector by learning a pattern estimation method during learning to extract features of the video cumulative emphasis feature map;
Obtaining a video enhancement feature vector by learning a pattern estimation method during learning to extract a feature of the audio accumulation enhancement feature map; And
Calculating a loss by summing the difference between the video conversion feature map and the video enhancement feature vector and the audio conversion feature map and the audio enhancement feature vector, and backpropagating the calculated loss; Audio-video matching area detection method comprising a.

The method of claim 6, wherein obtaining the semantic vector comprises:
Extracting the video semantic vector from the video transformation feature map by learning a pattern estimation method in advance based on the video enhancement map reflecting the audio semantic vector; And
Extracting the audio semantic vector from the audio transformation feature map by learning a pattern estimation method in advance based on the audio enhancement map reflecting the video semantic vector; Audio-video matching area detection method comprising a.

The method of claim 6, wherein obtaining the highlight map comprises:
Obtaining a video enhancement vector by multiplying a transpose matrix for each of the partial matrices of the video feature map by the audio semantic vector;
Obtaining an audio enhancement vector by multiplying a transpose matrix for each of the partial matrices of the audio feature map by the video semantic vector; And
Receiving the video enhancement vector and the video enhancement vector and normalizing in a known manner to obtain the video enhancement map and the audio enhancement map; Audio-video matching area detection method comprising a.

The method of claim 8, wherein the normalizing step
An audio-video matching region detection method for normalizing the video enhancement vector and the video enhancement vector based on a probability by a softmax function.

delete