KR20180101959A

KR20180101959A - Method and system for extracting Video feature vector using multi-modal correlation

Info

Publication number: KR20180101959A
Application number: KR1020170028561A
Authority: KR
Inventors: 양지훈; 이정헌
Original assignee: 서강대학교산학협력단
Priority date: 2017-03-06
Filing date: 2017-03-06
Publication date: 2018-09-14
Also published as: KR101910089B1

Abstract

An objective of the present invention is to generate a single feature vector representing a video. The present invention relates to a method for extracting a feature vector of a video and a system thereof. The method and system extract an image and audio from a video, extract a p-dimensional image feature vector of the image, extract a q-dimensional audio feature vector of the audio, match the feature vectors into d dimensions, normalize the image and audio feature vectors into respective unit vectors, and perform correlation pooling on the normalized image and audio feature vectors to extract a single feature vector representing the video.

Description

BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a method and system for extracting a video feature vector using a multi-modal correlation,

본 발명은 동영상 특징 벡터 추출 방법 및 시스템에 관한 것으로서, 더욱 구체적으로는 멀티 모달의 상관 관계를 이용하여 동영상을 대표하는 단일의 특징 벡터를 추출하는 방법 및 시스템에 관한 것이다. More particularly, the present invention relates to a method and system for extracting a single feature vector representing a moving image using a multimodal correlation.

다양한 기계학습 알고리즘들이 연구되고 성능이 향상됨에 따라 인공지능의 최종 목표인 인간 수준의 인공지능 실현을 위해서는 인간 수준의 기계학습 기술을 개발하기 위해 다양한 연구가 진행 중이다. 인간은 데이터를 받아들일 때 촉각, 시각, 후각, 미각, 청각 다섯 개의 감각을 통해서 받아들인다. 이에 인간 수준이란 인간과 같이 다양한 종류의 데이터를 받아들이고 학습할 수 있어야 하며 새로운 데이터에 대해서도 적절한 판단이 진행될 수 있어야 한다. 지금 대부분의 기계학습은 한 종류만의 제한된 입력을 받는 단일 모달리티(uni-modality)로 구성되어 있다. As various machine learning algorithms are studied and performance is improved, various studies are under way to develop human-level machine learning technology to realize human-level artificial intelligence, which is the ultimate goal of artificial intelligence. Humans accept the data through five senses: tactile, visual, olfactory, taste, and auditory. Therefore, the human level should be able to accept and learn various kinds of data like human beings, and to be able to make appropriate judgments about new data. Most machine learning now consists of uni-modality with only one kind of limited input.

단일 모달리티의 경우는 인간을 뛰어넘는 경우가 있다. 하지만 실제 인간이 사는 환경에서는 단 하나만의 정제된 모달리티가 입력되지 않으며 다양한 종류의 멀티 모달리티(multi-modality)를 인식하고 이를 종합하여 판단을 진행해야 한다. 실생활에 더욱 인간과 같은 인공지능 적용을 위해서는 다양한 종류의 멀티 모달리티를 인식하고 학습할 수 있어야 한다. 이에 더욱 사람과 같이 다양한 종류의 데이터를 같이 받아들임으로써 유연한 학습이 가능하고 더 나은 성능을 얻을 수 있도록 하는 멀티 모달리티 기반의 기계학습 연구들이 진행되고 있다.In the case of single modality, there is a case where it exceeds human. However, in an actual human environment, only one refined modality is not input, and various types of multi-modality are recognized and integrated. To apply more human-like artificial intelligence to real life, various kinds of multi-modalities should be recognized and learned. Therefore, multi-modality-based machine learning researches are being conducted to allow flexible learning and better performance by accepting various types of data together with people.

멀티 모달리티 기반의 기계학습 연구로 대표되는 것은 동영상 분류 시스템이다. 동영상은 이미지, 오디오, 텍스트 등 다양한 종류의 모달리티를 가지는 데이터라고 할 수 있기 때문이다. 하지만 대부분의 동영상 분류 시스템은 이미지 하나만을 가지고 동영상을 분류하는 경우가 많다. 물론 동영상에서 시각적인 정보가 대부분을 차지하지만, 동영상에서의 모든 이미지가 동영상의 주제에 맞지는 않을 것이다. 동영상에서 주제와는 무관한 이미지가 출력될 때 음성이나 텍스트 등 다른 모달리티가 동영상의 주제와 관련있다면, 이미지 대신 이들을 활용하여 동영상의 주제를 분류할 수 있을 것이다. 따라서 이미지 정보뿐 아니라 오디오와 텍스트 정보를 같이 활용하는 멀티 모달리티 기반의 기계학습을 통하여 동영상 분류 성능이 향상될 수 있을 것이다.Representation of multi-modality based machine learning research is video classification system. This is because video can be data with various kinds of modality such as image, audio, and text. However, most video classification systems often categorize videos with only one image. Of course, most of the visual information is in the video, but not all of the images in the video will fit the video's theme. When images that are irrelevant to the subject matter in the video are output, if other modalities such as voice or text are related to the subject of the video, they can be used instead of the image to classify the subject of the video. Therefore, video classification performance can be improved through multi - modality - based machine learning which utilizes not only image information but also audio and text information.

기존의 동영상 분류 시스템은 대부분 이미지만을 사용한다. 그러므로 기계학습에서 이미지 분류에 대표적으로 사용되는 알고리즘인 CNNs (Convolutional Neural Networks)를 통하여 이미지를 분류하고 그 결과를 동영상 분류에 활용한다. 동영상 하나의 길이가 T 초라고 할 때, 1초마다 이미지 하나를 추출하면 동영상 하나로부터 T 개의 이미지가 나온다. 학습 동영상으로부터 이미지를 추출하여 각각의 이미지를 CNNs를 통하여 동영상에 해당하는 이벤트 클래스로 추가 학습(fine tuning)을 진행한다. 이렇게 추가 학습된 CNNs를 바탕으로 새로운 동영상에서 추출한 T 개의 이미지를 입력하면 T 개의 이미지에 대해 각각 이벤트 클래스가 분류되게 된다. 이로부터 한 동영상 내에서 가장 높은 빈도로 분류된 이벤트 클래스를 해당 동영상의 이벤트 클래스라고 할 수 있다. 이는 투표 형식으로 진행하는 동영상 분류 시스템이다. 더 나아가 이미지로부터 다양한 특징 벡터 추출과 분류 알고리즘을 활용하여 가장 많은 투표를 받은 이벤트 클래스로 분류하는 앙상블(ensemble) 기법을 적용한 이벤트 분류 시스템이 있다.Most existing video classification systems use only images. Therefore, we classify images through CNNs (Convolutional Neural Networks), which is a typical algorithm for image classification in machine learning, and apply the results to video classification. If one video is T seconds long, extracting one image every second will result in T images from one video. The image is extracted from the learning video and fine tuning is performed as an event class corresponding to the moving picture through CNNs. Based on the CNNs thus learned, if the T images extracted from the new videos are input, the event classes are classified for the T images. From this, the event class classified into the highest frequency in one video can be referred to as the event class of the corresponding video. This is a video classification system that proceeds in a voting format. Furthermore, there is an event classification system that applies an ensemble technique to classify the most-voted event class by extracting various feature vectors from an image and using a classification algorithm.

투표 형식과 유사하지만 다양한 방법으로 접근할 수 있는 동영상 분류 기법으로는 통합(pooling) 방법이 있다. 이는 CNNs에서와 같이 이미지에서 대표 특징을 추출하기 위하여 사용하는 통합 기법을 적용한 것으로, 동영상으로부터 추출된 여러 개의 이미지를 대표하는 하나의 특징 벡터를 구하는 방법이다. 동영상에서 추출된 T 개의 이미지에 대해 CNNs를 통하여 T 개의 특징 벡터 행렬을 구하고 이를 하나의 대표 벡터로 변환한 뒤, 이를 분류에 사용한다. 통합의 종류는 평균 통합(average pooling), 최댓값 통합(max pooling), 지역적 선택 통합(local pooling), 추가 학습 진행 통합(slow, late pooling) 등 다양한 종류의 통합 방법이 있으며 평균 통합을 제외하고 가장 높은 성능을 내고 있는 것은 최댓값 통합이다.There are pooling methods that are similar to voting formats but can be accessed in various ways. This is a method of obtaining a feature vector representing several images extracted from a moving image by applying an integration technique used for extracting representative features from an image as in CNNs. For the T images extracted from the moving image, T feature vector matrices are obtained through CNNs, transformed into one representative vector, and then used for classification. Types of integration include various types of integration methods such as average pooling, max pooling, local pooling, and slow and late pooling. The highest performance is maximum integration.

투표 방식과 통합 방식을 사용하는 경우 동영상의 길이에 상관없이 동영상을 대표하는 특징 벡터를 빠르고 쉽게 구할 수 있다. 하지만 동영상이 시계열 데이터임에 주목하고 이를 반영하여 동영상 이벤트를 분류하는 대표적인 연구로 3D CNNs 와 LSTM based 동영상 분류 시스템이 있다. 3D CNNs 동영상 분류 시스템은 2D 이미지에 동영상 길이를 포함한 3D 상태로 변환한 뒤 이를 CNNs를 통하여 학습하는 방식으로 학습 단계에서 시공간의 정보를 포함할 수 있다. LSTM based 동영상 분류 시스템은 시계열 데이터를 처리하는 데 있어 대표적인 기계학습 알고리즘인 RNNs(Recurrent Neural Networks) 기반의 LSTM(Long Short Term Memory)을 사용한다. 이는 CNNs를 통해 추출된 이미지들의 특징 벡터를 바탕으로 다시 LSTM을 통해 시계열 정보를 학습하는 방법으로 동영상의 시계열 정보를 효과적으로 학습할 수 있다.If you use the voting method and the integration method, you can quickly and easily obtain the feature vector representing the video regardless of the length of the video. However, there are 3D CNNs and LSTM based video classification systems as a typical research for classifying video events by paying attention to video as time series data. The 3D CNNs moving picture classification system can convert the 2D image into the 3D state including the video length, and then learn the CNNs through the CNNs. The LSTM based video classification system uses LSTM (Long Short Term Memory) based on Recurrent Neural Networks (RNNs), which is a typical machine learning algorithm, for processing time series data. This method learns time series information through LSTM based on feature vectors of images extracted through CNNs, and can learn time series information of moving pictures effectively.

그러나 3D CNNs 동영상 분류 시스템은 고정된 값을 입력으로 받아들이므로 동영상의 길이가 유동적인 경우에는 활용하기가 어렵다. 또한, LSTM based 경우 CNNs 기반의 이미지 분류 알고리즘 위에 LSTM 알고리즘을 적용하는 방법으로 두 종류의 네트워크가 존재한다. 따라서 파라미터(parameter)의 개수가 많아져 학습에 필요한 학습 데이터와 시간이 늘어났으며 학습 단계에서 두 종류의 네트워크 간 파라미터 공유가 어렵다. However, since the 3D CNNs moving picture classification system accepts fixed values as input, it is difficult to utilize when the length of the moving picture is flexible. In addition, there are two types of networks that apply the LSTM algorithm on the CNNs-based image classification algorithm in case of LSTM based. Therefore, the number of parameters increases and learning data and time required for learning are increased, and it is difficult to share parameters between two kinds of networks at the learning stage.

한국등록특허공보 제 10-0792016호Korean Patent Registration No. 10-0792016 한국공개특허공보 제 10-2007-0107628호Korean Patent Publication No. 10-2007-0107628

전술한 문제점을 해결하기 위한 본 발명의 목적은 멀티 모달의 상관 관계 통합(correlation pooling)을 이용하여 동영상을 대표하는 단일의 특징 벡터를 생성할 수 있는 동영상에 대한 특징 벡터 추출 방법 및 시스템을 제공하는 것이다. SUMMARY OF THE INVENTION An object of the present invention is to provide a method and system for extracting a feature vector for a moving image capable of generating a single feature vector representing a moving image by using a multimodal correlation pooling will be.

본 발명의 다른 목적은 전술한 방법을 적용하여, 동영상으로부터 추출이 가능한 이미지와 오디오 두 개의 멀티 모달리티를 사용하여 추출된 동영상에 대한 단일의 특징 벡터를 이용하여 동영상에 대한 이벤트를 분류하여 동영상 이벤트 분류 성능을 향상시킬 수 있는 동영상 분류 시스템 및 방법을 제공하는 것이다. Another object of the present invention is to classify an event for a moving image by using a single feature vector for a moving image extracted using two multi-modalities, And to provide a moving picture classification system and method capable of improving performance.

전술한 기술적 과제를 달성하기 위한 본 발명의 제1 특징에 따른 이미지와 오디오로 이루어지는 동영상의 특징 벡터 추출 방법은, (a) 상기 동영상의 이미지에 대한 이미지 특징 벡터를 추출하는 단계; (b) 상기 동영상의 오디오에 대한 오디오 특징 벡터를 추출하는 단계; (c) 상기 이미지 특징 벡터 및 상기 오디오 특징 벡터를 각각 단위 벡터를 이용하여 정규화시키는 단계; (d) 정규화된 이미지 특징 벡터와 정규화된 오디오 특징 벡터를 통합하여 상기 동영상에 대한 특징 벡터를 생성하는 단계; 를 구비하여 동영상을 대표하는 단일의 특징 벡터를 추출한다. According to an aspect of the present invention, there is provided a method for extracting a feature vector of a moving image comprising: (a) extracting an image feature vector for an image of the moving image; (b) extracting an audio feature vector for audio of the moving picture; (c) normalizing the image feature vector and the audio feature vector using a unit vector, respectively; (d) generating a feature vector for the moving image by integrating the normalized image feature vector and the normalized audio feature vector; And extracts a single feature vector representing a moving image.

전술한 제1 특징에 따른 동영상의 특징 벡터 추출 방법에 있어서, 상기 (d) 단계는, 상기 정규화된 이미지 특징 벡터와 오디오 특징 벡터에 대한 상관 계수를 추출하고, 상기 상관 계수를 가중값으로 이용하여 상기 정규화된 이미지 특징 벡터와 오디오 특징 벡터를 상관 관계 통합하여 상기 동영상에 대한 특징 벡터를 생성하는 것이 바람직하다. In the method of extracting a feature vector of a moving image according to the first aspect, the step (d) may include extracting a correlation coefficient for the normalized image feature vector and an audio feature vector, and using the correlation coefficient as a weight value, And the feature vector for the moving image is generated by correlating the normalized image feature vector and the audio feature vector.

전술한 제1 특징에 따른 동영상의 특징 벡터 추출 방법에 있어서, 상기 상관 계수는 상기 정규화된 이미지 특징 벡터와 오디오 특징 벡터에 대한 피어슨 상관 계수인 것이 바람직하다. In the feature vector extracting method according to the first aspect, it is preferable that the correlation coefficient is a Pearson correlation coefficient for the normalized image feature vector and the audio feature vector.

전술한 제1 특징에 따른 동영상의 특징 벡터 추출 방법에 있어서, 상기 (d) 단계는 상기 정규화된 이미지 특징 벡터와 오디오 특징 벡터에 대하여 평균 통합을 하여 상기 동영상에 대한 특징 벡터를 생성하되, 상기 상관 계수를 가중값으로 사용하는 것이 바람직하다. In the method of extracting a feature vector of a moving image according to the first aspect, the step (d) includes averaging the normalized image feature vector and the audio feature vector to generate a feature vector for the moving image, It is preferable to use the coefficient as the weight value.

전술한 제1 특징에 따른 동영상의 특징 벡터 추출 방법에 있어서, 상기 (c) 단계는 단일 계층 신경망을 사용하여 상기 (a) 단계에서 추출된 이미지 특징 벡터의 차원과 상기 (b) 단계에서 추출된 오디오 특징 벡터의 차원을 일치시키고, 서로 일치된 차원을 갖는 이미지 특징 벡터와 오디오 특징 벡터를 단위 벡터를 이용하여 정규화시키는 것을 특징으로 하며, 상기 단위 벡터는 이미지 특징 벡터와 오디오 특징 벡터의 속성은 그대로 유지하면서 크기가 1인 벡터인 것이 바람직하다. In the feature vector extracting method according to the first aspect of the present invention, the step (c) may include extracting a dimension of the image feature vector extracted in the step (a) and a dimension of the image feature vector extracted in the step And the image feature vector and the audio feature vector having mutually coincident dimensions are normalized by using a unit vector. The unit vector is characterized in that the attributes of the image feature vector and the audio feature vector are the same It is preferable that it is a vector having a size of 1.

본 발명의 제2 특징에 따른 동영상 분류 방법은, 전술한 제1 특징에 따른 동영상의 특징 벡터 추출 방법에 의해 추출된 동영상을 대표하는 단일의 특징 벡터를 이용하여 동영상에 대한 이벤트를 분류하는 것을 특징으로 한다. The moving picture classification method according to the second aspect of the present invention classifies an event for a moving picture by using a single feature vector representing the moving picture extracted by the moving picture feature extraction method according to the first feature .

본 발명의 제3 특징에 따른 동영상의 특징 벡터 추출 시스템은, 동영상으로부터 이미지와 오디오를 각각 추출하는 이미지/오디오 추출 모듈; 상기 이미지/오디오 추출 모듈로부터 추출된 이미지에 대한 이미지 특징 벡터를 추출하는 이미지 특징 벡터 추출 모듈; 상기 이미지/오디오 추출 모듈로부터 추출된 오디오에 대한 오디오 특징 벡터를 추출하는 오디오 특징 벡터 추출 모듈; 단일 계층 신경망을 이용하여 상기 이미지 특징 벡터 추출 모듈에 의해 추출된 이미지 특징 벡터의 차원과 상기 오디오 특징 벡터 추출 모듈에 의해 추출된 오디오 특징 벡터의 차원을 서로 일치시키는 차원 일치 모듈; 상기 차원 일치 모듈에 의해 차원이 일치된 이미지 특징 벡터 및 오디오 특징 벡터를 단위 벡터를 이용하여 각각 정규화시키는 정규화 모듈; 정규화된 이미지 특징 벡터와 오디오 특징 벡터를 통합시켜 동영상을 대표하는 하나의 특징 벡터를 추출하는 벡터 통합 모듈;을 구비하여, 동영상을 대표하는 단일을 특징 벡터를 추출하여 제공한다. According to a third aspect of the present invention, there is provided a system for extracting feature vectors of moving images, comprising: an image / audio extracting module for extracting an image and audio from a moving image; An image feature vector extraction module for extracting an image feature vector for an image extracted from the image / audio extraction module; An audio feature vector extraction module for extracting an audio feature vector for audio extracted from the image / audio extraction module; A dimension matching module for matching the dimension of the image feature vector extracted by the image feature vector extracting module with the dimension of the audio feature vector extracted by the audio feature vector extracting module using a single layer neural network; A normalization module for normalizing each of the image feature vectors and the audio feature vectors whose dimensions are matched by the dimension matching module using a unit vector; And a vector integration module for extracting a feature vector representing a moving image by integrating the normalized image feature vector and the audio feature vector to extract and provide a single feature vector representative of the moving image.

전술한 제3 특징에 따른 동영상의 특징 벡터 추출 시스템에 있어서, 상기 벡터 통합 모듈은, 상기 정규화된 이미지 특징 벡터와 오디오 특징 벡터에 대한 상관 계수를 추출하고, 상기 상관 계수를 가중값으로 이용하여 상기 정규화된 이미지 특징 벡터와 오디오 특징 벡터를 상관 관계 통합하여 상기 동영상에 대한 특징 벡터를 생성하는 것이 바람직하다. In the feature vector extracting system according to the third aspect of the present invention, the vector integrating module extracts a correlation coefficient for the normalized image feature vector and an audio feature vector, and uses the correlation coefficient as a weight value, And a feature vector for the moving image is generated by correlating the image feature vector and the audio feature vector.

전술한 제3 특징에 따른 동영상의 특징 벡터 추출 시스템에 있어서, 상기 상관 계수는 상기 정규화된 이미지 특징 벡터와 오디오 특징 벡터에 대한 피어슨 상관 계수인 것이 바람직하다. In the motion vector feature extraction system according to the third aspect, it is preferable that the correlation coefficient is a Pearson correlation coefficient for the normalized image feature vector and the audio feature vector.

전술한 제3 특징에 따른 동영상의 특징 벡터 추출 시스템에 있어서, 상기 벡터 통합 모듈은, 상기 정규화된 이미지 특징 벡터와 오디오 특징 벡터에 대하여 평균 통합을 하여 상기 동영상에 대한 특징 벡터를 생성하되, 상기 상관 계수를 가중값으로 사용하는 것이 바람직하다. In the feature vector extraction system according to the third aspect of the present invention, the vector integration module performs a mean integration on the normalized image feature vector and the audio feature vector to generate a feature vector for the motion image, It is preferable to use the coefficient as the weight value.

본 발명의 제4 특징에 따른 동영상 분류 시스템은, 전술한 제3 특징에 따른 동영상의 특징 벡터 추출 시스템에 의해 추출된 동영상을 대표하는 단일의 특징 벡터를 이용하여 동영상에 대한 이벤트를 분류하는 것을 특징으로 한다. The moving picture classification system according to the fourth aspect of the present invention classifies an event for a moving picture by using a single feature vector representing a moving picture extracted by the moving picture feature extraction system according to the third aspect .

본 발명에 따른 동영상의 특징 벡터 추출 방법 및 시스템은 동영상의 이미지 특징 벡터와 오디오 특징 벡터를 추출하고, 이들의 차원을 일치시키고 단위 벡터로 정규화시킨 후, 상관 관계 통합(correlation pooling)을 이용하여 동영상을 대표하는 단일의 특징 벡터를 추출할 수 있다. The method and system for extracting feature vectors of moving images according to the present invention extracts image feature vectors and audio feature vectors of a moving image, normalizes the image feature vectors and the audio feature vectors, and normalizes them to unit vectors. Then, Can be extracted.

또한, 이렇게 추출된 동영상 특징 벡터를 활용함으로써 보다 효율적으로 동영상에 대한 이벤트를 분류할 수 있게 된다. In addition, by using the extracted motion feature vectors, it is possible to classify the events for the motion pictures more efficiently.

본 발명에 따른 특징 벡터 추출 방법에서는, 인공지능이 더욱 인간과 같은 학습을 할 수 있도록 단일 모달리티가 아닌 동영상으로부터 추출된 이미지와 오디오를 같이 학습하는 멀티 모달 학습을 시도하였다. 이에 각 모달리티를 단위 벡터로 정규화하는 방식을 제안함으로써, 본 발명은 멀티 모달의 통합 단계에서 효과적으로 통합할 수 있다. 그리고 이를 통해 하나의 동영상으로부터 얻을 수 있는 여러 종류의 특징 벡터를 효율적으로 통합하여 하나의 대표 특징 벡터를 구할 수 있다.In the feature vector extraction method according to the present invention, a multi-modal learning is performed in which the artificial intelligence learns more like a human by learning images and audio extracted from moving images instead of a single modality. Thus, by proposing a method of normalizing each modality to a unit vector, the present invention can be effectively integrated in the multimodal integration step. In this way, one representative feature vector can be obtained by efficiently integrating various kinds of feature vectors obtained from one moving image.

또한, 본 발명에 따른 특징 벡터 추출 방법에서는, 멀티 모달리티를 사용할 때, 제안한 상관관계 통합이 기존의 통합 방법을 사용하는 것보다 성능이 향상됨을 확인할 수 있었다. 이를 통해 인공지능이 더욱 인간과 같아질 수 있도록 멀티미디어 학습이 가능함을 확인할 수 있었다.Also, in the feature vector extraction method according to the present invention, it is confirmed that the proposed correlation integration improves performance compared to the conventional integration method when using the multi-modality. It is confirmed that multimedia learning is possible so that artificial intelligence becomes more like human.

도 1은 본 발명의 바람직한 실시예에 따른 동영상의 특징 벡터 추출 방법을 순차적으로 도시한 흐름도이다.
도 2는 피어슨 상관계수에 따른 선형관계를 설명하는 그래프이다.
도 3은 AlexNet의 구조도이다.
도 4는 GoogLeNet 의 구조도이다.
도 5는 본 발명의 바람직한 실시예에 따른 동영상에 대한 특징 벡터 추출 시스템을 전체적으로 도시한 블록도이다. FIG. 1 is a flowchart sequentially illustrating a feature vector extracting method according to a preferred embodiment of the present invention.
2 is a graph illustrating a linear relationship according to the Pearson correlation coefficient.
3 is a structural diagram of AlexNet.
4 is a structural diagram of GoogLeNet.
FIG. 5 is a block diagram illustrating a feature vector extraction system for moving images in accordance with a preferred embodiment of the present invention. Referring to FIG.

본 발명에 따른 동영상의 특징 벡터 추출 방법 및 시스템은 동영상의 이미지 특징 벡터와 오디오 특징 벡터를 추출하고, 이들의 차원을 일치시키고 단위 벡터로 정규화시킨 후, 상관 관계 통합(correlation pooling)을 이용하여 동영상을 대표하는 단일의 특징 벡터를 추출하는 것을 특징으로 한다. The method and system for extracting feature vectors of moving images according to the present invention extracts image feature vectors and audio feature vectors of a moving image, normalizes the image feature vectors and the audio feature vectors, and normalizes them to unit vectors. Then, And extracts a single feature vector representative of the feature vector.

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시예에 따른 동영상의 특징 벡터 추출 방법 및 시스템에 대하여 구체적으로 설명한다. Hereinafter, a method and system for extracting a feature vector of a moving image according to a preferred embodiment of the present invention will be described in detail with reference to the accompanying drawings.

< 동영상의 특징 벡터 추출 방법 ><Feature vector extraction method>

먼저, 본 발명의 바람직한 실시예에 따른 동영상의 특징 벡터 추출 방법에 대하여 구체적으로 설명한다. 도 1은 본 발명의 바람직한 실시예에 따른 동영상의 특징 벡터 추출 방법을 순차적으로 도시한 흐름도이다. First, a method of extracting a feature vector of a moving image according to a preferred embodiment of the present invention will be described in detail. FIG. 1 is a flowchart sequentially illustrating a feature vector extracting method according to a preferred embodiment of the present invention.

도 1을 참조하면, 본 발명에 따른 동영상의 특징 벡터 추출 방법은, 이미지와 오디오로 이루어지는 동영상을 대표하는 단일의 특징 벡터를 추출하는 방법으로서, 상기 동영상의 이미지에 대한 이미지 특징 벡터를 추출하는 단계(단계 100), 상기 동영상의 오디오에 대한 오디오 특징 벡터를 추출하는 단계(단계 110), 상기 이미지 특징 벡터 및 상기 오디오 특징 벡터의 차원을 일치시키는 단계(단계 120), 차원을 일치시킨 상기 이미지 특징 벡터 및 상기 오디오 특징 벡터를 각각 단위 벡터로 정규화시키는 단계(단계 130), 및 정규화된 이미지 특징 벡터와 정규화된 오디오 특징 벡터를 통합하여 상기 동영상에 대한 특징 벡터를 생성하는 단계(단계 140)를 구비하여 동영상을 대표하는 단일의 특징 벡터를 추출한다. 이하, 전술한 각 단계들에 대하여 구체적으로 설명한다. 1, a method for extracting a feature vector of a moving image according to the present invention is a method of extracting a single feature vector representing a moving image composed of an image and audio, and extracting an image feature vector for the image of the moving image (Step 110), extracting an audio feature vector for the audio of the moving picture (step 110), matching the dimensions of the image feature vector and the audio feature vector (step 120) A step 130 of normalizing the vector and the audio feature vector to a unit vector, and a step 140 of generating a feature vector for the moving picture by integrating the normalized audio feature vector and the normalized audio feature vector And extracts a single feature vector representing the moving picture. Hereinafter, each of the above-described steps will be described in detail.

먼저, 상기 동영상의 이미지에 대한 이미지 특징 벡터를 추출하는 단계(단계 100)에 대하여 구체적으로 설명한다. First, the step of extracting an image feature vector for an image of the moving image (step 100) will be described in detail.

동영상의 프레임(Frame)은 1초에 출력되는 이미지의 개수를 의미한다. 따라서 동영상 프레임이 f FSP(Frames Per Second)이고 동영상의 길이가 T 초이면, 동영상에서 추출되는 이미지의 개수는 모두 f × T 개이다. 동영상의 길이가 길어지고 프레임의 수가 커질수록 학습에 사용되는 이미지는 많아지게 되므로 학습하는데 시간이 많이 걸린다. 또한, 1초에 f 개의 다른 이미지가 지나가더라도 사람의 눈은 모든 이미지를 인식하기 어려우며 동영상에서 1초 동안 출력되는 이미지는 급변하기 보다는 연속된 동작이나 장면이므로 대부분의 비슷할 것이다. 그러므로 본 발명에 따른 방법에서는 매 1초마다 이미지를 추출하여 총 T 개의 이미지를 사용한다.A frame of a moving image means the number of images outputted in one second. Therefore, if the video frame is f FSP (Frames Per Second) and the length of the video is T seconds, the number of images extracted from the video is f × T. The longer the length of a moving picture and the larger the number of frames, the more images are used for learning, so it takes a long time to learn. Also, even if f different images pass in one second, it is difficult for the human eye to recognize all the images, and the images outputted for one second in the video will be almost similar to each other because they are continuous motion or scene rather than rapid change. Therefore, in the method according to the present invention, an image is extracted every 1 second to use a total of T images.

길이가 T 인 동영상의 t 초에 해당하는 이미지를

라 하면, 추출된 이미지들의 집합

는 수학식 1과 같다.The image corresponding to t seconds of a video with length T

, A set of extracted images

Is expressed by Equation (1).

두 개의 CNNs(Convolutional Neutral Networks) 알고리즘을 사용하여 이미지의 특징 벡터를 추출한다. CNNs는 기계학습에서 이미지 분류에 대표적으로 사용되는 알고리즘으로서, 여러 개의 합성곱 계층(convolutional layer)와 통합 계층(pooling layer)들로 이루어져 있다. 합성곱 계층은 일종의 필터 형태로 다량의 학습 가능한 가중치(weight)를 합성곱 연산을 통해 입력으로부터 특징을 추출해낸다. 통합 계층은 합성곱 계층으로부터 추출된 특징의 차원을 축소하는 역할을 한다. 마지막으로 추출된 특징을 두 층의 신경망 계층을 통해 분류한다. 이에 t 초에 해당하는 이미지는 수학식 2와 같이 CNNs 알고리즘을 통해 p 차원의 특징 벡터로 변환된다.Two CNNs (Convolutional Neural Networks) algorithms are used to extract image feature vectors. CNNs are algorithms used for image classification in machine learning. They consist of several convolutional layers and pooling layers. The synthesis product layer extracts a feature from input through a synthesis product operation with a large amount of learnable weight in the form of a filter. The integration layer plays a role of reducing the dimension of the feature extracted from the composite product layer. Finally, the extracted features are classified through two layers of neural networks. The t-second image is transformed into a p-dimensional feature vector by the CNNs algorithm as shown in Equation (2).

이를 T 개의 이미지에 대해 모두 수행하면 수학식 3과 같이 동영상으로부터 T ×p 차원의 특징 벡터 행렬을 구할 수 있다.If all of the T images are performed on the T images, the feature vector matrix of the Txp dimension can be obtained from the moving picture as shown in Equation (3).

전술한 CNNs 알고리즘들 중 하나는 AlexNet 으로서, AlexNet( Krizhevsky, A., Sutskever, I., and Hinton, E. Imagenet classification with deep convolutional neural networks. Proceedings of Advances in Neural Information Processing Systems, pp.1097-1105, 2012.)은 2012년 영상 분석 관련 대회인 ILSVRC(ImageNet Large-Scale Visual Recognition Challenge)에서 혁신적인 구조로 가장 높은 성능을 낸 알고리즘이다. 도 3은 AlexNet의 구조도이다. 도 3과 같은 구조로 5단계의 합성곱 계층 및 통합 계층들과 3개의 fc 계층(fully connected layers)들로 이루어져 있다. 본 발명에서는 1000개의 클래스를 가진 ILSVRC 2012 데이터로 사전에 학습된(pre-trained) AlexNet 모델을 사용하여 fc 계층 중 두 번째 단계인 fc7 계층으로부터 4096차원의 특징 벡터를 추출하였다.One of the CNNs algorithms described above is AlexNet (Krizhevsky, A., Sutskever, I., and Hinton, E. Imagenet classification with deep convolutional neural networks, Proceedings of Advances in Neural Information Processing Systems, pp. , 2012.) is the most advanced algorithm with innovative structure in 2012 ImageNet Large-Scale Visual Recognition Challenge (ILSVRC). 3 is a structural diagram of AlexNet. The structure of FIG. 3 is composed of five layers of the convolution product layer and the integration layers and three fully connected layers (fc). In the present invention, 4096-dimensional feature vectors are extracted from the fc7 layer, which is the second stage of the fc layer, using the pre-trained AlexNet model with ILSVRC 2012 data having 1000 classes.

전술한 CNNs 알고리즘들 중 다른 하나는 Inception-v3으로서, Inception-v3( Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. Rethinking the inception architecture for computer vision. Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp.2818-2826, 2016)는 Google에서 개발된 CNNs 기반 이미지 분류 알고리즘인 GoogLeNet(Inception)( Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., ... and Rabinovich, A. Going deeper with convolutions. Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp.1-9, 2015.)의 세 번째 버전이다. 도 4는 GoogLeNet 의 구조도이다. GoogLeNet은 도 4와 같은 구조로 합성곱 계층 및 통합 계층을 넓고 깊게 사용하는 Inception이라는 모듈 9개를 사용한다. 또한 중간에 두 개의 보조 분류기를 사용하여 구조가 깊어질수록 발생하는 문제를 해결하였다. Inception-v3는 Inception 모듈을 보다 효율적으로 넓고 깊게 사용하도록 개선된 방법이다. 본 발명에서는 AlexNet과 마찬가지로 ILSVRC 2012 데이터로 사전에 학습된 Inception-v3를 사용하여 마지막 분류기의 평균 통합을 수행하는 pool3 계층으로부터 2048차원의 특징 벡터를 추출하였다.Another of the CNNs algorithms described above is Inception-v3 (Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. Rethinking the inception architecture for computer vision Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp . 2818-2826, 2016) was developed by Google based on the CNNs-based image classification algorithm GoogLeNet (Inception) (Szegedy, C., Liu, W., Jia, Y. , Sermanet, P., Reed, S., Anguelov, D., ... and Rabinovich, A. Going deeper with convolutions, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.1-9, The third version. 4 is a structural diagram of GoogLeNet. GoogLeNet uses nine modules called Inception, which uses a wider and deeper use of the composite product layer and the integration layer, as shown in Fig. In addition, two auxiliary classifiers are used in the middle to solve the problems that occur as the structure deepens. Inception-v3 is an improved way to use the Inception module more efficiently, wider and deeper. In the present invention, 2048-dimensional feature vectors are extracted from the pool3 layer that performs average integration of the last classifier using Inception-v3 previously learned with ILSVRC 2012 data as in the case of AlexNet.

다음, 상기 동영상의 오디오에 대한 오디오 특징 벡터를 추출하는 단계(단계 110)에 대하여 구체적으로 설명한다. Next, the step of extracting the audio feature vector for the audio of the moving picture (step 110) will be described in detail.

이미지와는 다르게 오디오는 시계열 데이터로 표현이 된다. 따라서 동영상으로부터 오디오를 추출할 때, 이미지를 추출하는 t 초에서 ±0.5초 구간의 오디오를 추출한다. 이에 길이가 T 인 동영상에서 구한 오디오를

라 하면, 추출된 오디오들의 집합

는 수학식 4와 같다.Unlike images, audio is represented by time series data. Therefore, when extracting audio from a moving picture, extract audio of ± 0.5 seconds from t seconds to extract an image. So if you have audio from a video

, A set of extracted audio

Is expressed by Equation (4).

본 발명에 따른 동영상 특징 벡터 추출 방법에서는, 오디오로부터 특징 벡터를 추출하기 음성신호처리 분야에서 음성의 특성을 표현하기 위해 대표적으로 사용되는 방식인 MFCC (Mel Frequency Cepstral coefficient)(Sahidullah, M., and Saha, G. Design, analysis and experimental evaluation of block based transformation in MFCC computation for speaker recognition. Speech Communication, Vol. 54, No.4, pp.543-565, 2012.)를 사용한다. 이에 t ±0.5초에 해당하는 오디오를 수학식 5와 같이 MFCC를 통해 q 차원의 오디오 특징 벡터로 변환한다.In the moving picture feature vector extracting method according to the present invention, the Mel Frequency Cepstral coefficient (MFCC) (Sahidullah, M., and Speech Communication , Vol. 54, No. 4, pp. 543-565, 2012.) is used for the MFCC computation for speaker-based transformation. Then, the audio corresponding to t ± 0.5 seconds is converted into the q-dimensional audio feature vector through the MFCC as shown in Equation (5).

이를 T 개의 오디오에 대해 모두 수행하면 수학식 6과 같이 동영상으로부터 T × q 차원의 오디오 특징 벡터 행렬을 구할 수 있다.If both T and q are performed for T audio, an audio feature vector matrix of T x q dimension can be obtained from the moving picture as shown in Equation (6).

다음, 상기 이미지 특징 벡터 및 상기 오디오 특징 벡터의 차원을 일치시키는 단계(단계 120) 및 전술한 단계에서 차원을 일치시킨 상기 이미지 특징 벡터 및 상기 오디오 특징 벡터를 각각 단위 벡터로 정규화시키는 단계(단계 130)에 대하여 구체적으로 설명한다. Next, the step of matching the dimensions of the image feature vector and the audio feature vector (step 120) and the step of normalizing the image feature vector and the audio feature vector, Will be described in detail.

단계 100 및 단계 110에서 구해진 이미지 특징 벡터와 오디오 특징 벡터의 차원은 각각 p 와 q 이므로, 이들은 일반적인 통합 기법 적용이 어렵다. 따라서, 이들의 차원을 동일하게 해줄 필요가 있으며 통합 단계에서 둘 간의 상관관계가 반영될 수 있어야 한다. 이에 단일 계층 신경망(single layer neural network)을 사용하여 d 차원으로 일치시켜야 한다. 신경망의 활성화 함수는 수학식 7과 같이 딥 러닝 알고리즘에서 대표적으로 사용되는 ReLU(Rectified Linear Unit)(Nair, V., and Hinton, E. Rectified linear units improve restricted boltzmann machines. Proceedings of International Conference on Machine Learning, pp.807-814, 2010.)를 사용할 수 있다.Since the dimensions of the image feature vector and the audio feature vector obtained in steps 100 and 110 are p and q, respectively, it is difficult to apply the general integration technique. Therefore, these dimensions need to be identical and the correlation between the two should be reflected in the integration step. Therefore, a single layer neural network should be used to match d dimension. Activation function of neural network (Rectified Linear Unit) ReLU are typically used in deep learning algorithm as shown in Equation 7 (Nair, V., and Hinton , E. Rectified linear units improve restricted boltzmann machines. Proceedings of International Conference on Machine Learning , pp . 807-814, 2010.) can be used.

본 발명에 따른 특징 벡터 추출 방법은, 멀티 모달리티의 특징 벡터들을 하나로 통합하는 단계를 거쳐 동영상을 대표하는 새로운 특징 벡터를 추출하고자 한다. 하지만, 수학식 7에서 확인할 수 있듯이 신경망의 활성화 함수인 ReLU의 결과값은 무한대로 치솟을 수 있다. 이는 통합 단계에서 특징 벡터들의 분포 차이가 발생할 수 있으며 그 차이가 크다면 분포가 작은 모달리티의 특징이 제대로 반영되지 않을 것이다. 이에 수학식 8과 같이 단위 벡터로의 정규화 수행함으로써, 전술한 문제를 해결할 수 있으며 두 모달리티의 상관관계를 효율적으로 구할 수 있게 된다.In the feature vector extraction method according to the present invention, a new feature vector representing a moving image is extracted through a step of integrating feature vectors of multi-modality into one. However, as shown in Equation (7), the result of ReLU, which is an activation function of the neural network, can soar to infinity. This may lead to a distribution difference of feature vectors in the integration step, and if the difference is large, the characteristics of the small-scale distribution will not be properly reflected. By performing normalization to a unit vector as shown in Equation (8), the above-described problem can be solved and the correlation between two modalities can be efficiently obtained.

본 발명에 있어서, 단일 계층의 신경망을 통해 추출된 차원의 특징 벡터를 다음과 같이 벡터의 속성을 그대로 유지하면서 크기가 1인 단위 벡터로 정규화를 수행한다. 이미지와 오디오의 신경망을 φ _img (.) , φ _aud (.) 라 하면 수학식 9 및 수학식 10과 같이 차원을 일치시켜 차원이 d 인 이미지 특징 벡터( V' ^img )와 오디오 특징 벡터( V' ^aud )를 구할 수 있다.In the present invention, a feature vector of a dimension extracted through a single-layer neural network is normalized to a unit vector of size 1 while maintaining the attributes of the vector as follows. A neural network for image and audio _{_{φ img (.), Φ aud}} (.) Referred to when the image feature vector (V ^'img by matching the dimensions as shown in Equation 9 and Equation 10, the dimension is d ) And an audio feature vector V ' ^aud .

단위 벡터를 구하여 최종적으로 구해진 이미지 특징 벡터와 오디오 특징 벡터는 수학식 11 및 수학식 12와 같다.The image feature vector and the audio feature vector obtained by obtaining the unit vector are as shown in Equations (11) and (12).

다음, 정규화된 이미지 특징 벡터와 정규화된 오디오 특징 벡터를 통합하여 상기 동영상에 대한 특징 벡터를 생성하는 단계(단계 140)에 대하여 구체적으로 설명한다. Next, a description will be made in detail of a step 140 of generating a feature vector for the moving image by integrating the normalized image feature vector and the normalized audio feature vector.

최종적으로 구해진 이미지 특징 벡터와 오디오 특징 벡터들을 통해 동영상을 대표하는 하나의 특징 벡터( u )를 생성한다. Finally, a feature vector u representative of the moving image is generated through the image feature vectors and the audio feature vectors obtained.

상관 관계 분석(correlation analysis)은 두 개의 변수 사이에 선형적 관계를 파악하기 위한 분석 방법이다. 이때 두개의 변수 사이의 선형적 관계를 상관 계수(correlation coefficient)라 하는데, 이를 구하기 위해서는 보편적으로 피어슨 상관 계수(Pearson correlation coefficient)를 사용한다. 피어슨 상관 계수는 두 변수가 변하는 정도인 공분사(covariance)에 각 변수가 변하는 정도인 표준편차 쌍의 곱을 나누어 구할 수 있으며, -1에서 1 사이의 값을 가진다. 도 2는 피어슨 상관계수에 따른 선형관계를 설명하는 그래프이다. 도 2를 참조하면, 2개의 변수들의 분포가 유사하면 계수가 1에 가까운 양의 상관 관계를 가지고, 분포가 반대이면 계수가 -1에 가까운 음의 상관 관계를 가진다. 0에 가까울수록 선형 관계가 거의 없다고 할 수 있다. Correlation analysis is an analytical method for determining the linear relationship between two variables. The linear relationship between the two variables is called the correlation coefficient. To obtain this, a Pearson correlation coefficient is generally used. Pearson's correlation coefficient can be obtained by dividing the product of the standard deviation pair, which is the degree to which each variable changes, to the covariance of the two variables, and has a value between -1 and 1. 2 is a graph illustrating a linear relationship according to the Pearson correlation coefficient. Referring to FIG. 2, if the distribution of the two variables is similar, the coefficient has a positive correlation close to 1, and if the distribution is opposite, the coefficient has a negative correlation close to -1. The closer to 0, the less linear relationship is.

상관 계수를 활용하는 경우는 상관 계수를 통하여 여러 모델의 분포를 분산시켜 앙상블시 개별 네트워크를 특화하는 방법이나 오토 인코더를 통해 멀티 모달을 학습할 때 사용하는 상관관계 신경망(correlation neural networks)이 있다. 이 상관관계 신경망은 두 개의 멀티 모달리티가 입력으로 사용되며 오토 인코더를 통하여 각각의 특징 벡터를 학습할 때, 피어슨 상관계수를 목적 함수(object function)를 최소화하는 과정에서 일정 가중치로 사용한다. 이로 인하여 두 개의 멀티 모달리티를 입력하여 학습할 때 두 개의 모달리티가 유사한 상관관계를 가지게끔 유도할 수 있다. In the case of using correlation coefficient, correlation neural networks (correlation neural networks) are used to distribute the distributions of various models through the correlation coefficient and to specialize individual networks in ensemble or to learn multimodal through auto encoder. The correlation neural network uses two multimodalities as inputs. When learning each feature vector through an auto-encoder, the Pearson correlation coefficient is used as a weighting factor in minimizing the object function. Therefore, when two multi-modalities are inputted, two modalities can be induced to have a similar correlation.

본 발명에서는 유연한 길이의 동영상을 활용할 수 있고 효과적으로 두 모달리티 간의 상관관계를 반영하기 위해 통합 방법을 활용한다. 통합 방법 중 대표적으로 사용되는 최댓값 통합은 수학식 13과 같이 표현이 가능하다.The present invention utilizes a flexible length of video and utilizes an integrated method to effectively reflect the correlation between the two modalities. The maximum value integration that is typically used among the integration methods can be expressed as shown in Equation (13).

최댓값 통합은 특징 벡터 행렬( V )의 각 열에 대해서 최댓값을 찾는 방식이다. 따라서 멀티 모달리티로 이미지와 오디오 특징 벡터가 동시에 입력되면 두 특징 벡터의 쌍을 유지할 의미가 없어진다. 따라서, 본 발명의 바람직한 실시예에 따른 특징 벡터 추출 방법에서는 이미지와 오디오의 특징 벡터가 입력되었을 때 두 특징벡터의 쌍을 유지하면서 상관관계가 반영될 수 있도록 수학식 14와 같은 평균 통합을 사용한다.The maximum value integration is a method of finding the maximum value for each column of the feature vector matrix (V). Therefore, if image and audio feature vectors are input simultaneously with multi-modality, there is no meaning to maintain pair of two feature vectors. Therefore, in the feature vector extraction method according to the preferred embodiment of the present invention, when the feature vectors of the image and audio are input, average integration such as Equation (14) is used so that the correlation can be reflected while maintaining pairs of two feature vectors .

서로간의 상관관계를 반영하기 위해 피어슨 상관계수를 사용한다. 변수 X 와 Y 사이의 피어슨 상관계수 corr(X,Y) 은 수학식 15와 같다.Pearson correlation coefficients are used to reflect the correlation between each other. The Pearson correlation coefficient corr (X, Y) between the variables X and Y is expressed by Equation (15 ) .

본 발명의 바람직한 실시예에 따른 특징 벡터 추출 방법에서 사용하는 상관관계 통합은 평균 통합을 수행할 때, 시간 t 에서의 이미지와 오디오의 특징 벡터 사이의 상관계수를 일정 가중치로 반영하는 방법이다. 피어슨 상관계수가 1에 가까울수록 두 변수 사이의 분포가 유사한데 이는 동영상으로부터 추출된 시간 t 에서의 이미지와 오디오가 모두 동영상을 대표하는 특징 벡터라 가정할 수 있다. 반면 피어슨 상관 계수가 -1에 가까워진다면 둘 중 하나만이 동영상을 대표하는 특징 벡터라 가정할 수 있다.The correlation integration used in the feature vector extraction method according to the preferred embodiment of the present invention is a method of reflecting the correlation coefficient between the image and the feature vector of the audio at a time t as a certain weight when the average integration is performed. The closer the Pearson correlation coefficient is to 1, the more similar is the distribution between the two variables. It can be assumed that both the image and audio at time t extracted from the moving picture are feature vectors representing the moving picture. On the other hand, if the Pearson correlation coefficient approaches -1, it can be assumed that only one of the two is a feature vector representing the moving picture.

따라서 평균 통합에서 매 시간 t 에서의 이미지와 오디오 간의 상관계수를 구하여 두 모달리티가 모두 동영상을 대표할 가능성이 높으면, 동영상의 특징 벡터( u )에 더 큰 영향을 줄 수 있도록 한다. 또한, 상관 계수의 범위는 -1에서 1 사이므로 단위 벡터로 정규화된 특징 벡터를 입력으로 사용하면 보다 효율적으로 상관관계를 반영할 수 있을 것이다. 이에 수학식 16과 같이 이전 단계에서 단위 벡터로 정규화를 진행한 이미지와 오디오의 최종 특징 벡터 행렬

와

를 사용하는 상관관계 통합한다.Therefore, if the correlation between the image and the audio at every time t is obtained in the average integration, if the probability that both of the two modalities represent the moving picture is higher, the motion vector u (u) can be more influenced. In addition, since the range of the correlation coefficient is -1 to 1, the use of the feature vector normalized with the unit vector can more effectively reflect the correlation. As shown in Equation (16), the final feature vector matrix of the image and the audio normalized to the unit vector in the previous step

Wow

To integrate the correlation.

한편, 전술한 본 발명에 따른 동영상에 대한 특징 벡터 추출 방법을 통해 상관관계 통합으로 구해진 동영상의 특징 벡터( u )를 SVM(Support Vector Machine)을 통하여 이벤트 클래스에 대한 확률 분포를 구하고 이를 소프트맥스(soft-max)를 통하여 동영상에 대한 이벤트를 선택함으로써, 동영상을 분류할 수 있게 된다.Meanwhile, the feature vector u obtained from the correlation integration is obtained by SVM (Support Vector Machine) through the feature vector extracting method according to the present invention, and the probability distribution for the event class is obtained. the user can select the event for the moving image through the soft-max.

< 동영상에 대한 특징 벡터 추출 시스템 ><Feature vector extraction system for video>

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시예에 따른 동영상에 대한 특징 벡터 추출 시스템에 대하여 구체적으로 설명한다. Hereinafter, a feature vector extraction system for moving images according to a preferred embodiment of the present invention will be described in detail with reference to the accompanying drawings.

도 5는 본 발명의 바람직한 실시예에 따른 동영상에 대한 특징 벡터 추출 시스템을 전체적으로 도시한 블록도이다. 도 5를 참조하면, 본 발명에 따른 동영상에 대한 특징 벡터 추출 시스템(30)은 이미지/오디오 추출 모듈(300), 이미지 특징 벡터 추출 모듈(310), 오디오 특징 벡터 추출 모듈(312), 차원 일치 모듈(320), 정규화 모듈(330), 벡터 통합 모듈(340)를 구비하여, 이미지와 오디오로 이루어지는 동영상을 대표하는 단일의 특징 벡터를 추출하여 제공한다. FIG. 5 is a block diagram illustrating a feature vector extraction system for moving images in accordance with a preferred embodiment of the present invention. Referring to FIG. 5, a feature vector extracting system 30 for moving images according to the present invention includes an image / audio extracting module 300, an image feature vector extracting module 310, an audio feature vector extracting module 312, A module 320, a normalization module 330, and a vector integration module 340 to extract and provide a single feature vector representative of a moving image composed of an image and audio.

상기 이미지/오디오 추출 모듈(300)은 동영상으로부터 이미지와 오디오를 각각 추출한다. The image / audio extraction module 300 extracts an image and an audio from a moving image, respectively.

상기 이미지 특징 벡터 추출 모듈(310)은 상기 이미지/오디오 추출 모듈로부터 추출된 이미지에 대한 이미지 특징 벡터를 추출하여 제공한다. 본 발명의 바람직한 실시예에 따른 특징 벡터 추출 시스템은 두 개의 CNNs 알고리즘을 사용하여 각 이미지에 대하여 p 차원의 특징 벡터로 변환시키면, 길이가 T인 동영상에 대하여 수학식 3과 같은 T × p 차원의 이미지 특징 벡터 행렬을 추출하여 제공하게 된다. The image feature vector extraction module 310 extracts and provides an image feature vector for an image extracted from the image / audio extraction module. The feature vector extraction system according to the preferred embodiment of the present invention can convert a feature vector of each image into a p-dimensional feature vector using two CNNs algorithms, An image feature vector matrix is extracted and provided.

상기 오디오 특징 벡터 추출 모듈(312)은, 상기 이미지/오디오 추출 모듈로부터 추출된 오디오에 대한 오디오 특징 벡터를 추출하여 제공한다. 본 발명의 바람직한 실시예에 따른 특징 벡터 추출 시스템은 오디오로부터 특징 벡터를 추출하기 위하여 음성 신호처리 분야에서 음성의 특성을 표현하기 위하여 대표적으로 사용되는 방식인 MFCC(Mel Frequency Cepstral Coefficient)를 사용한다. 이에 의하여, 각 오디오에 대하여 q 차원의 특징 벡터로 변환시키면, 동영상으로부터 수학식 6과 같은 T × q 차원의 오디오 특징 벡터 행렬을 추출하여 제공하게 된다. The audio feature vector extraction module 312 extracts and provides an audio feature vector for the audio extracted from the image / audio extraction module. The feature vector extraction system according to the preferred embodiment of the present invention uses the Mel Frequency Cepstral Coefficient (MFCC), which is a typical method for expressing the characteristics of speech in the field of speech signal processing, in order to extract a feature vector from audio. Thus, when each audio is converted into a q-dimensional feature vector, an audio feature vector matrix of T × q dimension as shown in Equation (6) is extracted from a moving picture and provided.

상기 차원 일치 모듈(320)은, 단일 계층 신경망을 이용하여 상기 이미지 특징 벡터 추출 모듈에 의해 추출된 이미지 특징 벡터의 차원과 상기 오디오 특징 벡터 추출 모듈에 의해 추출된 오디오 특징 벡터의 차원을 하나의 차원으로 일치시킨다. 따라서, 상기 차원 일치 모듈(320)에 의해 d 차원의 이미지 특징 벡터 및 오디오 특징 벡터를 구하여 제공하게 된다. The dimension matching module 320 may classify the dimension of the image feature vector extracted by the image feature vector extracting module and the dimension of the audio feature vector extracted by the audio feature vector extracting module into one dimension . Accordingly, the d-dimensional image feature vector and the audio feature vector are obtained and provided by the dimension matching module 320.

상기 정규화 모듈(330)은, 상기 차원 일치 모듈에 의해 차원이 일치되어 모두 d 차원의 이미지 특징 벡터 및 오디오 특징 벡터를 단위 벡터를 이용하여 각각 정규화시켜 제공하게 된다. The normalization module 330 normalizes the d-dimensional image feature vector and the audio feature vector using the unit vector, and provides the normalized image feature vector and the audio feature vector, respectively.

상기 벡터 통합 모듈(340)은 상기 정규화된 이미지 특징 벡터와 오디오 특징 벡터에 대한 상관 계수를 추출하고, 상기 상관 계수를 가중값으로 이용하여 상기 정규화된 이미지 특징 벡터와 오디오 특징 벡터를 상관 관계 통합하여 동영상을 대표하는 하나의 특징 벡터(u)를 추출하여 제공한다. 상기 상관 계수는 상기 정규화된 이미지 특징 벡터와 오디오 특징 벡터에 대한 피어슨 상관 계수인 것이 바람직하다. The vector integration module 340 extracts a correlation coefficient between the normalized image feature vector and the audio feature vector and correlates the normalized image feature vector and the audio feature vector using the correlation coefficient as a weight value, And extracts and provides one feature vector u representative of the feature vector u. Preferably, the correlation coefficient is a Pearson correlation coefficient for the normalized image feature vector and the audio feature vector.

한편, 상기 벡터 통합 모듈은, 상기 정규화된 이미지 특징 벡터와 오디오 특징 벡터에 대하여 평균 통합을 하여 상기 동영상에 대한 특징 벡터를 생성하되, 상기 상관 계수를 가중값으로 사용하여 평균 통합하는 것이 바람직하다. Meanwhile, it is preferable that the vector integration module performs average integration on the normalized image feature vector and the audio feature vector to generate a feature vector for the moving image, and performs average integration using the correlation coefficient as a weight value.

전술한 구성을 갖는 본 발명의 바람직한 실시예에 따른 동영상에 대한 특징 벡터 추출 시스템은 이미지와 오디오를 갖는 동영상으로부터 이미지와 오디오를 추출하고, 이들 각각 CNNs 알고리즘과 MFCC를 통해 각각 p 차원의 이미지 특징 벡터와 q 차원의 오디오 특징 벡터를 추출하고, 단일 계층 신경망을 통해 이들의 차원을 일치시키고 단위 벡터로 정규화시킨 후, 상관관계 통합을 사용하여 상기 동영상을 대표하는 단일의 특징 벡터(u)를 생성하여 제공하게 된다. A feature vector extracting system for moving images according to a preferred embodiment of the present invention having the above-described configuration extracts images and audio from a moving image having an image and audio, and outputs a p-dimensional image feature vector And q-dimensional audio feature vectors are extracted, and their dimensions are matched and normalized to a unit vector through a single-layer neural network. Then, a single feature vector u representative of the moving picture is generated using correlation integration .

한편, 본 발명에 따른 동영상 분류 시스템은 전술한 본 발명에 따른 동영상에 대한 특징 벡터 추출 시스템을 이용하여 동영상에 대한 단일의 특징 벡터(u)를 추출하고, SVM(Support Vector Machine)을 통하여 이벤트 클래스에 대한 확률 분포를 구하고, 이를 소프트맥스(Soft-max)를 통하여 동영상에 대한 이벤트를 선택함으로써, 동영상을 분류할 수 있도록 한다. Meanwhile, the moving picture classification system according to the present invention extracts a single feature vector u for a moving picture using a feature vector extraction system for moving pictures according to the present invention, , And selects the event for the moving picture through the Soft-max, so that the moving picture can be classified.

전술한 본 발명에 따른 동영상에 대한 특징 벡터 생성 방법을 검증하기 위하여, 표 1과 같이 다양한 조건에서의 비교 실험을 수행하였다. 다만, 멀티 모달의 상관 관계를 반영하기 위해 제안한 상관 관계 통합은 정규화를 수행하지 않을 경우, 상관 계수의 영향이 거의 없으므로 단위 벡터로 정규화를 진행한 경우에만 실험을 수행하였다. 이미지의 경우 AlexNet과 Inception-v3로 추출한 특징 벡터에 따른 성능을 추가로 비교해 본다.In order to verify the feature vector generation method for moving images according to the present invention, comparison experiments were performed under various conditions as shown in Table 1. However, the correlation integration proposed in order to reflect the multi-modal correlation has almost no effect on the correlation coefficient when the normalization is not performed, so the experiment is performed only when the normalization is performed with the unit vector. For the image, we compare the performance according to the feature vector extracted by AlexNet and Inception-v3.

실험 데이터로는 멀티 모달리티를 사용하기 위하여 YLI-MED 데이터를 사용하였으며, 이는 멀티미디어 이벤트 인식 연구( MED, Multimedia Event Detection)에 사용되는 데이터로 YFCC100M(Yahoo Flickr Creative Commons 100 Million)으로부터 추출된 이미지와 오디오가 포함된 1823개의 동영상 데이터이다. 동영상의 평균 길이는 약 43초이며 10개의 이벤트 클래스로 구성되어 있다. 1000개의 학습 데이터와 823개의 실험 데이터로 구성되어 있으며 각 클래스에 대한 자세한 정보는 표 2와 같다. YLI-MED data was used as experimental data to use multimodality. This data is used for multimedia event detection (MED) and it is composed of image and audio extracted from YFCC100M (Yahoo Flickr Creative Commons 100 Million) Is 1823 pieces of moving picture data. The average length of the video is approximately 43 seconds and consists of 10 event classes. It consists of 1000 learning data and 823 experimental data. Detailed information about each class is shown in Table 2.

본 발명에 대한 추가 실험을 위하여 YouTube로부터 이미지와 오디오가 포함된 1369개의 동영상 데이터를 자체 수집하여 구성하였다. 동영상의 평균 길이는 약 54초이며 총 16개의 이벤트 클래스로 구성되어 있다. 906개의 학습 데이터와 463개의 실험 데이터로 이루어져 있으며 각 클래스에 대한 정보는 표 3과 같다.For further experimentation of the present invention, 1369 video data including images and audio from YouTube were collected and configured by themselves. The average length of the video is about 54 seconds, and it consists of 16 event classes in total. It consists of 906 training data and 463 experimental data. Table 3 shows the information about each class.

앞서 정의한 실험 조건에 따라 YLI-MED와 YouTube 데이터로 실험을 수행해본 결과는 각각 표 4, 표 5와 같다.The results of YLI-MED and YouTube data are shown in Tables 4 and 5, respectively.

실험 결과 이미지의 특징 벡터를 추출하는 CNNs 알고리즘에 따라 성능의 차이가 큰 것을 확인할 수 있었다. 이는 Inception-v3가 가장 최신의 이미지 분류 알고리즘으로 AlexNet보다 성능이 향상되었기 때문이다. 이미지에 대해 두 가지의 알고리즘을 사용한 이유는 CNNs 알고리즘의 성능 비교가 아닌 특징 벡터를 추출하는 다양한 알고리즘 사용하여 실험해 봄으로써 본 발명에서 제안하는 방법들의 성능을 검증하기 위함이다.Experimental results show that the CNNs algorithm extracts the feature vectors of the images. This is because Inception-v3 has improved performance over AlexNet with the most advanced image classification algorithms. The reason for using two algorithms for images is to test the performance of the methods proposed by the present invention by experimenting with various algorithms for extracting feature vectors rather than performance comparison of CNNs algorithm.

최종 실험 결과 표 4와 표 5에서 확인할 수 있듯이 제안한 단위 벡터 정규화를 사용할 경우 모든 통합 방법에 대해 성능의 향상이 있었다. 또한, 제안한 상관관계 통합이 전체에 대해 가장 높은 성능을 얻을 수 있었다.As a result of the final experiment, as shown in Table 4 and Table 5, when the proposed unit vector normalization is used, performance of all the integration methods is improved. In addition, the proposed correlation integration achieves the highest performance for the whole.

이상에서 본 발명에 대하여 그 바람직한 실시예를 중심으로 설명하였으나, 이는 단지 예시일 뿐 본 발명을 한정하는 것이 아니며, 본 발명이 속하는 분야의 통상의 지식을 가진 자라면 본 발명의 본질적인 특성을 벗어나지 않는 범위에서 이상에 예시되지 않은 여러 가지의 변형과 응용이 가능함을 알 수 있을 것이다. 그리고, 이러한 변형과 응용에 관계된 차이점들은 첨부된 청구 범위에서 규정하는 본 발명의 범위에 포함되는 것으로 해석되어야 할 것이다. While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is to be understood that the invention is not limited to the disclosed exemplary embodiments, but, on the contrary, It will be understood that various changes and modifications may be made without departing from the spirit and scope of the invention. It is to be understood that the present invention may be embodied in many other specific forms without departing from the spirit or essential characteristics thereof.

30 : 동영상의 특징 벡터 추출 시스템
300 : 이미지/오디오 추출 모듈
310 : 이미지 특징 벡터 추출 모듈
312 : 오디오 특징 벡터 추출 모듈
320 : 차원 일치 모듈
330 : 정규화 모듈
340 : 벡터 통합 모듈30: Video Feature Vector Extraction System
300: Image / Audio Extraction Module
310: Image feature vector extraction module
312: Audio feature vector extraction module
320: Dimension Matching Module
330: Normalization module
340: vector integration module

Claims

A method for extracting a feature vector of a moving image comprising an image and audio,
(a) extracting an image feature vector for an image of the moving image;
(b) extracting an audio feature vector for audio of the moving picture;
(c) normalizing the image feature vector and the audio feature vector using a unit vector, respectively;
(d) generating a feature vector for the moving image by integrating the normalized image feature vector and the normalized audio feature vector;
And extracting a single feature vector representing a moving image.

The method of claim 1, wherein the step (d) further comprises: extracting a correlation coefficient for the normalized image feature vector and an audio feature vector, and using the correlation coefficient as a weight value, And a feature vector for the moving image is generated by integrating the correlation.

3. The method of claim 2, wherein the correlation coefficient is a Pearson correlation coefficient for the normalized image feature vector and the audio feature vector.

3. The method of claim 2, wherein in step (d), the normalized image feature vector and the audio feature vector are averaged to generate a feature vector for the moving image, and the correlation coefficient is used as a weight value Feature vector extraction method of moving picture.

2. The method of claim 1, wherein step (c)
The method of claim 1, wherein the dimension of the image feature vector extracted in step (a) is matched with the dimension of the audio feature vector extracted in step (b) using a single layer neural network, Is normalized using a unit vector,
Wherein the unit vector is a vector having a size of 1 while maintaining the attributes of the image feature vector and the audio feature vector as they are.

A method for classifying a moving picture according to any one of claims 1 to 5, wherein a single feature vector representing a moving picture extracted by the feature vector extracting method according to any one of claims 1 to 5 is used.

An image / audio extraction module for extracting an image and audio from a moving image, respectively;
An image feature vector extraction module for extracting an image feature vector for an image extracted from the image / audio extraction module;
An audio feature vector extraction module for extracting an audio feature vector for audio extracted from the image / audio extraction module;
A dimension matching module for matching the dimension of the image feature vector extracted by the image feature vector extracting module with the dimension of the audio feature vector extracted by the audio feature vector extracting module using a single layer neural network;
A normalization module for normalizing each of the image feature vectors and the audio feature vectors whose dimensions are matched by the dimension matching module using a unit vector;
A vector integration module for extracting a feature vector representing a moving image by integrating the normalized image feature vector and the audio feature vector;
And extracting a feature vector representing a single moving picture and providing the extracted feature vector.

8. The apparatus of claim 7, wherein the vector integration module extracts a correlation coefficient for the normalized image feature vector and an audio feature vector, and correlates the normalized image feature vector with an audio feature vector using the correlation coefficient as a weight value. And a feature vector for the moving picture is generated by integrating the relation.

9. The system of claim 8, wherein the correlation coefficient is a Pearson correlation coefficient for the normalized image feature vector and the audio feature vector.

The method of claim 8, wherein the vector integration module performs a mean integration on the normalized image feature vector and the audio feature vector to generate a feature vector for the moving image, and uses the correlation coefficient as a weight value Feature vector extraction system for moving images.

The moving picture classification system according to any one of claims 7 to 10, wherein an event for a moving picture is classified using a single feature vector representing a moving picture extracted by the moving picture feature extraction system.