KR102505592B1

KR102505592B1 - Video captioning method based on multi-representation switching, recording medium and system for performing the same

Info

Publication number: KR102505592B1
Application number: KR1020210073548A
Authority: KR
Inventors: 이수원; 김희찬
Original assignee: 숭실대학교 산학협력단
Priority date: 2021-06-07
Filing date: 2021-06-07
Publication date: 2023-03-02
Also published as: KR102505592B9; KR20220165061A

Abstract

본 발명의 다중 표현 스위칭 기반 비디오 캡셔닝 시스템에 의해 수행되는 다중 표현 스위칭 기반 비디오 캡셔닝 방법은, 입력된 동영상으로부터 프레임 단위로 객체 특징을 추출하는 단계; 추출된 객체 특징에 기초하여 동작 특징을 추출하는 단계; 주어진 단어 서열에 기초하여 문법 특징을 추출하는 단계; 상기 객체 특징 및 동작 특징에 대한 가중치를 계산하여 가중합을 산출하는 단계; 및 문법 특징, 가중합된 객체 특징 및 가중합된 동작 특징에 기초하여 상기 동영상의 설명을 생성하는 단계를 포함할 수 있다. 이에 의해 동영상에 대한 다양한 표현을 포함하는 다수의 캡션을 자동으로 생성할 수 있게 된다. A multi-expression switching-based video captioning method performed by the multi-expression switching-based video captioning system of the present invention includes the steps of extracting object features in units of frames from an input moving picture; extracting motion features based on the extracted object features; extracting grammatical features based on given word sequences; calculating a weighted sum by calculating weights for the object features and motion features; and generating a description of the video based on the grammar feature, the weighted object feature, and the weighted motion feature. Accordingly, it is possible to automatically generate a plurality of captions including various expressions for the video.

Description

Video captioning method based on multi-representation switching, recording medium and system for performing the same

본 발명은 다중 표현 스위칭 기반 비디오 캡셔닝 방법, 이를 수행하기 위한 기록 매체 및 시스템에 관한 것으로, 보다 상세하게는 연속적인 영상으로 구성된 동영상에 대한 자연어 문장을 자동으로 생성할 수 있는 다중 표현 스위칭 기반 비디오 캡셔닝 방법, 이를 수행하기 위한 기록 매체 및 시스템에 관한 것이다.The present invention relates to a video captioning method based on multi-expression switching, a recording medium and a system for performing the same, and more particularly, to a video captioning method based on multi-expression switching capable of automatically generating natural language sentences for moving pictures composed of continuous images. It relates to a captioning method, a recording medium and a system for performing the captioning method.

신경망으로부터 발전한 딥러닝 연구가 매우 활발하게 진행되고 있는 근래에 컴퓨터 비전과 자연어 처리 영역에서 기존의 기계 학습 방법들을 성능에서 압도하는 딥러닝 연구들이 다수 존재한다. Recently, as deep learning research developed from neural networks is being actively conducted, there are many deep learning researches that outperform existing machine learning methods in the areas of computer vision and natural language processing.

비디오 설명은 이미지 처리와 자연어 처리 연구가 하나로 통합된 연구 분야 중 하나로서 동영상을 요약하는 것이며 과거에는 이미지 분류와 같은 컴퓨터 비전 연구와 자동 요약과 같은 자연어 처리 연구가 각각 별개로 연구되었다. Video description is one of the research fields in which image processing and natural language processing research are integrated into one, which is summarizing moving images. In the past, computer vision research such as image classification and natural language processing research such as automatic summarization have been studied separately.

하지만 최근에는 자동 동영상 자막 생성, 동영상 감시 등의 멀티 모달 데이터 분석의 필요성이 증대됨에 따라 컴퓨터 비전과 자연어 처리 기술이 통합된 기술들이 활발하게 연구되고 있다. Recently, however, as the need for multi-modal data analysis such as automatic video caption generation and video surveillance increases, technologies in which computer vision and natural language processing technologies are integrated are being actively researched.

하지만 종래 기술의 경우들은 동영상의 객체와 동작 특성을 포함하는 표현을 추출하기 위한 방법에 집중하고 있는데, 이러한 방법들은 주어진 단어가 비디오 정보를 표현하는 단어인지 문법적으로 필요한 단어인지 모델링하기 어렵다는 문제를 가지고 있다. However, prior art cases focus on methods for extracting expressions including object and motion characteristics of a video, and these methods have a problem in that it is difficult to model whether a given word is a word expressing video information or a grammatically necessary word. there is.

그리고 종래 기술의 경우 계산 복잡도가 높은 방법을 이용해 문제를 해결하기에 학습과 평가를 위해 많은 자원을 필요로 하고 적용 단계에서 반응 시간이 빨라야 하는 경우에 있어서 더 큰 문제가 된다.In addition, in the case of the prior art, since it solves a problem using a method with high computational complexity, it is a bigger problem in the case where a lot of resources are required for learning and evaluation and the reaction time in the application step must be fast.

대한민국 공개특허 제10-2015-0057591호Republic of Korea Patent Publication No. 10-2015-0057591

본 발명은 상기와 같은 문제를 해결하기 위해 안출된 것으로, 본 발명의 목적은 동영상 내 프레임이 가지는 다양한 특징을 설명하고, 다양한 표현을 포함하는 다수의 캡션을 자동으로 생성할 수 있는 다중 표현 스위칭 기반 비디오 캡셔닝 방법, 이를 수행하기 위한 기록 매체 및 시스템을 제공하는 것이다.The present invention has been made to solve the above problems, and an object of the present invention is a multi-expression switching base that can explain various characteristics of frames in a video and automatically generate a plurality of captions including various expressions. It is to provide a video captioning method, a recording medium and a system for performing the same.

상기 목적을 달성하기 위한 본 발명의 일 실시예에 따른 다중 표현 스위칭 기반 비디오 캡셔닝 시스템에 의해 수행되는 다중 표현 스위칭 기반 비디오 캡셔닝 방법은, 입력된 동영상으로부터 프레임 단위로 객체 특징을 추출하는 단계; 추출된 객체 특징에 기초하여 동작 특징을 추출하는 단계; 주어진 단어 서열에 기초하여 문법 특징을 추출하는 단계; 상기 객체 특징 및 동작 특징에 대한 가중치를 계산하여 가중합을 산출하는 단계; 및 문법 특징, 가중합된 객체 특징 및 가중합된 동작 특징에 기초하여 상기 동영상의 설명을 생성하는 단계를 포함한다. To achieve the above object, a multi-expression switching-based video captioning method performed by a multi-expression switching-based video captioning system according to an embodiment of the present invention includes extracting object features from an input video in frame units; extracting motion features based on the extracted object features; extracting grammatical features based on given word sequences; calculating a weighted sum by calculating weights for the object features and motion features; and generating a description of the video based on the grammar feature, the weighted object feature, and the weighted motion feature.

여기서 상기 객체 특징을 추출하는 단계는, 추출된 객체 특징들을 평균 풀링(mean pooling)하는 단계를 포함할 수 있다. Here, extracting the object features may include mean pooling the extracted object features.

그리고, 상기 문법 특징을 추출하는 단계에서는, 상기 주어진 단어 서열과 함께 추출된 객체 특징 벡터와 동작 특징 벡터의 전체 프레임에 대한 평균 벡터의 결합에 기초하여 문법 특징을 추출할 수 있다. In the step of extracting the grammatical feature, the grammatical feature may be extracted based on a combination of an object feature vector extracted along with the given word sequence and an average vector for all frames of motion feature vectors.

또한, 가중합을 산출하는 단계는, 주의 메커니즘(attention mechanism)을 사용하여 프레임별 가중치를 계산하며, 상기 동영상의 설명을 위한 단어를 생성할 때 필요한 추출된 객체 특징 및 동작 특징은 상기 주의 메커니즘을 통해 상기 단어가 생성될 때마다 선택될 수 있다. In addition, the step of calculating the weighted sum calculates the weight for each frame using an attention mechanism, and the extracted object features and motion features required when generating words for the description of the video use the attention mechanism Through this, the word can be selected whenever it is generated.

그리고 상기 설명을 생성하는 단계는, 문법 특징만을 이용하는 경우, 객체 특징만을 이용하는 경우 및 동작 특징만을 이용하는 경우의 단어 생성 확률 표현을 각각 생성하는 단계; 상기 단어 생성 확률 표현에 기초하여 최종 단어 확률 분포를 산출하는 단계; 및 산출된 단어 확률 분포에서 가장 높은 확률 값을 가지는 단어를 최종 단어로 생성하는 단계를 포함할 수 있다. And the step of generating the description comprises: generating a word generation probability expression when only grammar features are used, when only object features are used, and when only operation features are used; calculating a final word probability distribution based on the word generation probability expression; and generating a word having the highest probability value in the calculated word probability distribution as a final word.

또한 상기 설명을 생성하는 단계는, 상기 단어 생성 확률 표현을 각각 생성하는 단계, 상기 최종 단어 확률 분포를 산출하는 단계 및 가장 높은 확률 값을 가지는 단어를 최종 단어로 생성하는 단계를 반복하여 동영상 설명을 위한 단어 서열을 생성할 수 있다. In addition, the generating of the explanation may include generating the word generation probability expression, calculating the final word probability distribution, and generating the word having the highest probability value as the final word to repeat the steps of generating the video description. You can create a sequence of words for

또한, 상기 설명을 생성하는 단계에서, 상기 단어 서열 생성은 빔 탐색 알고리즘(beam search algorithm)을 통해 생성할 수 있다. Also, in the step of generating the description, the generation of the word sequence may be generated through a beam search algorithm.

한편, 상기 목적을 달성하기 위한 본 발명의 일 실시예에 따른 다중 표현 스위칭 기반 비디오 캡셔닝 방법을 수행하기 위한 컴퓨터 프로그램이 기록된 컴퓨터로 판독 가능한 기록 매체일 수 있다.Meanwhile, it may be a computer-readable recording medium on which a computer program for performing a multi-expression switching-based video captioning method according to an embodiment of the present invention for achieving the above object is recorded.

한편, 상기 목적을 달성하기 위한 본 발명의 일 실시예에 따른 다중 표현 스위칭 기반 비디오 캡셔닝 시스템은, 동영상 입력받는 입력부; 및 상기 동영상이 입력되면 딥 러닝 모델을 이용하여 비디오 캡션을 생성하며 동영상과 설명 쌍을 학습데이터로 입력받아 상기 딥 러닝 모델을 학습시키는 프로세서를 포함한다. Meanwhile, a multi-expression switching-based video captioning system according to an embodiment of the present invention for achieving the above object includes an input unit for receiving a video input; and a processor that generates video captions using a deep learning model when the video is input, receives a video and description pair as training data, and trains the deep learning model.

여기서 상기 딥러닝 모델은, 입력된 동영상으로부터 프레임 단위로 객체 특징을 추출하는 2D 합성곱 신경망(CNN) 모듈과 추출된 객체 특징에 기초하여 동작 특징을 추출하는 제1 순환신경망(RNN) 모듈을 포함하는 인코더; 및 주어진 단어 서열을 이용하여 문법 특징을 추출하는 제2 순환신경망 모듈, 상기 객체 특징 및 동작 특징에 대한 프레임별 가중치를 계산하여 가중합을 산출하는 주의 모듈 및 추출된 문법 특징, 가중합된 객체 특징 및 가중합된 동작 특징에 기초하여 상기 동영상의 설명을 생성하는 스위치 모듈을 포함하는 디코더를 포함할 수 있다. Here, the deep learning model includes a 2D convolutional neural network (CNN) module that extracts object features in frame units from an input video and a first recurrent neural network (RNN) module that extracts motion features based on the extracted object features. Encoder to do; and a second recurrent neural network module for extracting grammatical features using a given word sequence, an attention module for calculating a weighted sum by calculating weights per frame for the object features and motion features, and the extracted grammatical features and weighted object features. and a decoder including a switch module for generating a description of the moving image based on the weighted summed operating characteristics.

그리고 상기 2D 합성곱 신경망(CNN)모듈은, 추출된 객체 특징들을 평균 풀링(mean pooling) 연산을 수행할 수 있다. The 2D convolutional neural network (CNN) module may perform a mean pooling operation on the extracted object features.

또한, 상기 제2 순환신경망 모듈은, 상기 주어진 단어 서열과 함께 추출된 객체 특징 벡터와 동작 특징 벡터의 전체 프레임에 대한 평균 벡터의 결합에 기초하여 문법 특징을 추출할 수 있다. In addition, the second recurrent neural network module may extract a grammar feature based on a combination of an object feature vector extracted with the given word sequence and an average vector for all frames of motion feature vectors.

그리고, 상기 주의 모듈은, 주의 메커니즘(attention mechanism)을 사용하여 프레임별 가중치를 계산하며, 상기 동영상의 설명을 위한 단어를 생성할 때 필요한 추출된 객체 특징 및 동작 특징은 상기 주의 메커니즘을 통해 상기 단어가 생성될 때마다 선택될 수 있다. In addition, the attention module calculates a weight for each frame using an attention mechanism, and the extracted object features and motion features required when generating words for the description of the video are obtained through the attention mechanism. can be selected whenever is created.

또한, 상기 스위치 모듈은, 문법 특징만을 이용하는 경우, 객체 특징만을 이용하는 경우 및 동작 특징만을 이용하는 경우의 단어 생성 확률 표현을 각각 생성하고, 상기 단어 생성 확률 표현에 기초하여 최종 단어 확률 분포를 산출하며, 산출된 단어 확률 분포에서 가장 높은 확률 값을 가지는 단어를 최종 단어로 생성하는 과정을 반복하여 동영상 설명을 위한 단어 서열을 생성할 수 있다. In addition, the switch module generates word generation probability expressions when only grammar features are used, when only object features are used, and when only operation features are used, and a final word probability distribution is calculated based on the word generation probability expressions, A word sequence for video description may be generated by repeating a process of generating a word having the highest probability value as a final word in the calculated word probability distribution.

그리고, 상기 스위치 모듈은, 빔 탐색 알고리즘(beam search algorithm)을 통해 상기 단어 서열을 생성할 수 있다. And, the switch module may generate the word sequence through a beam search algorithm.

상술한 본 발명의 일측면에 따르면, 다중 표현 스위칭 기반 비디오 캡셔닝 방법, 이를 수행하기 위한 기록 매체 및 시스템을 제공함으로써, 동영상을 구성하는 이미지 프레임이 가지는 다양한 특징을 설명하고, 다양한 표현을 포함하는 다수의 캡션을 자동으로 생성할 수 있는 것은 물론, 컵퓨터 비전과 자연어 처리 기법에 기반한 전처리나 별도의 손실 함수 없이 캡션을 자동으로 생성할 수 있게 된다. According to one aspect of the present invention described above, by providing a multi-expression switching-based video captioning method, a recording medium and a system for performing the same, various characteristics of image frames constituting a moving picture are described, and various expressions are included. Not only can multiple captions be automatically generated, but also captions can be automatically generated without preprocessing or a separate loss function based on cup computer vision and natural language processing techniques.

도 1은 비디오 캡셔닝에 사용되는 데이터를 설명하기 위한 도면,
도 2는 본 발명의 일 실시예에 따른 다중 표현 스위칭 기반 비디오 캡셔닝을 위한 딥러닝 모델의 구조를 설명하기 위한 도면,
도 3은 본 실시예에 따른 딥러닝 모델에서 객체의 특징 추출을 설명하기 위한 도면,
도 4는 본 실시예에 따른 다중 표현 스위칭 기반 비디오 캡셔닝 방법을 설명하기 위한 흐름도,
도 5는 본 실시예에 따른 비디오 캡셔닝 결과를 설명하기 위한 도면,
도 6은 본 실시예에 따른 비디오 캡셔닝 결과를 설명하기 위한 그래프,
도 7은 본 실시예에 따른 비디오 캡셔닝 결과를 설명하기 위한 그래프,
도 8 내지 10은 본 실시예에 따른 비디오 캡셔닝 결과를 설명하기 위한 도면, 그리고
도 11은 본 발명의 일 실시예에 따른 다중 표현 스위칭 기반 비디오 캡셔닝 시스템을 설명하기 위한 블럭도이다.1 is a diagram for explaining data used for video captioning;
2 is a diagram for explaining the structure of a deep learning model for multi-expression switching-based video captioning according to an embodiment of the present invention;
3 is a diagram for explaining feature extraction of an object in a deep learning model according to this embodiment;
4 is a flowchart for explaining a video captioning method based on multi-expression switching according to this embodiment;
5 is a diagram for explaining a video captioning result according to the present embodiment;
6 is a graph for explaining a video captioning result according to the present embodiment;
7 is a graph for explaining a video captioning result according to the present embodiment;
8 to 10 are diagrams for explaining video captioning results according to the present embodiment, and
11 is a block diagram illustrating a video captioning system based on multi-expression switching according to an embodiment of the present invention.

후술하는 본 발명에 대한 상세한 설명은, 본 발명이 실시될 수 있는 특정 실시예를 예시로서 도시하는 첨부 도면을 참조한다. 이들 실시예는 당업자가 본 발명을 실시할 수 있기에 충분하도록 상세히 설명된다. 본 발명의 다양한 실시예는 서로 다르지만 상호 배타적일 필요는 없음이 이해되어야 한다. 예를 들어, 여기에 기재되어 있는 특정 형상, 구조 및 특성은 일 실시예와 관련하여 본 발명의 정신 및 범위를 벗어나지 않으면서 다른 실시예로 구현될 수 있다. 또한, 각각의 개시된 실시예 내의 개별 구성요소의 위치 또는 배치는 본 발명의 정신 및 범위를 벗어나지 않으면서 변경될 수 있음이 이해되어야 한다. 따라서, 후술하는 상세한 설명은 한정적인 의미로서 취하려는 것이 아니며, 본 발명의 범위는, 적절하게 설명된다면, 그 청구항들이 주장하는 것과 균등한 모든 범위와 더불어 첨부된 청구항에 의해서만 한정된다. 도면에서 유사한 참조부호는 여러 측면에 걸쳐서 동일하거나 유사한 기능을 지칭한다.DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS The detailed description of the present invention which follows refers to the accompanying drawings which illustrate, by way of illustration, specific embodiments in which the present invention may be practiced. These embodiments are described in sufficient detail to enable one skilled in the art to practice the present invention. It should be understood that the various embodiments of the present invention are different from each other but are not necessarily mutually exclusive. For example, specific shapes, structures, and characteristics described herein may be implemented in another embodiment without departing from the spirit and scope of the invention in connection with one embodiment. Additionally, it should be understood that the location or arrangement of individual components within each disclosed embodiment may be changed without departing from the spirit and scope of the invention. Accordingly, the detailed description set forth below is not to be taken in a limiting sense, and the scope of the present invention, if properly described, is limited only by the appended claims, along with all equivalents as claimed by those claims. Like reference numbers in the drawings indicate the same or similar function throughout the various aspects.

이하에서는 도면들을 참조하여 본 발명의 바람직한 실시예들을 보다 상세하게 설명하기로 한다.Hereinafter, preferred embodiments of the present invention will be described in more detail with reference to the drawings.

도 1은 비디오 캡셔닝에 사용되는 데이터를 설명하기 위한 도면이고, 도 2는 본 발명의 일 실시예에 따른 다중 표현 스위칭 기반 비디오 캡셔닝을 위한 딥러닝 모델의 구조를 설명하기 위한 도면, 도 3은 본 실시예에 따른 딥러닝 모델에서 객체의 특징 추출을 설명하기 위한 도면, 그리고, 도 4는 본 실시예에 따른 다중 표현 스위칭 기반 비디오 캡셔닝 방법을 설명하기 위한 흐름도이다. 1 is a diagram for explaining data used for video captioning, and FIG. 2 is a diagram for explaining the structure of a deep learning model for video captioning based on multi-expression switching according to an embodiment of the present invention. FIG. is a diagram for explaining feature extraction of an object in a deep learning model according to this embodiment, and FIG. 4 is a flowchart for explaining a video captioning method based on multi-expression switching according to this embodiment.

본 발명의 일 실시예에 따른 비디오 캡셔닝을 위한 딥러닝 모델에 사용되는 데이터로 주어진 동영상으로 이러한 동영상은 이미지의 서열로 구성되고, 주어진 설명(description)은 단어의 서열로 구성된다. 이러한 영상과 자막의 관계는 도 1에 도시된 바와 같다. A video given as data used in a deep learning model for video captioning according to an embodiment of the present invention. This video is composed of a sequence of images, and a given description is composed of a sequence of words. The relationship between these images and subtitles is as shown in FIG. 1 .

도 1의 영상은 아기가 의자 위에서 춤을 추고 있는 동영상으로 이와 관련된 캡션 중 'baby', 'dancing' 및 'chair'는 동영상 내의 객체와 객체의 동작을 나타내는 단어이고, 'a', 'is'는 문법적으로 필요한 단어이며 'on'은 객체 사이의 관계를 표현하는 단어이다. 즉 동영상 내의 객체와 동작에 대한 정보는 단어와 직접적인 관계가 있지만 문법상 필요한 단어나 객체 사이의 관계를 나타내는 단어는 자연어를 배워야 표현할 수 있는 단어임을 알 수 있다. 1 is a video of a baby dancing on a chair, and 'baby', 'dancing', and 'chair' in the related captions are words representing objects and their motions in the video, and 'a' and 'is'. is a grammatically necessary word and 'on' is a word expressing the relationship between objects. That is, although information about objects and motions in a video has a direct relationship with words, it can be seen that grammatically necessary words or words representing relationships between objects are words that can be expressed only after learning a natural language.

따라서 별도이 자연어 처리 방법을 사용하여 관사, 명사, 동사 등을 구분하지 않고 해당 동영상과 설명 간의 관계를 학습시키기 위해서 본 실시예에서와 같이 내부에 두 단어를 따로 학습할 수 있는 구조의 딥러닝 모델을 가져야 하며, 이를 위해 본 발명의 일 실시예에 따른 비디오 캡셔닝 방법은 다중 표현 스위칭을 기반으로 한다. Therefore, in order to learn the relationship between the video and the description without distinguishing between articles, nouns, and verbs using a separate natural language processing method, a deep learning model with a structure that can separately learn two words inside as in this embodiment To this end, the video captioning method according to an embodiment of the present invention is based on multi-expression switching.

본 실시예에 따른 딥러닝 모델은 도 2에 도시된 개념도와 같이 동영상과 설명 쌍들을 학습하여 설명을 생성하기 위하여 인코더-디코더 구조(ecoder-decoder framework)를 가지는 것이 바람직하며, 인코더(100)는 객체 및 동작을 포함하는 동영상의 특징을 추출하기 위해 2D 합성곱 신경망(CNN, Convolutional Neural Network) 모듈(110) 및 제1 순환신경망(RNN, Recurrent Neural Network) 모듈(130)을 포함하고, 디코더(200)는 단어 특징을 추출하고 추출된 객체 특징, 동작 특징 및 단어 특징을 선택하여 해당 이미지의 설명을 생성하기 위해 제2 순환신경망 모듈(210), 주의(attention) 모듈(220) 및 스위치(switcher) 모듈(230)을 포함한다. The deep learning model according to the present embodiment preferably has an encoder-decoder framework to generate explanations by learning video and description pairs, as shown in the conceptual diagram shown in FIG. 2, and the encoder 100 It includes a 2D convolutional neural network (CNN) module 110 and a first recurrent neural network (RNN) module 130 to extract features of a video including objects and motions, and a decoder ( 200) extracts word features and selects the extracted object features, motion features, and word features to generate a description of the image; ) module 230.

그리고 도 2에 도시된 화살표와 관련하여 얇은 화살표는 하나의 벡터가 전달되는 것을 의미하며, 굵은 화살표는 여러 개의 벡터가 전달되는 것을 의미한다. 구체적으로 예를 들면 2D 합성곱 신경망 모듈(110)에서 제1 순환신경망 모듈(120) 사이의 굵은 화살표는 2D 합성곱 신경망 모듈(110)에서 출력된 모든 프레임 단위의 벡터정보가 제1 순환신경망 모듈(120) 모듈로 전달되는 것을 의미하는 것이다. Regarding the arrows shown in FIG. 2 , a thin arrow means that one vector is transmitted, and a thick arrow means that several vectors are transmitted. Specifically, for example, a thick arrow between the 2D convolutional neural network module 110 and the first recurrent neural network module 120 indicates that the vector information of all frames output from the 2D convolutional neural network module 110 is the first recurrent neural network module. (120) It means that it is passed to the module.

이러한 본 실시예에 따른 다중 표현 스위칭 기반 비디오 캡셔닝 방법은 먼저, 2D 합성곱 신경망 모듈(110)에서 동영상 내의 각 프레임별 객체 특징을 추출한다(S110). 객체 특징을 추출하는 단계는 동영상에서 프레임 단위의 객체 정보를 추출하는 단계이다. 객체 특징을 추출하는 단계(S110) 이전에는 인코더(100)의 2D 합성곱 신경망 모듈(110)로 동영상에서 프레임 단위의 이미지가 입력되는데, 구체적으로 객체 특징을 추출하기 위해 마련되는 2D 합성곱 신경망 모듈(110)은 동영상

의 각 프레임들

의 벡터를 입력받을 수 있다. In the multi-expression switching-based video captioning method according to the present embodiment, object features for each frame in a video are first extracted in the 2D convolutional neural network module 110 (S110). The step of extracting object features is a step of extracting object information in units of frames from video. Prior to the step of extracting object features (S110), an image in units of frames from a video is input to the 2D convolutional neural network module 110 of the encoder 100. Specifically, a 2D convolutional neural network module provided to extract object features 110 videos

each frame of

A vector of can be input.

각 프레임

은 224x224x3의 RGB 프레임으로 입력되며, 각 프레임은 색상 정보인 RGB 정보만 가지고 있다. 각 프레임 사이의 동작 정보 등은 따로 사용하지 않는 것이 바람직하며, 각 프레임을 입력받기 위해서 별도의 프레임 생성모듈(미도시)을 통해 프레임이 각각 독립적으로 처리될 수도 있을 것이다. each frame

is input as a 224x224x3 RGB frame, and each frame has only color information, RGB information. It is preferable not to separately use operation information between each frame, and each frame may be independently processed through a separate frame generation module (not shown) to receive input of each frame.

본 실시예에 따른 2D 합성곱 신경망 모듈(110)은 적은 매개변수로 좋은 성능을 보이는 ResNet50V2를 사용하며, 동영상에서 다양한 입상도(granularity)의 객체 특징을 추출하기 위하여 적층된 ResNet50V2의 각 레이어의 출력을 사용하는데 컨볼루션 블록의 더미(stack)에서 블록의 출력 크기가 바뀌는 블록을 선택하여 사용한다. 구체적으로 하나의 프레임은 최종 레이어를 포함하여 6개의 벡터로 표현되며 각 블록별로 추출된 벡터의 정보는 도 3에 도시된 바와 같다. The 2D convolutional neural network module 110 according to this embodiment uses ResNet50V2, which shows good performance with small parameters, and outputs each layer of stacked ResNet50V2 to extract object features of various granularity from a video. is used by selecting and using a block whose output size changes from a stack of convolution blocks. Specifically, one frame is represented by 6 vectors including the final layer, and vector information extracted for each block is as shown in FIG. 3 .

그리고 REsNet50V2 네트워크를 함수

로 표현하고, 이것은 하기의 수학식1과 같이 정의된다. And the REsNet50V2 network function

, which is defined as Equation 1 below.

[수학식 1][Equation 1]

그리고 2D 합성곱 신경망 모듈(110)의 학습 가능한 매개변수는 기존의 ResNet50V2의 구조로부터 충분히 유추가능한 바 가독성을 위하여 표현되지않았으며, 각 객체 특징들은 하기의 표 1과 같다. Also, since the learnable parameters of the 2D convolutional neural network module 110 can be sufficiently inferred from the structure of the existing ResNet50V2, they are not expressed for readability, and the characteristics of each object are shown in Table 1 below.

레이어 출력
(Layer Output)layer output
(Layer Output) 레이어 이름
(Layer Name)layer name
(Layer Name) 차원
(Dimension)Dimension
(Dimension)

Pool1_pool 56x56x64

Conv2_block3_out 28x28x256

Conv3_block4_out 14x14x512

Conv4_block5_out 7x7x1024

Conv5_block6_out 7x7x2048

predictions 1000

[표 1]에서 레이어 이름은 딥러닝 라이브러리에 따라 다를 수 있다. Layer names in [Table 1] may differ depending on the deep learning library.

그리고 이러한 객체 표현(representation)을 이용해 동작(motion) 특징을 학습하기 위하여 각 객체 특성들에 평균 풀링(mean pooling)한 후 동일한 차원

으로 임베딩한다. 이를 통해

번째 프레임의 최종 객체 특징인

는 하기의 수학식 2와 같이 정의된다. And in order to learn motion features using this object representation, mean pooling is performed on each object feature, and then the same dimension

embed into because of this

the last object feature of the first frame

Is defined as in Equation 2 below.

[수학식 2][Equation 2]

여기서

는 각 객체에 대한 임베딩 행렬이며, 학습 가능한 파라미터이고,

는

번째 표현(representation)의 출력 벡터의 마지막 차원 번호를 나타내는 것으로 표 1을 예로 들면,

는 64이다. 그리고

는 평균 풀링 작업을 의미하고,

번째 프레임의 객체 특징인

는

차원의 벡터이다. here

is an embedding matrix for each object, is a learnable parameter,

Is

Taking Table 1 as an example to indicate the last dimension number of the output vector of the th representation,

is 64. and

denotes an average pooling operation,

object features of the first frame

Is

is a vector of dimensions.

그리고 합성곱 신경망 모듈(110)에서 추출한 각 프레임별 객체 특성들의 평균 풀링은 제2 순환신경망 모듈(210)로 전달될 수 있다. In addition, the average pooling of object characteristics for each frame extracted by the convolutional neural network module 110 may be transmitted to the second recurrent neural network module 210 .

이후 제1 순환신경망 모듈(120)은 추출된 각 프레임별 정보를 이용하여 동작 특징을 추출한다(S120). 동작 특징을 추출하는 것은 각 프레임별 정보를 이용하여 프레임별 차이에 따른 동작 정보를 추출하는 것으로 제1 순환신경망 모듈(120)은 각 프레임 순번 별 동작 정보를 산출하는데 구체적으로 예를 들면 세번째 순번의 동작 정보는 첫번째, 두번째 및 세번째 프레임 정보를 이용하여 추출된 동작 정보인 것이다. Thereafter, the first recurrent neural network module 120 extracts motion characteristics using the extracted information for each frame (S120). Extracting the motion feature extracts motion information according to the difference between frames using information for each frame, and the first CNN module 120 calculates motion information for each frame sequence. The motion information is motion information extracted using first, second, and third frame information.

본 실시예에 따른 합성곱 신경망 모듈(110)에서 추출한 객체 특징을 기반으로 제1 순환신경망 모듈(120)은 객체의 동작 특징을 학습할 수 있으며, 이를 위해 본 실시예에 따른 제1 순환신경망 모듈(120)은 장단기 메모리 네트워크(LSTM, Long Short-Term Memory network)를 사용하는 것이 바람직하다. Based on the object features extracted by the convolutional neural network module 110 according to the present embodiment, the first recurrent neural network module 120 may learn the motion characteristics of the object. To this end, the first recurrent neural network module according to the present embodiment 120 preferably uses a long short-term memory network (LSTM).

인코더(100)의 LSTM은 함수

로 표현되며, i번째 프레임에서 LSTM의 은닉 상태

는 하기 수학식 3과 같이 정의된다. The LSTM of the encoder 100 is a function

, and the hidden state of LSTM in the ith frame

Is defined as in Equation 3 below.

[수학식 3][Equation 3]

제1 순환신경망 모듈(120)의 학습 가능한 파라미터는 기존의 LSTM 구조로부터 충분히 유츄가능한 바 가독성을 위하여 별도로 표현하지 않았다. 그리고 인코더(100)의 LSTM의 초기 은닉 상태

는 영벡터로 정의되고, 본 실시예에서는 이 은닉 상태를 동작 특징으로 고려한다. Since the learnable parameters of the first recurrent neural network module 120 can be sufficiently inferred from the existing LSTM structure, they are not separately expressed for readability. And the initial hidden state of the LSTM of the encoder 100

is defined as a zero vector, and this embodiment considers this hidden state as an operating characteristic.

이후 제2 순환신경망 모듈(210)은 문법 특징을 추출한다(S130). Then, the second recurrent neural network module 210 extracts grammar features (S130).

본 실시예에 따른 디코더(200)는 도 2에 도시된 바와 같이 두 부분으로 구성되어 있는데 첫번째는 제2 순환신경망 모듈(210)을 포함하는 문법 특징 추출(textural feature extraction)부분이고, 두 번째는 주의 모듈(220) 및 스위치 모듈(230)을 포함하는 설명 생성(description generation)부분이다. The decoder 200 according to this embodiment is composed of two parts as shown in FIG. 2. The first part is a textural feature extraction part including the second recurrent neural network module 210, and the second part is It is a description generation part including the attention module 220 and the switch module 230.

주어진 설명(description)은 단어의 서열이므로, 디코더(200)의 제2 순환신경망 모듈(210)은 주어진 단어 서열을 이용하여 문법 특징을 추출할 수 있다. 이전 단계의 단어를 이용하여 바로 다음 단어를 생성할 때 필요한 단어 정보가 무엇인지 추출 및 학습할 수 있다. 이를 위해 상술한 제1 순환신경망 모듈(120)과 마찬가지로 LSTM을 사용하는 것이 바람직하다. Since the given description is a sequence of words, the second recurrent neural network module 210 of the decoder 200 may extract grammatical features using the given word sequence. It is possible to extract and learn what word information is required when generating the next word using the word in the previous step. To this end, it is preferable to use LSTM like the first recurrent neural network module 120 described above.

인코더(100)의 제1 순환신경망 모듈(120)과 유사하게 디코더(200)의 제2 순환신경망 모듈(210)의 LSTM은 함수

로 표현되며, t번째 단계의 LSTM의 은닉 상태

는 하기의 수학식 4와 같이 정의된다.Similar to the first recurrent neural network module 120 of the encoder 100, the LSTM of the second recurrent neural network module 210 of the decoder 200 is a function

, and the hidden state of the LSTM at the tth step

Is defined as in Equation 4 below.

[수학식 4][Equation 4]

제2 순환신경망 모듈(210)의 학습 가능한 파라미터 역시 기존의 LSTM 구조로부터 충분히 유츄가능한 바 가독성을 위하여 별도로 표현하지 않았다. Since the learnable parameters of the second recurrent neural network module 210 can also be sufficiently inferred from the existing LSTM structure, they are not separately expressed for readability.

여기서

는 단어를 표현하는 원핫(one-hot) 벡터를 나타내고,

는 임베딩 매트릭스를 나타내며,

는 단어

의

차원 임베딩 벡터를 나타낸다.

는 학습 가능한 파라미터이다. 디코더(200)의 LSTM의 초기 은닉 상태

와 셀 상태는 인코더(100)의 마지막 은닉 상태

와 셀 상태로 정의된다. here

represents a one-hot vector representing a word,

denotes an embedding matrix,

is the word

of

Represents a dimensional embedding vector.

is a learnable parameter. Initial Concealment State of LSTM of Decoder 200

and the cell state is the last hidden state of the encoder 100

and is defined as the cell state.

그리고 객체 특징 벡터

와 동작 특징 벡터

의 전체 프레임에 대한 평균 벡터의 결합 벡터인

는 하기의 수학식 5와 같이 정의된다.And the object feature vector

and motion feature vector

is the combined vector of the mean vectors for all frames of

Is defined as in Equation 5 below.

[수학식 5][Equation 5]

즉 주어진 단어 설명인 동영상 설명은 동영상의 정보를 내포하고 있으므로 문법 특징을 추출 시 동영상의 개괄적인 정보인 각 프레임별 객체 특징의 평균과 각 프레임 순번의 동작 특징의 평균에 기초하여 문법 특징을 추출하는 것이다.That is, since the video description, which is a given word description, contains video information, when extracting grammar features, grammar features are extracted based on the average of object features for each frame, which is general information of the video, and the average of motion features for each frame sequence. will be.

이후 주의 모듈(220)은 객체 특징과 동작 특징에 대한 가중합을 각각 산출한다(S140). 이상의 과정을 통해 추출된 정보들을 이용하여 실제 단어를 생성함에 있어서 매 단계에 단어를 생성하기 위하여 필요한 정보를 주의 모듈(220)을 통해 선택하게 된다. 이 때 필요한 정보는 이번 단계에 생성할 단어와 관련된 정보이기 때문에 객체 특징과 동작 특징을 기반으로 중요한 정보에 대한 가중치를 계산하여 이를 가중합으로 추려내게 되는 것이다. Thereafter, the attention module 220 calculates a weighted sum of object features and motion features, respectively (S140). In generating an actual word using the information extracted through the above process, information necessary for generating a word is selected through the attention module 220 at each stage. At this time, since the necessary information is information related to the word to be generated in this step, weights for important information are calculated based on object characteristics and motion characteristics, and these are selected as a weighted sum.

또한 주의 모듈(220)은 매 단어를 생성할 때마다 필요한 객체 특징과 동작 특징을 주의 메커니즘(attention mechanism)을 통해 선택할 수 있다. In addition, the attention module 220 may select required object characteristics and motion characteristics whenever a word is generated through an attention mechanism.

주의 모듈(220)을 포함하는 디코더(200)는 영상의 전체 정보

와 이전 단어 정보만을 이용하여 학습하기 때문에 단어 생성 시 중요한 프레임 정보를 참조할 필요가 있으며, 이를 위해 본 실시예에 따른 주의 모듈(220)은 주의 메커니즘(attention mechanism)을 사용하여 인코더(100)로부터 출력된 정보를 참조하며, 그 정보는 문맥 및 맥락(context)으로써 정의될 수 있다. The decoder 200 including the attention module 220 is the entire information of the image

Since learning is performed using only word information and previous words, it is necessary to refer to important frame information when generating words. It refers to the output information, and the information can be defined as context and context.

먼저 t번째 단계의 객체 문맥(context)인

는 객체 표현(representation)

의 가중 합이며 이는 하기 수학식 6과 같다.First, the object context of the tth step,

is an object representation

is the weighted sum of , which is shown in Equation 6 below.

[수학식 6][Equation 6]

여기서

는 t번째 단어와 k 번째 프레임 간의 객체 정보에 대한 주의 (attention) 점수를 의미하며, 이는 하기 수학식 7과 같다. here

Means an attention score for object information between the t-th word and the k-th frame, which is expressed in Equation 7 below.

[수학식 7][Equation 7]

이 때

는 표현 소프트맥스 함수(represents a softmax function)이고,

,

는 학습 가능한 파라미터이다. 그리고,

는 입력 i번째 영상 벡터

와 출력 t번째 은닉 상태 벡터

와의 주의집중 점수를 의미하며, 이상의 수학식 6과 수학식 7에서

는 모든 입력

={1, 2, …, I}에 대한 합을 계산하기 위하여 k로 변수를 선언한 것으로 동일한 정의를 따르게 된다.At this time

is the representation softmax function,

,

is a learnable parameter. and,

is the input ith image vector

and the output tth hidden state vector

Means the attention score of and, in

Equations

6 and 7 above

is all input

={1, 2, … , I} follows the same definition as declaring a variable as k to calculate the sum for I}.

한편

번째 단계의 동작 문맥(context)인

은 동작 표현

의 가중 합으로 하기 수학식 8과 같다. Meanwhile

The operating context of the first step,

the action expression

As the weighted sum of Equation 8 below.

[수학식 8][Equation 8]

는 k번째 동작 표현

과

번째 문법 표현

사이의 주의(attention) 점수를 나타내며 이는 하기 수학식 9와 같이 정의될 수 있다.

is the kth action expression

class

first grammatical expression

represents an attention score between .

[수학식 9][Equation 9]

그리고,

는 입력 i번째 은닉 상태 벡터

와 출력 t번째 은닉 상태 벡터

와의 주의집중 점수를 의미하며, 이상의 수학식 8과 수학식 9에서

는 모든 입력 i={1, 2, …, I}에 대한 합을 계산하기 위하여 k로 변수를 선언한 것으로 동일한 정의를 따르게 된다.and,

is the input ith hidden state vector

and the output tth hidden state vector

Means the attention score of and, in Equation 8 and Equation 9 above

is for all inputs i={1, 2, … , I} follows the same definition as declaring a variable as k to calculate the sum for I}.

이상의 과정을 통해 본 실시예에서는 동영상의 객체 및 동작 정보와 이전 단어 정보를 얻을 수 있게 된다. Through the above process, in this embodiment, object and motion information of a video and previous word information can be obtained.

이후 스위치 모듈(230)에서 설명을 위한 단어 서열을 생성한다(S150). Afterwards, the switch module 230 generates word sequences for explanation (S150).

이는 제2 순환신경망 모듈(210)에서 산출된 문법 특징과 주의 모듈(220)에서 산출된 가중합된 객체 특징 및 가중합된 동작 특징에 기초하여 단어 생성 확률 로짓을 생성하는 것으로, t번째 단계에서 각각 고정된 크기의 표현인

,

및

을 기반으로 하는 다음 단어 확률 로짓(logits)

,

은 하기의 수학식 10에서와 같이 정의된다.This is to generate a word generation probability logit based on the grammatical features calculated in the second recurrent neural network module 210, the weighted object features calculated in the attention module 220, and the weighted operational features, in the tth step Each fixed-size expression

,

and

Next word probability logits based on

,

Is defined as in Equation 10 below.

[수학식 10][Equation 10]

여기서

는 활성 함수(activation function)를 나타내고,

,

는 학습 가능한 파라미터이다.here

denotes an activation function,

,

is a learnable parameter.

그리고 다음 단어 생성 시 이 표현들이 가진 정보 중 필요한 것을 적절히 선택하기 위하여 본 실시예에서는 스위치 모듈(230)을 포함하는 것인데, 상술한 확률 로짓은 각각 문법 특징만을 이용하였을 때 다음 단어 정보, 객체 특징만을 이용하였을 때의 다음 단어 정보 및 동작 특징만을 이용하였을 때의 다음 단어 정보를 의미한다. In addition, in order to appropriately select necessary information from these expressions when generating the next word, the present embodiment includes the switch module 230. This means next word information when used and next word information when only operation characteristics are used.

이러한 스위치 모듈(230)에서 다중 표현 스위칭(multi-representation switching)을 이용하여 최종 단어 확률 분포

를 하기 수학식 11을 통해 구할 수 있다. Final word probability distribution using multi-representation switching in this switch module 230

Can be obtained through Equation 11 below.

[수학식 11][Equation 11]

와

는 학습 가능한 파라미터이고, 각

는 t번째 단어를 디코딩할 때 각 표현의 중요성을 나타내며 이는 하기 수학식 12와 같이 정의된다.

and

is a learnable parameter, and each

Represents the importance of each expression when decoding the tth word, which is defined as in Equation 12 below.

[수학식 12][Equation 12]

여기서

는 학습 가능한 파라미터이다. here

is a learnable parameter.

스위치 모듈(230)은 단어 확률 분포에서 가장 높은 확률 값을 가지는 단어를 생성하며, 이 과정을 반복하여 동영상 설명을 위한 단어 서열을 생성하게 된다. 이 때 단어 서열은 최적 우선 탐색 방법 중 하나인 빔 탐색 알고리즘을 통해 생성될 수 있다. The switch module 230 generates a word having the highest probability value in the word probability distribution, and repeats this process to generate a word sequence for video description. In this case, the word sequence may be generated through a beam search algorithm, which is one of optimal first search methods.

본 실시예에서는 이러한 스위치 모듈(230)을 통한 다중 표현 스위칭 덕분에, 디코더(200)는 동영상의 객체 및 동작 정보와 설명 내 문자 정보를 별도로 각각 고려할 수 있고, 이를 통해 한 객체나 동작을 여러 단어로 모델링 할 수 있게 된다. 즉 스위치 모듈(230)을 통해 각 문법 특징, 객체 특징 및 동작 특징을 통해 생성된 표현들 중에서 다음 단어 생성에 적합한 것을 선택할 수 있게 되는 것이다. In this embodiment, thanks to the multi-expression switching through the switch module 230, the decoder 200 can separately consider the object and action information of the video and the text information in the description, and through this, one object or action can be converted into several words. can be modeled with That is, it is possible to select an expression suitable for generating the next word from expressions generated through each grammatical feature, object feature, and operation feature through the switch module 230 .

딥러닝 모델에서 학습을 위한 손실함수

는 정답 단어

에 대한 음의 로그 우도 함수(negative log likelihood function)이며, 하기 수학식 13과 같이 정의된다. Loss function for training in deep learning models

is the correct answer word

is a negative log likelihood function for , and is defined as in Equation 13 below.

[수학식 13][Equation 13]

도 5는 본 실시예에 따른 비디오 캡셔닝 결과를 설명하기 위한 도면, 도 6은 본 실시예에 따른 비디오 캡셔닝 결과를 설명하기 위한 그래프, 도 7은 본 실시예에 따른 비디오 캡셔닝 결과를 설명하기 위한 그래프, 그리고 도 8 내지 10은 본 실시예에 따른 비디오 캡셔닝 결과를 설명하기 위한 도면이다. 5 is a diagram for explaining a video captioning result according to this embodiment, FIG. 6 is a graph for explaining a video captioning result according to this embodiment, and FIG. 7 describes a video captioning result according to this embodiment. A graph for this and FIGS. 8 to 10 are diagrams for explaining video captioning results according to the present embodiment.

상술한 캡셔닝 방법을 통한 비디오 캡션 생성에 대한 객관적인 성능 평가를 위하여 비디오 캡셔닝 문제에서 널리 사용되는 Microsoft Research Video Description (MSVD) 데이터 셋을 이용하여 평가하였다. MSVD 데이터 셋은 요리, 영화 등의 다양한 출처의 영상들로 구성된 오픈 도메인 데이터 셋이다. 전체 영상의 수는 1,970개이고, 각 영상은 평균적으로 약 41개의 설명을 가진다. To objectively evaluate the performance of video caption generation through the above-described captioning method, the Microsoft Research Video Description (MSVD) data set, which is widely used in video captioning problems, was used. The MSVD data set is an open domain data set composed of images from various sources, such as cooking and movies. The total number of images is 1,970, and each image has about 41 descriptions on average.

여기서 설명들을 Natural Language Toolkit(NLTP)의 wordpunkt tokenizer를 이용하여 토크나이징 하였고, 모두 소문자로 변경하였다. 이 스페이스 토크나이징을 제외하고 그 어떤 자연어 처리 기법을 이용한 전처리는 수행하지 않았다. 데이터 셋의 모든 단어의 수는 57만여개이며, 각 description의 평균 토큰 수는 7여개이다. 이 데이터 셋의 전체 vocabulary 수는 9,745개이다. 기존 연구들과 동일하게 데이터의 분할은 학습에 1200개의 영상, 검증에 100개, 평가에는 나머지 670개를 사용하였다. 그리고 초당 2 frame으로 총 20 frames를 샘플링하면서 각 프레임을 정사각형으로 잘라내었고, 224x224 pixels로 리사이즈하였다. RGB 값을 센터링하기 위하여, 이 값들은 127.5 로 나뉘고, 1을 뺐다. 샘플 데이터는 도 5에 도시된 바와 같으며, 설명(description)은 5개만 도식되었다.Here, the explanations were tokenized using the wordpunkt tokenizer of the Natural Language Toolkit (NLTP), and all were changed to lowercase. Except for this space tokenization, preprocessing using any natural language processing technique was not performed. The number of all words in the data set is about 570,000, and the average number of tokens in each description is about 7. The total number of vocabularies in this data set is 9,745. As with previous studies, 1200 images were used for learning, 100 images for verification, and the remaining 670 images for evaluation. Then, while sampling a total of 20 frames at 2 frames per second, each frame was cut into squares and resized to 224x224 pixels. To center the RGB values, these values are divided by 127.5 and 1 is subtracted. The sample data is as shown in FIG. 5, and only 5 descriptions are shown.

도 5를 구체적으로 살펴보면 먼저 도 5 (a)에서 동일한 개체인 'squirrel'을 'Small animal'이나 'chipmunk', 'hamster'와 같이 서로 다른 단어로 지칭하는 것을 확인할 수 있다. 또한 먹고 있는 'nut'에 대해서도 'peanut'이나 단순히 'food'로 표현한 것을 볼 수 있다. 도 5 (b)의 경우에도 유사한 패턴이 존재하며, 'ingrediants'와 같은 오탈자가 존재한다. 도 5 (c)의 동영상의 설명에서, 양파가 있지만, 충분히 주의를 기울이지 않으면 오렌지와 혼동할 수 있다. 이 MSVD 데이터의 설명(description)은 주로 진행형(progressive form)으로 구성되어있다. Looking at FIG. 5 in detail, first, in FIG. 5 (a), it can be confirmed that the same entity 'squirrel' is referred to as 'Small animal', 'chipmunk', or 'hamster' with different words. You can also see 'peanut' or simply 'food' for the 'nut' you are eating. A similar pattern exists in the case of FIG. 5 (b), and there are typos such as 'ingrediants'. In the description of the video in Fig. 5(c), there is an onion, but you can confuse it with an orange if you don't pay enough attention. The description of this MSVD data is mainly composed of progressive form.

그리고 전이학습을 위하여, 본 실시예에서는 Microsoft Common Objects in Context (MSCOCO)와 Flickr30k 이미지 데이터 셋을 사용하였다. 이 이미지와 설명은 MSVD 데이터셋과 동일하게 전처리 되었다.And for transfer learning, Microsoft Common Objects in Context (MSCOCO) and Flickr30k image data set were used in this embodiment. These images and descriptions were preprocessed identically to the MSVD dataset.

본 실시예에 따른 비디오 캡셔닝 방법을 위해 ImageNet으로 사전학습된 ResNet50V2를 사용하고 프레임 임베딩

, 복수의 순환신경망(RNN)인

,

, 어텐션 메커니즘의 내부 프로젝션(projection)

, 단어 로짓

을 위한 차원수들은 모두 512로 설정되었다. 최대 입력

, 출력

길이는 각각 20, 10으로 설정되었다. Leaky rectified linear unit을 활성함수로 사용하였고, Adam optimizer를 학습률 5e-5,

of 0.9,

of 0.999,

of 1e-7로 사용하였다. 또한 MSVD의 training 셋과 이미지 데이터 셋이 등장한 단어들만 사용하였다. 이 설정에서 사전은 21,992 단어로 구성된다. 제안 방법의 초 매개변수를 정리하면 표 2와 같다.For the video captioning method according to this embodiment, ResNet50V2 pre-trained with ImageNet is used and frame embedding

, which is a plurality of recurrent neural networks (RNNs)

,

, the internal projection of the attention mechanism

, the word logit

The dimensions for are all set to 512. max input

, Print

The lengths were set to 20 and 10, respectively. A leaky rectified linear unit was used as an activation function, and an Adam optimizer with a learning rate of 5e-5,

of 0.9,

of 0.999,

of 1e-7 was used. In addition, only words that appeared in MSVD's training set and image data set were used. In this setup, the dictionary consists of 21,992 words. Table 2 summarizes the hyperparameters of the proposed method.

Hyper parameterHyper parameter ValueValue Hyper parameterHyper parameter ValueValue

20

512

10

512

[56, 28, 14, 7, 7, 1]

21,992

512

빔 서치 알고리즘은 그리디 트리(greedy tree) 탐색 알고리즘 중 하나로, 최적의 노드를 찾기 위한 후보군의 수를 빔 크기(beam size)로 제한한 알고리즘이다. 학습된 자연어 생성 모델을 기반으로 단어 서열을 생성하기 위하여, 많은 연구자들은 이 빔 서치 알고리즘을 사용하였다. 이 경우 탐색의 목표는 가장 확률(likelihood)이 높은 단어의 조합을 찾는 것이다. 본 발명에서는 생성된 여러 후보군 중 종료 토큰이 생성된 후보는 따로 저장해둔다. 그리고 이후 최대 길이까지 탐색된 후보군들과 이전에 저장한 종료 토큰으로 끝나는 후보군들을 묶어 그 중 가장 좋은 단어 서열을 선택하였다. 기계 번역이나 자동 요약 문제처럼 주어진 설명(description)의 길이가 짧기 때문에, 별도의 길이 패널티(length penalty)는 적용하지 않았다. 그리고 설명을 설명하기 위해 빔 크기가 5인 빔 탐색 알고리즘을 사용하였다. The beam search algorithm is one of greedy tree search algorithms, and is an algorithm in which the number of candidate groups for finding an optimal node is limited by a beam size. To generate word sequences based on the learned natural language generation model, many researchers have used this beam search algorithm. In this case, the goal of the search is to find the combination of words with the highest probability. In the present invention, a candidate for which an end token is generated among several generated candidate groups is separately stored. Then, the best word sequence was selected by combining the searched candidate groups up to the maximum length and the candidate groups ending with the previously stored end token. Since the length of the given description is short, such as machine translation or automatic summary problems, no length penalty was applied. In addition, a beam search algorithm with a beam size of 5 was used to explain the explanation.

평가 척도로 기존 연구들과 마찬가지로 우리는 이중 언어 평가 언더 스터디(BLEU, Bilingual Evaluation Understudy), 번역 평가 메트릭(METEOR, Metric for Evaluation of Translation with Explicit Ordering) 그리고 이미지 설명 평가(CIDEr, Consensus-based Image Description Evaluation)를 평가 척도로 사용하였다. BLEU 은 기계번역 방법의 평가 널리 사용되는 척도이고, 자동 생성한 설명을 실측 값(ground truth)을 기반으로 n-gram 정밀도를 평가하는 척도이며, 우리는 평가를 위하여 BLEU-4를 사용하였다. METEOR역시 마찬가지로, 기계번역 방법의 평가에서 널리 사용되는 척도이고, 자동 생성한 설명과 실측 값(ground truth) 사이의 형태소 분석 및 동의어 일치와 같은 의미론적 방식으로 단어 매칭율을 평가하는 척도이다. CIDEr는 이미지 캡셔닝 모델 평가에 널리 사용되는 척도이고, 빈도-역항 빈도(Term Frequency - Inversed Term Frequency(TF-IDF))라는 용어에 의해 가중치가 부여된 n-gram 유사도를 평가하는 척도이다. 이 척도들은 높을수록 좋은 성능을 나타낸다는 의미를 가진다.As evaluation scales, like previous studies, we used the Bilingual Evaluation Understudy (BLEU), Metric for Evaluation of Translation with Explicit Ordering (METEOR), and Consensus-based Image Description Evaluation (CIDEr). Evaluation) was used as an evaluation scale. BLEU is a widely used scale for evaluating machine translation methods, and is a scale for evaluating the n-gram precision of automatically generated descriptions based on ground truth. We used BLEU-4 for evaluation. Similarly, METEOR is also a widely used scale in the evaluation of machine translation methods, and is a scale that evaluates word matching rates in semantic ways such as morphological analysis and synonym matching between automatically generated explanations and ground truth. CIDEr is a widely used metric for evaluating image captioning models, and is a metric for evaluating n-gram similarity weighted by the term Term Frequency - Inversed Term Frequency (TF-IDF). These scales have the meaning that higher values indicate better performance.

또한 빠른 학습을 위하여 RTX 3090 graphics card가 장착된 워크스테이션을 사용하였으며, 본 발명은 Tensorflow 2.3으로 구현되었다. In addition, a workstation equipped with an RTX 3090 graphics card was used for fast learning, and the present invention was implemented with Tensorflow 2.3.

실험을 위해 본 발명을 배치 크기 32로 이미지 데이터 셋을 10 주기까지 사전학습시켰다. 그 후 MSVD데이터 셋을 배치크기 8로 30,000 단계 더 학습시켰으며, 학습 시 2,000 단계마다 매개변수들을 저장하였다. 여기서 이 저장된 매개변수를 단일 파라미터

로 표현한다. 이 연구에서, 15개의 단일 파라미터가 저장되었고

로 표현된다.For the experiment, the present invention was pre-trained for up to 10 cycles of the image data set with a batch size of 32. After that, the MSVD data set was trained for 30,000 more steps with a batch size of 8, and parameters were saved at every 2,000 steps during learning. Here, this stored parameter is used as a single parameter

express it as In this study, 15 single parameters were stored and

is expressed as

본 발명의 성능을 더욱 끌어올리기 위하여, 이 단일 매개변수들을 앙상블하였다. 여기서 사용한 앙상블 방법은 단일 매개변수들의 값을 산술 평균

하거나 단일 매개 변수의 METEOR 점수에 따라 가중 평균

하는 방법이다. 먼저 앙상블을 위하여 가장 성능이 좋은 다섯개의 단일 매개변수를 선택하였고, 또한 가장 좋은 두 개, 세 개 등의 조합도 고려하였다. 이 매개변수들은 가장 좋은 두 개의 조합부터 가장 좋은 다섯 개의 조합까지 각각

and

로 표현된다. 이 23개의 선택된 매개변수 중 가장 검증 성능이 좋은 매개변수는 검증 데이터를 이용하여 최종 미세조정(finetuning)된다.To further enhance the performance of the present invention, these single parameters were ensemble. The ensemble method used here is the arithmetic average of the values of single parameters.

or a weighted average based on the METEOR score of a single parameter

way to do it First, five single parameters with the best performance were selected for the ensemble, and the best two, three, etc. combinations were also considered. These parameters range from the best two combinations to the best five combinations, respectively.

and

is expressed as Among these 23 selected parameters, the one with the best verification performance is finally finetuned using verification data.

먼저 빔 탐색 알고리즘과 평가 척도를 이용하여 단일 매개변수를 평가 시와 동일하게 검증하였으며, 단일 매개변수의 BLEU-4와 METEOR 점수는 도 6에 도시된 바와 같다. 도 6에서 가장 높은 점수는 원으로 표시되었고, 해당 값은 숫자로 표시된 바와 같으며, 최적 매개변수의 인덱스는 수직 점선으로 표시된 바와 같다. 도 6에 도시된 최고 BLEU-4와 METEOR 점수와 같이, MSVD의 검증 데이터를 기반으로 6천(3*2,000) 단계의 단일 매개변수

를 최적 매개변수

로 선택하였다. 앙상블을 위하여 가장 좋은 다섯 개의 매개변수

,

를 METEOR 점수에 따라 선택하였으며, 앙상블 매개변수의 BLEU-4와 METEOR 점수는 도 7에 도시된 바와 같으며 도 7 (a)는 산술 평균의 앙상블 매개변수, 도 7 (b)는 METEOR 점수에 의한 가중 평균의 앙상블 매개변수 점수이다. First, a single parameter was verified in the same way as in the evaluation using a beam search algorithm and an evaluation scale, and the BLEU-4 and METEOR scores of the single parameter are shown in FIG. 6 . In FIG. 6, the highest score is indicated by a circle, the corresponding value is indicated by a number, and the index of the optimal parameter is indicated by a vertical dotted line. Like the highest BLEU-4 and METEOR scores shown in Figure 6, a single parameter of 6 thousand (3 * 2,000) steps based on the verification data of MSVD

to the optimal parameter

was selected as The five best parameters for an ensemble

,

was selected according to the METEOR score, and the BLEU-4 and METEOR scores of the ensemble parameters are as shown in FIG. 7, FIG. 7 (a) is the ensemble parameter of the arithmetic mean, and FIG. is the weighted average ensemble parameter score.

검증을 통하여, 최적 매개변수 METEOR 점수 32.96의

와 33.51의

, 33.54의

를 테스트를 위하여 선택하였다. 최고의 METEOR 점수를 달성한 매개변수는

이다. 본 발명에서는 이 매개변수

를 MSVD의 검증 데이터 셋을 학습률 1e-5로 단 1 주기만 학습시켰고, 이 파인튜닝된 매개변수는

로 표현된다. 최적 단일 매개변수 기반 본 발명은 "Multi-Representation Switching with single parameter (MRS-s)"로 표현되었고, 최적 산술 평균 기반 앙상블 매개변수

와 가중 평균 기반 앙상블 매개변수

는 각각 "MRS-ea", "MRS-ew"로 표현되었다. 마지막으로 파인튜닝된 매개변수

기반 본 방법은 "MRS-ew+"로 표현되었다. Through verification, the optimal parameter METEOR score of 32.96

with 33.51

, of 33.54

was selected for testing. The parameter that achieved the highest METEOR score was

am. In the present invention, this parameter

was trained for only one cycle with a learning rate of 1e-5 on the verification data set of MSVD, and this fine-tuned parameter is

is expressed as Based on the optimal single parameter, the present invention was expressed as "Multi-Representation Switching with single parameter (MRS-s)", and the optimal arithmetic mean based ensemble parameter

and a weighted average based ensemble parameter

were expressed as "MRS-ea" and "MRS-ew", respectively. Last fine-tuned parameter

Basis This method was expressed as "MRS-ew+".

실험 결과는 하기의 표 3과 같다. 비교를 위한 방법들은 METEOR 점수를 기준으로 오름차순 정렬되었다.The experimental results are shown in Table 3 below. Methods for comparison are sorted in ascending order based on METEOR score.

ModelsModels BLEU-4BLEU-4 METEORMETEOR CIDErCIDEr Meanpool [16]Meanpool [16] 30.730.7 27.727.7 -- SA-LSTM [4]SA-LSTM [4] 41.941.9 29.629.6 51.751.7 S2VT [2]S2VT [2] -- 29.829.8 -- LSTM-GAN [8]LSTM-GANs [8] 42.942.9 30.430.4 -- GRU-RCN [5]GRU-RCN [5] 43.343.3 31.631.6 68.068.0 BAE [6]BAE [6] 42.542.5 32.432.4 63.563.5 h-RNN [7]h-RNNs [7] 49.949.9 32.632.6 65.865.8 PickNet [24]PickNet [24] 46.146.1 33.133.1 76.076.0 STAT_V [25]STAT_V [25] 52.052.0 33.3 33.3 73.873.8 TSA-ED [26]TSA-ED [26] 51.751.7 34.034.0 74.974.9 RecNetlocal [27]RecNetlocal [27] 52.352.3 34.134.1 80.380.3 MRS-sMRS-s 51.851.8 32.032.0 64.964.9 MRS-eaMRS-ea 52.452.4 32.932.9 74.874.8 MRS-ewMRS-ew 53.353.3 32.932.9 73.173.1 MRS-ew+MRS-ew+ 54.354.3 34.034.0 80.380.3

표 3에서 알 수 있듯 본 발명에 따른 비디오 캡셔닝 방법은 세 가지 척도에 대하여 일관적으로 좋은 성능을 보이고 잇다. 앙상블 모델인 MRS-ea와 MRS-ew에서, 싱글 모델인 MRS-s와 비교하여 BLEU-4와 METEOR 점수는 적은 폭으로 상승하였으나, CIDEr 점수는 큰 폭으로 상승하였다. 특히 파인튜닝된 모델인 본 발명 MRS-ew+는 대부분의 비디오 캡셔닝 방법의 성능을 뛰어넘는 점수를 기록하였다. 이 점수는 MRS-s의 경우 6,000 step만에, MRS-e의 경우 30,000 step만에 달성한 점수라고 볼 수 있다. 배치 크기가 8이므로, 1 epoch은 약 6천 단계로 구성되며, 각 모델은 1epoch, 5 epoch 까지만 학습된 것으로 볼 수 있다. As can be seen from Table 3, the video captioning method according to the present invention consistently shows good performance for three criteria. In the ensemble model MRS-ea and MRS-ew, compared to the single model MRS-s, the BLEU-4 and METEOR scores increased slightly, but the CIDEr scores increased significantly. In particular, the MRS-ew+ of the present invention, which is a fine-tuned model, recorded scores exceeding the performance of most video captioning methods. This score can be regarded as a score achieved in 6,000 steps for MRS-s and 30,000 steps for MRS-e. Since the batch size is 8, 1 epoch consists of about 6,000 steps, and each model can be considered to have been trained only up to 1 epoch and 5 epochs.

학습 시간과 방법에 대하여, 본 실시예에 따른 비디오 캡셔닝 방법과 성능이 유사한 PickNet, STAT_V, TSA-ED, RecNet 방법을 비교하였다. PickNet은 세 단계로 학습되었으며, 각 단계는 최대 100 epochs까지 학습되었다. 첫 번째 단계에서는 대상 단어에 대한 음의 로그 가능성 손실로 인코더-디코더 네트워크(encoder-decoder networks)를 학습하다. 그리고 두번째 단계에서는 강화 학습을 기반으로 picking network를 학습하고, 마지막 단계는 두 네트워크를 공동으로 학습하는 단계이다. With respect to learning time and method, PickNet, STAT_V, TSA-ED, and RecNet methods with similar performance to the video captioning method according to this embodiment were compared. PickNet was trained in three steps, and each step was trained up to 100 epochs. In the first step, we train encoder-decoder networks with negative log-likelihood loss for the target word. In the second step, the picking network is learned based on reinforcement learning, and the last step is to jointly learn the two networks.

TSA-ED는 최대 30 epochs까지 학습하거나, 20회 동안 검증 성능이 개선되지 않았을 학습을 정지하였다. RecNet은, TSA-ED와 유사하게, 20 epoch동안 유효성 검사 손실(validation loss)의 변화가 없을 때까지 학습을 수행하였다. 그에 반하여 본 발명은 5 epoch까지만 학습되었다. RecNet은 학습을 위하여 재건 손실(reconstruction loss)을 추가로 사용하였다. PickNet과 비교하여 본 발명에 따른 비디오 캡셔닝 방법은 강화학습 없이 모든 척도에서 좋은 성능을 보였다. TSA-ED와 비교하여 제안 방법은 BLEU-4와 CIDEr에서 높은 점수를 기록하였으며 METEOR 점수는 동일하다. RecNet과 비교하여 본 발명은 2점 더 높은 BLEU-4 점수를 기록하였고, 0.1점 낮은 METEOR 점수를, 동일한 CIDEr 점수를 기록하였다. TSA-ED learned up to 30 epochs, or stopped learning when verification performance did not improve for 20 epochs. RecNet, similar to TSA-ED, was trained until there was no change in validation loss for 20 epochs. In contrast, the present invention was only trained up to 5 epochs. RecNet additionally used reconstruction loss for learning. Compared to PickNet, the video captioning method according to the present invention showed good performance in all scales without reinforcement learning. Compared to TSA-ED, the proposed method recorded high scores in BLEU-4 and CIDEr, and the METEOR scores were the same. Compared to RecNet, the present invention recorded a BLEU-4 score 2 points higher, a METEOR score 0.1 point lower, and the same CIDEr score.

STAT_V는 동영상 특징 추출을 위하여 사전 학습된 2D CNN, C3d, R-CNN을 사용하였다. 영상정보에 대한 별도의 태깅이 없는 MSVD 데이터셋의 특성에 따라 STAT_V에서 사용된 R-CNN은 end-to-end manner로 파인튠(finetuned)될 수 없다. 이러한 구조는 학습이나 평가 시 특징 추출 단계와 설명 생성 단계가 분리될 수밖에 없다는 한계를 가진다. 그에 반하여 본 발명에 따른 비디오 캡셔닝 방법은 특징 추출과 설명 생성 단계가 분리되지 않고 각 네트워크를 원활히 파인튜닝 할 수 있다는 장점이 있으며, STAT_V와 비교하여 모든 척도에서 좋은 성능을 보였다.STAT_V used pretrained 2D CNN, C3d, and R-CNN for video feature extraction. According to the characteristics of the MSVD dataset without separate tagging of image information, the R-CNN used in STAT_V cannot be finetuned in an end-to-end manner. This structure has a limitation in that the feature extraction step and description generation step are inevitably separated during learning or evaluation. In contrast, the video captioning method according to the present invention has the advantage of smoothly fine-tuning each network without separating the feature extraction and description generation steps, and showed good performance in all scales compared to STAT_V.

이 결과로부터, 본 발명에서 제안된 구조는 주어진 동영상과 설명 쌍에서 정보를 추출하는데에 효과적이라고 볼 수 있다. 이는 본 실시예에 따른 비디오 캡셔닝 방법이 fully end-to-end manner로 학습 가능하며, 우리는 어떤 컴퓨터 비전이나 자연어 처리 기법 기반의 전처리나 추가적인 손실함수를 적용하지 않았기 때문이다.From this result, it can be seen that the structure proposed in the present invention is effective in extracting information from a given video and description pair. This is because the video captioning method according to the present embodiment can be learned in a fully end-to-end manner, and we do not apply preprocessing based on any computer vision or natural language processing technique or an additional loss function.

한편 본 실시예에 따른 비디오 캡셔닝 방법의 질적 차이를 확인하기 위하여 테스트 데이터로 생성된 설명을 비교하였다. 적절한 자막으로 생성된 예제는 도 8에 도시된 바와 같다. 도시된 바와 같이 본 발명에 따른 비디오 캡셔닝 방법은 일반적으로 주어진 동영상에 대하여 적절한 설명을 생성하였으며, 성능이 좋은 모델일수록 보다 풍부한 표현을 사용하는 것으로 보였다. Meanwhile, in order to confirm the qualitative difference of the video captioning method according to the present embodiment, descriptions generated with test data were compared. An example generated with appropriate subtitles is shown in FIG. 8 . As shown, the video captioning method according to the present invention generally generates an appropriate description for a given video, and it seems that models with better performance use richer expressions.

그리고 잘못된 객체 및 올바른 객체와 단어를 일치시키는 예는 도 9에 도시된 바와 같다. 왼쪽의 바이올린을 켜는 장면의 몇몇 프레임에서, 얼굴과 활대의 각도 때문에 바이올린이 플루트로 오인식된 것으로 보인다. 한편 오른쪽의 남성이 플루트를 연주하는 장면의 프레임으로 여기서 얼굴과 악기의 형태가 왼쪽과 유사한 것을 볼 수 있다. 또 다른 예제인 도 10을 보면 왼쪽 그림은 토마토이지만 적절히 인식되지 않았는데, 이는 토마토의 색상 때문일 것으로 예상된다. 그리고 오른쪽 프레임은 회색 토끼가 핑크색 토끼 인행으로 놀고 있는 모습의 이미지인데 모든 모델에서 이 토끼를 개로 인식하였으나, 본 실시예에 따른 모델에서는 토끼 인형은 인식하였음을 알 수 있다. Also, an example of matching a wrong object and a correct object with a word is as shown in FIG. 9 . In several frames of the violin playing scene on the left, the violin appears to be mistaken for a flute because of the angle of the face and bow. Meanwhile, it is a frame of a scene where a man on the right is playing the flute, and here you can see that the shape of the face and instrument is similar to the one on the left. Looking at another example, FIG. 10 , the picture on the left is a tomato, but it is not properly recognized, which is expected to be due to the color of the tomato. In addition, the right frame is an image of a gray rabbit playing with a pink rabbit, and it can be seen that all models recognized the rabbit as a dog, but the model according to the present embodiment recognized a rabbit doll.

상술한 바와 같이 본 실시예에서는 동영상과 자막 쌍의 특징을 모델링하기 위한 다중 표현 스위칭 기반 비디오 캡셔닝 방법을 개시하였으며, 이러한 다중 표현 스위칭을 통해 주어진 영상과 자막 쌍에서 중요한 정보를 효율적으로 추출할 수 있게 됨을 알 수 있다. MSVD 데이터셋을 이용한 실험을 통하여 본 발명은 제안된 의도에 따라 학습되었으며, 54.3의 BLEU-4, 34.0의 METEOR, 80.3의 CIDEr 점수를 달성하였다. 이 기록은 별도의 컵퓨터 비전과 자연어 처리 기반 전처리와 손실함수 없이, 매우 적은 학습 단계만으로 기존 방법들을 뛰어넘는 점수임을 알 수 있다. As described above, in this embodiment, a multi-expression switching-based video captioning method for modeling the characteristics of a video and subtitle pair is disclosed, and important information can be efficiently extracted from a given video and subtitle pair through such multi-expression switching. It can be known that there will be Through experiments using the MSVD dataset, the present invention was learned according to the proposed intention and achieved BLEU-4 of 54.3, METEOR of 34.0, and CIDEr score of 80.3. It can be seen that this record is a score that surpasses existing methods with very few learning steps, without a separate cup computer vision and natural language processing-based preprocessing and loss function.

한편 도 11은 본 발명의 일 실시예에 따른 다중 표현 스위칭 기반 비디오 캡셔닝 시스템(300)을 설명하기 위한 블럭도로, 통신부(310), 입력부(320), 저장부(330), 출력부(340) 및 프로세서(350)를 포함하는 컴퓨팅 시스템으로 구현할 수 있다. Meanwhile, FIG. 11 is a block diagram for explaining a multi-expression switching-based video captioning system 300 according to an embodiment of the present invention, and includes a communication unit 310, an input unit 320, a storage unit 330, and an output unit 340. ) and a processor 350.

통신부(310)는 외부 기기와 외부 네트워크로부터 필요한 정보를 송수신하기 위해 마련되는 것으로 이를 통해 학습 데이터나 캡션을 생성하기 위한 동영상을 입력 받을 수 있다. The communication unit 310 is provided to transmit/receive necessary information from an external device and an external network, and through this, learning data or video for generating captions can be input.

입력부(320)는 사용자 명령을 입력받기 위한 입력 수단으로 캡션 생성을 위한 동영상을 입력받을 수 있고, 출력부(340)는 다중 표현 스위치 기반 비디오 캡셔닝의 과정 및 결과를 표시하기 위한 것으로 디스플레이를 포함할 수 있다. The input unit 320 is an input means for receiving a user command and can receive video for generating captions, and the output unit 340 is for displaying the process and result of multi-expression switch-based video captioning and includes a display. can do.

그리고 저장부(330)는 비디오 캡셔닝 방법을 수행하기 위한 프로그램이 기독되고, 프로세서(350)가 동작함에 있어 필요한 저장 공간을 제공하여 프로세서(350)가 처리하는 데이터를 일시적 또는 영구적으로 저장하며, 휘발성 저장매체 또는 비휘발성 저장 매체를 포함할 수 있으나, 본 발명의 범위가 이에 한정되는 것은 아니다. 또한 저장부(330)는 비디오 캡셔닝 방법을 수행하면서 누적되는 데이터가 저장될 수 있다. In addition, the storage unit 330 stores the data processed by the processor 350 temporarily or permanently by reading a program for performing the video captioning method and providing a necessary storage space for the processor 350 to operate. A volatile storage medium or a non-volatile storage medium may be included, but the scope of the present invention is not limited thereto. Also, the storage unit 330 may store data accumulated while performing the video captioning method.

한편 프로세서(350)는 전술한 다중 표현 스위치 기반 비디오 캡셔닝을 위한 딥러닝 모델을 학습시키고, 학습된 딥러닝 모델을 이용하여 다중 표현 스위치 기반의 비디오 캡션을 생성하기 위한 CPU와 GPU들로, 이러한 프로세서(350)는 비디오 캡셔닝 방법을 제공하는 전체 과정을 제어할 수 있다. Meanwhile, the processor 350 is a CPU and GPUs for learning the deep learning model for video captioning based on the multi-expression switch and generating video captions based on the multi-expression switch using the learned deep learning model. The processor 350 may control the entire process of providing a video captioning method.

이와 같은 본 발명의 다중 표현 스위칭 기반 비디오 캡셔닝 방법은 다양한 컴퓨터 구성요소를 통하여 수행될 수 있는 프로그램 명령어의 형태로 구현되어 컴퓨터 판독 가능한 기록 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능한 기록 매체는 프로그램 명령어, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. The multi-expression switching-based video captioning method of the present invention can be implemented in the form of program instructions that can be executed through various computer components and recorded on a computer-readable recording medium. The computer readable recording medium may include program instructions, data files, data structures, etc. alone or in combination.

상기 컴퓨터 판독 가능한 기록 매체에 기록되는 프로그램 명령어는 본 발명을 위하여 특별히 설계되고 구성된 것은 물론 컴퓨터 소프트웨어 분야의 당업자에게 공지되어 사용 가능한 것일 수도 있다. Program instructions recorded on the computer-readable recording medium may be specially designed and configured for the present invention, as well as those known and available to those skilled in the art of computer software.

컴퓨터 판독 가능한 기록 매체의 예에는, 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체, CD-ROM, DVD 와 같은 광기록 매체, 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 ROM, RAM, 플래시 메모리 등과 같은 프로그램 명령어를 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다.Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks and magnetic tapes, optical recording media such as CD-ROMs and DVDs, and magneto-optical media such as floptical disks. media), and hardware devices specially configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like.

프로그램 명령어의 예에는, 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드도 포함된다. 상기 하드웨어 장치는 본 발명에 따른 처리를 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.Examples of program instructions include high-level language codes that can be executed by a computer using an interpreter or the like as well as machine language codes such as those produced by a compiler. The hardware device may be configured to act as one or more software modules to perform processing according to the present invention and vice versa.

이상에서는 본 발명의 다양한 실시예에 대하여 도시하고 설명하였지만, 본 발명은 상술한 특정의 실시예에 한정되지 아니하며, 청구범위에서 청구하는 본 발명의 요지를 벗어남이 없이 당해 발명이 속하는 기술분야에서 통상의 지식을 가진 자에 의해 다양한 변형실시가 가능한 것은 물론이고, 이러한 변형실시들은 본 발명의 기술적 사상이나 전망으로부터 개별적으로 이해되어져서는 안될 것이다.Although various embodiments of the present invention have been shown and described above, the present invention is not limited to the specific embodiments described above, and is commonly used in the technical field to which the present invention pertains without departing from the gist of the present invention claimed in the claims. Of course, various modifications are possible by those with knowledge of, and these modifications should not be individually understood from the technical spirit or prospect of the present invention.

100 : 인코더 110 : 2D 합성곱 신경망 모듈
120 : 제1 순환신경망 모듈 200 : 디코더
210 : 제2 순환신경망 모듈 220 : 주의 모듈
230 : 스위치 모듈 300 : 시스템
310 : 통신부 320 : 입력부
330 : 저장부 340 : 출력부
350 : 프로세서100: encoder 110: 2D convolutional neural network module
120: first recurrent neural network module 200: decoder
210: second recurrent neural network module 220: attention module
230: switch module 300: system
310: communication unit 320: input unit
330: storage unit 340: output unit
350: processor

Claims

The multi-expression switching-based video captioning method performed by the multi-expression switching-based video captioning system includes:
extracting object features on a frame-by-frame basis from an input video;
extracting motion features based on the extracted object features;
extracting grammatical features based on given word sequences;
calculating a weighted sum by calculating weights for the object features and motion features; and
A multi-expression switching-based video captioning method comprising generating a description of the video based on extracted grammar features, weighted object features, and weighted motion features.

According to claim 1,
The step of extracting the object features,
A video captioning method based on multi-expression switching, comprising calculating mean pooling on extracted object features.

According to claim 2,
In the step of extracting the grammatical features,
A video captioning method based on multi-expression switching, characterized in that a grammar feature is extracted based on a combination of an object feature vector extracted with the given word sequence and an average vector for all frames of the motion feature vector.

According to claim 3,
In the step of calculating the weighted sum,
Calculate the weight per frame using an attention mechanism;
The multi-expression switching-based video captioning method of claim 1 , wherein the extracted object features and motion features required when generating words for the video description are selected through the attention mechanism whenever the words are generated.

According to claim 4,
The step of generating the description is,
generating word generation probability expressions when only grammatical features are used, when only object features are used, and when only motion features are used;
calculating a final word probability distribution based on the word generation probability expression; and
A multi-expression switching-based video captioning method comprising generating a word having the highest probability value as a final word in the calculated word probability distribution.

According to claim 5,
The step of generating the description is,
Characterized in that a word sequence for video description is generated by repeating the steps of generating each word generation probability expression, calculating the final word probability distribution, and generating a word with the highest probability value as the final word. A multi-representation switching-based video captioning method.

According to claim 6,
In the step of generating the description,
The multi-expression switching-based video captioning method, characterized in that the generation of the word sequence is generated through a beam search algorithm.

A computer-readable recording medium having a computer program recorded thereon for performing the multi-expression switching-based video captioning method according to claim 1.

delete

an input unit for receiving video input; and
A processor that generates video captions using a deep learning model when the video is input and receives a video and description pair as training data to train the deep learning model;
The deep learning model,
An encoder including a 2D convolutional neural network (CNN) module that extracts object features in frame units from an input video and a first recurrent neural network (RNN) module that extracts motion features based on the extracted object features; and
A second recurrent neural network module extracting grammatical features using a given word sequence, an attention module calculating a weighted sum by calculating weights per frame for the object features and motion features, and the extracted grammatical features, weighted object features, and A multiple representation switching-based video captioning system comprising a decoder comprising a switch module for generating a description of the video based on weighted summed motion characteristics.

According to claim 10,
The 2D convolutional neural network (CNN) module,
A video captioning system based on multi-expression switching, characterized in that a mean pooling operation is performed on extracted object features.

According to claim 11,
The second recurrent neural network module,
A video captioning system based on multi-expression switching, characterized in that a grammar feature is extracted based on a combination of an object feature vector extracted with the given word sequence and an average vector for all frames of the motion feature vector.

According to claim 12,
The attention module,
Calculate the weight per frame using an attention mechanism;
The multi-expression switching based video captioning system, characterized in that the extracted object features and motion features required when generating words for the video description are selected through the attention mechanism whenever the words are generated.

According to claim 13,
The switch module,
In the case of using only grammar features, in the case of using only object features, and in the case of using only motion features, word generation probability expressions are generated, respectively, and a final word probability distribution is calculated based on the word generation probability expression, and the most calculated word probability distribution is calculated. A video captioning system based on multi-expression switching, characterized in that a word sequence for video description is generated by repeating a process of generating a word having a high probability value as a final word.

According to claim 14,
The switch module,
A video captioning system based on multi-expression switching, characterized in that the word sequence is generated through a beam search algorithm.