KR101936947B1

KR101936947B1 - Method for temporal information encoding of the video segment frame-wise features for video recognition

Info

Publication number: KR101936947B1
Application number: KR1020170165382A
Authority: KR
Inventors: 김영석; 권희승
Original assignee: 포항공과대학교 산학협력단
Priority date: 2017-12-04
Filing date: 2017-12-04
Publication date: 2019-01-09
Also published as: WO2019112385A1

Abstract

Disclosed is a method for encoding a video feature point which represents temporal information of a video using proposed pooling equations for video recognition. The method according to the present specification comprises the following steps: (a) receiving feature points generated for each frame of a video; (b) encoding the feature points through integrated pooling (sum pooling), max pooling, gradient pooling, standard deviation pooling, gradient max pooling, and gradient standard deviation pooling, separately; and (c) calculating one vector value by connecting each feature point calculated in step (b).

Description

TECHNICAL FIELD [0001] The present invention relates to a time information encoding method for a temporal information encoding method,

본 발명은 비디오의 프레임 통합 인코딩에 관한 것이며, 보다 상세하게는 비디오 내 사람 또는 사물의 동작 인식을 향상시키기 위한 인코딩에 관한 것이다. The present invention relates to frame-integrated encoding of video, and more particularly to encoding for enhancing motion recognition of a person or object in a video.

비디오 인식은 이미지 인식과 마찬가지로 다양한 산업 분야에 응용 가능하다. 가장 직접적으로 관계되는 비디오 검색, 비디오 감시 시스템뿐 아니라, 의료 영상 진단, 자율 주행, 인간-로봇 인터랙션, 지능형 로봇 등의 분야에 응용될 수 있다.Video recognition, like image recognition, is applicable to a variety of industries. The present invention can be applied not only to video retrieval and video surveillance systems most directly involved but also to medical image diagnosis, autonomous navigation, human-robot interaction, and intelligent robot.

일반적인 비디오 인식 기술은 비디오를 통해 사람의 행동에 대해 이해하기 위한 기술로, 5~10초 정도의 짧은 영상 세그먼트에서 취하는 사람의 동작을 인식하고 분석한다. 기본적으로 비디오가 초당 25~30프레임의 이미지로 이루어져 있음에 기인해서, 비디오 인식 기술은 이미지 인식 기술에 쓰이는 방법론을 다수 차용한다. 구체적으로, 비디오 인식을 위해 영상의 프레임별 특징점을 추출할 때, 이미지 인식의 특징점 추출 방식이 쓰인다. 이후, 특징점 인코딩 방식을 이용하여 비디오의 프레임별 특징점을 통합하여 비디오 전체의 종합적 특징점을 만들어내서 비디오 인식 기술을 완성시킨다.General video recognition technology is a technique for understanding human behavior through video. It recognizes and analyzes human motion in short video segments of 5 to 10 seconds. Due to the fact that video is basically composed of images of 25-30 frames per second, video recognition technology takes many of the methodologies used in image recognition technology. Specifically, when extracting feature points for each frame of an image for video recognition, a feature point extraction method for image recognition is used. Then, by integrating the feature points of each frame of the video using the feature point encoding method, a comprehensive feature point of the entire video is created to complete the video recognition technology.

비디오 인식을 위한 특징점 추출 방법으로는, 이미지 인식 분야의 특징점 추출 방식이 주로 쓰인다. 전통적으로, 사람이 직접 설계한 특징점 추출 방식을 소개하자면 다음과 같다. 영상의 옵티컬 플로우 (Optical Flow), HOG (Histogram of Oriented Gradient), HoF (Histogram of Flow), MBH (Motion Boundary Histogram) 등이 그 예이다. 영상 내 움직임을 분석하기 위한 모션 관련 특징점 추출 방식을 주로 사용한다는 점이 주목할 만한 부분이다. 딥 러닝 기술의 발전에 따라, 컨볼루션 신경망 모델 (Convolutional Neural Network; CNN) 을 통해 특징점을 학습하는 방식이 비디오 인식의 특징점 추출 방식으로도 많이 쓰이는 추세다. 비디오의 모든 프레임을 추출하여, 각 프레임의 RGB 이미지를 네트워크 인풋으로 사용하여 프레임별 특징점을 학습한다. 또한, 비디오의 모션 특징점 추출을 위해 비디오의 밀집 옵티컬 플로우 (Dense Optical Flow)를 미리 계산하여 Optical Flow 이미지를 생산, Optical Flow 이미지를 네트워크 인풋으로 사용하기도 한다. 하지만 제시한 방식들은 모두 비디오의 프레임별 특징을 추출하는 방식으로, 비디오의 전체적인 시간 정보를 고려하지 못하는 한계가 있다.As feature point extraction method for video recognition, feature point extraction method in image recognition field is mainly used. Traditionally, human-designed feature extraction methods are described as follows. Optical Flow, Histogram of Oriented Gradient (HOG), Histogram of Flow (HoF), and Motion Boundary Histogram (MBH). It is noteworthy that the method of extracting motion related feature points is mainly used for analyzing motion in an image. With the development of deep learning technology, the method of learning feature points through the Convolutional Neural Network (CNN) is becoming a popular feature extraction method of video recognition. All frames of the video are extracted, and the RGB image of each frame is used as the network input to learn the feature points of each frame. In addition, to extract motion feature points of video, optical flow image is produced by predicting the dense optical flow of video, and optical flow image is used as network input. However, all of the proposed methods are based on extracting the features of each frame of the video, and there is a limitation in that the overall time information of the video can not be considered.

비디오의 시간 정보를 고려하기 위해 위의 특징점 추출 방식과 연계되는 특징점 인코딩 방식들은 대표적으로 Bag of Words (BoW) 기법과 Imporved Fisher Vector (IFV) 기법이 있다. BoW 기법은 추출된 특징점들의 클러스터링을 통해서, 비디오의 종합적 특징을 나타내는 히스토그램을 만드는 기법이다. IFV 기법은, BoW 기법보다 더 유연한 방식으로 특징점들을 클러스터링하며, 특징점들의 관계를 통해서 비디오의 종합적 특징을 모델링하는 기법이다. 두 방식 모두 비디오의 전체 프레임 정보를 고려하긴 하지만, 프레임 순서 등 시간에 따른 변화 정보는 모델링하지 못한다.In order to consider the temporal information of the video, the feature point encoding methods associated with the above feature point extraction methods are typically the Bag of Words (BoW) technique and the Imporved Fisher Vector (IFV) technique. The BoW technique is a technique for creating a histogram representing the comprehensive characteristics of video through clustering of extracted feature points. The IFV technique clusters feature points in a more flexible manner than the BoW technique, and is a technique for modeling the comprehensive characteristics of video through the relationship of feature points. Both methods take into account the full frame information of the video but do not model the temporal change information such as frame order.

비디오의 시간 정보를 고려하며 특징점 추출과 특징점 인코딩을 한꺼번에 수행하는 방식으로 3D 컨볼루션 인공 신경망, 컨볼루션 인공 신경망과 순환 신경망의 융합과 같은 방식이 있다. 3D 컨볼루션 인공 신경망 방식은, 비디오의 시간 축을 하나의 차원으로 간주하여 비디오를 삼차원 인풋으로 인식, 컨볼루션 인공 신경망을 삼차원으로 설계한 방식이다. 컨볼루션 인공 신경망과 순환 신경망의 융합 방식은 컨볼루션 인공 신경망 뒷단에 장단기 기억 (Long Short-Term Memory; LSTM) 순환 신경망을 융합하여 시간 축 정보를 인식하게 설계한 방식이다. 두 방식 모두 감독 학습 기반 모델로, 데이터의 필요량이 많고 모델의 복잡도가 너무 심해서 학습이 어려운 단점이 있다.There are methods such as fusion of 3D convolution artificial neural network, convolution artificial neural network and circular neural network in which feature point extraction and feature point encoding are performed all at once considering video time information. The 3D convolution artificial neural network method is a method of three-dimensionally designing the convolution artificial neural network by recognizing the video as a three-dimensional input by considering the time axis of the video as one dimension. The convergence artificial neural network and the circular neural network are designed to recognize the time axis information by fusing a short-term memory (LSTM) circular neural network at the back of the convolution artificial neural network. Both methods are based on supervised learning, and there are disadvantages that it is difficult to learn because the amount of data required is large and the complexity of the model is too great.

등록특허공보 제10-1575857호, 등록특허공보 제10-1762400호, 등록특허공보 제10-1563297호10-1575857, 10-1762400, 10-1563297

본 명세서는 비디오 인식을 위해서 비디오의 시간적 정보를 제시한 풀링 수식들을 이용해 표현하는 비디오 특징점 인코딩 방법을 제공하고자 한다.The present specification aims to provide a video feature point encoding method for representing video temporal information using pooling equations for video recognition.

본 명세서는 상기 언급된 과제로 제한되지 않으며, 언급되지 않은 또 다른 과제들은 아래의 기재로부터 통상의 기술자에게 명확하게 이해될 수 있을 것이다.The present specification is not limited to the above-mentioned problems, and other matters not mentioned may be clearly understood by those skilled in the art from the following description.

상술한 과제를 해결하기 위한 본 명세서에 따른 비디오 특징점 인코딩 방법은, (a) 비디오의 프레임별 생성된 특징점들을 수신하는 단계; 및 (b) 상기 특징점들을 표준 편차 풀링(Standard deviation Pooling), 그라디언트 맥스 풀링(Gradient Max Pooling) 및 그라디언트 표준 편차 풀링(Gradient Standard deviation Pooling)으로 이루어진 군에서 선택된 적어도 어느 하나의 방식으로 인코딩하는 단계;를 포함할 수 있다.According to an aspect of the present invention, there is provided a video feature point encoding method including: (a) receiving feature points generated for each frame of a video; And (b) encoding the feature points in at least one manner selected from the group consisting of Standard deviation Pooling, Gradient Max Pooling, and Gradient Standard Deviation Pooling; . &Lt; / RTI >

본 명세서의 일 실시예에 따르면, 상기 표준 편차 풀링은 시간 축에서 프레임별 특징점값들의 표준 편차값을 특징점으로 인코딩할 수 있다.According to one embodiment of the present invention, the standard deviation pooling may encode a standard deviation value of feature point values for each frame on a time axis as feature points.

이때, 상기 표준 편차 풀링은 아래 수식을 사용하여 인코딩할 수 있다.At this time, the standard deviation pooling may be encoded using the following equation.

: 표준 편차 풀링의 결과값

: Result of standard deviation pooling

N: 프레임 개수N: number of frames

: 특징 값들의 시간축 평균값

: Time-averaged mean value of feature values

본 명세서의 다른 실시예에 따르면, 상기 그라디언트 맥스 풀링은 프레임별 특징점값들의 그라디언트 값을 계산하고 시간 축에서 최대값 및 최소값을 특징점으로 인코딩할 수 있다.According to another embodiment of the present invention, the gradient max pooling may calculate the gradient value of the feature point values for each frame and encode the maximum value and the minimum value in the time axis as feature points.

이 때, 상기 그라디언트 맥스 풀링은 아래 수식을 사용하여 인코딩할 수 있다.At this time, the gradient maximum pulling can be encoded using the following equation.

: 각각 특징점 그라디언트 값의 최대값과 최소값

: Maximum and minimum values of feature point gradient values, respectively

본 명세서의 또 다른 실시예에 따르면, 상기 그라디언트 표준 편차 풀링은 프레임별 특징점값들의 그라디언트 값을 계산하고 시간 축에서 표준 편차값을 특징점으로 인코딩할 수 있다.According to another embodiment of the present disclosure, the gradient standard deviation pooling may calculate the gradient value of the feature point values for each frame and encode the standard deviation value in the time axis as feature points.

이때, 상기 그라디언트 표준 편차 풀링은 아래 수식을 사용하여 인코딩할 수 있다.At this time, the gradient standard deviation pooling can be encoded using the following equation.

상술한 과제를 해결하기 위한 본 명세서에 따른 비디오 특징점 인코딩 방법은, (a) 비디오의 프레임별 생성된 특징점들을 수신하는 단계; (b) 상기 특징점들을 통합 풀링(Sum Pooling), 맥스 풀링(Max Pooling), 그라디언트 풀링(Gradient Pooling), 표준 편차 풀링(Standard deviation Pooling), 그라디언트 맥스 풀링(Gradient Max Pooling) 및 그라디언트 표준 편차 풀링(Gradient Standard deviation Pooling)으로 각각 인코딩하는 단계; 및 (c) 상기 (b) 단계에서 산출된 각각의 특징점을 연결하여 하나의 벡터값으로 산출하는 단계;를 포함할 수 있다.According to an aspect of the present invention, there is provided a video feature point encoding method including: (a) receiving feature points generated for each frame of a video; (b) calculating the feature points using a method selected from the group consisting of Sum Pooling, Max Pooling, Gradient Pooling, Standard Deviation Pooling, Gradient Max Pooling and Gradient Standard Deviation Pooling Gradient Standard Deviation Pooling); And (c) concatenating the feature points calculated in the step (b) and calculating a single vector value.

본 명세서에 따른 비디오 특징점 인코딩 방법은, (d) 상기 (c) 단계에서 인코딩된 벡터값을 평균 차감법과 표준 편차로 나누어 정규화하는 단계;를 더 포함할 수 있다.The video feature point encoding method according to the present invention may further include (d) normalizing the vector value encoded in the step (c) by dividing the vector value by an average difference subtraction method and a standard deviation.

본 명세서에 따른 비디오 특징점 인코딩 방법은, 컴퓨터에서 비디오 특징점 인코딩 방법의 각 단계들을 수행하도록 작성되어 컴퓨터로 독출 가능한 기록 매체에 기록된 컴퓨터프로그램으로 구현될 수 있다.The video feature point encoding method according to the present invention can be implemented in a computer program written in a computer-readable recording medium so as to perform each step of the video feature point encoding method in a computer.

상술한 과제를 해결하기 위한 본 명세서에 따른 비디오 특징점 인코딩 장치는, 비디오의 프레임별 생성된 특징점들을 수신하는 특징점 수신부; 및 상기 특징점 생성부에서 생성된 데이터를 표준 편차 풀링(Standard deviation Pooling), 그라디언트 맥스 풀링(Gradient Max Pooling) 및 그라디언트 표준 편차 풀링(Gradient Standard deviation Pooling)으로 이루어진 군에서 선택된 적어도 어느 하나의 방식으로 인코딩하는 풀링부;를 포함할 수 있다.According to an aspect of the present invention, there is provided a video feature point encoding apparatus including: a feature point receiver for receiving feature points generated for each frame of a video; And encoding the data generated by the minutia generation unit in at least one manner selected from the group consisting of a standard deviation pooling, a gradient max pooling, and a gradient standard deviation pooling. And a pulling section for performing a pulling operation.

상술한 과제를 해결하기 위한 본 명세서에 따른 비디오 특징점 인코딩 장치는, 비디오의 프레임별 생성된 특징점들을 수신하는 특징점 수신부; 상기 특징점들을 통합 풀링(Sum Pooling), 맥스 풀링(Max Pooling), 그라디언트 풀링(Gradient Pooling), 표준 편차 풀링(Standard deviation Pooling), 그라디언트 맥스 풀링(Gradient Max Pooling) 및 그라디언트 표준 편차 풀링(Gradient Standard deviation Pooling)으로 각각 인코딩하는 풀링부; 및 상기 풀링부에서 인코딩된 각각의 특징점을 연결하여 하나의 벡터값으로 산출하는 특징점연결부;를 포함할 수 있다. 이 경우, 상기 특징점연결부에서 산출된 벡터값을 평균 차감법과 표준 편차로 나누는 정규화부;를 더 포함할 수 있다.According to an aspect of the present invention, there is provided a video feature point encoding apparatus including: a feature point receiver for receiving feature points generated for each frame of a video; The minutiae are divided into a sum pooling, a max pooling, a gradient pooling, a standard deviation pooling, a gradient max pooling, and a gradient standard deviation Pooling); And a minutiae point connection unit for connecting each of the minutiae encoded in the pulling unit and calculating a single vector value. In this case, the normalization unit may divide the vector value calculated by the minutiae point connecting unit by an average difference method and a standard deviation.

본 발명의 기타 구체적인 사항들은 상세한 설명 및 도면들에 포함되어 있다.Other specific details of the invention are included in the detailed description and drawings.

본 명세서의 일 측면에 따르면, 특정 동작을 포함한 비디오는 하나의 새로운 통합 특징 벡터로 압축되어 표현된다. 이후, 비디오의 통합 특징 벡터는 서포트 벡터 머신과 같은 일반적인 분류기와의 조합으로 높은 비디오 인식 성능을 달성 할 수 있다.According to one aspect of the present disclosure, a video comprising a specific action is compressed and represented as a new integrated feature vector. The integrated feature vector of the video can then achieve high video recognition performance in combination with a general classifier such as a support vector machine.

본 명세서의 다른 측면에 따르면, 기존 특징점 풀링 방식과 더불어 특징점들의 통계적 정보와 시간 흐름에 대한 정보를 포함한 새로운 특징점 풀링 방식을 이용할 수 있다. 이를 통해 기존의 방식보다 다양한 시간적 정보를 압축하며, 흔들림이 심한 모션과 장기적인 관찰이 필요한 모션의 인코딩에 대해서도 강인하다.According to another aspect of the present invention, a new feature point pooling method including statistical information of feature points and information on time flow can be used in addition to the existing feature point pooling method. It compresses various temporal information more than the existing method, and is robust to the encoding of the motion which requires heavy shaking motion and the motion which requires long-term observation.

본 명세서의 또 다른 측면에 따르면, 특징점 풀링 방식의 특성상 비디오의 프레임별 특징점 추출 방식과 연계되어 사용되어, 전통적인 이미지 인식의 특징점 추출 방식, 컨볼루션 신경망 모델의 특징점 추출 방식 등 기존의 다양한 비디오 인식의 특징점 추출 방식과 호환 가능하다.According to another aspect of the present invention, the feature point pooling method is used in connection with the feature point extraction method for each frame of video, and is used in a variety of conventional video recognition methods such as feature point extraction method of conventional image recognition, feature point extraction method of convolutional neural network model It is compatible with feature point extraction method.

본 명세서의 또 다른 측면에 따르면, 학습이 필요하지 않은 완전한 비감독학습 방식으로 데이터의 양에 관계없이 사용 가능하다. 따라서 데이터 셋의 크기가 작더라도 효과적으로 활용할 수 있다.According to another aspect of the present disclosure, a complete unscreened learning method that does not require learning is available regardless of the amount of data. Therefore, even if the size of the dataset is small, it can be used effectively.

본 발명의 효과들은 이상에서 언급된 효과로 제한되지 않으며, 언급되지 않은 또 다른 효과들은 아래의 기재로부터 통상의 기술자에게 명확하게 이해될 수 있을 것이다.The effects of the present invention are not limited to the above-mentioned effects, and other effects not mentioned can be clearly understood by those skilled in the art from the following description.

도 1은 본 명세서에 따른 비디오 특징점 인코딩 방법을 간략하게 도시한 흐름도이다.
도 2는 본 명세서의 일 실시예에 따라 표준 편차 풀링을 하는 방법의 흐름도이다.
도 3은 본 명세서의 다른 실시예에 따라 그라디언트 맥스 풀링을 하는 방법의 흐름도이다.
도 4는 본 명세서의 다른 실시예에 따라 그라디언트 맥스 풀링을 하는 방법의 흐름도이다.
도 5는 본 명세서의 또 다른 실시예에 따른 비디오 특징점 인코딩 방법을 간략하게 도시한 참고도이다.1 is a flow chart briefly illustrating a video feature point encoding method according to the present disclosure.
2 is a flow diagram of a method of performing standard deviation pooling in accordance with one embodiment of the present disclosure.
3 is a flow diagram of a method of performing gradient max-pooling in accordance with another embodiment of the present disclosure.
4 is a flow diagram of a method of performing gradient max-pooling in accordance with another embodiment of the present disclosure.
FIG. 5 is a schematic view showing a video feature point encoding method according to another embodiment of the present invention. FIG.

본 명세서에 개시된 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나, 본 명세서가 이하에서 개시되는 실시예들에 제한되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시예들은 본 명세서의 개시가 완전하도록 하고, 본 명세서가 속하는 기술 분야의 통상의 기술자(이하 '당업자')에게 본 명세서의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 명세서의 권리 범위는 청구항의 범주에 의해 정의될 뿐이다. Brief Description of the Drawings The advantages and features of the invention disclosed herein and the manner of achieving them will become apparent with reference to the embodiments described in detail below with reference to the accompanying drawings. It should be understood, however, that the description is not limited to the embodiments disclosed herein but may be embodied in many different forms and should not be construed as limited to the specific embodiments set forth herein, (Hereinafter " a person skilled in the art ") to fully disclose the scope of this specification, and the scope of the present description is only defined by the scope of the claims.

본 명세서에서 사용된 용어는 실시예들을 설명하기 위한 것이며 본 명세서의 권리 범위를 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다. 명세서에서 사용되는 "포함한다(comprises)" 및/또는 "포함하는(comprising)"은 언급된 구성요소 외에 하나 이상의 다른 구성요소의 존재 또는 추가를 배제하지 않는다. 명세서 전체에 걸쳐 동일한 도면 부호는 동일한 구성 요소를 지칭하며, "및/또는"은 언급된 구성요소들의 각각 및 하나 이상의 모든 조합을 포함한다. 비록 "제1", "제2" 등이 다양한 구성요소들을 서술하기 위해서 사용되나, 이들 구성요소들은 이들 용어에 의해 제한되지 않음은 물론이다. 이들 용어들은 단지 하나의 구성요소를 다른 구성요소와 구별하기 위하여 사용하는 것이다. 따라서, 이하에서 언급되는 제1 구성요소는 본 발명의 기술적 사상 내에서 제2 구성요소일 수도 있음은 물론이다.The terminology used herein is for the purpose of describing the embodiments and is not intended to limit the scope of the present disclosure. In the present specification, the singular form includes plural forms unless otherwise specified in the specification. The terms " comprises " and / or " comprising " used in the specification do not exclude the presence or addition of one or more other elements in addition to the stated element. Like reference numerals refer to like elements throughout the specification and " and / or " include each and every combination of one or more of the elements mentioned. Although " first ", " second " and the like are used to describe various components, it is needless to say that these components are not limited by these terms. These terms are used only to distinguish one component from another. Therefore, it goes without saying that the first component mentioned below may be the second component within the technical scope of the present invention.

다른 정의가 없다면, 본 명세서에서 사용되는 모든 용어(기술 및 과학적 용어를 포함)는 본 명세서가 속하는 기술분야의 통상의 기술자에게 공통적으로 이해될 수 있는 의미로 사용될 수 있을 것이다. 또한, 일반적으로 사용되는 사전에 정의되어 있는 용어들은 명백하게 특별히 정의되어 있지 않는 한 이상적으로 또는 과도하게 해석되지 않는다. 이하, 첨부된 도면을 참조하여 본 발명의 실시예를 상세하게 설명한다.Unless defined otherwise, all terms (including technical and scientific terms) used herein may be used in a sense commonly understood to one of ordinary skill in the art to which this disclosure belongs. In addition, commonly used predefined terms are not ideally or excessively interpreted unless explicitly defined otherwise. Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

본 명세서에 따른 비디오 특징점 인코딩 방법은 간략하게 정의하자면, 특징점 풀링 (Feature Pooling) 방식이다. 특징점 풀링 방식은 추출된 프레임별 특징점들에 시간축으로 간단한 풀링 수식들을 적용한 기법으로, 시간 축 정보를 표현하는 데 있어 종래 BoW, IFV기법보다 더 효과적일 수 있다는 것이 몇몇 연구들에서 검증되었다. 본 명세서에 따른 비디오 특징점 인코딩 방법은 기존의 특징점 풀링 방식과 다른 새로운 풀링 방식들을 사용하며, 나아가 기존의 특징점 풀링 방식과 통합적으로 비디오의 통합 특징 벡터를 만들어내는 방법이다.The video feature point encoding method according to the present invention is a feature point pooling method. The feature point pooling scheme is a technique that applies simple pooling equations to the extracted feature points on the time axis. It has been proved in several studies that it can be more effective than the conventional BoW and IFV techniques in expressing time axis information. The video feature point encoding method according to the present invention uses a new pulling scheme that is different from the existing feature point pulling scheme, and further, a method of generating an integrated feature vector of video integrated with the existing feature point pulling scheme.

도 1은 본 명세서에 따른 비디오 특징점 인코딩 방법을 간략하게 도시한 흐름도이다.1 is a flow chart briefly illustrating a video feature point encoding method according to the present disclosure.

본 명세서에 따른 비디오 특징점 인코딩 방법은 특징점 생성하는 단계(S10) 및 생성된 특징점을 풀링하는 단계(S20)을 포함할 수 있다.The video feature point encoding method according to the present invention may include a feature point generating step (S10) and a step (S20) of pooling the generated feature points.

상기 비디오의 특징점을 생성하는 방법(S10)은 기존 특징점 추출 방식을 통해서 비디오의 프레임별 특징점을 생성할 수 있다. 상기 프레임별 특징점 생성은 당업자에게 널리 알려진 공지의 기술이며 본 명세서에서 있어서 기술적 핵심이 아니므로 상세한 설명은 생략하도록 하겠다.The method of generating feature points of the video (S10) can generate feature points for each frame of the video through the existing feature point extraction method. The creation of the minutiae for each frame is a well known technique widely known to those skilled in the art and is not a technical core in the present specification, and thus a detailed description thereof will be omitted.

다음 단계 S20에서, 상기 특징점들을 표준 편차 풀링(Standard deviation Pooling), 그라디언트 맥스 풀링(Gradient Max Pooling) 및 그라디언트 표준 편차 풀링(Gradient Standard deviation Pooling)으로 이루어진 군에서 선택된 적어도 어느 하나의 방식으로 인코딩될 수 있다.In the next step S20, the feature points can be encoded in at least one manner selected from the group consisting of Standard deviation Pooling, Gradient Max Pooling, and Gradient Standard deviation Pooling. have.

도 2는 본 명세서의 일 실시예에 따라 표준 편차 풀링을 하는 방법의 흐름도이다.2 is a flow diagram of a method of performing standard deviation pooling in accordance with one embodiment of the present disclosure.

상기 표준 편차 풀링은, 시간 축에서 프레임별 특징점값들의 통계적 정보를 추축하여 새로운 특징점을 생성하는 방식으로, 시각축에서 프레임별 특징점값들의 표준 편차값을 새로운 특징점으로 인코딩할 수 있다.The standard deviation pooling may encode statistical information of the feature point values for each frame on the time axis to generate new feature points, and may encode the standard deviation value of the feature point values for each frame on the time axis into new feature points.

상기 표준 편차 풀링은, 아래 수학식 1을 사용하여 인코딩할 수 있다.The standard deviation pooling may be encoded using Equation 1 below.

<수학식 1>&Quot; (1) "

: 표준 편차 풀링의 결과값

: Result of standard deviation pooling

N: 프레임 개수N: number of frames

: 특징 값들의 시간축 평균값

: Time-averaged mean value of feature values

도 3은 본 명세서의 다른 실시예에 따라 그라디언트 맥스 풀링을 하는 방법의 흐름도이다.3 is a flow diagram of a method of performing gradient max-pooling in accordance with another embodiment of the present disclosure.

상기 그라디언트 맥스 풀링은 특징점들의 시간적 흐름에 따른 변화 값을 추출하여 새로운 특징점을 생성하는 방식이다. 추출된 프레임별 특징점값들의 그라디언트 값들을 계산한 뒤, 시간 축에서의 최대값과 최소값을 새로운 특징점으로 인코딩할 수 있다. 그라디언트 값의 최대값과 최소값을 관찰함으로써, 모션의 변화가 가장 큰 부분 혹은 비디오 내의 임팩트 순간에 대한 인식율 상승을 기대할 수 있다.The gradient maximum pulling is a method of generating new feature points by extracting a change value according to temporal flow of the feature points. After calculating the gradient values of the minutiae values of the extracted frames, the maximum and minimum values in the time axis can be encoded into new minutiae. By observing the maximum value and the minimum value of the gradient value, it is possible to expect an increase in the recognition rate with respect to the moment at which the motion is changed most or the impact moment in the video.

상기 그라디언트 맥스 풀링은, 아래 수학식 2를 사용하여 인코딩할 수 있다.The gradient maximum pulling can be encoded using Equation (2) below.

<수학식 2>&Quot; (2) "

: 각각 특징점 그라디언트 값의 최대값과 최소값

: Maximum and minimum values of feature point gradient values, respectively

도 4는 본 명세서의 다른 실시예에 따라 그라디언트 맥스 풀링을 하는 방법의 흐름도이다.4 is a flow diagram of a method of performing gradient max-pooling in accordance with another embodiment of the present disclosure.

그라디언트 표준 편차 풀링 역시 특징점들의 시간적 흐름에 따른 변화 값을 추출하여 새로운 특징점을 생성하는 방식이다. 추출된 프레임별 특징점값들의 그라디언트 값들을 계산한 뒤, 시간 축에서의 표준 편차값을 특징점으로 인코딩할 수 있다. 그라디언트 값의 표준편차를 관찰함으로써 각 비디오에서 변화폭이 큰 특징점들을 인지하여 동작별로 공통된 특징점들을 추출할 수 있다.Gradient standard deviation pulling is also a method of generating new feature points by extracting the change value according to the temporal flow of the feature points. After calculating the gradient values of the extracted feature point values, the standard deviation values on the time axis can be encoded as feature points. By observing the standard deviation of the gradient value, it is possible to recognize the feature points having large variation width in each video and extract common feature points for each operation.

상기 그라디언트 표준 편차 풀링은, 아래 수학식 3을 사용하여 인코딩할 수 있다.The gradient standard deviation pooling may be encoded using Equation 3 below.

<수학식 3>&Quot; (3) "

상술한 표준 편차 풀링(Standard deviation Pooling), 그라디언트 맥스 풀링(Gradient Max Pooling) 및 그라디언트 표준 편차 풀링(Gradient Standard deviation Pooling)은 어느 하나의 방식만이 사용될 수 있으면, 2개가 함께 사용되거나 3개 모두가 사용될 수도 있다. 나아가, 기존 영상 프레임별 특징점 추출 방식과 연계되어 영상 전체의 시간적 정보를 대표하는 특징점을 만들어낼 수 있다.The standard deviation pooling, the gradient maximum pooling, and the gradient standard deviation pooling may be used in combination of two, or all three, if only one method can be used. . In addition, feature points representing temporal information of the entire image can be created in association with the feature point extraction method for each existing image frame.

도 5는 본 명세서의 또 다른 실시예에 따른 비디오 특징점 인코딩 방법을 간략하게 도시한 참고도이다.FIG. 5 is a schematic view showing a video feature point encoding method according to another embodiment of the present invention. FIG.

도 5를 참조하면, 우선 기존의 특징점 추출 방식들을 통해서 비디오의 프레임별 특징점들이 생성되었다고 가정하겠다. 그리고 프레임별 특징점들을 비디오 별로 묶는다. 도 5에서 D는 특징점의 차원, N은 비디오의 프레임 수를 나타낸다. 이후 묶인 프레임별 특징점들은 시간 축으로 풀링 과정을 거친다. 즉, NxD 매트릭스가 D차원의 벡터로 줄어드는 것이다. 본 명세서에서는 통합 풀링(Sum Pooling), 맥스 풀링(Max Pooling), 그라디언트 풀링(Gradient Pooling), 표준 편차 풀링(Standard deviation Pooling), 그라디언트 맥스 풀링(Gradient Max Pooling) 및 그라디언트 표준 편차 풀링(Gradient Standard deviation Pooling) 총 여섯 가지 풀링 방식을 통해 총 여섯 개의 풀링 결과값들을 얻게 된다. 그리고 벡터의 연결을 통해 전체 하나의 벡터로 나타낼 수 있다. 이후 특징점 정규화 과정을 거쳐, 전체 시간 정보를 아우르는 비디오 통합 특징 벡터를 완성할 수 있다.Referring to FIG. 5, it is assumed that feature points for each frame of a video are generated through existing feature point extraction methods. Then, frame-specific feature points are grouped by video. 5, D represents the dimension of the feature point, and N represents the number of frames of the video. Afterwards, the minutiae of each frame are pooled on the time axis. That is, the NxD matrix is reduced to a D-dimensional vector. In the present specification, the terms "Sum Pooling", "Max Pooling", "Gradient Pooling", "Standard deviation Pooling", "Gradient Max Pooling", and "Gradient Standard deviation Pooling) Six total pooling results are obtained through a total of six pooling methods. It can be expressed as a whole vector through the connection of vectors. After the feature point normalization process, a video integrated feature vector including the entire time information can be completed.

본 명세서에서 사용되는 기존의 세 가지 특징점 풀링 방식은 각각 통합 풀링(Sum Pooling), 맥스 풀링(Max Pooling), 그라디언트 풀링(Gradient Pooling) 이다. 상기 통합 풀링은 추출된 프레임별 특징점값들을 시간 축으로 전부 더하여 하나의 벡터로 만드는 방식이며, 맥스 풀링 방식은 추출된 프레임별 특징점값들 중 시간 축에서 최대값만을 구하여 벡터화하는 방식이다. 그라디언트 풀링은 추출된 프레임별 특징점들의 그라디언트 값을 구한 뒤, 시간 축으로 양의 값과 음의 값을 분리하여 하나의 벡터로 만드는 방식이다.The three existing minutiae point pooling methods used in the present specification are respectively Sum Pooling, Max Pooling, and Gradient Pooling. The integrated pooling is a method of making all the extracted minutiae points on a time axis into a single vector by a time axis. The max pooling method is a method of vectorizing minutiae values of extracted minutiae only on a time axis. Gradient pulling is a method of obtaining gradient values of feature points extracted from a frame and then separating positive and negative values along the time axis into a single vector.

아래 수학식 4는 통합 풀링을 수식으로 나타낸 것이다.

는 N개의 프레임 중 t번째 프레임, D차원의 특징점 중 k번째 특징점의 값을 나타낸다.

는 각각 시작 프레임과 종료 프레임을 시계열 숫자로 나타낸 것이다.

는 통합 풀링의 결과값을 나타낸 것이다. 수학식 4에서 볼 수 있듯이, 통합 풀링은 시간 축으로 특징 값들을 더한다.Equation 4 below expresses the integrated pooling as an equation.

Represents the value of the kth feature point among the feature points of the t-th frame and the D dimension among the N frames.

Are time series numbers of the start frame and the end frame, respectively.

Is the result of integrated pooling. As can be seen from Equation (4), the integrated pulling adds feature values along the time axis.

<수학식 4>&Quot; (4) "

아래 수학식 5는 맥스 풀링을 수식으로 나타낸 것이다.

는 맥스 풀링의 결과값을 나타낸다. 나머지 값들의 정의는 수학식 4와 동일하다. 수학식 5에서 보이듯이, 맥스 풀링은 시간 축으로 특징 값들의 최대값을 나타낸다.Equation (5) below expresses MaxPulling as an equation.

Represents the result value of the max pooling. The definitions of the remaining values are the same as in Equation (4). As shown in Equation (5), MaxPulling represents the maximum value of the feature values along the time axis.

<수학식 5>Equation (5)

아래 수학식 6은 그라디언트 풀링을 수식으로 나타낸 것이다.

는 각각 양값과 음값의 그라디언트 풀링 벡터 결과값이다. 수학식 6에서 보이듯이, 그라디언트의 계산은 시간축으로 해당 프레임 특징점과와 그 전 프레임 특징점의 차이로 계산한다.Equation (6) below expresses the gradient pooling as an equation.

Are the gradient pulling vector result values of the positive and negative values, respectively. As shown in Equation (6), the calculation of the gradient is calculated on the time axis by the difference between the frame feature point and the previous frame feature point.

<수학식 6>&Quot; (6) "

한편, 본 명세서에 따른 비디오 특징점 인코딩 방법의 각 단계들을 수행하도록 작성되어 컴퓨터로 독출 가능한 기록 매체에 기록된 컴퓨터프로그램으로 구현될 수 있다.Meanwhile, the present invention can be embodied as a computer program recorded on a computer-readable recording medium so as to perform each step of the video feature point encoding method according to the present invention.

상기 컴퓨터프로그램은, 상기 컴퓨터가 프로그램을 읽어 들여 프로그램으로 구현된 상기 방법들을 실행시키기 위하여, 상기 컴퓨터의 프로세서(CPU)가 상기 컴퓨터의 장치 인터페이스를 통해 읽힐 수 있는 C, C++, JAVA, 기계어 등의 컴퓨터 언어로 코드화된 코드(Code)를 포함할 수 있다. 이러한 코드는 상기 방법들을 실행하는 필요한 기능들을 정의한 함수 등과 관련된 기능적인 코드(Functional Code)를 포함할 수 있고, 상기 기능들을 상기 컴퓨터의 프로세서가 소정의 절차대로 실행시키는데 필요한 실행 절차 관련 제어 코드를 포함할 수 있다. 또한, 이러한 코드는 상기 기능들을 상기 컴퓨터의 프로세서가 실행시키는데 필요한 추가 정보나 미디어가 상기 컴퓨터의 내부 또는 외부 메모리의 어느 위치(주소 번지)에서 참조되어야 하는지에 대한 메모리 참조관련 코드를 더 포함할 수 있다. 또한, 상기 컴퓨터의 프로세서가 상기 기능들을 실행시키기 위하여 원격(Remote)에 있는 어떠한 다른 컴퓨터나 서버 등과 통신이 필요한 경우, 코드는 상기 컴퓨터의 통신 모듈을 이용하여 원격에 있는 어떠한 다른 컴퓨터나 서버 등과 어떻게 통신해야 하는지, 통신 시 어떠한 정보나 미디어를 송수신해야 하는지 등에 대한 통신 관련 코드를 더 포함할 수 있다.The computer program may be stored in a computer readable medium such as C, C ++, JAVA, machine language, or the like that can be read by the processor (CPU) of the computer through the device interface of the computer, in order for the computer to read the program, And may include a code encoded in a computer language. Such code may include a functional code related to a function or the like that defines necessary functions for executing the above methods, and includes a control code related to an execution procedure necessary for the processor of the computer to execute the functions in a predetermined procedure can do. Further, such code may further include memory reference related code as to whether the additional information or media needed to cause the processor of the computer to execute the functions should be referred to at any location (address) of the internal or external memory of the computer have. Also, when the processor of the computer needs to communicate with any other computer or server that is remote to execute the functions, the code may be communicated to any other computer or server remotely using the communication module of the computer A communication-related code for determining whether to communicate, what information or media should be transmitted or received during communication, and the like.

상기 저장되는 매체는, 레지스터, 캐쉬, 메모리 등과 같이 짧은 순간 동안 데이터를 저장하는 매체가 아니라 반영구적으로 데이터를 저장하며, 기기에 의해 판독(reading)이 가능한 매체를 의미한다. 구체적으로는, 상기 저장되는 매체의 예로는 ROM, RAM, CD-ROM, 자기 테이프, 플로피디스크, 광 데이터 저장장치 등이 있지만, 이에 제한되지 않는다. 즉, 상기 프로그램은 상기 컴퓨터가 접속할 수 있는 다양한 서버 상의 다양한 기록매체 또는 사용자의 상기 컴퓨터상의 다양한 기록매체에 저장될 수 있다. 또한, 상기 매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산방식으로 컴퓨터가 읽을 수 있는 코드가 저장될 수 있다.The medium to be stored is not a medium for storing data for a short time such as a register, a cache, a memory, etc., but means a medium that semi-permanently stores data and is capable of being read by a device. Specifically, examples of the medium to be stored include ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical data storage, and the like, but are not limited thereto. That is, the program may be stored in various recording media on various servers to which the computer can access, or on various recording media on the user's computer. In addition, the medium may be distributed to a network-connected computer system so that computer-readable codes may be stored in a distributed manner.

또 한편, 본 명세서에 따른 비디오 특징점 인코딩 방법은 특징점 생성부 및 풀링부를 포함하는 비디오 특징점 인코딩 장치로 구현될 수 있다.On the other hand, the video feature point encoding method according to the present invention can be implemented by a video feature point encoding apparatus including a feature point generating unit and a pulling unit.

상기 특징점 생성부는 비디오의 프레임별 특징점들을 생성할 수 있다.The feature point generation unit may generate feature points for each frame of the video.

상기 풀링부는 상기 특징점 생성부에서 생성된 데이터를 표준 편차 풀링(Standard deviation Pooling), 그라디언트 맥스 풀링(Gradient Max Pooling) 및 그라디언트 표준 편차 풀링(Gradient Standard deviation Pooling)으로 이루어진 군에서 선택된 적어도 어느 하나의 방식으로 인코딩할 수 있다.The pooling unit may be configured to calculate the data generated in the minutia generation unit by at least one method selected from the group consisting of Standard deviation Pooling, Gradient Max Pooling, and Gradient Standard Deviation Pooling. &Lt; / RTI >

상기 특징점 생성부 및 풀링부의 구체적인 알고리즘에 대해서는 상술하였으므로 반복적인 설명은 생략하도록 하겠다.Since detailed algorithms of the minutia generation unit and the pulling unit have been described above, repetitive description will be omitted.

본 명세서의 일 실시예에 따르면, 상기 풀링부는 상기 특징점들을 통합 풀링(Sum Pooling), 맥스 풀링(Max Pooling), 그라디언트 풀링(Gradient Pooling), 표준 편차 풀링(Standard deviation Pooling), 그라디언트 맥스 풀링(Gradient Max Pooling) 및 그라디언트 표준 편차 풀링(Gradient Standard deviation Pooling)으로 각각 인코딩할 수 있다.According to an embodiment of the present invention, the pooling unit may divide the minutiae points into at least one of a sum pooling, a max pooling, a gradient pooling, a standard deviation pooling, a gradient max pooling, Max Pooling and Gradient Standard deviation Pooling, respectively.

나아가, 본 명세서에 따른 비디오 특징점 인코딩 장치는 상기 풀링부에서 인코딩된 각각의 특징점을 연결하여 하나의 벡터값으로 산출하는 특징점연결부 및 상기 특징점연결부에서 산출된 벡터값을 평균 차감법과 표준 편차로 나누는 정규화부를 더 포함할 수 있다.Further, the video feature point encoding apparatus according to the present invention includes a feature point connection unit for connecting each feature point encoded in the pulling unit and calculating a vector value, and a normalization unit for dividing the vector value calculated in the feature point connection unit by an average difference method and a standard deviation And the like.

본 명세서의 실시예와 관련하여 설명된 비디오 특징점 인코딩 장치는 하드웨어로 직접 구현되거나, 하드웨어에 의해 실행되는 소프트웨어 모듈로 구현되거나, 또는 이들의 결합에 의해 구현될 수 있다. 소프트웨어 모듈은 RAM(Random Access Memory), ROM(Read Only Memory), EPROM(Erasable Programmable ROM), EEPROM(Electrically Erasable Programmable ROM), 플래시 메모리(Flash Memory), 하드 디스크, 착탈형 디스크, CD-ROM, 또는 본 명세서가 속하는 기술 분야에서 잘 알려진 임의의 형태의 컴퓨터 판독가능 기록매체에 상주할 수도 있다.The video feature point encoding apparatus described in connection with the embodiments of the present disclosure may be implemented directly in hardware, in software modules executed by hardware, or in a combination thereof. The software module may be a random access memory (RAM), a read only memory (ROM), an erasable programmable ROM (EPROM), an electrically erasable programmable ROM (EEPROM), a flash memory, a hard disk, a removable disk, a CD- May reside in any form of computer readable recording medium known in the art to which this disclosure belongs.

이상, 첨부된 도면을 참조로 하여 본 명세서의 실시예를 설명하였지만, 본 명세서가 속하는 기술분야의 통상의 기술자는 본 발명이 그 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있다는 것을 이해할 수 있을 것이다. 그러므로, 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며, 제한적이 아닌 것으로 이해해야만 한다. While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is evident that many alternatives, modifications and variations will be apparent to those skilled in the art. You will understand. Therefore, it should be understood that the above-described embodiments are illustrative in all aspects and not restrictive.

Claims

(a) receiving feature points generated for each frame of a video; And
(b) encoding the feature points in at least one manner selected from the group consisting of Standard Deviation Pooling, Gradient Max Pooling and Gradient Standard Deviation Pooling; &Lt; / RTI &
The standard deviation pooling is encoded using the following equation,

: Result of standard deviation pooling
N: number of frames

: Time-averaged mean value of feature values
k represents the kth feature point
t represents the t-th frame
t ^s : indicates the start frame
t ^e : indicates the end frame
f _k (t) represents the value of the kth feature point among the feature points of the t-th frame and D dimension among N frames
The gradient max pooling is encoded using the following equation,

: Maximum and minimum values of feature point gradient values, respectively

The gradient standard deviation pooling is encoded using the following equation,

The step (b)
Wherein a feature point representative of temporal information of the entire video is obtained by encoding in at least one of the standard deviation pooling, the gradient maximum pooling, and the gradient standard deviation pooling.

The method according to claim 1,
Wherein the standard deviation pooling encodes a standard deviation value of feature point values for each frame on a time axis as feature points.

delete

The method according to claim 1,
Wherein the gradient max pooling calculates a gradient value of feature point values for each frame and encodes the maximum value and the minimum value on the time axis as feature points.

delete

The method according to claim 1,
Wherein the gradient standard deviation pooling calculates a gradient value of the feature point values for each frame and encodes the standard deviation value for the feature point in the time axis.

delete

(a) receiving feature points generated for each frame of a video;
(b) calculating the feature points using a method selected from the group consisting of Sum Pooling, Max Pooling, Gradient Pooling, Standard Deviation Pooling, Gradient Max Pooling and Gradient Standard Deviation Pooling Gradient Standard Deviation Pooling); And
(c) concatenating each of the minutiae encoded in the step (b) and calculating a single vector value; And
(d) normalizing the vector value calculated in the step (c) by dividing the vector value by an average difference subtraction method and a standard deviation.

delete

A computer program recorded on a computer-readable recording medium, the computer program being written in the computer to perform the steps of the video feature point encoding method according to claim 1, claim 2, claim 4, claim 6 and claim 8.

A feature point receiving unit for receiving feature points generated for each frame of the video; And
A minutiae point received by the minutiae point receiving unit is encoded in at least one selected from the group consisting of a standard deviation pooling, a gradient max pooling, and a gradient standard deviation pooling. Comprising:
The standard deviation pooling is encoded using the following equation,

: Result of standard deviation pooling
N: number of frames

: Maximum and minimum values of feature point gradient values, respectively

The pulling-
Wherein the video feature point encoding unit encodes the video feature point in at least one of the standard deviation pooling, the gradient maximum pulling, and the gradient standard deviation pulling to obtain a feature point representative of temporal information of the entire video.

The method of claim 11,
Wherein the standard deviation pooling encodes a standard deviation value of feature point values for each frame on a time axis as feature points.

delete

The method of claim 11,
Wherein the gradient max pooling calculates a gradient value of the feature point values for each frame and encodes the maximum value and the minimum value on the time axis as feature points.

delete

The method of claim 11,
Wherein the gradient standard deviation pooling calculates a gradient value of the feature point values for each frame and encodes the standard deviation value as feature points on the time axis.

delete

A feature point receiving unit for receiving feature points generated for each frame of the video;
The minutiae are divided into a sum pooling, a max pooling, a gradient pooling, a standard deviation pooling, a gradient max pooling, and a gradient standard deviation Pooling); And
A minutiae point connecting unit for connecting each of the minutiae encoded in the pooling unit and calculating a single vector value; And
And a normalization unit for dividing the vector value calculated by the minutiae point connection unit by an average difference method and a standard deviation.

delete