KR102263017B1

KR102263017B1 - Method and apparatus for high-speed image recognition using 3d convolutional neural network

Info

Publication number: KR102263017B1
Application number: KR1020190005188A
Authority: KR
Inventors: 이영주; 김영석; 박군호; 이현훈
Original assignee: 포항공과대학교 산학협력단
Priority date: 2019-01-15
Filing date: 2019-01-15
Publication date: 2021-06-08
Also published as: KR20200092509A; KR102263017B9

Abstract

3D CNN(3-dimension Convolutional Neural Network)을 이용한 고속 영상 인식 방법 및 장치가 개시된다. 3D CNN(3-dimension Convolutional Neural Network)을 이용한 고속 영상 인식 방법은, 입력 영상을 구성하는 영상 클립들 중 제1 영상 클립들을 각각 3D CNN에 입력하는 단계, 상기 제1 영상 클립들 각각에 대하여 상기 3D CNN을 통해 소프트맥스 함수(softmax function)를 연산한 결과값들을 획득하는 단계, 획득된 결과값들을 이용하여 스코어 마진(score margin)을 산출하는 단계, 산출된 스코어 마진을 미리 설정된 임계값과 비교하는 단계 및 상기 비교하는 단계에 대한 응답으로, 상기 입력 영상을 구성하는 영상 클립들 중 상기 제1 영상 클립들을 제외한 나머지 영상 클립들을 상기 3D CNN에 입력할지 여부를 결정하는 단계를 포함한다. 따라서, 영상 인식을 위한 연산 속도를 향상시킬 수 있다.A high-speed image recognition method and apparatus using a 3D CNN (3-dimension convolutional neural network) are disclosed. A high-speed image recognition method using a 3D CNN (3-dimension convolutional neural network) comprises: inputting first image clips from among image clips constituting an input image to the 3D CNN, respectively, for each of the first image clips Obtaining result values of calculating a softmax function through 3D CNN, calculating a score margin using the obtained result values, comparing the calculated score margin with a preset threshold and determining whether to input the remaining video clips excluding the first video clips from among video clips constituting the input video to the 3D CNN in response to the step of performing and comparing the video clip. Therefore, it is possible to improve the operation speed for image recognition.

Description

High-speed image recognition method and apparatus using 3D CNN {METHOD AND APPARATUS FOR HIGH-SPEED IMAGE RECOGNITION USING 3D CONVOLUTIONAL NEURAL NETWORK}

본 발명은 3D CNN을 이용한 고속 영상 인식 방법 및 장치에 관한 것으로, 더욱 상세하게는 입력 영상 클립 중 일부에 대하여 3D CNN을 이용한 영상 인식을 위한 네트워크 연산을 수행하고, 수행 결과를 기초로 후속 영상 클립에 대한 네트워크 연산을 일부 생략함으로써 연산 속도를 고속화하는 기술에 관한 것이다.The present invention relates to a method and apparatus for high-speed image recognition using a 3D CNN, and more particularly, a network operation for image recognition using a 3D CNN on some of the input image clips, and a subsequent image clip based on the result It relates to a technique for speeding up the operation speed by partially omitting the network operation for .

인공지능 기술이 발전함에 따라 높은 수준의 추상화(abstraction)를 통해 사람의 사고를 컴퓨터가 수행할 수 있도록 학습시키는 기계학습 알고리즘인 딥러닝 기법이 연구되고 있다. 이러한 딥러닝 기법은 심층 신경망(Deep Neural Network), 합성곱 신경망(Convolutional Neural Network, CNN), 순환 신경망(Recurrent Neural Network, RNN) 등과 같은 다양한 인공신경망을 사용하여 트레이닝셋을 훈련시키고 입력된 데이터에 대한 추론을 수행한다.As artificial intelligence technology develops, deep learning techniques, which are machine learning algorithms that allow computers to perform human thinking through high-level abstraction, are being studied. These deep learning techniques use various artificial neural networks such as Deep Neural Network, Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), etc. make inferences about

특히, 합성곱 신경망은 이미지 분류에서 뛰어난 성능을 나타내는 것으로 주목받고 있는 네트워크로서, 하나 이상의 합성곱 계층(Convolutional layer)을 포함한다.In particular, the convolutional neural network is a network that has been attracting attention as showing excellent performance in image classification, and includes one or more convolutional layers.

합성곱 신경망을 이용한 영상 인식 기술은 영상에 포함된 객체를 식별하거나 객체(또는 사람)의 행동을 인식하는 데 주로 사용된다. 그런데, 객체의 행동을 인식하기 위해서 주로 사용되는 3차원 합성곱 신경망(3dimension Convolutional Neural Network)은 단일한 2차원 이미지가 아니라 복수의 2차원 이미지로 구성되는 3차원 영상 이미지를 입력으로 사용한다. Image recognition technology using a convolutional neural network is mainly used to identify an object included in an image or to recognize the behavior of an object (or a person). However, a 3D convolutional neural network, which is mainly used to recognize the behavior of an object, uses not a single 2D image but a 3D video image composed of a plurality of 2D images as an input.

종래의 3차원 합성곱 신경망은 깊은 네트워크를 사용함에 따른 많은 연산량과 변수를 처리하기 위하여 많은 자원을 필요로 하기 때문에 IoT(Internet of Things) 기기를 포함한 소형 기기들의 제한적인 자원으로는 구현이 어려운 문제가 있다.Because the conventional 3D convolutional neural network requires a lot of resources to process a lot of computation and variables due to the use of a deep network, it is difficult to implement with limited resources of small devices including IoT (Internet of Things) devices. there is

상기와 같은 문제점을 해결하기 위한 본 발명의 목적은, 3D CNN을 이용한 고속 영상 인식 방법을 제공하는 데 있다.An object of the present invention for solving the above problems is to provide a high-speed image recognition method using a 3D CNN.

상기와 같은 문제점을 해결하기 위한 본 발명의 다른 목적은, 3D CNN을 이용한 고속 영상 인식 장치를 제공하는 데 있다.Another object of the present invention for solving the above problems is to provide a high-speed image recognition apparatus using a 3D CNN.

상기 목적을 달성하기 위한 본 발명의 일 측면은, 3D CNN을 이용한 고속 영상 인식 방법을 제공한다.One aspect of the present invention for achieving the above object provides a high-speed image recognition method using a 3D CNN.

상기 3D CNN을 이용한 고속 영상 인식 방법은, 입력 영상을 구성하는 영상 클립들 중 제1 영상 클립들을 각각 3D CNN(3-dimension Convolutional Neural Network)에 입력하는 단계, 상기 제1 영상 클립들 각각에 대하여 상기 3D CNN을 통해 소프트맥스 함수(softmax function)를 연산한 결과값들을 획득하는 단계, 획득된 결과값들을 이용하여 스코어 마진(score margin)을 산출하는 단계, 산출된 스코어 마진을 미리 설정된 임계값과 비교하는 단계 및 상기 비교하는 단계에 대한 응답으로, 상기 입력 영상을 구성하는 영상 클립들 중 상기 제1 영상 클립들을 제외한 나머지 영상 클립들을 상기 3D CNN에 입력할지 여부를 결정하는 단계를 포함할 수 있다.The high-speed image recognition method using the 3D CNN includes inputting first image clips among image clips constituting an input image to a 3D CNN (3-dimension convolutional neural network), respectively, for each of the first image clips Obtaining result values of calculating a softmax function through the 3D CNN, calculating a score margin by using the obtained result values, and setting the calculated score margin with a preset threshold value Comparing and in response to the comparing, determining whether to input the remaining video clips excluding the first video clips among video clips constituting the input video to the 3D CNN. .

상기 스코어 마진은 상기 결과값들 중 가장 큰 값과 두번째로 큰 값 사이의 차분값일 수 있다.The score margin may be a difference value between a largest value and a second largest value among the result values.

상기 나머지 영상 클립들을 상기 3D CNN에 입력할지 여부를 결정하는 단계는, 상기 스코어 마진이 상기 임계값보다 크면, 상기 제1 영상 클립들 이후의 영상 클립을 상기 3D CNN에 입력하지 않고, 상기 결과값들만으로 상기 입력 영상에 대한 영상 인식을 수행하는 단계를 포함할 수 있다.The determining whether to input the remaining video clips to the 3D CNN may include, if the score margin is greater than the threshold, do not input video clips after the first video clips into the 3D CNN, and the result value It may include performing image recognition on the input image using only

상기 나머지 영상 클립들을 상기 3D CNN에 입력할지 여부를 결정하는 단계는, 상기 스코어 마진이 상기 임계값보다 작으면, 상기 제1 영상 클립들 이후의 영상 클립을 상기 3D CNN에 입력하는 단계를 포함할 수 있다.The step of determining whether to input the remaining video clips to the 3D CNN may include inputting video clips after the first video clips to the 3D CNN if the score margin is less than the threshold value. can

상기 결과값들을 획득하는 단계는, 상기 소프트맥스 함수를 연산하여 획득된 결과값들을 메모리(memory)에 누적하여 저장하는 단계를 더 포함할 수 있다.The obtaining of the result values may further include accumulating and storing the result values obtained by calculating the softmax function in a memory.

상기 임계값은, 영상 인식을 수행하는 단말의 종류, 연산 능력, 입력 영상의 종류, 입력 영상의 해상도, 입력 영상을 구성하는 프레임 수 중 적어도 하나에 따라 결정될 수 있다.The threshold value may be determined according to at least one of a type of a terminal performing image recognition, arithmetic capability, a type of an input image, a resolution of the input image, and the number of frames constituting the input image.

상기 입력 영상을 구성하는 영상 클립들 각각은, 상기 입력 영상을 구성하는 복수의 프레임 중에서 미리 설정된 개수의 시간적으로 연속한 프레임들로 구성될 수 있다.Each of the image clips constituting the input image may be composed of a preset number of temporally consecutive frames from among a plurality of frames constituting the input image.

상기 목적을 달성하기 위한 본 발명의 다른 측면은, 3D CNN을 이용한 고속 영상 인식 장치를 제공한다.Another aspect of the present invention for achieving the above object provides a high-speed image recognition apparatus using a 3D CNN.

3D CNN을 이용한 고속 영상 인식 장치는, 적어도 하나의 프로세서(processor), 및 상기 적어도 하나의 프로세서가 적어도 하나의 단계를 수행하도록 지시하는 명령어들(instructions)을 저장하는 메모리(memory)를 포함할 수 있다.A high-speed image recognition apparatus using a 3D CNN may include at least one processor, and a memory for storing instructions instructing the at least one processor to perform at least one step. have.

상기 적어도 하나의 단계는, 입력 영상을 구성하는 영상 클립들 중 제1 영상 클립들을 각각 3D CNN(3-dimension Convolutional Neural Network)에 입력하는 단계, 상기 제1 영상 클립들 각각에 대하여 상기 3D CNN을 통해 소프트맥스 함수(softmax function)를 연산한 결과값들을 획득하는 단계, 획득된 결과값들을 이용하여 스코어 마진(score margin)을 산출하는 단계, 산출된 스코어 마진을 미리 설정된 임계값과 비교하는 단계 및 상기 비교하는 단계에 대한 응답으로, 상기 입력 영상을 구성하는 영상 클립들 중 상기 제1 영상 클립들을 제외한 나머지 영상 클립들을 상기 3D CNN에 입력할지 여부를 결정하는 단계를 포함할 수 있다.The at least one step may include inputting first video clips from among video clips constituting an input video to a 3D CNN (3-dimension convolutional neural network), respectively, and using the 3D CNN for each of the first video clips. obtaining results obtained by calculating a softmax function through the steps of, calculating a score margin using the obtained result values, comparing the calculated score margin with a preset threshold value, and In response to the comparing, the method may include determining whether to input the remaining video clips excluding the first video clips among video clips constituting the input video to the 3D CNN.

상기와 같은 본 발명에 따른 3D CNN을 이용한 고속 영상 인식 방법 및 장치를 이용할 경우에는 스코어 마진에 따라 후속 영상 클립에 대한 연산을 생략함으로써 연산 속도를 향상시키고 시스템 자원 요구사항을 낮출 수 있다.In the case of using the high-speed image recognition method and apparatus using 3D CNN according to the present invention as described above, calculation speed can be improved and system resource requirements can be lowered by omitting calculations for subsequent video clips according to a score margin.

또한, 제한적인 자원을 갖는 각종 장치에서도 3D CNN을 이용하여 영상 인식을 수행할 수 있는 장점이 있다.In addition, there is an advantage that image recognition can be performed using 3D CNN even in various devices with limited resources.

도 1은 본 발명의 일 실시예에 따른 2차원 합성곱 신경망을 설명하기 위한 예시도이다.
도 2는 본 발명의 일 실시예에 따른 3D CNN을 설명하기 위한 예시도이다.
도 3a 및 도 3b는 본 발명의 일 실시예에 따른 스코어 마진값을 설명하기 위한 히스토그램이다.
도 4는 본 발명의 제1 실시예에 따른 3D CNN을 이용한 고속 영상 인식 방법에 대한 흐름도이다.
도 5는 본 발명의 제2 실시예에 따른 동적으로 3D CNN을 이용하는 고속 영상 인식 방법에 대한 흐름도이다.
도 6은 본 발명의 제1 실시예에 따른 3D CNN을 이용한 고속 영상 인식 장치에 대한 구성도이다.1 is an exemplary diagram for explaining a two-dimensional convolutional neural network according to an embodiment of the present invention.
2 is an exemplary diagram for explaining a 3D CNN according to an embodiment of the present invention.
3A and 3B are histograms for explaining a score margin value according to an embodiment of the present invention.
4 is a flowchart of a high-speed image recognition method using a 3D CNN according to the first embodiment of the present invention.
5 is a flowchart of a high-speed image recognition method dynamically using a 3D CNN according to a second embodiment of the present invention.
6 is a block diagram of a high-speed image recognition apparatus using a 3D CNN according to the first embodiment of the present invention.

본 발명은 다양한 변경을 가할 수 있고 여러 가지 실시예를 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 상세한 설명에 상세하게 설명하고자 한다. 그러나, 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 각 도면을 설명하면서 유사한 참조부호를 유사한 구성요소에 대해 사용하였다. Since the present invention can have various changes and can have various embodiments, specific embodiments are illustrated in the drawings and described in detail in the detailed description. However, this is not intended to limit the present invention to specific embodiments, and it should be understood to include all modifications, equivalents and substitutes included in the spirit and scope of the present invention. In describing each figure, like reference numerals have been used for like elements.

제1, 제2, A, B 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 예를 들어, 본 발명의 권리 범위를 벗어나지 않으면서 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다. 및/또는 이라는 용어는 복수의 관련된 기재된 항목들의 조합 또는 복수의 관련된 기재된 항목들 중의 어느 항목을 포함한다. Terms such as first, second, A, and B may be used to describe various elements, but the elements should not be limited by the terms. The above terms are used only for the purpose of distinguishing one component from another. For example, without departing from the scope of the present invention, a first component may be referred to as a second component, and similarly, a second component may also be referred to as a first component. and/or includes a combination of a plurality of related listed items or any of a plurality of related listed items.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는, 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다. When a component is referred to as being “connected” or “connected” to another component, it is understood that the other component may be directly connected or connected to the other component, but other components may exist in between. it should be On the other hand, when it is mentioned that a certain element is "directly connected" or "directly connected" to another element, it should be understood that the other element does not exist in the middle.

본 출원에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terms used in the present application are only used to describe specific embodiments, and are not intended to limit the present invention. The singular expression includes the plural expression unless the context clearly dictates otherwise. In the present application, terms such as “comprise” or “have” are intended to designate that a feature, number, step, operation, component, part, or combination thereof described in the specification exists, but one or more other features It should be understood that this does not preclude the existence or addition of numbers, steps, operations, components, parts, or combinations thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Terms such as those defined in commonly used dictionaries should be interpreted as having a meaning consistent with the meaning in the context of the related art, and should not be interpreted in an ideal or excessively formal meaning unless explicitly defined in the present application. does not

이하, 본 발명에 따른 바람직한 실시예를 첨부된 도면을 참조하여 상세하게 설명한다.Hereinafter, preferred embodiments according to the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 일 실시예에 따른 2차원 합성곱 신경망을 설명하기 위한 예시도이다.1 is an exemplary diagram for explaining a two-dimensional convolutional neural network according to an embodiment of the present invention.

도 1을 참조하면, 2차원 합성곱 신경망(Convolutional Neural Network, 이하 CNN)의 기본적인 계층 구조를 확인할 수 있다. 구체적으로, 2차원 합성곱 신경망은 입력 이미지를 입력으로 받아 합성곱 연산을 수행하여 특징맵(feature map)을 출력하는 컨볼루셔널 계층(Convolutional layer, 10), 활성화 함수(activation function)를 이용하여 컨볼루셔널 계층(10)의 출력값을 정규화하는 활성화 계층(activation layer, 11), 활성화 계층(11)의 출력에 대하여 샘플링 또는 풀링을 수행하여 대표적 특징을 추출하는 풀링 계층(pooling layer, 12)를 포함할 수 있다. 이때, 컨볼루셔널 계층(10), 활성화 계층(11), 풀링 계층(12)의 연결 구조는 여러 개를 반복적으로 구성될 수 있다. 또한, 합성곱 신경망은 상기 연결 구조의 후단에 풀링 계층(12)을 통해 추출된 여러 개의 특징들을 결합하는 전결합층(Fully-Connected layer, 13)이 연결되고, 소프트맥스 함수(softmax function)를 이용하여 전결합층(13)의 출력을 정규화하는 소프트 맥스 계층(Softmax layer, 14)으로 연결될 수 있다.Referring to FIG. 1 , a basic hierarchical structure of a two-dimensional convolutional neural network (CNN) may be confirmed. Specifically, the two-dimensional convolutional neural network receives an input image as an input and performs a convolution operation using a convolutional layer (10) and an activation function to output a feature map. An activation layer (11) that normalizes the output value of the convolutional layer (10), a pooling layer (pooling layer, 12) that extracts representative features by performing sampling or pooling on the output of the activation layer (11) may include In this case, several connection structures of the convolutional layer 10 , the activation layer 11 , and the pooling layer 12 may be repeatedly configured. In addition, in the convolutional neural network, a fully-connected layer 13 that combines several features extracted through a pooling layer 12 is connected to the rear end of the connection structure, and a softmax function It can be connected to a softmax layer 14 that normalizes the output of the pre-coupling layer 13 using the same.

컨볼루셔널 계층(10)은 입력 이미지와 필터 사이의 컨볼루션 연산을 수행할 수 있다. 필터는 입력 이미지의 각 픽셀들(pixels)과 합성곱 연산을 수행하기 위한 성분값을 갖는 픽셀 단위의 영역으로 정의될 수 있다. 이때, 픽셀 단위의 영역을 필터의 크기로 지칭할 수 있으며, 필터는 일반적으로 행렬로서 표현될 수 있다. 컨볼루셔널 계층(10)은, 필터를 입력 이미지의 가로, 세로 방향으로 이동 (sliding)시키면서, 필터와 입력 이미지 사이의 컨볼루션 연산을 반복할 수 있다. 이때, 필터가 한번에 이동하는 간격을 스트라이드(stride)로 정의할 수 있다. 예를 들어 스트라이드 값이 2라면, 2개의 픽셀 간격만큼 필터가 이동하면서 입력 이미지와의 합성곱 연산을 수행할 수 있다. 또한, 컨볼루셔널 계층(convolutional layer, 10)이 반복됨에 따라 출력되는 이미지(또는 특징맵)의 크기가 작아질 수 있는데, 컨볼루셔널 계층은 출력되는 특징맵의 크기를 조절하기 위하여 패딩 과정을 수행할 수 있다. 여기서 패딩 과정은 입력 이미지의 바깥쪽 영역에 특정 값(예를 들면 0)을 채우는 과정일 수 있다.The convolutional layer 10 may perform a convolution operation between the input image and the filter. The filter may be defined as a pixel unit area having a component value for performing a convolution operation with each pixel of the input image. In this case, the pixel unit area may be referred to as the size of the filter, and the filter may be generally expressed as a matrix. The convolutional layer 10 may repeat the convolution operation between the filter and the input image while sliding the filter in horizontal and vertical directions of the input image. In this case, an interval at which the filter moves at one time may be defined as a stride. For example, if the stride value is 2, a convolution operation with the input image may be performed while the filter is moved by an interval of two pixels. Also, as the convolutional layer 10 is repeated, the size of the output image (or feature map) may decrease. The convolutional layer performs a padding process to adjust the size of the output feature map. can be done Here, the padding process may be a process of filling the outer region of the input image with a specific value (eg, 0).

활성화 계층(11)에서 활성화 함수는 어떠한 값(또는 행렬)으로 추출된 특징을 비선형 값으로 바꾸는 함수로서, 시그모이드(sigmoid) 함수, ReLU 함수 등이 사용될 수 있다. 도 1에서 활성화 계층(11)은 설명의 편의를 위해 컨볼루셔널 계층(10)과 별도로 도시하였으나 활성화 계층(11)이 컨볼루셔널 계층(10)에 포함되는 것으로 해석할 수도 있다.In the activation layer 11 , the activation function is a function that converts a feature extracted with a certain value (or matrix) into a nonlinear value, and a sigmoid function, a ReLU function, or the like may be used. In FIG. 1 , the activation layer 11 is illustrated separately from the convolutional layer 10 for convenience of description, but the activation layer 11 may be interpreted as being included in the convolutional layer 10 .

풀링 계층(12)은 추출된 특징맵에 대하여 서브 샘플링(subsampling) 또는 풀링(pooling)을 수행하여 특징맵을 대표하는 특징을 선정하는 계층으로서, 특징맵의 일정 영역에 대하여 가장 큰 값을 추출하는 맥스 풀링(max pooling), 평균값을 추출하는 애버리지 풀링(average pooling) 등이 수행될 수 있다. 이때, 풀링 계층(12)은 활성화 계층(11) 이후에 반드시 수행되는 것이 아니라 선택적으로 수행될 수도 있다. The pooling layer 12 is a layer that selects a feature representing the feature map by performing subsampling or pooling on the extracted feature map, and extracts the largest value for a certain area of the feature map. Max pooling, average pooling for extracting an average value, etc. may be performed. In this case, the pooling layer 12 is not necessarily performed after the activation layer 11, but may be selectively performed.

전결합층(13)은 일반적으로 CNN의 마지막에 위치하며, 전결합층(13)에서는 컨볼루셔널 계층(10), 활성화 계층(11), 풀링 계층(12)을 통해 추출된 특징들을 결합하여 어떤 클래스(class)에 해당하는 지를 판단할 수 있다. The precoupling layer 13 is generally located at the end of the CNN, and the precoupling layer 13 combines features extracted through the convolutional layer 10, the activation layer 11, and the pooling layer 12, You can determine which class it belongs to.

구체적으로, 전결합층(13)은 입력된 특징맵의 모든 픽셀을 벡터화하고, 각각의 파라미터 값들을 곱한 후, 연산 결과들을 종합하여 가장 큰 값을 갖는 클래스를 결과로 출력할 수 있다. 소프트맥스 계층(14)은, 소프트맥스 함수를 사용하여 전결합층(13)에서의 연산 결과 값을0과 1 사이의 확률 값으로 표현할 수 있다. 예를 들어, 소프트맥스 함수는 입력된 값을 0~1사이의 값으로 모두 정규화하며 출력 값들의 총합은 항상 1이 되는 특성을 가진 함수일 수 있다. 도 1에서, 소프트맥스 계층(14)은 설명의 편의를 위해 전결합층(13)과 별도로 도시하였으나 전결합층(13)에 포함되는 것으로 해석할 수도 있다.Specifically, the pre-coupling layer 13 may vectorize all pixels of the input feature map, multiply each parameter value, synthesize the calculation results, and output the class having the largest value as a result. The softmax layer 14 may express the operation result value in the pre-coupling layer 13 as a probability value between 0 and 1 using a softmax function. For example, the softmax function may be a function having a characteristic that all input values are normalized to values between 0 and 1 and the sum of output values is always 1. In FIG. 1 , the softmax layer 14 is illustrated separately from the precoupling layer 13 for convenience of description, but may be interpreted as being included in the precoupling layer 13 .

도 2는 본 발명의 일 실시예에 따른 3D CNN을 설명하기 위한 예시도이다.2 is an exemplary diagram for explaining a 3D CNN according to an embodiment of the present invention.

3차원 컨볼루션 신경망(3-dimension Convolutional Neural Network, 이하 3D CNN)은 도 1에 따른 2차원 합성곱 신경망을 시간축으로 한 차원 확장시킨 인공 신경망으로 해석할 수 있다. 도 1에 따른 2차원 합성곱 신경망은 일반적으로 이미지를 입력으로 받고, 입력된 이미지 상의 공간적인 특성을 통해 입력된 이미지를 분류하거나 입력된 이미지 내부의 객체를 식별하는 등과 같은 용도로 주로 사용될 수 있다.A three-dimensional convolutional neural network (hereinafter, 3D CNN) can be interpreted as an artificial neural network in which the two-dimensional convolutional neural network according to FIG. 1 is extended by one dimension along the time axis. The two-dimensional convolutional neural network according to FIG. 1 generally receives an image as input, and can be mainly used for purposes such as classifying the input image through spatial characteristics on the input image or identifying an object inside the input image. .

그러나, 2차원 합성곱 신경망은 시간 정보가 포함된 동영상 데이터를 처리할 수 없는 한계를 가진다. 반면 3D CNN은 동영상 데이터의 시간 성분까지 고려하여 합성곱 연산과 풀링 연산 등을 수행하므로, 동영상 데이터의 시간적 속성을 고려하여 특징을 추출할 수 있다.However, the 2D convolutional neural network has a limitation in that it cannot process video data including time information. On the other hand, since the 3D CNN performs convolution and pooling operations in consideration of the temporal component of the moving picture data, features can be extracted in consideration of the temporal properties of the moving picture data.

구체적으로, 도 2를 참조하면, 먼저 시간축에 따른 복수의 프레임(또는 픽쳐)으로 구성되는 동영상 데이터인 입력 영상(20)을 복수의 영상 클립(21)으로 분류하고, 각각의 영상 클립을 3D CNN(22)에 대한 입력으로 사용할 수 있다. 이때, 영상 클립(21)은 미리 설정된 개수(3D CNN이 한번에 처리할 수 있는 프레임의 수)의 프레임으로 구성되는데, 예를 들어 영상 클립(21)은 시간축상으로 연속된 프레임들로 구성될 수 있다. 또한, 각 프레임(도 2에 따른 예시에서 f=0, f=1)은 K개의 채널로 구성되고 각 채널은 W·H의 해상도를 갖는 이미지로 구성될 수 있다. 예를 들어 각 프레임이 RGB 성분의 이미지라면, 채널은 R(Red), G(Green), B(Blue) 각각의 성분에 따라 3개일 수 있다.Specifically, referring to FIG. 2 , first, an input image 20 , which is moving image data composed of a plurality of frames (or pictures) along a time axis, is classified into a plurality of image clips 21 , and each image clip is divided into a 3D CNN It can be used as an input to (22). At this time, the video clip 21 is composed of a preset number of frames (the number of frames that the 3D CNN can process at once). For example, the video clip 21 may be composed of consecutive frames on the time axis. have. In addition, each frame (f=0, f=1 in the example according to FIG. 2) may be composed of K channels, and each channel may be composed of an image having a resolution of W·H. For example, if each frame is an image of RGB components, the number of channels may be three according to each component of R (Red), G (Green), and B (Blue).

3D CNN(22)의 구조는 기본적으로 도 1에 따른 2차원 합성곱 신경망과 동일하거나 유사하지만, 시간축에 따른 영상 데이터들을 모두 이용하는 점에서 차이가 있을 수 있다. 예를 들어, 3D CNN(22)의 컨볼루셔널 계층은 2차원 컨볼루션과 마찬가지로 필터가 이미지를 스캔하듯이 움직이면서 컨볼루션 연산을 하는데, 시간축으로도 스트라이드(stride) 값만큼 이동하여 컨볼루션 연산을 수행할 수 있다. 또한, 3D CNN(22)에 따른 풀링 계층은 도 1에서 설명한 풀링 계층(12)을 시간축으로 한 차원 확장된 형태로, 시간축에 따른 픽셀값들을 모두 이용할 수 있다. 3D CNN(22)에 따른 전결합 계층은 도 1에 따른 전결합 계층(13)과 마찬가지로 마지막 특성맵에 존재하는 모든 픽셀을 벡터화하여 파라미터와의 가중합을 구하며, 3D CNN(22)에 따른 소프맥스 계층은 도 1에 따른 소프트맥스 계층(14)과 마찬가지로 동작할 수 있다.The structure of the 3D CNN 22 is basically the same as or similar to the two-dimensional convolutional neural network according to FIG. 1 , but there may be a difference in using all image data along the time axis. For example, the convolutional layer of the 3D CNN 22 performs a convolution operation while the filter moves as if it scans an image like a two-dimensional convolution, and the convolution operation is performed by moving as much as the stride value on the time axis. can be done In addition, the pooling layer according to the 3D CNN 22 is a one-dimensional extension of the pooling layer 12 described with reference to FIG. 1 on the time axis, and all pixel values along the time axis can be used. The pre-combined layer according to the 3D CNN 22 obtains a weighted sum with parameters by vectorizing all pixels present in the last feature map, similarly to the pre-combining layer 13 according to FIG. 1, and a soap according to the 3D CNN 22 The max layer may operate like the softmax layer 14 according to FIG. 1 .

이처럼, 3D CNN(22)은 시간 축상의 영상 데이터를 함께 고려하여 학습을 수행하기 때문에 시간적으로 변화하는 사람의 동작을 학습하는데 유리할 수 있다. 다만, 시간 축상의 영상 데이터들을 함께 고려해야 하므로, 2차원 합성곱 신경망보다 더 많은 파라미터와 연산량이 필요한 문제가 있다.As such, the 3D CNN 22 may be advantageous in learning a temporally changing human motion because it performs learning in consideration of image data on the time axis. However, since image data on the time axis must be considered together, there is a problem in that more parameters and computations are required than in the 2D convolutional neural network.

따라서, 본 발명에서는 3D CNN에 따른 연산량을 줄이고, 고속으로 영상 인식을 수행할 수 있는 방법을 제안한다.Therefore, the present invention proposes a method capable of reducing the amount of computation according to 3D CNN and performing image recognition at high speed.

도 3a 및 도 3b는 본 발명의 일 실시예에 따른 스코어 마진값을 설명하기 위한 히스토그램이다.3A and 3B are histograms for explaining a score margin value according to an embodiment of the present invention.

일반적인 3차원 CNN에서는 도 2와 같이 입력 영상을 구성하는 모든 영상 클립들에 대하여 동일한 3D CNN을 통해 소프트맥스 값을 산출하고, 산출된 소프트맥스 값을 이용하여 영상을 인식한다. 그러나, 모든 영상 클립에 대해서 3D CNN을 통한 소프트맥스 값을 산출할 경우 연산량이 많기 때문에 연산속도가 저하되는 문제가 있다. 특히, 제한적인 연산 자원만을 사용할 수 있는 소형 단말에서는 과도한 연산량을 감당하기 어렵기 때문에 연산량을 줄이고 고속으로 영상을 인식할 수 있는 방법이 요구된다.In a general 3D CNN, a softmax value is calculated through the same 3D CNN for all video clips constituting an input image as shown in FIG. 2, and an image is recognized using the calculated softmax value. However, when calculating the softmax value through the 3D CNN for all video clips, there is a problem in that the calculation speed is lowered due to the large amount of calculation. In particular, since it is difficult to handle an excessive amount of computation in a small terminal capable of using only limited computational resources, a method of reducing the computational amount and recognizing an image at high speed is required.

본 발명의 일 실시예에서는 이러한 문제점을 해결하기 위한 수단으로서, 스코어 마진(score margin)이라는 개념을 정의할 수 있다. 스코어 마진(score margin)은 다음의 수학식 1로 정의할 수 있다.In one embodiment of the present invention, as a means for solving this problem, the concept of a score margin may be defined. The score margin may be defined by Equation 1 below.

상기 수학식 1을 참조하면, 스코어 마진(score margin)은 지금까지 3D CNN을 통해 각 영상 클립에 대하여 소프트맥스 함수를 산출한 결과값들 중에서 가장 큰 값(V_softmax1)과 두 번째로 큰 값(V_softmax2) 사이의 차분값으로 정의할 수 있다. 이때, 소프트 맥스 함수를 통하여 산출된 값은 0과 1 사이의 값을 가지므로, 스코어 마진도 0과 1 사이의 값을 가질 수 있다.Referring to Equation 1, the score margin is the largest value (V _softmax1 ) and the second largest value (V softmax1 ) among the results of calculating the softmax function for each video clip through 3D CNN so far It can be defined as the difference value between V _{softmax2 ).} In this case, since the value calculated through the soft max function has a value between 0 and 1, the score margin may also have a value between 0 and 1.

수학식 1에 따른 스코어 마진이 영상 인식의 성공과 실패에 얼마나 영향이 있는지 파악하기 위하여 UCF101 데이터 셋을 대상으로 산출한 스코어 마진의 결과 그래프는 도 3a 및 도 3b와 같다.In order to understand how much the score margin according to Equation 1 affects the success or failure of image recognition, graphs of the result of the score margin calculated for the UCF101 data set are shown in FIGS. 3A and 3B .

먼저 도 3a를 참조하면, 영상 인식이 성공한 경우에 따른 영상 데이터들(세로축)의 스코어 마진값(가로축)에 대한 분포를 확인할 수 있는데, 스코어 마진값이 0.9와 1사이의 값을 갖는 데이터들이 월등히 많은 것을 알 수 있다. First, referring to FIG. 3A , the distribution of the score margin value (horizontal axis) of the image data (vertical axis) according to the case in which image recognition is successful can be confirmed. Data having a score margin value between 0.9 and 1 are significantly superior to each other. You can know a lot.

또한 도 3b를 참조하면, 영상 인식이 실패한 경우에 따른 영상 데이터들(세로축)의 스코어 마진값(가로축)에 대한 분포를 확인할 수 있는데, 스코어 마진값이 작은 쪽에 더 많은 데이터가 분포해 있는 것을 확인할 수 있다.Also, referring to FIG. 3B , it can be seen that the distribution of the score margin value (horizontal axis) of the image data (vertical axis) according to the case in which image recognition fails. can

따라서, 도 3a와 도 3b를 종합하면, 스코어 마진값이 충분히 크다면 현재 3D CNN을 통해 분석한 영상 클립들만으로도 입력 영상에 대한 영상 인식이 성공한 것으로 판단할 수 있어 이후의 영상 클립들에 대해 추가로 영상 인식을 수행할 필요성이 낮다. 이하에서는, 현재까지 3D CNN에 영상 클립들을 입력하여 산출한 스코어 마진값을 평가하여 영상 인식이 성공한 것으로 판단된 경우, 후속 영상 클립들을 3D CNN에 입력하는 과정을 생략하거나, 연산 복잡도가 낮은 3D CNN을 이용하여 후속 영상 클립에 대한 분석을 수행하는 방법을 제안한다.Therefore, combining FIGS. 3A and 3B , if the score margin is large enough, it can be determined that image recognition for the input image is successful only with the image clips analyzed through the current 3D CNN. Therefore, the need to perform image recognition is low. Hereinafter, when it is determined that image recognition is successful by evaluating the score margin calculated by inputting video clips to the 3D CNN so far, the process of inputting subsequent video clips into the 3D CNN is omitted, or the 3D CNN with low computational complexity. We propose a method for performing analysis on subsequent video clips using

도 4는 본 발명의 제1 실시예에 따른 3D CNN을 이용한 고속 영상 인식 방법에 대한 흐름도이다.4 is a flowchart of a high-speed image recognition method using a 3D CNN according to the first embodiment of the present invention.

도 4를 참조하면, 제1 실시예에 따른 3D CNN을 이용한 고속 영상 인식 방법은, 입력 영상을 구성하는 영상 클립들 중 제1 영상 클립들을 각각 3D CNN(3-dimension Convolutional Neural Network)에 입력하는 단계(S100), 상기 제1 영상 클립들 각각에 대하여 상기 3D CNN을 통해 소프트맥스 함수(softmax function)를 연산한 결과값들을 획득하는 단계(S110), 획득된 결과값들을 이용하여 스코어 마진(score margin)을 산출하는 단계(S120), 산출된 스코어 마진을 미리 설정된 임계값과 비교하는 단계(S130) 및 상기 비교하는 단계에 대한 응답으로, 상기 입력 영상을 구성하는 영상 클립들 중 상기 제1 영상 클립들을 제외한 나머지 영상 클립들을 상기 3D CNN에 입력할지 여부를 결정하는 단계(S140)를 포함할 수 있다.Referring to FIG. 4 , in the high-speed image recognition method using the 3D CNN according to the first embodiment, first image clips among image clips constituting an input image are respectively input to a 3D CNN (3-dimension convolutional neural network). Step S100, obtaining result values of calculating a softmax function through the 3D CNN for each of the first video clips (S110), and using the obtained result values to obtain a score margin margin) calculating (S120), comparing the calculated score margin with a preset threshold (S130), and in response to the comparing, the first image among the video clips constituting the input image It may include the step of determining whether to input the remaining video clips excluding the clips to the 3D CNN (S140).

여기서 제1 영상 클립들은, 3D CNN에 입력할 최초 하나의 영상 클립을 의미할 수도 있고, 최초 영상 클립부터 복수 개의 영상 클립을 의미할 수도 있다.Here, the first video clips may mean one first video clip to be input to the 3D CNN, or may mean a plurality of video clips from the first video clip.

상기 스코어 마진은 상기 결과값들 중 가장 큰 값과 두번째로 큰 값 사이의 차분값일 수 있다. 예를 들어 스코어 마진은 수학식 1에 따라 정의할 수 있다.The score margin may be a difference value between a largest value and a second largest value among the result values. For example, the score margin may be defined according to Equation (1).

상기 나머지 영상 클립들을 상기 3D CNN에 입력할지 여부를 결정하는 단계(S140)는, 상기 스코어 마진이 상기 임계값보다 크면, 상기 제1 영상 클립들 이후의 영상 클립을 상기 3D CNN에 입력하지 않고, 상기 결과값들만으로 상기 입력 영상에 대한 영상 인식을 수행하는 단계를 포함할 수 있다. 따라서, 제1 영상 클립들에 대한 3D CNN의 분석만으로 영상 인식 결과를 최종적으로 도출하고 제1 영상 클립들 이후의 영상 클립에 대한 분석은 생략할 수 있다.In the step of determining whether to input the remaining video clips to the 3D CNN (S140), if the score margin is greater than the threshold, the video clips after the first video clips are not input to the 3D CNN, The method may include performing image recognition on the input image using only the result values. Accordingly, an image recognition result may be finally derived only by the 3D CNN analysis of the first image clips, and the analysis of the image clips after the first image clips may be omitted.

상기 나머지 영상 클립들을 상기 3D CNN에 입력할지 여부를 결정하는 단계(S140)는, 상기 스코어 마진이 상기 임계값보다 작으면, 상기 제1 영상 클립들 이후의 영상 클립을 상기 3D CNN에 입력하는 단계를 포함할 수 있다. The step of determining whether to input the remaining video clips to the 3D CNN (S140) may include: if the score margin is less than the threshold, inputting video clips after the first video clips to the 3D CNN may include.

따라서, 제1 영상 클립들 이후의 영상 클립들이 입력될 때마다 반복하여 스코어 마진을 구하고 임계값 비교를 수행함으로써, 다음 영상 클립을 입력할지 현재 단계에서 영상 인식 결과를 최종적으로 도출하고 영상 인식을 종료할지 여부를 결정할 수 있다.Therefore, whenever video clips after the first video clips are input, the score margin is repeatedly obtained and the threshold value comparison is performed, thereby finally deriving the image recognition result at the current stage whether to input the next video clip and ending the image recognition You can decide whether to

상기 결과값들을 획득하는 단계(S110)는, 상기 소프트맥스 함수를 연산하여 획득된 결과값들을 메모리(memory)에 누적하여 저장하는 단계를 더 포함할 수 있다. 즉, 결과값들은 계속하여 누적하여 저장함으로써, 저장된 결과값들에 다음 영상 클립이 3D CNN에 입력되어 연산된 소프트맥스 함수의 결과값을 추가로 포함하여 단계 S120에 따른 스코어 마진을 산출할 수 있다.The step of obtaining the result values ( S110 ) may further include accumulating and storing the result values obtained by calculating the softmax function in a memory. That is, by continuously accumulating and storing the result values, the score margin according to step S120 can be calculated by additionally including the result value of the softmax function calculated by inputting the next video clip to the 3D CNN in the stored result values. .

도 5는 본 발명의 제2 실시예에 따른 동적으로 3D CNN을 이용하는 고속 영상 인식 방법에 대한 흐름도이다.5 is a flowchart of a high-speed image recognition method dynamically using a 3D CNN according to a second embodiment of the present invention.

도 5를 참조하면, 제2 실시예에 따른 동적으로 3D CNN(3-dimension Convolutional Neural Network)을 이용하는 고속 영상 인식 방법은, 입력 영상을 구성하는 영상 클립들 중 제1 영상 클립들을 각각 3D CNN(3-dimension Convolutional Neural Network)에 입력하는 단계(S200), 상기 제1 영상 클립들 각각에 대하여 상기 3D CNN을 통해 소프트맥스 함수(softmax function)를 연산한 결과값들을 획득하는 단계(S210), 획득된 결과값들을 이용하여 스코어 마진(score margin)을 산출하는 단계(S220), 산출된 스코어 마진을 미리 설정된 임계값과 비교하는 단계(S230) 및 상기 비교하는 단계에 대한 응답으로, 상기 입력 영상을 구성하는 영상 클립들 중 상기 제1 영상 클립들의 다음 영상 클립을 상기 3D CNN과 동일한 네트워크에 입력할지 여부를 결정하는 단계(S240)를 포함할 수 있다.Referring to FIG. 5 , in the high-speed image recognition method dynamically using a 3D CNN (3-dimension convolutional neural network) according to the second embodiment, the first image clips among the image clips constituting the input image are respectively 3D CNN ( Step (S200) of inputting to a 3-dimension convolutional neural network), Step (S210) of obtaining results obtained by calculating a softmax function through the 3D CNN for each of the first video clips (S210), Obtaining In response to the steps of calculating a score margin using the obtained result values (S220), comparing the calculated score margin with a preset threshold value (S230), and in response to the comparing, the input image is It may include determining whether to input the next video clip of the first video clips among the video clips constituting the video clip to the same network as the 3D CNN (S240).

상기 다음 영상 클립을 상기 3D CNN과 동일한 네트워크에 입력할지 여부를 결정하는 단계(S240)는, 상기 스코어 마진이 상기 임계값보다 크면, 상기 제1 영상 클립들의 다음 영상 클립을 상기 3D CNN과 동일하거나 상기 3D CNN보다 더 얕은 네트워크에 입력하는 단계를 포함할 수 있다. 즉, 스코어 마진이 임계값보다 크다면 현재 입력된 영상 클립들로 추론한 영상 인식 결과가 옳았을 가능성이 높기 때문에, 다음 영상 클립은 현재 추론에 사용한 3D CNN과 동일하거나 더 얕은 네트워크에 입력함으로써 연산 속도를 향상 시킬 수 있다. 이때, 얕은 네트워크라는 의미는 컨볼루셔널 계층의 개수가 작거나 연산 복잡도가 낮은 네트워크를 의미할 수 있다.In the step of determining whether to input the next video clip to the same network as the 3D CNN (S240), if the score margin is greater than the threshold value, the next video clip of the first video clips is the same as the 3D CNN or It may include inputting into a network shallower than the 3D CNN. That is, if the score margin is greater than the threshold, the image recognition result inferred from the currently input video clips is highly likely to be correct. Therefore, the next video clip is calculated by inputting it to the same or shallower network as the 3D CNN used for the current inference. Speed can be improved. In this case, the shallow network may mean a network having a small number of convolutional layers or a low computational complexity.

상기 다음 영상 클립을 상기 3D CNN과 동일한 네트워크에 입력할지 여부를 결정하는 단계(S240)는, 상기 스코어 마진이 상기 임계값보다 작으면, 상기 제1 영상 클립들의 다음 영상 클립을 상기 3D CNN보다 더 깊은 네트워크에 입력하는 단계를 포함할 수 있다. 즉, 스코어 마진이 임계값보다 작다면 현재 입력된 영상 클립들로 추론한 영상 인식 결과가 틀렸을 가능성이 높기 때문에, 다음 영상 클립은 현재 추론에 사용한 3D CNN보다 더 깊은 네트워크에 입력함으로써 연산 속도를 향상 시킬 수 있다. 이때, 깊은 네트워크라는 의미는 컨볼루셔널 계층의 개수가 많거나 연산 복잡도가 높은 네트워크를 의미할 수 있다.The step of determining whether to input the next video clip to the same network as the 3D CNN (S240) is, if the score margin is less than the threshold value, the next video clip of the first video clips is more than the 3D CNN It may include entering into a deep network. That is, if the score margin is smaller than the threshold, the image recognition result inferred from the currently input video clips is highly likely to be wrong. Therefore, the next video clip is input into a deeper network than the 3D CNN used for the current inference to improve the computation speed. can do it In this case, a deep network may mean a network having a large number of convolutional layers or a high computational complexity.

상기 결과값들을 획득하는 단계(S210)는, 상기 소프트맥스 함수를 연산하여 획득된 결과값들을 메모리(memory)에 누적하여 저장하는 단계를 더 포함할 수 있다. 즉, 결과값들은 계속하여 누적하여 저장하고, 기존에 저장한 결과값들에 다음 영상 클립에 대한 소프트맥스 함수의 결과값을 추가로 포함함으로써, 단계 S220에 따른 스코어 마진을 산출할 수 있다.The step of obtaining the result values ( S210 ) may further include accumulating and storing the result values obtained by calculating the softmax function in a memory. That is, the result values are continuously accumulated and stored, and the result value of the softmax function for the next video clip is additionally included in the previously stored result values, so that the score margin according to step S220 can be calculated.

단계 S240에서 다음 영상 클립을 입력할 네트워크가 결정되었다면, 제1 영상 클립들의 다음 영상을 단계 S240에서 결정된 네트워크에 입력하여 단계 S210 부터 단계 S240에 따른 과정을 반복함으로써, 입력 영상을 구성하는 영상 클립 모두에 대한 네트워크를 동적으로 결정할 수 있다.If the network to which the next video clip is to be input is determined in step S240, the next video of the first video clips is input to the network determined in step S240, and the process from step S210 to step S240 is repeated, thereby all video clips constituting the input video can dynamically determine the network for

또한, 단계 S240에서 다음 영상 클립이 마지막 영상 클립인 경우에는 마지막 영상 클립을 단계 S240에서 결정된 네트워크에 입력하여 소프트맥스 함수를 연산하고, 그동안 연산한 소프트맥스 함수의 결과값들을 종합하여 최종적으로 영상 인식 결과를 도출함으로써 영상 인식을 종료할 수 있다.In addition, if the next video clip is the last video clip in step S240, the last video clip is input to the network determined in step S240 to calculate the softmax function, and finally image recognition by synthesizing the result values of the softmax function calculated during the operation By deriving a result, image recognition can be terminated.

한편, 도 4 및 도 5에 따른 제1 실시예와 제2 실시예는 서로 결합하여 구현될 수도 있다. 제1 실시예와 제2 실시예 모두 본 발명에서 정의하는 스코어 마진을 기반으로 이후의 연산을 생략하거나 적용할 네트워크를 달리할 수 있다. 따라서, 제1 실시예에 따른 단계 S100 내지 S120를 통해 산출된 스코어 마진을 제1 임계값과 비교함으로써 제1 실시예에 따른 단계 S140을 적용하고, 단계 S100 내지 S120를 통해 산출된 스코어 마진을 제2 임계값과 비교함으로써 제2 실시예에 따른 단계 S240을 적용할 수도 있다. 여기서 제1 임계값과 제2 임계값은 서로 다른 값으로 설정할 수 있으나, 동일한 값으로 설정하는 것을 배제하지 않는다.Meanwhile, the first embodiment and the second embodiment according to FIGS. 4 and 5 may be implemented in combination with each other. Both the first and second embodiments may omit subsequent calculations based on the score margin defined in the present invention or use a different network to be applied. Accordingly, step S140 according to the first embodiment is applied by comparing the score margin calculated through steps S100 to S120 according to the first embodiment with a first threshold value, and the score margin calculated through steps S100 to S120 according to the first embodiment is applied. Step S240 according to the second embodiment may be applied by comparing with the threshold value 2 . Here, the first threshold value and the second threshold value may be set to different values, but setting the same value is not excluded.

도 6은 본 발명의 제1 실시예에 따른 3D CNN을 이용한 고속 영상 인식 장치에 대한 구성도이다.6 is a block diagram of a high-speed image recognition apparatus using a 3D CNN according to a first embodiment of the present invention.

도 6을 참조하면, 제1 실시예에 따른 3D CNN을 이용한 고속 영상 인식 장치(100)는, 적어도 하나의 프로세서(processor, 110), 및 상기 적어도 하나의 프로세서(110)가 적어도 하나의 단계를 수행하도록 지시하는 명령어들(instructions)을 저장하는 메모리(memory, 120)를 포함할 수 있다.Referring to FIG. 6 , in the high-speed image recognition apparatus 100 using a 3D CNN according to the first embodiment, at least one processor 110 and the at least one processor 110 perform at least one step. It may include a memory 120 for storing instructions instructing to be performed.

여기서 적어도 하나의 프로세서(110)는 중앙 처리 장치(central processing unit, CPU), 그래픽 처리 장치(graphics processing unit, GPU), 또는 본 발명의 실시예들에 따른 방법들이 수행되는 전용의 프로세서를 의미할 수 있다. 메모리(120) 및 저장 장치(160) 각각은 휘발성 저장 매체 및 비휘발성 저장 매체 중에서 적어도 하나로 구성될 수 있다. 예를 들어, 메모리(120)는 읽기 전용 메모리(read only memory, ROM) 및 랜덤 액세스 메모리(random access memory, RAM) 중에서 적어도 하나로 구성될 수 있다.Here, the at least one processor 110 may mean a central processing unit (CPU), a graphics processing unit (GPU), or a dedicated processor on which methods according to embodiments of the present invention are performed. can Each of the memory 120 and the storage device 160 may be configured as at least one of a volatile storage medium and a non-volatile storage medium. For example, the memory 120 may be configured as at least one of a read only memory (ROM) and a random access memory (RAM).

또한, 3D CNN을 이용한 고속 영상 인식 장치(100)는, 무선 네트워크를 통해 통신을 수행하는 송수신 장치(transceiver)(130)를 포함할 수 있다. 또한, 3D CNN을 이용한 고속 영상 인식 장치(100)는 입력 인터페이스 장치(140), 출력 인터페이스 장치(150), 저장 장치(160) 등을 더 포함할 수 있다. 3D CNN을 이용한 고속 영상 인식 장치(100)에 포함된 각각의 구성 요소들은 버스(bus)(170)에 의해 연결되어 서로 통신을 수행할 수 있다.In addition, the high-speed image recognition apparatus 100 using the 3D CNN may include a transceiver 130 that performs communication through a wireless network. In addition, the high-speed image recognition apparatus 100 using the 3D CNN may further include an input interface device 140 , an output interface device 150 , a storage device 160 , and the like. Each of the components included in the high-speed image recognition apparatus 100 using the 3D CNN may be connected by a bus 170 to communicate with each other.

3D CNN을 이용한 고속 영상 인식 장치(100)의 예를 들면, 통신 가능한 데스크탑 컴퓨터(desktop computer), 랩탑 컴퓨터(laptop computer), 노트북(notebook), 스마트폰(smart phone), 태블릿 PC(tablet PC), 모바일폰(mobile phone), 스마트 워치(smart watch), 스마트 글래스(smart glass), e-book 리더기, PMP(portable multimedia player), 휴대용 게임기, 네비게이션(navigation) 장치, 디지털 카메라(digital camera), DMB(digital multimedia broadcasting) 재생기, 디지털 음성 녹음기(digital audio recorder), 디지털 음성 재생기(digital audio player), 디지털 동영상 녹화기(digital video recorder), 디지털 동영상 재생기(digital video player), PDA(Personal Digital Assistant) 등일 수 있다.For example, a high-speed image recognition device 100 using a 3D CNN, a communicable desktop computer (desktop computer), a laptop computer (laptop computer), a notebook (notebook), a smart phone (smart phone), a tablet PC (tablet PC) , mobile phone, smart watch, smart glass, e-book reader, PMP (portable multimedia player), portable game console, navigation device, digital camera, DMB (digital multimedia broadcasting) player, digital audio recorder, digital audio player, digital video recorder, digital video player, PDA (Personal Digital Assistant) etc.

한편, 도 5에 따른 제2 실시예에 따른 동적으로 3D CNN을 이용하는 고속 영상 인식 방법 또한 도 6과 같은 하드웨어 구성을 갖는 장치에서 수행될 수 있다. 이때, 도 5에 따른 방법은 도 6에 따른 장치와 마찬가지로 프로세서에 의해 수행되는 명령어들로서 구현되어 수행될 수 있으며, 중복 설명을 방지하기 위하여 자세한 설명은 생략한다. Meanwhile, the high-speed image recognition method dynamically using the 3D CNN according to the second embodiment according to FIG. 5 may also be performed in the device having the hardware configuration as shown in FIG. 6 . In this case, the method according to FIG. 5 may be implemented and performed as instructions executed by a processor similarly to the apparatus according to FIG. 6 , and detailed description will be omitted to prevent duplicate description.

본 발명에 따른 방법들은 다양한 컴퓨터 수단을 통해 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 컴퓨터 판독 가능 매체에 기록되는 프로그램 명령은 본 발명을 위해 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다.The methods according to the present invention may be implemented in the form of program instructions that can be executed by various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, and the like, alone or in combination. The program instructions recorded on the computer-readable medium may be specially designed and configured for the present invention, or may be known and available to those skilled in the art of computer software.

컴퓨터 판독 가능 매체의 예에는 롬(ROM), 램(RAM), 플래시 메모리(flash memory) 등과 같이 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함될 수 있다. 프로그램 명령의 예에는 컴파일러(compiler)에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터(interpreter) 등을 사용해서 컴퓨터에 의해 실행될 수 있는 고급 언어 코드를 포함할 수 있다. 상술한 하드웨어 장치는 본 발명의 동작을 수행하기 위해 적어도 하나의 소프트웨어 모듈로 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.Examples of computer-readable media may include hardware devices specially configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions may include not only machine language codes such as those generated by a compiler, but also high-level language codes that can be executed by a computer using an interpreter or the like. The hardware device described above may be configured to operate as at least one software module to perform the operations of the present invention, and vice versa.

또한, 상술한 방법 또는 장치는 그 구성이나 기능의 전부 또는 일부가 결합되어 구현되거나, 분리되어 구현될 수 있다. In addition, the above-described method or apparatus may be implemented by combining all or part of its configuration or function, or may be implemented separately.

상기에서는 본 발명의 바람직한 실시예를 참조하여 설명하였지만, 해당 기술 분야의 숙련된 당업자는 하기의 특허 청구의 범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 수 있을 것이다. Although the above has been described with reference to the preferred embodiments of the present invention, those skilled in the art can variously modify and change the present invention within the scope without departing from the spirit and scope of the present invention as set forth in the claims below. You will understand that it can be done.

Claims

As a high-speed image recognition method using a 3D CNN (3-dimension convolutional neural network),
Two or more first video clips among video clips constituting the input video are input to a 3D CNN, and softmax functions are respectively calculated through the 3D CNN for each of the two or more first video clips to obtain a soft obtaining max values;
calculating one score margin indicating a probability of successful recognition by using the softmax values;
comparing the score margin with a preset threshold; and
In response to the comparing step, it is determined whether to input an additional video clip from among video clips constituting the input video to the 3D CNN, and image recognition for the input video using only the first video clips or performing image recognition on the input image using the first image clips and the additional image clips;
A high-speed image recognition method comprising a.

In claim 1,
and the score margin is a difference value between a largest value and a second largest value among the softmax values.

In claim 1,
The step of determining whether to input the additional video clips to the 3D CNN comprises:
If the score margin is greater than the threshold, performing image recognition on the input image using only the first image clips without inputting the image clips after the first image clips to the 3D CNN. Including, a high-speed image recognition method.

In claim 1,
The step of determining whether to input the additional video clips to the 3D CNN comprises:
When the score margin is less than the threshold, video clips after the first video clips are input to the 3D CNN, and image recognition for the input video using the first video clips and the additional video clip A high-speed image recognition method comprising the step of performing a.

In claim 1,
The step of obtaining the softmax values comprises:
and accumulating and storing the softmax values obtained by calculating the softmax function in a memory.

In claim 1,
The threshold is
A high-speed image recognition method that is determined according to at least one of a type of a terminal performing image recognition, arithmetic capability, a type of an input image, a resolution of the input image, and the number of frames constituting the input image.

In claim 1,
Each of the video clips constituting the input video,
and a preset number of temporally consecutive frames among a plurality of frames constituting the input image.

As a high-speed image recognition device using a 3D CNN (3-dimension convolutional neural network),
at least one processor; and
a memory for storing instructions instructing the at least one processor to perform at least one step;
The at least one step is
Two or more first video clips among video clips constituting the input video are input to a 3D CNN, and softmax functions are respectively calculated through the 3D CNN for each of the two or more first video clips to obtain a soft obtaining max values;
calculating one score margin representing a probability of successful recognition by using the softmax values;
comparing the score margin with a preset threshold; and
In response to the comparing step, it is determined whether to input an additional video clip from among video clips constituting the input video to the 3D CNN, and image recognition for the input video using only the first video clips or performing image recognition on the input image using the first image clips and the additional image clip;
A high-speed image recognition device comprising a.

In claim 8,
and the score margin is a difference value between a largest value and a second largest value among the softmax values.

In claim 8,
The step of determining whether to input the additional video clips to the 3D CNN comprises:
If the score margin is greater than the threshold, performing image recognition on the input image using only the first image clips without inputting the image clips after the first image clips to the 3D CNN. Including, high-speed image recognition device.

In claim 8,
The step of determining whether to input the additional video clips to the 3D CNN comprises:
When the score margin is less than the threshold, video clips after the first video clips are input to the 3D CNN, and image recognition for the input video using the first video clips and the additional video clip A high-speed image recognition device comprising the step of performing.

In claim 8,
The step of obtaining the softmax values comprises:
and accumulating and storing the softmax values obtained by calculating the softmax function in a memory.

In claim 8,
The threshold is
A high-speed image recognition apparatus, which is determined according to at least one of a type of a terminal performing image recognition, an arithmetic capability, a type of an input image, a resolution of the input image, and the number of frames constituting the input image.

In claim 8,
Each of the video clips constituting the input video,
and a preset number of temporally consecutive frames among a plurality of frames constituting the input image.