KR102357000B1

KR102357000B1 - Action Recognition Method and Apparatus in Untrimmed Videos Based on Artificial Neural Network

Info

Publication number: KR102357000B1
Application number: KR1020200029743A
Authority: KR
Inventors: 김은태; 성홍제; 현준혁
Original assignee: 연세대학교 산학협력단
Priority date: 2020-03-10
Filing date: 2020-03-10
Publication date: 2022-01-27
Also published as: KR20210114257A

Abstract

본 발명에 따르면, 프로세서가 분석 대상 영상을 입력 받고, 시간 영역을 기준으로 기 설정된 구간별로 상기 분석 대상 영상에서 일부의 프레임 영상을 선택하고, 상기 선택된 프레임 영상에서 장소와 행동을 인식하여 인식한 장소와 행동에 따른 특징값을 상기 선택된 프레임 영상에 라벨링하여 비정제 동영상에서 클립 단위 장소와 행동 정보로 학습한 인공신경망으로 장소와 행동이 어느 프레임의 어느 공간 영역에서 나타나고 있는지 찾는 인공 신경망 기반의 비정제 동영상에서의 행동 인식 방법 및 장치가 개시된다.According to the present invention, a processor receives an analysis target image, selects some frame images from the analysis target image for each preset section based on a time domain, and recognizes a place and an action from the selected frame image. An artificial neural network-based unrefined artificial neural network that finds in which frame and in which spatial region a place and a behavior appear with an artificial neural network that is learned from clip unit place and behavior information from unrefined video by labeling the selected frame image with feature values according to and behavior A method and apparatus for recognizing behavior in a video are disclosed.

Description

{Action Recognition Method and Apparatus in Untrimmed Videos Based on Artificial Neural Network}

본 발명은 행동 인식 방법 및 장치에 관한 것으로, 특히 인공 신경망을 기반으로 한 비정제 동영상에서의 행동 인식 방법 및 장치에 관한 것이다.The present invention relates to a method and apparatus for recognizing a behavior, and more particularly, to a method and apparatus for recognizing a behavior in an unrefined video based on an artificial neural network.

최근 딥 러닝이 컴퓨터 비전 분야에서 다양한 문제를 해결하는데 많이 사용되고 있다. 하지만 딥 러닝 기반 시스템을 사용하기 위해선 학습에 필요한 많은 양의 데이터를 필요로 한다. 또한, 실제 상황에 사용하기에 적합한 시스템을 구축하기 위해선 실제 상황과 유사한 데이터에 학습을 해야 하며, 이는 비정제 동영상이 매우 적합하고, 장소와 행동 인식은 실제 상황에 적용하기 위해 가장 기본적으로 수행되어야 할 기술이다. 하지만, 학습에 사용될 데이터들은 정확한 정보를 제공해야 하기 때문에 사람이 직접 만들어야 하며, 비정제 동영상은 필요 없는 프레임을 많이 포함하고 있기 때문에 비정제 동영상 데이터를 구축하기 위해 장소와 행동이 나타나는 프레임을 찾는 것은 매우 많은 시간을 필요로 한다. 게다가 각 프레임 마다 장소와 행동이 일어나는 구역을 나타내는 것 또한 매우 많은 노동력을 필요로 한다. 그에 반해, 클립 단위 (약 5초 간격)으로 해당 클립이 어떠한 장소와 행동으로 이루어져 있는지 라벨링 하는 것은 비교적 적은 노동력을 필요로 한다.Recently, deep learning has been widely used to solve various problems in the field of computer vision. However, in order to use a deep learning-based system, a large amount of data required for learning is required. In addition, in order to build a system suitable for use in real situations, it is necessary to learn from data similar to real situations. skill to do However, since the data to be used for learning must provide accurate information, it must be created manually, and since unrefined video contains a lot of unnecessary frames, it is difficult to find the frame in which the place and action appear in order to construct unrefined video data. It takes a lot of time. In addition, it is very labor intensive to indicate the place and the zone where the action takes place in each frame. On the other hand, labeling the location and action of a clip in units of clips (approximately 5 second intervals) requires relatively little labor.

본 발명은 인공 신경망 기반의 비정제 동영상에서의 행동 인식 방법 및 장치로 프로세서가 분석 대상 영상을 입력 받고, 시간 영역을 기준으로 기 설정된 구간별로 상기 분석 대상 영상에서 일부의 프레임 영상을 선택하고, 상기 선택된 프레임 영상에서 장소와 행동을 인식하여 인식한 장소와 행동에 따른 특징값을 상기 선택된 프레임 영상에 라벨링하여 비정제 동영상에서 클립 단위 장소와 행동 정보로 학습한 인공신경망으로 장소와 행동이 어느 프레임의 어느 공간 영역에서 나타나고 있는지 찾는데 그 목적이 있다.The present invention is a method and apparatus for recognizing behavior in an artificial neural network-based unrefined video, in which a processor receives an analysis target image, selects some frame images from the analysis target image for each preset section based on a time domain, and It is an artificial neural network that recognizes places and behaviors in the selected frame image and labels the recognized place and behavior values on the selected frame image and learns from the clip unit place and behavior information from the unrefined video. The purpose is to find out in which spatial domain it is appearing.

또한, 장소와 행동의 영역을 함께 도출해내는 단일 인공신경망을 사용하며, 서로 다른 2가지 이상의 태스크(task)를 해결하는 멀티태스킹(multitasking) 방법을 이용하여, 서로 연관성이 높은 장소와 행동의 영역을 함께 도출하는데 또 다른 목적이 있다.In addition, it uses a single artificial neural network that derives areas of place and action together, and uses a multitasking method that solves two or more different tasks to identify areas of places and actions that are highly related to each other. There is another purpose to derive together.

본 발명의 명시되지 않은 또 다른 목적들은 하기의 상세한 설명 및 그 효과로부터 용이하게 추론할 수 있는 범위 내에서 추가적으로 고려될 수 있다.Other objects not specified in the present invention may be additionally considered within the scope that can be easily inferred from the following detailed description and effects thereof.

상기 과제를 해결하기 위해, 본 발명의 일 실시예에 따른 행동 인식 방법은, 프로세서가, 분석 대상 영상을 입력 받고, 시간 영역을 기준으로 기 설정된 구간별로 상기 분석 대상 영상에서 일부의 프레임 영상을 선택하는 단계, 상기 선택된 프레임 영상에서 장소와 행동을 인식하는 단계 및 인식한 장소와 행동에 따른 특징값을 상기 선택된 프레임 영상에 라벨링하는 단계를 포함한다.In order to solve the above problem, in the behavior recognition method according to an embodiment of the present invention, a processor receives an analysis target image, and selects some frame images from the analysis target image for each preset section based on a time domain and recognizing a place and an action in the selected frame image, and labeling a feature value according to the recognized place and action on the selected frame image.

여기서, 상기 선택된 프레임 영상에서 장소와 행동을 인식하는 단계는, 제1 합성곱 신경망을 이용하여 상기 선택된 프레임 영상의 장소 인식을 위한 제1 특징 텐서를 추출하는 단계 및 제2 합성곱 신경망을 이용하여 상기 선택된 프레임 영상의 객체 인식을 위한 제2 특징 텐서를 추출하는 단계를 포함한다.Here, the step of recognizing a place and an action in the selected frame image includes extracting a first feature tensor for place recognition of the selected frame image using a first convolutional neural network and using a second convolutional neural network. and extracting a second feature tensor for object recognition of the selected frame image.

여기서, 상기 제2 합성곱 신경망은, 상기 인식하고자 하는 행동 정보와 유사한 객체 인식 데이터셋에 학습된 것이다.Here, the second convolutional neural network is trained on an object recognition dataset similar to the behavior information to be recognized.

여기서, 상기 선택된 프레임 영상에서 장소와 행동을 인식하는 단계는, 어텐션 함수(Attention Function)를 이용한 연산을 수행하여 시공간 영역의 상기 제1 특징 텐서를 기 설정된 크기의 제1 특징 벡터로 추출하는 단계 및 어텐션 함수(Attention Function)를 이용한 연산을 수행하여 시공간 영역의 상기 제2 특징 텐서를 기 설정된 크기의 제2 특징 벡터로 추출하는 단계를 더 포함한다.Here, the step of recognizing a place and an action in the selected frame image includes performing an operation using an attention function to extract the first feature tensor in the spatiotemporal domain as a first feature vector having a preset size, and The method further includes extracting the second feature tensor in the space-time domain as a second feature vector having a preset size by performing an operation using an attention function.

여기서, 상기 선택된 프레임 영상에서 장소와 행동을 인식하는 단계는, 상기 제1 특징 벡터 또는 상기 제2 특징 벡터의 차원 변환을 통해 연산이 가능하도록 변환하는 클래스 변환 연산을 수행하는 단계 및 상기 제1 특징 벡터와 클래스 변환 연산을 수행한 상기 제2 특징 벡터를 합하여 시공간 영역에 해당하는 특징을 추출하기 위한 멀티태스크 트랜스포머 유닛을 이용한 트랜스포머 연산을 수행하여 결합 특징 벡터를 추출하는 단계를 더 포함한다.Here, the step of recognizing a place and an action in the selected frame image includes: performing a class transformation operation that transforms the first feature vector or the second feature vector so that the operation is possible through dimensional transformation, and the first feature The method further includes extracting a combined feature vector by performing a transformer operation using a multitask transformer unit for extracting a feature corresponding to a space-time domain by adding the vector and the second feature vector on which the class transformation operation has been performed.

여기서, 상기 멀티태스크 트랜스포머 유닛을 이용하여 트랜스포머 연산을 수행하는 단계는, 쿼리 입력부가 상기 제1 특징 벡터를 입력받고, 상기 입력된 제1 특징 벡터와 미리 결정된 연결 가중치에 따른 풀리 커넥티트 특징값을 생성하는 단계, 상기 선택된 프레임 영상을 컨볼루션 변환을 통해 컨볼루션 특징값을 생성하는 단계, 상기 풀리 커넥티트 특징값과 상기 컨볼루션 특징값의 행렬 곱 연산을 수행하는 단계 및 상기 풀리 커넥티트 특징값과 상기 행렬 곱 연산을 수행한 상기 컨볼루션 특징값을 합하여 정규화를 수행하는 단계를 포함한다.In this case, the step of performing the transformer operation using the multitask transformer unit includes: a query input unit receiving the first feature vector, and a fully connected feature value according to the input first feature vector and a predetermined connection weight generating, generating a convolutional feature value through convolutional transformation of the selected frame image, performing a matrix multiplication operation of the fully connected feature value and the convolutional feature value, and the fully connected feature value and performing normalization by summing the convolutional feature values on which the matrix multiplication operation has been performed.

여기서, 상기 트랜스포머 연산은, 제1 트랜스포머 연산 내지 제3 트랜스포머 연산을 포함하며, 상기 선택된 프레임 영상에서 장소와 행동을 인식하는 단계는, 상기 제3 트랜스포머 연산을 수행하여 추출한 상기 결합 특징 벡터를 풀리 커넥티드 레이어(fully-connected layer)를 이용하여 장소와 행동을 분류하는 단계를 더 포함한다.Here, the transformer operation includes a first transformer operation to a third transformer operation, and the step of recognizing a place and an action in the selected frame image is to fully connect the combined feature vector extracted by performing the third transformer operation. The method further includes classifying places and actions using a fully-connected layer.

본 발명의 일 실시예에 따른 행동 인식 장치는, 외부로부터 인식하고자 하는 행동 정보가 포함된 분석 대상 영상을 획득하는 영상 획득부, 하나 이상의 인스트럭션을 저장하는 메모리 및 상기 메모리에 저장된 하나 이상의 인스트럭션을 실행하는 프로세서를 포함하고, 상기 프로세서는, 인공신경망을 기반으로 상기 분석 대상 영상으로부터 장소와 행동을 인식한다.A behavior recognition apparatus according to an embodiment of the present invention includes an image acquisition unit for acquiring an analysis target image including behavior information to be recognized from the outside, a memory for storing one or more instructions, and one or more instructions stored in the memory. and a processor, wherein the processor recognizes a place and an action from the analysis target image based on an artificial neural network.

여기서, 상기 프로세서는, 상기 분석 대상 영상에서 시간 영역을 기준으로 기 설정된 구간별로 상기 분석 대상 영상에서 일부의 프레임 영상을 선택하는 단계, 상기 선택된 프레임 영상에서 장소와 행동을 인식하는 단계 및 인식한 장소와 행동에 따른 특징값을 상기 선택된 프레임 영상에 라벨링하는 단계를 수행한다.Here, the processor includes the steps of selecting a part of a frame image from the analysis target image for each preset section based on a time domain in the analysis target image, recognizing a place and an action in the selected frame image, and the recognized place and labeling the feature value according to the action on the selected frame image.

여기서, 상기 프로세서는, 제1 합성곱 신경망을 이용하여 상기 선택된 프레임 영상의 장소 인식을 위한 제1 특징 텐서를 추출하는 단계 및 제2 합성곱 신경망을 이용하여 상기 선택된 프레임 영상의 객체 인식을 위한 제2 특징 텐서를 추출하는 단계를 수행한다.Here, the processor includes: extracting a first feature tensor for place recognition of the selected frame image using a first convolutional neural network; and a second method for object recognition of the selected frame image using a second convolutional neural network. 2 The step of extracting the feature tensor is performed.

여기서, 상기 프로세서는, 어텐션 함수(Attention Function)를 이용한 연산을 수행하여 시공간 영역의 상기 제1 특징 텐서를 기 설정된 크기의 제1 특징 벡터로 추출하는 단계 및 어텐션 함수(Attention Function)를 이용한 연산을 수행하여 시공간 영역의 상기 제2 특징 텐서를 기 설정된 크기의 제2 특징 벡터로 추출하는 단계를 수행한다.Here, the processor performs an operation using an attention function to extract the first feature tensor in the space-time domain as a first feature vector of a preset size, and an attention function. and extracting the second feature tensor in the space-time domain as a second feature vector having a preset size.

여기서, 상기 프로세서는, 상기 제1 특징 벡터 또는 상기 제2 특징 벡터의 차원 변환을 통해 연산이 가능하도록 변환하는 클래스 변환 연산을 수행하는 단계 및 상기 제1 특징 벡터와 클래스 변환 연산을 수행한 상기 제2 특징 벡터를 합하여 시공간 영역에 해당하는 특징을 추출하기 위한 멀티태스크 트랜스포머 유닛을 이용한 트랜스포머 연산을 수행하여 결합 특징 벡터를 추출하는 단계를 수행한다.Here, the processor performs, by the processor, a class transformation operation for transforming the first feature vector or the second feature vector to be operable through dimensional transformation, and the first feature vector and the second feature vector performing the class transformation operation. A step of extracting a combined feature vector is performed by performing a transformer operation using a multitask transformer unit for summing two feature vectors and extracting a feature corresponding to a space-time domain.

여기서, 상기 프로세서는, 상기 제1 특징 벡터를 입력받고, 상기 입력된 제1 특징 벡터와 미리 결정된 연결 가중치에 따른 풀리 커넥티트 특징값을 생성하는 단계, 상기 선택된 프레임 영상을 컨볼루션 변환을 통해 컨볼루션 특징값을 생성하는 단계, 상기 풀리 커넥티트 특징값과 상기 컨볼루션 특징값의 행렬 곱 연산을 수행하는 단계 및 상기 풀리 커넥티트 특징값과 상기 행렬 곱 연산을 수행한 상기 컨볼루션 특징값을 합하여 정규화를 수행하는 단계를 수행한다.Here, the processor receives the first feature vector, generates a fully connected feature value according to the input first feature vector and a predetermined connection weight, and convolves the selected frame image through convolutional transformation. Generating a convolution feature value, performing a matrix multiplication operation of the fully connected feature value and the convolution feature value, and adding the convolution feature value obtained by performing the matrix multiplication operation with the fully connected feature value Steps to perform normalization are performed.

여기서, 상기 트랜스포머 연산은, 제1 트랜스포머 연산 내지 제3 트랜스포머 연산을 포함하며, 상기 프로세서는, 상기 제3 트랜스포머 연산을 수행하여 추출한 상기 결합 특징 벡터를 풀리 커넥티드 레이어(fully-connected layer)를 이용하여 장소와 행동을 분류하는 단계를 수행한다.Here, the transformer operation includes a first transformer operation to a third transformer operation, and the processor uses a fully-connected layer for the combined feature vector extracted by performing the third transformer operation to categorize places and actions.

이상에서 설명한 바와 같이 본 발명의 실시예들에 의하면, 프로세서가 분석 대상 영상을 입력 받고, 시간 영역을 기준으로 기 설정된 구간별로 상기 분석 대상 영상에서 일부의 프레임 영상을 선택하고, 상기 선택된 프레임 영상에서 장소와 행동을 인식하여 인식한 장소와 행동에 따른 특징값을 상기 선택된 프레임 영상에 라벨링하여 비정제 동영상에서 클립 단위 장소와 행동 정보로 학습한 인공신경망으로 장소와 행동이 어느 프레임의 어느 공간 영역에서 나타나고 있는지 찾을 수 있다.As described above, according to the embodiments of the present invention, a processor receives an analysis target image, selects some frame images from the analysis target image for each preset section based on a time domain, and selects a frame image from the selected frame image. It is an artificial neural network that recognizes places and actions and labels feature values according to the recognized places and actions on the selected frame image, and learns from the unrefined video clip unit location and behavior information. You can find out if it's showing up.

또한, 장소와 행동의 영역을 함께 도출해내는 단일 인공신경망을 사용하며, 서로 다른 2가지 이상의 태스크(task)를 해결하는 멀티태스킹(multitasking) 방법을 이용하여, 서로 연관성이 높은 장소와 행동의 영역을 함께 도출할 수 있다.In addition, it uses a single artificial neural network that derives areas of place and action together, and uses a multitasking method that solves two or more different tasks to identify areas of places and actions that are highly related to each other. can be derived together.

여기에서 명시적으로 언급되지 않은 효과라 하더라도, 본 발명의 기술적 특징에 의해 기대되는 이하의 명세서에서 기재된 효과 및 그 잠정적인 효과는 본 발명의 명세서에 기재된 것과 같이 취급된다.Even if it is an effect not explicitly mentioned herein, the effects described in the following specification expected by the technical features of the present invention and their potential effects are treated as if they were described in the specification of the present invention.

도 1은 본 발명의 일 실시예에 따른 행동 인식 장치의 블록도이다.
도 2는 본 발명의 일 실시예에 따른 행동 인식 장치의 프로세서를 설명하기 위한 도면이다.
도 3은 본 발명의 일 실시예에 따른 행동 인식 장치 및 방법의 인공 신경망 구조를 나타낸 것이다.
도 4는 본 발명의 일 실시예에 따른 행동 인식 장치 및 방법의 라벨링을 예로 들어 나타낸 것이다.
도 5는 본 발명의 일 실시예에 따른 행동 인식 장치 및 방법의 MTx 연산 구조를 나타낸 것이다.
도 6은 본 발명의 일 실시예에 따른 행동 인식 장치 및 방법의 AQPr 연산 구조를 나타낸 것이다.
도 7은 본 발명의 일 실시예에 따른 행동 인식 장치 및 방법을 이용한 실험 결과를 나타낸 것이다.
도 8 내지 도 10은 본 발명의 일 실시예에 따른 행동 인식 방법을 이용한 나타낸 흐름도이다.1 is a block diagram of a behavior recognition apparatus according to an embodiment of the present invention.
2 is a diagram for explaining a processor of a behavior recognition apparatus according to an embodiment of the present invention.
3 is a diagram illustrating an artificial neural network structure of a behavior recognition apparatus and method according to an embodiment of the present invention.
4 is an example showing the labeling of the behavior recognition apparatus and method according to an embodiment of the present invention.
5 is a diagram illustrating an MTx operation structure of a behavior recognition apparatus and method according to an embodiment of the present invention.
6 is a diagram illustrating an AQPr operation structure of a behavior recognition apparatus and method according to an embodiment of the present invention.
7 shows experimental results using the behavior recognition apparatus and method according to an embodiment of the present invention.
8 to 10 are flowcharts showing a behavior recognition method according to an embodiment of the present invention.

이하, 본 발명에 관련된 인공 신경망 기반의 비정제 동영상에서의 행동 인식 방법 및 장치에 대하여 도면을 참조하여 보다 상세하게 설명한다. 그러나, 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며, 설명하는 실시예에 한정되는 것이 아니다. 그리고, 본 발명을 명확하게 설명하기 위하여 설명과 관계없는 부분은 생략되며, 도면의 동일한 참조부호는 동일한 부재임을 나타낸다.Hereinafter, a method and apparatus for recognizing a behavior in an artificial neural network-based unrefined video according to the present invention will be described in more detail with reference to the drawings. However, the present invention may be embodied in various different forms, and is not limited to the described embodiments. In addition, in order to clearly explain the present invention, parts irrelevant to the description are omitted, and the same reference numerals in the drawings indicate the same members.

이하의 설명에서 사용되는 구성요소에 대한 접미사 "모듈" 및 "부"는 명세서 작성의 용이함만이 고려되어 부여되거나 혼용되는 것으로서, 그 자체로 서로 구별되는 의미 또는 역할을 갖는 것은 아니다.The suffixes "module" and "part" for components used in the following description are given or mixed in consideration of only the ease of writing the specification, and do not have distinct meanings or roles by themselves.

본 발명은 인공 신경망 기반의 비정제 동영상에서의 행동 인식 방법 및 장치에 관한 것이다.The present invention relates to a method and apparatus for recognizing behavior in an artificial neural network-based unrefined video.

도 1은 본 발명의 일 실시예에 따른 행동 인식 장치의 블록도이다.1 is a block diagram of a behavior recognition apparatus according to an embodiment of the present invention.

도 1을 참조하면, 본 발명의 일 실시예에 따른 행동 인식 장치(1)는 프로세서(10), 영상 획득부(20), 메모리(30), I/O 인터페이스(40)를 포함한다.Referring to FIG. 1 , a behavior recognition apparatus 1 according to an embodiment of the present invention includes a processor 10 , an image acquisition unit 20 , a memory 30 , and an I/O interface 40 .

본 발명의 일 실시예에 따른 행동 인식 장치(1)는 비정제 동영상에서 행동을 인식하기 위한 장치로서, 딥 러닝 구조 중 메모리 네트워크의 한 종류인 Multitask Transformer Network에서 사용되는 QKV (Query, Key, Value) 컨셉을 활용하여 비정제 동영상에서 클립 단위 장소와 행동 정보로 학습한 인공신경망으로 장소와 행동이 어느 프레임의 어느 공간 영역에서 나타나고 있는지 찾는 알고리즘을 이용한다.The behavior recognition device 1 according to an embodiment of the present invention is a device for recognizing a behavior in an unrefined video, and is a QKV (Query, Key, Value) used in a Multitask Transformer Network, which is a type of memory network among deep learning structures. ) concept to use an algorithm to find out in which frame and in which spatial domain the place and action are appearing with an artificial neural network that is learned from clip unit location and behavior information from unrefined video.

본 발명의 일 실시예에 따른 행동 인식 장치(1)는 비정제 동영상에서 클립 단위로 라벨링 되어있는 장소와 행동 정보로 학습된 인공 신경망을 활용하여 장소와 행동이 각각 나타나고 있는 구체적인 시공간 영역을 찾는 것을 목적으로 한다.The behavior recognition apparatus 1 according to an embodiment of the present invention utilizes an artificial neural network learned with place and behavior information labeled in units of clips in an unrefined video to find specific spatiotemporal regions in which places and actions appear respectively. The purpose.

프로세서(10)는 인공신경망을 기반으로 상기 분석 대상 영상으로부터 장소와 행동을 인식한다.The processor 10 recognizes a place and an action from the analysis target image based on an artificial neural network.

본 발명의 일 실시예에 따른 행동 인식 장치(1)의 프로세서(10)는 적은 정보량을 가진 데이터로 학습을 한 후, 더 자세하고 구체적인 정보를 생성해내도록 학습하는 방법을 약지도학습(Weakly supervised learning)을 사용한다. 딥 러닝에서 사용되는 약지도학습의 대표적인 예로 CAM (Class Activation Map)이 있다. 이 방법은, image-level로 라벨링 되어있는 데이터로 인공 신경망을 학습한 후 pixel-level로 결과를 도출해내는 방법이다. 학습할 때 인공신경망 구조는, 컨벌루션 필터를 통해 추출된 특징(feature) 텐서(tensor)를 분류(classifying)하는 마지막 레이어(layer)의 구성을 GAP (Global Average Pooling)과 FC (Fully Connected) Layer로 구성한다. 그러면, 특징 텐서가 GAP를 통과하여 공간(spatial)영역으로 pooling되어 특징 벡터(vector)가 되며, FC layer가 이 특징 벡터를 분류하게 되기 때문에 이미지-레벨(image-level)으로 학습이 된다. 픽셀-레벨(Pixel-level)로 결과를 도출할 때는, 인공 신경망 구조에서 GAP를 제거한 구조로 진행이 된다. 이 경우, 컨벌루션 필터를 통해 추출된 특징 텐서가 그대로 FC layer를 거치게 되므로, 특징 텐서의 공간 영역별로 존재하는 특징 벡터 각각이 FC layer를 통해 분류가 되기 때문에 공간 영역마다(pixel-level) 분류를 진행할 수 있게 된다.After learning with data having a small amount of information, the processor 10 of the behavior recognition device 1 according to an embodiment of the present invention learns to generate more detailed and specific information. Weakly supervised learning method learning) is used. A typical example of weakly supervised learning used in deep learning is CAM (Class Activation Map). In this method, after learning the artificial neural network with data labeled at the image-level, the result is derived at the pixel-level. When learning, the artificial neural network structure consists of a GAP (Global Average Pooling) and FC (Fully Connected) layer, the composition of the last layer that classifies the feature tensor extracted through the convolution filter. make up Then, the feature tensor passes through the GAP and is pooled into a spatial domain to become a feature vector, and since the FC layer classifies this feature vector, learning is performed at the image-level. When deriving a result at the pixel-level, it proceeds to a structure in which GAP is removed from the artificial neural network structure. In this case, since the feature tensor extracted through the convolution filter goes through the FC layer as it is, each of the feature vectors existing in each spatial region of the feature tensor is classified through the FC layer, so that each spatial region (pixel-level) classification is performed. be able to

또한, 공간 영역으로 정보를 확장하던 방법을 시간영역으로 변경한 T-CAM (Temporal Class Activation Map)방법도 존재한다. 본 방법은, 이미지-레벨(image-level)로 학습하던 인공신경망 구조에서 공간영역으로 풀링(pooling)하던 GAP를, 비디오-레벨(video-level)로 학습이 되도록 시간(temporal) 영역으로 풀링(pooling)이 되도록 GAP를 구성을 한 후, CAM과 같은 방식으로 결과를 도출할 때 GAP를 제거하여 시간마다 분류가 되도록 하는 구조이다.In addition, there is also a T-CAM (Temporal Class Activation Map) method in which the method of extending information in the spatial domain is changed to the temporal domain. In this method, the GAP, which was pooled in the spatial domain in the artificial neural network structure that was trained in the image-level, is pooled in the temporal domain so that it is learned in the video-level ( pooling), and then removes the GAP when deriving a result in the same way as CAM so that it is classified every time.

이외에도 CAM은 딥 러닝을 활용한 약지도학습 방식에 대표적으로 많이 사용되고 있다. 본 발명의 일 실시예에 따른 행동 인식 장치(1)의 프로세서(10)는, CAM과는 다른 성격을 많이 띄고 있는 약지도학습법으로 비디오-레벨(video-level) 정보를 시공간(spatio-temporal) 영역으로 구체화(또는 지역화, localization) 하는 방법을 사용한다. 본 발명의 일 실시예에 따른 행동 인식 장치(1)의 프로세서(10)는 인공신경망 구조를 멀티태스크 트랜스포머 네트워크(Multitask Transformer Network)를 기반으로 진행한다. 이 멀티태스크 트랜스포머 네트워크(Multitask Transformer Network)는 액션 트랜스포머 네트워크(Action Transformer Network)가 비디오에서 행동(action)만을 분류하던 인공신경망 구조를 장소(scene)과 행동이 함께 분류가 되도록 확장한 인공신경망 구조이다.In addition, CAM is commonly used in weakly supervised learning methods using deep learning. The processor 10 of the behavior recognition apparatus 1 according to an embodiment of the present invention transmits video-level information in spatio-temporal method using a weakly supervised learning method that has many characteristics different from those of CAM. Use the method of specifying (or localizing, localization) as a region. The processor 10 of the behavior recognition apparatus 1 according to an embodiment of the present invention performs an artificial neural network structure based on a multitask transformer network. This Multitask Transformer Network is an artificial neural network structure in which the Action Transformer Network classifies only the action in the video, and expands the artificial neural network structure so that the scene and the action are classified together. .

액션 트랜스포머 네트워크(Action Transformer Network)는 트랜스포머 네트워크(Transformer Network)가 자연어 처리 (언어 번역)에 사용되던 것을 비디오에 적용이 가능하도록 인공 신경망 구조를 바꾼 것이다. 이 트랜스포머 네트워크(Transformer Network)는 메모리 네트워크(Memory Network)의 구조 중 QKV (Query, Key, Value) 컨셉을 활용하여 고성능의 자연어 처리가 가능하도록 구성한 인공신경망 구조이다.The Action Transformer Network is an artificial neural network structure that has been changed so that what the Transformer Network was used for natural language processing (language translation) can be applied to video. This Transformer Network is an artificial neural network structure that utilizes the QKV (Query, Key, Value) concept among the structures of the Memory Network to enable high-performance natural language processing.

CAM의 경우 인공신경망 구조가 Convolution Filter - GAP - FC 의 구조로, 마지막 GAP-FC의 구조가 필수적으로 사용되어야 한다. 따라서 인공신경망 구조 설계에 제한이 많으며, 이는 학습할 image(video)-level 분류에서부터 높은 성능을 기대하기 어렵다. 낮은 성능의 image(video)-level 분류는 결국 최종적으로 해내어야 할 시공간 영역의 지역화 성능 역시 낮을 수밖에 없다. 하지만 QKV 컨셉의 경우 인공신경망 구조의 어디에도 사용될 수 있으며, 이러한 높은 설계에서의 자유도는 높은 성능을 도출해내며, 최근 많이 사용되고 있다. 또한, CAM은 학습할 때와 결과를 도출해낼 때 인공신경망 구조가 다르기 때문에 (학습할 때 존재하였던 GAP를 결과 도출 시엔 없앰), 좋은 성능을 기대하기 어렵다. QKV 컨셉을 사용한 약지도학습법의 경우엔, 학습할 때와 결과를 도출해낼 때 같은 구조의 인공신경망을 사용하기 때문에 잘 학습된 인공신경망의 경우 좋은 성능을 기대할 수 있다.In the case of CAM, the artificial neural network structure is the structure of the Convolution Filter - GAP - FC, and the structure of the last GAP-FC must be used. Therefore, there are many limitations in the design of the artificial neural network structure, and it is difficult to expect high performance from the image (video)-level classification to be learned. Image(video)-level classification with low performance inevitably has low localization performance in the spatiotemporal domain, which should be finally done. However, in the case of the QKV concept, it can be used anywhere in the artificial neural network structure, and this high degree of freedom in design derives high performance and has been widely used recently. In addition, since the artificial neural network structure is different when learning and deriving results (the GAP that existed during learning is removed when deriving results), it is difficult to expect good performance from CAM. In the case of the weakly supervised learning method using the QKV concept, good performance can be expected in the case of a well-trained artificial neural network because an artificial neural network with the same structure is used for learning and for deriving results.

영상 획득부(20)는 외부로부터 인식하고자 하는 행동 정보가 포함된 분석 대상 영상을 획득한다. 별도의 입력부를 통해 분석 대상 영상을 획득하면, 프로세서를 통해 분석 대상 영상에서의 장소와 행동을 인식하게 된다.The image acquisition unit 20 acquires an analysis target image including behavior information to be recognized from the outside. When an analysis target image is acquired through a separate input unit, a place and an action in the analysis target image are recognized through a processor.

메모리(30)는 프로세서(10)의 처리 및 제어를 위한 프로그램들(하나 이상의 인스트럭션들)을 저장할 수 있다.The memory 30 may store programs (one or more instructions) for processing and controlling the processor 10 .

I/O 인터페이스(40)는 시스템 또는 장비를 연결 할 수 있는 연결매체를 장착할 수 있는 장치로서 본 발명에서는 영상 획득부와 프로세서를 연결한다.The I/O interface 40 is a device capable of mounting a connection medium capable of connecting a system or equipment, and in the present invention, an image acquisition unit and a processor are connected.

도 2는 본 발명의 일 실시예에 따른 행동 인식 장치의 프로세서를 설명하기 위한 도면이다.2 is a diagram for explaining a processor of a behavior recognition apparatus according to an embodiment of the present invention.

도 2를 참조하면, 본 발명의 일 실시예에 따른 행동 인식 장치(1)의 프로세서(10)는 장소 인식 특징 추출부(100), 객체 인식 특징 추출부(200), AQPr 연산부(300), CCM 연산부(400), MTx 연산부(500), 특징값 분류부(600)를 포함한다.Referring to FIG. 2 , the processor 10 of the behavior recognition apparatus 1 according to an embodiment of the present invention includes a place recognition feature extraction unit 100 , an object recognition feature extraction unit 200 , an AQPr operation unit 300 , It includes a CCM calculating unit 400 , an MTx calculating unit 500 , and a feature value classifying unit 600 .

본 발명의 일 실시예에 따른 행동 인식 장치(1)의 프로세서(10)에서 사용되는 인공신경망 구조는 하기 도 3에서 상세히 설명한다.The artificial neural network structure used in the processor 10 of the behavior recognition apparatus 1 according to an embodiment of the present invention will be described in detail with reference to FIG. 3 below.

프로세서(10)는 기능에 따라 복수 개의 모듈들로 구분될 수도 있고, 하나의 프로세서에서 기능들을 수행할 수도 있다.The processor 10 may be divided into a plurality of modules according to functions, and functions may be performed by one processor.

프로세서(10)는 인공신경망을 기반으로 상기 분석 대상 영상으로부터 장소와 행동을 인식한다. 구체적으로, 상기 분석 대상 영상에서 기 설정된 구간별로 선택된 프레임을 선택하고, 선택된 프레임에서 장소와 행동을 인식하며, 인식한 장소와 행동에 따른 특징값을 상기 선택된 프레임 별로 라벨링한다.The processor 10 recognizes a place and an action from the analysis target image based on an artificial neural network. Specifically, a frame selected for each preset section in the analysis target image is selected, a place and an action are recognized in the selected frame, and a characteristic value according to the recognized place and action is labeled for each selected frame.

장소 인식 특징 추출부(100)는 제1 합성곱 신경망을 이용하여 상기 선택된 프레임의 장소 인식을 위한 제1 특징 텐서를 추출한다.The place recognition feature extraction unit 100 extracts a first feature tensor for place recognition of the selected frame by using a first convolutional neural network.

객체 인식 특징 추출부(200)는 제2 합성곱 신경망을 이용하여 상기 선택된 프레임의 객체 인식을 위한 제2 특징 텐서를 추출한다.The object recognition feature extraction unit 200 extracts a second feature tensor for object recognition of the selected frame using a second convolutional neural network.

AQPr 연산부(300)는 제1 AQPr 연산부(300a)와 제2 AQPr 연산부(300b)를 포함한다. 제1 AQPr 연산부(300a)는 어텐션 함수(Attention Function)를 이용한 연산을 수행하여 시공간 영역의 상기 제1 특징 텐서를 기 설정된 크기의 제1 특징 벡터로 추출한다.The AQPr calculating unit 300 includes a first AQPr calculating unit 300a and a second AQPr calculating unit 300b. The first AQPr calculating unit 300a extracts the first feature tensor in the space-time domain as a first feature vector having a preset size by performing an operation using an attention function.

제2 AQPr 연산부(300b)는 어텐션 함수(Attention Function)를 이용한 연산을 수행하여 시공간 영역의 상기 제2 특징 텐서를 기 설정된 크기의 제2 특징 벡터로 추출한다.The second AQPr operation unit 300b extracts the second feature tensor in the space-time domain as a second feature vector having a preset size by performing an operation using an attention function.

CCM 연산부(400)는 상기 제1 특징 벡터 또는 상기 제2 특징 벡터의 차원 변환을 통해 연산이 가능하도록 클래스 변환 연산을 수행한다.The CCM operation unit 400 performs a class transformation operation so that the operation is possible through dimensional transformation of the first feature vector or the second feature vector.

MTx 연산부(500)는 상기 제1 특징 벡터와 클래스 변환 연산을 수행한 상기 제2 특징 벡터를 합하여 시공간 영역에 해당하는 특징을 추출하기 위한 멀티태스크 트랜스포머 유닛을 이용한 트랜스포머 연산을 수행하여 결합 특징 벡터를 추출한다.The MTx operation unit 500 performs a transformer operation using a multitask transformer unit for extracting a feature corresponding to a space-time domain by adding the first feature vector and the second feature vector on which a class transformation operation has been performed to obtain a combined feature vector extract

MTx 연산부(500)는 제1 MTx 연산부 내지 제3 MTx 연산부를 포함하여 제1 트랜스포머 연산 내지 제3 트랜스포머 연산을 수행하고, 특징값 분류부(600)는 상기 제3 트랜스포머 연산을 수행하여 추출한 상기 결합 특징 벡터를 풀리 커넥티드 레이어(fully-connected layer)를 이용하여 장소와 행동을 분류한다.The MTx operation unit 500 includes a first MTx operation unit to a third MTx operation unit to perform a first transformer operation to a third transformer operation, and the feature value classification unit 600 performs the third transformer operation and the extracted combination Classify places and behaviors using feature vectors as fully-connected layers.

도 3은 본 발명의 일 실시예에 따른 행동 인식 장치 및 방법의 인공 신경망 구조를 나타낸 것이다.3 is a diagram illustrating an artificial neural network structure of a behavior recognition apparatus and method according to an embodiment of the present invention.

본 발명의 일 실시예에 따른 행동 인식 장치 및 방법은 비 정제 동영상에서 약 지도 학습법으로 장소와 행동이 나타나는 시공간 영역을 찾는 방법을 제안한다.A behavior recognition apparatus and method according to an embodiment of the present invention proposes a method of finding a space-time region in which a place and an action appear in a weakly supervised learning method in an unrefined video.

본 발명의 일 실시예에 따른 행동 인식 장치(1)의 프로세서(10)에서 사용되는 인공신경망 구조인 멀티태스크 트랜스포머 네트워크(Multitask Transformer Network)는 QKV 컨셉을 사용하고 있으며, 이 QKV컨셉을 활용하여 학습할 땐 video-level의 장소 및 행동 정보로 학습하지만, 결과는 장소와 행동이 시공간 영역 중 어디에 해당하는지를 도출해낸다.The Multitask Transformer Network, which is an artificial neural network structure used in the processor 10 of the behavior recognition device 1 according to an embodiment of the present invention, uses the QKV concept, and learns by using the QKV concept. It learns from video-level place and behavior information, but the result derives which place and behavior correspond to in the space-time domain.

또한, 본 발명은 장소와 행동의 영역을 함께 도출해내는 단일 인공신경망을 사용한다. 이렇게 서로 다른 2가지 이상의 태스크(task)를 해결하는 방법을 멀티태스킹(multitasking)이라 하는데, 각각의 task인 장소와 행동은 서로 연관성이 높기 때문에, 장소와 행동 각각 독립적인 인공신경망을 사용하는 것 보다 더 좋은 성능을 기대할 수 있다.In addition, the present invention uses a single artificial neural network that derives the domains of places and actions together. This method of solving two or more different tasks is called multitasking. Since places and actions, which are each task, are highly related to each other, it is better than using an artificial neural network that is independent of places and actions. Better performance can be expected.

본 발명의 일 실시예에 따른 행동 인식 장치(1)의 프로세서(10)에서 사용되는 멀티태스크 트랜스포머 네트워크(Multitask Transformer Network)는 도 3에 나타난 바와 같이, 비 정제 동영상의 일정 구간인 segment를 입력으로 받고, 해당 segment가 어떤 장소와 행동인지를 출력으로 내놓는다. 입력인 segment의 경우, 비 정제 동영상 전체를 한번에 입력으로 사용할 경우, 딥러닝 가속화를 위한 병렬연산에 사용될 그래픽카드의 메모리의 한계 때문에 5초 동안의 32프레임, 즉 6.4fps 의 간격으로 32개의 RGB 이미지가 입력으로 들어가게 된다. 따라서, 도 3의 Multitask Transformer Network는 해당 5초짜리 segment가 어떤 장소와 행동을 나타내는지를 출력으로 내며, 전체 비 정제 동영상에서는 5초 간격으로 장소와 행동이 무엇인지를 나타내준다.As shown in FIG. 3 , the multitask transformer network used in the processor 10 of the behavior recognition device 1 according to an embodiment of the present invention receives a segment, which is a certain section of an unrefined video, as an input. Receive and output the location and behavior of the segment as output. In the case of segment, which is an input, if the entire unrefined video is used as an input at once, 32 RGB images at an interval of 6.4 fps, that is, 32 frames for 5 seconds due to the limitation of the memory of the graphic card to be used for parallel operation for deep learning acceleration is entered as input. Therefore, the Multitask Transformer Network of FIG. 3 outputs what kind of place and action the corresponding 5-second segment represents, and shows what the place and action are at 5-second intervals in the entire unrefined video.

Multitask Transformer Network의 학습은 딥 러닝에서 일반적으로 사용하는 Stochastic Gradient Descent (SGD) 방식을 따라 진행되며, 실험적으로 본 발명의 성능을 보이기 위해 학습에 사용된 데이터 셋은 CoVieW 2019 dataset이 사용되는 것이 바람직하다. CoVieW 2019 dataset은 비 정제 동영상에서 5초 간격으로 장소(scene), 행동(action), 그리고 해당 5초짜리 segment가 그 동영상에서 얼마나 중요한지에 대한 점수 (importance score) 이렇게 3개가 라벨링 되어있는 데이터셋이다. CoVieW 2019 데이터셋에서 하나의 비 정제 동영상 에서의 라벨링 예시는 하기 도 4와 같다.Learning of the Multitask Transformer Network follows the stochastic gradient descent (SGD) method commonly used in deep learning, and it is preferable that the CoVieW 2019 dataset be used as the data set used for learning to experimentally demonstrate the performance of the present invention. . The CoVieW 2019 dataset is a dataset with three labels: a scene, an action, and an import score of how important the 5-second segment is in the video at 5-second intervals in the unrefined video. . An example of labeling in one unrefined video in the CoVieW 2019 dataset is shown in FIG. 4 below.

도 3의 인공신경망 구조를 자세히 살펴보면 다음과 같다. 먼저, 장소 인식 특징 추출부(100)와 객체 인식 특징 추출부(200)에서 Places365와 ImageNet dataset에 각각 학습된 Places365 2D CNN, ImageNet 2D CNN을 사용하여 segment의 프레임의 특징을 추출해낸다. 이 때 사용하는 2D CNN은 ResNet18을 사용하며, 이는 (높이×너비×RGB채널)으로 H×W×3 크기의 RGB (3채널) 이미지를 입력으로 받고 출력으로 H/32×W/32×512 크기의 특징 텐서를 출력으로 내놓는 인공신경망 (2D CNN) 이다. 본 발명에서는 segment (시간길이×높이×너비×RGB채널)으로 32×224×224×3을 입력으로 넣어 32×7×7×512의 크기의 특징을 각각의 2D CNN에서 추출해낸다.A detailed look at the structure of the artificial neural network of FIG. 3 is as follows. First, by using the Places365 2D CNN and ImageNet 2D CNN trained on the Places365 and ImageNet datasets in the place recognition feature extraction unit 100 and the object recognition feature extraction unit 200, respectively, the segment frame features are extracted. The 2D CNN used at this time uses ResNet18, which receives H×W×3 RGB (3-channel) images as (height×width×RGB channels) as input and H/32×W/32×512 as output. It is an artificial neural network (2D CNN) that outputs size feature tensors. In the present invention, 32×224×224×3 is input as a segment (time length×height×width×RGB channel) and features of size 32×7×7×512 are extracted from each 2D CNN.

여기서 Places365 2D CNN는 장소 인식 데이터셋인 Places365에 학습되어 있으므로, 장소 관련 특징을 추출해내고, ImageNet 2D CNN은 객체 인식 데이터셋인 ImageNet에 학습 되어있으므로 객체 관련 특징을 추출해낸다. ImageNet 2D CNN의 경우, 우리는 최종적으로 행동을 인식할 것인데 행동 인식 데이터셋에 학습된 2D CNN이 아닌 객체 인식 데이터셋에 학습한 이유는, 2D CNN은 2-Dimensional CNN으로 입력으로 동영상이 아닌 이미지를 받기 때문에 이미지로 구성된 데이터셋을 사용해야 하는데, 이미지로 구성된 행동 인식 데이터셋이 존재하지 않아, 행동 인식과 가장 유사한 객체 인식 데이터셋에 학습된 2D CNN을 사용하였다. 그리고, 2D CNN을 미리 다른 데이터셋에 학습하여 사용한 이유는, CoVieW 2019 데이터셋이 총 1500개의 비 정제 동영상을 제공하여 딥 러닝 구조를 학습하기엔 비교적 적은 양의 데이터를 제공하여 과적합 (overfitting)문제가 발생하기 때문에 비교적 많은 양의 데이터를 제공하는 Places365 (약 800만 장 이미지), ImageNet (약 120만 장 이미지)에 2D CNN을 미리 학습하여 일반적인 (general한) 특징을 추출해낼 수 있게 하였다.Here, Places365 2D CNN is trained in Places365, a place recognition dataset, so it extracts place-related features, and ImageNet 2D CNN is trained in ImageNet, an object recognition dataset, so it extracts object-related features. In the case of ImageNet 2D CNN, we will finally recognize the behavior. The reason why we learned on the object recognition dataset rather than the 2D CNN trained on the behavior recognition dataset is that the 2D CNN is a 2-Dimensional CNN that uses images rather than videos as input. In order to receive , it is necessary to use a dataset consisting of images, but since there is no behavior recognition dataset consisting of images, a 2D CNN trained on the object recognition dataset most similar to behavior recognition was used. And, the reason why 2D CNN was trained and used in other datasets in advance is that the CoVieW 2019 dataset provides a total of 1500 unrefined videos and provides a relatively small amount of data to learn the deep learning structure, resulting in an overfitting problem. 2D CNN is pre-trained in Places365 (about 8 million images) and ImageNet (about 1.2 million images), which provide a relatively large amount of data, so that general features can be extracted.

그 후, 도 3에 나타난 바와 같이 2D CNN으로 추출된 특징(Convolution feature)은 제1 AQPr 연산부(300a)와 제2 AQPr 연산부(300b)를 구현하는 AQPr (Attentional Query Processor)에 입력으로 들어가고, MTx 연산부(500)를 구현하는 MTx (Multitask Transformer units) 들의 입력으로도 사용된다. 2D CNN으로 추출된 32×7×7×512 의 크기의 특징을

으로 나타낼 때, AQPr에서의 수학식 1과 같이 진행된다.After that, as shown in FIG. 3, the feature (convolution feature) extracted by the 2D CNN is input to AQPr (Attentional Query Processor) implementing the first AQPr operation unit 300a and the second AQPr operation unit 300b as input, and MTx It is also used as an input of MTx (Multitask Transformer units) implementing the operation unit 500 . 32×7×7×512 extracted with 2D CNN

When expressed as , it proceeds as in Equation 1 in AQPr.

여기서,

이 AQPr의 출력이고,

는 sigmoid function이고,

는 학습되는 weight (trainable parameter)이고 InstanceNorm은 Instance Normalization으로 이 역시 학습되는 weight를 포함한 정규화 (normalization) 방식 중 하나이다. 따라서, AQPr을 거치면 32Х7Х7Х512의 크기의 시공간 영역의 특징 텐서가 512 크기의 특징 벡터로 추출되며, 이 특징 벡터와 도 3의 CCM 연산부(400)를 구현하는 CCM (class conversion matrix) 에서 추출된 특징 벡터가 더해져서 MTx의 query 입력으로 사용된다.here,

This is the output of AQPr,

is a sigmoid function,

is a trained weight (trainable parameter), and InstanceNorm is Instance Normalization, which is also one of the normalization methods including the trained weight. Accordingly, through AQPr , a feature tensor of a spatiotemporal domain with a size of 32Х7Х7Х512 is extracted as a feature vector with a size of 512. is added and used as a query input for MTx.

도 3의 CCM의 경우, 입력을

, 출력을

, n은 특징 벡터의 채널 (도 3에서 AQPr의 출력과 더해지는 부분은 n=512 이며, 나머지 "MTx->concat->"과 더해지는 부분은 n=256이다.) 이라 하였을 때, CCM에서의 연산은 수학식 2와 같이 진행된다.In the case of the CCM of Figure 3, the input

, the output

, n is the channel of the feature vector (in FIG. 3, the part added to the output of AQPr is n=512, and the part added to the remaining "MTx->concat->" is n=256 ), the operation in CCM is performed as in Equation 2.

여기서

는 ReLU function이고,

는 학습되는 weight이다. here

is the ReLU function,

is the weight to be learned.

그리고 MTx의 연산은 하기 도 5와 같이 이루어 진다.And the MTx operation is performed as shown in FIG. 5 below.

도 3에서 실선은 "학습할 때 기울기가 전파되는 곳"을 나타내며, 회색 점선은 "학습할 때 기울기가 전파되지 않는 곳"을 나타낸다. 즉, 장소와 행동의 특징 (정보) 공유가 일어나는 부분 인 CCM에서 기울기를 전파해주지 않으므로, MTx 는 각각 본인이 맡고있는 문제(task)인 장소 또는 행동에 대해서만 학습을 하게 된다. 즉, 도 3에서 윗줄 민무늬로 표시된 MTx들은 장소에 대한 전문성을 띄게되고, 아랫줄 빗금 표시된 MTx들은 행동에 대한 전문성을 띄게 된다.In FIG. 3 , a solid line indicates “a place where a gradient is propagated when learning”, and a gray dotted line indicates “a place where a gradient does not propagate when learning”. In other words, since the gradient is not propagated in the CCM, which is the part where the sharing of features (information) of place and behavior occurs, the MTx learns only about the place or behavior, which is the task it is in charge of. That is, in FIG. 3 , the MTx indicated by the upper line with a plain pattern have expertise in the place, and the MTx indicated by the hatched lower line have the specialization in the action.

도 3의 Multitask Transformer Network에 대해 부연 설명을 하면, 앞에서 잠깐 언급 했듯이, CCM에서 장소와 행동에 대한 정보 공유가 일어나며, AQPr은 시공간 영역의 특징 텐서를 하나의 특징 벡터로 합쳐주는 (feature aggregation) 역할을 하며, MTx는 하기 도 5에서의 연산과 같이, memory (2D CNN으로 추출된 특징 벡터) 중 보고자 하는 영역을 query 특징과 matrix multiplication (시공간 영역으로 각각 inner product 연산으로 진행 됨) 연산을 통해 찾아내고, 그 시공간 영역에 해당하는 특징을 추출해내기 위해 한번 더 matrix multiplication 연산을 통해 추출해낸다. 즉, 첫번째 matrix multiplication으로는 보고자 하는 시공간 영역을 찾아내는 것이고, 두번째 matrix multiplication은 그 시공간 영역에 해당하는 특징을 추출해내는 역할을 한다. 그리고 뒤의 Layer Norm, FC 등 은 일반적인 MLP (multi-layer perceptron)과 같은 역할로, feature를 고차원으로 embedding 하는 역할을 한다.If the Multitask Transformer Network of FIG. 3 is further explained, as mentioned briefly above, information about places and actions is shared in CCM, and AQPr serves to aggregate feature tensors in the spatiotemporal domain into one feature vector (feature aggregation). , and MTx finds the region to be viewed in memory (feature vector extracted with 2D CNN) through the operation of query features and matrix multiplication (each proceeds with inner product operation in the spatio-temporal region) as shown in the operation in FIG. In order to extract the features corresponding to the space-time domain, it is extracted through matrix multiplication operation once more. That is, the first matrix multiplication is to find a space-time region to be viewed, and the second matrix multiplication is to extract features corresponding to the space-time region. And the rear layer norm, FC, etc. have the same role as a general multi-layer perceptron (MLP), and play a role of embedding features in a high-dimensional manner.

첫번째 Matrix Multiplication에서 장소/행동 인식과 관련된 찾고자 하는 정보가 담긴 특징 (Query, 1x512, Query Embedding인 FC를 거치면 1x128)와 가장 잘 매칭이 되는 시공간영역을 (Memory 32x7x7x512, Key Embedding을 거치면, 32x7x7x128) 에서 찾고자 Matrix Multiplication을 통해 32x7x7x1의 특징을 추출해내어 시공간 영역으로 장소/행동이 존재할 확률을 얻어내고 두번째 Matrix Multiplication에서는 시공간영역 32x7x7 에 해당하는 장소/행동 특징을 추출해내기 위해 위에서 추출된 32x7x7x1 과 Value Embedding을 통해 추출된 32x7x7x128과 Matrix Multiplication을 통해 1x128 크기의 특징을 추출해낸다.In the first Matrix Multiplication, the spatiotemporal region that best matches the feature (Query, 1x512, 1x128 through FC, which is Query Embedding) containing the information to be found related to place/behavior recognition (Memory 32x7x7x512, 32x7x7x128 when subjected to Key Embedding) To find, we extract the 32x7x7x1 features through Matrix Multiplication to obtain the probability that a place/action exists in the spatiotemporal domain. In the second Matrix Multiplication, the 32x7x7x1 and Value Embedding extracted above are used to extract the place/action features corresponding to the spatiotemporal domain 32x7x7. The extracted 32x7x7x128 and 1x128 size features are extracted through Matrix Multiplication.

즉, Query(1x512, FC를 거치면 1x128)는 이전에 추출된 (MTx의 입력으로 들어가는) 장소/행동에 대한 특징이며, Key Embedding(32x7x7x128)에서 추출되는 특징은 Query와 매칭을 하기위해 추출하는 시공간 특징이며, Value Embedding(32x7x7x128)에서 추출되는 특징은 행동/장소에 대한 시공간 영역마다의 특징이다.In other words, the Query (1x512, 1x128 if going through FC) is a feature about the place/action previously extracted (entered into the input of MTx), and the feature extracted from Key Embedding (32x7x7x128) is the time-space extracted to match the Query. It is a feature, and the feature extracted from Value Embedding (32x7x7x128) is a feature for each space-time domain for action/place.

도 4는 본 발명의 일 실시예에 따른 행동 인식 장치 및 방법의 라벨링을 예로 들어 나타낸 것이다.4 is an example showing the labeling of the behavior recognition apparatus and method according to an embodiment of the present invention.

Multitask Transformer Network의 학습은 딥 러닝에서 일반적으로 사용하는 Stochastic Gradient Descent (SGD) 방식을 따라 진행되며, 실험적으로 본 발명의 성능을 보이기 위해 학습에 사용된 데이터 셋은 CoVieW 2019 dataset이 사용된다. CoVieW 2019 dataset은 비 정제 동영상에서 5초 간격으로 장소(scene), 행동(action), 그리고 해당 5초짜리 segment가 그 동영상에서 얼마나 중요한지에 대한 점수 (importance score) 이렇게 3개가 라벨링 되어있는 데이터셋이다. CoVieW 2019 데이터셋에서 하나의 비 정제 동영상 에서의 라벨링 예시는 도 4와 같다.Learning of the Multitask Transformer Network follows the stochastic gradient descent (SGD) method commonly used in deep learning, and the CoVieW 2019 dataset is used as the dataset used for learning to experimentally demonstrate the performance of the present invention. The CoVieW 2019 dataset is a dataset with three labels: a scene, an action, and an import score of how important the 5-second segment is in the video at 5-second intervals in the unrefined video. . An example of labeling in one unrefined video in the CoVieW 2019 dataset is shown in FIG.

도 5는 본 발명의 일 실시예에 따른 행동 인식 장치 및 방법의 MTx 연산 구조를 나타낸 것이다.5 is a diagram illustrating an MTx operation structure of a behavior recognition apparatus and method according to an embodiment of the present invention.

MTx 연산부(500)는 상기 제1 특징 벡터와 클래스 변환 연산을 수행한 상기 제2 특징 벡터를 합하여 멀티태스크 트랜스포머 유닛을 이용한 트랜스포머 연산을 수행하여 결합 특징 벡터를 추출한다.The MTx operation unit 500 extracts a combined feature vector by adding the first feature vector and the second feature vector on which the class conversion operation is performed, and performing a transformer operation using a multitask transformer unit.

MTx 연산부(500)의 멀티태스크 트랜스포머 유닛은, 쿼리 입력부(510), 메모리 입력부(520), 적어도 하나의 풀리 커넥티드 레이어(fully-connected layer)(530)를 포함한다.The multitask transformer unit of the MTx operation unit 500 includes a query input unit 510 , a memory input unit 520 , and at least one fully-connected layer 530 .

MTx 연산부(500)는, 쿼리 입력부(510)로 상기 제1 특징 벡터를 입력받고, 상기 입력된 제1 특징 벡터와 미리 결정된 연결 가중치에 따른 풀리 커넥티트 특징값을 생성한다.The MTx calculator 500 receives the first feature vector through the query inputter 510 and generates a fully connected feature value according to the input first feature vector and a predetermined connection weight.

또한, 메모리 입력부(520)로 입력된 선택된 프레임 영상을 컨볼루션 변환을 통해 컨볼루션 특징값을 생성한다.In addition, a convolutional feature value is generated through convolutional transformation of the selected frame image input to the memory input unit 520 .

풀리 커넥티트 특징값과 상기 컨볼루션 특징값의 행렬 곱 연산을 수행하고, 상기 풀리 커넥티트 특징값과 상기 행렬 곱 연산을 수행한 상기 컨볼루션 특징값을 합하여 정규화를 수행한다.A matrix multiplication operation of the fully connected feature value and the convolutional feature value is performed, and normalization is performed by adding the fully connected feature value and the convolution feature value on which the matrix multiplication operation is performed.

도 5에서 쿼리 입력부(510)를 구현하는 쿼리(Query)는 도 3의 MTx 블록의 왼쪽에서 들어오는 입력이며, Memory는 MTx 블록의 위 또는 아래에서 들어오는 입력이다. FC는 fully connected layer를 나타내며, 1x1x1 Conv는 1x1x1 Convolution 을 나타내며, Layer Norm은 Layer Normalization 으로, 앞에서 언급한 InstanceNorm과 비슷하게, 학습되는 weight를 포함한 정규화 (normalization) 방식 중 하나이다. Softmax는 softmax function을 나타낸다. Dropout은 과적합 (overfitting) 문제를 완화하기 위해 사용되었다. In FIG. 5 , a query implementing the query input unit 510 is an input coming from the left side of the MTx block of FIG. 3 , and Memory is an input coming from above or below the MTx block. FC stands for fully connected layer, 1x1x1 Conv stands for 1x1x1 Convolution, and Layer Norm is Layer Normalization, which is one of the normalization methods including learned weights, similar to the aforementioned InstanceNorm. Softmax represents the softmax function. Dropout was used to alleviate the overfitting problem.

는 matrix multiplication 연산자를,

는 element-wise sum (일반적인 덧셈) 연산자를 나타낸다. 그리고 빨간색 화살표의 경우, 특별한 연산이 따로 있는 것이 아니며, 추후 설명할 본 발명의 약 지도 학습법에서 사용할 특징 텐서가 빨간색 화살표에서 추출된 특징 텐서를 사용할 것임을 나타낸다. 도 5의 MTx 에서 최종적으로 출력되는 맨 오른쪽 화살표에선,

의 크기를 갖는 특징 벡터가 추출이 된다.

is the matrix multiplication operator,

represents the element-wise sum (normal addition) operator. And, in the case of the red arrow, there is no special operation, and it indicates that the feature tensor extracted from the red arrow will be used as the feature tensor to be used in the weakly supervised learning method of the present invention, which will be described later. In the rightmost arrow finally output from MTx of FIG. 5,

A feature vector with a size of is extracted.

이후, 상기 도 3에 나타난 바와 같이, 2개의 MTx에서 추출된 특징이 "concat"을 거치게 되는데, 이는 concatenation의 줄임말로, 두 개의

크기 벡터가 단순히 연결되어

의 크기가 됨을 나타낸다.Thereafter, as shown in FIG. 3, features extracted from two MTx are subjected to “concat”, which is an abbreviation for concatenation,

The magnitude vectors are simply concatenated

indicates the size of

그리고 특징값 분류부(600)는 상기 도 3에서 2개씩 쌓인 MTx를 총 3번 걸쳐 추출된 특징 벡터가 최종적으로 하나의 FC (fully connected layer)의 입력으로 들어가게 되고, 이를 통해 장소(scene), 행동(action)을 분류해낸다.In addition, the feature value classification unit 600 extracts the MTx stacked by two in FIG. 3 a total of three times, and the feature vector is finally input to one fully connected layer (FC), and through this, a scene, Categorize actions.

상기 도 3의 Multitask Transformer Network는 5초짜리 segment (동영상 클립)의 장소와 행동을 분류하는 인공 신경망으로, 학습할 때 동영상 5초마다의 해당 장소와 행동의 정보만 필요로 한다. 그리고 이렇게 학습된 인공신경은 하기 도 5의 화살표 부분에서 어느 시공간 영역을 보아야 하는지 사람이 정보를 주지 않더라도, 장소와 행동이 무엇인지 잘 찾아낼 수 있도록 학습이 된다. 따라서 이 빨간색 화살표에서 추출되는 특징 텐서는 장소와 행동이 나타나는 시공간 영역을 나타낼 것이며, 이를 위한 학습 정보를 주지 않았으므로, 이 정보로 장소와 행동이 나타내는 영역을 표현한다면 이는 약 지도 학습법이 된다. (비교적 쉬운 정보인 5초마다의 행동, 장소 정보만으로 학습 후 장소, 행동이 나타나는 시공간 영역을 표현하므로)The Multitask Transformer Network of FIG. 3 is an artificial neural network that classifies the location and behavior of a 5-second segment (video clip), and only requires information on the corresponding location and behavior every 5 seconds of the video when learning. And, the artificial nerve learned in this way is learned so that it can find out the place and the action well, even if the person does not give information about which spatiotemporal region to see in the arrow part of FIG. 5 . Therefore, the feature tensor extracted from this red arrow will indicate the spatiotemporal region in which places and actions appear, and since learning information for this purpose is not given, if this information is used to express the regions indicated by places and actions, it is a weakly supervised learning method. (Because it expresses the space-time domain in which the place and action appear after learning only with the relatively easy information, such as behavior and place information every 5 seconds)

정확한 표현을 위해, 화살표에서 추출되는 특징 텐서를

, query가 FC를 거쳐 추출된 Query Embedding 특징 벡터를

, memory가 1x1x1 Conv를 거쳐 추출된 Key Embedding 특징 텐서를

으로 표현할 때 수학식 3과 같이 계산된다 (T는 time, H는 height, W는 width으로 처음에 설명한 T=32, H=7, W=7이 사용된다)For an accurate representation, the feature tensor extracted from the arrow

, the query embedding feature vector extracted through FC

, the key embedding feature tensor extracted through 1x1x1 conv

It is calculated as in Equation 3 when expressed as

여기서 softmax는 softmax function이며, 특징 텐서 Y의 전체 합이 1이 되도록 한다. 따라서, 시공간 영역

으로 각각 상수값(scalar) 를 하나씩 갖게 되며, 이는 시공간 영역의 장소 또는 행동이 나타나는 정도를 띄게 된다.Here, softmax is a softmax function, so that the total sum of the feature tensor Y is 1. Thus, the space-time domain

Each of them has a constant value (scalar), which shows the extent to which a place or action in the space-time domain appears.

하지만 이 T×H×W=32×7×7 는 2D CNN에서 추출된 크기로, 원본 이미지의 높이(H)와 너비(W)의 비해 1/32 배 된 크기이다. 따라서 이

를 bilinear interpolation 방식으로 높이와 너비를 32배 늘려준 후, 원본 이미지 크기로 맞춰주고 visualization을 위해 값이 낮은 곳은 파란색, 값이 높은 곳은 빨간색으로 표현하면 하기 도 7과 같은 결과를 얻을 수 있다.However, this T×H×W=32×7×7 is the size extracted from the 2D CNN, which is 1/32 times larger than the height (H) and width (W) of the original image. Therefore, this

After increasing the height and width by 32 times using the bilinear interpolation method, adjust the original image size and express low values in blue and high values in red for visualization, the result shown in FIG. 7 can be obtained.

마지막으로, 본 발명에서 하고자 하는 MTx에서 Softmax결과로 출력되는 32(시간)×7(높이)×7(너비) 마다의 중요도를 원본 크기인 32(시간)×224(높이)×224(너비) 의 크기로 늘리는 방법은 bilinear interpolation 방식을 사용한다.Finally, in the present invention, the importance of each 32 (time) × 7 (height) × 7 (width) output as a Softmax result in the MTx intended in the present invention is set to the original size of 32 (time) × 224 (height) × 224 (width). To increase the size of , a bilinear interpolation method is used.

CNN이 높너비를 1/2 크기로 줄이는 pooling을 5번 사용하기 문에 input image의 높너비 보다 1/32배 크기를 갖는 특징을 추출하게 된다. 따라서 본 발명에서 사용하는 특징인 32×7×7마다의 중요도는 시간방향으로는 압축되지 않고, 공간 방향으로만 1/32배 압축된 크기이다. 시간방향으로는 크기 변화가 없기 때문에 하나의 frame인 224×224 입장에서만 보았을때, 7×7 특징의 각 한칸마다는 224×224의 image에서 32×32의 크기를 담당하게 된다. 이를 보기 좋게 visualization 하기 위해서 7×7 특징을 224×224으로 키우게 되는데, interpolation 방식은 무엇을 쓰던지 상관 없이 7×7 크기의 특징을 224×224으로만 키우기만 하면 됩니다. 저희는 관련 약지도학습법인 CAM에서 사용한 방식인 bilinear ineterpolation 방식을 채택한다.Since CNN uses pooling to reduce the height to 1/2 the size 5 times, it extracts features that are 1/32 times larger than the height of the input image. Therefore, the importance of every 32×7×7, which is a feature used in the present invention, is not compressed in the temporal direction, but is compressed by 1/32 times only in the spatial direction. Since there is no change in size in the time direction, when viewed only from the perspective of one frame, 224×224, each cell of the 7×7 feature is responsible for the size of 32×32 in a 224×224 image. In order to visualize this nicely, the 7×7 feature is increased to 224×224, and the interpolation method only needs to increase the 7×7 size feature to 224×224 regardless of what is used. We adopt the bilinear ineterpolation method, which is the method used in CAM, a related weak-supervised learning method.

도 6은 본 발명의 일 실시예에 따른 행동 인식 장치 및 방법의 AQPr 연산 구조를 나타낸 것이다.6 is a diagram illustrating an AQPr operation structure of a behavior recognition apparatus and method according to an embodiment of the present invention.

상기 도 3에서 나와있듯이 2D CNN으로 추출된 특징 (Convolution feature)은 AQPr (Attentional Query Processor)에 입력으로 들어가고, MTx (Multitask Transformer units) 들의 입력으로도 사용된다. 2D CNN으로 추출된 32×7×7×512의 크기의 특징을

으로 나타낼 때, AQPr에서의 상기 수학식 1과 같이 진행된다.As shown in FIG. 3, the feature (convolution feature) extracted by the 2D CNN is input to the AQPr (Attentional Query Processor) and is also used as an input to the MTx (Multitask Transformer units). 32×7×7×512 extracted with 2D CNN

When expressed as , it proceeds as in Equation 1 above in AQPr.

AQPr을 거치면 32×7×7×512의 크기의 시공간 영역의 특징 텐서가 512 크기의 특징 벡터로 추출되며, 이 특징 벡터와 상기 도 3의 CCM (class conversion matrix) 에서 추출된 특징 벡터가 더해져서 MTx의 query 입력으로 사용된다.Through AQPr, a feature tensor of a space-time domain with a size of 32×7×7×512 is extracted as a feature vector with a size of 512 , and this feature vector and the feature vector extracted from the CCM (class conversion matrix) of FIG. It is used as a query input for MTx.

도 7은 본 발명의 일 실시예에 따른 행동 인식 장치 및 방법을 이용한 실험 결과를 나타낸 것이다.7 shows experimental results using the behavior recognition apparatus and method according to an embodiment of the present invention.

도 7은 2가지 segment (video clip)에 대한 실험 결과를 나타낸다. Input video는 원본 이미지를 나타내며, scene은 장소의 시공간 영역, action은 행동의 시공간 영역을 나타낸다. 2가지 segment 중 위의 실험 결과에선 장소가 대부분의 영역에 퍼져서 골고루 나타나며, 행동은 사람에게만 영역이 집중된 것을 볼 수 있다. 이는 원래 해당 segment가 나타내는 concert라는 장소와 performance라는 행동을 잘 나타내는 것을 확인할 수 있다. 아래 segment 결과에 대해서는, 장소와 행동이 나타나는 영역이 일부 특정 시간에서만 집중되어 있으며, 나타내는 영역도 특정 구역에 집중되어 있는 것을 확인할 수 있다. 이는, 특정 프레임의 특정 영역에서 장소 또는 행동이 눈에 띄게 잘 나타나는 경우에, 그 특정 프레임에서만 시공간 영역이 집중되는 결과가 나타나는 것이다. 따라서, 행동인 cycling은 자전거 타는 사람에게 영역이 잘 나타났으며, 장소인 park의 경우 실내 공간에서는 시공간 영역이 나타나지 않고, 야외에만 영역이 잘 표현된 것을 확인할 수 있다.7 shows experimental results for two segments (video clips). The input video represents the original image, the scene represents the space-time domain of the place, and the action represents the space-time domain of the action. Among the two segments, in the above experimental result, it can be seen that the place is spread out over most of the area, and the area is concentrated only on the person for the action. It can be seen that the original segment represents the concert place and the performance action well. For the segment result below, it can be seen that the areas where places and actions appear are concentrated only in some specific times, and the indicated areas are also concentrated in a specific area. In this case, when a place or action is remarkably well displayed in a specific region of a specific frame, the result of concentrating the space-time domain only in the specific frame appears. Therefore, it can be seen that cycling, an action, showed well to cyclists, and in the case of park, which is a place, the space-time domain did not appear in the indoor space, and it was confirmed that the area was well expressed only outdoors.

도 8 내지 도 10은 본 발명의 일 실시예에 따른 행동 인식 방법을 이용한 나타낸 흐름도이다.8 to 10 are flowcharts showing a behavior recognition method according to an embodiment of the present invention.

도 8을 참조하면, 본 발명의 일 실시예에 따른 행동 인식 방법은 프로세서가, 분석 대상 영상을 입력 받고, 시간 영역을 기준으로 기 설정된 구간별로 상기 분석 대상 영상에서 일부의 프레임 영상을 선택하는 단계(S100), 상기 선택된 프레임 영상에서 장소와 행동을 인식하는 단계(S200) 및 인식한 장소와 행동에 따른 특징값을 상기 선택된 프레임 영상에 라벨링하는 단계(S300)를 포함한다.Referring to FIG. 8 , in the behavior recognition method according to an embodiment of the present invention, the processor receives an analysis target image, and selects some frame images from the analysis target image for each preset section based on a time domain. (S100), recognizing a place and action in the selected frame image (S200), and labeling a feature value according to the recognized place and action on the selected frame image (S300).

도 9를 참조하면, 선택된 프레임에서 장소와 행동을 인식하는 단계(S200)는, 단계 S210에서 제1 합성곱 신경망을 이용하여 상기 선택된 프레임의 장소 인식을 위한 제1 특징 텐서를 추출한다.Referring to FIG. 9 , in the step S200 of recognizing a place and an action in a selected frame, a first feature tensor for recognizing a place in the selected frame is extracted using a first convolutional neural network in step S210.

단계 S220에서 제2 합성곱 신경망을 이용하여 상기 선택된 프레임의 객체 인식을 위한 제2 특징 텐서를 추출한다.In step S220, a second feature tensor for object recognition of the selected frame is extracted using a second convolutional neural network.

여기서, 제2 합성곱 신경망은 상기 인식하고자 하는 행동 정보와 유사한 객체 인식 데이터셋에 학습된 것이다.Here, the second convolutional neural network is trained on an object recognition dataset similar to the behavior information to be recognized.

단계 S230에서 어텐션 함수(Attention Function)를 이용한 연산을 수행하여 시공간 영역의 상기 제1 특징 텐서를 기 설정된 크기의 제1 특징 벡터로 추출한다.In step S230, an operation using an attention function is performed to extract the first feature tensor in the space-time domain as a first feature vector having a preset size.

단계 S240에서 어텐션 함수(Attention Function)를 이용한 연산을 수행하여 시공간 영역의 상기 제2 특징 텐서를 기 설정된 크기의 제2 특징 벡터로 추출한다.In step S240, an operation using an attention function is performed to extract the second feature tensor in the space-time domain as a second feature vector having a preset size.

단계 S250에서 상기 제1 특징 벡터 또는 상기 제2 특징 벡터의 차원 변환을 통해 연산이 가능하도록 변환하는 클래스 변환 연산을 수행한다.In step S250, a class transformation operation for transforming the first feature vector or the second feature vector to be operable through dimensional transformation is performed.

단계 S260에서 상기 제1 특징 벡터와 클래스 변환 연산을 수행한 상기 제2 특징 벡터를 합하여 시공간 영역에 해당하는 특징을 추출하기 위한 멀티태스크 트랜스포머 유닛을 이용한 트랜스포머 연산을 수행하여 결합 특징 벡터를 추출한다.In step S260, the combined feature vector is extracted by performing a transformer operation using a multitask transformer unit for extracting a feature corresponding to a space-time domain by adding the first feature vector and the second feature vector on which the class transformation operation has been performed.

단계 S260에서 상기 트랜스포머 연산은, 제1 트랜스포머 연산 내지 제3 트랜스포머 연산을 포함하며,In step S260, the transformer operation includes a first transformer operation to a third transformer operation,

단계 S270에서 상기 제3 트랜스포머 연산을 수행하여 추출한 상기 결합 특징 벡터를 풀리 커넥티드 레이어(fully-connected layer)를 이용하여 장소와 행동을 분류한다.In step S270, the combined feature vector extracted by performing the third transformer operation is classified into places and behaviors using a fully-connected layer.

도 10을 참조하면, 멀티태스크 트랜스포머 유닛을 이용하여 트랜스포머 연산을 수행하는 단계(S260)는, 단계 S261에서 제1 특징 벡터를 입력받고, 상기 입력된 제1 특징 벡터와 미리 결정된 연결 가중치에 따른 풀리 커넥티트 특징값을 생성한다.Referring to FIG. 10 , in the step S260 of performing the transformer operation using the multitask transformer unit, a first feature vector is received in step S261, and a pulley according to the input first feature vector and a predetermined connection weight Create a connected feature value.

단계 S262에서 상기 선택된 프레임 영상을 컨볼루션 변환을 통해 컨볼루션 특징값을 생성한다.In step S262, the selected frame image is subjected to convolutional transformation to generate a convolutional feature value.

단계 S263에서 풀리 커넥티트 특징값과 상기 컨볼루션 특징값의 행렬 곱 연산을 수행한다.In step S263, a matrix multiplication operation of the fully connected feature value and the convolution feature value is performed.

단계 S264에서 풀리 커넥티트 특징값과 상기 행렬 곱 연산을 수행한 상기 컨볼루션 특징값을 합하여 정규화를 수행한다.In step S264, normalization is performed by adding the fully connected feature value and the convolution feature value on which the matrix multiplication operation is performed.

본 발명의 일 실시예에 따른 행동 인식 장치 및 방법은 사람이 비 정제 동영상에서 장소와 행동이 일어나는 시공간 영역을 모두 찾기엔 많은 노동력을 필요로 하지만, 비디오 단위로(약 5초 단위) 장소와 행동을 라벨링 하는 것은 비교적 훨씬 쉬운 일이다. 본 발명에서는 QKV컨셉을 활용하여 비정제 동영상에서의 장소와 행동이 나타나고 있는 시공간 영역을 비디오 단위의 장소와 행동 라벨링을 통해 찾는 약지도학습 방법을 제안하고 있다. 본 발명은 장소와 행동의 시공간 영역을 도출해내는 단일 인공신경망을 사용함으로써, 장소와 행동의 연관성을 통한 강인한 인공 신경망이 학습이 가능하여 장소 또는 행동의 결과만 도출해내는 인공신경망에 비해 좋은 성능을 기대할 수 있다. The behavior recognition apparatus and method according to an embodiment of the present invention require a lot of labor for a person to find all the space-time regions in which a place and an action occur in an unrefined video. It is relatively much easier to label . In the present invention, using the QKV concept, we propose a weakly supervised learning method to find the spatiotemporal region in which the place and action in the unrefined video are displayed through the location and action labeling of the video unit. In the present invention, by using a single artificial neural network that derives the spatiotemporal domain of place and action, a strong artificial neural network through the relationship between place and action can learn, so better performance is expected compared to an artificial neural network that derives only the result of place or action. can

이상의 설명은 본 발명의 일 실시예에 불과할 뿐, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자는 본 발명의 본질적 특성에서 벗어나지 않는 범위에서 변형된 형태로 구현할 수 있을 것이다. 따라서 본 발명의 범위는 전술한 실시예에 한정되지 않고 특허 청구 범위에 기재된 내용과 동등한 범위 내에 있는 다양한 실시 형태가 포함되도록 해석되어야 할 것이다.The above description is only one embodiment of the present invention, and those of ordinary skill in the art to which the present invention pertains will be able to implement it in a modified form without departing from the essential characteristics of the present invention. Therefore, the scope of the present invention is not limited to the above-described embodiments, and should be construed to include various embodiments within the scope equivalent to the content described in the claims.

Claims

receiving, by a processor, an analysis target image and selecting a part of a frame image from the analysis target image for each preset section based on a time domain;
recognizing a place and an action in the selected frame image; and
Including; labeling a feature value according to the recognized place and action on the selected frame image;
Recognizing a place and an action in the selected frame image comprises:
extracting a first feature tensor for location recognition of the selected frame image using a first convolutional neural network;
extracting a second feature tensor for object recognition of the selected frame image using a second convolutional neural network;
Using a CCM (class conversion matrix) operator to share the characteristic information of the place and the behavior without propagating a gradient,
An MTx belonging to the first group intensively learns the place by using a multitask transformer unit (MTx) operation unit, and the MTx belonging to the second group intensively learns the behavior on the behavior recognition method.

delete

According to claim 1,
The second convolutional neural network is
A behavior recognition method, characterized in that it is learned from an object recognition dataset similar to the behavior information to be recognized.

According to claim 1,
Recognizing a place and an action in the selected frame image comprises:
extracting the first feature tensor in the space-time domain as a first feature vector having a preset size by performing an operation using an attention function; and
Behavior recognition method further comprising; extracting the second feature tensor of the space-time domain as a second feature vector of a preset size by performing an operation using an attention function.

5. The method of claim 4,
Recognizing a place and an action in the selected frame image comprises:
performing a class transformation operation for transforming the first feature vector or the second feature vector so that the operation is possible through dimensional transformation; and
extracting a combined feature vector by performing a transformer operation using a multi-task transformer unit for extracting a feature corresponding to a space-time domain by summing the first feature vector and the second feature vector on which a class transformation operation has been performed; Behavior recognition method comprising:

6. The method of claim 5,
The step of performing a transformer operation using the multi-task transformer unit includes:
receiving the first feature vector by a query input unit and generating a fully connected feature value according to the input first feature vector and a predetermined connection weight;
generating a convolutional feature value through convolutional transformation of the selected frame image;
performing a matrix multiplication operation of the fully connected feature value and the convolution feature value; and
and performing normalization by summing the fully connected feature value and the convolutional feature value on which the matrix multiplication operation has been performed.

6. The method of claim 5,
The transformer operation includes a first transformer operation to a third transformer operation,
Recognizing a place and an action in the selected frame image comprises:
and classifying places and actions using a fully-connected layer with the combined feature vector extracted by performing the third transformer operation.

an image acquisition unit for acquiring an analysis target image including behavior information to be recognized from the outside;
a memory storing one or more instructions; and
Including; a processor that executes one or more instructions stored in the memory;
The processor recognizes a place and an action from the analysis target image based on an artificial neural network,
selecting, by the processor, a part of a frame image from the analysis target image for each preset section based on a time domain in the analysis target image;
recognizing a place and an action in the selected frame image; and
performing a labeling of a feature value according to a recognized place and an action on the selected frame image;
The processor may include: extracting a first feature tensor for location recognition of the selected frame image using a first convolutional neural network; and
extracting a second feature tensor for object recognition of the selected frame image using a second convolutional neural network;
The processor uses a class conversion matrix (CCM) operation unit to share characteristic information of the location and the behavior without propagating a gradient,
An MTx belonging to the first group intensively learns the place by using a multitask transformer unit (MTx) operation unit, and the MTx belonging to the second group intensively learns the action on the behavior recognition apparatus.

delete

9. The method of claim 8,
The second convolutional neural network is
Behavior recognition apparatus, characterized in that it is learned from an object recognition dataset similar to the behavior information to be recognized.

9. The method of claim 8,
extracting, by the processor, the first feature tensor of the space-time domain as a first feature vector having a preset size by performing an operation using an attention function; and
Behavior recognition apparatus comprising: performing an operation using an attention function to extract the second feature tensor in the space-time domain as a second feature vector having a preset size;

13. The method of claim 12,
performing, by the processor, a class transformation operation for transforming the first feature vector or the second feature vector so that the operation is possible through dimensional transformation; and
performing a transformer operation using a multi-task transformer unit for extracting a feature corresponding to a space-time domain by summing the first feature vector and the second feature vector on which a class transformation operation has been performed to extract a combined feature vector; Behavior recognition device, characterized in that.

14. The method of claim 13,
receiving, by the processor, the first feature vector and generating a fully connected feature value according to the input first feature vector and a predetermined connection weight;
generating a convolutional feature value through convolutional transformation of the selected frame image;
performing a matrix multiplication operation of the fully connected feature value and the convolution feature value; and
and performing normalization by summing the fully connected feature value and the convolutional feature value on which the matrix multiplication operation has been performed.

14. The method of claim 13,
The transformer operation includes a first transformer operation to a third transformer operation,
and the processor classifies a place and a behavior using a fully-connected layer for the combined feature vector extracted by performing the third transformer operation;