KR102174656B1

KR102174656B1 - Apparatus and method for recognizing activity and detecting activity area in video

Info

Publication number: KR102174656B1
Application number: KR1020190034501A
Authority: KR
Inventors: 손광훈; 이지영
Original assignee: 연세대학교 산학협력단
Priority date: 2019-03-26
Filing date: 2019-03-26
Publication date: 2020-11-05
Also published as: KR20200119386A

Abstract

본 발명은 액션 레이블만이 주석된 학습용 비디오를 이용하여 학습되어 학습용 비디오를 획득하기 위한 시간적 비용적 부담을 경감하고, 비디오에 포함된 객체의 액션을 인식하여 액션 영역을 정확하게 추출하여 액션 로컬라이제이션을 수행할 수 있는 비디오 액션 인식 및 액션 영역 탐지 장치 및 방법을 제공할 수 있다.In the present invention, only the action label is learned using the annotated learning video, thereby reducing the time and cost burden for acquiring the learning video, and performing action localization by accurately extracting the action region by recognizing the action of the object included in the video. It is possible to provide an apparatus and method for recognizing a possible video action and detecting an action area.

Description

Video action recognition and action area detection device and method {APPARATUS AND METHOD FOR RECOGNIZING ACTIVITY AND DETECTING ACTIVITY AREA IN VIDEO}

본 발명은 비디오 동작 인식 및 동작 구간 탐지 장치 및 방법에 관한 것으로, 비디오에서 객체의 동작을 인식하고 동작 영역을 추출할 수 있는 동작 인식 및 동작 구간 탐지 장치 및 방법에 관한 것이다.The present invention relates to an apparatus and method for recognizing a video motion and detecting a motion section, and to an apparatus and method for recognizing a motion of an object and extracting a motion region from a video.

비디오에 포함된 객체의 액션을 인식하고, 액션 영역을 추출하는 것은 비디오 감시, 비디오 요약 및 비디오 캡션과 같은 다양한 비디오 이용 분야에서 필수적이다. 비디오에서 객체를 탐지하는 다양한 기술이 공개되었으며, 이로부터 객체의 액션을 인식하는 기법 또한 큰 발전을 이루어 왔으나, 액션의 위치를 정확하게 추출하는 것은 액션의 다양성과 복잡한 배경 등을 포함한 다양한 이유로 인해 성능의 제약이 있어왔다.Recognizing an action of an object included in a video and extracting an action area is essential in various video usage fields such as video surveillance, video summary, and video caption. Various technologies for detecting objects in video have been disclosed, and the technique for recognizing the action of an object has also made great progress. However, accurately extracting the location of an action is due to various reasons including the diversity of actions and complex background. There have been limitations.

이에 최근에는 딥 러닝(Deep learning) 기법으로 학습된 인공 신경망(artificial neural network)을 이용하여 비디오에서 객체의 액션 영역을 추출하는 액션 로컬라이제이션을 수행하기 위한 다양한 연구가 진행되었다. 딥 러닝 기법을 이용함에 의해 비디오에 대한 액션 로컬라이제이션 작업의 성능이 크게 향상되었다.Accordingly, various studies have recently been conducted to perform action localization that extracts an action region of an object from a video using an artificial neural network learned by a deep learning technique. By using the deep learning technique, the performance of the action localization task for video was greatly improved.

기존의 딥러닝 기법에서 인공 신경망은 완전 지도(fully supervised) 학습 방식으로 학습되었다. 따라서 학습 시에 학습용 비디오 내의 객체의 액션 경계에 대한 검증 자료 레이블(ground truth label)이 완전하게 주석(full annotation)될 것이 요구되었다.In the existing deep learning technique, artificial neural networks are trained in a fully supervised learning method. Therefore, it is required that the ground truth label for the action boundary of the object in the training video be fully annotated during learning.

그러나 비디오에서 각 액션 각각에 대한 경계를 수작업으로 주석 처리하는 것은 시간적으로나 비용적으로 매우 비효율적이다. 뿐만 아니라, 각 액션의 경계는 작업자에 따라 주관적으로 판단될 수 있어, 인공 신경망을 부정확하게 학습시킬 수 있다는 문제가 있다.However, manually annotating the boundaries for each action in a video is very inefficient in terms of time and cost. In addition, there is a problem that the boundary of each action can be subjectively determined according to an operator, and thus an artificial neural network can be learned incorrectly.

한국 등록 특허 제10-1900237호 (2018.09.13 등록)Korean Patent Registration No. 10-1900237 (Registered on September 13, 2018)

획득이 용이한 간단한 액션 레이블만이 주석된 학습용 비디오를 이용하는 약지도 학습(weakly-supervised learning) 방식을 기반으로 학습시킬 수 있는 비디오 액션 인식 및 액션 영역 탐지 장치 및 방법을 제공하는데 있다.It is to provide an apparatus and method for recognizing a video action and detecting an action region that can be learned based on a weakly-supervised learning method using only a simple action label that is easy to obtain annotated learning video.

본 발명의 다른 목적은 약지도 학습으로 학습되어 비디오에 대한 액션 로컬라이제이션을 수행할 수 있는 비디오 액션 인식 및 액션 영역 탐지 장치 및 방법을 제공하는데 있다.Another object of the present invention is to provide a video action recognition and action region detection apparatus and method capable of performing action localization on a video through learning through weak guidance learning.

본 발명의 또 다른 목적은 비디오로부터 객체의 액션 영역을 정확하게 추출할 수 있는 비디오 액션 인식 및 액션 영역 탐지 장치 및 방법을 제공하는데 있다.Another object of the present invention is to provide a video action recognition and action region detection apparatus and method capable of accurately extracting an action region of an object from a video.

상기 목적을 달성하기 위한 본 발명의 일 실시예에 따른 비디오 액션 인식 및 액션 영역 탐지 장치는 미리 학습된 패턴 추정 방식에 따라 비디오의 다수 프레임 각각에서 기지정된 객체가 포함된 영역인 경계 박스를 탐색하고, 다수의 프레임에서 대응하는 경계 박스를 연결하여 객체 튜블릿을 생성하는 객체 튜블릿 획득부; 액션 레이블이 주석된 액션 학습용 비디오를 이용하는 약지도 학습 방식으로 패턴 추정 방식이 미리 학습되어, 상기 객체 튜블릿의 다수의 경계 박스의 크기를 조절하여 튜블릿을 획득하는 튜블릿 조절부; 상기 튜블릿의 다수의 최적 경계 박스를 시간 평균 풀링하여 튜블릿 이미지로 변환하고, 미리 학습된 패턴 추정 방식에 따라 상기 튜블릿 이미지의 특징을 추출하여 특징맵을 생성하는 특징맵 획득부; 상기 특징맵에서 액션 가중치를 획득하여 대응하는 특징맵에 가중하여 가중 특징맵을 획득하는 액션 가중치 획득부; 및 상기 가중 특징맵이 기지정된 다수의 액션 클래스 각각에 대응하는 수준을 나타내는 액션 클래스 스코어를 계산하고, 상기 액션 클래스 스코어에 따라 튜블릿에 대응하는 액션을 선택하고, 튜블릿에 포함된 최적 경계 박스의 위치 정보를 출력하는 액션 인식 및 영역 판별부; 를 포함한다.The video action recognition and action region detection apparatus according to an embodiment of the present invention for achieving the above object searches for a bounding box that is an area containing a predetermined object in each of a plurality of frames of a video according to a previously learned pattern estimation method. And an object tubelet acquisition unit that connects a corresponding bounding box in a plurality of frames to generate an object tubelet; A tublet adjuster configured to obtain a tube by adjusting the size of a plurality of bounding boxes of the object tube by pre-learning a pattern estimation method in a weak supervised learning method using an action learning video with an action label annotated; A feature map acquisition unit configured to generate a feature map by extracting features of the tubelet image by time-averaging pooling of the plurality of optimal bounding boxes of the tubelet and transforming it into a tubelet image, and extracting features of the tubelet image according to a previously learned pattern estimation method; An action weight acquisition unit for acquiring an action weight from the feature map and weighting a corresponding feature map to obtain a weighted feature map; And an action class score indicating a level corresponding to each of the plurality of action classes for which the weighted feature map is determined, and selecting an action corresponding to the tublet according to the action class score, and an optimal bounding box included in the tublet. An action recognition and area determination unit that outputs location information of the user; Includes.

상기 액션 인식 및 영역 판별부는 인공 신경망을 포함하여 구성되고 비디오에 포함된 N(N은 자연수)개 튜블릿 중 n번째 튜블릿(P_n)에 대한 상기 액션 클래스 스코어(λⁿ(c))를 수학식 The action recognition and region determination unit is configured to include an artificial neural network and calculates the action class score (λ ⁿ (c)) for the n-th tubelet (P _n ) among N (N is a natural number) tubelets included in the video. Equation

(여기서 αⁿ은 의 액션 가중치이고, yⁿ 은 특징맵이며, w_T(c, d)는 지정된 액션 클래스(c ∈ {1, ..., C})를 식별하기 위한 액션 클래스 분류자에 대응하는 d번째 요소로서 인공 신경망의 연산 레이어의 가중치를 나타낸다.)에 따라 획득할 수 있다.(Where α ⁿ is the action weight of, y ⁿ is the feature map, and w _T (c, d) is used in the action class classifier to identify the specified action class (c ∈ {1, ..., C}). As the corresponding d-th element, it can be obtained according to the weight of the computation layer of the artificial neural network).

상기 액션 인식 및 영역 판별부는 상기 액션 클래스 스코어 중 기지정된 기준 액션 클래스 스코어 이상인 액션 클래스 스코어를 선택하고, 선택된 액션 클래스 스코어에 대응하는 액션 클래스를 객체의 액션으로 출력하고, 선택된 액션 클래스 스코어에 대응하는 튜블릿의 최적 경계 박스의 위치 정보를 출력할 수 있다.The action recognition and region determination unit selects an action class score that is equal to or greater than a predetermined reference action class score among the action class scores, outputs an action class corresponding to the selected action class score as an action of the object, and corresponds to the selected action class score. Position information of the optimal bounding box of the tube can be output.

상기 액션 인식 및 영역 판별부는 동일한 튜블릿에 대해 기준 액션 클래스 스코어 이상인 액션 클래스 스코어가 다수개인 경우, 기지정된 설정에 따라 액션 클래스 스코어가 가장 높은 하나의 액션 클래스를 출력하거나, 기준 액션 클래스 스코어 이상으로 나타난 다수의 액션 클래스를 함께 출력할 수 있다.The action recognition and area determination unit outputs one action class with the highest action class score according to a predetermined setting, or outputs one action class score greater than or equal to the reference action class score when there are a plurality of action class scores equal to or greater than the reference action class score for the same tubelet. You can print multiple action classes that appear together.

상기 비디오 액션 인식 및 액션 영역 탐지 장치는 액션 레이블만이 주석된 액션 학습용 비디오를 기반으로 상기 튜블릿 조절부, 상기 특징맵 획득부, 상기 액션 가중치 획득부 및 액션 인식 및 영역 판별부를 약지도 학습시키기 위한 학습부; 를 더 포함하고, 상기 학습부는 상기 액션 학습용 비디오에 응답하여, 액션 가중치 획득부(150)에서 모든 액션 튜블릿(P_n)에 대해 출력되는 가중 특징맵을 가산하여 비디오 특징맵을 획득하고, 비디오 특징맵으로부터 비디오 액션 클래스 스코어를 획득하며, 비디오 액션 클래스 스코어와 액션 학습용 비디오의 액션 레이블과의 차이를 액션 손실로 획득하여 역전파하여 약지도 학습을 수행할 수 있다.The video action recognition and action region detection apparatus learns a ring map based on an action learning video in which only an action label is annotated, the tubelet adjustment unit, the feature map acquisition unit, the action weight acquisition unit, and the action recognition and region determination unit. For learning department; The learning unit further includes, in response to the action learning video, the action weight acquisition unit 150 adds a weighted feature map output for all action tublets (P _n ) to obtain a video feature map, and The video action class score is obtained from the feature map, and the difference between the video action class score and the action label of the action learning video is acquired as an action loss and back propagated to perform weak guidance learning.

상기 객체 튜블릿 획득부는 미리 학습된 패턴 추정 방식에 따라 비디오의 다수 프레임 각각에서 기지정된 객체가 포함된 영역인 경계 박스를 탐색하고, 각 경계 박스에 검출해야 하는 객체가 존재할 확률을 나타내는 객체 스코어를 함께 획득하고, 획득된 객체 스코어가 기지정된 기준 객체 스코어 이상인 경계 박스를 이용하여 객체 튜블릿을 생성하고, 상기 학습부는 객체 레이블만이 주석된 객체 학습용 비디오가 인가되어 상기 객체 튜블릿 획득부에서 획득된 상기 객체 스코어와 객체 학습용 비디오에 주석된 객체 레이블 사이의 차이를 객체 손실로 획득하여 역전파함으로써, 상기 객체 튜블릿 획득부를 약지도 학습시킬 수 있다.The object tublet acquisition unit searches for a bounding box, which is an area containing a predetermined object in each of a plurality of frames of a video, according to a previously learned pattern estimation method, and generates an object score indicating the probability that an object to be detected exists in each bounding box. Acquired together, and using a bounding box whose object score is equal to or greater than a predetermined reference object score, the learning unit generates an object tutorial, and the learning unit obtains the object tubelet acquisition unit by applying an object learning video to which only the object label is annotated. By acquiring the difference between the obtained object score and the object label annotated in the object learning video as an object loss and backpropagating, the object tubelet acquisition unit may learn the weak map.

상기 목적을 달성하기 위한 본 발명의 다른 실시예에 따른 비디오 액션 인식 및 액션 영역 탐지 장치 및 방법은 미리 학습된 패턴 추정 방식에 따라 비디오의 다수 프레임 각각에서 기지정된 객체가 포함된 영역인 경계 박스를 탐색하고, 다수의 프레임에서 대응하는 경계 박스를 연결하여 객체 튜블릿을 생성하는 단계; A video action recognition and action area detection apparatus and method according to another embodiment of the present invention for achieving the above object includes a bounding box that is an area including an object specified in each of a plurality of frames of a video according to a previously learned pattern estimation method. Searching, and generating an object tubelet by connecting corresponding bounding boxes in a plurality of frames;

액션 레이블이 주석된 액션 학습용 비디오를 이용하는 약지도 학습 방식으로 학습된 패턴 추정 방식에 따라 상기 객체 튜블릿의 다수의 경계 박스의 크기를 조절하여 튜블릿을 획득하는 단계; 상기 튜블릿의 다수의 최적 경계 박스를 시간 평균 풀링하여 튜블릿 이미지로 변환하고, 미리 학습된 패턴 추정 방식에 따라 상기 튜블릿 이미지의 특징을 추출하여 특징맵을 생성하는 단계; 상기 특징맵에서 액션 가중치를 획득하여 대응하는 특징맵에 가중하여 가중 특징맵을 획득하는 단계; 및 상기 가중 특징맵이 기지정된 다수의 액션 클래스 각각에 대응하는 수준을 나타내는 액션 클래스 스코어를 계산하고, 상기 액션 클래스 스코어에 따라 튜블릿에 대응하는 액션을 선택하고, 튜블릿에 포함된 최적 경계 박스의 위치 정보를 출력하는 단계; 를 포함한다.Obtaining a tube by adjusting sizes of a plurality of bounding boxes of the object tube according to a pattern estimation method learned by a weak supervised learning method using an action learning video in which an action label is annotated; Transforming a plurality of optimal bounding boxes of the tubelet into a tubelet image by time-averaging pooling, and extracting features of the tubelet image according to a previously learned pattern estimation method to generate a feature map; Acquiring an action weight from the feature map and weighting it to a corresponding feature map to obtain a weighted feature map; And an action class score indicating a level corresponding to each of the plurality of action classes for which the weighted feature map is determined, and selecting an action corresponding to the tublet according to the action class score, and an optimal bounding box included in the tublet. Outputting the location information of; Includes.

따라서, 본 발명의 실시예에 따른 비디오 액션 인식 및 액션 영역 탐지 장치 및 방법은 액션 레이블만이 주석된 학습용 비디오를 이용하여 학습되어 학습용 비디오를 획득하기 위한 시간적 비용적 부담을 경감할 수 있다. 또한 비디오에 포함된 객체의 액션을 인식하고, 액션 영역을 정확하게 추출하여 액션 로컬라이제이션을 수행할 수 있다.Accordingly, the apparatus and method for recognizing a video action and detecting an action region according to an embodiment of the present invention can reduce a time and cost burden for acquiring a training video by learning using a training video in which only an action label is annotated. In addition, it is possible to recognize an action of an object included in a video and accurately extract an action area to perform action localization.

도 1은 본 발명의 일 실시예에 따른 비디오 액션 인식 및 액션 영역 탐지 장치의 개략적 구조를 나타낸다.
도 2는 도 1의 특징맵 획득부의 상세 구성을 나타낸다.
도 3은 약지도 학습을 위한 액션 학습용 비디오의 일예를 나타낸다.
도 4는 도 1의 튜블릿 조절부에서 크기가 조절된 경계 박스의 일예를 나타낸다.
도 5 및 도 6은 본 실시예에 따른 비디오 액션 인식 및 액션 영역 탐지 장치에서 액션 로컬라이제이션이 수행된 결과의 일예를 나타낸다.
도 7은 본 발병의 일 실시예에 따른 비디오 액션 인식 및 액션 영역 탐지 방법을 나타낸다.1 illustrates a schematic structure of an apparatus for recognizing a video action and detecting an action region according to an embodiment of the present invention.
FIG. 2 shows a detailed configuration of a feature map acquisition unit of FIG. 1.
3 shows an example of an action learning video for weak guidance learning.
FIG. 4 shows an example of a bounding box whose size is adjusted by the tubelet controller of FIG. 1.
5 and 6 illustrate an example of a result of performing action localization in the apparatus for recognizing a video action and detecting an action region according to the present embodiment.
7 shows a method of recognizing a video action and detecting an action region according to an embodiment of the present outbreak.

본 발명과 본 발명의 동작상의 이점 및 본 발명의 실시에 의하여 달성되는 목적을 충분히 이해하기 위해서는 본 발명의 바람직한 실시예를 예시하는 첨부 도면 및 첨부 도면에 기재된 내용을 참조하여야만 한다. In order to fully understand the present invention, the operational advantages of the present invention, and the objects achieved by the implementation of the present invention, reference should be made to the accompanying drawings illustrating preferred embodiments of the present invention and the contents described in the accompanying drawings.

이하, 첨부한 도면을 참조하여 본 발명의 바람직한 실시예를 설명함으로써, 본 발명을 상세히 설명한다. 그러나, 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며, 설명하는 실시예에 한정되는 것이 아니다. 그리고, 본 발명을 명확하게 설명하기 위하여 설명과 관계없는 부분은 생략되며, 도면의 동일한 참조부호는 동일한 부재임을 나타낸다. Hereinafter, the present invention will be described in detail by describing a preferred embodiment of the present invention with reference to the accompanying drawings. However, the present invention may be implemented in various different forms, and is not limited to the described embodiments. In addition, in order to clearly describe the present invention, parts irrelevant to the description are omitted, and the same reference numerals in the drawings indicate the same members.

명세서 전체에서, 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라, 다른 구성요소를 더 포함할 수 있는 것을 의미한다. 또한, 명세서에 기재된 "...부", "...기", "모듈", "블록" 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어나 소프트웨어 또는 하드웨어 및 소프트웨어의 결합으로 구현될 수 있다. Throughout the specification, when a part "includes" a certain component, it means that other components may be further included, rather than excluding other components unless specifically stated to the contrary. In addition, terms such as "... unit", "... group", "module", and "block" described in the specification mean units that process at least one function or operation, which is hardware, software, or hardware. And software.

도 1은 본 발명의 일 실시예에 따른 비디오 액션 인식 및 액션 영역 탐지 장치의 개략적 구조를 나타내고, 도 2는 도 1의 객체 튜블릿 획득부의 상세 구성을 나타낸다.1 shows a schematic structure of an apparatus for recognizing a video action and detecting an action region according to an embodiment of the present invention, and FIG. 2 shows a detailed configuration of an object tubelet acquisition unit of FIG. 1.

도 1을 참조하면, 본 실시예에 따른 비디오 액션 인식 및 액션 영역 탐지 장치는 객체 튜블릿 획득부(110), 튜블릿 조절부(120), 차원 변환부(130), 특징 추출부(140), 액션 가중치 획득부(150), 액션 인식 및 영역 판별부(160) 및 학습부(170)를 포함한다.Referring to FIG. 1, the apparatus for recognizing a video action and detecting an action region according to the present embodiment includes an object tubelet acquisition unit 110, a tubelet adjustment unit 120, a dimensional conversion unit 130, and a feature extraction unit 140. , An action weight acquisition unit 150, an action recognition and area determination unit 160, and a learning unit 170.

객체 튜블릿 획득부(110)는 액션 로컬라이제이션이 수행되어야 하는 비디오를 획득하고, 획득된 비디오의 다수의 프레임에서 객체가 포함된 영역을 검출하여 객체 튜블릿(tubelet)을 생성한다. 객체 튜블릿 획득부(110)는 약지도 학습(weakly-supervised) 방식에 따라 패턴 추정 방식이 미리 학습된 인공 신경망을 포함하여, 비디오의 다수의 프레임에서 기지정된 객체를 검출하고, 연속되는 다수의 프레임에서 객체가 검출된 영역을 연결함으로써 객체 튜블릿을 생성한다.The object tubelet acquisition unit 110 acquires a video for which action localization is to be performed, and generates an object tubelet by detecting a region including an object in a plurality of frames of the acquired video. The object tubelet acquisition unit 110 includes an artificial neural network in which a pattern estimation method is learned in advance according to a weakly-supervised method, and detects a predetermined object in a plurality of frames of a video, and An object tublet is created by connecting the area in which the object is detected in the frame.

도 2를 참조하면, 객체 튜블릿 획득부(110)는 비디오 제공부(111), 프레임 그룹화부(112), 객체 검출부(113) 및 튜블릿 생성부(114)를 포함할 수 있다.Referring to FIG. 2, the object tubelet acquisition unit 110 may include a video providing unit 111, a frame grouping unit 112, an object detection unit 113, and a tubelet generation unit 114.

비디오 제공부(111)는 액션 로컬라이제이션이 수행되어야 하는 다수 프레임으로 구성된 비디오를 획득한다. 여기서 액션 로컬라이제이션은 비디오 내에 포함된 적어도 하나의 객체의 액션 영역을 구분하는 것으로서, 객체의 액션을 인지하고, 각 프레임 내에서 인지된 객체의 액션이 포함된 영역을 추출하는 것을 의미한다.The video provider 111 acquires a video composed of a plurality of frames for which action localization is to be performed. Here, action localization is to classify the action area of at least one object included in the video, and means recognizing the action of the object, and extracting the area including the action of the recognized object within each frame.

프레임 그룹화부(112)는 연속되는 다수의 프레임이 포함된 비디오에서 기지정된 개수(여기서는 일예로 T개(T는 자연수)) 단위로 프레임(f_t, t = {1, ..., T})을 그룹화하여 구분한다. 일반적으로 비디오에는 매우 많은 프레임이 포함되고, 비디오에 포함된 모든 프레임에서 객체가 동일하게 유지되는 경우는 거의 없다. 또한 스트리밍과 같이 비디오의 모든 프레임을 일괄적으로 획득할 수 없는 경우도 발생한다. 따라서 프레임 그룹화부(112)는 비디오에 포함된 객체를 용이하게 검출할 수 있도록, 비디오를 기지정된 개수의 프레임 단위로 그룹화한다. 여기서 그룹화되는 프레임의 개수는 다양하게 설정될 수 있으나, 일예로 8개의 프레임 단위로 그룹화될 수 있다. The frame grouping unit 112 includes frames (f _t , t = {1, ..., T}) in units of a predetermined number (here, T (T is a natural number)) in a video including a plurality of consecutive frames. ) Are grouped and classified. In general, video contains a large number of frames, and objects rarely remain the same in all frames included in the video. In addition, there are cases in which all frames of a video cannot be obtained at once, such as streaming. Accordingly, the frame grouping unit 112 groups the videos in units of a predetermined number of frames so that objects included in the video can be easily detected. Here, the number of frames to be grouped may be variously set, but may be grouped in units of eight frames as an example.

객체 검출부(113)는 패턴 추정 방식이 미리 학습된 인공 신경망을 포함하여, 학습된 패턴 추정 방식에 따라 그룹화된 다수의 프레임에 포함된 객체를 탐지하고, 다수의 프레임에서 탐지된 객체가 나타나는 객체 영역에 대한 경계 박스(B = {B₁, ..., B_T})를 검출한다.The object detection unit 113 detects objects included in a plurality of frames grouped according to the learned pattern estimation method, including an artificial neural network in which the pattern estimation method is learned in advance, and an object region in which the detected object appears in the plurality of frames. Detect the bounding box for (B = {B ₁ , ..., B _T }).

여기서 객체 검출부(113)는 검출할 객체가 지정되어 미리 학습될 수 있다. 본 실시예에서는 비디오 액션 인식 및 액션 영역 탐지 장치가 사람의 액션을 인식하고 액션 영역을 검출하는 것으로 가정하며, 이에 객체 검출부(113)는 그룹화된 다수의 프레임에서 사람이 포함된 영역을 검출한다.Here, the object detection unit 113 may be learned in advance by designating an object to be detected. In the present embodiment, it is assumed that the video action recognition and action region detection apparatus recognizes a person's action and detects the action region. Accordingly, the object detection unit 113 detects a region including a person in a plurality of grouped frames.

객체 검출부(113)는 그룹화된 다수의 프레임(f_t)에서 객체가 나타나는 객체 영역을 사각의 경계 박스(bounding box)(B)로 검출하고, 검출된 경계 박스(B)의 좌표를 출력할 수 있다.The object detection unit 113 may detect an object area in which an object appears in a plurality of grouped frames f _t as a rectangular bounding box B, and output the coordinates of the detected bounding box B. have.

여기서 객체 검출부(113)는 일예로 컨볼루션 신경망(Convolutional Neural Networks)으로 구현될 수 있으며, 학습의 편의성을 위해 약지도 학습 방식에 따라 학습될 수 있다.Here, the object detection unit 113 may be implemented as convolutional neural networks, for example, and may be learned according to a weak supervised learning method for convenience of learning.

객체 검출부(113)가 완전 지도 학습 방식으로 학습되는 경우, 객체 검출 성능이 우수하지만, 객체의 경계, 즉 경계 박스(B)에 대한 검증 자료 레이블이 완전하게 주석된 대량의 학습용 비디오를 필요로 한다. 그리고 검증 자료 레이블이 주석된 학습용 비디오는 기본적으로 수작업으로 획득되므로, 학습용 비디오를 획득하는 것이 용이하지 않다.When the object detection unit 113 is trained in a fully supervised learning method, the object detection performance is excellent, but it requires a large amount of training videos in which the verification data label for the boundary of the object, that is, the bounding box B, is completely annotated. . In addition, since the training video annotated with the verification data label is basically acquired by hand, it is not easy to acquire the training video.

그에 비해 본 실시예에서 객체 검출부(113)는 단순한 객체 레이블만이 주석된 객체 학습용 비디오를 기반으로 약지도 학습된다.In contrast, in the present embodiment, the object detection unit 113 learns a ring map based on an object learning video in which only simple object labels are annotated.

단순 객체 레이블만이 주석된 객체 학습용 비디오는 객체 영역에 대한 별도의 주석 없이 비디오 전체에 대해 객체 레이블만이 제공되는 비디오를 의미한다. 일예로, 본 실시예에서 객체 학습용 비디오에는 사람, 개, 고양이, 염소 등과 같이 단순히 객체의 레이블만이 주석으로 제공되며, 객체가 나타나는 객체 영역에 대해서는 별도의 주석이 제공되지 않는다.An object learning video in which only a simple object label is annotated refers to a video in which only an object label is provided for the entire video without additional annotation on the object area. For example, in the object learning video in the present embodiment, only the label of an object such as a person, dog, cat, goat, etc. is provided as an annotation, and a separate annotation is not provided for an object area in which the object appears.

간단한 객체 레이블만이 제공되는 비디오를 이용한 약지도 학습은 객체의 영역 경계를 수작업으로 주석 처리할 필요가 없으므로, 대량의 객체 학습용 비디오를 저비용으로 빠르고 용이하게 제작할 수 있다.In the weak instructional learning using a video provided only with a simple object label, since it is not necessary to manually annotate the boundary of an object, a large amount of video for object learning can be produced quickly and easily at low cost.

여기서 객체 검출부(113)는 각 경계 박스(B)에 검출해야 하는 객체가 존재할 확률을 나타내는 객체 스코어(h^*)를 함께 획득하고, 획득된 객체 스코어(h^*)가 기지정된 기준 객체 스코어 이상인 경우에만 정상 경계 박스(B)로 판별하여 출력할 수도 있다. 이는 객체 검출부(113)의 객체 검출 신뢰도를 향상시키기 위해서이다.Wherein not less than the object detection section 113 is the game reference object acquired for each boundary box (B) is an object score (h ^*) represents the probability of an object to be detected together, and the obtained object score (h ^*) is Basis Only normal bounding box (B) can be determined and output. This is to improve the reliability of object detection by the object detection unit 113.

객체 검출부(113)는 미리 학습된 패턴 추정 방식에 따라 다수의 프레임(f_t) 중 n번째 프레임(f_n)에서 검출되는 경계 박스(B_n)의 객체 특징(x_n ∈

)을 추출하고, 추출된 객체 특징(x_n)로부터 수학식 1에 따라 프레임별 객체 스코어(hⁿ)를 획득할 수 있다.The object detection unit 113 detects the object feature (x _n ∈) of the bounding box B _n detected in the n-th frame f _n among a plurality of frames f _t according to a previously learned pattern estimation method.

), and an object score h ^{n for each} frame according to Equation 1 from the extracted object feature x _n .

(여기서 w_h(d)는 지정된 객체를 식별하기 위한 객체 분류자(w_h)에 대응하는 d번째 요소(element)로서 인공 신경망의 연산 레이어(예를 들면 컨볼루션 레이어)의 가중치를 나타낸다.)(Wh _h (d) is a d-th element corresponding to an object classifier (w _h ) for identifying a designated object, and represents the weight of the computational layer (eg, convolutional layer) of the artificial neural network.)

객체 검출부(113)는 다수의 프레임(f_t) 각각에 대해 획득되는 프레임별 객체 스코어(hⁿ)에 대해 평균값 풀링(average pooling)과 시그모이드(sigmoid) 함수를 적용하여, 수학식 2에 따라 다수의 프레임(f_t)의 경계 박스(B)에서 객체가 존재할 확률인 객체 스코어(h^*)를 획득할 수 있다.The object detection unit 113 applies an average pooling and a sigmoid function to each frame object score h ⁿ obtained for each of a plurality of frames f _t , Accordingly, it is possible to obtain an object score (h ^* ) that is a probability that an object exists in the bounding box (B) of a plurality of frames (f _t ).

한편 객체 검출부(113)는 다수의 프레임에 다수의 객체가 포함된 경우, 다수 객체의 다양한 조합에 따른 영역을 검출할 수도 있다. 예를 들면, 비디오에 다수의 객체가 나타나며, 다수의 객체는 서로 이격되어 나타나거나 인접 또는 일부 영역에서 중첩되어 나타날 수도 있다. 이에 객체 검출부(113)는 서로 이격된 객체는 각각 구분된 객체 영역으로 검출하고, 인접하거나 일부 영역이 중첩된 객체는 각 객체별로 구분된 객체 영역으로 검출할 뿐만 아니라, 인접 또는 중첩된 객체가 함께 포함된 객체 영역 또한 검출할 수 있다. 여기서 객체 검출부(113)는 일예로 객체가 인접 또는 중첩 배치되어 각 객체에 대한 객체 영역의 적어도 일부가 중첩되는 경우에 객체가 함께 포함된 객체 영역을 추가로 검출하도록 구성될 수 있다.Meanwhile, when a plurality of objects are included in a plurality of frames, the object detection unit 113 may detect regions according to various combinations of the plurality of objects. For example, a plurality of objects appear in a video, and the plurality of objects may appear spaced apart from each other or may appear adjacent to each other or overlapped in a partial area. Accordingly, the object detection unit 113 detects objects separated from each other as separate object areas, and objects with adjacent or overlapping areas are detected as separate object areas for each object, as well as adjacent or overlapped objects. The included object area can also be detected. Here, the object detection unit 113 may be configured to additionally detect an object area including an object when at least a part of the object area for each object overlaps due to an adjacent or overlapping arrangement of objects.

객체 검출부(113)는 T개의 프레임(f₁, ..., f_T) 각각에 대응하는 경계 박스(bounding box)(B = {B₁, ..., B_T})를 검출하며, 각 프레임(f_t)에 다수의 객체 영역이 탐지되는 경우, 각 객체 영역에 대응하는 개수(N)의 경계 영역(B_t ⁿ, 여기서 {n = 1, ..., N})를 검출할 수 있다.The object detection unit 113 detects a bounding box (B = {B ₁ , ..., B _T }) corresponding to each of the _T frames (f ₁ , ..., f _T ), and When a plurality of object regions are detected in a frame (f _t ), the number of boundary regions (B _t ⁿ , where {n = 1, ..., N}) corresponding to each object region (N) can be detected. have.

객체 검출부(113)에 의해 객체 영역이 검출되면, 튜블릿 생성부(114)는 다수의 프레임에서 동일 객체에 대해 검출된 객체 영역을 연결하여 객체 튜블릿을 생성한다. 즉 튜블릿은 그룹화된 다수 프레임에서 동일한 객체가 포함된 영역에 대한 경계 박스(B)들의 집합으로 획득될 수 있다.When an object region is detected by the object detection unit 113, the tubelet generator 114 generates an object tublet by connecting the object regions detected for the same object in a plurality of frames. That is, the tubelet may be obtained as a set of bounding boxes B for a region including the same object in a plurality of grouped frames.

튜블릿 생성부(114)는 두개의 연속되는 프레임(f_t-1, f_t)에서 경계 박스(B_t ^m, B_t-1 ⁿ)(여기서 m, n ∈ {1, ..., N})가 획득되면, 경계 박스(B_t ^m)와 경계 박스(B_t-1 ⁿ) 사이의 링크 스코어(E_link)를 수학식 3에 따라 획득한다.The tubelet generator 114 is a bounding box (B _t ^m , B _t-1 ⁿ ) (where m, n ∈ (1, ..., N) in two consecutive frames (f _t-1 , f _t ). }) is obtained, a link score (E _link ) between the bounding box (B _t ^m ) and the bounding box (B _t-1 ⁿ ) is obtained according to Equation 3.

(여기서 h(B_t ⁿ)는 경계 박스에 지정된 객체가 포함될 확률을 나타내는 객체 스코어이고, E_feat(B_t ⁿ, B_t-1 ⁿ)는 L₂-norm 함수에 의해 경계 박스(B_t ⁿ)와 경계 박스(B_t-1 ⁿ)의 정규화된 특징 사이의 유사성을 나타내고, E_IoU(B_t ^m, B_t-1 ⁿ)는 경계 박스(B_t ^m)와 경계 박스(B_t-1 ⁿ) 사이의 중첩 스코어로서 Union of IoU (Intersection of Union)를 측정한 결과를 나타내며, β₁, β₂는 각각 특징 유사도와 중첩 스코어에 대한 가중치를 제어하는 매개 변수이다.)(Where h(B _t ⁿ ) is an object score representing the probability that the object specified in the bounding box will be included, and E _feat (B _t ⁿ , B _t-1 ⁿ ) is the bounding box (B _t ⁿ ) by the L ₂ -norm function. ) And the normalized features of the bounding box (B _t-1 ⁿ ), and E _IoU (B _t ^m , B _t-1 ⁿ ) is the bounding box (B _t ^m ) and the bounding box (B _t-1 ⁿ ) ⁿ ) represents the result of measuring Union of IoU (Intersection of Union) as an overlap score between, and β ₁ and β ₂ are parameters that control feature similarity and weight for overlapping scores, respectively.)

링크 스코어(E_link)는 연속되는 두 개의 프레임(f_t-1, f_t)에서 객체의 특징이 유사하고, 객체가 나타나는 영역이 중첩될수록 큰 값을 가져 강력하게 연결된다.The link score E _link has similar characteristics of an object in two consecutive frames (f _t-1 and f _t ), and has a larger value as the area where the object appears overlaps, so that it is strongly connected.

그리고 프레임(f_t)에서 n 번째 객체에 대한 튜블릿을 생성하기 위해, 경로 수학식 4에 따른 인덱스(π_t(n))를 갖는 연결 경로를 구성하여, 수학식 5로 표현되는 객체 튜블릿(Oⁿ)을 생성한다.And in order to generate a tublet for the n-th object in the frame (f _t ), by configuring a connection path having an index (π _t (n)) according to the path equation 4, the object tublet represented by equation 5 Produces (O ⁿ ).

(여기서, l ∈ {1, ..., N}이고, t ∈ {2, ..., T}이다.)(Here, l ∈ {1, ..., N} and t ∈ {2, ..., T}.)

다만 객체 튜블릿을 획득하는 다양한 방식이 기존에 공개되어 있으므로, 경우에 따라서 객체 튜블릿 획득부(110)는 기존의 방식으로 미리 학습되어 객체 튜블릿(Oⁿ)을 생성 할 수도 있다.However, since various methods of acquiring an object tube have been previously disclosed, in some cases, the object tube acquiring unit 110 may be learned in advance in a conventional manner to generate an object tube (O ⁿ ).

튜블릿 조절부(120) 또한 패턴 추정 방식이 미리 학습된 인공 신경망을 포함하여 객체 튜블릿 획득부(110)에서 획득된 객체 튜블릿(Oⁿ)을 인가받고, 인가된 객체 튜블릿(Oⁿ) 각각에서 경계 박스(B)들의 크기를 조절한다. 즉 객체 튜블릿(Oⁿ) 각각의 크기를 조절하여 튜블릿(P_n)을 획득한다.The tubelet controller 120 also receives the object tubelet O ⁿ obtained from the object tubelet acquisition unit 110 including an artificial neural network in which the pattern estimation method is learned in advance, and receives the applied object tubelet O ⁿ ) Adjust the size of the bounding boxes (B) in each. That is, the size of each of the object tubelets O ⁿ is adjusted to obtain a tubelet P _n .

상기한 바와 같이, 객체 튜블릿 획득부(110)가 약지도 학습되는 경우, 학습용 비디오를 매우 용이하게 획득할 수 있으나, 경계 박스(B)의 검출 성능은 완전 지도 학습 방식보다 낮아질 수 있다. 즉 경계 박스(B)가 객체가 나타나는 객체 영역에 정확하게 대응하지 않고, 불필요한 영역을 포함하여 추출될 수 있다. 이에 튜블릿 조절부(120)는 경계 박스(B)가 정확하게 객체 영역만을 지정하도록 객체 튜블릿의 경계 박스(B)에서 이러한 불필요한 영역을 제거하도록 한다.As described above, when the object tubelet acquisition unit 110 is weakly supervised learning, it is possible to very easily acquire a training video, but the detection performance of the bounding box B may be lower than that of the fully supervised learning method. That is, the bounding box B does not exactly correspond to the object region in which the object appears, and may be extracted including an unnecessary region. Accordingly, the tublet control unit 120 removes the unnecessary area from the bounding box B of the object tublet so that the bounding box B accurately designates only the object area.

튜블릿 조절부(120)는 t번째 프레임(f_t)의 n번째 경계 박스(B_t ⁿ)의 중심 위치를 기준으로 대해 폭(u_t ⁿ)과 높이(v_t ⁿ)에 대한 오프셋을 줄여 경계 박스(B)의 크기를 조절한다.The tubelet adjustment unit 120 reduces the offset for the width (u _t ⁿ ) and the height (v _t ⁿ ) based on the center position of the n-th bounding box (B _t ⁿ ) of the t-th frame (f _t ). Adjust the size of the bounding box (B).

구체적으로 튜블릿 조절부(120)는 미리 학습된 패턴 추정 방식에 따라 경계 박스(B_t ⁿ)의 폭(u_t ⁿ)에 대한 조절 폭(∇u_t ⁿ)과 높이(v_t ⁿ)에 대한 조절 높이(∇v_t ⁿ)를 획득하고, 획득된 조절 폭(∇u_t ⁿ)과 조절 높이(∇v_t ⁿ)에 따라 수학식 6과 같이 크기가 조절된 최적 경계 박스(

)를 획득한다.In more detail, the tublet controller 120 adjusts the width (∇u _t ⁿ ) and the height (v _t ⁿ ) for the width (u _t ⁿ ) of the bounding box (B _t ⁿ ) according to the previously learned pattern estimation method. to obtain a controlled height (∇v _t ^n), and an optimum bounding box size is adjusted as shown in equation 6, according to the obtained control width (∇u _t ⁿ⁾ and adjusting the height (∇v _t ⁿ⁾ (

).

튜블릿 조절부(120)는 다수의 컨볼루션 레이어와 적어도 하나의 활성화 함수 레이어(Activation function layer)(여기서는 일예로 ReLU)로 구성된 컨볼루션 신경망을 포함하여, 조절 폭(∇u_t ⁿ)과 조절 높이(∇v_t ⁿ)를 획득할 수 있다.The tubular adjustment unit 120 includes a convolutional neural network composed of a plurality of convolutional layers and at least one activation function layer (here, for example, ReLU), and the adjustment width (∇u _t ⁿ ) and adjustment The height (∇v _t ⁿ ) can be obtained.

도 3은 약지도 학습을 위한 액션 학습용 비디오의 일예를 나타내고, 도 4은 도 1의 튜블릿 조절부에서 크기가 조절된 경계 박스의 일예를 나타낸다.3 shows an example of an action learning video for weak supervised learning, and FIG. 4 shows an example of a bounding box whose size is adjusted in the tube control unit of FIG. 1.

튜블릿 조절부(120)는 객체 튜블릿 획득부(110)가 객체 학습용 비디오에 의해 약지도 학습된 이후, 약지도 학습된 객체 튜블릿 획득부(110)가 액션 학습용 비디오에서 획득한 객체 튜블릿을 인가받아 추가적으로 약지도 학습될 수 있다. 여기서 액션 학습용 비디오는 단순히 액션 레이블이 주석된 비디오로서, 일예로 도 3에 도시된 바와 같이, 다이빙, 골프, 아이스 댄싱, 펜싱 등의 액션 레이블이 주석된 단일 액션이 포함된 비디오일 수 있다.After the object tublet acquisition unit 110 has learned about the weak map by the object learning video, the tube control unit 120 includes the object tube obtained from the video for the action learning With the authorization, the ring map can be additionally learned. Here, the action learning video is simply a video to which an action label is annotated, and as an example, as shown in FIG. 3, a video including a single action to which an action label such as diving, golf, ice dancing, or fencing is annotated.

도 4에서는 연속되는 다수의 프레임에서 객체 튜블릿 획득부(110)가 검출한 경계 박스(B_t ⁿ)와 튜블릿 조절부(120)에서 조절된 최적 경계 박스(

)를 나타내고 있다. 도 4에 도시된 바와 같이, 경계 박스(B_t ⁿ)는 객체가 나타나는 영역에 대해 상대적으로 큰 영역으로 검출되어 여백이 포함되는 반면, 최적 경계 박스(

)는 객체의 영역에 매우 타이트하게 설정되었음을 알 수 있다.In FIG. 4, a bounding box (B _t ⁿ ) detected by the object tubelet acquisition unit 110 in a plurality of consecutive frames and an optimal bounding box adjusted by the tubelet controller 120 (

). As shown in FIG. 4, the bounding box (B _t ⁿ ) is detected as a relatively large area with respect to the area in which the object appears, and a margin is included, whereas the optimal bounding box (

You can see that) is set very tightly in the area of the object.

차원 변환부(130)는 튜블릿(P_n) 각각의 다수의 최적 경계 박스(

)들에 대해 시간축을 기준으로 시간 평균 풀링(time average pooling)을 수행하여, 다수의 최적 경계 박스(

)를 포함하는 3차원의 튜블릿(P_n) 각각을 2차원의 튜블릿 이미지로 변환한다. The dimensional transform unit 130 includes a plurality of optimal bounding boxes of each of the tubelets P _n

) On the basis of the time axis by performing time average pooling,

Each of the three-dimensional tublets (P _n ) including) is converted into a two-dimensional tublet image.

특징 추출부(140)는 튜블릿 이미지를 인가받고, 미리 학습된 패턴 추정 방식에 따라 튜블릿 이미지의 특징을 추출하여 특징맵(yⁿ ∈

)을 획득한다.The feature extraction unit 140 receives the tubelet image, extracts the features of the tubelet image according to a previously learned pattern estimation method, and extracts a feature map (y ⁿ ∈

).

액션 가중치 획득부(150)는 미리 학습된 패턴 추정 방식에 따라 특징 추출부(140)에서 획득된 특징맵에서 액션 가중치(αⁿ)를 획득하고, 획득된 액션 가중치를 대응하는 특징맵(yⁿ)에 적용하여 가중 특징맵(αⁿyⁿ)을 획득한다.Action characterized by the weight obtaining unit 150 obtains the action weight (α ⁿ⁾ in the characteristic map obtained from the feature extraction unit 140 according to the pre-training pattern estimation method, corresponding to the obtained action weight map (y ⁿ ) To obtain a weighted feature map (α ⁿ y ⁿ ).

여기서 액션 가중치(αⁿ)는 튜블릿 이미지에서 객체의 액션 수준, 즉 움직임을 나타내는 가중치이다. 액션 가중치 획득부(150)가 액션 가중치(αⁿ)를 획득하여 특징맵(yⁿ)에 가중하는 것은, 비록 객체 튜블릿 획득부(110)가 객체를 탐지하여 객체 튜블릿(Oⁿ)을 획득하더라도, 객체 튜블릿(Oⁿ)의 객체에 움직임이 없다면 객체의 액션 영역을 검출하는 액션 로컬라이제이션에서는 무의미하기 때문이다.Here, the action weight α ⁿ is a weight representing the action level, that is, motion of the object in the tubular image. The action weight acquisition unit 150 acquires the action weight α ^{n and} weights it to the feature map y ⁿ , although the object tubelet acquisition unit 110 detects an object and generates the object tube O ⁿ . This is because even if acquired, if there is no movement in the object of the object tubelet O ⁿ , it is meaningless in action localization that detects the action area of the object.

차원 변환부(130)와 특징 추출부(140)는 특징맵 획득부로 통합될 수 있다.The dimensional conversion unit 130 and the feature extraction unit 140 may be integrated into a feature map acquisition unit.

액션 인식 및 영역 판별부(160)는 가중 특징맵(αⁿyⁿ)을 인가받고, 미리 학습된 패턴 추정 방식에 따라 기지정된 다수의 액션 클래스 중 적어도 하나의 액션 클래스로 분류한다. 액션 인식 및 영역 판별부(160) 또한 인공 신경망으로 구현될 수 있다.The action recognition and area determination unit 160 receives the weighted feature map α ⁿ y ⁿ and classifies it into at least one action class from among a plurality of predefined action classes according to a previously learned pattern estimation method. The action recognition and region determination unit 160 may also be implemented as an artificial neural network.

액션 인식 및 영역 판별부(160)는 우선 튜블릿(P_n) 각각에 대응하는 가중 특징맵(αⁿyⁿ)이 기지정된 다수의 액션 클래스 각각에 대응하는 수준을 나타내는 액션 클래스 스코어(λⁿ(c) = {λⁿ(1), ..., λⁿ(C)})를 수학식 7에 따라 획득한다.The action recognition and region determination unit 160 first includes an action class score (λ ⁿ⁾ indicating a level corresponding to each of a plurality of predefined action classes in which a weighted feature map (α ⁿ y ⁿ ) corresponding to each of the tubelets (P _n ) is determined. (c) = {λ ⁿ (1), ..., λ ⁿ (C)}) is obtained according to Equation 7.

(여기서 w_T(c, d)는 지정된 액션 클래스(c ∈ {1, ..., C})를 식별하기 위한 액션 클래스 분류자(w_T∈

)에 대응하는 d번째 요소로서 인공 신경망의 연산 레이어(예를 들면 컨볼루션 레이어)의 가중치를 나타낸다.)(Where w _T (c, d) is the action class classifier (w _T ∈) to identify the specified action class (c ∈ {1, ..., C}).

), which represents the weight of the computational layer (for example, the convolutional layer) of the artificial neural network.)

수학식 7에서

는 n번째 튜블릿(P_n)의 클래스(c)에 대한 연관성을 나타내는 분류 스코어로서 수학식 8와 같이 표현될 수 있다.In Equation 7

Is a classification score indicating the association with the class (c) of the n-th tubelet (P _n ), and may be expressed as in Equation 8.

(여기서 sⁿ = [sⁿ(1), ..., sⁿ(C)]^T ∈

이다.)(Where s ⁿ = [s ⁿ (1), ..., s ⁿ (C)] ^T ∈

to be.)

수학식 8에 의해 수학식 7는 수학식 9으로 표현될 수 있다.Equation 7 can be expressed as Equation 9 by Equation 8.

즉 n번째 튜블릿(P_n)의 클래스(c)에 대한 액션 클래스 스코어(λⁿ(c))는 수학식 9과 같이, 액션 가중치(αⁿ)와 분류 스코어(Sⁿ)로 획득된다.That is, the action class score (λ ⁿ (c)) for the class (c) of the n-th tubelet (P _n ) is obtained as an action weight (α ⁿ ) and a classification score (S ⁿ ), as shown in Equation 9.

액션 인식 및 영역 판별부(160)는 튜블릿(P_n) 각각의 다수의 액션 클래스(c)에 대한 액션 클래스 스코어(λⁿ(c))가 획득되면, 기지정된 기준 액션 클래스 스코어 이상인 액션 클래스 스코어(λⁿ(c))를 선택하고, 선택된 액션 클래스 스코어(λⁿ(c))에 대응하는 튜블릿(P_n)을 액션 튜블릿으로 추출한다. 그리고 추출된 액션 튜블릿의 최적 경계 박스(

)와 액션 클래스(c)를 획득하여 액션 로컬라이제이션의 결과로 출력한다. 즉 액션의 종류와 함께 비디오에서 액션이 나타난 객체 영역을 출력한다.When the action class score (λ ⁿ (c)) for a plurality of action classes (c) of each of the tubelets (P _n ) is obtained, the action recognition and region determination unit 160 is an action class equal to or greater than a predetermined reference action class score. A score (λ ⁿ (c)) is selected, and a tubelet (P _n ) corresponding to the selected action class score (λ ⁿ (c)) is extracted as an action tubelet. And the optimal bounding box of the extracted action tube (

) And action class (c) are obtained and output as the result of action localization. That is, the object area where the action appears in the video is output along with the type of action.

이때 하나의 튜블릿(P_n)이 다수의 액션 클래스(c)에 대해서 액션 클래스 스코어(λⁿ(c))가 기준 액션 클래스 스코어 이상으로 나타날 수 있다. 즉 하나의 튜블릿(P_n)이 다수의 액션 클래스에 대응하는 경우가 발생할 수 있다. 이 경우, 액션 인식 및 영역 판별부(160)는 기지정된 설정에 따라 액션 클래스 스코어(λⁿ(c))가 가장 높은 하나의 액션 클래스(c)를 출력하거나, 기준 액션 클래스 스코어 이상으로 나타난 다수의 액션 클래스(c) 모두를 출력할 수 있다.In this case, one tubelet P _n may have an action class score λ ⁿ (c) greater than or equal to the reference action class score for a plurality of action classes c. That is, there may be a case where one tubelet P _n corresponds to a plurality of action classes. In this case, the action recognition and region determination unit 160 outputs one action class (c) having the highest action class score (λ ⁿ (c)) according to a predetermined setting, or a plurality of actions that appear above the reference action class score. All of the action classes (c) of can be displayed.

학습부(170)는 액션 인식 및 액션 영역 탐지 장치를 약지도 학습시키기 위한 구성으로 학습 수행 시에만 추가되고, 학습된 이후에는 생략될 수 있다.The learning unit 170 is a configuration for learning a weak map of the action recognition and action region detection apparatus, and is added only when learning is performed, and may be omitted after learning.

학습부(170)는 객체 학습용 비디오를 이용하여 객체 튜블릿 획득부(110)를 우선 약지도 학습시키고, 이후, 약지도 학습된 객체 튜블릿 획득부(110)와 액션 학습용 비디오를 이용하여 튜블릿 조절부(120), 차원 변환부(130), 특징 추출부(140), 액션 가중치 획득부(150) 및 액션 인식 및 영역 판별부(160)를 약지도 학습시킬 수 있다.The learning unit 170 first learns the object tubelet acquisition unit 110 using the object learning video, and then, uses the object tubelet acquisition unit 110 and the action learning video, which has been learned on the weak map. The adjusting unit 120, the dimensional transforming unit 130, the feature extracting unit 140, the action weight obtaining unit 150, and the action recognition and region determining unit 160 may be taught a ring map.

학습부(170)는 객체 학습용 비디오가 객체 튜블릿 획득부(110)에 인가되어 획득되는 경계 박스(B)에 검출해야 하는 객체가 존재할 확률을 나타내는 객체 스코어(h^*)를 전달받는다. 그리고 객체 스코어(h^*)와 객체 학습용 비디오에 주석된 객체 레이블 사이의 차이를 객체 손실로 획득하여 객체 튜블릿 획득부(110)로 역전파하여 객체 튜블릿 획득부(110)를 약지도 학습시킨다. 이때, 학습부(170)는 객체 손실을 일예로 표준 다중 레이블 교차 엔트로피 손실과 같은 공지된 함수에 적용하여 획득할 수 있다.The learning unit 170 receives an object score (h ^* ) indicating a probability that an object to be detected exists in the bounding box B obtained by applying the object learning video to the object tubelet acquisition unit 110. In addition, the difference between the object score (h ^* ) and the object label annotated in the object learning video is acquired as an object loss, and back propagated to the object tubelet acquisition unit 110 to learn the object tubelet acquisition unit 110 ring map. . In this case, the learning unit 170 may obtain the object loss by applying it to a known function such as a standard multi-label cross entropy loss as an example.

객체 튜블릿 획득부(110)가 약지도 학습되면, 학습부(170)는 액션 학습용 비디오를 객체 튜블릿 획득부(110)에 인가하고, 액션 가중치 획득부(150)에서 모든 액션 튜블릿(P_n)에 대해 출력되는 가중 특징맵(αⁿyⁿ)을 수학식 10과 같이 모두 더하여 비디오 레벨에서 액션 튜블릿에 대한 특징을 나타내는 비디오 특징맵(y^*)을 획득한다.When the object tubelet acquisition unit 110 learns weak guidance, the learning unit 170 applies the action learning video to the object tubelet acquisition unit 110, and the action weight acquisition unit 150 applies all the action tubelets (P). _n) all of the weighting characteristic map (α ⁿ y ⁿ⁾ that is output as shown in equation 10 for the addition to obtain a video feature map (y ^*) representing a feature for the action in the tube wavelet video levels.

그리고 학습부(170)는 비디오 특징맵(y^*)으로부터 비디오 액션 클래스 스코어(λ(c))를 수학식 7와 유사하게 수학식 11에 따라 획득한다.In addition, the learning unit 170 obtains a video action class score λ(c) from the video feature map (y ^* ) according to Equation 11, similar to Equation 7.

비디오 액션 클래스 스코어(λ(c))가 획득되면, 학습부(170)는 비디오 액션 클래스 스코어(λ(c))와 액션 학습용 비디오의 액션 레이블과의 차이를 액션 손실로 획득하여 역전파함으로써, 튜블릿 조절부(120), 차원 변환부(130), 특징 추출부(140), 액션 가중치 획득부(150) 및 액션 인식 및 영역 판별부(160)를 약지도 학습시킬 수 있다. 여기서 학습부(170)는 액션 손실을 일예로 표준 다중 레이블 교차 엔트로피 손실과 같은 공지된 함수에 적용하여 획득할 수 있다.When the video action class score λ(c) is obtained, the learning unit 170 acquires the difference between the video action class score λ(c) and the action label of the action learning video as an action loss and backpropagates it, The tubelet adjustment unit 120, the dimensional conversion unit 130, the feature extraction unit 140, the action weight acquisition unit 150, and the action recognition and region determination unit 160 may be taught a ring map. Here, the learning unit 170 may obtain the action loss by applying it to a known function such as a standard multi-label cross entropy loss, for example.

도 5 및 도 6은 본 실시예에 따른 비디오 액션 인식 및 액션 영역 탐지 장치에서 액션 로컬라이제이션이 수행된 결과의 일예를 나타낸다.5 and 6 illustrate an example of a result of performing action localization in the apparatus for recognizing a video action and detecting an action region according to the present embodiment.

도 5에서 (a)와 (b)는 각각 농구와 아이스 댄싱에 대해 액션 로컬라이제이션을 수행한 결과를 나타내고, 도6 에서 (a) 내지 (d)는 각각 다이빙, 축구, 농구 및 사이클에 대해 액션 로컬라이제이션을 수행한 결과를 나타낸다. 그리고 도 5 및 도 6에서는 본 실시예에 따른 비디오 액션 인식 및 액션 영역 탐지 장치의 성능을 비교하기 위해 기존에 수작업 등으로 수행된 검증 자료 레이블(ground truth label)을 함께 표시하였다.In Figure 5 (a) and (b) show the results of performing action localization for basketball and ice dancing, respectively, and in Figure 6 (a) to (d) are action localization for diving, soccer, basketball and cycle, respectively. Shows the result of performing. In FIGS. 5 and 6, in order to compare the performance of the apparatus for recognizing a video action and detecting an action region according to the present embodiment, a ground truth label previously performed manually or the like is displayed together.

도 5 및 도 6에 도시된 바와 같이, 본 실시예에 따른 비디오 액션 인식 및 액션 영역 탐지 장치는 약지도 학습 방식으로 학습이 수행됨에도 객체의 액션이 발생된 영역을 정확하게 추출할 수 있음을 확인할 수 있다.As shown in Figs. 5 and 6, it can be seen that the video action recognition and action region detection apparatus according to the present embodiment can accurately extract the region in which the action of the object occurs even though the learning is performed by the weak guidance learning method. have.

도 7은 본 발병의 일 실시예에 따른 비디오 액션 인식 및 액션 영역 탐지 방법을 나타낸다.7 shows a method of recognizing a video action and detecting an action region according to an embodiment of the present outbreak.

도 1 내지 도 6을 참조하여, 도 7의 비디오 액션 인식 및 액션 영역 탐지 방법을 설명하면, 우선 학습부(170)는 객체 레이블이 주석된 객체 학습용 비디오를 이용하여 객체 튜블릿 획득부(110)를 약지도 학습시킨다(S12). 학습부(170)는 객체 튜블릿 획득부(110)가 객체 학습용 비디오에 응답하여 출력하는 객체 스코어(h^*)와 객체 레이블 사이의 차이를 객체 손실로 획득하여 객체 튜블릿 획득부(110)로 역전파함으로써, 객체 튜블릿 획득부(110)를 학습시킬 수 있다.Referring to FIGS. 1 to 6, the video action recognition and action region detection method of FIG. 7 will be described. First, the learning unit 170 uses an object learning video to which an object label is annotated, and the object tube acquisition unit 110 The weak map is learned (S12). The learning unit 170 obtains the difference between the object score (h ^* ) and the object label output in response to the object learning video by the object tubelet acquisition unit 110 as an object loss, and sends the object tubelet acquisition unit 110 to By backpropagating, the object tubelet acquisition unit 110 can be trained.

이후 학습부(170)는 학습된 객체 튜블릿 획득부(110)와 객체의 액션 레이블이 주석된 액션 학습용 비디오를 이용하여 튜블릿 조절부(120), 차원 변환부(130), 특징 추출부(140), 액션 가중치 획득부(150) 및 액션 인식 및 영역 판별부(160)를 약지도 학습시킨다(S12).Thereafter, the learning unit 170 uses the learned object tublet acquisition unit 110 and the action learning video in which the action label of the object is annotated, and the tube control unit 120, the dimensional conversion unit 130, and the feature extraction unit ( 140), the action weight acquisition unit 150 and the action recognition and area determination unit 160 are learned about a weak map (S12).

학습부(170)는 액션 가중치 획득부(150)에서 모든 액션 튜블릿(P_n)에 대해 출력되는 가중 특징맵(αⁿyⁿ)으로부터 수학식 10에 따라 비디오 특징맵(y^*)을 획득하고, 비디오 특징맵(y^*)으로부터 비디오 액션 클래스 스코어(λ(c))를 획득한다. 그리고 획득된 비디오 액션 클래스 스코어(λ(c))와 액션 학습용 비디오의 액션 레이블과의 차이를 액션 손실로 획득하여 역전파하여 약지도 학습을 수행할 수 있다.The learning unit 170 obtains a video feature map (y ^* ) according to Equation 10 from the weighted feature map (α ⁿ y ⁿ ) output for all action tublets (P _n ) from the action weight acquisition unit 150 Then, a video action class score (λ(c)) is obtained from the video feature map (y ^* ). In addition, the difference between the obtained video action class score λ(c) and the action label of the video for action learning may be acquired as an action loss and backpropagated to perform weak guidance learning.

학습이 수행된 이후, 비디오 액션 인식 및 액션 영역 탐지 장치는 액션 로컬라이제이션이 수행되어야 하는 비디오를 인가받고, 패턴 추정 방식이 약지도 학습된 객체 튜블릿 획득부(110)는 비디오에서 기지정된 객체가 포함된 영역인 경계 박스(B)를 검출하여 객체 튜블릿(Oⁿ)을 획득한다(S21). 이때 비디오에 포함된 객체의 수에 따라 획득되는 객체 튜블릿(Oⁿ)의 개수는 가변될 수 있다.After the learning is performed, the video action recognition and action region detection device receives the video for which action localization is to be performed, and the object tubelet acquisition unit 110 whose pattern estimation method is weakly learned includes an object specified in the video. The object tubelet O ⁿ is obtained by detecting the bounding box B, which is the selected area (S21). In this case, the number of object tubelets O ⁿ obtained may vary according to the number of objects included in the video.

튜블릿 조절부(120)는 획득된 객체 튜블릿(Oⁿ)의 경계 박스(B_t ⁿ) 각각에 대해 약지도 학습된 패턴 추정 방식에 따라 수학식 6과 같이 객체 튜블릿(Oⁿ)의 경계 박스(B)의 크기를 조절하여, 최적 경계 박스(

)를 갖는 튜블릿(P_n)을 획득한다.The tubelet control unit 120 determines the object tube (O ⁿ ) as shown in Equation 6 according to a weak map-learned pattern estimation method for each bounding box (B _t ⁿ ) of the obtained object tube (O ⁿ ). By adjusting the size of the bounding box (B), the optimal bounding box (

A tubelet (P _n ) with) is obtained.

튜블릿(P_n)이 획득되면, 차원 변환부(130)가 튜블릿(P_n) 각각의 다수의 최적 경계 박스(

)들에 대해 시간축을 기준으로 시간 평균 풀링을 수행하여, 튜블릿 이미지로 변환한다 (S23).When the tubelet P _n is obtained, the dimensional transform unit 130 performs a plurality of optimal bounding boxes for each of the tubelets P _n .

) Are subjected to time average pooling based on the time axis, and converted into a tubelet image (S23).

그리고 미리 학습된 패턴 추정 방식에 따라 특징 추출부(140)가 튜블릿 이미지의 특징을 추출하여 특징맵(yⁿ)을 획득하고, 액션 가중치 획득부(150)가 특징맵에서 액션 가중치(αⁿ)를 획득하여 대응하는 특징맵(yⁿ)에 적용함으로써 가중 특징맵(αⁿyⁿ)을 획득한다(S24).In addition, according to the previously learned pattern estimation method, the feature extraction unit 140 extracts the features of the tubelet image to obtain a feature map (y ⁿ ), and the action weight acquisition unit 150 obtains an action weight (α ⁿ ) from the feature map. ) Is obtained and applied to the corresponding feature map (y ⁿ ) to obtain a weighted feature map (α ⁿ y ⁿ ) (S24).

액션 인식 및 영역 판별부(160)는 튜블릿(P_n) 각각에 대응하는 가중 특징맵(αⁿyⁿ)이 기지정된 다수의 액션 클래스 각각에 대응하는 수준을 나타내는 액션 클래스 스코어(λⁿ(c) = {λⁿ(1), ..., λⁿ(C)})를 수학식 9와 같이 획득한다(S25).The action recognition and region determination unit 160 includes an action class score (λ ⁿ ()) indicating a level corresponding to each of a plurality of predetermined action classes in which a weighted feature map (α ⁿ y ⁿ ) corresponding to each of the tubelets (P _n ) is determined. c) = {λ ⁿ (1), ..., λ ⁿ (C)}) is obtained as in Equation 9 (S25).

그리고 획득된 액션 클래스 스코어(λⁿ(c)) 중 기지정된 기준 액션 클래스 스코어 이상인 액션 클래스 스코어(λⁿ(c))를 선택하고, 선택된 액션 클래스 스코어(λⁿ(c))에 대응하는 튜블릿(P_n)을 액션 튜블릿으로 추출한다. 이와 함께 추출된 액션 튜블릿의 최적 경계 박스(

)와 액션 클래스(c)를 획득하여 출력한다(S26).And the obtained action class score (λ ⁿ (c)) of the exchanger tube for selecting a specified standard action class scores than action class score (λ ⁿ (c)), and corresponding to the selected action class score (λ ⁿ (c)) The bullet (P _n ) is extracted as an action tube. With this, the optimal bounding box of the extracted action tube (

) And the action class (c) are obtained and output (S26).

본 발명에 따른 방법은 컴퓨터에서 실행 시키기 위한 매체에 저장된 컴퓨터 프로그램으로 구현될 수 있다. 여기서 컴퓨터 판독가능 매체는 컴퓨터에 의해 액세스 될 수 있는 임의의 가용 매체일 수 있고, 또한 컴퓨터 저장 매체를 모두 포함할 수 있다. 컴퓨터 저장 매체는 컴퓨터 판독가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터와 같은 정보의 저장을 위한 임의의 방법 또는 기술로 구현된 휘발성 및 비휘발성, 분리형 및 비분리형 매체를 모두 포함하며, ROM(판독 전용 메모리), RAM(랜덤 액세스 메모리), CD(컴팩트 디스크)-ROM, DVD(디지털 비디오 디스크)-ROM, 자기 테이프, 플로피 디스크, 광데이터 저장장치 등을 포함할 수 있다.The method according to the present invention may be implemented as a computer program stored in a medium for execution on a computer. Here, the computer-readable medium may be any available medium that can be accessed by a computer, and may also include all computer storage media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, and ROM (Read Dedicated memory), RAM (random access memory), CD (compact disk)-ROM, DVD (digital video disk)-ROM, magnetic tape, floppy disk, optical data storage device, and the like.

본 발명은 도면에 도시된 실시예를 참고로 설명되었으나 이는 예시적인 것에 불과하며, 본 기술 분야의 통상의 지식을 가진 자라면 이로부터 다양한 변형 및 균등한 타 실시예가 가능하다는 점을 이해할 것이다.The present invention has been described with reference to the embodiments shown in the drawings, but these are merely exemplary, and those of ordinary skill in the art will appreciate that various modifications and other equivalent embodiments are possible therefrom.

따라서, 본 발명의 진정한 기술적 보호 범위는 첨부된 청구범위의 기술적 사상에 의해 정해져야 할 것이다.Therefore, the true technical protection scope of the present invention should be determined by the technical spirit of the appended claims.

110: 객체 튜블릿 획득부 120: 튜블릿 회귀부
130: 차원 변환부 140: 특징 추출부
150: 액션 가중치 획득부 160: 액션 인식 및 영역 판별부
170: 학습부110: object tublet acquisition unit 120: tublet regression unit
130: dimensional transformation unit 140: feature extraction unit
150: action weight acquisition unit 160: action recognition and area determination unit
170: learning department

Claims

An object tublet that searches for a bounding box that is an area containing a predetermined object in each of a plurality of frames of a video according to a pre-learned first pattern estimation method, and creates an object tube by connecting the corresponding bounding box in a plurality of frames. Acquisition unit;
A tubelet adjusting unit for obtaining a tube by adjusting the size of a plurality of bounding boxes of the object tube by pre-learning a second pattern estimation method in a weak supervised learning method using an action learning video with an action label annotated;
A feature map for generating a feature map by transforming a plurality of bounding boxes of which the size of the tublet is adjusted into a tublet image by time-averaging pooling, and extracting features of the tublet image according to a previously learned third pattern estimation method Acquisition unit;
An action weight acquisition unit for acquiring an action weight from the feature map and weighting a corresponding feature map to obtain a weighted feature map; And
An action class score representing a level corresponding to each of a plurality of action classes in which the weighted feature map is determined is calculated, an action corresponding to a tublet is selected according to the action class score, and the size included in the tublet is An action recognition and area determination unit outputting position information of the adjusted bounding box; Video action recognition and action area detection device comprising a.

The method of claim 1, wherein the tubular adjustment unit
Adjusted width (∇u _t ⁿ ) and height (v _t ⁿ ) for each width (u _t ⁿ ) of a plurality of bounding boxes (B _t ⁿ ) of the object tube according to the second pattern estimation method Obtain (∇v _t ⁿ ), and from the obtained adjustment width (∇u _t ⁿ ) and adjustment height (∇v _t ⁿ )

The bounding box whose size is adjusted according to (

A video action recognition and action area detection device to acquire).

The method of claim 1, wherein the action recognition and area determination unit
The action class score (λ ⁿ (c)) for the n-th tubelet (P _n ) among N (N is a natural number) tubelets configured including an artificial neural network and included in the video
Equation

(Where α ⁿ is the action weight of, y ⁿ is the feature map, and w _T (c, d) is used in the action class classifier to identify the specified action class (c ∈ {1, ..., C}). As the corresponding d-th element, it represents the weight of the computational layer of the artificial neural network.)
Video action recognition and action area detection device acquired according to.

The method of claim 1, wherein the action recognition and area determination unit
Among the action class scores, an action class score equal to or greater than a predetermined reference action class score is selected, the action class corresponding to the selected action class score is output as an action of the object, and the size of the tube corresponding to the selected action class score is adjusted. A video action recognition and action area detection device that outputs location information of a bounding box.

The method of claim 4, wherein the action recognition and area determination unit
When there are multiple action class scores that are equal to or greater than the base action class score for the same tubelet, one action class with the highest action class score is output according to a predetermined setting, or multiple action classes that appear above the base action class score are combined. Output video action recognition and action area detection device.

The apparatus of claim 1, wherein the video action recognition and action area detection device
A learning unit for learning a weak map based on an action learning video in which only an action label is annotated, the tubelet control unit, the feature map acquisition unit, the action weight acquisition unit, and the action recognition and region determination unit; Including more,
The learning unit
In response to the action learning video, the action weight acquisition unit 150 adds weighted feature maps output for all action tublets (P _n ) to obtain a video feature map, and obtains a video action class score from the video feature map. A video action recognition and action region detection device that acquires and performs weak supervised learning by acquiring the difference between the video action class score and the action label of the action learning video as an action loss and backpropagating it.

The method of claim 6, wherein the object tubelet acquisition unit
According to the first pattern estimation method, a bounding box, which is an area containing a predetermined object in each of a plurality of frames of a video, is searched, an object score indicating the probability of the existence of an object to be detected in each bounding box is obtained, and the obtained An object tublet is created using a bounding box having an object score equal to or greater than a predetermined reference object score,
The learning unit
An object learning video in which only an object label is annotated is applied, and the difference between the object score obtained by the object tubelet acquisition unit and the object label annotated in the object learning video is acquired as an object loss and backpropagated, so that the object tube A video action recognition and action region detection device that learns weak guidance of the acquisition unit.

Searching for a bounding box, which is an area including an object, in each of a plurality of frames of a video according to a pre-learned first pattern estimation method, and connecting the corresponding bounding boxes in the plurality of frames to generate an object tube;
Obtaining a tube by adjusting the size of a plurality of bounding boxes of the object tube according to a second pattern estimation method learned by a weak supervised learning method using an action learning video with an action label annotated;
Generating a feature map by performing time-averaged pooling of the bounding box of which the size of the tubelet is adjusted to convert it into a tubelet image, and extracting features of the tubelet image according to a previously learned third pattern estimation method;
Acquiring an action weight from the feature map and weighting it to a corresponding feature map to obtain a weighted feature map; And
An action class score representing a level corresponding to each of a plurality of action classes in which the weighted feature map is determined is calculated, an action corresponding to a tublet is selected according to the action class score, and the size included in the tublet is Outputting position information of the adjusted bounding box; Video action recognition and action area detection method comprising a.

The method of claim 8, wherein outputting the location information comprises:
The action class score (λ ⁿ (c)) for the n-th tubelet (P _n ) among N (N is a natural number) tubelets configured including an artificial neural network and included in the video
Equation

(Where α ⁿ is the action weight of, y ⁿ is the feature map, and w _T (c, d) is used in the action class classifier to identify the specified action class (c ∈ {1, ..., C}). As the corresponding d-th element, it represents the weight of the computational layer of the artificial neural network.)
Video action recognition and action area detection method obtained according to.

The method of claim 8, wherein outputting the location information comprises:
Selecting an action class score that is equal to or greater than a predetermined reference action class score among the action class scores;
Outputting an action class corresponding to the selected action class score as an action of the object; And
Outputting location information of a bounding box in which a size of a tube corresponding to the selected action class score is adjusted; Video action recognition and action area detection method comprising a.

The method of claim 8, wherein the video action recognition and action region detection method
Learning a medicine map based on the action learning video in which only the action label is annotated; Including more,
The step of learning the medicine map
Acquiring a video feature map (y ^* ) by adding weighted feature maps (α ⁿ y ⁿ ) output to all acquired action tubelets (P _n ) in response to the action learning video;
Obtaining a video action class score (λ(c)) from the video feature map (y ^* ); And
Acquiring a difference between the video action class score λ(c) and the action label of the video for action learning as an action loss and backpropagating; Video action recognition and action area detection method comprising a.

The method of claim 8, wherein generating the object tublet comprises:
Searching for a bounding box that is an area including a predetermined object in each of a plurality of frames of a video according to the first pattern estimation method;
Acquiring an object score indicating a probability that an object to be detected exists in each bounding box together: And
Generating an object tubelet using a bounding box in which the obtained object score is equal to or greater than a predetermined reference object score; Including,
The step of learning the medicine map
An object learning video in which only an object label is annotated is applied, and the difference between the object score obtained by the object tubelet acquisition unit and the object label annotated in the object learning video is acquired as an object loss and backpropagated, so that the object tube Learning an acquiring unit; Video action recognition and action area detection method further comprising.