KR20190054702A

KR20190054702A - Method and apparatus for detecting action of object in viedio stream

Info

Publication number: KR20190054702A
Application number: KR1020170151575A
Authority: KR
Inventors: 이성환; 조남규; 윤다혜
Original assignee: 고려대학교 산학협력단
Priority date: 2017-11-14
Filing date: 2017-11-14
Publication date: 2019-05-22
Also published as: KR102008290B1

Abstract

The present invention provides a method for identifying an action of an object from an image, which is capable of correctly detecting the action, and a device thereof. More specifically, the method for identifying an action of an object from an image comprises: a step of acquiring an image segment; a step of determining whether an action of an object exists in the image segment through a first neural network; a step of generating one or more future images successive to the image segment through a second neural network to form an integrated image segment when the action of the object exists in the image segment; and a step of detecting an action type and time point of the object in the integrated image segment through a third neural network, wherein the first neural network is independently learned and the second and third neural networks are dependently learned.

Description

METHOD AND APPARATUS FOR DETECTING ACTION OF OBJECT IN VIEDIO STREAM FIELD OF THE INVENTION [0001]

본 발명은 영상으로부터 영상에 포함된 객체의 행동을 인식하는 방법 및 그 장치에 관한 것이다. The present invention relates to a method and apparatus for recognizing an action of an object included in an image from an image.

컴퓨터 비전 분야에서의 행동인식 기술은 이미지 센서로부터 색상(RGB) 또는 깊이(depth) 영상을 입력 받아 얻어진 영상 정보에 의존하여 영상 내의 객체의 행동을 분류한다. 이미지 센서는 다양한 장소에 용이하게 설치 운용될 수 있기 때문에, 최근 이미지 센서를 활용하여 위험 상황 감시, 특정 이벤트 탐지 등이 이루어지고 있다. 그러나 행동인식 기술을 여러 응용 분야에 활용하기 위해서는 이 기술을 사용하고자 하는 상황에 따라 영상의 어떤 피처(feature)를 사용할지 결정하는 것이 중요하다. 영상에서 사용될 수 있는 피처의 범주는 크게 두 가지로, 연구자가 직접 설계한 핸드크래프트(hand-crafted) 피처와 딥러닝을 통해 추출되는 딥(deep) 피처로 나누어질 수 있다. 영상에서 많이 이용 되는 핸드크래프트 피처는 사람의 궤적정보를 표현하는 Dense Trajectories 기술자(descriptor)와 외형정보를 나타내는 Histogram of Oriented Gradients (HOG) 기술자를 포함하며, 딥 피처에는 시간 정보를 학습하는 Recurrent Neural Network(RNN)와 외형정보를 학습하는 Convolutional Neural Network(CNN)가 주로 사용된다. 최근에는 딥러닝 기술의 비약적 발전으로 인해, 딥러닝 기반의 기술자가 대표적으로 사용되고 있다. Behavior recognition technology in the field of computer vision classifies the behavior of objects in the image depending on the image information obtained by inputting color (RGB) or depth image from the image sensor. Since the image sensor can be easily installed and operated in various places, recently, an image sensor is used to monitor a dangerous situation and to detect a specific event. However, in order to use behavior recognition technology in many applications, it is important to decide which features of the image to use, depending on the context in which the technology is to be used. The categories of features that can be used in the image can be roughly divided into two categories: a hand-crafted feature designed by the researcher and a deep feature extracted through the deep learning. The handcraft feature frequently used in the image includes a Dense Trajectories descriptor representing the human trajectory information and a Histogram of Oriented Gradients (HOG) descriptor representing the appearance information. The Deep Feature includes a Recurrent Neural Network (RNN) and Convolutional Neural Network (CNN) which learns appearance information are mainly used. In recent years, due to the breakthrough of deep-running technology, a deep-run-based engineer has been used as a representative.

이와 같이, 행동인식 기술은 딥러닝 기술의 발전으로 인한 동반 발전을 보이며 높은 성능을 보이고 있다. 그러나 종래의 행동인식 기술은 특정 행동이 발생된 영상을 학습데이터로 사용하므로, 다양한 행동이 연이어 일어나거나 행동이 일어나지 않는 프레임들을 포함하는 실생활 영상으로부터 행동인식을 수행하는데에는 여전히 한계가 있다. 즉, 이러한 연속적인 행동을 인식하기 위해서는 인식기술을 적용하기 전에 연속적인 행동을 하나의 행동만이 포함되도록 클립단위로 자르는 전처리 과정이 요구된다. 따라서 이러한 전처리 과정 없이 연속적인 행동을 인식하기 위해서는 영상에서 행동이 발생하는 부분을 검출하는 동시에 어떤 행동이 발생되었는지 인식하는 기술이 요구된다.As such, the behavior recognition technology shows high performance due to the development of deep learning technology. However, since the conventional behavior recognition technology uses an image in which a specific behavior is generated as learning data, there are still limitations in performing behavior recognition from real life images including a series of various actions or frames in which no action occurs. In other words, in order to recognize such a continuous action, a preprocessing process is required in which continuous action is cut into clip units so that only one action is included before the recognition technology is applied. Therefore, in order to recognize continuous action without such preprocessing, it is necessary to detect a part where the action occurs in the image and to recognize a certain action.

미국등록특허 제 9,648,035 호 (발명의 명칭: User behavioral risk assessment)U.S. Patent No. 9,648,035 (entitled "User behavioral risk assessment"

본 발명은 전술한 종래 기술의 문제점을 해결하기 위한 것으로, 실시간 영상에서 행동인식을 위해서 행동 유형과 함께 행동이 발생하는 구간의 시작점과 끝점을 학습함으로써, 보다 정확하게 행동을 검출하는 방법 및 시스템을 제시하고자 한다.Disclosure of Invention Technical Problem [6] The present invention provides a method and system for more accurately detecting a behavior by learning a start point and an end point of a section in which a behavior occurs together with a behavior type for real- I want to.

상술한 기술적 과제를 달성하기 위한 기술적 수단으로서, 본 발명의 제1 측면에 따른 객체 인식 장치가 영상 내의 객체의 행동을 인식하는 방법은 영상 세그먼트를 획득하는 단계; 제1 뉴럴 네트워크(neural network)를 통해 영상 세그먼트 내에 객체의 행동이 존재하는지 여부를 판별하는 단계; 영상 세그먼트 내에 객체의 행동이 존재하면, 제2 뉴럴 네트워크를 통해 영상 세그먼트에 연속하는 하나 이상의 미래영상 프레임을 생성하여 통합 영상 세그먼트를 구성하는 단계; 및 제3 뉴럴 네트워크를 통해 통합 영상 세그먼트 내의 객체의 행동 유형 및 행동 시점을 검출하는 단계를 포함한다. 이때, 제1 뉴럴 네트워크는 독립적으로 학습되며, 제2 뉴럴 네트워크와 제3 뉴럴 네트워크는 의존적으로 학습된다. According to an aspect of the present invention, there is provided a method of recognizing an object in an image, the method comprising: acquiring an image segment; Determining whether an action of an object exists in an image segment through a first neural network; Constructing an integrated image segment by generating one or more future image frames contiguous to the image segment through a second neural network if the behavior of the object exists in the image segment; And detecting a behavior type and an action point of an object in the integrated image segment through the third neural network. At this time, the first neural network is learned independently, and the second neural network and the third neural network are learned dependently.

또한, 본 발명의 제2 측면에 따른 행동 인식 장치는, 영상에 포함된 객체의 행동을 인식하는 프로그램이 저장된 메모리; 및 상기 프로그램을 실행하는 프로세서를 포함한다. 이때, 프로세서는, 상기 프로그램이 실행됨에 따라, 영상 세그먼트를 획득하고, 제1 뉴럴 네트워크를 통해 영상 세그먼트 내에 객체의 행동이 존재하는지 여부를 판별하며, 영상 세그먼트 내에 객체의 행동이 존재하면, 제2 뉴럴 네트워크를 통해 영상 세그먼트에 연속하는 하나 이상의 미래영상 프레임을 생성하여 통합 영상 세그먼트를 구성하고, 제3 뉴럴 네트워크를 통해 통합 영상 세그먼트 내의 객체의 행동 유형 및 행동 시점을 검출한다. 이때, 제1 뉴럴 네트워크는 독립적으로 학습되며, 제2 뉴럴 네트워크와 제3 뉴럴 네트워크는 의존적으로 학습된다. According to a second aspect of the present invention, there is provided a behavior recognition apparatus comprising: a memory for storing a program for recognizing an action of an object included in an image; And a processor for executing the program. At this time, the processor acquires an image segment as the program is executed, determines whether the behavior of the object exists in the image segment through the first neural network, and if there is an action of the object in the image segment, One or more future image frames contiguous to the image segment are generated through the neural network to form an integrated image segment and a behavior type and an action point of the object in the integrated image segment are detected through the third neural network. At this time, the first neural network is learned independently, and the second neural network and the third neural network are learned dependently.

또한, 본 발명의 제3 측면은, 상기 제1 측면의 방법을 컴퓨터 상에서 수행하기 위한 프로그램을 기록한 컴퓨터 판독 가능한 기록 매체를 제공한다. A third aspect of the present invention provides a computer-readable recording medium having recorded thereon a program for performing the method of the first aspect on a computer.

전술한 과제 해결 수단에 따르면, 본 발명의 일 실시예는 미래영상 프레임들을 생성함으로써 실시간으로 영상이 입력되는 환경에서 위험 상황 감지 또는 특정 상황 인지를 수행할 수 있다. 또한, 본 발명의 일 실시예는 영상 내에 행동이 일어나는 구간과 그렇지 않은 구간을 분류하여 행동이 일어나는 구간을 검출하고, 나아가 영상 내의 행동에 대한 행동 시점(즉, 행동의 끝 또는 시작 시점 등)을 검출함으로써, 시간에 대한 사전적 학습 없이 시각적 정보에만 의존하여 행동을 반복적으로 검출함으로써, 지속적 관찰이 요구되는 로봇 제어 등에 효율적으로 적용될 수 있다.According to the above-mentioned problem solving means, an embodiment of the present invention can detect a dangerous situation or recognize a specific situation in an environment in which an image is input in real time by generating future image frames. According to an embodiment of the present invention, a segment in which an action occurs in an image and a segment in which an action occurs in an image are detected to detect a segment in which the action occurs, and furthermore, an action point (i.e., It is possible to efficiently apply it to robot control requiring continuous observation by detecting behavior repeatedly depending on only visual information without learning prior to time.

도 1은 본 발명의 일 실시예에 따른 행동 인식 장치를 도시한다.
도 2는 본 발명의 일 실시예에 따른 뉴럴 네트워크를 도시한 일례이다.
도 3은 본 발명의 일 실시예에 따른 행동 인식 장치의 구성을 도시한다.
도 4는 본 발명의 일 실시예에 따른 탐지 네트워크의 일례를 도시한다.
도 5는 본 발명의 일 실시예에 따라 도 3의 프로세서가 영상 내의 객체의 행동을 인식하는 방법을 도시한 순서도이다.
도 6은 본 발명의 일 실시예에 따른 프로세서를 설명하기 위한 도면이다.1 shows a behavior recognition apparatus according to an embodiment of the present invention.
2 is an example of a neural network according to an embodiment of the present invention.
FIG. 3 shows a configuration of a behavior recognition apparatus according to an embodiment of the present invention.
4 shows an example of a detection network according to an embodiment of the present invention.
FIG. 5 is a flowchart illustrating a method of recognizing an action of an object in an image of the processor of FIG. 3 according to an embodiment of the present invention.
6 is a diagram for explaining a processor according to an embodiment of the present invention.

아래에서는 첨부한 도면을 참조하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 본 발명의 실시예를 상세히 설명한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings, which will be readily apparent to those skilled in the art. The present invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. In order to clearly illustrate the present invention, parts not related to the description are omitted, and similar parts are denoted by like reference characters throughout the specification.

명세서 전체에서, 어떤 부분이 다른 부분과 "연결"되어 있다고 할 때, 이는 "직접적으로 연결"되어 있는 경우뿐 아니라, 그 중간에 다른 소자를 사이에 두고 "전기적으로 연결"되어 있는 경우도 포함한다. 또한, 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다.Throughout the specification, when a part is referred to as being "connected" to another part, it includes not only "directly connected" but also "electrically connected" with another part in between . Also, when a part is referred to as " including " an element, it does not exclude other elements unless specifically stated otherwise.

이하, 도면을 참조하여, 본 발명의 일 실시예에 대하여 구체적으로 설명하도록 한다. Hereinafter, one embodiment of the present invention will be described in detail with reference to the drawings.

도 1은 본 발명의 일 실시예에 따른 행동 인식 장치(10)를 도시한다. 1 shows a behavior recognition apparatus 10 according to an embodiment of the present invention.

도 1 에 도시된 바와 같이, 행동 인식 장치(10)는 뉴럴 네트워크(neural network)(11)를 이용하여 실시간 영상(12)으로부터 객체의 행동을 인식한다. 뉴럴 네트워크(11)는, 통계학적 기계 학습의 결과를 이용하여, 실시간 영상으로부터 다양한 속성 정보들을 추출하고, 추출된 속성 정보들을 기초로 실시간 영상 내 객체의 행동을 식별하는 알고리즘 집합일 수 있다. 또한, 뉴럴 네트워크(11)는 전술한 알고리즘 집합을 실행하기 위한 소프트웨어 또는 엔진(engine) 등으로 구현될 수 있다. 소프트웨어 또는 엔진 등으로 구현된 뉴럴 네트워크는 행동 인식 장치(10) 내의 프로세서에 의해 실행될 수 있다.As shown in FIG. 1, the behavior recognition apparatus 10 recognizes the behavior of an object from a real-time image 12 using a neural network 11. The neural network 11 may be a set of algorithms for extracting various attribute information from the real-time image using the result of statistical machine learning and identifying the behavior of the object in the real-time image based on the extracted attribute information. In addition, the neural network 11 may be implemented with software or an engine for executing the above-described algorithm set. Software or a neural network implemented by an engine or the like may be executed by a processor in the behavior recognition apparatus 10. [

한편, 행동 인식 장치(10)는 카메라, CCTV(closed circuit television), 블랙박스(black-box) 등과 같이 영상 장치일 수 있으나, 이에 한정되지 않으며, 영상 장치를 포함하거나 영상 장치와 통신하여 영상을 제공받을 수 있는 컴퓨팅 기기일 수 있으며, 비한정적인 예로서, 스마트폰, 태블릿 PC, PC, 스마트 TV, 휴대폰, PDA(personal digital assistant), 랩톱, 미디어 플레이어, 마이크로 서버, IoT 허브, IoT 서버, 네비게이션, 키오스크, 가전기기 등일 수 있다. Meanwhile, the behavior recognition device 10 may be a video device such as a camera, a closed circuit television (CCTV), a black-box, and the like, but is not limited thereto. Such as a smart phone, a tablet PC, a PC, a smart TV, a mobile phone, a personal digital assistant (PDA), a laptop, a media player, a micro server, an IOT hub, an IOT server, Navigation, kiosks, home appliances, and the like.

도 2는 본 발명의 일 실시예에 따른 뉴럴 네트워크(11)를 도시한 일례이다. 2 is an example showing a neural network 11 according to an embodiment of the present invention.

도 2에 도시된 바와 같이, 본 발명의 일 실시예에 따른 뉴럴 네트워크(11)는 프로포잘 네트워크(proposal network)(210), 미래영상 생성자 네트워크(future frame generation network)(220) 및 탐지 네트워크(detection network)(230)를 포함한다. 2, the neural network 11 according to an embodiment of the present invention includes a proposal network 210, a future frame generation network 220, and a detection network detection network 230.

먼저, 프로포절 네트워크(210)는 영상 세그먼트(13)를 입력받아, 영상 세그먼트 내에 포함된 다양한 속성들을 추상화함으로써, 상기 영상 세그먼트(13)가 객체의 행동을 포함하는지 여부(즉, 행동 존부)를 판별할 수 있다. 여기서, 영상 세그먼트(13)는 연속되는 복수의 영상 프레임들의 집합으로서 실시간으로 획득되는 것일 수 있다. 그리고 영상 세그먼트 내 속성들을 추상화한다는 것은, 영상 세그먼트 내 객체 외형 정보, 객체 움직임 정보 등과 같은 속성 정보들을 검출하고, 검출된 속성 정보들 중에서 영상 세그먼트를 대표할 수 있는 핵심 속성을 판단하는 것일 수 있다.First, the poster network 210 receives an image segment 13 and abstracts various attributes included in the image segment to determine whether the image segment 13 includes an action of the object (i.e., an action / presence part) can do. Here, the image segment 13 may be obtained in real time as a set of a plurality of consecutive image frames. The abstracting of the attributes in the image segment can be performed by detecting attribute information such as object appearance information, object motion information, and the like in the image segment and determining a core attribute that can represent the image segment from among the detected attribute information.

이때, 프로포절 네트워크(210)는 복수의 추상화 레이어(211) 및 이진 분류기(binary classifier)(212)를 포함할 수 있으며, 행동 인식 장치(10)는 복수의 추상화 레이어(211)를 기초로 영상 세그먼트(13)로부터 속성 정보를 추출할 수 있다. 여기서, 속성 정보는, 비한정적인 예로서, 폴리건(polygon), 에지(edge), 깊이(depth), 선명도(sharpness), 채도, 명도, 깊이 및 이들의 시공간적 변화값 등을 포함할 수 있다. 행동 인식 장치(10)는 마지막 추상화 레이어에서 추출된 속성 정보를 기초로 특징맵을 획득한다. 여기에서, 특징맵은 추출된 속성 정보의 조합으로서, 영상 세그먼트의 속성을 대표하는 적어도 하나의 속성 벡터를 포함할 수 있다. At this time, the poster network 210 may include a plurality of abstraction layers 211 and a binary classifier 212, and the behavior recognition apparatus 10 may include a plurality of abstraction layers 211, The attribute information can be extracted from the attribute information storage unit 13. Here, the attribute information may include polygon, edge, depth, sharpness, saturation, brightness, depth, and temporal / spatial variation values thereof, as non-limiting examples. The behavior recognition apparatus 10 acquires the feature map based on the attribute information extracted from the last abstraction layer. Here, the feature map may include at least one attribute vector representing the attribute of the image segment as a combination of extracted attribute information.

특징맵은 이진분류기(212)의 입력 데이터로 적용될 수 있다. 이진분류기(212)는 특징맵을 기초로 영상 세그먼트에 객체의 행동이 존재하는지 여부를 판별할 수 있다. 행동 인식 장치(10)는 이진분류기(212)에 특징맵을 입력한 결과로서 객체 행동 존부(즉, 행동 존재 또는 비존재)를 획득할 수 있다. The feature map may be applied as the input data of the binary classifier 212. The binary classifier 212 can determine whether there is an object action in the image segment based on the feature map. The behavior recognition apparatus 10 may acquire an object behavior part (i. E., Presence or non-existence of behavior) as a result of inputting the feature map to the binary classifier 212. [

미래 영상 생성자 네트워크(220)는 상기한 프로포절 네트워크(210)의 결과값이 영상 세그먼트(13) 내에 객체의 행동이 존재하는 것임을 나타내는 경우에 동작한다. 미래 영상 생성자 네트워크(220)는 영상 세그먼트에 연속하는 하나 이상의 영상 프레임을 생성한다. The future image creator network 220 operates when the result of the above-mentioned transport network 210 indicates that the behavior of the object exists in the image segment 13. The future image generator network 220 generates one or more image frames contiguous to the image segment.

구체적으로, 미래 영상 생성자 네트워크(220)는 영상 세그먼트(13)로부터 다양한 속성 정보를 추출하는 복수의 추상화 레이어(221)와 추출된 속성 정보를 기초로 영상 세그먼트에 연속하는 하나 이상의 미래 영상 프레임을 생성하는 복수의 생성자 레이어(222)를 포함한다. 이때, 영상 세그먼트의 속성 정보는, 비한정적인 예로서, 폴리건, 에지, 깊이, 선명도, 채도, 명도, 깊이 및 이들의 시공간적 변화값 등을 포함할 수 있다. Specifically, the future image generator network 220 includes a plurality of abstraction layers 221 for extracting various attribute information from the image segment 13, and one or more future image frames contiguous to the image segment based on the extracted attribute information And a plurality of creator layers 222 that are arranged in a matrix. At this time, the attribute information of the image segment may include polygon, edge, depth, sharpness, saturation, brightness, depth, and temporal / spatial variation values thereof as a non-limiting example.

행동 인식 장치(10)는 복수의 추상화 레이어(221) 중 마지막 추상화 레이어에서 추출된 속성 정보를 조합한 특징맵을 획득하고, 이를 첫번째 생성자 레이어의 입력 데이터로 적용할 수 있다. 복수의 생성자 레이어(222)는 기 학습된 정보를 기초로 속성 정보를 변형하거나, 기 저장된 다른 이미지들의 특징 정보를 합성하여 미래 영상 프레임을 생성한다. 이때, 기 학습된 정보는 객체의 행동에 기반한 각 속성 정보의 시공간적 변화값으로서, 각 생성자 레이어의 파라미터로 표현된다. 또한, 다른 이미지들의 특징 정보는, 비한정적인 예로서, 폴리건, 에지, 깊이, 선명도, 채도, 명도, 깊이 및 이들의 시공간적 변화값 등을 포함할 수 있다.The behavior recognition apparatus 10 may acquire a feature map combining attribute information extracted from the last abstraction layer among a plurality of abstraction layers 221 and apply the feature map as input data of the first constructor layer. The plurality of creator layers 222 modify the attribute information based on the learned information or synthesize feature information of other previously stored images to generate a future image frame. In this case, the learned information is a space-time change value of each attribute information based on the behavior of the object, and is expressed by parameters of each creator layer. Further, the feature information of the other images may include polygon, edge, depth, sharpness, saturation, brightness, depth, and temporal / spatial variation values thereof as non-limiting examples.

또한, 구현예에 따라, 미래영상 생성자 네트워크(220)는 생성된 미래영상 프레임이 영상 세그먼트에 실질적으로 연속할 확률을 판별하고, 이를 추상화 레이어(221) 및 생성자 레이어(222)로 피드백하는 복수의 판별자 레이어(도시되지 않음)를 더 포함할 수 있다. 이러한, 미래 영상 생성자 네트워크는 예시적으로, 라플라시안 생산적 적대 네트워크(Laplacian Generative Adversarial Network)일 수 있으나, 이에 한정되는 것은 아니다. Also, according to an implementation, the future image generator network 220 may determine a future image frame to be substantially contiguous to the image segment, and send it to the abstraction layer 221 and the constructor layer 222, And a discriminator layer (not shown). Such a future image generator network may be, by way of example, but not exclusively, a Laplacian Generative Adversarial Network.

이어서, 행동 인식 장치(10)는 미래 영상 생성자 네트워크(220)를 통해 획득된 하나 이상의 미래 영상 프레임을 영상 세그먼트(13)에 더하여 확장 영상 세그먼트를 구성한다. Next, the behavior recognition apparatus 10 constructs an extended image segment by adding one or more future image frames obtained through the future image generator network 220 to the image segment 13. [

다음으로, 탐지 네트워크(230)는 확장 영상 세그먼트를 입력받아, 확장 영상 세그먼트에 포함된 다양한 속성들을 추상화함으로써, 확장 영상 세그먼트에 포함된 행동 유형 및 행동 시점을 검출할 수 있다. 여기서, 행동 유형은 상기 탐지 네트워크(230)의 학습 데이터에 의해 결정될 수 있으며, 비한정적인 예로서, 사람의 운동, 공격, 사고, 부상 등을 포함할 수 있으며, 기계의 고장, 사고 등을 포함할 수 있다. 또한, 행동 시점은 확장 영상 세그먼트에 포함된 객체의 행동이 해당 행동의 시작과 끝 시점 내의 어느 시점을 나타내는지를 포함한다. Next, the detection network 230 receives the extended image segment and abstracts various attributes included in the extended image segment, thereby detecting a behavior type and an action time included in the extended image segment. Here, the behavior type may be determined by the learning data of the detection network 230 and may include, by way of non-limiting example, human motion, attack, accident, injury, etc., can do. In addition, the action point includes a point in time at which the action of the object included in the extended image segment indicates the start and end points of the action.

탐지 네트워크(230)는 복수의 추상화 레이어(231)와 행동 유형 및 행동 시점을 판별하는 하나 이상의 분류기(232)를 포함한다. 행동 인식 장치(10)는 복수의 추상화 레이어(231)를 통해 행동 인식 장치(10)는 복수의 추상화 레이어(231)를 기초로 확장 영상 세그먼트로부터 속성 정보를 추출하고, 마지막 추상화 레이어로부터 추출된 속성 정보로부터 획득된 특징맵을 분류기(232)의 입력 데이터로 적용할 수 있다. 예시적으로, 탐지 네트워크(230)는 상기 특징맵을 제1 분류기(도시되지 않음)에 입력 데이터로 적용하여 통합 영상 세그먼트 내의 객체의 행동 유형을 식별하고, 제1 분류기의 결과값 및/또는 특징맵을 제2 분류기(도시되지 않음)의 입력 데이터로 적용하여 통합 영상 세그먼트 내의 객체의 행동이 발현되는 시점이 행동 유형의 끝 또는 시작 시점에 대응되는지 여부를 검출할 수 있다. 이때, 제1 분류기는 다중 클래스 분류기일 수 있으며, 제2 분류기는 이진분류기일 수 있으나, 이에 한정되는 것은 아니며, 상기 제1 및 제2 분류기는 복수의 이진분류기로 구현될 수도 있다. The detection network 230 includes a plurality of abstraction layers 231 and one or more classifiers 232 for discriminating behavior types and behavior points. The behavior recognition apparatus 10 extracts attribute information from the extended image segment based on the plurality of abstraction layers 231 through the plurality of abstraction layers 231, The feature map obtained from the information can be applied to the input data of the classifier 232. [ Illustratively, the detection network 230 applies the feature map as input data to a first classifier (not shown) to identify the behavior type of the object in the aggregated image segment and to determine the resultant value and / The map may be applied as input data of a second classifier (not shown) to detect whether the time at which the behavior of the object in the integrated image segment is manifested corresponds to the end or start time of the behavior type. In this case, the first classifier may be a multi-class classifier, and the second classifier may be a binary classifier, but the present invention is not limited thereto, and the first classifier and the second classifier may be implemented with a plurality of binary classifiers.

도 3은 본 발명의 일 실시예에 따른 행동 인식 장치(10)의 구성을 도시한다. 행동 인식 장치(10)는 메모리(310) 및 프로세서(320)를 포함한다. FIG. 3 shows a configuration of a behavior recognition apparatus 10 according to an embodiment of the present invention. The behavior recognition apparatus 10 includes a memory 310 and a processor 320.

메모리(310)에는 프로세서(320)의 처리 및 제어를 위한 프로그램들(하나 이상의 인스트럭션들)을 저장할 수 있다. 메모리(310)에 저장된 프로그램들은 기능에 따라 복수 개의 모듈들로 구분될 수 있다. 일 실시예에 따라 메모리(310)는 영상으로부터 영상에 포함된 객체의 행동을 인식하는 프로그램을 저장할 수 있다. 상기 프로그램은 뉴럴 네트워크 모듈을 포함할 수 있다. The memory 310 may store programs (one or more instructions) for processing and control of the processor 320. Programs stored in the memory 310 may be divided into a plurality of modules according to functions. According to one embodiment, the memory 310 may store a program that recognizes the behavior of an object included in an image from the image. The program may include a neural network module.

뉴럴 네트워크 모듈은 프로포절 네트워크, 미래 영상 생성자 네트워크 및 탐지 네트워크에 포함된 복수의 레이어들과 분류기들을 포함할 수 있다. 이때, 상기한 네트워크들에 포함된 각 추상화 레이어는 입력 영상로부터 이미지의 속성 정보를 추출하여 특징맵을 생성하는 하나 이상의 인스트럭션을 포함하는 3D 컨벌루셔널 레이어(3D convolutional layer), 및/또는 추출된 속성 정보로부터 대표값을 결정하는 하나 이상의 인스트럭션을 포함하는 풀링 레이어(pooling layer)를 포함할 수 있다. 또한, 미래 영상 생성자 네트워크에 포함된 각 생성자 레이어는 특징맵을 기초로 추출된 속성 정보를 변형하거나, 기 저장된 다른 이미지들의 특징 정보를 합성하는 하나 이상의 인스트럭션을 포함하는 디컨벌루셔널 레이어(deconvolutional layer)를 포함할 수 있다. The neural network module may include a plurality of layers and classifiers included in a poster network, a future image generator network, and a detection network. In this case, each of the abstraction layers included in the networks includes a 3D convolutional layer including one or more instructions for extracting attribute information of an image from an input image to generate a feature map, and / And a pooling layer including one or more instructions for determining a representative value from the attribute information. Each creator layer included in the future image generator network may include a deconvolutional layer including one or more instructions for modifying extracted attribute information based on the feature map or synthesizing feature information of other previously stored images, . &Lt; / RTI >

도 4는 본 발명의 일 실시예에 따른 탐지 네트워크의 일례를 도시한다. 4 shows an example of a detection network according to an embodiment of the present invention.

프로세서(320)는 하나 이상의 코어(core, 도시되지 않음) 및 그래픽 처리부(도시되지 않음) 및/또는 다른 구성 요소와 신호를 송수신하는 연결 통로(예를 들어, 버스(bus) 등)를 포함할 수 있다. The processor 320 includes a connection path (e.g., a bus) that transmits and receives signals with one or more cores (not shown) and a graphics processing unit (not shown) and / or other components .

일 실시예에 따라 프로세서(320)는 뉴럴 네트워크 모듈 내의 각 네트워크에 포함된 하나 이상의 인스트럭션들을 병렬적으로 처리할 수 있다. According to one embodiment, the processor 320 may process one or more instructions contained in each network in a neural network module in parallel.

이하, 도 5를 참조하여, 프로세서(320)가 영상 내의 객체의 행동을 인식하는 방법을 설명한다. Hereinafter, with reference to FIG. 5, a method in which the processor 320 recognizes an action of an object in an image will be described.

먼저, 프로세서(320)는 영상으로부터 영상 세그먼트를 획득한다(S510). 이때, 영상은 행동 인식 장치(10)에 구비된 이미지 센서로부터 실시간으로 획득된 것일 수 있으며, 외부 영상 장치로부터 수신된 것일 수도 있다. 그리고 영상 세그먼트는 연속하는 영상 프레임들의 집합이다. First, the processor 320 obtains an image segment from the image (S510). At this time, the image may be obtained in real time from the image sensor provided in the behavior recognition apparatus 10, or may be received from the external imaging apparatus. A video segment is a set of consecutive video frames.

이후, 프로세서(320)는 프로포절 네트워크를 통해 영상 세그먼트 내에 객체의 행동이 존재하는지 여부를 판별한다(S520). Thereafter, the processor 320 determines whether the action of the object exists in the image segment through the poster network (S520).

전술한 바와 같이 프로포절 네트워크는 복수의 추상화 레이어와 이진분류기를 포함한다. 프로세서(320)는 복수의 추상화 네트워크를 통해 영상 세그먼트의 속성 정보를 추출한다. 예를 들어, 프로세서(320)는 복수의 레이어 중 제 1 레이어를 이용하여 영상 프레임으부터 직선 정보를 추출할 수 있다. 또한, 디바이스는 추출된 직선 정보를 제 1 레이어와 연결된 제 2 레이어에 입력 데이터로 적용하여, 제 2 레이어로부터 직선의 변화값을 추출할 수 있다. 전술한 방식과 같이 디바이스는 복수의 레이어 각각에 영상 세그먼트를 입력하거나 이전 레이어로부터 추출된 속성 정보를 입력 데이터로 적용함으로써, 다양한 속성 정보를 추출할 수 있다. 이어서 프로세서(320)는 마지막 레이어에서 추출된 속성 정보를 조합하여 특징맵을 추출한다.As discussed above, the poster network comprises a plurality of abstraction layers and a binary classifier. The processor 320 extracts attribute information of an image segment through a plurality of abstraction networks. For example, the processor 320 may extract linear information from an image frame using the first layer of the plurality of layers. Further, the device can extract the linear change value from the second layer by applying the extracted linear information to the second layer connected to the first layer as input data. As described above, the device can extract various attribute information by inputting image segments to each of a plurality of layers or applying attribute information extracted from a previous layer as input data. Subsequently, the processor 320 extracts the feature map by combining the attribute information extracted from the last layer.

프로세서(320)는 추출된 특징맵을 이진분류기의 입력 데이터로 적용하여 영상 세그먼트에 객체의 행동이 존재하는지 여부를 판별할 수 있다.The processor 320 may apply the extracted feature map to the input data of the binary classifier to determine whether the action of the object exists in the image segment.

만약, 상기 영상 세그먼트 내에 객체의 행동이 존재하지 않는 것으로 판별되면, 프로세서(320)는 다음 영상 세그먼트에 대해 S510 및 S520 단계를 반복 수행한다. 이를 통해, 프로세서(320)는 불필요한 연산 부하를 최소화할 수 있다. 한편, 다음 영상 세그먼트는 현재 영상 세그먼트의 일부 영상 프레임(예컨대, 현재 영상 세그먼트의 마지막 영상 프레임 등)을 중복하여 포함할 수 있다. If it is determined that there is no action of the object in the image segment, the processor 320 repeats steps S510 and S520 for the next image segment. This allows the processor 320 to minimize unnecessary computational load. Meanwhile, the next image segment may include some image frames of the current image segment (for example, the last image frame of the current image segment, etc.) in a duplicate manner.

그러나, 영상 세그먼트 내에 객체의 행동이 존재하면, 프로세서(320)는 미래영상 생성자 네트워크를 통해 영상 세그먼트에 연속하는 하나 이상의 미래영상 프레임을 생성하여 통합 영상 세그먼트를 구성한다 (S530).However, if there is an action of the object in the image segment, the processor 320 generates one or more future image frames continuous to the image segment through the future image creator network to configure an integrated image segment (S530).

미래영상 생성자 네트워크는 복수의 추상화 레이어와 복수의 생성자 레이어를 포함한다. 프로세서(320)는, 프로포절 네트워크에서와 마찬가지로, 미래영상 생성자 네트워크에 포함된 복수의 추상화 레이어를 통해 특징맵을 획득하며, 복수의 생성자 레이어를 통해 특징맵을 기초로 영상 세그먼트의 속성 정보가 변형되거나, 기 저장된 다른 이미지들의 특징 정보가 합성된 하나 이상의 미래 영상 프레임을 생성한다. 프로세서(320)는 영상 세그먼트에 하나 이상의 미래영상 프레임을 더하여 통합 영상 세그먼트를 구성한다. The future image generator network includes a plurality of abstraction layers and a plurality of producer layers. The processor 320 obtains the feature map through a plurality of abstraction layers included in the future image creator network as in the case of the present invention, and the attribute information of the image segment is transformed based on the feature map through a plurality of creator layers , And generates one or more future image frames in which feature information of other previously stored images is synthesized. Processor 320 constructs an integrated image segment by adding one or more future image frames to the image segment.

추가로, 미래영상 생성자 네트워크는 생성된 하나 이상의 미래영상 프레임이 상기 영상 세그먼트에 실질적으로 연속할 확률을 판별하고, 이를 상기 추상화 레이어 및 생성자 레이어로 피드백하는 복수의 판별자 레이어를 더 포함할 수 있다. 프로세서(320)는 복수의 판별자 레이어를 통해 산출된 확률값을 기초로 미래영상 프레임을 반복적으로 생성하여 상기 판별자 레이어의 확률값을 향상시켜, 보다 정확한 미래영상 프레임을 생성할 수 있다. In addition, the future image generator network may further include a plurality of discriminator layers for determining the probability that one or more future image frames generated are substantially contiguous to the image segment, and for feeding back the result to the abstraction layer and the constructor layer . The processor 320 may repeatedly generate a future image frame based on the probability value calculated through the plurality of discriminator layers to improve the probability value of the discriminator layer to generate a more accurate future image frame.

이후, 프로세서(320)는 탐지 네트워크를 통해 통합 영상 세그먼트 내의 객체의 행동 유형 및 행동 시점을 검출한다(S540).Then, the processor 320 detects the action type and the action point of the object in the integrated image segment through the detection network (S540).

전술한 바와 같이 탐지 네트워크는 복수의 추상화 레이어와 하나 이상의 분류기를 포함한다. 프로세서(320)는 탐지 네트워크에 포함된 복수의 추상화 레이어를 통해 통합 영상 세그먼트의 속성 정보를 추출하여, 마지막 추상화 레이어에서 추출된 속성 정보를 조합하여 특징맵을 획득한다. 그리고 프로세서(320)는 특징맵을 하나 이상의 분류기의 입력 데이터로 적용한다. 예시적으로, 프로세서(320)는 특징맵을 제1 분류기에 입력 데이터로 적용하여 통합 영상 세그먼트 내의 객체의 행동 유형을 식별하고, 상기 제1 분류기의 결과값 및/또는 상기 특징맵을 상기 제2 분류기의 입력 데이터로 적용하여 상기 통합 영상 세그먼트 내의 객체의 행동이 발현되는 시점이 상기 행동 유형의 끝 또는 시작 시점에 대응되는지 여부를 검출할 수 있다. As described above, the detection network includes a plurality of abstraction layers and one or more classifiers. The processor 320 extracts the attribute information of the integrated image segment through the plurality of abstraction layers included in the detection network and obtains the feature map by combining the attribute information extracted from the last abstraction layer. The processor 320 then applies the feature map to the input data of one or more classifiers. Illustratively, the processor 320 may apply the feature map to the first classifier as input data to identify the behavior type of the object in the aggregated image segment, and to compare the result of the first classifier and / It is possible to detect whether the time at which the behavior of the object in the integrated image segment is expressed corresponds to the end or start time of the action type.

이후, 프로세서(320)는 다시 S510 내지 S540을 반복 수행하여, 실시간으로 수신되는 영상으로부터 객체의 행동 유형 및 행동 시점을 탐지할 수 있다. 또한, 프로세서(320)는 탐지된 행동 유형이 기 설정된 이상 행동에 대응되는 경우, 이를 행동 인식 장치(10)에 구비된 알림 장치를 통해 행동 유형 및 행동 시점에 대한 정보를 알림하거나, 외부 장치로 상기한 정보를 전송할 수 있다. Thereafter, the processor 320 repeats S510 to S540 again to detect the behavior type and the action point of the object from the image received in real time. When the detected behavior type corresponds to a predetermined abnormal behavior, the processor 320 informs the behavior type and the action timing information through the notification device provided in the behavior recognition apparatus 10, The above-described information can be transmitted.

한편, 단계 S520 내지 S540 의 각 뉴럴 네트워크는 기 수행된 학습을 통해 상기한 속성 정보 추출 동작, 분류 동작, 미래영상 프레임 생성 등을 수 있다. 예를 들어, 프로세서(320)는 S510 단계 이전에 수행된 영상 세그먼트를 행동 또는 비행동으로 분류한 결과, 영상 세그먼트에 연속하는 미래영상 프레임을 생성한 결과, 영상 세그먼트에 미래영상 프레임을 더한 결과로부터 행동 유형 및 행동 시점 판별 결과 등의 정확도를 높일 수 있는 방향으로 각 네트워크에 포함된 각 레이어 및 분류기를 학습시킬 수 있다. 이에 대해서는, 도 6을 참조하여 보다 구체적으로 후술한다. Each neural network in steps S520 to S540 may perform the attribute information extraction operation, classification operation, future image frame generation, and the like through the previously performed learning. For example, the processor 320 classifies the image segment performed before S510 into a behavior or a non-behavior, and as a result of generating a future image frame contiguous to the image segment, It is possible to learn each layer and classifier included in each network in the direction of increasing the accuracy of the behavior type and the behavior timing discrimination result. This will be described later in more detail with reference to FIG.

도 6은 본 발명의 일 실시예에 따른 프로세서(320)를 설명하기 위한 도면이다. 도 6을 참조하면, 프로세서(320)는 데이터 학습부(610) 및 데이터 인식부(620)를 포함한다. 6 is a diagram for explaining a processor 320 according to an embodiment of the present invention. Referring to FIG. 6, the processor 320 includes a data learning unit 610 and a data recognizing unit 620.

데이터 학습부(610)는 프로포절 네트워크를 독립 학습을 수행하며, 미래영상 생성자 네트워크와 탐지 네트워크는 상호 의존적 학습을 수행한다. 이때, 학습은 각 레이어 및/또는 분류기 클래스의 파라미터(예컨대, 가중치 등)를 조정하거나, 연산에서 적어도 하나의 레이어를 생략 및/또는 추가하는 것일 수 있다. The data learning unit 610 performs independent learning on the proposal network, and the future image generator network and the detection network perform interdependent learning. At this time, the learning may be to adjust parameters (e.g., weights, etc.) of each layer and / or classifier class, or to omit and / or add at least one layer in the operation.

구체적으로, 데이터 학습부(610)는 프로포절 네트워크의 입력 데이터(즉, 영상 세그먼트)로부터 객체의 행동이 존재하는지 여부를 검출하기 위한 제1 기준을 독립적으로 학습시킨다. 예를 들어, 데이터 학습부(610)는 학습 영상 세그먼트와 해당 학습 영상 세그먼트 내에 객체의 행동 존부에 대한 정보(즉, "행동" 또는 "비행동")를 프로포절 네트워크에 입력하여 학습시킬 수 있다. Specifically, the data learning unit 610 independently learns the first criterion for detecting whether the behavior of the object exists from the input data (that is, the image segment) of the picture network. For example, the data learning unit 610 can input learning information segments and information about the behavior of the objects in the learning image segment (i.e., "behavior" or "non-behavior") to the proposal network to learn.

또한, 데이터 학습부(610)는 학습 영상 세그먼트와 상기 학습 영상 세그먼트에 연속하는 프레임을 미래영상 생성자 네트워크에 입력하여 학습시킨다. 또한, 데이터 학습부(610)는 미래영상 생성자 네트워크의 결과값(즉, 통합 영상 세그먼트)과 통합 영상 세그먼트 내의 객체의 행동 유형 및 행동 시점을 탐지 네트워크에 입력하여 학습시킨다. 이때, 데이터 학습부(610)는 탐지 네트워크의 손실함수(loss function)의 값을 미래영상 생성자 네트워크에 더 입력하여, 미래영상 생성자 네트워크가 상기 손실함수의 값을 기초로 학습되도록 한다. 여기서, 손실함수는, 예시적으로, 음의 로그-우도(negative log-likelihood) 함수가 이용될 수 있으나, 이에 한정되는 것은 아니다. 이와 같이, 데이터 학습부(610)는 미래영상 생성자 네트워크와 탐지 네트워크에 대해 상호 의존적 학습을 수행함으로써 보다 정확하게 객체의 행동 유형 및 행동 시점을 탐지할 수 있도록 할 수 있다. In addition, the data learning unit 610 inputs a learning image segment and a frame contiguous to the learning image segment to the future image generator network to learn. In addition, the data learning unit 610 inputs the result values (i.e., the integrated image segments) of the future image generator network and the behavior types and behavior points of the objects in the integrated image segment to the detection network. At this time, the data learning unit 610 further inputs the value of the loss function of the detection network to the future image generator network so that the future image generator network can learn based on the value of the loss function. Here, the loss function is illustratively a negative log-likelihood function, but is not limited thereto. In this way, the data learning unit 610 can perform the interdependent learning on the future image generator network and the detection network, thereby more accurately detecting the behavior type and the behavior time of the object.

데이터 인식부(620)는 데이터 학습부(610)를 통해 학습된 기준에 기초하여, 영상 세그먼트로부터 객체의 행동 유형 및 행동 시점을 판별할 수 있다. 이에 대해서는, 도 1 내지 도 5를 참조하여 전술하였으므로, 자세한 설명은 생략한다. The data recognition unit 620 can determine the behavior type and the action point of the object from the image segment based on the learned criterion through the data learning unit 610. [ This has been described above with reference to FIGS. 1 to 5, and thus a detailed description thereof will be omitted.

한편, 데이터 학습부(610) 및 데이터 인식부(620) 중 적어도 하나는, 적어도 하나의 하드웨어 칩 형태로 제작되어 행동 인식 장치(10)에 탑재될 수 있다. 또는, 데이터 학습부(610) 및 데이터 인식부(620) 중 적어도 하나는 소프트웨어 모듈로 구현될 수 있다. 데이터 학습부(610) 및 데이터 인식부(620) 중 적어도 하나가 소프트웨어 모듈(또는, 인스트럭션(instruction) 포함하는 프로그램 모듈)로 구현되는 경우, 소프트웨어 모듈은 컴퓨터로 읽을 수 있는 판독 가능한 비일시적 판독 가능 기록매체(non-transitory computer readable media)에 저장될 수 있다. 또한, 이 경우, 적어도 하나의 소프트웨어 모듈은 OS(Operating System)에 의해 제공되거나, 소정의 어플리케이션에 의해 제공될 수 있다. At least one of the data learning unit 610 and the data recognizing unit 620 may be manufactured in the form of at least one hardware chip and mounted on the behavior recognition apparatus 10. [ Alternatively, at least one of the data learning unit 610 and the data recognition unit 620 may be implemented as a software module. When at least one of the data learning unit 610 and the data recognition unit 620 is implemented as a software module (or a program module including an instruction), the software module may be a computer-readable, And may be stored in non-transitory computer readable media. In this case, at least one software module may be provided by an operating system (OS) or by a predetermined application.

본 발명의 일 실시예는 컴퓨터에 의해 실행되는 프로그램 모듈과 같은 컴퓨터에 의해 실행 가능한 명령어를 포함하는 기록 매체의 형태로도 구현될 수 있다. 컴퓨터 판독 가능 매체는 컴퓨터에 의해 액세스될 수 있는 임의의 가용 매체일 수 있고, 휘발성 및 비휘발성 매체, 분리형 및 비분리형 매체를 모두 포함한다. 또한, 컴퓨터 판독가능 매체는 컴퓨터 저장 매체를 포함할 수 있다. 컴퓨터 저장 매체는 컴퓨터 판독가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터와 같은 정보의 저장을 위한 임의의 방법 또는 기술로 구현된 휘발성 및 비휘발성, 분리형 및 비분리형 매체를 모두 포함한다. One embodiment of the present invention may also be embodied in the form of a recording medium including instructions executable by a computer, such as program modules, being executed by a computer. Computer readable media can be any available media that can be accessed by a computer and includes both volatile and nonvolatile media, removable and non-removable media. The computer-readable medium may also include computer storage media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.

본 발명의 방법 및 시스템은 특정 실시예와 관련하여 설명되었지만, 그것들의 구성요소 또는 동작의 일부 또는 전부는 범용 하드웨어 아키텍쳐를 갖는 컴퓨터 시스템을 사용하여 구현될 수 있다.While the methods and systems of the present invention have been described in connection with specific embodiments, some or all of those elements or operations may be implemented using a computer system having a general purpose hardware architecture.

전술한 본 발명의 설명은 예시를 위한 것이며, 본 발명이 속하는 기술분야의 통상의 지식을 가진 자는 본 발명의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성요소들도 결합된 형태로 실시될 수 있다.It will be understood by those skilled in the art that the foregoing description of the present invention is for illustrative purposes only and that those of ordinary skill in the art can readily understand that various changes and modifications may be made without departing from the spirit or essential characteristics of the present invention. will be. It is therefore to be understood that the above-described embodiments are illustrative in all aspects and not restrictive. For example, each component described as a single entity may be distributed and implemented, and components described as being distributed may also be implemented in a combined form.

10: 행동 인식 장치
11: 뉴럴 네트워크
12: 실시간 영상 13: 영상 세그먼트
210: 프로포절 네트워크(proposal network)
220: 미래영상 생성자 네트워크
230: 탐지 네트워크(detection network)
310: 메모리 320: 프로세서
610: 데이터 학습부 620: 데이터 인식부10: Behavior recognition device
11: Neural network
12: Real time image 13: Image segment
210: proposal network
220: Future image generator network
230: detection network
310: memory 320: processor
610: Data learning unit 620: Data recognition unit

Claims

A method for a behavior recognition apparatus to recognize an action of an object in an image,
Obtaining an image segment;
Determining whether there is an action of an object in the image segment through a first neural network;
Constructing an integrated image segment by generating one or more future image frames contiguous to the image segment through a second neural network if there is an action of the object in the image segment; And
Detecting a behavior type and an action point of the object in the integrated image segment through a third neural network,
Wherein the first neural network is learned independently and the second neural network and the third neural network are learned dependently.

The method according to claim 1,
Wherein the second neural network is learned based on a value of a loss function of the third neural network and the third neural network is learned based on a result of the second neural network.

The method according to claim 1,
Wherein the step of determining whether or not the behavior of the object exists
Acquiring a feature map by combining attribute information extracted from a last layer among a plurality of layers included in the first neural network; And
And applying the feature map as input data to a binary classifier included in the first neural network to determine whether the behavior of the object exists in the image segment.

The method of claim 3,
Wherein the attribute information includes at least one of polygon, edge, depth, sharpness, saturation, brightness, depth, and their temporal / spatial variation values included in the image segment Recognition method.

The method according to claim 1,
The step of constructing the integrated video segment
Acquiring a feature map by combining attribute information extracted from a last layer among a plurality of first layers included in the second neural network;
Generating one or more future image frames in which attribute information of the image segment is modified based on the feature map or feature information of other previously stored images is synthesized through a plurality of second layers included in the second neural network step; And
And constructing the integrated image segment by adding the one or more future image frames to the image segment.

6. The method of claim 5,
The step of constructing the integrated video segment
Determining, via a plurality of third layers included in the second neural network, a probability that the generated one or more future image frames are substantially continuous to the image segment; And
And feedbacking the determined probability value to the first layer and the second layer to regenerate the one or more future image frames.

The method according to claim 1,
The step of determining the action type and the action point
Acquiring a feature map by combining attribute information extracted from a last layer among a plurality of layers included in the third neural network; And
Applying the feature map to the first classifier as input data to identify the behavior type of the object in the integrated image segment and applying the result of the first classifier to the input data of the second classifier, Detecting whether a point at which a behavior is manifested corresponds to an end or a starting point of the behavior type.

A memory for storing a program for recognizing an action of an object included in the image; And
And a processor for executing the program,
The processor, as the program is executed,
Acquiring an image segment, determining whether an action of the object exists in the image segment through the first neural network,
If there is an action of an object in the image segment, constructing an integrated image segment by generating one or more future image frames continuing to the image segment through a second neural network, Detecting an action type and an action point of the object,
Wherein the first neural network is learned independently and the second neural network and the third neural network are learned dependently.

9. The method of claim 8,
Wherein the second neural network is learned based on a value of a loss function of the third neural network and the third neural network is learned based on a result value of the second neural network.

9. The method of claim 8,
The processor
Acquiring a feature map by combining attribute information extracted from a last layer among a plurality of layers included in the first neural network, applying the feature map to input data to a binary classifier included in the first neural network, And determines whether or not there is an action of the object in the segment.

9. The method of claim 8,
The processor
Acquiring a feature map by combining attribute information extracted from a last layer among a plurality of first layers included in the second neural network, and acquiring the feature map through a plurality of second layers included in the second neural network One or more future image frames in which attribute information of the image segment is modified or feature information of other images previously stored is synthesized,
And the one or more future image frames are added to the image segment to form the integrated image segment.

9. The method of claim 8,
The processor
Acquiring a feature map by combining attribute information extracted from a last layer among a plurality of layers included in the third neural network, and applying the feature map to the first classifier as input data to determine a behavior type of the object in the integrated image segment And detecting whether the time at which the behavior of the object in the integrated image segment is manifested corresponds to the end or the start time of the behavior type by applying the result value of the first classifier to the input data of the second classifier In behavior recognition device.

A computer-readable recording medium recording a program for performing the method according to any one of claims 1 to 7 on a computer.