KR20220072499A

KR20220072499A - Method, apparatus and system for recognizing behavior based on multi-view video

Info

Publication number: KR20220072499A
Application number: KR1020200160179A
Authority: KR
Inventors: 김동칠; 박성주
Original assignee: 한국전자기술연구원
Priority date: 2020-11-25
Filing date: 2020-11-25
Publication date: 2022-06-02

Abstract

본 발명은 다중 뷰 영상 기반 행위 인지 방법, 장치 및 시스템에 관한 것이다. 본 발명의 일 실시예에 따른 행위 인식 방법은 전자 장치에 의해 수행되며, 동시에 n(단, n은 3이상의 자연수)개의 다른 뷰(view)로 촬영된 영상(video)을 이용하여 객체 또는 이벤트의 행위를 인지하는 방법으로서, 다른 뷰의 영상에 대한 입력 데이터와, 영상 내 객체 또는 이벤트의 행위 종류에 대한 결과 데이터를 각각 포함한 학습 데이터를 이용하여, 머신 러닝 기법(machine learning)에 따라 n개의 분류기를 학습시키는 단계; 대상 영역에 대한 n개의 다른 뷰의 영상을 학습된 각 분류기에 입력시켜, 각 분류기가 출력하는 n개의 행위 종류를 획득하는 단계; 및 앙상블 기법(ensemble of classifiers)을 이용하여, 각 분류기로부터 획득한 n개의 행위 종류 중에 적어도 하나의 행위 종류를 결과로 도출하는 단계;를 포함한다.The present invention relates to a multi-view image-based behavior recognition method, apparatus and system. The method for recognizing an action according to an embodiment of the present invention is performed by an electronic device, and at the same time n (where n is a natural number greater than or equal to 3) uses a video captured in different views of an object or event. As a method for recognizing an action, n classifiers according to a machine learning technique using learning data including input data for images of different views and result data for behavior types of objects or events in the image, respectively learning; inputting images of n different views of the target region to each learned classifier to obtain n behavior types output by each classifier; and deriving at least one behavior type from among n behavior types obtained from each classifier as a result by using an ensemble of classifiers.

Description

Multi-view image-based behavior recognition method, apparatus and system

본 발명은 다중 뷰 영상 기반 행위 인지 방법, 장치 및 시스템에 관한 것으로서, 더욱 상세하게는 대상 영역을 동시에 다양한 방향 또는 각도로 촬영한 다중 뷰 영상(multi-view video)을 이용하여, 그 대상 영역 내의 객체(object) 또는 이벤트(event)의 다양한 행위를 인지하기 위한 방법, 장치 및 시스템에 관한 것이다.The present invention relates to a multi-view image-based behavior recognition method, apparatus and system, and more particularly, to a multi-view video in which a target area is simultaneously photographed in various directions or angles, within the target area. It relates to a method, apparatus and system for recognizing various actions of an object or event.

카메라에서 촬영된 영상(vide)을 분석하여 그 영상 내 객체(object) 또는 이벤트(event)의 다양한 행위를 인지하는 기술, 즉 영상 기반 행위 인지 기술이 있다. 종래의 영상 기반 행위 인지 기술(이하, “종래 기술”이라 지칭함)은 하나의 카메라에서 촬영된 단일 영상을 분석하여 객체의 행위를 판단하였다.There is a technology for recognizing various actions of an object or event in the video by analyzing a video captured by a camera, that is, an image-based action recognition technology. Conventional image-based behavior recognition technology (hereinafter, referred to as “conventional technology”) analyzes a single image captured by one camera to determine the behavior of an object.

한편, 최근에는 다양한 감시 목적 및 대상에 대응하기 위해, 대상 영역에 대해 동시에 다양한 방향 또는 각도에서 촬영된 다중 뷰(multi-view)를 제공하는 시스템(가령, 지능형 CCTV 시스템 등)이 요구되고 있다. 특히, 이러한 시스템에서는 적어도 3개 이상의 카메라가 설치될 필요가 있다. 예를 들어, 이러한 지능형 CCTV 시스템은 생활방범시스템, 사물자동추적시스템, 불법 주·정차 단속시스템, 도로방범시스템, 신호·과속 단속시스템, 불법 쓰레기배출 단속시스템 등일 수 있다.Meanwhile, in recent years, in order to respond to various monitoring purposes and targets, a system (eg, an intelligent CCTV system, etc.) that provides a multi-view shot from various directions or angles simultaneously with respect to a target area is required. In particular, in such a system, at least three or more cameras need to be installed. For example, such an intelligent CCTV system may be a living crime prevention system, an automatic object tracking system, an illegal parking/stop enforcement system, a road crime prevention system, a signal/speeding control system, an illegal garbage discharge control system, and the like.

하지만, 종래 기술로는 이러한 3개 이상의 카메라에서 촬영된 다중 뷰 영상에서 행위 인지의 분석을 수행하기 어려운 문제점이 있다. 이는 종래 기술이 단일 영상에 대한 행위 인지 기술에 불과하기 때문이다. 즉, 종래 기술은 복잡한 영상 처리의 알고리즘을 통한 행위 인지 기술이므로, 이를 다중 뷰 영상에 그대로 적용하면, 연산량이 매우 커지고 연산속도가 매우 느려, 실시간 환경 등에 대응하기 불가능하다.However, in the prior art, there is a problem in that it is difficult to analyze the behavior recognition in the multi-view image captured by these three or more cameras. This is because the prior art is merely an action recognition technology for a single image. That is, since the prior art is a behavior recognition technology through a complex image processing algorithm, if it is applied to a multi-view image as it is, the amount of computation is very large and the computation speed is very slow, making it impossible to cope with a real-time environment.

또한, 종래 기술은 객체가 가려지거나 객체의 행위가 복잡한 행위인 경우, 그 위 인지의 정확성이 떨어지는 문제점이 있다.In addition, the prior art has a problem in that when the object is covered or the action of the object is a complicated action, the accuracy of recognition on the object is lowered.

KR10- 2019-0050551 AKR10- 2019-0050551 A

상기한 바와 같은 종래 기술의 문제점을 해결하기 위하여, 본 발명은 대상 영역을 동시에 3개 이상의 다중 뷰(multi-view)로 촬영한 영상(video)에 대해 영상 내의 객체 또는 이벤트에 대한 행위 인지를 보다 빠르고 정확하게 수행하는 방법, 장치 및 시스템을 제공하는데 그 목적이 있다.In order to solve the problems of the prior art as described above, the present invention examines the action recognition of an object or an event in the image for a video shot in three or more multi-views of a target area at the same time. An object of the present invention is to provide a method, an apparatus and a system for performing it quickly and accurately.

다만, 본 발명이 해결하고자 하는 과제는 이상에서 언급한 과제에 제한되지 않으며, 언급되지 않은 또 다른 과제들은 아래의 기재로부터 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.However, the problems to be solved by the present invention are not limited to the problems mentioned above, and other problems not mentioned can be clearly understood by those of ordinary skill in the art to which the present invention belongs from the description below. There will be.

상기와 같은 과제를 해결하기 위한 본 발명의 일 실시예에 따른 행위 인지 방법은 전자 장치에 의해 수행되며, 동시에 n(단, n은 3이상의 자연수)개의 다른 뷰(view)로 촬영된 영상(video)을 이용하여 객체 또는 이벤트의 행위를 인지하는 방법으로서, 다른 뷰의 영상에 대한 입력 데이터와, 영상 내 객체 또는 이벤트의 행위 종류에 대한 결과 데이터를 각각 포함한 학습 데이터를 이용하여, 머신 러닝 기법(machine learning)에 따라 n개의 분류기를 학습시키는 단계; 대상 영역에 대한 n개의 다른 뷰의 영상을 학습된 각 분류기에 입력시켜, 각 분류기가 출력하는 n개의 행위 종류를 획득하는 단계; 및 앙상블 기법(ensemble of classifiers)을 이용하여, 각 분류기로부터 획득한 n개의 행위 종류 중에 적어도 하나의 행위 종류를 결과로 도출하는 단계;를 포함한다.Behavior recognition method according to an embodiment of the present invention for solving the above problems is performed by an electronic device, and at the same time, n (where n is a natural number equal to or greater than 3) is a video shot with different views. ) as a method of recognizing the behavior of an object or event using a machine learning technique ( training n classifiers according to machine learning); inputting images of n different views of the target region to each learned classifier to obtain n behavior types output by each classifier; and deriving at least one behavior type from among n behavior types obtained from each classifier as a result by using an ensemble of classifiers.

상기 학습시키는 단계에서, 상기 n개의 분류기 각각에 대한 상기 입력 데이터는 서로 다른 뷰의 영상을 포함할 수 있으며,In the learning step, the input data for each of the n classifiers may include images of different views,

상기 획득하는 단계에서, 상기 각 분류기에는 상기 대상 영역에 대한 서로 다른 뷰의 영상이 입력될 수 있다.In the acquiring, images of different views of the target region may be input to each classifier.

상기 학습시키는 단계는, 각 카메라에서 촬영되는 다른 뷰의 n개 영상에서, 객체 또는 이벤트가 영상 내의 특정 영역에 위치하도록, 각 카메라 간에 캘리브레이션을 수행하는 단계; 및 캘리브레이션이 수행된 각 카메라에서 촬영된 다른 뷰의 n개 영상을 이용하여, 상기 n개의 분류기를 학습시키는 단계;를 포함할 수 있다.The learning may include: performing calibration between each camera so that an object or event is located in a specific area in the image in n images of different views captured by each camera; and learning the n classifiers using n images of different views captured by each camera on which the calibration has been performed.

상기 특정 영역은 영상 내의 중심 영역에 해당할 수 있다.The specific region may correspond to a central region in the image.

상기 학습시키는 단계에서, 상기 n개의 분류기 각각에 대한 상기 입력 데이터는 서로 다른 뷰의 영상과, 해당 영상 내에서 객체 또는 이벤트가 출현하는 프레임의 정보와, 해당 영상 내에서 출현했던 객체 또는 이벤트가 사라지는 프레임의 정보를 각각 포함할 수 있다.In the learning step, the input data for each of the n classifiers includes images of different views, information on frames in which objects or events appear in the images, and objects or events that appear in the image disappear. Each frame may include information.

본 발명의 일 실시예에 따른 행위 인지 장치는 동시에 n(단, n은 3이상의 자연수)개의 다른 뷰(view)로 촬영된 영상(video)을 이용하여 객체 또는 이벤트의 행위를 인지하는 장치로서, 다른 뷰의 영상에 대한 입력 데이터와, 영상 내 객체 또는 이벤트의 행위 종류에 대한 결과 데이터를 각각 포함한 학습 데이터를 이용하여, 머신 러닝 기법(machine learning)에 따라 기 학습된 n개의 분류기를 저장하는 메모리; 및 메모리에 저장된 각 분류기를 이용하여, 대상 영역에 대한 n개의 다른 뷰의 영상에서 객체 또는 이벤트에 대한 행위 인지를 제어하는 제어부;를 포함한다.The device for recognizing an action according to an embodiment of the present invention is a device for recognizing the action of an object or event by using a video shot with n (where n is a natural number of 3 or more) different views at the same time, A memory for storing n classifiers pre-learned according to machine learning using learning data including input data for images of different views and result data for behavior types of objects or events in the image, respectively ; and a control unit that controls recognition of an action for an object or event in images of n different views of the target region by using each classifier stored in the memory.

본 발명의 다른 일 실시예에 따른 행위 인지 장치는 동시에 n(단, n은 3이상의 자연수)개의 다른 뷰(view)로 촬영된 영상(video)을 이용하여 객체 또는 이벤트의 행위를 인지하는 장치로서, 다른 뷰의 영상에 대한 입력 데이터와, 영상 내 객체 또는 이벤트의 행위 종류에 대한 결과 데이터를 각각 포함한 학습 데이터를 이용하여, 머신 러닝 기법(machine learning)에 따라 기 학습된 n개의 분류기를 수신하는 통신부; 및 통신부에 수신된 각 분류기를 이용하여, 대상 영역에 대한 n개의 다른 뷰의 영상에서 객체 또는 이벤트에 대한 행위 인지를 제어하는 제어부;를 포함한다.Behavior recognition device according to another embodiment of the present invention is a device for recognizing the behavior of an object or event using a video shot with n (where n is a natural number of 3 or more) different views at the same time. , using training data including input data for images of different views and result data on the types of actions of objects or events in the image, respectively, to receive n classifiers pre-learned according to machine learning communication department; and a control unit that controls recognition of an action for an object or event in images of n different views of the target region by using each classifier received from the communication unit.

상기 제어부는, 대상 영역에 대한 n개의 다른 뷰의 영상을 각 분류기에 입력시켜, 각 분류기가 출력하는 n개의 행위 종류를 획득할 수 있고, 앙상블 기법(ensemble of classifiers)을 이용하여, 각 분류기로부터 획득한 n개의 행위 종류 중에 적어도 하나의 행위 종류를 그 결과로 도출할 수 있다.The controller may input images of n different views of the target region to each classifier to obtain n types of behaviors output by each classifier, and use an ensemble of classifiers from each classifier. At least one behavior type among the acquired n behavior types may be derived as a result.

본 발명의 일 실시예에 따른 행위 인지 시스템은 동시에 n(단, n은 3이상의 자연수)개의 다른 뷰(view)의 영상(video)을 촬영하는 다수의 카메라; 및 각 카메라에서 촬영된 다른 뷰의 영상을 이용하여 객체 또는 이벤트의 행위를 인지하는 전자 장치;를 포함한다.A behavior recognition system according to an embodiment of the present invention includes a plurality of cameras that simultaneously take n (where n is a natural number of 3 or more) different views of video; and an electronic device for recognizing an action of an object or event by using an image of a different view captured by each camera.

상기 전자 장치는, 다른 뷰의 영상에 대한 입력 데이터와, 영상 내 객체 또는 이벤트의 행위 종류에 대한 결과 데이터를 각각 포함한 학습 데이터를 이용하여, 머신 러닝 기법(machine learning)에 따라 기 학습된 n개의 분류기를 저장하는 메모리; 및 통신부에 수신된 각 분류기를 이용하여, 대상 영역에 대한 n개의 다른 뷰의 영상에서 객체 또는 이벤트에 대한 행위 인지를 제어하는 제어부;를 포함할 수 있다다.The electronic device uses n pre-learned n pieces of pre-learned data according to machine learning using input data for images of different views and learning data including result data for behavior types of objects or events in the image, respectively. memory to store the classifier; and a control unit that controls recognition of an action for an object or event in images of n different views of the target region by using each classifier received from the communication unit.

상기 카메라는 CCTV용 카메라일 수 있다. 즉, 본 발명의 일 실시예에 따른 행위 인지 시스템은 CCTV 시스템일 수 있다.The camera may be a camera for CCTV. That is, the behavior recognition system according to an embodiment of the present invention may be a CCTV system.

상기와 같이 구성되는 본 발명은 단일 영상이 아닌, 다수의 영상에 대한 행동 인지의 처리가 가능한 이점이 있다.The present invention configured as described above has an advantage in that it is possible to process behavior recognition for a plurality of images rather than a single image.

즉, 본 발명은 대상 영역을 동시에 3개 이상의 다중 뷰(multi-view)로 촬영한 영상(video)에 대해, 머신 러닝 기법의 학습 모델을 이용하여 분석함에 따라 해당 영상 내의 객체 또는 이벤트에 대한 행위 인지를 보다 빠르고 정확하게 수행할 수 있어, 실시간 환경 등에 대응이 가능한 이점이 있다.That is, the present invention analyzes a video taken in three or more multi-views of a target area at the same time using a learning model of a machine learning technique to act on an object or an event in the image. Since recognition can be performed more quickly and accurately, there is an advantage in that it is possible to respond to a real-time environment.

또한, 본 발명은 다수의 영상을 동시에 처리하되, 머신 러닝 기법의 학습 모델의 분석 결과를 앙상블 기법을 통해 최종적으로 행동 인지를 분석함에 따라 그 행위 인지의 정확도를 더욱 향상시킬 수 있는 이점이 있다.In addition, the present invention has the advantage of being able to further improve the accuracy of behavior recognition by processing a plurality of images at the same time, and finally analyzing the behavior recognition through the ensemble method of the analysis result of the learning model of the machine learning technique.

또한, 본 발명은 공공기관이나 민간에서 사용중인 영상보안시스템 등의 CCTV 시스템에 적용될 경우, 범죄 예방, 범인 검거율 향상 및 개인 신변 안전 보장이 가능한 이점이 있다.In addition, when the present invention is applied to a CCTV system such as a video security system in use in public institutions or in the private sector, there is an advantage in that it is possible to prevent crime, improve the arrest rate of criminals, and guarantee personal safety.

본 발명에서 얻을 수 있는 효과는 이상에서 언급한 효과들로 제한되지 않으며, 언급하지 않은 또 다른 효과들은 아래의 기재로부터 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.The effects obtainable in the present invention are not limited to the above-mentioned effects, and other effects not mentioned may be clearly understood by those of ordinary skill in the art to which the present invention belongs from the following description. will be.

도 1은 본 발명의 일 실시예에 따른 시스템(10)의 블록 구성도를 나타낸다.
도 2는 본 발명의 일 실시예에 따른 행위 인식 장치(200)의 블록 구성도를 나타낸다.
도 3은 본 발명의 일 실시예에 따른 행위 인식 장치(200)에서 제어부(250)의 블록 구성도를 나타낸다.
도 4는 본 발명의 일 실시예에 따른 행위 인식 방법의 순서도를 나타낸다.
도 5는 본 발명의 일 실시예에 따른 행위 인식 방법의 S100 및 S200에 대한 상세한 순서도를 나타낸다.
도 6은 본 발명의 일 실시예에 따른 시스템(10)의 각 카메라(100)에서 촬영된 영상에 대한 일 예를 나타낸다.
도 7은 도 6에서 각 카메라(100) 간에 캘리브레이션을 수행한 결과에 따른 영상에 대한 일 예를 나타낸다.
도 8은 본 발명의 일 실시예에 따른 행위 인식 방법의 S200에 따라 수행되는 행위 인지 과정을 나타낸다.1 shows a block diagram of a system 10 according to an embodiment of the present invention.
2 is a block diagram showing a behavior recognition apparatus 200 according to an embodiment of the present invention.
3 is a block diagram of the controller 250 in the behavior recognition apparatus 200 according to an embodiment of the present invention.
4 is a flowchart of a behavior recognition method according to an embodiment of the present invention.
5 is a detailed flowchart of steps S100 and S200 of a behavior recognition method according to an embodiment of the present invention.
6 shows an example of an image captured by each camera 100 of the system 10 according to an embodiment of the present invention.
FIG. 7 shows an example of an image according to a result of performing calibration between each camera 100 in FIG. 6 .
8 illustrates a behavior recognition process performed according to S200 of the behavior recognition method according to an embodiment of the present invention.

본 발명의 상기 목적과 수단 및 그에 따른 효과는 첨부된 도면과 관련한 다음의 상세한 설명을 통하여 보다 분명해질 것이며, 그에 따라 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 본 발명의 기술적 사상을 용이하게 실시할 수 있을 것이다. 또한, 본 발명을 설명함에 있어서 본 발명과 관련된 공지 기술에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략하기로 한다.The above object and means of the present invention and its effects will become more apparent through the following detailed description in relation to the accompanying drawings, and accordingly, those of ordinary skill in the art to which the present invention pertains can easily understand the technical idea of the present invention. will be able to carry out In addition, in describing the present invention, if it is determined that a detailed description of a known technology related to the present invention may unnecessarily obscure the gist of the present invention, the detailed description thereof will be omitted.

본 명세서에서 사용된 용어는 실시예들을 설명하기 위한 것이며, 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 경우에 따라 복수형도 포함한다. 본 명세서에서, "포함하다", “구비하다”, “마련하다” 또는 “가지다” 등의 용어는 언급된 구성요소 외의 하나 이상의 다른 구성요소의 존재 또는 추가를 배제하지 않는다.The terminology used herein is for the purpose of describing the embodiments, and is not intended to limit the present invention. In the present specification, the singular form also includes the plural form as the case may be, unless otherwise specified in the phrase. In this specification, terms such as “include”, “provide”, “provide” or “have” do not exclude the presence or addition of one or more other components other than the mentioned components.

본 명세서에서, “또는”, “적어도 하나” 등의 용어는 함께 나열된 단어들 중 하나를 나타내거나, 또는 둘 이상의 조합을 나타낼 수 있다. 예를 들어, “또는 B”“및 B 중 적어도 하나”는 A 또는 B 중 하나만을 포함할 수 있고, A와 B를 모두 포함할 수도 있다.In this specification, terms such as “or” and “at least one” may indicate one of the words listed together, or a combination of two or more. For example, “or B” and “at least one of B” may include only one of A or B, or both A and B.

본 명세서에서, “예를 들어” 등에 따르는 설명은 인용된 특성, 변수, 또는 값과 같이 제시한 정보들이 정확하게 일치하지 않을 수 있고, 허용 오차, 측정 오차, 측정 정확도의 한계와 통상적으로 알려진 기타 요인을 비롯한 변형과 같은 효과로 본 발명의 다양한 실시 예에 따른 발명의 실시 형태를 한정하지 않아야 할 것이다.In the present specification, descriptions according to “for example” and the like may not exactly match the information presented, such as recited properties, variables, or values, tolerances, measurement errors, limits of measurement accuracy, and other commonly known factors The embodiments of the present invention according to various embodiments of the present invention should not be limited by effects such as modifications including .

본 명세서에서, 어떤 구성요소가 다른 구성요소에 '연결되어’ 있거나 '접속되어' 있다고 기재된 경우, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성 요소에 '직접 연결되어' 있다거나 '직접 접속되어' 있다고 언급된 때에는, 중간에 다른 구성요소가 존재하지 않는 것으로 이해될 수 있어야 할 것이다.In this specification, when a component is described as being 'connected' or 'connected' to another component, it may be directly connected or connected to the other component, but other components may exist in between. It should be understood that there is On the other hand, when it is mentioned that a certain element is 'directly connected' or 'directly connected' to another element, it should be understood that the other element does not exist in the middle.

본 명세서에서, 어떤 구성요소가 다른 구성요소의 '상에' 있다거나 '접하여' 있다고 기재된 경우, 다른 구성요소에 상에 직접 맞닿아 있거나 또는 연결되어 있을 수 있지만, 중간에 또 다른 구성요소가 존재할 수 있다고 이해되어야 할 것이다. 반면, 어떤 구성요소가 다른 구성요소의 '바로 위에' 있다거나 '직접 접하여' 있다고 기재된 경우에는, 중간에 또 다른 구성요소가 존재하지 않은 것으로 이해될 수 있다. 구성요소 간의 관계를 설명하는 다른 표현들, 예를 들면, '～사이에'와 '직접 ～사이에' 등도 마찬가지로 해석될 수 있다.In this specification, when it is described that a certain element is 'on' or 'in contact with' another element, it may be directly in contact with or connected to the other element, but another element may exist in the middle. It should be understood that On the other hand, when it is described that a certain element is 'directly on' or 'directly' of another element, it may be understood that another element does not exist in the middle. Other expressions describing the relationship between the elements, for example, 'between' and 'directly between', etc. may be interpreted similarly.

본 명세서에서, '제1', '제2' 등의 용어는 다양한 구성요소를 설명하는데 사용될 수 있지만, 해당 구성요소는 위 용어에 의해 한정되어서는 안 된다. 또한, 위 용어는 각 구성요소의 순서를 한정하기 위한 것으로 해석되어서는 안되며, 하나의 구성요소와 다른 구성요소를 구별하는 목적으로 사용될 수 있다. 예를 들어, '제1구성요소'는 '제2구성요소'로 명명될 수 있고, 유사하게 '제2구성요소'도 '제1구성요소'로 명명될 수 있다.In this specification, terms such as 'first' and 'second' may be used to describe various components, but the components should not be limited by the above terms. In addition, the above terms should not be construed as limiting the order of each component, and may be used for the purpose of distinguishing one component from another. For example, a 'first component' may be referred to as a 'second component', and similarly, a 'second component' may also be referred to as a 'first component'.

다른 정의가 없다면, 본 명세서에서 사용되는 모든 용어는 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 공통적으로 이해될 수 있는 의미로 사용될 수 있을 것이다. 또한, 일반적으로 사용되는 사전에 정의되어 있는 용어들은 명백하게 특별히 정의되어 있지 않는 한 이상적으로 또는 과도하게 해석되지 않는다.Unless otherwise defined, all terms used herein may be used with meanings commonly understood by those of ordinary skill in the art to which the present invention pertains. In addition, terms defined in a commonly used dictionary are not to be interpreted ideally or excessively unless specifically defined explicitly.

이하, 첨부된 도면을 참조하여 본 발명에 따른 바람직한 일 실시예를 상세히 설명하도록 한다.Hereinafter, a preferred embodiment according to the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 일 실시예에 따른 시스템(10)의 블록 구성도를 나타낸다.1 shows a block diagram of a system 10 according to an embodiment of the present invention.

본 발명의 일 실시예에 따른 시스템(10)은 다수의 위치 영역에 대한 영상을 수집하여 그 영상을 분석하는 시스템으로서, 해당 영상에서 객체(object) 또는 이벤트(event)의 행위를 인지할 수 있다. 즉, 본 발명의 일 실시예에 따른 시스템(10)은 영상에 대한 표시 또는 분석을 기반으로, 범죄예방, 재난감시 등과 같이 생명과 재산을 보호하기 위한 시스템일 수 있다.The system 10 according to an embodiment of the present invention is a system that collects images for a plurality of location areas and analyzes the images, and can recognize an action of an object or an event in the image. . That is, the system 10 according to an embodiment of the present invention may be a system for protecting lives and property, such as crime prevention and disaster monitoring, based on display or analysis of images.

가령, 본 발명의 일 실시예에 따른 시스템(10)은 CCTV(Closed Circuit Television) 시스템일 수 있다. 즉, CCTV 시스템은 특정 건축물이나 시설물에서 특정 수신자를 대상으로 유선 또는 무선 설비를 이용해 화상을 전송하는 시스템으로서, 산업용, 교육용, 의료용, 교통 관제용 감시, 방재용 및 사내의 화상정보 전달용 등과 같이 그 용도가 다양하다.For example, the system 10 according to an embodiment of the present invention may be a Closed Circuit Television (CCTV) system. In other words, a CCTV system is a system that transmits images using wired or wireless facilities to a specific recipient in a specific building or facility. Its uses are diverse.

도 1을 참조하면, 본 발명의 일 실시예에 따른 시스템(10)은 다수의 카메라(100) 및 행위 인식 장치(200)를 포함할 수 있다.Referring to FIG. 1 , a system 10 according to an embodiment of the present invention may include a plurality of cameras 100 and a behavior recognition device 200 .

카메라(100)는 n개(단, n은 3이상의 자연수)(100-1, 100-2, …100-n)가 구비된다. 이러한 각 카메라(100)은 감시 대상 영역에 대해 동시에 다양한 방향 또는 각도에서 촬영하여, 서로 다른 뷰(view)를 가지는 n개의 다중 뷰(view)의 영상을 제공할 수 있다. 예를 들어, 카메라(100)는 CCTV용 카메라일 수 있으나, 이에 한정되는 것은 아니다.The camera 100 is provided with n (however, n is a natural number equal to or greater than 3) (100-1, 100-2, ... 100-n). Each of these cameras 100 may provide images of n multiple views having different views by simultaneously photographing the area to be monitored from various directions or angles. For example, the camera 100 may be a CCTV camera, but is not limited thereto.

가령, 각 카메라(100)는 회전카메라로 구현될 수 있다. 이때, 회전카메라는 팬(Pan), 틸트(Tilt) 및 줌(Zoom)의 기능이 가능한 PTZ(Pan-Tilt-Zoom) 카메라일 수 있다. 이때, 팬 기능은 카메라(100)가 제1 방향(가령, 수평 방향)을 따라 회전 이동하는 기능을 지칭하며, 틸트 기능은 카메라(100)가 제1 방향과 다른 방향인 제2 방향(가령, 수직 방향)을 따라 회전 이동하는 기능을 지칭한다. 또한, 줌 기능은 카메라(100)가 감시 대상 영역에 대해 현재 촬영 영상의 중심 영역을 기반으로 그 영상을 확대 촬영하는 기능을 지칭한다. 즉, n개의 회전카메라는 감시 대상 영역에 대해, 다양한 방향 또는 각도로 촬영함으로써, 다중 뷰(multiple views)를 제공할 수 있다For example, each camera 100 may be implemented as a rotating camera. In this case, the rotation camera may be a Pan-Tilt-Zoom (PTZ) camera capable of pan, tilt, and zoom functions. In this case, the pan function refers to a function in which the camera 100 rotates along a first direction (eg, a horizontal direction), and the tilt function refers to a function in which the camera 100 moves in a second direction (eg, a different direction from the first direction). It refers to the function of rotational movement along the vertical direction). In addition, the zoom function refers to a function in which the camera 100 enlarges and captures an image based on the central region of the currently captured image with respect to the monitoring target region. That is, the n rotation cameras can provide multiple views by photographing the area to be monitored in various directions or angles.

각 카메라(100)는 유선 또는 무선 전송 통로를 통해 행위 인식 장치(200)로 그 촬영한 영상 정보를 전송한다. 또한, 카메라(100)는 인터넷 등의 네트워크를 통해 그 촬영한 영상 정보를 행위 인식 장치(200)로 전송할 수도 있다. 예를 들어, 카메라(100)는 네트워크 상에서 IP 주소가 부여되지 않은 카메라이거나, IP 주소가 부여된 IP 카메라일 수 있으나, 이에 한정되는 것은 아니다.Each camera 100 transmits the captured image information to the behavior recognition device 200 through a wired or wireless transmission path. Also, the camera 100 may transmit the captured image information to the behavior recognition apparatus 200 through a network such as the Internet. For example, the camera 100 may be a camera to which an IP address is not assigned on a network or an IP camera to which an IP address is assigned, but is not limited thereto.

한편, 각 카메라(100)가 서로 다른 방향 또는 각도에서 감시 대상 영역을 촬영하므로, 각 카메라(100)에서 촬영된 영상은 서로 다른 뷰(view), 즉 다중 뷰(multi-view)를 가질 수 있다. 이러한 다른 뷰의 영상은 카메라(100)의 개수에 따라 n개가 동시에 촬영될 수 있다. 즉, 제1 카메라(100)에서 제1 영상, 제2 카메라(200)에서 제2 영상, 제n 카메라(100)에서 제n 영상이 각각 촬영될 수 있으며, 이들 영상은 시계열의 연속된 영상 프레임에 대한 정보를 가지는 동영상일 수 있다.On the other hand, since each camera 100 captures a monitoring target area in different directions or angles, images captured by each camera 100 may have different views, that is, a multi-view. . According to the number of cameras 100, n images of these different views may be simultaneously captured. That is, the first image from the first camera 100 , the second image from the second camera 200 , and the nth image from the n-th camera 100 may be photographed, respectively, and these images are time-series continuous image frames. It may be a video having information about .

각 영상에서, '지점'은 영상 프레임 내 어느 한 픽셀을 지칭할 수 있으며, '영역'은 영상 프레임 내 연접하는 다수의 지점(즉, 다수의 픽셀)을 지칭할 수 있다. 가령, 영상에서, 어떤 영역의 '중심 지점'은 해당 영역의 중심에 위치한 픽셀을 지칭할 수 있다. 또한, 영상에서, '중심 좌표점'은 그 영상의 프레임 중심에 위치한 픽셀을 지칭할 수 있으며, '중심 영역'은 그 중심 좌표점을 포함하는 영역일 수 있다.In each image, a 'point' may refer to any one pixel in an image frame, and a 'region' may refer to a plurality of contiguous points (ie, a plurality of pixels) within an image frame. For example, in an image, a 'center point' of a certain area may refer to a pixel located at the center of the corresponding area. Also, in an image, a 'center coordinate point' may refer to a pixel located at the center of a frame of the image, and the 'center region' may be an area including the central coordinate point.

행위 인식 장치(200)는 카메라(100)에서 촬영된 영상 정보를 수신하여 처리하는 장치로서, 그 영상 내의 객체 또는 이벤트의 다양한 행위에 대한 인식의 제어를 수행한다. 즉, 행위 인식 장치(200)는 감시 등을 위해 수신된 영상 정보를 그 디스플레이를 통해 표시할 수 있다. 또한, 행위 인식 장치(200)는 후술할 행위 인식 방법에 따라, 각 카메라(100)에서 촬영된 n개의 영상에 대한 학습 및 행위 인식의 제어를 수행할 수 있다.The behavior recognition apparatus 200 is a device that receives and processes image information captured by the camera 100 , and controls recognition of various actions of objects or events in the image. That is, the behavior recognition apparatus 200 may display the received image information for monitoring or the like through the display. Also, the behavior recognition apparatus 200 may perform learning and behavior recognition control for n images captured by each camera 100 according to a behavior recognition method to be described later.

그 외에도, 행위 인식 장치(200)는 수신된 영상 정보를 분석할 수도 있다. 가령, 행위 인식 장치(200)는 수신된 영상 정보에서 사람, 차량 등의 감시 대상인 객체 또는 이벤트를 인식하고, 그 인식된 행위가 이상 행위(예를 들어, 주차 위반, 신호 위반, 침입, 화재, 폭력, 도난, 투기 등)인 경우, 이에 대한 정보를 각종 경고 장치 등을 통해 사용자에게 전달할 수도 있다.In addition, the behavior recognition apparatus 200 may analyze the received image information. For example, the behavior recognition device 200 recognizes an object or event to be monitored, such as a person or a vehicle, from the received image information, and the recognized behavior is an abnormal behavior (eg, parking violation, signal violation, intrusion, fire, violence, theft, speculation, etc.), information on this may be delivered to the user through various warning devices.

즉, 행위 인식 장치(200)는 카메라(100)에 대한 제어 처리 또는 카메라(100)에서 촬영된 영상 정보에 대한 제어 처리를 수행하는 장치로서, 컴퓨팅(computing)이 가능한 전자 장치 또는 컴퓨팅 네트워크일 수 있다.That is, the behavior recognition device 200 is a device that performs control processing on the camera 100 or control processing on image information captured by the camera 100, and may be an electronic device or a computing network capable of computing. have.

예를 들어, 전자 장치는 데스크탑 PC(desktop personal computer), 랩탑 PC(laptop personal computer), 태블릿 PC(tablet personal computer), 넷북 컴퓨터(netbook computer), 워크스테이션(workstation), PDA(personal digital assistant), 스마트폰(smartphone), 스마트패드(smartpad), 또는 휴대폰(mobile phone), 등일 수 있으나, 이에 한정되는 것은 아니다.For example, the electronic device includes a desktop personal computer (PC), a laptop personal computer (PC), a tablet personal computer (PC), a netbook computer, a workstation, and a personal digital assistant (PDA). , a smartphone (smartphone), a smart pad (smartpad), or a mobile phone (mobile phone), etc., but is not limited thereto.

도 2는 본 발명의 일 실시예에 따른 행위 인식 장치(200)의 블록 구성도를 나타낸다.2 is a block diagram showing a behavior recognition apparatus 200 according to an embodiment of the present invention.

이러한 행위 인식 장치(200)는, 도 2에 도시된 바와 같이, 입력부(210), 통신부(220), 디스플레이(230), 메모리(240) 및 제어부(250)를 포함할 수 있다.As shown in FIG. 2 , the behavior recognition apparatus 200 may include an input unit 210 , a communication unit 220 , a display 230 , a memory 240 , and a control unit 250 .

입력부(210)는 다양한 사용자(감시자 등)의 입력에 대응하여, 입력데이터를 발생시키며, 다양한 입력수단을 포함할 수 있다. 예를 들어, 입력부(210)는 키보드(key board), 키패드(key pad), 돔 스위치(dome switch), 터치 패널(touch panel), 터치 키(touch key), 터치 패드(touch pad), 마우스(mouse), 메뉴 버튼(menu button) 등을 포함할 수 있으나, 이에 한정되는 것은 아니다. 가령, 입력부(210)는 각종 설정을 위한 명령 등을 입력 받을 수 있다.The input unit 210 generates input data in response to input of various users (monitor, etc.), and may include various input means. For example, the input unit 210 may include a keyboard, a keypad, a dome switch, a touch panel, a touch key, a touch pad, and a mouse. (mouse), a menu button (menu button) and the like may be included, but is not limited thereto. For example, the input unit 210 may receive commands for various settings.

통신부(220)는 다른 장치와의 통신을 수행하는 구성이다. 예를 들어, 통신부(220)는 5G(5th generation communication), LTE-A(long term evolution-advanced), LTE(long term evolution), 블루투스, BLE(bluetooth low energe), NFC(near field communication), 와이파이(WiFi) 통신 등의 무선 통신을 수행하거나, 케이블 통신 등의 유선 통신을 수행할 수 있으나, 이에 한정되는 것은 아니다. 가령, 통신부(220)는 카메라(100)에서 촬영된 영상 정보를 각 카메라(100)로부터 수신할 수 있으며, 기 학습된 학습 모델(각 분류기, 앙상블 모델)에 대한 정보를 등을 타 장치로부터 수신할 수 있다. 또한, 통신부(220)는 영상 정보, 학습된 학습 모델에 대한 정보, 행위 인식 결과 등을 타 장치로 송신할 수 있다.The communication unit 220 is configured to communicate with other devices. For example, the communication unit 220 is 5th generation communication (5G), long term evolution-advanced (LTE-A), long term evolution (LTE), Bluetooth, bluetooth low energe (BLE), near field communication (NFC), Wireless communication such as Wi-Fi communication may be performed or wired communication such as cable communication may be performed, but is not limited thereto. For example, the communication unit 220 may receive image information captured by the camera 100 from each camera 100 , and receive information on pre-learned learning models (each classifier and ensemble model) from other devices. can do. In addition, the communication unit 220 may transmit image information, information on the learned learning model, an action recognition result, and the like to another device.

디스플레이(230)는 다양한 영상 데이터를 화면으로 표시하는 것으로서, 비발광형 패널이나 발광형 패널로 구성될 수 있다. 예를 들어, 디스플레이(230)는 액정 디스플레이(LCD; liquid crystal display), 발광 다이오드(LED; light emitting diode) 디스플레이, 유기 발광 다이오드(OLED; organic LED) 디스플레이, 마이크로 전자기계 시스템(MEMS; micro electro mechanical systems) 디스플레이, 또는 전자 종이(electronic paper) 디스플레이 등을 포함할 수 있으나, 이에 한정되는 것은 아니다. 또한, 디스플레이(230)는 입력부(210)와 결합되어 터치 스크린(touch screen) 등으로 구현될 수 있다. 가령, 디스플레이(230)는 카메라(100)에서 촬영된 영상 정보, 설정 정보, 행위 인식 결과 등을 화면으로 표시할 수 있다.The display 230 displays various image data on a screen, and may be configured as a non-emission panel or a light emitting panel. For example, the display 230 may include a liquid crystal display (LCD), a light emitting diode (LED) display, an organic light emitting diode (OLED) display, or a micro electromechanical system (MEMS). mechanical systems) display, or an electronic paper display, etc., but is not limited thereto. In addition, the display 230 may be implemented as a touch screen or the like in combination with the input unit 210 . For example, the display 230 may display image information captured by the camera 100 , setting information, an action recognition result, and the like on a screen.

메모리(240)는 행위 인식 장치(200)의 동작에 필요한 각종 정보를 저장한다. 저장 정보로는 카메라(100)에서 촬영된 영상 정보, 설정 정보, 학습 모델에 대한 정보, 앙상블 기법에 대한 정보, 행위 인식 결과, 후술할 행위 인식 방법에 관련된 프로그램 정보 등이 포함될 수 있으나, 이에 한정되는 것은 아니다. 예를 들어, 메모리(240)는 그 유형에 따라 하드디스크 타입(hard disk type), 마그네틱 매체 타입(Sagnetic media type), CD-ROM(compact disc read only memory), 광기록 매체 타입(Optical Media type), 자기-광 매체 타입(Sagneto-optical media type), 멀티미디어 카드 마이크로 타입(Sultimedia card micro type), 플래시 저장부 타입(flash memory type), 롬 타입(read only memory type), 또는 램 타입(random access memory type) 등을 포함할 수 있으나, 이에 한정되는 것은 아니다. 또한, 메모리(240)는 그 용도/위치에 따라 캐시(cache), 버퍼, 주기억장치, 또는 보조기억장치이거나 별도로 마련된 저장 시스템일 수 있으나, 이에 한정되는 것은 아니다.The memory 240 stores various types of information necessary for the operation of the behavior recognition apparatus 200 . The stored information may include image information captured by the camera 100, setting information, information about a learning model, information about an ensemble technique, a behavior recognition result, and program information related to a behavior recognition method to be described later, but is limited thereto. it is not going to be For example, the memory 240 may be a hard disk type, a magnetic media type, a compact disc read only memory (CD-ROM), or an optical media type depending on the type. ), a Sagneto-optical media type, a multimedia card micro type, a flash memory type, a read only memory type, or a random access memory type), but is not limited thereto. In addition, the memory 240 may be a cache, a buffer, a main memory, an auxiliary memory, or a separately provided storage system according to its purpose/location, but is not limited thereto.

제어부(250)는 행위 인식 장치(200)의 다양한 제어 동작을 수행할 수 있다. 즉, 제어부(250)는 후술할 행위 인식 방법의 수행을 제어할 수 있으며, 행위 인식 장치(200)의 나머지 구성, 즉 입력부(210), 통신부(220), 디스플레이(230), 메모리(240) 등의 동작을 제어할 수 있다. 예를 들어, 제어부(250)는 하드웨어인 프로세서(processor) 또는 해당 프로세서에서 수행되는 소프트웨어인 프로세스(process) 등을 포함할 수 있으나, 이에 한정되는 것은 아니다.The controller 250 may perform various control operations of the behavior recognition apparatus 200 . That is, the controller 250 may control the execution of a behavior recognition method to be described later, and the remaining components of the behavior recognition apparatus 200 , that is, the input unit 210 , the communication unit 220 , the display 230 , and the memory 240 . You can control actions such as For example, the control unit 250 may include, but is not limited to, a processor that is hardware or a process that is software that is executed in a corresponding processor.

도 3은 본 발명의 일 실시예에 따른 행위 인식 장치(200)에서 제어부(250)의 블록 구성도를 나타낸다.3 is a block diagram showing the control unit 250 in the behavior recognition apparatus 200 according to an embodiment of the present invention.

제어부(250)는 본 발명의 일 실시예에 따른 행위 인식 방법의 수행을 제어하며, 도 3에 도시된 바와 같이, 학습부(251), 캘리브레이션부(252), 분류기 처리부(253) 및 앙상블 처리부(254)를 포함할 수 있다. 예를 들어, 학습부(251), 캘리브레이션부(252), 분류기 처리부(253) 및 앙상블 처리부(254)는 제어부(250)의 하드웨어 구성이거나, 제어부(250)에서 수행되는 소프트웨어인 프로세스일 수 있으나, 이에 한정되는 것은 아니다.The control unit 250 controls the execution of the behavior recognition method according to an embodiment of the present invention, and as shown in FIG. 3 , the learning unit 251 , the calibration unit 252 , the classifier processing unit 253 , and the ensemble processing unit (254). For example, the learning unit 251 , the calibration unit 252 , the classifier processing unit 253 , and the ensemble processing unit 254 may be a hardware configuration of the control unit 250 or a software process performed by the control unit 250 . , but is not limited thereto.

이하, 본 발명에 따른 행위 인식 방법에 대해 보다 상세하게 설명하도록 한다.Hereinafter, the behavior recognition method according to the present invention will be described in more detail.

도 4는 본 발명의 일 실시예에 따른 행위 인식 방법의 순서도를 나타내며, 도 5는 본 발명의 일 실시예에 따른 행위 인식 방법의 S100 및 S200에 대한 상세한 순서도를 나타낸다.4 is a flowchart of a behavior recognition method according to an embodiment of the present invention, and FIG. 5 is a detailed flowchart of S100 and S200 of the behavior recognition method according to an embodiment of the present invention.

본 발명의 일 실시예에 따른 행위 인식 방법은 n개의 카메라(100)에서 동시에 n개의 다른 뷰(view)로 촬영된 영상(video)을 이용하여, 영상 내 객체 또는 이벤트의 행위를 인지하기 위한 방법으로서, 도 4에 도시된 바와 같이, S100 및 S200을 포함한다.A method for recognizing an action according to an embodiment of the present invention is a method for recognizing an action of an object or an event in an image by using a video shot simultaneously from n cameras 100 in n different views. As shown in FIG. 4 , S100 and S200 are included.

S100은 학습 데이터를 이용하여 학습 모델(각 분류기, 앙상블 모델)을 학습시키는 단계이다. 이때, 분류기(classifier)는 머신 러닝 기법(machine learning)에 따라 학습되어 분류 기능을 수행하는 학습 모델로서, 다른 뷰를 가지는 영상의 개수와 동일하게 n개가 구비될 수 있다. 이는 각 분류기에 대해 서로 다른 뷰의 영상을 이용하여 학습이 이루어지고, 학습된 각 분류기에 대해서도 서로 다른 뷰의 영상을 이용하여 행위 인지를 위한 분류를 수행하기 때문이다. 즉, 제1 분류기는 제1 영상, 제2 분류기는 제2 영상, 제n 분류기는 제n 영상을 각각 이용하여, 학습될 수 있다. 또한, 학습된 제1 분류기는 제1 영상, 학습된 제2 분류기는 제2 영상, 학습된 제n 분류기는 제n 영상을 각각 이용하여, 행위 인지를 위한 분류를 수행할 수 있다.S100 is a step of learning the learning model (each classifier, ensemble model) using the training data. In this case, the classifier is a learning model that is learned according to machine learning and performs a classification function, and may be provided with n equal to the number of images having different views. This is because learning is performed using images of different views for each classifier, and classification for behavior recognition is performed using images of different views for each learned classifier. That is, the first classifier may be trained by using the first image, the second classifier using the second image, and the nth classifier using the nth image, respectively. In addition, the learned first classifier may perform classification for behavior recognition by using the first image, the learned second classifier using the second image, and the learned n-th classifier using the n-th image, respectively.

다만, S100에서 사용되는 각 영상은 학습을 위해 마련된 영상(즉, 학습 데이터용 영상)이며, 학습 데이터의 입력 데이터에 포함된다. 또한, 그 학습 데이터의 출력 데이터에는 영상 내 객체 또는 이벤트의 행위 종류에 대한 레이블(label)이 포함된다. 이에 대한 상세한 설명은 후술하도록 한다.However, each image used in S100 is an image prepared for learning (ie, an image for learning data), and is included in input data of the learning data. In addition, the output data of the training data includes a label for an action type of an object or event in the image. A detailed description thereof will be provided later.

구체적으로, 도 5에 도시된 바와 같이, S100는 S101 내지 S105를 포함할 수 있다.Specifically, as shown in FIG. 5 , S100 may include S101 to S105.

먼저, S101에서, 감시 대상 영역을 다양한 방향 또는 각도로 촬영하는 n개의 카메라(100)를 다수의 지점에 설치한다. 가령, n개의 회전카메라를 다수의 지점에 설치할 수 있다.First, in S101, n cameras 100 for photographing a monitoring target area in various directions or angles are installed at a plurality of points. For example, n rotation cameras may be installed at a plurality of points.

도 6은 본 발명의 일 실시예에 따른 시스템(10)의 각 카메라(100)에서 촬영된 영상에 대한 일 예를 나타내며, 도 7은 도 6에서 각 카메라(100) 간에 캘리브레이션을 수행한 결과에 따른 영상에 대한 일 예를 나타낸다.6 shows an example of an image captured by each camera 100 of the system 10 according to an embodiment of the present invention, and FIG. 7 is a result of performing calibration between each camera 100 in FIG. An example of the following image is shown.

이후, S102에서, 캘리브레이션부(252)는 각 카메라(100) 간에 캘리브레이션을 수행한다. 이때, 캘리브레이션이란 각 영상에서 객체 또는 이벤트가 동일한 특정 영역 내에 위치하도록 각 카메라(100)의 촬영 방향을 조절하는 작업을 지칭한다. 즉, 캘리브레이션부(252)는 각 카메라(100)에 대해 팬 또는 틸트 기능을 수행함으로써, 각 카메라(100)에서 촬영되는 다른 뷰의 n개 영상에서, 객체 또는 이벤트가 영상 내의 특정 영역에 위치하도록, 각 카메라(100) 간에 캘리브레이션을 수행할 수 있다.Thereafter, in S102 , the calibration unit 252 performs calibration between each camera 100 . In this case, the calibration refers to an operation of adjusting the shooting direction of each camera 100 so that an object or an event in each image is located within the same specific area. That is, the calibrator 252 performs a pan or tilt function for each camera 100 so that in n images of different views captured by each camera 100 , an object or event is located in a specific area within the image. , calibration may be performed between each camera 100 .

예를 들어, 특정 영역은 영상 프레임 내 중심 영역일 수 있으나, 이에 한정되는 것은 아니다. 이러한 캘리브레이션을 통해, 학습 데이터에 포함되는 각 영상의 객체 또는 이벤트의 위치에 대한 일관성을 유지할 수 있어, 학습 효율을 향상시켜 그 분류의 정확성을 높일 수 있다.For example, the specific region may be a central region within the image frame, but is not limited thereto. Through such calibration, it is possible to maintain consistency with respect to the location of an object or event of each image included in the training data, thereby improving learning efficiency and thus increasing the accuracy of classification.

가령, 도 6에 도시된 바와 같이, 객체(O)가 포함된 객체 영역(OR1, OR2, OR3)을 포함한 감시 대상 영역을, 3개의 회전카메라(100-1, 100-2, 100-3)가 촬영하도록 구현될 수 있다. 이때, OR1, OR2 및 OR3의 각 중심 지점을 CP1, CP2, CP3라 지칭할 수 있다. 특히, 제1 영상에서는 OR1(CP1)이 그 영상 내 중심 영역에 위치하지만, 제2 및 제3 영상에서는 OR2 및 OR3(CP2 및 CP3)가 각각 그 영상 내 중심 영역에서 벗어난 영역에 위치한다.For example, as shown in FIG. 6 , the monitoring target area including the object areas OR1 , OR2 , OR3 including the object O has three rotating cameras 100-1, 100-2, 100-3. may be implemented to photograph. In this case, each center point of OR1, OR2, and OR3 may be referred to as CP1, CP2, and CP3. In particular, in the first image, OR1 ( CP1 ) is located in the central region of the image, but in the second and third images, OR2 and OR3 ( CP2 and CP3 ) are located in regions deviating from the central region in the image, respectively.

이에 따라, 도 7에 도시된 바와 같이, 제1 회전카메라(100-1)의 촬영 방향은 그대로 두고, 제2 및 제3 회전카메라(100-2, 100-3)의 촬영 방향은 조절하는 캘리브레이션을 수행할 수 있다. 이러한 캘리브레이션의 결과, OR2 및 OR3(CP2 및 CP3)가 각각 OR2' 및 OR3'(CP2' 및 CP3')로 위치 이동할 수 있다. 즉, OR2' 및 OR3'(CP2' 및 CP3')는 각각 영상 내 중심 영역에 위치할 수 있으며, 특히 위치 이동한 CP2' 및 CP3'는 각각 영상 내 중심 좌표점에 해당할 수 있다.Accordingly, as shown in FIG. 7 , a calibration in which the shooting direction of the first rotating camera 100-1 is left as it is and the shooting directions of the second and third rotating cameras 100-2 and 100-3 are adjusted. can be performed. As a result of this calibration, OR2 and OR3 (CP2 and CP3) may be displaced to OR2' and OR3' (CP2' and CP3'), respectively. That is, OR2' and OR3' (CP2' and CP3') may be respectively located in the central region in the image, and in particular, the moved positions CP2' and CP3' may correspond to the central coordinate points in the image, respectively.

이후, S103에서, 학습부(251)는 n개의 서로 다른 뷰의 영상을 취득한다. 즉, 단일 뷰의 영상이 아닌, 각 카메라부(100)에서 촬영된 n개의 다른 뷰의 영상 취득할 수 있다.Thereafter, in S103 , the learning unit 251 acquires images of n different views. That is, instead of an image of a single view, images of n different views captured by each camera unit 100 may be acquired.

이후, S104에서, 학습부(251)는 취득한 영상을 이용하여, 머신 러닝 학습을 위한 학습 데이터의 GT(Ground-Truth)를 생성한다. 이때, 학습 데이터는 입력 데이터 및 출력 데이터의 쌍을 포함하는데, 입력 데이터는 취득한 다른 뷰의 영상을 포함하며, 결과 데이터는 영상 내 객체 또는 이벤트의 행위 종류에 대한 레이블(label)(즉, 행위 명칭)을 포함한다. 이하, 제k 분류기(단, k는 3이상 n이하의 자연수)에 대한 학습 데이터를 “제k 학습 데이터”라 지칭하며, 제k 학습 데이터의 입력 데이터 및 출력 데이터를 “제k 입력 데이터” 및 “제k 출력 데이터”라 지칭한다.Thereafter, in S104 , the learning unit 251 generates a GT (Ground-Truth) of training data for machine learning learning by using the acquired image. In this case, the training data includes a pair of input data and output data. The input data includes an image of another view acquired, and the result data is a label (ie, action name) for an action type of an object or event in the image. ) is included. Hereinafter, the training data for the k-th classifier (where k is a natural number of 3 or more and n or less) is referred to as “k-th training data”, and the input data and output data of the k-th training data are referred to as “k-th input data” and It is referred to as “k-th output data”.

예를 들어, 제1 내지 제n 결과 데이터는 주차 위반, 신호 위반, 침입, 화재, 폭력, 도난, 또는 투기 등에 대한 행위 종류의 레이블을 포함할 수 있으나, 이에 한정되는 것은 아니다.For example, the first to nth result data may include, but are not limited to, labels of types of actions for parking violations, signal violations, trespassing, fire, violence, theft, or speculation.

특히, 각 학습 데이터에서, 제1 내지 제n 입력 데이터는 서로 다른 뷰의 영상을 포함하는 것이 바람직하다. 즉, 제1 입력 데이터는 제1 영상, 제2 입력 데이터는 제2 영상, 제n 입력 데이터는 제n 영상을 각각 포함할 수 있다.In particular, in each training data, it is preferable that the first to nth input data include images of different views. That is, the first input data may include a first image, the second input data may include a second image, and the nth input data may include an nth image.

또한, 각 학습 데이터에서, 입력 데이터는 서로 다른 뷰의 영상 외에도, 해당 영상 내에서 객체 또는 이벤트가 출현하는 프레임(이하, “제1 프레임”이라 지칭함)의 정보와, 해당 영상 내에서 출현했던 객체 또는 이벤트가 사라지는 프레임(이하, “제2 프레임”이라 지칭함)의 정보를 각각 더 포함할 수 있다. 이는 각 영상이 시계열의 연속된 영상 프레임에 대한 정보를 가지는 동영상이므로, 그 동영상에서 해당 객체 또는 이벤트의 행위에 대한 시간 정보(즉, 해당 행위의 시작 시간 및 종료 시간에 대한 정보)를 학습 데이터의 입력 데이터에 추가적으로 제공하기 위함이다. 이 경우, 제1 입력 데이터는 제1 영상, 제1 영상 내 제1 프레임의 정보 및 제1 영상 내 제2 프레임의 정보를 각각 포함할 수 있다. 또한, 제2 입력 데이터는 제2 영상, 제2 영상 내 제1 프레임의 정보 및 제2 영상 내 제2 프레임의 정보를 각각 포함할 수 있다. 또한, 제n 입력 데이터는 제n 영상, 제n 영상 내 제1 프레임의 정보 및 제n 영상 내 제2 프레임의 정보를 각각 포함할 수 있다.In addition, in each learning data, input data includes, in addition to images of different views, information on frames in which objects or events appear in the images (hereinafter referred to as “first frames”), and objects that appeared in the images. Alternatively, information on a frame in which the event disappears (hereinafter, referred to as a “second frame”) may be further included. Since each image is a moving picture having information on continuous image frames in a time series, time information about the action of the object or event in the video (that is, information about the start time and end time of the action) is obtained from the learning data. This is to provide additional input data. In this case, the first input data may include the first image, information on the first frame in the first image, and information on the second frame in the first image, respectively. In addition, the second input data may include the second image, information on the first frame in the second image, and information on the second frame in the second image, respectively. Also, the n-th input data may include an n-th image, information on a first frame in the n-th image, and information on a second frame in the n-th image, respectively.

이후, S105에서, 학습부(251)는 준비된 학습 데이터를 이용하여, n개의 분류기를 학습시킨다. 즉, 학습부(251)는 캘리브레이션이 수행된 각 카메라(100)에서 촬영된 다른 뷰의 n개 영상을 기반으로 생성된 학습 데이터를 이용하여, 머신 러닝 기법에 따라 n개의 분류기를 학습시킬 수 있다. 이때, 분류기는 지도 학습(supervised learning)의 머신 러닝 기법에 따라 학습될 수 있다.Thereafter, in S105 , the learning unit 251 learns n classifiers using the prepared training data. That is, the learning unit 251 may learn n classifiers according to a machine learning technique using the training data generated based on n images of different views captured by each camera 100 on which calibration has been performed. . In this case, the classifier may be learned according to a machine learning technique of supervised learning.

예를 들어, 머신 러닝 기법은 Artificial neural network, Decision tree, Gaussian process regression, Support vector machine, Random forests, 또는 Deep learning 등을 포함할 수 있으나, 이에 한정되는 것은 아니다. 또한, 딥 러닝 기법은 Deep Neural Network(DNN), Convolutional Neural Network(CNN), Recurrent Neural Network(RNN), Restricted Boltzmann Machine(RBM), Deep Belief Network(DBN), Deep Q-Networks 등을 포함할 수 있으나, 이에 한정되는 것은 아니다.For example, the machine learning technique may include, but is not limited to, an artificial neural network, a decision tree, a Gaussian process regression, a support vector machine, random forests, or deep learning. In addition, deep learning techniques can include Deep Neural Network (DNN), Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), Restricted Boltzmann Machine (RBM), Deep Belief Network (DBN), Deep Q-Networks, etc. However, the present invention is not limited thereto.

분류기는 다수의 레이어(layer)를 포함하여, 학습을 통해 입력 데이터와 출력 데이터의 관계에 대한 함수를 가질 수 있다. 즉, 분류기는 입력 데이터와 출력 데이터 간의 관계를 다수의 층(즉, 레이어)으로 표현하며, 이러한 다수의 표현층을 “신경망(neural network)”라 지칭하기도 한다. 학습된 분류기에 입력 데이터가 입력되는 경우, 해당 표현층에 따른 출력 데이터가 출력될 수 있다. 신경망 내의 각 레이어는 적어도 하나 이상의 필터로 이루어지며, 각 필터는 가중치(weight)의 매트릭스(matrix)를 가질 수 있다.The classifier may include a plurality of layers and have a function on the relationship between input data and output data through learning. That is, the classifier expresses the relationship between input data and output data in multiple layers (ie, layers), and these multiple representation layers are also referred to as “neural network”. When input data is input to the learned classifier, output data according to a corresponding expression layer may be output. Each layer in the neural network consists of at least one filter, and each filter may have a matrix of weights.

또한, S105에서, 학습부(251)는 분류기를 학습시킨 후, 앙상블 기법(ensemble of classifiers)에 따른 학습 모델인 앙상블 모델을 추가적으로 학습시킬 수 있다. 이때, 앙상블 기법은 각 분류기가 출력(예측)하는 결과 데이터를 이용하여, 그 예측된 n개의 행위 종류 중에서 적어도 하나의 행위 종류를 최종 결과로 도출하기 위한 기법이다.Also, in S105 , after learning the classifier, the learning unit 251 may additionally train an ensemble model, which is a learning model according to an ensemble of classifiers. In this case, the ensemble technique is a technique for deriving at least one behavior type from among the n predicted behavior types as a final result using the result data output (predicted) by each classifier.

예를 들어, 앙상블 기법은 보팅(voting), 배깅(Bagging), 페이스팅(pasting), 랜덤 포레스트(random forest) 등과 같은 다수결/투표 기반의 제1 방법과, 에이다 부스트 (Ada Boost), 그레디언트 부스트 (Gradient Boost) 등과 같은 부스팅(Boosting) 기반의 제2 방법일 수 있되, 제1 방법이 바람직할 수 있으나, 이에 한정되는 것은 아니다.For example, the ensemble technique includes a first method based on majority vote/voting, such as voting, bagging, pasting, random forest, etc., and Ada Boost, gradient boost, etc. It may be a second method based on boosting such as (Gradient Boost), and the first method may be preferable, but is not limited thereto.

즉, 앙상블 모델에서, 각 분류기의 출력 데이터가 그 입력 데이터에 포함될 수 있으며, 사용자가 해당 영상 내 객체 또는 이벤트에 대한 최종 결과로 분류하기 원하는 행위 종류의 레이블이 그 출력 데이터에 포함될 수 있다.That is, in the ensemble model, the output data of each classifier may be included in the input data, and the label of the action type that the user wants to classify as the final result for the object or event in the corresponding image may be included in the output data.

이러한 S100의 수행 결과, 3개 이상의 다른 뷰에서 촬영된 각 영상에 대해, 다수의 분류기의 결과를 기반으로 앙상블 기법에 따라 행위 인지를 수행할 수 있는 학습 모델이 학습될 수 있다.As a result of performing S100, a learning model capable of performing behavior recognition according to an ensemble technique may be trained based on the results of a plurality of classifiers for each image captured from three or more different views.

다만, 상술한 S100은 행위 인식 장치(200)에서 수행되지 않고, 타 장치에서 수행될 수도 있다. 이 경우, 타 장치의 제어부에 의해 상술한 S100이 수행되며, 그 수행 결과에 따라 학습된 학습 모델(각 분류기, 앙상블 모델)은 통신부(220)를 통해 행위 인식 장치(200)로 전송될 수 있다.However, the above-described S100 is not performed in the behavior recognition apparatus 200, but may be performed in another apparatus. In this case, the above-described S100 is performed by the control unit of the other device, and the learning model (each classifier, ensemble model) learned according to the execution result may be transmitted to the behavior recognition device 200 through the communication unit 220 . .

도 8은 본 발명의 일 실시예에 따른 행위 인식 방법의 S200에 따라 수행되는 행위 인지 과정을 나타낸다.8 illustrates a behavior recognition process performed according to S200 of the behavior recognition method according to an embodiment of the present invention.

다음으로, S200은 학습된 학습 모델, 즉 각 분류기와 앙상블 모델을 이용하여 행위 인지를 수행하는 단계이다. Next, S200 is a step of performing behavior recognition using the learned learning model, that is, each classifier and ensemble model.

구체적으로, 도 5 및 도 8에 도시된 바와 같이, S200은 S201 내지 S203을 포함할 수 있다.Specifically, as shown in FIGS. 5 and 8 , S200 may include S201 to S203.

먼저, S201에서, 분류기 처리부(253)는 학습된 n개의 각 분류기를 이용하여, n개의 행위 종류에 대한 출력(예측)을 획득한다. 즉, 감시 대상 영역에 대한 n개의 다른 뷰의 영상을 학습된 각 분류기에 입력시켜, 각 분류기가 출력하는 n개의 행위 종류를 획득할 수 있다. 이때, 각 영상은 각 카메라(100)에서 실시간 등으로 획득된 실제 감시 영상으로서, 학습 데이터의 영상과 달리 레이블이 부가되지 않는다.First, in S201 , the classifier processing unit 253 obtains an output (prediction) for n behavior types by using each of the learned n classifiers. That is, by inputting images of n different views of the monitoring target region to each learned classifier, n behavior types output by each classifier can be obtained. In this case, each image is an actual surveillance image acquired in real time by each camera 100 , and a label is not added unlike the image of the learning data.

즉, 학습된 제1 분류기에 제1 영상을 입력하여 그 행위 종류에 대한 제1 출력을 획득할 수 있다. 또한, 학습된 제2 분류기에 제2 영상을 입력하여 그 행위 종류에 대한 제2 출력을 획득할 수 있다. 또한, 학습된 제n 분류기에 제n 영상을 입력하여 그 행위 종류에 대한 제n 출력을 획득할 수 있다.That is, by inputting the first image to the learned first classifier, the first output for the action type may be obtained. Also, by inputting the second image to the learned second classifier, a second output for the action type may be obtained. In addition, an n-th output for the action type may be obtained by inputting an n-th image to the learned n-th classifier.

다만, 각 분류기에 입력되는 n개의 다른 뷰의 영상은 각 카메라(100)의 캘리브레이션이 수행된 후의 영상일 수 있으나, 이에 한정되는 것은 아니다. 또한, 이러한 각 카메라(100)의 캘리브레이션은 상술한 S102에서와 같이 제어부(250), 즉 캘리브레이션부(252) 등을 통해 수행될 수 있다.However, the images of n different views input to each classifier may be images after calibration of each camera 100 is performed, but is not limited thereto. In addition, the calibration of each camera 100 may be performed through the control unit 250 , ie, the calibration unit 252 , as in S102 described above.

한편, 분류기 처리부(253)는 각 분류기에 영상 정보 외에도, 객체 또는 이벤트의 행위에 대한 시간 정보(즉, 해당 행위의 시작 시간 및 종료 시간에 대한 정보)를 추가로 입력시킬 수 있다. 이 경우, 분류기 처리부(253)는 영상에서 객체 또는 이벤트를 추출하는 다양한 객체/이벤트 추출 알고리즘 등을 활용하여, 그 객체 또는 이벤트가 출현하고 사라지는 프레임에 대한 시간 정보를 도출할 수 있다.Meanwhile, the classifier processing unit 253 may additionally input time information about an action of an object or event (ie, information about a start time and an end time of the corresponding action) to each classifier in addition to the image information. In this case, the classifier processing unit 253 may derive time information about a frame in which the object or event appears and disappears by utilizing various object/event extraction algorithms for extracting an object or event from an image.

이후, S202에서, 앙상블 처리부(254)는 앙상블 기법(ensemble of classifiers) 기법을 이용하여, 각 분류기로부터 획득한 n개의 행위 종류 중에 적어도 하나의 행위 종류를 그 결과로 도출한다. 즉, 학습된 앙상블 모델에 각 분류기의 출력을 입력시켜, 적어도 하나의 행위 종류를 n개 영상 내의 객체 또는 이벤트에 대한 행위로 최종 결정할 수 있다.Thereafter, in S202 , the ensemble processing unit 254 derives at least one behavior type among n behavior types obtained from each classifier as a result by using an ensemble of classifiers technique. That is, by inputting the output of each classifier to the learned ensemble model, at least one action type may be finally determined as an action on an object or event in n images.

이후, S203에서, 제어부(250)는 S201 또는 S202에 따른 행위 인지의 결과를 디스플레이(230)에 표시하도록 제어할 수 있다. 또한, 제어부(250)는 그 행위 인지의 결과를 통신부(220)를 통해 타 장치로 전송할 수도 있다.Thereafter, in S203 , the controller 250 may control to display the result of the action recognition according to S201 or S202 on the display 230 . Also, the control unit 250 may transmit the result of recognition of the action to another device through the communication unit 220 .

또한, S200의 수행 후, 제어부(250)는 인지된 행위가 이상 행위(예를 들어, 주차 위반, 신호 위반, 침입, 화재, 폭력, 도난, 투기 등)인 경우, 이에 대한 정보를 각종 경고 장치 등을 통해 사용자에게 전달할 수도 있다.In addition, after performing S200, when the recognized behavior is an abnormal behavior (eg, parking violation, signal violation, intrusion, fire, violence, theft, speculation, etc.), the control unit 250 provides information about this to various warning devices. It can also be delivered to the user through

상술한 바와 같이 구성되는 본 발명은 단일 영상이 아닌, 다수의 영상에 대한 행동 인지의 처리가 가능한 이점이 있다. 즉, 본 발명은 대상 영역을 동시에 3개 이상의 다중 뷰(multi-view)로 촬영한 영상(video)에 대해, 머신 러닝 기법의 학습 모델을 이용하여 분석함에 따라 해당 영상 내의 객체 또는 이벤트에 대한 행위 인지를 보다 빠르고 정확하게 수행할 수 있어, 실시간 환경 등에 대응이 가능한 이점이 있다. 또한, 본 발명은 다수의 영상을 동시에 처리하되, 머신 러닝 기법의 학습 모델의 분석 결과를 앙상블 기법을 통해 최종적으로 행동 인지를 분석함에 따라 그 행위 인지의 정확도를 더욱 향상시킬 수 있는 이점이 있다. 또한, 본 발명은 공공기관이나 민간에서 사용중인 영상보안시스템 등의 CCTV 시스템에 적용될 경우, 범죄 예방, 범인 검거율 향상 및 개인 신변 안전 보장이 가능한 이점이 있다.The present invention configured as described above has an advantage in that it is possible to process behavior recognition for a plurality of images rather than a single image. That is, according to the present invention, an action on an object or event in the image is analyzed using a learning model of a machine learning technique for a video shot simultaneously in three or more multi-views of a target region. Since recognition can be performed more quickly and accurately, there is an advantage in that it is possible to respond to a real-time environment. In addition, the present invention has the advantage of being able to further improve the accuracy of behavior recognition by processing a plurality of images at the same time, and finally analyzing the behavior recognition through the ensemble technique of the analysis result of the learning model of the machine learning technique. In addition, when the present invention is applied to a CCTV system such as a video security system used in public institutions or private companies, there is an advantage in that it is possible to prevent crime, improve the arrest rate of criminals, and guarantee personal safety.

본 발명의 상세한 설명에서는 구체적인 실시 예에 관하여 설명하였으나 본 발명의 범위에서 벗어나지 않는 한도 내에서 여러 가지 변형이 가능함은 물론이다. 그러므로 본 발명의 범위는 설명된 실시 예에 국한되지 않으며, 후술되는 청구범위 및 이 청구범위와 균등한 것들에 의해 정해져야 한다.In the detailed description of the present invention, although specific embodiments have been described, various modifications are possible without departing from the scope of the present invention. Therefore, the scope of the present invention is not limited to the described embodiments, and should be defined by the following claims and their equivalents.

10: CCTV 시스템 100: 카메라부
110: 고정카메라 120: 회전카메라
200: CCTV 운용 장치 210: 입력부
220: 통신부 230: 디스플레이
240: 메모리 250: 제어부
251: 학습부 252: 캘리브레이션부
253: 분류기 처리부 254: 앙상블 처리부10: CCTV system 100: camera unit
110: fixed camera 120: rotating camera
200: CCTV operating device 210: input unit
220: communication unit 230: display
240: memory 250: control unit
251: learning unit 252: calibration unit
253: classifier processing unit 254: ensemble processing unit

Claims

As a method of recognizing an action of an object or event using a video captured by an electronic device and simultaneously n (where n is a natural number of 3 or more) different views,
learning n classifiers according to a machine learning technique using input data for images of different views and learning data including result data for behavior types of objects or events in the images, respectively;
inputting images of n different views of the target region to each learned classifier to obtain n behavior types output by each classifier; and
deriving at least one behavior type among n behavior types obtained from each classifier as a result by using an ensemble of classifiers;
How to include.

According to claim 1,
In the learning step, the input data for each of the n classifiers includes images of different views,
In the acquiring, images of different views of the target region are input to each classifier.

3. The method of claim 2,
The learning step is
performing calibration between each camera so that an object or event is located in a specific area in the image in n images of different views captured by each camera; and
learning the n classifiers using n images of different views captured by each camera on which calibration has been performed;
How to include.

4. The method of claim 3,
The specific region corresponds to a central region in the image.

3. The method of claim 2,
In the learning step, the input data for each of the n classifiers includes images of different views, information on frames in which objects or events appear in the images, and objects or events that appear in the image disappear. How to include information in each frame.

Simultaneously, as a device for recognizing an action of an object or event using a video shot with n different views (where n is a natural number of 3 or more),
A memory for storing n classifiers pre-learned according to machine learning using learning data including input data for images of different views and result data for behavior types of objects or events in the image, respectively ; and
A control unit for controlling the recognition of an object or an event in the image of n different views of the target region by using each classifier stored in the memory;
The control unit is
Images of n different views of the target region are input to each classifier to obtain n behavior types output by each classifier, and n behavior types obtained from each classifier using an ensemble of classifiers A device for deriving at least one kind of behavior as a result.

Simultaneously, as a device for recognizing an action of an object or event using a video shot with n different views (where n is a natural number of 3 or more),
A communication unit for receiving n classifiers pre-learned according to machine learning by using input data for images of different views and learning data including result data for action types of objects or events in the image, respectively ; and
A control unit for controlling the recognition of an action for an object or an event in the image of n different views of the target area by using each classifier received from the communication unit;
The control unit is
Images of n different views of the target region are input to each classifier to obtain n behavior types output by each classifier, and n behavior types obtained from each classifier using an ensemble of classifiers A device for deriving at least one kind of behavior as a result.

a plurality of cameras simultaneously capturing n (provided that n is a natural number of 3 or more) different views; and
An electronic device for recognizing an action of an object or event by using an image of a different view captured by each camera;
The electronic device is
A memory for storing n classifiers pre-learned according to machine learning using learning data including input data for images of different views and result data for behavior types of objects or events in the image, respectively ; and
A control unit for controlling the recognition of an action for an object or an event in the image of n different views of the target area by using each classifier received from the communication unit;
The control unit is
Images of n different views of the target region are input to each classifier to obtain n behavior types output by each classifier, and n behavior types obtained from each classifier using an ensemble of classifiers A system that derives at least one kind of behavior as a result.

9. The method of claim 8,
The camera is a CCTV camera system.