KR102514301B1

KR102514301B1 - Device for identifying the situaton of object's conduct using sensor fusion

Info

Publication number: KR102514301B1
Application number: KR1020200014926A
Authority: KR
Inventors: 김종원; 김진수
Original assignee: 주식회사 딥아이
Priority date: 2020-02-07
Filing date: 2020-02-07
Publication date: 2023-03-27
Also published as: KR20210100937A

Abstract

본 발명은 이종 센서 융합을 이용한 행동 분석 기술에 관한 것으로, 더욱 상세하게는 RGB 영상 내의 객체 정보와 열 영상 내의 객체 정보를 융합하여 객체 탐지의 정확도를 높이는 한편, 영상 전체가 아니라 탐지된 객체의 부분 영상을 시계열적으로 분석하여 객체의 행동이 어떤 상황을 의미하는지를 비교적 가벼운 시스템으로 판단할 수 있는 행동 분석 장치에 관한 것이다. 일 실시예에 따른 이종 센서 융합을 이용한 행동 분석 장치는, 제1 센서의 RGB 영상에서 추출된 제1 특징정보와 제2 센서의 열 영상에서 추출된 제2 특징정보를 융합(fusion)하여 객체를 탐지(detect)하는 객체탐지부 및 상기 탐지된 객체의 부분 영상을 시계열적으로 분석하여 객체의 상황을 판단하는 상황판단부를 포함하여 이루어진다.The present invention relates to a behavioral analysis technology using heterogeneous sensor fusion, and more particularly, to improve the accuracy of object detection by fusing object information in an RGB image with object information in a thermal image, and a portion of a detected object rather than the entire image. The present invention relates to a behavioral analysis device capable of time-sequentially analyzing an image to determine what a situation an object's behavior means with a relatively lightweight system. An apparatus for analyzing behavior using heterogeneous sensor fusion according to an exemplary embodiment fuses first feature information extracted from an RGB image of a first sensor and second feature information extracted from a thermal image of a second sensor to obtain an object. It includes an object detection unit that detects and a situation determination unit that analyzes partial images of the detected object in time series to determine the situation of the object.

Description

Behavioral analysis device using heterogeneous sensor fusion {DEVICE FOR IDENTIFYING THE SITUATON OF OBJECT'S CONDUCT USING SENSOR FUSION}

본 발명은 이종 센서 융합을 이용한 행동 분석 기술에 관한 것으로, 더욱 상세하게는 RGB 영상 내의 객체 정보와 열 영상 내의 객체 정보를 융합하여 객체 탐지의 정확도를 높이는 한편, 영상 전체가 아니라 탐지된 객체의 부분 영상을 시계열적으로 분석하여 객체의 행동이 어떤 상황을 의미하는지를 비교적 가벼운 시스템으로 판단할 수 있는 행동 분석 장치에 관한 것이다. The present invention relates to a behavioral analysis technology using heterogeneous sensor fusion, and more particularly, to improve the accuracy of object detection by fusing object information in an RGB image with object information in a thermal image, and a portion of a detected object rather than the entire image. The present invention relates to a behavioral analysis device capable of time-sequentially analyzing an image to determine what a situation an object's behavior means with a relatively lightweight system.

사람의 활동 범위가 넓어짐에 따라 화재, 비행기의 추락, 어선의 침몰 등으로 인한 사고의 인명 구조 활동은 다양한 영역에서 요구되고 있다. 특히, 해상에서의 사고는 해·조류의 영향으로 사고 선박과 조난자가 수색 범위를 빠르게 이탈하여 육상과는 달리 광범위한 사고 지역에 대한 정보가 신속하고 정확하게 수집되어야 한다. BACKGROUND ART As the range of human activity is widened, lifesaving activities in accidents caused by fire, plane crash, fishing boat sinking, etc. are required in various fields. In particular, in the case of an accident at sea, the accident vessel and the survivors quickly leave the search range due to the influence of sea and current, so unlike on land, information on a wide range of accident areas must be quickly and accurately collected.

하지만 현재 구조 지침과 기술로는 감시자가 사고 현장의 영상을 직접 관측하여 인명 구조에 필요한 상황 정보를 수동적으로 수집하는 것에 그치고 있다. 또한, 야간에 발생하는 해상사고가 가장 많은 인명 피해로 이어지고 있으며 시야 확보가 어려운 야간 환경의 특성이 고려된 상황에서는 단순히 조명을 비추거나 조명탄을 활용한 육안 수색에 의존하고 있어 많은 인력과 시간이 소요되고 있다.However, with the current rescue guidelines and technologies, the observer directly observes the video of the accident site and passively collects situational information necessary for lifesaving. In addition, maritime accidents that occur at night lead to the greatest loss of life, and in situations where the characteristics of the night environment where it is difficult to secure visibility are taken into consideration, it takes a lot of manpower and time because it relies on simply illuminating lights or visual search using flares. It is becoming.

비단 해상 조난자의 탐지 뿐만 아니라 보안 분야에서 침입자의 탐지, 동물원에서 특정 동물의 이상 행동 감시 등 영상을 분석하여 비정상적인 상황을 인식하는 각종 모니터링 서비스의 요구가 증가하고 있다.Demand for various monitoring services that recognize abnormal situations by analyzing images, such as detection of intruders in the field of security and monitoring of abnormal behavior of specific animals in zoos, as well as detection of victims at sea are increasing.

전통적인 컴퓨터 비전 분야의 영상처리 기법을 기반으로 한 객체 탐지는 영상 전체를 프로세싱하여 객체를 식별하는 방식이므로 이미지 프로세싱을 위해 매우 무거운 하드웨어 사양을 요구한다. 그러나 드론과 같은 이동형의 탐색 장치는 제한된 파워와 하드웨어 사양으로 인해 전통적인 영상처리 기법을 적용하기 어려운 문제가 있다.Object detection based on image processing techniques in the traditional computer vision field requires very heavy hardware specifications for image processing because it is a method of identifying objects by processing the entire image. However, there is a problem in that it is difficult to apply traditional image processing techniques to mobile navigation devices such as drones due to limited power and hardware specifications.

이러한 문제점을 극복하기 위해 CNN과 같이 비교적 가벼운 방식으로 객체를 탐지하는 기법이 제시되기도 하였지만, CNN 은 단일 프레임의 정지된 이미지를 대상으로 객체 자체를 탐지할 뿐 객체가 어떤 행동을 하고 있는지 또는 객체가 어떤 상황에 처해 있는지 까지는 판단할 수 없다. In order to overcome this problem, a technique for detecting an object in a relatively lightweight method such as CNN has been proposed, but CNN only detects the object itself for a single frame of still image, and does not know what the object is doing or what the object is doing. You can't judge what situation you're in.

또한 기존의 객체 탐지 기법은 객체 탐지를 위한 특징(feature)을 사람이 직접 정의하는데, 통상 객체의 대표적인 특징으로 색상, 크기, 형태를 정의하고, 촬영된 영상으로부터 상기 특징들을 추출하여 객체를 탐지하는 방식이다.In addition, in the existing object detection technique, a person directly defines features for object detection, and usually defines color, size, and shape as representative features of an object, and detects the object by extracting the features from a captured image. way.

그러나, 기존의 객체 탐지 기법은 객체의 특징이 바뀌는 경우 매번 새롭게 특징을 추출해야 하고, 객체의 구체적인 특징을 일반화하기 어렵다는 점에서 한계가 존재한다. 이로 인해 객체 탐지의 성능에 대한 신뢰도는 낮아질 수밖에 없다.However, existing object detection techniques have limitations in that new features must be newly extracted every time the object features change, and it is difficult to generalize specific features of the object. This inevitably lowers the reliability of object detection performance.

따라서 객체 탐지의 신뢰성을 높이는 한편 탐지된 객체의 행동을 분석하여 객체의 상황까지 판단해 줄 수 있는 기술의 개발이 요구되는 실정이다.Therefore, it is required to develop a technology that can increase the reliability of object detection and analyze the behavior of the detected object to determine the object's situation.

본 발명이 해결하고자 하는 일 과제는, 종래의 영상처리 기반의 객체 탐지 기술이 가지는 수동성과 낮은 신뢰도의 단점을 극복하는 능동적이고 지능적인 객체 탐지 기술을 제공하는 것이다.One problem to be solved by the present invention is to provide an active and intelligent object detection technology that overcomes the passivity and low reliability of the conventional image processing-based object detection technology.

본 발명이 해결하고자 하는 다른 과제는, 전체 영상이 아니라 탐지된 객체가 포함된 부분 영상을 분석함으로써 비교적 가벼운 시스템 자원으로 객체의 행동 패턴을 파악하는 기술을 제공하는 것이다.Another problem to be solved by the present invention is to provide a technique for identifying an object's behavior pattern with relatively light system resources by analyzing a partial image including the detected object instead of the entire image.

본 발명이 해결하고자 하는 다른 과제는, RGB 센서와 열 영상 센서를 융합하여 각 센서가 가지는 기능상 한계점을 보완하는 기술을 제공하는 것이다. Another problem to be solved by the present invention is to provide a technology for supplementing functional limitations of each sensor by fusing an RGB sensor and a thermal image sensor.

위의 과제를 해결하기 위해 본 발명은, 제1 센서의 RGB 영상에서 추출된 제1 특징정보와 제2 센서의 열 영상에서 추출된 제2 특징정보를 융합(fusion)하여 객체를 탐지(detect)하는 객체탐지부; 및 상기 탐지된 객체의 부분 영상을 시계열적으로 분석하여 객체의 상황을 판단하는 상황판단부를 포함하는 이종 센서 융합을 이용한 행동 분석 장치를 일 실시예로 제안한다.In order to solve the above problem, the present invention detects an object by fusion of first feature information extracted from an RGB image of a first sensor and second feature information extracted from a thermal image of a second sensor. an object detecting unit; and a situation determining unit configured to analyze a partial image of the detected object in a time-sequential manner and determine a situation of the object.

상기 행동 분석 장치는 RGB 영상과 열 영상을 상기 객체탐지부에 입력하기에 앞서 RGB 영상 및 열 영상을 대상으로 시야각 보정을 수행하고, 열 영상을 대상으로 에지 보정을 수행하는 전처리부를 더 포함할 수 있다.The behavior analysis device may further include a pre-processing unit that performs viewing angle correction on the RGB image and the thermal image before inputting the RGB image and the thermal image to the object detection unit, and performs edge correction on the thermal image. there is.

상기 객체탐지부는, RGB 영상으로부터 객체의 경계상자(bb_c) 및 신뢰도 점수(S_C)를 포함하는 제1 특징정보를 추출하는 제1탐지모듈; 열 영상으로부터 객체의 경계상자(bb_t) 및 신뢰도 점수(S_T)를 포함하는 제2 특징정보를 추출하는 제2탐지모듈; 상기 제1탐지모듈 및 상기 제2탐지모듈의 출력이 실제 정답(ground-truth)과 더 의미 있는 상관 관계를 가지도록 영상별 신뢰도 점수(S_C, S_T)를 재설정하는 센서융합모듈; 및 상기 재설정된 신뢰도 점수(S_C', S_T')를 기초로 최종 특징정보를 도출하는 최종탐지모듈을 포함하여 이루어질 수 있다.The object detection unit may include: a first detection module extracting first feature information including a bounding box (bb _c ) and a reliability score (S _C ) of the object from the RGB image; a second detection module extracting second feature information including a bounding box (bb _t ) and a reliability score (S _T ) of the object from the thermal image; a sensor fusion module that resets reliability scores (S _C , S _T ) for each image so that the outputs of the first detection module and the second detection module have a more meaningful correlation with an actual correct answer (ground-truth); and a final detection module for deriving final feature information based on the reset reliability scores (S _C ', S _T ').

여기서 상기 센서융합모듈은, 상기 제1 특징정보 및 제2 특징정보를 이용하여, 영상별 신뢰도 점수(S_C, S_T), 영상별 경계상자(bb_c, bb_t), 상기 영상별 경계상자의 중심좌표의 평균값(μ_x, μ_y), 상기 영상별 경계상자의 중심좌표의 분산값(σ_x, σ_y), 상기 영상별 경계상자를 모두 포함하는 최소 경계상자(bb_m) 중 적어도 하나를 포함하는 융합 특징정보(F)를 도출하는 추출 모듈; 및 상기 영상별 경계상자(bb_c, bb_t)와 실제 정답(ground-truth)의 경계상자(bb_g)의 교차 영역의 상대적인 넓이(Intersection Of Union; IOU)로 타겟값이 설정된 멀티 레이어 퍼셉트론(multi-layer perceptron)에 상기 융합 특징정보(F)를 입력하는 학습 모듈을 포함하여 이루어질 수 있다.Here, the sensor fusion module uses the first feature information and the second feature information to determine reliability scores for each image (S _C , S _T ), bounding boxes for each image (bb _c , bb _t ), and bounding boxes for each image At least among the average values of the center coordinates (μ _x , μ _y ), the variance values (σ _x , σ _y ) of the center coordinates of the bounding boxes for each image, and the minimum bounding box (bb _m ) including all the bounding boxes for each image An extraction module for deriving fusion feature information (F) including one; And a multi-layer perceptron whose target value is set to the relative width (Intersection Of Union; IOU) of the intersection of the boundary boxes (bb _c , bb _t ) for each image and the bounding box (bb _g ) of the actual correct answer (ground-truth) ( It may include a learning module that inputs the fusion feature information (F) to a multi-layer perceptron.

또한 상기 상황판단부는, t (t는 1 이상의 정수) 시점에서 탐지한 객체의 특징정보에 t -1 시점에서 탐지한 객체의 특징정보를 가중치로 반영하여 t 시점의 출력값을 도출하는 연산모듈; 및 객체가 탐지된 시점부터 미리 설정된 시점까지 상기 연산모듈의 출력값의 시퀀스(sequence)를 분석하여 객체의 상황을 판단하는 분석모듈을 포함하여 이루어질 수 있다.In addition, the situation determination unit, a calculation module for deriving an output value at time t by reflecting the feature information of the object detected at time t -1 to the feature information of the object detected at time t (t is an integer greater than or equal to 1) as a weight; and an analysis module configured to determine a situation of the object by analyzing a sequence of output values of the calculation module from a time point at which the object is detected to a time point set in advance.

상기 연산모듈은, 입력층(x), 은닉층(h), 출력층(y)을 포함하며, 상기 은닉층에는 t-1 시점의 객체의 탐지정보를 t 시점으로 전달하는 순환 엣지가 추가된 순환 신경망(recurrent neural network)을 이용할 수 있다.The calculation module includes an input layer (x), a hidden layer (h), and an output layer (y), and a circular neural network to which a circular edge is added to the hidden layer to transfer detection information of an object at time t-1 to time t ( recurrent neural networks) can be used.

또한 상기 연산모듈은, 상기 순환 신경망에 파라미터 저장을 위한 메모리 블록을 추가한 LSTM(long short term memory) 네트워크를 이용할 수 있다.In addition, the calculation module may use a long short term memory (LSTM) network in which a memory block for storing parameters is added to the recurrent neural network.

본 실시예의 행동 분석 장치는, 상기 판단된 상황을 무선망을 통해 원격지의 관제 단말에 전송하기 위한 통신모듈; 상기 행동 분석 장치를 목적지로 이동시키기 위한 이동모듈; 및 상기 행동 분석 장치의 위치를 파악하기 위한 GPS모듈 중 적어도 하나를 더 포함할 수 있다.The behavior analysis apparatus of this embodiment includes a communication module for transmitting the determined situation to a remote control terminal through a wireless network; A movement module for moving the behavior analysis device to a destination; And it may further include at least one of a GPS module for determining the location of the behavior analysis device.

본 발명의 실시예들에 의하면 딥 러닝에 의한 지능적이고 능동적인 객체 탐지가 가능하므로 종래에 영상처리 기반의 객체 탐지 기술에서 객체 특징이 업데이트될 때마다 매번 특징 정보를 새로이 입력해야 하는 번거로움과 객체의 구체적인 특징을 일반화하기 어려운 한계점을 극복할 수 있다.According to the embodiments of the present invention, since intelligent and active object detection by deep learning is possible, whenever object features are updated in the conventional image processing-based object detection technology, it is cumbersome to newly input feature information every time and object It is possible to overcome the limitation that it is difficult to generalize the specific characteristics of

본 발명의 실시예들에 의하면 객체의 행동 분석을 위해 영상 전체를 프로세싱할 필요가 없으므로 드론과 같이 제한된 자원을 가지는 모바일 기기에서도 충분한 품질로 객체의 행동 패턴을 파악할 수 있다.According to the embodiments of the present invention, since it is not necessary to process the entire image to analyze the behavior of an object, it is possible to grasp the behavior pattern of an object with sufficient quality even in a mobile device having limited resources, such as a drone.

본 발명의 실시예들에 의하면 RGB 센서가 가지는 한계점인 외부 광량이 부족하거나 과도한 경우의 탐지 성능 저하를 열 영상 센서로 보완함으로써 신뢰성 있는 객체 탐지 및 행동 분석이 가능해진다. According to embodiments of the present invention, reliable object detection and behavioral analysis are possible by supplementing the detection performance deterioration when the amount of external light is insufficient or excessive, which is a limitation of the RGB sensor, with a thermal image sensor.

본 발명의 효과는 이상에서 언급된 것들에 한정되지 않으며, 언급되지 아니한 다른 효과들은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.The effects of the present invention are not limited to those mentioned above, and other effects not mentioned will be clearly understood by those skilled in the art from the description below.

도 1은 실시예 1의 행동 분석 시스템 및 그 시스템에 포함되는 행동 분석 장치의 구성도이다.
도 2는 CNN의 동작 원리를 도시한 개념도이다.
도 3은 객체탐지부의 동작 과정을 도시한 블록도이다.
도 4는 센서융합모듈의 동작 원리를 상세히 도시한 개념도이다.
도 5는 RNN의 동작 알고리즘을 설명하는 개념도이다.
도 6은 상황판단부의 동작 알고리즘을 도시한 개념도이다.
도 7은 도 6의 알고리즘 원리를 직관적으로 도시한 개념도이다.
도 8은 실시예 2의 행동 분석 시스템 및 그 시스템에 포함되는 행동 분석 장치의 구성도이다.1 is a configuration diagram of a behavior analysis system according to a first embodiment and a behavior analysis device included in the system.
2 is a conceptual diagram illustrating the operating principle of a CNN.
3 is a block diagram illustrating an operation process of an object detection unit.
4 is a conceptual diagram illustrating in detail the operation principle of the sensor fusion module.
5 is a conceptual diagram explaining the operation algorithm of RNN.
6 is a conceptual diagram illustrating an operating algorithm of a situation determination unit.
FIG. 7 is a conceptual diagram intuitively illustrating the principle of the algorithm of FIG. 6 .
8 is a configuration diagram of a behavior analysis system and a behavior analysis device included in the system according to the second embodiment.

본 발명은 다양한 변환을 가할 수 있고 여러 가지 실시 예를 가질 수 있는 바, 특정 실시 예들을 도면에 예시하고 상세한 설명에서 상세하게 설명하고자 한다. 그러나, 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변환, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. Since the present invention can apply various transformations and have various embodiments, specific embodiments will be illustrated in the drawings and described in detail in the detailed description. However, it should be understood that this is not intended to limit the present invention to specific embodiments, and includes all transformations, equivalents, and substitutes included in the spirit and scope of the present invention.

본 발명을 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 본 발명의 요지를 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다.In describing the present invention, if it is determined that a detailed description of related known technologies may obscure the gist of the present invention, the detailed description will be omitted.

본 출원에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.Terms used in this application are only used to describe specific embodiments, and are not intended to limit the present invention. Singular expressions include plural expressions unless the context clearly dictates otherwise. In this application, the terms "include" or "have" are intended to designate that there is a feature, number, step, operation, component, part, or combination thereof described in the specification, but one or more other features It should be understood that the presence or addition of numbers, steps, operations, components, parts, or combinations thereof is not precluded.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미가 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미가 있는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which the present invention belongs. Terms such as those defined in commonly used dictionaries should be interpreted as having meanings consistent with the meanings in the context of the related art, and unless explicitly defined in this application, they should not be interpreted in ideal or excessively formal meanings. don't

본 명세서에서 기재한 모듈(MODULE)이란 용어는 특정한 기능이나 동작을 처리하는 하나의 단위를 의미하며, 이는 하드웨어나 소프트웨어 또는 하드웨어 및 소프트웨어의 결합을 의미할 수 있다.The term "MODULE" described in this specification refers to a unit that processes a specific function or operation, and may mean hardware or software or a combination of hardware and software.

또한 제1, 제2 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. In addition, terms such as first and second may be used to describe various components, but the components should not be limited by the terms. These terms are only used for the purpose of distinguishing one component from another.

본 발명에서 "객체"라 함은 RGB 영상 또는 열 영상 안에 포함된 분석 대상(analysis target)을 가리키며 경우에 따라 사람, 자동차, 애완동물 등 설정 조건에 따라 다양하게 정의될 수 있다. 본 명세서에서는 객체를 사람으로 가정하여 설명하지만 반드시 이에 한정하는 것은 아니다.In the present invention, an "object" refers to an analysis target included in an RGB image or a thermal image, and may be defined in various ways according to setting conditions such as a person, a vehicle, and a pet. In this specification, the object is assumed to be a person and described, but is not necessarily limited thereto.

이하, 첨부한 도면들을 참조하여 본 발명에 따른 실시예들을 상세히 설명하기로 하며, 첨부 도면을 참조하여 설명함에 있어 도면 부호에 상관없이 동일하거나 대응하는 구성 요소는 동일한 참조번호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다.Hereinafter, embodiments according to the present invention will be described in detail with reference to the accompanying drawings. description is omitted.

<실시예 1><Example 1>

실시예 1은 영상의 촬영, 객체의 탐지 및 상황의 판단을 자체적으로 수행하는 행동 분석 장치에 관한 것이다.Embodiment 1 relates to a behavior analysis device that autonomously performs image capturing, object detection, and situation determination.

도 1은 실시예 1의 행동 분석 시스템 및 그 시스템에 포함되는 행동 분석 장치의 구성도이다.1 is a configuration diagram of a behavior analysis system according to a first embodiment and a behavior analysis device included in the system.

도 1에서 보듯, 실시예 1의 행동 분석 시스템은 행동 분석 장치(1)와 원격지의 관제 단말(2)을 포함하여 이루어진다. As shown in FIG. 1 , the behavior analysis system of the first embodiment includes a behavior analysis device 1 and a remote control terminal 2 .

행동 분석 장치(1)는, RGB 영상을 촬영하는 제1 센서(10), 열 영상을 촬영하는 제2 센서(20), 제1 센서(10)의 RGB 영상에서 추출된 제1 특징정보와 제2 센서(20)의 열 영상에서 추출된 제2 특징정보를 융합(fusion)하여 객체를 탐지(detect)하는 객체탐지부(200) 및 상기 탐지된 객체의 부분 영상을 시계열적으로 분석하여 객체의 상황을 판단하는 상황판단부(300)를 포함하여 이루어지며, 객체 탐지가 용이하도록 RGB 영상 및 열 영상을 보정하는 전처리부(100)를 더 포함할 수 있다.The behavior analysis device 1 includes a first sensor 10 that captures an RGB image, a second sensor 20 that captures a thermal image, first feature information extracted from the RGB image of the first sensor 10, and first sensor 10 that captures a thermal image. The object detection unit 200 detects an object by fusion of the second feature information extracted from the thermal image of the second sensor 20 and time-sequentially analyzes the partial image of the detected object to determine the location of the object. It includes a situation determination unit 300 that determines the situation, and may further include a pre-processing unit 100 that corrects the RGB image and the thermal image so that object detection is easy.

이해를 돕기 위해, 실시예 1의 시스템에서 행동 분석 장치(1)는 해양 사고 조난자를 탐색하기 위한 드론으로 구현되고, 관제 단말(2)은 상기 드론과 무선통신망으로 연결된 이동통신단말기로 구현된다고 가정한다.For better understanding, it is assumed that in the system of Example 1, the behavior analysis device 1 is implemented as a drone for searching for victims of a marine accident, and the control terminal 2 is implemented as a mobile communication terminal connected to the drone through a wireless communication network. do.

이 경우, 행동 분석 장치(1)는, 자신의 위치를 파악하기 위한 GPS모듈(30), 상황판단부(300)에서 판단된 상황 및/또는 GPS모듈(30)가 파악한 위치 정보를 무선망을 통해 원격지의 관제 단말에 전송하기 위한 통신모듈(40), 및 행동 분석 장치를 목적지로 이동시키기 위한 이동모듈(50)를 더 포함할 수 있다.In this case, the behavior analysis device 1, the GPS module 30 for determining its location, the situation determined by the situation determination unit 300 and / or the location information determined by the GPS module 30 through a wireless network It may further include a communication module 40 for transmitting to a control terminal in a remote place, and a movement module 50 for moving the behavior analysis device to a destination.

이동모듈(50)는 바퀴, 캐터필러, 프로펠러, 제트추진기, 쿼드롭터 팬 또는 이와 유사한 다양한 이동수단으로 구현된다. 상기 가정과 같이 비행형 드론으로 구현될 경우, 쿼드롭터 팬이 적용될 수 있다. The moving module 50 is implemented as various means of movement such as wheels, caterpillars, propellers, jet propellers, quadcopter fans, or the like. When implemented as a flying drone as in the above assumption, a quadcopter fan may be applied.

전처리부 (100)는 RGB 영상과 열 영상을 객체탐지부(200)에 입력하기에 앞서서 RGB 영상 및 열 영상을 대상으로 시야각 보정을 수행하고, 열 영상을 대상으로 에지 보정을 수행한다. 시야각 보정과 에지 보정은 선택적으로 어느 하나만 수행될 수도 있다.The pre-processing unit 100 performs viewing angle correction on the RGB and thermal images before inputting the RGB and thermal images to the object detection unit 200, and performs edge correction on the thermal images. Either one of the viewing angle correction and the edge correction may be selectively performed.

제1 센서(10)와 제2 센서(20)는 3차원의 객체를 2차원으로 투영시켜 이미지를 생성하는데 각 센서가 설치되는 위치가 상이하므로 각 센서에서 동일한 객체를 바라보더라도 시야각에 따라 촬영되는 이미지에 차이가 발생한다. 동일한 장면에 대한 RGB 영상과 열 영상의 위치 정보가 일치해야 추후 설명할 센서 융합이 용이하기 때문에 두 영상에 대한 위치 정보의 일치가 필요하다. 따라서 전처리부 (100)는 열화상 카메라로부터 수집되는 복수의 열 정적 이미지와 색상 카메라로부터 수집되는 복수의 색상 정적 이미지에 대한 시야각을 일치시킨다. The first sensor 10 and the second sensor 20 generate an image by projecting a 3D object into a 2D image. Since the positions where each sensor is installed are different, even if each sensor looks at the same object, it is photographed according to the viewing angle. A difference occurs in the image. Since sensor fusion, which will be described later, is easy when the positional information of the RGB image and the thermal image of the same scene are identical, the positional information of the two images needs to be matched. Accordingly, the pre-processing unit 100 matches the viewing angles of the plurality of thermal static images collected from the thermal imaging camera and the plurality of color static images collected from the color camera.

또한, 제1 센서(10)는 무수히 많은 픽셀과 색상으로 표시되는 RGB 영상을 획득하는 데 비해 제2 센서(20)는 그레이 스케일(gray scale) 계열의 열 영상을 획득하므로, 열 영상에 포함된 객체의 에지나 윤곽을 정확히 표현하지 못한다. 따라서 전처리부(100)는 열 영상의 시각적 대비를 높이기 위한 에지 필터를 적용하여 객체의 에지를 보정한다. 에지 필터의 예로, 소벨(Sobel) 에지 검출 필터, 캐니(Canny) 에지 검출 필터 등을 들 수 있으며, 이들 외에도 동일 기능의 필터라면 어떠한 것이라도 적용 가능하다. In addition, since the first sensor 10 acquires an RGB image displayed with a myriad of pixels and colors, the second sensor 20 obtains a gray scale thermal image. The edge or outline of an object cannot be accurately expressed. Accordingly, the preprocessor 100 corrects the edge of the object by applying an edge filter to increase the visual contrast of the thermal image. Examples of the edge filter include a Sobel edge detection filter, a Canny edge detection filter, and the like, and other than these, any filter having the same function can be applied.

이하, 객체탐지부(200)와 상황판단부(300)를 중심으로 상세히 설명한다.Hereinafter, the object detection unit 200 and the situation determination unit 300 will be described in detail.

객체탐지부(200)는 RGB 영상 및 열 영상으로부터 객체를 탐지하기 위해 컨볼루션 뉴럴 네트워크(Convolution Neural Network; CNN)를 이용할 수 있다.The object detection unit 200 may use a Convolution Neural Network (CNN) to detect an object from an RGB image and a thermal image.

도 2는 CNN의 동작 원리를 도시한 개념도이다. 2 is a conceptual diagram illustrating the operating principle of a CNN.

도 2에서 보듯, CNN은 입력된 영상 데이터로부터 합성곱 연산을 통해 특징을 추출하고, 추출된 특징을 통해 영상 데이터에 포함된 객체를 탐지하는 네트워크로서, 객체의 특징정보를 추출하는 특징 추출 계층(feature extraction layers)과, 추출된 특징정보를 이용하여 객체를 분류하는 전체 연결 계층(fully connected layers)으로 구성된다. 다시 말해 CNN은 영상 데이터에 포함된 특정 객체의 크기나 위치 등의 기하학적 정보를 도출(localization)하고 상기 특정 객체를 분류(classification)하는 객체 탐지(object detection) 기술 중 하나로 정의될 수 있다.As shown in FIG. 2, CNN is a network that extracts features from input image data through a convolution operation and detects objects included in the image data through the extracted features. A feature extraction layer that extracts feature information of the object ( It consists of feature extraction layers and fully connected layers that classify objects using extracted feature information. In other words, CNN can be defined as one of object detection technologies that derives (localizes) geometric information such as the size or position of a specific object included in image data and classifies the specific object.

특징 추출 계층은 복수 개의 컨볼루션 계층으로 구성되며, 어느 일 단계의 컨볼루션 계층은 입력 데이터(영상 데이터, 특징 맵)와 필터의 역할을 담당하는 합성 커널의 합성곱 연산을 통해 새로운 특징 맵(feature map)을 생성하고, 생성된 특징 맵을 다음 단계의 컨볼루션 계층에 입력한다. 각 컨볼루션 계층은 학습에 따라 값이 변하는 파라미터인 가중치, 바이어스로 구성되며 색상, 형태, 윤곽선 등과 같이 객체 탐지에 유용한 정보를 추출하도록 학습 과정에서 합성 커널의 값이 자동으로 조정된다.The feature extraction layer is composed of a plurality of convolution layers, and the convolution layer at one stage is a new feature map (feature map) map) is generated, and the generated feature map is input to the convolution layer in the next step. Each convolution layer is composed of weights and biases, which are parameters whose values change according to learning, and the value of the synthetic kernel is automatically adjusted during the learning process to extract useful information for object detection, such as color, shape, and contour.

전체 연결 계층은 특징 추출 계층에서 추출된 다차원 특징 맵을 1차원으로 변환하는 편평화 계층(flatten layer)과 멀티 계층 퍼셉트론(multi-layer perceptron, MLP)으로 구성된다. 1차원으로 변환된 특징 맵을 학습 알고리즘인 MLP에 입력함으로써 영상 데이터의 파라미터를 조정한다.The entire connection layer consists of a flatten layer that converts the multi-dimensional feature map extracted from the feature extraction layer into a one-dimensional one and a multi-layer perceptron (MLP). Parameters of image data are adjusted by inputting the one-dimensionally converted feature map to the learning algorithm, MLP.

객체탐지부(200)는 RGB 영상 및 열 열상 각각에 대하여 CNN의 특징 추출 계층을 수행하되, 센서 융합이 적용된 편평화 계층을 멀티 계층 퍼셉트론에 입력하는 방식으로 전체 연결 계층을 수행함으로써 각 영상의 특징(feature)에 대한 신뢰성을 재평가한 후, 신뢰성이 높은 특징으로 객체를 최종 탐지한다.The object detection unit 200 performs a CNN feature extraction layer for each of the RGB image and the thermal image, and performs a full connection layer by inputting the sensor fusion-applied flattening layer to the multi-layer perceptron to perform the feature of each image. After re-evaluating the reliability of the feature, the object is finally detected as a highly reliable feature.

도 3은 객체탐지부(200)의 동작 과정을 도시한 블록도이다.3 is a block diagram illustrating an operation process of the object detection unit 200.

도 3에서 보듯, 객체탐지부(200)는 제1탐지모듈(210), 제2탐지모듈(220), 센서융합모듈(230) 및 최종탐지모듈(240)을 포함하여 이루어진다.As shown in FIG. 3 , the object detection unit 200 includes a first detection module 210 , a second detection module 220 , a sensor fusion module 230 and a final detection module 240 .

제1탐지모듈(210)은 RGB 영상으로부터 객체를 탐지하고 해당 객체의 특징정보를 출력한다. 그리고 제2탐지모듈(220)은 열 영상으로부터 객체를 탐지하고 해당 객체의 특징정보를 출력한다. 즉, 제1탐지모듈(210) 및 제2탐지모듈(220)은 각각 RGB 영상 및 열 영상에 대하여 전술한 CNN의 특징 추출 계층을 수행하는 것으로 이해할 수 있다.The first detection module 210 detects an object from the RGB image and outputs feature information of the object. The second detection module 220 detects an object from the thermal image and outputs feature information of the object. That is, it can be understood that the first detection module 210 and the second detection module 220 perform the feature extraction layer of the CNN described above for the RGB image and the thermal image, respectively.

CNN의 특징 추출 계층은 객체 탐지를 위한 단계의 개수에 따라 단일 단계 탐지기(Single-stage detector)와 두 단계 탐지기(Two-stage detector)로 구분된다. ‘Two-stage detector’는 객체를 탐지하기 이전에 영상 데이터 내부에 객체가 존재할 법한 영역(Region Of Interest, ROI)을 제안하고 해당 영역 내부에서 객체를 탐지하는 2개의 단계로 이루어진다. 이에 비해 ‘Single-stage detector’는 영상 데이터 내부에 포함된 객체의 위치를 찾는 과정과 객체를 탐지하는 과정이 동시에 이루어진다. CNN's feature extraction layer is divided into a single-stage detector and a two-stage detector according to the number of stages for object detection. The ‘Two-stage detector’ consists of two steps: proposing a region of interest (ROI) where an object is likely to exist inside the image data before detecting an object, and detecting the object within that region. On the other hand, in the ‘Single-stage detector’, the process of finding the position of an object included in the image data and the process of detecting the object are performed simultaneously.

단일 단계 탐지기의 예로 YOLO v1, v2, v3, SSD, DSSD, DSOD, RetinaNet, RefineDet, M2Det 등을 들 수 있고, 두 단계 탐지기의 예로 R-CNN, Fast R-CNN, Faster R-CNN, R-FCN, Mask R-CNN 등을 들 수 있다.Examples of single-step detectors include YOLO v1, v2, v3, SSD, DSSD, DSOD, RetinaNet, RefineDet, and M2Det, and examples of two-step detectors include R-CNN, Fast R-CNN, Faster R-CNN, R- FCN, Mask R-CNN, etc. may be mentioned.

이하에서는 제1탐지모듈(210)과 제2탐지모듈(220)이 단일 단계 탐지기(Single-stage detector) 중 실시간 객체 탐지 알고리즘 YOLO(You Only Look Once)을 이용하는 예를 설명하지만 당업자라면 위에서 예시한 다른 알고리즘을 용이하게 대체 적용할 수 있을 것이다. 제1탐지모듈(210)은 RGB 영상을 대상으로 한다는 의미에서 YOLO-C라는 명칭으로 혼용될 수 있고, 제2탐지모듈(220)은 열 영상을 대상으로 한다는 의미에서 YOLO-T라는 명칭으로 혼용될 수 있다.Hereinafter, an example in which the first detection module 210 and the second detection module 220 use the real-time object detection algorithm YOLO (You Only Look Once) among single-stage detectors will be described. Other algorithms can easily be applied as substitutes. The first detection module 210 can be used interchangeably with the name YOLO-C in the sense that it targets RGB images, and the second detection module 220 is used interchangeably with the name YOLO-T in the sense that it targets thermal images It can be.

YOLO는 각 이미지를 S x S 개의 그리드로 분할하고, 그리드의 신뢰도를 계산한다. 신뢰도는 그리드 내 객체 인식 시 정확성을 반영한다. 처음에는 객체 인식과 동떨어진 경계 상자가 설정되지만, 신뢰도를 계산하여 경계상자의 위치를 조정함으로써, 가장 높은 객체 인식 정확성을 가지는 경계 상자를 얻을 수 있다. 그리드에 객체 포함 여부를 계산하기 위해, 객체 클래스 점수를 계산한다. 이 결과로 총 S x S x N 객체가 예측된다. 그리드의 대부분은 낮은 신뢰도를 가지며, 신뢰도를 높이기 위해 주변의 그리드를 병합한 후 임계값을 설정하여 불필요한 부분은 제거한다.YOLO divides each image into S × S grids and calculates the reliability of the grids. Confidence reflects accuracy in recognizing objects within the grid. Initially, a bounding box that is far from object recognition is set, but by calculating the reliability and adjusting the position of the bounding box, a bounding box with the highest object recognition accuracy can be obtained. To calculate whether an object is included in the grid, an object class score is calculated. As a result, a total of S x S x N objects is predicted. Most of the grids have low reliability, and in order to increase the reliability, unnecessary parts are removed by setting a threshold after merging neighboring grids.

센서융합모듈(230)은 제1탐지모듈(210) 및 제2탐지모듈(220)의 탐지 결과를 융합하여 객체 탐지의 정확도를 높이거나 어느 한 쪽의 센서에 객체가 미 검출되더라도 객체 탐지가 가능하도록 한다. The sensor fusion module 230 fuses the detection results of the first detection module 210 and the second detection module 220 to increase the accuracy of object detection or enables object detection even if an object is not detected by either sensor. let it do

RGB 영상을 촬영하는 제1센서(10)는 물체로부터 반사된 가시광선을 고해상도의 영상 데이터로 표현하여 사람의 시각과 유사하게 영상 데이터를 나타내기 때문에 다양한 인식 센서 중 탐지할 객체의 형태와 색상이 가장 잘 나타난다.Since the first sensor 10 that captures an RGB image expresses visible light reflected from an object as high-resolution image data and represents image data similar to human vision, the shape and color of an object to be detected among various recognition sensors are best appear

하지만 외부환경에 취약하기 때문에 광량이 충분하지 않거나 반대로 광량이 갑자기 과도하게 입력되는 경우에는 영상 데이터에서 탐지할 객체가 잘 나타나지 않는다. 예를 들어, 광량이 부족한 야간이나 햇빛을 똑바로 보아서 순간적으로 외부 광의 영향을 크게 받는 경우에는 RGB 영상 기반의 객체 탐지 시스템의 탐지 성능이 급격하게 떨어진다.However, because it is vulnerable to the external environment, the object to be detected does not appear well in the image data when the amount of light is insufficient or when the amount of light is suddenly input excessively. For example, the detection performance of the RGB image-based object detection system is rapidly degraded when the light intensity is insufficient at night or when the direct sunlight is greatly affected by external light.

이에 비해, 열 영상을 촬영하는 제2센서(20)는 적외선에 따라 물체가 방출하는 복사에너지를 온도로 변환하여 영상 데이터로 나타낸다. 제2센서(20)는 외부환경의 영향을 덜 받기 때문에 갑자기 광량이 과도해지는 상황에서도 탐지할 객체를 잘 표현하며 특히, 광량이 적은 야간에서도 객체 탐지 성능이 타 센서에 비해 높은 편이다. 그러나 영상 데이터의 해상도가 낮으므로 객체 탐지의 정확도가 떨어지는 경향이 있다.In contrast, the second sensor 20 that captures a thermal image converts radiant energy emitted by an object according to infrared rays into temperature and displays it as image data. Since the second sensor 20 is less affected by the external environment, it expresses the object to be detected well even in a situation where the amount of light suddenly becomes excessive. However, since the resolution of image data is low, object detection tends to be less accurate.

센서융합모듈(230)은 YOLO-C, YOLO-T에서 출력되는 경계상자 및 신뢰도 점수를 기반으로 탐지 결과를 융합하여 각 센서의 장단점을 상호 보완함으로써 높은 신뢰도와 탐지 성능을 발휘할 수 있게 한다.The sensor fusion module 230 fuses the detection results based on the bounding boxes and reliability scores output from YOLO-C and YOLO-T to mutually complement the strengths and weaknesses of each sensor, thereby enabling high reliability and detection performance.

도 4는 센서융합모듈(230)의 동작 원리를 상세히 도시한 개념도이다.4 is a conceptual diagram illustrating in detail the operating principle of the sensor fusion module 230 .

도 3 및 도 4에서 보듯, 센서융합모듈(230)은 CNN의 편평화 레이어에 일부 대응하는 추출 모듈(231)과, MLP에 대응하는 학습 모듈(232)을 포함한다.As shown in FIGS. 3 and 4 , the sensor fusion module 230 includes an extraction module 231 partially corresponding to the CNN flattening layer and a learning module 232 corresponding to the MLP.

추출 모듈(231)은 제1탐지모듈(210) 및 제2탐지모듈(220)에서 출력된 특징정보 중에서 신뢰도 점수(S_C, S_T)와 경계상자(bb_c, bb_t)를 선택하고, 두 경계상자의 중심좌표의 평균값(μ_x, μ_y)과 두 경계상자의 중심좌표의 분산값(σ_x, σ_y)을 산출하고, 두 경계상자(bb_c, bb_t)를 모두 포함하는 최소 경계상자(bb_m)를 산출한다. 여기서 경계상자(bb_c, bb_t,bb_m)라 함은 전술한 바와 같이 객체가 포함된 사각 영역의 중심좌표 및 폭과 높이를 포함하는 정보를 가리킨다. The extraction module 231 selects reliability scores (S _C , S _T ) and bounding boxes (bb _c , bb _t ) from among the feature information output from the first detection module 210 and the second detection module 220, Calculate the average value (μ _x , μ _y ) of the central coordinates of the two bounding boxes and the variance values (σ _x , σ _y ) of the central coordinates of the two bounding boxes, and include both bounding boxes (bb _c , bb _t ). Calculate the minimum bounding box (bb _m ). where the bounding box (bb _c , bb _t ,As described above, bb _m ) refers to information including center coordinates, width, and height of a rectangular area including an object.

이와 같이 선택 또는 산출된 신뢰도 점수(S_C, S_T), 경계상자(bb_c, bb_t), 중심좌표의 평균값(μ_x, μ_y), 중심좌표의 분산값(σ_x, σ_y), 최소 경계상자(bb_m)를 입력 데이터(input data)로 정의할 때, 상기 입력 데이터는 MLP로 입력되는 융합 특징정보(F)로서 아래의 수학식으로 표현될 수 있다.Reliability scores (S _C , S _T ), bounding boxes (bb _c , bb _t ), mean values of central coordinates (μ _x , μ _y ), and variance values of central coordinates (σ _x , σ _y ) , When the minimum bounding box (bb _m ) is defined as input data, the input data may be expressed as fusion feature information (F) input to the MLP by the following equation.

[수학식 1][Equation 1]

F = (S_C, S_T, bb_c, bb_t, μ_x, μ_y, σ_x, σ_y, bb_m)F = (S _C , S _T , bb _c , bb _t , μ _x , μ _y , σ _x , σ _y , bb _m )

학습 모듈(232)은 YOLO-C, YOLO-T에서 출력된 경계상자(bb_c, bb_t)와 실제 정답(Ground-truth)의 경계상자(bb_g) 간의 다차원 비선형 매핑(일례로 multi-layer perceptron, MLP)을 학습한다. MLP 학습을 통해 YOLO-C, YOLO-T의 출력이 실제 정답과 더 의미 있는 상관 관계를 갖도록 신뢰도 점수를 재설정한다.The learning module 232 _is a multi-dimensional non-linear mapping ₍ for example, _multi -layer perceptron, MLP). Through MLP learning, the confidence score is reset so that the outputs of YOLO-C and YOLO-T have a more meaningful correlation with the actual correct answer.

구체적으로, 학습 모듈(232)은 MLP의 타겟값을, YOLO-C, YOLO-T에서 출력된 경계상자(bb_c, bb_t)와 실제 정답의 경계상자(bb_g)의 교차 영역의 상대적인 넓이(Intersection Of Union; IOU)로 설정함으로써 기존의 신뢰도 점수보다 더 높은 실제 정답과의 연관성을 갖도록 학습한다.Specifically, the learning module 232 calculates the target value of the MLP as the relative area of the intersection of the bounding boxes (bb _c , bb _t ) output from YOLO-C and YOLO-T and the bounding box of the actual correct answer (bb _g ) (Intersection Of Union; IOU), it learns to have a higher correlation with the actual correct answer than the existing reliability score.

최종탐지모듈(240)은 센서융합모듈(130)에서 출력된 두 개의 경계상자(bb_c, bb_t) 와 MLP에 의해 갱신된 신뢰도(S_C', S_T')를 참고하여 미리 정해진 규칙에 따라 최종 경제박스 및 신뢰도 점수(bb_f, S_F)를 도출한다.The final detection module 240 uses the two bounding boxes (bb _c , bb _t ) output from the sensor fusion module 130 and the confidence level ( _SC ', S _T ') updated by the MLP to determine the result according to a predetermined rule. The final economy box and reliability score (bb _f , S _F ) are derived according to

이상에서는 객체탐지부(200)가 CNN을 변형한 알고리즘을 이용하여 최적의 객체 탐지를 수행하는 내용을 설명하였다. CNN은 정확도 높은 객체 추출이 가능하나, 수많은 이미지를 CNN으로 실행하는 것이 매우 어렵다는 난점이 있다. 다라서 다른 실시예에서 객체탐지부(200)는 영역(region)을 고려한 R-CNN 알고리즘을 통해 객체 분류에 필요한 경계 상자의 수를 줄일 수 있다.In the above description, the object detection unit 200 performed optimal object detection using an algorithm modified from CNN. CNN can extract objects with high accuracy, but there is a difficulty in running a large number of images with CNN. Accordingly, in another embodiment, the object detection unit 200 may reduce the number of bounding boxes required for object classification through the R-CNN algorithm considering a region.

다음으로, 상황판단부(300)의 동작을 상세히 설명한다.Next, the operation of the situation determination unit 300 will be described in detail.

상황판단부(300)는 객체탐지부(200)에서 탐지된 객체의 영상 즉, 객체가 포함된 특정 영역의 영상을 분석하여 객체의 상황을 판단한다. The situation determination unit 300 determines the situation of the object by analyzing the image of the object detected by the object detection unit 200, that is, the image of a specific area including the object.

드론이 촬영한 영상 안에 다수의 객체 즉, 일례로 다수의 사람이 포함된 경우, 전통적인 영상처리로는 엄청난 정보처리 리소스를 요구하지만 드론이라는 제한된 파워 및 정보처리 리소스로는 감당할 수 없다. 따라서 본 발명에서는 우선 CNN과 같은 객체 탐지 네트워크로 사람을 찾은 후, 사람이 포함된 특정 영역의 영상만을 시계열적으로 분석하여 그 사람의 행동이 어떤 행위를 의미하는지를 판단하는 2단계의 분석 기법을 제안한다.When a large number of objects, for example, a large number of people are included in an image captured by a drone, a large amount of information processing resources are required by traditional image processing, but the limited power and information processing resources of a drone cannot handle it. Therefore, the present invention proposes a two-step analysis technique that first finds a person with an object detection network such as CNN and then analyzes only the images of a specific area that includes the person in time series to determine what the person's behavior means. do.

특히 단일 프레임의 정지 영상 (image)으로 사람의 행위를 판단하기는 어려운데, 예를 들면, 조난자를 탐지하는 드론 시스템에서 CNN과 같은 객체 탐지 네트워크를 이용하여 바다 위에 떠 있는 사람을 탐지했다 하더라도 팔을 흔들고 있는 한 컷의 정지 영상만으로는 그 사람이 안전 상황에서 정상적인 수영을 하고 있는 것인지 조난 상황에서 구조 요청을 하고 있는 것인지 판단하기가 어렵다. 따라서 객체 탐지 네트워크를 통해 우선 사람을 탐지하고, 탐지 영역의 단일 이미지 프레임이 아닌 시계열적으로 연속되는 여러 장의 이미지 프레임을 대상으로 사람의 행동의 맥락(sequence)을 분석함으로써 수영인지 또는 구조요청인지 등과 같은 움직임의 유형을 판단한다.In particular, it is difficult to judge a person's behavior with a single frame still image. For example, in a drone system that detects a person in distress, even if a person floating on the sea is detected using an object detection network such as CNN, It is difficult to determine whether the person is swimming normally in a safe situation or requesting rescue in a distress situation with only a single still image of the person waving. Therefore, a person is first detected through an object detection network, and the sequence of a person's action is analyzed for several consecutive image frames in time series rather than a single image frame in the detection area. judge the type of movement.

시계열적으로 연속되는 여러 장의 이미지 프레임을 시계열 데이터라 정의할 때, 시계열 데이터는 시간에 따라 가변적인 길이를 가지며 데이터의 특성이 변화하는 동적(dynamic) 데이터이다. 특히 긴 시간 동안의 시계열 데이터를 분석하기 위해서는 데이터가 취득된 시점에서부터 마지막 시점 까지의 데이터를 대상으로 문맥(sequence)을 파악해야 하며 기존의 MLP, CNN을 활용하는 경우 긴 시간 동안의 시계열 데이터의 문맥을 파악하는 데에 한계점이 존재한다. When several image frames that are successive in time series are defined as time series data, the time series data is dynamic data having a variable length and changing characteristics according to time. In particular, in order to analyze time series data for a long period of time, it is necessary to identify the sequence of data from the point at which the data was acquired to the last point in time. There are limitations in understanding .

따라서, 본 발명에서는 CNN이 탐지한 객체의 비디오의 여러 장의 프레임을 효과적으로 분석하기 위한 일례의 알고리즘으로 순환 신경망(Recurrent Neural Network, RNN)을 사용한다. RNN은 순차적으로 입력되는 시계열 데이터를 분석함으로써 객체의 상황 즉, 객체의 상황을 판단한다.Therefore, in the present invention, a Recurrent Neural Network (RNN) is used as an example algorithm for effectively analyzing several frames of a video of an object detected by a CNN. RNN determines the situation of an object, that is, the situation of an object by analyzing sequentially input time series data.

이를 위해 상황판단부(300)는, t (t는 1 이상의 정수) 시점에서 탐지한 객체의 특징정보에 t -1 시점에서 탐지한 객체의 특징정보를 가중치로 반영하여 t 시점의 출력값을 도출하는 연산모듈(미도시)과, 객체가 탐지된 시점부터 미리 설정된 시점까지 연산모듈(미도시)의 출력값의 시퀀스(sequence)를 분석하여 객체의 상황을 판단하는 분석모듈(미도시)을 포함하여 이루어진다.To this end, the situation determination unit 300 reflects the feature information of the object detected at time t -1 to the feature information of the object detected at time t (t is an integer greater than or equal to 1) as a weight to derive an output value at time t It includes a calculation module (not shown) and an analysis module (not shown) that analyzes a sequence of output values of the calculation module (not shown) from the time the object is detected to a preset time to determine the situation of the object. .

도 5는 RNN의 동작 알고리즘을 설명하는 개념도이다.5 is a conceptual diagram explaining the operation algorithm of RNN.

도 5에서 보듯, RNN은 MLP와 순환 엣지(W)가 결합된 형태로서, 시계열 데이터를 입력 받을 때 순환 엣지를 통해 과거의 출력값을 함께 입력 받는다. 따라서, t 시점에서의 RNN의 출력값은 이전 t-1 시점의 출력값에 영향을 받는다는 점에서 과거 값을 기억하는 ‘메모리(memory)’정보를 포함하는 피드백 구조의 네트워크라 정의할 수 있다. As shown in FIG. 5, RNN is a combination of MLP and cyclic edge ( W ), and receives past output values through cyclic edges when time-series data is input. Therefore, given that the output value of the RNN at time t is affected by the output value at the previous time point t-1, it can be defined as a network of feedback structures including 'memory' information that remembers past values.

도 6은 상황판단부(300)의 동작 알고리즘을 도시한 개념도이고, 도 7은 도 6의 알고리즘 원리를 직관적으로 도시한 개념도이다.6 is a conceptual diagram illustrating an operating algorithm of the situation determination unit 300, and FIG. 7 is a conceptual diagram intuitively illustrating the principle of the algorithm of FIG.

도 6에서 보듯, 도 5의 통상적인 RNN이 하나의 데이터(예를 들어 단일 정지 프레임)에 대해 순환 엣지를 적용하는 것과는 달리, 상황판단부(300)의 RNN은 시계열적으로 나열된 데이터(예를 들어 복수의 연속된 비디오 프레임)를 대상으로 RNN의 피드백 구조가 반복(recurrent)되는 이른 바 "펼쳐진(unfold) 구조의 RNN" 이다. As shown in FIG. 6, unlike the typical RNN of FIG. 5 that applies a circular edge to one piece of data (for example, a single still frame), the RNN of the situation decision unit 300 uses time-sequentially arranged data (for example, It is a so-called "unfolded RNN" in which the feedback structure of the RNN is recurrent for a plurality of consecutive video frames).

연산모듈(미도시)의 RNN은 MLP와 같이 주어진 입력 데이터로부터 타겟값과 가까운 출력값을 예측하는 파라미터를 찾는 역할을 수행하며, 시점의 흐름에 따라 새로운 데이터를 입력받고 이전 시점에서의 가중치를 통해 출력값을 계산한다.The RNN of the calculation module (not shown) plays a role of finding parameters that predict an output value close to the target value from given input data such as MLP, receives new data according to the flow of time, and outputs the output value through the weight at the previous time. Calculate

도 6에서 보듯, RNN은 입력층(x), 은닉층(h), 출력층(y)으로 구성되고 은닉층에는 순환 엣지가 추가된다. 순한 엣지(W)는 시점 t-1의 정보를 시점 t로 전달하는 역할을 담당한다. As shown in Figure 6, the RNN consists of an input layer (x), a hidden layer (h), and an output layer (y), and circular edges are added to the hidden layer. The mild edge (W) serves to transfer the information of time t-1 to time t.

시간에 따라 X=(x^t=1, x^t=2, ..., x^t=T)가 순차적으로 입력되고, 입력 데이터 X에 가중치 U를 반영하여 은닉층 값이 계산된다. 시점 t에서의 은닉층 값 h^(t)는 입력값 x^(t)와 바로 이전의 시점 t-1의 은닉층 값 h^(t-1)에 따라 계산되며, 시간에 따라 Y=(y^t=1, y^t=2, ..., y^t=T)가 출력된다. 이와 같은 과정을 수학식으로 표현하면 수학식 2와 같다.According to time, X=(x ^t=1 , x ^t=2 , ..., x ^t=T ) are sequentially input, and the hidden layer value is calculated by reflecting the weight U to the input data X. The hidden layer value h ^(t) at time t is calculated according to the input value x ^(t) and the hidden layer value h ^{(t-1) at} the immediately preceding time point t-1, and Y=(y ^t=1 , depending on time y ^t=2 , ..., y ^t=T ) are output. If such a process is expressed as a mathematical expression, it is as shown in Equation 2.

[수학식 2][Equation 2]

h^(t)= f(h^(t-1) * W+x^(t) * U) (t>=2인 경우)h ^(t) = f(h ^(t-1) * W+x ^(t) * U) (if t>=2)

h^(t)= f(x^t * U) (t=1인 경우)h ^(t) = f(x ^t * U) (for t=1)

여기에서, f(x)는 활성화 함수(Activation function)이다.Here, f(x) is an activation function.

분석모듈은 객체가 탐지된 시점부터 미리 설정된 시점까지 연산모듈(미도시)의 출력값의 시퀀스(sequence)를 분석하여 객체의 상황을 판단한다.The analysis module analyzes a sequence of output values of a calculation module (not shown) from a time point when the object is detected to a time point set in advance to determine the state of the object.

한편, RNN으로 시계열 데이터를 처리하는 경우 입력 데이터가 취득된 시점이 길어 데이터의 길이가 긴 경우에는 파라미터가 소멸되는 'gradient vanishing' 문제가 발생할 수 있다. 따라서 다른 실시예에서, 상황판단부(300)의 연산모듈(미도시)은 RNN에 메모리 블록을 추가하여 긴 시계열 데이터를 처리하는 LSTM(Long Short Term Memory)을 이용할 수 있다.On the other hand, when time-series data is processed by RNN, if the time point at which the input data is acquired is long, the 'gradient vanishing' problem in which parameters disappear may occur. Accordingly, in another embodiment, the calculation module (not shown) of the situation determination unit 300 may use Long Short Term Memory (LSTM) processing long time series data by adding a memory block to the RNN.

LSTM은 시점 t에서의 은닉층 값은 이전 시점인 t-1의 은닉층 값의 영향을 받기 때문에 동적 영상의 시점이 종료될 때까지의 해당 입력 데이터의 문맥을 파악하여 최종 출력값 y^(T)을 산출하며, 이러한 문맥 파악 과정을 통해 객체의 상황을 판단한다. Since the value of the hidden layer at time t is affected by the value of the hidden layer at the previous time point t-1, LSTM calculates the final output value y ^(T) by figuring out the context of the input data until the end of the dynamic image time point. , the context of the object is judged through this process of understanding the context.

<실시예 2><Example 2>

실시예 2는 영상의 촬영 및 전송은 촬영 단말이 담당하고, 객체의 탐지와 상황의 판단은 이격지에 위치한 행동 분석 서버가 담당하는 경우에 관한 것이다.Embodiment 2 relates to a case in which a photographing terminal is in charge of capturing and transmitting images, and a behavior analysis server located in a remote location is in charge of detecting an object and determining a situation.

도 8은 실시예 2에 따른 행동 분석 시스템 및 그 시스템에 포함되는 행동 분석 장치의 구성도이다.8 is a configuration diagram of a behavior analysis system and a behavior analysis device included in the system according to the second embodiment.

실시예 2의 행동 분석 시스템은, 촬영 단말(3)과 행동 분석 서버(4)를 포함하여 이루어진다.The behavior analysis system of the second embodiment includes a photographing terminal 3 and a behavior analysis server 4 .

이해를 돕기 위해, 실시예 2의 시스템에서 행동 분석 장치(3)는 백화점에서 고객의 행동을 촬영하는 CCTV로 구현되고, 행동 분석 서버(4)는 상기 CCTV와 무선 또는 유선통신망으로 연결된 보안 서버로 구현된다고 가정한다.For better understanding, in the system of Example 2, the behavior analysis device 3 is implemented as a CCTV that captures customer behavior in a department store, and the behavior analysis server 4 is a security server connected to the CCTV and a wireless or wired communication network. Assume it is implemented.

촬영 단말(3)은 매장 상황의 RGB 영상을 촬영하는 제1 센서(11), 동일 매장 상황의 열 영상을 촬영하는 제2 센서(21), 촬영된 두 영상을 유선망 또는 무선망을 통해 행동 분석 서버(4)로 전송하기 위한 통신모듈(41)을 포함한다. 제1 센서(11) 및 제2 센서(21)의 구성은 전술한 실시예 1의 제1 센서(10) 및 제2 센서(20)와 동일하다.The photographing terminal 3 includes a first sensor 11 that captures an RGB image of a store situation, a second sensor 21 that captures a thermal image of the same store situation, and behavioral analysis of the two captured images through a wired or wireless network. It includes a communication module 41 for transmission to the server 4. The configurations of the first sensor 11 and the second sensor 21 are the same as those of the first sensor 10 and the second sensor 20 of the first embodiment described above.

행동 분석 서버(4)는 제1 센서(11)의 RGB 영상에서 추출된 제1 특징정보와 제2 센서(21)의 열 영상에서 추출된 제2 특징정보를 융합(fusion)하여 객체를 탐지(detect)하는 객체탐지부(201) 및 상기 탐지된 객체의 시계열적 영상을 분석하여 객체의 상황을 판단하는 상황판단부(301)를 포함하여 이루어지며, 객체 탐지가 용이하도록 RGB 영상 및 열 영상을 보정하는 전처리부(101)와 촬영 단말(3)과 통신하기 위한 통신모듈(42)을 더 포함할 수 있다. 여기서 전처리부(101), 객체탐지부(201) 및 상황판단부(301)의 구성은 전술한 실시예 1의 전처리부(100), 객체탐지부(200) 및 상황판단부(300)와 동일하다.The behavior analysis server 4 fuses the first feature information extracted from the RGB image of the first sensor 11 and the second feature information extracted from the thermal image of the second sensor 21 to detect the object ( It includes an object detection unit 201 to detect and a situation determination unit 301 to determine the situation of the object by analyzing the time-sequential images of the detected object, and RGB images and thermal images are provided to facilitate object detection. A pre-processing unit 101 for correcting and a communication module 42 for communicating with the photographing terminal 3 may be further included. Here, the configuration of the pre-processing unit 101, the object detection unit 201, and the situation determination unit 301 is the same as the pre-processing unit 100, the object detection unit 200, and the situation determination unit 300 of the first embodiment described above. do.

이상에서 설명한 실시예에서는 행동 분석 장치가 주로 인명 구조 및 수색을 위해 사용되는 경우를 설명하였으나 반드시 이에 한정될 필요는 없다. In the above-described embodiment, the case where the behavior analysis device is mainly used for lifesaving and search has been described, but it is not necessarily limited thereto.

예를 들어 오프라인 의류매장에서 특정 고객을 연령과 성별로 분류한 후 행동 분석을 통해 30대 여성은 어떤 제품을 가장 많이 입어보고 어떤 제품을 가장 많이 구매하는지를 파악할 수 있다. 또한 어린이집에 설치된 CCTV를 통해 특정 아이가 어떤 수업을 할 때 활동적인지, 연령별로 어떠한 행동 특성이 나타나는지를 분석할 수 있다. 이와 같이 분석된 행동 분류 데이터를 통해 제품의 출시 계획을 구축하는 과정에서 구매자의 취향을 파악하는 등 다양한 분야에서 활용될 수 있다.For example, after classifying specific customers by age and gender at an offline clothing store, it is possible to identify which products a woman in her 30s wears the most and purchases the most through behavioral analysis. In addition, through CCTV installed in daycare centers, it is possible to analyze which classes a particular child is active in and what behavioral characteristics appear by age. The behavioral classification data analyzed in this way can be used in various fields, such as identifying buyers' tastes in the process of establishing a product launch plan.

이상에서 설명한 행동 분석 장치 또는 행동 분석 시스템의 전체 또는 부분적 기능들은 일련의 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 기계어로 번역한 소프트웨어로 구현되거나, 그러한 소프트웨어가 컴퓨터를 통해 판독될 수 있는 기록매체에 포함되어 제공될 수 있음을 당업자가 쉽게 이해할 수 있을 것이다. 상기 컴퓨터 판독 가능한 기록매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광 기록매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리, USB 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드 외에도, 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기 하드웨어 장치는 본 발명의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있다.All or partial functions of the behavior analysis device or behavior analysis system described above are implemented as software that translates a series of program commands, data files, data structures, etc. into machine language alone or in combination, or such software can be read through a computer. Those skilled in the art will be able to easily understand that it can be included in a recording medium that can be provided. Examples of the computer-readable recording medium include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and floptical disks. Included are hardware devices specially configured to store and execute program instructions, such as magneto-optical media, and ROM, RAM, flash memory, USB memory, and the like. Examples of program instructions include high-level language codes that can be executed by a computer using an interpreter or the like, in addition to machine language codes such as those produced by a compiler. The hardware device may be configured to operate as one or more software modules to perform the operations of the present invention.

또한 이상에서는 본 발명에 관한 몇 가지 실시예를 참조하여 설명하였지만, 해당 기술 분야에서 통상의 지식을 가진 자라면 하기의 특허 청구의 범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 수 있을 것이다.In addition, although the above has been described with reference to several embodiments related to the present invention, those skilled in the art can within the scope not departing from the spirit and scope of the present invention described in the claims below. It will be appreciated that it can be modified and changed in various ways.

1: 행동 분석 장치 2: 관제 단말
3: 촬영 단말 4: 행동 분석 서버
10, 11: 제1센서 20, 21: 제2센서
30 : GPS모듈 40, 41: 통신모듈
50: 이동모듈
100: 전처리부 200: 객체탐지부
210: 제1탐지모듈 220: 제2탐지모듈
230: 센서융합모듈 240: 최종탐지모듈
300: 상황판단부1: behavior analysis device 2: control terminal
3: shooting terminal 4: behavior analysis server
10, 11: first sensor 20, 21: second sensor
30: GPS module 40, 41: communication module
50: movement module
100: pre-processing unit 200: object detection unit
210: first detection module 220: second detection module
230: sensor fusion module 240: final detection module
300: situation judgment unit

Claims

an object detection unit that detects a human object by fusing first feature information extracted from an RGB image of a first sensor with second feature information extracted from a thermal image of a second sensor; and
A situation determination unit configured to time-sequentially analyze partial images of the detected object to determine whether the action of the human being is an action in a distress situation;
The situation judgment department,
An operation module for deriving an output value at time t by reflecting the feature information of the object detected at time t -1 as a weight to the feature information of the object detected at time t (t is an integer greater than or equal to 1);
Characterized in that it comprises an analysis module for determining whether the action of the human being is an action in a distress situation by analyzing a sequence of output values of the calculation module from the time when the object is detected to the time set in advance.
Behavioral analysis device using heterogeneous sensor fusion.

According to claim 1,
A pre-processing unit that performs viewing angle correction on the RGB and thermal images before inputting the RGB and thermal images to the object detection unit, and performs edge correction on the thermal images.
Behavior analysis device using heterogeneous sensor fusion further comprising a.

According to claim 1,
The object detection unit,
A first detection module extracting first feature information including a bounding box (bb _c ) and a reliability score (S _C ) of the object from the RGB image;
a second detection module extracting second feature information including a bounding box (bb _t ) and a reliability score (S _T ) of the object from the thermal image;
a sensor fusion module that resets reliability scores (S _C , S _T ) for each image so that the outputs of the first detection module and the second detection module have a correlation with an actual correct answer (ground-truth); and
A final detection module for deriving final feature information based on the reset reliability scores (S _C ', S _T ')
Behavior analysis device using heterogeneous sensor fusion, characterized in that it comprises a.

According to claim 3,
The sensor fusion module,
Using the first feature information and the second feature information, the reliability score (S _C , S _T ) for each image, the bounding box (bb _c , bb _t ) for each image, and the average value of the center coordinates of the bounding box for each image (μ _x , μ _y ), variance values (σ _x , σ _y ) of the central coordinates of the bounding boxes for each image, and fusion feature information including at least one of a minimum bounding box (bb _m ) including all bounding boxes for each image (F) an extraction module that derives; and
A multi-layer perceptron (multi-layer perceptron) whose target value is set to the relative width (Intersection Of Union; IOU) of the intersection area of the bounding box (bb _c , bb _t ) for each image and the bounding box (bb _g ) of the actual correct answer (ground-truth). Learning module for inputting the convergence feature information (F) to the -layer perceptron)
Behavior analysis device using heterogeneous sensor fusion, characterized in that it comprises a.

According to claim 3 or 4,
The first detection module and the second detection module use a You Only Look Once (YOLO) algorithm for object detection.

delete

According to claim 1,
The calculation module,
It includes an input layer (x), a hidden layer (h), and an output layer (y), and in the hidden layer, a recurrent neural network to which a circular edge is added that transmits the detection information of the object at time t-1 to time t Behavior analysis device using heterogeneous sensor fusion, characterized in that for use.

According to claim 7,
The calculation module,
Behavior analysis device using heterogeneous sensor fusion, characterized in that using a long short term memory (LSTM) network in which a memory block for storing parameters is added to the recurrent neural network.

According to claim 1,
Behavior analysis device using heterogeneous sensor fusion further comprising a communication module for transmitting the determined situation to a remote control terminal through a wireless network.

According to claim 9,
A movement module for moving the behavior analysis device to a destination; and
GPS module for determining the location of the behavior analysis device
Behavior analysis device using heterogeneous sensor fusion further comprising a.