KR101563297B1

KR101563297B1 - Method and apparatus for recognizing action in video

Info

Publication number: KR101563297B1
Application number: KR1020140048687A
Authority: KR
Inventors: 서일홍; 박영빈; 장준익
Original assignee: 한양대학교 산학협력단
Priority date: 2014-04-23
Filing date: 2014-04-23
Publication date: 2015-10-26

Abstract

The present invention relates to a method and an apparatus for recognizing an action in a video. The apparatus for recognizing an action: extracts one or more RGB images and joint data from an input video by using Kinect; stratifies the RGB images and the joint data to extract attributes for action recognition; extracts and combines a first attribute value for the RGB images and a second attribute value for the joint data according to the stratification; extracts at least one third attribute value; groups the third attribute value into a predetermined number of groups by using K-means clustering; converts the groups into a histogram exhibiting an action pattern; and recognizes the action pattern from the histogram.

Description

TECHNICAL FIELD [0001] The present invention relates to a method and apparatus for recognizing a behavior in a video,

본 발명은 영상에서 행동을 인식하는 방법에 관한 것으로, 특히 키넥트를 이용하여 행동인식에 필요한 데이터를 추출하고, 추출된 데이터를 계층화하여 특징을 학습함으로써, 행동을 인식하는 방법에 관한 것이다.The present invention relates to a method of recognizing a behavior in an image, and more particularly, to a method of extracting data necessary for behavior recognition using a Kinect and learning behavior by layering the extracted data to learn a behavior.

기존의 영상이나 자세 데이터에서 설계한 특징들은 특정 분야에서는 잘 적용되지만 데이터가 바뀌거나 행동의 종류가 바뀌게 되면 적용하기 힘든 단점이 존재한다. 따라서, 상기 설계한 특징으로 행동을 인식하는 방법에는 전문가가 시간과 노력을 들여 설계한 특징들을 사용하는 방법이 제안되었다. 하지만, 성능을 향상시키기 위해 들이는 시간과 노력이 많을 뿐만 아니라 여전히 다른 센서 및 데이터에 적용하기 힘들다는 문제점이 여전히 존재한다.Features designed from existing image or attitude data are well applied in a specific field, but there are disadvantages that it is difficult to apply if the data is changed or the kind of behavior is changed. Therefore, a method of using the features designed by experts with time and effort has been proposed in the method of recognizing the behavior with the designed features. However, there is still a problem that it is difficult to apply to other sensors and data as well as time and effort to improve the performance.

한편, 이하에서 인용되는 선행기술 문헌에는 비디오카메라로 촬영된 영상에 포함된 인간의 행동을 실시간으로 인식하는 시스템 및 방법을 제안하고 있다. 하지만, 상기 선행기술은 영상기반 특징을 설계하여 행동을 인식하는 것으로서, 앞서 언급한 특징을 설계함에 많은 시간이 소모되고 설계된 특징은 특정 센서에만 사용 가능하다는 치명적인 단점이 존재한다.Meanwhile, prior art documents cited below suggest a system and method for real-time recognition of human behavior contained in a video image captured by a video camera. However, the prior art is designed to recognize an action by designing an image-based feature. Thus, it takes a lot of time to design the aforementioned feature, and there is a fatal disadvantage that the designed feature can be used only for a specific sensor.

이와 같은 관점에서, 행동 인식에 필요한 특징을 설계하는 시간을 효율적으로 줄이고, 비 전문가도 쉽게 확장 적용할 수 있는 기술적 수단이 필요하다는 것을 알 수 있다.From this point of view, it can be seen that there is a need for technical means to efficiently reduce the time required to design features necessary for behavior recognition, and to easily extend non-experts.

공개특허공보 제 10-2010-0104272 (2010.09.29)Patent Document 1: Japanese Patent Application Laid-Open No. 10-2010-0104272 (Sep. 29, 2010)

따라서, 본 발명이 해결하고자 하는 첫 번째 과제는 키넥트를 이용하여 입력된 영상으로부터 RGB 영상 및 관절 데이터를 추출하고, 추출된 데이터를 계층화하여 특징을 학습하며, 학습된 특징을 결합하여 클러스터링 후 SVM을 통해 학습함으로써, 영상에서 행동을 인식하는 방법을 제공하는 것이다.Therefore, the first problem to be solved by the present invention is to extract RGB image and joint data from an input image using Kinect, to classify the extracted data to learn characteristics, to combine the learned features, To provide a method of recognizing behavior in a video.

따라서, 본 발명이 해결하고자 하는 두 번째 과제는 키넥트를 이용하여 입력된 영상으로부터 RGB 영상 및 관절 데이터를 추출하고, 추출된 데이터를 계층화하여 특징을 학습하며, 학습된 특징을 결합하여 클러스터링 후 SVM을 통해 학습함으로써, 영상에서 행동을 인식하는 장치를 제공하는 것이다.Accordingly, a second problem to be solved by the present invention is to extract RGB image and joint data from an input image using a Kinect, to classify the extracted data to learn characteristics, to combine the learned features, To thereby provide a device for recognizing a behavior in a video.

상기 첫 번째 과제를 달성하기 위하여, 본 발명의 일 실시예에 따른 행동인식 장치가 입력된 영상으로부터 키넥트(Kinect)를 이용하여 하나 이상의 RGB 영상 및 관절 데이터를 추출하는 단계; 상기 행동인식 장치가 상기 RGB 영상 및 관절 데이터에 관하여 행동인식을 위한 특징을 추출하도록 각각 계층화하는 단계; 상기 행동인식 장치가 상기 각각의 계층화에 따른 상기 RGB 영상에 대한 하나 이상의 제 1 특징값 및 상기 관절 데이터에 대한 하나 이상의 제 2 특징값을 추출하여 결합함으로써, 하나 이상의 제 3 특징값을 도출하는 단계; 상기 행동인식 장치가 상기 제 3 특징값을 케이민즈 클러스터링(K-means clustering)을 이용하여 소정의 개수만큼 그룹화하고, 상기 그룹을 행동 패턴을 나타내는 히스토그램(histogram)으로 변환하는 단계; 및 상기 행동인식 장치가 상기 히스토그램으로부터 행동 패턴을 인식하는 단계를 포함하는 영상에서 행동을 인식하는 방법을 제공한다.According to an aspect of the present invention, there is provided a behavior recognition apparatus for extracting at least one RGB image and joint data from a input image using a Kinect, Layering the behavior recognition device so as to extract features for behavior recognition with respect to the RGB image and joint data; Deriving one or more third feature values by extracting and combining at least one first feature value for the RGB image and at least one second feature value for the joint data according to the respective layering, ; Grouping the third feature values by a predetermined number using K-means clustering, and converting the group into a histogram representing a behavior pattern; And a step of recognizing a behavior pattern from the histogram by the behavior recognition apparatus.

상기된 일 실시예에 따른 상기 키넥트를 통해 신체 관절의 상대적 3D 위치를 포함하는 관절 데이터를 시간에 따라 순차적으로 저장함으로써, 시간에 따른 관절의 궤적을 도출하는 것을 특징으로 하는 영상에서 행동을 인식하는 방법일 수 있다.The joint data including the relative 3D position of the body joints are sequentially stored with respect to time through the Kinect according to the embodiment described above so that the locus of the joints with time is derived, Lt; / RTI >

상기된 일 실시예에 따른 상기 추출된 RGB 영상 및 관절 데이터는 영상의 동일한 프레임으로부터 각각 추출하는 것을 특징으로 하는 영상에서 행동을 인식하는 방법일 수 있다.The extracted RGB image and joint data according to the embodiment may be extracted from the same frame of the image, respectively.

상기된 일 실시예에 따른 상기 계층은 입력 계층이 상기 RGB 영상 및 관절 데이터이며, 출력 계층이 상기 제 1 특징값 및 제 2 특징값인 것을 특징으로 하는 영상에서 행동을 인식하는 방법일 수 있다.The layer according to an embodiment of the present invention may be a method of recognizing an action in a video, in which the input layer is the RGB image and the joint layer, and the output layer is the first feature value and the second feature value.

상기된 일 실시예에 따른 상기 히스토그램은 상기 입력된 영상에서 특정 행동 패턴의 빈도 수를 나타내는 것을 특징으로 하는 영상에서 행동을 인식하는 방법일 수 있다.The histogram according to the embodiment may be a method of recognizing a behavior in a video, which indicates the frequency of a specific behavior pattern in the input image.

상기된 일 실시예에 따른 상기 케이민즈 클러스터링은 상기 제 3 특징값이 벡터이고, 상기 소정 개수의 그룹만큼 중심점(centroid)을 학습하는 것을 특징으로 하는 영상에서 행동을 인식하는 방법일 수 있다.The third feature value is a vector, and the centroid is learned by the predetermined number of groups. The method of recognizing the behavior in the image may be a method of recognizing the behavior in the image.

상기된 일 실시예에 따른 상기 행동 패턴을 인식하는 단계는, 상기 행동인식 장치가 상기 히스토그램과 맵핑된 행동타입을 검색함으로써, 상기 행동타입에 따라 행동 패턴을 인식하는 단계를 포함하는 영상에서 행동을 인식하는 방법일 수 있다.The step of recognizing the behavior pattern according to the embodiment may include a step of recognizing a behavior pattern according to the behavior type by searching the behavior type mapped with the histogram by the behavior recognition apparatus It can be a way to recognize.

상기 두 번째 과제를 달성하기 위하여, 본 발명의 일 실시예에 따른 입력된 영상으로부터 키넥트를 이용하여 하나 이상의 RGB 영상 및 관절 데이터를 추출하는 추출부; 상기 RGB 영상 및 관절 데이터에 관하여 행동인식을 위한 특징을 추출하도록 각각 계층화하고, 상기 각각의 계층화에 따른 상기 RGB 영상에 대한 하나 이상의 제 1 특징값 및 상기 관절 데이터에 대한 하나 이상의 제 2 특징값을 추출하여 결합하고, 하나 이상의 제 3 특징값을 도출하며, 상기 제 3 특징값을 케이민즈 클러스터링을 이용하여 소정의 개수만큼 그룹화함으로써, 상기 그룹을 행동 패턴을 나타내는 히스토그램으로 변환하는 처리부; 및 상기 히스토그램으로부터 행동 패턴을 인식하는 인식부를 포함하는 것을 특징으로 하는 영상에서 행동을 인식하는 장치를 제공한다.In order to achieve the second object, an extraction unit extracts one or more RGB images and joint data using an input image according to an exemplary embodiment of the present invention. Wherein each of the first and second feature values for the RGB image and the at least one second feature value for the joint image are obtained by extracting features for behavior recognition with respect to the RGB image and the joint data, Extracting and combining the first feature value and the second feature value, deriving one or more third feature values, and grouping the third feature values into a histogram representing a behavior pattern by grouping the third feature values by a predetermined number using KM's clustering; And a recognition unit for recognizing a behavior pattern from the histogram.

상기된 일 실시예에 따른 상기 추출부는, 상기 입력된 영상의 동일한 프레임으로부터 상기 RGB 영상 및 상기 관절 데이터를 각각 추출하는 것을 특징으로 하는 영상에서 행동을 인식하는 장치일 수 있다.The extracting unit may extract the RGB image and the joint data from the same frame of the input image, respectively.

상기된 일 실시예에 따른 상기 처리부는, 입력 계층이 상기 RGB 영상 및 관절 데이터이며, 출력 계층이 상기 제 1 특징값 및 제 2 특징값으로 하는 계층을 구성하는 것을 특징으로 하는 영상에서 행동을 인식하는 장치일 수 있다.The processing unit according to an embodiment of the present invention is characterized in that the input layer is the RGB image and the joint data, and the output layer is a layer having the first feature value and the second feature value. Lt; / RTI >

상기된 일 실시예에 따른 상기 케이민즈 클러스터링은 상기 제 3 특징값이 벡터이고, 상기 소정 개수의 그룹만큼 중심점(centroid)을 학습하는 것을 특징으로 하는 영상에서 행동을 인식하는 장치일 수 있다.The third feature value is a vector, and the centroid is learned by the predetermined number of groups. The device may recognize the behavior in the image.

상기된 일 실시예에 따른 상기 인식부는, 상기 히스토그램과 맵핑된 행동타입을 검색함으로써, 상기 행동타입에 따라 행동 패턴을 인식하는 것을 특징으로 하는 영상에서 행동을 인식하는 장치일 수 있다.The recognition unit according to the above-described embodiment may recognize a behavior pattern in a video, by detecting a behavior pattern according to the behavior type by searching for a behavior type mapped with the histogram.

본 발명에 따르면, 키넥트를 이용하여 입력된 영상으로부터 RGB 영상 및 관절 데이터를 추출하고, 추출된 데이터를 계층화하여 특징을 학습하며, 학습된 특징을 결합하여 클러스터링 후 SVM을 통해 학습함으로써, 행동 인식에 필요한 특징을 설계하는 시간을 효율적으로 줄이고 비 전문가도 쉽게 확장 적용할 수 있는 효과가 있다.According to the present invention, the RGB image and the joint data are extracted from the input image using the kinect, the extracted data is layered to learn the characteristics, the learned features are combined, It is possible to efficiently reduce the time required for designing the necessary features and to easily expand the non-experts.

도 1은 본 발명의 일 실시예들이 채택하고 있는 영상에서 행동을 인식하는 방법을 도시한 흐름도이다.
도 2a는 본 발명의 일 실시예에 따른 독립 성분 분석의 계층 구조를 도시한 도면이다.
도 2b는 본 발명의 일 실시예에 따른 계층화한 결과를 도시한 도면이다.
도 2c는 본 발명의 일 실시예에 따른 계층화에 따라 학습 되어 연결된 모델 파라미터를 도시한 도면이다.
도 3은 본 발명의 일 실시예에 따른 변형된 자세정보에 딥러닝을 적용하는 방법을 도시한 도면이다.
도 4는 본 발명의 일 실시예에 따른 영상에서 행동을 인식하는 방법을 도시한 또 다른 도면이다.
도 5는 본 발명의 일 실시예들이 채택하고 있는 영상에서 행동을 인식하는 장치를 도시한 블럭도이다.
도 6은 본 발명의 일 실시예에 따른 본 발명에 따른 결과값을 종래기술과 비교하여 도시한 도면이다.FIG. 1 is a flow chart illustrating a method of recognizing a behavior in an image adopted by embodiments of the present invention.
FIG. 2A is a diagram illustrating a hierarchical structure of independent component analysis according to an embodiment of the present invention.
FIG. 2B is a diagram showing a layered result according to an embodiment of the present invention.
FIG. 2C is a diagram illustrating model parameters that are learned and connected according to the layering according to an embodiment of the present invention.
3 is a diagram illustrating a method of applying deep learning to modified posture information according to an embodiment of the present invention.
4 is another diagram illustrating a method of recognizing a behavior in a video according to an embodiment of the present invention.
FIG. 5 is a block diagram illustrating an apparatus for recognizing a behavior in an image adopted by an embodiment of the present invention.
FIG. 6 is a diagram illustrating a result of the present invention in comparison with a conventional technique according to an embodiment of the present invention. Referring to FIG.

본 발명의 실시예들을 설명하기에 앞서, 기존의 행동인식 시스템에서 발생하는 문제점들을 검토한 후, 이들 문제점을 해결하기 위해 본 발명의 실시예들이 채택하고 있는 기술적 수단을 개괄적으로 소개하도록 한다.Prior to describing the embodiments of the present invention, the technical means adopted by the embodiments of the present invention will be introduced to solve the problems occurring in the conventional behavior recognition system.

앞서 언급한 영상에서 행동을 인식하기 위하여 설계한 특징들이 갖는 한계를 극복하도록 특징을 학습하는 여러 방법이 사용될 수 있다. 최근에는 딥러닝(Deep Learning)을 이용한 접근방식이 성공적인 성과를 거두고 있으며, 숫자 필기인식, 물체인식, 장면인식 등 많은 부분에서 딥러닝으로 영상기반 특징을 학습하여 높은 효율의 성능을 나타내고 있다. 특히, 영상기반 행동인식은 어렵다고 알려진 헐리우드 데이터 셋에서 기존의 설계된 특징을 이용한 방법들보다 딥러닝으로 학습된 특징을 이용한 것이 훨씬 큰 성능을 내고 있다.Several methods can be used to learn features to overcome the limitations of features designed to recognize behavior in the aforementioned images. In recent years, the Deep Learning approach has been successful, and it has demonstrated high efficiency performance by learning image-based features through deep learning in many areas such as numeric handwriting recognition, object recognition, and scene recognition. In particular, image - based behavior recognition is far superior to using features learned in deep learning than those using existing designed features in Hollywood data sets, which are known to be difficult.

한편, 행동인식을 위한 또 다른 접근방식인 자세기반 행동인식 방법은 주로 정확한 관절 데이터를 이용하였다. 모션캡쳐 데이터를 이용하여 정확한 관절 데이터를 얻어 자세기반 특징으로 변환시켜서 행동을 인식할 수 있다. 하지만, 정확한 관절 데이터는 모션캡쳐 장비 없이는 취득하기 어려우며, 소모되는 비용이 많다는 단점이 있다. 최근에는 키넥트를 이용하여 노이즈가 많이 있지만 관절 데이터를 값싸고 쉽게 얻을 수 있으나, 이런 관절 데이터를 자세정보로 변환하였을 때 어떤 특징이 유용한지에 대해 설계하는 일도 너무 많은 시간과 노력이 필요하다는 단점이 존재한다.On the other hand, the attitude-based behavior recognition method, which is another approach for behavior recognition, mainly uses accurate joint data. Motion capture data can be used to acquire accurate joint data and convert into attitude-based features to recognize behavior. However, accurate joint data is difficult to acquire without motion capture equipment, and there is a disadvantage that it is expensive to be consumed. Recently, there are many noises using Kinect, but the joint data can be obtained easily and cheaply. However, it is also a disadvantage that too much time and effort are required to design the feature which is useful when the joint data is converted into attitude information exist.

따라서, 본 발명의 실시예들은 키넥트에서 얻은 관절 데이터로부터 딥러닝으로 자세기반 특징을 학습할 뿐만 아니라 영상기반 특징과 결합함으로써, 영상기반 행동인식 방법과 자세기반 행동인식 방법의 장단점을 서로 보완할 수 있는 기술적 수단을 제안하고자 한다.Thus, embodiments of the present invention not only learn posture-based features from joint data obtained from Kinect, but also incorporate image-based features to complement the advantages and disadvantages of image-based and posture-based behavior recognition methods I would like to propose a technical means to

이하에서는 도면을 참조하여 본 발명의 실시예들을 구체적으로 설명하도록 한다. 다만, 하기의 설명 및 첨부된 도면에서 본 발명의 요지를 흐릴 수 있는 공지 기능 또는 구성에 대한 상세한 설명은 생략한다. 또한, 도면 전체에 걸쳐 동일한 구성 요소들은 가능한 한 동일한 도면 부호로 나타내고 있음에 유의하여야 한다.Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. In the following description and the accompanying drawings, detailed description of well-known functions or constructions that may obscure the subject matter of the present invention will be omitted. It should be noted that the same constituent elements are denoted by the same reference numerals as possible throughout the drawings.

도 1은 본 발명의 일 실시예들이 채택하고 있는 영상에서 행동을 인식하는 방법을 도시한 흐름도로서, 행동인식 장치가 입력된 영상으로부터 키넥트(Kinect)를 이용하여 하나 이상의 RGB 영상 및 관절 데이터를 추출하고, 상기 RGB 영상 및 관절 데이터에 관하여 행동인식을 위한 특징을 추출하도록 각각 계층화하고, 상기 각각의 계층화에 따른 상기 RGB 영상에 대한 하나 이상의 제 1 특징값 및 상기 관절 데이터에 대한 하나 이상의 제 2 특징값을 추출하여 결합하고, 하나 이상의 제 3 특징값을 도출하고, 상기 제 3 특징값을 케이민즈 클러스터링(K-means clustering)을 이용하여 소정의 개수만큼 그룹화하며, 상기 그룹을 행동 패턴을 나타내는 히스토그램(histogram)으로 변환함으로써, 상기 히스토그램으로부터 행동 패턴을 인식한다.FIG. 1 is a flowchart illustrating a method of recognizing a behavior in an image adopted by an embodiment of the present invention. The behavior recognition apparatus extracts one or more RGB images and joint data from a input image using a Kinect Extracting features for behavior recognition from the RGB image and joint data, extracting features for behavior recognition from the RGB image and joint data, extracting at least one first feature value for the RGB image according to the respective layering and at least one second feature value for the joint data, Extracting and combining the feature values, deriving one or more third feature values, grouping the third feature values by a predetermined number using K-means clustering, By converting the histogram into a histogram, a behavior pattern is recognized from the histogram.

보다 구체적으로, S110 단계에서 행동인식 장치가 입력된 영상으로부터 키넥트(Kinect)를 이용하여 하나 이상의 RGB 영상 및 관절 데이터를 추출한다. 다시 말해, 행동인식을 위해서 키넥트의 디바이스(device)를 사용하여 RGB 영상, 뎁스(Depth) 영상, 관절(skeleton pose) 데이터를 획득할 수 있다. 여기서, 상기 RGB 영상, 뎁스 영상, 관절 데이터를 획득하기 위해서 상기 키넥트의 OpenNI 라이브러리를 사용할 수 있다.More specifically, in step S110, the behavior recognition apparatus extracts one or more RGB images and joint data from the input image using a Kinect. In other words, it is possible to acquire RGB image, depth image, and skeleton pose data using Kinect device for behavior recognition. Here, the OpenNI library of the Kinect can be used to acquire the RGB image, the depth image, and the joint data.

또한, 상기 키넥트를 통해 신체 관절의 상대적 3D 위치를 포함하는 관절 데이터를 시간에 따라 순차적으로 저장함으로써, 시간에 따른 관절의 궤적을 도출할 수 있다. 즉, 상기 키넥트 OpenNI 라이브러리를 사용하면 상기 관절 데이터를 획득할 수 있으며, 획득한 관절 데이터에는 신체 20개 관절에 대하여 상기 키넥트로부터의 상대적 3D 위치인 X축, Y축, 및 Z축이 나타나 있다. 따라서, 지속적으로 이 정보를 축적해 두면 각 관절들이 시간에 따라 어떻게 움직였는지에 대한 관절 궤적(skeleton trajectory)을 알 수 있다.In addition, the joint data including the relative 3D position of the body joint through the Kinect is sequentially stored with respect to time, so that the locus of the joint can be derived with time. That is, by using the Kinect OpenNI library, it is possible to acquire the joint data. In the acquired joint data, relative X-axis, Y-axis, and Z-axis, which are relative 3D positions from the Kinect, have. Therefore, if you accumulate this information continuously, you can know the skeleton trajectory of how each joint moved with time.

또한, 상기 추출된 RGB 영상 및 관절 데이터는 영상의 동일한 프레임으로부터 각각 추출할 수 있다. 다시 말해, 상기 키넥트 OpenNI 라이브러리를 이용하여 RGB 영상, 뎁스 영상, 관절 데이터 각각을 추출할 수 있으며, 추출된 각각의 데이터들은 프레임 별로 싱크가 동기화되어 있다. 예를 들어, RGB 영상 첫 번째 프레임, 뎁스 영상 첫 번째 프레임, 첫 번째 관절 데이터는 같은 시간대에 추출된 데이터일 수 있다.Also, the extracted RGB image and joint data can be extracted from the same frame of the image, respectively. In other words, each of the RGB image, the depth image, and the joint data can be extracted using the Kinect OpenNI library, and the extracted data is synchronized with each other on a frame-by-frame basis. For example, the first frame of the RGB image, the first frame of the depth image, and the first joint data may be data extracted at the same time.

S120 단계에서, 상기 행동인식 장치가 상기 RGB 영상 및 관절 데이터에 관하여 행동인식을 위한 특징을 추출하도록 각각 계층화한다.In step S120, the behavior recognition apparatus layered so as to extract features for behavior recognition with respect to the RGB image and the joint data.

여기서, S120 단계 및 S130 단계는 이하에서 도 2a 내지 도 2c을 통해 보다 구체적으로 설명하도록 한다. Here, steps S120 and S130 will be described in more detail with reference to FIG. 2A through FIG. 2C.

도 2a는 본 발명의 일 실시예에 따른 독립 성분 분석의 계층 구조를 도시한 도면이며, 도 2b는 본 발명의 일 실시예에 따른 계층화한 결과를 도시한 도면이다.FIG. 2A illustrates a hierarchical structure of independent component analysis according to an exemplary embodiment of the present invention, and FIG. 2B illustrates a layered result according to an exemplary embodiment of the present invention.

보다 구체적으로, 도 2a 및 도 2b는 본 발명이 실시예에서 행동인식을 위한 특징을 학습하기 위하여 사용한 모델 중 하나인 독립 성분 분석(Independent Subspace Analysis: ICA)일 수 있다. 여기서, 상기 계층인 독립 성분 분석 모델에서 입력 계층이 상기 RGB 영상 및 관절 데이터이며, 출력 계층이 상기 제 1 특징값 및 제 2 특징값일 수 있다.More specifically, FIGS. 2A and 2B may be Independent Subspace Analysis (ICA), which is one of the models used by the present invention to learn features for behavior recognition in the embodiment. Here, in the independent component analysis model of the layer, the input layer may be the RGB image and the joint data, and the output layer may be the first feature value and the second feature value.

다시 말해, 입력 계층은 S110 단계에서 추출된 데이터이고 출력 계층은 특징(feature) 또는 요소(component) 계층일 수 있다. 상기 입력 계층을 X라 하면, X는 관측되는 확률변수일 수 있으며, 상기 출력 계층을 S라 하면, S는 관측이 되지 않는 확률변수일 수 있다. 여기서, 상기 입력 계층 X는 2차원 영상을 1차원 벡터형태로 변형된 것이며, 상기 입력 계층 X와 상기 출력 계층 S는 서로 풀리 커넥트(fully connect)하게 연결될 수 있다. 상기 연결된 각 라인은 w_i 라고 나타내고 모델 파라미터이며, 이는 학습 되어야 하는 대상이다. 상기 w_i의 개수는 위의 네트워크가 S 와 X 사이에 풀리 커넥트 되어 있기 때문에 (S의 디멘션(dimension))*(X의 디멘션)일 수 있다. In other words, the input layer is the data extracted in step S110, and the output layer may be a feature or a component layer. If the input layer is X, then X can be a observed random variable, and if the output layer is S, S can be a random variable that can not be observed. Here, the input layer X is a transformed two-dimensional image into a one-dimensional vector, and the input layer X and the output layer S may be connected to each other in a fully connected manner. Each connected line is _denoted w _i and is a model parameter, which is an object to be learned. The number of w _i can be (dimension of S) * (dimension of X) because the above network is pulley-connected between S and X.

따라서, 학습시에는 입력값인 X 만 알고 S 와 w_i 들은 모르는 상태이므로, 기계 학습에서의 대부분의 학습은 최적화(optimization) 방법을 사용하며, 최적화의 과정은 다음과 같다. 먼저, 최적화하고자 하는 함수(objective function)를 설계 최적화 알고리즘을 이용하여 상기 함수를 가장 최소화(minimize) 또는 경우에 따라 가장 최대화(maximize)하는 모델 파라미터인 w_i들의 셋(set)을 탐색하여 학습할 수 있다. 상기 모델 파라미터가 학습 되고 나면 S 중 하나인 s_i 에 연결된 모델 파라미터들 즉 벡터 w_i를 이미지화한 도 2c와 같을 수 있다.Therefore, in learning, only the input value X is known, and since S and w _i are unknown, most of the learning in the machine learning uses the optimization method, and the optimization process is as follows. First, an objective function to be optimized is searched by learning a set of model parameters w _i which minimizes or maximizes the function using a design optimization algorithm. . Once the model parameter is learned, it may be as shown in Fig. 2 (c), which is an image of the model parameters, i.e., the vector w _i , connected to s _i , one of S's.

도 2c에서는 총 12개의 w_i 벡터를 이미지화한 것으로서, 각각의 wi 벡터는 영상에서 특정 패턴을 디텍트(detect)하는 특징 디텍터(feature detector)의 역할을 할 수 있다. 이후, 학습이 끝나고 인식과정에서 상기 w_i 특징 디텍터의 형태와 유사한 형태의 패턴이 영상에 존재하면 상기 s_i의 출력값이 커지게 되며, 반대의 경우는 상기 s_i의 출력값이 작아질 수 있다. 따라서, 학습과정에 독립 성분 분석 모델은 w_i를 학습하여 수정하고 학습이 끝난 후 인식과정에서의 모델의 출력값은 S 벡터일 수 있다. 상기 S 벡터를 분석하면 영상에 어떤 패턴들이 있는지를 파악할 수 있다.In FIG. 2C, a total of twelve w _i vectors are imaged, and each w _i vector can serve as a feature detector for detecting a specific pattern in an image. Then, when the end of the study the form of a pattern similar to the shape of the w _i characteristic detector present in the image in the recognition process will be the output value of the s _i increases, the opposite case is that the output value of the s _i can be made small. Therefore, in the learning process, the independent component analysis model can be modified by learning w _i , and the output value of the model in the recognition process after the learning is S vector. By analyzing the S vector, it is possible to know what patterns are present in the image.

이제 다시 도 1로 돌아가 S120 단계 이후를 설명하도록 한다.Returning now to FIG. 1, step S120 and subsequent steps will be described.

S130 단계에서, 상기 행동인식 장치가 상기 각각의 계층화에 따른 상기 RGB 영상에 대한 하나 이상의 제 1 특징값 및 상기 관절 데이터에 대한 하나 이상의 제 2 특징값을 추출하여 결합함으로써, 하나 이상의 제 3 특징값을 도출한다.In step S130, the behavior recognition device extracts and combines one or more first feature values for the RGB image according to the respective layering and one or more second feature values for the joint data, .

보다 구체적으로, S120 단계에서 행동 인식을 위한 특징을 학습하도록 계층화한 딥 네트워크를 통해 특징을 학습하려면 일단 학습된 독립 성분 분석(ICA)를 통해 출력값 S 벡터를 획득할 수 있다. 이후, 상기 획득한 출력값 S 벡터를 상위 레벨의 ICA 학습을 위한 입력값 X로 사용할 수 있으며, 상기와 같은 과정을 반복하여 레벨을 높게 쌓아갈 수 있다. 즉, 상기 학습 과정을 사용하면 일반적으로는 상위 레벨의 ICA의 w_i들은 하위레벨의 특징 디텍터 보다 좀더 복잡한 특징을 디텍터 하도록 학습될 수 있다. 여기서, 입력 값 X가 S110 단계에서 추출된 RGB 영상 또는 관절 데이터인지 여부에 따라 상기 특징 디텍터의 형태가 조금씩 달라질 수 있으나 학습 과정에는 차이가 없다.More specifically, in order to learn a feature through a deep network layered to learn a feature for behavior recognition in step S120, an output value S vector can be obtained through the learned independent component analysis (ICA). Thereafter, the obtained output value S vector may be used as an input value X for ICA learning at a higher level, and the above process may be repeated to accumulate the level higher. That is, using the learning process, w _i of a higher-level ICA can be learned to detect a more complex feature than a lower-level feature detector. Here, the shape of the feature detector may be slightly different depending on whether the input value X is the RGB image or joint data extracted in step S110, but there is no difference in the learning process.

따라서, 영상기반 특징(appearance-based feature)을 위한 딥 네트워크를 따로 학습하고, 자세기반 특징(pose-based feature)을 위한 딥 네트워크를 따로 학습하고 난 후 다시 학습된 데이터를 각각의 네트워크에 입력으로 주면 영상기반 딥 네트워크의 최종 출력값인 S 벡터(제 1 특징값)를 획득하며, 자세기반 딥 네트워크의 최종 출력값인 S 벡터(제 2 특징값)를 얻을 수 있다. 이때, 상기 각각의 S 벡터가 10 디멘션이면 총 20 디멘션의 최종 출력값 S 벡터(제 3 특징값)를 얻을 수 있다.Therefore, we study the deep network for appearance-based features separately, learn the deep network for pose-based features separately, and input the re-learned data to each network. The S vector (first feature value) which is the final output value of the main image-based DIP network is obtained, and the S vector (second feature value) which is the final output value of the posture-based DIP network is obtained. At this time, if each S vector is 10 dimensions, a final output value S vector (third feature value) of 20 dimensions in total can be obtained.

한편, 도 3은 본 발명의 일 실시예에 따른 변형된 자세정보에 딥러닝을 적용하는 방법을 도시한 도면이다.3 is a diagram illustrating a method of applying deep learning to modified posture information according to an embodiment of the present invention.

보다 구체적으로, 도 3의 위의 그림에서 전체 자세 정보를 몸, 왼 팔, 오른 팔, 왼 다리, 오른다리로 총 5가지 부분으로 나누어 각각 독립된 네트워크에 입력함으로써, 1층 네트워크는 각 부분의 데이터를 t₁ 시간만큼 샘플링하여 학습할 수 있다. 이후, 도 3의 아래 그림에서는 학습된 1층 네트워크 각각을 필요한 만큼 복사하여 부분 샘플들을 통과시키되, 먼저 2층 네트워크의 입력을 t₂로 하고 1층 네트워크 입력 데이터의 시간보다 길게 하여 샘플링할 있다. 이는 t₂ > t₁와 같으며, t₂의 길이를 가지는 데이터를 1층 네트워크의 입력 크기인 t1에 맞추어 부분 샘플링하여 각 부분의 복사된 네트워크로 통과시킬 수 있다. 이때, 등 간격 부분샘플링을 하며, 상기 간격은 사용자가 임의설정 가능할 수 있다. 이렇게 각 부분의 데이터가 모두 네트워크를 통과하면 출력값들을 모아 2층 네트워크의 입력으로 사용하여 2층 네트워크를 학습시킬 수 있다. 이때, 몸 전체의 데이터를 1층 네트워크보다 더 긴 시간에 대해 학습하기 때문에 시공간 추상화가 이루어질 수 있다. More specifically, in the above figure of FIG. 3, the entire posture information is divided into five parts by the body, the left arm, the right arm, the left leg, and the right leg, Can be learned by sampling for t ₁ time. Then, in the lower figure of FIG. 3, each of the learned first layer networks is copied as necessary and the partial samples are passed through. First, the input of the second layer network is t ₂ and the sampling is performed longer than the time of the first layer network input data. This is equivalent to t ₂ > t _1, and the data having the length of t ₂ can be partially sampled according to the input size t 1 of the first layer network and passed through the copied network of each part. At this time, the equal interval part sampling is performed, and the interval can be arbitrarily set by the user. If all the data in each part passes through the network, the output values can be collected and used as inputs to the second layer network to learn the second layer network. At this time, space-time abstraction can be performed because the whole body data is learned for a longer time than the one-layer network.

이제 다시 도 1로 돌아가 S130 단계 이후를 설명하도록 한다.Returning now to FIG. 1, step S130 and subsequent steps will be described.

S140 단계에서, 상기 행동인식 장치가 상기 제 3 특징값을 케이민즈 클러스터링(K-means clustering)을 이용하여 소정의 개수만큼 그룹화하고, 상기 그룹을 행동 패턴을 나타내는 히스토그램(histogram)으로 변환한다. 여기서, 상기 히스토그램은 상기 입력된 영상에서 특정 행동 패턴의 빈도 수를 나타내고, 상기 케이민즈 클러스터링은 상기 제 3 특징값이 벡터이며, 상기 소정 개수의 그룹만큼 중심점(centroid)을 학습할 수 있다.In step S140, the behavior recognition apparatus groups the third feature values by a predetermined number using K-means clustering, and converts the group into a histogram representing a behavior pattern. Here, the histogram represents a frequency of a specific behavior pattern in the input image, and the third characteristic value is a vector, and the centroid can be learned by the predetermined number of groups.

보다 구체적으로, 상기 케이민즈 클러스터링의 대상은 S130 단계에서 학습된 딥 네트워크의 최종 출력값 S 벡터(제 3 특징값)일 수 있으며, 여기서는 디멘션을 M이라고 가정하에 설명하도록 한다. 예를 들어, 샘플 데이터 하나당 S 벡터 하나가 생성되는데 샘플이 만개면 만개의 M 디멘션 벡터가 생성될 수 있다. 여기에 케이민즈 클러스터링을 수행하면 상기 만개의 샘플이 그룹화될 수 있으며, 상기 그룹화는 각 그룹의 중심이 되는 중심점(centroid)를 학습할 수 있다. 즉, 상기 클러스터링을 통해 5개로 그룹화되었다면 5개의 중심점 각각은 M 디멘션 벡터일 수 있다. 여기서, 상기 그룹의 개수는 사용자가 미리 정해줄 수 있으며, 일반적으로 초기값인 각 중심점의 벡터값은 랜덤값으로 설정할 수도 있다.More specifically, the object of the cadence clustering may be the final output value S vector (third characteristic value) of the DIP network learned in step S130, and it is assumed here that the dimension is M. For example, one S vector is generated per sample data, and a M dimension vector of a full bloom sample size can be generated. Here, when the Kminz's clustering is performed, the full sample can be grouped, and the grouping can learn a centroid that is the center of each group. That is, if grouped into five through the clustering, each of the five center points may be an M dimension vector. Here, the number of the groups may be predetermined by a user, and a vector value of each center point, which is an initial value, may be set to a random value.

상기 군집화된 하나의 클러스터는 문서에 비유하면 하나의 단어이고, 문서의 종류에 따라(정치 기사인지 스포츠 기사인지 등) 문서 안에서 어떤 단어가 얼마의 빈도로 나타났는지를 표현하는 히스토그램으로 나타낼 수 있다. 다시 말해, 행동 인식에서는 사람의 행동이 있는 동영상도 어떤 특정 패턴들이 동영상에서 얼마의 빈도로 나타났는지에 대한 히스토그램으로 표현가능 하므로, 백오브워드(bag of words) 표현방식을 통해 군집화된 클러스터를 히스토그램으로 변환할 수 있다.The clustered cluster is a word when compared to a document, and can be expressed as a histogram representing how often a word appears in a document (such as a political article or a sports article) depending on the type of document. In other words, behavioral awareness can be expressed as a histogram of how often a certain pattern is displayed in a moving picture, and therefore, clusters clustered through a bag of words expression method is called a histogram . &Lt; / RTI >

S150 단계에서, 상기 행동인식 장치가 상기 히스토그램으로부터 행동 패턴을 인식한다. 다시 말해, 상기 행동인식 장치가 상기 히스토그램과 맵핑된 행동타입을 검색함으로써, 상기 행동타입에 따라 행동 패턴을 인식한다.In step S150, the behavior recognition apparatus recognizes a behavior pattern from the histogram. In other words, the behavior recognition apparatus recognizes the behavior pattern according to the behavior type by searching for the behavior type mapped with the histogram.

보다 구체적으로, S140 단계에서 획득한 히스토그램과 해당 히스토그램이 어떤 행동을 관찰하는 과정에서 얻어진 것인지에 대한 행동 타입에 대한 정보를 이용하여 SVM 학습과정을 통해 어떤 히스토그램이 나타나면 무슨 행동이고 또 다른 타입의 히스토그램이 나타나면 어떤 행동인지에 대한 맵핑정보를 획득함으로써, 행동을 인식할 수 있다.More specifically, if the histogram obtained in the step S140 and the information on the behavior type about the behavior obtained by observing the behavior are used, the SVM learning process determines what kind of behavior is indicated when the histogram appears, and another type of histogram By acquiring mapping information about what action is to be taken, behavior can be recognized.

도 4는 본 발명의 일 실시예에 따른 영상에서 행동을 인식하는 방법을 도시한 또 다른 도면으로서, 영상기반 특징학습 과정, 자세기반 특징학습 과정, 영상기반 특징과 자세기반 특징을 결합하는 과정, 케이민즈 클러스터링을 이용한 벡터 양자화 과정(Vector quanization), 입력데이터의 라벨을 이용한

(x2-kernel Support Vector Machine) 학습 및 분류과정을 포함할 수 있다. 도 4는 앞서 기술한 도 1의 각 과정에 대응하는 구성을 포함하는 바 도 4의 상세한 설명은 도 1의 상세한 설명으로 대신한다.FIG. 4 is another diagram illustrating a method for recognizing a behavior in an image according to an exemplary embodiment of the present invention. FIG. 4 is a flowchart illustrating an image-based feature learning process, a posture-based feature learning process, Vector quantization using KM's clustering (Vector quanization), labeling of input data

(x2-kernel Support Vector Machine) learning and classification process. FIG. 4 includes a configuration corresponding to each step of FIG. 1 described above, and the detailed description of FIG. 4 replaces the detailed description of FIG.

도 5는 본 발명의 일 실시예들이 채택하고 있는 영상에서 행동을 인식하는 장치를 도시한 블럭도로서, 영상에서 행동을 인식하는 장치(50)는 앞서 기술한 도 1의 각 과정에 대응하는 구성을 포함한다. 따라서, 여기서는 설명의 중복을 피하기 위해 시스템의 세부구성을 중심으로 그 기능을 약술하도록 한다.FIG. 5 is a block diagram illustrating an apparatus for recognizing a behavior in an image adopted in an embodiment of the present invention. In FIG. 5, an apparatus 50 for recognizing a behavior in an image includes a configuration corresponding to each process of FIG. 1 . Therefore, in order to avoid duplication of explanation, the function is outlined mainly in the detailed configuration of the system.

추출부(51)는 입력된 영상으로부터 키넥트를 이용하여 하나 이상의 RGB 영상 및 관절 데이터를 추출한다.The extracting unit 51 extracts one or more RGB images and joint data from the input image using the key knot.

처리부(52)는 상기 RGB 영상 및 관절 데이터에 관하여 행동인식을 위한 특징을 추출하도록 각각 계층화하고, 상기 각각의 계층화에 따른 상기 RGB 영상에 대한 하나 이상의 제 1 특징값 및 상기 관절 데이터에 대한 하나 이상의 제 2 특징값을 추출하여 결합하고, 하나 이상의 제 3 특징값을 도출하며, 상기 제 3 특징값을 케이민즈 클러스터링을 이용하여 소정의 개수만큼 그룹화함으로써, 상기 그룹을 행동 패턴을 나타내는 히스토그램으로 변환한다.The processing unit (52) is configured to layer each of the RGB images and the joint data to extract features for behavior recognition, and generate at least one first feature value for the RGB image according to each layering and one or more Extracts and combines the second feature values, derives one or more third feature values, and groups the third feature values by a predetermined number using the Kmuze clustering, thereby converting the group into a histogram representing a behavior pattern .

인식부(53)는 상기 히스토그램으로부터 행동 패턴을 인식한다.The recognition unit 53 recognizes the behavior pattern from the histogram.

또한, 추출부(51)는 상기 입력된 영상의 동일한 프레임으로부터 상기 RGB 영상 및 상기 관절 데이터를 각각 추출한다.Further, the extracting unit 51 extracts the RGB image and the joint data from the same frame of the input image.

또한, 처리부(52)는 입력 계층이 상기 RGB 영상 및 관절 데이터이며, 출력 계층이 상기 제 1 특징값 및 제 2 특징값으로 하는 계층을 구성한다.In addition, the processing unit 52 constitutes a hierarchy in which the input layer is the RGB image and the joint data, and the output layer is the first feature value and the second feature value.

또한, 상기 케이민즈 클러스터링은 상기 제 3 특징값이 벡터이고, 상기 소정 개수의 그룹만큼 중심점을 학습한다.The third feature value is a vector, and the center point is learned by the predetermined number of groups.

또한, 인식부는 상기 히스토그램과 맵핑된 행동타입을 검색함으로써, 상기 행동타입에 따라 행동 패턴을 인식한다.Further, the recognition unit recognizes the behavior pattern according to the behavior type by searching for the behavior type mapped with the histogram.

도 6은 본 발명의 일 실시예에 따른 본 발명에 따른 결과값을 종래기술과 비교하여 도시한 도면이다.FIG. 6 is a diagram illustrating a result of the present invention in comparison with a conventional technique according to an embodiment of the present invention. Referring to FIG.

보다 구체적으로, 본 발명의 실시예에 따른 실험 결과값으로 나온 혼동행렬로써, 행렬 내부의 값은 백분율을 뜻하며 각 행의 합이 100이 되도록 되어있다. 행렬 좌측의 인덱스는 입력 데이터의 라벨을 뜻하고, 행렬 아래의 인덱스는 출력 라벨을 뜻한다. 따라서, 입력과 출력이 동일한 대각선 항목의 값이 높을수록 적중률이 좋은 것이다.More specifically, as a confusion matrix derived from the experimental result according to the embodiment of the present invention, the value inside the matrix means a percentage, and the sum of each row is 100. The index on the left side of the matrix indicates the label of the input data, and the index below the matrix indicates the output label. Therefore, the higher the value of the diagonal line item with the same input and output, the better the hit rate.

상단의 그림(61)은 영상기반 행동인식의 결과로 나온 혼동행렬로서, 대각성분의 평균이 67.5%임을 알 수 있다. 또한, 중간의 그림(62)은 자세기반 특징만을 사용했을 때의 혼동행렬로서, 대각성분의 평균은 56.3%임을 알 수 있다. 또한, 하단의 그림(63)은 본 발명이 제안하는 자세기반 특징과 영상기반특징을 결합한 특징으로 행동을 인식한 혼동행렬로서, 대각성분의 평균은 76.8%임을 알 수 있다. 즉, 본 발명이 제안하는 자세기반 특징과 영상기반특징을 결합한 특징으로 행동을 인식하는 방법은 각각의 특징을 보완하고 있는 경향을 확인할 수 있다.The upper picture (61) is a confusion matrix resulting from the image-based behavior recognition, and the average of the diagonal components is 67.5%. In addition, the middle figure (62) is a confusion matrix when only posture-based features are used, and the average of the diagonal components is 56.3%. In the lower part (63), the behavior-aware confusion matrix is characterized by combining the posture-based feature and the image-based feature proposed by the present invention, and the average of the diagonal components is 76.8%. That is, the method of recognizing the behavior combining features of the posture-based feature and the image-based feature proposed by the present invention can confirm the tendency of complementing each feature.

또한, 이하의 표 1은 HudaAct dataset이라는 벤치마크 데이터 셀을 사용하여 인식 성능을 평가한 것이다.Table 1 below shows the recognition performance using a benchmark data cell called HudaAct dataset.

표 1에서 RGB-STIP, RGBHOGHOF, RGB-SITP, DepthDescriptor는 최첨단(state-of-the-art) 기술이며, 본 발명이 제안하는 방법과는 0.9% 차이밖에 나지 않는 것을 알 수 있다. 따라서, 본 발명이 제안하는 방법은 세계적인 인식 기술과 성능 차이가 거의 나지 않으며, 오히려 이를 개발하기 위한 시간투자 및 비용소모의 관점에서 보면 제안하는 기술의 우수성을 알 수 있다.In Table 1, it can be seen that RGB-STIP, RGBHOGHOF, RGB-SITP, and DepthDescriptor are state-of-the-art technologies, which are only 0.9% different from the method proposed by the present invention. Accordingly, the method proposed by the present invention has little difference in performance between the global recognition technology and performance, and the superiority of the proposed technology is seen from the viewpoint of time investment and cost expenditure to develop it.

기존의 거의 모든 행동인식 기술들은 행동 인식을 위한 특징을 직접 디자인해서 사용하였다. 이렇게 특징을 직접 디자인하는 것은 컴퓨터 비전이나 관절 데이터에 대한 특성에 대하여 깊이 이해하고 있어야 하므로, 해당 소스에 대한 전문가적 지식이 없이는 만들기가 거의 불가능 하였다. 또한, 각자가 인식하고 실행하는 행동의 타입이나 종류들이 모두 상이할 경우, 즉 주로 손에 대한 행동, 다리에 대한 행동, 혹은 멀리서 취하고 있는 행동 등등은 여태까지 개발된 이러한 특징 혹은 행동 인식 시스템은 모든 경우에 잘 적용될 수 있지 못하였다. 다시 말해, 행동 인식 시스템을 개발하고자 하는 개발자들은 각각이 사용하는 영역 내에서 성능을 잘 낼 수 있는 특징을 직접 디자인해야 하는데 이것은 앞서 언급한 사항으로 인해 몹시 어렵다는 단점이 존재한다. Almost all existing behavioral recognition technologies use features designed for behavior recognition. Designing these features requires a deeper understanding of the characteristics of computer vision and joint data, making it almost impossible to create them without expert knowledge of the source. In addition, if the types or types of behaviors that each person recognizes and executes are different, such as mainly hand movements, leg movements, or distant actions, etc., In this study, In other words, developers who want to develop behavior awareness systems must design their own features that can perform well within the domain they are using, which is very difficult due to the aforementioned issues.

따라서, 상기된 본 발명이 제안하는 학습기반의 시스템을 통해 전문가가 디자인한 특징을 사용하는 시스템과 유사한 성능을 보이는 기술적 수단을 제안 하였다. 이러한 제안한 학습 프레임워크는 기존의 오픈 소스 등을 조합하여 쉽게 구현 가능하며, 인식하고 싶은 행동이 주로 손에 대한 행동이라던지 다리에 대한 행동 혹은 멀리서 취하고 있는 행동이라던지 등등에 따라서 해당되는 학습데이터만 사용하면 관련된 특징을 디자인할 필요없이 자동으로 학습할 수 있는 효과가 있다.Therefore, a technical means which shows similar performance to the system using the features designed by the experts through the learning based system proposed by the present invention described above is proposed. This proposed learning framework can be easily implemented by combining existing open source, etc., and it is possible to implement only the learning data corresponding to the action to be recognized, such as the action on the hand, the action on the bridge, the action taken from afar, It has the effect of automatically learning without having to design related features.

나아가, 키넥트 카메라 한대만으로 복잡한 행동을 잘 인식할 수 있기 때문에 무인감시 또는 서비스 로봇에 쉽게 적용할 수 있으며, 용도에 따라 다른 알고리즘을 추가하거나 자세정보를 추출하는 다른 센서에 응용이 가능할 수 있다. Furthermore, it can easily be applied to unmanned surveillance or service robots because it can recognize complicated behaviors with only one Kinect camera, and it can be applied to other sensors that add other algorithms or extract attitude information according to the purpose.

한편, 본 발명은 실시예들은 컴퓨터로 읽을 수 있는 기록 매체에 컴퓨터가 읽을 수 있는 코드로 구현하는 것이 가능하다. 컴퓨터가 읽을 수 있는 기록 매체는 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록 장치를 포함한다.Meanwhile, the embodiments of the present invention can be embodied as computer readable codes on a computer readable recording medium. A computer-readable recording medium includes all kinds of recording apparatuses in which data that can be read by a computer system is stored.

컴퓨터가 읽을 수 있는 기록 매체의 예로는 ROM, RAM, CD-ROM, 자기 테이프, 플로피디스크, 광 데이터 저장장치 등이 있으며, 또한 캐리어 웨이브(예를 들어 인터넷을 통한 전송)의 형태로 구현하는 것을 포함한다. 또한, 컴퓨터가 읽을 수 있는 기록 매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산 방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수 있다. 그리고 본 발명을 구현하기 위한 기능적인(functional) 프로그램, 코드 및 코드 세그먼트들은 본 발명이 속하는 기술 분야의 프로그래머들에 의하여 용이하게 추론될 수 있다.Examples of the computer-readable recording medium include a ROM, a RAM, a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device and the like, and also a carrier wave (for example, transmission via the Internet) . In addition, the computer-readable recording medium may be distributed over network-connected computer systems so that computer readable codes can be stored and executed in a distributed manner. In addition, functional programs, codes, and code segments for implementing the present invention can be easily deduced by programmers skilled in the art to which the present invention belongs.

이상에서 본 발명에 대하여 그 다양한 실시예들을 중심으로 살펴보았다. 본 발명에 속하는 기술 분야에서 통상의 지식을 가진 자는 본 발명이 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 변형된 형태로 구현될 수 있음을 이해할 수 있을 것이다. 그러므로 개시된 실시예들은 한정적인 관점이 아니라 설명적인 관점에서 고려되어야 한다. 본 발명의 범위는 전술한 설명이 아니라 특허청구범위에 나타나 있으며, 그와 동등한 범위 내에 있는 모든 차이점은 본 발명에 포함된 것으로 해석되어야 할 것이다.The present invention has been described above with reference to various embodiments. It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Therefore, the disclosed embodiments should be considered in an illustrative rather than a restrictive sense. The scope of the present invention is defined by the appended claims rather than by the foregoing description, and all differences within the scope of equivalents thereof should be construed as being included in the present invention.

50: 영상에서 행동을 인식하는 장치
51 : 추출부
52 : 처리부
53 : 인식부50: Apparatus for recognizing an action in a video
51:
52:
53:

Claims

Extracting at least one RGB image and joint data using a Kinect from an input image of the behavior recognition apparatus;
Layering the behavior recognition device so as to extract features for behavior recognition with respect to the RGB image and joint data;
Deriving one or more third feature values by extracting and combining at least one first feature value for the RGB image and at least one second feature value for the joint data according to the respective layering, ;
Grouping the third feature values by a predetermined number using K-means clustering, and converting the group into a histogram representing a behavior pattern; And
Wherein the behavior recognition apparatus recognizes a behavior pattern from the histogram.

The method according to claim 1,
Wherein the joint data including the relative 3D position of the body joints is sequentially stored with respect to time through the Kinect to thereby derive trajectories of the joints with time.

The method according to claim 1,
And extracting the extracted RGB image and joint data from the same frame of the image, respectively.

The method according to claim 1,
Wherein the input layer is the RGB image and the joint layer, and the output layer is the first feature value and the second feature value.

The method according to claim 1,
Wherein the histogram represents a frequency of a specific behavior pattern in the input image.

The method according to claim 1,
Wherein the third feature value is a vector and the centroid is learned by the predetermined number of groups.

The method according to claim 1,
Wherein the step of recognizing the behavior pattern comprises:
And recognizing a behavior pattern according to the behavior type by searching for a behavior type mapped with the histogram by the behavior recognition apparatus.

An extracting unit for extracting at least one RGB image and joint data from the input image using a key knot;
Wherein each of the first and second feature values for the RGB image and the at least one second feature value for the joint image are obtained by extracting features for behavior recognition with respect to the RGB image and the joint data, Extracting and combining the first feature value and the second feature value, deriving one or more third feature values, grouping the third feature values by a predetermined number using the cadence clustering, and converting the group into a histogram representing a behavior pattern; And
And a recognition unit for recognizing a behavior pattern from the histogram.

9. The method of claim 8,
The extracting unit extracts,
And extracting the RGB image and the joint data from the same frame of the input image, respectively.

9. The method of claim 8,
Wherein,
Wherein the input layer is the RGB image and the joint data, and the output layer is a layer having the first feature value and the second feature value.

9. The method of claim 8,
Wherein the third feature value is a vector and the centroid is learned by the predetermined number of groups.

9. The method of claim 8,
Wherein,
And a behavior pattern is recognized according to the behavior type by searching for a behavior type mapped with the histogram.