KR102347639B1

KR102347639B1 - Devices for recognizing human behavior through spatial information in video data

Info

Publication number: KR102347639B1
Application number: KR1020200096684A
Authority: KR
Inventors: 이영구; 엠디 아제르 우딘
Original assignee: 경희대학교 산학협력단
Priority date: 2019-11-28
Filing date: 2020-08-03
Publication date: 2022-01-06
Also published as: KR20210066694A

Abstract

본 발명의 실시 예에 따른 인간 행동 인식 장치는 인간 행동이 포함된 비디오 데이터를 저장하는 데이터 저장부, 비디오 데이터에서 공간 특징을 추출하여 공간 정보를 생성하는 공간 정보 생성부, 비디오 데이터에서 시간 정보 및 공간 정보를 활용하여 동적 정보를 생성하는 동적 정보 생성부를 포함하며, 동적 정보 생성부는 비디오 데이터에서 3개의 3x,3 크기의 연속된 Former Frame, Current Frame, Next Frame을 추출하고, 연속된 프레임 각 각에 대한 픽셀값을 제 1픽셀값으로 하고, 제 1픽셀값에 대해 그라디언트 연산과 임계값 연산을 통해 WVLGTP(Weber'law Volume Local Gradient Ternary Pattern) 생성하며, 생성된 WVLGTP를 통해 시간 정보 및 공간 정보에 대한 특징 벡터를 생성할 수 있고, 동적 정보는 시간 정보에 대한 특징 벡터 및 공간 정보에 대한 특징 벡터인 것을 특징으로 하며 공간 정보와 동적 정보를 패턴 인식을 위한 지도학습 모델인 SVM(Support Vector Machine)알고리즘을 통해 인간 행동을 인식하는 인식부를 더 포함한다.An apparatus for recognizing human behavior according to an embodiment of the present invention includes a data storage unit for storing video data including human behavior, a spatial information generating unit for generating spatial information by extracting spatial features from video data, temporal information from video data, and and a dynamic information generator for generating dynamic information by utilizing spatial information, wherein the dynamic information generator extracts three consecutive Former Frames, Current Frames, and Next Frames of 3x,3 size from video data, and each of the continuous frames Using the pixel value of , as the first pixel value, a Weber'law Volume Local Gradient Ternary Pattern (WVLGTP) is generated through the gradient operation and the threshold value operation on the first pixel value, and temporal information and spatial information are generated through the generated WVLGTP. can generate a feature vector for , and the dynamic information is characterized as a feature vector for temporal information and a feature vector for spatial information, and SVM (Support Vector Machine), a supervised learning model for pattern recognition of spatial information and dynamic information ) further includes a recognition unit for recognizing human behavior through an algorithm.

Description

DEVICES FOR RECOGNIZING HUMAN BEHAVIOR THROUGH SPATIAL INFORMATION IN VIDEO DATA

본 발명은 비디오 데이터를 분석하여 인간의 행동을 인식하는 장치에 관한 것으로, 구체적으로 비디오 데이터의 배경 공간 정보 및 비디오 데이터의 피사체 동적 정보를 고려하여 인간 행동을 인식하는 장치에 관한 것이다. The present invention relates to an apparatus for recognizing human behavior by analyzing video data, and more particularly, to an apparatus for recognizing human behavior in consideration of background spatial information of video data and subject dynamic information of video data.

인간 행동 인식은 비디오 감시, 스포츠 비디오 분석, 영화 검색 등의 다양한 분야로 발전 가능한 분야이다. 인간 행동 인식은 이미지 또는 비디오 데이터에서 인간의 행동이 무엇인지 인식하는 것을 말한다. 인간 행동 인식에서 가장 중요한 기술은 비디오 데이터에서 인간의 복잡한 여러 행동을 구분하고 분류할 수 있는 기술이다. 나아가, 인간의 의복, 모양, 개인 차이에 따른 상이한 인식 결과가 되지 않고 동일한 인식 결과가 도출될 수 있는 기술이 중요하다. 이와 관련하여 인간 행동과 관련된 광범위한 연구가 진행되고 있다. Human behavior recognition is a field that can be developed into various fields such as video surveillance, sports video analysis, and movie search. Human behavior recognition refers to recognizing what human behavior is in image or video data. The most important technology in human behavior recognition is a technology that can distinguish and classify various complex human behaviors in video data. Furthermore, it is important to develop a technology capable of deriving the same recognition result without different recognition results due to differences in human clothes, shapes, and individuals. In this regard, extensive research related to human behavior is being conducted.

그러나, 이러한 연구에도 불구하고, 여전히 인간 행동 인식과 관련한 많은 문제점이 존재한다. 예를 들어, 비디오 프레임의 길이가 길어짐에 따라 행동 인식의 정확성이 낮아지거나, 복잡한 배경, 피사체와 공간 간의 모호한 경계로 인한 인간 행동 인식의 불명확성 등에 문제가 있다. 또한, 조명 변화 등으로 인해 상이한 인식 결과의 도출도 문제점 중 하나였다.However, despite these studies, there are still many problems related to human behavior recognition. For example, as the length of a video frame increases, the accuracy of behavior recognition decreases, or there is a problem in the uncertainty of human behavior recognition due to a complex background or an ambiguous boundary between a subject and a space. In addition, derivation of different recognition results due to changes in lighting and the like was also one of the problems.

이러한 문제점을 해결하고 인간 행동 인식의 정확도를 향상시키기 위해, 비디오 프레임의 배경 공간 정보와 비디오 데이터에서의 피사체의 동적인 특징을 추출한 동적 정보를 함께 고려하는 방법이 대두되고 있다.In order to solve this problem and improve the accuracy of human behavior recognition, a method of considering background spatial information of a video frame and dynamic information obtained by extracting dynamic characteristics of a subject from video data is emerging.

본 발명은 전술한 문제점을 해결하기 위한 것으로서, 비디오 데이터 배경의 공간 특징을 추출하여 생성된 공간 정보를 통해 비디오 데이터의 인간 행동을 인식하는 장치를 제공하는 것을 목적으로 한다.An object of the present invention is to provide an apparatus for recognizing human behavior of video data through spatial information generated by extracting spatial features of a background of video data to solve the above problems.

본 발명은 비디오 데이터에서 웨버의 법칙을 활용하여 시간 및 공간 특징을 추출하는 인간 행동 인식 장치를 제공하는 것을 목적으로 한다.An object of the present invention is to provide an apparatus for recognizing human behavior that extracts temporal and spatial features by utilizing Weber's law from video data.

본 발명은 비디오 데이터에서 피사체의 동적 정보를 분석하기 위한 WVLGTP 라는 새로운 개념을 도입하여 종래 VLBP 방법을 통한 인간 행동 인식의 정확성을 보완하는 것을 목적으로 한다. An object of the present invention is to supplement the accuracy of human behavior recognition through the conventional VLBP method by introducing a new concept called WVLGTP for analyzing dynamic information of a subject in video data.

본 발명은 비디오 데이터에서 인간 행동을 인식할 때 필요한 동적 정보 생성시 시간 특징과 공간 정보를 구분하여 특징 벡터를 생성하여 인간 행동 인식의 정확성 향상시키는 것을 목적으로 한다. An object of the present invention is to improve the accuracy of human behavior recognition by generating a feature vector by classifying temporal characteristics and spatial information when generating dynamic information required for recognizing human behavior from video data.

본 발명의 실시 예를 따르면, 비디오 데이터에서의 인간 행동 인식 시 인간 행동 동작과 배경의 공간 정보를 명확히 구분하여, 인간 행동 인식의 평균 정확성을 향상시킬 수 있다. According to an embodiment of the present invention, when human behavior is recognized in video data, the average accuracy of human behavior recognition can be improved by clearly distinguishing the human behavior motion from the spatial information of the background.

본 발명의 실시 예를 따르면, 비디오 데이터에서 인간 행동 인식 시 피사체의 시간 특징 및 공간 특징을 구분하여 별개의 특징 벡터를 생성하는 바, 행동 인식 정확성이 향상된다. According to an embodiment of the present invention, when human behavior is recognized from video data, a separate feature vector is generated by classifying temporal and spatial features of a subject, so behavior recognition accuracy is improved.

본 발명의 실시 예를 따르면, 웨버의 법칙을 기반으로 생성된 WVLGTP에 의해 중앙 픽셀값에 관한 잡음 처리를 할 수 있다. According to an embodiment of the present invention, noise processing on a central pixel value may be performed by WVLGTP generated based on Weber's law.

도 1은 본 발명의 실시 예에 따른 인간 행동 인식 장치의 블록도를 나타낸 도면이다.
도 2는 본 발명의 실시 예에 따른 공간 정보 생성부에서 공간 정보를 생성하는 방법을 설명하기 위한 도면이다.
도 3은 본 발명의 실시 예에 따른 동적 정보 생성부에서 동적 정보를 생성하는 방법을 설명하기 위한 도면이다.
도 4는 본 발명의 또 다른 일 실시 예에 따른 멀티 스케일 WVLGTP를 생성하는 방법을 설명하기 위한 도면이다.
도 5는 본 발명의 실시 예에 따른 멀티 스케일 WVLGTP를 생성할 시 평균화 방식 연산을 설명하기 위한 도면이다.
도 6은 본 발명의 실시 예에 따른 인간 행동 인식 장치에 대한 성능을 확인하기 위한 실험 대상 데이터에 관한 도면이다.
도 7은 본 발명의 실시 예에 따른 KTH 데이터 셋에 대한 인간 행동 인식 성능 확인 실험 결과값을 나타낸 도면이다.
도 8은 본 발명의 실시 예에 따른 UCF 스포츠 액션 데이터 셋에 대한 인간 행동 인식 성능 확인 실험 결과값을 나타낸 도면이다.
도 9는 본 발명의 실시 예에 따른 UT-Interaction 데이터 셋에 대한 인간 행동 인식 성능 확인 실험 결과값을 나타낸 도면이다.
도 10은 본 발명의 실시 예에 따른 Hollywood2 데이터 셋에 대한 인간 행동 인식 성능 확인 실험 결과값을 나타낸 도면이다.
도 11은 본 발명의 실시 예에 따른 UCF-101 데이터 셋에 대한 인간 행동 인식 성능 확인 실험 결과값을 나타낸 도면이다.
도 12는 본 발명의 실시 예에 따른 멀티스케일 WVLGTP에 행동 인식 성능 확인 실험 결과값을 나타낸 도면이다.1 is a diagram illustrating a block diagram of an apparatus for recognizing human behavior according to an embodiment of the present invention.
2 is a diagram for explaining a method of generating spatial information in a spatial information generating unit according to an embodiment of the present invention.
3 is a diagram for explaining a method of generating dynamic information in a dynamic information generating unit according to an embodiment of the present invention.
4 is a diagram for explaining a method of generating a multi-scale WVLGTP according to another embodiment of the present invention.
5 is a diagram for explaining an averaging method operation when generating a multi-scale WVLGTP according to an embodiment of the present invention.
6 is a diagram of test subject data for confirming the performance of the apparatus for recognizing human behavior according to an embodiment of the present invention.
7 is a diagram illustrating a result of an experiment for confirming human behavior recognition performance for a KTH data set according to an embodiment of the present invention.
FIG. 8 is a diagram illustrating a result of an experiment for confirming human behavior recognition performance for a UCF sports action data set according to an embodiment of the present invention.
9 is a diagram illustrating a result of a human behavior recognition performance confirmation experiment for a UT-Interaction data set according to an embodiment of the present invention.
10 is a view showing the results of experiments to confirm human behavior recognition performance for the Hollywood2 data set according to an embodiment of the present invention.
11 is a diagram illustrating a result of a human behavior recognition performance confirmation experiment for the UCF-101 data set according to an embodiment of the present invention.
12 is a diagram illustrating a result of a behavior recognition performance verification experiment in multi-scale WVLGTP according to an embodiment of the present invention.

전술한 목적, 특징 및 장점은 첨부된 도면을 참조하여 상세하게 후술되며, 이에 따라 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 본 발명의 기술적 사상을 용이하게 실시할 수 있을 것이다. 본 발명을 설명함에 있어서 본 발명과 관련된 공지 기술에 대한 구체적인 설명이 본 발명의 요지를 불필요 하게 흐릴 수 있다고 판단되는 경우에는 상세한 설명을 생략한다.The above-described objects, features and advantages will be described below in detail with reference to the accompanying drawings, and accordingly, those of ordinary skill in the art to which the present invention pertains will be able to easily implement the technical idea of the present invention. In describing the present invention, if it is determined that a detailed description of a known technology related to the present invention may unnecessarily obscure the gist of the present invention, the detailed description will be omitted.

도면에서 동일한 참조부호는 동일 또는 유사한 구성요소를 가리키는 것으로 사용되며, 명세서 및 특허청구의 범위에 기재된 모든 조합은 임의의 방식으로 조합될 수 있다. 그리고 다른 식으로 규정하지 않는 한, 단수에 대한 언급은 하나 이상을 포함할 수 있고, 단수 표현에 대한 언급은 또한 복수 표현을 포함할 수 있음이 이해되어야 한다.In the drawings, the same reference numerals are used to indicate the same or similar elements, and all combinations described in the specification and claims may be combined in any manner. And unless otherwise provided, it is to be understood that references to the singular may include one or more, and references to the singular may also include plural expressions.

본 명세서에서 사용되는 용어는 단지 특정 예시적 실시 예들을 설명할 목적을 가지고 있으며 한정할 의도로 사용되는 것이 아니다. 본 명세서에서 사용된 바와 같은 단수적 표현들은 또한, 해당 문장에서 명확하게 달리 표시하지 않는 한, 복수의 의미를 포함하도록 의도될 수 있다. 용어 "및/또는," "그리고/또는"은 그 관련되어 나열되는 항목들의 모든 조합들 및 어느 하나를 포함한다. 용어 "포함한다", "포함하는", "포함하고 있는", "구비하는", "갖는", "가지고 있는" 등은 내포적 의미를 갖는 바, 이에 따라 이러한 용어들은 그 기재된 특징, 정수, 단계, 동작, 요소, 및/또는 컴포넌트를 특정하며, 하나 이상의 다른 특징, 정수, 단계, 동작, 요소, 컴포넌트, 및/또는 이들의 그룹의 존재 혹은 추가를 배제하지 않는다. 본 명세서에서 설명되는 방법의 단계들, 프로세스들, 동작들은, 구체적으로 그 수행 순서가 확정되는 경우가 아니라면, 이들의 수행을 논의된 혹은 예시된 그러한 특정 순서로 반드시 해야 하는 것으로 해석돼서는 안 된다. 추가적인 혹은 대안적인 단계들이 사용될 수 있음을 또한 이해해야 한다.The terminology used herein is for the purpose of describing specific exemplary embodiments only and is not intended to be limiting. As used herein, singular expressions may also be intended to include plural meanings unless the sentence clearly indicates otherwise. The term “and/or,” “and/or” includes any and all combinations of the items listed therewith. The terms "comprises", "comprising", "comprising", "comprising", "having", "having", etc. have an implicit meaning, so that these terms refer to their described features, integers, It specifies steps, operations, elements, and/or components and does not exclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. The steps, processes, and acts of the methods described herein should not be construed as necessarily performing their performance in such a specific order as discussed or exemplified, unless specifically determined to be an order of performance thereof. . It should also be understood that additional or alternative steps may be used.

또한, 각각의 구성요소는 각각 하드웨어 프로세서로 구현될 수 있고, 위 구성요소들이 통합되어 하나의 하드웨어 프로세서로 구현될 수 있으며, 또는 위 구성요소들이 서로 조합되어 복수 개의 하드웨어 프로세서로 구현될 수도 있다.In addition, each of the components may be implemented as a hardware processor, the above components may be integrated into one hardware processor, or the above components may be combined with each other and implemented as a plurality of hardware processors.

이하, 첨부된 도면을 참조하여 본 발명에 따른 바람직한 실시 예를 상세히 설명하기로 한다.Hereinafter, preferred embodiments according to the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 실시 예에 따른 인간 행동 인식 장치의 블록도를 나타낸 도면이다.1 is a diagram illustrating a block diagram of an apparatus for recognizing human behavior according to an embodiment of the present invention.

인간 행동 인식 장치(100)은 인간 행동이 포함된 비디오 데이터가 저장된 데이터 저장부(110), 데이터 저장부(110)의 저장된 비디오 데이터에서 공간 특징을 추출하여 공간 정보를 생성하는 공간 정보 생성부(130), 비디오 데이터에서 연속된 복수의 프레임을 추출하고, 픽셀값의 기울기 연산을 통하여 동적 정보를 생성하는 동적 정보 생성부(150), 공간 정보 생성부(130)와 생성된 공간 정보와 동적 정보 생성부(150)에서 생성된 동적 정보를 이용하여 SVM(Support Vector Machine) 알고리즘을 통해 인간 행동을 인식하는 인식부(170)을 포함할 수 있다.The human behavior recognition apparatus 100 includes a data storage unit 110 storing video data including human behavior, and a spatial information generating unit generating spatial information by extracting spatial features from the video data stored in the data storage unit 110 ( 130), a dynamic information generating unit 150 that extracts a plurality of consecutive frames from video data and generating dynamic information through a gradient operation of pixel values, a spatial information generating unit 130, and the generated spatial information and dynamic information The recognition unit 170 may include a recognition unit 170 that recognizes human behavior through a support vector machine (SVM) algorithm using the dynamic information generated by the generation unit 150 .

데이터 저장부(110)의 저장되는 비디오 데이터는 인간 행동을 포함하는 비디오 데이터일 수 있다. 비디오 데이터는 인간 한 명이 수행하는 행동을 포함한 데이터 또는 복수의 인간이 수행하는 행동을 포함하는 데이터 일 수 있다. 예를 들어, 비디오 데이터는 인간 행동 중 걷기, 달리기, 조깅, 복싱, 박수치기, 댄스, 다이빙, 골프 스윙, 리프팅, 스케이트보드 타기, 축구, 전자기기에 응답하기, 운전, 차에서 내리기, 야구공 잡기, 드럼, 훌라후프 돌리기, 요요 하기 등을 포함할 수 있다. 또한, 복수의 인간이 수행하는 행동은 악수하기, 포옹, 태권도, 가르키기, 주먹 지르기, 밀기, 포옹, 키스, 퍼레이드 등이 포함될 수 있다. 이는 예시에 해당될 뿐, 한정되어 해석되지 아니한다.The video data stored in the data storage 110 may be video data including human behavior. The video data may be data including an action performed by one human or data including an action performed by a plurality of humans. For example, video data of human actions include walking, running, jogging, boxing, clapping, dancing, diving, golf swing, lifting, skateboarding, soccer, responding to electronics, driving, getting out of a car, baseball This may include catching, drumming, hula-hoops, yo-yos, and more. In addition, the actions performed by the plurality of humans may include shaking hands, hugging, taekwondo, pointing, punching, pushing, hugging, kissing, parade, and the like. This is only an example, and should not be construed as being limited.

이하에서는, 공간 정보 생성부(130), 동적 정보 생성부(150), 인식부(170)에 대한 구체적 설명을 후술하도록 한다.Hereinafter, the spatial information generating unit 130 , the dynamic information generating unit 150 , and the recognizing unit 170 will be described in detail.

도 2는 본 발명의 실시 예에 따른 공간 정보 생성부에서 공간 정보를 생성하는 방법을 설명하기 위한 도면이다. 2 is a diagram for explaining a method of generating spatial information in a spatial information generating unit according to an embodiment of the present invention.

비디오 데이터의 배경 공간 정보는 인간 행동의 많은 동작과 활발히 연관되어 있는 경우가 많다. 인간 행동 인식시, 인간 행동의 동작 정보와 배경의 공간 정보가 명확히 구분될수록 인간 행동 인식 장치의 인식 정확성은 향상될 수 있다. Background spatial information in video data is often actively associated with many movements of human behavior. When human behavior is recognized, the recognition accuracy of the human behavior recognition device may be improved as the motion information of the human behavior and the spatial information of the background are clearly distinguished.

본 발명의 일 실시 예에 따른, 공간 정보 생성부는 비디오 데이터의 배경의 공간 특징을 추출하여 공간 정보를 생성할 수 있다. 공간 정보 생성부는 데이터 저장부의 저장된 비디오 데이터를 입력 데이터(131)로 사용할 수 있다. 입력 데이터(131)는 개별 RGB 프레임 크기가 299x299x3일 수 있고, 이에 한정되지 않고, 개별 RGB 프레임 크기는 여러 값을 가질 수 있다. According to an embodiment of the present invention, the spatial information generator may generate spatial information by extracting spatial characteristics of a background of video data. The spatial information generating unit may use the video data stored in the data storage unit as the input data 131 . The input data 131 may have an individual RGB frame size of 299x299x3, but is not limited thereto, and the individual RGB frame size may have several values.

공간 정보 생성부는 비디오 데이터로부터 깊은 공간의 공간적 특징을 추출하기 위해서 Inception-Resnet-v2 네트워크 컨볼루션 모델을 활용할 수 있다. 도 2를 참조하면, Inception-Resnet-v2 네트워크 모델을 활용한 깊은 공간 특징 추출 아키텍처(132)를 확인할 수 있다. The spatial information generator may utilize the Inception-Resnet-v2 network convolution model to extract spatial features of deep space from video data. Referring to FIG. 2 , a deep spatial feature extraction architecture 132 using the Inception-Resnet-v2 network model can be identified.

공간 특징 추출 아키텍처는(132)은 Reduction 레이어(135)와 Inception Resnet 레이어(134)의 조합을 포함할 수 있다. 또한, 공간적 특징 추출 아키텍처(132)는 Stem 레이어(133)를, Inception Resnet 레이어(134) 앞에 더 포함할 수 있다. 보다 구체적으로, 공간 특징 추출 아키텍처(132)은 하나의 Stem 레이어(134)와 3개의 Inception Resnet 레이어(134)와 2개의 Reduction 레이어(135)와 순차적으로 하는 구조를 갖을 수 있다. 이러한 레이어 다음에는 평균 풀링 레이어(136)와 완전히 연결된 1000개의 채널이 있는 레이어(137)가 연결되어 공간 특징 추출 아키텍처(132)를 구성할 수 있다. Stem 레이어(133)는 공간 특징 추출 아키텍처(132)에 입력되기 전에 예비 컨볼루션 작업이 포함될 수 있다. Inception Resnet 레이어(134)는 컨볼루션 작업과 함께 Reduction 레이어(135)가 연결될 수 있다. Reduction 레이어(135)는 그리드의 높이와 너비를 조정할 수 있다. Stem 레이어(133)와 Inception Resnet 레이어(134)는 공간적 특징을 추출할 수 있고, 평균 풀링 레이어(136)는 개별 공간 특징 추출의 차원을 줄임과 동시에 조명과 변환 과정에 따라 추출된 공간 특징이 변경되는 것을 방지할 수 있다. 완전히 연결된 1000개의 채널이 있는 레이어(137)는 비선형 함수와 학습 시간을 줄이는 기능을 수행하는 정류 선형 단위Rectified Linear Unit(ReLU)를 포함할 수 있다. The spatial feature extraction architecture 132 may include a combination of a Reduction layer 135 and an Inception Resnet layer 134 . In addition, the spatial feature extraction architecture 132 may further include a Stem layer 133 before the Inception Resnet layer 134 . More specifically, the spatial feature extraction architecture 132 may have a structure in which one Stem layer 134 , three Inception Resnet layers 134 , and two Reduction layers 135 are sequentially formed. After these layers, the average pooling layer 136 and the fully connected layer 137 with 1000 channels may be connected to form the spatial feature extraction architecture 132 . Stem layer 133 may include preliminary convolution before being input to spatial feature extraction architecture 132 . The Inception Resnet layer 134 may be connected to the Reduction layer 135 together with a convolution operation. The reduction layer 135 may adjust the height and width of the grid. The Stem layer 133 and the Inception Resnet layer 134 can extract spatial features, and the average pooling layer 136 reduces the dimension of individual spatial feature extraction and at the same time changes the extracted spatial features according to the lighting and transformation process. can be prevented from becoming Layer 137 with 1000 fully connected channels may include a Rectified Linear Unit (ReLU) that performs a non-linear function and a function to reduce learning time.

공간 정보 생성부는 비디오 데이터를 공간 특징 추출 아키텍처(132)에 입력하여, 입력 데이터에 대한 배경의 공간 특징을 추출할 수 있다. 추출된 공간 특징을 집계하여 비디오 데이터에 대한 공간 정보(138)를 생성할 수 있다. 공간 정보(138)은 인간 행동 인식 장치의 인식부에 입력 데이터로 활용될 수 있다. 이하에서는 동적 정보 생성부가 동적 정보를 생성하는 방법에 대한 구체적인 설명을 하도록 한다.The spatial information generator may input the video data to the spatial feature extraction architecture 132 to extract spatial features of the background with respect to the input data. The extracted spatial features may be aggregated to generate spatial information 138 for the video data. The spatial information 138 may be used as input data to the recognition unit of the human behavior recognition apparatus. Hereinafter, a detailed description will be given of a method for the dynamic information generating unit to generate dynamic information.

도 3은 본 발명의 실시 예에 따른 동적 정보 생성부에서 동적 정보를 생성하는 방법을 설명하기 위한 도면이다. 3 is a diagram for explaining a method of generating dynamic information in a dynamic information generating unit according to an embodiment of the present invention.

동적 정보 생성부는 시공간 특징을 모두 추출하여 동적 정보로 생성할 수 있으며, 이는 WVLGTP(Weber's law based Volume Local Gradient Ternary Pattern) 새로운 개념을 통해 생성될 수 있다. The dynamic information generating unit may extract all spatiotemporal features and generate dynamic information, which may be generated through a new concept of Weber's law based Volume Local Gradient Ternary Pattern (WVLGTP).

종래에는 시공간 볼륨 내 픽셀값을 처리하기 위해 VLBP(Volume Local Binary Pattern) 방법을 사용하였다. VLBP(Volume Local Binary Pattern) 방법은 LBP(Local Binary Pattern)와 유사한 방법에 해당된다. 보다 구체적으로 VLBP 방법은 시공간 볼륨 내에서 인접 픽셀의 회색 값을 비교하여 1 또는 0의 이진법으로 값을 부여하는 것을 말한다. 즉, 중앙 픽셀값을 기준으로, 인접 픽셀의 값이 중앙 픽셀값보다 높으면 1를 부여하고, 낮으면 0을 부여하는 방법으로 VLBP를 계산한다. Conventionally, a VLBP (Volume Local Binary Pattern) method is used to process pixel values in a space-time volume. The VLBP (Volume Local Binary Pattern) method corresponds to a method similar to the LBP (Local Binary Pattern) method. More specifically, the VLBP method compares the gray values of adjacent pixels in a space-time volume and assigns values in a binary system of 1 or 0. That is, based on the central pixel value, if the value of the adjacent pixel is higher than the central pixel value, 1 is assigned, and if it is lower, the VLBP is calculated by assigning 0.

WVLGTP는 VLBP 방법을 보다 발전시킨 개념으로서, 본 발명에서의 동적 정보를 생성하기 위해 도입한 새로운 개념이다. 본 발명의 실시 예에 따르면, 픽셀값에 대한 여러 연산화 과정을 통해 WVLGTP를 활용한 동적 정보가 생성될 수 있다. WVLGTP를 생성하기 위한 여러 연산화 과정에 대한 상세한 설명은 도 3을 참조하여 설명하도록 한다.WVLGTP is a more developed concept of the VLBP method, and is a new concept introduced to generate dynamic information in the present invention. According to an embodiment of the present invention, dynamic information using WVLGTP may be generated through various operation processes for pixel values. A detailed description of various operation processes for generating WVLGTP will be described with reference to FIG. 3 .

도 3에서 동적 정보 생성부는 데이터 저장부의 저장된 비디오 데이터를 입력 데이터(151)로 할 수 있다. 동적 정보 생성부는 입력 데이터(151)에 대해서 연속된 3개의 비디오 프레임(152)을 추출할 수 있다. 연속된 3개의 비디오 프레임은 Former Frame, Current Frame, Next Frame으로 지칭할 수 있으며, 과거, 현재, 미래 프레임으로 지칭할 수 있다.In FIG. 3 , the dynamic information generating unit may use the video data stored in the data storage unit as input data 151 . The dynamic information generator may extract three consecutive video frames 152 from the input data 151 . Three consecutive video frames may be referred to as Former Frame, Current Frame, and Next Frame, and may be referred to as past, present, and future frames.

동적 정보 생성부는 3개의 연속된 프레임에 대한 3x3 크기의 제 1픽셀값(153)을 계산할 수 있다. 제 1픽셀값은 3개의 연속된 프레임의 회색 픽셀값을 말한다. 동적 정보 생성부는 제 1픽셀값에 대해서 그라디언트 연산을 수행할 수 있다. 그라디언트 연산을 수행한 결과값을 제 2픽셀값(154)으로 할 수 있다. The dynamic information generator may calculate the first pixel value 153 having a size of 3x3 for three consecutive frames. The first pixel value refers to gray pixel values of three consecutive frames. The dynamic information generator may perform a gradient operation on the first pixel value. A result of the gradient operation may be used as the second pixel value 154 .

제 2픽셀값(154)을 추출하기 위한 그라디언트 연산은 중심 픽셀의 제 1 픽셀값(153)을 기준으로 인접 픽셀값과의 차이의 절대값을 계산하는 것을 말한다. 동적 정보 생성부는 3개의 연속된 프레임의 제 1픽셀값(153)에 대해 그라디언트 연산을 수행하고, 이에 대한 결과값으로 제 2픽셀값(154)을 생성할 수 있다.The gradient operation for extracting the second pixel value 154 refers to calculating the absolute value of the difference from the adjacent pixel value based on the first pixel value 153 of the central pixel. The dynamic information generator may perform a gradient operation on the first pixel values 153 of three consecutive frames, and generate a second pixel value 154 as a result of the gradient operation.

예를 들어, Former Frame의 경우 중심 픽셀값이 100이고, Current Frame의 경우에는 60이며, Next Frame의 경우에는 70인 바, 100, 60, 70 제 1픽셀값을 기준으로 인접 픽셀값 과의 차이값의 절대값을 계산한다. For example, the center pixel value is 100 in the case of the Former Frame, 60 in the case of the Current Frame, and 70 in the case of the Next Frame. Calculates the absolute value of a value.

여기서 Current Frame의 중심 픽셀의 제 2픽셀값(154)은 다른 방법으로 계산할 수 있다. 이는 Current Frame의 중심 픽셀의 경우 가장 많은 노이즈를 가지고 있어, 이를 조정할 필요가 있기 때문이다. 또한, 비디오 데이터에서 3개의 연속된 프레임을 추출하는 과정에서 3개의 연속된 프레임의 시간적 차이가 매우 짧기 때문에 Current Frame의 중심 픽셀의 노이즈 처리를 통해, 인간 행동 인식 정확도를 향상시킬 수 있다. 따라서 Current Frame의 중심 픽셀의 제 2 픽셀값(154)은 그라디언트 연산을 통해 산출된 인접 픽셀들의 제 2픽셀값(154)의 평균값으로 대체할 수 있다. 예를 들어, Current Frame의 인접 픽셀의 그라디언트 연산을 통해 추출된 제 2픽셀값은 10, 40, 45, 50, 60, 50, 30, 120 이고, 이 값의 평균값인 50.62를 Current Frame의 중심 픽셀의 제 2픽셀값(154)으로 할 수 있다. Current Frame의 중심 픽셀의 제 2픽셀값(154)은 Current Frame의 중심 픽셀의 제 1픽셀값(153)을 그대로 사용할 수도 있다. Here, the second pixel value 154 of the center pixel of the Current Frame may be calculated by another method. This is because the center pixel of the Current Frame has the most noise and needs to be adjusted. In addition, since the time difference between three consecutive frames is very short in the process of extracting three consecutive frames from video data, it is possible to improve human behavior recognition accuracy through noise processing of the center pixel of the current frame. Accordingly, the second pixel value 154 of the center pixel of the current frame may be replaced with an average value of the second pixel values 154 of adjacent pixels calculated through the gradient operation. For example, the second pixel value extracted through the gradient operation of adjacent pixels of the Current Frame is 10, 40, 45, 50, 60, 50, 30, 120, and the average value of these values, 50.62, is set to the center pixel of the Current Frame. It can be set as the second pixel value 154 of As the second pixel value 154 of the center pixel of the Current Frame, the first pixel value 153 of the center pixel of the Current Frame may be used as it is.

동적 정보 생성부는 추가적으로 중심 픽셀을 효과적으로 처리하기 위한 임계값 연산 방법을 활용할 수 있다. 임계값 연산 방법은 Weber's Law, 즉 웨버의 법칙을 기반으로 할 수 있다. 이에 대한 구체적인 설명을 후술하도록 한다. The dynamic information generator may additionally utilize a threshold value calculation method for effectively processing the central pixel. The threshold calculation method may be based on Weber's Law, that is, Weber's Law. A detailed description thereof will be provided later.

기본적으로, 웨버의 법칙은 다음과 같은 수학식 1을 통해서 설명할 수 있다. Basically, Weber's law can be described through Equation 1 as follows.

여기서

는 차이의 현저함을 나타내는 증분의 임계 값을 나타내고, I는 초기 강도, K는 상수를 나타낸다. 즉 웨버의 법칙은 같은 종류의 두 자극에 대해서 두 자극을 구별할 수 있는 최소 차이는 초기 자극의 강도에 비례한다는 것에 관한 것이다. 보다 구체적으로, 초기 자극이 존재하고, 다음 자극을 초기 자극과 구별하기 위해서는 초기 자극의 강도가 강할수록, 다음 자극의 강도 차이가 매우 커야 한다는 것이다. here

denotes the threshold of increments indicating the salience of the difference, I denotes the initial intensity, and K denotes the constant. In other words, Weber's law states that for two stimuli of the same kind, the minimum difference between two stimuli is proportional to the intensity of the initial stimulus. More specifically, the initial stimulus exists, and in order to distinguish the next stimulus from the initial stimulus, the stronger the initial stimulus, the greater the difference in intensity between the next stimulus.

본 발명의 일 실시 예를 따르면 웨버의 법칙을 이용하여 T_wc 값을 계산할 수 있다. T_wc 값은 다음과 같은 수학식 2를 통해 설명할 수 있다. According to an embodiment of the present invention, the T _wc value may be calculated using Weber's law. The _{value of T wc} can be described through Equation 2 below.

여기서,

은 웨버의 분수 값이다.

는 중심 픽셀의 제 2픽셀값을 말한다. C_N은 Current Frame의 인접 픽셀의 제 1픽셀값이고, C_C는 Current Frame의 중심 픽셀의 제 1픽셀값이다.here,

is Weber's fractional value.

is the second pixel value of the central pixel. C _N is the first pixel value of the adjacent pixel of the Current Frame, and C _C is the first pixel value of the center pixel of the Current Frame.

동적 정보 생성부는 그라디언트 연산을 통한 제 2픽셀값과 웨버의 법칙에 기초한 임계값인 T_wc를 기반으로 임계값 연산을 수행할 수 있다. 보다 구체적으로 임계값 연산은 아래와 같은 수학식 3을 통해서 설명할 수 있다. The dynamic information generator may perform a threshold value operation based on the second pixel value through the gradient operation and T _{wc , which is a threshold value based on Weber's law.} More specifically, the threshold value operation can be described through Equation 3 below.

임계값 연산을 통해서 추출된 픽셀값은 제 3픽셀값(155)이라 할 수 있다. 여기서 적응형 로컬 임계값인 T의 값은 제 2픽셀값을 3진 패턴으로 변환하기 위해 도출되어야 할 값이다. P와 R은 각각 인접 화소의 수 및 반경을 뜻한다. T 값은

의 중앙값으로 계산될 수 있다. 본 발명의 일 실시예에 따르면 T 값은 8로 계산될 수 있다. 임계값 연산을 통해 추출된 제 3픽셀값은 동적 정보 생성부에서 시간 정보에 대한 특징 벡터와 공간 정보에 대한 특징 벡터 생성시 활용된다. The pixel value extracted through the threshold operation may be referred to as a third pixel value 155 . Here, the value of T, which is the adaptive local threshold, is a value to be derived in order to convert the second pixel value into a ternary pattern. P and R denote the number and radius of adjacent pixels, respectively. T value is

can be calculated as the median of According to an embodiment of the present invention, the value of T may be calculated as 8. The third pixel value extracted through the threshold operation is used when the dynamic information generator generates a feature vector for temporal information and a feature vector for spatial information.

종래 VLBP는 3개의 연속된 프레임에 대해서 8개의 주변 픽셀을 가져와 크기가 큰 특징 벡터를 생성할 수 있었다. 그러나 이러한 VLBP로 생성된 특징 벡터는 모호성이 매우 높아 인간 행동 인식의 정확성이 감소하는 원인 중 하나였다. 따라서 본 발명의 일 실시 예에 따르면, 동적 정보 생성부는 동적 정보를 생성할 때, 시간 정보와 공간 정보를 별도로 구분하여 생성할 수 있다. Conventional VLBP can generate large feature vectors by taking 8 neighboring pixels for 3 consecutive frames. However, this VLBP-generated feature vector has very high ambiguity, which was one of the causes of the decrease in the accuracy of human behavior recognition. Therefore, according to an embodiment of the present invention, the dynamic information generating unit may separately generate temporal information and spatial information when generating dynamic information.

동적 정보 생성부에서 생성하는 시간 정보에 대한 특징 벡터(156)는 Former Frame 과 Next Frame을 Current Frame과 비교함으로써 생성되고, 공간 정보에 대한 특징 벡터(157)은 Current Frame을 기초로 하여 생성될 수 있다. The feature vector 156 for time information generated by the dynamic information generator is generated by comparing the Former Frame and the Next Frame with the Current Frame, and the feature vector 157 for spatial information can be generated based on the Current Frame. have.

시간 정보에 대한 특징 벡터(156)는 임계값 연산 방법을 통해 생성된 제 3픽셀값(155)에 대해서 Lowe Pattern과 Upper Pattern 연산을 통해 생성될 수 있다. Lowe Pattern 연산은 1과 0은 0의 값으로, -1의 값은 1의 값으로 변환하는 연산을 뜻한다. Upper Pattern 연산은 1은 1로, 0과 -1은 0으로 변환하는 연산을 뜻한다. 시간 정보에 대한 특징 벡터 생성시, Lowe Pattern 연산을 하면, 8개의 인접 픽셀 중 중앙 픽셀에서 가장 상단 픽셀을 기준으로 시계 방향으로 읽어 나가며 생성할 수 있다. 예를 들어, Former Frame의 제 3픽셀값은 (1-10111-1-1)₃ 이었고, 이를 Lowe Pattern 연산을 통해 시간 정보에 대한 특징 벡터를 생성하면, (01000011)₂ 가 된다. 따라서, 시간 정보에 대한 특징 벡터(156)는 Lowe Pattern과 Upper Pattern 2가지 연산을 통해, 2개의 연산 결과값을 갖을 수 있다. The feature vector 156 for time information may be generated through Lowe Pattern and Upper Pattern calculations with respect to the third pixel value 155 generated through the threshold value calculation method. The Lowe Pattern operation refers to an operation that converts 1 and 0 to a value of 0, and a value of -1 to a value of 1. Upper Pattern operation means an operation that converts 1 to 1 and 0 and -1 to 0. When generating a feature vector for time information, if the Lowe Pattern operation is performed, it can be generated by reading it in a clockwise direction based on the uppermost pixel from the center pixel among 8 adjacent pixels. For example, the third pixel value of the Former Frame was (1-10111-1-1) ₃ , and when a feature vector for time information is generated through Lowe Pattern operation, it becomes _{(01000011) 2 .} Accordingly, the feature vector 156 for time information may have two operation result values through two operations, Lowe Pattern and Upper Pattern.

한편, 공간 정보에 대한 특징 벡터(157)는 Lowe Pattern과 Upper Pattern 연산을 Current Frame을 기준으로 계산하여 생성할 수 있다. 예를 들어, Current Frame의 제 3픽셀값이 (-1001-100-1)₃ 이었고, 이를 Lowe Pattern 연산을 통해 공간 정보에 대한 특징 벡터를 생성하면, (10001001)₂ 가 된다. Meanwhile, the feature vector 157 for spatial information may be generated by calculating Lowe Pattern and Upper Pattern operations based on the Current Frame. For example, the third pixel value of the Current Frame is (-1001-100-1) ₃ , and when a feature vector for spatial information is generated through Lowe Pattern operation, it becomes _{(10001001) 2 .}

본 발명의 또 다른 일 실시 예에 따르면, 공간 정보에 대한 특징 벡터 생성시, 크기 벡터라는 보조 개념을 활용할 수 있다. 본 발명의 일 실시 예에 따르면, 크기 벡터를 계산하여 인간 행동 인식의 정확성을 향상시킬 수 있다. 크기 벡터를 그대로 사용하지 아니하고, 크기 정보를 유지하기 위해 크기 벡터의 평균과 분산, 2가지를 활용할 수 있다. According to another embodiment of the present invention, when generating a feature vector for spatial information, an auxiliary concept of a magnitude vector may be used. According to an embodiment of the present invention, accuracy of human behavior recognition may be improved by calculating a magnitude vector. Instead of using the size vector as it is, two types of the size vector average and variance can be used to maintain size information.

크기 벡터의 평균과 분산은 아래와 같은 수학식 4를 통해서 설명할 수 있다. The mean and variance of the magnitude vector can be described using Equation 4 below.

크기 벡터 M_C는 Current Frame 픽셀값을 기준으로 인접 픽셀값과 중앙 픽셀값의 차이의 절대값을 사용할 수 있다. The magnitude vector M _C may use the absolute value of the difference between the adjacent pixel value and the central pixel value based on the current frame pixel value.

예를 들어, 크기 벡터 M_C는 Current Frame 픽셀값의 중앙값인 60을 기준으로, 70, 20, 105, 10, 120, 10, 30, 180에 대해서 차이의 절대값을 계산하면, 10, 40, 45, 50, 60, 60, 30, 120의 크기 벡터 값을 산출할 수 있다. For example, the magnitude vector M _C is based on 60, which is the median value of the current frame pixel value, and if the absolute value of the difference is calculated for 70, 20, 105, 10, 120, 10, 30, 180, 10, 40, 45, 50, 60, 60, 30, and 120 magnitude vector values can be calculated.

크기 벡터 M_C 값이 구해지면, 크기 벡터 값의 평균과 그 분산을 계산할 수 있다. 예를 들어, 본 발명의 일 실시 예에서는 크기 벡터의 평균은 50.62, 크기 벡터의 분산은 890.23으로 계산될 수 있다. 여기서

와

값은 Current Frame 서브 프레임의 크기 벡터의 대한 평균값, 분산값에 해당된다. 또한,

값은 크기 벡터를 3진법의 수로 변환하기 위한 값이다.

은 Current Frame에서 중앙 픽셀값을 기준으로 인접 픽셀값과의 차이값의 절대값의 평균값을 말한다. 본 발명에서

값은 45가 된다.When the magnitude vector M _C is obtained, the average of the magnitude vector values and the variance thereof can be calculated. For example, in an embodiment of the present invention, the average of the magnitude vectors may be calculated as 50.62, and the variance of the magnitude vectors may be calculated as 890.23. here

Wow

The values correspond to the average and variance values of the size vectors of the Current Frame subframes. Also,

The value is a value for converting the magnitude vector into a ternary number.

is the average value of the absolute value of the difference value from the adjacent pixel value based on the central pixel value in the Current Frame. in the present invention

The value will be 45.

공간 정보에 대한 특징 벡터 생성시, 크기 벡터 값의 평균과 분산에 관해서도 임계값 연산 과정이 필요하다. 임계값 연산은 크기 벡터의 평균값과

값의 차이값을

값과 비교하여 1, -1, 0의 값을 부여하는 것을 말한다. 또한 크기 벡터의 분산값과

값의 차이값을

값과 비교하여 1, -1, 0의 값을 부여한다. 예를 들어, 본 발명에서는 크기 벡터의 평균값이 50.62이고,

값은 62.1 인바, 그 차이값은 -11.48이 되고, -45와 45 값 사이이므로 0의 값이 부여된다. 또한 크기 벡터의 분산값은 890.23이고,

값은 815.6 이므로, 그 차이값은 74.63이 되고, 45 값보다 크므로 1의 값이 부여된다. 따라서, 크기 벡터의 임계값 연산 과정을 거친 결과는 (01)₃이 된다.When generating a feature vector for spatial information, a threshold calculation process is also required for the mean and variance of magnitude vector values. Threshold operation is the average value of the magnitude vector and

the difference between the values

It refers to assigning a value of 1, -1, or 0 compared to a value. Also, the variance of the magnitude vector and

the difference between the values

Values of 1, -1, and 0 are given compared to the value. For example, in the present invention, the average value of the magnitude vector is 50.62,

The value is 62.1, and the difference is -11.48. Since it is between -45 and 45, a value of 0 is given. Also, the variance value of the magnitude vector is 890.23,

Since the value is 815.6, the difference becomes 74.63, and since it is greater than the value of 45, a value of 1 is given. Therefore, the result of the threshold value calculation process of the magnitude vector becomes (01) ₃ .

동적 정보 생성부는 크기 벡터의 임계값 연산 과정을 거친 결과를 가지고 공간 정보에 대한 특징 벡터 생성시 Lowe Pattern과 Upper Pattern 연산 활용할 수 있다. 예를 들어, Lowe Pattern 연산을 통해 공간 정보에 대한 특징 벡터를 생성하면, (10001001)₂ 가 되고, 여기서 크기 벡터의 평균과 분산을 고려한 Lowe Pattern 연산을 하면 (10001001 00)₂이 된다. 동적 정보 생성부는 크기 벡터의 평균과 분산을 고려하여 최종 공간 정보에 대한 특징 벡터(158)를 생성할 수 있다. The dynamic information generator may utilize the Lowe Pattern and Upper Pattern calculations when generating a feature vector for spatial information with the result of the threshold value calculation process of the magnitude vector. For example, when a feature vector for spatial information is generated through Lowe Pattern operation, it becomes (10001001) ₂ , and when Lowe Pattern operation considering the mean and variance of magnitude vectors is performed, it becomes (10001001 00) ₂ . The dynamic information generator may generate the feature vector 158 for the final spatial information in consideration of the mean and variance of the magnitude vector.

동적 정보 생성부는 비디오 데이터에서 추출된 3개의 연속된 프레임에 대해서 복수 개의 시간 정보에 대한 특징 벡터, 공간 정보에 대한 특징 벡터를 생성하고, 생성된 모든 벡터들을 종합하여 동적 정보를 생성할 수 있다. 이하에서는 본 발명의 또 다른 일 실시 예에 따른 멀티 스케일 WVLGTP를 생성하는 방법을 설명하도록 한다. The dynamic information generator may generate a feature vector for a plurality of temporal information and a feature vector for spatial information with respect to three consecutive frames extracted from video data, and may generate dynamic information by synthesizing all the generated vectors. Hereinafter, a method of generating a multi-scale WVLGTP according to another embodiment of the present invention will be described.

도 4는 본 발명의 또 다른 일 실시 예에 따른 멀티 스케일 WVLGTP를 생성하는 방법을 설명하기 위한 도면이다. WVLGTP 생성시, 중앙 픽셀에서 거리가 1인 인접 픽셀에 대해서만 WVLGTP를 생성할 수 있다. 즉 3x3 크기의 비디오 프레임 픽셀에 대해서 WVLGTP를 생성할 수 있다. 본 발명의 다른 일 실시 예에 따르면, 중앙 픽셀에서 거리가 2인 인접 픽셀에 대해서 멀티스케일 WVLGTP를 생성할 수 있다. 즉, 5x5 크기의 비디오 프레임 픽셀에 대해서 멀티스케일 WVLGTP를 생성할 수 있다. 멀티스케일 WVLGTP 생성시 비디오 프레임의 픽셀 크기는 한정되어 해석되지 아니하며 5x5 크기 보다 더 큰 크기를 갖을 수 있다. 멀티스케일 WVLGTP가 생성되면 도 3과 중복되는 방법을 통해, 공간 정보에 대한 특징 벡터를 생성할 수 있고, 생성된 특징 벡터를 집계하여 동적 정보를 생성할 수 있다. 동적 정보를 생성하는 상세한 설명은 도 3과 중복되는 바 설명은 생략하도록 한다. 4 is a diagram for explaining a method of generating a multi-scale WVLGTP according to another embodiment of the present invention. When generating WVLGTP, it is possible to generate WVLGTP only for adjacent pixels with a distance of 1 from the center pixel. That is, WVLGTP can be generated for a video frame pixel having a size of 3x3. According to another embodiment of the present invention, multi-scale WVLGTP may be generated for adjacent pixels having a distance of 2 from the central pixel. That is, multiscale WVLGTP can be generated for 5x5 video frame pixels. When generating multiscale WVLGTP, the pixel size of a video frame is not limited and interpreted and may have a size larger than 5x5 size. When the multiscale WVLGTP is generated, a feature vector for spatial information may be generated through the method overlapping that of FIG. 3 , and dynamic information may be generated by aggregating the generated feature vector. A detailed description of generating the dynamic information overlaps with FIG. 3 , so a description thereof will be omitted.

본 발명의 또 다른 일 실시 예에 따르면 동적 정보 생성시 멀티스케일 WVLGTP를 통한 공간 정보에 대한 특징 벡터와 WVLGTP로 생성된 공간 정보에 대한 특징 벡터를 집계하여 동적 정보로 생성할 수도 있다. 보다 구체적으로 복수의 크기를 갖는 프레임을 활용하여 WVLGTP를 2개 이상 생성할 수 있고, 생성된 각각의 WVLGTP를 통해 공간 정보에 대한 특징 벡터를 생성할 수 있다. 생성된 복수개의 공간 정보에 대한 특징 벡터를 모두 집계하여 동적 정보로 생성할 수 있다. According to another embodiment of the present invention, when generating dynamic information, a feature vector for spatial information through multiscale WVLGTP and a feature vector for spatial information generated by WVLGTP may be aggregated to generate dynamic information. More specifically, two or more WVLGTPs may be generated using frames having a plurality of sizes, and a feature vector for spatial information may be generated through each generated WVLGTP. By aggregating all the feature vectors for a plurality of generated spatial information, it is possible to generate dynamic information.

도 4에서 멀티 스케일 WVLGTP를 생성할 시 도 3과 중복되는 방법에 대한 설명은 생략하도록 한다. 도 4를 참조하면, 동적 정보 생성부는 데이터 저장부의 저장된 비디오 데이터를 입력 데이터(161)로 할 수 있다. 동적 정보 생성부는 입력 데이터(161)에 대해서 3개의 연속된 비디오 프레임(162)을 추출할 수 있다. When generating the multi-scale WVLGTP in FIG. 4, a description of the method overlapping with that of FIG. 3 will be omitted. Referring to FIG. 4 , the dynamic information generating unit may use video data stored in the data storage unit as input data 161 . The dynamic information generator may extract three consecutive video frames 162 from the input data 161 .

3개의 연속된 프레임에 대해서 중앙 픽셀에서의 거리가 2인, 5x5 크기 인접 픽셀에 대해서 도 3에서 설명한 방법과 동일한 방법으로 그라디언트 연산, 임계값 연산 등을 통해 제 2 픽셀값, 제 3 픽셀값을 생성할 수 있다. 마찬가지로 도 3에서 설명한 방법과 동일한 방법으로 시간 정보에 대한 특징 벡터 및 공간 정보에 대한 특징 벡터(163)를 생성할 수 있다. For three consecutive frames, the second pixel value and the third pixel value are calculated through the gradient operation and the threshold value operation in the same manner as in the method described in FIG. can create Similarly, a feature vector for temporal information and a feature vector 163 for spatial information may be generated in the same way as the method described with reference to FIG. 3 .

다만, 중앙 픽셀에서 거리가 2인 인접 픽셀에 대해 멀티 스케일 WVLGTP 생성시, 추가적인 단계를 거칠 필요가 있다. However, when generating multi-scale WVLGTP for adjacent pixels having a distance of 2 from the center pixel, it is necessary to go through an additional step.

이러한 단계는 그라디언트 연산을 수행하기 전에 수행하는 단계로서 평균화 방식 연산을 말한다. 평균화 방식 연산을 통해 비디오 프레임의 조명의 변화로 인해 인간 행동 인식의 결과값이 모호해지는 것을 방지할 수 있다. 구체적인 평균화 방식 연산은 도 5를 참조하여 설명하도록 한다.This step is performed before the gradient operation and refers to the averaging method. Through the averaging method, it is possible to prevent the result value of human behavior recognition from being ambiguous due to a change in lighting of a video frame. A detailed averaging method operation will be described with reference to FIG. 5 .

도 5는 본 발명의 실시 예에 따른 멀티 스케일 WVLGTP를 생성할 시 평균화 방식 연산을 설명하기 위한 도면이다.5 is a diagram for explaining an averaging method operation when generating a multi-scale WVLGTP according to an embodiment of the present invention.

멀티 스케일 WVLGTP 생성시 비디오 프레임(164)에 대해서 평균화 방식 연산 과정을 추가적으로 진행할 수 있다. 중앙 픽셀을 중심으로 인접 픽셀까지 거리가 3인 픽셀을 모두 모아 픽셀값(165)를 계산한다. When generating the multi-scale WVLGTP, an averaging operation process may be additionally performed with respect to the video frame 164 . A pixel value 165 is calculated by collecting all pixels having a distance of 3 from the center pixel to the adjacent pixel.

픽셀값(165)에서 중앙 픽셀을 중심으로 인접 픽셀까지 거리가 1인 픽셀에 경우에는 그 픽셀값을 그대로 적용한다. 그 결과 인접 픽셀까지 거리가 1인 경우에는 그대로 값을 적용한 결과 픽셀값(166)을 생성할 수 있다.In the case of a pixel having a distance of 1 from the pixel value 165 to the adjacent pixel from the center pixel, the pixel value is applied as it is. As a result, when the distance to the adjacent pixel is 1, the pixel value 166 may be generated as a result of applying the value as it is.

인접 픽셀까지 거리가 2인 경우에는, 인접 픽셀 하나에 대해서 3개의 양, 옆, 위, 아래의 픽셀의 평균을 결과값으로 하여 픽셀값(167)을 생성한다. 예를 들어, 픽셀값(167)의 오른쪽 상단의 149 값은 152, 140, 156의 평균값을 계산한 그 결과값을 픽셀값으로 생성한 것이다. When the distance to the adjacent pixel is 2, the pixel value 167 is generated by using the average of three positive, side, upper, and lower pixels for one adjacent pixel as a result value. For example, the value 149 at the upper right of the pixel value 167 is a pixel value obtained by calculating the average values of 152, 140, and 156.

인접 픽셀까지 거리가 3인 경우에는, 인접 픽셀 하나에 대해서 5개의 픽셀값에 대한 평균을 결과값(167)으로 생성한다. 예를 들어, 픽셀값(168)에서 오른쪽 상단의 140의 값은 148, 129, 128, 137, 156의 값에 대한 평균값을 계산한 그 결과값을 픽셀값으로 생성한 것이다. When the distance to the adjacent pixel is 3, an average of five pixel values for one adjacent pixel is generated as a result value 167 . For example, in the pixel value 168, the value of 140 at the upper right is generated as a pixel value by calculating the average value of the values of 148, 129, 128, 137, and 156.

이러한 평균화 방식 연산을 통해서 비디오 프레임의 조명 변화로 인한 결과값의 모호성을 방지할 수 있다. Through this averaging method, it is possible to prevent ambiguity of a result value due to a change in lighting of a video frame.

도 1로 다시 돌아가, 인간 행동 인식 장치에서 인식부에 대한 설명을 하도록 한다. Returning to FIG. 1 again, the recognition unit in the human behavior recognition apparatus will be described.

도 1에서, 인식부(170)은 공간 정보 생성부(130)에서 생성된 공간 정보와 동적 정보 생성부(150)에서 생성된 동적 정보를 SVM(Support Vector Machine) 알고리즘을 통해 인간 행동을 예측할 수 있다. SVM 알고리즘은 특징 벡터를 입력 데이터로 하여 입력 데이터의 특징을 학습하고, 특징에 관해 데이터를 분류하는 지도학습 모델에 일종에 해당된다. SVM은 RBF 커널 함수와 함께 비선형 SVM을 이용하여 특징 벡터의 동작을 분류할 수 있다. 또한 5배 교차검증을 수행하기 위해 C와 Y가 매개 변수로 사용될 수 있다. 이하에서는 도 6 내지 도 12를 참조하여 본 발명의 인간 행동 인식 장치에 대한 성능 확인 실험 설계 및 그 결과에 대해서 설명하도록 한다. In FIG. 1 , the recognition unit 170 can predict human behavior using the spatial information generated by the spatial information generation unit 130 and the dynamic information generated by the dynamic information generation unit 150 through a support vector machine (SVM) algorithm. have. The SVM algorithm is a kind of supervised learning model that learns features of input data using feature vectors as input data, and classifies data with respect to features. The SVM can classify the behavior of the feature vector by using the nonlinear SVM together with the RBF kernel function. Also, C and Y can be used as parameters to perform 5-fold cross-validation. Hereinafter, with reference to FIGS. 6 to 12, a performance verification experiment design for the apparatus for recognizing human behavior of the present invention and a result thereof will be described.

도 6은 본 발명의 실시 예에 따른 인간 행동 인식 장치에 대한 성능을 확인하기 위한 실험 대상 데이터에 관한 도면이다. 6 is a diagram of test subject data for confirming the performance of the apparatus for recognizing human behavior according to an embodiment of the present invention.

본 발명의 인간 행동 인식 장치에 대한 성능을 확인하기 위해, KTH 데이터셋, UCF 스포츠 액션 데이터셋, UT-Interaction 데이터 셋, Hollywood2 데이터셋, UCF-101 데이터셋을 이용할 수 있다. In order to check the performance of the human behavior recognition apparatus of the present invention, the KTH dataset, the UCF sports action dataset, the UT-Interaction dataset, the Hollywood2 dataset, and the UCF-101 dataset may be used.

UCF-101 데이터 셋은 많은 수의 비디오 데이터로 구성되고, UT-Interaction 데이터 셋은 두 사람 간의 상호 작용을 나타내는 비디오 데이터로 구성된다. Hollywood2 데이터 셋은 KTH 데이터 셋 같은 단순한 동작이 아니라 복잡한 활동을 나타내는 데이터로 구성된다. 또한 Hollywood2 데이터 셋은 각 비디오에 눈에 띄는 카메라 동작과 빠른 장면 변경이 포함되어 있어 인간 행동을 인식하기 가장 어려운 데이터 셋에 해당된다. The UCF-101 data set consists of a large number of video data, and the UT-Interaction data set consists of video data representing interactions between two people. The Hollywood2 data set consists of data representing complex activities rather than simple actions like the KTH data set. In addition, the Hollywood2 dataset is the most difficult dataset to recognize human behavior, as each video contains striking camera motions and rapid scene changes.

도 6(a)는 KTH 데이터 셋의 샘플 데이터에 해당된다. KTH 데이터 셋은 걷기, 달리기, 조깅, 권투, 박수 등 인간 25명이 수행하는 100개의 시퀀스가 *?*포함된다. 6( a ) corresponds to sample data of the KTH data set. The KTH dataset contains 100 sequences performed by 25 humans, such as walking, running, jogging, boxing, and clapping.

도 6(b)는 UCF 스포츠 액션 데이터 셋의 샘플 데이터에 해당된다. UCF 스포츠 액션 데이터 셋은 해상도가 150인 비디오 데이터를 포함할 수 있다. UCF 스포츠 액션 데이터 셋은 걷기, 달리기, 발길질, 리프팅, 다이빙, 골프 스윙, 승마, 스케이트 보드, 스윙 사이드 및 스윙 벤치 등을 포함할 수 있다. UCF 스포츠 액션 데이터 셋은 다양한 관점에서 촬영한 비디오 데이터 및 많은 카메라 움직임을 포함하는 비디오 데이터로 구성될 수 있다. 6(b) corresponds to sample data of the UCF sports action data set. The UCF sports action data set may include video data having a resolution of 150. The UCF sports action data set may include walking, running, kicking, lifting, diving, golf swing, horseback riding, skateboarding, swing side and swing bench, and the like. The UCF sports action data set may be composed of video data captured from various viewpoints and video data including many camera movements.

도 6(c)는 UT-Interaction 데이터 셋의 샘플 데이터에 해당된다. UT-Interaction 데이터 셋은 720 × 480의 해상도를 가진 120 개의 비디오로 구성된다. UT-Interaction 데이터 셋은 악수, 포옹, 발로 차기, 밀기, 가리 키기 및 펀치 등의 6가지 액션에 관한 비디오 데이터를 포함할 수 있다. 또한, UT-Interaction 데이터 셋의 비디오 데이터는 15가지의 다른 옷 조건을 가진 사람들이 6가지 액션을 수행하는 데이터를 포함하고 있다. 6(c) corresponds to sample data of the UT-Interaction data set. The UT-Interaction data set consists of 120 videos with a resolution of 720 × 480. The UT-Interaction dataset may contain video data about six actions: handshake, hug, kick, push, point, and punch. In addition, the video data of the UT-Interaction data set contains data that people with 15 different clothes conditions perform 6 actions.

도 6(d)는 Hollywood2 데이터 셋의 샘플 데이터에 해당된다. Hollywood2 데이터 셋은 12가지 인간 행동을 가진 3669개의 *?*비디오 데이터로 구성된다. 6( d ) corresponds to sample data of the Hollywood2 data set. The Hollywood2 data set consists of 3669 *?* video data with 12 human behaviors.

Hollywood2 데이터 셋은 전화 응답, 자동차 운전, 식사, 싸우기, 차에서 나오기, 악수, 포옹, 키스, 뛰기, 앉기, 일어나기, 일어나기 등의 액션 데이터를 포함할 수 있다. Hollywood2 데이터 셋은 빠른 카메라 움직임과 빠른 장면 전환 관련 비디오 데이터를 포함하고 있다. The Hollywood2 dataset may contain action data such as answering a phone call, driving a car, eating, fighting, getting out of a car, shaking hands, hugging, kissing, running, sitting, getting up, getting up, etc. The Hollywood2 data set contains video data related to fast camera movements and fast transitions.

도 6(e)는 UCF-101 데이터 셋의 샘플 데이터에 해당된다. UCF-101 데이터 셋은 13320개의 비디오 데이터로 구성된 가장 큰 액션 데이터 셋에 해당된다. UCF-101 데이터 셋은 101개의 액션 동작을 포함하고, 비디오 데이터는 YouTube에서 가져올 수 있다. UCF-101 데이터 셋은 4개 내지 7개의 동작 시퀀스를 다루는 25개의 그룹으로 구분될 수 있다. 6(e) corresponds to sample data of the UCF-101 data set. The UCF-101 data set corresponds to the largest action data set composed of 13320 video data. The UCF-101 data set contains 101 action actions, and the video data can be obtained from YouTube. The UCF-101 data set can be divided into 25 groups dealing with 4 to 7 operation sequences.

본 발명의 인간 행동 인식 장치의 성능을 확인하기 위한 실험 설계를 위해 5개 데이터 모두에 대해 비디오 데이터의 70%를 인식부의 지도학습 모델인 SVM 알고리즘을 훈련시켰다. 비디오 데이터의 나머지 30%에 대하여 본 발명의 인간 행동 인식 장치의 행동 인식 성능을 확인하였다. 또한 5배의 교차 검증을 통해, 장치의 성능을 확인하였다. For the experimental design to confirm the performance of the human behavior recognition device of the present invention, the SVM algorithm, which is a supervised learning model of the recognition unit, was trained on 70% of the video data for all five data. For the remaining 30% of the video data, the behavior recognition performance of the human behavior recognition apparatus of the present invention was confirmed. In addition, the performance of the device was confirmed through 5-fold cross-validation.

본 발명의 인간 행동 인식 장치의 성능을 확인하기 위해, 실험 진행하는 동안 비디오 데이터의 프레임 수는 동일한 시간 간격으로 고정시켜 수행할 수 있다. 비디오 프레임 수는 35로 설정하여 샘플링 할 수 있다. 이하에서는 본 발명의 인간 행동 인식 장치 성능에 관한 실험 결과를 설명하도록 한다. In order to confirm the performance of the apparatus for recognizing human behavior of the present invention, the number of frames of video data may be fixed at the same time interval during the experiment. You can sample by setting the number of video frames to 35. Hereinafter, experimental results regarding the performance of the human behavior recognition device of the present invention will be described.

도 6(c)는 UT-Interaction 데이터 셋의 샘플 데이터에 해당된다. UT-Interaction 데이터 셋은 720×480의 해상도를 가진 120 개의 비디오로 구성된다. UT-Interaction 데이터 셋은 악수, 포옹, 발로 차기, 밀기, 가리 키기 및 펀치 등의 6가지 액션에 관한 비디오 데이터를 포함할 수 있다. 또한, UT-Interaction 데이터 셋의 비디오 데이터는 15가지의 다른 옷 조건을 가진 사람들이 6가지 액션을 수행하는 데이터를 포함하고 있다. 6(c) corresponds to sample data of the UT-Interaction data set. The UT-Interaction data set consists of 120 videos with a resolution of 720×480. The UT-Interaction dataset may contain video data about six actions: handshake, hug, kick, push, point, and punch. In addition, the video data of the UT-Interaction data set contains data that people with 15 different clothes conditions perform 6 actions.

도 6(d)는 Hollywood2 데이터 셋의 샘플 데이터에 해당된다. Hollywood2 데이터 셋은 12가지 인간 행동을 가진 3669개의 비디오 데이터로 구성된다. 6( d ) corresponds to sample data of the Hollywood2 data set. The Hollywood2 dataset consists of 3669 video data with 12 human behaviors.

도 7은 본 발명의 실시 예에 따른 KTH 데이터 셋에 대한 인간 행동 인식 성능 확인 실험 결과값을 나타낸 도면이다. 도 7을 참조하면, 본 발명의 인간 행동 인식 장치는 KTH 데이터 셋의 카테고리 중 박수치기 부분에서 가장 높은 정확도를 보여준다. 도 7는 KTH 데이터 셋에서 Inception-Resnet-v2 네트워크 컨볼루션 모델을 활용하여 공간 정보만을 분석한 모델, WVLGTP를 활용한 동적 정보를 분석한 모델, 공간 정보와 WVLGTP를 활용한 동적 정보를 모두 분석한 모델에 대한 각각의 평균 행동 인식 정확도를 나타낸다. 3가지 모델에 대한 평균 인식 정확도는 각각 94.9%, 94.4%, 96.5%로 확인되었다. 7 is a diagram illustrating a result of an experiment for confirming human behavior recognition performance for a KTH data set according to an embodiment of the present invention. Referring to FIG. 7 , the apparatus for recognizing human behavior of the present invention shows the highest accuracy in the clap part among the categories of the KTH data set. 7 shows a model that analyzes only spatial information using the Inception-Resnet-v2 network convolution model in the KTH data set, a model that analyzes dynamic information using WVLGTP, and analyzes both spatial information and dynamic information using WVLGTP. The average behavior recognition accuracy of each for the model is shown. The average recognition accuracy for the three models was 94.9%, 94.4%, and 96.5%, respectively.

도 7을 참조하면, KTH 데이터 셋에 대한 인간 행동 인식 관련 다른 모델과 본 발명의 모델을 비교한 결과를 확인할 수 있다. 공간 정보와 WVLGTP를 활용한 동적 정보를 모두 분석한 모델의 경우 기존 VLBP 접근 방법보다 정확도가 향상됨을 확인할 수 있다. 본 발명의 Inception-Resnet-v2 네트워크 컨볼루션 모델을 이용하여 공간 정보와 WVLGTP 활용한 동적 정보를 분석한 모델은 VLBP, LBP-TOP, MBP, ALMD, Extended LBP-TOP, LTP-TOP, 3D Gradient LBP 보다 더 나은 행동 인식 정확도를 갖는다. 본 발명의 인간 행동 인식 장치의 인식 정확도가 96.5% 임에 반해, Inception-Resnet-v2 네트워크 컨볼루션 모델을 이용한 iDT, VLBP, ALMD은 각각 95.7%, 90.1%, 93.7% 정확도를 갖는 바, 상대적으로 우수한 성능을 갖음을 확인할 수 있다. Referring to FIG. 7 , a result of comparing the model of the present invention with other models related to human behavior recognition for the KTH data set can be confirmed. In the case of a model that analyzes both spatial information and dynamic information using WVLGTP, it can be confirmed that the accuracy is improved compared to the existing VLBP approach. Models analyzing spatial information and dynamic information using WVLGTP using the Inception-Resnet-v2 network convolution model of the present invention are VLBP, LBP-TOP, MBP, ALMD, Extended LBP-TOP, LTP-TOP, 3D Gradient LBP It has better behavior recognition accuracy. While the recognition accuracy of the human behavior recognition device of the present invention is 96.5%, iDT, VLBP, and ALMD using the Inception-Resnet-v2 network convolution model have 95.7%, 90.1%, and 93.7% accuracy, respectively. It can be confirmed that it has excellent performance.

도 8은 본 발명의 실시 예에 따른 UCF 스포츠 액션 데이터 셋에 대한 인간 행동 인식 성능 확인 실험 결과값을 나타낸 도면이다. 도 8은 UCF 스포츠 액션 데이터 셋에서 Inception-Resnet-v2 네트워크 컨볼루션 모델을 활용하여 공간 정보만을 분석한 모델, WVLGTP를 활용한 동적 정보를 분석한 모델, 공간 정보와 WVLGTP를 활용한 동적 정보를 모두 분석한 모델에 대한 각각의 평균 행동 인식 정확도를 나타낸다. 3가지 모델에 대한 평균 인식 정확도는 각각 92.9%, 93.3%, 94.6%에 해당됨을 알 수 있다. UCF 스포츠 액션 데이터 셋에서 본 발명의 인간 행동 인식 장치는 다이빙, 승마, 스윙 밴치 카테고리에서 인식 정확도가 높은 것을 알 수 있다.FIG. 8 is a diagram illustrating a result of an experiment for confirming human behavior recognition performance for a UCF sports action data set according to an embodiment of the present invention. 8 shows a model that analyzes only spatial information using the Inception-Resnet-v2 network convolution model in the UCF sports action data set, a model that analyzes dynamic information using WVLGTP, and dynamic information using both spatial information and WVLGTP. The average behavior recognition accuracy of each analyzed model is shown. It can be seen that the average recognition accuracy for the three models corresponds to 92.9%, 93.3%, and 94.6%, respectively. From the UCF sports action data set, it can be seen that the human behavior recognition apparatus of the present invention has high recognition accuracy in diving, horseback riding, and swing bench categories.

도 8을 참조하면, UCF 스포츠 액션 데이터 셋에 대한 인간 행동 인식 관련 다른 모델과 본 발명의 모델을 비교한 결과를 확인할 수 있다. 본 발명의 공간 정보와 WVLGTP를 활용한 동적 정보를 모두 분석한 모델이 SFT 모델과 MAE 모델보다 더 나은 행동 인식 정확도를 갖음을 알 수 있다. UCF 스포츠 액션 데이터 셋에 대해, 본 발명의 공간 정보와 WVLGTP를 활용한 동적 정보를 모두 분석한 모델은 94.6%의 인식 정확도를 갖는다. Referring to FIG. 8 , a result of comparing the model of the present invention with other models related to human behavior recognition for the UCF sports action data set can be confirmed. It can be seen that the model analyzing both spatial information and WVLGTP of the present invention has better behavior recognition accuracy than the SFT model and the MAE model. For the UCF sports action data set, the model analyzing both spatial information and WVLGTP of the present invention has a recognition accuracy of 94.6%.

도 9는 본 발명의 실시 예에 따른 UT-Interaction 데이터 셋에 대한 인간 행동 인식 성능 확인 실험 결과값을 나타낸 도면이다. 도 9은 UT-Interaction 데이터 셋에서 Inception-Resnet-v2 네트워크 컨볼루션 모델을 활용하여 공간 정보만을 분석한 모델, WVLGTP를 활용한 동적 정보를 분석한 모델, 공간 정보와 WVLGTP를 활용한 동적 정보를 모두 분석한 모델에 대한 각각의 평균 행동 인식 정확도를 나타낸다. 3가지 모델에 대한 평균 인식 정확도는 각각 96.7%, 97.7%, 98.7%에 해당됨을 알 수 있다. 9 is a diagram illustrating a result of a human behavior recognition performance confirmation experiment for a UT-Interaction data set according to an embodiment of the present invention. 9 shows a model that analyzes only spatial information using the Inception-Resnet-v2 network convolution model in the UT-Interaction data set, a model that analyzes dynamic information using WVLGTP, and both spatial information and dynamic information using WVLGTP. The average behavior recognition accuracy of each analyzed model is shown. It can be seen that the average recognition accuracy for the three models corresponds to 96.7%, 97.7%, and 98.7%, respectively.

도 9을 참조하면, UT-Interaction 데이터 셋에 대한 인간 행동 인식 관련 다른 모델과 본 발명의 모델을 비교한 결과를 확인할 수 있다. 본 발명의 공간 정보와 WVLGTP를 활용한 동적 정보를 모두 분석한 모델의 경우 VLBP, MBP, ALMD, 3D Gradient LBP 보다 더 나은 행동 인식 정확도를 갖는다. UT-Interaction 데이터 셋에 대해, 본 발명의 본 발명의 공간 정보와 WVLGTP를 활용한 동적 정보를 모두 분석한 모델은 98.67%의 인식 정확도를 갖는다.Referring to FIG. 9 , a result of comparing the model of the present invention with other models related to human behavior recognition for the UT-Interaction data set can be confirmed. The model in which both spatial information and dynamic information using WVLGTP of the present invention are analyzed has better behavior recognition accuracy than VLBP, MBP, ALMD, and 3D gradient LBP. For the UT-Interaction data set, the model that analyzes both spatial information of the present invention and dynamic information using WVLGTP of the present invention has a recognition accuracy of 98.67%.

도 10은 본 발명의 실시 예에 따른 Hollywood2 데이터 셋에 대한 인간 행동 인식 성능 확인 실험 결과값을 나타낸 도면이다. 도 10은 Hollywood2 데이터 셋에서 UCF 스포츠 액션 데이터 셋에서 Inception-Resnet-v2 네트워크 컨볼루션 모델을 활용하여 공간 정보만을 분석한 모델, WVLGTP를 활용한 동적 정보를 분석한 모델, 공간 정보와 WVLGTP를 활용한 동적 정보를 모두 분석한 모델에 대한 각각의 평균 행동 인식 정확도를 나타낸다. 3가지 모델에 대한 평균 인식 정확도는 각각 67.04%, 66.74%, 70.3%에 해당됨을 알 수 있다. Hollywood2 데이터 셋에서 본 발명의 인간 행동 인식 장치는 달리기, 키스, 일어나기 카테고리에서 인식 정확도가 높은 것을 알 수 있다. 10 is a view showing the results of experiments to confirm human behavior recognition performance for the Hollywood2 data set according to an embodiment of the present invention. 10 is a model that analyzes only spatial information using the Inception-Resnet-v2 network convolution model in the UCF sports action dataset in the Hollywood2 dataset, a model that analyzes dynamic information using WVLGTP, and uses spatial information and WVLGTP. It represents the average behavior recognition accuracy of each model that analyzed all dynamic information. It can be seen that the average recognition accuracy for the three models corresponds to 67.04%, 66.74%, and 70.3%, respectively. It can be seen from the Hollywood2 data set that the human behavior recognition apparatus of the present invention has high recognition accuracy in the running, kissing, and waking categories.

도 10을 참조하면, Hollywood2 데이터 셋에 대한 인간 행동 인식 관련 다른 모델과 본 발명의 모델을 비교한 결과를 확인할 수 있다. Hollywood2 데이터 셋에서 가장 높은 정확도를 갖는 것은 Inception-Resnet-v2 네트워크 컨볼루션 모델을 이용한 iDT 모델인 것으로 확인되었다. Hollywood2 데이터 셋에 대해, 본 발명의 공간 정보와 WVLGTP를 활용한 동적 정보를 모두 분석한 모델은 70.3%의 인식 정확도를 갖는다. Referring to FIG. 10 , results of comparing the model of the present invention with other models related to human behavior recognition for the Hollywood2 data set can be confirmed. It was confirmed that the iDT model using the Inception-Resnet-v2 network convolution model has the highest accuracy in the Hollywood2 data set. For the Hollywood2 data set, a model that analyzes both spatial information of the present invention and dynamic information using WVLGTP has a recognition accuracy of 70.3%.

도 11은 본 발명의 실시 예에 따른 UCF-101 데이터 셋에 대한 인간 행동 인식 성능 확인 실험 결과값을 나타낸 도면이다. 도 11을 참조하면, UCF-101 데이터 셋에 대한 인간 행동 인식 관련 다른 모델과 본 발명의 모델을 비교한 결과를 확인할 수 있다. 공간 정보와 WVLGTP를 활용한 동적 정보를 모두 분석한 모델의 경우 VLBP, MBP, ALMD, 3D Gradient LBP 보다 더 나은 정확도를 갖는다. 또한 Inception-Resnet-v2 네트워크 컨볼루션 모델을 이용한 iDT 모델에 비해서도 2.2% 더 나은 정확도를 갖는다. 본 발명의 공간 정보와 WVLGTP를 활용한 동적 정보를 모두 분석한 모델은 UCF-101 데이터 셋에 대해 94.9 인식 정확도를 갖음을 알 수 있다. 11 is a diagram illustrating a result of a human behavior recognition performance confirmation experiment for the UCF-101 data set according to an embodiment of the present invention. Referring to FIG. 11 , a result of comparing the model of the present invention with other models related to human behavior recognition for the UCF-101 data set can be confirmed. A model that analyzes both spatial information and dynamic information using WVLGTP has better accuracy than VLBP, MBP, ALMD, and 3D gradient LBP. Also, it has 2.2% better accuracy than the iDT model using the Inception-Resnet-v2 network convolution model. It can be seen that the model that analyzes both spatial information and WVLGTP of the present invention has a recognition accuracy of 94.9 for the UCF-101 data set.

도 12는 본 발명의 실시 예에 따른 멀티 스케일 WVLGTP에 행동 인식 성능 확인 실험 결과값을 나타낸 도면이다. 도 12를 참조하면, 모든 데이터 셋에 대해 중앙 픽셀로부터 거리가 1인 인접 픽셀 8개와 중앙 픽셀로부터 거리가 2인 인접 픽셀 16개 고려하여 멀티 스케일 WVLGTP 생성한 경우 가장 좋은 인식 정확도를 갖음을 알 수 있다. 즉, 비디오 프레임에서 3x3 크기 픽셀에 대한 WVLGTP와 5x5 크기 픽셀에 대한 멀티스케일 WVLGTP를 활용하여 동적 정보를 생성한 모델이 평균 인식도가 가장 높음을 알 수 있다. 12 is a diagram illustrating a result of a behavior recognition performance verification experiment in multi-scale WVLGTP according to an embodiment of the present invention. 12, it can be seen that multi-scale WVLGTP generation has the best recognition accuracy for all data sets considering 8 adjacent pixels with a distance of 1 from the center pixel and 16 adjacent pixels with a distance of 2 from the center pixel. have. That is, it can be seen that the model that generates dynamic information using WVLGTP for 3x3 size pixels and multiscale WVLGTP for 5x5 size pixels in a video frame has the highest average recognition level.

전술한 실험 결과를 종합하면, 인간 행동 인식 관련하여 Inception-Resnet-v2 네트워크 컨볼루션 모델을 이용한 공간 정보와 WVLGTP 모델을 이용한 동적 정보를 활용한 모델에서 인간 행동 인식 정확도 향상이 있음을 확인할 수 있다. 나아가, 3x3 크기 픽셀에 대한 WVLGTP와 5x5 크기 픽셀에 대한 멀티스케일 WVLGTP를 활용하여 동적 정보를 생성한 모델이 3x3 크기 픽셀에 대한 WVLGTP를 통해 동적 정보를 분석한 모델 보다 더 향상된 정확도를 갖음을 알 수 있다. Combining the above experimental results, it can be confirmed that there is an improvement in human behavior recognition accuracy in the model using spatial information using the Inception-Resnet-v2 network convolution model and dynamic information using the WVLGTP model in relation to human behavior recognition. Furthermore, it can be seen that the model that generated dynamic information using WVLGTP for 3x3 size pixels and multiscale WVLGTP for 5x5 size pixels has better accuracy than the model that analyzes dynamic information through WVLGTP for 3x3 size pixels. have.

본 명세서와 도면에 개시된 본 발명의 실시 예들은 본 발명의 기술 내용을 쉽게 설명하고 본 발명의 이해를 돕기 위해 특정 예를 제시한 것뿐이며, 본 발명의 범위를 한정하고자 하는 것은 아니다. 여기에 개시된 실시 예들 이외에도 본 발명의 기술적 사상에 바탕을 둔 다른 변형 예들이 실시 가능하다는 것은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에게 자명한 것이다.Embodiments of the present invention disclosed in the present specification and drawings are merely provided for specific examples to easily explain the technical contents of the present invention and help the understanding of the present invention, and are not intended to limit the scope of the present invention. It will be apparent to those of ordinary skill in the art to which the present invention pertains that other modifications based on the technical spirit of the present invention can be implemented in addition to the embodiments disclosed herein.

Claims

a data storage unit for storing video data including human behavior;
a spatial information generating unit generating spatial information by extracting spatial features from video data;
a dynamic information generator for generating dynamic information by utilizing temporal information and spatial information from video data;
The dynamic information generator extracts three successive Former Frames, Current Frames, and Next Frames of size 3x,3 from video data, sets a pixel value for each successive frame as a first pixel value, and a first pixel value WVLGTP (Weber'law Volume Local Gradient Ternary Pattern) is generated through gradient operation and threshold operation for
The dynamic information is characterized as a feature vector for temporal information and a feature vector for spatial information,
Recognition unit for recognizing human behavior through SVM (Support Vector Machine) algorithm, which is a supervised learning model for pattern recognition of spatial information and dynamic information;
A human behavior recognition device comprising a.

The method of claim 1,
The gradient operation is
The absolute value of the difference with the adjacent pixel value based on the central pixel value of the first pixel value is the second pixel value of the adjacent pixel, and the central pixel value of the first pixel value is the second pixel value of the central pixel as it is. human behavior recognition device.

3. The method of claim 2,
The second pixel value of the central pixel is an average value of the second pixel values of adjacent pixels.

3. The method of claim 2,
The threshold calculation is
For the second pixel value, the absolute value of the difference from the threshold value based on Weber's law is calculated, the absolute value is compared with an arbitrary value, and the result of giving values 1, -1, and 0 is the third pixel value human behavior recognition device.

The method of claim 1,
The feature vectors for the temporal information and spatial information are
For three consecutive frames, a human behavior recognition device that generates a feature vector for spatial information based on the third pixel value of the Current Frame, and generates a feature vector for temporal information based on the Former Frame and the Next Frame.

6. The method of claim 5,
The feature vector for the spatial information is
A human behavior recognition device created by generating the size vector of the Current Frame and utilizing the average and variance of the generated size vector.

The method of claim 1,
The dynamic information
It is a feature vector for temporal information and spatial information generated through multiscale WVLGTP,
The multiscale WVLGTP is
An apparatus for recognizing human behavior generated by extracting 5x5 pixel values from the continuous frames, averaging method calculation, gradient calculation, and threshold value calculation.

8. The method of claim 7,
The averaging method calculation is
For an arbitrary pixel, the pixel value of the pixel is a calculation method in which the average of a plurality of pixel values adjacent to the pixel is calculated and the average value is used as the pixel value of the pixel.

8. The method of claim 7,
The dynamic information
A device for recognizing human behavior in which a feature vector for temporal information and spatial information generated through multiscale WVLGTP and a feature vector for temporal information and spatial information generated through WVLGTP are used as one feature vector for temporal information and spatial information.

7. The method of claim 6,
The magnitude vector is
With respect to the first pixel value of the current frame, the human behavior recognition device, characterized in that the absolute value of the difference between the adjacent pixel value and the central pixel value based on the central pixel value.