KR20230017126A

KR20230017126A - Action recognition system based on deep learning and the method thereof

Info

Publication number: KR20230017126A
Application number: KR1020220069397A
Authority: KR
Inventors: 노승민; 빌랄 무하마드; 맥수드 무아잠; 야스민 사다프; 하산 나잠
Original assignee: 중앙대학교 산학협력단
Priority date: 2021-07-26
Filing date: 2022-06-08
Publication date: 2023-02-03

Abstract

A deep learning-based behavior recognition system according to an embodiment of the present invention comprises: a data preprocessing unit that groups overlapping motion classes in a video image data set, extracts frames from each video image, adjusts the frames to a frame size suitable for a deep learning pipeline (DLPL) CNN learning model, and pre-trains the deep learning pipe line (DLPL) CNN learning model; a transfer learning unit that trains and fine-tunes the deep learning pipeline (DLPL) CNN learning model pre-trained in the data preprocessing unit by applying an additional data set; a deep feature extraction unit that extracts high-dimensional deep features by learning frame-level spatial information from a visual data stream by applying the pre-trained deep learning pipeline (DLPL) CNN learning model fine-tuned in the transfer learning unit; an encoder unit that compresses the high-dimensional deep features extracted from the deep feature extraction unit into a low-dimensional feature map; and an adjustment module unit that learns temporal information from the feature map compressed in the encoder unit and repeatedly fine-tunes and trains the model by applying a changed part of novel video image data to the previously trained model. Accordingly, action recognition errors can be reduced through behavioral analysis that can accurately differentiate similar human actions and postures.

Description

Deep learning-based action recognition system and method {ACTION RECOGNITION SYSTEM BASED ON DEEP LEARNING AND THE METHOD THEREOF}

본 발명은 딥러닝 기반의 행동 인식 시스템 및 그 방법에 관한 것으로, 특히 긴 영상 데이터에서 비슷한 동작 또는 일부 겹치는 행동 인식에 대해 시공간 프레임 워크를 통한 심층 특징을 추출하여 학습함으로써 행동 인식의 정확한 차별화를 제공할 수 있는 딥러닝 기반의 행동 인식 시스템 및 그 방법에 관한 것이다. The present invention relates to a deep learning-based action recognition system and method, and in particular, provides accurate differentiation of action recognition by extracting and learning in-depth features through a spatio-temporal framework for similar or partially overlapping action recognition in long image data. It is about a deep learning-based action recognition system and method that can

일반적으로 인간 행동 인식 작업에는 비디오 데이터 스트림에서 인간이 수행하는 다양한 활동을 식별하게 된다. 인간 행동 인식은 주로 '어떤 행동을 감지하고 인식하는가' 또는 '비디오 데이터 스트림에서 행동이 수행되는 위치는 어디인가'와 같은 질문에 중점을 두고 비디오 데이터 스트림에는 동작의 공간 정보 및 시간 정보를 파악하여 시공간 정보와 이들의 관계를 포함하여 동작 인식에 도움이 되는 많은 숨겨진 정보를 추출하게 된다. Generally, human action recognition tasks involve identifying various activities performed by humans in a video data stream. Human action recognition mainly focuses on questions such as 'what action is detected and recognized' or 'where is the action performed in the video data stream', and the video data stream contains spatial and temporal information of the action A lot of hidden information helpful for motion recognition is extracted, including spatio-temporal information and their relationships.

인간 행동 인식은 인간 대 인간 행동, 인간과 사물 상호 작용, 인간의 신체 움직임, 하위 활동 또는 원자 행동, 제스처, 그룹 활동, 이벤트 및 행동과 같은 여러 범주로 나눌 수 있다. 걷기 및 달리기와 같은 원자적 동작 또는 활동은 덜 복잡하고 인식하기 쉽다. 그러나 요리와 같은 활동은 많은 하위 활동의 조합으로 복잡하고 인식하기에 난이도가 있다.Human action recognition can be divided into several categories such as human-to-human actions, human-object interactions, human body movements, sub-activities or atomic actions, gestures, group activities, events and actions. Atomic motions or activities such as walking and running are less complex and easier to recognize. However, activities such as cooking are complex and difficult to recognize as a combination of many sub-activities.

즉, 현재 대부분의 비디오 데이터에는 HAR(Human Action Recognition)에 사용할 수 있는 숨겨진 정보와 패턴이 많이 포함되어 있다. HAR은 행동 분석, 지능형 비디오 감시, 로봇 비전 등과 같은 많은 영역에 적용될 수 있다. 예컨대, 응용 분야 중 일부는 사기 탐지, 의심스럽거나 비정상적인 행동 탐지, 노인 행동 모니터링, 지능형 비디오 감시, 인간-컴퓨터 상호 작용 시스템 및 로봇 비전에 적용될 수 있다. In other words, most video data currently contains a lot of hidden information and patterns that can be used for HAR (Human Action Recognition). HAR can be applied in many areas such as behavior analysis, intelligent video surveillance, robot vision, etc. For example, some of the applications can be applied to fraud detection, suspicious or abnormal behavior detection, elderly behavior monitoring, intelligent video surveillance, human-computer interaction systems and robot vision.

그러나, 폐색, 시점 변화 및 조명은 HAR 작업을 더 어렵게 만드는 문제가 있다. 또한, 일부 작업 클래스에는 유사한 작업 또는 일부 겹치는 부분이 존재하여 오분류에 가장 크게 기여하는 주요 원인으로 해결되지 못한 많은 문제점이 있다.However, occlusion, perspective changes and lighting present problems that make HAR work more difficult. In addition, similar tasks or some overlapping parts exist in some task classes, so there are many problems that have not been resolved as the main cause that contributes the most to misclassification.

앞서 언급한 문제점을 해결하기 위하여, 긴 영상 데이터에서 중복되는 행동 패턴들을 CNN 알고리즘을 이용하여 프레임 워크를 추출하여 학습하고, 사전 훈련된 딥 러닝 파이프 라인(DLPL) 모델을 이용하여 미세 조정을 수행을 통해 딥러닝 기반의 행동 인식 시스템 및 그 방법을 제공한다. In order to solve the aforementioned problem, a framework is extracted and learned from overlapping behavioral patterns in long image data using a CNN algorithm, and fine-tuning is performed using a pre-trained deep learning pipeline (DLPL) model. To provide a deep learning-based behavior recognition system and its method.

본 발명의 일 실시예에 따른 딥러닝 기반의 행동 인식 시스템은, 비디오 영상의 데이터 세트에서 중첩 동작 클래스를 그룹화하고, 각 비디오 영상에서 프레임을 추출하여 딥 러닝 파이프 라인(DLPL) CNN 학습 모델에 맞는 프레임 사이즈로 조정하여 사전 학습하는 데이터 전처리부; 상기 데이터 전처리부에서 사전 학습된 딥 러닝 파이프 라인(DLPL) CNN 학습 모델에 추가적인 데이터 세트를 적용하여 학습 및 미세 조정하는 전이 학습부; 상기 전이 학습부에서 미세 조정된 사전 학습된 딥 러닝 파이프 라인(DLPL) CNN 학습 모델을 적용하여 시각적 데이터 스트림에서 프레임 수준 공간 정보를 학습하여 고차원 심층 특징을 추출하는 심층 특징 추출부; 상기 심층 특징 추출부에서 추출된 고차원 심층 특징을 저차원 특징 맵으로 압축하는 인코더부; 및 상기 인코더부에서 압축된 특징 맵에서 시간 정보를 학습하고 이전 학습된 모델에 새로운 비디오 영상 데이터의 변경 부분을 적용하여 반복적으로 미세 조정하여 학습하는 조정 모듈부를 포함하는 점에 그 특징이 있다.A deep learning-based action recognition system according to an embodiment of the present invention groups overlapping action classes in a data set of video images, and extracts frames from each video image to fit a Deep Learning Pipeline (DLPL) CNN learning model. A data pre-processing unit for pre-learning by adjusting the frame size; a transfer learning unit for learning and fine-tuning by applying an additional data set to a deep learning pipeline (DLPL) CNN learning model previously trained in the data pre-processing unit; a deep feature extraction unit for extracting high-dimensional deep features by learning frame-level spatial information from a visual data stream by applying a pretrained deep learning pipeline (DLPL) CNN learning model finely tuned in the transfer learning unit; an encoder unit for compressing the high-dimensional deep features extracted by the deep feature extraction unit into a low-dimensional feature map; and an adjustment module unit that learns temporal information from the feature map compressed by the encoder unit and repeatedly fine-tunes and learns by applying a modified part of new video image data to a previously learned model.

여기서, 특히 상기 딥 러닝 파이프 라인(DLPL) CNN 학습 모델은 DenseNet201, InceptionV3, ResNet101V2, ResNet152V2, VGG16, VGG19 및 Xception를 포함하고, 이들 중 어느 하나의 사전 훈련된 CNN 학습 모델을 이용하는 점에 그 특징이 있다.Here, in particular, the deep learning pipeline (DLPL) CNN learning model includes DenseNet201, InceptionV3, ResNet101V2, ResNet152V2, VGG16, VGG19 and Xception, and uses any one of these pre-trained CNN learning models. there is.

여기서, 특히 상기 전이 학습부는 상기 사전 학습된 딥 러닝 파이프 라인(DLPL) CNN 모델의 마지막 분류 계층을 제거하고 새 데이터 세트에 대해 새로운 계층을 추가하여 학습 및 미세 조정하는 점에 그 특징이 있다.In particular, the transfer learning unit removes the last classification layer of the pre-trained deep learning pipeline (DLPL) CNN model and adds a new layer for a new data set to learn and refine.

여기서, 특히 상기 심층 특징 추출부는 딥 러닝 파이프 라인(DLPL) CNN 학습 모델에서 마지막 완전 연결 계층(FCL)을 제거하고 일부 추가 계층을 추가하여 데이터 세트에 대해 시각적 데이터 스트림에서 공간 패턴과 관계를 학습하고, 마지막으로 완전히 연결된 SoftMax 계층 이전의 출력을 다음 네트워크에 제공하여 심층 특징을 추출하는 점에 그 특징이 있다.Here, in particular, the deep feature extraction unit removes the last fully connected layer (FCL) from the Deep Learning Pipeline (DLPL) CNN training model and adds some additional layers to learn spatial patterns and relationships in the visual data stream for the data set, , finally, the output before the fully connected SoftMax layer is provided to the next network to extract deep features.

여기서, 특히 상기 인코더부는 오토 인코더(Deep Autoencoder)를 이용하여 입력 데이터에 대한 최소 표현을 학습하고, 원래 입력 데이터에 가장 가까운 출력으로 재구성하여 저차원 특징 맵으로 출력하는 점에 그 특징이 있다.In particular, the encoder unit is characterized in that it learns a minimum representation of the input data using a deep autoencoder, reconstructs the original input data into an output closest to the original input data, and outputs it as a low-dimensional feature map.

여기서, 특히 상기 조정 모듈부는 장기 시간적 맥락을 학습하기 위해 LSTM(Long Short-Term Memory) 및 RNN(Recurrent Neural Network)을 이용하는 점에 그 특징이 있다.In particular, the adjustment module unit is characterized in that it uses Long Short-Term Memory (LSTM) and Recurrent Neural Network (RNN) to learn a long-term temporal context.

또한, 본 발명의 일 실시예에 따른 딥러닝 기반의 행동 인식 방법은, 비디오 영상의 데이터 세트에서 중첩 동작 클래스를 그룹화하고, 각 비디오 영상에서 프레임을 추출하여 딥 러닝 파이프 라인(DLPL) CNN 학습 모델에 맞는 프레임 사이즈로 조정하여 사전 학습하는 단계; 상기 사전 학습된 딥 러닝 파이프 라인(DLPL) CNN 학습 모델에 추가적인 데이터 세트를 적용하여 학습 및 미세 조정하는 단계; 상기 미세 조정된 사전 학습된 딥 러닝 파이프 라인(DLPL) CNN 학습 모델을 적용하여 시각적 데이터 스트림에서 프레임 수준 공간 정보를 학습하여 고차원 심층 특징을 추출하는 단계; 상기 심층 특징 추출부에서 추출된 고차원 심층 특징을 저차원 특징 맵으로 압축하는 단계; 및 상기 압축된 특징 맵에서 시간 정보를 학습하고 이전 학습된 모델에 새로운 비디오 영상 데이터의 변경 부분을 적용하여 반복적으로 미세 조정하여 학습하는 단계를 포함하는 점에 그 특징이 있다.In addition, the deep learning-based action recognition method according to an embodiment of the present invention groups overlapping action classes in a data set of video images, extracts frames from each video image, and uses a Deep Learning Pipeline (DLPL) CNN learning model. pre-learning by adjusting the frame size to fit; learning and fine-tuning by applying an additional data set to the pre-trained Deep Learning Pipeline (DLPL) CNN learning model; extracting high-dimensional deep features by learning frame-level spatial information from a visual data stream by applying the fine-tuned pretrained deep learning pipeline (DLPL) CNN learning model; compressing the high-dimensional deep features extracted by the deep feature extraction unit into a low-dimensional feature map; and learning time information from the compressed feature map and repeatedly fine-tuning and learning by applying a modified part of new video image data to a previously learned model.

여기서, 특히 상기 학습 및 미세 조정하는 단계에서 사전 학습된 딥 러닝 파이프 라인(DLPL) CNN 모델의 마지막 분류 계층을 제거하고 새 데이터 세트에 대해 새로운 계층을 추가하여 학습 및 미세 조정하는 점에 그 특징이 있다.Here, in particular, in the learning and fine-tuning step, the last classification layer of the pretrained deep learning pipeline (DLPL) CNN model is removed, and a new layer is added for a new data set to learn and fine-tune. there is.

여기서, 특히 상기 심층 특징을 추출하는 단계 상기 사전 학습된 딥 러닝 파이프 라인(DLPL) CNN 학습 모델에서 마지막 완전 연결 계층(FCL)을 제거하고 일부 추가 계층을 추가하여 데이터 세트에 대해 시각적 데이터 스트림에서 공간 패턴과 관계를 학습하고, 마지막으로 완전히 연결된 SoftMax 계층 이전의 출력을 다음 네트워크에 제공하여 심층 특징을 추출하는 점에 그 특징이 있다.Here, in particular, the step of extracting the deep features spatially from the visual data stream for the data set by removing the last Fully Connected Layer (FCL) from the pre-trained Deep Learning Pipeline (DLPL) CNN training model and adding some additional layers. It is characterized by learning patterns and relationships, and finally extracting deep features by providing the output before the fully connected SoftMax layer to the next network.

여기서, 특히 상기 저차원 특징 맵으로 압축하는 단계에서 오토 인코더(Deep Autoencoder)를 이용하여 입력 데이터에 대한 최소 표현을 학습하고, 원래 입력 데이터에 가장 가까운 출력으로 재구성하여 저차원 특징 맵으로 출력하는 점에 그 특징이 있다.Here, in particular, in the step of compressing the low-dimensional feature map, the minimum expression for the input data is learned using a deep autoencoder, and the output closest to the original input data is reconstructed and output as a low-dimensional feature map. has its characteristics.

여기서, 특히 상기 미세 조정하여 학습하는 단계에서 장기 시간적 맥락을 학습하기 위해 LSTM(Long Short-Term Memory) 및 RNN(Recurrent Neural Network)을 이용하는 점에 그 특징이 있다.Here, in particular, in the step of fine-tuning and learning, a long short-term memory (LSTM) and a recurrent neural network (RNN) are used to learn a long-term temporal context.

본 발명에서 개시하고 있는 일 실시예에 따르면, 인간 행동의 유사한 행동 및 자세들에서 이들을 정확하게 차별화할 수 있는 행동 분석을 통해 행동 인식 오류를 감소할 수 있는 효과가 있다. According to an embodiment disclosed in the present invention, there is an effect of reducing behavioral recognition errors through behavioral analysis that can accurately differentiate them from similar behaviors and postures of human behavior.

도 1은 본 발명의 일 실시예에 따른 딥러닝 기반의 행동 인식 시스템의 구성을 개략적으로 도시한 도면.
도 2는 본 발명의 일 실시 예에 따른 시공간 인간 행동 인식(HAR) 프레임워크를 개략적으로 도시한 도면.
도 3은 본 발명의 팁 오토인코더의 아키텍쳐를 개략적으로 도시한 도면.
도 4는 본 발명의 RNN 모델의 아키텍쳐를 개략적으로 도시한 도면.
도 5는 본 발명의 LSTM 모델의 아키텍쳐를 개략적으로 도시한 도면.
도 6은 본 발명의 조정 모듈부의 새로운 비디오의 변경 사항을 적용하기 위한 반복적인 미세 조정의 과정을 도시한 도면.
도 7은 본 발명의 일 실시예에 따른 딥러닝 기반의 행동 인식 방법에 대한 순서도.1 is a diagram schematically showing the configuration of a deep learning-based behavior recognition system according to an embodiment of the present invention.
2 schematically illustrates a spatial-temporal human action recognition (HAR) framework according to an embodiment of the present invention;
3 schematically shows the architecture of a tip autoencoder of the present invention.
Figure 4 schematically shows the architecture of the RNN model of the present invention.
5 schematically illustrates the architecture of the LSTM model of the present invention;
6 is a diagram illustrating a process of iterative fine-tuning for applying changes to a new video in the adjustment module unit of the present invention;
7 is a flowchart of a deep learning-based behavior recognition method according to an embodiment of the present invention.

본 발명은 다양한 변경을 가할 수 있고 여러 가지 실시예를 가질 수 있는 바, 특정 실시예들을 도면을 참조하여 상세하게 설명하도록 한다. 그러나, 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 각 도면을 설명하면서 유사한 참조부호를 유사한 구성요소에 대해 사용하였다.Since the present invention can have various changes and various embodiments, specific embodiments will be described in detail with reference to the drawings. However, this is not intended to limit the present invention to specific embodiments, and should be understood to include all modifications, equivalents, and substitutes included in the spirit and scope of the present invention. Like reference numerals have been used for like elements throughout the description of each figure.

제1, 제2, A, B 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 예를 들어, 본 발명의 권리범위를 벗어나지 않으면서 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다. 및/또는이라는 용어는 복수의 관련된 기재 항목들의 조합 또는 복수의 관련된 기재 항목들 중의 어느 항목을 포함한다.Terms such as first, second, A, and B may be used to describe various components, but the components should not be limited by the terms. These terms are only used for the purpose of distinguishing one component from another. For example, a first element may be termed a second element, and similarly, a second element may be termed a first element, without departing from the scope of the present invention. The term and/or includes a combination of a plurality of related items or any one of a plurality of related items.

본 출원에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "가지다" 등의 용어는 명세서 상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.Terms used in this application are only used to describe specific embodiments, and are not intended to limit the present invention. Singular expressions include plural expressions unless the context clearly dictates otherwise. In this application, the terms "include" or "have" are intended to designate that there is a feature, number, step, operation, component, part, or combination thereof described in the specification, but one or more other features It should be understood that the presence or addition of numbers, steps, operations, components, parts, or combinations thereof is not precluded.

명세서 및 청구범위 전체에서, 어떤 부분이 어떤 구성 요소를 포함한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성 요소를 제외하는 것이 아니라 다른 구성 요소를 더 포함할 수 있다는 것을 의미한다. Throughout the specification and claims, when a part includes a certain component, it means that it may further include other components, not excluding other components unless otherwise stated.

도 1은 본 발명의 일 실시예에 따른 딥러닝 기반의 행동 인식 시스템의 구성을 개략적으로 도시한 도면이고, 도 2는 본 발명의 일 실시 예에 따른 시공간 인간 행동 인식(HAR) 프레임워크를 개략적으로 도시한 도면이다.1 is a diagram schematically showing the configuration of a deep learning-based action recognition system according to an embodiment of the present invention, and FIG. 2 is a schematic diagram of a space-time human action recognition (HAR) framework according to an embodiment of the present invention. It is a drawing shown as

도 1에 도시된 바와 같이, 딥러닝 기반의 행동 인식 시스템(100)은 데이터 전처리부(110), 전이 학습부(120), 심층 특징 추출부(130), 인코더부(140) 및 조정 모듈부(150)를 포함하여 구성된다. As shown in FIG. 1, the deep learning-based behavior recognition system 100 includes a data pre-processing unit 110, a transfer learning unit 120, a deep feature extraction unit 130, an encoder unit 140, and an adjustment module unit. (150).

상기 데이터 전처리부(110)는, 도 2에 도시된 바와 같이, 비디오 영상의 데이터 세트에서 중첩 동작 클래스를 그룹화하고, 각 비디오 영상에서 프레임을 추출하여 딥 러닝 파이프 라인(DLPL) CNN 학습 모델에 맞는 프레임 사이즈로 조정하여 사전 학습하게 된다.As shown in FIG. 2, the data pre-processing unit 110 groups overlapping motion classes in a data set of video images and extracts frames from each video image to fit a Deep Learning Pipeline (DLPL) CNN learning model. Pre-learning is performed by adjusting the frame size.

보다 구체적으로, 딥 러닝 파이프 라인(DLPL) CNN 학습 모델은 DenseNet201, InceptionV3, ResNet101V2, ResNet152V2, VGG16, VGG19 및 Xception를 포함할 수 있으며, 이를 이용하여 사전 학습을 진행하게 된다. More specifically, the Deep Learning Pipeline (DLPL) CNN training model may include DenseNet201, InceptionV3, ResNet101V2, ResNet152V2, VGG16, VGG19, and Xception, and pre-learning is performed using them.

일 예로, 데이터 사전 처리로 UCF-101 데이터 세트에서 유사한 부분이 있거나 일부 겹치는 작업이 있는 작업 클래스를 그룹화할 수 있다. UCF-101 데이터세트의 경우 중첩 동작을 포함하는 총 그룹 수는 18개이며 각 그룹은 2~5 범위의 동작 클래스로 구성될 수 있다. 겹치는 동작 클래스를 그룹화한 후 각 비디오에서 프레임을 추출한 다음 7개의 사전 훈련된 CNN 모델 각각에 필요한 입력 모양에 따라 재구성하게 된다. 이때, 프레임이 치수(224, 224, 3)로 다시 조정될 수 있다. 딥 러닝 파이프 라인(DLPL) CNN 학습 모델 중에서 InceptionV3 및 Xception의 경우 프레임이 DenseNet201, ResNet101V2, ResNet152V2, VGG16 및 VGG19 사전 훈련된 CNN 모델에 대해 차원(299, 299, 3)으로 다시 조정될 수 있다. For example, data pre-processing can group work classes that have similar parts or some overlapping jobs in the UCF-101 data set. For the UCF-101 dataset, the total number of groups containing nested actions is 18, and each group can consist of action classes ranging from 2 to 5. After grouping the overlapping motion classes, frames are extracted from each video and then reconstructed according to the shape of the input required for each of the seven pre-trained CNN models. At this time, the frame may be readjusted to dimensions 224, 224, 3. For InceptionV3 and Xception among Deep Learning Pipeline (DLPL) CNN training models, frames can be rescaled to dimension (299, 299, 3) for DenseNet201, ResNet101V2, ResNet152V2, VGG16 and VGG19 pre-trained CNN models.

상기 전이 학습부(120)는 상기 데이터 전처리부(110)에서 사전 학습된 딥 러닝 파이프 라인(DLPL) CNN 학습 모델에 추가적인 데이터 세트를 적용하여 학습 및 미세 조정하게 된다. The transfer learning unit 120 learns and fine-tunes by applying an additional data set to the deep learning pipeline (DLPL) CNN training model previously trained in the data preprocessing unit 110.

보다 구체적으로, 상기 전이 학습부(120)는 상기 사전 학습된 딥 러닝 파이프 라인(DLPL) CNN 모델의 마지막 분류 계층을 제거하고 새 데이터 세트에 대해 새로운 계층을 추가하여 학습 및 미세 조정하게 된다. More specifically, the transfer learning unit 120 removes the last classification layer of the pretrained deep learning pipeline (DLPL) CNN model and adds a new layer to a new data set to learn and fine-tune.

다시 말해, 전이 학습은 하나의 특정 영역에서 유용한 정보와 숨겨진 패턴을 학습한 다음 이를 활용하여 다른 관련 영역에 전달하는 머신 러닝의 기술로, 전이 학습에서 모델이 한 작업에 대해 학습한 것을 다른 관련 문제로 더 일반화할 수 있도록 활용할 수 있게 된다. 여기서 하나의 작업에 대한 모델 훈련은 관련 작업에 대한 다른 모델 훈련의 시작점이 될 수 있으며, 전이 학습에서 이러한 사전 훈련된 모델을 미세 조정하는 것은 처음부터 모델을 훈련하는 것과 비교하여 계산량이 적어 매우 빠르다. 즉, 전이 학습을 사용하는 또 다른 이점은 사전 훈련된 모델이 이미 오랜 기간 동안 많은 최적화 전략을 사용하여 매우 큰 데이터 세트에서 훈련되었기 때문에 소량의 데이터만 필요하다. 전이 학습을 사용하는 다른 많은 전략 중에서 그 중 하나는 마지막 분류 계층을 제거하고 새 계층을 추가한 다음 사전 훈련된 모델과 새 데이터 세트의 끝에 추가된 추가 계층을 훈련 및 미세 조정하는 것이다. In other words, transfer learning is a technique in machine learning that learns useful information and hidden patterns in one specific domain and then uses them to pass them on to other related domains. can be used for further generalization. Here, training a model for one task can be a starting point for training other models for related tasks, and fine-tuning these pre-trained models in transfer learning is very fast with less computation compared to training a model from scratch. . That said, another benefit of using transfer learning is that the pretrained model has already been trained on very large data sets using many optimization strategies over a long period of time, requiring only a small amount of data. Among many other strategies using transfer learning, one of them is to remove the last classification layer, add a new layer, and then train and fine-tune the pretrained model and additional layers added at the end of the new dataset.

상기 심층 특징 추출부(130)는 상기 전이 학습부(120)에서 미세 조정된 사전 학습된 딥 러닝 파이프 라인(DLPL) CNN 학습 모델을 적용하여 시각적 데이터 스트림에서 프레임 수준 공간 정보를 학습하여 고차원 심층 특징을 추출하게 된다. 다시 말해, 상기 심층 특징 추출부(130)는 딥 러닝 파이프 라인(DLPL) CNN 학습 모델에서 마지막 완전 연결 계층(FCL)을 제거하고 일부 추가 계층을 추가하여 데이터 세트에 대해 시각적 데이터 스트림에서 공간 패턴과 관계를 학습하고, 마지막으로 완전히 연결된 SoftMax 계층 이전의 출력을 다음 네트워크에 제공하여 심층 특징을 추출하게 된다. The deep feature extraction unit 130 learns frame-level spatial information from the visual data stream by applying the pre-trained deep learning pipeline (DLPL) CNN training model finely tuned in the transfer learning unit 120 to obtain high-dimensional deep features. will extract In other words, the deep feature extraction unit 130 removes the last fully connected layer (FCL) from the Deep Learning Pipeline (DLPL) CNN training model and adds some additional layers to determine spatial patterns and spatial patterns in the visual data stream for the data set. The relationship is learned, and finally the output before the fully connected SoftMax layer is fed to the next network to extract deep features.

보다 구체적으로, 비디오 데이터에는 시각적 데이터 스트림에 숨겨진 엄청난 양의 유용한 정보가 있다. 여기서, 색상 강도 변화, 가장자리 및 모양, 동작 및 질감 패턴, 장기간의 시간적 맥락 정보가 포함된다. More specifically, video data contains an enormous amount of useful information hidden in the visual data stream. Here, color intensity changes, edges and shapes, motion and texture patterns, and long-term temporal context information are included.

따라서, 상기 정보들을 이용한 전이 학습의 이점을 활용하기 위해 DenseNet201, InceptionV3, ResNet101V2, ResNet152V2, VGG16, VGG19 및 Xception의 7가지 사전 훈련된 CNN 모델을 사용하게 된다. Therefore, in order to take advantage of transfer learning using the above information, we use 7 pre-trained CNN models: DenseNet201, InceptionV3, ResNet101V2, ResNet152V2, VGG16, VGG19 and Xception.

이러한 CNN 모델은 시각적 데이터 스트림에서 공간 패턴과 관계를 학습하는 프레임 수준에서 심층 기능을 추출하는 데 사용된다. 또한, CNN 모델은 1000개의 다른 클래스에 대한 약 120만 컬러 이미지의 데이터 세트에서 훈련될 수 있다. These CNN models are used to extract deep features at the frame level learning spatial patterns and relationships in visual data streams. Additionally, a CNN model can be trained on a data set of about 1.2 million color images for 1000 different classes.

여기서, 심층 특징 추출부(130)에서 심층 특징은 비디오 프레임에서 공간 정보의 지역적 특징과 전역적 특징을 모두 나타내는 고차원 공간의 특징이다. Here, the deep feature in the deep feature extractor 130 is a feature of a high-dimensional space representing both local and global features of spatial information in a video frame.

이러한 심층 특징을 추출하기 위한 사전 훈련된 CNN 모델의 끝에서 마지막 몇 개의 완전 연결 계층(FCL)을 제거하고 일부 추가 계층을 추가하여 모델이 데이터 세트에 대해 훈련될 수 있도록 한다. 그리고, 네트워크 끝에 레이어를 추가한 후 사전 훈련된 CNN 모델을 미세 조정하여 비디오의 동작 인식 작업에 사용할 수 있게 된다. At the end of the pre-trained CNN model for extracting these deep features, we remove the last few fully connected layers (FCL) and add some additional layers so that the model can be trained on the data set. Then, after adding a layer at the end of the network, the pre-trained CNN model can be fine-tuned and used for motion recognition tasks in videos.

또한, 심층 특징 추출부(130)에서 심층 특징을 추출하기 위해 마지막으로 완전히 연결된 소프트맥스(SoftMax) 계층 이전의 출력이 캡처되어 표현 학습을 위해 다음 네트워크에 제공될 수 있다. 예컨대, (5 x 5) 커널 크기의 컨볼루션 레이어의 경우 출력은 200개의 특징 맵에 대한 (150 x 150) 차원이고 보폭 값이 1인 동일한 풀링을 사용할 수 있다. 크기가 (150, 100, 3)인 이미지와 3개의 색상 채널이 있는 RGB 이미지의 경우 전체 매개변수는 (5 x 5 x 3 + 1) x 200 = 15,200 이 될 수 있다. In addition, in order to extract deep features in the deep feature extraction unit 130, an output before the last completely connected SoftMax layer may be captured and provided to the next network for expression learning. For example, for a (5 x 5) kernel-sized convolutional layer, the output is of dimension (150 x 150) for 200 feature maps and the same pooling with a stride value of 1 can be used. For an image of size (150, 100, 3) and an RGB image with 3 color channels, the overall parameters could be (5 x 5 x 3 + 1) x 200 = 15,200.

모든 200개의 특징 맵은 (150 x 100)개의 뉴런으로 구성되며 (5 x 5 x 3) = 73개의 입력으로 모든 뉴런에 대한 가중치 합을 계산하게 된다. 32비트 부동 소수점 값의 경우 데이터 세트의 각 인스턴스에 대해 총 (200 x 150 x 100 x 32) = 9600만 비트(12MB)가 된다. 예측 단계에서 한 계층은 다음 계층에 대한 계산이 완료되면 RAM을 해제하지만 훈련 목적을 위해 순방향 패스에서 수행된 모든 계산은 역방향 패스에 대해 보존하게 된다.All 200 feature maps consist of (150 x 100) neurons, and with (5 x 5 x 3) = 73 inputs, the sum of the weights for all neurons is calculated. For 32-bit floating point values, for each instance in the data set, total (200 x 150 x 100 x 32) = 96 million bits (12 MB). During the prediction phase, one layer releases RAM when the calculations for the next layer are complete, but for training purposes, all calculations performed in the forward pass are retained for the backward pass.

한편, 2D 이미지에서 특정 객체의 위치를 감지하기 위해 가중치 함수는 다음과 수학식 1과 같이 나타낼 수 있다.Meanwhile, in order to detect the location of a specific object in a 2D image, a weight function can be expressed as Equation 1 below.

[수학식 1][Equation 1]

여기서 ()는 시간 의 위치를 나타내고 ()는 컨볼루션 커널이다. 물체의 위치는 컨볼루션을 사용하여 단순 이동 평균(SMA)으로 계산할 수 있다. SMA용 커널은 다음과 수학식 2와 같이 나타낼 수 있다.where () represents the location of time and () is the convolution kernel. The position of an object can be calculated with a simple moving average (SMA) using convolution. The kernel for SMA can be expressed as Equation 2 below.

[수학식 2][Equation 2]

여기서, 창 크기는 단순 이동 평균이다. 2D 이미지에 대한 복잡한 이미지는 다음 수학식 3과 같이 계산할 수 있다.Here, the window size is a simple moving average. A complex image of a 2D image can be calculated as in Equation 3 below.

[수학식 3][Equation 3]

여기서 는 컨볼루션이 수행되는 2D 이미지이고 은 2D 스무딩 커널이다. 컨볼루션 레이어에 있는 단일 뉴런의 출력은 다음 수학식 4와 같이 계산할 수 있다.Here, is a 2D image on which convolution is performed and is a 2D smoothing kernel. The output of a single neuron in the convolutional layer can be calculated as in Equation 4 below.

[수학식 4][Equation 4]

여기서, 특징 맵의 경우, 컨볼루션 레이어 의 은 i^th 행과 ^th 열이 있는 _,,, 는 수직 보폭, 는 수평 보폭, 는 필드, 는 필드이다. 는 수신 필드의 너비이다. 또한, 는 특징 맵의 이전 레이어 번호이고, _,,은 i^th 행과 ^th 및 특징 맵 또는 채널에 대한 이전 레이어의 뉴런 출력이다. 편향성은 로 표시되고 _,,, 는 레이어 의 행과 열에 대해 특징 맵 의 뉴런과 특징 맵 k 간의 연결 가중치로 표시된다.Here, for the feature map, i of the convolutional layer is i _, , , with ^th rows and ^th columns.is a vertical stride, is a horizontal stride, is a field, and is a field. is the width of the receiving field. Also, is the number of the previous layer of the feature map, , _, is the neuron output of the previous layer for i ^th row and ^th and feature map or channel. Bias is denoted by _, , , are denoted by the connection weights between neurons in feature map and feature map k for the rows and columns of layer .

도 3은 본 발명의 팁 오토인코더의 아키텍쳐를 개략적으로 도시한 도면이고, 도 4는 본 발명의 RNN 모델의 아키텍쳐를 개략적으로 도시한 도면이고, 도 5는 본 발명의 LSTM 모델의 아키텍쳐를 개략적으로 도시한 도면이고, 도 6은 본 발명의 조정 모듈부의 새로운 비디오의 변경 사항을 적용하기 위한 반복적인 미세 조정의 과정을 도시한 도면이다.Figure 3 is a diagram schematically showing the architecture of the tip autoencoder of the present invention, Figure 4 is a diagram schematically showing the architecture of the RNN model of the present invention, Figure 5 is a schematic diagram of the architecture of the LSTM model of the present invention FIG. 6 is a diagram showing a process of iterative fine adjustment to apply changes to a new video in the adjustment module unit of the present invention.

상기 인코더부(140)는 심층 특징 추출부에서 추출된 고차원 심층 특징을 저차원 특징 맵으로 압축하게 된다. 여기서, 상기 인코더부(140)는 딥 오토 인코더(Deep Autoencoder)를 이용하여 입력 데이터에 대한 최소 표현을 학습하고, 원래 입력 데이터에 가장 가까운 출력으로 재구성하여 저차원 특징 맵으로 출력하게 된다.The encoder unit 140 compresses the high-dimensional deep features extracted by the deep feature extraction unit into a low-dimensional feature map. Here, the encoder unit 140 learns a minimum representation of the input data using a deep autoencoder, reconstructs the original input data into an output closest to the original input data, and outputs the low-dimensional feature map.

보다 구체적으로, 도 3에 도시된 바와 같이, 상기 인코더부(140)는 입력 데이터의 표현을 기대어 원래 입력에 최대한 가까운 출력으로 재구성할 수 있는 신경망인 딥 오토인코더(Deep autoencoder)일 수 있다. 여기서, 상기 인코더부(140)는 재구성 과정은 감독되지 않은 방식으로 입력 데이터에서 출력을 생성하는 기억 프로세스가 아니라 학습 프로세스에 따른 과정을 진행하게 된다. 즉, 여러 개의 은닉층이 있는 이 아키텍처의 신경망으로 구성될 수 있다. More specifically, as shown in FIG. 3 , the encoder unit 140 may be a deep autoencoder, which is a neural network capable of reconstructing an output as close to an original input as possible based on an expression of input data. Here, the reconstruction process of the encoder unit 140 proceeds according to a learning process rather than a storage process that generates an output from input data in an unsupervised manner. That is, it can be composed of a neural network of this architecture with multiple hidden layers.

이러한 상기 인코더부(140)는 오토인코더로 두 부분으로 구성될 수 있으며, 상기 인코더부(140)는 디코더에서 추가로 사용할 수 있는 입력 데이터에 대한 가장 작은 표현을 학습하는 역할을 하고, 디코더에서 생성한 내부 표현의 도움으로 원래 입력 데이터에 가장 가까운 출력을 재구성하게 된다. 상기 인코더부(140)에서 인코더가 생성하는 내부 표현을 잠재 공간이라 정의하고, 잠재 공간은 디코더에 공급되는 인코더에 의해 생성된 확률 분포를 포함한다. 여기서, 디코더의 목표는 잠재 공간에서 출력 데이터를 생성하고 원래 입력에 가까운 확률 분포에서 샘플을 생성하게 되어 생성 모델링에 자동 인코더를 사용할 수 있게 된다. 일반적으로 잠재 공간은 원래 입력보다 작은 차원에 유용한 기능을 포함하게 된다. 이러한 작은 차원을 학습하면 주어진 훈련 데이터에서 가장 두드러진 특징을 얻는 데 사용할 수 있게 된다. The encoder unit 140 may be composed of two parts as an autoencoder, and the encoder unit 140 serves to learn the smallest representation of input data that can be additionally used by the decoder, and is generated by the decoder. With the help of an internal representation, the output closest to the original input data is reconstructed. An internal representation generated by the encoder in the encoder unit 140 is defined as a latent space, and the latent space includes a probability distribution generated by the encoder supplied to the decoder. Here, the goal of the decoder is to generate output data in the latent space and generate samples from a probability distribution close to the original input, enabling the use of autoencoders for generative modeling. In general, the latent space will contain useful features in dimensions smaller than the original input. Learning these small dimensions allows us to use them to obtain the most salient features given the training data.

상기 조정 모듈부(150)는 상기 인코더부(140)에서 압축된 특징 맵에서 시간 정보를 학습하고 이전 학습된 모델에 새로운 비디오 영상 데이터의 변경 부분을 적용하여 반복적으로 미세 조정하여 학습하게 된다. 여기서, 상기 조정 모듈부(150)는 장기 시간적 맥락을 학습하기 위해 LSTM(Long Short-Term Memory) 및 RNN(Recurrent Neural Network)을 이용하게 된다.The adjustment module unit 150 learns time information from the feature map compressed by the encoder unit 140 and repeatedly fine-tunes and learns by applying a modified part of new video image data to a previously learned model. Here, the adjustment module unit 150 uses Long Short-Term Memory (LSTM) and Recurrent Neural Network (RNN) to learn the long-term temporal context.

보다 구체적으로, 도 4에 도시된 바와 같이, 상기 조정 모듈부(150)의 RNN(Recurrent Neural Network) 모델은 신경망으로 프레임 시퀀스에서 시간 정보를 학습 위해 이전 데이터를 사용하게 된다. 여기서, 이전 단계의 출력은 다음 단계로 전달되고, RNN의 '메모리'는 이전에 학습한 내용을 추적하게 된다. 이때, 동일한 매개변수를 사용하여 모든 은닉층에서 각 입력에 대해 유사한 작업을 수행하므로 매개변수가 덜 복잡하게 되고, 모든 계층에 대해 가중치와 편향을 동일하게 유지하여 결과적으로 독립 활성화를 종속 활성화로 변환할 수 있다. 그리고, 네트워크에 한 번에 한 단계씩 입력하게 된다. 현재 상태는 현재 입력과 이전 네트워크 상태를 모두 사용하여 계산된다. 은닉 상태 라고 하는 현재 내부 상태는 다음 시간 단계에서 1이 된다. 모든 시간 단계에 대해 유사하게 경쟁한 후 마지막 상태로 출력을 계산한 다음 원래 출력과 비교하여 오류율을 계산한다. 다음 반복을 위해 가중치와 편향을 업데이트하기 위해 이 오류는 네트워크로 역전파가 된다. 상술한 RNN의 전체 메커니즘을 통해 시각적 데이터 스트림의 프레임 시퀀스에서 시간적 특징을 학습할 수 있게 된다. More specifically, as shown in FIG. 4 , the recurrent neural network (RNN) model of the adjustment module unit 150 is a neural network and uses previous data to learn time information from a frame sequence. Here, the output of the previous step is passed to the next step, and the 'memory' of the RNN keeps track of the previously learned content. At this time, the parameters are less complex as we perform similar operations for each input in all hidden layers using the same parameters, and we keep the weights and biases the same for all layers, resulting in transforming independent activations into dependent activations. can Then, you enter the network one step at a time. The current state is computed using both the current input and the previous network state. The current internal state, called the hidden state, becomes 1 at the next time step. Compute the output to the last state after competing similarly for all time steps and then compare it to the original output to calculate the error rate. This error is back-propagated into the network to update the weights and biases for the next iteration. Through the whole mechanism of the RNN described above, it is possible to learn temporal features from the frame sequence of the visual data stream.

또한, 도 5에 도시된 바와 같이, 상기 조정 모듈부(150)의 LSTM(Long Short-Term Memory) 모델은 RNN 모델에서 기울기 소실 또는 폭발 기울기 문제를 해결하기 위한 알고리즘 모델이다. LSTM 반복 단위는 이전에 학습한 지식을 추적하고 관련 없는 데이터는 배제하게 된다. 그런 다음 내부 셀 상태로 명명된 벡터가 모든 LSTM 반복 단위에 의해 보존된다. LSTM은 Sigmoid 활성화 함수를 사용하여 이전 데이터 양을 '잊는' 역할을 하는 4개의 게이트 포겟-게이트(forget-gate)를 사용한다. 입력-게이트(Input-gate) 또는 업데이트- 게이트(update-gate)는 내부 셀 상태에 저장하고, Sigmoid 활성화 기능을 사용하여 정보의 양을 담당한다. 입력모듈레이션-게이트(Inputmodulation-gate) 또는 relevance는 Hyperbolic-Tangent 활성화 함수를 사용하여 입력-게이트가 작동하는 정보를 변조하는 역할을 하게 된다. 출력-게이트(output gate.)는 내부 셀 상태 에서 출력을 생성하는 역할을 하며 Sigmoid 활성화 함수를 사용하게 된다.In addition, as shown in FIG. 5 , the Long Short-Term Memory (LSTM) model of the adjustment module unit 150 is an algorithm model for solving a gradient loss or exploding gradient problem in an RNN model. LSTM iterations will keep track of previously learned knowledge and exclude irrelevant data. A vector named internal cell state is then preserved by every LSTM repeat unit. LSTM uses a four-gate forget-gate that serves to 'forget' the previous amount of data using the sigmoid activation function. The input-gate or update-gate stores the internal cell state and uses the sigmoid activation function to account for the amount of information. The inputmodulation-gate or relevance plays a role of modulating the information on which the input-gate operates by using the Hyperbolic-Tangent activation function. The output gate. plays a role in generating an output from the internal cell state and uses a sigmoid activation function.

그리고, 상기 조정 모듈부(150)는 새로운 데이터 스트림을 위해 반복 학습 모델을 통한 미세 조정을 진행하게 된다. And, the adjustment module unit 150 performs fine adjustment through an iterative learning model for a new data stream.

구체적으로, 도 6에 도시된 바와 같이, 미세 조정 모듈 아키텍처는 시간이 지남에 따라 비디오 데이터의 변화하는 특성으로 인해 이전에 학습된 모델은 비디오 데이터의 변경 사항을 채택할 수 없으며 새 데이터 스트림에서 작업을 올바르게 학습할 수 없게 된다. 따라서, 프레임워크는 새로운 데이터에 대해 반복적으로 자체 업데이트를 하여 주변 환경의 변화를 인지하게 된다. 이때, 학습 모델을 반복적으로 재교육하기 위해 예측된 신뢰도 점수가 특정 임계값보다 큰 데이터 스트림이 사용되어 주변 환경에서 다양한 조건을 채택할 수 있다. 환경 영향을 고려하여 배포 설정 및 사용자 요구 사항 임계값을 선택할 수 있다. 충분한 데이터가 수집되면 이전에 훈련된 모델이 미세 조정되므로 모델은 새로운 데이터의 환경 변화에 적응하게 된다. 그런 다음 학습 모델은 실시간 감시를 위해 비디오에서 환자 활동 모니터링, 사기 및 비정상 행동 감지와 같은 다른 많은 영역에 적용될 수 있게 된다. Specifically, as shown in Figure 6, the fine-tuning module architecture is such that due to the changing nature of video data over time, previously trained models cannot adopt changes in video data and work on new data streams. cannot be learned correctly. Therefore, the framework recognises changes in the surrounding environment by repeatedly updating itself with new data. At this time, in order to repeatedly retrain the learning model, a data stream having a predicted reliability score greater than a specific threshold may be used to adopt various conditions in the surrounding environment. Deployment settings and user requirement thresholds can be selected to account for environmental impact. Once enough data is collected, the previously trained model is fine-tuned so that the model adapts to changes in the environment of the new data. The trained model can then be applied to many other areas, such as monitoring patient activity in video for real-time surveillance, and detecting fraud and anomalous behavior.

또한, 도 7은 본 발명의 일 실시예에 따른 딥러닝 기반의 행동 인식 방법에 대한 순서도이다. 7 is a flowchart of a deep learning-based behavior recognition method according to an embodiment of the present invention.

도 7에 도시된 바와 같이, 본 발명의 일 실시예에 따른 딥러닝 기반의 행동 인식 방법은, 먼저 비디오 영상의 데이터 세트에서 중첩 동작 클래스를 그룹화하고, 각 비디오 영상에서 프레임을 추출하여 딥 러닝 파이프 라인(DLPL) CNN 학습 모델에 맞는 프레임 사이즈로 조정하여 사전 학습하는 단계가 수행된다(S710).As shown in FIG. 7, the deep learning-based action recognition method according to an embodiment of the present invention first groups overlapping action classes in a data set of video images, extracts frames from each video image, and deep learning pipe A step of pre-learning by adjusting the frame size suitable for the line (DLPL) CNN learning model is performed (S710).

여기서, 상기 딥 러닝 파이프 라인(DLPL) CNN 학습 모델은 DenseNet201, InceptionV3, ResNet101V2, ResNet152V2, VGG16, VGG19 및 Xception를 포함하고, 이들 중 어느 하나의 사전 훈련된 CNN 학습 모델을 이용하는 것이 바람직하다. Here, the deep learning pipeline (DLPL) CNN learning model includes DenseNet201, InceptionV3, ResNet101V2, ResNet152V2, VGG16, VGG19, and Xception, and it is preferable to use any one of these pre-trained CNN learning models.

그리고, 상기 사전 학습된 딥 러닝 파이프 라인(DLPL) CNN 학습 모델에 추가적인 데이터 세트를 적용하여 전이 학습 및 미세 조정하는 단계가 수행된다(S720). Then, transfer learning and fine-tuning are performed by applying an additional data set to the pretrained deep learning pipeline (DLPL) CNN learning model (S720).

보다 구체적으로, 상기 전이 학습 및 미세 조정하는 단계는 사전 학습된 딥 러닝 파이프 라인(DLPL) CNN 모델의 마지막 분류 계층을 제거하고 새 데이터 세트에 대해 새로운 계층을 추가하여 학습 및 미세 조정하게 된다. More specifically, the transfer learning and fine-tuning step removes the last classification layer of the pre-trained Deep Learning Pipeline (DLPL) CNN model and adds a new layer for a new data set to learn and fine-tune it.

이어서, 상기 미세 조정된 사전 학습된 딥 러닝 파이프 라인(DLPL) CNN 학습 모델을 적용하여 시각적 데이터 스트림에서 프레임 수준 공간 정보를 학습하여 고차원 심층 특징을 추출하는 단계가 수행된다(S730). 여기서, 심층 특징 추출은 상기 사전 학습된 딥 러닝 파이프 라인(DLPL) CNN 학습 모델에서 마지막 완전 연결 계층(FCL)을 제거하고 일부 추가 계층을 추가하여 데이터 세트에 대해 시각적 데이터 스트림에서 공간 패턴과 관계를 학습하고, 마지막으로 완전히 연결된 SoftMax 계층 이전의 출력을 다음 네트워크에 제공하여 심층 특징을 추출하게 된다. Subsequently, a step of extracting high-dimensional deep features by learning frame-level spatial information from the visual data stream by applying the finely tuned pretrained deep learning pipeline (DLPL) CNN learning model is performed (S730). Here, deep feature extraction removes the last Fully Connected Layer (FCL) from the above pre-trained Deep Learning Pipeline (DLPL) CNN training model and adds some additional layers to determine spatial patterns and relationships in visual data streams for a dataset. Finally, the output before the fully connected SoftMax layer is fed to the next network to extract deep features.

여기서, 심층 심층 특징은 비디오 프레임에서 공간 정보의 지역적 특징과 전역적 특징을 모두 나타내는 고차원 공간의 특징이다. Here, the deep feature is a feature of a high-dimensional space representing both local and global features of spatial information in a video frame.

그 다음, 상기 추출된 고차원 심층 특징을 저차원 특징 맵으로 압축하는 단계가 수행된다(S740). 여기서, 상기 저차원 특징 맵으로 압축하는 단계에서 오토 인코더(Deep Autoencoder)를 이용하여 입력 데이터에 대한 최소 표현을 학습하고, 원래 입력 데이터에 가장 가까운 출력으로 재구성하여 저차원 특징 맵으로 출력하게 된다. Next, a step of compressing the extracted high-dimensional deep features into a low-dimensional feature map is performed (S740). Here, in the step of compressing into the low-dimensional feature map, a minimum representation of the input data is learned using a deep autoencoder, and an output closest to the original input data is reconstructed and output as a low-dimensional feature map.

보다 구체적으로, 저차원 특징 맵으로 출력하기 위해 입력 데이터의 표현을 기대어 원래 입력에 최대한 가까운 출력으로 재구성할 수 있는 신경망인 딥 오토인코더(Deep autoencoder)을 적용할 수 있다. 여기서, 인코더 재구성 과정은 감독되지 않은 방식으로 입력 데이터에서 출력을 생성하는 기억 프로세스가 아니라 학습 프로세스에 따른 과정을 진행하게 된다. 즉, 여러 개의 은닉층이 있는 이 아키텍처의 신경망으로 구성될 수 있다. More specifically, a deep autoencoder, which is a neural network capable of reconstructing an output as close as possible to an original input by leaning on the expression of input data, can be applied to output as a low-dimensional feature map. Here, the encoder reconstruction process proceeds according to a learning process rather than a memorization process that generates an output from input data in an unsupervised manner. That is, it can be composed of a neural network of this architecture with multiple hidden layers.

이러한 딥 오토인코더는 두 부분으로 구성될 수 있으며, 디코더에서 추가로 사용할 수 있는 입력 데이터에 대한 가장 작은 표현을 학습하는 역할을 하고, 디코더에서 생성한 내부 표현의 도움으로 원래 입력 데이터에 가장 가까운 출력을 재구성하게 된다. 인코더가 생성하는 내부 표현을 잠재 공간이라 정의하고, 잠재 공간은 디코더에 공급되는 인코더에 의해 생성된 확률 분포를 포함한다. 여기서, 디코더의 목표는 잠재 공간에서 출력 데이터를 생성하고 원래 입력에 가까운 확률 분포에서 샘플을 생성하게 되어 생성 모델링에 자동 인코더를 사용할 수 있게 된다. 일반적으로 잠재 공간은 원래 입력보다 작은 차원에 유용한 기능을 포함하게 된다. 이러한 작은 차원을 학습하면 주어진 훈련 데이터에서 가장 두드러진 특징을 얻는 데 사용할 수 있게 된다. Such a deep autoencoder can consist of two parts, one serving to learn the smallest representation of the input data that is further usable by the decoder, and the output closest to the original input data with the help of the internal representation produced by the decoder. will reconstruct The internal representation produced by the encoder is defined as the latent space, and the latent space contains the probability distribution generated by the encoder that is fed to the decoder. Here, the goal of the decoder is to generate output data in the latent space and generate samples from a probability distribution close to the original input, enabling the use of autoencoders for generative modeling. In general, the latent space will contain useful features in dimensions smaller than the original input. Learning these small dimensions allows us to use them to obtain the most salient features given the training data.

이어서, 상기 압축된 특징 맵에서 시간 정보를 학습하고 이전 학습된 모델에 새로운 비디오 영상 데이터의 변경 부분을 적용하여 반복적으로 미세 조정하여 학습하는 단계를 수행하게 된다(S750). 여기서, 상기 미세 조정하는 학습으로 장기 시간적 맥락을 학습하기 위해 LSTM(Long Short-Term Memory) 및 RNN(Recurrent Neural Network)을 이용하게 된다. Subsequently, a step of learning time information from the compressed feature map and repeatedly fine-tuning and learning by applying a modified part of new video image data to a previously learned model is performed (S750). Here, long short-term memory (LSTM) and recurrent neural network (RNN) are used to learn a long-term temporal context through the fine-tuning learning.

RNN(Recurrent Neural Network) 모델은 신경망으로 프레임 시퀀스에서 시간 정보를 학습 위해 이전 데이터를 사용하게 된다. 여기서, 이전 단계의 출력은 다음 단계로 전달되고, RNN의 '메모리'는 이전에 학습한 내용을 추적하게 된다. 이때, 동일한 매개변수를 사용하여 모든 은닉층에서 각 입력에 대해 유사한 작업을 수행하므로 매개변수가 덜 복잡하게 되고, 모든 계층에 대해 가중치와 편향을 동일하게 유지하여 결과적으로 독립 활성화를 종속 활성화로 변환할 수 있다. 그리고, 네트워크에 한 번에 한 단계씩 입력하게 된다. 현재 상태는 현재 입력과 이전 네트워크 상태를 모두 사용하여 계산된다. 은닉 상태 라고 하는 현재 내부 상태는 다음 시간 단계에서 1이 된다. 모든 시간 단계에 대해 유사하게 경쟁한 후 마지막 상태로 출력을 계산한 다음 원래 출력과 비교하여 오류율을 계산한다. 다음 반복을 위해 가중치와 편향을 업데이트하기 위해 이 오류는 네트워크로 역전파가 된다. 상술한 RNN의 전체 메커니즘을 통해 시각적 데이터 스트림의 프레임 시퀀스에서 시간적 특징을 학습할 수 있게 된다. A Recurrent Neural Network (RNN) model is a neural network that uses previous data to learn temporal information from a sequence of frames. Here, the output of the previous step is passed to the next step, and the 'memory' of the RNN keeps track of the previously learned content. At this time, the parameters are less complex as we perform similar operations for each input in all hidden layers using the same parameters, and we keep the weights and biases the same for all layers, resulting in transforming independent activations into dependent activations. can Then, you enter the network one step at a time. The current state is computed using both the current input and the previous network state. The current internal state, called the hidden state, becomes 1 at the next time step. Compute the output to the last state after competing similarly for all time steps and then compare it to the original output to calculate the error rate. This error is back-propagated into the network to update the weights and biases for the next iteration. Through the whole mechanism of the RNN described above, it is possible to learn temporal features from the frame sequence of the visual data stream.

또한, LSTM(Long Short-Term Memory) 모델은 RNN 모델에서 기울기 소실 또는 폭발 기울기 문제를 해결하기 위한 알고리즘 모델이다. LSTM 반복 단위는 이전에 학습한 지식을 추적하고 관련 없는 데이터는 배제하게 된다. 그런 다음 내부 셀 상태로 명명된 벡터가 모든 LSTM 반복 단위에 의해 보존된다. LSTM은 Sigmoid 활성화 함수를 사용하여 이전 데이터 양을 '잊는' 역할을 하는 4개의 게이트 포겟-게이트(forget-gate)를 사용한다. 입력-게이트(Input-gate) 또는 업데이트- 게이트(update-gate)는 내부 셀 상태에 저장하고, Sigmoid 활성화 기능을 사용하여 정보의 양을 담당한다. 입력모듈레이션-게이트(Inputmodulation-gate) 또는 relevance는 Hyperbolic-Tangent 활성화 함수를 사용하여 입력-게이트가 작동하는 정보를 변조하는 역할을 하게 된다. 출력-게이트(output gate.)는 내부 셀 상태 에서 출력을 생성하는 역할을 하며 Sigmoid 활성화 함수를 사용하게 된다.In addition, the LSTM (Long Short-Term Memory) model is an algorithm model for solving a gradient loss or exploding gradient problem in an RNN model. LSTM iterations will keep track of previously learned knowledge and exclude irrelevant data. A vector named internal cell state is then preserved by every LSTM repeat unit. LSTM uses a four-gate forget-gate that serves to 'forget' the previous amount of data using the sigmoid activation function. The input-gate or update-gate stores the internal cell state and uses the sigmoid activation function to account for the amount of information. The inputmodulation-gate or relevance plays a role of modulating the information on which the input-gate operates by using the Hyperbolic-Tangent activation function. The output gate. plays a role in generating an output from the internal cell state and uses a sigmoid activation function.

그런 다음, 새로운 데이터 스트림을 위해 반복 학습 모델을 통한 미세 조정을 진행하게 된다. It then proceeds to fine-tune through iterative learning models for new data streams.

구체적으로, 미세 조정을 위한 학습 모델은 시간이 지남에 따라 비디오 데이터의 변화하는 특성으로 인해 이전에 학습된 모델은 비디오 데이터의 변경 사항을 채택할 수 없으며 새 데이터 스트림에서 작업을 올바르게 학습할 수 없게 된다. 따라서, 프레임워크는 새로운 데이터에 대해 반복적으로 자체 업데이트를 하여 주변 환경의 변화를 인지하게 된다. 이때, 학습 모델을 반복적으로 재교육하기 위해 예측된 신뢰도 점수가 특정 임계값보다 큰 데이터 스트림이 사용되어 주변 환경에서 다양한 조건을 채택할 수 있다. 환경 영향을 고려하여 배포 설정 및 사용자 요구 사항 임계값을 선택할 수 있다. 충분한 데이터가 수집되면 이전에 훈련된 모델이 미세 조정되므로 모델은 새로운 데이터의 환경 변화에 적응하게 된다. 그런 다음 학습 모델은 실시간 감시를 위해 비디오에서 환자 활동 모니터링, 사기 및 비정상 행동 감지와 같은 다른 많은 영역에 적용될 수 있게 된다. Specifically, the trained model for fine-tuning is concerned that, due to the changing nature of video data over time, a previously trained model will not be able to adopt changes in the video data and will not be able to correctly learn tasks from new data streams. do. Therefore, the framework recognises changes in the surrounding environment by repeatedly updating itself with new data. At this time, in order to repeatedly retrain the learning model, a data stream having a predicted reliability score greater than a specific threshold may be used to adopt various conditions in the surrounding environment. Deployment settings and user requirement thresholds can be selected to account for environmental impact. Once enough data is collected, the previously trained model is fine-tuned so that the model adapts to changes in the environment of the new data. The trained model can then be applied to many other areas, such as monitoring patient activity in video for real-time surveillance, and detecting fraud and anomalous behavior.

본 발명의 일 실시예는 컴퓨터에 의해 실행되는 프로그램 모듈과 같은 컴퓨터에 의해 실행 가능한 명령어를 포함하는 기록 매체의 형태로도 구현될 수 있다. 컴퓨터 판독 가능 매체는 컴퓨터에 의해 액세스될 수 있는 임의의 가용 매체일 수 있고, 휘발성 및 비휘발성 매체, 분리형 및 비분리형 매체를 모두 포함한다. 또한, 컴퓨터 판독가능 매체는 컴퓨터 저장 매체 및 통신 매체를 모두 포함할 수 있다. 컴퓨터 저장 매체는 컴퓨터 판독가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터와 같은 정보의 저장을 위한 임의의 방법 또는 기술로 구현된 휘발성 및 비휘발성, 분리형 및 비분리형 매체를 모두 포함한다. 통신 매체는 전형적으로 컴퓨터 판독가능 명령어, 데이터 구조, 프로그램 모듈, 또는 반송파와 같은 변조된 데이터 신호의 기타 데이터, 또는 기타 전송 메커니즘을 포함하며, 임의의 정보 전달 매체를 포함한다.An embodiment of the present invention may be implemented in the form of a recording medium including instructions executable by a computer, such as program modules executed by a computer. Computer readable media can be any available media that can be accessed by a computer and includes both volatile and nonvolatile media, removable and non-removable media. Also, computer readable media may include both computer storage media and communication media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Communication media typically includes computer readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave, or other transport mechanism, and includes any information delivery media.

본 발명의 방법 및 시스템은 특정 실시예와 관련하여 설명되었지만, 그것들의 구성 요소 또는 동작의 일부 또는 전부는 범용 하드웨어 아키텍쳐를 갖는 컴퓨터 시스템을 사용하여 구현될 수 있다.Although the methods and systems of the present invention have been described with reference to specific embodiments, some or all of their components or operations may be implemented using a computer system having a general-purpose hardware architecture.

전술한 본 발명의 설명은 예시를 위한 것이며, 본 발명이 속하는 기술분야의 통상의 지식을 가진 자는 본 발명의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다.The above description of the present invention is for illustrative purposes, and those skilled in the art can understand that it can be easily modified into other specific forms without changing the technical spirit or essential features of the present invention. will be. Therefore, the embodiments described above should be understood as illustrative in all respects and not limiting. For example, each component described as a single type may be implemented in a distributed manner, and similarly, components described as distributed may be implemented in a combined form.

본 발명의 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다.The scope of the present invention is indicated by the following claims rather than the detailed description above, and all changes or modifications derived from the meaning and scope of the claims and equivalent concepts should be construed as being included in the scope of the present invention. do.

110: 데이터 전처리부
120: 전이 학습부
130: 심층 특징 추출부
140: 인코더부
150: 조정 모듈부110: data pre-processing unit
120: transfer learning unit
130: deep feature extraction unit
140: encoder unit
150: adjustment module unit

Claims

A data pre-processing unit that groups overlapping motion classes in a data set of video images, extracts frames from each video image, and adjusts the frame size to fit a deep learning pipeline (DLPL) CNN learning model for pre-learning;
a transfer learning unit for learning and fine-tuning by applying an additional data set to a deep learning pipeline (DLPL) CNN learning model previously trained in the data pre-processing unit;
a deep feature extraction unit for extracting high-dimensional deep features by learning frame-level spatial information from a visual data stream by applying a pretrained deep learning pipeline (DLPL) CNN learning model finely tuned in the transfer learning unit;
an encoder unit for compressing the high-dimensional deep features extracted by the deep feature extraction unit into a low-dimensional feature map; and
A deep learning-based action recognition system comprising an adjustment module unit that learns time information from the feature map compressed by the encoder unit and repeatedly fine-tunes and learns by applying a modified part of new video image data to a previously learned model.

According to claim 1,
The deep learning pipeline (DLPL) CNN learning model includes DenseNet201, InceptionV3, ResNet101V2, ResNet152V2, VGG16, VGG19 and Xception, and uses any one of these pre-trained CNN learning models Based on deep learning, characterized in that behavioral recognition system.

According to claim 1,
The transfer learning unit removes the last classification layer of the pretrained deep learning pipeline (DLPL) CNN model and adds a new layer for a new data set to learn and fine-tune. Deep learning-based action recognition system, characterized in that .

According to claim 1,
The deep feature extraction unit removes the last fully connected layer (FCL) from the Deep Learning Pipeline (DLPL) CNN training model and adds some additional layers to learn spatial patterns and relationships in the visual data stream for the data set, and finally A deep learning-based action recognition system characterized by extracting deep features by providing the output before the fully connected SoftMax layer to the next network.

According to claim 1,
The encoder unit learns a minimum representation of the input data using a deep autoencoder, reconstructs the original input data into an output closest to the original input data, and outputs it as a low-dimensional feature map. .

According to claim 1,
The adjustment module unit uses a long short-term memory (LSTM) and a recurrent neural network (RNN) to learn a long-term temporal context.

Grouping overlapping motion classes in a data set of video images, extracting frames from each video image, adjusting the frame size to fit a deep learning pipeline (DLPL) CNN learning model, and pre-learning;
Transfer learning and fine-tuning by applying an additional data set to the pre-trained Deep Learning Pipeline (DLPL) CNN learning model;
extracting high-dimensional deep features by learning frame-level spatial information from a visual data stream by applying the fine-tuned pretrained deep learning pipeline (DLPL) CNN learning model;
compressing the high-dimensional deep features extracted by the deep feature extraction unit into a low-dimensional feature map; and
Deep learning-based action recognition method comprising the step of learning temporal information from the compressed feature map and repeatedly fine-tuning and learning by applying a modified part of new video image data to a previously learned model.

According to claim 7,
The deep learning pipeline (DLPL) CNN learning model includes DenseNet201, InceptionV3, ResNet101V2, ResNet152V2, VGG16, VGG19 and Xception, and uses any one of these pre-trained CNN learning models Based on deep learning, characterized in that Behavior Recognition Method.

According to claim 7,
In the transfer learning and fine-tuning step, the last classification layer of the pre-trained Deep Learning Pipeline (DLPL) CNN model is removed and a new layer is added for the new data set to learn and fine-tune Deep learning based behavioral recognition method.

According to claim 7,
Extracting the deep features involves removing the last Fully Connected Layer (FCL) from the pre-trained Deep Learning Pipeline (DLPL) CNN training model and adding some additional layers to the dataset for a spatial pattern and relationship in the visual data stream. Deep learning-based action recognition method characterized by learning and finally providing the output before the fully connected SoftMax layer to the next network to extract deep features.

According to claim 7,
Characterized in that in the step of compressing into the low-dimensional feature map, a minimum expression for the input data is learned using a deep autoencoder, and the output is reconstructed into an output closest to the original input data and output as a low-dimensional feature map Deep learning-based behavior recognition method.

According to claim 7,
Deep learning-based action recognition method, characterized in that using a long short-term memory (LSTM) and a recurrent neural network (RNN) to learn the long-term temporal context in the fine-tuning and learning step.

A computer-readable recording medium recording a program for performing the method according to any one of claims 7 to 12 on a computer.