KR102529876B1

KR102529876B1 - A Self-Supervised Sampler for Efficient Action Recognition, and Surveillance Systems with Sampler

Info

Publication number: KR102529876B1
Application number: KR1020220143407A
Authority: KR
Inventors: 최동걸; 서민석; 이상우
Original assignee: 한밭대학교 산학협력단
Priority date: 2022-11-01
Filing date: 2022-11-01
Publication date: 2023-05-09

Abstract

An action recognition method of an action recognition model including a sampler according to another embodiment of the present invention comprises: a reception step of allowing the sampler trained by the above learning method to receive the down-sampled video clip; a score prediction step of allowing the trained sampler to predict irrelevance scores for each frame of all frames of the downsampled video clip; a frame removal step of removing behavior-irrelevant frames based on the predicted irrelevance score for each frame; and an action recognition step of allowing the action recognition model to perform action recognition by receiving frames selected from the frames remaining after irrelevant frames are removed from the frames of the video clip.

Description

A Self-Supervised Sampler for Efficient Action Recognition, and Surveillance Systems with Sampler}

본 발명은 행동 인식을 위한 자기 지도 샘플러, 및 샘플러를 이용하는 영상 감시 시스템에 관한 것으로, 보다 상세하게는 실세계의 감시 시스템에서 효율적인 행동 인식을 위한 자기 지도 샘플러, 및 샘플러를 이용하는 영상 감시 시스템에 관한 것이다.The present invention relates to a self-map sampler for action recognition, and a video surveillance system using the sampler, and more particularly, to a self-map sampler for efficient action recognition in a real-world surveillance system, and a video surveillance system using the sampler. .

이미지를 입력으로 받아 추론하는 인공지능 시스템의 경우 많이 상용화가 되었으며, 특정 분야에서는 전문가를 대체할 수준 정도의 인공지능이 개발이 되었다. 하지만 비디오 기반의 인공지능 시스템의 경우 아직 인간 수준의 인식 성능에 도달하지 못하였으며, 시스템을 구축하더라도 이미지 기반의 인공지능에 비해 많은 연산량을 소모하여 실제 애플리케이션에서 아직 상용화가 잘 되지 못하고 있다. 따라서 연산량 및 인식 성능을 개선할 수 있는 샘플러가 개발이 된다면 샘플러를 이용하여 행동 인식 기반의 시스템을 구축하는 경우 높은 시장성을 가질 수 있을 것으로 기대된다.In the case of artificial intelligence systems that receive images as input and make inferences, many have been commercialized, and in certain fields, artificial intelligence that can replace experts has been developed. However, in the case of video-based artificial intelligence systems, human-level recognition performance has not yet been reached, and even if the system is built, it consumes a lot of computation compared to image-based artificial intelligence, so it is not well commercialized in real applications yet. Therefore, if a sampler capable of improving the amount of computation and recognition performance is developed, it is expected that a system based on action recognition using the sampler will have high marketability.

행동 인식은 영상 감시, 제스처 인식, 인간-로봇 상호작용 등 다양한 로봇에 널리 적용되는 주요 연구 주제 중 하나이다. CNN 기반의 행동 인식 모델의 일반적인 예측 방식은 단순히 모든 프레임을 밀집된 형태로 입력받아 인공지능 모델에서 출력된 예측값의 평균을 사용하는 것이다. 그러나, 이러한 예측 방식은 행동 유무에 관계없이 모든 프레임이 고르게 활용되기 때문에 비효율적이다. 입력 비디오가 정제되지 않은 실시간 행동 인식 애플리케이션에서 밀집된 형태의 예측은 더 비효율적이다. 따라서 이러한 밀집된 형태의 입력을 받는 행동 인식 모델의 예측 방식을 개선하기 위하여 비디오에 등장하는 대표적인 행동과 관련된 프레임을 선택하는 매우 작은 연산량을 가지는 CNN 기반의 샘플러가 요구된다.Behavior recognition is one of the major research topics widely applied to various robots such as video surveillance, gesture recognition, and human-robot interaction. A general prediction method of a CNN-based action recognition model is to simply receive all frames in a dense form and use the average of the predicted values output from the artificial intelligence model. However, this prediction method is inefficient because all frames are evenly utilized regardless of whether there is an action or not. Dense form prediction is more inefficient in real-time action recognition applications where the input video is unrefined. Therefore, in order to improve the prediction method of the action recognition model that receives such dense inputs, a CNN-based sampler with a very small amount of computation that selects frames related to representative actions appearing in the video is required.

한국등록특허 제10-1986002호(2019.06.04.)Korean Patent Registration No. 10-1986002 (2019.06.04.)

S. Benaim et al., "SpeedNet: Learning the speediness in videos," in Proc.IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 9919??9928.S. Benaim et al., "SpeedNet: Learning the speediness in videos," in Proc.IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 9919??9928. D. Kim, D. Cho, and I. S. Kweon, "Self-supervised video representation learning with space-time cubic puzzles," in Proc. AAAI Conf. Artif. Intell.,vol. 33, no. 01, 2019, pp. 8545??8552.D. Kim, D. Cho, and I. S. Kweon, "Self-supervised video representation learning with space-time cubic puzzles," in Proc. AAAI Conf. Artif. Intell., vol. 33, no. 01, 2019, p. 8545??8552.

따라서 본 발명은 상기한 바와 같은 종래 기술의 문제점을 해결하기 위해 안출된 것으로, 본 발명은 행동 인식을 위한 자기 지도 샘플러 및 이를 이용하는 영상 감시 시스템을 제공하는 것을 그 목적으로 한다.Therefore, the present invention has been made to solve the problems of the prior art as described above, and an object of the present invention is to provide a self-map sampler for motion recognition and a video surveillance system using the same.

그러나 본 발명의 목적은 상기에 언급된 목적으로 제한되지 않으며, 언급되지 않은 또 다른 목적들은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.However, the object of the present invention is not limited to the above-mentioned object, and other objects not mentioned will be clearly understood by those skilled in the art from the description below.

상기한 바와 같은 문제점을 해결하기 위한 본 발명의 일실시예에 따른 CNN 기반의 자기 지도 샘플러의 학습방법은 학습을 위한 제1 비디오 및 제2 비디오를 수신하는 수신단계; 수신된 제2 비디오에서 하나의 무관한 프레임을 추출하고, 추출된 프레임을 제1 비디오에 혼합하여 무관한 프레임을 갖는 제1 비디오를 생성하는 혼합단계; 및 상기 혼합된 제1 비디오에서 어느 프레임이 혼합된 프레임인지 식별하도록 학습하는 학습단계;를 포함하여 구성되는 것을 특징으로 한다.A method for learning a CNN-based self-map sampler according to an embodiment of the present invention to solve the above problems includes receiving a first video and a second video for learning; a mixing step of extracting one irrelevant frame from the received second video and mixing the extracted frame with the first video to generate a first video having irrelevant frames; and a learning step of learning to identify which frame is a mixed frame in the mixed first video.

바람직하게는, 상기 혼합단계는 수신한 제2 비디오의 클립의 하나의 무관한 프레임을 수신한 제1 비디오의 클립에 랜덤하게 혼합하여 무관한 프레임을 갖는 제1 비디오의 클립을 생성하며, 상기 학습단계는 혼합된 제1 비디오의 클립을 입력으로 하여 샘플러를 훈련하여 무관한 프레임의 인덱스를 찾도록 하며, 하나의 무관한 프레임을 갖는 제1 비디오 클립이 샘플러의 입력이고, 그 프레임의 인덱스가 목표 라벨이 되는 것을 특징으로 한다.Preferably, the blending step randomly mixes one irrelevant frame of the received clip of the second video with the received clip of the first video to generate a clip of the first video having unrelated frames; In step 1, the sampler is trained to find the index of an irrelevant frame by taking the clip of the first mixed video as an input, the first video clip having one irrelevant frame is the input of the sampler, and the index of the frame is the target It is characterized by being a label.

바람직하게는, 상기 무관한 프레임을 갖는 제1 비디오의 클립은, 배치(batch) 축에 따라 미니-배치 X를 셔플링하여

를 얻는 과정 - 여기서 B는 배치(batch) 크기이고, T는 시간 차원이고, C는 채널의 수이고, H 및 W는 공간 차원으로 정의되며 -; 및 미니-배치 X 내의 각각의 비디오 클립 X_b에 대해, 시간 t에서 하나의 프레임 X_b(t)를 랜덤하게 선택하고, 그 후 그것을 시간 t에서 셔플링된 미니-배치

내의 b-번째 클립의 프레임

와 교환하는 과정;으로 얻어지는 것을 특징으로 한다.Preferably, the clips of the first video with the unrelated frames are shuffled mini-batch X along a batch axis to

The process of obtaining , where B is the batch size, T is the temporal dimension, C is the number of channels, and H and W are defined as spatial dimensions; and for each video clip X _b in mini-batch X, randomly select one frame X _b (t) at time t, then place it in a shuffled mini-batch at time t

frame of the b-th clip in

It is characterized by being obtained as; a process of exchanging with.

바람직하게는, 상기 샘플러

는 하나의 준비된 비디오 클립

를 취하고, 그 후 하나의 비디오 클립 내의 각각의 프레임에 대한 비관련성 점수

를 생성하며, 손실 함수를 기반으로 파라미터 Θ를 업데이트하는 것을 특징으로 한다.Preferably, the sampler

is one prepared video clip

, and then the irrelevance score for each frame within one video clip.

and updating the parameter Θ based on the loss function.

본 발명의 다른 일실시예에 따른 샘플러를 포함하는 행동 인식 모델의 행동인식방법은 상기 학습방법에 의해 학습된 샘플러가 다운샘플링된 비디오 클립을 수신하는 수신단계; 상기 학습된 샘플러는 다운샘플링된 비디오 클립의 전체 프레임의 각 프레임별로 무관련성 점수들을 예측하는 점수예측단계; 상기 각 프레임에 대한 예측된 무관련성 점수에 기초하여 행동 무관련 프레임을 제거하는 프레임제거단계; 및 비디오 클립의 프레임 중 무관련성 프레임을 제거하여 남은 프레임 중 선택된 프레임들을 입력으로 받아 행동인식모델이 행동인식을 수행하는 행동인식단계;를 포함하는 것을 특징으로 한다.An action recognition method of an action recognition model including a sampler according to another embodiment of the present invention includes a receiving step of receiving a video clip in which the sampler learned by the learning method is downsampled; a score prediction step of predicting irrelevance scores for each frame of all frames of the downsampled video clip using the learned sampler; a frame removal step of removing behaviorally irrelevant frames based on the predicted irrelevance score for each frame; and a behavior recognition step of removing irrelevant frames from among the frames of the video clip and receiving selected frames among remaining frames as inputs and performing behavior recognition by the behavior recognition model.

바람직하게는, 상기 프레임제거단계는 상기 임계값과 비교를 통해 각 프레임의 예측된 무관련성 점수가 임계값보다 높은 프레임을 제거하는 것을 특징으로 한다.Preferably, the frame removal step is characterized by removing frames whose predicted irrelevance score of each frame is higher than the threshold value through comparison with the threshold value.

바람직하게는, 샘플러 네트워크

는 5개의 3D-CNN 레이어로 구성되며,

는 비디오 클립들로 구성된 하나의 다운샘플링된 입력 미니 배치를 나타내고, 여기서 B는 배치(batch) 크기이고, T는 시간 차원이고, C는 채널의 수이고, H 및 W는 공간 차원으로 정의하면, 상기 학습된 샘플러는 하나의 미니-배치 X 내의 b번째 비디오 클립 X_b로부터의 각각의 프레임에 대한 비관련성 점수를 다음 수식과 같이 예측하며,

, 여기서

및 Θ는 각각 샘플러의 비관련성 점수 및 학습 가능한 파라미터들을 나타내고, 하나의 비디오 클립에서 각 프레임의 무관련성 점수들은 각 프레임이 비디오 클립의 전체 프레임에 얼마나 멀리 떨어져 있는지를 나타내는 것을 특징으로 한다.Preferably, the sampler network

consists of five 3D-CNN layers,

denotes one downsampled input mini-batch consisting of video clips, where B is the batch size, T is the temporal dimension, C is the number of channels, and H and W are defined as spatial dimensions: The learned sampler predicts the irrelevance score for each frame from the bth video clip X _b in one mini-batch X as follows,

, here

and Θ represent the sampler's irrelevance score and learnable parameters, respectively, and the irrelevance scores of each frame in one video clip represent how far each frame is from the entire frame of the video clip.

바람직하게는, 상기 비디오 클립의 프레임 중 무관련성 프레임을 제거하여 남은 프레임 중 선택된 프레임은, 임계값 τ보다 큰 비관련성 점수를 갖는 프레임을 제거한 후 일정 간격으로 N개의 고정된 수의 프레임을 선택하는 과정 또는 임계값 τ를 설정한 후, 비관련성 점수가 τ보다 작은 프레임들만을 선택하는 과정 중 어느 하나에 의해 선택된 프레임인 것을 특징으로 한다.Preferably, among the frames of the video clip, selected frames among frames remaining after removing irrelevant frames are selected from a fixed number of N frames at regular intervals after removing frames having an irrelevance score greater than a threshold value τ. It is characterized in that the frame is selected by any one of a process or a process of selecting only frames having an irrelevance score smaller than τ after setting a threshold value τ.

바람직하게는, 상기 행동인식단계는 상기 선택된 프레임을 입력으로 사용하여 조밀 예측 방법(dense prediction method)을 적용하여 비디오의 행동을 인식하는 것을 특징으로 한다.Preferably, the action recognition step is characterized in that the action of the video is recognized by applying a dense prediction method using the selected frame as an input.

본 발명의 다른 일실시예에 따른 샘플러를 이용하는 영상 감시 시스템은, CCTV로부터 동영상을 수신하고, 실시간 객체 검출에 사용되는 객체 검출 모델을 이용하여 동영상내 사람들을 검출하여 위치를 한정하는 객체 검출부; 객체 추적에 사용되는 객체 추적 모델을 이용하여 검출된 사람 각각의 개별 ID를 인식하는 객체 추적부; 인식된 ID별로 복수개의 프레임으로 구성되는 비디오 클립을 입력받아 비디오 클립 중 셔플링된 프레임을 제거하고, 나머지 프레임을 동작 인식 모델로 출력하는 자기 지도 샘플러; 및 자기 지도 샘플러로부터 수신된 나머지 프레임을 이용하여 각 사람의 행동을 인식하는 행동 인식 모델;을 포함하는 것을 특징으로 한다.A video surveillance system using a sampler according to another embodiment of the present invention includes an object detection unit that receives video from CCTV and detects people in the video using an object detection model used for real-time object detection to limit the location; an object tracking unit recognizing an individual ID of each detected person using an object tracking model used for object tracking; a self-map sampler that receives a video clip consisting of a plurality of frames for each recognized ID, removes shuffled frames from the video clip, and outputs the remaining frames as a motion recognition model; and an action recognition model for recognizing each person's action using the remaining frames received from the self-map sampler.

바람직하게는, 상기 샘플러는 다운샘플링된 비디오 클립을 수신하고, 다운샘플링된 비디오 클립의 전체 프레임의 각 프레임별로 무관련성 점수들을 예측하고, 상기 각 프레임에 대한 예측된 무관련성 점수에 기초하여 행동 무관련 프레임을 제거하는 과정을 수행하고, 상기 행동 인식 모델은 비디오 클립의 프레임 중 무관련성 프레임을 제거하여 남은 프레임 중 선택된 프레임들을 입력으로 받아 행동인식모델이 행동인식을 수행하는 것을 특징으로 한다.Advantageously, the sampler receives a downsampled video clip, predicts irrelevance scores for each frame of full frames of the downsampled video clip, and determines whether an action is determined based on the predicted irrelevance score for each frame. A process of removing relevant frames is performed, and the action recognition model removes irrelevant frames from among frames of the video clip and receives selected frames among the remaining frames as input and performs action recognition by the action recognition model.

바람직하게는, 상기 샘플러는 상기 학습방법에 의해 학습된 샘플러인 것을 특징으로 한다.Preferably, the sampler is characterized in that it is a sampler learned by the learning method.

본 발명의 다른 일실시예에 따른 샘플러를 이용하는 영상 감시 시스템의 동작방법은, 객체 검출부가 CCTV로부터 동영상을 수신하고, 실시간 객체 검출에 사용되는 객체 검출 모델을 이용하여 동영상내 사람들을 검출하여 위치를 한정하는 객체검출단계; 객체 추적부가 객체 추적에 사용되는 객체 추적 모델을 이용하여 검출된 사람 각각의 개별 ID를 인식하는 객체추적단계; 자기 지도 샘플러가 인식된 ID별로 복수개의 프레임으로 구성되는 비디오 클립을 입력받아 비디오 클립 중 셔플링된 프레임을 제거하고, 나머지 프레임을 동작 인식 모델로 출력하는 프레임선택단계; 및 행동 인식 모델이 자기 지도 샘플러로부터 수신된 나머지 프레임을 이용하여 각 사람의 행동을 인식하는 행동인식단계;를 포함하는 것을 특징으로 한다. In the method of operating a video surveillance system using a sampler according to another embodiment of the present invention, an object detection unit receives a video from a CCTV, detects people in the video using an object detection model used for real-time object detection, and locates the location. object detection step to limit; an object tracking step in which an object tracking unit recognizes an individual ID of each detected person using an object tracking model used for object tracking; a frame selection step in which the self-map sampler receives a video clip composed of a plurality of frames for each recognized ID, removes shuffled frames from the video clip, and outputs the remaining frames as a motion recognition model; and an action recognition step in which the action recognition model recognizes each person's action using the remaining frames received from the self-map sampler.

본 발명의 다른 일실시예에 따른 컴퓨터로 판독 가능한 기록매체는 상기 CNN 기반의 자기 지도 샘플러의 학습방법 또는 샘플러를 포함하는 행동 인식 모델의 행동인식방법을 실행하는 프로그램을 기록한 것을 특징으로 한다.A computer-readable recording medium according to another embodiment of the present invention is characterized in that a program for executing the learning method of the CNN-based self-map sampler or the behavior recognition method of the behavior recognition model including the sampler is recorded.

본 발명의 일실시예에 따른 행동 인식을 위한 자기 지도 샘플러를 이용하여 AI Hub에서 공개된 지하철역 내 이상행동 탐지 데이터세트에서 실험했을 경우, 동일한 행동 인식 모델에서 약 1/3만의 연산량 만을 통해 예측을 수행할 수 있었으며, 성능 또한 약 7.5% 향상된 결과를 보여주었다.When an experiment was conducted on an abnormal behavior detection dataset in a subway station published by AI Hub using the self-map sampler for behavior recognition according to an embodiment of the present invention, prediction was made through only about 1/3 of the amount of computation in the same behavior recognition model. It was able to perform, and the performance was also improved by about 7.5%.

본 발명의 행동 인식을 위한 자기 지도 샘플러를 쓰게 될 경우 앞으로 연구 및 개발되는 비디오 입력 기반의 인공 지능 모델의 연산량을 대폭 줄일 수 있어, 시스템을 구축할 때 실시간성, 연산 효율성에 대한 부담을 줄일 수 있으며, 추가적으로 행동 인식 모델의 인식 성능 향상도 기대할 수 있다.When the self-guided sampler for action recognition of the present invention is used, the amount of computation of an artificial intelligence model based on video input to be researched and developed in the future can be greatly reduced, reducing the burden on real-time performance and computational efficiency when building a system. In addition, the recognition performance of the action recognition model can be expected to be improved.

도 1은 본 발명의 샘플러를 위해 훈련 데이터를 생성하는 과정을 도시한 것이다.
도 2는 추론 단계 동안 본 발명의 샘플러를 이용한 행동 인식의 전체 파이프라인을 도시한 것이다.
도 3은 본 발명의 다른 일실시예에 따른 자기 지도 샘플러의 학습방법을 도시한 것이다.
도 4는 본 발명의 다른 일실시예에 따른 샘플러를 포함하는 행동 인식 모델의 행동인식방법을 도시한 것이다.
도 5는 본 발명의 다른 일실시예에 따른 자기 지도 샘플러를 이용하는 전철역에서의 실시간 감시 시스템의 구성도를 도시한 것이다.
도 6은 본 발명의 다른 일실시예에 따른 자기 지도 샘플러를 이용하는 전철역에서의 실시간 감시 시스템의 동작방법을 도시한 것이다.1 shows a process of generating training data for the sampler of the present invention.
Figure 2 shows the entire pipeline of action recognition using the sampler of the present invention during the inference step.
3 illustrates a learning method of a self-map sampler according to another embodiment of the present invention.
4 illustrates a behavior recognition method of a behavior recognition model including a sampler according to another embodiment of the present invention.
5 is a block diagram of a real-time monitoring system in a subway station using a magnetic map sampler according to another embodiment of the present invention.
6 illustrates a method of operating a real-time monitoring system in a subway station using a magnetic map sampler according to another embodiment of the present invention.

본 발명은 다양한 변환을 가할 수 있고 여러 가지 실시예를 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 상세한 설명에 상세하게 설명하고자 한다. 그러나 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변환, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 본 발명을 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 본 발명의 요지를 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다.Since the present invention can apply various transformations and have various embodiments, specific embodiments will be illustrated in the drawings and described in detail in the detailed description. However, this is not intended to limit the present invention to specific embodiments, and should be understood to include all conversions, equivalents, or substitutes included in the spirit and scope of the present invention. In describing the present invention, if it is determined that a detailed description of related known technologies may obscure the gist of the present invention, the detailed description will be omitted.

본 출원에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "구성된다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.Terms used in this application are only used to describe specific embodiments, and are not intended to limit the present invention. Singular expressions include plural expressions unless the context clearly dictates otherwise. In this application, the term "comprises" or "consists of" is intended to indicate that there is a feature, number, step, operation, component, part, or combination thereof described in the specification, but one or more other It should be understood that the presence or addition of features, numbers, steps, operations, components, parts, or combinations thereof is not precluded.

본 발명은 밀집된 형태의 입력을 받는 행동 인식 모델의 예측 방식을 개선하기 위하여 비디오에 등장하는 대표적인 행동과 관련된 프레임을 선택하는 매우 작은 연산량을 가지는 CNN 기반의 샘플러에 관한 것이다.The present invention relates to a CNN-based sampler with a very small amount of computation that selects a frame related to a representative action appearing in a video in order to improve a prediction method of an action recognition model receiving dense inputs.

본 발명의 CNN 기반의 샘플러는 입력된 비디오에서 하나의 관련이 없는 프레임 즉, 다른 비디오에서 추출된 프레임을 혼합하고, 이를 찾는 과정으로 학습된다. 해당 방식으로 학습된 샘플러를 사용하여 행동 인식 모델에 대한 입력으로 사용할 프레임을 정제한다. 샘플러는 모든 행동 인식 모델과 결합할 수 있으며, 예측 시간, 메모리에 대한 효율성을 대폭 향상시킨다.The CNN-based sampler of the present invention is learned by mixing an unrelated frame from an input video, that is, a frame extracted from another video, and finding it. Using the sampler learned in this way, frames to be used as input to the action recognition model are refined. Sampler can be combined with any action recognition model and greatly improves the efficiency of prediction time and memory.

본 발명의 CNN 기반의 샘플러는 CNN 기반의 행동 인식 모델 및 비디오를 입력으로 받는 인공지능 네트워크에 적용된다. 대부분의 인공지능 모델의 경우 고성능의 그래픽 연산장치가 필수적이며, 그래픽 연산장치의 성능에 따라 시스템에 탑재할 수 있는 인공지능 모델의 규모가 달라진다. 비디오를 입력으로 받는 인공지능의 경우 시간축에 존재하는 움직임의 정보를 포착하고 이를 기반으로 행동에 대한 인식을 수행함에 따라 이미지 기반의 인공지능에 비해 매우 큰 연산량을 가지게 된다. 따라서 행동 인식 인공지능 기반의 실제 애플레케이션을 개발 및 연구하는 경우 연산량은 핵심적으로 고려해야하는 요소 중 하나이다. 본 발명의 CNN 기반의 샘플러 모델의 경우 이러한 문제를 대폭적으로 완화할 수 있으며, 더불어 행동 인식 모델의 성능을 향상시킬 수 있다.The CNN-based sampler of the present invention is applied to an artificial intelligence network that receives a CNN-based action recognition model and a video as an input. For most artificial intelligence models, a high-performance graphic processing unit is essential, and the size of the artificial intelligence model that can be loaded into the system varies depending on the performance of the graphic processing unit. In the case of artificial intelligence that receives video as an input, it has a very large amount of computation compared to image-based artificial intelligence as it captures motion information on the time axis and recognizes actions based on this. Therefore, when developing and researching real applications based on behavioral recognition artificial intelligence, the amount of computation is one of the key factors to consider. In the case of the CNN-based sampler model of the present invention, this problem can be greatly alleviated, and the performance of the action recognition model can be improved.

이하에서는 본 발명의 행동 인식을 위한 자기 지도 샘플러 발명의 발명 동기, 네트워크 구조, 샘플러 훈련 절차, 샘플러를 이용한 행동 인식에 대해 설명한다.Hereinafter, the motivation for the invention of the self-guided sampler for action recognition of the present invention, network structure, sampler training procedure, and action recognition using the sampler will be described.

본 발명의 발명 동기는 행동 인식 모델에 대한 입력 프레임 중에 중복되거나 관련 없는 프레임이 있다는 것이다. 따라서, 입력 비디오 클립(clip)에서 중복되거나 무관한 프레임을 제거하면 연산량을 줄이고 정확도를 향상시킬 수 있다. 이러한 목적을 달성하기 위해, 본 발명의 샘플러 모델은 전체 클립에 대한 각각의 프레임의 관련성을 측정하도록 훈련된다. 또한, 샘플러는 더 빠른 추론을 위한 보조 모델이기 때문에 샘플러를 경량화 설계한다. 지도 학습으로 이러한 샘플러 모델을 훈련시키기 위해, 비디오들 내의 관련 없는 프레임 위치들에 대한 고가의 주석(annotations)들이 요구된다. 이러한 고가의 주석 대신에, 본 발명은 혼합함으로써 무관한 프레임을 갖는 비디오를 합성적으로 생성하는 간단한 자기 지도 방법을 이용한다: 2개의 비디오가 주어지면, 제1 비디오로부터의 하나의 프레임이 제2 비디오에 삽입되고, 샘플러 모델은 어느 프레임이 제2 비디오에서의 혼합된 프레임인지를 식별하도록 훈련된다. 이때, 사람에 의한 주석은 필요하지 않다. 훈련 후에는, 본 발명의 샘플러 모델은 실제 비디오에서 관련 없는 프레임을 찾을 수 있을 것으로 기대된다.The motivation of the present invention is that there are overlapping or irrelevant frames among the input frames for the action recognition model. Therefore, removing redundant or irrelevant frames from an input video clip can reduce the amount of computation and improve accuracy. To achieve this goal, the sampler model of the present invention is trained to measure the relevance of each frame to the entire clip. In addition, since the sampler is an auxiliary model for faster inference, the sampler is designed to be lightweight. In order to train such a sampler model with supervised learning, expensive annotations to irrelevant frame locations within the videos are required. Instead of these expensive annotations, the present invention uses a simple self-teaching method to synthetically create a video with unrelated frames by blending: given two videos, one frame from the first video is replaced by the second video. , and a sampler model is trained to identify which frames are blended frames in the second video. At this time, human annotation is not required. After training, our sampler model is expected to be able to find irrelevant frames in real video.

네트워크 구조에 대해 설명한다.Describe the network structure.

본 발명의 샘플러 네트워크

는 5개의 3D-CNN 레이어로 구성된다.

는 비디오 클립들로 구성된 하나의 다운샘플링된 입력 미니 배치를 나타낸다. B는 배치(batch) 크기이고, T는 시간 차원이고, C는 채널의 수이고, H 및 W는 공간 차원이다. 샘플러는 하나의 미니-배치 X 내의 b번째 비디오 클립 X_b로부터의 각각의 프레임에 대한 비관련성 점수를 다음 수학식 1과 같이 생성한다: Sampler network of the present invention

is composed of five 3D-CNN layers.

represents one downsampled input mini-batch composed of video clips. B is the batch size, T is the temporal dimension, C is the number of channels, and H and W are the spatial dimension. The sampler generates an irrelevance score for each frame from the bth video clip X _b in one mini-batch X as:

여기서

및 Θ는 각각 샘플러의 비관련성 점수 및 학습 가능한 파라미터들을 나타낸다. 하나의 비디오 클립에서 각 프레임의 무관련성 점수들은 각 프레임이 비디오 클립의 전체 프레임에 얼마나 멀리 떨어져 있는지를 나타낸다. 이러한 무관련성 점수들은 행동 인식을 위해 프레임을 제거하는 데 사용된다.here

and Θ represent the sampler's irrelevance score and learnable parameters, respectively. The irrelevance scores of each frame in a video clip represent how far each frame is from the overall frames of the video clip. These irrelevance scores are used to remove frames for action recognition.

샘플러 훈련 절차에 대해 설명한다.The sampler training procedure is described.

인간의 노력 없이 무관한 프레임을 찾을 수 있는 샘플러 모델을 만들기 위한 자기 지도 학습 방법을 설명한다. 이를 위해, 하나의 무관한 프레임을 주어진 비디오 클립으로 랜덤하게 혼합하여 샘플러의 입력 클립을 만든다. 그런 다음, 샘플러를 훈련하여 무관한 프레임의 인덱스를 찾도록 한다. 즉, 하나의 무관한 프레임을 갖는 비디오 클립이 입력이고, 그 프레임의 인덱스가 목표 라벨(target label)이다.We describe a self-supervised learning method to create a sampler model that can find irrelevant frames without human effort. To do this, one unrelated frame is randomly mixed into a given video clip to create the sampler's input clip. Then train the sampler to find the indices of irrelevant frames. That is, a video clip with one irrelevant frame is the input, and the index of that frame is the target label.

도 1은 본 발명의 샘플러를 위해 훈련 데이터를 생성하는 과정을 도시한 것이다. 도 1에서 보듯이, 동일한 미니 배치에서 랜덤 위치들에 랜덤한 수의 프레임을 혼합한다.1 shows a process of generating training data for the sampler of the present invention. As shown in Figure 1, we mix a random number of frames at random locations in the same mini-batch.

구체적으로, 배치(batch) 축에 따라 X를 셔플링하여

를 얻는다. 미니-배치 X 내의 각각의 비디오 클립 X_b에 대해, 시간 t에서 하나의 프레임 X_b(t)를 랜덤하게 선택하고, 그 후 그것을 도 1에 도시된 바와 같이 시간 t에서 셔플링된 미니-배치

내의 b-번째 클립의 프레임

와 교환한다. 이 프로세스는 다음 수학식 2와 같이 표현됩니다.Specifically, by shuffling X along the batch axis,

get For each video clip X _b in mini-batch X, randomly select one frame X _b (t) at time t, and then divide it into a shuffled mini-batch at time t as shown in FIG.

frame of the b-th clip in

exchange with This process is expressed as Equation 2 below.

여기서,

및

는 샘플러를 훈련하기 위한 입력들 및 라벨들이다. Y_b는 각각의 비디오 클립

에 대해 원-핫 벡터(one-hot vector)이고, Y_b(t)는 프레임을 교환할 때 1이며 그 외는 0이 된다. 원-핫 인코딩은 타겟 레벨 집합의 크기를 벡터의 차원으로 하고, 라벨이 해당하는 인덱스의 값에 1을 부여하고 다른 인덱스에는 0을 부여하는 벡터 표현 방식이다. 이렇게 표현된 벡터를 원-핫 벡터(One-Hot vector)라고 한다.here,

and

are the inputs and labels for training the sampler. Y _b is for each video clip

is a one-hot vector for , and Y _b (t) is 1 when exchanging frames and 0 otherwise. One-hot encoding is a vector representation method in which the size of a target level set is the dimension of a vector, and 1 is assigned to an index corresponding to a label and 0 is assigned to other indices. A vector expressed in this way is called a one-hot vector.

훈련 단계(stage)에서, 본 발명의 샘플러

는 하나의 준비된 비디오 클립

를 생성한다. 다음 수학식 3의 손실 함수를 기반으로 파라미터 Θ를 업데이트한다.In the training stage, the sampler of the present invention

is one prepared video clip

, and then the irrelevance score for each frame within one video clip.

generate The parameter Θ is updated based on the loss function of Equation 3 below.

도 2는 추론 단계 동안 본 발명의 샘플러를 이용한 행동 인식의 전체 파이프라인을 도시한 것이다. 도 2에서 보듯이, 본 발명의 샘플러는 입력으로 다운샘플링된 비디오 클립을 수신하고, 그런 다음 무관련성 점수들을 생성한다. 각 프레임에 대한 예측된 무관련성 점수에 기초하여 행동 무관련 프레임을 제거한다. 최종적으로 행동 인식은 남겨진 선택된 프레임들만 이용하여 수행된다.Figure 2 shows the entire pipeline of action recognition using the sampler of the present invention during the inference step. As shown in Figure 2, the sampler of the present invention receives as input a downsampled video clip and then produces irrelevance scores. We remove behaviorally irrelevant frames based on the predicted irrelevance score for each frame. Finally, action recognition is performed using only the remaining selected frames.

도 2를 참고하여, 샘플러를 이용한 행동 인식에 대해 더 자세히 설명한다.Referring to FIG. 2, action recognition using a sampler will be described in more detail.

본 발명의 샘플러의 최종 목표는 정확성과 계산 비용 모두에 대한 동작 인식 모델의 성능을 향상시키는 것이다. 이를 위해, 하나의 비디오 클립에서 전체 프레임들을 본 발명의 훈련된 샘플러

에 입력하고, 훈련된 샘플러는 각각의 프레임에 대한 무관련성 점수들을 획득한다. 추론 단계에서의 샘플러의 입력이 훈련 단계의 입력과 다르지만, 훈련된 샘플러는 여전히 비디오 클립에서 각 프레임을 전체 프레임에 대한 무관련성에 대해 점수화할 수 있다. 마지막으로, 입력 비디오 클립에서 비관련성 점수가 높은 프레임을 제거한 다음, 나머지 프레임을 액션 인식 모델의 입력으로 제공한다. 특히 두 가지 버전의 프레임 제거 방법을 사용한다. 첫 번째 방법(V1)은 임계값 τ보다 큰 비관련성 점수를 갖는 프레임을 제거한 후 일정 간격(오름차순으로 비관련성 점수를 정렬한 후)으로 N개의 고정된 수의 프레임을 선택하는 것이다. 두 번째 방법(V2)은 임계값 τ를 설정한 후, 비관련성 점수가 τ보다 작은 프레임들만을 사용하는 것이다. 선택된 프레임을 사용하여 기존의 조밀 예측 방법(dense prediction method)을 적용하여 비디오의 행동을 인식한다. 조밀 예측 방법은 행동 인식에서 통상적으로 사용되는 예측 방법으로서, 전체 비디오를 어떤 길이의 클립들로 분할하고, 클립들 모두를 행동 인식 모델에 제공하고 모든 예측을 평균하는 기법이다.The ultimate goal of the sampler of the present invention is to improve the performance of motion recognition models in terms of both accuracy and computational cost. To this end, all frames from one video clip are sampled using the trained sampler of the present invention.

, and the trained sampler obtains irrelevance scores for each frame. Although the sampler's input in the inference phase is different from the input in the training phase, the trained sampler can still score each frame in the video clip for irrelevance to the entire frame. Finally, frames with high irrelevance scores are removed from the input video clip, and the remaining frames are provided as input to the action recognition model. In particular, it uses two versions of the frame removal method. The first method (V1) is to select a fixed number of N frames at regular intervals (after sorting the irrelevance scores in ascending order) after removing frames with irrelevance scores greater than the threshold τ. The second method (V2) sets a threshold value τ, and then uses only frames with an irrelevance score smaller than τ. Using the selected frame, the existing dense prediction method is applied to recognize the behavior of the video. The dense prediction method is a prediction method commonly used in action recognition, and is a technique of dividing an entire video into clips of a certain length, providing all of the clips to an action recognition model, and averaging all predictions.

도 3은 본 발명의 다른 일실시예에 따른 자기 지도 샘플러의 학습방법을 도시한 것이다.3 illustrates a learning method of a self-map sampler according to another embodiment of the present invention.

도 3에서 보듯이, CNN 기반의 자기 지도 샘플러의 학습방법은 수신단계(S100), 혼합단계(S200), 및 학습단계(S300)를 포함하여 구성된다.As shown in FIG. 3, the CNN-based self-mapped sampler learning method includes a receiving step (S100), a mixing step (S200), and a learning step (S300).

상기 수신단계(S100)에서는 자기 지고 샘플에 대한 학습을 위해 제1 비디오 및 제2 비디오를 수신하게 된다.In the receiving step (S100), the first video and the second video are received for learning on the self-learning sample.

상기 혼합단계(S200)에서는 수신된 제2 비디오에서 하나의 무관한 프레임을 추출하고, 추출된 프레임을 제1 비디오에 혼합하여 무관한 프레임을 갖는 제1 비디오를 생성하게 된다. 상기 혼합단계에서는 수신한 제2 비디오의 클립의 하나의 무관한 프레임을 수신한 제1 비디오의 클립에 랜덤하게 혼합하여 무관한 프레임을 갖는 제1 비디오의 클립을 생성하게 된다.In the mixing step (S200), one unrelated frame is extracted from the received second video, and the extracted frame is mixed with the first video to generate a first video having unrelated frames. In the mixing step, one unrelated frame of the received clip of the second video is randomly mixed with the received clip of the first video to generate a clip of the first video having unrelated frames.

상기 학습단계(S300)에서는 상기 혼합된 제1 비디오에서 어느 프레임이 혼합된 프레임인지 식별하도록 학습하게 된다. 상기 학습단계에서는 혼합된 제1 비디오의 클립을 입력으로 하여 샘플러를 훈련하여 무관한 프레임의 인덱스를 찾도록 한다. In the learning step (S300), it is learned to identify which frame is a mixed frame in the mixed first video. In the learning step, the sampler is trained using the mixed clip of the first video as an input to find an index of an irrelevant frame.

상기 하나의 무관한 프레임을 갖는 제1 비디오 클립이 샘플러의 입력이고, 그 프레임의 인덱스가 목표 라벨이 된다. 상기 무관한 프레임을 갖는 제1 비디오의 클립은, 배치(batch) 축에 따라 미니-배치 X를 셔플링하여

내의 b-번째 클립의 프레임

와 교환하는 과정;으로 취득하게 된다. 상기 샘플러

는 하나의 준비된 비디오 클립

를 생성하며, 손실 함수를 기반으로 파라미터 Θ를 업데이트 한다.The first video clip with the one irrelevant frame is the input of the sampler, and the index of that frame becomes the target label. A clip of the first video with the unrelated frames is obtained by shuffling the mini-batch X along the batch axis.

frame of the b-th clip in

The process of exchanging with; the sampler

is one prepared video clip

, and then the irrelevance score for each frame within one video clip.

and update the parameter Θ based on the loss function.

도 4는 본 발명의 다른 일실시예에 따른 샘플러를 포함하는 행동 인식 모델의 행동인식방법을 도시한 것이다.4 illustrates a behavior recognition method of a behavior recognition model including a sampler according to another embodiment of the present invention.

도 4에서 보듯이, 샘플러를 포함하는 행동 인식 모델의 행동인식방법은 수신단계(S1100), 점수예측단계(S1200), 프레임제거단계(S1300), 및 행동인식단계(S1400)를 포함하여 구성된다.As shown in FIG. 4, the behavior recognition method of the behavior recognition model including the sampler includes a receiving step (S1100), a score prediction step (S1200), a frame removal step (S1300), and an action recognition step (S1400). .

상기 수신단계(S1100)에서는 도 3의 학습방법에 의해 학습된 샘플러가 다운샘플링된 비디오 클립을 수신하게 된다.In the receiving step (S1100), the sampler learned by the learning method of FIG. 3 receives the downsampled video clip.

상기 점수예측단계(S1200)에서는 상기 학습된 샘플러는 다운샘플링된 비디오 클립의 전체 프레임의 각 프레임별로 무관련성 점수들을 예측한다. 상기 샘플러 네트워크

는 5개의 3D-CNN 레이어로 구성되는데,

는 비디오 클립들로 구성된 하나의 다운샘플링된 입력 미니 배치를 나타내고, 여기서 B는 배치(batch) 크기이고, T는 시간 차원이고, C는 채널의 수이고, H 및 W는 공간 차원으로 정의하면, 상기 학습된 샘플러는 하나의 미니-배치 X 내의 b번째 비디오 클립 X_b로부터의 각각의 프레임에 대한 비관련성 점수를 수학식 1과 같이 예측하고, 하나의 비디오 클립에서 각 프레임의 무관련성 점수들은 각 프레임이 비디오 클립의 전체 프레임에 얼마나 멀리 떨어져 있는지를 나타낸다.In the score prediction step (S1200), the learned sampler predicts irrelevance scores for each frame of all frames of the downsampled video clip. the sampler network

is composed of five 3D-CNN layers,

denotes one downsampled input mini-batch consisting of video clips, where B is the batch size, T is the temporal dimension, C is the number of channels, and H and W are defined as spatial dimensions: The learned sampler predicts the irrelevance score for each frame from the bth video clip X _b in one mini-batch X as shown in Equation 1, and the irrelevance scores of each frame in one video clip are respectively Indicates how far apart the frames are in the entire frame of the video clip.

상기 프레임제거단계(S1300)에서는 각 프레임에 대한 예측된 무관련성 점수에 기초하여 행동 무관련 프레임을 제거한다. 상기 프레임제거단계에서는 임계값과 비교를 통해 각 프레임의 예측된 무관련성 점수가 임계값보다 높은 프레임을 제거한다.In the frame removal step (S1300), action irrelevant frames are removed based on the predicted irrelevance score for each frame. In the frame removal step, frames having a predicted irrelevance score higher than the threshold value are removed through comparison with the threshold value.

상기 행동인식단계(S1400)에서는 비디오 클립의 프레임 중 무관련성 프레임을 제거하여 남은 프레임 중 선택된 프레임들을 입력으로 받아 행동인식모델이 행동인식을 수행한다. 상기 비디오 클립의 프레임 중 무관련성 프레임을 제거하여 남은 프레임 중 선택된 프레임은, 임계값 τ보다 큰 비관련성 점수를 갖는 프레임을 제거한 후 일정 간격으로 N개의 고정된 수의 프레임을 선택하는 과정 또는 임계값 τ를 설정한 후, 비관련성 점수가 τ보다 작은 프레임들만을 선택하는 과정 중 어느 하나에 의해 선택된 프레임이다. 상기 행동인식단계에서는 상기 선택된 프레임을 입력으로 사용하여 조밀 예측 방법(dense prediction method)을 적용하여 비디오의 행동을 인식한다.In the action recognition step (S1400), the action recognition model performs action recognition by receiving selected frames among the remaining frames after removing irrelevant frames from among the frames of the video clip. A process of selecting a fixed number of N frames at regular intervals after removing frames having an irrelevance score greater than a threshold τ, or a threshold After setting τ, it is a frame selected by any one of the processes of selecting only frames with a non-correlation score smaller than τ. In the action recognition step, a video action is recognized by applying a dense prediction method using the selected frame as an input.

이하에서는 본 발명의 자기 지도 샘플러를 이용하는 감시 시스템에 대해 설명한다.Hereinafter, a monitoring system using the magnetic map sampler of the present invention will be described.

도 5는 본 발명의 자기 지도 샘플러를 이용하는 전철역에서의 실시간 감시 시스템의 구성도를 도시한 것이다. 도 5에서 보듯이, 본 발명의 자기 지도 샘플러를 이용하는 전철역에서의 실시간 감시 시스템은 자기 지도 샘플러(100), 행동 인식 모델(200), 객체 검출부(300), 및 객체 추적부(400)를 포함하여 구성된다.5 is a block diagram of a real-time monitoring system in a train station using the magnetic map sampler of the present invention. As shown in FIG. 5, the real-time monitoring system at a subway station using the magnetic map sampler of the present invention includes a magnetic map sampler 100, an action recognition model 200, an object detector 300, and an object tracking unit 400. It is composed by

도 5에서 보듯이, 본 발명의 다른 일실시예에 따른 감시 시스템은 최소한의 인적 노동으로 지하철에서 자주 발생하는 다양한 사고에 신속하게 대응하기 위한 것이다. 본 감시 시스템은 정상, 계단 낙하, 에스컬레이터 낙하, 실신(faint), 파괴(vandalism), 폭행, 도난 등 다양한 행위를 인식하고 이상 상황이 발생할 경우 사용자에게 경고한다. 비디오 수준 행동을 인식하는 행동 인식을 위한 기존의 벤치마크 데이터세트와 행동 인식 방법과 달리 본 발명의 감시 시스템은 동영상(비디오)에서 개인들의 행동을 인식한다.As shown in FIG. 5, the monitoring system according to another embodiment of the present invention is intended to quickly respond to various accidents that frequently occur in subways with minimal human labor. This monitoring system recognizes various behaviors such as normal, stair fall, escalator fall, fainting, vandalism, assault, and theft, and warns the user in case of an abnormal situation. Unlike existing benchmark datasets and behavior recognition methods for behavior recognition that recognize video-level behavior, the surveillance system of the present invention recognizes individuals' behavior in a moving picture (video).

따라서 본 감시 시스템에서는 객체 검출부(300) 및 객체 추적부(400)가 비디오에서 개인들을 추적하는 데 사용된다. 이러한 시스템을 구축하기 위해 행동 인식, 객체 검출 및 객체 추적 작업을 위한 데이터세트들을 수집하고, 작업을 위한 모델을 훈련하고, 훈련된 모델을 전철역 감시 시스템에 통합한다.Therefore, in the present surveillance system, the object detection unit 300 and the object tracking unit 400 are used to track individuals in the video. To build such a system, we collect datasets for action recognition, object detection, and object tracking tasks, train models for tasks, and integrate the trained models into a subway station monitoring system.

본 발명에서 실세계 시나리오에 샘플러를 사용하는 이유에 대해 설명한다. 본 발명의 감시 시스템은 CCTV(500)로부터 비디오 스트림을 수신하고 실시간으로 개인 행동을 인식할 수 있을 것으로 예상된다. 그러나, 트리밍(정제)되지 않은 비디오에서, 중복 중첩 프레임이 비디오의 대부분을 차지할 수 있다. 따라서, 실제 시나리오에서는 비디오를 트리밍하는 샘플러가 필요하다. 샘플러는 실패한 추적 프레임들 및 중복 프레임들을 필터링하기 위해 사용될 수 있어, 최종 예측을 더 정확하고 더 계산적으로 효율적으로 만든다.The reason for using samplers in real-world scenarios in the present invention will be explained. The surveillance system of the present invention is expected to be able to receive video streams from CCTV 500 and recognize individual behavior in real time. However, in untrimmed (refined) video, redundant overlapping frames may occupy a majority of the video. Therefore, in a real scenario, a sampler to trim the video is needed. The sampler can be used to filter out failed tracking frames and redundant frames, making the final prediction more accurate and more computationally efficient.

본 발명의 실시간 감시 시스템의 파이프라인은 도 5에 도시되어 있다. 먼저, 객체 검출부(300)는 실시간 객체 검출에 사용되는 YOLOv5s 모델 등의 객체 검출 모델을 이용하여 사람들을 검출하여 위치를 한정하고(localize), 객체 추적부(400)는 객체 추적에 사용되는 DeepSort 모델 등의 객체 추적 모델을 이용하여 검출된 사람 각각의 개별 ID를 인식한다. 그런 다음, 추출된 ID별로 16개의 프레임을 동작 인식 모델의 입력으로 하여 각 개인의 행동을 인식하게 된다. 그러나, 객체 검출부(300)와 객체 추적부(400)의 성능이 완벽하지 않기 때문에, 16 프레임의 ID들은 정확히 동일하지 않으며, 혼합된 ID가 제거되지 않으면, 예측은 도 3의 붉은 선(Rline)과 같이 부정확할 것이다. 따라서, 도 5의 녹색 라인(Gline)에 나타난 바와 같이, 자기 지도 샘플러를 통해 셔플링된 프레임을 제거하고, 계산 및 성능 효율적인 예측을 수행한다.The pipeline of the real-time monitoring system of the present invention is shown in FIG. First, the object detection unit 300 detects and localizes people using an object detection model such as the YOLOv5s model used for real-time object detection, and the object tracking unit 400 uses a DeepSort model used for object tracking Recognizes the individual ID of each detected person using an object tracking model such as Then, each person's action is recognized by using 16 frames for each extracted ID as input to the motion recognition model. However, since the performance of the object detection unit 300 and the object tracking unit 400 are not perfect, the IDs of 16 frames are not exactly the same, and if the mixed IDs are not removed, the prediction is the red line (Rline) in FIG. will be inaccurate as Therefore, as shown in the green line (Gline) of FIG. 5, shuffled frames are removed through the self-map sampler, and calculation and performance-efficient prediction are performed.

딥러닝 기반 감시 시스템을 구축하기 위해서는 충분한 양의 고품질 훈련 데이터세트가 필수적이다. 지하철역에서 구축된 여러 개의 데이터세트가 있으나, 개인 정보, 보안, CCTV에 대한 접근 권한, 사고 등으로 인해 지하철역 내 비정상적인 행동 인식과 취약한 객체 추적을 커버하는 데이터세트가 여전히 부족하다. 따라서 지하철역에서 자주 발생하는 고위험 사고에 대해 조사하고, 비정상적인 행위와 취약 대상(객체)을 정의한다. 그 후 지하철역 내 60대 이상의 CCTV를 이용해 비정상적인 행동 인식 데이터세트와 취약 객체 추적 데이터세트를 구축한다.To build a deep learning-based surveillance system, a sufficient amount of high-quality training dataset is essential. There are several datasets built in subway stations, but there are still a lack of datasets covering abnormal behavior recognition and vulnerable object tracking in subway stations due to privacy, security, access rights to CCTV, accidents, etc. Therefore, high-risk accidents that frequently occur in subway stations are investigated, and abnormal behaviors and vulnerable targets (objects) are defined. After that, more than 60 CCTVs in the subway station are used to build an abnormal behavior recognition dataset and a vulnerable object tracking dataset.

비정상적인 행동 인식 데이터세트는 13종의 비정상 행동에 대해 5분 길이의 약 7,300개의 클립으로 구성된다. 도 3의 CCTV(500)에서 3,840× 2,160의 해상도를 갖는 비디오 프레임을 얻는다. 또한 비정상적인 행동에 대해 100만 개 이상의 객체가 포함된 바운딩 박스(bounding box)가 있다. 데이터 세트에서 정의된 비정상적인 행동은 숨겨진 카메라 성적 비위, 에스컬레이터 추락, 환경 요인으로 인한 추락, 계단 추락, 주취행위, 배회, 기절, 파괴, 납치, 폭행, 절도, 접근 방향 오해 및 무단 출입을 포함한다.The deviant behavior recognition dataset consists of approximately 7,300 five-minute clips of 13 deviant behaviors. In the CCTV 500 of FIG. 3, a video frame having a resolution of 3,840×2,160 is obtained. There is also a bounding box containing over 1 million objects for unusual behavior. Abnormal behaviors defined in the data set include hidden camera sexual misconduct, escalator falls, environmental-induced falls, stair falls, drunkenness, loitering, fainting, vandalism, kidnapping, assault, theft, misdirection of approach, and unauthorized entry.

취약 객체 추적 데이터세트는 비정상적인 행동 인식 데이터세트와 유사한 과정으로 구성된다. 취약 객체 추적 데이터세트는 휠체어, 시각 장애, 배회자, 유모차, 유아, 어린이 등 6종의 대상 물체에 대해 약 5분 길이의 약 7,000개의 클립으로 구성된다. 레이블된 25만 개 이상의 객체가 있다. The vulnerable object tracking dataset is composed of a process similar to the abnormal behavior recognition dataset. The vulnerable object tracking dataset consists of approximately 7,000 clips of approximately 5 minutes in length for six types of objects: wheelchair, blind, wanderer, stroller, infant, and child. There are over 250,000 labeled objects.

비정상 행동 인식 데이터세트와 취약 객체 추적 데이터세트를 클립 수를 기준으로 7:3 비율로 훈련 세트와 테스트 세트로 구분한다.The abnormal behavior recognition dataset and the vulnerable object tracking dataset are divided into a training set and a test set at a ratio of 7:3 based on the number of clips.

도 5를 참조하여, 본 발명의 실시간 감시 시스템의 구현 세부 정보에 대해 설명한다.Referring to FIG. 5, implementation details of the real-time monitoring system of the present invention will be described.

본 발명의 감시 시스템은 하나의 CCTV에서 3,840× 2,160 해상도 입력 비디오를 얻고 이를 하나의 Nvidia RTX 3090 GPU를 사용하여 처리하도록 설계된다. 실시간 객체 검출을 위해 YOLOv5s 모델이 사용된다. 상기 모델은 학습 데이터세트로서 CrowdHuman, WidePerson 및 상기 객체 추적 데이터세트의 조합으로 훈련된다. 감시 시스템에 사용된 최종 모델은 객체 추적 데이터세트의 테스트 분할에서 92.0의 AP를 달성한다. DeepSort을 객체 추적 모델로 채택하며, Market-1501 데이터 세트와 상기 객체 추적 훈련 데이터세트를 훈련 데이터 세트로 사용한다. YOLOv5s 모델과 DeepSort 모델을 함께 사용했을 때 객체 추적 테스트 데이터 세트에서 72개의 MOTA 성능이 달성되며, 객체 추적 테스트 데이터 세트를 감시 시스템에서 사용한다.The surveillance system of the present invention is designed to obtain 3,840×2,160 resolution input video from one CCTV and process it using one Nvidia RTX 3090 GPU. For real-time object detection, the YOLOv5s model is used. The model is trained with a combination of CrowdHuman, WidePerson and the object tracking dataset as a training dataset. The final model used for the surveillance system achieves an AP of 92.0 in the test partition of the object tracking dataset. DeepSort is adopted as an object tracking model, and the Market-1501 dataset and the object tracking training dataset are used as training datasets. When the YOLOv5s model and the DeepSort model are used together, the performance of 72 MOTA is achieved on the object tracking test dataset, and the object tracking test dataset is used in the surveillance system.

본 발명의 자기 지도 샘플러를 감시 시스템에 적용하기 위해 제안된 비정상 행동 인식 훈련 데이터세트를 사용하여 샘플러를 훈련한다. 특히, 입력 크기를 128×128로 다운샘플링하고 이를 샘플러에 전달하고, 훈련 동안 증강으로 무작위 자르기 및 회전을 활용한다.In order to apply the self-guided sampler of the present invention to a monitoring system, the sampler is trained using the proposed abnormal behavior recognition training dataset. Specifically, downsampling the input size to 128x128, passing it to the sampler, and utilizing random cropping and rotation as augmentation during training.

본 발명에서 샘플러에 64개의 연속 비디오 프레임을 입력하고 그 결과에 기초하여 고정된 16개의 프레임을 추출함으로써 자기 지도 샘플러 V1(self-supervised sampler V1)에 대한 실험을 설계한다. 자기 지도 샘플러 V2 실험에서, 비디오 프레임에서 최대 64개 및 최소 16개의 프레임이 입력된다.In the present invention, an experiment for a self-supervised sampler V1 is designed by inputting 64 consecutive video frames to the sampler and extracting 16 fixed frames based on the result. In the magnetic map sampler V2 experiment, a maximum of 64 and a minimum of 16 frames from video frames are input.

마지막으로, MARS 모델이 광학적 흐름(optical flow)을 사용하지 않는 실시간 행동 인식에 사용된다. MARS 훈련을 위해, kinetics400 사전 훈련된 가중치를 사용하며 비정상적인 행동 인식 훈련 데이터세트를 사용하여 마지막 완전 연결 레이어만 미세 조정한다. ID와 사람의 바운딩 박스 GT(ground truth)를 사용할 때 학습된 MARS 모델은 94% 성능을 달성한다.Finally, the MARS model is used for real-time action recognition without using optical flow. For MARS training, we use kinetics400 pre-trained weights and only fine-tune the last fully connected layer using the anomalous action recognition training dataset. The trained MARS model achieves 94% performance when using the ID and the bounding box GT (ground truth) of the person.

지하철역 내 감시 시스템에서 자기 지도 샘플러의 효과를 검증하기 위해, 비정상 행동 인식 테스트 세트에서 조밀(dense), 균일(uniform), 랜덤(random), 자기 지도 샘플러-V1 및 V2 알고리즘을 비교하였다. 표 1은 각 비정상 행동 분류(클래스)에 대해 비교된 모든 알고리즘(방법)의 성능을 보여준다. 각 객체의 GT 박스 및 ID를 사용할 때 "with GT"라고 표기한다.To verify the effectiveness of self-mapped samplers in a surveillance system in subway stations, dense, uniform, random, self-mapped samplers-V1 and V2 algorithms were compared in an abnormal behavior recognition test set. Table 1 shows the performance of all algorithms (methods) compared for each deviant behavior classification (class). When using the GT box and ID of each object, write "with GT".

표 1(정량적인 결과들(비정상 행동 인식 데이터세트에서 3D-Resnext101을 이용하여 모든 평가가 수행됨)에 나타난 바와 같이, GT를 사용하지 않고 객체 검출부와 객체 추적부를 사용하는 경우, ID가 클립들에서 혼합되므로, 조밀(dense), 균일(uniform), 랜덤(random) 샘플링의 성능이 저하됨을 확인할 수 있다. 특히 정상(normal) 분류는 GT 추적을 사용할 때 100% 성능을 보여준다.As shown in Table 1 (quantitative results (all evaluations were performed using 3D-Resnext101 in the abnormal behavior recognition dataset), when the object detection unit and the object tracking unit are used without using GT, the ID is Since it is mixed, it can be seen that the performance of dense, uniform, and random sampling deteriorates, especially the normal classification shows 100% performance when using GT tracking.

GT가 사용되지 않으면, 성능은 조밀(dense), 균일(uniform), 랜덤(random) 샘플링 전략에서 30% 이상 떨어진다. 정상(normal) 분류는 다른 분류보다 객체 검출부 및 객체 추적부의 성능에 더 민감하다는 것을 의미한다. 놀랍게도, 자기 지도 샘플러가 사용되면(V1 및 V2 모두), 정상 분류에 대한 성능은 다시 100%에 도달한다. 이는 본 발명의 목적이기도 한 샘플러가 클립에 섞여 있는 ID를 효과적으로 제거함을 나타낸다. 실세계 감시 영상에서 대부분의 행동 사례는 정상(normal)이므로 허위경보율을 최소화하는 것이 중요하다는 점을 감안하면 본 발명의 자기 지도 샘플러는 충분히 활용 가치가 있다고 볼 수 있다.If GT is not used, performance drops by more than 30% for dense, uniform, and random sampling strategies. Normal classification means that the performance of the object detection unit and the object tracking unit is more sensitive than other classifications. Surprisingly, when the self-guided samplers are used (both V1 and V2), the performance for normal classification again reaches 100%. This indicates that the sampler, which is also the object of the present invention, effectively removes IDs mixed in clips. Considering that it is important to minimize the false alarm rate since most behavior cases in real-world surveillance images are normal, the self-map sampler of the present invention can be considered sufficiently useful.

도 6은 본 발명의 다른 일실시예에 따른 자기 지도 샘플러를 이용하는 전철역에서의 실시간 감시 시스템의 동작방법을 도시한 것이다6 illustrates a method of operating a real-time monitoring system in a subway station using a magnetic map sampler according to another embodiment of the present invention.

도 6에서 보듯이, 자기 지도 샘플러를 이용하는 전철역에서의 실시간 감시 시스템의 동작방법은 객체검출단계(S2100), 객체추적단계(S2200), 프레임선택단계(S2300), 및 행동인식단계(S2400)를 포함하여 구성된다.As shown in FIG. 6, the method of operating the real-time monitoring system in a subway station using a self-map sampler includes an object detection step (S2100), an object tracking step (S2200), a frame selection step (S2300), and an action recognition step (S2400). consists of including

상기 객체검출단계(S2100)에서는 객체 검출부가 CCTV로부터 동영상을 수신하고, 실시간 객체 검출에 사용되는 객체 검출 모델을 이용하여 동영상내 사람들을 검출하여 위치를 한정한다.In the object detection step (S2100), the object detection unit receives a video from a CCTV, detects people in the video using an object detection model used for real-time object detection, and limits the location.

상기 객체추적단계(S2200)에서는 객체 추적부가 객체 추적에 사용되는 객체 추적 모델을 이용하여 검출된 사람 각각의 개별 ID를 인식하게 된다.In the object tracking step (S2200), the object tracking unit recognizes an individual ID of each detected person using an object tracking model used for object tracking.

상기 프레임선택단계(S2300)에서는 자기 지도 샘플러가 인식된 ID별로 복수개의 프레임으로 구성되는 비디오 클립을 입력받아 비디오 클립 중 셔플링된 프레임을 제거하고, 나머지 프레임을 동작 인식 모델로 출력한다. 상기 샘플러는 다운샘플링된 비디오 클립을 수신하고, 다운샘플링된 비디오 클립의 전체 프레임의 각 프레임별로 무관련성 점수들을 예측하고, 상기 각 프레임에 대한 예측된 무관련성 점수에 기초하여 행동 무관련 프레임을 제거하는 과정을 수행한다.In the frame selection step (S2300), the self-map sampler receives a video clip composed of a plurality of frames for each recognized ID, removes shuffled frames from the video clip, and outputs the remaining frames as a motion recognition model. The sampler receives a downsampled video clip, predicts irrelevance scores for each frame of the entire frame of the downsampled video clip, and removes behaviorally irrelevant frames based on the predicted irrelevance score for each frame. carry out the process

상기 행동인식단계(S2400)에서는 행동 인식 모델이 자기 지도 샘플러로부터 수신된 나머지 프레임을 이용하여 각 사람의 행동을 인식하게 된다. 상기 행동 인식 모델은 비디오 클립의 프레임 중 무관련성 프레임을 제거하여 남은 프레임 중 선택된 프레임들을 입력으로 받아 행동인식모델이 행동인식을 수행한다.In the action recognition step (S2400), the action recognition model recognizes each person's action using the remaining frames received from the self-map sampler. The action recognition model removes irrelevant frames from among the frames of the video clip and receives selected frames among the remaining frames as inputs, and the action recognition model performs action recognition.

이상과 같이 본 발명은 비록 한정된 실시예와 도면에 의해 설명되었으나, 본 발명이 상기의 실시예에 한정되는 것은 아니며, 이는 본 발명이 속하는 분야에서 통상의 지식을 가진 자라면 이러한 기재로부터 다양한 수정 및 변형이 가능하다. 따라서, 본 발명의 사상은 아래에 기재된 특허청구범위에 의해서만 파악되어야 하고, 이와 균등하거나 또는 등가적인 변형 모두는 본 발명 사상의 범주에 속한다 할 것이다. As described above, the present invention has been described by the limited embodiments and drawings, but the present invention is not limited to the above embodiments, and those skilled in the art in the field to which the present invention belongs can make various modifications and transformation is possible Therefore, the spirit of the present invention should be grasped only by the claims described below, and all equivalent or equivalent modifications thereof will fall within the scope of the spirit of the present invention.

또한, 본 발명에 따른 샘플러의 학습방법과 행동인식방법은 컴퓨터로 읽을 수 있는 기록매체에 컴퓨터가 읽을 수 있는 코드로서 구현하는 것이 가능하다. 컴퓨터가 읽을 수 있는 기록매체는 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록장치를 포함한다. 기록매체의 예로는 ROM, RAM, CD ROM, 자기 테이프, 플로피 디스크, 광 데이터 저장장치, 하드 디스크, 플래시 드라이브 등이 있으며, 또한 캐리어 웨이브(예를 들어 인터넷을 통한 전송)의 형태로 구현되는 것도 포함한다. 또한 컴퓨터가 읽을 수 있는 기록매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어 분산방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수 있다.In addition, the sampler learning method and behavior recognition method according to the present invention can be implemented as computer readable codes on a computer readable recording medium. A computer-readable recording medium includes all types of recording devices that store data that can be read by a computer system. Examples of the recording medium include ROM, RAM, CD ROM, magnetic tape, floppy disk, optical data storage device, hard disk, flash drive, etc., and also implemented in the form of a carrier wave (for example, transmission through the Internet). include In addition, the computer-readable recording medium is distributed to computer systems connected through a network, and computer-readable codes can be stored and executed in a distributed manner.

Claims

In the learning method of a CNN-based self-mapped sampler,
A receiving step of receiving a first video and a second video for learning;
a mixing step of extracting one irrelevant frame from the received second video and mixing the extracted frame with the first video to generate a first video having irrelevant frames; and
It is configured to include; a learning step of learning to identify which frame is a mixed frame in the mixed first video,
The mixing step
randomly mixing one unrelated frame of the received clip of the second video with the received clip of the first video to generate a clip of the first video having unrelated frames;
The clip of the first video with the unrelated frames is
By shuffling the mini-batch X along the batch axis,

The process of obtaining , where B is the batch size, T is the temporal dimension, C is the number of channels, and H and W are defined as spatial dimensions; and
For each video clip X _b in mini-batch X, randomly select one frame X _b (t) at time t, and then place it in a shuffled mini-batch at time t

frame of the b-th clip in

It is obtained by the process of exchanging with;
The exchange process is expressed as the following formula,

here,

and

are the inputs and labels for training the self-supervised sampler, and Y _b is each video clip

A learning method for a CNN-based self-mapped sampler, characterized in that it is a one-hot vector for and Y _b (t) is 1 when frames are exchanged and 0 otherwise.

According to claim 1,
The learning stage is
The self-mapped sampler is trained using the clip of the first mixed video as input to find the index of an unrelated frame, the first video clip having one irrelevant frame is the input of the self-mapped sampler, and the index of the frame Learning method of a CNN-based self-mapped sampler, characterized in that is a target label.

delete

According to claim 2,
The magnetic map sampler

is one prepared video clip

, and then the irrelevance score for each frame within one video clip.

A method for learning a CNN-based self-mapped sampler, characterized in that for generating and updating the parameter Θ based on the loss function.

In the action recognition method of an action recognition model including a self-guided sampler, the self-guided sampler learned by the learning method having the characteristics of claim 1,
a receiving step of receiving the downsampled video clip;
a score prediction step of predicting irrelevance scores for each frame of all frames of the downsampled video clip;
a frame removal step of removing action-irrelevant frames based on the predicted irrelevance score for each frame; and
An action recognition step of removing irrelevant frames from among the frames of the video clip and receiving selected frames among the remaining frames as input and performing action recognition by the action recognition model; Behavior recognition method.

The method of claim 5, wherein the frame removal step
An action recognition method of an action recognition model including a self-map sampler, characterized in that by comparing with a threshold value, frames with predicted irrelevance scores of each frame are higher than the threshold value are removed.

According to claim 6,
Magnetic Map Sampler Network

consists of five 3D-CNN layers,

denotes one downsampled input mini-batch consisting of video clips, where B is the batch size, T is the temporal dimension, C is the number of channels, and H and W are defined as spatial dimensions: The learned self-map sampler predicts the irrelevance score for each frame from the bth video clip X _b in one mini-batch X as follows,

here

and Θ represent the irrelevance score and learnable parameters of the self-map sampler, respectively, and the irrelevance scores of each frame in one video clip represent how far each frame is from the entire frame of the video clip. Action recognition method of action recognition model including map sampler.

According to claim 5,
Among the frames of the video clip, the selected frame among the frames remaining after removing irrelevant frames is
A process of selecting a fixed number of N frames at regular intervals after removing frames with an irrelevance score greater than the threshold τ, or selecting only frames with an irrelevance score less than τ after setting a threshold τ An action recognition method of an action recognition model including a self-map sampler, characterized in that the frame selected by any one of the processes.

The method of claim 5, wherein the action recognition step
An action recognition method of an action recognition model including a self-map sampler, characterized in that the action of the video is recognized by applying a dense prediction method using the selected frame as an input.

In a video surveillance system using a magnetic map sampler,
An object detection unit that receives a video from a CCTV and detects people in the video using an object detection model used for real-time object detection to limit a location;
an object tracking unit recognizing an individual ID of each detected person using an object tracking model used for object tracking;
It is learned by the learning method having the characteristics of claim 1, receives a video clip composed of a plurality of frames for each recognized ID, removes shuffled frames from the video clip, and outputs the remaining frames as a motion recognition model. sampler; and
A video surveillance system using a self-map sampler, characterized in that it comprises a; action recognition model for recognizing each person's action using the remaining frames received from the self-map sampler.

According to claim 10,
The self-map sampler receives a downsampled video clip, predicts irrelevance scores for each frame of the entire frame of the downsampled video clip, and behaviorally irrelevant frames based on the predicted irrelevance score for each frame. perform the process of removing
The video surveillance system using a self-mapped sampler, characterized in that the action recognition model performs action recognition by receiving selected frames among the remaining frames after removing irrelevant frames from among the frames of the video clip.

delete

In the operating method of a video surveillance system using a magnetic map sampler,
An object detection step in which an object detection unit receives a video from a CCTV and detects people in the video using an object detection model used for real-time object detection to limit a location;
an object tracking step in which an object tracking unit recognizes an individual ID of each detected person using an object tracking model used for object tracking;
The self-guided sampler learned by the learning method having the characteristics of claim 1 receives a video clip consisting of a plurality of frames for each recognized ID, removes shuffled frames from the video clip, and outputs the remaining frames as a motion recognition model. a frame selection step; and
A method of operating a video surveillance system using a self-map sampler, comprising: an action recognition step in which the action recognition model recognizes each person's action using the remaining frames received from the self-map sampler.

According to claim 13,
The self-map sampler receives a downsampled video clip, predicts irrelevance scores for each frame of the entire frame of the downsampled video clip, and behaviorally irrelevant frames based on the predicted irrelevance score for each frame. perform the process of removing
The method of operating a video surveillance system using a self-map sampler, characterized in that the action recognition model performs action recognition by receiving selected frames among the remaining frames after removing irrelevant frames from among the frames of the video clip.