KR102574273B1

KR102574273B1 - Method and apparatus for detecting video motion

Info

Publication number: KR102574273B1
Application number: KR1020220142983A
Authority: KR
Inventors: 김영휘; 남성현; 김선주
Original assignee: 국방과학연구소; 연세대학교 산학협력단
Priority date: 2022-10-31
Filing date: 2022-10-31
Publication date: 2023-09-06

Abstract

본 발명은 비디오 동작 검출에 관한 것으로, 비디오 프레임을 스트리밍 방식으로 입력하는 프레임 입력부; 상기 비디오 프레임으로부터 액션 시점에 대한 확률값을 출력하는 저장부; 및 상기 확률값을 기초로 액션 인스턴스를 생성하여 인스턴스 단위의 비디오 동작 검출 결과를 출력하는 액션 인스턴스 생성부;를 포함할 수 있다.The present invention relates to video motion detection, comprising: a frame input unit for inputting video frames in a streaming manner; a storage unit outputting a probability value for an action time point from the video frame; and an action instance generation unit generating an action instance based on the probability value and outputting a video motion detection result in an instance unit.

Description

Video motion detection device and method {METHOD AND APPARATUS FOR DETECTING VIDEO MOTION}

본 발명은 비디오의 동작을 실시간으로 검출하기 위한 기술에 관한 것이다.The present invention relates to techniques for detecting motion in video in real time.

비디오에서 일어나고 있는 일을 이해하는 것은 오랜 연구 문제이며, 컴퓨터 비전과 기계 학습에서 여전히 어려운 작업에 속한다.Understanding what's happening in video is a long-standing research problem, and one that remains challenging in computer vision and machine learning.

최근, 딥러닝의 발전은 해당 분야의 많은 연구자들의 관심을 끌었고, 이는 다양한 비디오 이해 작업(video understanding task)에서 상당한 진전을 가져왔다. 예컨대, 액션 분류(action classification), 액션 시간 추정(temporal action localization), 비디오 질답(video question and answering), 비디오 요약(video summarization)과 같은 다양한 작업에서 비디오 이해의 많은 성공적인 진전을 이루었다. 이러한 작업의 대부분은 추론할 때 전체 비디오에 액세스할 수 있는 상태인 '오프라인 환경(offline setting)'으로 비디오 이해 작업을 다룬다.Recently, advances in deep learning have attracted the attention of many researchers in the field, which has led to significant progress in various video understanding tasks. Many successful advances in video understanding have been made in various tasks such as, for example, action classification, temporal action localization, video question and answering, and video summarization. Most of these tasks deal with video comprehension tasks in an 'offline setting', a state in which the entire video is accessible at the time of inference.

라이브 스트리밍 서비스 및 보안 감시 카메라의 스트리밍 비디오 포맷이 증가함에 따라 비디오 스트림을 처리하는 방법에 대한 연구 또한 계속 증가하고 있다. 특히, 요즘은 강의, 화상회의, 유튜브 채널 등 원격 소통의 도구로 스트리밍 영상이 성행하고 있다.As streaming video formats for live streaming services and surveillance cameras increase, research on how to process video streams continues to increase. In particular, streaming video is prevalent these days as a tool for remote communication such as lectures, video conferences, and YouTube channels.

이러한 스트리밍 비디오를 이해하는 작업을 수행할 때는 알 수 없는 길이의 비디오를 입력으로 삼고, 비디오를 이해하는 데 전체 비디오를 액세스 할 수 없으며, 과거 프레임부터 현재 스트리밍 받은 프레임까지만 사용할 수 있다.When performing the task of understanding these streaming videos, it takes a video of unknown length as input, and cannot access the entire video to understand it, only the past frames to the currently streamed frames.

이러한 제약을 극복하기 위해 프레임 단위로 비디오 스트리밍을 분석하는 기술이나 비디오 액션을 추정하는 기술이 연구되고는 있으나, 정확한 동작 예측이 힘들고 대부분 오프라인 환경에 최적화가 되어 있다.In order to overcome these limitations, a technique of analyzing video streaming in frame units or a technique of estimating video action has been studied, but accurate motion prediction is difficult and most of them are optimized for an offline environment.

따라서, 온라인 환경에서 스트리밍 비디오를 실시간으로 분석할 수 있는 새로운 방식의 비디오 동작 검출 기술이 필요한 실정이다.Therefore, there is a need for a new method of video motion detection technology capable of analyzing streaming video in real time in an online environment.

등록특허공보 제10-2420887호 (2022년07월15일 등록공고)Registered Patent Publication No. 10-2420887 (registration announcement on July 15, 2022) 등록특허공보 제10-0420747호 (2004년03월02일 등록공고)Registered Patent Publication No. 10-0420747 (registration announcement on March 02, 2004)

본 발명의 실시예에서는, 액션 인스턴스(action instance) 단위로 비디오 스트리밍을 실시간으로 분석하여 온라인 기반의 액션 시간 추정(online temporal action localization) 문제를 해결할 수 있는 비디오 동작 검출 장치 및 방법을 제안하고자 한다.In an embodiment of the present invention, a video motion detection apparatus and method capable of solving a problem of online temporal action localization by analyzing video streaming in real time in units of action instances are proposed.

본 발명의 실시예에서는, 온라인 비디오의 액션 전체의 컨텍스트를 취하여 높은 정확성을 확보할 수 있는 비디오 동작 검출 장치 및 방법을 제안하고자 한다.In an embodiment of the present invention, it is intended to propose a video motion detection apparatus and method capable of securing high accuracy by taking the context of the entire action of an online video.

본 발명의 실시예에서는, 액션의 컨텍스트와 액션 컨텍스트의 변화를 모니터링하여 저지연 및 고성능의 검출 결과를 제공할 수 있는 비디오 동작 검출 장치 및 방법을 제안하고자 한다.In an embodiment of the present invention, an apparatus and method for detecting video motions capable of providing low-latency and high-performance detection results by monitoring action contexts and changes in action contexts are proposed.

본 발명이 해결하고자 하는 과제는 상기에서 언급한 것으로 제한되지 않으며, 언급되지 않은 또 다른 해결하고자 하는 과제는 아래의 기재들로부터 본 발명이 속하는 통상의 지식을 가진 자에 의해 명확하게 이해될 수 있을 것이다.The problems to be solved by the present invention are not limited to those mentioned above, and other problems to be solved that are not mentioned can be clearly understood by those skilled in the art from the description below. will be.

본 발명의 실시예에 따르면, 비디오 프레임을 스트리밍 방식으로 입력하는 프레임 입력부; 상기 비디오 프레임으로부터 액션 시점에 대한 확률값을 출력하는 저장부; 및 상기 확률값을 기초로 액션 인스턴스를 생성하여 인스턴스 단위의 비디오 동작 검출 결과를 출력하는 액션 인스턴스 생성부;를 포함하는 비디오 동작 검출 장치를 제공할 수 있다.According to an embodiment of the present invention, a frame input unit for inputting a video frame in a streaming manner; a storage unit outputting a probability value for an action time point from the video frame; and an action instance generator generating an action instance based on the probability value and outputting a result of detecting a video action in an instance unit.

여기서, 상기 저장부는, 상기 비디오 프레임을 인코딩하여 피처를 추출하는 피처 추출부; 상기 피처를 정방향(forward pass)으로 탐색하여 액션의 종료 지점을 검출하는 액션 종료 검출부; 및 상기 피처를 역방향(backward pass)으로 탐색하여 액션의 시작 지점을 검출하는 액션 시작 검출부;를 포함할 수 있다.Here, the storage unit may include: a feature extraction unit encoding the video frame to extract features; an action end detector detecting an end point of an action by searching for the feature in a forward pass; and an action start detection unit configured to detect a start point of an action by searching for the feature in a backward pass.

또한, 상기 액션 종료 검출부는, 상기 피처를 2종의 확률값으로 출력하는 멀티 헤드 검출부; 및 상기 2종의 확률값을 기초로, 최종 액션 종료 확률값을 출력하는 액션 종료 감지 개선부;를 포함할 수 있다.In addition, the action end detection unit may include a multi-head detection unit outputting the feature as two types of probability values; and an action end detection improvement unit outputting a final action end probability value based on the two types of probability values.

또한, 상기 멀티 헤드 검출부는, 액션이 존재할 확률값을 출력하는 액션 감지 헤드; 및 액션의 종료 지점이 존재할 액션 종료 확률값을 출력하는 액션 종료 감지 헤드;를 포함할 수 있다.In addition, the multi-head detection unit may include an action detection head outputting a probability value that an action exists; and an action end detection head outputting an action end probability value where an end point of the action exists.

또한, 상기 액션 감지 헤드와 상기 액션 종료 감지 헤드 각각은, LSTM(long short-term memory) 네트워크, FC(fully-connected) 레이어 및 소프트맥스(softmax) 레이어를 포함할 수 있다.In addition, each of the action detection head and the action end detection head may include a long short-term memory (LSTM) network, a fully-connected (FC) layer, and a softmax layer.

또한, 상기 액션 종료 감지 개선부는, FC 레이어 및 소프트맥스 레이어를 포함할 수 있다.Also, the action end detection improvement unit may include an FC layer and a Softmax layer.

또한, 상기 액션 시작 검출부는, LSTM 네트워크, FC 레이어 및 소프트맥스 레이어를 포함할 수 있다.In addition, the action start detection unit may include an LSTM network, an FC layer, and a Softmax layer.

또한, 상기 액션 시작 검출부는, 액션의 시작 지점이 존재할 액션 시작 확률값을 출력할 수 있다.In addition, the action start detection unit may output an action start probability value where an action start point exists.

또한, 상기 액션 시작 검출부는, 상기 액션 시작 확률값이 1회 상승한 후 하락하는 시점까지 상기 역방향으로 탐색하는 과정을 유지할 수 있다.In addition, the action start detection unit may maintain the process of searching in the reverse direction until the point at which the action start probability value increases once and then decreases.

또한, 상기 액션 시작 검출부는, 상기 하락하는 시점에서의 이전 프레임의 액션 시작 확률값을 최종 액션 시작 확률값으로 결정할 수 있다.In addition, the action start detection unit may determine an action start probability value of a frame previous to the falling time point as a final action start probability value.

또한, 상기 액션 인스턴스 생성부는, 상기 액션 종료 확률값이 기 설정된 임계값을 초과하는 경우에 상기 액션 시작 검출부를 호출할 수 있다.In addition, the action instance creation unit may call the action start detection unit when the action end probability value exceeds a preset threshold value.

또한, 상기 액션 인스턴스 생성부는 상기 액션의 시작 지점과 상기 액션의 종료 지점을 기초로 상기 액션 인스턴스를 생성하며, 상기 액션 인스턴스는 상기 최종 액션 종료 확률값, 상기 최종 액션 시작 확률값 및 액션의 클래스를 포함할 수 있다.In addition, the action instance creation unit generates the action instance based on the start point of the action and the end point of the action, and the action instance may include the final action end probability value, the final action start probability value, and an action class. can

또한, 상기 피처 추출부는, 상기 피처를 시퀀스 형태로 보관할 수 있다.Also, the feature extraction unit may store the features in a sequence form.

본 발명의 실시예에 따른 비디오 동작 검출 장치의 비디오 동작 검출 방법에 있어서, 비디오 프레임을 스트리밍 방식으로 입력하는 단계; 상기 비디오 프레임으로부터 액션 시점에 대한 확률값을 출력하는 단계; 및 상기 확률값을 기초로 액션 인스턴스를 생성하여 인스턴스 단위의 비디오 동작 검출 결과를 출력하는 단계;를 포함하는 비디오 동작 검출 방법을 제공할 수 있다.A video motion detection method of a video motion detection apparatus according to an embodiment of the present invention, comprising: inputting a video frame in a streaming manner; outputting a probability value for an action time point from the video frame; and generating an action instance based on the probability value and outputting a video motion detection result in an instance unit.

여기서, 상기 확률값을 출력하는 단계는, 상기 비디오 프레임을 인코딩하여 피처를 추출하는 단계; 상기 피처를 정방향으로 탐색하여 액션의 종료 지점을 검출하는 단계; 및 상기 피처를 역방향으로 탐색하여 액션의 시작 지점을 검출하는 단계;를 포함할 수 있다.Here, the outputting of the probability value may include extracting a feature by encoding the video frame; detecting an end point of an action by searching the feature in a forward direction; and detecting a start point of an action by searching the feature in a reverse direction.

또한, 상기 종료 지점을 검출하는 단계는, 상기 피처를 2종의 확률값으로 출력하는 단계; 및 상기 2종의 확률값을 기초로, 최종 액션 종료 확률값을 출력하는 단계;를 포함할 수 있다.In addition, the detecting of the end point may include outputting the feature as two types of probability values; and outputting a final action end probability value based on the two types of probability values.

또한, 상기 2종의 확률값으로 출력하는 단계는, 액션이 존재할 확률값을 출력하는 단계; 및 액션의 종료 지점이 존재할 액션 종료 확률값을 출력하는 단계;를 포함할 수 있다.In addition, the step of outputting the two types of probability values may include outputting a probability value for which an action exists; and outputting an action end probability value where an end point of the action exists.

또한, 상기 시작 지점을 검출하는 단계는, 상기 액션 시작 확률값이 1회 상승한 후 하락하는 시점까지 상기 역방향으로 탐색하는 단계를 포함할 수 있다.Also, the detecting of the starting point may include searching in the reverse direction until the point at which the action starting probability value increases once and then decreases.

또한, 상기 시작 지점을 검출하는 단계는, 상기 하락하는 시점에서의 이전 프레임의 액션 시작 확률값을 최종 액션 시작 확률값으로 결정하는 단계를 포함할 수 있다.Also, the detecting of the starting point may include determining an action starting probability value of a previous frame at the falling time point as a final action starting probability value.

또한, 상기 액션 종료 확률값이 기 설정된 임계값을 초과하는 경우에 상기 액션의 시작 지점을 검출하는 단계를 수행하는 것을 특징으로 할 수 있다.In addition, the step of detecting the starting point of the action may be performed when the action end probability value exceeds a preset threshold value.

본 발명의 실시예에 따르면, 컴퓨터 프로그램을 저장하고 있는 컴퓨터 판독 가능 기록매체로서, 상기 컴퓨터 프로그램은, 비디오 동작 검출 방법을 프로세서가 수행하도록 하기 위한 명령어를 포함하고, 상기 방법은, 비디오 프레임을 스트리밍 방식으로 입력하는 단계; 상기 비디오 프레임으로부터 액션 시점에 대한 확률값을 출력하는 단계; 및 상기 확률값을 기초로 액션 인스턴스를 생성하여 인스턴스 단위의 비디오 동작 검출 결과를 출력하는 단계;를 포함할 수 있다.According to an embodiment of the present invention, a computer readable recording medium storing a computer program, the computer program including instructions for causing a processor to perform a video motion detection method, the method comprising: streaming video frames; input in a manner; outputting a probability value for an action time point from the video frame; and generating an action instance based on the probability value and outputting a result of detecting a video action in an instance unit.

본 발명의 실시예에 따르면, 컴퓨터 판독 가능 기록매체에 저장된 컴퓨터 프로그램으로서, 상기 컴퓨터 프로그램은, 비디오 동작 검출 방법을 프로세서가 수행하도록 하기 위한 명령어를 포함하고, 상기 방법은, 비디오 프레임을 스트리밍 방식으로 입력하는 단계; 상기 비디오 프레임으로부터 액션 시점에 대한 확률값을 출력하는 단계; 및 상기 확률값을 기초로 액션 인스턴스를 생성하여 인스턴스 단위의 비디오 동작 검출 결과를 출력하는 단계;를 포함할 수 있다.According to an embodiment of the present invention, a computer program stored on a computer-readable recording medium, the computer program including instructions for causing a processor to perform a video motion detection method, the method comprising: converting video frames into a streaming method; input; outputting a probability value for an action time point from the video frame; and generating an action instance based on the probability value and outputting a result of detecting a video action in an instance unit.

본 발명의 실시예에 의하면, 온라인 액션 시간 추정 문제를 해결하여 스트리밍 비디오에서 액션의 시작점, 끝점 및 클래스를 생성하여 인스턴스 단위의 실시간 동작 탐지를 수행할 수 있다. 또한, 본 발명의 실시예에 의하면, 정방향 패스에서 액션의 종료점을 먼저 찾아 온라인 작업에서 필요한 높은 응답성을 해결하면서, 역방향 패스를 통해 액션의 시작점을 찾아 액션 전체의 컨텍스트를 취하여 높은 정확성을 확보할 수 있다. 또한, 본 발명의 실시예에 의하면, 액션 종료 검출 시 멀티 헤드 검출기로 액션의 컨텍스트를 파악하고, 액션 종료 감지 개선부를 통해 컨텍스트의 변화 양상을 관찰하여 단일 액션 종료 검출기를 사용하는 것보다 저지연, 고성능 검출을 수행할 수 있다.According to an embodiment of the present invention, by solving the problem of estimating online action time, it is possible to perform real-time motion detection in units of instances by generating a starting point, an ending point, and a class of an action in a streaming video. In addition, according to an embodiment of the present invention, while solving the high responsiveness required in online work by first finding the end point of an action in a forward pass, it is possible to find a start point of an action through a backward pass and take the context of the entire action to ensure high accuracy. can In addition, according to the embodiment of the present invention, when the end of an action is detected, the context of the action is grasped by the multi-head detector, and the changing aspect of the context is observed through the end of action detection improvement unit, so that the delay is lower than that of using a single end of action detector. High-performance detection can be performed.

도 1은 본 발명의 실시예에 따른 비디오 동작 검출 장치(1)를 설명하는 블록도이다.
도 2는 도 1의 저장부(20)의 구체적인 구성을 나타낸 블록도이다.
도 3은 도 2의 액션 종료 검출부(210)의 구체적인 구성을 나타낸 블록도이다.
도 4는 도 3의 멀티 헤드 검출부(212)의 구체적인 구성을 나타낸 블록도이다.
도 5는 본 발명의 실시예에 따른 비디오 동작 검출 장치(1)의 저장부(20)에 대한 세부 구성 및 동작 과정을 예시적으로 설명하는 개념도이다.1 is a block diagram illustrating a video motion detection device 1 according to an embodiment of the present invention.
FIG. 2 is a block diagram showing a specific configuration of the storage unit 20 of FIG. 1 .
FIG. 3 is a block diagram showing a specific configuration of the action end detection unit 210 of FIG. 2 .
FIG. 4 is a block diagram showing a specific configuration of the multi-head detection unit 212 of FIG. 3 .
5 is a conceptual diagram illustrating the detailed configuration and operation process of the storage unit 20 of the video motion detection apparatus 1 according to an embodiment of the present invention by way of example.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 개시되는 실시예들에 한정되는 것이 아니라 다양한 형태로 구현될 수 있으며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하고, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명의 범주는 청구항에 의해 정의될 뿐이다.Advantages and features of the present invention, and methods of achieving them, will become clear with reference to the detailed description of the following embodiments taken in conjunction with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below and can be implemented in various forms, only the present embodiments are intended to complete the disclosure of the present invention, and those of ordinary skill in the art to which the present invention belongs It is provided to fully inform the person of the scope of the invention, and the scope of the invention is only defined by the claims.

본 발명의 실시예들을 설명함에 있어서 공지 기능 또는 구성에 대한 구체적인 설명은 본 발명의 실시예들을 설명함에 있어 실제로 필요한 경우 외에는 생략될 것이다. 그리고 후술되는 용어들은 본 발명의 실시예에서의 기능을 고려하여 정의된 용어들로서 이는 사용자, 운용자의 의도 또는 관례 등에 따라 달라질 수 있다. 그러므로 그 정의는 본 명세서 전반에 걸친 내용을 토대로 내려져야 할 것이다.In describing the embodiments of the present invention, detailed descriptions of well-known functions or configurations will be omitted unless actually necessary in describing the embodiments of the present invention. In addition, terms to be described later are terms defined in consideration of functions in the embodiment of the present invention, which may vary according to the intention or custom of a user or operator. Therefore, the definition should be made based on the contents throughout this specification.

스트리밍 비디오를 분석하는 기술로서, 온라인 액션 감지(online action detection)와 액션 시작의 온라인 감지(online detection of action start)와 같이 프레임 단위로 분할되어 스트리밍 비디오를 분석하는 연구 방식이 존재한다.As a technology for analyzing streaming video, there are research methods for analyzing streaming video by dividing it into frames, such as online action detection and online detection of action start.

이러한 온라인 액션 감지 및 액션 시작의 온라인 감지는, 액션 클래스 또는 액션 시작점을 프레임 단위로 예측을 하나, 그것이 반드시 액션 인스턴스 단위의 이해와 연결되지는 않는다. 예를 들어, 3개의 연속된 점프 행동을 포함하는 30프레임 비디오 클립의 경우, 온라인 액션 감지의 접근 방식은 30개의 프레임에 30개의 동작 예측을 출력하고, 세 개의 점프 행동이 있다고 예측할 수 없다. Such online action detection and online detection of action start predict action classes or action start points in units of frames, but they are not necessarily linked to understanding in units of action instances. For example, for a 30-frame video clip containing 3 consecutive jump actions, the approach of online action detection outputs 30 action predictions in 30 frames, and cannot predict that there are 3 jump actions.

따라서, 본 발명의 실시예에서는, 온라인 액션 감지의 접근 방식으로 온라인 액션 시간 추정 작업을 수행하기 위해, 프레임을 액션 인스턴스로 그룹화하는 과정을 적용하고자 한다.Accordingly, in an embodiment of the present invention, a process of grouping frames into action instances is applied to perform an online action time estimation task as an approach of online action detection.

본 발명의 실시예에서는 액션 시작의 온라인 감지를 액션 끝점의 온라인 탐지(online detection of action end)로 확장하고, 시작점과 끝점을 탐지하여 온라인 액션 시간 추정 작업에 대해 적용하기 위한 기술을 제안하고자 한다.In an embodiment of the present invention, a technique for extending online detection of action start to online detection of action end point, detecting start point and end point, and applying it to an online action time estimation task is proposed.

본 발명의 실시예에서는, 액션의 시작점과 끝점을 일관된 한 쌍으로 감지하고자, 액션 시작과 액션 인스턴스 사이의 프레임을 그룹화하는 데 필요한 액션의 가장 신뢰할 수 있는 액션의 끝점 후보를 결정하기 위한 기술을 제안하고자 한다.In an embodiment of the present invention, in order to detect the start and end points of an action as a consistent pair, a technique for determining the most reliable end point candidate of an action necessary for grouping frames between an action start and an action instance is proposed. want to do

본 발명의 실시예에서는, 액션 인스턴스 단위로 비디오 스트리밍을 실시간으로 분석하여 온라인 기반의 액션 시간 추정 문제를 해결할 수 있으며, 온라인 비디오의 액션 전체의 컨텍스트를 취하여 높은 정확성을 확보할 수 있고, 액션의 컨텍스트와 액션 컨텍스트의 변화를 모니터링하여 저지연 및 고성능의 검출 결과를 제공할 수 있는 비디오 동작 검출 장치 및 방법을 제안하고자 한다.In an embodiment of the present invention, it is possible to solve the online-based action time estimation problem by analyzing video streaming in real-time in units of action instances, and to obtain high accuracy by taking the context of the entire action of the online video, and to obtain the context of the action. An apparatus and method for detecting video motions capable of providing low-latency and high-performance detection results by monitoring changes in motion and action contexts are proposed.

이하, 첨부된 도면을 참조하여 본 발명의 실시예에 대해 상세히 설명하기로 한다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 실시예에 따른 비디오 동작 검출 장치(1)를 설명하는 블록도이다.1 is a block diagram illustrating a video motion detection device 1 according to an embodiment of the present invention.

도 1에 도시한 바와 같이, 본 발명의 실시예에 따른 비디오 동작 검출 장치(1)는, 프레임 입력부(10), 저장부(20) 및 액션 인스턴스 생성부(30)를 포함할 수 있다.As shown in FIG. 1 , an apparatus 1 for detecting video motion according to an embodiment of the present invention may include a frame input unit 10 , a storage unit 20 and an action instance generator 30 .

프레임 입력부(10)는 비디오 프레임을 스트리밍 방식으로 실시간 입력 받을 수 있다.The frame input unit 10 may receive video frames in real time in a streaming manner.

저장부(20)는 프레임 입력부(10)를 통해 입력되는 비디오 프레임으로부터 액션 시점에 대한 확률값을 출력할 수 있다.The storage unit 20 may output a probability value for an action time point from a video frame input through the frame input unit 10 .

액션 인스턴스 생성부(30)는 저장부(20)의 확률값을 기초로 액션 인스턴스를 생성하여 인스턴스 단위의 비디오 동작 검출 결과를 출력할 수 있다.The action instance generation unit 30 may generate an action instance based on the probability value stored in the storage unit 20 and output a video motion detection result in an instance unit.

도 2는 이러한 저장부(20)의 구체적인 구성을 나타낸 블록도이다.2 is a block diagram showing a specific configuration of such a storage unit 20.

도 2에 도시한 바와 같이, 저장부(20)는 피처 추출부(feature extractor, 200), 액션 종료 검출부(action end detector, 210) 및 액션 시작 검출부(action start detector, 220)를 포함할 수 있다.As shown in FIG. 2 , the storage unit 20 may include a feature extractor 200, an action end detector 210, and an action start detector 220. .

피처 추출부(200)는 프레임 입력부(10)으로부터 비디오 프레임을 실시간으로 입력 받아 피처를 추출할 수 있다.The feature extractor 200 may extract a feature by receiving a video frame from the frame input unit 10 in real time.

액션 종료 검출부(210)는 피처 추출부(200)에서 추출된 피처를 정방향(forward pass)으로 탐색하여 액션의 종료 지점을 검출할 수 있다. The action end detection unit 210 may detect the end point of the action by searching for the features extracted by the feature extraction unit 200 in a forward pass.

이러한 액션 종료 검출부(210)는 도 3에 도시한 바와 같이, 멀티 헤드 검출부(multi-head detector, 212)와 액션 종료 감지 개선부(end detection refinement, 214)를 포함할 수 있다.As shown in FIG. 3 , the action end detector 210 may include a multi-head detector 212 and an end detection refinement 214 .

멀티 헤드 검출부(212)는 피처 추출부(200)에서 추출된 피처를, 서로 다른 2종의 확률값으로 출력하는 역할을 할 수 있다.The multi-head detection unit 212 may serve to output the features extracted by the feature extraction unit 200 with two different probability values.

액션 종료 감지 개선부(214)는 멀티 헤드 검출부(212)로부터 출력되는 2종의 확률값을 기초로, 최종 액션 종료 확률값을 출력하는 역할을 할 수 있다.The action end detection improvement unit 214 may serve to output a final action end probability value based on two types of probability values output from the multi-head detection unit 212 .

도 4는 도 3의 액션 종료 검출부(210)의 구성 중 멀티 헤드 검출부(212)의 구체적인 구성을 나타낸 블록도로서, 본 발명의 실시예에 따른 멀티 헤드 검출부(212)는 액션 감지 헤드(action detection head, 212a)와 액션 종료 감지 헤드(action end head, 212b)를 포함할 수 있다.4 is a block diagram showing a specific configuration of the multi-head detection unit 212 among the configurations of the action end detection unit 210 of FIG. 3. The multi-head detection unit 212 according to an embodiment of the present invention is an action detection head head 212a) and an action end detection head 212b.

액션 감지 헤드(212a)는 액션이 존재할 확률값을 출력하는 역할을 할 수 있다.The action detection head 212a may play a role of outputting a probability value of the existence of an action.

액션 종료 감지 헤드(212b)는 액션의 종료 지점이 존재할 액션 종료 확률값을 출력하는 역할을 할 수 있다.The action end detection head 212b may serve to output an action end probability value where an end point of the action exists.

다시 도 1 및 도 2를 참조하면, 액션 시작 검출부(220)는 액션의 시작 지점이 존재할 액션 시작 확률값을 출력하는 수단으로서, 피처 추출부(200)에서 추출된 피처를 역방향(backward pass)으로 탐색하여 액션의 시작 지점을 검출하는 역할을 수행할 수 있다.Referring back to FIGS. 1 and 2 , the action start detection unit 220 is a means for outputting an action start probability value where an action start point exists, and searches for features extracted by the feature extractor 200 in a backward pass. It can play a role in detecting the starting point of an action.

이러한 액션 시작 검출부(220)는 액션 시작 확률값이 1회 상승한 후 하락하는 시점까지 역방향 탐색을 유지할 수 있으며, 하락하는 시점에서의 이전 프레임의 액션 시작 확률값을 최종 액션 시작 확률값으로 결정할 수 있다.The action start detection unit 220 may maintain backward search until the action start probability value rises once and then decreases, and determines the action start probability value of the previous frame at the drop point as the final action start probability value.

이때, 액션 시작 검출부(220)는 도 4의 액션 종료 감지 헤드(212b)에서 출력된 액션 종료 확률값이 기 설정된 임계값을 초과하는 경우에, 액션 인스턴스 생성부(30)에 의해 호출될 수 있다. 즉, 액션 인스턴스 생성부(30)는 액션 종료 검출부(210)의 액션 종료 확률값을 모니터링하고, 이러한 액션 종료 확률값에 종속되어 액션 시작 검출부(220)의 호출을 명령할 수 있다.In this case, the action start detector 220 may be called by the action instance generator 30 when the action end probability value output from the action end detection head 212b of FIG. 4 exceeds a preset threshold. That is, the action instance generator 30 may monitor the action end probability value of the action end detector 210 and command the action start detector 220 to be called depending on the action end probability value.

여기서, 액션 인스턴스 생성부(30)는 액션의 시작 지점과 액션의 종료 지점을 기초로 상술한 액션 인스턴스를 생성할 수 있다.Here, the action instance generator 30 may generate the above-described action instance based on the start point of the action and the end point of the action.

액션 인스턴스 생성부(30)에 의해 생성되는 액션 인스턴스는 액션 종료 검출부(210)로부터 생성된 최종 액션 종료 확률값과 액션 시작 검출부(220)로부터 생성된 최종 액션 시작 확률값을 포함할 수 있다. 또한, 액션 인스턴스는 액션의 클래스를 더 포함할 수 있다.The action instance generated by the action instance generator 30 may include the final action end probability value generated by the action end detector 210 and the final action start probability value generated by the action start detector 220 . Also, an action instance may further include an action class.

따라서, 액션 인스턴스 생성부(30)는 최종 액션 종료 확률값과 최종 액션 시작 확률값, 그리고 액션의 클래스를 조합하여 하나의 액션 인스턴스를 생성할 수 있다.Accordingly, the action instance generator 30 may generate one action instance by combining the final action end probability value, the final action start probability value, and the action class.

도 5는 본 발명의 실시예에 따른 비디오 동작 검출 장치(1)의 저장부(20)에 대한 세부 구성 및 동작 과정을 예시적으로 설명하는 개념도이다.5 is a conceptual diagram illustrating the detailed configuration and operation process of the storage unit 20 of the video motion detection apparatus 1 according to an embodiment of the present invention by way of example.

도 5에 도시한 바와 같이, 피처 추출부(200)는 프레임 입력부(10)를 통해 실시간으로 입력되는 비디오 프레임을 기 설정된 개수만큼 취합하여 인코딩하는 과정을 거칠 수 있다. 피처 추출부(200)는 이러한 기 설정 개수의 비디오 프레임을 인코딩하는 과정을 통해 피처를 생성할 수 있으며, 생성된 피처를 시퀀스 형태로 저장할 수 있다.As shown in FIG. 5 , the feature extraction unit 200 may collect and encode a predetermined number of video frames input in real time through the frame input unit 10 . The feature extractor 200 may generate features through a process of encoding a predetermined number of video frames, and may store the created features in the form of a sequence.

액션 종료 검출부(210)는 피처 추출부(200)로부터 추출된 피처를 정방향으로 탐색하여 액션의 종료 지점을 검출하는 수단으로서, 멀티 헤드 검출부(212)와 액션 종료 감지 개선부(214)를 포함할 수 있다.The action end detection unit 210 is a means for detecting an end point of an action by searching for features extracted from the feature extraction unit 200 in a forward direction, and may include a multi-head detection unit 212 and an action end detection improvement unit 214. can

멀티 헤드 검출부(212)는 다시 액션 감지 헤드(212a)와 액션 종료 감지 헤드(212b)를 포함할 수 있으며, 이러한 액션 감지 헤드(212a)와 액션 종료 감지 헤드(212b) 각각은, LSTM(long short-term memory) 네트워크, FC(fully-connected) 레이어 및 소프트맥스(softmax) 레이어를 포함하는 신경망 구조로 이루어질 수 있다. 따라서, 액션 감지 헤드(212a)를 통해 액션이 존재할 확률값을 출력할 수 있으며, 액션 종료 감지 헤드(212b)를 통해 액션의 종료 지점이 존재할 액션 종료 확률값을 출력할 수 있다.The multi-head detection unit 212 may further include an action detection head 212a and an action end detection head 212b, and each of the action detection head 212a and the action end detection head 212b is LSTM (long short) -term memory) network, a fully-connected (FC) layer, and a softmax (softmax) layer. Accordingly, a probability value of the existence of an action may be output through the action detection head 212a, and an action end probability value of the existence of an end point of the action may be output through the action end detection head 212b.

액션 종료 감지 개선부(214)는 멀티 헤드 검출부(212)의 액션 감지 헤드(212a)와 액션 종료 감지 헤드(212b) 각각으로부터 출력되는 2종의 확률값을 기초로 최종 액션 종료 확률값을 출력하는 수단으로서, FC 레이어 및 소프트맥스 레이어를 포함하는 신경망 구조로 이루어질 수 있다.The action end detection improving unit 214 is a means for outputting a final action end probability value based on two types of probability values output from each of the action end detection head 212a and the action end detection head 212b of the multi-head detection unit 212. , FC layer and softmax layer may be made of a neural network structure.

액션 시작 검출부(220)는 피처 추출부(200)로부터 추출된 피처를 역방향으로 탐색하여 액션의 시작 지점을 검출하는 수단으로서, LSTM 네트워크, FC 레이어 및 소프트맥스 레이어를 포함하는 신경망 구조로 이루어질 수 있다.The action start detection unit 220 is a means for detecting the start point of an action by searching the features extracted from the feature extraction unit 200 in a reverse direction, and may be formed of a neural network structure including an LSTM network, an FC layer, and a softmax layer. .

액션 인스턴스 생성부(30)는 감지된 액션 종료 지점(최종 액션 종료 확률값)과 액션 시작 지점(최종 액션 시작 확률값)을 조합하여 하나의 액션 인스턴스를 생성할 수 있다.The action instance generator 30 may generate one action instance by combining the detected action end point (final action end probability value) and the action start point (final action start probability value).

도 5에서 상단 그래프는 최종 액션 종료 확률값을, 하단 그래프는 최종 액션 시작 확률값을 각각 표현하고 있다. 각각의 그래프에서 x축은 비디오 프레임을, y축은 확률값을 나타낸다.In FIG. 5 , the upper graph represents the final action end probability value, and the lower graph represents the final action start probability value. In each graph, the x-axis represents a video frame and the y-axis represents a probability value.

여기서, 액션 인스턴스는 액션의 클래스, 클래스 확률값, 시작 지점 프레임 및 종료 지점 프레임의 벡터를 포함할 수 있다. 따라서, 액션 인스턴스 생성부(30)로부터 생성된 액션 인스턴스에 의해 비디오 동작 검출 장치(1)에서 제공되는 출력 결과는 액션 범위, 액션 종류 등으로 표현될 수 있다. 액션 범위라 함은 상술한 액션 시작 지점과 액션 종료 지점을 의미할 수 있으며, 액션 종류는, 예를 들어 배구, 농구와 같은 특정 스포츠 종목을 의미할 수 있다.Here, the action instance may include a vector of an action class, a class probability value, a start point frame, and an end point frame. Accordingly, an output result provided from the video motion detection apparatus 1 by the action instance generated by the action instance generator 30 may be expressed as an action range, an action type, and the like. The action range may mean the above-described action start point and action end point, and the action type may mean a specific sports event such as volleyball and basketball, for example.

이상 설명한 바와 같은 본 발명의 실시예에 의하면, 온라인 액션 시간 추정 문제를 해결하여 스트리밍 비디오에서 액션의 시작점, 끝점 및 클래스를 생성하여 인스턴스 단위의 실시간 동작 탐지를 수행할 수 있으며, 정방향 패스에서 액션의 종료점을 먼저 찾아 온라인 작업에서 필요한 높은 응답성을 해결하면서 역방향 패스를 통해 액션의 시작점을 찾아 액션 전체의 컨텍스트를 취하여 높은 정확성을 확보할 수 있을 뿐만 아니라, 액션 종료 검출 시 멀티 헤드 검출기로 액션의 컨텍스트를 파악하고, 액션 종료 감지 개선부를 통해 컨텍스트의 변화 양상을 관찰하여 단일 액션 종료 검출기를 사용하는 것보다 저지연, 고성능 검출을 수행하도록 구현한 것이다.According to the embodiment of the present invention as described above, it is possible to perform real-time motion detection in an instance unit by solving the online action time estimation problem, generating the start point, end point, and class of an action in a streaming video, and While solving the high responsiveness required in online work by finding the end point first, it finds the start point of the action through the reverse pass and takes the context of the entire action to ensure high accuracy. is identified, and the changing aspect of the context is observed through the action end detection improvement unit to implement low-latency and high-performance detection rather than using a single action end detector.

한편, 첨부된 블록도의 각 블록과 흐름도의 각 단계의 조합들은 컴퓨터 프로그램 인스트럭션들에 의해 수행될 수도 있다. 이들 컴퓨터 프로그램 인스트럭션들은 범용 컴퓨터, 특수용 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비의 프로세서에 탑재될 수 있으므로, 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비의 프로세서를 통해 수행되는 그 인스트럭션들이 블록도의 각 블록에서 설명된 기능들을 수행하는 수단을 생성하게 된다.Meanwhile, combinations of each block of the accompanying block diagram and each step of the flowchart may be performed by computer program instructions. Since these computer program instructions may be loaded into a processor of a general-purpose computer, special-purpose computer, or other programmable data processing equipment, the instructions executed by the processor of the computer or other programmable data processing equipment are described in each block of the block diagram. It creates means to perform functions.

이들 컴퓨터 프로그램 인스트럭션들은 특정 방식으로 기능을 구현하기 위해 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비를 지향할 수 있는 컴퓨터 이용 가능 또는 컴퓨터 판독 가능 기록매체(또는 메모리) 등에 저장되는 것도 가능하므로, 그 컴퓨터 이용 가능 또는 컴퓨터 판독 가능 기록매체(또는 메모리)에 저장된 인스트럭션들은 블록도의 각 블록에서 설명된 기능을 수행하는 인스트럭션 수단을 내포하는 제조 품목을 생산하는 것도 가능하다.These computer program instructions may be stored on a computer usable or computer readable medium (or memory) or the like that may be directed to a computer or other programmable data processing equipment to implement functions in a particular manner, so that the computer usable Alternatively, the instructions stored in a computer readable recording medium (or memory) may produce an article of manufacture containing instruction means for performing a function described in each block of the block diagram.

그리고, 컴퓨터 프로그램 인스트럭션들은 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비 상에 탑재되는 것도 가능하므로, 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비 상에서 일련의 동작 단계들이 수행되어 컴퓨터로 실행되는 프로세스를 생성해서 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비를 수행하는 인스트럭션들은 블록도의 각 블록에서 설명된 기능들을 실행하기 위한 단계들을 제공하는 것도 가능하다. In addition, since the computer program instructions can be loaded on a computer or other programmable data processing equipment, a series of operational steps are performed on the computer or other programmable data processing equipment to create a computer-executed process to generate a computer or other programmable data processing equipment. Instructions performing possible data processing equipment may also provide steps for executing the functions described in each block of the block diagram.

또한, 각 블록은 특정된 논리적 기능(들)을 실행하기 위한 적어도 하나 이상의 실행 가능한 인스트럭션들을 포함하는 모듈, 세그먼트 또는 코드의 일부를 나타낼 수 있다. 또, 몇 가지 대체 실시 예들에서는 블록들에서 언급된 기능들이 순서를 벗어나서 발생하는 것도 가능함을 주목해야 한다. 예컨대, 잇달아 도시되어 있는 두 개의 블록들은 사실 실질적으로 동시에 수행되는 것도 가능하고 또는 그 블록들이 때때로 해당하는 기능에 따라 역순으로 수행되는 것도 가능하다.Also, each block may represent a module, segment, or portion of code including at least one or more executable instructions for executing specified logical function(s). It should also be noted that in some alternative embodiments, it is possible for the functions mentioned in the blocks to occur out of order. For example, two blocks shown in succession may in fact be executed substantially concurrently, or the blocks may sometimes be executed in reverse order depending on their function.

1: 비디오 동작 검출 장치
10: 프레임 입력부
20: 저장부
30: 액션 인스턴스 생성부
200: 피처 추출부
210: 액션 종료 검출부
212: 멀티 헤드 검출부
212a: 액션 감지 헤드
212b: 액션 종료 감지 헤드
214: 액션 종료 감지 개선부
220: 액션 시작 검출부1: video motion detection device
10: frame input unit
20: storage unit
30: action instance creation unit
200: feature extraction unit
210: action end detection unit
212: multi-head detection unit
212a: action detection head
212b: action end detection head
214: action end detection improvement unit
220: action start detection unit

Claims

a frame input unit for inputting video frames in a streaming manner;
a storage unit outputting a probability value for an action time point from the video frame; and
An action instance generating unit generating an action instance based on the probability value and outputting a video motion detection result in an instance unit;
the storage unit,
a feature extraction unit for extracting features by encoding the video frame;
an action end detector detecting an end point of an action by searching for the feature in a forward pass; and
An action start detection unit configured to detect a start point of an action by searching for the feature in a backward pass;
The action end detection unit,
a multi-head detection unit outputting the feature with two types of probability values; and
An action end detection improvement unit outputting a final action end probability value based on the two types of probability values;
The action start detection unit outputs an action start probability value where the action start point exists, and maintains the backward search process until the action start probability value increases once and then decreases, and the previous frame at the drop point. Determine the action start probability value as the final action start probability value,
The action instance creation unit creates the action instance based on the start point of the action and the end point of the action, and ends the action by calling the action start detection unit when the action end probability value exceeds a preset threshold value. Depending on the probability value, the action start detection unit is commanded to be called, and the action instance includes the final action end probability value, the final action start probability value, the class of the action, and the class probability value.
Video motion detection device.

delete

According to claim 1,
The multi-head detection unit,
an action detection head outputting a probability value of an action present; and
An action end detection head outputting an action end probability value at which the end point of the action exists.
Video motion detection device.

According to claim 4,
Each of the action detection head and the action end detection head,
It includes a long short-term memory (LSTM) network, a fully-connected (FC) layer, and a softmax layer.
Video motion detection device.

According to claim 1,
The action end detection improvement unit,
including the FC layer and the Softmax layer.
Video motion detection device.

According to claim 1,
The action start detection unit,
LSTM network, FC layer and Softmax layer.
Video motion detection device.

delete

According to claim 1,
The feature extraction unit,
Storing the features in sequence form
Video motion detection device.

A video motion detection method of a video motion detection device, comprising:
inputting a video frame in a streaming manner;
outputting a probability value for an action time point from the video frame; and
Generating an action instance based on the probability value and outputting a video motion detection result in an instance unit; Including,
The step of outputting the probability value is,
encoding the video frames to extract features;
detecting an end point of an action by searching the feature in a forward direction; and
Including; detecting a start point of an action by searching the feature in a reverse direction;
Detecting the end point of the action,
outputting the features as two types of probability values; and
Based on the two types of probability values, outputting a final action end probability value; includes,
The step of detecting the starting point of the action includes searching in the reverse direction until a point in time when an action starting probability value at which the starting point of the action exists increases once and then decreases,
The step of detecting the starting point includes determining an action start probability value of a previous frame at the falling time point as a final action start probability value,
The outputting of the video motion detection result in units of instances may include generating the action instance based on a start point of the action and an end point of the action, and when the action end probability value exceeds a preset threshold value. By including the step of detecting the start point of the action in, it is called to detect the action start point dependent on the action end probability value,
The action instance includes the final action end probability value, the final action start probability value, a class of the action, and a class probability value.
Video motion detection method.

delete

15. The method of claim 14,
The step of outputting the two types of probability values,
outputting a probability value where an action exists; and
Outputting an action end probability value where an end point of the action exists;
Video motion detection method.

delete

A computer-readable recording medium storing a computer program,
The computer program,
Including instructions for causing a processor to perform a video motion detection method;
The method,
inputting a video frame in a streaming manner;
outputting a probability value for an action time point from the video frame; and
Generating an action instance based on the probability value and outputting a video motion detection result in an instance unit; Including,
The step of outputting the probability value is,
encoding the video frames to extract features;
detecting an end point of an action by searching the feature in a forward direction; and
Including; detecting a start point of an action by searching the feature in a reverse direction;
Detecting the end point of the action,
outputting the features as two types of probability values; and
Based on the two types of probability values, outputting a final action end probability value; includes,
The step of detecting the starting point of the action includes searching in the reverse direction until a point in time when an action starting probability value at which the starting point of the action exists increases once and then decreases,
The step of detecting the starting point includes determining an action start probability value of a previous frame at the falling time point as a final action start probability value,
The outputting of the video motion detection result in units of instances may include generating the action instance based on a start point of the action and an end point of the action, and when the action end probability value exceeds a preset threshold value. Including the step of detecting the starting point of the action, it is called to detect the starting point of the action depending on the action end probability value,
The action instance includes the final action end probability value, the final action start probability value, a class of the action, and a class probability value.
A computer-readable recording medium.

A computer program stored on a computer readable recording medium,
The computer program,
Including instructions for causing a processor to perform a video motion detection method;
The method,
inputting a video frame in a streaming manner;
outputting a probability value for an action time point from the video frame; and
Generating an action instance based on the probability value and outputting a video motion detection result in an instance unit; Including,
The step of outputting the probability value is,
encoding the video frames to extract features;
detecting an end point of an action by searching the feature in a forward direction; and
Including; detecting a start point of an action by searching the feature in a reverse direction;
Detecting the end point of the action,
outputting the features as two types of probability values; and
Based on the two types of probability values, outputting a final action end probability value; includes,
The step of detecting the start point of the action includes searching in the reverse direction until a point in time when an action start probability value at which the action start point exists increases once and then decreases,
The step of detecting the starting point includes determining an action start probability value of a previous frame at the falling time point as a final action start probability value,
The outputting of the video motion detection result in units of instances may include generating the action instance based on a start point of the action and an end point of the action, and when the action end probability value exceeds a preset threshold value. Including the step of detecting the starting point of the action, it is called to detect the starting point of the action depending on the action end probability value,
The action instance includes the final action end probability value, the final action start probability value, the class of the action, and the class probability value.
A computer program stored on a recording medium.