KR102243049B1

KR102243049B1 - Apparatus and method for providing clip video from cctv image

Info

Publication number: KR102243049B1
Application number: KR1020190098595A
Authority: KR
Inventors: 조용범
Original assignee: 건국대학교 산학협력단
Priority date: 2019-08-13
Filing date: 2019-08-13
Publication date: 2021-04-20
Also published as: KR20210019685A

Abstract

CCTV 영상으로부터 클립 영상을 제공하는 장치 및 방법이 개시되며, 본원의 일 실시예에 따른 CCTV 영상으로부터 클립 영상을 제공하는 방법은, CCTV 영상에 대한 비트 스트림을 수신하는 단계, 탐색하고자 하는 소정의 상황과 연계된 사용자 입력을 수신하는 단계, 상기 비트 스트림 및 상기 사용자 입력을 기초로 기 학습된 기계 학습 모델을 통해 상기 소정의 상황에 부합하는 GOP(Group of Picture)를 결정하는 단계, 상기 결정된 GOP, 상기 결정된 GOP에 대한 선행 GOP 및 후행 GOP을 포함하는 GOP 집합을 결정하는 단계 및 상기 GOP 집합에 대응하는 비트 스트림을 복호화하여 상기 GOP 집합에 대응하는 클립 영상을 제공하는 단계를 포함할 수 있다.Disclosed is an apparatus and method for providing a clip image from a CCTV image, and a method of providing a clip image from a CCTV image according to an embodiment of the present application includes receiving a bit stream for a CCTV image, a predetermined situation to be searched Receiving a user input associated with, determining a Group of Picture (GOP) corresponding to the predetermined situation through a machine learning model previously learned based on the bit stream and the user input, the determined GOP, Determining a GOP set including a preceding GOP and a following GOP for the determined GOP, and decoding a bit stream corresponding to the GOP set to provide a clip image corresponding to the GOP set.

Description

Device and method for providing clip video from CCTV video {APPARATUS AND METHOD FOR PROVIDING CLIP VIDEO FROM CCTV IMAGE}

본원은 CCTV 영상으로부터 클립 영상을 제공하는 장치 및 방법에 관한 것이다.The present application relates to an apparatus and method for providing a clip image from a CCTV image.

최근 들어, 다양한 목적을 가지고 설치되는 폐쇄 회로 텔레비전(Closed Circuit Television, CCTV)의 수가 점차 증가하고 있다. 또한, 영상 기술의 발전에 힘입어 CCTV 영상 역시 점차 선명해지고 고화질로 발전하고 있는 추세이다. 이에 따라 고화질의 CCTV 영상의 저장을 위하여 점점 더 큰 저장 공간이 확보되어야 하고, CCTV 영상을 고효율로 압축하는 것이 요구되는 실정이다.In recent years, the number of closed circuit televisions (CCTVs) installed for various purposes is gradually increasing. In addition, with the development of image technology, CCTV images are also gradually becoming clearer and developing to high definition. Accordingly, in order to store high-definition CCTV images, a larger storage space must be secured, and it is required to compress CCTV images with high efficiency.

특히, CCTV 시스템은 각종 범죄와 도난 사고 등의 발생이 증가함에 따라, 그 기능에 대한 사용자들의 기대도 높아지는 추세이다. 기존의 CCTV 시스템은 주로 범죄나 도난 등의 발생시 사후적으로 범죄자 등을 색출하는 데 주로 이용된다. 그러나, 긴 시간 동안 지속적으로 촬영된 CCTV 영상으로부터 범죄나 도난이 발생한 시점을 특정하거나 특정 인물이 등장하는 시점을 특정하는 것은 매우 어려운 일이다. 이를 위해서 종래에는 관측자가 해당 CCTV 영상의 전체 시간 구간을 일일이 관찰하거나 목격자 등의 진술이나 증거를 토대로 일정 시간 구간을 결정하는 방법이 가능할 뿐이었다.In particular, in the CCTV system, as the occurrence of various crimes and theft accidents increases, users' expectations for their functions are also increasing. Existing CCTV systems are mainly used to detect criminals after a crime or theft occurs. However, it is very difficult to specify when a crime or theft occurs from CCTV images continuously photographed for a long time or when a specific person appears. To this end, conventionally, it was only possible for an observer to observe the entire time interval of a corresponding CCTV image, or to determine a certain time interval based on statements or evidence from witnesses.

이와 관련하여, 촬영된 CCTV 영상을 대상으로 관측자가 탐색하고자 하는 소정의 상황에 부합하는 움직임이나 객체가 포함된 영상의 특정 부분을 자동적으로 탐지하여 해당 부분을 중심으로 한 소정 길이의 클립 영상을 제공하는 장치나 방법은 공지되거나 공용되고 있지 않은 실정이다.In this regard, a specific part of the image including motion or object corresponding to a certain situation that the observer wants to search for the captured CCTV image is automatically detected, and a clip image of a certain length centered on the part is provided. The device or method to be performed is not known or commonly used.

본원의 배경이 되는 기술은 한국등록특허공보 제10-1176743호에 개시되어 있다.The technology behind the present application is disclosed in Korean Patent Publication No. 10-1176743.

본원은 전술한 종래 기술의 문제점을 해결하기 위한 것으로서, CCTV 영상으로부터 사용자가 탐색하고자 하는 소정의 상황을 기초로 해당 상황에 부합하는 CCTV 영상 내 특정 부분을 탐지하여 이를 중심으로 사용자에게 클립 영상을 제공하는 장치 및 방법을 제공하려는 것을 목적으로 한다.The present application is to solve the problems of the prior art described above, based on a predetermined situation that the user wants to search for from the CCTV image, detects a specific part in the CCTV image corresponding to the situation, and provides a clip image to the user based on this. It is an object of the present invention to provide an apparatus and method.

본원은 전술한 종래 기술의 문제점을 해결하기 위한 것으로서, 긴 재생 시간을 가지는 CCTV 영상의 전체 구간을 모두 복호화할 필요가 없어 특정 상황을 탐색하는 과정에서 적은 계산 리소스를 사용하여 많은 데이터를 분석할 수 있도록 하는 CCTV 영상으로부터 클립 영상을 제공하는 장치 및 방법을 제공하는 것을 목적으로 한다.The present application is to solve the problems of the prior art described above, since it is not necessary to decode all sections of a CCTV video having a long playback time, a large amount of data can be analyzed using a small amount of computational resources in the process of searching for a specific situation. It is an object of the present invention to provide an apparatus and method for providing a clip image from a CCTV image to be used.

다만, 본원의 실시예가 이루고자 하는 기술적 과제는 상기된 바와 같은 기술적 과제들로 한정되지 않으며, 또 다른 기술적 과제들이 존재할 수 있다.However, the technical problem to be achieved by the embodiments of the present application is not limited to the technical problems as described above, and other technical problems may exist.

상기한 기술적 과제를 달성하기 위한 기술적 수단으로서, 본원의 일 실시예에 따른 CCTV 영상으로부터 클립 영상을 제공하는 방법은, CCTV 영상에 대한 비트 스트림을 수신하는 단계, 탐색하고자 하는 소정의 상황과 연계된 사용자 입력을 수신하는 단계, 상기 비트 스트림 및 상기 사용자 입력을 기초로 기 학습된 기계 학습 모델을 통해 상기 소정의 상황에 부합하는 GOP(Group of Picture)를 결정하는 단계, 상기 결정된 GOP, 상기 결정된 GOP에 대한 선행 GOP 및 후행 GOP을 포함하는 GOP 집합을 결정하는 단계 및 상기 GOP 집합에 대응하는 비트 스트림을 복호화하여 상기 GOP 집합에 대응하는 클립 영상을 제공하는 단계를 포함할 수 있다.As a technical means for achieving the above technical problem, the method of providing a clip image from a CCTV image according to an embodiment of the present application includes the steps of receiving a bit stream for the CCTV image, in connection with a predetermined situation to be searched. Receiving a user input, determining a Group of Picture (GOP) corresponding to the predetermined situation through a pre-learned machine learning model based on the bit stream and the user input, the determined GOP, and the determined GOP Determining a GOP set including a preceding GOP and a following GOP for, and decoding a bit stream corresponding to the GOP set to provide a clip image corresponding to the GOP set.

또한, 상기 소정의 상황에 부합하는 GOP(Group of Picture)를 결정하는 단계는, 상기 비트 스트림을 복수의 프레임을 포함하는 GOP를 단위로 하여 분할하는 단계, 상기 비트 스트림으로부터 헤더(Header) 데이터를 획득하는 단계, 상기 헤더 데이터에 기초하여 분할된 GOP 각각의 모션 데이터를 획득하는 단계, 상기 획득된 모션 데이터를 이용하여 부분 복호화 영상을 생성하는 단계 및 상기 부분 복호화 영상을 기초로 하여 상기 기 학습된 기계 학습 모델을 통해 탐색하고자 하는 소정의 상황과 연관된 모션을 포함하는 부분 복호화 영상에 대응하는 GOP를 상기 소정의 상황에 부합하는 GOP로 결정하는 단계를 포함할 수 있다.In addition, the determining of a GOP (Group of Picture) corresponding to the predetermined situation may include dividing the bit stream in units of GOPs including a plurality of frames, and header data from the bit stream. Acquiring, obtaining motion data of each of the GOPs divided based on the header data, generating a partially decoded image using the obtained motion data, and the pre-learned based on the partially decoded image It may include determining a GOP corresponding to a partially decoded image including a motion associated with a predetermined situation to be searched through the machine learning model as a GOP corresponding to the predetermined situation.

또한, 상기 부분 복호화 영상을 생성하는 단계는, 상기 GOP의 예측 유닛(Prediction Unit, PU)을 단위로 하여 모션 컴포지션(Motion Composition, MC)에 기초하여 상기 부분 복호화 영상을 생성하는 것일 수 있다.In addition, generating the partial decoded image may include generating the partial decoded image based on a motion composition (MC) based on a prediction unit (PU) of the GOP.

또한, 상기 선행 GOP는 상기 결정된 GOP에 대하여 시간적으로 선행하는 기 설정된 만큼의 복수의 GOP를 포함하고, 상기 후행 GOP는 상기 결정된 GOP에 대하여 시간적으로 후행하는 기 설정된 만큼의 복수의 GOP를 포함할 수 있다.In addition, the preceding GOP may include a predetermined number of GOPs temporally preceding the determined GOP, and the following GOP may include a predetermined number of GOPs temporally following the determined GOP. have.

또한, 상기 선행 GOP는, 상기 결정된 GOP에 대하여 시간적으로 선행하는 2개 또는 3개의 GOP를 포함하고, 상기 후행 GOP는, 상기 결정된 GOP에 대하여 시간적으로 후행하는 2개 또는 3개의 GOP를 포함할 수 있다.In addition, the preceding GOP may include two or three GOPs temporally preceding the determined GOP, and the following GOP may include two or three GOPs temporally following the determined GOP. have.

또한, 상기 GOP 집합을 결정하는 단계는, 상기 클립 영상의 전체 재생 시간이 60초 이하가 되도록 상기 GOP 집합을 결정할 수 있다.In addition, in the determining of the GOP set, the GOP set may be determined so that the total playback time of the clip image is 60 seconds or less.

또한, 상기 소정의 상황은, 피촬영자의 신체 움직임과 연계된 상황 또는 소정의 물체가 상기 CCTV 영상 내에서 탐지되는 상황 중 적어도 하나를 포함할 수 있다.In addition, the predetermined situation may include at least one of a situation associated with a body movement of a subject or a situation in which a predetermined object is detected in the CCTV image.

또한, 상기 기 학습된 기계 학습 모델은, 딥 러닝(Deep Learning) 기법에 의해 학습된 것일 수 있다.In addition, the previously learned machine learning model may be learned by a deep learning technique.

한편, 본원의 일 실시예에 따른 CCTV 영상으로부터 클립 영상을 제공하는 장치는, CCTV 영상에 대한 비트 스트림을 수신하고, 탐색하고자 하는 소정의 상황과 연계된 사용자 입력을 수신하는 입력 수신부, 상기 비트 스트림 및 상기 사용자 입력을 기초로 기 학습된 기계 학습 모델을 통해 상기 소정의 상황에 부합하는 GOP(Group of Picture)를 결정하는 GOP 결정부 및 상기 결정된 GOP, 상기 결정된 GOP에 대한 선행 GOP 및 후행 GOP를 포함하는 GOP 집합을 결정하고, 상기 GOP 집합에 대응하는 비트 스트림을 복호화하여 상기 GOP 집합에 대응하는 클립 영상을 제공하는 클립 영상 제공부를 포함할 수 있다.Meanwhile, an apparatus for providing a clip image from a CCTV image according to an embodiment of the present invention includes an input receiver configured to receive a bit stream for a CCTV image and receive a user input associated with a predetermined situation to be searched, the bit stream. And a GOP determination unit for determining a GOP (Group of Picture) corresponding to the predetermined situation through a machine learning model previously learned based on the user input, the determined GOP, and a preceding GOP and a subsequent GOP for the determined GOP. It may include a clip image providing unit determining a GOP set to include, decoding a bit stream corresponding to the GOP set, and providing a clip image corresponding to the GOP set.

또한, 상기 GOP 결정부는, 상기 비트 스트림을 복수의 프레임을 포함하는 GOP를 단위로 하여 분할하는 GOP 분할부, 상기 비트 스트림으로부터 헤더(Header) 데이터를 획득하고, 상기 헤더 데이터에 기초하여 분할된 GOP 각각의 모션 데이터를 획득하는 데이터 획득부, 상기 획득된 모션 데이터를 이용하여 부분 복호화 영상을 생성하는 부분 복호화부 및 상기 부분 복호화 영상을 기초로 하여 상기 기 학습된 기계 학습 모델을 통해 탐색하고자 하는 소정의 상황과 연관된 모션을 포함하는 부분 복호화 영상에 대응하는 GOP를 상기 소정의 상황에 부합하는 GOP로 결정하는 기계 학습부를 포함할 수 있다.In addition, the GOP determination unit includes a GOP dividing unit that divides the bit stream in units of GOPs including a plurality of frames, obtains header data from the bit stream, and divides the GOP based on the header data. A data acquisition unit that acquires each motion data, a partial decoding unit that generates a partially decoded image using the obtained motion data, and a predetermined to be searched through the pre-learned machine learning model based on the partially decoded image. It may include a machine learning unit that determines a GOP corresponding to a partial decoded image including a motion associated with a situation of as a GOP corresponding to the predetermined situation.

또한, 상기 부분 복호화부는, 상기 GOP의 예측 유닛(Prediction Unit, PU)을 단위로 하여 모션 컴포지션(Motion Composition, MC)에 기초하여 상기 부분 복호화 영상을 생성할 수 있다.In addition, the partial decoder may generate the partial decoded image based on a motion composition (MC) based on a prediction unit (PU) of the GOP.

또한, 상기 클립 영상 제공부는, 상기 클립 영상의 전체 재생 시간이 60초 이하가 되도록 상기 GOP 집합을 결정할 수 있다.In addition, the clip image providing unit may determine the GOP set such that the total playback time of the clip image is 60 seconds or less.

상술한 과제 해결 수단은 단지 예시적인 것으로서, 본원을 제한하려는 의도로 해석되지 않아야 한다. 상술한 예시적인 실시예 외에도, 도면 및 발명의 상세한 설명에 추가적인 실시예가 존재할 수 있다.The above-described problem solving means are merely exemplary and should not be construed as limiting the present application. In addition to the above-described exemplary embodiments, additional embodiments may exist in the drawings and detailed description of the invention.

전술한 본원의 과제 해결 수단에 의하면, CCTV 영상으로부터 사용자가 탐색하고자 하는 소정의 상황을 기초로 해당 상황에 부합하는 CCTV 영상 내 특정 부분을 탐지하여 이를 중심으로 사용자에게 클립 영상을 제공할 수 있다.According to the above-described problem solving means of the present application, a clip image can be provided to the user based on the detection of a specific part in the CCTV image corresponding to the situation based on a predetermined situation that the user wants to search for from the CCTV image.

전술한 본원의 과제 해결 수단에 의하면, 긴 재생 시간을 가지는 CCTV 영상의 전체 구간을 모두 복호화할 필요가 없어 특정 상황을 탐색하는 과정에서 적은 계산 리소스를 사용하여 많은 데이터를 분석할 수 있다.According to the above-described problem solving means of the present application, it is not necessary to decode all sections of a CCTV image having a long playback time, so that a large amount of data can be analyzed using a small amount of computational resources in the process of searching for a specific situation.

다만, 본원에서 얻을 수 있는 효과는 상기된 바와 같은 효과들로 한정되지 않으며, 또 다른 효과들이 존재할 수 있다.However, the effect obtainable in the present application is not limited to the above-described effects, and other effects may exist.

도 1은 본원의 일 실시예예 따른 CCTV 영상으로부터 클립 영상을 제공하는 장치를 포함하는 CCTV 영상에 기초한 클립 영상 제공 시스템의 개략적인 구성도이다.
도 2는 본원의 일 실시예에 따른 CCTV 영상으로부터 클립 영상을 제공하는 장치가 탐색하고자 하는 소정의 상황에 부합하는 GOP를 결정하여 클립 영상을 제공하는 과정을 설명하기 위한 도면이다.
도 3은 본원의 일 실시예에 따른 따른 CCTV 영상으로부터 클립 영상을 제공하는 장치의 개략적인 구성도이다.
도 4는 본원의 일 실시예에 따른 GOP 결정부의 개략적인 구성도이다.
도 5는 본원의 일 실시예에 따른 CCTV 영상으로부터 클립 영상을 제공하는 방법에 대한 동작 흐름도이다.
도 6은 본원의 일 실시예에 따른 소정의 상황에 부합하는 GOP를 결정하는 방법에 대한 동작 흐름도이다.1 is a schematic configuration diagram of a system for providing a clip image based on a CCTV image including a device for providing a clip image from a CCTV image according to an embodiment of the present disclosure.
FIG. 2 is a diagram illustrating a process of determining a GOP corresponding to a predetermined situation to be searched by an apparatus for providing a clip image from a CCTV image according to an embodiment of the present disclosure and providing a clip image.
3 is a schematic configuration diagram of an apparatus for providing a clip image from a CCTV image according to an embodiment of the present application.
4 is a schematic configuration diagram of a GOP determination unit according to an embodiment of the present application.
5 is a flowchart illustrating a method of providing a clip image from a CCTV image according to an embodiment of the present application.
6 is a flowchart illustrating a method of determining a GOP corresponding to a predetermined situation according to an embodiment of the present disclosure.

아래에서는 첨부한 도면을 참조하여 본원이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 본원의 실시예를 상세히 설명한다. 그러나 본원은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본원을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.Hereinafter, embodiments of the present application will be described in detail with reference to the accompanying drawings so that those of ordinary skill in the art may easily implement the present application. However, the present application may be implemented in various different forms and is not limited to the embodiments described herein. In addition, in the drawings, parts irrelevant to the description are omitted in order to clearly describe the present application, and similar reference numerals are attached to similar parts throughout the specification.

본원 명세서 전체에서, 어떤 부분이 다른 부분과 "연결"되어 있다고 할 때, 이는 "직접적으로 연결"되어 있는 경우뿐 아니라, 그 중간에 다른 소자를 사이에 두고 "전기적으로 연결" 또는 "간접적으로 연결"되어 있는 경우도 포함한다. Throughout the present specification, when a part is said to be "connected" with another part, it is not only the case that it is "directly connected", but also "electrically connected" or "indirectly connected" with another element interposed therebetween. "Including the case.

본원 명세서 전체에서, 어떤 부재가 다른 부재 "상에", "상부에", "상단에", "하에", "하부에", "하단에" 위치하고 있다고 할 때, 이는 어떤 부재가 다른 부재에 접해 있는 경우뿐 아니라 두 부재 사이에 또 다른 부재가 존재하는 경우도 포함한다.Throughout the present specification, when a member is positioned "on", "upper", "upper", "under", "lower", and "lower" of another member, this means that a member is located on another member. This includes not only the case where they are in contact, but also the case where another member exists between the two members.

본원 명세서 전체에서, 어떤 부분이 어떤 구성 요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성 요소를 제외하는 것이 아니라 다른 구성 요소를 더 포함할 수 있는 것을 의미한다.In the entire specification of the present application, when a certain part "includes" a certain component, it means that other components may be further included rather than excluding other components unless specifically stated to the contrary.

도 1은 본원의 일 실시예예 따른 CCTV 영상으로부터 클립 영상을 제공하는 장치를 포함하는 CCTV 영상에 기초한 클립 영상 제공 시스템의 개략적인 구성도이다.1 is a schematic configuration diagram of a system for providing a clip image based on a CCTV image including a device for providing a clip image from a CCTV image according to an embodiment of the present application.

도 1을 참조하면, 본원의 일 실시예예 따른 CCTV 영상에 기초한 클립 영상 제공 시스템(10)은, 본원의 일 실시예예 따른 CCTV 영상으로부터 클립 영상을 제공하는 장치(100)(이하, '클립 영상 제공 장치(100)'라 한다.), CCTV 촬영 장치(200), 사용자 단말(20) 및 네트워크(30)를 포함할 수 있다.Referring to FIG. 1, a system for providing a clip image based on a CCTV image according to an embodiment of the present application 10 is an apparatus 100 for providing a clip image from a CCTV image according to an embodiment of the present application (hereinafter, referred to as'providing a clip image). The device 100'), a CCTV photographing device 200, a user terminal 20, and a network 30 may be included.

예를 들면, 사용자 단말(20)은, 스마트폰(Smartphone), 스마트패드(SmartPad), 태블릿 PC, 컴퓨터, 노트북 등과 PCS(Personal Communication System), GSM(Global System for Mobile communication), PDC(Personal Digital Cellular), PHS(Personal Handyphone System), PDA(Personal Digital Assistant), IMT(International Mobile Telecommunication)-2000, CDMA(Code Division Multiple Access)-2000, W-CDMA(W-Code Division Multiple Access), Wibro(Wireless Broadband Internet) 단말기 같은 모든 종류의 유/무선 통신 장치를 포함할 수 있다.For example, the user terminal 20 may include a smartphone, a smart pad, a tablet PC, a computer, a notebook, etc., a personal communication system (PCS), a global system for mobile communication (GSM), a personal digital Cellular), PHS (Personal Handyphone System), PDA (Personal Digital Assistant), IMT (International Mobile Telecommunication)-2000, CDMA (Code Division Multiple Access)-2000, W-CDMA (W-Code Division Multiple Access), Wibro ( Wireless Broadband Internet) may include all types of wired/wireless communication devices such as terminals.

클립 영상 제공 장치(100), 사용자 단말(20) 및 CCTV 촬영 장치(200) 상호간은 네트워크(30)을 통해 연결될 수 있으며, 네트워크(30)는 단말들 및 서버들과 같은 각각의 노드 상호간에 정보 교환이 가능한 연결 구조를 의미하는 것으로, 이러한 네트워크의 일 예에는, 3GPP(3rd Generation Partnership Project) 네트워크, LTE(Long Term Evolution) 네트워크, 5G 네트워크, WIMAX(World Interoperability for Microwave Access) 네트워크, 인터넷(Internet), LAN(Local Area Network), Wireless LAN(Wireless Local Area Network), WAN(Wide Area Network), PAN(Personal Area Network), wifi 네트워크, 블루투스(Bluetooth) 네트워크, 위성 방송 네트워크, 아날로그 방송 네트워크, DMB(Digital Multimedia Broadcasting) 네트워크 등이 포함되나 이에 한정되지는 않는다.The clip image providing device 100, the user terminal 20, and the CCTV photographing device 200 may be connected to each other through a network 30, and the network 30 provides information between each node such as terminals and servers. It refers to a connection structure that can be exchanged, and examples of such a network include a 3rd Generation Partnership Project (3GPP) network, a Long Term Evolution (LTE) network, a 5G network, a World Interoperability for Microwave Access (WIMAX) network, and the Internet. ), LAN (Local Area Network), Wireless LAN (Wireless Local Area Network), WAN (Wide Area Network), PAN (Personal Area Network), wifi network, Bluetooth network, satellite broadcasting network, analog broadcasting network, DMB (Digital Multimedia Broadcasting) networks, etc. are included, but are not limited thereto.

클립 영상 제공 장치(100)는, CCTV 촬영 장치(200)로부터 CCTV 영상에 대한 비트 스트림을 수신할 수 있다. 본원의 일 실시예에 따르면, CCTV 촬영 장치(200)는 촬영한 CCTV 영상 또는 촬영한 CCTV 영상을 인코더를 통해 압축(부호화)한 비트 스트림을 저장하는 메모리를 포함하거나 상기 메모리와 연계될 수 있으며, 클립 영상 제공 장치(100)는 상기 메모리로부터 CCTV 영상에 대한 비트 스트림을 수신하는 것일 수 있다.The clip image providing apparatus 100 may receive a bit stream for a CCTV image from the CCTV photographing apparatus 200. According to an embodiment of the present application, the CCTV photographing apparatus 200 may include a memory for storing a captured CCTV image or a bit stream compressed (encoded) through an encoder, or may be linked to the memory, The clip image providing apparatus 100 may receive a bit stream for a CCTV image from the memory.

또한, 본원에서의 CCTV 촬영 장치(200)는 각각이 다른 공간을 촬영하거나 각각이 다른 방향을 촬영하도록 복수 개 마련될 수 있다. 이 때, 클립 영상 제공 장치(100)는 복수의 CCTV 촬영 장치(200) 각각으로부터 CCTV 영상에 대한 비트 스트림을 수신할 수 있다.In addition, the CCTV photographing apparatus 200 in the present application may be provided in plural so that each photographs a different space or each photographs a different direction. In this case, the clip image providing apparatus 100 may receive a bit stream for a CCTV image from each of the plurality of CCTV photographing apparatuses 200.

클립 영상 제공 장치(100)는, 사용자 단말(20)로부터 탐색하고자 하는 소정의 상황과 연계된 사용자 입력을 수신할 수 있다. 여기서, 탐색하고자 하는 소정의 상황은 CCTV 영상의 피촬영자의 신체 움직임과 연계된 상황 또는 소정의 몰체가 CCTV 영상 내에서 탐지되는 상황 중 적어도 하나를 포함할 수 있으나, 이에 한정되는 것은 아니며, 사용자 단말(20)의 사용자가 CCTV 영상으로부터 탐색하고자 하는 다양한 상황을 고려하여 결정될 수 있다. 또한, 탐색하고자 하는 소정의 상황은 사용자(관측자)가 CCTV 영상을 확인하려는 목적과 관련된 것일 수 있다.The clip image providing apparatus 100 may receive a user input associated with a predetermined situation to be searched from the user terminal 20. Here, the predetermined situation to be searched may include at least one of a situation associated with the body movement of the person to be photographed in the CCTV image or a situation in which a predetermined molar body is detected in the CCTV image, but is not limited thereto, and the user terminal It can be determined in consideration of various situations that the user of (20) wants to search from the CCTV image. In addition, the predetermined situation to be searched may be related to the purpose of the user (observer) to check the CCTV image.

예시적으로, 탐색하고자 하는 소정의 상황은 상해, 폭행, 도난, 도주, 방화 등과 연관된 범죄 상황, 교통 사고, 화재, 폭발 등과 연관된 재난 상황에서 나타날 수 있는 피촬영자의 신체 움직임, 소정의 물체 자체, 소정의 물체의 움직임 등이 탐지되는 상황을 포함할 수 있다.Exemplarily, the predetermined situation to be searched is a crime situation related to injury, assault, theft, escape, arson, etc., a physical movement of the subject that may appear in a disaster situation related to a traffic accident, fire, explosion, etc., a certain object itself, It may include a situation in which a motion of a predetermined object or the like is detected.

보다 구체적으로, 범죄 상황과 관련되어, 신체 움직임과 연계된 상황은 피촬영자가 담을 넘는 행동, 도주하는 행동, 다른 누군가에게 해를 가하는 행동, 출입문이나 창문 등을 파손하는 행동 등이 탐지되는 상황일 수 있다. 또한, 범죄 상황과 관련한 물체가 탐지되는 상황은 흉기, 차량, 특정 도구 등이 탐지되는 상황일 수 있다.More specifically, in relation to the criminal situation, the situation related to the movement of the body is a situation in which the filmed person crosses a wall, flees, harms someone else, damages a door or window, etc. I can. In addition, a situation in which an object related to a criminal situation is detected may be a situation in which a weapon, a vehicle, or a specific tool is detected.

다만, 상술한 탐색하고자 하는 소정의 상황에 대한 설명은 이해를 돕기 위한 예시적인 것으로 이해되어야 할 것이다.However, it should be understood that the above description of a predetermined situation to be searched is exemplary for aiding understanding.

또한, 본원의 일 실시예에 따르면, 탐색하고자 하는 소정의 상황과 연계된 사용자 입력은, 음성 인식 기능이 탑재된 사용자 단말(20)에 대하여 사용자가 발화한 내용을 입력으로 하는 음성 입력 유형, 키보드, 터치 스크린 등의 별도의 입력 장치를 통해 타이핑한 내용을 입력으로 하는 텍스트 또는 키워드 입력 유형 등을 포함하는 다양한 형식을 가질 수 있다. In addition, according to an embodiment of the present application, the user input associated with a predetermined situation to be searched is a voice input type in which the content uttered by the user with respect to the user terminal 20 equipped with the voice recognition function is input as an input, a keyboard. , It may have various formats including text or keyword input type in which content typed through a separate input device such as a touch screen is input.

또한, 본원의 일 실시예에 따르면, 클립 영상 제공 장치(100)가 사용자 단말(20)로 클립 영상 제공 장치(100)가 탐색 가능한 상황에 대한 목록을 리스트 형태, 콤보 박스 형태, 체크박스 형태 등의 형태로 제공하고 목록을 제공받은 사용자(관측자가) 해당 목록 중 특정 탐색 가능한 상황을 선택하는 방식으로 사용자 입력이 수신될 수 있다.In addition, according to an embodiment of the present application, the clip image providing device 100 provides a list of situations in which the clip image providing device 100 can navigate to the user terminal 20 in a list form, a combo box form, a check box form, etc. User input may be received in a manner that a user (an observer) provided in the form of and provided with a list selects a specific searchable situation from the list.

또한, 본원의 일 실시예에 따른 사용자 입력은 사용자(관측자)가 탐색하고자 하는 CCTV 영상의 촬영 시간대 정보, 촬영 대상 영역 정보, 촬영 방향 정보 등을 포함할 수 있다. 예를 들어, 사용자는 CCTV 영상을 확인하고자 하는 특정 일자(또는 구간)를 입력하거나, 특정 장소를 입력하거나, 특정 촬영 방향을 입력할 수 있다.In addition, the user input according to the exemplary embodiment of the present application may include information on a time zone of a CCTV image that the user (observer) wants to search, information on a region to be photographed, information on a photographing direction, and the like. For example, the user may input a specific date (or section) for checking CCTV images, a specific location, or a specific photographing direction.

또한, 본원의 일 실시예에 따른 사용자 입력은 사용자(관측자)가 탐색하고자 하는 CCTV 영상 속 피촬영자에 대한 정보를 포함할 수 있다. 예를 들어, 사용자(관측자)가 남성 피촬영자에 대한 소정의 상황을 탐색하고자 하는 경우, 이러한 의도가 사용자 입력에 포함될 수 있다. 이와 관련하여, 후술할 기 학습된 기계 학습 모델은 이러한 사용자(관측자)의 의도를 고려하여 탐색하고자 하는 상황에 부합하는 GOP를 결정할 수 있다. 즉, 사용자(관측자)가 탐색하고자 하는 피촬영자의 신체 움직임이 포함된 GOP가 존재하더라도 해당 피촬영자가 여성인 경우, 해당 GOP는 소정의 상황에 부합하지 않는 것으로 판단될 수 있다.In addition, a user input according to an embodiment of the present application may include information on a person to be photographed in a CCTV image that the user (observer) wants to search. For example, when a user (observer) wants to search for a predetermined situation for a male photographee, this intention may be included in the user input. In this regard, the previously learned machine learning model, which will be described later, may determine a GOP suitable for a situation to be searched in consideration of the intention of the user (observer). That is, even if there is a GOP including the body movement of the person to be photographed that the user (observer) wants to search for, when the corresponding person is a female, it may be determined that the corresponding GOP does not correspond to a predetermined situation.

또한, 도 1을 참조하면, 탐색하고자 하는 소정의 상황과 연계된 사용자 입력은 사용자 단말(20)을 통하여 입력되어 클립 영상 제공 장치(100)로 전달되는 것으로 도시되어 있으나, 이에 한정되는 것은 아니고, 실시예에 따라 사용자(관측자)가 클립 영상 제공 장치(100)에 직접 사용자 입력을 입력할 수 있도록 구현될 수 있다.In addition, referring to FIG. 1, a user input associated with a predetermined situation to be searched is shown as being input through the user terminal 20 and transmitted to the clip image providing apparatus 100, but is not limited thereto, According to an embodiment, it may be implemented so that a user (observer) can directly input a user input to the clip image providing apparatus 100.

클립 영상 제공 장치(100)는, 비트 스트림 및 사용자 입력을 기초로 기 학습된 기계 학습 모델을 통해 소정의 상황에 부합하는 GOP(Group of Picture)를 결정할 수 있다. The clip image providing apparatus 100 may determine a Group of Picture (GOP) corresponding to a predetermined situation through a pre-learned machine learning model based on a bit stream and a user input.

여기서, GOP(Group of Picture)는 해당 CCTV 영상(또는, 해당 CCTV 영상의 전체 비트 스트림)을 분할한 단위로 이해될 수 있으며, 하나의 GOP는 복수의 프레임을 포함할 수 있다. 또한, 전술한 GOP는 실시예에 따라 GOF(Group of Frame)으로 달리 지칭될 수 있다. 예시적으로, 클립 영상 제공 장치(100)는 CCTV 영상(또는, CCTV 영상의 전체 비트 스트림)을 N개의 GOP로 분할될 수 있고(GOP 1, GOP 2, ···, GOP N) 각각의 GOP는 M개의 프레임(Frame 1, Frame 2, ···, Frame M)을 포함하도록 분할될 수 있다. 또한, 본원 명세서 전반에서 GOP의 번호에 관한 설명은, GOP 1에서 GOP N으로 갈수록 재생 순서을 기준으로 뒤에 오는(나중에 재생되는) GOP를 의미할 수 있다. 즉, 재생 순서상 첫번째 GOP는 GOP 1이고, 재생 순서상 마지막 GOP는 GOP N일 수 있다.Here, a group of picture (GOP) may be understood as a unit obtained by dividing a corresponding CCTV image (or an entire bit stream of a corresponding CCTV image), and one GOP may include a plurality of frames. In addition, the above-described GOP may be differently referred to as a GOF (Group of Frame) according to an embodiment. Exemplarily, the clip image providing apparatus 100 may divide a CCTV image (or the entire bit stream of a CCTV image) into N GOPs (GOP 1, GOP 2, ..., GOP N), and each GOP May be divided to include M frames (Frame 1, Frame 2, ..., Frame M). In addition, the description of the GOP number throughout the specification of the present application may mean a GOP that follows (reproduced later) based on a reproduction order from GOP 1 to GOP N. That is, the first GOP in the reproduction order may be GOP 1, and the last GOP in the reproduction order may be GOP N.

또한, 클립 영상 제공 장치(100)는, 결정된 GOP, 결정된 GOP에 대한 선행 GOP 및 후행 GOP을 포함하는 GOP 집합을 결정할 수 있다.In addition, the clip image providing apparatus 100 may determine a GOP set including the determined GOP, a preceding GOP for the determined GOP, and a subsequent GOP.

이하에서는 클립 영상 제공 장치(100)가 비트 스트림 및 사용자 입력을 수신한 후, 이를 기초로 사용자가 탐색하고자 하는 상황에 부합하는 GOP를 결정하고, 나아가 결정된 GOP를 기반으로 한 클립 영상을 사용자에게 제공하는 과정에 대하여 도 2를 참조하여 상세히 서술하도록 한다.Hereinafter, after the clip image providing apparatus 100 receives a bit stream and a user input, based on this, it determines a GOP corresponding to the situation that the user wants to search for, and further provides a clip image based on the determined GOP to the user. The process of this will be described in detail with reference to FIG. 2.

도 2를 참조하면, 먼저, 클립 영상 제공 장치(100)는, CCTV 영상에 대한 비트 스트림을 복수의 프레임을 포함하는 GOP를 단위로 하여 분할할 수 있다. 이 때 분할된 GOP(40)는 전술한 바와 같이 N개의 GOP를 포함할 수 있다(GOP 1, GOP 2, ···, GOP N).Referring to FIG. 2, first, the apparatus 100 for providing a clip image may divide a bit stream for a CCTV image in units of a GOP including a plurality of frames. At this time, the divided GOP 40 may include N GOPs as described above (GOP 1, GOP 2, ..., GOP N).

또한, 클립 영상 제공 장치(100)는, 비트 스트림으로부터 헤더(Header) 데이터를 획득할 수 있다. 본원의 일 실시예에 따르면, 헤더 데이터는 복호화 대상 영상인 CCTV 영상의 각종 영상 정보를 포함하는 헤더 파일로부터 획득되는 것일 수 있다.Also, the clip image providing apparatus 100 may obtain header data from a bit stream. According to an embodiment of the present application, the header data may be obtained from a header file including various image information of a CCTV image that is a decoding target image.

또한, 예를 들어, 헤더 데이터는 CCTV 영상의 각종 영상 정보로서, CCTV 영상에 포함된 각각의 프레임(또는 각각의 GOP)의 색깔과 연계된 정보, 휘도와 연계된 정보, 배경 영역과 연계된 정보, 움직임과 연계된 정보 등을 포함할 수 있다. 또한, 헤더 데이터가 상술한 여러 유형의 정보를 포함할 수 있다는 것은, 해당 CCTV 영상에 대하여 부호화된(압축된) 전체 비트 스트림 중 특정 유형의 정보를 나타내는 비트의 위치나 주소값을 포함할 수 있다는 것으로 이해될 수 있다.In addition, for example, header data is various image information of CCTV images, information associated with the color of each frame (or each GOP) included in the CCTV image, information associated with luminance, information associated with the background area , Motion-related information, and the like. In addition, the fact that the header data may include the above-described various types of information means that it may include a position or address value of a bit representing a specific type of information among the entire bit stream encoded (compressed) for the CCTV image. Can be understood as.

또한, 클립 영상 제공 장치(100)는, 헤더 데이터에 기초하여 분할된 GOP(40) 각각의 모션 데이터를 획득할 수 있다.In addition, the clip image providing apparatus 100 may obtain motion data of each of the divided GOPs 40 based on the header data.

본원의 일 실시예에 따르면, 모션 데이터는 헤더 데이터에 기초하여 파악되는 비트 스트림 내에 포함된 움직임과 연계된 정보를 의미하는 것일 수 있다. 예를 들면, 모션 데이터는 GOP 내의 객체의 움직임과 연계된 정보를 포함하는 특정 비트, 파라미터, 플래그, 움직임 벡터 등일 수 있다.According to an embodiment of the present disclosure, motion data may mean information related to motion included in a bit stream identified based on header data. For example, the motion data may be specific bits, parameters, flags, motion vectors, etc. including information related to the motion of an object in the GOP.

본원의 일 실시예에 따르면, 클립 영상 제공 장치(100)는, 헤더 데이터에 기초하여 획득되는 모션 데이터를 분석함으로써, GOP 각각에 대응하는 부분 영상 내에 인물이나 물체의 움직임(모션)이 포착되는지를 파악하는 것일 수 있다.According to an embodiment of the present application, the clip image providing apparatus 100 analyzes motion data obtained based on header data to determine whether a movement (motion) of a person or an object is captured in a partial image corresponding to each GOP. It could be grasping.

예를 들어, 정지된 영상만이 촬영된 부분에 대한 GOP와 연계된 헤더 데이터의 경우, 모션 데이터가 포함되지 않거나 모션 데이터 값이 디폴트(Default) 값으로 지정되어 있을 수 있다. 이 경우, 클립 영상 제공 장치(100)는 해당 GOP는 움직임이 포착되지 않는 것으로 판단할 수 있다. 이렇듯 움직임이 포착되지 않는 GOP는 탐색하고자 하는 소정의 상황과 무관한 것으로 분석되어 후술할 기 학습된 기계 학습 모델로 입력되지 않을 수 있다.For example, in the case of header data linked to a GOP for a portion in which only a still image is captured, motion data may not be included or a motion data value may be designated as a default value. In this case, the clip image providing apparatus 100 may determine that the GOP does not capture motion. As such, a GOP in which motion is not captured may be analyzed as being irrelevant to a predetermined situation to be searched, and thus may not be input into a previously learned machine learning model to be described later.

요약하면, 본원의 일 실시예에 따른 클립 영상 제공 장치(100)는, 헤더 데이터에 기초하여 분할된 GOP(40) 각각의 모션 데이터를 획득 및 분석함으로써, GOP에 대응하는 부분 영상이 움직임(모션)을 포함하여, 탐색하고자 하는 소정의 상황과 연계될 가능성이 있는 후보 GOP를 추릴 수 있다.In summary, the clip image providing apparatus 100 according to an embodiment of the present application acquires and analyzes motion data of each of the GOPs 40 divided based on header data, so that a partial image corresponding to the GOP moves (motion ), a candidate GOP that is likely to be linked to a predetermined situation to be searched may be deduced.

또한, 클립 영상 제공 장치(100)는, 획득된 모션 데이터를 이용하여 부분 복호화 영상을 생성할 수 있다. 여기서, 모션 데이터를 이용하여 부분 복호화 영상을 생성하는 것은, 전술한 후보 GOP 또는 분할된 GOP(40) 각각에 대하여 모션 데이터를 기초로 부분적인 복호화를 수행하여 탐색하고자 하는 소정의 상황과 연계된 것인지를 판단할 수 있는 수준으로 움직임과 연계된 정보를 반영한 영상을 생성하는 것으로 이해될 수 있다.Also, the clip image providing apparatus 100 may generate a partially decoded image by using the acquired motion data. Here, whether the generation of the partially decoded image using motion data is related to a predetermined situation to be searched by performing partial decoding based on motion data for each of the aforementioned candidate GOPs or the divided GOPs 40? It can be understood as generating an image reflecting information related to motion at a level that can be determined.

예를 들어, 부분 복호화 영상은 움직임과 연계된 정보는 포함하되, 배경 영역과 연계된 정보, 색깔과 연계된 정보 등은 포함하지 않는 것일 수 있다.For example, the partially decoded image may include information related to motion, but may not include information related to a background area, information related to color, and the like.

즉, 본원의 일 실시예에 따른 클립 영상 제공 장치(100)는, 후보 GOP 또는 분할된 GOP(40)에 포함된 객체의 움직임과 연계된 정보를 파악하기 위한 분석 과정에서, 객체의 움직임과 관련이 없는 배경 영역 정보, 색상 정보 등은 배제하고 모션 데이터만을 가지고 부분적인 복호화를 수행하므로, 보다 적은 계산 리소스를 가지고 빠른 속도로 CCTV 영상을 분석할 수 있다.That is, the apparatus 100 for providing a clip image according to an embodiment of the present application relates to the motion of the object in the analysis process for grasping information related to the motion of the object included in the candidate GOP or the divided GOP 40. Since partial decoding is performed with only motion data, excluding background area information and color information that are not present, CCTV images can be analyzed at high speed with fewer computational resources.

본원의 일 실시예에 따르면, 클립 영상 제공 장치(100)는, 분할된 GOP(40)의 예측 유닛(Prediction Unit, PU)을 단위로 하여 모션 컴포지션(Motion Composition, MC)에 기초하여 부분 복호화 영상을 생성할 수 있다.According to an embodiment of the present application, the apparatus 100 for providing a clip image is a partial decoded image based on a motion composition (MC) based on a prediction unit (PU) of the divided GOP 40. Can be created.

고효율 영상 부호화(High Efficiency Video Coding, HEVC) 표준은 블록 기반의 공간적 예측 및 시간적 예측을 통한 부호화/복호화 방식을 채택한다. HEVC는 부호화 대상 영상을 부호화 트리 단위(CTU)라고 부르는 사각 블록으로 분할한다. 또한, 각각의 CTU는 부호화 단위(CU)로 지칭되는 더 작은 사각 블록으로 분할될 수 있다. 또한, 각각의 CU는 예측 유닛(Prediction Unit, PU)로 지칭되는 하나 이상의 블록을 포함할 수 있다. PU는 상술한 공간적 예측 또는 시간적 예측을 수행하기 위한 단위로 사용될 수 있다. 예시적으로, CU가 인트라(intra) 모드에서 부호화되는 경우, CU의 각각의 PU는 그 자신의 공간 예측 방향을 가질 수 있다. 만일, CU가 인터(inter) 모드에서 부호화되면, CU의 각각의 PU는 그 자신의 모션 벡터 및 관련된 참조 화상을 가질 수 있다.The High Efficiency Video Coding (HEVC) standard adopts an encoding/decoding scheme through block-based spatial prediction and temporal prediction. HEVC divides an encoding target image into rectangular blocks called a coding tree unit (CTU). In addition, each CTU may be divided into smaller rectangular blocks referred to as coding units (CUs). In addition, each CU may include one or more blocks referred to as prediction units (PUs). The PU may be used as a unit for performing the above-described spatial prediction or temporal prediction. For example, when a CU is encoded in an intra mode, each PU of the CU may have its own spatial prediction direction. If the CU is coded in the inter mode, each PU of the CU may have its own motion vector and an associated reference picture.

이와 관련하여, 본원의 일 실시예에 따른 클립 영상 제공 장치(100)는, 시간적 예측을 위하여 분할된 GOP(40) 각각의 인터(inter) 모드로 부호화된 예측 유닛(PU)를 단위로 하여 모션 컴포지션에 기초하여 부분 복호화 영상을 생성할 수 있다.In this regard, the apparatus 100 for providing a clip image according to an embodiment of the present disclosure uses a prediction unit (PU) encoded in an inter mode of each of the divided GOPs 40 for temporal prediction as a unit. A partial decoded image may be generated based on the composition.

또한, 클립 영상 제공 장치(100)는, 부분 복호화 영상을 기초로 하여 기 학습된 기계 학습 모델을 통해 탐색하고자 하는 소정의 상황과 연관된 모션을 포함하는 부분 복호화 영상에 대응하는 GOP를 소정의 상황에 부합하는 GOP(41)로 결정할 수 있다. 본원 명세서에서 소정의 상황에 부합하는 GOP(41)는 (기 학습된 기계 학습 모델에 의해) 결정된 GOP(41)달리 지칭될 수 있다.In addition, the clip image providing apparatus 100 may store a GOP corresponding to a partial decoded image including a motion associated with a predetermined situation to be searched through a machine learning model previously learned based on the partially decoded image. It can be determined by the matching GOP (41). In the present specification, the GOP 41 corresponding to a predetermined situation may be referred to differently as the GOP 41 determined (by a pre-learned machine learning model).

여기서, 부분 복호화 영상은 본원에서의 기계 학습 모델의 학습 대상 특성(feature)일 수 있다. 즉, 본원에서의 기계 학습 모델은, 부분 복호화 영상을 기초로 부분 복호화 영상에서 파악되는 움직임(모션) 또는 물체가 어떠한 상황과 연관된 모션 또는 물체인지를 예측하는 모델일 수 있다. 나아가, 본원에서의 기계 학습 모델은 예측된 부분 복호화 영상에서 파악되는 모션 또는 물체와 연관된 상황이 사용자(관측자)가 탐색하고자 하는 소정의 상황에 부합하는지 여부를 판단하여 그 판단 결과를 출력할 수 있다.Here, the partially decoded image may be a learning target feature of the machine learning model in the present application. That is, the machine learning model in the present application may be a model that predicts motion (motion) or an object identified in the partially decoded image based on the partially decoded image and a motion or object associated with a certain situation. Furthermore, the machine learning model herein may determine whether a motion or a situation related to an object identified in the predicted partial decoded image corresponds to a predetermined situation that the user (observer) wants to search, and output the determination result. .

예시적으로, 본원에서의 기계 학습 모델은 사용자(관측자)가 탐색하고자 하는 소정의 상황이 '인물이 넘어지는 상황' 인 경우, 분할된 GOP(40) 각각의 부분 복호화 영상을 분석하여 부분 복호화 영상에 포함된 것으로 분석된 모션이 '인물이 넘어지는 상황'에 부합하는 경우, 해당 부분 복호화 영상에 대응하는 GOP를 소정의 상황에 부합하는 GOP(41)로 결정할 수 있다.Exemplarily, the machine learning model in the present application analyzes the partially decoded image of each of the divided GOPs 40 when a predetermined situation that the user (observer) wants to search for is a'character falls situation', and the partial decoded image is performed. When the motion analyzed as being included in corresponds to the'a situation in which a person falls', a GOP corresponding to the corresponding partial decoded image may be determined as the GOP 41 corresponding to a predetermined situation.

이 때, 도 2를 참조하면, 본원에서의 기계 학습 모델은 소정의 상황에 부합하는 GOP(41)를 1개 결정할 수 있으나, 본원의 일 실시예에 따른 클립 영상 제공 장치(100)는 소정의 상황에 부합하는 GOP(41)를 복수 개로 결정하도록 구현될 수 있다.In this case, referring to FIG. 2, the machine learning model in the present application may determine one GOP 41 corresponding to a predetermined situation, but the clip image providing apparatus 100 according to an exemplary embodiment of the present application It may be implemented to determine a plurality of GOPs 41 corresponding to the situation.

예시적으로, 본원에서의 기계 학습 모델은 소정의 상황에 부합하는지 여부를 확률적으로 분석하여, 소정의 상황에 부합할 확률이 가장 높은 GOP를 소정의 상황에 부합하는 단일한 GOP(41)로 결정하도록 동작할 수 있다. 다른 예로, 본원에서의 기계 학습 모델은 소정의 상황에 부합할 확률이 기 설정된 임계 확률 이상인 경우의 GOP 전체를 소정의 상황에 부합하는 GOP(41)로 결정하도록 동작할 수 있다.Exemplarily, the machine learning model in the present application probabilistically analyzes whether it meets a predetermined situation, and converts the GOP with the highest probability of matching the predetermined situation into a single GOP 41 that meets the predetermined situation. Can act to determine. As another example, the machine learning model of the present application may operate to determine the entire GOP as the GOP 41 corresponding to the predetermined situation when the probability of matching a predetermined situation is greater than or equal to a preset threshold probability.

본원의 일 실시예에 따르면, 기 학습된 기계 학습 모델은 딥 러닝(Deep Learning) 기법에 의해 학습된 것일 수 있으나 이는 이해를 돕기 위한 예시적 기재일 뿐, 다른 기계 학습의 실시예가 본 사상에 적용되는 것을 제한하거나 한정하는 것으로 해석되어서는 안되며, 예시적으로 합성곱 신경망 네트워크(Convolutional Neural Network, CNN), 심층 신경망 네트워크(Deep Neural Network, DNN), 순환 신경망 네트워크(Recurrent Neural Network, RNN), 로지스틱 회귀분석(Logistic Regression), 랜덤 포레스트(Random Forest) 등의 기계 학습 기법에 연계된 것일 수 있다.According to an embodiment of the present application, the previously learned machine learning model may be learned by a deep learning technique, but this is only an exemplary description to aid understanding, and other machine learning embodiments are applied to the present idea. It should not be construed as limiting or limiting what is being done, for example, a convolutional neural network (CNN), a deep neural network (DNN), a recurrent neural network (RNN), and a logistic. It may be related to machine learning techniques such as Logistic Regression and Random Forest.

또한, 본원의 일 실시예에 따르면, 기 학습된 기계 학습 모델은, 다량의 영상(특히, 부분 복호화되어 움직임과 연계된 정보가 반영된 영상)을 미리 입력 받아, 각각의 영상의 움직임과 연계된 정보와 해당 움직임이 의미하는 소정의 상황 간의 상관 관계를 반복적으로 학습한 모델일 수 있다. 따라서, 본원에서의 기 학습된 기계 학습 모델에 분할된 GOP(40)에 대한 부분 복호화 영상이 새로이 입력될 경우, 기존에 학습한 상관 관계를 기초로 부분 복호화 영상에 포함된 움직임(모션) 또는 물체가 어떠한 상황과 연관된 모션 또는 물체인지를 예측할 수 있다. 나아가, 본원의 일 실시예에 따른 클립 영상 제공 장치(100)는, 기 학습된 기계 학습 모델에 의해 예측된 부분 복호화 영상에 나타난 상황이 사용자(관측자)가 탐색하고자 하는 소정의 상황에 부합하는지 여부를 판단하여 클립 영상(50)의 기초가 되는 소정의 상황에 부합하는 GOP(41)를 결정할 수 있다. 또한, 본원의 일 실시예에 따른 기 학습된 기계 학습 모델은, CCTV 영상이 본원의 클립 영상 제공 장치(100)에 반복적으로 제공될수록 학습이 누적되어 예측 성능 또는 분석 성능이 향상될 수 있다.In addition, according to an embodiment of the present application, the previously-learned machine learning model receives a large amount of images (especially, images in which information related to motion is partially decoded and reflected), and information associated with the motion of each image. It may be a model in which a correlation between and a predetermined situation implied by the corresponding movement is repeatedly learned. Therefore, when the partially decoded image for the GOP 40 divided into the previously learned machine learning model in the present application is newly input, the motion (motion) or object included in the partial decoded image is based on the previously learned correlation. It is possible to predict which motion or object is associated with any situation. Furthermore, the apparatus 100 for providing a clip image according to an embodiment of the present application may determine whether the situation shown in the partially decoded image predicted by the machine learning model previously learned corresponds to a predetermined situation that the user (observer) wants to search. It is possible to determine the GOP 41 corresponding to a predetermined situation that is the basis of the clip image 50. In addition, in the previously-learned machine learning model according to an embodiment of the present application, as the CCTV image is repeatedly provided to the clip image providing apparatus 100 of the present application, learning may be accumulated to improve prediction performance or analysis performance.

또한, 클립 영상 제공 장치(100)는, 결정된 GOP(41), 결정된 GOP(41)에 대한 선행 GOP(42) 및 후행 GOP(43)을 포함하는 GOP 집합(44)을 결정할 수 있다.In addition, the clip image providing apparatus 100 may determine a GOP set 44 including the determined GOP 41, a preceding GOP 42 and a trailing GOP 43 for the determined GOP 41.

본원의 일 실시예에 따르면, 선행 GOP(42)는 결정된 GOP(41)에 대하여 시간적으로 선행하는 기 설정된 만큼의 복수의 GOP를 포함할 수 있다. 예시적으로, 선행 GOP(42)는 결정된 GOP에 대하여 시간적으로 선행하는 2개 또는 3개의 GOP를 포함할 수 있으나, 이에 한정되는 것은 아니다.According to an exemplary embodiment of the present disclosure, the preceding GOP 42 may include a predetermined number of GOPs preceding the determined GOP 41 in time. For example, the preceding GOP 42 may include two or three GOPs that temporally precede the determined GOP, but is not limited thereto.

예를 들어, 클립 영상 제공 장치(100)가 사용자가 탐색하고자 하는 소정의 상황에 부합하는 GOP를 GOP 7로 결정한 경우, 선행 GOP(42)는 2개의 GOP를 포함하는 경우, GOP 5 및 GOP 6을 포함할 수 있고, 3개의 GOP를 포함하는 경우, GOP 4 내지 GOP 6을 포함할 수 있다.For example, when the clip image providing apparatus 100 determines a GOP corresponding to a predetermined situation to be searched by the user as GOP 7, when the preceding GOP 42 includes two GOPs, GOP 5 and GOP 6 It may include, and when three GOPs are included, GOPs 4 to 6 may be included.

본원의 일 실시예에 따르면, 후행 GOP(43)는 결정된 GOP(41)에 대하여 시간적으로 후행하는 기 설정된 만큼의 복수의 GOP를 포함할 수 있다. 예시적으로, 후행 GOP(43)는 결정된 GOP에 대하여 시간적으로 선행하는 2개 또는 3개의 GOP를 포함할 수 있으나, 이에 한정되는 것은 아니다.According to the exemplary embodiment of the present application, the following GOP 43 may include a predetermined number of GOPs that are temporally followed by the determined GOP 41. For example, the following GOP 43 may include two or three GOPs that temporally precede the determined GOP, but is not limited thereto.

예를 들어, 클립 영상 제공 장치(100)가 사용자가 탐색하고자 하는 소정의 상황에 부합하는 GOP를 GOP 7로 결정한 경우, 후행 GOP(43)는 2개의 GOP를 포함하는 경우, GOP 8 및 GOP 9을 포함할 수 있고, 3개의 GOP를 포함하는 경우, GOP 8 내지 GOP 10을 포함할 수 있다.For example, when the clip image providing apparatus 100 determines a GOP corresponding to a predetermined situation to be searched by the user as GOP 7, when the following GOP 43 includes two GOPs, GOP 8 and GOP 9 It may include, and when three GOPs are included, GOPs 8 to 10 may be included.

또한, 클립 영상 제공 장치(100)가 사용자가 탐색하고자 하는 소정의 상황에 부합하는 GOP를 GOP 7로 결정하고, 선행 GOP(42) 및 후행 GOP(43)가 각각 3개의 GOP를 포함하는 경우 GOP 집합(44)는 GOP 4 내지 GOP 10을 포함하도록 결정될 수 있다.In addition, when the clip image providing apparatus 100 determines a GOP corresponding to a predetermined situation that the user wants to search as GOP 7 and the preceding GOP 42 and the following GOP 43 each include three GOPs, GOP The set 44 may be determined to include GOP 4 to GOP 10.

또한, 클립 영상 제공 장치(100)는, GOP 집합(44)에 대응하는 비트 스트림을 복호화하여 GOP 집합(44)에 대응하는 클립 영상(50)을 제공할 수 있다.In addition, the clip image providing apparatus 100 may provide a clip image 50 corresponding to the GOP set 44 by decoding a bit stream corresponding to the GOP set 44.

본원의 일 실시예에 따르면, 기 학습된 기계 학습 모델을 통해 결정된 GOP(41)가 복수 개인 경우, 각각의 결정된 GOP(41)에 대한 GOP 집합(44)이 복수 개 결정될 수 있다. 이에 따라, 각각의 GOP 집합(44)에 대응하는 클립 영상(50) 또한 복수로 제공될 수 있다. 여기서, 클립 영상 제공 장치(100)는, 복수의 클립 영상(50) 각각에 대하여 분석된 탐색하고자 하는 소정의 상황과 부합할 확률을 클립 영상(50)과 함께 제공하도록 구현될 수 있다. According to an embodiment of the present application, when there are a plurality of GOPs 41 determined through a previously learned machine learning model, a plurality of GOP sets 44 for each determined GOP 41 may be determined. Accordingly, a plurality of clip images 50 corresponding to each GOP set 44 may also be provided. Here, the clip image providing apparatus 100 may be implemented to provide a probability corresponding to a predetermined situation to be searched analyzed for each of the plurality of clip images 50 together with the clip image 50.

또한, 실시예에 따라 클립 영상 제공 장치(100)는 복수의 클립 영상(50)을 제공하는 경우, 탐색하고자 하는 소정의 상황과 부합할 확률이 높은 순서대로 복수의 클립 영상(50)이 제공할 수 있다.In addition, according to an embodiment, when the clip image providing apparatus 100 provides a plurality of clip images 50, the plurality of clip images 50 will be provided in an order with a high probability of matching a predetermined situation to be searched. I can.

또한, 본원의 일 실시예에 따르면, 클립 영상 제공 장치(100)는, 복수 개의 소정의 상황에 부합하는 GOP(41)가 결정된 경우, 결정된 GOP(41), 선행 GOP(42) 또는 후행 GOP(43)의 일부가 중복되는 경우, 중복되는 GOP(41)에 대한 GOP 집합(44)을 통합하여 하나의 클립 영상(50)을 제공하도록 동작할 수 있다.In addition, according to an embodiment of the present application, when the GOP 41 corresponding to a plurality of predetermined situations is determined, the clip image providing apparatus 100 is the determined GOP 41, the preceding GOP 42, or the following GOP ( When a part of 43) overlaps, the GOP set 44 for the overlapping GOP 41 may be integrated to provide a single clip image 50.

예를 들어, 소정의 상황에 부합하는 GOP(41)가 GOP 5와 GOP 8로 결정되고, 선행 GOP(42) 및 후행 GOP(43)가 각각 3개의 GOP를 포함하는 경우, 첫번째 소정의 상황 부합하는 GOP(41)인 GOP 5에 대한 GOP 집합(44)은 GOP 2 내지 GOP 8이고, 두번째 소정의 상황에 부합하는 GOP(41)인 GOP 8에 대한 GOP 집합(44)은 GOP 5 내지 GOP 11일 수 있다. 이 때, GOP 5에 대한 후행 GOP(43)와 GOP 8에 대한 선행 GOP(42)가 중복되므로, 클립 영상 제공 장치(100)는 GOP 2 내지 GOP 11을 포함하는 단일한 GOP 집합(44)을 기초로 클립 영상(50)을 제공할 수 있다.For example, when the GOP 41 corresponding to a predetermined situation is determined as GOP 5 and GOP 8, and each of the preceding GOP 42 and the following GOP 43 includes three GOPs, the first predetermined situation is met. The GOP set 44 for GOP 5, which is the GOP 41, is GOP 2 to GOP 8, and the GOP set 44 for GOP 8, which is the GOP 41 corresponding to the second predetermined situation, is GOP 5 to GOP 11 Can be At this time, since the following GOP 43 for GOP 5 and the preceding GOP 42 for GOP 8 are overlapped, the clip image providing apparatus 100 provides a single GOP set 44 including GOPs 2 to 11 A clip image 50 may be provided as a basis.

또한, 본원의 일 실시예에 따르면, 클립 영상 제공 장치(100)는, 클립 영상(50)의 전체 재생 시간이 60초 이하가 되도록 GOP 집합(44)을 결정할 수 있다. 여기서, GOP 집합(44)을 결정한다는 것은, 선행 GOP(42) 또는 후행 GOP(43)에 포함될 GOP의 수 중 적어도 하나를 결정하는 것으로 이해될 수 있다.In addition, according to the exemplary embodiment of the present disclosure, the apparatus 100 for providing a clip image may determine the GOP set 44 so that the total playback time of the clip image 50 is 60 seconds or less. Here, determining the GOP set 44 may be understood as determining at least one of the number of GOPs to be included in the preceding GOP 42 or the following GOP 43.

다만, 상술한 클립 영상(50)의 전체 재생 시간은 60초에 한정되는 것은 아니고, 사용자가 탐색하고자 하는 소정의 상황의 유형에 따라 전체 재생 시간이 달리 결정될 수 있고(즉, 선행 GOP(42) 또는 후행 GOP(43)에 포함될 GOP의 수가 달리 결정될 수 있고), 실시예에 따라 사용자 단말(20)의 사용자가 클립 영상 제공 장치(100)에 전달하는 사용자 입력에 사용자가 원하는 클립 영상의 재생 시간 정보가 포함되도록 구현될 수 있다. 예를 들어, 5분(300초)에 대한 클립 영상을 확인하고자 하는 사용자가 클립 영상의 재생 시간을 5분(300초)로 설정한 경우, 클립 영상 제공 장치(100)는, GOP 집합(44)에 대응하는 클립 영상이 설정된 5분(300초)과 소정 이상 근접하도록 선행 GOP(42) 또는 후행 GOP(43)에 포함될 GOP의 수를 결정할 수 있다.However, the total playback time of the above-described clip image 50 is not limited to 60 seconds, and the total playback time may be determined differently according to the type of a predetermined situation that the user wants to search for (that is, the preceding GOP 42). Alternatively, the number of GOPs to be included in the following GOPs 43 may be differently determined), according to an embodiment, the playback time of the clip image desired by the user in the user input transmitted to the clip image providing apparatus 100 by the user of the user terminal 20 It can be implemented to include information. For example, when a user who wants to check a clip image for 5 minutes (300 seconds) sets the playback time of the clip image to 5 minutes (300 seconds), the clip image providing apparatus 100 may include a GOP set 44 The number of GOPs to be included in the preceding GOP 42 or the following GOP 43 may be determined so that the clip image corresponding to) is close to a preset 5 minutes (300 seconds) by a predetermined or more.

또한, 클립 영상 제공 장치(100)는, GOP 집합(44)이 결정되고 나면, GOP 집합(44)에 대응하는 비트 스트림을 복호화하여 GOP 집합(44)에 대응하는 클립 영상(50)을 제공할 수 있다.In addition, after the GOP set 44 is determined, the clip image providing apparatus 100 decodes the bit stream corresponding to the GOP set 44 to provide the clip image 50 corresponding to the GOP set 44. I can.

본원의 일 실시예에 따르면, 클립 영상 제공 장치(100)는 GOP 집합(44)에 대응하는 클립 영상(50)을 생성하기 위하여, GOP 집합(44)에 대응하는 비트 스트림의 전체 복호화를 수행할 수 있다. 클립 영상(50)은 클립 영상 제공 장치(100)에 의해 생성되어 최종적으로 사용자(관측자)에게 제공되고, 사용자가 육안으로 확인할 수도록 제공되는 CCTV 영상의 일부분에 해당하는 영상이다. 따라서, 클립 영상(50) 생성을 위한 전체 복호화는 상술한 기계 학습 모델의 입력되는 부분 복호화 영상을 생성하기 위한 부분적인 복호화 과정과 달리, 비트 스트림에 포함된 전체 정보를 고려하여 복호화하는 과정을 의미할 수 있다.According to an embodiment of the present application, in order to generate the clip image 50 corresponding to the GOP set 44, the clip image providing apparatus 100 performs full decoding of the bit stream corresponding to the GOP set 44. I can. The clip image 50 is an image that is generated by the clip image providing apparatus 100 and is finally provided to a user (observer), and corresponds to a part of the CCTV image provided so that the user can visually check it. Therefore, the full decoding for generating the clip image 50 refers to a process of decoding in consideration of the entire information included in the bit stream, unlike the partial decoding process for generating the partial decoded image input of the machine learning model described above. can do.

즉, 본원에서의 기계 학습 모델에 입력되는 부분 복호화 영상이 움직임과 연계된 정보를 파악할 수 있는 수준으로 복호화 되어, 배경 영역과 연계된 정보, 색깔과 연계된 정보 등을 포함하지 않는 것과 달리, 전체 복호화된 클립 영상(50)은 배경 영역과 연계된 정보, 색깔과 연계된 정보 등을 모두 반영하여 생성된 것일 수 있다.In other words, the partial decoded image input to the machine learning model in the present application is decoded to a level that can grasp information related to motion, and does not include information related to the background area, information related to color, etc. The decoded clip image 50 may be generated by reflecting all information associated with a background area and information associated with a color.

또한, 본원의 일 실시예에 따르면, 클립 영상 제공 장치(100)는, 생성된 클립 영상(50)을 사용자 단말(20) 또는 별도의 영상 재생 장치(미도시)로 제공하는 것일 수 있다.In addition, according to an exemplary embodiment of the present disclosure, the clip image providing apparatus 100 may provide the generated clip image 50 to the user terminal 20 or a separate image reproducing device (not shown).

종합하면, 본원의 일 실시예에 따른 클립 영상 제공 장치(100)는, 사용자로부터 사용자가 탐색하고자 하는 상황에 대한 사용자 입력을 수신하고, CCTV 영상에 대하여 압축된 비트 스트림을 수신하여 이를 기초로 해당 CCTV 영상의 전체 구간 중 사용자가 탐색하고자 하는 상황에 부합하는 일부 구간을 특정하여, 특정된 일부 구간의 전후로 소정의 재생 시간을 추가한 클립 영상을 제공할 수 있다. 이에 따라, 사용자는 CCTV 영상 촬영 장치(200)에 의해 촬영된 CCTV 영상의 전체 시간 구간을 확인하거나 재생할 필요 없이, 탐색을 원하는 상황을 클립 영상 제공 장치(100)에 전달함으로써 원하는 영상을 획득할 수 있다.In summary, the clip image providing apparatus 100 according to an embodiment of the present application receives a user input for a situation that the user wants to search for from a user, receives a compressed bit stream for a CCTV image, and corresponds to it based on this. Among the entire sections of the CCTV image, a partial section corresponding to a situation to be searched by the user may be specified, and a clip image in which a predetermined playback time is added before and after the specified partial section may be provided. Accordingly, the user can obtain the desired image by transmitting the desired situation to the clip image providing device 100 without the need to check or reproduce the entire time section of the CCTV image captured by the CCTV image photographing device 200. have.

종래의 CCTV 시스템에 의할 때, 관측자가 긴 시간 동안 지속적으로 촬영된 CCTV 영상으로부터 소정의 상황이 발생한 시점과 근접한 시점의 영상을 확인하는 것은 매우 어려운 일이었고, 해당 CCTV 영상의 비트스트림 전체를 복호화 하여야 했으며, 목격자 등의 진술이나 증거 등 별도의 추가적인 정보가 없다면 해당 CCTV 영상의 전체 시간 구간을 일일이 관찰하여야 하는 한계가 있었다. 반면, 본원에서의 클립 영상 제공 장치(100)는 기 촬영된 CCTV 영상의 전체 시간 구간에 구애받지 않고, 사용자(관측자)가 찾고자 하는 상황과 관련이 있는 부분을 탐색할 수 있으며, 탐색된 부분에 대한 비트 스트립이 복호화된 짤막한 클립 영상(50)을 제공할 수 있다.According to the conventional CCTV system, it was very difficult for the observer to check the video from the CCTV image continuously photographed for a long period of time and the video close to the time when a certain situation occurred, and the entire bitstream of the CCTV video was decoded. If there is no additional information such as statements or evidence from witnesses, there is a limit to observe the entire time section of the CCTV video. On the other hand, the clip image providing apparatus 100 in the present application can search for a part related to the situation that the user (observer) wants to find, regardless of the entire time period of the previously captured CCTV image, and It is possible to provide a short clip image 50 in which the bit strip is decoded.

또한, 사용자에게 영상을 제공하기 위한 복호화에 요구되는 연산량이나 연산 처리 속도에 있어서도, 종래의 CCTV 시스템은 주어진 비트 스트림에 전체에 대한 복호화를 수행할 수 밖에 없고, 소정의 조건을 만족하는 비트 스트림의 특정 부분만을 복호화하여 관측자에게 제공할 수 없었다. 반면, 본원에서의 클립 영상 제공 장치(100)는, 관측자가 탐색하고자 하는 소정의 상황과 연계된 특정 부분 만큼의 일부 비트 스트림을 부분적으로 복호화 하여 클립 영상(50)을 생성할 수 있어, 종래의 CCTV 시스템 대비 보다 적은 계산 리소스를 가지고 높은 속도로 사용자(관측자)가 원하는 영상을 제공할 수 있다. 다시 말해, 본원에서의 클립 영상 제공 장치(100)는 최종적으로 탐색된 부분에 대응하는 비트 스트림에 대하여만 복호화를 수행하면 충분하므로 복호화에 요구되는 연산량을 획기적으로 절감할 수 있는 효과가 있다.In addition, in terms of the amount of computation or computational processing speed required for decoding to provide an image to the user, the conventional CCTV system has no choice but to perform the entire decoding on a given bit stream, and the bit stream that satisfies a predetermined condition. Only a specific part could not be decoded and provided to the observer. On the other hand, the clip image providing apparatus 100 in the present application can generate a clip image 50 by partially decoding a partial bit stream as much as a specific part associated with a predetermined situation that an observer wants to search. Compared to the CCTV system, it has fewer computational resources and can provide the image desired by the user (observer) at a high speed. In other words, since it is sufficient for the clip image providing apparatus 100 of the present application to decode only the bit stream corresponding to the finally searched portion, there is an effect of remarkably reducing the amount of computation required for decoding.

아울러, 전술한 바와 같이 비트 스트림에 포함된 전체 정보가 아닌 움직임과 연계된 정보만을 고려하여(즉, 모션 데이터에 기초하여) 사용자(관측자)가 찾고자 하는 상황과 관련이 있는 부분을 탐색할 수 있으므로, 보다 적은 계산 리소스를 가지고 높은 속도로 CCTV 영상의 많은 부분을 분석하여 빠르게 사용자(관측자)가 찾고자 하는 상황과 관련이 있는 부분을 결정할 수 있다.In addition, as described above, it is possible to search for a part related to the situation that the user (observer) wants to find by considering only motion-related information (i.e., based on motion data), not all information included in the bit stream. , It is possible to quickly determine the part related to the situation the user (observer) is looking for by analyzing a large part of the CCTV image at a high speed with less computational resources.

도 3은 본원의 일 실시예에 따른 따른 CCTV 영상으로부터 클립 영상을 제공하는 장치의 개략적인 구성도이다.3 is a schematic configuration diagram of an apparatus for providing a clip image from a CCTV image according to an embodiment of the present application.

도 3을 참조하면, 본원의 일 실시예에 따른 따른 CCTV 영상으로부터 클립 영상을 제공하는 장치(100)(이하, '클립 영상 제공 장치(100)'라 한다.)는, 입력 수신부(110), GOP 결정부(120) 및 클립 영상 제공부(130)를 포함할 수 있다.Referring to FIG. 3, an apparatus 100 for providing a clip image from a CCTV image according to an embodiment of the present application (hereinafter referred to as a'clip image providing apparatus 100') includes an input receiving unit 110, A GOP determination unit 120 and a clip image providing unit 130 may be included.

입력 수신부(110)는, CCTV 영상에 대한 비트 스트림을 수신하고, 탐색하고자 하는 소정의 상황과 연계된 사용자 입력을 수신할 수 있다. 본원의 일 실시예에 따르면, 입력 수신부(110)는 CCTV 촬영 장치(200)로부터 CCTV 영상에 대한 비트 스트림을 수신하고, 사용자 단말(20)로부터 사용자 입력을 수신하는 것일 수 있다.The input receiver 110 may receive a bit stream for a CCTV image, and may receive a user input associated with a predetermined situation to be searched. According to an embodiment of the present application, the input receiving unit 110 may receive a bit stream for a CCTV image from the CCTV photographing apparatus 200 and receive a user input from the user terminal 20.

GOP 결정부(120)는, 입력 수신부(110)를 통해 수신된 비트 스트림 및 사용자 입력을 기초로 기 학습된 기계 학습 모델을 통해 사용자가 탐색하고자 하는 소정의 상황에 부합하는 GOP(Group of Picture)를 결정할 수 있다. GOP 결정부(120)의 세부 구성, 동작 등은 도 4를 통해 후술하도록 한다.The GOP determination unit 120 is a group of pictures (GOP) that meets a predetermined situation that a user wants to search through a machine learning model previously learned based on a bit stream received through the input receiving unit 110 and a user input. Can be determined. The detailed configuration, operation, etc. of the GOP determination unit 120 will be described later with reference to FIG. 4.

클립 영상 제공부(130)는, 결정된 GOP(41), 결정된 GOP(41)에 대한 선행 GOP(42) 및 후행 GOP(43)를 포함하는 GOP 집합(44)을 결정할 수 있다.The clip image providing unit 130 may determine a GOP set 44 including the determined GOP 41, a preceding GOP 42 and a trailing GOP 43 for the determined GOP 41.

본원의 일 실시예에 따르면, 클립 영상 제공부(130)는, 클립 영상의 전체 재생 시간이 60초 이하가 되도록 GOP 집합(44)을 결정할 수 있다.According to an exemplary embodiment of the present disclosure, the clip image providing unit 130 may determine the GOP set 44 so that the total playback time of the clip image is 60 seconds or less.

또한, 클립 영상 제공부(130)는 GOP 집합(44)에 대응하는 비트 스트림을 복호화하여 GOP 집합(44)에 대응하는 클립 영상(50)을 제공할 수 있다.In addition, the clip image providing unit 130 may provide a clip image 50 corresponding to the GOP set 44 by decoding a bit stream corresponding to the GOP set 44.

도 4는 본원의 일 실시예에 따른 GOP 결정부의 개략적인 구성도이다.4 is a schematic configuration diagram of a GOP determination unit according to an embodiment of the present application.

도 4를 참조하면, 본원의 일 실시예에 따른 GOP 결정부(120)는, GOP 분할부(121), 데이터 획득부(122), 부분 복호화부(123) 및 기계 학습부(124)를 포함할 수 있다.4, the GOP determination unit 120 according to an embodiment of the present application includes a GOP division unit 121, a data acquisition unit 122, a partial decoding unit 123, and a machine learning unit 124 can do.

GOP 분할부(121)는, 비트 스트림을 복수의 프레임을 포함하는 GOP를 단위로 하여 분할할 수 있다.The GOP dividing unit 121 can divide the bit stream in units of GOPs including a plurality of frames.

데이터 획득부(122)는, 비트 스트림으로부터 헤더(Header) 데이터를 획득하고, 획득된 헤더 데이터에 기초하여 분할된 GOP 각각의 모션 데이터를 획득할 수 있다.The data acquisition unit 122 may acquire header data from a bit stream and acquire motion data of each of the divided GOPs based on the acquired header data.

부분 복호화부(123)는, 획득된 모션 데이터를 이용하여 부분 복호화 영상을 생성할 수 있다. 본원의 일 실시예에 따르면, 부분 복호화부(123)는, GOP(예를 들면, 분할된 GOP(40)의 각각의 GOP)의 예측 유닛(Prediction Unit, PU)을 단위로 하여 모션 컴포지션(Motion Composition, MC)에 기초하여 부분 복호화 영상을 생성할 수 있다.The partial decoder 123 may generate a partial decoded image by using the acquired motion data. According to an embodiment of the present application, the partial decoder 123 uses a prediction unit (PU) of a GOP (for example, each GOP of the divided GOP 40) as a unit of motion composition Composition, MC) may be used to generate a partially decoded image.

기계 학습부(124)는, 부분 복호화 영상을 기초로 하여 기 학습된 기계 학습 모델을 통해 탐색하고자 하는 소정의 상황과 연관된 모션을 포함하는 부분 복호화 영상에 대응하는 GOP를 소정의 상황에 부합하는 GOP로 결정할 수 있다.The machine learning unit 124 generates a GOP corresponding to a partial decoded image including a motion associated with a predetermined situation to be searched through a machine learning model previously learned based on the partially decoded image. Can be determined by

이하에서는 상기에 자세히 설명된 내용을 기반으로, 본원의 동작 흐름을 간단히 살펴보기로 한다.Hereinafter, based on the details described above, the operation flow of the present application will be briefly described.

도 5는 본원의 일 실시예에 따른 CCTV 영상으로부터 클립 영상을 제공하는 방법에 대한 동작 흐름도이다.5 is a flowchart illustrating a method of providing a clip image from a CCTV image according to an embodiment of the present application.

도 5에 도시된 본원의 일 실시예에 따른 CCTV 영상으로부터 클립 영상을 제공하는 방법은 앞서 설명된 본원의 일 실시예에 따른 CCTV 영상으로부터 클립 영상을 제공하는 장치(100)에 의하여 수행될 수 있다. 따라서, 이하 생략된 내용이라고 하더라도 CCTV 영상으로부터 클립 영상을 제공하는 장치(100)에 대하여 설명된 내용은 CCTV 영상으로부터 클립 영상을 제공하는 방법에 대한 설명에도 동일하게 적용될 수 있다.The method of providing a clip image from a CCTV image according to an embodiment of the present application illustrated in FIG. 5 may be performed by the apparatus 100 for providing a clip image from a CCTV image according to an exemplary embodiment of the present application described above. . Therefore, even if omitted below, the description of the apparatus 100 for providing a clip image from a CCTV image may be equally applied to a description of a method of providing a clip image from a CCTV image.

도 5를 참조하면, 단계 S510에서 입력 수신부(110)는, CCTV 영상에 대한 비트 스트림을 수신할 수 있다.Referring to FIG. 5, in step S510, the input receiver 110 may receive a bit stream for a CCTV image.

다음으로, 단계 S520에서 입력 수신부(110)는, 탐색하고자 하는 소정의 상황과 연계된 사용자 입력을 수신할 수 있다.Next, in step S520, the input receiving unit 110 may receive a user input associated with a predetermined situation to be searched.

다음으로, 단계 S530에서 GOP 결정부(120)는, 비트 스트림 및 사용자 입력을 기초로 기 학습된 기계 학습 모델을 통해 탐색하고자 하는 소정의 상황에 부합하는 GOP(Group of Picture)를 결정할 수 있다.Next, in step S530, the GOP determiner 120 may determine a Group of Picture (GOP) that corresponds to a predetermined situation to be searched through a machine learning model that has been previously learned based on a bit stream and a user input.

다음으로, 단계 S540에서 클립 영상 제공부(130)는, 결정된 GOP(41), 결정된 GOP에 대한 선행 GOP(42) 및 후행 GOP(43)을 포함하는 GOP 집합(44)을 결정할 수 있다.Next, in step S540, the clip image providing unit 130 may determine a GOP set 44 including the determined GOP 41, the preceding GOP 42 and the following GOP 43 for the determined GOP.

다음으로, 단계 S550에서 클립 영상 제공부(130)는, GOP 집합(44)에 대응하는 비트 스트림을 복호화하여 GOP 집합(44)에 대응하는 클립 영상(50)을 제공할 수 있다.Next, in step S550, the clip image providing unit 130 may decode a bit stream corresponding to the GOP set 44 to provide a clip image 50 corresponding to the GOP set 44.

상술한 설명에서, 단계 S510 내지 S550은 본원의 구현예에 따라서, 추가적인 단계들로 더 분할되거나, 더 적은 단계들로 조합될 수 있다. 또한, 일부 단계는 필요에 따라 생략될 수도 있고, 단계 간의 순서가 변경될 수도 있다.In the above description, steps S510 to S550 may be further divided into additional steps or may be combined into fewer steps, according to an embodiment of the present disclosure. In addition, some steps may be omitted as necessary, or the order between steps may be changed.

도 6은 본원의 일 실시예에 따른 소정의 상황에 부합하는 GOP를 결정하는 방법에 대한 동작 흐름도이다.6 is a flowchart illustrating a method of determining a GOP corresponding to a predetermined situation according to an embodiment of the present disclosure.

도 6에 도시된 본원의 일 실시예에 따른 소정의 상황에 부합하는 GOP를 결정하는 방법은 앞서 설명된 본원의 일 실시예에 따른 CCTV 영상으로부터 클립 영상을 제공하는 장치(100)에 의하여 수행될 수 있다. 따라서, 이하 생략된 내용이라고 하더라도 CCTV 영상으로부터 클립 영상을 제공하는 장치(100)에 대하여 설명된 내용은 소정의 상황에 부합하는 GOP를 결정하는 방법에 대한 설명에도 동일하게 적용될 수 있다.The method of determining a GOP corresponding to a predetermined situation according to an embodiment of the present application illustrated in FIG. 6 is performed by the apparatus 100 for providing a clip image from a CCTV image according to an exemplary embodiment of the present application described above. I can. Accordingly, even if omitted below, the description of the apparatus 100 for providing a clip image from a CCTV image may be equally applied to a description of a method of determining a GOP suitable for a predetermined situation.

도 6을 참조하면, 단계 S610에서 GOP 분할부(121)는, 비트 스트림을 복수의 프레임을 포함하는 GOP를 단위로 하여 분할할 수 있다.Referring to FIG. 6, in step S610, the GOP dividing unit 121 may divide a bit stream in units of a GOP including a plurality of frames.

다음으로, 단계 S620에서 데이터 획득부(122)는, 비트 스트림으로부터 헤더(Header) 데이터를 획득할 수 있다.Next, in step S620, the data acquisition unit 122 may acquire header data from the bit stream.

다음으로, 단계 S630에서 데이터 획득부(122)는, 헤더 데이터에 기초하여 분할된 GOP(40) 각각의 모션 데이터를 획득할 수 있다.Next, in step S630, the data acquisition unit 122 may acquire motion data of each of the divided GOPs 40 based on the header data.

다음으로, 단계 S640에서 부분 복호화부(123)는, 획득된 모션 데이터를 이용하여 부분 복호화 영상을 생성할 수 있다.Next, in step S640, the partial decoder 123 may generate a partial decoded image by using the acquired motion data.

다음으로, 단계 S650에서 기계 학습부(124)는, 부분 복호화 영상을 기초로 하여 기 학습된 기계 학습 모델을 통해 탐색하고자 하는 소정의 상황과 연관된 모션을 포함하는 부분 복호화 영상에 대응하는 GOP를 소정의 상황에 부합하는 GOP(41)로 결정할 수 있다.Next, in step S650, the machine learning unit 124 determines a GOP corresponding to a partial decoded image including a motion associated with a predetermined situation to be searched through a machine learning model previously learned based on the partial decoded image. It can be determined by the GOP 41 that matches the situation of.

상술한 설명에서, 단계 S610 내지 S650은 본원의 구현예에 따라서, 추가적인 단계들로 더 분할되거나, 더 적은 단계들로 조합될 수 있다. 또한, 일부 단계는 필요에 따라 생략될 수도 있고, 단계 간의 순서가 변경될 수도 있다.In the above description, steps S610 to S650 may be further divided into additional steps or may be combined into fewer steps, according to an embodiment of the present disclosure. In addition, some steps may be omitted as necessary, or the order between steps may be changed.

본원의 일 실시예에 따른 CCTV 영상으로부터 클립 영상을 제공하는 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 본 발명의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The method of providing a clip image from a CCTV image according to an exemplary embodiment of the present disclosure may be implemented in the form of a program command that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, and the like alone or in combination. The program instructions recorded on the medium may be specially designed and configured for the present invention, or may be known and usable to those skilled in computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floptical disks. -A hardware device specially configured to store and execute program instructions such as magneto-optical media, and ROM, RAM, flash memory, and the like. Examples of program instructions include not only machine language codes such as those produced by a compiler, but also high-level language codes that can be executed by a computer using an interpreter or the like. The hardware device described above may be configured to operate as one or more software modules to perform the operation of the present invention, and vice versa.

또한, 전술한 CCTV 영상으로부터 클립 영상을 제공하는 방법은 기록 매체에 저장되는 컴퓨터에 의해 실행되는 컴퓨터 프로그램 또는 애플리케이션의 형태로도 구현될 수 있다.In addition, the method of providing a clip image from the CCTV image described above may be implemented in the form of a computer program or application executed by a computer stored in a recording medium.

전술한 본원의 설명은 예시를 위한 것이며, 본원이 속하는 기술분야의 통상의 지식을 가진 자는 본원의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다.The foregoing description of the present application is for illustrative purposes only, and those of ordinary skill in the art to which the present application pertains will be able to understand that it can be easily modified into other specific forms without changing the technical spirit or essential features of the present application. Therefore, it should be understood that the embodiments described above are illustrative in all respects and are not limiting. For example, each component described as a single type may be implemented in a distributed manner, and similarly, components described as being distributed may also be implemented in a combined form.

본원의 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본원의 범위에 포함되는 것으로 해석되어야 한다.The scope of the present application is indicated by the claims to be described later rather than the detailed description, and all changes or modified forms derived from the meaning and scope of the claims and their equivalent concepts should be interpreted as being included in the scope of the present application.

10: CCTV 영상에 기초한 클립 영상 제공 시스템
100: CCTV 영상으로부터 클립 영상을 제공하는 장치
110: 입력 수신부
120: GOP 결정부
121: GOP 분할부
122: 데이터 획득부
123: 부분 복호화부
124: 기계 학습부
130: 클립 영상 제공부
200: CCTV 촬영 장치
20: 사용자 단말
30: 네트워크
40: 분할된 GOP
41: 결정된 GOP
42: 선행 GOP
43: 후행 GOP
44: GOP 집합
50: 클립 영상10: Clip video providing system based on CCTV video
100: a device that provides a clip image from a CCTV image
110: input receiver
120: GOP decision section
121: GOP division
122: data acquisition unit
123: partial decoding unit
124: Machine Learning Department
130: clip image providing unit
200: CCTV imaging device
20: user terminal
30: network
40: segmented GOP
41: determined GOP
42: Leading GOP
43: trailing GOP
44: GOP set
50: clip video

Claims

In a method for providing a clip image from a CCTV image,
Receiving a bit stream obtained by compressing a CCTV image;
Receiving a user input associated with a predetermined situation to be searched;
Determining a GOP (Group of Picture) corresponding to the predetermined situation through a machine learning model previously learned based on the bit stream and the user input;
Determining a GOP set including the determined GOP, a preceding GOP and a subsequent GOP for the determined GOP; And
Decoding a bit stream corresponding to the GOP set to provide a clip image corresponding to the GOP set,
Including,
The step of determining a GOP (Group of Picture) corresponding to the predetermined situation,
Dividing the compressed bit stream in units of a GOP including a plurality of frames;
Obtaining header data from the compressed bit stream;
Obtaining motion data of each of the divided GOPs based on the header data;
Generating a partially decoded image using the acquired motion data; And
Determining a GOP corresponding to a partially decoded image including a motion associated with a predetermined situation to be searched through the machine learning model based on the partially decoded image as a GOP corresponding to the predetermined situation,
Including
Generating the partially decoded image includes:
The method for providing a clip image, wherein data including at least one of color information, luminance information, and background area information included in the header data is not decoded, but only the motion data is decoded.

delete

The method of claim 1,
Generating the partially decoded image includes:
The method for providing a clip image, wherein the partial decoded image is generated based on a motion composition (MC) based on a prediction unit (PU) of the GOP.

The method of claim 1,
The preceding GOP includes a predetermined number of GOPs that temporally precede the determined GOP,
The following GOP includes a predetermined number of GOPs that follow in time with respect to the determined GOP,
The preceding GOP is,
Includes two or three GOPs preceding in time with respect to the determined GOP,
The trailing GOP,
The method of providing a clip image comprising two or three GOPs that are temporally followed by the determined GOP.

The method of claim 4,
The step of determining the GOP set,
The method of providing a clip image, wherein the GOP set is determined so that the total playback time of the clip image is 60 seconds or less.

The method of claim 1,
The predetermined situation is,
The method for providing a clip image comprising at least one of a situation associated with a body movement of a subject or a situation in which a predetermined object is detected in the CCTV image.

The method of claim 1,
The pre-learned machine learning model,
A method of providing a clip image, which is learned by a deep learning technique.

In the apparatus for providing a clip image from a CCTV image,
An input receiving unit for receiving a bit stream obtained by compressing a CCTV image and receiving a user input associated with a predetermined situation to be searched;
A GOP determination unit for determining a Group of Picture (GOP) corresponding to the predetermined situation through a machine learning model previously learned based on the bit stream and the user input; And
A clip image providing unit that determines a GOP set including the determined GOP, a preceding GOP and a subsequent GOP for the determined GOP, and provides a clip image corresponding to the GOP set by decoding a bit stream corresponding to the GOP set,
Including,
The GOP determination unit,
A GOP dividing unit for dividing the compressed bit stream in units of a GOP including a plurality of frames;
A data acquisition unit acquiring header data from the compressed bit stream and acquiring motion data of each of the divided GOPs based on the header data;
A partial decoder that generates a partial decoded image using the acquired motion data; And
Machine learning for determining a GOP corresponding to a partial decoded image including a motion associated with a predetermined situation to be searched through the machine learning model based on the partially decoded image as a GOP corresponding to the predetermined situation part,
Including,
The partial decoding unit,
The apparatus for providing a clip image, wherein data including at least one of color information, luminance information and background area information included in the header data is not decoded, but only the motion data is decoded.

delete

The method of claim 8,
The partial decoding unit,
The apparatus for providing a clip image, wherein the partial decoded image is generated based on a motion composition (MC) based on a prediction unit (PU) of the GOP.

The method of claim 8,
The clip image providing unit,
To determine the GOP set so that the total reproduction time of the clip image is 60 seconds or less.

A computer-readable recording medium in which a program for executing the method of any one of claims 1, 3 to 7 on a computer is recorded.