KR102147100B1

KR102147100B1 - Method for generatinig video synopsis by identifying target object using plurality of image devices and system for performing the same

Info

Publication number: KR102147100B1
Application number: KR1020180039313A
Authority: KR
Inventors: 김익재; 조정현; 최희승; 남기표
Original assignee: 한국과학기술연구원
Priority date: 2018-04-04
Filing date: 2018-04-04
Publication date: 2020-08-25
Also published as: KR20190119229A

Abstract

본 발명의 실시예들은 기 설치된 복수개의 카메라를 활용하여, 비디오에 포함된 객체들 중 특정 관심 객체를 식별한 뒤 짧은 시간으로 요약하여 비디오 요약을 생성할 수 있다. 또한, 비디오 요약에 사용된 비디오를 촬영한 영상 장비의 위치 정보에 기초하여 관심 객체의 이동 경로를 생성하고 이를 사용자에게 제공할 수 있다. Embodiments of the present invention may generate a video summary by identifying a specific object of interest among objects included in the video by using a plurality of pre-installed cameras and then summarizing it in a short time. In addition, a moving path of an object of interest may be generated based on location information of an imaging device that has captured a video used for video summary, and provided to a user.

Description

A method of generating a video summary by identifying an object of interest using a plurality of video equipment, and a system that performs the same.

본 발명의 실시예들은 비디오 요약(video synopsis)을 생성하는 것에 관한 것으로서, 보다 상세하게는 하나 이상의 카메라에서 촬영된 비디오로부터 사용자가 모니터링하고자 하는 관심 객체(target object)를 식별하고, 관심 객체에 대한 비디오 요약을 생성하는 기술에 관한 것이다.Embodiments of the present invention relate to generating a video synopsis, and more particularly, to identify a target object to be monitored by a user from a video captured by one or more cameras, and It relates to techniques for generating video summaries.

최근 일상 생활 속에서 CCTV, 블랙박스 등의 영상 장비가 널리 사용되고 있다. 상기 영상 장비에 의해 촬영된 비디오는 다양한 분야에서 유용하게 사용되는 데, 특히 보안, 범죄 수사 등 치안 부분에서 활발히 이용되고 있다. 예를 들어, 용의자 혹은 실종자의 이동 경로 등을 효율적으로 파악하기 위해 다수의 영상 장비의 비디오를 이용한다. 그러나, 비디오의 길이가 긴 경우 촬영된 영상을 끝까지 모니터링해야하는 불편이 있다.Recently, video equipment such as CCTV and black box are widely used in daily life. The video captured by the imaging equipment is usefully used in various fields, and is particularly actively used in security, such as security and criminal investigations. For example, videos from multiple video equipment are used to efficiently grasp the moving path of a suspect or a missing person. However, if the length of the video is long, there is an inconvenience of monitoring the captured image to the end.

도 1은, 종래 기술에 따른, 비디오 요약을 생성하는 개념도이다. 도 1을 참조하면, 이런 불편을 해결하는 하나의 수단으로 한국 공개특허공보 제10-2008-0082936호에는 비디오 비디오에 포함된 움직이는 객체들만을 따로 추출하여 움직이는 객체에 대한 요약 비디오를 생성하는 방법이 개시되어 있다. 이러한 요약 비디오를 통해 전체 비디오를 보지 않고도 움직이는 객체를 짧은 시간 내에 요약 시청할 수 있다. 1 is a conceptual diagram for generating a video summary according to the prior art. Referring to FIG. 1, as a means of solving such inconvenience, Korean Patent Laid-Open Publication No. 10-2008-0082936 discloses a method of separately extracting only moving objects included in a video video to generate a summary video for a moving object. It is disclosed. This summary video allows you to view a moving object in a short amount of time without viewing the entire video.

그러나, 위의 선행 기술은 하나의 영상 장비에서 촬영된 비디오를 요약할 뿐이어서, 범죄자 또는 실종자의 이동 경로를 추적하기 위해 복수의 영상 장비에서 촬영된 복수의 비디오를 이용하는데 한계가 있다.However, the above prior art merely summarizes videos shot by one imaging device, and thus, there is a limitation in using a plurality of videos shot by a plurality of imaging devices to track a moving path of a criminal or a missing person.

나아가, 요약 비디오를 생성하기 위해서 객체의 비공간적 중복 출현을 나타내는 포션(portion)이 최소 세 개의 상이한 입력 프레임으로부터 최소 두 개의 연속적인 프레임으로 카피되어 표시하여 요약 비디오를 생성하는데, 움직이는 객체를 무조건 요약 대상으로 추출하는 방식(tracking by detection)을 이용하였다. 따라서, 동일한 객체가 프레임 밖으로 나갔다가 다시 들어오는 경우에도 별개의 객체로 추출하여 비디오 시퀸스를 생성하고, 움직이는 객체의 수가 많을 경우 모니터링 작업이 용이하지 않는 등 치안 분야에서 이용하는데 한계가 있다.Furthermore, in order to create a summary video, a portion representing the non-spatial overlapping appearance of an object is copied and displayed in at least two consecutive frames from at least three different input frames to create a summary video, which unconditionally summarizes the moving object. Tracking by detection was used. Therefore, even when the same object goes out of the frame and then comes back in, there is a limitation in using it in the field of security, such as extracting as a separate object and generating a video sequence, and monitoring is not easy when the number of moving objects is large.

특허공개공보 제10-2008-0082963호Patent Publication No. 10-2008-0082963

본 발명은 복수의 영상 장비를 사용하여 관심 객체 식별에 의한 비디오 요약을 생성하는 방법 및 이를 수행하는 시스템을 제공한다.The present invention provides a method of generating a video summary by identifying an object of interest using a plurality of imaging equipment and a system for performing the same.

본 발명의 일 측면에 따른 관심 객체 식별에 의한 비디오 요약을 생성하는 방법은 영상 장비에 의해 촬영된 비디오에서 이벤트 객체를 검출하는 단계; 상기 이벤트 객체의 속성을 프레임별로 추출하는 단계; 각각의 프레임에 포함된 이벤트 객체의 속성을 프레임 간에 비교하여 이벤트 객체를 식별하는 단계; 식별된 이벤트 객체를 이벤트 객체의 속성에 따라 군집화하는 단계; 관심 객체의 속성을 포함한 사용자 입력을 수신하는 경우, 상기 이벤트 객체의 속성과 관심 객체의 속성에 기초하여 상기 관심 객체의 속성에 대응하는 군집을 필터링하는 단계; 상기 관심 객체에 대응하는 군집의 튜브를 생성하는 단계; 및 상기 튜브에 기초하여 관심 객체에 대한 비디오 요약을 생성하는 단계를 포함할 수 있다.A method of generating a video summary by identifying an object of interest according to an aspect of the present invention includes: detecting an event object from a video captured by an imaging device; Extracting the attribute of the event object for each frame; Identifying an event object by comparing properties of the event object included in each frame between frames; Clustering the identified event objects according to the properties of the event objects; When receiving a user input including a property of the object of interest, filtering a cluster corresponding to the property of the object of interest based on the property of the event object and the property of the object of interest; Generating a tube of a cluster corresponding to the object of interest; And generating a video summary for the object of interest based on the tube.

일 실시예에서, 상기 이벤트 객체를 식별하는 단계는, 상기 이벤트 객체의 속성과 관심 객체의 속성을 비교하여 상기 속성들 간의 차이가 소정의 제1 임계치 이하인 경우 동일한 이벤트 객체로 식별하는 단계를 포함할 수 있다.In one embodiment, the step of identifying the event object includes comparing the attribute of the event object and the attribute of the object of interest, and identifying the event object as the same event object when the difference between the attributes is less than or equal to a predetermined first threshold. I can.

일 실시예에서, 상기 이벤트 객체를 식별하는 단계는, 시간별 순서가 인접한 두 프레임 간의 이벤트 객체의 속성을 비교할 수 있다.In an embodiment, the step of identifying the event object may compare properties of the event object between two frames adjacent to each other in an order of time.

일 실시예에서, 상기 이벤트 객체를 식별하는 단계는, 동일한 이벤트 객체로 각각 식별된 하나 이상의 이벤트 객체에 있어서, 상기 하나 이상의 이벤트 객체 간의 속성을 비교하는 단계를 더 포함할 수 있다.In an embodiment, the step of identifying the event object may further include comparing properties between the one or more event objects for one or more event objects, each identified as the same event object.

일 실시예에서, 상기 비디오는, 서로 상이한 지점에 위치하는 복수 개의 영상 장비에서 각각 촬영될 수 있다. In one embodiment, the video may be captured by a plurality of imaging equipment located at different points from each other.

또한, 상기 방법은 상기 비디오 중 적어도 두 개의 비디오 사이에서 이벤트 객체를 식별하는 단계를 더 포함할 수 있다.Further, the method may further include identifying an event object between at least two of the videos.

일 실시예에서, 상기 이벤트 객체의 속성을 추출하는 단계는, 컨볼루션 레이어(convolution layer), 풀링 레이어(pooling layer), 및 완전 연결 레이어(fully connected layer) 중 적어도 하나를 포함하는 속성 추출 모델을 사용할 수 있다.In one embodiment, the extracting of the attribute of the event object comprises: an attribute extraction model including at least one of a convolution layer, a pooling layer, and a fully connected layer. Can be used.

일 실시예에서, 상기 이벤트 객체의 속성을 추출하는 단계는, LBP(Local Binary Pattern), HOG(Histogram of Oriented Gradient), SIFT(Scale Invariant Feature Transform) 중 어느 하나를 사용할 수 있다. In an embodiment, the extracting the attribute of the event object may use any one of a Local Binary Pattern (LBP), a Histogram of Oriented Gradient (HOG), and Scale Invariant Feature Transform (SIFT).

또한, 상기 이벤트 객체를 식별하는 단계는, KFC(kernlized correlation filter) 기반의 칼만 필터(kalman filter)를 사용할 수 있다.In addition, in the step of identifying the event object, a Kalman filter based on a kernlized correlation filter (KFC) may be used.

일 실시예에서, 상기 이벤트 객체의 속성을 추출하는 단계는, 상기 이벤트 객체의 위치가 프레임별로 추출되는 경우, 상기 이벤트 객체를 식별한 이후에 상기 이벤트 객체의 프레임별 위치에 기초하여 상기 이벤트 객체의 움직임 방향 및 움직임 패턴 중 하나 이상을 더 추출하는 단계를 더 포함할 수 있다.In one embodiment, the extracting of the attribute of the event object comprises: when the location of the event object is extracted for each frame, after identifying the event object, the event object is It may further include the step of further extracting one or more of the movement direction and the movement pattern.

일 실시예에서, 상기 클러스터에 대한 비디오 요약을 생성하는 단계는, 상기 관심 객체에 대응하는 튜브 간에 시간적 일관성을 유지하면서, 공간적 충돌을 최소화하여 비디오 요약을 생성할 수 있다.In an embodiment, in the generating of the video summary for the cluster, the video summary may be generated by minimizing spatial collision while maintaining temporal consistency between tubes corresponding to the object of interest.

일 실시예에서, 상기 방법은 상기 관심 객체에 대응하는 이벤트 객체를 촬영한 영상 장비의 위치 정보에 기초하여 상기 관심 객체의 이동 경로를 사용자에게 제공하는 단계를 더 포함할 수 있다.In an embodiment, the method may further include providing a moving path of the object of interest to a user based on location information of an imaging device that has captured the event object corresponding to the object of interest.

본 발명의 다른 일 측면에 따른 컴퓨터 판독가능한 기록매체는 컴퓨터에 의해 판독 가능하고, 상기 컴퓨터에 의해 동작 가능한 프로그램 명령어를 저장할 수 있다. 여기서, 프로그램 명령어는 상기 컴퓨터의 프로세서에 의해 실행되는 경우 상기 컴퓨터의 프로세서가 상술한 실시예들에 따른 관심 객체 식별에 의한 비디오 요약 생성 방법을 수행하게 할 수 있다.A computer-readable recording medium according to another aspect of the present invention may be readable by a computer and may store program instructions operable by the computer. Here, when the program command is executed by the processor of the computer, the processor of the computer may cause the method of generating a video summary by identifying the object of interest according to the above-described embodiments.

본 발명의 또 다른 일 측면에 따른 관심 객체 식별에 의한 비디오 요약을 생성하는 방법은 영상 장비에 의해 촬영된 복수 개의 비디오에서 이벤트 객체를 검출하는 단계; 상기 이벤트 객체의 제1 속성을 프레임별로 추출하는 단계; 각각의 프레임에 포함된 이벤트 객체의 속성을 프레임 간에 비교하여 이벤트 객체를 식별하는 단계; 상기 식별된 이벤트 객체의 튜브를 생성하는 단계; 상기 이벤트 객체의 튜브에 기초하여 이벤트 객체의 제2 속성을 추출하는 단계; 식별된 이벤트 객체를 이벤트 객체의 속성에 따라 군집화하는 단계; 사용자 입력을 수신하는 경우, 상기 이벤트 객체의 속성과 관심 객체의 속성에 기초하여 상기 관심 객체의 속성에 대응하는 군집을 필터링하는 단계; 및 상기 필터링된 튜브에 기초하여 관심 객체에 대한 비디오 요약을 생성하는 단계를 포함할 수 있다.According to another aspect of the present invention, a method for generating a video summary by identifying an object of interest includes: detecting an event object from a plurality of videos captured by an imaging device; Extracting a first attribute of the event object for each frame; Identifying an event object by comparing properties of the event object included in each frame between frames; Creating a tube of the identified event object; Extracting a second attribute of the event object based on the tube of the event object; Clustering the identified event objects according to the properties of the event objects; When receiving a user input, filtering a cluster corresponding to the attribute of the object of interest based on the attribute of the event object and the attribute of the object of interest; And generating a video summary for the object of interest based on the filtered tube.

일 실시예에서, 상기 이벤트 객체의 제2 속성을 추출하는 단계는, 상기 이벤트 객체의 튜브의 밀도를 산출하는 단계; 상기 밀도가 소정의 제2 임계치를 초과하는지 판단하는 단계; 상기 제2 임계치를 초과하는 영역을 핵심 영역(key region)으로 결정하는 단계; 및 상기 핵심 영역에 기초하여 상기 관심 객체의 움직임 방향, 및 움직임 패턴 중 하나의 속성 이상을 추출하는 단계를 포함할 수 있다.In one embodiment, the extracting of the second attribute of the event object includes: calculating a density of a tube of the event object; Determining whether the density exceeds a second predetermined threshold; Determining a region exceeding the second threshold as a key region; And extracting at least one attribute of the movement direction and the movement pattern of the object of interest based on the core region.

일 실시예에서, 상기 제1 속성은 이벤트 객체의 종류를 포함하고, 상기 제2 속성은 이벤트 객체의 움직임 방향 및 움직임 패턴 중 적어도 하나를 포함할 수 있다.In an embodiment, the first attribute may include a type of an event object, and the second attribute may include at least one of a movement direction and a movement pattern of the event object.

본 발명의 또 다른 일 측면에 따른, 관심 객체 식별에 의한 비디오 요약 생성 시스템은 복수 개의 영상 장비; 그리고 비디오 처리부를 포함할 수 있다. 여기서, 비디오 처리부는 복수 개의 영상 장비에 의해 촬영된 비디오에서 이벤트 객체를 검출하는 이벤트 객체 검출부; 상기 이벤트 객체의 속성을 추출하는 속성 추출부; 이벤트 객체의 속성을 프레임 간에 비교하여 이벤트 객체를 식별하는 식별부; 식별된 이벤트 객체를 이벤트 객체의 속성에 따라 군집화하는 군집화부; 관심 객체의 속성을 포함한 사용자 입력을 수신하는 경우, 상기 이벤트 객체의 속성과 관심 객체의 속성에 기초하여 상기 관심 객체의 속성에 대응하는 군집을 필터링하는 필터링부; 상기 관심 객체에 대응하는 군집의 튜브를 생성하는 튜브 생성부; 및 상기 튜브에 기초하여 관심 객체에 대한 비디오 요약을 생성하는 비디오 요약부를 포함할 수 있다.According to another aspect of the present invention, a system for generating a video summary by identifying an object of interest includes a plurality of video equipment; And it may include a video processing unit. Here, the video processing unit may include an event object detector configured to detect an event object from a video captured by a plurality of video equipment; An attribute extracting unit for extracting an attribute of the event object; An identification unit for identifying an event object by comparing properties of the event object between frames; A clustering unit that clusters the identified event objects according to the properties of the event objects; When receiving a user input including an attribute of an object of interest, a filtering unit for filtering a cluster corresponding to the attribute of the object of interest based on the attribute of the event object and the attribute of the object of interest; A tube generator for generating a cluster of tubes corresponding to the object of interest; And a video summary unit that generates a video summary for the object of interest based on the tube.

본 발명의 일 측면에 따른 비디오 요약을 생성하는 방법은 복수의 카메라에서 촬영된 각각의 비디오로부터 이벤트 객체를 식별하고, 이벤트 객체 중 사용자가 모니터링하고자 하는 관심 객체에 대한 비디오 요약을 생성할 수 있다. In a method of generating a video summary according to an aspect of the present invention, an event object may be identified from each video captured by a plurality of cameras, and a video summary of an object of interest that a user wants to monitor among the event objects may be generated.

사용자가 모니터링하고자 하는 특정 관심 객체에 대해 비디오 요약을 생성함으로써, 사용자는 종래 비디오 요약에 포함된 모든 움직임 객체 중에서 관심 객체를 탐색하기 위해 다시 모니터링할 필요가 없다. 그 결과, 시간적, 신체적 측면에서 사용자 편의성을 제공할 수 있다.By creating a video summary for a specific object of interest that the user wants to monitor, the user does not need to monitor again to search for the object of interest among all the moving objects included in the conventional video summary. As a result, user convenience can be provided in terms of time and body.

또한, 하나 이상의 프레임 관계에 있어서 이벤트 객체를 식별함으로써 동일 객체의 중복 출현 문제를 해결할 수 있다. 이로 인해, 비디오 요약의 용량을 효율적으로 만들 수 있다.In addition, it is possible to solve the problem of duplicate occurrence of the same object by identifying the event object in one or more frame relationships. Due to this, the capacity of the video summary can be made efficiently.

또한, 복수의 카메라에 대한 위치 정보 등을 추가적으로 이용하여 관심 객체의 이동 경로를 추적할 수 있다.In addition, it is possible to track the movement path of the object of interest by additionally using location information for a plurality of cameras.

본 발명의 효과들은 이상에서 언급한 효과들로 제한되지 않으며, 언급되지 않은 또 다른 효과들은 청구범위의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.The effects of the present invention are not limited to the effects mentioned above, and other effects that are not mentioned will be clearly understood by those skilled in the art from the description of the claims.

도 1은, 종래 기술에 따른, 비디오 요약을 생성하는 개념도이다.
도 2는, 본 발명의 일 실시예에 따른, 비디오 요약 생성 시스템의 개략적인 블록도이다.
도 3은, 본 발명의 일 실시예에 따른, 비디오 요약을 생성하는 개념도이다.
도 4은, 본 발명의 일 실시예에 따른, 이벤트 객체를 추출한 결과를 도시한 도면이다.
도 5는, 본 발명의 일 실시예에 따른, CNN(Convolution Neuron Network)기반의 속성 추출 모델의 개략적인 구조도이다.
도 6는, 본 발명의 일 실시예에 따른, 속성 추출 모델의 학습에 사용된 자동차 샘플의 예시도이다.
도 7은, 본 발명의 일 실시예에 따른, 비디오 요약 생성 시스템에 의해 생성되는 비디오 요약의 개념도이다.
도 8은, 본 발명의 일 실시예에 따른, 비디오 요약 생성 방법의 흐름도이다.
도 9는, 본 발명의 일 실시예에 따른, 특정 크기를 갖는 관심 객체에 대하여 생성된 비디오 요약을 도시한 도면이다.
도 10은, 본 발명의 일 실시예에 따른, 특정 종류에 해당되는 관심 객체에 대하여 생성된 비디오 요약을 도시한 도면이다.
도 11은, 위치에 움직임 연관 속성을 추출하는 과정의 흐름도이다.
도 12는, 본 발명의 일 실시예에 따른, 추출된 핵심 영역(key region)을 도시한 도면이다.
도 13은, 본 발명의 일 실시예에 따른, 이벤트 객체의 속성으로 특정 방향이 추출되는 과정의 개념도이다.
도 14는, 본 발명의 일 실시예에 따른, 다른 특정 방향을 갖는 관심 객체에 대하여 생성된 비디오 요약을 도시한 도면이다.1 is a conceptual diagram for generating a video summary according to the prior art.
2 is a schematic block diagram of a video summary generation system according to an embodiment of the present invention.
3 is a conceptual diagram of generating a video summary according to an embodiment of the present invention.
4 is a diagram showing a result of extracting an event object according to an embodiment of the present invention.
5 is a schematic structural diagram of an attribute extraction model based on a convolution neuron network (CNN) according to an embodiment of the present invention.
6 is an exemplary diagram of a vehicle sample used for learning an attribute extraction model according to an embodiment of the present invention.
7 is a conceptual diagram of a video summary generated by a video summary generation system according to an embodiment of the present invention.
8 is a flowchart of a video summary generation method according to an embodiment of the present invention.
9 is a diagram illustrating a video summary generated for an object of interest having a specific size according to an embodiment of the present invention.
10 is a diagram illustrating a video summary generated for an object of interest corresponding to a specific type according to an embodiment of the present invention.
11 is a flowchart of a process of extracting a motion related attribute to a position.
12 is a diagram illustrating an extracted key region according to an embodiment of the present invention.
13 is a conceptual diagram illustrating a process of extracting a specific direction as an attribute of an event object according to an embodiment of the present invention.
14 is a diagram illustrating a video summary generated for an object of interest having a different specific direction according to an embodiment of the present invention.

여기서 사용되는 전문 용어는 단지 특정 실시예를 언급하기 위한 것이며, 본 발명을 한정하는 것을 의도하지 않는다. 여기서 사용되는 단수 형태들은 문구들이 이와 명백히 반대의 의미를 나타내지 않는 한 복수 형태들도 포함한다. 명세서에서 사용되는 "포함하는"의 의미는 특정 특성, 영역, 정수, 단계, 동작, 요소 및/또는 성분을 구체화하며, 다른 특성, 영역, 정수, 단계, 동작, 요소 및/또는 성분의 존재나 부가를 제외시키는 것은 아니다.The terminology used herein is for referring only to specific embodiments and is not intended to limit the present invention. Singular forms as used herein also include plural forms unless the phrases clearly indicate the opposite. The meaning of “comprising” as used in the specification specifies a specific characteristic, region, integer, step, action, element and/or component, and the presence of another characteristic, region, integer, step, action, element and/or component, or It does not exclude additions.

다르게 정의하지는 않았지만, 여기에 사용되는 기술용어 및 과학용어를 포함하는 모든 용어들은 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 일반적으로 이해하는 의미와 동일한 의미를 가진다. 보통 사용되는 사전에 정의된 용어들은 관련기술문헌과 현재 개시된 내용에 부합하는 의미를 가지는 것으로 추가 해석되고, 정의되지 않는 한 이상적이거나 매우 공식적인 의미로 해석되지 않는다.Although not defined differently, all terms including technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the present invention belongs. Terms defined in a commonly used dictionary are additionally interpreted as having a meaning consistent with the related technical literature and the presently disclosed content, and are not interpreted in an ideal or very formal meaning unless defined.

이하에서, 도면을 참조하여 본 발명의 실시예들에 대하여 상세히 살펴본다. Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the drawings.

도 2은, 본 발명의 일 실시예에 따른, 비디오 요약 생성 시스템(1)의 개략적인 블록도이고, 도 3은, 본 발명의 일 실시예에 따른, 비디오 요약을 생성하는 개념도이다.2 is a schematic block diagram of a video summary generation system 1 according to an embodiment of the present invention, and FIG. 3 is a conceptual diagram for generating a video summary according to an embodiment of the present invention.

도 3을 참조하면, 비디오 요약 생성 시스템(1)은 도 1의 종래 비디오 요약 기술과 달리, 사용자가 모니터링하고자 하는 특정 관심 객체에 대해 비디오 요약을 생성한다. 따라서, 사용자는 종래 비디오 요약에 포함된 모든 움직임 객체 중에서 관심 객체를 탐색하기 위해 다시 모니터링할 필요가 없어 시간적, 신체적 측면에서 사용자 편의성을 제공할 수 있다.Referring to FIG. 3, unlike the conventional video summary technology of FIG. 1, the video summary generation system 1 generates a video summary for a specific object of interest to be monitored by a user. Accordingly, the user does not need to monitor again to search for an object of interest among all moving objects included in the conventional video summary, thereby providing user convenience in terms of time and body.

다시 도 2를 참조하면, 비디오 요약 생성 시스템(1)은 복수 개의 영상 장비(100A, 100B, ... ,100N), 상기 영상 장비(100A, 100B, ... ,100N)에서 촬영된 비디오를 유선 및/또는 무선 네트워크(미도시)를 통해 수신하는 비디오 처리부(200)를 포함한다.Referring back to FIG. 2, the video summary generation system 1 generates a video captured by a plurality of video equipment (100A, 100B, ... ,100N) and the video equipment (100A, 100B, ... ,100N). It includes a video processing unit 200 that receives through a wired and/or wireless network (not shown).

영상 장비(100A, 100B, ... ,100N)는 객체를 직접 촬영하고 단일 프레임으로 구성된 이미지, 연속 프레임으로 구성된 비디오를 생성하는 장치로서, 카메라, CCTV 등을 포함한다. The imaging equipment (100A, 100B, ..., 100N) is a device that directly photographs an object and generates an image composed of a single frame, a video composed of a continuous frame, and includes a camera, a CCTV, and the like.

다른 실시예에서, 영상 장비(100A, 100B, ... ,100N)는 상황에 따라 적외선 카메라와 같은 촬영을 보조하기 위한 구성요소를 더 포함할 수도 있다. 또 다른 실시예에서, 영상 장비(100A, 100B, ... ,100N)는 스캐너를 이용하여 상이한 유형의 이미지, 비디오를 더 생성할 수도 있다. In another embodiment, the imaging equipment 100A, 100B, ..., 100N may further include a component for assisting photographing, such as an infrared camera, depending on the situation. In another embodiment, the imaging equipment 100A, 100B, ..., 100N may further generate different types of images and videos using a scanner.

일 실시예에서, 복수 개의 영상 장비(100A, 100B, ... ,100N)는 서로 상이한 지점에 고정 시야를 가지고 위치한다. 이 경우, 비디오 요약 생성 시스템(1)은 복수 개의 영상 장비(100A, 100B, ... ,100N)에서 각각 촬영된 복수 개의 비디오 중 일부 또는 전부를 사용하여 비디오 요약을 생성한다.In one embodiment, a plurality of imaging equipment (100A, 100B, ..., 100N) are located at different points from each other with a fixed field of view. In this case, the video summary generation system 1 generates a video summary by using some or all of a plurality of videos each captured by a plurality of imaging equipments 100A, 100B, ..., 100N.

이와 같이, 복수 개의 영상 장비(100A, 100B, ... ,100N)의 위치가 서로 상이하기 때문에, 복수 개의 영상 장비(100A, 100B, ... ,100N)는 비디오 데이터와 함께 각 영상 장비의 고유 식별 정보, 위치 정보 등을 함께 비디오 처리부(200)에 전송할 수 있다.In this way, since the positions of the plurality of video equipment (100A, 100B, ... ,100N) are different from each other, the plurality of video equipment (100A, 100B, ... ,100N) of each video equipment together with the video data Unique identification information, location information, and the like may be transmitted to the video processing unit 200 together.

유선 및/또는 무선 네트워크를 통한 통신 방법은 객체와 객체가 네트워킹 할 수 있는 모든 통신 방법을 포함할 수 있으며, 유선 통신, 무선 통신, 3G, 4G, 혹은 그 이외의 방법으로 제한되지 않는다. 예를 들어, 유선 및/또는 무선 네트워크(2)는 LAN(Local Area Network), MAN(Metropolitan Area Network), GSM(Global System for Mobile Network), EDGE(Enhanced Data GSM Environment), HSDPA(High Speed Downlink Packet Access), W-CDMA(Wideband Code Division Multiple Access), CDMA(Code Division Multiple Access), TDMA(Time Division Multiple Access), 블루투스(Bluetooth), 지그비(Zigbee), 와이-파이(Wi-Fi), VoIP(Voice over Internet Protocol), LTE Advanced, IEEE802.16m, WirelessMAN-Advanced, HSPA+, 3GPP Long Term Evolution (LTE), Mobile WiMAX (IEEE 802.16e), UMB (formerly EV-DO Rev. C), Flash-OFDM, iBurst and MBWA (IEEE 802.20) systems, HIPERMAN, Beam-Division Multiple Access (BDMA), Wi-MAX(World Interoperability for Microwave Access) 및 초음파 활용 통신으로 이루어진 군으로부터 선택되는 하나 이상의 통신 방법에 의한 통신 네트워크를 지칭할 수 있으나, 이에 한정되는 것은 아니다.The communication method through a wired and/or wireless network may include all communication methods capable of networking an object and an object, and is not limited to wired communication, wireless communication, 3G, 4G, or other methods. For example, wired and/or wireless networks 2 include Local Area Network (LAN), Metropolitan Area Network (MAN), Global System for Mobile Network (GSM), Enhanced Data GSM Environment (EDGE), High Speed Downlink (HSDPA). Packet Access), W-CDMA (Wideband Code Division Multiple Access), CDMA (Code Division Multiple Access), TDMA (Time Division Multiple Access), Bluetooth, Zigbee, Wi-Fi, VoIP (Voice over Internet Protocol), LTE Advanced, IEEE802.16m, WirelessMAN-Advanced, HSPA+, 3GPP Long Term Evolution (LTE), Mobile WiMAX (IEEE 802.16e), UMB (formerly EV-DO Rev. C), Flash- Communication network by one or more communication methods selected from the group consisting of OFDM, iBurst and MBWA (IEEE 802.20) systems, HIPERMAN, Beam-Division Multiple Access (BDMA), Wi-MAX (World Interoperability for Microwave Access), and ultrasonic communication May refer to, but is not limited thereto.

실시예들에 따른 영상 장비(100A, 100B, ... , 100N), 비디오 처리부(200) 는 전적으로 하드웨어이거나, 전적으로 소프트웨어이거나, 또는 부분적으로 하드웨어이고 부분적으로 소프트웨어인 측면을 가질 수 있다. 예컨대, 상품 유통 서버는 데이터 처리 능력이 구비된 하드웨어 및 이를 구동시키기 위한 운용 소프트웨어를 통칭할 수 있다. 본 명세서에서 "부(unit)", "시스템" 및 "장치" 등의 용어는 하드웨어 및 해당 하드웨어에 의해 구동되는 소프트웨어의 조합을 지칭하는 것으로 의도된다. 예를 들어, 하드웨어는 CPU(Central Processing Unit), GPU(Graphic Processing Unit) 또는 다른 프로세서(processor)를 포함하는 데이터 처리 기기일 수 있다. 또한, 소프트웨어는 실행중인 프로세스, 객체(object), 실행파일(executable), 실행 스레드(thread of execution), 프로그램(program) 등을 지칭할 수 있다.The video equipment 100A, 100B, ..., 100N according to the embodiments and the video processing unit 200 may be entirely hardware, entirely software, or partially hardware and partially software. For example, the product distribution server may collectively refer to hardware equipped with data processing capability and operating software for driving it. In the present specification, terms such as "unit", "system" and "device" are intended to refer to a combination of hardware and software driven by the hardware. For example, the hardware may be a data processing device including a CPU (Central Processing Unit), a GPU (Graphic Processing Unit), or another processor. In addition, software may refer to an executing process, an object, an executable file, a thread of execution, a program, and the like.

다시 도 2를 참조하면, 일 실시예에서, 비디오 처리부(200)는 이벤트 객체 검출부(210), 속성 추출부(220), 식별부(230), 군집화부(240), 필터링부(250), 튜브 생성부(260), 비디오 요약부(270), 및 경로 추적부(280)를 포함한다. 일 실시예에서, 비디오 처리부(200)는 데이터베이스290를 더 포함할 수 있다. 여기서 데이터베이스는 대량의 정형, 비정형 또는 반정형 데이터의 집합을 의미하며, 과거에 이미 진행된 경매에 관련된 데이터 등을 저장할 수 있다. 여기서, 정형 데이터는 고정된 필드에 저장된 데이터로서, 예컨대, 관계형 데이터베이스, 스프레드쉬트 등이 있다. 또한, 비정형 데이터는 고정된 필드에 저장되어 있지 않는 데이터로서, 예컨대, 텍스트 문서, 이미지, 동영상, 음성 데이터 등이 있다. 또한, 반정형 데이터는 고정된 필드에 저장되어 있지 않지만 메타데이터나 스키마를 포함하는 데이터로서, 예컨대, XML, HTML, 텍스트 등이 있다. Referring back to FIG. 2, in one embodiment, the video processing unit 200 includes an event object detection unit 210, an attribute extraction unit 220, an identification unit 230, a clustering unit 240, a filtering unit 250, It includes a tube generation unit 260, a video summary unit 270, and a path tracking unit 280. In an embodiment, the video processing unit 200 may further include a database 290. Here, the database refers to a set of large amounts of structured, unstructured or semi-structured data, and may store data related to auctions that have already been conducted in the past. Here, the structured data is data stored in a fixed field, such as a relational database and a spreadsheet. In addition, unstructured data is data that is not stored in a fixed field, such as text documents, images, moving pictures, and audio data. In addition, the semi-structured data is not stored in a fixed field, but includes metadata or schema, such as XML, HTML, text, and the like.

실시예들에 따른 비디오 처리부(200)를 구성하는 각각의 부(210, 220, 230, 240, 250, 260, 270, 280, 290)는 반드시 물리적으로 구분되는 별개의 구성요소를 지칭하는 것으로 의도되지 않는다. 즉, 도 2에서 는 서로 구분되는 별개의 블록으로 도시되나, 실시예들에 따라서는 이벤트 객체 검출부(210), 속성 추출부(220), 식별부(230), 군집화부(240), 필터링부(250), 튜브 생성부(260), 비디오 요약부(270), 경로 추적부(280) 및 데이터베이스290 중 일부 또는 전부가 동일한 하나의 장치(예컨대, 디스플레이 장치) 내에 집적화될 수 있다. 또한, 각각의 부(210, 220, 230, 240, 250, 260, 270, 280, 290)는 이들이 구현된 컴퓨팅 장치에서 수행하는 동작에 따라 장치를 기능적으로 구분한 것일 뿐, 반드시 각각의 부가 서로 독립적으로 구비되어야 하는 것이 아니다. Each unit (210, 220, 230, 240, 250, 260, 270, 280, 290) constituting the video processing unit 200 according to the embodiments is intended to necessarily refer to separate components that are physically classified. It doesn't work. That is, in FIG. 2, it is shown as separate blocks that are separated from each other, but according to embodiments, the event object detection unit 210, the attribute extraction unit 220, the identification unit 230, the clustering unit 240, and the filtering unit Some or all of 250, the tube generation unit 260, the video summary unit 270, the path tracking unit 280, and the database 290 may be integrated in the same single device (eg, a display device). In addition, each unit (210, 220, 230, 240, 250, 260, 270, 280, 290) is only functionally classified devices according to the operation performed by the implemented computing device. It does not have to be provided independently.

물론, 실시예에 따라서는 이벤트 객체 검출부(210), 속성 추출부(220), 식별부(230), 군집화부(240), 필터링부(250), 튜브 생성부(260), 비디오 요약부(270), 경로 추적부(280) 및 데이터베이스290 중 하나 이상이 서로 물리적으로 구분되는 별개의 장치로 구현되는 것도 가능하다. 예컨대, 데이터베이스290는 나머지 부(210 내지 280)와 물리적으로 구분되는 외부 장치로 구현될 수 있다.Of course, depending on the embodiment, the event object detection unit 210, the attribute extraction unit 220, the identification unit 230, the clustering unit 240, the filtering unit 250, the tube generation unit 260, the video summary unit ( 270), at least one of the path tracking unit 280 and the database 290 may be implemented as separate devices that are physically separated from each other. For example, the database 290 may be implemented as an external device that is physically separated from the remaining units 210 to 280.

이벤트 객체 검출부(210)는 복수 개의 영상 장치(100A, 100B, ..., 100N)에서 촬영된 복수 개의 비디오에서 이벤트 객체를 검출한다. The event object detection unit 210 detects an event object from a plurality of videos captured by a plurality of imaging devices 100A, 100B, ..., 100N.

일 실시예에서, 보다 효율적인 이벤트 객체 검출을 위해, 이벤트 객체 검출부(210)는 배경과 객체를 각각 추출하여 배경과 객체를 분리한 이후, 이벤트 객체를 검출한다.In an embodiment, for more efficient event object detection, the event object detector 210 extracts a background and an object, separates the background and the object, and then detects the event object.

본 명세서에서 이벤트 객체는 객체 중에서 움직이는 객체를 나타내며, 배경이란 일반적으로 객체를 둘러싼 주위는 물론, 움직이지 않는 객체 또한 나타낸다. 분리된 배경은 그 획득된 배경을 배경 광 강화(background light strenthening) 및 배경 식별과 같은 다양한 방식으로 처리될 수 있다. In the present specification, the event object represents a moving object among objects, and the background generally represents not only the surrounding surrounding the object but also the non-moving object. The separated background can be processed in a variety of ways such as background light strenthening and background identification of the acquired background.

일 실시예에서, 이벤트 객체 검출부(210)는 비디오를 GMM(gausian mixture model)을 사용하는 강화된 MoG(improved Mixture of Gaussian)을 적용하여 배경을 추출하고 이벤트 객체를 검출할 수 있다. 이벤트 객체 검출부(210)는 GMM을 사용하는 강화된 MoG를 통해 일정 시간 동안의 배경 프레임을 샘플링하여 배경 모델(background model)을 생성하고, 매 시간 프레임의 이미지 차이를 산출한다. 이벤트 객체 검출부(210)는 생성된 배경 모델의 차이과 각각의 비디오 프레임 간의 차이 값을 구하여 각 프레임에서 이벤트 객체를 검출한다.In an embodiment, the event object detector 210 may extract a background and detect an event object by applying an enhanced MoG (improved mixture of Gaussian) using a GMM (gausian mixture model) to the video. The event object detection unit 210 generates a background model by sampling a background frame for a certain period of time through the enhanced MoG using GMM, and calculates an image difference of every time frame. The event object detector 210 detects an event object in each frame by obtaining a difference between the generated background model and a difference value between each video frame.

이와 같이, 이벤트 객체 검출부(210)는 GMM을 사용하는 강화된 MoG에 비디오를 적용함으로써, 일정 시간 동안의 이미지를 바탕으로 조명 성분 변화로 인한 이미지 전체의 변화량을 감지하여 배경을 추출하고 조명 변화에 의한 배경 추출 오류를 줄일 수 있다. 그 결과, 빠른 추출 수행 속도로 조명 변화에 의한 배경 추출 오류를 적은 높은 품질의 배경 추출 기능을 수행할 수 있다.In this way, the event object detection unit 210 applies video to the enhanced MoG using GMM, detects the amount of change in the entire image due to the change in the lighting component based on the image for a certain period of time, extracts the background, and It is possible to reduce the background extraction error caused by. As a result, it is possible to perform a high-quality background extraction function that reduces background extraction errors due to lighting changes at a fast extraction execution speed.

다른 일 실시예에서, 이벤트 객체 검출부(210)는 KNN(K-nearnest neighbor), ACF(aggregate channel featrues) 등을 적용하여 배경을 추출하고 이벤트 객체를 검출할 수도 있다.In another embodiment, the event object detector 210 may extract a background and detect an event object by applying a K-nearnest neighbor (KNN), aggregate channel featrues (ACF), or the like.

추가적으로, 이벤트 객체 검출부(210)는 검출된 이벤트 객체에 팽창 모폴로지 작업(dilation morphological operation)을 추가로 수행할 수 있다. 이로 인해, 이벤트 객체 검출 과정에서 발생하는 그림자 및 노이즈로 인한 오차율을 줄일 수 있다.Additionally, the event object detector 210 may additionally perform a dilation morphological operation on the detected event object. Accordingly, it is possible to reduce an error rate due to shadows and noise generated in the event object detection process.

도 4는, 본 발명의 일 실시예에 따른, 이벤트 객체를 추출한 결과를 도시한 도면이다.4 is a diagram illustrating a result of extracting an event object according to an embodiment of the present invention.

도 4를 참조하면, 이벤트 객체 검출부(210)는 비디오 프레임의 배경은 흑색으로 처리하고 객체는 백색으로 처리한 뒤 배경을 추출하고 이벤트 객체를 검출할 수 있다.Referring to FIG. 4, the event object detection unit 210 may process a background of a video frame as black and a white object, extract the background, and detect the event object.

속성 추출부(220)는 이벤트 객체의 식별의 기준이 되는 이벤트 객체의 속성을 추출한다. 속성 추출부(220)는 각 프레임 별로 이벤트 객체의 속성을 추출한다.The attribute extraction unit 220 extracts an attribute of an event object, which is a criterion for identification of the event object. The attribute extracting unit 220 extracts the attribute of the event object for each frame.

본 명세서에서 속성은 이벤트 객체를 서로 구별할 수 있는 기준으로서, 종류(예컨대, 차량, 자전거, 사람), 크기, 색상, 위치, 사람의 경우 성별, 모자 등 액세서리 착용 유무, 차량의 경우 차종, 차량 번호판 등과 같은 단일 프레임 내의 이벤트 객체에 연관된 속성을 포함할 수 있다. 또한, 이벤트 객체의 움직임 방향, 움직임 패턴과 같은 다중 프레임 내의 단일 이벤트 객체에 연관된 속성을 더 포함할 수 있다. In this specification, an attribute is a criterion for distinguishing event objects from each other, such as type (e.g., vehicle, bicycle, person), size, color, location, gender in the case of a person, whether or not accessories such as a hat are worn, vehicle type in the case of a vehicle, and vehicle It may include attributes related to event objects within a single frame, such as a license plate. In addition, properties related to a single event object within multiple frames, such as a movement direction and a movement pattern of the event object, may be further included.

속성 추출부(220)는 다양한 인공 지능(AI)(예컨대, 딥 러닝(deep learning), SVM(Support Vector Machine) 등), 핸드크래프트(handcraft) 특징 검출 알고리즘(예컨대, LBP(Local Binary Pattern), HOG(Histogram of Oriented Gradient), SIFT(Scale Invariant Feature Transform))) 중 적어도 하나를 사용하여 이벤트 객체의 속성을 추출한다.The attribute extraction unit 220 includes various artificial intelligence (AI) (eg, deep learning, SVM (Support Vector Machine), etc.), handcraft feature detection algorithms (eg, LBP ( Local Binary Pattern )), The attribute of the event object is extracted using at least one of HOG (Histogram of Oriented Gradient) and SIFT (Scale Invariant Feature Transform))).

일 실시예에서, 속성 추출부(220)는 CNN(Convolution Neuron Network) 기반의 속성 추출 모델을 사용하여 이벤트 객체의 속성을 추출한다. 상기 속성 추출 모델은 컨볼루션 레이어(convolution layer), 풀링 레이어(pooling layer), 및 완전 연결 레이어(fully connected layer) 중 적어도 하나를 포함한다.In an embodiment, the attribute extracting unit 220 extracts an attribute of an event object using an attribute extraction model based on a Convolution Neuron Network (CNN). The attribute extraction model includes at least one of a convolution layer, a pooling layer, and a fully connected layer.

도 5는, 본 발명의 일 실시예에 따른, CNN(Convolution Neuron Network) 기반의 속성 추출 모델의 개략적인 구조도이다.5 is a schematic structural diagram of an attribute extraction model based on a convolution neuron network (CNN) according to an embodiment of the present invention.

도 5를 참조하면, 이벤트 객체의 속성 추출에 사용되는 속성 추출 모델은 컨볼루션 레이어, 풀링 레이어를 포함하는 3개의 서브 레이어 포함한다. 각 서브 레이어는 컨볼루션 레이어와 풀링 레이어를 거쳐 출력된 특징 맵(feature map)에 ReLU와 같은 활성 함수를 적용하도록 구성된다.Referring to FIG. 5, an attribute extraction model used for attribute extraction of an event object includes three sublayers including a convolutional layer and a pooling layer. Each sub-layer is configured to apply an activation function such as ReLU to a feature map output through the convolution layer and the pooling layer.

컨볼루션 레이어는 미리 정한 크기의 컨볼루션 필터를 사용하여 각 프레임에서 검출된 이벤트 객체의 이미지(크기: 32Х32)로부터 특징을 추출하고, 추출된 특징을 특징 맵(feature map)으로 출력한다.The convolutional layer extracts features from the image (size: 32Х32) of the event object detected in each frame using a convolution filter of a predetermined size, and outputs the extracted features as a feature map.

그 후, 풀링 레이어는 특징 맵에 포함된 특징 값에 대하여 미리 정한 크기의 윈도우로 일정 칸식 스트라이드(stride)하면서, 해당 윈도우 내에 포함된 특징 값들을 대표 값으로 변환하여 특징 맵의 크기(size)를 스케일링할 수 있다. 이 과정은 서브 샘플링(sub-sampling) 또는 풀링(pooling) 이라고 지칭된다. 속성 추출 모델은 특징 맵을 풀링 처리함으로써 특징 맵의 크기를 스케일링 다운하여 효율적인 계산을 가능하게 하고, 이미지를 추상화하여 왜곡에 강하게 할 수 있다.Thereafter, the pooling layer strides the feature values included in the feature map into a window of a predetermined size, and converts the feature values included in the window into representative values to determine the size of the feature map. Can be scaled. This process is referred to as sub-sampling or pooling. The attribute extraction model can perform efficient calculation by scaling down the size of the feature map by performing a pooling process on the feature map, and can enhance distortion by abstracting the image.

일 실시예에서, 풀링 레이어에서는 해당 윈도우 내 최대 값을 추출하는 맥스 풀링을 통해 풀링 과정이 수행된다. 그러나, 이에 제한되지 않으며 평균 값을 추출하는 평균 풀링, L2-norm 풀링 등과 같은 다양한 풀링 방식이 사용될 수 있다.In an embodiment, in the pooling layer, a pooling process is performed through max pooling that extracts a maximum value within a corresponding window. However, the present invention is not limited thereto, and various pooling methods such as average pooling, L2-norm pooling, etc. for extracting an average value may be used.

속성 추출 모델은 2개 이상의 완전 연결 레이어(Fully connected layer)를 포함할 수 있다. 완전 연결 레이어는 기존 신경망과 같은 형태의 레이어로서, 모든 입력 노드(Input Node)가 모든 출력 노드(Output Node)로 연결된 상태이다. The attribute extraction model may include two or more fully connected layers. The fully connected layer is a layer in the same form as the existing neural network, and all input nodes are connected to all output nodes.

컨볼루션 필터의 가중치와 완전 연결 레이어의 노드 가중치는 복수의 훈련 이미지(training image)에 의해 미리 학습될 수도 있다. 예컨대, 속성 추출 모델은 3 종류(사람, 자동차, 자전거)의 훈련 이미지를 사용하여 이벤트의 종류를 추출하도록 학습될 수 있다.The weight of the convolution filter and the node weight of the fully connected layer may be learned in advance using a plurality of training images. For example, the attribute extraction model may be trained to extract the type of event using three types of training images (person, car, bicycle).

도 6은, 본 발명의 일 실시예에 따른, 속성 추출 모델의 학습에 사용된 자동차 패치(patch)의 예시도이다. 상기 실시예에서, 속성 추출 모델은 도 6의 다양한 자동차 이미지에 기초하여 자동차에 대해 학습할 수 있다. 이로 인해, 이벤트 객체의 종류를 사람, 자동차, 자전거 중 어느 하나로 판단하고, 이를 속성(즉, 종류)으로 추출할 수 있다.6 is an exemplary diagram of a vehicle patch used for learning an attribute extraction model according to an embodiment of the present invention. In the above embodiment, the attribute extraction model may learn about a vehicle based on various vehicle images of FIG. 6. Accordingly, it is possible to determine the type of the event object as one of a person, a car, and a bicycle, and extract this as an attribute (ie, type).

이러한 학습을 통해, 컨볼루션 필터와 완전 연결 레이어의 노드 가중치는 이벤트 객체의 종류 등과 같은 속성을 분석하고 비디오 프레임 내에서의 이벤트 객체 위치를 파악하기 위해 적합한 특징 값을 출력하는 필터 가중치와 노드 가중치를 가질 수 있다.Through this learning, the node weight of the convolution filter and the fully connected layer analyzes properties such as the type of event object, and determines the filter weight and node weight that output appropriate feature values to determine the location of the event object in the video frame. Can have.

추가적으로, 완전 연결 레이어에서 출력된 결과 값에 Softmax함수가 더 적용될 수 있다. Softmax함수는 입력 데이터가 복수개의 클래스에 속하는 확률을 산출하는 기능을 가진다. 따라서, 입력된 이벤트 객체를 둘 이상의 클래스로 분류(즉, 판단)할 수 있다. 도 6의 실시예와 같이, 속성 추출 모델이 이벤트 객체를 사람, 차량, 자전거로 분류하도록 설계된 경우, Soft함수는 입력된 이벤트 객체가 사람, 차량, 자전거에 속할 확률을 산출한다.Additionally, the Softmax function may be further applied to the result value output from the fully connected layer. The Softmax function has a function of calculating the probability that the input data belongs to a plurality of classes. Therefore, the input event object can be classified (ie, determined) into two or more classes. As shown in the embodiment of FIG. 6, when the attribute extraction model is designed to classify event objects into people, vehicles, and bicycles, the Soft function calculates the probability that the input event objects belong to people, vehicles, and bicycles.

추가적으로, 출력부에는 데이터 효율을 위해 드랍 아웃(drop out)이 더 적용될 수 있다.Additionally, drop out may be further applied to the output unit for data efficiency.

이와 같은 속성 추출 모델을 사용하여 속성 추출부(220)는 각각의 프레임에 포함된 이벤트 객체의 종류(예컨대, 사람, 자전거, 자동차)를 추출할 수 있다. 또한, 동종 객체에 대한 세분화된 속성(예컨대, 사람의 경우 성별, 모자 등 액세서리 착용 유무, 차량의 경우 차종, 차량 번호판)을 추출할 수 있다. Using such an attribute extraction model, the attribute extraction unit 220 may extract types (eg, people, bicycles, and cars) of event objects included in each frame. In addition, it is possible to extract subdivided attributes of the same object (eg, gender, whether or not accessories such as a hat are worn for a person, a vehicle type, and a vehicle license plate for a vehicle).

그러나, 이에 제한되지 않고 이벤트 객체를 보다 다양한 종류로, 그리고 세부적인 속성으로 분석할 수 있다. 이러한 분석은 다양한 구조로 변형된 속성 추출 모델을 통해 수행될 수 있다.However, the present invention is not limited thereto, and the event object can be analyzed in more various types and detailed properties. This analysis can be performed through an attribute extraction model transformed into various structures.

다른 일 실시예에서, 속성 추출 모델은 제1 서브 모델 및 제2 서브 모델을 포함할 수 있다. 여기서, 상기 제1 서브 모델은 도 4의 구조를 갖는 CNN 기반의 속성 추출 모델과 같이, 이벤트 객체의 종류를 분석한다. 제2 서브 모델은 동종 객체를 다시 분류할 수 있는 제2 속성을 추출할 수 있다. In another embodiment, the attribute extraction model may include a first sub-model and a second sub-model. Here, the first sub-model analyzes the type of event object, like the CNN-based attribute extraction model having the structure of FIG. 4. The second sub-model may extract a second attribute capable of reclassifying a homogeneous object.

여기서, 제2 속성은 제1 서브 모델의 속성 보다 하위 속성일 수 있다. 예를 들어, 속성 추출 모델은 제1 서브 모델에 의해 이벤트 객체의 종류(예컨대, 차량)를 추출하고, 차량으로 분류된 이벤트 객체를, 숫자 위치를 검출하고 숫자의 특징을 추출하여 차량 번호판을 추출하도록 설계된 제2 서브 모델에 적용하여 차량 표지판을 추출하도록 구성될 수 있다. Here, the second attribute may be a lower attribute than the attribute of the first sub-model. For example, the attribute extraction model extracts the type of event object (e.g., vehicle) by the first sub-model, detects the event object classified as a vehicle, detects the position of a number, and extracts the features of the number to extract the vehicle license plate. It may be configured to extract the vehicle sign by applying it to the second sub-model designed to be.

다른 일 실시예에서, 속성 추출 모델은 복수 개의 서브 모델을 포함하지 않고도 제1 속성과 제1 속성 보다 하위 속성인 제2 속성을 추출하도록 구성될 수도 있다.In another embodiment, the attribute extraction model may be configured to extract a first attribute and a second attribute that is a lower attribute than the first attribute without including a plurality of sub-models.

또한, 속성 추출부(220)는 이벤트 객체의 움직임 방향, 움직임 패턴과 같은 다중 프레임 내의 단일 이벤트 객체에 연관된 속성을 더 추출할 수 있다. In addition, the attribute extractor 220 may further extract attributes related to a single event object in multiple frames, such as a movement direction and a movement pattern of the event object.

도 5의 CNN 기반의 속성 추출 모델에 의해, 각 프레임에서 이벤트 객체의 위치를 추출할 수 있다. By using the CNN-based attribute extraction model of FIG. 5, the location of the event object can be extracted from each frame.

이벤트 객체의 속성 중 이벤트 객체의 위치가 추출된 경우, 속성 추출부(220)는 추가 속성을 더 추출할 수 있다. 일 실시예에서, 이벤트 객체의 속성 중 이벤트 객체의 위치가 프레임별로 추출된 경우, 각각의 프레임에서 동일한 이벤트 객체를 식별한 이후 속성 추출부(220)는 동일한 이벤트 객체의 프레임별 위치 변화를 산출하여 이벤트 객체의 움직임 궤적을 산출하고, 움직임 궤적에 기초하여 움직임 방향, 움직임 패턴을 더 추출할 수 있다. 이에 대해서는 도 11-14를 참조하여 아래에서 보다 상세하게 설명한다.When the location of the event object is extracted from the properties of the event object, the property extracting unit 220 may further extract additional properties. In one embodiment, when the location of the event object among the properties of the event object is extracted for each frame, after identifying the same event object in each frame, the property extracting unit 220 calculates a position change of the same event object for each frame. A motion trajectory of the event object may be calculated, and a motion direction and a motion pattern may be further extracted based on the motion trajectory. This will be described in more detail below with reference to FIGS. 11-14.

추가적으로, 속성 추출부(220)는 추출된 속성 관련 정보를 포함하는 이벤트 객체 속성 테이블을 이벤트 객체 별로 생성할 수 있다. 상기 이벤트 객체 속성 테이블은 이벤트 객체 속성 벡터, 매트릭스로 지칭될 수 있다. 이벤트 객체 속성 테이블은 데이터베이스(290)에 저장된다.Additionally, the attribute extracting unit 220 may generate an event object attribute table including the extracted attribute-related information for each event object. The event object attribute table may be referred to as an event object attribute vector or matrix. The event object attribute table is stored in the database 290.

추가적으로, 속성 추출부(220)는 속성 테이블과 같은 속성 관련 정보, 비디오 관련 정보(영상 장비 관련 정보, 프레임 관련 정보, 시간 정보 등)를 포함하는 이벤트 객체 정보 테이블을 이벤트 객체 별로 생성할 수 있다.Additionally, the attribute extracting unit 220 may generate, for each event object, an event object information table including attribute-related information, such as an attribute table, and video-related information (image equipment-related information, frame-related information, time information, etc.).

식별부(230)는 이벤트 객체를 식별하는 하나의 수단으로서, 제1 프레임에 포함된 이벤트 객체와 제1 프레임에 포함된 이벤트 객체가 서로 동일한 이벤트 객체인지 식별한다. 여기서, 제1 프레임과 제2 프레임은 하나의 비디오에 포함된 서로 다른 임의의 프레임을 지칭한다.The identification unit 230 is one means of identifying the event object, and identifies whether the event object included in the first frame and the event object included in the first frame are the same event object. Here, the first frame and the second frame refer to different arbitrary frames included in one video.

일 실시예에서, 식별부(230)는 이벤트 객체의 속성과 관심 객체의 속성을 비교하여 상기 속성들 간의 차이가 소정 범위 미만인 경우 동일한 이벤트 객체로 식별한다. 식별부(230)는 유클리디안 거리(Euclidian distance), 코사인 거리(Cosine distance), 마할라노비스 거리(Mahalanobis distance) 및 결합 베이지안(Joint Bayesian) 중 적어도 하나의 방식을 사용하여 각 프레임의 이벤트 객체의 속성들 간의 차이를 산출할 수 있다. 그러나, 이에 제한되지 않으며 다양한 유사도를 측정하기 위한 계산 방식을 사용할 수 있다.In one embodiment, the identification unit 230 compares the attribute of the event object and the attribute of the object of interest, and identifies the same event object when the difference between the attributes is less than a predetermined range. The identification unit 230 uses at least one of a Euclidian distance, a cosine distance, a Mahalanobis distance, and a joint Bayesian event object of each frame. The difference between the properties of can be calculated. However, it is not limited thereto, and a calculation method for measuring various similarities may be used.

제1 영상 장비(100A)에서 서로 상이한 두 개의 이벤트 객체(EOA 및 EOB)를 촬영한 제1 비디오를 예시적으로 가정해보자. 제1 비디오는 제1 프레임과 제2 프레임을 포함하며, 제1 프레임은 이벤트 객체(EOA1), 이벤트 객체(EOB1)를 포함하고, 제2 프레임은 이벤트 객체(EOA2), 이벤트 객체(EOB2)를 포함한다.Assume as an example a first video in which two different event objects (EOA and EOB) are photographed by the first imaging device 100A. The first video includes a first frame and a second frame, the first frame includes an event object (EOA1) and an event object (EOB1), and the second frame includes an event object (EOA2) and an event object (EOB2). Include.

속성 추출부(220)는 제1 프레임의 이벤트 객체(EOA1 및 EOB2)와 제2 이벤트 객체(EOA2 및 EOB2)의 속성을 추출한다. 제1 프레임의 이벤트 객체(EOA1)와 제2 프레임의 이벤트 객체(EOA2)는 실제로 동일한 이벤트 객체이므로 서로 동일하거나, 거의 유사한 속성이 추출될 것이다. 제1 프레임의 이벤트 객체(EOB1)와 제2 프레임의 이벤트 객체(EOB2) 역시 실제로 동일한 이벤트 객체이므로 서로 동일하거나, 거의 유사한 속성이 추출될 것이다.The attribute extracting unit 220 extracts attributes of the event objects EAO1 and EOB2 and the second event objects EEO2 and EOB2 of the first frame. Since the event object EAO1 of the first frame and the event object EAO2 of the second frame are actually the same event objects, the same or almost similar properties will be extracted. Since the event object EOB1 of the first frame and the event object EOB2 of the second frame are actually the same event objects, the same or almost similar properties will be extracted.

식별부(230)는 제1 프레임의 이벤트 객체(EOA1 및 EOB1)의 속성 테이블과 제2 프레임의 이벤트 객체(EOA2 및 EOB2)의 속성 테이블을 비교하여 서로의 차이를 산출한다.The identification unit 230 compares the attribute tables of the event objects EAO1 and EOB1 of the first frame and the attribute tables of the event objects EAO2 and EOB2 of the second frame to calculate a difference between them.

제1 프레임의 이벤트 객체(EOA1)와 제2 프레임의 이벤트 객체(EOB2)를 비교하는 경우, 비교 대상인 두 이벤트 객체는 실제 다른 이벤트 객체이므로 매우 큰 차이 값이 산출될 것이다. 결국, 소정 임계치를 초과하는 차이 값이 산출되어 식별부(230)는 제1 프레임의 이벤트 객체(EOA1)와 제2 프레임의 이벤트 객체(EOB2)를 서로 상이한 이벤트 객체로 식별한다. 또는 제1 프레임의 이벤트 객체(EOB1)와 제2 프레임의 이벤트 객체(EOA2)를 비교하는 경우도 이와 유사하다.When comparing the event object EAO1 of the first frame and the event object EOB2 of the second frame, since the two event objects to be compared are actually different event objects, a very large difference value will be calculated. As a result, a difference value exceeding a predetermined threshold is calculated, and the identification unit 230 identifies the event object EAO1 of the first frame and the event object EOB2 of the second frame as different event objects. Alternatively, the case of comparing the event object EOB1 of the first frame and the event object EAO2 of the second frame is similar.

반면, 제1 프레임과 제2 프레임의 이벤트 객체(EOA1 및 EOA2)를 비교하는 경우, 비교 대상인 두 이벤트 객체는 실제 동일한 이벤트 객체이므로 0 또는 매우 작은 차이 값이 산출될 것이다. 결국, 소정 임계치 보다 미만인 차이가 산출되어 식별부(230)는 제1 프레임과 제2 프레임의 이벤트 객체(EOA1 및 EOA2)를 서로 동일한 이벤트 객체로 판단한다. 제1 프레임과 제2 프레임의 이벤트 객체(EOB1 및 EOB2)를 비교하는 경우도 이와 유사하다. On the other hand, when comparing the event objects (EOA1 and EOA2) of the first frame and the second frame, since the two event objects to be compared are actually the same event object, a difference value of 0 or a very small difference will be calculated. As a result, a difference less than a predetermined threshold is calculated, and the identification unit 230 determines the event objects EAO1 and EOA2 of the first frame and the second frame as the same event objects. The case of comparing the event objects EOB1 and EOB2 of the first frame and the second frame is similar to this.

이와 같은 이벤트 객체 식별 과정을 전체 프레임에 대해 수행하여 각 비디오에 포함된 이벤트 객체를 식별하고, 그 후, 식별된 이벤트 객체에 고유한 식별 정보(예컨대, 식별자)를 할당한다. 예컨대, EOA1, EOA2에 대해서는 제1 식별자, EOB1, EOB2에 대해서는 제2 식별자가 할당된다. 즉, 각 프레임에 포함된 이벤트 객체들은 식별자를 통해 각각 식별된다.This event object identification process is performed for all frames to identify event objects included in each video, and then, unique identification information (eg, an identifier) is assigned to the identified event objects. For example, a first identifier is assigned to EOA1 and EOA2, and a second identifier is assigned to EOB1 and EOB2. That is, event objects included in each frame are identified through identifiers.

이와 같은 식별부(230)의 이벤트 객체 식별로 인해, 불연속적인 움직임을 갖는 이벤트 객체에 대해서도 하나의 객체로 식별할 수 있다. 예를 들어, 하나의 영상 장비의 프레임에서 동일한 이벤트 객체가 프레임 안팎으로 움직이는 경우에 프레임 밖으로 나가기 이전에 촬영된 이벤트 객체와 프레임 안으로 다시 들어온 이후에 촬영된 이벤트 객체를 동일한 이벤트 객체로 식별할 수 있다.Due to the event object identification of the identification unit 230, an event object having discontinuous movement can be identified as one object. For example, when the same event object moves in and out of the frame in a frame of one video device, an event object photographed before leaving the frame and an event object photographed after entering the frame can be identified as the same event object. .

일 실시예에서, 식별부(230)는 시간별 순서가 인접한 두 프레임 간의 이벤트 객체의 속성을 비교할 수 있다. 이로 인해, 식별부(230)는 보다 빠르게 각 프레임의 이벤트 객체가 동일한 이벤트 객체인지 식별할 수 있다.In an embodiment, the identification unit 230 may compare properties of an event object between two frames adjacent to each other in order by time. Accordingly, the identification unit 230 may more quickly identify whether the event object of each frame is the same event object.

상기 실시예에서, 시간별 순서가 인접한 두 프레임 간의 이벤트 객체 속성 비교 이후에, 동일한 이벤트 객체로 식별된 이벤트 객체에 대하여 다시 한번 속성 비교가 더 수행될 수 있다. In the above embodiment, after comparing the event object properties between two frames adjacent to each other in an order of time, the property comparison may be performed again on the event object identified as the same event object.

위의 일 예에서, 이벤트 객체(EOA)가 제2 프레임 이후 비디오 프레임 밖으로 나갔다가 제3 프레임부터 다시 비디오 프레임 내로 들어온 경우에서, 시간별 순서가 인접한 두 프레임 간의 이벤트 객체의 속성을 비교한 식별부(230)는 제1 프레임부터 제2 프레임까지의 이벤트 객체(EOA)를 동일한 객체로 식별하고, 제3 프레임 이후의 이벤트 객체(EOA)를 동일한 객체로 식별한다. 그 후, 제1 프레임부터 제2 프레임까지의 이벤트 객체(EOA)와 제3 프레임 이후의 이벤트 객체(EOA)를 다시 비교하여 이벤트 객체(EOA)가 제1 프레임 내지 제2 프레임, 그리고 제3 프레임 이후에 위치한다는 것을 식별할 수 있다.In the above example, when the event object (EOA) goes out of the video frame after the second frame and then enters the video frame again from the third frame, the identification unit comparing the properties of the event object between two adjacent frames ( 230) identifies the event object (EOA) from the first frame to the second frame as the same object, and identifies the event object (EOA) after the third frame as the same object. Thereafter, the event object (EOA) from the first frame to the second frame and the event object (EOA) after the third frame are compared again, so that the event object (EOA) is selected from the first frame to the second frame and the third frame. It can be identified that it is located later.

그 결과, 보다 빠르게 이벤트 객체를 식별하면서, 불연속적인 움직임을 갖는 이벤트 객체 또한 식별할 수 있다.As a result, while identifying event objects more quickly, event objects having discontinuous movements can also be identified.

또 다른 일 실시예에서, 속성 추출부(220)가 LBP(Local Binary Pattern), HOG(Histogram of Oriented Gradient), SIFT(Scale Invariant Feature Transform)) 등을 사용하여 이벤트 객체의 속성을 추출한 경우 (예컨대, HOG에 의해 이벤트 객체의 속성이 추출된 경우) 식별부(230)는 KFC(kernlized correlation filter) 기반의 칼만 필터(kalman filter)를 사용하여 최대한 오랜 시간 동안 관심 객체를 추적하고, 각 프레임에 포함된 이벤트 객체를 식별할 수 있다. In another embodiment, when the attribute extraction unit 220 extracts the attribute of the event object using LBP (Local Binary Pattern), HOG (Histogram of Oriented Gradient), SIFT (Scale Invariant Feature Transform)), etc. , When the attribute of the event object is extracted by HOG) The identification unit 230 tracks the object of interest for as long as possible using a KFC (kernlized correlation filter) based kalman filter, and includes it in each frame. Event objects can be identified.

특히, KFC 기반 칼만 필터는 객체에 가림(occlusion) 현상이 발생하는 경우, 예측 파라미터에 의해 가림 전 후의 이벤트 객체 위치를 예측할 수 있다. 상기 예측 파라미터는 적응적으로 업데이트될 수 있다.In particular, the KFC-based Kalman filter can predict the location of an event object before and after occlusion by a prediction parameter when occlusion occurs in an object. The prediction parameter may be updated adaptively.

비디오 요약 생성 시스템(1)이 수신하는 비디오는 복수 개이므로, 이벤트 객체 검출부(210), 속성 추출부(220) 및 식별부(230)은 각각의 복수 개의 비디오에 대해 각자의 기능을 수행한다.Since the video summary generation system 1 receives a plurality of videos, the event object detection unit 210, the attribute extraction unit 220, and the identification unit 230 perform their respective functions for each of the plurality of videos.

추가적으로, 식별부(230)는 각각의 비디오 간에 이벤트 객체 식별 과정을 수행하여 전체 비디오에서 이벤트 객체의 식별을 수행할 수도 있다. 일 실시예에서, 식별부(230)는 비디오 중 적어도 두 개의 비디오 사이에서 이벤트 객체를 식별할 수 있다. 이에 대한 예시는 이벤트 객체(EOA, EOB)와 관련된 상기 일 예의 제1 프레임, 제2 프레임이 제1 비디오, 제2 비디오로 변경된 경우와 유사하므로, 자세한 설명은 생략한다.Additionally, the identification unit 230 may perform an event object identification process between each video to identify an event object in the entire video. In an embodiment, the identification unit 230 may identify an event object between at least two of the videos. An example of this is similar to the case in which the first frame and the second frame of the example related to the event object (EOA, EOB) are changed to the first video and the second video, and thus a detailed description thereof will be omitted.

또 다른 일 실시예에서, 식별부(230)는 이벤트 객체의 움직임 연관 속성과 비디오를 촬영한 영상 장비(100)의 위치에 기초하여, 일 비디오에 촬영된 이벤트 객체가 촬영될 가능성이 있는 방향에 있는 영상 장비(100)에서 촬영된 다른 비디오들과 이벤트 객체의 식별을 수행할 수 있다.In another embodiment, the identification unit 230 is based on the motion-related property of the event object and the location of the video equipment 100 that has captured the video, in a direction in which the event object photographed in one video is likely to be photographed. Other videos captured by the existing imaging device 100 and event objects may be identified.

군집화부(240)는 식별된 이벤트 객체들을 속성별로 군집화하고, 속성별 메타 데이터를 생성한다. 여기서 속성별 메타 데이터는 해당 속성을 갖는 이벤트 객체의 식별자를 포함한다. 따라서, 비디오 요약 생성 시스템(1)은 특정 속성에 대한 사용자 입력을 수신하는 경우, 특정 속성에 대한 메타 데이터를 탐색한 뒤, 해당 메타 데이터에 포함된 이벤트 객체에 대해서 비디오 요약을 생성한다.The clustering unit 240 clusters the identified event objects for each attribute, and generates metadata for each attribute. Here, the metadata for each attribute includes the identifier of the event object having the corresponding attribute. Accordingly, when receiving a user input for a specific attribute, the video summary generation system 1 searches for metadata for a specific attribute and then generates a video summary for an event object included in the metadata.

필터링부(250)는 관심 객체의 속성을 포함한 사용자 입력을 수신하는 경우, 상기 이벤트 객체의 속성과 관심 객체의 속성에 기초하여 상기 관심 객체의 속성에 대응하는 군집을 필터링한다. 여기서, 관심 객체는 사용자가 모니터링하고자 하는 객체를 나타낸다. 여기서, 관심 객체의 속성을 포함한 사용자 입력은 관심 객체의 속성을 직접적으로 나타내는 정보, 또는 관심 객체의 속성을 간접적으로 나타난 정보(예컨대, 관심 객체의 형태 이미지, 관심 객체의 움직임이 표시된 지도)를 포함한다.When receiving a user input including a property of an object of interest, the filtering unit 250 filters a cluster corresponding to the property of the object of interest based on the property of the event object and the property of the object of interest. Here, the object of interest represents an object that the user wants to monitor. Here, the user input including the attribute of the object of interest includes information that directly indicates the attribute of the object of interest or information that indirectly indicates the attribute of the object of interest (e.g., an image of the shape of the object of interest, a map showing the movement of the object of interest) do.

일 예에서, 사용자가 자동차 종류에 속하는 이벤트 객체를 모니터링하고자 하는 경우 사용자는 자동차를 출력하게 하는 신호를 비디오 요약 생성 시스템(1)에 입력한다. 그 후, 비디오 요약 생성 시스템(1)은 자동차에 해당하는 군집을 필터링하고, 필터링된 군집에 포함된 이벤트 객체를 사용하여 비디오 요약을 생성한다.In one example, when a user wants to monitor an event object belonging to a vehicle type, the user inputs a signal for outputting a vehicle to the video summary generation system 1. After that, the video summary generation system 1 filters the cluster corresponding to the vehicle, and generates a video summary by using the event object included in the filtered cluster.

튜브 생성부(260)는 이벤트 객체의 튜브를 생성하는 하나의 수단이다.The tube generator 260 is one means for generating a tube of an event object.

일 실시예에서, 군집에 포함된 이벤트 객체를 튜브화하여 군집에 대한 튜브를 생성한다. 여기서, "튜브(tubes)는 각 객체의 움직임을 모아놓은 일련의 객체 집합으로서, 비디오의 프레임들에 걸쳐 객체(objects) 또는 활동(activities)을 나타내는 이미지 부분의 연결을 지칭한다. 바람직하게는, 튜브는 비디오의 연속 프레임들에 걸쳐 나타나는 이미지 부분의 연결을 지칭한다. 이와 같이, 이벤트 객체가 시공간량으로 튜브에 의해 표현되기 때문에, 이하에서 "이벤트 객체" 및 "튜브"라는 용어는 경우에 따라 상호교환될 수 있게 사용된다.In one embodiment, an event object included in the cluster is tubed to create a tube for the cluster. Here, "tubes" are a set of objects in which movements of each object are collected, and refer to the connection of parts of an image representing objects or activities over frames of a video. Tube refers to the connection of parts of an image that appear over successive frames of video. As such, since the event object is represented by a tube in a space-time quantity, the terms “event object” and “tube” hereinafter may be used as appropriate. They are used interchangeably.

추가적으로, 튜브 생성부(260)에 의해 생성된 튜브에는 추가 정보가 라벨링될수 있다. 일 실시예에서, 튜브에는 튜브가 생성되는데 사용된 이벤트 객체에 연관된 정보(속성 정보, 비디오 관련 정보)가 라벨링될 수 있다. Additionally, additional information may be labeled on the tube generated by the tube generating unit 260. In one embodiment, the tube may be labeled with information (attribute information, video related information) associated with the event object used to generate the tube.

다른 일 실시예에서, 튜브에 기초하여 추가적인 속성이 추출되고, 추가 속성이 해당 튜브에 라벨링 될 수 있다. 이에 대해서는 도 11-14를 참조하여 아래에서 보다 상세하게 서술한다.In another embodiment, an additional attribute is extracted based on the tube, and the additional attribute may be labeled on the tube. This will be described in more detail below with reference to FIGS. 11-14.

비디오 요약부(270)는 비디오 요약을 생성하는 하나의 수단으로서, 일 실시예에서는 최대한의 행동 패턴들을 제공할 수 있도록, 시간적 일관성(temporal consistency) 을 최대한 유지하면서, 비디오 내 요약 가능한 객체들의 움직임 충돌(collision)을 최소화하는 최적화 과정을 수행한다.The video summary unit 270 is a means of generating a video summary, and in one embodiment, motion collision of objects that can be summarized in the video while maintaining temporal consistency as much as possible so as to provide maximum behavior patterns. An optimization process that minimizes (collision) is performed.

도 7은, 본 발명의 일 실시예에 따른, 비디오 요약 생성 시스템에 의해 생성되는 비디오 요약의 개념도이다. 설명의 명료성을 위해, 제1 영상 장비(100A)에서 촬영된 제1 비디오를 사용하여 비디오 요약을 생성하는 과정을 서술한다.7 is a conceptual diagram of a video summary generated by a video summary generation system according to an embodiment of the present invention. For clarity of explanation, a process of generating a video summary using the first video captured by the first imaging device 100A will be described.

일 실시예에서, 비디오 요약부(270)는 시간적 일관성을 유지하면서, 충돌 코스트(collision cost)가 소정 임계치(Φ) 미만이 될 때까지 각 이벤트 객체의 튜브를 이동시킨다(tube shifting). 여기서 충돌 코스트는 매 2개의 튜브와 이들 간의 모든 상대적인 시간 이동에 대한 시간 중첩량으로 나타낸다. 소정 임계치(Φ)는 경우에 따라 상이한 값일 수 있다. 예를 들어, 군집에 포함된 이벤트 객체의 수가 적을 경우 비디오 요약시 상대적으로 시공간상으로 여유가 있어 상대적으로 적은 값으로 설정될 수 있다. 그러면서, 비디오 요약의 길이(L)는 최소화한다. In one embodiment, the video summarization unit 270 moves the tube of each event object until the collision cost becomes less than a predetermined threshold Φ while maintaining temporal consistency. Here, the collision cost is expressed as the amount of time overlap for every two tubes and all relative time movements between them. The predetermined threshold Φ may be a different value depending on the case. For example, when the number of event objects included in the cluster is small, there is a relatively space-time margin during video summary, and thus a relatively small value may be set. In doing so, the length L of the video summary is minimized.

이런 최적화 과정은 시간별(temporal), 공간별(spatial)로 관심 객체를 그룹화하여 최적화를 수행한다. 도 7을 참조하면, 비-시계열적(non-chronological) 비디오 요약은 도 7의 결과(a)처럼 생성되고, 시공간별 그룹에 기반하여 생성된 비디오 요약(spatio-temporal group-based video synopsis)은 도 7의 결과(b)처럼 생성된다. 또한, 관심 객체에 연관된 사용자 입력을 수신한 경우, 비디오 요약은 도 7의 결과(c)처럼 관심 객체가 출력되도록 생성된다.This optimization process performs optimization by grouping objects of interest by temporal and spatial. Referring to FIG. 7, a non-chronological video summary is generated as shown in the result (a) of FIG. 7, and a spatio-temporal group-based video synopsis is generated based on a group by space-time. It is generated as the result (b) of FIG. 7. In addition, when a user input related to the object of interest is received, the video summary is generated to output the object of interest as shown in the result (c) of FIG. 7.

또한, 비디오 요약부(270)는 비디오 요약을 생성하기 위해 시간 경과 배경(Time-Lapse Background)을 사용한다. 여기서 시간 경과 배경은 시간에 걸친 배경변화(예컨대, 낮밤 전환 등)을 나타내면서, 튜브의 배경을 나타낸다.In addition, the video summary unit 270 uses a time-lapse background to generate a video summary. Here, the time-lapse background represents the background of the tube, representing the background change over time (eg, day-to-night transition, etc.).

결국, 기존 방법인 개별 객체의 튜브를 요약하는 것이 아닌, 객체를 그룹화한 그룹 튜브의 요약을 수행한다.In the end, it does not summarize the tubes of individual objects, which is the conventional method, but summarizes the group tubes in which the objects are grouped.

일 실시예에서, 비디오 요약부(270)는 사용자가 요구한 요약 가능 시간 내에 비디오 요약을 생성할 수 있도록, 비디오 요약의 전체 시간 길이를 축약할 수 있다. 여기서, 축약 비율은 사용자가 요구한 요약 가능 시간과 상기 비디오 요약 길이(L)의 비율에 따라 결정된다.In one embodiment, the video summary unit 270 may shorten the total time length of the video summary so that the video summary can be generated within the summary available time requested by the user. Here, the reduction ratio is determined according to the ratio of the summary available time requested by the user and the video summary length (L).

또한, 비디오 요약 생성 시스템(1)은 복수 개의 비디오에 대해 비디오 요약을 생성할 수 있다. 비디오 요약 생성 시스템(1)은 복수 개의 비디오에 대해 이벤트 객체의 식별을 수행한 이후에, 복수 개의 비디오 중 관심 객체에 대응하는 이벤트 객체가 촬영된 비디오를 사용하여 관심 객체에 대한 비디오 요약을 생성한다. 이 과정은 전술한 하나의 비디오로부터 비디오 요약을 생성하는 것과 유사하므로 자세한 설명은 생략한다.Further, the video summary generation system 1 may generate a video summary for a plurality of videos. After the event object is identified for a plurality of videos, the video summary generation system 1 generates a video summary for the object of interest by using the video in which the event object corresponding to the object of interest among the plurality of videos is captured. . Since this process is similar to generating a video summary from one video described above, detailed descriptions are omitted.

경로 추적부(280)는 비디오 요약에 사용된 비디오를 촬영한 영상 장비의 위치 정보에 기초하여 별도의 정보를 사용자에게 제공할 수 있다. 일 실시예에서, 특정 관심 객체에 대한 비디오 요약이 생성된 경우, 경로 추적부(280)는 관심 객체에 대응하는 이벤트 객체를 촬영한 영상 장비의 위치 정보에 기초하여 상기 관심 객체의 이동 경로를 생성하고, 이를 사용자에게 제공할 수 있다. 경로 추적부(280)는 그래픽, 텍스트 및 이들의 조합으로 관심 객체의 이동 경로를 생성할 수 있다.The path tracking unit 280 may provide separate information to a user based on location information of an imaging device that has captured a video used for video summary. In one embodiment, when a video summary for a specific object of interest is generated, the path tracking unit 280 generates a moving path of the object of interest based on location information of an imaging device that has captured an event object corresponding to the object of interest. And provide it to the user. The path tracking unit 280 may generate a moving path of an object of interest using graphics, text, and combinations thereof.

비디오 요약 생성 시스템(1)은 비디오 요약을 생성하는 과정에서 생성된 다양한 데이터, 정보 등을 데이터베이스(290)에 저장할 수 있다. 예를 들어, 데이터베이스290는 속성 추출부(220)에서 생성된 속성 관련 정보, 식별부(230)에서 생성된 식별자 정보를 저장할 수 있다. The video summary generation system 1 may store various data, information, etc. generated in the process of generating the video summary in the database 290. For example, the database 290 may store attribute-related information generated by the attribute extracting unit 220 and identifier information generated by the identification unit 230.

상기 비디오 처리부(200)가 본 명세서에 서술되지 않은 다른 구성요소를 포함할 수도 있다는 것이 당업자에게 명백할 것이다. 또한, 상기 비디오 처리부(200)는, 하나 이상의 대체적이고 특별한 목적의 프로세서, 메모리, 저장공간, 및 네트워킹 구성요소(무선 또는 유선 중 어느 하나), 데이터 엔트리를 위한 입력 장치, 및 디스플레이, 인쇄 또는 다른 데이터 표시를 위한 출력 장치를 포함하는, 본 명세서에 서술된 동작에 필요한 다른 하드웨어 요소를 포함할 수도 있다.It will be apparent to those skilled in the art that the video processing unit 200 may include other components not described herein. In addition, the video processing unit 200, one or more alternative and special purpose processor, memory, storage space, and networking components (either wireless or wired), an input device for data entry, and display, printing or other It may also include other hardware elements necessary for the operations described herein, including output devices for data display.

도 8은, 본 발명의 일 실시예에 따른, 관심 객체 식별에 의한 비디오 요약 생성 방법의 흐름도이다. 상기 비디오 요약 생성 방법은 복수 개의 영상 장비(100A, 100B, ..., 100N)를 포함하는 비디오 요약 생성 시스템(1)에 의해 수행된다. 그 결과, 도 3에 도시된 바와 같이, 사용자가 모니터링하고자 하는 관심 객체에 대한 비디오 요약을 생성할 수 있다. 8 is a flowchart of a method for generating a video summary by identifying an object of interest according to an embodiment of the present invention. The video summary generation method is performed by a video summary generation system 1 including a plurality of video equipment 100A, 100B, ..., 100N. As a result, as shown in FIG. 3, a video summary of the object of interest that the user wants to monitor can be generated.

도 8을 참조하면, 영상 장비에 의해 촬영된 비디오에서 이벤트 객체를 검출한다(S210). 이벤트 객체는 모니터링 대상인 관심 객체의 후보 객체이다. Referring to FIG. 8, an event object is detected from a video captured by an imaging device (S210). The event object is a candidate object of the object of interest to be monitored.

보다 효율적인 이벤트 객체 검출을 위해 배경 추출이 함께 수행된다. 단계(S210)에서, 배경과 객체의 분리를 위해 다양한 분리 방식이 사용될 수 있다. 예를 들어, GMM을 이용한 강화된 MoG 방식이 사용될 수 있다.Background extraction is performed together for more efficient event object detection. In step S210, various separation methods may be used to separate the background and the object. For example, an enhanced MoG method using GMM may be used.

단계(S210) 이후, 이벤트 객체의 속성을 프레임별로 추출한다(S220). 여기서, 속성은 이벤트 객체의 종류, 위치, 크기, 색상, 텍스쳐(숫자, 문자) 등을 포함한다.After step S210, the attribute of the event object is extracted for each frame (S220). Here, the properties include the type, position, size, color, texture (number, character) of the event object.

일 실시예에서, 이벤트 객체의 속성은 컨볼루션 레이어(convolution layer), 풀링 레이어(pooling layer), 및 완전 연결 레이어(fully connected layer) 중 적어도 하나를 포함하는 속성 추출 모델을 사용하여 추출된다. 일 예에서, 도 5에 도시된 바와 같은 구조를 갖는 속성 추출 모델이 사용될 수 있다.In one embodiment, the attribute of the event object is extracted using an attribute extraction model including at least one of a convolution layer, a pooling layer, and a fully connected layer. In one example, an attribute extraction model having a structure as shown in FIG. 5 may be used.

다른 일 실시예에서, 이벤트 객체의 속성은 LBP(Local Binary Pattern), HOG(Histogram of Oriented Gradient), SIFT(Scale Invariant Feature Transform) 중 어느 하나를 사용하여 추출한 이후에, KFC(kernlized correlation filter) 기반의 칼만 필터(kalman filter)를 사용하여 이벤트 객체를 식별할 수 있다.In another embodiment, the attribute of the event object is extracted using any one of LBP (Local Binary Pattern), HOG (Histogram of Oriented Gradient), SIFT (Scale Invariant Feature Transform), and then based on KFC (kernlized correlation filter). The event object can be identified using the Kalman filter of.

단계(S820) 이후, 각각의 프레임에 포함된 이벤트 객체의 속성을 프레임 간에 비교하여 이벤트 객체를 식별한다(S830). 일 실시예에서, 단계(S830)는 상기 이벤트 객체의 속성과 관심 객체의 속성을 비교하여 상기 속성들 간의 차이가 소정 범위 미만인 경우 동일한 이벤트 객체로 식별하는 단계를 포함한다.After step S820, an event object is identified by comparing the properties of the event object included in each frame between frames (S830). In an embodiment, step S830 includes comparing the attribute of the event object and the attribute of the object of interest, and identifying the same event object as the difference between the attributes is less than a predetermined range.

일 실시예에서, 단계(S830)는 상기 이벤트 객체의 속성과 관심 객체의 속성을 비교하여 상기 속성들 간의 차이가 소정 범위 미만인 경우 동일한 이벤트 객체로 식별하는 단계를 포함한다. 상기 속성 차이는 유클리디안 거리(Euclidian distance), 코사인 거리(Cosine distance), 마할라노비스 거리(Mahalanobis distance) 및 결합 베이지안(Joint Bayesian) 중 적어도 하나의 방식을 사용하여 산출된다.In an embodiment, step S830 includes comparing the attribute of the event object and the attribute of the object of interest, and identifying the same event object as the difference between the attributes is less than a predetermined range. The attribute difference is calculated using at least one of a Euclidian distance, a cosine distance, a Mahalanobis distance, and a joint Bayesian.

이와 같이 동일한 이벤트 객체로 식별되면, 고유한 식별자가 이벤트 객체에 할당된다. 상기 식별자는 아래의 군집화를 위해 사용된다.When identified as the same event object in this way, a unique identifier is assigned to the event object. The identifier is used for the following clustering.

다른 일 실시예에서, 단계(S830)는 시간별 순서가 인접한 두 프레임 간의 이벤트 객체의 속성을 비교한다. 이 경우, 동일한 이벤트 객체로 각각 식별된 하나 이상의 이벤트 객체에 있어서, 상기 하나 이상의 이벤트 객체 간의 속성을 더 비교한다. 이로 인해, 하나의 비디오 내에서 불연속적인 움직임을 갖는 이벤트 객체가 서로 동일한 이벤트 객체로 식별된다.In another embodiment, in step S830, properties of event objects are compared between two frames adjacent to each other in order by time. In this case, for one or more event objects each identified as the same event object, properties between the one or more event objects are further compared. For this reason, event objects having discontinuous movements in one video are identified as the same event objects.

단계(S820)에서 이벤트 객체의 속성 중 이벤트 객체의 위치가 프레임별로 추출된 경우, 단계(S830) 이벤트 객체의 객체의 프레임별 위치에 기초하여 이벤트 객체의 움직임 방향 및 움직임 패턴 중 하나 이상을 더 추출할 수 있다. 이에 대해서는 도 11-14를 참조하여 아래에서 보다 상세하게 서술된다.When the location of the event object among the properties of the event object is extracted for each frame in step S820, at step S830, one or more of the movement direction and the motion pattern of the event object are further extracted based on the frame-by-frame location of the event object. can do. This will be described in more detail below with reference to FIGS. 11-14.

비디오 요약 생성 방법은 서로 상이한 지점에 위치하는 복수 개의 영상 장비(100A, 100B, ..., 100N)에서 각각 촬영된 복수 개의 비디오를 사용한다. 따라서, 단계들(S810 내지 S830)은 각각의 비디오에 따라 개별적으로 수행된다. 일 실시예에서, 복수 개의 비디오에 포함된 이벤트 객체의 속성을 비교하여 각 비디오에 포함된 이벤트 객체를 식별한다.The video summary generation method uses a plurality of videos each captured by a plurality of imaging devices 100A, 100B, ..., 100N located at different points. Accordingly, steps S810 to S830 are individually performed according to each video. In an embodiment, an event object included in each video is identified by comparing properties of event objects included in a plurality of videos.

그 후, 식별된 이벤트 객체를 이벤트 객체의 속성에 따라 군집화한다(S840). 일 실시예에서, 단계(S840)에서 식별자를 사용하여 속성별로 이벤트 객체를 군집화하고, 식별자를 포함한 메타 데이터를 생성한다.After that, the identified event objects are clustered according to the properties of the event object (S840). In an embodiment, in step S840, event objects are clustered for each attribute using an identifier, and metadata including the identifier is generated.

단계(S840) 이후, 관심 객체의 속성을 포함한 사용자 입력을 수신하는 경우, 상기 이벤트 객체의 속성과 관심 객체의 속성에 기초하여 관심 객체의 속성에 대응하는 군집을 필터링한다(S850). 이벤트 객체는 이미 속성별로 군집화되었기 때문에, 관심 객체에 속성에 대응하는 군집을 필터링하는 경우 관심 객체이거나 관심 객체와 유사한, 관심 객체 후보가 필터링된다.After step S840, when receiving a user input including the property of the object of interest, the cluster corresponding to the property of the object of interest is filtered based on the property of the event object and the property of the object of interest (S850). Since the event objects have already been clustered by attribute, when filtering the cluster corresponding to the attribute to the object of interest, the object of interest candidate, which is the object of interest or similar to the object of interest, is filtered.

단계(S850) 이후, 관심 객체에 대응하여 필터링된 군집의 튜브를 생성하고(S860), 필터링된 군집 튜브에 기초하여 관심 객체에 대한 비디오 요약을 생성한다(S870). 일 실시예에서, 상기 관심 객체에 대한 튜브 간에 시간적 일관성을 유지하면서, 공간적 충돌을 최소화하여 비디오 요약을 생성한다(S870).After step S850, a tube of a cluster filtered in response to the object of interest is generated (S860), and a video summary of the object of interest is generated based on the filtered cluster tube (S870). In an embodiment, a video summary is generated by minimizing spatial collision while maintaining temporal consistency between tubes for the object of interest (S870).

일 실시예에서, 관심 객체 식별에 따른 비디오 요약 생성 방법은 관심 객체에 대응하는 이벤트 객체를 촬영한 영상 장비의 위치 정보에 기초하여 상기 관심 객체의 이동 경로를 사용자에게 제공하는 단계를 더 포함할 수 있다.In one embodiment, the method for generating a video summary according to object-of-interest identification may further include providing a moving path of the object-of-interest to a user based on location information of an image device that has captured the event object corresponding to the object of interest. have.

도 9는, 본 발명의 일 실시예에 따른, 특정 크기를 갖는 관심 객체에 대하여 생성된 비디오 요약을 도시한 도면이고, 도 10은, 본 발명의 일 실시예에 따른, 특정 종류에 해당되는 관심 객체에 대하여 생성된 비디오 요약을 도시한 도면이다.9 is a diagram illustrating a video summary generated for an object of interest having a specific size according to an embodiment of the present invention, and FIG. 10 is a diagram illustrating an interest corresponding to a specific type according to an embodiment of the present invention. It is a diagram showing a video summary generated for an object.

사용자 입력이 특정 크기에 연관된 정보를 포함한 경우, 단계(S870)에서 생성되는 비디오 요약은 사용자 입력에 포함된 특정 크기에 대응하는 크기를 갖는 이벤트 객체를 포함한다. 도 9를 참조하면, 제1 크기 정보가 사용자 입력에 포함된 경우, 비디오 요약은 제1 크기에 대응하는 크기를 갖는 이벤트 객체(자동차)를 포함한다. 반면, 제1 크기 보다 작은 제2 크기 정보가 사용자 입력에 포함된 경우, 비디오 요약은 자동차 보다 작은 이벤트 객체(사람, 오토바이 등)를 포함한다.When the user input includes information related to a specific size, the video summary generated in step S870 includes an event object having a size corresponding to the specific size included in the user input. Referring to FIG. 9, when first size information is included in the user input, the video summary includes an event object (car) having a size corresponding to the first size. On the other hand, when the second size information smaller than the first size is included in the user input, the video summary includes event objects (persons, motorcycles, etc.) smaller than the car.

사용자 입력이 특정 종류에 연관된 정보를 포함한 경우, 단계(S870)에서 생성되는 비디오 요약은 사용자 입력에 포함된 특정 종류에 대응하는 종류를 갖는 이벤트 객체를 포함한다. 도 10을 참조하면, 제1 종류(사람)가 사용자 입력에 포함된 경우, 비디오 요약은 사람을 포함한다. 제2 종류(자전거)가 사용자 입력에 포함된 경우, 비디오 요약은 자전거를 포함한다. 또한, 제3 종류(자동차)가 사용자 입력에 포함된 경우, 비디오 요약은 자동차를 포함한다. When the user input includes information related to a specific type, the video summary generated in step S870 includes an event object having a type corresponding to the specific type included in the user input. Referring to FIG. 10, when the first type (person) is included in the user input, the video summary includes a person. If the second type (bicycle) is included in the user input, the video summary includes a bicycle. Also, if the third type (car) is included in the user input, the video summary includes cars.

추가적으로, 이벤트 객체의 속성 중 이벤트 객체의 위치가 추출된 경우, 속성 추출부(220)는 추가 속성을 더 추출할 수 있다. 일 실시예에서, 이벤트 객체의 속성 중 이벤트 객체의 위치가 프레임별로 추출된 경우, 각각의 프레임에서 동일한 이벤트 객체를 식별한 이후 속성 추출부(220)는 동일한 이벤트 객체의 프레임별 위치 변화를 산출하여 이벤트 객체의 움직임 궤적을 산출하고, 움직임 궤적에 기초하여 움직임 방향, 움직임 패턴 등과 같은 움직임 연관 속성을 더 추출할 수 있다.Additionally, when the location of the event object is extracted from the properties of the event object, the property extractor 220 may further extract the additional property. In one embodiment, when the location of the event object among the properties of the event object is extracted for each frame, after identifying the same event object in each frame, the property extracting unit 220 calculates a position change of the same event object for each frame. A motion trajectory of the event object may be calculated, and motion related properties such as a motion direction and a motion pattern may be further extracted based on the motion trajectory.

도 11는, 움직임 연관 속성을 추출하는 과정의 흐름도이다.11 is a flowchart of a process of extracting a motion related attribute.

일 실시예에서, 움직임 연관 속성은 이벤트 객체의 움직임 궤적의 밀도를 산출하는 단계(S1121), 상기 밀도가 소정의 임계치를 초과하는지 판단하는 단계(S1122); 소정의 임계치를 초과하는 영역을 핵심 영역(key region)으로 결정하는 단계(S1123); 및 상기 핵심 영역에 기초하여 이벤트 객체의 움직임 방향, 및 움직임 패턴 중 하나 이상을 추출하는 단계(S1125)에 의해 추출된다.In an embodiment, the motion-related property includes: calculating a density of a motion trajectory of an event object (S1121), determining whether the density exceeds a predetermined threshold (S1122); Determining a region exceeding a predetermined threshold as a key region (S1123); And extracting one or more of a movement direction and a movement pattern of the event object based on the core region (S1125).

따라서, 움직임 방향, 움직임 패턴 등에 더 기초하여 관심 객체를 필터링하고, 이에 대한 비디오 요약을 생성할 수 있다.Accordingly, an object of interest may be filtered based on a motion direction, a motion pattern, and the like, and a video summary may be generated.

도 12은, 본 발명의 일 실시예에 따른, 추출된 핵심 영역(key region)을 도시한 도면이다. 도 12에는 삼거리를 촬영한 비디오 프레임이 도시되어 있다. 다수의 이벤트 객체의 움직임 궤적을 산출하는 경우 움직임 궤적의 밀도가 높은 영역은 빨간색 영역, 파란색 영역, 초록색 영역과 같이 3개의 영역(핵심 영역(key regions))으로 나타난다. 이벤트 객체는 3개의 영역 중 적어도 2개의 영역 사이를 이동하기 때문에 이벤트 객체의 움직임 방향 또는 움직임 패턴은 3개의 핵심 영역을 단일로, 또는 중복으로 사용하여 표현할 수 있다.12 is a diagram illustrating an extracted key region according to an embodiment of the present invention. 12 shows a video frame photographing a three-way distance. In the case of calculating the motion trajectories of multiple event objects, the areas with high density of the motion trajectories appear as three areas (key regions) such as a red area, a blue area, and a green area. Since the event object moves between at least two of the three areas, the movement direction or movement pattern of the event object can be expressed by using three core areas singly or in duplicate.

속성 추출부(220)는 프레임에서 핵심 영역을 결정하고, 이에 기초하여 이벤트 객체의 움직임 연관 속성을 추출한다.The attribute extracting unit 220 determines a core region in the frame and extracts a motion-related attribute of the event object based on this.

도 13은, 본 발명의 일 실시예에 따른, 이벤트 객체의 속성으로 특정 방향이 추출되는 과정의 개념도이다.13 is a conceptual diagram illustrating a process of extracting a specific direction as an attribute of an event object according to an embodiment of the present invention.

도 13에서 이벤트 객체(예컨대, 승합차)의 움직임 궤적은 화살표 방향으로 도시된 바와 같이 빨간색 영역에서 시작하고, 녹색 영역에서 종료되는 것으로 표현된다. 그 결과, 속성 추출부(220)는 2개의 핵심 영역(빨간색 영역과 녹색 영역)을 사용하여 도 13의 승합차의 움직임 연관 속성을 추출할 수 있다.In FIG. 13, the motion trajectory of the event object (eg, van) is expressed as starting in the red area and ending in the green area as shown in the arrow direction. As a result, the attribute extracting unit 220 may extract the motion-related attribute of the van of FIG. 13 by using the two core regions (red region and green region).

도 13의 승합차로부터 노란색 화살표 방향을 움직임 연관 속성이 추출되고, 추출된 연관 속성은 해당 이벤트 객체에 추가적으로 라벨링된다. 따라서, 도 13의 방향을 갖는 관심 객체 정보를 포함한 사용자 입력이 수신되는 경우, 도 13의 방향을 갖는 이벤트 객체가 비디오 요약에 포함된다.A movement-related attribute in the direction of a yellow arrow is extracted from the van of FIG. 13, and the extracted related attribute is additionally labeled to a corresponding event object. Accordingly, when a user input including information on the object of interest having the direction of FIG. 13 is received, the event object having the direction of FIG. 13 is included in the video summary.

도 14는, 본 발명의 일 실시예에 따른, 다른 특정 방향을 갖는 관심 객체에 대하여 생성된 비디오 요약을 도시한 도면이다.14 is a diagram illustrating a video summary generated for an object of interest having a different specific direction according to an embodiment of the present invention.

도 14를 참조하면, 오른쪽 경로에서 아래쪽 경로 방향을 갖는 (즉, 도 12의 파란색 핵심 영역에서 초록색 핵심 영역으로의 방향을 갖는) 이벤트 객체를 포함하는 비디오 요약이 생성될 수 있다.Referring to FIG. 14, a video summary including an event object having a direction from a right path to a downward path (ie, having a direction from a blue key area to a green key area in FIG. 12) may be generated.

다른 일 실시예에서, 관심 객체 식별에 의한 비디오 요약을 생성하는 방법은 움직임 연관 속성을 추출함에 있어서, 이벤트 객체의 위치로부터 바로 이벤트 객체의 움직임을 산출하는 것이 아니라, 이벤트 객체의 튜브로부터 튜브의 움직임 연관 속성을 추출하는 과정을 통해 수행될 수 있다. In another embodiment, the method of generating a video summary by identifying an object of interest does not calculate the movement of the event object directly from the position of the event object in extracting the motion-related attribute, but the movement of the tube from the tube of the event object. It can be performed through the process of extracting the related attribute.

관심 객체 식별에 의한 비디오 요약을 생성하는 방법으로서, 영상 장비에 의해 촬영된 복수 개의 비디오에서 이벤트 객체를 검출하는 단계; 상기 이벤트 객체의 제1 속성을 프레임별로 추출하는 단계 - 여기서, 제1 속성은 종류를 포함함; 각각의 프레임에 포함된 이벤트 객체의 속성을 프레임 간에 비교하여 이벤트 객체를 식별하는 단계; 상기 식별된 이벤트 객체의 튜브를 생성하는 단계; 상기 이벤트 객체의 튜브에 기초하여 이벤트 객체의 제2 속성을 추출하는 단계 - 여기서, 제2 속성은 움직임 연관 속성을 포함함; 식별된 이벤트 객체를 이벤트 객체의 속성에 따라 군집화하는 단계; 사용자 입력을 수신하는 경우, 상기 이벤트 객체의 속성과 관심 객체의 속성에 기초하여 상기 관심 객체의 속성에 대응하는 군집을 필터링하는 단계; 및 상기 필터링된 튜브에 기초하여 관심 객체에 대한 비디오 요약을 생성하는 단계를 포함할 수 있다. A method of generating a video summary by identifying an object of interest, the method comprising: detecting an event object from a plurality of videos captured by an imaging device; Extracting a first attribute of the event object for each frame, wherein the first attribute includes a type; Identifying an event object by comparing properties of the event object included in each frame between frames; Creating a tube of the identified event object; Extracting a second attribute of the event object based on the tube of the event object, wherein the second attribute includes a motion related attribute; Clustering the identified event objects according to the properties of the event objects; When receiving a user input, filtering a cluster corresponding to the attribute of the object of interest based on the attribute of the event object and the attribute of the object of interest; And generating a video summary for the object of interest based on the filtered tube.

또 다른 일 실시예에서, 사용자의 입력이 복수 일 수 있다. 예를 들어, 제1 사용자 입력은 제1 속성에 연관된 것이고, 제2 사용자 입력은 제2 속성에 연관된 것이다. 이 경우, 이벤트 객체의 속성은 사용자 입력이 있는 경우에 사용자 입력에 응답하여 대응하는 이벤트 객체의 속성을 추출할 수도 있다.In another embodiment, there may be a plurality of user inputs. For example, a first user input is associated with a first attribute and a second user input is associated with a second attribute. In this case, when there is a user input, the attribute of the event object may be extracted in response to the user input.

상기 실시예들의 구체적인 내용은 도 8의 실시예와 상당 부분 유사하므로, 자세한 설명은 생략한다.Details of the above embodiments are substantially similar to those of FIG. 8, and thus detailed descriptions thereof will be omitted.

이상에서 설명한 실시예들에 따른 관심 객체 식별에 의한 비디요 요약 생성 방법 및 이를 수행하는 비디오 요약 생성 시스템에 의한 동작은 적어도 부분적으로 컴퓨터 프로그램으로 구현되어, 컴퓨터로 읽을 수 있는 기록매체에 기록될 수 있다. 예를 들어, 프로그램 코드를 포함하는 컴퓨터-판독가능 매체로 구성되는 프로그램 제품과 함께 구현되고, 이는 기술된 임의의 또는 모든 단계, 동작, 또는 과정을 수행하기 위한 프로세서에 의해 실행될 수 있다. The video summary generation method by identifying the object of interest according to the above-described embodiments and the operation by the video summary generation system performing the same are at least partially implemented as a computer program and recorded on a computer-readable recording medium. have. For example, it is implemented with a program product composed of a computer-readable medium containing program code, which can be executed by a processor to perform any or all steps, operations, or processes described.

상기 컴퓨터가 읽을 수 있는 기록매체는 컴퓨터에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록장치를 포함한다. 컴퓨터가 읽을 수 있는 기록매체의 예로는 ROM, RAM, CD-ROM, 자기 테이프, 플로피디스크, 광 데이터 저장장치 등을 포함한다. 또한 컴퓨터가 읽을 수 있는 기록매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산 방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수도 있다. 또한, 본 실시예를 구현하기 위한 기능적인 프로그램, 코드 및 코드 세그먼트(segment)들은 본 실시예가 속하는 기술 분야의 통상의 기술자에 의해 용이하게 이해될 수 있을 것이다. The computer-readable recording medium includes all types of recording devices that store data that can be read by a computer. Examples of computer-readable recording media include ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical data storage device, and the like. In addition, the computer-readable recording medium may be distributed over a computer system connected through a network, and computer-readable codes may be stored and executed in a distributed manner. In addition, functional programs, codes, and code segments for implementing the present embodiment may be easily understood by those skilled in the art to which the present embodiment belongs.

이상에서 살펴본 본 발명은 도면에 도시된 실시예들을 참고로 하여 설명하였으나 이는 예시적인 것에 불과하며 당해 분야에서 통상의 지식을 가진 자라면 이로부터 다양한 변형 및 실시예의 변형이 가능하다는 점을 이해할 것이다. 그러나, 이와 같은 변형은 본 발명의 기술적 보호범위 내에 있다고 보아야 한다. 따라서, 본 발명의 진정한 기술적 보호범위는 첨부된 특허청구범위의 기술적 사상에 의해서 정해져야 할 것이다.The present invention described above has been described with reference to the embodiments shown in the drawings, but these are merely exemplary, and those of ordinary skill in the art will understand that various modifications and variations of the embodiments are possible therefrom. However, such modifications should be considered to be within the technical protection scope of the present invention. Therefore, the true technical scope of the present invention should be determined by the technical spirit of the appended claims.

최근 지능형 CCTV 시스템 등 영상 감시 장비에 대한 기술 고도화 및 활용도가 지속적으로 증가하고 있는 추세이다. 복수 개의 영상 감시 장비로부터 촬영된 전체 구간에 대한 비디오를 짧은 시간으로 요약하여 모니터링하고자 하는 관심 객체에 대하여 최소한의 정보를 통해 최대의 효율, 편의성을 얻을 수 있는 본 발명은 해당 시장으로의 접근이 용이하며 파급효과가 클 것으로 예상된다. Recently, the technology advancement and utilization of video surveillance equipment such as intelligent CCTV systems is continuously increasing. The present invention, which can obtain maximum efficiency and convenience through minimum information on the object of interest to be monitored by summarizing the video for the entire section captured from a plurality of video surveillance equipment in a short time, facilitates access to the corresponding market. And the ripple effect is expected to be great.

또한, 최근 4차 산업 기술 중 하나인 머신 러닝(machine learning)을 통해 자동으로 특정 관심 객체를 식별하고 이에 대한 비디오 요약을 생성할 수 있다. 특히, 목적에 따라 다양한 측면의 관심 객체를 식별할 수 있어 다양한 방면에서 활용될 수 있다. In addition, it is possible to automatically identify specific objects of interest and generate video summaries for them through machine learning, one of the recent fourth industrial technologies. In particular, objects of interest in various aspects can be identified according to the purpose, and thus can be used in various fields.

예를 들어, 특정 관심 객체로 식별된 촬영 객체들에 대하여 복수 개의 영상 감시 장치로부터 비디오 요약을 생성할 수 있어, 관심 객체를 추적(예컨대, 용의자의 차량을 추적)하기 위한 목적을 위해 사용될 수 있다.. 따라서, 범죄자 추적에 특히 유용하게 이용될 수 있다.For example, video summaries may be generated from a plurality of video surveillance devices for photographed objects identified as specific objects of interest, and thus may be used for the purpose of tracking the objects of interest (eg, tracking a suspect's vehicle). .. Therefore, it can be particularly useful for tracking criminals.

또한, 특정 방향으로 움직임을 갖는 관심 객체에 대하여 비디오 요약을 생성할 수 있어, 특정 상황(어느 교차로에서 특정 방향으로 사고가 빈번하게 발생하는 경우)을 분석하기 위해 사용될 수 있다.In addition, a video summary can be generated for an object of interest having a movement in a specific direction, so it can be used to analyze a specific situation (when an accident occurs frequently in a specific direction at an intersection).

Claims

As a method of generating a video summary by object of interest identification,
Detecting an event object in the video captured by the imaging equipment;
Extracting the attribute of the event object for each frame;
Identifying an event object by comparing properties of the event object included in each frame between frames;
Clustering the identified event objects according to the properties of the event objects;
When receiving a user input including a property of the object of interest, filtering a cluster corresponding to the property of the object of interest based on the property of the event object and the property of the object of interest;
Generating a tube of the filtered cluster, wherein the tube of the filtered cluster is generated by tubeizing event objects included in the filtered cluster; And
Including the step of generating a video summary of the object of interest based on the filtered tube of the cluster,
Generating a video summary for the object of interest comprises:
Moving the tube of the event object included in the cluster corresponding to the attribute of the object of interest, the tube of the event object is moved until a spatial collision falls below a predetermined threshold while maintaining temporal consistency. How to do it.

The method of claim 1, wherein identifying the event object comprises:
And comparing the property of the event object with the property of the object of interest, and identifying the same event object as the same event object when the difference between the properties is less than or equal to a predetermined first threshold.

The method of claim 2, wherein identifying the event object comprises:
A method comprising comparing properties of an event object between two frames adjacent to each other in an order of time.

The method of claim 3, wherein identifying the event object comprises:
For one or more event objects, each identified as the same event object, the method further comprising comparing properties between the one or more event objects.

The method of claim 1, wherein the video,
A method, characterized in that each photographed by a plurality of imaging equipment located at different points.

The method of claim 5,
The method further comprising identifying an event object between at least two of the videos.

The method of claim 1, wherein the extracting the attribute of the event object comprises:
A method comprising using an attribute extraction model comprising at least one of a convolution layer, a pooling layer, and a fully connected layer.

The method of claim 1, wherein the extracting the attribute of the event object comprises:
A method comprising using any one of LBP (Local Binary Pattern), HOG (Histogram of Oriented Gradient), SIFT (Scale Invariant Feature Transform).

The method of claim 8, wherein identifying the event object comprises:
A method characterized by using a Kalman filter based on a kernlized correlation filter (KFC).

The method of claim 1, wherein the extracting the attribute of the event object comprises:
When the location of the event object is extracted for each frame, further comprising extracting at least one of a movement direction and a movement pattern of the event object based on the frame-by-frame location of the event object after identifying the event object. How to.

The method of claim 1, wherein generating a video summary for the object of interest comprises:
And generating a video summary by minimizing spatial collisions while maintaining temporal consistency between tubes corresponding to the objects of interest.

The method of claim 1,
The method further comprising the step of providing a moving path of the object of interest to a user based on location information of an imaging device that has captured the event object corresponding to the object of interest.

A computer-readable recording medium storing program instructions readable by a computer and operable by the computer, wherein when the program instructions are executed by a processor of the computer, the processor is any one of claims 1 to 12. A computer-readable recording medium for performing the method of generating a video summary by identifying an object of interest according to one of the preceding claims.

As a method of generating a video summary by object of interest identification,
Detecting an event object from a plurality of videos captured by the imaging equipment;
Extracting a first attribute of the event object for each frame;
Identifying an event object by comparing properties of the event object included in each frame between frames;
Creating a tube of the identified event object;
Extracting a second attribute of the event object based on the tube of the event object;
Clustering the identified event objects according to the properties of the event objects;
When receiving a user input, filtering a cluster corresponding to the attribute of the object of interest based on the attribute of the event object and the attribute of the object of interest;
Generating a tube of the filtered cluster, wherein the tube of the filtered cluster is generated by tubeizing event objects included in the filtered cluster; And
Including the step of generating a video summary of the object of interest based on the filtered tube of the cluster,
Generating a video summary for the object of interest comprises:
Moving the tube of the event object included in the cluster corresponding to the attribute of the object of interest, the tube of the event object is moved until a spatial collision falls below a predetermined threshold while maintaining temporal consistency. How to do it.

The method of claim 14, wherein extracting the second attribute of the event object comprises:
Calculating the density of the tube of the event object;
Determining whether the density exceeds a second predetermined threshold;
Determining a region exceeding the second threshold as a key region; And
And extracting at least one attribute of a movement direction and a movement pattern of the object of interest based on the core region.

The method of claim 14,
The first property includes a type of an event object,
The second property includes at least one of a movement direction and a movement pattern of the event object.

A video summary generation system that performs a video summary generation method by identifying an object of interest,
A plurality of imaging equipment; And a video processing unit, wherein the video processing unit,
An event object detection unit for detecting an event object in the video captured by the plurality of video equipment;
An attribute extracting unit for extracting an attribute of the event object;
An identification unit for identifying an event object by comparing properties of the event object between frames;
A clustering unit that clusters the identified event objects according to the properties of the event objects;
When receiving a user input including an attribute of an object of interest, a filtering unit for filtering a cluster corresponding to the attribute of the object of interest based on the attribute of the event object and the attribute of the object of interest;
A tube generator for generating a cluster of tubes corresponding to the object of interest; And
Including a video summary unit for generating a video summary of the object of interest based on the tube of the cluster,
The tube generator:
Further configured to obtain a cluster filtered in response to the property of the object of interest as a cluster corresponding to the object of interest, and to generate a tube of the cluster by tubeizing event objects included in the filtered cluster,
The video summary section:
It is further configured to move the tube of the event object included in the cluster corresponding to the property of the object of interest, and the tube of the event object is moved until a spatial collision falls below a predetermined threshold while maintaining temporal consistency. Video summary generation system.