KR20130110417A

KR20130110417A - Method for analyzing video stream data using multi-channel analysis

Info

Publication number: KR20130110417A
Application number: KR1020120032383A
Authority: KR
Inventors: 이바도; 석호식; 장병탁
Original assignee: 서울대학교산학협력단
Priority date: 2012-03-29
Filing date: 2012-03-29
Publication date: 2013-10-10
Also published as: WO2013147374A1; KR101369270B1

Abstract

PURPOSE: A video stream analysis method of using a multi channel analysis is provided to analyze a dependency relation of a division and a semantic boundary of a video stream by learning the video stream by classifying the video stream into an image channel and a sound channel and by referencing a likelihood change based on the learned data. CONSTITUTION: A video stream analysis method comprises the following steps: a step aims to classify a video stream data into an image channel and a sound channel, and to comprise learning models about the classified image channel and the classified sound channel respectively (S1010, S1020); a step (S1030) aims to estimate a likelihood of a current frame set about a frame set which follows by using the learning models of the comprised image channel and the comprised sound channel; and a step (S1040) aims to record trends of the estimated likelihood, and to decide a story transition point of a video stream by using a peak point exceeding a predetermined threshold among the trends of a recent recorded likelihood. [Reference numerals] (AA) Start; (BB) End; (S1010) Classify a video stream data into an image channel and a sound channel; (S1020) Comprise learning models about the classified image channel and the classified sound channel respectively; (S1030) Estimate a likelihood of a current frame set about a frame set which follows by using each learning model; (S1040) Decide a story transition point by using a peak point exceeding a predetermined threshold

Description

Video stream analysis method using multi channel analysis {METHOD FOR ANALYZING VIDEO STREAM DATA USING MULTI-CHANNEL ANALYSIS}

본 발명은 비디오 스트림에 대한 분석 기술에 관한 것으로, 보다 상세하게는, 비디오 스트림을 이미지 채널 및 사운드 채널로 구분하여 각각 학습하고, 학습한 데이터를 기반으로 우도 변화를 참조하여 비디오 스트림의 의미론적 경계를 구분하여 분석할 수 있는 멀티 채널 분석을 이용한 비디오 스트림 분석 방법에 관한 것이다.The present invention relates to an analysis technique for a video stream. More specifically, the video stream is divided into an image channel and a sound channel, respectively, and the semantic boundary of the video stream is referred to based on the likelihood change based on the learned data. The present invention relates to a video stream analysis method using multi-channel analysis that can classify and analyze.

영상 기술의 발전에 따라, 비디오 데이터에 대한 다양한 분석이 이루어지고 있다. 특히, 최근에는 비디오 스트림에 대하여 의미적인 구분을 수행하려는 다양한 시도가 수행되고 있다.With the development of imaging technology, various analyzes of video data are being performed. In particular, various attempts have recently been made to perform semantic classification on video streams.

그러나, 과거에는 단순히 화면의 표현(구성)의 변화를 감지하여 영상을 구분짓거나 비교하는 수준에 불과하였다. 예를 들어, 이미지의 RBG(Red-Blue-Green) 값의 변화를 이용하여 화면의 전환 등에 대한 분석을 수행하는 것에 불과하여, 실제 의미론적 구분이 성공적으로 이루어지지 못하는 한계를 가지고 있었다.In the past, however, it merely detected a change in the expression (composition) of the screen to distinguish or compare images. For example, the analysis of screen switching and the like using only changes in RBG (Red-Blue-Green) values of an image has been performed, and the actual semantic classification cannot be successfully achieved.

본 발명은 비디오 스트림을 이미지 채널 및 사운드 채널로 구분하여 각각 학습하고, 학습한 데이터를 기반으로 우도 변화를 참조하여 비디오 스트림의 의미론적 경계를 구분의 의존성 관계를 분석할 수 있는 멀티 채널 분석을 이용한 비디오 스트림 분석 방법을 제공하고자 한다.The present invention classifies a video stream into an image channel and a sound channel, respectively, and uses multi-channel analysis to analyze a dependency relationship of semantic boundaries of a video stream by referring to a likelihood change based on the learned data. It is intended to provide a video stream analysis method.

또한, 본 발명은 이미지 채널에 대한 학습을 수행할 때 특정 피처로 한정하여 학습하지 않고 계층적 학습을 수행함으로써 유연하고 보다 정확하게 이미지 채널을 학습할 수 있는 멀티 채널 분석을 이용한 비디오 스트림 분석 방법을 제공하고자 한다.In addition, the present invention provides a video stream analysis method using multi-channel analysis that can learn the image channel flexibly and more accurately by performing hierarchical learning when learning the image channel is not limited to a specific feature to learn I would like to.

실시예들 중에서, 비디오 스트림 분석 방법은 비디오 스트림을 입력받고 상기 입력받은 비디오 스트림의 스토리 변환을 분석할 수 있는 비디오 스트립 분석 장치에서 수행된다. 상기 비디오 스트림 분석 방법은 (a) 비디오 스트림 데이터를 이미지 채널과 사운드 채널로 구분하고, 구분된 상기 이미지 채널 및 사운드 채널 각각에 대한 학습 모델들을 구성하는 단계, (b) 구성된 이미지 채널 및 사운드 채널의 학습 모델들을 이용하여 후행하는 프레임 집합에 대한 현재 프레임 집합의 우도를 추정하는 단계 및 (c) 상기 추정된 우도의 동향을 기록하고, 상기 기록된 우도의 동향 중 기 설정된 임계값을 초과하는 피크 점을 이용하여 상기 비디오 스트림의 스토리 변환점을 결정하는 단계를 포함한다.Among the embodiments, the video stream analysis method is performed in a video strip analysis apparatus capable of receiving a video stream and analyzing the story transformation of the received video stream. The video stream analysis method includes (a) dividing video stream data into an image channel and a sound channel, and constructing learning models for each of the divided image channel and sound channel, and (b) configuring the image channel and sound channel. Estimating the likelihood of the current frame set with respect to the trailing frame set using learning models; and (c) recording the estimated likelihood trend and a peak point exceeding a predetermined threshold value among the recorded likelihood trends. Determining a story transformation point of the video stream by using.

일 실시예에서, 상기 (a) 단계는 상기 비디오 스트림 데이터를 상기 이미지 채널과 사운드 채널로 분리하는 단계 및 상기 분리된 이미지 채널에 대하여 시각적 단어를 이용하여 이미지 채널의 학습 모델을 구성하는 단계를 포함할 수 있다.In one embodiment, the step (a) includes separating the video stream data into the image channel and the sound channel, and constructing a learning model of the image channel using visual words for the separated image channel. can do.

일 실시예에서, 상기 이미지 채널의 학습 모델을 구성하는 단계는 기 이미지 채널에 대하여 SIFT(Scalar Invariant Feature Transform)을 적용하여 상기 시각적 단어를 추출하는 단계를 포함할 수 있다.In an embodiment, the constructing the learning model of the image channel may include extracting the visual word by applying a Scale Invariant Feature Transform (SIFT) to the existing image channel.

일 실시예에서, 상기 (a) 단계는 기 분리된 사운드 채널에 대하여 MFCC(Mel Frequency Cepstral Coefficients) 알고리즘을 사용하여 특징을 추출하는 단계를 포함할 수 있다.In an embodiment, the step (a) may include extracting a feature using a Mel Frequency Cepstral Coefficients (MFCC) algorithm for the previously separated sound channel.

일 실시예에서, 상기 사운드 채널의 학습 모델은 기 사운드 채널에 반영된 대화의 연속성을 기초로 모델링을 수행할 수 있다.In one embodiment, the learning model of the sound channel may perform modeling based on the continuity of the dialogue reflected in the existing sound channel.

일 실시예에서, 상기 (c) 단계는 기 이미지 채널의 학습 모델에서 추정한 우도에 대한 제1 피크점을 중심으로 소정의 시간 범위 내에 상기 사운드 채널의 학습 모델에서 추정한 우도의 제2 피크점이 존재하면, 상기 제1 피크점을 상기 스토리 변환점으로 결정하는 단계를 포함할 수 있다.In one embodiment, the step (c) is a second peak point of the likelihood estimated from the learning model of the sound channel within a predetermined time range around a first peak point of the likelihood estimated from the learning model of the previous image channel. If present, the method may include determining the first peak point as the story transformation point.

일 실시예에서, 상기 (c) 단계는 상기 이미지 채널 및 사운드 채널을 각각 하위 스트림으로 나누어 처리하는 순차적인 계층적 디리슐레 과정(sHDP, serial Hierarchical Dirichlet Process)을 이용하여 상기 변환점을 결정하는 단계를 포함할 수 있다.In one embodiment, the step (c) is a step of determining the conversion point using a sequential hierarchical Dirichlet process (sHDP) of dividing the image channel and the sound channel into a lower stream, respectively; It may include.

실시예들 중에서, 비디오 스트립 분석 장치는 비디오 스트림을 입력받고 상기 입력받은 비디오 스트림의 스토리 변환을 분석할 수 있다. 상기 비디오 스트립 분석 장치는 분리 모듈, 이미지 학습 모듈, 사운드 학습 모듈 및 제어 모듈을 포함한다. 상기 분리 모듈은 상기 비디오 스트림을 이미지 스트림과 사운드 스트림으로 분리한다. 상기 이미지 학습 모듈은 상기 분리된 이미지 스트림에 대하여, 시각적 단어(Visual Word)를 추출하고, 그를 이용하여 후행하는 프레임 집합에 대한 현재 프레임 집합의 우도를 추정한다. 상기 사운드 학습 모듈은 상기 분리된 사운드 스트림에 대하여 대화의 연속성을 기초로 모델링을 수행하고, 생성된 모델을 이용하여 후행하는 프레임 집합에 대한 현재 프레임 집합의 우도를 추정한다. 상기 제어 모듈은 상기 이미지 학습 모듈 및 사운드 학습 모듈에서 추정된 우도의 동향을 기초로 기 설정된 임계값을 초과하는 피크점을 확인하고, 상기 이미지 스트림 및 사운드 스트림의 피크점을 연관하여 상기 비디오 스트림의 스토리 변환점을 결정한다.Among the embodiments, the video strip analyzing apparatus may receive a video stream and analyze a story transformation of the received video stream. The video strip analyzing apparatus includes a separation module, an image learning module, a sound learning module and a control module. The separation module separates the video stream into an image stream and a sound stream. The image learning module extracts a visual word for the separated image stream and uses the same to estimate a likelihood of the current frame set with respect to the following frame set. The sound learning module performs modeling on the separated sound stream based on continuity of dialogue, and estimates the likelihood of the current frame set with respect to the following frame set using the generated model. The control module identifies a peak point exceeding a preset threshold based on the likelihood of the likelihood estimated by the image learning module and the sound learning module, and associates the peak points of the image stream and the sound stream with respect to the video stream. Determine story change points.

일 실시예에서, 상기 제어 모듈은 상기 이미지 채널 및 사운드 채널을 각각 하위 스트림으로 나누어 처리하는 순차적인 계층적 디리슐레 과정(sHDP, serial Hierarchical Dirichlet Process)을 이용하여 상기 변환점을 결정할 수 있다.In one embodiment, the control module may determine the conversion point using a serial Hierarchical Dirichlet Process (sHDP) that divides and processes the image channel and the sound channel into sub streams, respectively.

일 실시예에서, 상기 제어 모듈은 상기 이미지 학습 모듈에서 추정한 우도에 대한 제1 피크점을 중심으로 소정의 시간 범위 내에 상기 사운드 학습 모듈에서 추정한 우도의 제2 피크점이 존재하면, 상기 제1 피크점을 상기 스토리 변환점으로 결정할 수 있다.In example embodiments, the control module may be configured to, when the second peak point of the likelihood estimated by the sound learning module is present within a predetermined time range based on the first peak point of the likelihood estimated by the image learning module. The peak point may be determined as the story change point.

실시예들 중에서, 기록매체는 비디오 스트림 분석 방법을 실행시키기 위한 프로그램을 기록한다. 상기 프로그램은 비디오 스트림을 입력받고 상기 입력받은 비디오 스트림의 스토리 변환을 분석할 수 있는 비디오 스트립 분석 장치에서 구동될 수 있는 프로그램으로서, (a) 비디오 스트림 데이터를 이미지 채널과 사운드 채널로 구분하고, 구분된 상기 이미지 채널 및 사운드 채널 각각에 대한 학습 모델들을 구성하는 기능, (b) 구성된 이미지 채널 및 사운드 채널의 학습 모델들을 이용하여 후행하는 프레임 집합에 대한 현재 프레임 집합의 우도를 추정하는 기능 및 (c) 상기 추정된 우도의 동향을 기록하고, 상기 기록된 우도의 동향 중 기 설정된 임계값을 초과하는 피크 점을 이용하여 상기 비디오 스트림의 스토리 변환점을 결정하는 기능을 포함한다.Among the embodiments, the recording medium records a program for executing the video stream analysis method. The program may be driven in a video strip analysis apparatus that receives a video stream and analyzes the story transformation of the received video stream. (A) The video stream data is divided into an image channel and a sound channel, and the classification is performed. (B) estimating the likelihood of the current frame set with respect to the following frame set using the configured image models and the learning channel of the sound channel, and (c) Recording the estimated likelihood trend and determining a story conversion point of the video stream by using a peak point exceeding a preset threshold value among the recorded likelihood trends.

본 발명에 따르면, 비디오 스트림을 이미지 채널 및 사운드 채널로 구분하여 각각 학습하고, 학습한 데이터를 기반으로 우도 변화를 참조하여 비디오 스트림의 의미론적 경계를 구분의 의존성 관계를 분석할 수 있는 효과가 있다.According to the present invention, the video stream is divided into an image channel and a sound channel, respectively, and each of them is trained, and the dependency relationship between the semantic boundaries of the video stream can be analyzed with reference to the likelihood change based on the learned data. .

또한 본 발명에 따르면, 이미지 채널에 대한 학습을 수행할 때 특정 피처로 한정하여 학습하지 않고 계층적 학습을 수행함으로써 유연하고 보다 정확하게 이미지 채널을 학습할 수 있는 효과가 있다.In addition, according to the present invention, it is possible to learn the image channel flexibly and more accurately by performing hierarchical learning without limiting to specific features when learning the image channel.

도 1은 본 발명에 따른 데이터 프로세싱 스키마를 도식화한 참고도이다.
도 2는 본 발명에 따른 sHDP 모델링에 대한 알고리즘을 설명하는 참고도이다.
도 3은 sHDP 모델을 설명하기 위한 참고도이다.
도 4 내지 도 6은 각각 에피소드 1 내지 3의 recall, precision 및 정답의 동향을 나타내는 그래프이다.
도 7은 에피소드 1 내지 3에서의 F1 measure를 도시하는 그래프이다.
도 8은 본 발명에 따라 채널 병합 방법의 효과를 나타내는 그래프이다.
도 9는 본 발명에 따른 비디오 스트립 분석 장치의 일 실시예를 설명하기 위한 구성도이다.
도 10은 본 발명에 따른 비디오 스트립 분석 방법의 일 실시예를 설명하기 위한 순서도이다.1 is a reference diagram schematically illustrating a data processing schema according to the present invention.
2 is a reference diagram illustrating an algorithm for sHDP modeling according to the present invention.
3 is a reference diagram for explaining an sHDP model.
4 to 6 are graphs showing trends of recall, precision and correct answers of episodes 1 to 3, respectively.
7 is a graph showing the F1 measure in episodes 1-3.
8 is a graph showing the effect of the channel merging method according to the present invention.
9 is a block diagram illustrating an embodiment of a video strip analysis apparatus according to the present invention.
10 is a flowchart illustrating an embodiment of a video strip analysis method according to the present invention.

본 발명에 관한 설명은 구조적 내지 기능적 설명을 위한 실시예에 불과하므로, 본 발명의 권리범위는 본문에 설명된 실시예에 의하여 제한되는 것으로 해석되어서는 아니 된다. 즉, 실시예는 다양한 변경이 가능하고 여러 가지 형태를 가질 수 있으므로 본 발명의 권리범위는 기술적 사상을 실현할 수 있는 균등물들을 포함하는 것으로 이해되어야 한다.The description of the present invention is merely an example for structural or functional explanation, and the scope of the present invention should not be construed as being limited by the embodiments described in the text. That is, the embodiments are to be construed as being variously embodied and having various forms, so that the scope of the present invention should be understood to include equivalents capable of realizing technical ideas.

한편, 본 발명에서 서술되는 용어의 의미는 다음과 같이 이해되어야 할 것이다.Meanwhile, the meaning of the terms described in the present invention should be understood as follows.

"제1", "제2" 등의 용어는 하나의 구성요소를 다른 구성요소로부터 구별하기 위한 것으로, 이들 용어들에 의해 권리범위가 한정되어서는 아니 된다. 예를 들어, 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다.The terms "first "," second ", and the like are intended to distinguish one element from another, and the scope of the right should not be limited by these terms. For example, the first component may be referred to as a second component, and similarly, the second component may also be referred to as a first component.

어떤 구성요소가 다른 구성요소에 "연결되어"있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결될 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어"있다고 언급된 때에는 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다. 한편, 구성요소들 간의 관계를 설명하는 다른 표현들, 즉 "~사이에"와 "바로 ~사이에" 또는 "~에 이웃하는"과 "~에 직접 이웃하는" 등도 마찬가지로 해석되어야 한다.It is to be understood that when an element is referred to as being "connected" to another element, it may be directly connected to the other element, but there may be other elements in between. On the other hand, when an element is referred to as being "directly connected" to another element, it should be understood that there are no other elements in between. On the other hand, other expressions describing the relationship between the components, such as "between" and "immediately between" or "neighboring to" and "directly neighboring to", should be interpreted as well.

단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한 복수의 표현을 포함하는 것으로 이해되어야 하고, "포함하다"또는 "가지다" 등의 용어는 설시된 특징, 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것이 존재함을 지정하려는 것이며, 하나 또는 그 이상의 다른 특징이나 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.It should be understood that the singular " include "or" have "are to be construed as including a stated feature, number, step, operation, component, It is to be understood that the combination is intended to specify that it does not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, or combinations thereof.

각 단계들에 있어 식별부호(예를 들어, a, b, c 등)는 설명의 편의를 위하여 사용되는 것으로 식별부호는 각 단계들의 순서를 설명하는 것이 아니며, 각 단계들은 문맥상 명백하게 특정 순서를 기재하지 않는 이상 명기된 순서와 다르게 일어날 수 있다. 즉, 각 단계들은 명기된 순서와 동일하게 일어날 수도 있고 실질적으로 동시에 수행될 수도 있으며 반대의 순서대로 수행될 수도 있다.In each step, the identification code (e.g., a, b, c, etc.) is used for convenience of explanation, the identification code does not describe the order of each step, Unless otherwise stated, it may occur differently from the stated order. That is, each step may occur in the same order as described, may be performed substantially concurrently, or may be performed in reverse order.

본 발명은 컴퓨터가 읽을 수 있는 기록매체에 컴퓨터가 읽을 수 있는 코드로서 구현될 수 있고, 컴퓨터가 읽을 수 있는 기록 매체는 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록 장치를 포함한다. 컴퓨터가 읽을 수 있는 기록 매체의 예로는 ROM, RAM, CD-ROM, 자기 테이프, 플로피 디스크, 광 데이터 저장 장치 등이 있으며, 또한, 캐리어 웨이브(예를 들어 인터넷을 통한 전송)의 형태로 구현되는 것도 포함한다. 또한, 컴퓨터가 읽을 수 있는 기록 매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산 방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수 있다.The present invention can be embodied as computer-readable code on a computer-readable recording medium, and the computer-readable recording medium includes all kinds of recording devices for storing data that can be read by a computer system . Examples of the computer-readable recording medium include a ROM, a RAM, a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like, and also implemented in the form of a carrier wave (for example, transmission over the Internet) . In addition, the computer-readable recording medium may be distributed over network-connected computer systems so that computer readable codes can be stored and executed in a distributed manner.

여기서 사용되는 모든 용어들은 다르게 정의되지 않는 한, 본 발명이 속하는 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가진다. 일반적으로 사용되는 사전에 정의되어 있는 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 것으로 해석되어야 하며, 본 발명에서 명백하게 정의하지 않는 한 이상적이거나 과도하게 형식적인 의미를 지니는 것으로 해석될 수 없다.
All terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs, unless otherwise defined. Commonly used predefined terms should be interpreted to be consistent with the meanings in the context of the related art and can not be interpreted as having ideal or overly formal meaning unless explicitly defined in the present invention.

1. Sequential HDP1. Sequential HDP

TV드라마의 에피소드는 몇몇의 스토리 구간(Scene, 장면)으로 구성되어있다. 각각의 스토리 구간은 연속적인 프레임(이미지 채널)들로 이루어져 있고 동시에 대화(사운드 채널)의 집합으로 이루어진다. 각각의 프레임은 몇몇의 연속된 이미지 조각(image patch)들로 이루어져 있다. 또한 사운드 채널은 대화로 이루어질 수 있다.Episodes of TV dramas consist of several story scenes. Each story section consists of a series of frames (image channels) and at the same time a set of dialogues (sound channels). Each frame consists of several successive image patches. Sound channels can also be made up of conversations.

본 발명은 순차적인 계층적 디리슐레 과정( sHDP, serial Hierarchical Dirichlet Process) 를 이용하여 계층 구조를 기반으로 소정의 처리를 수행한다. 즉, 순차적 HDP(serial Hierarchical Dirichlet Process, sHDP)는 이러한 비디오 스트림의 계층적인 구조를 하위 스트림(이미지 및 사운드)으로 나누어 따로 처리한다. The present invention performs a predetermined process based on a hierarchical structure by using a serial hierarchical dirichlet process (sHDP). That is, the sequential HDP (serial hierarchical dirichlet process) sHDP divides the hierarchical structure of the video stream into lower streams (images and sounds) and processes them separately.

도 1은 본 발명에 따른 데이터 프로세싱 스키마를 도식화한 참고도이다. 도 1을 참조하면, 본 발명은 이미지에 대하여 은닉 변수 모델을 이용하여 추정을 수행하고, 사운드에 대해서 이미지와 별도의 처리 과정을 수행할 수 있다. 일반적인 구조학습 접근법과는 다르게 본 발명의 sHDP는 각각의 모달리티(이미지와 사운드)로 학습된 모델이 뒤따르는 데이터를 생성할 우도나 또는 데이터의 연속성을 판단하여 스토리 구간을 추정할 수 있다. 또한 본 발명은 베이지언 방법론을 따라 변환점을 추정하는 것을 고려한다. 그러나 TV드라마에는 이미지의 급격한 변화가 매우 빈번하기 때문에 우도만을 사용한 변화점 판단을 하면 너무 많은 수의 변화점이 추정되게 되므로, 본 발명은 이러한 점을 보완하기 위하여 각 채널의 변환점 판단에 대한 기여도를 동적으로 조절하는 채널동적병합(Dynamic Channel Merging)을 적용할 수 있다.
1 is a reference diagram schematically illustrating a data processing schema according to the present invention. Referring to FIG. 1, the present invention may perform estimation using a hidden variable model on an image and perform a separate process from the image on the sound. Unlike the general structure learning approach, the sHDP of the present invention can estimate the story interval by judging the likelihood or the continuity of the data that will be generated by the model trained with each modality (image and sound). The present invention also contemplates estimating the transform point according to the Bayesian methodology. However, in TV dramas, since a sudden change of the image is very frequent, too many change points are estimated when judging the change point using only the likelihood, and the present invention dynamically compensates for the conversion point of each channel. Dynamic Channel Merging can be applied.

1.1 Sequential HDP Structure1.1 Sequential HDP Structure

이하에서, 본 발명은 비디오 스트림으로서 드라마 에피소드를 가정하여 설명한다. 드라마 에피소드의 이미지 채널은 스토리 구간의 집합으로 표현될 수 있다. Theta(ij)는 시각 단어(Visual Word)이고 이는 CRF(Chinese Restaurant Franchise)에서의 손님에 비유될 수 있다. 각각의 restaurant는 frame에 해당할 수 있고, 이는 measure G(j) 에 의하여 생성될 수 있다. 이러한 비디오 스트림은 개념적 오브젝트(menu, Phi(k))의 전역적 집합을 공유할 수 있다. 하나의 프레임은 여러 개의 개념적 오브젝트로 구성될 수 있다. 테이블(t(ij))의 역할은 메뉴와 테이블에 앉아있는 손님(시각 단어, visual words)을 연결 시키는데 있다.
In the following, the present invention is described assuming a drama episode as a video stream. The image channel of the drama episode may be expressed as a set of story sections. Theta (ij) is a Visual Word, which can be likened to a guest at the Chinese Restaurant Franchise (CRF). Each restaurant can correspond to a frame, which can be created by measure G (j). These video streams can share a global set of conceptual objects (menu, Phi (k)). One frame may consist of several conceptual objects. The role of the table (t (ij)) is to link the menu with the visual words that are sitting at the table.

도 2는 본 발명에 따른 sHDP 모델링에 대한 알고리즘을 설명하는 참고도이고, 도 3은 sHDP 모델을 설명하기 위한 참고도이다.2 is a reference diagram illustrating an algorithm for sHDP modeling according to the present invention, and FIG. 3 is a reference diagram illustrating an sHDP model.

도 2에서, 계층적 구조 때문에 HDP는 이미지 스트림을 설명하는데 좋은 모델이 될 수 있다. 하지만, HDP는 카메라의 이동이나 화면의 왜곡과 같은 이미지 채널의 갑작스런 변동에 취약하다. 이러한 갑작스런 변화에 안정적인 결과를 얻기 위하여 Sticky HDP-HMM모델이 소개된 바 있다. 새로운 transition 확률을 도입하는 대신에 본 발명은 비디오 스트림의 멀티 모달리티를 활용하여 이를 동적으로 결합하는 방법을 개시한다.In Fig. 2, because of the hierarchical structure, HDP can be a good model for describing the image stream. However, HDP is vulnerable to sudden fluctuations in the image channel, such as camera movement or screen distortion. The Sticky HDP-HMM model was introduced to achieve stable results against these sudden changes. Instead of introducing a new transition probability, the present invention discloses a method of dynamically combining the multi-modalities of a video stream.

이미지 채널에서는 HDP모델을 사용할 수 있다. 이를 이용하여, 새로운 프레임이 나타날 경우 지금의 모델에서 다음 프레임 집합의 생성 우도를 추정할 수 있다. 이 우도값이 임계값보다 낮을 경우에는 도 2의 알고리즘을 이용하여 새로운 HDP 모델을 추정할 수 있다.HDP models can be used in the image channel. Using this, it is possible to estimate the generation likelihood of the next frame set in the current model when a new frame appears. If the likelihood value is lower than the threshold value, the new HDP model can be estimated using the algorithm of FIG.

그러나 이러한 HDP모델 같은 경우에는 의미 단위 안에서의 변화 감지만 가능하고, 전체 스토리 구간의 변화를 감지할 수 없다. Ahmed 와 Xing의 논문의 경우 각 채널의 가중치가 시간에 따라서 반감하게 되는데 이러한 반감 주기가 고정되어 있고 이는 현실적이지 못하다. 이는 비디오 스트림에서 가장 현실성 있는 변화의 근거는 각각의 채널에서 찾을 수 있기 때문이다. 따라서 본 발명은 사운드 채널을 모델에 더 포함시켰고 이를 도시하면 도 3의 그림 (c)와 같이 표현될 수 있다.However, in the case of the HDP model, only the change in the semantic unit can be detected, and the change in the entire story section cannot be detected. In the case of Ahmed and Xing's paper, the weight of each channel is halved over time, which is fixed and this is not realistic. This is because the most realistic change in the video stream can be found in each channel. Therefore, the present invention further includes a sound channel in the model, which can be represented as shown in FIG.

수학식 1 내지 3은 HDP모델을 을 설명하기 위한 수식이다.Equations 1 to 3 are equations for describing the HDP model.

본 발명은, 전술한 바와 같이, 이미지 모델에 사운드 채널의 변화를 더 추가하여 고려할 수 있다. 여기에서, 사운드 채널의 인식 성능에 따라, 사운드 채널에 대해서 은닉변수 모델을 사용하지 않을 수 있다. 사운드 채널의 특성은 사운드 채널이 스토리 변화와 정확하게 일치하지 않을 수도 있다는 것이며, 이는 같은 구간으로 간주되는 Scene 안에서 음성이 없는 부분이 있을 수도 있기 때문이다. 반면, 대화 중에는 Scene의 변화가 일어날 확률이 적다. 따라서 본 발명은 이러한 점에 착안하여 이미지 채널과 사운드 채널의 변화를 모두 고려하는 동적 모델을 개시한다.
As described above, the present invention can be considered by adding a change of a sound channel to the image model. Here, depending on the recognition performance of the sound channel, the hidden variable model may not be used for the sound channel. The characteristic of a sound channel is that the sound channel may not be exactly the same as the story change, because there may be parts of the scene that are considered to be the same section with no voice. On the other hand, it is less likely that the scene will change during the conversation. Therefore, the present invention discloses a dynamic model which considers the change of both the image channel and the sound channel in view of this point.

1.2 Dynamic Channel Merging1.2 Dynamic Channel Merging

제안된 방법론은 우도 변화의 측정값인

와 대화 감지 측정의 값

을 고려하여 변화 지점을 추정할 수 있다. 이를 수학식으로 표현하면 아래의 수학식 4 내지 수학식 5와 같다.The proposed methodology measures the likelihood change

Values of conversation detection measurements with

The change point can be estimated by considering. This may be expressed as Equation 4 to Equation 5 below.

여기에서,

는 이미지 채널에서의 우도 차이를 표현하고,

는 대화가 계속되는지를 판별할 수 있다.

는

가 주어졌을때

가 나올 우도에 대한 함수이다.

는 현재 프레임의 우도가 이전 프레임의 우도보다 낮을 경우 값이 1이 된다.

는 우도의 급하락 후에 급상승이 새로운 씬이 나타났음을 의미한다는 insight를 반영할 수 있다. 사운드 모델의 경우 대화의 연속성을 모델링 할 수 있으며, 따라서 수학식 4는 대화가 진행되고 있을 때 우도의 변화가 있을 경우 무시하고, 대화와 대화 사이에 우도 변화가 있을 경우에는 이 지점을 변화 후보 지점으로 간주할 수 있다.
From here,

Represents the likelihood difference in the image channel,

Can determine if the conversation continues.

The

Is given

Is a function for likelihood to come out.

The value becomes 1 when the likelihood of the current frame is lower than the likelihood of the previous frame.

Reflects the insight that the spikes after a drop in the likelihood indicate a new scene. For the sound model, we can model the continuity of the dialogue, so Equation 4 ignores the likelihood change when the dialogue is in progress, and uses this point if the likelihood changes between dialogue and dialogue. Can be regarded as.

1.3 Posterior sampling1.3 Posterior sampling

샘플링 방법과 추정 프레임웍은 Heinrich에 대한 방법론을 기반으로 할 수 있다. 이 과정에서의 하이퍼파라미터는 baseline measure H 와 집중도 Gamma로 구성될 수 있다. 여기서 Gamma 와 Alpha는 현재의 식별자와 은닉 모델의 popularity를 나타낸다. 우리는 HDP 논문에 나온대로 DP의 hyperparameter를 추정한다. x가 주어 졌을 때 사후추정을 위해서 (k,t)의 변화를 추정하여야 한다. 여기에서,

는 시각 단어의 그룹과 그것의 identifier를 연결시킬 수 있다.
Sampling methods and estimation frameworks can be based on Heinrich's methodology. Hyperparameters in this process can be composed of baseline measure H and concentration gamma. Where Gamma and Alpha represent the current identifier and popularity of the hidden model. We estimate the hyperparameter of DP as shown in the HDP paper. Given x, the change in (k, t) should be estimated for ex post estimation. From here,

Can associate a group of visual words with its identifier.

1.3.1 Sampling an identifier and a conceptual object1.3.1 Sampling an identifier and a conceptual object

식별자(identifier)를 배정하는데 있어서, 선호되는 conceptual object는 그 popularity를 유지한다는 데에 있다. 선호도가 바이어스 되는 것을 피하기 위하여 새로운 식별자(identifier)가 선택되는 것이 가능하여야 한다. 따라서, 본 발명은 다음과 같은 것을 가정할 수 있다. (1) 인물이나 큰 물체는 같은 스토리 안에서 다시 나타날 확률이 높다. (2) 새로운 프레임을 설명하기 위해서는 새로운 오브젝트(object)가 필요하다. 이러한 직관을 따라서 식별자(identifier)는 이하의 수학식 6과 같은 방법으로 샘플링 될 수 있고 conceptual object는 수학식 7과 같은 방법으로 샘플링될 수 있다.In assigning identifiers, the preferred conceptual object is to maintain its popularity. It should be possible to select a new identifier to avoid biasing preferences. Therefore, the present invention can assume the following. (1) People or large objects are more likely to reappear in the same story. (2) A new object is needed to describe the new frame. According to this intuition, the identifier may be sampled in the following Equation 6 and the conceptual object may be sampled in the Equation 7.

여기에서,

는

와 관련된 파라미터(시각 단어)의 갯수를 나타낸다.

는

가 주어 졌을 때

를 재생성할 우도를 나타낸다.

는 이전 스토리의 끝으로부터 현재 scene까지의 시간 경과를 나타낸다.

는 conceptual object t의 pupularity를 나타낸다.
From here,

The

Indicates the number of parameters (visual words) associated with.

The

When was given

Indicates the likelihood to regenerate.

Indicates the time lapse from the end of the previous story to the current scene.

Represents the pupularity of the conceptual object t.

2. Experiments2. Experiments

이하에서는, 본 발명에 따른 비디오 스트림 분석 방법을 TV드라마 에피소드의 분할에 적용한 결과에 대하여 개시한다. 일반적으로 스토리는 "하나의 이벤트를 설명하는 동질적인 두 개 또는 그 이상의 독립적인 절을 포함하는 뉴스의 분절"을 의미하는 것으로 정의되나, 본 발명에서는 스토리의 의미상 구분을 위하여, "동질한 토픽을 가지는 여러 대화와 이미지로 이루어진 분절"로서 스토리를 정의하도록 한다. 본 발명에서의 분절은 스트림을 동질성을 가지는 여러 분절로 분할 한다는 점에서 일반적 정의에서의 분절과 비슷하다.
Hereinafter, the results of applying the video stream analysis method according to the present invention to the division of TV drama episodes will be described. In general, a story is defined as meaning "segment of news that includes two or more homogenous independent clauses describing an event," but in the present invention, for the purpose of semantic distinction of stories, "homogeneous topics" Segment consisting of multiple dialogues and images with " The segment in the present invention is similar to the segment in the general definition in that it splits the stream into homogeneous segments.

2.1 Data and Representation2.1 Data and Representation

2.1.1 Data2.1.1 Data

19명의 피실험자가 데이터에서 스토리 변환점을 직접 손으로 기록하였다. 테스트 데이터의 전체 길이는 125분 30초이고 총 7530개의 Scene을 추출하여 사용하였다. 다수의 사람이 정확하게 같은 시점을 변환 지점으로 판단할 확률은 매우 작기 때문에 변환 지점의 인터벌을 변환 구간이라고 정의하였다. Nineteen subjects manually recorded the story transformation points in the data. The total length of the test data was 125 minutes and 30 seconds and a total of 7530 scenes were used. Since the probability that many people judge the exact same point in time as a transformation point is very small, the interval of the transformation point is defined as the transformation interval.

표1은 각 에피소드에서 통계적인 특성을 반영하고 있다. 표1에서 변환 구간 1초는 괜찮지만 11초는 다소 긴 감이 있다. 이러한 최대 인터벌은 하늘이나 도시의 배경을 빠르게 전환하면서 보여주는 Scene에 해당한다. 이러한 Scene에서는 특정한 한 점을 변환점이라고 말하기는 어렵다고 볼 수 있다.
Table 1 reflects the statistical characteristics of each episode. In Table 1, the conversion interval of 1 second is fine, but 11 seconds is somewhat longer. This maximum interval corresponds to a scene that shows the rapid change of the sky or city background. In such a scene, it is difficult to say that a specific point is a conversion point.

2.1.2. Representation2.1.2. Representation

SIFT를 이용하여 시각적 단어를 추출하였다. 비디오 스트림의 Scene은 이미지와 사운드 데이터로 이루어지며, 사운드 채널의 경우 MFCC(Mel Frequency Cepstral Coefficients) 추출법으로 처리할 수 있다. 멀티모달 데이터를 처리하기 위하여 다른 연구자들은 각 모달리티에서 구조의 순간적인 동시발생에 주목하였지만, 스토리 변환지점 판단 문제에서는 공통적 구조를 창안하기 힘들기 때문에 이러한 방법을 사용하기 힘들다. 따라서 본 발명은 sHDP를 사용하여 이미지를 처리하고 사운드의 경우 대화의 연속성을 판단하게 된다.
Visual words were extracted using SIFT. Scenes of the video stream are composed of images and sound data, and in the case of sound channels, the MFCC (Mel Frequency Cepstral Coefficients) extraction method can be processed. In order to process multi-modal data, other researchers have noted the instantaneous co-occurrence of structures in each modality, but it is difficult to use this method because it is difficult to create a common structure in the matter of story transition point determination. Thus, the present invention uses sHDP to process the image and, in the case of sound, determine the continuity of the conversation.

2.1.3 Experimental Results2.1.3 Experimental Results

표2에 Precision, recall, F1 measure를 표기하였다. 표2에서는 제안된 방법론의 성능과 비교하기 위하여 사용된 사람이 판단한 기저 참값을 표현하였다. 각 에피소드에서 검출된 변환점의 갯수는 각각 156, 186, 176개 이다. 또한 변환 구간은 각각 43, 39, 39개 이다. 성능은 우도에 대한 임계값에 따라 달라지므로 가장 잘 나온 결과만 표에 기록하였다. 도 4 내지 도 6은 각각 에피소드 1 내지 3의 recall, precision 및 정답의 동향을 나타내는 그래프이다. 임계값에 따른 recall, precision, 정답의 개수가 도시되어 있다. 수학식 8은 recall 및 precision을 설명하는 식이며, 수학식 9는 임계값(threshold)를 설명하는 식이다.Table 2 lists the Precision, recall, and F1 measures. Table 2 presents the baseline true values judged by the person used to compare the performance of the proposed methodology. The number of transform points detected in each episode is 156, 186, and 176, respectively. In addition, there are 43, 39, and 39 conversion intervals, respectively. Since performance depends on the threshold for likelihood, only the best results are listed in the table. 4 to 6 are graphs showing trends of recall, precision and correct answers of episodes 1 to 3, respectively. The number of recall, precision, and correct answers according to the threshold is shown. Equation 8 is an equation for describing recall and precision, and Equation 9 is an equation for describing a threshold.

도 7은 에피소드 1 내지 3에서의 F1 measure를 도시하는 그래프이다. 도 7에서와 같이, 좀더 큰 값의 우도 임계값(likelihood threshold)의 경우 recall과 precision이 증가함과 동시에 정답률이 낮아짐을 알 수 있다. 본 발명은 사운드 채널의 모델을 도입하여 더 정확한 변환점 판단이 가능하다. 7 is a graph showing the F1 measure in episodes 1-3. As shown in FIG. 7, the larger the likelihood threshold, the higher the recall and precision, and the lower the correct answer rate. The present invention introduces a model of the sound channel to enable more accurate conversion point determination.

도 8은 본 발명에 따라 채널 병합 방법의 효과를 나타내는 그래프이다. 8 is a graph showing the effect of the channel merging method according to the present invention.

도 8에서 실선은 이미지 채널의 우도를 나타내고 점선은 사운드 채널의 시그널을 나타낸다. 수직선은 제안된 방법론에 의하여 추정된 변환점들이다. 도 8에 도시된 바와 같이, 양 채널에서 급격하고 빈번한 변화가 있지만 제안된 방법론을 이용하면 4개의 변환점이 존재함을 판단할 수 있다.
In FIG. 8, the solid line represents the likelihood of the image channel and the dotted line represents the signal of the sound channel. The vertical lines are the transform points estimated by the proposed methodology. As shown in FIG. 8, although there are rapid and frequent changes in both channels, it can be determined that there are four transformation points using the proposed methodology.

도 9는 본 발명에 따른 비디오 스트립 분석 장치의 일 실시예를 설명하기 위한 구성도이다. 도 9에 도시된 비디오 스트립 분석 장치(100)는 전술한 내용을 수행할 수 있도록 구성된 소정의 장치이므로, 전술한 내용과 동일하거나 그에 상응하는 내용에 대해서는 설명을 생략하나, 당업자는 전술한 기재로부터 본 발명을 명확하게 이해할 수 있을 것이다.9 is a block diagram illustrating an embodiment of a video strip analysis apparatus according to the present invention. The video strip analyzing apparatus 100 illustrated in FIG. 9 is a predetermined apparatus configured to perform the above-described contents, and thus descriptions of the same or equivalent contents as those described above will be omitted. The invention will be clearly understood.

도 9를 참조하면, 비디오 스트립 분석 장치(100)는 분리 모듈(110), 이미지 학습 모듈(120), 이미지 데이터베이스(121), 사운드 학습 모듈(130), 사운드 데이터베이스(131) 및 제어 모듈(140)을 포함할 수 있다. Referring to FIG. 9, the video strip analysis apparatus 100 may include a separation module 110, an image learning module 120, an image database 121, a sound learning module 130, a sound database 131, and a control module 140. ) May be included.

분리 모듈(110)은 비디오 스트림을 이미지 스트림과 사운드 스트림으로 분리할 수 있다. The separation module 110 may separate the video stream into an image stream and a sound stream.

이미지 학습 모듈(120)은 분리된 이미지 스트림에 대하여, 시각적 단어(Visual Word)를 추출하고, 그를 이용하여 후행하는 프레임 집합에 대한 현재 프레임 집합의 우도를 추정할 수 있다. The image learning module 120 may extract a visual word for the separated image stream and estimate the likelihood of the current frame set with respect to the following frame set.

이미지 데이터베이스(121)는 이미지 학습 모듈(120)에 의하여 추정된 우도 데이터를 저장할 수 있다.The image database 121 may store likelihood data estimated by the image learning module 120.

사운드 학습 모듈(130)는 분리된 사운드 스트림에 대하여 대화의 연속성을 기초로 모델링을 수행하고, 생성된 모델을 이용하여 후행하는 프레임 집합에 대한 현재 프레임 집합의 우도를 추정할 수 있다.The sound learning module 130 may model the separated sound stream based on the continuity of the dialogue, and estimate the likelihood of the current frame set with respect to the following frame set by using the generated model.

사운드 데이터베이스(131)는 사운드 학습 모듈(130)에 의하여 추정된 우도 데이터를 저장할 수 있다.The sound database 131 may store likelihood data estimated by the sound learning module 130.

제어 모듈(140)은 이미지 학습 모듈(120) 및 사운드 학습 모듈(130)에서 추정된 우도의 동향을 기초로 기 설정된 임계값을 초과하는 피크점을 확인하고, 이미지 스트림 및 사운드 스트림의 피크점을 연관하여 비디오 스트림의 스토리 변환점을 결정할 수 있다.The control module 140 checks peak points exceeding a predetermined threshold value based on the likelihood trends estimated by the image learning module 120 and the sound learning module 130, and determines the peak points of the image stream and the sound stream. In association to determine the story transition points of the video stream.

일 실시예에서, 제어 모듈(140)은 이미지 채널 및 사운드 채널을 각각 하위 스트림으로 나누어 처리하는 순차적인 계층적 디리슐레 과정(sHDP, serial Hierarchical Dirichlet Process)을 이용할 수 있다. In one embodiment, the control module 140 may use a serial Hierarchical Dirichlet Process (sHDP) that divides and processes an image channel and a sound channel into sub streams, respectively.

일 실시예에서, 제어 모듈(140)은 이미지 학습 모듈(120)에서 추정한 우도에 대한 제1 피크점을 중심으로 소정의 시간 범위 내에 사운드 학습 모듈(130)에서 추정한 우도의 제2 피크점이 존재하면, 제1 피크점을 상기 스토리 변환점으로 결정할 수 있다.
In one embodiment, the control module 140 has a second peak point of the likelihood estimated by the sound learning module 130 within a predetermined time range around the first peak point of the likelihood estimated by the image learning module 120. If present, the first peak point may be determined as the story change point.

도 10은 본 발명에 따른 비디오 스트립 분석 방법의 일 실시예를 설명하기 위한 순서도이다. 도 10에 개시된 비디오 스트립 분석 방법의 일 실시예는 도 9의 비디오 스트립 분석 장치에 의하여 수행되므로, 도 1 내지 도 9를 참조하여 전술한 내용으로 뒷받침될 수 있다.10 is a flowchart illustrating an embodiment of a video strip analysis method according to the present invention. Since an embodiment of the video strip analysis method disclosed in FIG. 10 is performed by the video strip analysis apparatus of FIG. 9, it may be supported by the above description with reference to FIGS. 1 to 9.

도 10을 참조하면, 비디오 스트립 분석 장치(100)는 비디오 스트림 데이터를 이미지 채널과 사운드 채널로 구분하고(단계 S1010), 구분된 상기 이미지 채널 및 사운드 채널 각각에 대한 학습 모델들을 구성할 수 있다(단계 S1020).Referring to FIG. 10, the video strip analysis apparatus 100 may classify video stream data into an image channel and a sound channel (step S1010), and configure learning models for each of the divided image channel and the sound channel ( Step S1020).

비디오 스트립 분석 장치(100)는 구성된 이미지 채널 및 사운드 채널의 학습 모델들을 이용하여 후행하는 프레임 집합에 대한 현재 프레임 집합의 우도를 추정할 수 있다(단계 S1030). The video strip analyzing apparatus 100 may estimate the likelihood of the current frame set with respect to the following frame set by using the learning models of the configured image channel and the sound channel (step S1030).

비디오 스트립 분석 장치(100)는 추정된 우도의 동향을 기록하고, 기록된 우도의 동향 중 기 설정된 임계값을 초과하는 피크 점을 이용하여 비디오 스트림의 스토리 변환점을 결정할 수 있다(단계 S1040).The video strip analyzing apparatus 100 may record the estimated trend of the likelihood and determine a story conversion point of the video stream by using a peak point exceeding a preset threshold value among the recorded likelihood trends (step S1040).

단계 S1010 내지 단계 S1020에 대한 일 실시예예서, 비디오 스트립 분석 장치(100)는 비디오 스트림 데이터를 이미지 채널(이미지 스트림)과 사운드 채널(사운드 스트림)로 분리하고, 분리된 이미지 채널에 대하여 시각적 단어를 이용하여 이미지 채널의 학습 모델을 구성할 수 있다. In one embodiment of steps S1010 to S1020, the video strip analysis apparatus 100 separates the video stream data into image channels (image streams) and sound channels (sound streams), and visually identifies the visual words for the separated image channels. The learning model of the image channel can be constructed.

여기에서, 비디오 스트립 분석 장치(100)는 이미지 채널에 대하여 SIFT(Scalar Invariant Feature Transform)을 적용하여 시각적 단어를 추출하여 이미지 채널의 학습 모델을 구성할 수 있다. Here, the video strip analysis apparatus 100 may apply a Scale Invariant Feature Transform (SIFT) to the image channel to extract a visual word to construct a learning model of the image channel.

단계 S1010 내지 단계 S1020에 대한 일 실시예예서, 비디오 스트립 분석 장치(100)는 분리된 사운드 채널에 대하여 MFCC(Mel Frequency Cepstral Coefficients) 알고리즘을 사용하여 특징을 추출할 수 있다.In an embodiment of steps S1010 to S1020, the video strip analysis apparatus 100 may extract a feature using a Mel Frequency Cepstral Coefficients (MFCC) algorithm for the separated sound channel.

일 실시예예서, 사운드 채널의 학습 모델은 사운드 채널에 반영된 대화의 연속성을 기초로 모델링을 수행할 수 있다.In one embodiment, the learning model of the sound channel may perform modeling based on the continuity of the dialogue reflected in the sound channel.

단계 S1040에 대한 일 실시예예서, 비디오 스트립 분석 장치(100)는 이미지 채널의 학습 모델에서 추정한 우도에 대한 제1 피크점을 중심으로 소정의 시간 범위 내에 사운드 채널의 학습 모델에서 추정한 우도의 제2 피크점이 존재하면, 제1 피크점을 스토리 변환점으로서 결정할 수 있다.In one embodiment of the step S1040, the video strip analysis apparatus 100 of the likelihood estimated from the learning model of the sound channel within a predetermined time range around the first peak point for the likelihood estimated from the learning model of the image channel. If the second peak point exists, the first peak point can be determined as the story change point.

단계 S1040에 대한 일 실시예예서, 비디오 스트립 분석 장치(100)는 이미지 채널 및 사운드 채널을 각각 하위 스트림으로 나누어 처리하는 순차적인 계층적 디리슐레 과정(sHDP, serial Hierarchical Dirichlet Process)을 이용하여 변환점을 결정할 수 있다.
In one embodiment of step S1040, the video strip analysis apparatus 100 uses the serial Hierarchical Dirichlet Process (sHDP) to divide the image channel and the sound channel into sub streams, respectively, to process the transform points. You can decide.

상기에서는 본 발명의 바람직한 실시예를 참조하여 설명하였지만, 해당 기술 분야의 숙련된 당업자는 하기의 특허 청구의 범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 수 있을 것이다.It will be apparent to those skilled in the art that various modifications and variations can be made in the present invention without departing from the spirit or scope of the present invention as defined by the following claims It can be understood that

비디오 스트립 분석 장치(100)
분리 모듈(110) 이미지 학습 모듈(120)
이미지 데이터베이스(121) 사운드 학습 모듈(130)
사운드 데이터베이스(131) 제어 모듈(140)Video Strip Analysis Device (100)
Separation Module (110) Image Learning Module (120)
Image Database 121 Sound Learning Module 130
Sound Database 131 Control Module 140

Claims

A video stream analysis method performed in a video strip analysis apparatus capable of receiving a video stream and analyzing a story transformation of the received video stream,
(a) dividing video stream data into an image channel and a sound channel, and constructing learning models for each of the divided image channel and the sound channel;
(b) estimating the likelihood of the current frame set with respect to the following frame set using learning models of the configured image channel and sound channel; And
(c) recording the estimated likelihood trend and determining a story transformation point of the video stream by using a peak point exceeding a preset threshold value among the recorded likelihood trends;
Video stream analysis method comprising a.

The method of claim 1, wherein step (a)
Separating the video stream data into the image channel and the sound channel; And
Constructing a learning model of an image channel using visual words for the separated image channel;
Video stream analysis method comprising the.

The method of claim 2, wherein constructing the learning model of the image channel comprises:
Extracting the visual word by applying a scale invariant feature transform (SIFT) to the image channel;
Video stream analysis method comprising the.

3. The method of claim 2, wherein step (a)
Extracting a feature on the separated sound channel using a Mel Frequency Cepstral Coefficients (MFCC) algorithm;
Video stream analysis method comprising the.

The method of claim 1, wherein the learning model of the sound channel
Modeling based on the continuity of dialogue reflected in the sound channel
Video stream analysis method, characterized in that.

2. The method of claim 1, wherein step (c)
If the second peak point of the likelihood estimated by the learning model of the sound channel is present within a predetermined time range around the first peak point of the likelihood estimated by the learning model of the image channel, the first peak point is described as the story. Determining a conversion point;
Video stream analysis method comprising the.

2. The method of claim 1, wherein step (c)
Determining the conversion point using a sequential hierarchical dirichlet process (SHDP) for dividing and processing the image channel and the sound channel into lower streams, respectively;
Video stream analysis method comprising the.

A video strip analysis apparatus capable of receiving a video stream and analyzing a story transformation of the received video stream,
A separation module for separating the video stream into an image stream and a sound stream;
An image learning module for extracting a visual word for the separated image stream and estimating a likelihood of a current frame set with respect to a following frame set by using the same;
A sound learning module performing modeling on the separated sound streams based on continuity of dialogue and estimating a likelihood of a current frame set with respect to a subsequent frame set using the generated model; And
A peak point exceeding a predetermined threshold value is identified based on a trend of the likelihood estimated by the image learning module and the sound learning module, and the story change point of the video stream is determined by correlating the peak points of the image stream and the sound stream. A control module;
Video strip analysis device comprising a.

9. The apparatus of claim 8, wherein the control module
Determining the conversion point using a sequential hierarchical dirichlet process (sHDP) that divides the image channel and the sound channel into sub streams, respectively.
Video strip analysis device, characterized in that.

9. The apparatus of claim 8, wherein the control module
If the second peak point of the likelihood estimated by the sound learning module is present within a predetermined time range around the first peak point of the likelihood estimated by the image learning module, the first peak point is determined as the story conversion point. that
Video strip analysis device, characterized in that.

A recording medium having recorded thereon a program for executing a video stream analysis method,
The program is a program that can be run in the video strip analysis apparatus that can receive the video stream and analyze the story transformation of the received video stream,
(a) dividing video stream data into an image channel and a sound channel and configuring learning models for each of the divided image channel and sound channel;
(b) estimating the likelihood of the current frame set with respect to the following frame set using learning models of the configured image channel and sound channel; And
(c) recording the estimated likelihood trend and determining a story conversion point of the video stream by using a peak point exceeding a preset threshold value among the recorded likelihood trends;
And a recording medium.