KR100404322B1

KR100404322B1 - A Method of Summarizing News Video Based on Multimodal Features

Info

Publication number: KR100404322B1
Application number: KR10-2001-0002306A
Authority: KR
Inventors: 장현성; 김재곤; 강경옥; 김문철; 김진웅
Original assignee: 한국전자통신연구원
Priority date: 2001-01-16
Filing date: 2001-01-16
Publication date: 2003-11-01
Also published as: KR20020061318A

Abstract

본 발명은 멀티모달 특징 기반의 뉴스 비디오 요약 방법에 관한 것이다.The present invention relates to a multi-modal feature based news video summary method.

본 발명에서는 폐쇄자막 신호를 내포하는 뉴스 비디오로부터 멀티모달의 특징을 추출하고 이를 기반으로 뉴스 비디오의 주요 구간을 자동 검출하여 뉴스 비디오를 요약한다. 이를 위해, 입력된 대상 비디오의 신호를 디지털 신호로 변환하여 저장하는 디지털 컨텐츠 취득 단계와 취득된 디지털 컨텐츠를 구성하는 멀티모달의 각 요소로부터 특징을 추출하는 멀티모달 특징 추출 단계, 추출된 멀티모달 특징을 기반으로 하여 뉴스 비디오의 주요 구간을 검출하는 주요구간 검출 단계, 주요 구간 검출 결과를 바탕으로 뉴스 비디오의 요약 정보를 구조적으로 기술하는 비디오 요약 기술 단계를 포함한다.In the present invention, multi-modal features are extracted from the news video containing the closed caption signal, and based on this, the main section of the news video is automatically detected to summarize the news video. To this end, a digital content acquisition step of converting and storing the input target video signal into a digital signal, a multimodal feature extraction step of extracting features from each of the elements of the multimodal constituting the acquired digital content, and the extracted multimodal feature A main section detection step of detecting a main section of the news video based on the, and a video summary description step of structurally describing the summary information of the news video based on the main section detection results.

이로 인해, 음성 인식의 복잡도를 줄이는 동시에 신뢰도를 향상시킴으로서 안정적인 성능을 얻을 수 있다.As a result, stable performance can be obtained by reducing the complexity of speech recognition and improving reliability.

Description

A method of summarizing news video based on multimodal features}

본 발명은 멀티모달 특징 기반의 뉴스 비디오 요약 방법에 관한 것으로서, 보다 상세하게 설명하면, 폐쇄자막 정보를 포함하는 뉴스 비디오를 각 뉴스 기사 단위로 분할하고 분할된 단위별로 뉴스 비디오를 요약하는 멀티모달 특징 기반의 뉴스 비디오 요약 방법에 관한 것이다.The present invention relates to a news video summarization method based on a multimodal feature. More specifically, the present invention relates to a multimodal feature that divides a news video including closed caption information into units of news articles and summarizes news videos by segments. To news based video summary method.

뉴스 비디오 혹은 일반적인 비디오 요약에 관한 종래 기술로는 M. A. Smith and T. Kanade 가 제안한 [논문제목 : Video Skimming and Characterization through the Combination of Image and Language Understanding Technique, 게재지 : Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp 775∼781, 발표년도 : 1997년]의 논문이 있다. 이는 영상 분석과 언어 인식 기술 등의 비디오 멀티모달 특징을 사용한 요약 방식에 관하여 제안하였는데, 이 요약 방식은 비디오 신호로부터 카메라 움직임, 얼굴 영역 및 자막 영역을 검출하고 오디오 신호로부터 음성 인식과정을 통하여 얻어진 대본(transcript)을 색인하여 주요 단어를 추출한 후, 이들을 병합하여 주요 구간으로 취하는 것을 특징으로 한다. 그러나, 이러한 방법은 뉴스 비디오에 대한 별도의 기사 구조 분석 과정을 거치지 않으므로 구조화된 형태의 요약을 제공하기 어려울 뿐만 아니라, 오디오의 주요 구간이 단어 기반으로 추출되므로 뉴스 비디오의 요약을 재생할 때 음성이 자주 끊기며 의미 전달이 제대로 되지 않는다. 또한, 대본을 자동으로 생성하기 위하여 음성 인식과정을 취하므로, 비디오에 실리는 폐쇄자막 신호를 활용하는 경우에 비해 그신뢰성이 떨어지고 복잡도 또한 크다는 단점이 있다.Conventional techniques for news videos or general video summaries are proposed by M. A. Smith and T. Kanade [Articles: Video Skimming and Characterization through the Combination of Image and Language Understanding Technique, Publication: Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp 775-781, Year of Publication: 1997]. This paper proposes a summary method using video multi-modal features such as image analysis and language recognition technology. This summary method detects camera movement, face area and subtitle area from video signal, and obtains it through speech recognition process from audio signal. (transcript) indexes key words, and then merges them and takes them as key sections. However, these methods do not go through a separate article structure analysis of the news video, making it difficult to provide structured summaries, and because the main segments of the audio are extracted on a word-by-word basis, speech is often played back when playing a summary of the news video. It is cut off and the meaning is not delivered properly. In addition, since the speech recognition process is performed to automatically generate the script, the reliability and the complexity are also low compared to the case of using the closed caption signal on the video.

다른 종래 기술로는 1999년에 Q. Huang et al. 이 제안한 [논문제목 : Automated Generation of News Content Hierarchy by Integrating Audio, Video, and Text Information, 게재지 : Proc. IEEE Conference on Acoustics, Speech, and Signal Processing, pp 3025∼3028] 의 논문이 있다. 여기에서는 비디오의 멀티모달 특징을 모두 활용함으로서 뉴스 비디오의 내용을 계층적으로 구조화하는 방법을 제안하였다. 이 방법에서는 비디오와 함께 폐쇄자막 데이터가 주어지는 경우 이를 활용할 수 있도록 하였으며 오디오 신호의 분석을 통하여 광고 부분을 분리하고 앵커의 대사 부분을 검출하는 등 뉴스 비디오의 기사 구조 분석과정을 동반한다. 하지만, 이와 같은 방법은 비디오 요약에는 앵커 샷들만 포함될 수 있도록 한정하므로 비쥬얼 측면에서는 별 정보를 주지 못할 뿐만 아니라, 더욱이 이들이 요약에 포함될 때 앵커 샷 단락 전체가 한꺼번에 포함되므로 주요 문장에 기반한 요약에 비하여 부피가 크다(bulky)는 단점이 있다. 또한, 폐쇄자막 데이터를 활용할 때 오디오 신호와 시간적으로 동기되어 있음을 사전에 가정하고 있으나. 실제로 뉴스 비디오에서 제공하는 대부분의 폐쇄자막 데이터가 그렇지 못하며 이러한 경우 처리 방법이 매우 불분명하다는 단점이 있다.Another prior art is Q. Huang et al. [Article: Automated Generation of News Content Hierarchy by Integrating Audio, Video, and Text Information, Place: Proc. IEEE Conference on Acoustics, Speech, and Signal Processing, pp 3025-3028. Here, we proposed a method of hierarchically structuring the content of news video by utilizing all the multimodal features of the video. In this method, when closed caption data is given along with the video, it can be used, and it is accompanied with the process of analyzing the article structure of the news video by separating the advertisement part and detecting the metabolic part of the anchor through the analysis of the audio signal. However, this method does not give much information on the visual side as it can contain only anchor shots in the video summary. Furthermore, when they are included in the summary, the entire anchor shot paragraph is included at the same time. Bulky has its drawbacks. In addition, it is assumed in advance that the closed caption data is synchronized with an audio signal in time. In fact, most of the closed caption data provided in the news video is not, and in this case, the processing method is very unclear.

상기한 종래 기술의 문제점을 해결하기 위한 본 발명의 목적은 입력된 폐쇄자막과 함께 제공되는 뉴스 비디오를 뉴스 기사 단위로 분할하고 분할된 뉴스 기사 단위별로 뉴스 비디오를 요약함으로써, 내용 기반의 비디오 브라우징에 활용할 뿐만 아니라, 도중에 생성되는 폐쇄자막 데이터의 색인 결과는 텍스트 기반의 비디오 검색을 위한 데이터베이스 구축 시 활용하는 멀티모달 특징 기반의 뉴스 비디오 요약 방법을제공하기 위한 것이다.SUMMARY OF THE INVENTION An object of the present invention for solving the above-mentioned problems of the prior art is to divide a news video provided with an input closed caption into news article units and to summarize news videos by divided news article units, thereby providing content-based video browsing. In addition, the indexing results of the closed caption data generated in the middle are to provide a multi-modal feature-based news video summary method that is used when constructing a database for text-based video retrieval.

도 1은 본 발명의 일 실시예에 따른 멀티모달 특징 기반의 뉴스 비디오 요약 방법을 도시한 흐름도,1 is a flow diagram illustrating a multi-modal feature based news video summary method according to an embodiment of the present invention;

도 2는 도 1에 도시된 멀티모달 특징 기반의 뉴스 비디오 요약 방법 중, 디지털 컨텐츠 취득 과정을 상세하게 도시한 도면,FIG. 2 is a diagram illustrating a digital content acquisition process in detail among the multi-modal feature based news video summary method illustrated in FIG. 1;

도 3은 도 1에 도시된 본 발명에 따른 멀티모달 특징 기반의 뉴스 비디오 요약 방법 중, 멀티모달 특징 추출 과정을 상세하게 도시한 도면이다.FIG. 3 is a detailed diagram illustrating a multimodal feature extraction process in the multi-modal feature based news video summary method of FIG. 1 according to the present invention.

※ 도면의 주요 부분에 대한 부호의 설명 ※※ Explanation of code about main part of drawing ※

210 : 비디오 압축 부호기 220 : 폐쇄자막 복호기210: video compression encoder 220: closed caption decoder

310 : 샷 경계 검출기 320 : 오디오-폐쇄자막 동기기310: Shot boundary detector 320: Audio-closed caption synchronizer

330 : 폐쇄자막 텍스트 분석기330: closed caption text analyzer

상기한 목적을 달성하기 위한 본 발명에 따른 폐쇄자막 데이터가 실려있는 멀티모달(Multimodal) 특징 기반의 뉴스 비디오 요약 방법에 있어서, 상기 뉴스 비디오의 방송 신호를 디지털 신호로 변환하여 저장하는 디지털 컨텐츠 취득 단계와 ; 취득된 상기 디지털 컨텐츠를 구성하는 상기 멀티모달의 각 요소로부터 특징을 추출하는 멀티모달 특징 추출 단계 ; 추출된 상기 멀티모달 특징을 기반으로 하여 상기 뉴스 비디오의 주요구간을 검출하는 주요구간 검출 단계 ; 검출된 상기 주요 구간을 바탕으로 상기 뉴스 비디오의 요약 정보를 구조적으로 기술하는 비디오 요약 기술 단계를 포함한다.In the multi-modal feature-based news video summary method containing closed caption data according to the present invention for achieving the above object, Digital content acquisition step of converting and storing the broadcast signal of the news video to a digital signal Wow ; A multi-modal feature extraction step of extracting a feature from each element of the multi-modal constituting the acquired digital content; A main section detecting step of detecting a main section of the news video based on the extracted multi-modal features; And a video summary description step of structurally describing summary information of the news video based on the detected main section.

양호하게는, 폐쇄자막 데이터가 실려있는 멀티모달(Multimodal) 특징 기반의 뉴스 비디오 요약 방법을수행하기 위해 컴퓨터로 실행할 수 있는 프로그램을 저장한 기록매체에 있어서, 상기 뉴스 비디오의 방송 신호를 디지털 신호로 변환하여 저장하는 디지털 컨텐츠 취득 단계와 ; 취득된 상기 디지털 컨텐츠를 구성하는 상기 멀티모달의 각 요소로부터 특징을 추출하는 멀티모달 특징 추출 단계 ; 추출된상기 멀티모달 특징을 기반으로 하여 상기 뉴스 비디오의 주요구간을 검출하는 주요구간 검출 단계 ; 검출된 상기 주요 구간을 바탕으로 상기 뉴스 비디오의 요약 정보를 구조적으로 기술하는 비디오 요약 기술 단계를 포함하는 것을 특징으로 하는 프로그램을 저장한 컴퓨터로 판독할 수 있는 기록매체가 제공된다.Preferably, the closed caption data is stored in a program that can be executed by the computer to perform the listed multimodal (Multimodal) feature-based news video summary way the recording medium, a broadcast signal of the news video to a digital signal Digital content acquisition step of converting and storing; A multi-modal feature extraction step of extracting a feature from each element of the multi-modal constituting the acquired digital content; A main section detecting step of detecting a main section of the news video based on the extracted multi-modal features; A computer readable recording medium having stored therein a program comprising a video summary description step of structurally describing summary information of the news video based on the detected main section.

이하 첨부된 도면을 참조하면서 본 발명의 일 실시예에 따른 멀티모달 특징 기반의 뉴스 비디오 요약 방법을 보다 자세하게 설명하기로 한다. 도 1은 본 발명의 일 실시예에 따른 멀티모달 특징 기반의 뉴스 비디오 요약 방법을 전체적으로 도시한 흐름도이다.Hereinafter, a multimodal feature based news video summary method according to an embodiment of the present invention will be described in detail with reference to the accompanying drawings. 1 is a flowchart illustrating a multi-modal feature based news video summary method according to an embodiment of the present invention.

멀티모달(Multimodal : 오디오, 비디오, 텍스트 정보를 모두 포함하는 여러 감각의 신호)을 특징으로 하는 뉴스 비디오의 정보 요약 방법은, 도시된 바와 같이, 디지털 컨텐츠 취득 단계(S110)와 멀티모달 특징 추출 단계(S120), 주요 구간 검출 단계(S130), 비디오 요약 기술 단계(S140)를 포함한다.The information summarization method of a news video characterized by multimodal (multimodal signals including audio, video, and textual information), as shown, a digital content acquisition step S110 and a multimodal feature extraction step. (S120), a main section detection step S130, and a video summary description step S140.

우선, 디지털 컨텐츠 취득 단계(S110)는 폐쇄자막 신호를 내포하는 아날로그의 뉴스 비디오 방송 신호를 입력받아 시청각 신호와 폐쇄자막 신호로 분리한 후, 디지털 형식으로 변환하여 각각 출력하는 단계이다. 이 때, 시청각 신호는 기존에 알려진 여러 가지 압축 부호화 방식의 하나를 이용하여 압축된 디지털 신호로 변환하며, 한편, 폐쇄자막 신호는 복호화 한 후, 텍스트 형식으로 저장하되 해당 자막이 나타나는 시간 정보도 함께 저장한다.First, in the digital content acquisition step S110, an analog news video broadcast signal containing a closed caption signal is input, separated into an audiovisual signal and a closed caption signal, and then converted into a digital format and outputted. At this time, the audiovisual signal is converted into a compressed digital signal using one of several conventionally known compression coding schemes. On the other hand, the closed caption signal is decoded and stored in a text format, but the time information in which the corresponding subtitle appears is also included. Save it.

다음, 멀티모달 특징 추출 단계(S120)는 디지털 컨텐츠 취득단계(S110)에서얻어진 디지털 컨텐츠의 각각의 멀티모달 요소(비디오, 오디오 및 폐쇄자막 텍스트 데이터)로부터 주요 구간 검출에 필요한 일련의 특징을 추출하는 단계이다. 즉, 비디오 신호로부터 샷 경계를 검출하고 폐쇄자막 데이터를 기반으로 오디오 트랙에 음성인식기법을 적용하여, 오디오 신호와 폐쇄자막 데이터간의 시간 동기화를 이루는 한편, 폐쇄자막 데이터를 화자별, 뉴스 기사 단위별로 분할하고 색인한다. 이 때, 추출된 각 색인어가 각 분할 단위에서 나타나는 빈도 등의 색인 중요도도 함께 계산한다.Next, the multi-modal feature extraction step (S120) extracts a series of features necessary for detecting the main section from each of the multi-modal elements (video, audio, and closed caption text data) of the digital content obtained in the digital content acquisition step (S110). Step. That is, by detecting the shot boundary from the video signal and applying the speech recognition method to the audio track based on the closed caption data, time synchronization between the audio signal and the closed caption data is achieved and the closed caption data is divided by speaker and news story unit. Split and index. At this time, the index importance such as the frequency of each extracted index word in each split unit is also calculated.

다음, 주요구간 검출 단계(S130)는 멀티모달 특징 추출 단계(S120)를 통해 얻어진 폐쇄자막 데이터의 색인 결과를 토대로 하여 각 기사 단위별로 몇 개의 주요 문장을 검출한 후, 이를 요약 비디오에 포함될 주요 구간으로 판정하는 단계이다. 주요 문장을 검출하는 과정은 각 단락의 모든 문장을 대상으로 중요도를 계산한 후, 중요도가 높은 상위의 해당 개수만큼의 문장으로 추출하는 것이다. 이 때, 문장의 중요도는 폐쇄자막 데이터의 색인 과정에서 추출된 각 단어의 색인 중요도를 기반으로 각각의 문장에 포함된 색인어의 중요도를 가중합(weighted sum)하여 계산한다. 이와 같은 방법을 통해 추출된 각 주요 문장이 실제 비디오에서 위치하는 구간이 바로 요약 비디오에 포함될 주요 구간으로 판정된다. 또한, 멀티모달 특징 추출 단계(S120)에서 얻어지는 비디오 샷 경계 정보를 이용하여 주요 구간의 양단(both ends)을 시각적으로 거슬리지 않도록 보정하는 것을 포함한다.Next, the main section detecting step S130 detects a few main sentences for each article unit based on the index result of the closed caption data obtained through the multi-modal feature extraction step S120, and then includes the main section to be included in the summary video. It is a step of determining. The process of detecting the main sentence is to calculate the importance of all the sentences in each paragraph, and then extract as the number of sentences of the higher priority. At this time, the importance of the sentence is calculated by weighted sum of the importance of the index words included in each sentence based on the index importance of each word extracted in the closed caption data indexing process. The section in which each main sentence extracted through the above method is located in the actual video is determined as the main section to be included in the summary video. In addition, by using the video shot boundary information obtained in the multi-modal feature extraction step (S120) includes correcting so that both ends of the main section is not visually disturbing.

한편, 도입부와 상술부의 각 단락에서 추출된 주요 구간은 서로 다른 레벨로 분류한 후, 다음 단계인 비디오 요약 기술(S140)에서 계층적인 형태로 비디오 요약을 구성한다.Meanwhile, the main sections extracted from each paragraph of the introductory section and the detailed section are classified into different levels, and then, in the next step, the video summarization technique S140, the video summarization is formed in a hierarchical form.

마지막으로, 비디오 요약 기술 단계(S140)는 멀티모달 특징 추출 단계(S120)에서 추출된 비디오 샷 경계 정보와 주요구간 검출 단계(S130)에서 검출된 주요구간 정보를 수신한 후, 여러 표준화 기구에서 정한 기술정의언어를 이용하여 비디오를 요약 기술한다. 이때, 주요구간 검출 단계(S130)에서 검출된 각 주요 구간을 그 레벨에 따라 계층적인 형태로 기술하며, 각 주요 구간별로 사건 주제정보를 자동, 반자동 혹은 수동으로 추출하도록 하여 그 정보를 기술 데이터에 포함한다.Finally, the video summary description step S140 receives the video shot boundary information extracted in the multi-modal feature extraction step S120 and the main section information detected in the main section detection step S130, and then is determined by various standardization mechanisms. Summarize the video using the technical definition language. At this time, each major section detected in the main section detecting step (S130) is described in a hierarchical form according to its level, and event subject information is automatically, semi-automatically or manually extracted for each major section, and the information is added to the technical data. Include.

도 2는 도 1에 도시된 본 발명에 따른 멀티모달 특징 기반의 뉴스 비디오 요약 방법 중, 디지털 컨텐츠 취득 단계를 상세하게 도시한 도면으로서, 폐쇄자막 신호를 내포하는 아날로그의 뉴스 비디오를 입력받은 후, 시청각 신호와 폐쇄자막 신호로 분리하여 디지털 신호로 각각 변환 후 출력하는 과정이다.FIG. 2 is a diagram illustrating a digital content acquisition step in detail in the multi-modal feature-based news video summary method of FIG. 1 according to the present invention. After receiving an analog news video including a closed caption signal, FIG. It is a process of dividing the audiovisual signal and closed caption signal into digital signals and outputting them.

도시된 바와 같이, 시청각 신호는 비디오 압축 부호기(210)를 통해 압축된 디지털 신호로 변환되며, 폐쇄자막 신호는 패쇄자막 복호기(220)를 통해 텍스트 형식의 파일로 출력된다. 이 때, 폐쇄자막을 복호화하여 저장할 때, 해당 자막이 나타나는 시간 정보 또한 함께 저장한다.As shown, the audiovisual signal is converted into a digital signal compressed through the video compression encoder 210, and the closed caption signal is output as a file in a text format through the closed caption decoder 220. At this time, when the closed caption is decoded and stored, time information in which the subtitle appears is also stored.

일례로서, 폐쇄 자막의 각 어절 단위로 시간정보를 추출하는 경우, 폐쇄 자막에서 매 공백문자가 나타날 때마다 아날로그 입력 신호의 해당 필드 값을 이용하여 해당 시간정보로 사용할 수 있다.For example, when time information is extracted for each word unit of the closed caption, whenever a space character appears in the closed caption, the corresponding field value of the analog input signal may be used as the time information.

여기서의 필드값이란, 아날로그 방송신호에서 수직동기주사신호에 의하여 구분되는 각 필드를 차례로 셀 수 있는 카운트(count) 값으로서, 임의의 시점에서 0으로 초기화될 수 있으며 매 필드마다 그 값을 1 만큼 증가하여 나타낸 값을 의미한다.Here, the field value is a count value that can count each field distinguished by the vertical synchronous scanning signal in the analog broadcast signal, and can be initialized to 0 at an arbitrary time point, and the value is set to 1 for each field. It means increasing value.

또한, 비디오 압축 부호기(210)와 폐쇄자막 복호기(220)가 별도의 분리된 모듈로 구성될 경우, 이들 간에 시간 정보의 기준점을 공유하기 위한 시간 기준점 공유 제어신호가 별도로 필요하다. 이를 위해, 폐쇄자막 복호기(220)에서 필드 값이 초기화될 때에 비디오 압축부호기(210)로 시간 기준점 공유 제어신호가 전송되도록 하여 폐쇄자막 복호기(220)의 영(zero) 시점과 비디오 압축부호기(210)의 영 시점 간의 오차를 계산하고 이후, 폐쇄자막 복호기(220)에서 각 어절마다 추출된 시간정보를 그 오차만큼 보정한다.In addition, when the video compression encoder 210 and the closed caption decoder 220 are configured as separate modules, a time reference point sharing control signal for sharing the reference point of time information is separately required between them. To this end, when the field value is initialized in the closed caption decoder 220, a time reference point sharing control signal is transmitted to the video compression encoder 210 so that a zero time point and the video compression encoder 210 of the closed caption decoder 220 are transmitted. The error between the zero time points is calculated and then the time information extracted for each word in the closed caption decoder 220 is corrected by the error.

도 3은 도 1에 도시된 본 발명에 따른 멀티모달 특징 기반의 뉴스 비디오 요약 방법 중, 멀티모달 특징 추출 과정을 상세하게 도시한 도면이다. 이 단계에서는 이후 수행하는 주요구간 추출 단계에서 활용될 멀티모달의 각각의 특징을 추출한다.FIG. 3 is a detailed diagram illustrating a multimodal feature extraction process in the multi-modal feature based news video summary method of FIG. 1 according to the present invention. In this step, each feature of the multimodal to be used in the main section extraction step is performed.

우선, 샷 경계 검출기(310)는 임의의 방법에 의하여 비디오 신호로부터 샷 경계를 추출한다. 추출된 샷 경계 정보는 이후 비디오 요약 기술단계에서 기술 데이터의 일부로서 기술되며, 주요구간 검출 단계에서도 검출된 주요구간의 양단이 시각적으로 거슬리는 경우, 이를 보정하기 위해 이용된다.First, shot boundary detector 310 extracts a shot boundary from a video signal by any method. The extracted shot boundary information is described later as part of the description data in the video summary description step, and is used to correct this when both ends of the detected main section are visually disturbing in the main section detection step.

일반적으로, 폐쇄자막 데이터는 생방송으로 진행되는 뉴스 대사 내용을 속기사가 듣고 타이핑을 하는 실시간 자막인 경우, 오디오 신호에 비해 뒤쳐져서 나오게 된다. 이와 같은 시간 차이를 보상하기 위해, 오디오-폐쇄자막 동기기(320)는 폐쇄자막 데이터를 입력받아 인식 대상이 명확한 상태에서 임의의 음성인식과정을 수행하며, 디지털 컨텐츠 취득 과정에서 각 자막의 어절마다 추출된 시간 정보를 갱신한다. 이와 같은 과정을 거쳐 각 어절의 시간 정보가 오디오 신호와 시간적으로 동기를 이루게 된 폐쇄자막 데이터는 폐쇄자막 텍스트 분석기(330)의 입력된다.In general, the closed caption data comes out behind the audio signal in the case of real-time subtitles in which the short reporter listens to the contents of the news dialogue being performed live. In order to compensate for such a time difference, the audio-caption subtitle synchronizer 320 receives the closed caption data and performs an arbitrary speech recognition process in a state where the recognition target is clear, and extracts each word of each subtitle in the digital content acquisition process. Updated time information. Through such a process, the closed caption data in which time information of each word is synchronized with an audio signal in time is inputted by the closed caption text analyzer 330.

폐쇄자막 텍스트 분석기(330)는 폐쇄자막 데이터에 포함되어 있는 화자 태그 정보를 이용하여 폐쇄자막 데이터를 화자별, 기사 단위별로 분할하고 색인한다.The closed caption text analyzer 330 divides and indexes the closed caption data by speaker and article unit by using speaker tag information included in the closed caption data.

폐쇄자막 데이터를 색인하는 과정은 명사추출기 혹은 사전(dictionary) 등을 사전에 구성하여 폐쇄자막 데이터에서 색인어로서 적합한 어구들을 추출하고, 이들 색인어가 각 분할 단위에서 나타나는 빈도인 색인어 빈도값(Term Frequency, TF) 및 이들 색인어들이 나타나는 분할 단위의 수인 역문헌 빈도값(Inverse Document Frequency, IDF)을 함께 추출한다.The process of indexing closed-caption data consists of a noun extractor or a dictionary, and extracts phrases suitable as index words from the closed-caption data, and index frequency, which is the frequency in which these index words appear in each partition unit. TF) and Inverse Document Frequency (IDF), which are the number of splitting units in which these index words appear.

이와 같이 추출된 색인어 빈도값(TF) 및 역문헌 빈도값(IDF)은 각 색인어의 중요도를 계산하는 척도로 사용하며, 빈도값(TF) * 역문헌 빈도값(IDF) 의 식을 색인어의 중요도 계산식으로 사용한다.The extracted index word frequency value (TF) and reverse literature frequency value (IDF) are used as a measure for calculating the importance of each index word, and the expression of the frequency value (TF) * reverse literature frequency value (IDF) is used as the index word importance value. Use it as a formula.

위에서 양호한 실시예에 근거하여 이 발명을 설명하였지만, 이러한 실시예는 이 발명을 제한하려는 것이 아니라 예시하려는 것이다. 이 발명이 속하는 분야의 숙련자에게는 이 발명의 기술사상을 벗어남이 없이 위 실시예에 대한 다양한 변화나 변경 또는 조절이 가능함이 자명할 것이다. 그러므로, 이 발명의 보호범위는 첨부된 청구범위에 의해서만 한정될 것이며, 위와 같은 변화예나 변경예 또는 조절예를 모두 포함하는 것으로 해석되어야 할 것이다.While the invention has been described above based on the preferred embodiments thereof, these embodiments are intended to illustrate rather than limit the invention. It will be apparent to those skilled in the art that various changes, modifications, or adjustments to the above embodiments can be made without departing from the spirit of the invention. Therefore, the protection scope of the present invention will be limited only by the appended claims, and should be construed as including all such changes, modifications or adjustments.

이상과 같이 본 발명에 의하면, 뉴스 비디오에 대하여 폐쇄자막 데이터를 기반으로 음성인식기법을 적용함으로서, 각 뉴스 기사별로 의미적으로 대표성을 띄는 문장에 해당하는 비디오 구간을 주요구간으로 검출 할 뿐만 아니라, 그 과정에서 음성인식의 복잡도를 줄이는 동시에 신뢰도를 향상시킴으로서 안정적인 성능을 얻을 수 있는 효과가 있다.As described above, according to the present invention, by applying the speech recognition method based on the closed caption data to the news video, not only the video section corresponding to the sentence representatively representing each news article as the main section, In the process, it is possible to obtain stable performance by reducing the complexity of speech recognition and improving reliability.

Claims

A first step of processing a news video signal including closed caption data into an audiovisual signal of a digital signal and a closed caption signal in a text format including time information in which subtitles appear;

Applying a voice recognition technique to synchronize time between the audio signal of the audiovisual signal and the closed caption data;

A third step of classifying and classifying the closed caption data by speaker and article unit;

Analyzing the indexing result of the closed caption data, and detecting the main section of each sentence unit in each of the introduction unit and the above-mentioned detail unit for each article unit including an introduction unit including an anchor shot and a detailed unit including a detailed report of a reporter. 4 steps; And

And a fifth step of hierarchically constructing summary information of the news video by dividing the main section detected by the introduction section and the main section detected by the detail section into different levels. How to summarize news video.

The method of claim 1,

When summarizing the main section into the news video, using the shot boundary information detected from the video signal of the audio-visual signal, correcting both ends of the main section without visually disturbing the multi-modal. Feature based news video summary method.

delete

A computer program storing a program capable of executing a fifth step of hierarchically configuring the summary information of the news video by dividing the main section detected by the introduction section and the main section detected by the detail section into different levels. Recordable Media.