JPH11266428A

JPH11266428A - Method and device for picture division and recording medium with picture division program recorded

Info

Publication number: JPH11266428A
Application number: JP10068160A
Authority: JP
Inventors: Kenichi Minami; 憲一南; Akito Akutsu; 明人阿久津; Yoshinobu Tonomura; 佳伸外村
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1998-03-18
Filing date: 1998-03-18
Publication date: 1999-09-28
Anticipated expiration: 2018-03-18
Also published as: JP3488626B2

Abstract

PROBLEM TO BE SOLVED: To roughly divide pictures by extracting the feature quantities of sections including no music neither voice in sound information of pictures to calculate degrees of similarity of feature quantities and dividing pictures, which include sections having high degrees of similarity and sound information interposed between these sections, as one segment. SOLUTION: Pictures are stored in a picture storage part 102, and it is discriminated by a music detection part 103 whether sound information of sounds is music or not, and it is discriminated by a voice detection part 104 whether the sound information is voice or not if it is not music; and if it is not voice, sound information in this period is decided as background sounds including no music neither voice, and features of segments corresponding to background sounds are extracted by a feature extraction part 105. Long-term average spectrums of sound information are obtained by a feature extraction part 105, and the correlation between the preceding long-term average spectrum and current than is obtained by a picture division part 106, and they are regarded as the same scene to perform labeling if the correlation is high, and label information is preserved in the picture storage part 102 together with time information of segments.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、映像に含まれる音
情報の背景音を解析し、その特徴量の類似性に基づいて
映像を分割する映像分割方法、装置および映像分割プロ
グラムを記録した記録媒体に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a video dividing method and apparatus for analyzing a background sound of sound information contained in a video and dividing the video based on the similarity of the characteristic amounts thereof, and a recording in which a video dividing program is recorded. Regarding the medium.

【０００２】[0002]

【従来の技術】映像を分割する方法には主に画像情報を
用いるものがあり、例えば、カメラの切り替わりである
カット点を検出し、映像をショットに分割するものがあ
る。2. Description of the Related Art A method of dividing an image mainly uses image information. For example, there is a method of detecting a cut point at which a camera is switched and dividing the image into shots.

【０００３】[0003]

【発明が解決しようとする課題】カット点を検出する方
法を用いて画像情報を分割するようにした技術の応用例
として、ショットの先頭画像をそのショットを表す代表
的な静止画像（代表画像）として空間的に並べて表示
し、映像の内容を一覧できるようにした映像表現方法が
あるが、カット点は頻繁に存在するため、長時間の映像
を対象とした場合には、代表画像の数が増えすぎてしま
うという問題があった。代表画像の数を減らすために
は、映像をより大まかに分割する必要がある。As an application example of a technique for dividing image information using a method of detecting a cut point, a leading image of a shot is set to a representative still image (representative image) representing the shot. There is a video expression method that can be displayed spatially side-by-side and display the contents of the video.However, since there are frequent cut points, when targeting a long video, the number of representative images is reduced. There was a problem that it would increase too much. In order to reduce the number of representative images, the video needs to be roughly divided.

【０００４】映像製作の観点から、ショットの集合はシ
ーンであり、当該シーンをとらえて映像を分割すること
も考えられるが、通常シーンは同じ場面のつながりであ
り、自動的に分割することは困難であった。[0004] From the viewpoint of video production, a set of shots is a scene, and it is conceivable to divide the video by capturing the scene. However, it is difficult to automatically divide the normal scene because the same scene is connected. Met.

【０００５】本発明は、同じ場面では背景音が類似する
可能性が高いという特徴を利用し、映像を大まかに分割
するようにすることを目的としている。SUMMARY OF THE INVENTION An object of the present invention is to roughly divide a video using the feature that background sounds are likely to be similar in the same scene.

【０００６】[0006]

【課題を解決するための手段】上記目的を達成するた
め、本発明においては、映像を入力し、入力された映像
を蓄積し、映像の音情報から音楽および音声を検出し、
音情報のうち、音楽および音声を含まない区間に対して
特徴量を抽出し、抽出された特徴量の類似度を算出し、
類似度が高い区間およびその区間に挟まれた音情報を含
む映像を１つのセグメントとして分割することにより、
大まかに映像を分割するようにしている。In order to achieve the above object, according to the present invention, an image is input, the input image is stored, music and sound are detected from sound information of the image,
In the sound information, a feature amount is extracted for a section that does not include music and voice, and a similarity between the extracted feature amounts is calculated.
By dividing a video including a section having high similarity and sound information sandwiched between the sections as one segment,
The video is roughly divided.

【０００７】また、音情報の長時間平均スペクトルを用
いることにより、背景音から映像セグメントの類似性を
求めるようにしている。Further, the similarity of video segments is obtained from background sound by using a long-term average spectrum of sound information.

【０００８】[0008]

【発明の実施の形態】以下に、本発明の実施例について
図面を参照して説明する。図１は、本発明の一実施形態
の映像分割装置の概略構成を示すブロック図である。本
実施形態の映像分割装置は、映像を入力する映像入力部
１０１と、映像を蓄積する映像蓄積部１０２と、音楽を
検出する音楽検出部１０３と、音声を検出する音声検出
部１０４と、音楽および音声を含まない区間に対して、
特徴量を抽出する特徴抽出部１０５と、抽出された特徴
量の類似度を算出し、類似度が高い区間およびその区間
に挟まれた音情報を含む映像を１つのセグメントとして
分割する映像分割部１０６から構成されている。Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 is a block diagram illustrating a schematic configuration of a video division device according to an embodiment of the present invention. The video dividing apparatus according to the present embodiment includes a video input unit 101 for inputting video, a video storage unit 102 for storing video, a music detection unit 103 for detecting music, a voice detection unit 104 for detecting audio, And the section without sound
A feature extracting unit 105 for extracting a feature amount; and a video dividing unit for calculating a similarity between the extracted feature amounts, and dividing a video including sound information sandwiched between the high similarity and the segment as one segment as one segment. 106.

【０００９】図２は、本発明の一実施例の映像分割装置
の処理の流れを示したフローチャートである。本発明を
ソフトウェアで実現した場合でも同様の処理の流れとな
る。１ループの処理は１秒程度の映像セグメントに対し
て行われる。FIG. 2 is a flowchart showing the flow of processing of the video dividing apparatus according to one embodiment of the present invention. The same processing flow is used when the present invention is implemented by software. One loop of processing is performed on a video segment of about one second.

【００１０】まず、映像蓄積処理２０１で映像を蓄積
し、映像の音情報に対して音楽検出処理２０２を行う。
判断２０３において音楽かどうかの判別を行い、音楽な
らば判断２０８へジャンプする。音楽でない場合には、
音声検出処理２０４を施す。判断２０５において音声か
どうかの判断を行い、音声ならば判断２０８へジャンプ
する。音楽の検出には、音情報の周波数スペクトルのピ
ークが、周波数方向に対して時間的に安定しているとい
う特徴を用い、音声の検出には、くし形フィルタを用い
る方法（南他、「音解析による映像インデクシング」、
電子情報通信学会総合大会、Ｄ−１２−６４、１９９
７）などが有効である。First, a video is stored in a video storage process 201, and a music detection process 202 is performed on sound information of the video.
In the judgment 203, it is judged whether or not the music is music. If it's not music,
The voice detection processing 204 is performed. In the judgment 205, it is judged whether or not it is a voice. Music detection uses the characteristic that the peak of the frequency spectrum of sound information is temporally stable in the frequency direction, and voice detection uses a comb filter (Minami et al., “Sound Video Indexing by Analysis ",
IEICE General Conference, D-12-64, 199
7) is effective.

【００１１】なお当該「音解析による映像インデクシン
グ」は、映像に含まれる音情報から、音声や音楽を自動
的に検出し、これらが含まれる部分のみを抜き出して映
像を要約するものである。例えば、歌番組のトークを聞
かずに歌の部分のみを聞きたいといった場合に有効であ
る。音楽が存在する場合、周波数スペクトルのピーク
は、周波数方向に対して時間的に安定しているという特
徴があることから、ピークを検出し、時間的な持続性を
算出することによって、音楽を検出することができる。[0011] The "video indexing by sound analysis" is to automatically detect voice and music from the sound information included in the video, extract only the portion including these, and summarize the video. This is effective, for example, when the user wants to hear only the song without listening to the talk of the song program. When music is present, the peak of the frequency spectrum has the characteristic of being temporally stable in the frequency direction, so the music is detected by detecting the peak and calculating the temporal continuity. can do.

【００１２】音声でない場合には、その期間は音楽およ
び音声を含まない背景音であるとして、即ち、背景音に
対応するセグメントとして特徴抽出処理２０６が施され
る。特徴抽出処理２０６では、音情報を周波数解析し、
長時間平均スペクトルを求める。長時間平均スペクトル
は、各周波数におけるスペクトルのパワーの時間的平均
値である。If it is not a voice, the feature extraction processing 206 is performed as a background sound that does not include music and voice during the period, that is, as a segment corresponding to the background sound. In the feature extraction processing 206, the sound information is frequency-analyzed,
Find the long-term average spectrum. The long-term average spectrum is a temporal average value of the spectrum power at each frequency.

【００１３】次に、映像分割処理２０７において、１ル
ープ前に算出された長時間平均スペクトルと現在の長時
間平均スペクトルとの相関を求め、相関が高い場合には
同一場面であるとみなし、ラベリングする。相関を求め
た２つのセグメントに存在する音楽あるいは音声のセグ
メントも同一場面のものとしてラベリングする。ラベル
情報は、セグメントの時間情報と共に映像蓄積部１０２
に保存される。Next, in the video segmentation processing 207, the correlation between the long-term average spectrum calculated one loop before and the current long-term average spectrum is determined. If the correlation is high, the scene is regarded as the same scene, and labeling is performed. I do. The music or audio segments present in the two segments for which the correlation has been determined are also labeled as belonging to the same scene. The label information is stored in the video storage unit 102 together with the segment time information.
Is stored in

【００１４】なお前記において映像の分割について説明
したが、当該分割の態様はデータ処理装置が実行できる
プログラムの形で保持することができ、本発明は当該プ
ログラムを記録した記録媒体をも含むものである。Although the above description has been made on the division of an image, the division can be held in the form of a program that can be executed by the data processing device, and the present invention also includes a recording medium on which the program is recorded.

【００１５】[0015]

【発明の効果】（１）請求項１、３および５の発明は、
映像を入力し、入力された映像を蓄積し、映像の音情報
から音楽および音声を検出し、音情報のうち、音楽およ
び音声を含まない区間に対して特徴量を抽出し、抽出さ
れた特徴量の類似度を算出し、類似度が高い区間および
その区間に挟まれた音情報を含む映像を１つのセグメン
トとして分割することを可能にし、大まかに映像を分割
することを可能にする。(1) The first, third and fifth aspects of the present invention
A video is input, the input video is stored, music and audio are detected from the audio information of the video, and a feature amount is extracted from a section of the audio information that does not include the music and audio, and the extracted features are extracted. By calculating the degree of similarity, it is possible to divide a video including a section with high similarity and sound information sandwiched between the sections as one segment, and roughly divide the video.

【００１６】（２）請求項２、４および６の発明は、音
情報の長時間平均スペクトルを用いることにより、背景
音から映像セグメントの類似性を求めることを可能にす
る。(2) The inventions of claims 2, 4 and 6 make it possible to determine the similarity of video segments from background sound by using a long-term average spectrum of sound information.

[Brief description of the drawings]

【図１】本発明の一実施形態の映像分割装置の概略構成
を示すブロック図である。FIG. 1 is a block diagram illustrating a schematic configuration of a video division device according to an embodiment of the present invention.

【図２】本発明の一実施形態の映像分割装置の処理の流
れと本発明をソフトウェアで実現した場合の処理の流れ
を示すフローチャートである。FIG. 2 is a flowchart showing a processing flow of the video dividing apparatus according to the embodiment of the present invention and a processing flow when the present invention is realized by software.

[Explanation of symbols]

１０１映像入力部１０２映像蓄積部１０３音楽検出部１０４音声検出部１０５特徴抽出部１０６映像分割部２０１映像蓄積処理２０２音楽検出処理２０３音楽判定処理２０４音声検出処理２０５音声判定処理２０６特徴抽出処理２０７映像分割処理２０８映像終了判定処理 Reference Signs List 101 video input unit 102 video storage unit 103 music detection unit 104 audio detection unit 105 feature extraction unit 106 video division unit 201 video storage process 202 music detection process 203 music determination process 204 audio detection process 205 audio determination process 206 feature extraction process 207 video Division processing 208 Image end determination processing

Claims

[Claims]

1. A method for dividing a given video according to a scene, comprising: a video input stage for inputting a video; a video storage stage for storing a video; and music for detecting music from sound information in the video. A detection step; a voice detection step of detecting voice from the voice information; and a section of the voice information that does not include music and voice.
A feature extraction step of extracting a feature amount; calculating a similarity of the extracted feature amount to obtain a video of a section having a high similarity and a video including sound information sandwiched between the sections.
A video dividing step of dividing the video into a plurality of segments by combining them into one segment.

2. The video segmentation method according to claim 1, wherein in the feature extraction step, a long-term average spectrum of sound information is extracted as a feature amount.

3. A device for dividing a given video according to a scene, a video input unit for inputting a video, a video storage unit for storing a video, and music for detecting music from sound information in the video. A detecting unit, a voice detecting unit that detects voice from the voice information, and a section of the voice information that does not include music and voice.
A feature extraction unit for extracting a feature amount; calculating a similarity degree of the extracted feature amount to obtain a video of a section having a high similarity degree and a video including sound information sandwiched between the sections;
A video dividing unit that divides a video into a plurality of segments by combining the video into one segment.

4. The video dividing apparatus according to claim 3, wherein the feature extracting unit extracts a long-term average spectrum of the sound information as a feature amount.

5. A recording medium on which a program for dividing a given video according to a scene is recorded, comprising: a video input process for inputting a video, a video storage process for storing a video, and sound information in the video. A music detection process for detecting music, a voice detection process for detecting voice from sound information, and a section of the sound information that does not include music and voice.
A feature extraction process of extracting a feature amount; calculating a similarity of the extracted feature amount to obtain a video of a section having a high similarity and a video including sound information sandwiched between the sections;
A recording medium characterized by recording a video dividing program for causing a computer to execute a video dividing process of dividing a video into a plurality of segments by putting together as one segment.

6. The recording medium according to claim 5, wherein said feature extraction process extracts a long-term average spectrum of sound information as a feature amount.