JP3565228B2

JP3565228B2 - Sound visualization method and device

Info

Publication number: JP3565228B2
Application number: JP31812294A
Authority: JP
Inventors: 憲一南; 明人阿久津; 洋浜田
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1994-12-21
Filing date: 1994-12-21
Publication date: 2004-09-15
Anticipated expiration: 2019-09-15
Also published as: JPH08179791A

Description

【０００１】
【産業上の利用分野】
本発明は、画像情報を用いて音情報を可視化し、映像の内容を把握する際に好適な音可視化方法および装置に関する。
【０００２】
【従来の技術】
映像を容易に扱ったり、内容を容易に把握する技術は、ビデオハンドリング技術と呼ばれ、画像の輝度の変化量や物体の移動量、カメラの移動量等をインデクスとして、映像の外殻把握やパノラマ画像の作成に応用されている（外村ら：“ＳｔｒｕｃｕｔｕｒｅｄＶｉｄｅｏＣｏｍｐｕｔｉｎｇ，ＩＥＥＥＭｕｌｔｉｍｅｄｉａ，Ｆａｌｌ，１９９４”）が、ほとんどが画像情報のみに依存したものであるため、映像の中で何が起きたのかを直接反映するようなインデクスを用いたハンドリング技術は皆無に等しい。音の情報は、映像の中で何が起きたのかを良く反映しており、その特徴量は内容を直接的に表しているが、音情報の解析は音声認識や音場解析の分野で行われている程度である。音の解析結果を映像と関係づけるものに関しては、映画で扱われている音を解析したものがある（ＭｉｃｈａｅｌＪＨａｗｌｅｙ， “ＳｔｒｕｃｔｕｒｅｏｕｔｏｆＳｏｕｎｄ”，博士論文、ＭＩＴ，１９９３）が、具体的な利用方法については述べられていない。
【０００３】
音の可視化に至っては、波形やサウンドスペクトログラム、アイコン等の図形表示が存在するが、それがどのような音であるかを直感的に把握できるようなものではない。また、音の検索では、ワードスポッティングのようなハンティング技術は存在するが、特定の音がわからない場合や、音の表現が難しい場合に行うブラウジング技術は、音の一覧性の悪さや有効な可視化方法がないため存在しない。
【０００４】
【発明が解決しようとする課題】
映像の内容を短時間で把握する際、従来の画像情報に基づいた方法では映像の内容をあまり反映していないために容易でない場合が多かった。例えば、画像のカット点を用いる場合、カット点の前後を再生したり、カット点の直後の画像を時系列にディスプレイ上に並べたり、紙に印刷したするが、カット点は映像が切り替わるとろであるだけで、それが映像の意味的な内容を表しているわけではない。また、無数の存在するカット点から重要な部分を取り出すことも困難であるため、内容把握に要する時間の短縮もこれ以上は容易でない。
【０００５】
本発明の目的は、どのような音が存在するのかを直感的に把握することを可能にし、一覧性に優れた画像情報の良さと映像の内容を反映した音情報の良さの両者を兼ね備えた音可視化方法および装置を提供することにある。
【０００６】
【課題を解決するための手段】
上記の目的を達成するため、本発明の音可視化方法は、画像情報と音情報からなる映像を実時間で入力し、実時間で入力された映像を蓄積し、音の有無を判別し、音情報のスペクトルの特徴量から様々な音を検出し、該特徴量が同じもの毎に音を分類し、音楽情報を検出する際には、音情報のスペクトルのピークが時間の経過とともに周波数方向に安定しているというスペクトルピークの安定度検出を行ない、分類された音に対応する画像を音の種類の時間的な変化と共に代表画面として選択し、表示する。
【０００７】
本発明の実施態様では、ある一定時間の音情報の振幅の自乗和に閾値を設定して音の有無を判別する。
【００１０】
本発明の実施態様では、音情報から人間の話し声または動物の鳴き声を検出する際には、スペクトルピーク安定度検出を行なった後、該スペクトルの周波数方向に整数倍あるいはそれに近い倍数のピークが複数存在するハーモニック構造を検出する。
【００１１】
本発明の実施態様では、分類された音の始まり、あるいは一定時間後、あるいは終わり、あるいはそれらの組み合わせに対応する画像を代表画面として選択する。
【００１２】
本発明の実施態様では、分類された音に対応する画像の中で、場面が変化するカット点を代表画面として選択する。
【００１３】
本発明の実施態様では、選択された代表画面の枠を音の種類によって色分け、あるいは音の種類によって異なる図形や模様を表示し、あるいはそれらを組み合わせて時間の経過と共にディスプレイや紙に一覧表示する。
【００１４】
本発明の音可視化装置は、
画像情報と音情報からなる映像を実時間で入力する映像入力部と、
実時間で入力された前記映像を蓄積し、該蓄積された映像を出力する映像蓄積部と、
前記映像を入力し、音の有無を判別する音判別部と、音が存在する部分の音情報の周波数スペクトルの特徴量から様々な音を検出し、該特徴量が同じもの毎に音を分類する音検出・分類部であって、音情報から歌または音楽を検出する際には、該音情報の周波数スペクトルのピークが時間の経過と共に周波数方向に安定しているというスペクトルピーク安定度検出を行う音検出・分類部と、分類された音に対応する画像を音の種類の時間的な変化と共に代表画面として選択する代表画面を有する映像管理部と、
前記代表画面を表示するインターフェース部とを有する。
【００１５】
本発明の実施態様では、前記音判別部は、ある一定時間の音情報の振幅の自乗和に閾値を設定して音の有無を判別する。
【００１８】
本発明の実施態様では、前記音検出・分類部が、音情報から人間の話し声または動物の鳴き声を検出する際には、該スペクトルの周波数方向に整数倍あるいはそれ近い倍数のピークが複数存在するハーモニック構造を検出する。
【００１９】
本発明の実施態様では、前記代表画面選択部が、分類された音の始まり、あるいは一定時間後、あるいは終わり、あるいはそれらの組み合わせに対応する画像を代表画面として選択する。
【００２０】
本発明の実施態様では、前記代表画面選択部が、分類された音に対応する画像の中で、場面が変化するカット点を代表画面として選択する。
【００２１】
本発明の実施態様では、前記インターフェース部が、選択された代表画面の枠を音の種類によって色分け、あるいは音の種類によって異なる図形や模様を表示し、あるいはそれらを組み合わせて時間の経過と共にディスプレイや紙に一覧表示する。
【００２２】
本発明の実施態様では、前記インターフェース部は、表示された該代表画面をディスプレイ上でマウス等のポインティングデバイスを用いて指定することにより、対応する映像を再生する。
【００２３】
【作用】
本発明によれば、映像にどのような音が存在するのかを直感的に把握でき、音が存在する部分から様々な種類の音を検出、分類でき、特に周波数スペクトルのピークが時間の経過と共に周波数方向に安定しているというスペクトルピーク安定度検出方法を用いることにより、音情報から歌や音楽を検出することが可能である。
【００２４】
ある一定時間の音情報の振幅の自乗和に閾値を設定して音の有無を判別する音判別方法を用いることにより、音の有無を判別できる。
【００２７】
スペクトルの周波数方向に整数倍あるいはそれに近い倍数のピークが複数存在するハーモニック構造を検出するハーモニクス検出方法を用いることにより、音情報から人間の話し声や動物の鳴き声を検出することが可能である。
【００２８】
分類された音の始まり、あるいは一定時間後、あるいは終わり、あるいはそれらの組み合わせに対応する画像を代表画面として選択することにより、音が可視化される。
【００２９】
分類された音に対応する画像の中で、場面が変化するカット点を代表画面として選択することにより、音が可視化される。
【００３０】
選択された代表画面の枠を音の種類によって色分け、あるいは音の種類によって異なる図形や模様を表示し、あるいはそれらを組み合わせて時間の経過と共にディスプレイや紙に一覧表示することにより、音が可視化される。
【００３２】
表示された代表画面をディスプレイ上でマウス等のポインティングデバスを用いて指定し、対応する映像を再生する映像再生手段を用いることにより、所望の音に対応した映像を見ることが可能となる。
【００３３】
【実施例】
次に、本発明の実施例について図面を参照して説明する。
【００３４】
図１は本発明の一実施例の音可視化装置の概略構成を示すブロック図である。本実施例の音可視化装置は、映像を入力する映像入力部１０１と、実時間で入力された映像および特徴量を蓄積する映像・特徴量蓄積部１０２と、映像および特徴量を管理する映像管理部１０３と、本装置を制御し、可視化された音、再生された映像を提示するためのインターフェース部１０７から構成されている。映像管理部１０３は、音の有無を判別する音判別部１０４と、音の特徴量を抽出し、特徴量の種類によって音を分類する音検出・分類部１０５と、分類された音に対応する画像を選択する代表画面選択部１０６で構成されており、音判別部１０４と音検出・分類部１０５と代表画面選択部１０６は各々、並列あるいは時分割で作動し、実時間で特徴抽出しながら音を可視化できる。映像入力部１０１と映像・特徴量管理部１０２からは、映像のタイムコードあるいは経過時間を代表画面選択部１０６に送るためのバスも設けられている。
【００３５】
図２は、音検出・分類部１０５において行われる音検出・分類処理２０１の処理の流れを示したもので、スペクトルピーク安定度検出処理２０２を行った後、ハーモニクス検出処理２０３を行う。
【００３６】
図３は、映像管理部１０３の音判別部１０４と音検出・分類部１０５と代表画面選択部１０６を計算機等でソフトウェア的に実現する場合の処理を示すフローチャートである。この場合、まず、音判別処理３０１を行い、次に、音があると判断した場合には音検出・分類処理３０２を行う。音検出・分類処理３０２ではスペクトルピーク安定度検出処理３０３を行い、音楽かどうかを判断する。音楽でない場合には、さらにハーモニクス検出処理３０４を行い、人の声かどうか判断する。次に、分類された音に対応する画像を代表画面選択処理３０５で行う。音判別処理３０１で音がないと判別した場合には、音検出・分類処理３０２は行わず、代表画面選択処理３０５を行う。
【００３７】
図４は、インターフェース部１０７をディスプレイ上に実現した場合の様子を示したもので、再生された映像は、再生画面表示用ウィンドウ４０１に映し出される。４０２は、再生する映像の種類を選択するためのコントロールパネルである。可視化された映像は、代表画面表示用ウィンドウ４０３のように時系列に並べられる。４０４は、音の種類を示すアイコンで、例えば音が音楽である場合には、図のようなアイコンを表示する。タイムコード表示用ウィンドウ４０５は、画像の時間を表しており、映像にタイムコードが付加されている場合には、タイムコードを表示し、タイムコードが付加されていない場合には、映像の始めからの経過時間が表示される。
【００３８】
次に、本実施例の動作を説明する。
【００３９】
映像は映像入力部１０１によって入力され、映像が実時間で入力された場合には逐次映像・特徴量蓄積部１０２に蓄積される。入力された映像のうち音情報は音判別部１０４によって解析される。音判別部１０４では、音情報の振幅の自乗和を数ｍｓ〜数１０ｍｓ程度算出し、その値が設定された閾値以上であれば音が存在すると判別される。音が存在する場合には、音検出・分類部１０５によってどのような種類の音が存在するのかを音検出・分類処理２０１によって検出する。まず、スペクトルピーク安定度検出処理２０２において、音情報を５１２ポイント程度のフレーム長で数１０ｍｓ程度フレームをシフトさせながら周波数スペクトルを算出する。次に、周波数スペクトルのケプストラムを１２８次程度の係数まで求める。求められたケプストラムのピークの軌跡を５秒間隔程度で求め、周波数方向の変動がない軌跡の平均持続時間を算出する。平均持続時間がある閾値以上であった場合に、音情報は音楽と分類される。ピークの検出には通常、ケプストラムを用いるが、スペクトル波形を直接使う方法も考えられる。音情報が音楽でない場合には、ハーモニクス検出処理２０３を行う。人間の話し声または動物の鳴き声が存在する場合には、周波数方向に整数倍あるいはそれに近い帯状のスペクトルが観測できる。そこで、周波数方向に適当な間隔のくし形フィルターを用意し、くしの間隔を変化させたり周波数方向に移動させながらくしの頂点でのスペクトルパワーの総和を求める。ハーモニクスが存在する場合には、スペクトルパワーの総和が大きくなるため、声の存在が検出できる。広帯域に広がるノイズが存在する場合にもこの値は大きくなるので、スペクトルパワーの総和から、くしの谷間でのスペクトルパワーの総和を差し引くことで、ノイズに対処できる。音情報のスペクトルパワーの総和がある閾値を超えた場合には人間の話し声または動物の鳴き声と分類され、閾値以下の場合にはその他の音と分類される。また、ハーモニクスの検出にはスペクトルのパワーに閾値を設け、閾値以上のスペクトル強度を１、閾値以下を０として２値化し、くしの頂点が１と重なった数を数えて、その数がある閾値を超えた場合には人や動物の話し声と分類され、閾値以下の場合にはその他の音と分類される方法も可能である。分類された音情報の次に代表画面選択部１０６で画像と対応付ける。対応付けは映像入力部１０１または映像・特徴量蓄積部１０２から送られてきたタイムコードを基本に行う。
【００４０】
代表画面の選択は、分類された音の始めや終わり、一定時間後、カット点等、どのような画像を表示するかをインターフェース部１０７のコントロールパネル４０２において選択する。選択した代表画面を、インターフェース部１０７の代表画面表示用ウィンドウ４０３に時系列に表示し、音の種類によって４０４にアイコンを表示、タイムコード表示用ウィンドウ４０５にはタイムコードあるいは映像の始めからの経過時間を表示する。マウス等のポインティングデバイスを用いて代表画面を指定することで、対応する映像を映像・特徴量管理部１０２より読み込み、再生する。インターフェース部１０７のコントロールパネルパネル４０２において選択した代表画面は、装置に接続された外部出力装置をコントロール４０２において指定することでアイコンやタイムコードと共に時系列に配置された形で紙に印刷することも可能である。
【００４１】
なお、映像・特徴量蓄積部１０２は映像の蓄積のみを行なうようにしてもよい。
【００４２】
【発明の効果】
以上説明したように、本発明は以下に示すような効果がある。
【００４３】
（１）請求項１および７の発明は、映像にどのような音が存在するのかを直感的に把握でき、特に音情報から歌や音楽を検出することができる。
【００４４】
（２）請求項２、３および８〜９の発明は、音の有無が判別でき、またハーモニクス検出方法を用いることにより、人間の話し声や動物の鳴き声を検出できる。
【００４５】
（３）請求項４および１０の発明は、分類された音の始まり、あるいは一定時間後、あるいは終わり、あるいはそれらの組み合わせに対応する画像を代表画面として選択することにより、音の可視化ができる。
【００４６】
（４）請求項５および１１の発明は、分類された音に対応する画像の中で、場面が変化するカット点を代表画面として選択することにより、音の可視化ができる。
【００４７】
（５）請求項６および１２の発明は、選択された代表画面の枠を音の種類によって色分け、あるいは音の種類によって異なる図形や模様の付加、あるいはそれらを組み合わせて時間の経過と共にディスプレイや紙に一覧表示する代表画面表示方法を用いることにより、音の可視化ができる。
【００４８】
（６）請求項１３の発明は、表示された代表画面をディスプレイ上でマウス等を用いて指定することにより、所望の音に対応した映像を見ることができる。
【図面の簡単な説明】
【図１】本発明の一実施例の音可視化装置の概略構成を示すブロック図である。
【図２】音検出・分類部１０５の動作の流れを示すブロック図である。
【図３】映像管理部１０３の動作を計算機等でソフトウェア的に実現した場合の処理の流れを示すフローチャートである。
【図４】インターフェース部１０７の構成を示す図である。
【符号の説明】
１０１映像入力部
１０２映像・特徴量蓄積部
１０３映像管理部
１０４音判別部
１０５音検出・分類部
１０６代表画面選択部
１０７インターフェ−ス部
２０１音検出・分類処理
２０２スペクトルピーク安定度検出処理
２０３ハーモニクス検出処理
３０１音判別処理
３０２音検出・分類処理
３０３スペクトルピーク安定度検出処理
３０４ハーモニクス検出処理
３０５代表画面選択処理
４０１再生画面表示用ウィンドウ
４０２コントロールパネル
４０３代表画面表示用ウィンドウ
４０４アイコン
４０５タイムコード表示用ウィンドウ[0001]
[Industrial applications]
The present invention relates to a sound visualization method and apparatus suitable for visualizing sound information using image information and grasping the contents of a video.
[0002]
[Prior art]
Techniques for easily handling images and grasping the contents are called video handling techniques, and use the index of the amount of change in image brightness, the amount of object movement, the amount of camera movement, etc. It has been applied to the creation of panoramic images (Toumura et al .: “Structured Video Computing, IEEE Multimedia, Fall, 1994”), but what happened in the video because most of it depended only on image information There is almost no handling technique using an index that directly reflects the fact. Sound information is a good reflection of what happened in the video, and its features directly represent the content, but sound information analysis is performed in the fields of speech recognition and sound field analysis. It is to the extent that it has been done. As for a method of relating a sound analysis result to a video, there is a method of analyzing sound handled in a movie (Michael J Hawley, “Structure out of Sound”, doctoral dissertation, MIT, 1993). It does not mention how to use it.
[0003]
For visualization of sound, there are graphical displays such as waveforms, sound spectrograms, icons, and the like, but they do not provide an intuitive understanding of what kind of sound it is. In addition, hunting techniques such as word spotting exist in sound search, but browsing techniques that are used when a specific sound is unknown or when it is difficult to express a sound are difficult to list sound and effective visualization methods. Does not exist because there is no
[0004]
[Problems to be solved by the invention]
When grasping the contents of a video in a short time, it is often not easy to use the conventional method based on image information because the content of the video is not so reflected. For example, when using a cut point of an image, the image immediately before and after the cut point is reproduced, the images immediately after the cut point are arranged in chronological order on a display, or printed on paper. Just because it is, it does not represent the semantic content of the video. Further, since it is difficult to extract an important portion from the myriad of cut points, it is not easy to shorten the time required for grasping the contents.
[0005]
An object of the present invention is to make it possible to intuitively grasp what kind of sound exists, and to have both the goodness of image information excellent in listing and the goodness of sound information reflecting the contents of a video. An object of the present invention is to provide a sound visualization method and device.
[0006]
[Means for Solving the Problems]
In order to achieve the above object, the sound visualization method of the present invention is to input a video including image information and sound information in real time, accumulate the input video in real time, determine the presence or absence of sound, and Various sounds are detected from the characteristic amount of the information spectrum, the sounds are classified for each of the same characteristic amounts, and when detecting the music information, the peak of the spectrum of the sound information is shifted in the frequency direction with the passage of time. The degree of stability of the spectrum peak that is stable is detected, and an image corresponding to the classified sound is selected and displayed as a representative screen together with the temporal change of the type of sound.
[0007]
In embodiments of the present invention, to determine the presence or absence of a threshold value is set to the square sum of the amplitudes of the sound information for a predetermined time in to sound.
[0010]
In embodiments of the present invention, when detecting a squeal voice of human speech or animal from the sound information, after performing the spectral peak stability detection, a peak of an integral multiple or a multiple close to the frequency direction of the spectrum Detects a plurality of harmonic structures.
[0011]
In the embodiment of the present invention, the image corresponding to the start of the classified sound, or after a certain time, or at the end, or a combination thereof is selected as the representative screen.
[0012]
In the embodiment of the present invention, a cut point at which a scene changes is selected as a representative screen in an image corresponding to the classified sound.
[0013]
In the embodiment of the present invention, the frame of the selected representative screen is color-coded according to the type of sound, or a different figure or pattern is displayed according to the type of sound, or a list thereof is displayed on a display or paper over time by combining them. I do.
[0014]
The sound visualization device of the present invention includes:
An image input unit for inputting an image consisting of image information and sound information in real time,
A video storage unit that stores the video input in real time and outputs the stored video;
A sound discriminating unit that inputs the video and determines the presence / absence of sound, and detects various sounds from the feature amount of the frequency spectrum of the sound information of the portion where the sound exists, and classifies the sound for each of the same feature amount When detecting a song or music from sound information, the sound detection / classification unit performs a spectrum peak stability detection that the frequency spectrum peak of the sound information is stable in the frequency direction with the passage of time. A sound detection / classification unit to be performed , and a video management unit having a representative screen for selecting an image corresponding to the classified sound as a representative screen together with a temporal change in the type of sound,
An interface unit for displaying the representative screen .
[0015]
In an embodiment of the present invention, the sound determination unit determines a presence or absence of a sound by setting a threshold value to a sum of squares of the amplitude of the sound information for a certain period of time.
[0018]
In embodiments of the present invention, the sound detection and classification unit, when detecting the squeal voice of human speech or animal from the sound information, in the frequency direction of the spectral peak of an integral multiple or even close multiple multiple Detect existing harmonic structures.
[0019]
In an embodiment of the present invention, the representative screen selecting unit selects an image corresponding to the start of the classified sound, or after a certain time, or at the end, or a combination thereof as a representative screen.
[0020]
In an embodiment of the present invention, the representative screen selection unit selects a cut point at which a scene changes from among images corresponding to the classified sounds as a representative screen.
[0021]
In an embodiment of the present invention, the interface unit displays a frame of the selected representative screen in different colors according to the type of sound, or displays a different graphic or pattern depending on the type of sound, or displays them with the lapse of time by combining them. Or on paper.
[0022]
In an embodiment of the present invention, the interface unit reproduces a corresponding video by designating the displayed representative screen on a display using a pointing device such as a mouse.
[0023]
[Action]
According to the present invention, it is possible to intuitively grasp what kind of sound is present in a video, to detect and classify various kinds of sound from a part where sound exists, and particularly to make the peak of the frequency spectrum with the passage of time. By using the spectrum peak stability detection method of being stable in the frequency direction, it is possible to detect a song or music from sound information.
[0024]
The presence or absence of sound can be determined by using a sound determination method for determining the presence or absence of sound by setting a threshold value to the sum of squares of the amplitude of the sound information for a certain period of time.
[0027]
By using the harmonic detection method for detecting the harmonic structure integral multiple or multiple peaks in close to it there are a plurality of the frequency direction of the spectrum, it is possible to detect the squeal voice of human speech and animals from the sound information.
[0028]
Beginning of the classified sound, or after a predetermined time, or end, or by selecting an image corresponding to a combination of them as the representative screen, sound Ru visualized.
[0029]
Among the image corresponding to the classified sound, by selecting a cut point for scene changes as the representative screen, sound Ru visualized.
[0030]
Color depending on the type of the selected representative screen sound frame of, or to display different shapes and patterns according to the type of sound, or by a list on the display or paper over time by combining them, the sound is visualized You.
[0032]
By specifying the displayed representative screen on the display using a pointing device such as a mouse and using a video reproducing means for reproducing the corresponding video, it is possible to view a video corresponding to a desired sound.
[0033]
【Example】
Next, embodiments of the present invention will be described with reference to the drawings.
[0034]
FIG. 1 is a block diagram showing a schematic configuration of a sound visualization device according to one embodiment of the present invention. The sound visualization apparatus according to the present embodiment includes a video input unit 101 for inputting a video, a video / feature storage unit 102 for storing a video and a feature input in real time, and a video management for managing the video and the feature. It comprises a unit 103 and an interface unit 107 for controlling the apparatus and presenting visualized sound and reproduced video. The video management unit 103 includes a sound determination unit 104 that determines the presence / absence of sound, a sound detection / classification unit 105 that extracts a feature amount of the sound, and classifies the sound according to the type of the feature amount, and a sound corresponding to the classified sound. The sound discriminating unit 104, the sound detecting / classifying unit 105, and the representative screen selecting unit 106 operate in parallel or in a time-division manner, and perform feature extraction in real time. Sound can be visualized. The video input unit 101 and the video / feature management unit 102 also have a bus for transmitting the time code or elapsed time of the video to the representative screen selection unit 106.
[0035]
FIG. 2 shows the flow of the sound detection / classification processing 201 performed by the sound detection / classification unit 105. After the spectrum peak stability detection processing 202 is performed, the harmonics detection processing 203 is performed.
[0036]
FIG. 3 is a flowchart showing processing when the sound discriminating unit 104, the sound detecting / classifying unit 105, and the representative screen selecting unit 106 of the video managing unit 103 are realized by software using a computer or the like. In this case, first, the sound determination processing 301 is performed, and then, when it is determined that there is a sound, the sound detection / classification processing 302 is performed. In the sound detection / classification processing 302, a spectrum peak stability detection processing 303 is performed to determine whether or not the music is music. If it is not music, harmonics detection processing 304 is further performed to determine whether or not it is a human voice. Next, an image corresponding to the classified sound is performed in a representative screen selection process 305. When it is determined in the sound determination processing 301 that there is no sound, the representative screen selection processing 305 is performed without performing the sound detection / classification processing 302.
[0037]
FIG. 4 shows a state in which the interface unit 107 is realized on a display. A reproduced video is displayed on a reproduction screen display window 401. Reference numeral 402 denotes a control panel for selecting a type of video to be reproduced. The visualized images are arranged in chronological order as in a representative screen display window 403. An icon 404 indicates the type of sound. For example, when the sound is music, an icon as shown in the figure is displayed. The time code display window 405 indicates the time of the image. When the time code is added to the video, the time code is displayed. When the time code is not added, the time code display window 405 starts from the beginning of the video. Is displayed.
[0038]
Next, the operation of this embodiment will be described.
[0039]
The video is input by the video input unit 101. When the video is input in real time, the video is sequentially stored in the video / feature storage unit 102. The sound information in the input video is analyzed by the sound discriminating unit 104. The sound discriminating unit 104 calculates the sum of squares of the amplitude of the sound information for several ms to several tens ms, and if the value is equal to or larger than the set threshold value, it is determined that a sound is present. If a sound exists, the sound detection / classification unit 105 detects what kind of sound is present by the sound detection / classification processing 201. First, the spectral peak stability detection process 202, calculates the frequency spectrum while shifting the number 10ms about frames sound information in the frame length of approximately 512 points. Next, the cepstrum of the frequency spectrum is obtained up to a coefficient of the order of 128. The trajectory of the determined cepstrum peak is determined at intervals of about 5 seconds, and the average duration of the trajectory without fluctuation in the frequency direction is calculated. If the average duration is above a certain threshold, the sound information is classified as music. Cepstrum is usually used for peak detection, but a method of directly using a spectrum waveform is also conceivable. If the sound information is not music, a harmonics detection process 203 is performed. If there is speech or animal squeal voice between human band of the spectrum can be observed near an integer multiple or it in the frequency direction. Therefore, a comb filter having an appropriate interval in the frequency direction is prepared, and the sum of the spectral powers at the apex of the comb is obtained while changing the interval of the comb or moving the comb in the frequency direction. When harmonics exist, the sum of the spectral powers becomes large, so that the presence of voice can be detected. Since this value is large even when there is noise that spreads over a wide band, the noise can be dealt with by subtracting the sum of the spectrum power in the comb valley from the sum of the spectrum power. If it exceeds a certain threshold sum of the spectral power of the sound information it is classified as speech or animal squeal voice among human, if less than the threshold value are classified as other sounds. For detection of harmonics, a threshold is set for the power of the spectrum, the intensity of the spectrum above the threshold is set to 1 and the intensity below the threshold is set to 0, the binarization is performed, and the number of vertices of the comb overlapping with 1 is counted. When the value exceeds the threshold, the voice is classified as a human or animal voice, and when the value is less than the threshold, the voice is classified as another sound. Next to the classified sound information, the representative screen selection unit 106 associates the image with the image. The association is performed based on the time code transmitted from the video input unit 101 or the video / feature storage unit 102.
[0040]
In the selection of the representative screen, the control panel 402 of the interface unit 107 selects what kind of image to display, such as the start and end of the classified sound, after a certain time, a cut point, and the like. The selected representative screen is displayed in chronological order in the representative screen display window 403 of the interface unit 107, icons are displayed in 404 according to the type of sound, and the time code display window 405 displays the time code or the progress from the beginning of the video. Display time. By specifying a representative screen using a pointing device such as a mouse, the corresponding video is read from the video / feature management unit 102 and played. The representative screen selected on the control panel 402 of the interface unit 107 can be printed on paper in a time-series arrangement with icons and time codes by designating the external output device connected to the device with the control 402. It is possible.
[0041]
Note that the video / feature storage unit 102 may perform only video storage.
[0042]
【The invention's effect】
As described above, the present invention has the following effects.
[0043]
(1) According to the first and seventh aspects of the present invention, it is possible to intuitively grasp what kind of sound is present in a video, and particularly to detect a song or music from sound information .
[0044]
(2) According to the second , third and eighth to ninth aspects of the present invention, the presence or absence of sound can be determined, and the use of the harmonics detection method can detect human speech and animal squeal .
[0045]
(3) According to the inventions of claims 4 and 10 , the sound can be visualized by selecting, as a representative screen, an image corresponding to the start of the classified sound, or after a certain time, or at the end, or a combination thereof.
[0046]
(4) According to the fifth and eleventh aspects of the present invention, the sound can be visualized by selecting, as a representative screen, a cut point at which a scene changes in an image corresponding to the classified sound.
[0047]
(5) The invention according to claims 6 and 12 is characterized in that the frame of the selected representative screen is color-coded according to the type of sound, or a different graphic or pattern is added according to the type of sound, or a combination of these is used to display or print over time. By using the representative screen display method for displaying a list of the sounds, the sound can be visualized.
[0048]
(6) According to the thirteenth aspect , by specifying the displayed representative screen on the display using a mouse or the like, a video corresponding to a desired sound can be viewed.
[Brief description of the drawings]
FIG. 1 is a block diagram illustrating a schematic configuration of a sound visualization device according to an embodiment of the present invention.
FIG. 2 is a block diagram showing a flow of an operation of a sound detection / classification unit 105.
FIG. 3 is a flowchart showing the flow of processing when the operation of the video management unit 103 is realized by software using a computer or the like.
FIG. 4 is a diagram showing a configuration of an interface unit 107.
[Explanation of symbols]
Reference Signs List 101 Video input unit 102 Video / feature storage unit 103 Video management unit 104 Sound discrimination unit 105 Sound detection / classification unit 106 Representative screen selection unit 107 Interface unit 201 Sound detection / classification process 202 Spectrum peak stability detection process 203 Harmonics detection processing 301 Sound discrimination processing 302 Sound detection / classification processing 303 Spectrum peak stability detection processing 304 Harmonics detection processing 305 Representative screen selection processing 401 Playback screen display window 402 Control panel 403 Representative screen display window 404 Icon 405 Time code display Window for

Claims

Input a video consisting of image information and sound information in real time, accumulate the video input in real time,
The presence or absence of sound is determined, various sounds are detected from the features of the frequency spectrum of the sound information, the sounds are classified for each of the same features, and when detecting a song or music from the sound information, the sound is determined. Performs stability detection of the frequency spectrum peak that the frequency spectrum peak of the information is stable in the frequency direction over time,
A sound visualization method in which an image corresponding to a classified sound is selected and displayed as a representative screen together with a temporal change in the type of sound.

2. The sound visualization method according to claim 1, wherein a threshold value is set for the sum of squares of the amplitude of the sound information for a certain period of time to determine the presence or absence of sound.

When detecting a human voice or an animal cry from sound information, after performing the frequency spectrum peak stability detection, a harmonic structure in which a plurality of integer or multiples near the integer multiple peaks are present in the frequency direction of the frequency spectrum. The sound visualization method according to claim 1, wherein the sound is detected.

The sound visualization method according to any one of claims 1 to 3, wherein an image corresponding to the start of the classified sound, after a predetermined time, or at the end, or a combination thereof is selected as a representative screen.

4. The sound visualization method according to claim 1, wherein a cut point at which a scene changes is selected as a representative screen in an image corresponding to the classified sound.

6. The selected representative screen frame is color-coded according to the type of sound, or a different pattern is added according to the type of sound, or a combination thereof is displayed on a display or paper as time elapses. The sound visualization method described in the section.

A video input unit for inputting a video composed of image information and sound information in real time, a video storage unit for storing the video input in real time, and outputting the stored video,
A sound discriminating unit that inputs the video and determines the presence / absence of sound, and detects various sounds from the feature amount of the frequency spectrum of the sound information of the portion where the sound exists, and classifies the sound for each of the same feature amount When detecting a song or music from sound information, the sound detection / classification unit performs the stability of the frequency spectrum peak that the frequency spectrum peak of the sound information is stable in the frequency direction as time passes. A sound detection / classification unit that performs detection, and a video management unit that includes a representative screen selection unit that selects an image corresponding to the classified sound as a representative screen with a temporal change in the type of sound,
A sound visualization device having an interface unit for displaying the representative screen.

The sound visualization device according to claim 7, wherein the sound determination unit determines a presence or absence of a sound by setting a threshold value to a sum of squares of the amplitude of the sound information for a certain period of time.

The sound detection / classification unit, when detecting a human voice or an animal cry from sound information, performs the frequency spectrum peak stability detection, and then performs an integer multiple or a multiple thereof in the frequency direction of the frequency spectrum. 9. The sound visualization device according to claim 7, wherein a harmonic structure having a plurality of peaks is detected.

The sound according to any one of claims 7 to 9, wherein the representative screen selection unit selects, as a representative screen, an image corresponding to the start of the classified sound, after a certain time, or at the end, or a combination thereof. Visualization device.

The sound visualization device according to any one of claims 7 to 9, wherein the representative screen selection unit selects, as a representative screen, a cut point at which a scene changes in an image corresponding to the classified sound.

The interface unit, the frame of the selected representative screen is color-coded according to the type of sound, or different graphics and patterns are displayed according to the type of sound, or a combination thereof is displayed in a list on a display or paper over time. Item 12. The sound visualization device according to any one of items 7 to 11.

The sound visualization device according to any one of claims 7 to 11, wherein the interface unit reproduces a corresponding video by designating the displayed representative screen on a display using a pointing device such as a mouse. .