JP4099576B2

JP4099576B2 - Information identification apparatus and method, program, and recording medium

Info

Publication number: JP4099576B2
Application number: JP2002286836A
Authority: JP
Inventors: 康裕戸栗
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 2002-09-30
Filing date: 2002-09-30
Publication date: 2008-06-11
Anticipated expiration: 2022-09-30
Also published as: JP2004125944A

Description

【０００１】
【発明の属する技術分野】
本発明は、音声、音楽、音響を含む音声信号、又はその音声信号を含む情報源から特徴量を抽出することにより、音声や音楽を識別して検出又は検索する情報識別装置及びその方法、並びにプログラム及び記録媒体に関する。
【０００２】
【従来の技術】
放送システムやマルチメディアシステム等において、映像や音声の大量のコンテンツを効率よく管理、分類し、容易に検索可能とすることは重要であるが、これにはコンテンツ中のどの部分がどのような情報をもっているかを知ることが不可欠である。
【０００３】
ここで、多くのマルチメディアコンテンツ、放送コンテンツは、映像信号と共に音声信号を含んでおり、これはコンテンツの分類やシーンの検出において、非常に有用な情報である。特に、情報に含まれる音声信号の音声部分と音楽部分とを識別して検出することで、効率的な情報検索や情報管理が行える。
【０００４】
ところで、音声と音楽とを識別するための技術は、従来から数多く研究されており、零交差数、パワーの変動、スペクトルの変動などを特徴量として用いて識別する手法が提案されている。
【０００５】
例えば、下記の非特許文献１では、零交差数を用いて音声・音楽の識別を行っている。
【０００６】
また、下記の非特許文献２では、４Ｈｚ変調エネルギー、低エネルギーフレーム率、スペクトルロールオフ点、スペクトルセントロイド、スペクトル変動（Flux)、零交差率などを含めた１３個の特徴量を用いて音声・音楽を識別し、それぞれの性能を比較評価している。
【０００７】
さらに、下記の非特許文献３では、ケプストラム係数、デルタケプストラム係数、振幅、デルタ振幅、ピッチ、デルタピッチ、零交差数、デルタゼロ交差数を特徴量とし、それぞれの特徴量に混合正規分布モデルを用いることで、音声・音楽を識別している。
【０００８】
この他、音楽のスペクトルピークが特定周波数に安定したまま時間方向に持続するという特徴に基づいた検出手法も研究されている。ここで、スペクトルピークの安定性は、スペクトログラムにおける時間方向の直線成分の有無としても表現される。スペクトログラムとは、縦軸を周波数、横軸を時間とし、スペクトルを時間方向に並べて画像情報として表現したものである。この特徴を用いた発明としては、例えば下記の非特許文献４及び特許文献１が挙げられる。
【０００９】
ここで、特許文献１では、全帯域のエッジ強度を求め、これを閾値と比較することで音楽成分が存在するか否かを判定している。さらに、音楽成分を除去したスペクトルにくし型フィルタを適用し、音声の調波構造（ハーモニック構造）を検出することで音声成分も検出している。
【００１０】
すなわち、先ず周波数帯域ｊにおける時間方向のエッジ強度ｅｄ（ｊ）を以下の式（１）に従って求める。ここで、式（１）においてｆ（ｉ，ｊ）は、スペクトログラム上の画素（ｉ，ｊ）における輝度を示す。
【００１１】
【数１】

【００１２】
次に、全帯域のエッジ強度ＥＤを以下の式（２）に従って求める。
【００１３】
【数２】

【００１４】
そして、このエッジ強度ＥＤの値が閾値ＴＨ以上である場合には、検出範囲に音楽が存在すると判定している。
【００１５】
【非特許文献１】
Ｊ．サウンダース（J.Saunders），「放送された音声／音楽のリアルタイム識別（Real-time discrimination of broadcast speech/music）」，（米国），電気電子技術者学会報、音響・音声・信号処理に関する国際会議（Proc.IEEE Int.Conf. on Acoustics, Speech, Signal Processing），１９９６年，ｐ．９９３−９９６
【非特許文献２】
Ｅ．シェイアー（E.Scheire）及びＭ．スラニー（M.Slaney），「ロバストな多特性音声／音楽識別器の作製及び評価（Construction and evaluation of a robust multifeature speech/music discriminator）」，（米国），電気電子技術者学会報、音響・音声・信号処理に関する国際会議（Proc.IEEE Int.Conf. on Acoustics, Speech, Signal Processing），１９９７年，ｐ．１３３１−１３３４
【非特許文献３】
Ｍ．Ｊ．ケア（M.J.Care）、Ｅ．Ｓ．パリス（E.S.Parris）及びＨ．ロイド・トーマス（H.Lloyd-Thomas），「音声，音楽を識別するための特徴比較（A comparison of features for speech,music discrimination）」，（米国），電気電子技術者学会報、音響・音声・信号処理に関する国際会議（Proc.IEEE Int.Conf. on Acoustics, Speech, Signal Processing），１９９９年３月，ｐ．１４９−１５２
【非特許文献４】
南、阿久津、浜田及び外村，「音情報を用いた映像インデクシングとその応用」，電子情報通信学会論文誌Ｄ−ＩＩ，１９９８年，第Ｊ８１−Ｄ−ＩＩ巻，第３号，ｐ．５２９−５３７
【特許文献１】
特開平１０−１８７１８２号公報
【００１６】
【発明が解決しようとする課題】
しかしながら、上述した従来の技術において、零交差数、パワー変動、スペクトルセントロイドなどを特徴量として用いた識別手法は、どれも単独では識別に十分な特徴量ではなかった。
【００１７】
また、スペクトルのピークの安定性に着目した識別手法は、打撃音などを除き効果的な特徴量であるものの、エッジ強度の時間方向及び周波数方向における単純な総和を識別に用いていたため、特定周波数における時間方向のピーク安定性を十分に表現できない場合があった。つまり、単に全時刻・全帯域での総和をとると、スペクトルピークが周波数方向に揺らいでいる場合やピークが断続している場合であっても、スペクトルが特定周波数に安定して持続している場合、すなわちスペクトログラムにおける時間方向の直線成分が存在する場合との区別がつかないことがあり、これにより識別誤りを起こす可能性があった。
【００１８】
本発明は、このような従来の実情に鑑みて提案されたものであり、上述した従来技術の問題点を解決し、より高精度に音声・音楽を識別して検出する情報識別装置及びその方法、並びに情報識別処理をコンピュータに実行させるプログラム及びそのプログラムが記録されたコンピュータ読み取り可能な記録媒体を提供することを目的とする。
【００１９】
【課題を解決するための手段】
上述した目的を達成するために、本発明に係る情報識別装置は、音声信号を含む情報源から所定の時間区間毎に音声か音楽かを識別する情報識別装置において、入力音声信号を所定のブロック単位で周波数分析し、所定の識別区間毎に、縦軸及び横軸がそれぞれ周波数及び時間であるスペクトログラムを求めるスペクトログラム計算手段と、上記スペクトログラムを画像と見なしたときのスペクトログラム画像を２次元スペクトルに変換して得られる２次元周波数領域における水平周波数が０近傍の成分としての水平直流成分を抽出する成分抽出手段と、上記成分抽出手段により抽出された上記水平直流成分のパワーが上記スペクトログラムの全領域のパワーに占める割合を求めるパワー比計算手段と、上記パワー比計算手段によって求められたパワー比に基づいて、音声か音楽かを識別する識別手段とを備える。
【００２０】
ここで、本情報識別装置は、上記スペクトログラムの一部を複数の小ブロックに分割するスペクトログラム分割手段を備えていてもよく、この場合、上記水平直流成分抽出手段は、上記小ブロック毎に上記水平直流成分を抽出し、上記パワー比計算手段は、上記小ブロック毎に上記パワー比を求める。また、上記小ブロック毎に求められた上記パワー比に基づいて、全小ブロックにおける総合的なパワー比を求める総合パワー比計算手段を備えることもでき、この場合、上記識別手段は、上記総合パワー比計算手段によって求められた総合パワー比に基づいて、音声か音楽かを識別する。
【００２１】
このような情報識別装置は、音楽のスペクトルピークが時間方向に安定して持続するという特徴に基づいて音声と音楽とを識別する際に、入力音声信号を所定のブロック単位で周波数分析し、所定の識別区間毎に、縦軸及び横軸がそれぞれ周波数及び時間であるスペクトログラムを求め、このスペクトログラムを画像と見なしたときの水平直流成分のパワーが当該スペクトログラムの全領域のパワーに占める割合を特徴量として用いる。
【００２２】
また、上述した目的を達成するために、本発明に係る情報識別方法は、音声信号を含む情報源から所定の時間区間毎に音声か音楽かを識別する情報識別方法において、入力音声信号を所定のブロック単位で周波数分析し、所定の識別区間毎に、縦軸及び横軸がそれぞれ周波数及び時間であるスペクトログラムを求めるスペクトログラム計算工程と、上記スペクトログラムを画像と見なしたときのスペクトログラム画像を２次元スペクトルに変換して得られる２次元周波数領域における水平周波数が０近傍の成分としての水平直流成分を抽出する成分抽出工程と、上記成分抽出工程にて抽出された上記水平直流成分のパワーが上記スペクトログラムの全領域のパワーに占める割合を求めるパワー比計算工程と、上記パワー比計算工程にて求められたパワー比に基づいて、音声か音楽かを識別する識別工程とを有する。
【００２３】
ここで、本情報識別方法は、上記スペクトログラムの一部を複数の小ブロックに分割するスペクトログラム分割工程を有していてもよく、この場合、上記成分抽出工程では、上記小ブロック毎に上記水平直流成分が抽出され、上記パワー比計算工程では、上記小ブロック毎に上記パワー比が求められる。また、上記小ブロック毎に求められた上記パワー比に基づいて、全小ブロックにおける総合的なパワー比を求める総合パワー比計算工程を有してもよく、この場合、上記識別工程では、上記総合パワー比計算工程にて求められた総合パワー比に基づいて、音声か音楽かが識別される。
【００２４】
このような情報識別方法は、音楽のスペクトルピークが時間方向に安定して持続するという特徴に基づいて音声と音楽とを識別する際に、入力音声信号を所定のブロック単位で周波数分析し、所定の識別区間毎に、縦軸及び横軸がそれぞれ周波数及び時間であるスペクトログラムを求め、このスペクトログラムを画像と見なしたときの水平直流成分のパワーが当該スペクトログラムの全領域のパワーに占める割合を特徴量として用いる。
【００２５】
また、本発明に係るプログラムは、上述した情報識別処理をコンピュータに実行させるものであり、本発明に係る記録媒体は、そのようなプログラムが記録されたコンピュータ読み取り可能なものである。
【００２６】
【発明の実施の形態】
以下、本発明を適用した具体的な実施の形態について、図面を参照しながら詳細に説明する。この実施の形態は、本発明を、音楽のスペクトルピークが時間方向に安定して持続するという特徴に基づいて、音声信号の所定時間区間毎に音声と音楽とを識別して検出する情報識別装置に適用したものである。
【００２７】
以下では、本実施の形態における情報識別装置の構成及び動作を説明する前に、この情報識別装置における音声・音楽の識別手法の原理について説明する。
【００２８】
先ず音楽の典型的なスペクトログラムの様子を図１（Ａ）に示す。この図は抽象化して示しているが、実際のスペクトログラムは、スペクトルの大きさによって画素の輝度が異なる濃淡画像として得られる。打楽器のみの場合などに例外はあるが、多くの一般的な音楽では、図１（Ａ）に示すように、スペクトログラムに水平方向、すなわち時間方向の直線成分が観察される。これは、音楽ではある周波数帯域のスペクトルピークが時間方向に安定して持続するためである。
【００２９】
一方、音声の典型的なスペクトログラムの様子を図１（Ｂ）に示す。音楽の場合と異なり、音声ではスペクトログラムに水平直線成分が見られず、周波数方向に揺らいで波打っているのが観察される。これは、音声には調波構造（ハーモニクス構造）が見られるものの、周波数ピークが時間とともに揺らいで変動することを示している。また、音声では有声音と無声音とが交互に繰り返されるために、曲線の明確に現れる部分とそうでない部分とが存在する。
【００３０】
したがって、スペクトログラムを画像と見なし、そのスペクトログラム画像における水平直線成分の有無、或いはその程度によって音声と音楽とを識別することができる。
【００３１】
ここで、スペクトログラムを画像と見なした場合の２次元周波数領域を図２に示す。スペクトログラム画像を２次元スペクトルに変換した２次元周波数領域において、スペクトログラム画像における水平直線成分は、図２に斜線で示す領域ＬＵ付近、すなわち水平周波数ｕが０近傍である水平直流成分に集中する。
【００３２】
なお、垂直直流成分（ｖ＝０）付近にも水平直線のスペクトル成分は存在するが、垂直方向に殆ど変化がない成分、すなわち直線とはいえない成分も含まれるため、ｖ＝０付近は領域ＬＵから除いている。
【００３３】
この領域ＬＵ内のスペクトルパワーがスペクトログラムにおける水平直線成分であることから、全領域のスペクトルパワーに対する領域ＬＵ内のスペクトルパワーの比が、スペクトログラムの水平直線成分の程度、すなわちスペクトルの時間方向のピーク持続性を表すことになり、これを特徴量として音声と音楽とを識別することができる。
【００３４】
実際には、図３に示すように、スペクトログラム全体のうち、音声・音楽の識別に大きく寄与する領域ＡＲを複数の小ブロック（小領域）ＳＢに分割し、小ブロックＳＢ毎に上述したパワー比を求めてから、全ての小ブロックＳＢにおける総合的なパワー比を求めるのが好ましい。このように小ブロックＳＢに分割して処理を行うことで、水平直線成分の検出精度が向上する。また、スペクトログラム全体のうち、音声・音楽の識別に大きく寄与する領域ＡＲのみを処理の対象とすることで、識別の精度も向上する。
【００３５】
以上説明した識別手法により音声・音楽を識別する本実施の形態における情報識別装置の概略構成を図４に示す。図４に示すように、情報識別装置１は、音声信号入力部１０と、入力された音声信号のスペクトログラムを求めるスペクトログラム計算部１１と、スペクトログラムを複数の小ブロックＳＢに分割するスペクトログラム分割部１２と、分割されたスペクトログラムの小ブロックＳＢにおける水平直線周波数成分を抽出する水平直線周波数成分抽出部１３と、小ブロックＳＢの水平直線成分の全成分に対するパワー比を求める水平直線パワー比計算部１４と、全小ブロックＳＢの水平直線パワー比から総合水平直線成分パワー比を求める総合パワー比計算部１５と、求めた総合水平直線成分パワー比を特徴量とし、入力音声信号の所定時間区間毎に音声か音楽かを識別する音声・音楽識別部１６と、その識別結果を出力する識別結果出力部１７とを備える。
【００３６】
この情報識別装置１において、音声信号入力部１０は、音声信号を入力し、これをスペクトログラム計算部１１に供給する。スペクトログラム計算部１１は、入力音声信号を所定のブロック毎に周波数分析して周波数スペクトルを計算し、さらに所定の識別時間毎に入力音声信号のスペクトログラムを求めて、スペクトログラムをスペクトログラム分割部１２に供給する。そして、スペクトログラム分割部１２は、スペクトログラム計算部１１から供給されたスペクトログラムを後述するように複数の小ブロックＳＢに分割し、小ブロックＳＢ毎のスペクトログラムを水平直線周波数成分抽出部１３に供給する。
【００３７】
水平直線周波数成分抽出部１３は、スペクトログラムの小ブロックＳＢ毎に、その小ブロックＳＢの水平直線成分に相当する周波数成分を取り出して、水平直線パワー比計算部１４に供給する。そして、水平直線パワー比計算部１４は、全周波数帯域成分に対する水平直線成分のパワー比を計算し、総合パワー比計算部１５は、全ての小ブロックＳＢでの水平直線成分パワー比を評価して、総合水平直線成分パワー比を計算する。
【００３８】
音声・音楽識別部１６は、求められた総合水平直線成分パワー比を特徴量として用いて、閾値判定法やその他の統計的判別手法により入力音声信号の識別区間が音声であるか音楽であるかを識別し、識別結果を識別結果出力部１７に供給する。そして、識別結果出力部１７は、音声・音楽識別部１６から供給された識別結果を出力する。
【００３９】
この情報識別装置１の処理を図５のフローチャートを用いてさらに詳細に説明する。先ずステップＳ１において、入力音声信号の所定の識別時間内におけるスペクトログラムを求める。ここで、識別時間とは、入力音声信号において音声と音楽とを識別するための識別ブロック長であり、数秒程度以上が望ましい。具体的には、音声信号ｘ（ｔ）を入力し、所定の時間毎（例えば６４ミリ秒）にブロック化して周波数分析を行い、スペクトルを求める。
【００４０】
なお、周波数分析ブロックは、隣接ブロックと重複していてもよい。例えば、２０ミリ秒ずつ重複させることができる。また、周波数スケールは、対数スケールやメルスケールなどであってもよい。
【００４１】
そして、ｉ番目の周波数分析ブロックにおける周波数帯域ｋのスペクトルをｆ（ｉ，ｋ）とする。横軸にｉ（時間方向）、縦軸にｋ（周波数方向）をとり、求めたスペクトルｆ（ｉ，ｋ）を２次元画像の輝度として表現したものがスペクトログラムである。
【００４２】
次にステップＳ２において、スペクトログラムを図３に示したようにＭ個の小ブロックＳＢに分割する。この際、識別に寄与すると思われる部分のみを小ブロック化すればよい。本実施の形態では、時間方向にはスペクトログラム全体の時間幅（すなわち識別時間長）に亘って小ブロック化されているが、周波数方向には識別に重要な帯域（例えば、５０Ｈｚ〜４ｋＨｚ）のみが小ブロック化されており、それ以外の帯域を用いない。このように、識別に寄与すると思われる部分のみを小ブロック化することで、識別精度が向上する。ここで、小ブロックＳＢの大きさは、周波数方向にも時間方向にも適当な分解能となるように、例えば３２×３２とする。
【００４３】
なお、小ブロックＳＢは、隣接ブロックと重複していてもよい。本実施の形態では、小ブロックＳＢは半分ずつ重複しているとする。
【００４４】
このように分割した小ブロックＳＢ毎に、後段で水平直線成分のパワー比が求められる。
【００４５】
続いてステップＳ３において、ある小ブロックＳＢ_ｍについて、２次元画像スペクトル上の領域ＬＵの水平直線成分パワー比Ｒ（ｍ）を求める。すなわち、上述のよう分割された小ブロックＳＢ毎に、その小ブロックＳＢ内のスペクトログラムを画像と見なし、２次元フーリエ変換や２次元フィルタなどによりスペクトログラム画像の２次元周波数における領域ＬＵの成分を取り出し、全領域に対するパワー比を求める。
【００４６】
ここで、領域ＬＵの水平直線成分パワー比Ｒ（ｍ）を求める方法には、２次元フーリエ変換によって該当領域のスペクトルから求める方法と、２次元デジタルフィルタを用いて領域ＬＵの帯域成分のみを取り出す方法がある。フーリエ変換による方法では、先ず小ブロックＳＢ_ｍにおけるスペクトログラム画像を２次元フーリエ変換し、得られた２次元パワースペクトルをＦ_ｍ（ｕ，ｖ）とする。そして、領域ＬＵ内のスペクトルパワーの全帯域に対するパワー比を求める。すなわち、小ブロックＳＢ_ｍにおける水平直線成分パワー比Ｒ（ｍ）は、以下の式（３）により求められる。
【００４７】
【数３】

【００４８】
一方、２次元フィルタを用いた場合は、小ブロックＳＢ_ｍにおけるスペクトログラム画像に、領域ＬＵのみ通過させるような２次元帯域通過フィルタを適用する。そして、フィルタ処理された信号のパワーと、フィルタ処理しない原信号のパワーとの比を求めれば水平直線成分パワー比Ｒ（ｍ）が得られる。
【００４９】
ステップＳ４では、全ての小ブロックＳＢの処理が終了したか否かが判別される。全ての小ブロックＳＢについて水平直線成分パワー比Ｒ（ｍ）を求めた場合（Yes）にはステップＳ５に進み、そうでない場合（No）には、次の小ブロックＳＢについて同様にして水平直線成分パワー比Ｒ（ｍ）を求める。
【００５０】
ステップＳ５では、全ての小ブロックＳＢについての総合水平直線パワー比Ｒを求める。例えば、以下の式（４）に示すように、各小ブロックＳＢの水平直線成分パワー比Ｒ（ｍ）の平均を総合水平直線パワー比Ｒとすることができる。ここで、式（４）において、ｍは小ブロックの番号を示し、Ｍは小ブロック数を示す。
【００５１】
【数４】

【００５２】
なお、各小ブロックＳＢの水平直線成分パワー比Ｒ（ｍ）の平均に限定されるものではなく、以下の式（５）に示すように、単純に各小ブロックＳＢの水平直線成分パワー比Ｒ（ｍ）の総和を総合水平直線パワー比Ｒとしてもよい。
【００５３】
【数５】

【００５４】
そしてステップＳ６では、総合水平直線成分パワー比Ｒを特徴量として用いて、音声・音楽の識別を行う。一般に、典型的な音楽ではスペクトルピークが持続するため、この総合水平直線成分パワー比Ｒは大きな値となり、音声では小さい値になる。識別の手法は本発明では限定しないが、最も単純な方法としては、総合水平直線成分パワー比Ｒを閾値Ｔｈと比較し、総合水平直線成分パワー比Ｒが閾値Ｔｈ以上であれば音楽と判別し、閾値Ｔｈ未満であれば音声と判別することが挙げられる。
【００５５】
また、音声、音楽それぞれに対して総合水平直線成分パワー比Ｒの分布を正規分布モデルによって表現し、事後確率の大きい方に判別するといったベイズ決定則などの統計的判別法を用いてもよい。また、この総合水平直線成分パワー比Ｒを他の特徴量と組み合わせて総合的に判別してもよい。
【００５６】
以上説明したように、本実施の形態における情報識別装置１によれば、入力音声信号のスペクトログラムにおける水平直線成分に相当する周波数成分を取り出し、その全体に対するパワー比を特徴量として用いているため、スペクトルにおける特定帯域のピーク持続性を効果的に表現することができ、音声・音楽を高精度に識別することができる。
【００５７】
また、スペクトログラムを予め小ブロックＳＢに分割し、小ブロックＳＢ毎に上述の水平直線成分パワー比Ｒ（ｍ）を求めてから、全小ブロックＳＢにおける総合水平直線成分パワー比Ｒを求めているため、ピーク持続性の分析性能が向上する。
【００５８】
なお、本発明は上述した実施の形態のみに限定されるものではなく、本発明の要旨を逸脱しない範囲において種々の変更が可能であることは勿論である。
【００５９】
例えば、上述の実施の形態では、ハードウェアの構成として説明したが、これに限定されるものではなく、任意の処理を、ＣＰＵ（Central Processing Unit）にコンピュータプログラムを実行させることにより実現することも可能である。この場合、コンピュータプログラムは、記録媒体に記録して提供することも可能であり、また、インターネットその他の伝送媒体を介して伝送することにより提供することも可能である。
【００６０】
【発明の効果】
以上詳細に説明したように本発明に係る情報識別装置は、音声信号を含む情報源から所定の時間区間毎に音声か音楽かを識別する情報識別装置において、入力音声信号を所定のブロック単位で周波数分析し、所定の識別区間毎に、縦軸及び横軸がそれぞれ周波数及び時間であるスペクトログラムを求めるスペクトログラム計算手段と、上記スペクトログラムを画像と見なしたときの水平直流成分を抽出する水平直流成分抽出手段と、上記水平直流成分のパワーが上記スペクトログラムの全領域のパワーに占める割合を求めるパワー比計算手段と、上記パワー比計算手段によって求められたパワー比に基づいて、音声か音楽かを識別する識別手段とを備える。
【００６１】
ここで、本情報識別装置は、上記スペクトログラムの一部を複数の小ブロックに分割するスペクトログラム分割手段を備えていてもよく、この場合、上記水平直流成分抽出手段は、上記小ブロック毎に上記水平直流成分を抽出し、上記パワー比計算手段は、上記小ブロック毎に上記パワー比を求める。また、上記小ブロック毎に求められた上記パワー比に基づいて、全小ブロックにおける総合的なパワー比を求める総合パワー比計算手段を備えることもでき、この場合、上記識別手段は、上記総合パワー比計算手段によって求められた総合パワー比に基づいて、音声か音楽かを識別する。
【００６２】
このような情報識別装置によれば、音楽のスペクトルピークが時間方向に安定して持続するという特徴に基づいて音声と音楽とを識別する際に、入力音声信号を所定のブロック単位で周波数分析し、所定の識別区間毎に、縦軸及び横軸がそれぞれ周波数及び時間であるスペクトログラムを求め、このスペクトログラムを画像と見なしたときの水平直流成分のパワーが当該スペクトログラムの全領域のパワーに占める割合を特徴量として用いることで、スペクトルにおける特定帯域のピーク持続性を効果的に表現することができ、音声・音楽を高精度に識別することができる。
【００６３】
また、本発明に係る情報識別方法は、音声信号を含む情報源から所定の時間区間毎に音声か音楽かを識別する情報識別方法において、入力音声信号を所定のブロック単位で周波数分析し、所定の識別区間毎に、縦軸及び横軸がそれぞれ周波数及び時間であるスペクトログラムを求めるスペクトログラム計算工程と、上記スペクトログラムを画像と見なしたときの水平直流成分を抽出する水平直流成分抽出工程と、上記水平直流成分のパワーが上記スペクトログラムの全領域のパワーに占める割合を求めるパワー比計算工程と、上記パワー比計算工程にて求められたパワー比に基づいて、音声か音楽かを識別する識別工程とを有する。
【００６４】
ここで、本情報識別方法は、上記スペクトログラムの一部を複数の小ブロックに分割するスペクトログラム分割工程を有していてもよく、この場合、上記成分抽出工程では、上記小ブロック毎に上記水平直流成分が抽出され、上記パワー比計算工程では、上記小ブロック毎に上記パワー比が求められる。また、上記小ブロック毎に求められた上記パワー比に基づいて、全小ブロックにおける総合的なパワー比を求める総合パワー比計算工程を有してもよく、この場合、上記識別工程では、上記総合パワー比計算工程にて求められた総合パワー比に基づいて、音声か音楽かが識別される。
【００６５】
このような情報識別方法によれば、音楽のスペクトルピークが時間方向に安定して持続するという特徴に基づいて音声と音楽とを識別する際に、入力音声信号を所定のブロック単位で周波数分析し、所定の識別区間毎に、縦軸及び横軸がそれぞれ周波数及び時間であるスペクトログラムを求め、このスペクトログラムを画像と見なしたときの水平直流成分のパワーが当該スペクトログラムの全領域のパワーに占める割合を特徴量として用いることで、スペクトルにおける特定帯域のピーク持続性を効果的に表現することができ、音声・音楽を高精度に識別することができる。
【００６６】
また、本発明に係るプログラムは、上述した情報識別処理をコンピュータに実行させるものであり、本発明に係る記録媒体は、そのようなプログラムが記録されたコンピュータ読み取り可能なものである。
【００６７】
このようなプログラム及び記録媒体によれば、上述した情報識別処理をソフトウェアにより実現することができる。
【図面の簡単な説明】
【図１】スペクトログラムの典型例を概念的に説明する図であり、同図（Ａ）は、音楽のスペクトログラムを示し、同図（Ｂ）は、音声のスペクトログラムを示す。
【図２】スペクトログラム画像を２次元スペクトルに変換した２次元周波数領域を示す図である。
【図３】スペクトログラム画像を複数の小ブロックに分割した様子を示す図である。
【図４】本実施の形態における情報識別装置の概略構成を説明する図である。
【図５】同情報識別装置の動作を説明するフローチャートである。
【符号の説明】
１情報識別装置、１０音声信号入力部、１１スペクトログラム計算部、１２スペクトログラム分割部、１３水平直線周波数成分抽出部、１４水平直線パワー比計算部、１５総合パワー比計算部、１６音声・音楽識別部、１７識別結果出力部[0001]
BACKGROUND OF THE INVENTION
The present invention relates to an information identification apparatus and method for identifying and detecting or searching for speech and music by extracting features from speech, music, a speech signal including sound, or an information source including the speech signal, and a method thereof, and The present invention relates to a program and a recording medium.
[0002]
[Prior art]
In broadcast systems and multimedia systems, it is important to efficiently manage and categorize a large amount of video and audio content and make it easy to search. It is essential to know if you have
[0003]
Here, many multimedia contents and broadcast contents include an audio signal together with a video signal, which is very useful information for content classification and scene detection. In particular, efficient information retrieval and information management can be performed by identifying and detecting a voice portion and a music portion of an audio signal included in information.
[0004]
By the way, many techniques for discriminating speech and music have been studied conventionally, and a method for discriminating using the number of zero crossings, power fluctuation, spectrum fluctuation, etc. as a feature quantity has been proposed.
[0005]
For example, in the following Non-Patent Document 1, speech / music identification is performed using the number of zero crossings.
[0006]
In the following Non-Patent Document 2, speech is generated using 13 feature quantities including 4 Hz modulation energy, low energy frame rate, spectrum roll-off point, spectrum centroid, spectrum fluctuation (Flux), zero crossing rate, and the like.・ Music is identified and performance is compared and evaluated.
[0007]
Further, in the following Non-Patent Document 3, the cepstrum coefficient, the delta cepstrum coefficient, the amplitude, the delta amplitude, the pitch, the delta pitch, the number of zero crossings, and the number of delta zero crossings are used as feature quantities, and a mixed normal distribution model is used for each feature quantity. In this way, voice / music is identified.
[0008]
In addition, a detection method based on the feature that the spectrum peak of music continues in the time direction while being stable at a specific frequency has been studied. Here, the stability of the spectrum peak is also expressed as the presence or absence of a linear component in the time direction in the spectrogram. The spectrogram is expressed as image information in which the vertical axis represents frequency, the horizontal axis represents time, and the spectrum is arranged in the time direction. Examples of the invention using this feature include the following Non-Patent Document 4 and Patent Document 1.
[0009]
Here, in Patent Document 1, it is determined whether or not a music component exists by obtaining the edge strength of the entire band and comparing it with a threshold value. Further, a comb filter is applied to the spectrum from which the music component is removed, and the speech component is also detected by detecting the harmonic structure of the speech.
[0010]
That is, first, the edge strength ed (j) in the time direction in the frequency band j is obtained according to the following equation (1). Here, f (i, j) in equation (1) indicates the luminance at the pixel (i, j) on the spectrogram.
[0011]
[Expression 1]

[0012]
Next, the edge intensity ED of the entire band is obtained according to the following equation (2).
[0013]
[Expression 2]

[0014]
When the value of the edge strength ED is equal to or greater than the threshold value TH, it is determined that music exists in the detection range.
[0015]
[Non-Patent Document 1]
J. et al. J. Saunders, “Real-time discrimination of broadcast speech / music”, (USA), Journal of the Institute of Electrical and Electronics Engineers, International Conference on Sound, Speech, and Signal Processing (Proc. IEEE Int. Conf. On Acoustics, Speech, Signal Processing), 1996, p. 993-996
[Non-Patent Document 2]
E. E. Scheire and M.S. M.Slaney, “Construction and evaluation of a robust multifeature speech / music discriminator” (USA), Journal of the Institute of Electrical and Electronics Engineers, Sound and Speech・ International Conference on Signal Processing (Proc. IEEE Int. Conf. On Acoustics, Speech, Signal Processing), 1997, p. 1331-1334
[Non-Patent Document 3]
M.M. J. et al. Care (MJCare), E.C. S. ESParris and H.C. H. Lloyd-Thomas, “A comparison of features for speech, music discrimination,” (USA), Bulletin of the Institute of Electrical and Electronics Engineers, Sound, Speech, International Conference on Signal Processing (Proc. IEEE Int. Conf. On Acoustics, Speech, Signal Processing), March 1999, p. 149-152
[Non-Patent Document 4]
Minami, Akutsu, Hamada and Sotomura, “Video Indexing Using Sound Information and Its Applications”, IEICE Transactions D-II, 1998, J81-D-II, No. 3, p. 529-537
[Patent Document 1]
Japanese Patent Laid-Open No. 10-187182
[0016]
[Problems to be solved by the invention]
However, in the above-described conventional technology, none of the identification methods using the number of zero crossings, power fluctuations, spectrum centroids, and the like as feature amounts are sufficient feature amounts for identification alone.
[0017]
In addition, the identification method that focuses on the stability of the spectrum peak is an effective feature amount except for the impact sound, but the simple summation of the edge intensity in the time direction and frequency direction was used for identification. In some cases, the peak stability in the time direction cannot be expressed sufficiently. In other words, simply taking the sum of all time and bandwidth, even if the spectrum peak fluctuates in the frequency direction or the peak is intermittent, the spectrum is stably maintained at a specific frequency. In other words, there is a case where it is indistinguishable from the case where there is a linear component in the spectrogram in the time direction, which may cause an identification error.
[0018]
The present invention has been proposed in view of such a conventional situation, and solves the problems of the prior art described above, and an information identification apparatus and method for identifying and detecting speech / music with higher accuracy. It is another object of the present invention to provide a program for causing a computer to execute information identification processing and a computer-readable recording medium on which the program is recorded.
[0019]
[Means for Solving the Problems]
In order to achieve the above-described object, an information identification apparatus according to the present invention is an information identification apparatus for identifying whether a voice or music is received every predetermined time interval from an information source including a voice signal. A spectrogram calculating means for obtaining a spectrogram in which frequency analysis is performed in units and a vertical axis and a horizontal axis are frequency and time, respectively, for each predetermined identification section, and when the spectrogram is regarded as an image. As a component whose horizontal frequency in the two-dimensional frequency domain obtained by converting a spectrogram image into a two-dimensional spectrum is near zero. Extract horizontal DC component Ru A fraction extraction means; Extracted by the component extraction means Power ratio calculation means for determining the ratio of the power of the horizontal DC component to the power of the entire region of the spectrogram, and identification means for identifying whether it is speech or music based on the power ratio determined by the power ratio calculation means; Is provided.
[0020]
Here, the information identification apparatus may include spectrogram dividing means for dividing a part of the spectrogram into a plurality of small blocks. In this case, the horizontal direct current component extracting means is arranged for the horizontal block for each small block. A direct current component is extracted, and the power ratio calculation means obtains the power ratio for each small block. In addition, it is possible to provide an overall power ratio calculation means for obtaining an overall power ratio in all the small blocks based on the power ratio obtained for each of the small blocks. In this case, the identification means includes the overall power ratio. Based on the total power ratio obtained by the ratio calculation means, it is discriminated whether it is voice or music.
[0021]
Such an information identification device performs frequency analysis on an input audio signal in units of predetermined blocks when identifying audio and music based on the feature that the spectrum peak of music is stably maintained in the time direction. A spectrogram whose vertical and horizontal axes are the frequency and time, respectively, is determined for each identification section, and the ratio of the power of the horizontal DC component when the spectrogram is regarded as an image to the power of the entire region of the spectrogram is characterized. Used as a quantity.
[0022]
In order to achieve the above-described object, an information identification method according to the present invention is an information identification method for identifying whether speech or music from an information source including a speech signal every predetermined time interval. The frequency analysis is performed in units of blocks, and a spectrogram calculation step for obtaining a spectrogram in which the vertical axis and the horizontal axis are frequency and time for each predetermined identification section, and when the spectrogram is regarded as an image, As a component whose horizontal frequency in the two-dimensional frequency domain obtained by converting a spectrogram image into a two-dimensional spectrum is near zero. Extract horizontal DC component Ru A fraction extraction step; Extracted in the above component extraction step A power ratio calculation step for determining the ratio of the power of the horizontal DC component to the power of the entire region of the spectrogram, and an identification step for discriminating between voice and music based on the power ratio obtained in the power ratio calculation step And have.
[0023]
Here, the information identification method may include a spectrogram dividing step of dividing a part of the spectrogram into a plurality of small blocks. In this case, in the component extraction step, the horizontal DC is applied to each small block. Components are extracted, and in the power ratio calculation step, the power ratio is obtained for each small block. In addition, based on the power ratio obtained for each small block, it may have a total power ratio calculation step for obtaining a total power ratio in all small blocks. In this case, in the identification step, Based on the total power ratio obtained in the power ratio calculation step, it is discriminated whether it is voice or music.
[0024]
In such an information identification method, when voice and music are identified based on the feature that the spectrum peak of music is stably maintained in the time direction, the input voice signal is frequency-analyzed in a predetermined block unit, A spectrogram whose vertical and horizontal axes are the frequency and time, respectively, is determined for each identification section, and the ratio of the power of the horizontal DC component when the spectrogram is regarded as an image to the power of the entire region of the spectrogram is characterized. Used as a quantity.
[0025]
A program according to the present invention causes a computer to execute the above-described information identification processing, and a recording medium according to the present invention is a computer-readable medium on which such a program is recorded.
[0026]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, specific embodiments to which the present invention is applied will be described in detail with reference to the drawings. This embodiment is an information identification device for identifying and detecting voice and music for each predetermined time interval of a voice signal based on the feature that the spectrum peak of music is stably maintained in the time direction. Is applied.
[0027]
In the following, before explaining the configuration and operation of the information identification device according to the present embodiment, the principle of the voice / music identification method in this information identification device will be described.
[0028]
First, a typical spectrogram of music is shown in FIG. Although this figure is shown in an abstract form, the actual spectrogram is obtained as a grayscale image in which the luminance of the pixel differs depending on the size of the spectrum. Although there are exceptions in the case of percussion instruments only, in many general music, as shown in FIG. 1A, a horizontal component, that is, a linear component in the time direction is observed in the spectrogram. This is because a spectrum peak in a certain frequency band is stably maintained in the time direction in music.
[0029]
On the other hand, a typical spectrogram of speech is shown in FIG. Unlike the case of music, the horizontal line component is not seen in the spectrogram, but it is observed that the sound is undulating in the frequency direction. This shows that although the harmonic structure (harmonic structure) is seen in the voice, the frequency peak fluctuates with time. In voice, voiced sound and unvoiced sound are alternately repeated, so there are a portion where the curve clearly appears and a portion where the curve does not appear.
[0030]
Therefore, the spectrogram can be regarded as an image, and voice and music can be identified based on the presence or absence of the horizontal linear component in the spectrogram image or the degree thereof.
[0031]
Here, a two-dimensional frequency region when the spectrogram is regarded as an image is shown in FIG. In the two-dimensional frequency region obtained by converting the spectrogram image into a two-dimensional spectrum, the horizontal linear component in the spectrogram image is concentrated in the vicinity of the region LU indicated by the oblique lines in FIG. 2, that is, the horizontal DC component in which the horizontal frequency u is near zero.
[0032]
Note that although there is a horizontal straight line spectral component near the vertical DC component (v = 0), there is a component that hardly changes in the vertical direction, that is, a component that cannot be said to be a straight line. Removed from LU.
[0033]
Since the spectral power in the region LU is a horizontal linear component in the spectrogram, the ratio of the spectral power in the region LU to the spectral power in the entire region is the degree of the horizontal linear component of the spectrogram, that is, the peak duration in the time direction of the spectrum. The voice and music can be identified using this as a feature amount.
[0034]
Actually, as shown in FIG. 3, in the entire spectrogram, an area AR that greatly contributes to speech / music identification is divided into a plurality of small blocks (small areas) SB, and the power ratio described above for each small block SB. It is preferable to obtain the total power ratio in all the small blocks SB after obtaining the above. By dividing the small block SB and performing the processing in this way, the detection accuracy of the horizontal straight line component is improved. In addition, since only the area AR that greatly contributes to voice / music identification in the spectrogram as a processing target, the identification accuracy is improved.
[0035]
FIG. 4 shows a schematic configuration of the information identification apparatus according to the present embodiment for identifying voice / music by the identification method described above. As shown in FIG. 4, the information identification apparatus 1 includes an audio signal input unit 10, a spectrogram calculation unit 11 that obtains a spectrogram of the input audio signal, a spectrogram division unit 12 that divides the spectrogram into a plurality of small blocks SB, A horizontal linear frequency component extracting unit 13 for extracting a horizontal linear frequency component in the small block SB of the divided spectrogram, a horizontal linear power ratio calculating unit 14 for obtaining a power ratio with respect to all components of the horizontal linear component of the small block SB, An overall power ratio calculation unit 15 for obtaining an overall horizontal linear component power ratio from the horizontal linear power ratios of all the small blocks SB, and using the obtained overall horizontal linear component power ratio as a feature amount, A voice / music identification unit 16 for identifying music and an identification result output unit 17 for outputting the identification result Equipped with a.
[0036]
In the information identification apparatus 1, the audio signal input unit 10 inputs an audio signal and supplies it to the spectrogram calculation unit 11. The spectrogram calculation unit 11 performs frequency analysis on the input speech signal for each predetermined block to calculate a frequency spectrum, obtains a spectrogram of the input speech signal at every predetermined identification time, and supplies the spectrogram to the spectrogram division unit 12. . The spectrogram dividing unit 12 divides the spectrogram supplied from the spectrogram calculating unit 11 into a plurality of small blocks SB as will be described later, and supplies the spectrogram for each small block SB to the horizontal linear frequency component extracting unit 13.
[0037]
For each small block SB of the spectrogram, the horizontal linear frequency component extraction unit 13 extracts a frequency component corresponding to the horizontal linear component of the small block SB and supplies it to the horizontal linear power ratio calculation unit 14. Then, the horizontal linear power ratio calculation unit 14 calculates the power ratio of the horizontal linear component to the entire frequency band component, and the total power ratio calculation unit 15 evaluates the horizontal linear component power ratio in all the small blocks SB. Calculate the total horizontal linear component power ratio.
[0038]
The voice / music discriminating unit 16 uses the obtained total horizontal linear component power ratio as a feature amount to determine whether the discrimination section of the input voice signal is voice or music by a threshold judgment method or other statistical discrimination methods. And the identification result is supplied to the identification result output unit 17. Then, the identification result output unit 17 outputs the identification result supplied from the voice / music identification unit 16.
[0039]
The processing of the information identification device 1 will be described in more detail using the flowchart of FIG. First, in step S1, a spectrogram of the input audio signal within a predetermined identification time is obtained. Here, the identification time is an identification block length for identifying voice and music in the input voice signal, and is preferably about several seconds or more. More specifically, an audio signal x (t) is input, and a frequency analysis is performed by making a block every predetermined time (for example, 64 milliseconds) to obtain a spectrum.
[0040]
Note that the frequency analysis block may overlap with an adjacent block. For example, it can be overlapped by 20 milliseconds. The frequency scale may be a logarithmic scale or a mel scale.
[0041]
The spectrum of the frequency band k in the i-th frequency analysis block is assumed to be f (i, k). A spectrogram is obtained by expressing the obtained spectrum f (i, k) as the luminance of a two-dimensional image, with i (time direction) on the horizontal axis and k (frequency direction) on the vertical axis.
[0042]
Next, in step S2, the spectrogram is divided into M small blocks SB as shown in FIG. At this time, it is only necessary to make a small block only for the part that seems to contribute to the identification. In the present embodiment, the block is divided into small blocks over the entire spectrogram time width (that is, the identification time length) in the time direction, but only a band important for identification (for example, 50 Hz to 4 kHz) is present in the frequency direction. Small blocks are used, and no other band is used. In this way, the identification accuracy is improved by making only a portion that is considered to contribute to the identification into small blocks. Here, the size of the small block SB is set to, for example, 32 × 32 so as to have an appropriate resolution both in the frequency direction and in the time direction.
[0043]
Note that the small block SB may overlap with an adjacent block. In the present embodiment, it is assumed that the small blocks SB overlap by half.
[0044]
For each small block SB divided in this way, the power ratio of the horizontal linear component is obtained in the subsequent stage.
[0045]
Subsequently, in step S3, a certain small block SB _m For the horizontal linear component power ratio R (m) of the region LU on the two-dimensional image spectrum. That is, for each small block SB divided as described above, the spectrogram in the small block SB is regarded as an image, and the component of the region LU at the two-dimensional frequency of the spectrogram image is extracted by a two-dimensional Fourier transform or a two-dimensional filter. Find the power ratio for the whole area.
[0046]
Here, as a method for obtaining the horizontal linear component power ratio R (m) of the region LU, only a band component of the region LU is extracted using a method of obtaining from the spectrum of the region by two-dimensional Fourier transform and a two-dimensional digital filter. There is a way. In the method using the Fourier transform, first, the small block SB _m Two-dimensional Fourier transform of the spectrogram image at, and the obtained two-dimensional power spectrum is F _m (U, v). Then, the power ratio of the spectrum power in the region LU to the entire band is obtained. That is, small block SB _m The horizontal linear component power ratio R (m) is obtained by the following equation (3).
[0047]
[Equation 3]

[0048]
On the other hand, when a two-dimensional filter is used, small block SB _m A two-dimensional bandpass filter that passes only the region LU is applied to the spectrogram image at. Then, if the ratio between the power of the filtered signal and the power of the original signal that is not filtered is obtained, the horizontal linear component power ratio R (m) can be obtained.
[0049]
In step S4, it is determined whether or not all the small blocks SB have been processed. If the horizontal linear component power ratio R (m) is obtained for all the small blocks SB (Yes), the process proceeds to step S5. If not (No), the horizontal linear component is similarly applied to the next small block SB. Obtain the power ratio R (m).
[0050]
In step S5, the total horizontal linear power ratio R for all the small blocks SB is obtained. For example, as shown in the following formula (4), the average of the horizontal linear component power ratio R (m) of each small block SB can be set as the total horizontal linear power ratio R. Here, in Equation (4), m indicates the number of a small block, and M indicates the number of small blocks.
[0051]
[Expression 4]

[0052]
Note that the horizontal linear component power ratio R (m) of each small block SB is not limited to the average, and as shown in the following equation (5), the horizontal linear component power ratio R of each small block SB is simply set. The total sum of (m) may be the total horizontal linear power ratio R.
[0053]
[Equation 5]

[0054]
In step S6, speech and music are identified using the total horizontal linear component power ratio R as a feature amount. In general, since the spectrum peak persists in typical music, the total horizontal linear component power ratio R becomes a large value, and becomes a small value in speech. The identification method is not limited in the present invention, but the simplest method is to compare the total horizontal linear component power ratio R with a threshold Th, and if the total horizontal linear component power ratio R is equal to or greater than the threshold Th, it is determined as music. If it is less than the threshold Th, it may be determined that the voice is used.
[0055]
In addition, a statistical discriminant method such as a Bayes decision rule may be used in which the distribution of the total horizontal linear component power ratio R is expressed by a normal distribution model for each of speech and music, and discriminated with the larger posterior probability. The total horizontal linear component power ratio R may be comprehensively determined in combination with other feature amounts.
[0056]
As described above, according to the information identification device 1 in the present embodiment, the frequency component corresponding to the horizontal linear component in the spectrogram of the input audio signal is extracted, and the power ratio with respect to the whole is used as the feature amount. The peak persistence of a specific band in the spectrum can be effectively expressed, and voice / music can be identified with high accuracy.
[0057]
Further, since the spectrogram is divided into small blocks SB in advance and the horizontal linear component power ratio R (m) is obtained for each small block SB, the total horizontal linear component power ratio R in all the small blocks SB is obtained. , Peak sustainability analysis performance is improved.
[0058]
It should be noted that the present invention is not limited to the above-described embodiments, and various modifications can be made without departing from the scope of the present invention.
[0059]
For example, in the above-described embodiment, the hardware configuration has been described. However, the present invention is not limited to this, and arbitrary processing may be realized by causing a CPU (Central Processing Unit) to execute a computer program. Is possible. In this case, the computer program can be provided by being recorded on a recording medium, or can be provided by being transmitted via the Internet or another transmission medium.
[0060]
【The invention's effect】
As described above in detail, the information identification apparatus according to the present invention is an information identification apparatus for identifying whether a voice or music is received at predetermined time intervals from an information source including a voice signal. Spectrogram calculation means for performing frequency analysis and obtaining a spectrogram having a frequency and time on the vertical axis and the horizontal axis for each predetermined identification section, and a horizontal DC component for extracting a horizontal DC component when the spectrogram is regarded as an image Based on the power ratio obtained by the extraction means, the power ratio calculated by the power ratio calculation means, the power ratio calculation means for determining the ratio of the power of the horizontal DC component to the power of the entire region of the spectrogram, and discriminating between voice and music Identification means.
[0061]
Here, the information identification apparatus may include spectrogram dividing means for dividing a part of the spectrogram into a plurality of small blocks. In this case, the horizontal direct current component extracting means is arranged for the horizontal block for each small block. A direct current component is extracted, and the power ratio calculation means obtains the power ratio for each small block. In addition, it is possible to provide an overall power ratio calculation means for obtaining an overall power ratio in all the small blocks based on the power ratio obtained for each of the small blocks. In this case, the identification means includes the overall power ratio. Based on the total power ratio obtained by the ratio calculation means, it is discriminated whether it is voice or music.
[0062]
According to such an information identification device, when audio and music are identified based on the feature that the spectrum peak of music is stably maintained in the time direction, the input audio signal is subjected to frequency analysis in a predetermined block unit. For each predetermined identification section, a spectrogram whose vertical and horizontal axes are frequency and time, respectively, is obtained, and the power of the horizontal DC component when the spectrogram is regarded as an image is the ratio of the total area power of the spectrogram. Can be used as a feature value to effectively express the peak persistence of a specific band in the spectrum, and voice / music can be identified with high accuracy.
[0063]
The information identifying method according to the present invention is an information identifying method for identifying whether speech or music from an information source including a speech signal for each predetermined time interval, and analyzing the frequency of the input speech signal in units of a predetermined block. For each identification section, a spectrogram calculation step for obtaining a spectrogram whose vertical and horizontal axes are frequency and time, respectively, a horizontal DC component extraction step for extracting a horizontal DC component when the spectrogram is regarded as an image, and A power ratio calculation step for determining the ratio of the power of the horizontal DC component to the power of the entire region of the spectrogram, and an identification step for identifying whether it is voice or music based on the power ratio obtained in the power ratio calculation step; Have
[0064]
Here, the information identification method may include a spectrogram dividing step of dividing a part of the spectrogram into a plurality of small blocks. In this case, in the component extraction step, the horizontal DC is applied to each small block. Components are extracted, and in the power ratio calculation step, the power ratio is obtained for each small block. In addition, based on the power ratio obtained for each small block, it may have a total power ratio calculation step for obtaining a total power ratio in all small blocks. In this case, in the identification step, Based on the total power ratio obtained in the power ratio calculation step, it is discriminated whether it is voice or music.
[0065]
According to such an information identification method, when voice and music are identified based on the feature that the spectrum peak of music is stably maintained in the time direction, the input voice signal is subjected to frequency analysis in a predetermined block unit. For each predetermined identification section, a spectrogram whose vertical and horizontal axes are frequency and time, respectively, is obtained, and the power of the horizontal DC component when the spectrogram is regarded as an image is the ratio of the total area power of the spectrogram. Can be used as a feature value to effectively express the peak persistence of a specific band in the spectrum, and voice / music can be identified with high accuracy.
[0066]
A program according to the present invention causes a computer to execute the above-described information identification processing, and a recording medium according to the present invention is a computer-readable medium on which such a program is recorded.
[0067]
According to such a program and recording medium, the above-described information identification processing can be realized by software.
[Brief description of the drawings]
FIG. 1 is a diagram conceptually illustrating a typical example of a spectrogram, where FIG. 1A shows a spectrogram of music and FIG. 1B shows a spectrogram of speech.
FIG. 2 is a diagram showing a two-dimensional frequency region obtained by converting a spectrogram image into a two-dimensional spectrum.
FIG. 3 is a diagram illustrating a state in which a spectrogram image is divided into a plurality of small blocks.
FIG. 4 is a diagram illustrating a schematic configuration of an information identification device according to the present embodiment.
FIG. 5 is a flowchart for explaining the operation of the information identification apparatus.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 1 Information identification device, 10 Voice signal input part, 11 Spectrogram calculation part, 12 Spectrogram division part, 13 Horizontal linear frequency component extraction part, 14 Horizontal linear power ratio calculation part, 15 Total power ratio calculation part, 16 Voice / music identification part , 17 Identification result output section

Claims

In an information identification apparatus for identifying whether a sound or music from an information source including an audio signal every predetermined time interval,
A spectrogram calculating means for performing frequency analysis of an input speech signal in a predetermined block unit and obtaining a spectrogram in which a vertical axis and a horizontal axis are respectively frequency and time for each predetermined identification section;
And Ingredient extracting means that to extract the horizontal direct-current component as a component of the horizontal frequency near 0 in the two-dimensional frequency domain obtained by converting the spectrogram image into two-dimensional spectrum when regarded the spectrogram image,
Power ratio calculation means for obtaining a ratio of the power of the horizontal DC component extracted by the component extraction means to the power of the entire region of the spectrogram;
An information identification apparatus comprising: identification means for identifying voice or music based on the power ratio obtained by the power ratio calculation means.

Spectrogram dividing means for dividing a part of the spectrogram image when the spectrogram is regarded as an image into a plurality of small blocks;
Upper KiNaru component extraction unit extracts the horizontal direct-current component for each of the small blocks,
The information identification apparatus according to claim 1, wherein the power ratio calculation unit obtains the power ratio for each of the small blocks.

Comprehensive power ratio calculation means for obtaining a total power ratio in all small blocks based on the power ratio obtained for each small block,
The information identifying apparatus according to claim 2, wherein the identifying means identifies voice or music based on the total power ratio obtained by the total power ratio calculating means.

The component extraction means obtains a two-dimensional spectrum by performing a two-dimensional Fourier transform on the spectrogram image when the spectrogram is regarded as an image , extracts the horizontal DC component in the spectrum region,
The information identification apparatus according to claim 1, wherein the power ratio calculation unit calculates the power ratio by obtaining a ratio of the power of the horizontal DC component to the power of the entire region of the spectrogram.

The component extracting means, for the above spectrogram, two-dimensional horizontal frequency in the two-dimensional frequency domain obtained by converting the spectrogram spectrograms image when regarded as images on a two-dimensional spectrum is taken out near zero components Apply the filter to extract the horizontal DC component,
The power ratio calculation means calculates the power ratio by obtaining a ratio of the power of the horizontal DC component to the power in the time domain of the original signal not subjected to filtering. Information identification device.

In an information identification method for identifying whether a sound or music from an information source including an audio signal every predetermined time interval,
A spectrogram calculation step of performing frequency analysis of the input speech signal in units of predetermined blocks and obtaining a spectrogram in which the vertical axis and the horizontal axis are respectively frequency and time for each predetermined identification section;
And Ingredient extraction step you extract horizontal direct-current component as a component of the horizontal frequency near 0 in the two-dimensional frequency domain obtained by converting the spectrogram image into two-dimensional spectrum when regarded the spectrogram image,
A power ratio calculation step for obtaining a ratio of the power of the horizontal DC component extracted in the component extraction step to the power of the entire region of the spectrogram;
And an identification step for identifying whether the sound is music or music based on the power ratio obtained in the power ratio calculation step.

A spectrogram dividing step of dividing a part of a spectrogram image when the spectrogram is regarded as an image into a plurality of small blocks;
In the component extraction step, the horizontal DC component is extracted for each small block,
The information identification method according to claim 6, wherein in the power ratio calculation step, the power ratio is obtained for each of the small blocks.

Based on the power ratio obtained for each of the small blocks, a total power ratio calculation step for obtaining a total power ratio in all the small blocks,
8. The information identification method according to claim 7, wherein in the identification step, voice or music is identified based on the total power ratio obtained in the total power ratio calculation step.

In a program for causing a computer to execute information identification processing for identifying whether a sound or music from an information source including an audio signal every predetermined time interval,
A spectrogram calculation step of performing frequency analysis of the input speech signal in units of predetermined blocks and obtaining a spectrogram in which the vertical axis and the horizontal axis are respectively frequency and time for each predetermined identification section;
And Ingredient extraction step you extract horizontal direct-current component as a component of the horizontal frequency near 0 in the two-dimensional frequency domain obtained by converting the spectrogram image into two-dimensional spectrum when regarded the spectrogram image,
A power ratio calculation step for obtaining a ratio of the power of the horizontal DC component extracted in the component extraction step to the power of the entire region of the spectrogram;
An identification step for identifying whether the sound is music or music based on the power ratio obtained in the power ratio calculation step.

In a computer-readable recording medium recorded with a program for causing a computer to execute information identification processing for identifying whether it is voice or music from an information source including an audio signal every predetermined time interval,
A spectrogram calculation step of performing frequency analysis of the input speech signal in units of predetermined blocks and obtaining a spectrogram in which the vertical axis and the horizontal axis are respectively frequency and time for each predetermined identification section;
And Ingredient extraction step you extract horizontal direct-current component as a component of the horizontal frequency near 0 in the two-dimensional frequency domain obtained by converting the spectrogram image into two-dimensional spectrum when regarded the spectrogram image,
A power ratio calculation step for obtaining a ratio of the power of the horizontal DC component extracted in the component extraction step to the power of the entire region of the spectrogram;
And a discriminating step for discriminating between voice and music based on the power ratio obtained in the power ratio calculating step.