JP3798991B2

JP3798991B2 - Audio signal search method, audio signal search apparatus, program thereof, and recording medium for the program

Info

Publication number: JP3798991B2
Application number: JP2002047806A
Authority: JP
Inventors: 啓敏須賀; 純司寺本; 良治片岡
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2002-02-25
Filing date: 2002-02-25
Publication date: 2006-07-19
Anticipated expiration: 2022-02-25
Also published as: JP2003248494A

Description

【０００１】
【発明の属する技術分野】
本発明は，音声検索システムの技術に関し，特に入力または指定された音声信号に類似した音声信号区間を音声データベースの中から検索するための音声信号検索方法，音声信号検索装置，そのプログラムおよびそのプログラムの記録媒体に関するものである。
【０００２】
【従来の技術】
入力音声信号に類似した音声信号を音声データベースの中から検索する手法の従来技術としては，次の参考文献１に述べられたものがある。
［参考文献１］遠藤隆，中沢正幸，高橋裕信，岡隆一：“音声と動画像の自己組織化ネットワークによるデータ表現とスポッティング相互検索”：１９９８年度人工知能学会全国大会（第１２回），S5-04 ，pp.122-125．
これは，ＩＰＭ(Incremental Path Method) ネットワークを用いて，ＤＰマッチングと同様の動的マッチングで音声信号間の類似度を計算し，入力音声信号に類似した音声信号を音声データベースの中から検索する方法である。この方法によれば，動的マッチングを用いることにより，時間軸方向に非線形伸縮している音声信号でも類似度を計算することができる。
【０００３】
【発明が解決しようとする課題】
音声信号を音声信号で検索しようとする場合，音声信号間の類似度を計算する必要がある。一般的に音声信号は，同一文字列を発声した場合でも時間軸上で非線形伸縮する。このことから，音声信号間の類似度を計算するには，非線形伸縮に対応するため，ＤＰマッチングや隠れマルコフモデル（ＨＭＭ）等の動的マッチングを用いる必要があった。
【０００４】
しかし，全ての音声信号が時間軸上で非線形伸縮するわけではない。非線形伸縮しない音声信号の例として，まず歌声が挙げられる。歌声には潜在的にテンポが存在するので，短時間区間であれば，その区間全体のテンポがゆっくりになったり，速くなったりはすることはあるものの，その区間内でテンポが乱れることはない。つまり歌声は短時間区間内であれば，その区間全体で線形伸縮することはあるものの，その区間内で非線形伸縮することはない。
【０００５】
また，非線形伸縮しない音声信号の他の例として，アナウンサーの発話音声がある。アナウンサーは同じ言葉を同じ調子で何度も発話することが非常に上手であるため，同一の言葉であれば非線形伸縮しない。さらに，一般の人の発話でも一単語程度の短い発話であれば，その区間全体で線形伸縮することはあっても，その区間内では，ほとんど非線形伸縮していないとみなすことができる。
【０００６】
従来の技術では，音声信号間の類似度計算をするときに動的マッチングを用いていた。したがって，線形伸縮だけしかしていない音声にも，非線形伸縮にも対応できる動的マッチングを行ってしまう。動的マッチングは，固定長ベクトル間のユークリッド距離やマンハッタン距離等の距離計算を行う静的マッチングに比べて計算量が多く，検索時間が長くなってしまうという問題点があった。
【０００７】
【課題を解決するための手段】
前記課題を解決するために，本発明は，音声データの検索において時間窓を用いて切り出した特徴量を動的マッチングではなく，静的マッチングによって高速に照合できるようにする。そのため，音声時系列特徴量の音声データを線形伸縮させて，ある一定の時間区間の長さの固定長データにそろえる。こうすることにより，固定長データ間の単純なマッチングによって，検索キーと類似する音声データを高速に検索可能にすることを特徴とする。
【０００８】
具体的には，本発明の音声信号検索装置は，音声時系列特徴量抽出手段と，時間窓切り出し手段と，部分音声時系列特徴量固定長化手段と，検索キー音声時系列特徴量抽出手段と，音声蓄積手段と，検索条件入力手段と，特徴量情報比較手段と，表示形式生成手段からなる。さらに，音声時系列特徴量線形圧縮手段と，検索キー音声時系列特徴量線形圧縮手段とを設けてもよい。
【０００９】
音声時系列特徴量抽出手段は，時系列信号である音声信号から音声時系列特徴量を抽出する。
【００１０】
時間窓切り出し手段は，まず基準となる長さの時間窓である基準時間窓を用意する。そして，音声時系列特徴量抽出手段により抽出された音声時系列特徴量から，基準時間窓の長さの部分音声時系列特徴量を基準時間窓を少しずつずらしながら切り出す。次に，基準時間窓の長さを中心にして複数種類の長さの時間窓を用意する。これらの長さの時間窓でも同様にして部分音声時系列特徴量を切り出す。
【００１１】
部分音声時系列特徴量固定長化手段は，時間窓切り出し手段により複数種類の長さの時間窓で切り出した部分音声時系列特徴量を，線形伸縮して基準時間窓長にそろえた部分音声時系列特徴量を生成して，それを音声時系列特徴量ベクトルとして抽出する。
【００１２】
音声時系列特徴量線形圧縮手段は，部分音声時系列特徴量固定長化手段により生成された基準時間窓長の音声時系列特徴量を，ある一定の長さに線形圧縮して音声時系列特徴量ベクトルを抽出する。
【００１３】
検索キー音声時系列特徴量抽出手段は，検索キーとして入力された基準時間窓の長さの音声信号から，基準時間窓長の検索キー音声系列特徴量を抽出して，それを検索キー音声時系列特徴量ベクトルとして抽出する。
【００１４】
検索キー音声時系列特徴量線形圧縮手段は，検索キー音声時系列特徴量抽出手段により抽出された基準時間窓長の音声時系列特徴量を，音声時系列特徴量ベクトルと同じ長さに線形圧縮して，検索キー音声時系列特徴量ベクトルを抽出する。
【００１５】
音声蓄積手段は，入力された検索対象の音声信号を蓄積し，また音声時系列特徴量線形圧縮手段により抽出された音声時系列特徴量ベクトルからインデックスを作成して音声時系列特徴量ベクトルとともに蓄積し，さらに音声時系列特徴量ベクトルの抽出元の音声信号区間への対応付けの情報である時間窓切り出し情報も蓄積する。
【００１６】
検索条件入力手段は，音声蓄積手段において蓄積された音声信号区間を検索キーとする際に，検索キーを指定するための条件を入力する。
【００１７】
特徴量情報比較手段は，検索キー音声時系列特徴量線形圧縮手段において抽出された検索キー音声時系列特徴量ベクトルと音声蓄積手段において蓄積された音声時系列特徴量ベクトルを静的マッチングで類似距離を計算し，類似度を設定する。これにより動的マッチングよりも高速な類似度計算が可能になる。そして，検索キー音声時系列特徴量ベクトルとの類似度が高い順に音声時系列音声特徴量ベクトルを出力する。
【００１８】
表示形式生成手段は，特徴量情報比較手段から順位付けられて出力された音声時系列特徴量ベクトルを音声蓄積手段に蓄積された時間窓切り出し情報を元に抽出元の音声信号区間に対応付けて表示装置に出力する。
さらに，検索条件入力手段は，表示形式生成手段により出力された検索結果の一つを検索キーとする指定情報をユーザに入力させ，指定された検索結果から得られた音声データを検索キーとして前記類似度の算出による再検索を行う手段を有し，表示形式生成手段は，その検索結果を出力する手段を有する。
【００１９】
【発明の実施の形態】
〔実施の形態１〕
図１は，本発明の実施の形態１を説明するための構成図である。実施の形態１では，外部から入力した音声を検索キーとして用いる。
【００２０】
本発明の動作は，音声蓄積フェーズＰ１と，それから呼び出される音声時系列特徴量ベクトル抽出フェーズＰ２と，音声検索フェーズＰ３と，それから呼び出される検索キー音声時系列特徴量ベクトル抽出フェーズＰ４とで構成される。以下，各フェーズの動作を説明する。
【００２１】
〔Ａ〕音声蓄積フェーズＰ１と音声時系列特徴量ベクトル抽出フェーズＰ２
図２は，音声蓄積フェーズＰ１と音声時系列特徴量ベクトル抽出フェーズＰ２の動作を説明するフローチャートである。
【００２２】
まず，検索対象音声信号入力装置２０から検索対象の音声信号が入力され，音声蓄積部１４においてこの音声信号を記憶装置１５に蓄積する（ステップＳ１）。
【００２３】
次に，音声時系列特徴量抽出部１１において，入力された音声信号から音声時系列特徴量を抽出する（ステップＳ２）。音声時系列特徴量としては，例えば，メル周波数ケプストラム係数の低次項や，その１次差分，２次差分や，音声パワーや，フィルタバンク分析による各帯域の音声パワー等を多次元ベクトルで表し，それらを時系列順に並べたものを用いることができる。音声時系列特徴量の例は，次の参考文献２に述べられている。
［参考文献２］鹿野清宏他：“ＩＴｔｅｘｔ音声認識システム”，オーム社，2001．
次に，時間窓切り出し部１２において，まず図３に示すように，基準となる長さの時間窓を基準時間窓として設定し，この基準時間窓を少しずつずらしながら，基準時間窓長の部分音声時系列特徴量を切り出す（ステップＳ３）。
【００２４】
図４に示すように，基準時間窓の長さを中心とした複数種類の長さの時間窓を設定し，基準時間窓の場合と同様に，時間窓を少しずつずらしながらその時間窓長の部分音声時系列特徴量をそれぞれ切り出す（ステップＳ４）。後述する実験例では，基準時間窓は１５０フレームであり，１フレーム当たり約２６ミリ秒の長さである。時間窓の長さの下限は１１８フレーム，上限は１８２フレームとした。これらの時間窓を使って部分音声時系列特徴量を切り出したならば，その時間窓の切り出しに関する情報を音声蓄積部１４において時間窓切り出し情報として記憶装置１５に蓄積しておく。
【００２５】
また，部分音声時系列特徴量固定長化部１３において，図５に示すように，複数種類の長さの時間窓で切り出された部分音声時系列特徴量をそれぞれ時間軸方向で線形伸縮させて基準時間窓の長さにそろえ（ステップＳ５），その基準時間窓長の部分音声時系列特徴量を生成して，それを音声時系列特徴量ベクトルとする（ステップＳ６）。
【００２６】
そして，得られた音声時系列特徴量ベクトルからインデックスを構築し，音声時系列特徴量ベクトルとともに音声蓄積部１４の記憶装置１５に蓄積する（ステップＳ７）。多次元空間ベクトルのインデックス構造としては，例えば，次の参考文献３に述べられているＳＲ−ｔｒｅｅや，参考文献４に述べられているＡ−ｔｒｅｅなどを利用することができる。
［参考文献３］Norio Katayama and Shin'ichi Satoh：“The SR-Tree ：An Index Structure for High-Dimensional Nearest Neighbor Queries”，In Proc. ACM SIGMOID International Conference on Management of Data ，pp.368-380，May 1997．
［参考文献４］Yasushi Sakurai ，Masatoshi Yoshikawa ，Shunsuke Uemura ，and Haruhiko Kojima ：“The A-Tree：An Index Structure for High-Dimensional Spaces Using Relative Approximation ”，In Proc. of the 26th International Conference on Very Large Data Bases（VLDB），pp.516-526，Cairo ，September 2000．
〔Ｂ〕音声検索フェーズＰ３と検索キー音声時系列特徴量ベクトル抽出フェーズＰ４
図６は，音声検索フェーズＰ３と検索キー音声時系列特徴量ベクトル抽出フェーズＰ４の動作を説明するフローチャートである。
【００２７】
まず，検索キー音声信号入力装置２２を用いて，検索キーとなる基準時間窓長の音声信号を入力する（ステップＳ１０）。
【００２８】
次に，検索キー音声時系列特徴量抽出部１８において，入力した基準時間窓長の音声信号から，基準時間窓長の検索キー音声時系列特徴量を抽出する（ステップＳ１１）。検索キー音声時系列特徴量としては，前述した音声蓄積フェーズＰ１の音声時系列特徴量と同じ特徴量を利用する。この検索キー音声時系列特徴量を検索キー音声時系列特徴量ベクトルとする（ステップＳ１２）。
【００２９】
さらに，特徴量情報比較部１６において，得られた検索キー音声時系列特徴量ベクトルと，音声蓄積部１４において記憶装置１５に蓄積された音声時系列特徴量ベクトルとの類似距離を計算する（ステップＳ１３）。この距離計算には，固定長ベクトル間のユークリッド距離やマンハッタン距離等の静的マッチングを用い，また音声蓄積部１４の記憶装置１５に蓄積されたインデックスを用いて行う。これにより動的マッチングよりも高速な距離計算が可能となる。そして，その距離の短い順に音声時系列特徴量ベクトルの音声信号区間を順位付ける（ステップＳ１４）。
【００３０】
最後に，表示形式生成部１７において，音声蓄積部１４により記憶装置１５に蓄積された時間窓切り出し情報を用いて音声時系列特徴量ベクトルをその抽出元の音声信号区間に対応付けて，表示装置２１に出力する（ステップＳ１５）。ここでの表示では，例えば１時間の音楽番組の中から検索キーに該当する部分の検索結果を表示する場合に，順位の高い結果から順番に，番組の先頭から何分何秒目であるかなどの音声信号区間を示す一覧情報と，その部分の再生用のボタンとを表示し，再生ボタンが押された場合にはその部分の音声を出力することを行う。これによって，検索者が検索目的に適合する音声信号区間を探し出す手間を省けるようになる。また，検索対象について曲名などの情報をデータベース中に持つ場合には，検索結果の曲名などを併せて表示することができる。
【００３１】
〔実施の形態２〕
図７は，本発明の実施の形態２を説明するための構成図である。実施の形態２は，時間窓を用いて切り出した特徴量を圧縮してサイズを小さくし，サイズのより小さいデータベースを構築して検索を行うことを可能にしたものである。
【００３２】
実施の形態２では，線形伸縮させて基準時間窓の長さにそろえた部分音声時系列特徴量を時間軸上で線形圧縮したものを音声時系列特徴量ベクトルとして用い，これに伴い，基準時間窓の長さの検索キー音声時系列特徴量を音声時系列特徴量ベクトルと同様の長さに時間軸上で線形圧縮したものを検索キー音声時系列特徴量ベクトルとして用いる。そのため実施の形態１と比べて，音声蓄積フェーズＰ１から呼び出される音声時系列特徴量ベクトル抽出フェーズＰ２と，音声検索フェーズＰ３から呼び出される検索キー音声時系列特徴量ベクトル抽出フェーズＰ４が異なり，以下の通りになる。
【００３３】
〔Ａ〕音声蓄積フェーズＰ１と音声時系列特徴量ベクトル抽出フェーズＰ２
図８は，音声蓄積フェーズＰ１と音声時系列特徴量ベクトル抽出フェーズＰ２の動作を説明するフローチャートである。
【００３４】
実施の形態１と同様に，検索対象音声信号入力装置２０から音声信号を入力する（ステップＳ２０）。音声時系列特徴量抽出部１１において，入力された音声信号から音声時系列特徴量を抽出し（ステップＳ２１），時間窓切り出し部１２において，基準時間窓を少しずつずらしながら，基準時間窓長の部分音声時系列特徴量を切り出す（ステップＳ２２）。さらに，複数種類の長さの時間窓でも部分音声時系列特徴量を切り出し（ステップＳ２３），部分音声時系列特徴固定長化部１３において，切り出された複数種類の長さの部分音声時系列特徴量を時間軸上で線形伸縮して，基準時間窓長にそろえた部分音声時系列特徴量を生成する（ステップＳ２４）。
【００３５】
次に，音声時系列特徴量線形圧縮部３０において，図９に示すように，基準時間窓の長さにそろえた部分音声時系列特徴量をそれぞれ時間軸方向に線形圧縮して，音声時系列特徴量ベクトルを抽出する（ステップＳ２５）。これにより，音声時系列特徴量ベクトルの次元数が小さくなり，類似度計算の計算量を減らすことができ，蓄積する記憶装置１５の容量を小さくすることもできる。
【００３６】
さらに，実施の形態１と同様にして，得られた音声時系列特徴量ベクトルからインデックスを構築し，音声時系列特徴量ベクトルとともに音声蓄積部１４の記憶装置１５に蓄積する。
【００３７】
〔Ｂ〕音声検索フェーズＰ３と検索キー音声時系列特徴量ベクトル抽出フェーズＰ４
図１０は，音声検索フェーズＰ３と検索キー音声時系列特徴量ベクトル抽出フェーズＰ４の動作を説明するフローチャートである。
【００３８】
実施の形態１と同様に，検索キー音声信号入力装置２２を用いて，検索キーとなる基準時間窓長の音声信号を入力し（ステップＳ３０），検索キー音声時系列特徴量抽出部１８において，入力した基準時間窓長の音声信号から，基準時間窓長の検索キー音声時系列特徴量を抽出する（ステップＳ３１）。
【００３９】
そして，検索キー音声時系列特徴量線形圧縮部３１において，図９に示すように，基準時間窓長の音声時系列特徴量を時間軸方向に線形圧縮し，検索キー音声時系列特徴量ベクトルを抽出する（ステップＳ３２）。なお，検索キー音声時系列特徴量ベクトルの長さは，前記音声蓄積フェーズＰ１で生成された音声時系列特徴量ベクトルと同じ長さとする。
【００４０】
さらに，実施の形態１と同様に，特徴量情報比較部１６において，得られた検索キー音声時系列特徴量ベクトルと，音声蓄積部１４において記憶装置１５に蓄積された音声時系列特徴量ベクトルとの類似距離を計算し（ステップＳ３３），その距離の短い順に音声時系列特徴量ベクトルを順位付ける（ステップＳ３４）。
【００４１】
最後に，実施の形態１と同様に，表示形式生成部１７において，音声蓄積部１４の記憶装置１５に蓄積された時間窓切り出し情報を用いて音声時系列特徴量ベクトルをその抽出元の音声信号区間に対応付けて，表示装置２１に出力する（ステップＳ３５）。例えば１時間の音楽番組の中から検索キーに該当する部分の検索結果を表示する場合に，順位の高い結果から順番に，番組の先頭から何分何秒目であるかなどの音声信号区間を示す一覧情報と，その部分の再生用のボタンとを表示し，再生ボタンが押された場合にはその部分の音声を出力することができるような表示形式で表示する。
【００４２】
〔実施の形態３〕
図１１，図１２は，本発明の実施の形態３を説明するための構成図である。実施の形態３は，検索キーを外部から入力するのではなく，一度検索を行って検索結果を得たときに検索結果の中から新たに検索キーを指定しなおして，それに類似する音声データを検索することができるようにしたものである。
【００４３】
音声データを検索キーとする複数の検索結果を類似度の順に一覧表のように画面に表示し，検索結果として得られた音声データを新たな検索キーとして，類似する他の音声データを検索する。
【００４４】
例えば音楽検索の場合，検索結果表示画面の「曲名」の部分に触れると，その楽曲が検索結果として選択され，検索キーに対応する部分または楽曲の先頭から再生されて音声出力される。また，画面上の「順位」の部分に触れると，その楽曲の検索結果（例えば楽曲のうち当初の検索キーである音声データと類似する部分（３〜４秒間））を新たな検索キーとして，さらに類似する楽曲の検索を実行するようなことを行う。
【００４５】
以上のように実施の形態３では，検索結果の中から，検索目的に最も近かった結果を検索キーとして選択して再び検索を行う。図１１に示す音声信号検索装置１０の音声蓄積フェーズＰ１は，図１に示す実施の形態１の音声蓄積フェーズＰ１と同様であり，図１２に示す音声信号検索装置１０の音声蓄積フェーズＰ１は，図７に示す実施の形態２の音声蓄積フェーズＰ１と同様である。前述した実施の形態１，実施の形態２とは，音声検索フェーズＰ３が異なり，以下のようになる。
【００４６】
〔Ａ〕音声検索フェーズＰ３
図１３は，音声検索フェーズＰ３の動作を説明するフローチャートである。
【００４７】
前段階として，実施の形態１，実施の形態２と同様に検索を行い，その検索結果を類似度順に順位付けて表示装置２１に表示する（ステップＳ４０）。ユーザからの指示により，表示された検索結果が検索目的に十分に合致していれば検索を終了する（ステップＳ４１）。
【００４８】
検索目的に合致していなければ，検索条件入力部２３において，既に表示されている順位付けられた検索結果のうちの検索目的に最も近い結果を検索キーとして指定させ，それを検索キーとして選びなおす（ステップＳ４２）。このため，ステップＳ４０において順位付けられた結果を表示装置２１で表示する際に，結果１件あたりに２つの入力用ボタンを表示する。１つは検索キーとして指定する際に押すボタンであり，１つはその結果の音声を発声させる際に押すボタンである。ユーザは，表示装置２１上で前者のボタンを押すことで，検索結果の中から検索キーを指定することができる。
【００４９】
検索キーが指定されると，特徴量情報比較部１６において，検索キーと類似度の高い蓄積された音声区間を検索し，類似度の高い順に順位付けする（ステップＳ４３）。さらに，表示形式生成部１７において，検索結果から検索キーを選択できる表示形式を生成し，表示装置２１に表示する（ステップＳ４０）。
【００５０】
表示された検索結果が，検索目的に十分に合致していれば検索を終了する（ステップＳ４１）。まだ十分に合致しない場合には，さらに検索キーを選択しなおして再び検索を行い，検索目的に十分に合致する結果が得られるまで繰り返す（ステップＳ４０〜Ｓ４３）。
【００５１】
〔実施の形態４〕
以上の実施の形態１〜３では，検索対象の音声信号について複数種類の長さの時間窓を使ってそれぞれ長さの異なる部分音声時系列特徴量を抽出し，それらの部分音声時系列特徴量を線形伸縮して基準時間窓の長さにそろえたものを，検索対象の音声時系列特徴量ベクトルとして，音声蓄積部１４により蓄積した。
【００５２】
実施の形態４では，時間窓で切り出した部分音声時系列特徴量の線形伸縮を，検索対象のものについて行うのではなく，検索キーとして入力されたものについて行う。すなわち，実施の形態４では，検索対象の音声信号から複数種類の長さの時間窓を使って部分音声時系列特徴量を抽出するのではなく，基準時間窓だけを使って部分音声時系列特徴量を抽出し，それを検索対象の音声時系列特徴量ベクトルとして音声蓄積部１４に蓄積する。一方，検索キーとして入力された音声信号については，複数種類の長さの時間窓を使ってそれぞれ長さの異なる部分音声時系列特徴量を抽出し，これらの長さの異なる部分音声時系列特徴量を，基準時間窓の長さになるように線形伸縮する。
【００５３】
線形伸縮を検索キーとして入力されたものについて行っても，実施の形態１〜３と同様な検索結果が得られる。なお，基準時間窓の長さにそろえたのちに，必要に応じて実施の形態２のように一定の長さに線形圧縮することにより，検索時に照合するデータ量の削減を図ることもできる。
【００５４】
〔実験結果〕
本発明の有効性を確認するため，実施の形態２について実験用に作成した歌声データを対象とした２種類の実験を行った。
【００５５】
まず第１の実験で，本発明による線形伸縮に対応したマッチングが，線形伸縮に対応しないマッチングに比べて，歌声の検索において有効であることを確認し，次に第２の実験で，非線形伸縮マッチングを実装している従来方式と比較しても，本発明が歌声の検索において十分に有効であることを確認した。
【００５６】
〔Ａ〕両実験に共通する実験条件
１５人の被験者を女性，男性，混合の３つのグループに分け，それぞれのグループに６２曲の歌名リストを渡す。歌名リストの中で，その歌を知っている人にフレーズの一部（約１０秒程度）を歌ってもらい，６２×３個の歌声（合計約３０分）をデータベースに格納する。
【００５７】
１つの被験者グループの歌声の中から任意に１２曲選び，そこから１フレーズ程度（基準時間窓長：１５０フレーム分，約４秒）を取り出して検索キーとする。そして，他の２つのグループの被験者の歌声の同一フレーズ部分を適合結果として検索する。
【００５８】
本実験では，サンプリング周波数４４１００Ｈｚ，量子化ビット数１６ｂｉｔ，１チャネルのｗａｖｅファイル形式の音声データから歌声を検索することとする。音声特徴量として，メル周波数ケプストラム係数の低次項５次元を使い，これを時間軸上に並べたものから時間窓を用いて時系列特徴量を抽出する。
【００５９】
検索結果の評価基準としては，平均探索長の平均を用いる。これは検索結果の中から，検索目的に適合する結果を探し出す手間を表す評価基準である。なお，順位付けされた検索結果のうち，２０位までの適合性を判断することとし，順位を２０位までとし，それ以下に適合結果があっても検索できなかったものとする。
【００６０】
〔Ｂ〕平均探索長の平均の説明
ここで，検索結果の評価基準として用いた平均探索長の平均について説明する。平均探索長については，次の参考文献５に述べられている。
［参考文献５］徳永健伸：“言語と計算５情報検索と言語処理”，東京大学出版，1999．
平均探索長は，検索結果として順位付けられた集合を評価する尺度である。検索結果として順位付けられた結果が返ってきた場合，実際には検索者は，検索結果の適合性を上位の結果から逐一判断していかなければいけない。平均探索長は，このような検索者の適合性判断の過程を考慮し，検索者が必要な数の適合結果を得るためには，どれだけ結果の適合性を判断しなければならないかというユーザの手間を計測する尺度である。
【００６１】
例えば，検索結果が次のように順序付き集合Ｓ₁，Ｓ₂，Ｓ₃に分けることができたとする。ただし，集合間の順序はＳ₁，Ｓ₂，Ｓ₃の順であり，○，×はそれぞれ適合結果，不適合結果を表す。
【００６２】
Ｓ₁：｛○，×，×，×｝
Ｓ₂：｛○，○，○，○，×，×｝
Ｓ₃：｛○，○，×，×｝
今，検索者が，適合結果を１つ得たいとする。まず，集合Ｓ₁を検査することになる。この集合の中では順序が付いていないので，適合結果を見つけるまでに検査しなければならない結果の個数の期待値は，
１×１／４＋２×１／４＋３×１／４＋４×１／４＝２．５
となる。
【００６３】
これは検索結果から適合結果を１つ見つけるためには，検索者は平均的に２．５個の検索結果の適合性を判断しなければならないことを示している。つまり，この検索結果から１つの適合結果を見つけ出すのに必要な平均探索長は，２．５個である。
【００６４】
また，適合結果を２つ見つけるためには，集合Ｓ₁を全部検査した後，集合Ｓ₂から１つ見つければよいから，検査すべき結果の個数の期待値は，
（４＋１）×４／６＋（４＋２）×４／１５＋（４＋３）×１／１５＝５．４
となる。つまり，２つの適合結果を見つけ出すのに必要な平均探索長は，５．４個である。
【００６５】
上記の例では，検索結果が順位付けられた集合で与えられている場合であったが，検索結果の個々に全順序が付けられている場合でも，各集合の要素を１つと考えれば平均探索長を計算できる。
【００６６】
以上のことからもわかるように，平均探索長は一つの尺度とはならず，必要な適合結果の個数に依存した値となる。そこで，平均探索長の平均として，必要な適合結果１つあたりの平均探索長の値を計算する。
【００６７】
必要な適合結果数をｉ，総適合結果数をＭ，ｉ個の必要な適合結果を見つけるのに必要な平均探索長をｘ（ｉ）とすると，平均探索長の平均ｘ_avは，
ｘ_av＝（１／Ｍ）Σ_i=1 ^M｛ｘ（ｉ）／ｉ｝
で表される。
【００６８】
例えば，検索の結果，適合した結果が２位と６位に検索された場合を考える。必要な適合結果の個数を１とした場合，平均探索長は２となり，必要な検索結果の個数を２とした場合，平均探索長は６となる。これらの平均探索長の平均は，（２／１＋６／２）／２＝４となる。
【００６９】
〔Ｃ〕第１の実験
線形伸縮に対応しない方法として，１種類の長さの時間窓だけを使って時系列特徴量を抽出する方法で実験する。一方，線形伸縮に対応する方法として，９種類の時間窓を使った本発明による方法で実験する。この両方法の平均探索長の平均の値を比較する。
【００７０】
〔Ｄ〕第１の実験の結果
第１の実験による平均探索長の平均を比較した結果を図１４に示す。図中の×は，適合結果を検索できなかったことを表している。平均探索長の平均は，歌Ｂ，Ｅ，Ｈで同じ値になるものの，それ以外の歌ではすべて本発明による方法が上回っている。すなわち，固定長の時間窓により検索対象の音声データを対象として，従来方法による線形伸縮を用いない場合と，本発明による線形伸縮を用いた場合とを比較すると，平均探索長が短くて済み，検索精度が２５％から２倍程度向上するという効果があることがわかった。したがって，本発明による線形伸縮に対応する方式は，歌声の検索において有効であると言える。
【００７１】
〔Ｅ〕第２の実験
非線形伸縮に対応できる従来方法として，メディアドライブ株式会社の「CrossMediator for Video V2.0(R1)」のボイス検索機能を用いて，従来方法と本発明による方法の平均探索長の平均を比較する。また，単純なマッチングにより，検索時間を削減できているかも確認する。
【００７２】
検索時間は，表示部上の検索を開始するためのボタンを押した後から検索結果が表示されるまでの時間を手動で１０回計測し，その平均値を検索時間とする。なお，今回実験に使用したのは，ＣＰＵが米国Ｉｎｔｅｌ社のＰｅｎｔｉｕｍ４（１．７ＧＨｚ），主記憶容量が６５４，８１２ＫＢのパーソナルコンピュータである。
【００７３】
〔Ｆ〕第２の実験の結果
第２の実験による平均探索長の平均を比較した結果を図１５に示す。図中の×は，適合結果を検索できなかったことを表している。図１５から，歌Ｃ，Ｇでは，本発明による方法の結果が若干下回っているものの，その他すべての結果では同等以上の結果が得られている。非線形マッチングを使わない本発明による方法では，従来方法と同等以上に検索できていることがわかる。
【００７４】
次に，検索キーとして図１５中の歌Ｂを用い，１回検索を繰り返し，その平均を検索時間として比較を行った。従来方法では，４．２９秒かかっていたところが，本発明による方法では２．４２秒ほど速くなり，単純なマッチングによる検索時間の短縮が確認された。すなわち，従来の非線形伸縮を用いる方法と，本発明の線形伸縮を用いる方法とを比較すると，本発明による方法の場合，検索精度は遜色がない一方，処理速度（ＣＰＵ負担）は５６％程度向上することがわかった。
【００７５】
以上の本発明の有効性については，実施の形態１，４についても基本的に同様であることは明らかである。
【００７６】
以上説明した各実施の形態の処理は，コンピュータとソフトウェアプログラムとによって実現することができ，そのプログラムは，コンピュータが読み取り可能な可搬媒体メモリ，半導体メモリ，ハードディスク等の適当な記録媒体に格納して，そこから読み出すことによりコンピュータに実行させることができる。また，そのプログラムは通信回線を経由して他のコンピュータからダウンロードすることができ，それをインストールして実行させることもできる。
【００７７】
【発明の効果】
従来の検索キーの音声信号を用いて検索対象の音声信号の中から類似度の高い音声信号区間を検索する方法では，類似度を表す距離計算に音声信号の非線形伸縮に対応できる動的マッチングを用いていたが，本発明は，音声信号が主に線形伸縮しかしないような場合に，類似度を表す距離計算に線形伸縮だけに対応する静的マッチングを用いることで動的マッチングを距離計算に用いる場合に比べて計算量を削減し，検索時間を少なくするという効果を有する。
【００７８】
また，本発明では，検索キーに検索結果として得られた音声信号区間を利用することができ，一度目の検索で検索対象の音声信号中の検索目的に適合する全ての音声信号区間が得られなくても，検索結果の音声信号区間の中から検索目的に適合する音声信号区間を検索キーとして選びなおすことで絞り込んだ検索を行うことができ，検索目的に適合する音声信号を得られる可能性が高くなるという効果を有する。
【図面の簡単な説明】
【図１】本発明の実施の形態１を説明するための構成図である。
【図２】音声蓄積フェーズと音声時系列特徴量ベクトル抽出フェーズの動作を説明するフローチャートである。
【図３】時間窓切り出し部の処理を説明する図である。
【図４】時間窓の切り出し方法の例を説明する図である。
【図５】部分音声時系列特徴量の線形伸縮方法を説明する図である。
【図６】音声検索フェーズと検索キー音声時系列特徴量ベクトル抽出フェーズの動作を説明するフローチャートである。
【図７】本発明の実施の形態２を説明するための構成図である。
【図８】音声蓄積フェーズと音声時系列特徴量ベクトル抽出フェーズの動作を説明するフローチャートである。
【図９】基準時間窓長の部分音声時系列特徴量を時間軸方向に線形圧縮する方法を説明する図である。
【図１０】音声検索フェーズと検索キー音声時系列特徴量ベクトル抽出フェーズの動作を説明するフローチャートである。
【図１１】本発明の実施の形態３を説明するための構成図である。
【図１２】本発明の実施の形態３を説明するための構成図である。
【図１３】音声検索フェーズの動作を説明するフローチャートである。
【図１４】第１の実験の結果を示す図である。
【図１５】第２の実験の結果を示す図である。
【符号の説明】
Ｐ１音声蓄積フェーズ
Ｐ２音声時系列特徴量ベクトル抽出フェーズ
Ｐ３音声検索フェーズ
Ｐ４検索キー音声時系列特徴量ベクトル抽出フェーズ
１０音声信号検索装置
１１音声時系列特徴量抽出部
１２時間窓切り出し部
１３部分音声時系列特徴量固定長化部
１４音声蓄積部
１５記憶装置
１６特徴量情報比較部
１７表示形式生成部
１８検索キー音声時系列特徴量抽出部
２０検索対象音声信号入力装置
２１表示装置
２２検索キー音声信号入力装置
２３検索条件入力装置
３０音声時系列特徴量線形圧縮部
３１検索キー音声時系列特徴量線形圧縮部
４０検索条件入力部[0001]
BACKGROUND OF THE INVENTION
  The present invention relates to a technology of a speech search system, and more particularly to a speech signal search method for searching a speech signal section similar to an input or designated speech signal from a speech database.,soundThe present invention relates to a voice signal retrieval device, a program thereof, and a recording medium for the program.
[0002]
[Prior art]
As a prior art of a method for retrieving a speech signal similar to an input speech signal from a speech database, there is one described in Reference Document 1 below.
[Reference 1] Takashi Endo, Masayuki Nakazawa, Hironobu Takahashi, Ryuichi Oka: “Data Representation and Spotting Mutual Search by Self-Organizing Network of Voice and Video”: 1998 Annual Conference of Japanese Society for Artificial Intelligence (12th), S5 -04, pp.122-125.
This is a method of calculating similarity between speech signals by dynamic matching similar to DP matching using an IPM (Incremental Path Method) network, and retrieving speech signals similar to the input speech signal from the speech database. It is. According to this method, by using dynamic matching, the similarity can be calculated even for an audio signal that is nonlinearly expanded and contracted in the time axis direction.
[0003]
[Problems to be solved by the invention]
When attempting to search for audio signals using audio signals, it is necessary to calculate the similarity between the audio signals. In general, a voice signal nonlinearly expands and contracts on the time axis even when the same character string is uttered. Therefore, in order to calculate the similarity between audio signals, it is necessary to use dynamic matching such as DP matching or hidden Markov model (HMM) in order to cope with nonlinear expansion and contraction.
[0004]
However, not all audio signals expand and contract nonlinearly on the time axis. As an example of an audio signal that does not expand and contract non-linearly, a singing voice is given first. There is a potential tempo in the singing voice, so if it is a short period, the tempo of the entire period may be slow or fast, but the tempo will not be disturbed within that period . In other words, if the singing voice is within a short period, it may linearly expand and contract in the entire section, but does not expand and contract in a non-linear manner within that section.
[0005]
Another example of an audio signal that does not undergo non-linear expansion / contraction is an announcer's speech. Announcers are very good at speaking the same words many times with the same tone, so if the words are the same, they will not be nonlinearly stretched. Furthermore, even if the utterance of a general person is as short as one word, it can be considered that there is almost no non-linear expansion / contraction in the section even though linear expansion / contraction occurs in the entire section.
[0006]
In the prior art, dynamic matching is used when calculating the similarity between audio signals. Therefore, dynamic matching that can cope with non-linear expansion and contraction is performed on speech that is only linear expansion and contraction. The dynamic matching has a problem that the calculation time is longer and the search time is longer than the static matching for calculating the distance such as the Euclidean distance and the Manhattan distance between the fixed-length vectors.
[0007]
[Means for Solving the Problems]
In order to solve the above-described problems, the present invention makes it possible to collate feature quantities extracted using a time window in retrieval of speech data at high speed using static matching instead of dynamic matching. For this reason, the voice data of the voice time-series feature amount is linearly expanded and contracted to obtain fixed-length data having a certain time interval length. By doing so, the voice data similar to the search key can be searched at high speed by simple matching between fixed-length data.
[0008]
Specifically, the speech signal retrieval apparatus of the present invention includes speech time series feature amount extraction means, time window cutout means, partial speech time series feature amount fixed length lengthening means, and search key speech time series feature amount extraction means. And voice storage means, search condition input means, feature quantity information comparison means, and display format generation means. Furthermore, a speech time series feature amount linear compression means and a search key speech time series feature amount linear compression means may be provided.
[0009]
The voice time-series feature quantity extraction unit extracts a voice time-series feature quantity from a voice signal that is a time-series signal.
[0010]
The time window cutout means first prepares a reference time window which is a time window having a reference length. Then, a partial speech time-series feature amount having a reference time window length is extracted from the speech time-series feature amount extracted by the speech time-series feature amount extracting unit while the reference time window is gradually shifted. Next, a plurality of types of time windows are prepared with the reference time window as the center. The partial speech time-series feature amount is similarly cut out in the time windows of these lengths.
[0011]
The partial speech time-series feature amount fixed length increasing means is a partial speech time-series feature in which the partial time-series feature amounts cut out by the time window cut-out means in a plurality of types of time windows are linearly expanded and contracted to match the reference time window length. A sequence feature quantity is generated and extracted as a speech time series feature quantity vector.
[0012]
The speech time series feature amount linear compression means linearly compresses the speech time series feature amount of the reference time window length generated by the partial speech time series feature amount fixed length means to a certain length to obtain a speech time series feature. Extract quantity vector.
[0013]
The search key voice time series feature quantity extraction means extracts a search key voice series feature quantity of a reference time window length from a voice signal having a reference time window length inputted as a search key, and extracts it as a search key voice time. Extracted as a series feature vector.
[0014]
The search key speech time series feature quantity linear compression means linearly compresses the speech time series feature quantity of the reference time window length extracted by the search key voice time series feature quantity extraction means to the same length as the speech time series feature quantity vector. Then, the search key speech time series feature vector is extracted.
[0015]
The speech storage means stores the input speech signal to be searched, creates an index from the speech time series feature vector extracted by the speech time series feature amount linear compression means, and stores it together with the speech time series feature vector Further, time window cutout information, which is information for associating the speech time series feature vector with the speech signal section from which the speech time series feature vector is extracted, is also stored.
[0016]
The search condition input means inputs a condition for designating the search key when the voice signal section stored in the voice storage means is used as the search key.
[0017]
The feature quantity information comparing means statically matches the search key voice time series feature quantity vector extracted by the search key voice time series feature quantity linear compression means and the voice time series feature quantity vector stored in the voice storage means by similarity matching. And set the similarity. As a result, it is possible to perform similarity calculation faster than dynamic matching. Then, the speech time-series speech feature vector is output in descending order of similarity to the search key speech time-series feature vector.
[0018]
  The display format generation means associates the voice time-series feature vectors output by ranking from the feature quantity information comparison means with the voice signal section of the extraction source based on the time window cutout information stored in the voice storage means. Output to the display device.
  Further, the search condition input means allows the user to input designation information using one of the search results output by the display format generation means as a search key, and the voice data obtained from the designated search result is used as the search key. The display format generation means has means for outputting the search result.
[0019]
DETAILED DESCRIPTION OF THE INVENTION
[Embodiment 1]
FIG. 1 is a configuration diagram for explaining Embodiment 1 of the present invention. In the first embodiment, an externally input voice is used as a search key.
[0020]
The operation of the present invention includes a speech accumulation phase P1, a speech time series feature vector extraction phase P2 called from it, a speech search phase P3, and a search key speech time series feature vector extraction phase P4 called from it. The The operation of each phase will be described below.
[0021]
[A] Speech accumulation phase P1 and speech time series feature vector extraction phase P2
FIG. 2 is a flowchart for explaining the operations of the speech accumulation phase P1 and the speech time series feature vector extraction phase P2.
[0022]
First, a search target voice signal is input from the search target voice signal input device 20, and the voice storage unit 14 stores this voice signal in the storage device 15 (step S1).
[0023]
Next, the voice time series feature quantity extraction unit 11 extracts a voice time series feature quantity from the inputted voice signal (step S2). As the speech time series feature, for example, the low-order term of the Mel frequency cepstrum coefficient, the first order difference, the second order difference, the sound power, the sound power of each band by the filter bank analysis, etc. are represented by multidimensional vectors. Those arranged in chronological order can be used. An example of speech time series feature is described in Reference Document 2 below.
[Reference 2] Kiyohiro Shikano et al .: “IT text speech recognition system”, Ohmsha, 2001.
Next, in the time window cutout unit 12, first, as shown in FIG. 3, a reference time window is set as a reference time window, and this reference time window is shifted little by little while the reference time window length portion is set. A voice time-series feature amount is cut out (step S3).
[0024]
As shown in Fig. 4, multiple types of time windows centered on the length of the reference time window are set, and as with the reference time window, the time window is shifted gradually and the time window length is adjusted. Partial voice time-series feature amounts are cut out (step S4). In an experimental example to be described later, the reference time window is 150 frames and has a length of about 26 milliseconds per frame. The lower limit of the time window length was 118 frames, and the upper limit was 182 frames. If partial time series feature quantities are cut out using these time windows, information related to the cut-out of the time window is stored in the storage device 15 as time window cut-out information in the voice storage unit 14.
[0025]
Further, as shown in FIG. 5, the partial speech time-series feature amount fixed length increasing unit 13 linearly expands / contracts the partial speech time-series feature amounts clipped by a plurality of types of time windows in the time axis direction. Align to the length of the reference time window (step S5), generate a partial speech time series feature quantity of the reference time window length, and set it as a speech time series feature quantity vector (step S6).
[0026]
Then, an index is constructed from the obtained speech time-series feature vector, and is stored in the storage device 15 of the speech storage unit 14 together with the speech time-series feature vector (step S7). As an index structure of a multidimensional space vector, for example, SR-tree described in the following Reference 3 or A-tree described in Reference 4 can be used.
[Reference 3] Norio Katayama and Shin'ichi Satoh: “The SR-Tree: An Index Structure for High-Dimensional Nearest Neighbor Queries”, In Proc. ACM SIGMOID International Conference on Management of Data, pp. 368-380, May 1997.
[Reference 4] Yasushi Sakurai, Masatoshi Yoshikawa, Shunsuke Uemura, and Haruhiko Kojima: “The A-Tree: An Index Structure for High-Dimensional Spaces Using Relative Approximation”, In Proc. Of the 26th International Conference on Very Large Data Bases (VLDB), pp.516-526, Cairo, September 2000.
[B] Speech retrieval phase P3 and retrieval key speech time-series feature vector extraction phase P4
FIG. 6 is a flowchart for explaining the operations of the speech search phase P3 and the search key speech time-series feature vector extraction phase P4.
[0027]
First, an audio signal having a reference time window length serving as a search key is input using the search key audio signal input device 22 (step S10).
[0028]
Next, the search key voice time series feature quantity extraction unit 18 extracts the search key voice time series feature quantity of the reference time window length from the input voice signal of the reference time window length (step S11). As the search key voice time series feature quantity, the same feature quantity as the voice time series feature quantity in the voice accumulation phase P1 described above is used. This search key voice time series feature quantity is set as a search key voice time series feature quantity vector (step S12).
[0029]
Further, the feature information comparison unit 16 calculates a similarity distance between the obtained search key speech time series feature vector and the speech time series feature vector stored in the storage device 15 by the speech storage unit 14 (step S13). This distance calculation is performed using static matching such as a Euclidean distance or a Manhattan distance between fixed-length vectors, and using an index stored in the storage device 15 of the voice storage unit 14. This makes it possible to perform distance calculation faster than dynamic matching. Then, the speech signal sections of the speech time-series feature vector are ranked in order from the shortest distance (step S14).
[0030]
Finally, the display format generation unit 17 uses the time window cutout information stored in the storage device 15 by the audio storage unit 14 to associate the audio time-series feature vector with the extraction source audio signal section, and displays the display device. 21 (step S15). In this display, for example, when displaying the search result of the part corresponding to the search key from an hourly music program, how many minutes and seconds are from the top of the program in order from the highest ranking result. The list information indicating the audio signal interval and the playback button for that portion are displayed, and when the playback button is pressed, the audio for that portion is output. As a result, it is possible to save the searcher from having to search for a speech signal section suitable for the search purpose. Also, if the database has information such as the song name for the search object, the song name and the like of the search result can be displayed together.
[0031]
[Embodiment 2]
FIG. 7 is a configuration diagram for explaining the second embodiment of the present invention. In the second embodiment, the feature amount extracted using the time window is compressed to reduce the size, and a database having a smaller size can be constructed and searched.
[0032]
In the second embodiment, a linearly compressed partial speech time-series feature quantity that has been linearly expanded and contracted to the length of the reference time window is used as a speech time-series feature vector, and accordingly, a reference time is used. A search key speech time-series feature value vector obtained by linearly compressing the search key speech time-series feature value of the window length on the time axis to the same length as the speech time-series feature value vector is used. Therefore, compared with the first embodiment, the speech time-series feature vector extraction phase P2 called from the speech accumulation phase P1 and the search key speech time-series feature vector extraction phase P4 called from the speech search phase P3 are different. It becomes street.
[0033]
[A] Speech accumulation phase P1 and speech time series feature vector extraction phase P2
FIG. 8 is a flowchart for explaining the operations of the speech accumulation phase P1 and the speech time series feature vector extraction phase P2.
[0034]
As in the first embodiment, an audio signal is input from the search target audio signal input device 20 (step S20). The voice time series feature quantity extraction unit 11 extracts a voice time series feature quantity from the input voice signal (step S21), and the time window cutout part 12 shifts the reference time window little by little while changing the reference time window length. A partial voice time series feature amount is cut out (step S22). Further, partial speech time-series feature quantities are cut out even in a plurality of types of time windows (step S23), and the partial speech time-series feature fixed length unit 13 extracts a plurality of types of partial speech time-series features. The amount is linearly expanded / contracted on the time axis to generate a partial speech time series feature amount aligned with the reference time window length (step S24).
[0035]
Next, as shown in FIG. 9, the speech time series feature amount linear compression unit 30 linearly compresses the partial speech time series feature amounts aligned with the length of the reference time window in the time axis direction. A feature vector is extracted (step S25). As a result, the number of dimensions of the speech time-series feature vector is reduced, the amount of similarity calculation can be reduced, and the capacity of the storage device 15 to be stored can be reduced.
[0036]
Further, in the same manner as in the first embodiment, an index is constructed from the obtained speech time series feature vector, and is stored in the storage device 15 of the speech storage unit 14 together with the speech time series feature vector.
[0037]
[B] Speech retrieval phase P3 and retrieval key speech time-series feature vector extraction phase P4
FIG. 10 is a flowchart for explaining the operations of the speech search phase P3 and the search key speech time-series feature vector extraction phase P4.
[0038]
As in the first embodiment, the search key voice signal input device 22 is used to input a voice signal having a reference time window length as a search key (step S30). A search key voice time-series feature quantity having a reference time window length is extracted from the input voice signal having the reference time window length (step S31).
[0039]
Then, as shown in FIG. 9, the search key speech time-series feature amount linear compression unit 31 linearly compresses the speech time-series feature amount of the reference time window length in the time axis direction, and obtains the search key speech time-series feature amount vector. Extract (step S32). Note that the length of the search key speech time series feature vector is the same as the speech time series feature vector generated in the speech accumulation phase P1.
[0040]
Further, as in the first embodiment, the feature key information comparison unit 16 obtains the search key speech time series feature vector obtained and the speech time series feature vector stored in the storage device 15 by the speech storage unit 14. (Step S33), and the speech time-series feature vectors are ranked in order of increasing distance (step S34).
[0041]
Finally, in the same manner as in the first embodiment, the display format generation unit 17 uses the time window cutout information stored in the storage device 15 of the audio storage unit 14 to convert the audio time series feature vector into the audio signal from which it is extracted. The information is output to the display device 21 in association with the section (step S35). For example, when displaying the search result of the part corresponding to the search key from an hourly music program, the audio signal interval such as how many minutes and how many seconds from the top of the program is displayed in order from the highest ranking result. The list information to be displayed and the playback button for that part are displayed, and when the play button is pressed, the part is displayed in a display format that can output the sound of that part.
[0042]
[Embodiment 3]
11 and 12 are configuration diagrams for explaining Embodiment 3 of the present invention. In the third embodiment, instead of inputting a search key from the outside, when a search is performed once and a search result is obtained, a search key is newly specified from the search result, and voice data similar to the search key is obtained. It can be searched.
[0043]
Multiple search results using voice data as a search key are displayed on the screen as a list in order of similarity, and other similar voice data is searched using the voice data obtained as the search result as a new search key. .
[0044]
For example, in the case of music search, when the “music title” portion of the search result display screen is touched, the music is selected as the search result, and is played back from the portion corresponding to the search key or the beginning of the music and output as a sound. Further, when the “rank” part on the screen is touched, the search result of the music (for example, a part similar to the voice data (3 to 4 seconds) of the original search key in the music) is used as a new search key. Further, a similar music search is performed.
[0045]
As described above, in the third embodiment, the search result is selected again from the search results and the result closest to the search purpose is selected. The speech accumulation phase P1 of the speech signal retrieval apparatus 10 shown in FIG. 11 is the same as the speech accumulation phase P1 of the first embodiment shown in FIG. 1, and the speech accumulation phase P1 of the speech signal retrieval apparatus 10 shown in FIG. This is the same as the voice accumulation phase P1 of the second embodiment shown in FIG. The voice search phase P3 is different from the first and second embodiments described above, and is as follows.
[0046]
[A] Voice search phase P3
FIG. 13 is a flowchart for explaining the operation of the voice search phase P3.
[0047]
As a previous step, search is performed in the same manner as in the first and second embodiments, and the search results are ranked in the order of similarity and displayed on the display device 21 (step S40). If the displayed search result sufficiently matches the search purpose according to the instruction from the user, the search is terminated (step S41).
[0048]
If the search purpose is not met, the search condition input unit 23 causes the search result that is already displayed to be designated as the search key, and selects the result closest to the search purpose. (Step S42). For this reason, when the results ranked in step S40 are displayed on the display device 21, two input buttons are displayed for each result. One is a button that is pressed when designating as a search key, and the other is a button that is pressed when the resulting voice is uttered. The user can specify a search key from the search results by pressing the former button on the display device 21.
[0049]
When the search key is designated, the feature amount information comparison unit 16 searches the stored speech sections having a high similarity with the search key and ranks them in descending order of similarity (step S43). Further, the display format generation unit 17 generates a display format in which the search key can be selected from the search result, and displays it on the display device 21 (step S40).
[0050]
If the displayed search result sufficiently matches the search purpose, the search is terminated (step S41). If it still does not match sufficiently, the search key is selected again, the search is performed again, and the process is repeated until a result that sufficiently matches the search purpose is obtained (steps S40 to S43).
[0051]
[Embodiment 4]
In the above first to third embodiments, partial speech time-series feature quantities having different lengths are extracted from a speech signal to be searched using a plurality of types of time windows, and these partial speech time-series feature quantities are extracted. Is stored in the speech accumulating unit 14 as a speech time-series feature vector to be searched.
[0052]
In the fourth embodiment, the linear expansion / contraction of the partial speech time-series feature amount cut out by the time window is not performed for the search target but for the input as a search key. That is, in the fourth embodiment, the partial speech time series features are not extracted from the search target speech signal using time windows of a plurality of types of lengths, but only the reference time window is used. The amount is extracted and stored in the speech storage unit 14 as a speech time series feature amount vector to be searched. On the other hand, for speech signals input as search keys, partial speech time series features with different lengths are extracted using time windows of different lengths, and these partial speech time series features with different lengths are extracted. Scale the amount linearly to the length of the reference time window.
[0053]
Even if linear expansion / contraction is performed on a key input as a search key, the same search results as those in the first to third embodiments can be obtained. In addition, after aligning with the length of the reference time window, the amount of data to be collated at the time of retrieval can be reduced by performing linear compression to a certain length as in the second embodiment as necessary.
[0054]
〔Experimental result〕
In order to confirm the effectiveness of the present invention, two types of experiments were performed on the singing voice data created for the experiment in the second embodiment.
[0055]
First, in the first experiment, it was confirmed that the matching corresponding to linear expansion / contraction according to the present invention is more effective in singing voice search than the matching not corresponding to linear expansion / contraction. It was confirmed that the present invention is sufficiently effective in the search of singing voice even when compared with the conventional method in which matching is implemented.
[0056]
[A] Experimental conditions common to both experiments
Fifteen subjects are divided into three groups, female, male and mixed, and a list of 62 song names is given to each group. A person who knows the song in the song name list sings a part of the phrase (about 10 seconds) and stores 62 × 3 singing voices (about 30 minutes in total) in the database.
[0057]
Twelve songs are arbitrarily selected from the singing voices of one subject group, and about one phrase (reference time window length: 150 frames, about 4 seconds) is taken out as a search key. And the same phrase part of the singing voice of the test subject of two other groups is searched as a matching result.
[0058]
In this experiment, a singing voice is searched from audio data in a wave file format with a sampling frequency of 44100 Hz, a quantization bit number of 16 bits, and one channel. A time-series feature value is extracted using a time window from the five-dimensional low-order term of the mel frequency cepstrum coefficient as a voice feature value and arranged on the time axis.
[0059]
The average of the average search length is used as an evaluation criterion for the search results. This is an evaluation criterion that represents the time and effort required to find a result suitable for the search purpose from the search results. It is assumed that the suitability up to the 20th among the ranked search results is judged, the ranking is up to the 20th, and even if there is a conformance result below that, the search could not be performed.
[0060]
[B] Explanation of average search length
Here, the average of the average search length used as the evaluation criterion of the search result will be described. The average search length is described in Reference 5 below.
[Reference 5] Takenobu Tokunaga: “Language and Calculation 5 Information Retrieval and Language Processing”, University of Tokyo Press, 1999.
The average search length is a scale for evaluating a set ranked as a search result. When the ranked results are returned as the search results, the searcher must actually judge the suitability of the search results from the top results one by one. The average search length takes into account the process of determining the suitability of a searcher, and the user must determine how much the searcher must judge the suitability of a result in order to obtain the required number of match results. It is a scale that measures the effort of the.
[0061]
For example, if the search result is the ordered set S₁, S₂, S_ThreeSuppose we were able to divide However, the order between sets is S₁, S₂, S_ThreeIn this order, ○ and × represent the conformity result and nonconformity result, respectively.
[0062]
S₁: {○, ×, ×, ×}
S₂: {○, ○, ○, ○, ×, ×}
S_Three: {○, ○, ×, ×}
Now, the searcher wants to obtain one match result. First, the set S₁Will be inspected. Since there is no order in this set, the expected number of results that must be examined before finding a match is
1 × 1/4 + 2 × 1/4 + 3 × 1/4 + 4 × 1/4 = 2.5
It becomes.
[0063]
This indicates that, in order to find one match result from the search results, the searcher must determine the suitability of 2.5 search results on average. That is, the average search length required to find one matching result from this search result is 2.5.
[0064]
To find two matching results, the set S₁After checking all, set S₂The expected value of the number of results to be examined is
(4 + 1) × 4/6 + (4 + 2) × 4/15 + (4 + 3) × 1/15 = 5.4
It becomes. That is, the average search length required to find two matching results is 5.4.
[0065]
In the above example, the search results were given in a ranked set, but even if the search results are all ordered, an average search is considered if each set has one element. The length can be calculated.
[0066]
As can be seen from the above, the average search length is not a single measure but a value depending on the number of necessary matching results. Therefore, as the average of the average search length, a value of the average search length per necessary matching result is calculated.
[0067]
The average number x of average search lengths, where i is the required number of matching results, M is the total number of matching results, and x (i) is the average search length necessary to find i required matching results._avIs
x_av= (1 / M) Σ_{i = 1} ^M{X (i) / i}
It is represented by
[0068]
For example, let us consider a case where the matching result is retrieved in the second and sixth positions as a result of the retrieval. When the number of necessary matching results is 1, the average search length is 2, and when the number of necessary search results is 2, the average search length is 6. The average of these average search lengths is (2/1 + 6/2) / 2 = 4.
[0069]
[C] First experiment
As a method that does not support linear expansion and contraction, we will experiment by extracting time-series features using only one type of time window. On the other hand, as a method corresponding to linear expansion and contraction, an experiment is performed by the method according to the present invention using nine types of time windows. The average value of the average search length of both methods is compared.
[0070]
[D] Results of the first experiment
FIG. 14 shows the result of comparing the average search lengths of the first experiment. The x in the figure indicates that the matching result could not be retrieved. Although the average search length is the same for songs B, E, and H, the method according to the present invention is higher for all other songs. That is, comparing the case where the linear expansion / contraction with the conventional method is not used and the case where the linear expansion / contraction according to the present invention is used for the speech data to be searched through the fixed-length time window, the average search length is short, It was found that there is an effect that the search accuracy is improved from 25% to about 2 times. Therefore, it can be said that the method corresponding to the linear expansion and contraction according to the present invention is effective in the search for singing voice.
[0071]
[E] Second experiment
As a conventional method capable of dealing with nonlinear expansion and contraction, the average search length average of the conventional method and the method according to the present invention is compared by using the voice search function of “CrossMediator for Video V2.0 (R1)” of Media Drive Co., Ltd. Also, check whether the search time can be reduced by simple matching.
[0072]
The search time is manually measured 10 times from when the button for starting the search on the display unit is pressed until the search result is displayed, and the average value is used as the search time. In this experiment, a personal computer with a CPU of Pentium 4 (1.7 GHz) manufactured by Intel Corporation and a main storage capacity of 654,812 KB was used.
[0073]
[F] Results of the second experiment
FIG. 15 shows the result of comparing the average search lengths in the second experiment. The x in the figure indicates that the matching result could not be retrieved. From FIG. 15, although the results of the method according to the present invention are slightly lower in the songs C and G, the results of all the other results are equivalent or better. It can be seen that the method according to the present invention that does not use non-linear matching is able to search at least as much as the conventional method.
[0074]
Next, using the song B in FIG. 15 as a search key, the search was repeated once and the average was compared as the search time. The conventional method took 4.29 seconds, but the method according to the present invention is about 2.42 seconds faster, confirming a reduction in search time by simple matching. That is, when comparing the conventional method using nonlinear expansion and contraction with the method using linear expansion and contraction according to the present invention, the method according to the present invention is comparable to the search accuracy, but the processing speed (CPU burden) is improved by about 56%. I found out that
[0075]
Obviously, the effectiveness of the present invention is basically the same in the first and fourth embodiments.
[0076]
The processing of each embodiment described above can be realized by a computer and a software program, and the program is stored in an appropriate recording medium such as a portable medium memory, a semiconductor memory, and a hard disk that can be read by the computer. Then, it can be executed by a computer by reading out from there. The program can be downloaded from another computer via a communication line, and can be installed and executed.
[0077]
【The invention's effect】
In the conventional method of searching for a speech signal section having a high similarity from the speech signals to be searched using the speech signal of the search key, a dynamic matching that can cope with the nonlinear expansion / contraction of the speech signal is used for the distance calculation representing the similarity. However, in the present invention, when the audio signal is mainly linear expansion / contraction, the dynamic matching is used for the distance calculation by using the static matching corresponding to only the linear expansion / contraction for the distance calculation expressing the similarity. Compared to the case of using it, the calculation amount is reduced and the search time is reduced.
[0078]
In the present invention, the speech signal section obtained as a search result can be used as a search key, and all speech signal sections suitable for the search purpose in the search target speech signal can be obtained by the first search. Even if it is not, it is possible to perform a narrowed search by re-selecting a speech signal section suitable for the search purpose from among the speech signal sections of the search results as a search key, and there is a possibility of obtaining an audio signal suitable for the search purpose. Has the effect of increasing.
[Brief description of the drawings]
FIG. 1 is a configuration diagram for explaining Embodiment 1 of the present invention;
FIG. 2 is a flowchart for explaining operations in a voice accumulation phase and a voice time-series feature vector extraction phase.
FIG. 3 is a diagram illustrating processing of a time window cutout unit.
FIG. 4 is a diagram for explaining an example of a time window cutout method;
FIG. 5 is a diagram for explaining a linear expansion / contraction method for partial audio time-series feature amounts;
FIG. 6 is a flowchart for explaining operations in a voice search phase and a search key voice time-series feature vector extraction phase.
FIG. 7 is a configuration diagram for explaining a second embodiment of the present invention;
FIG. 8 is a flowchart for explaining operations in a speech accumulation phase and a speech time-series feature vector extraction phase.
FIG. 9 is a diagram illustrating a method of linearly compressing a partial speech time-series feature quantity having a reference time window length in the time axis direction.
FIG. 10 is a flowchart for explaining operations in a voice search phase and a search key voice time-series feature vector extraction phase.
FIG. 11 is a configuration diagram for explaining a third embodiment of the present invention;
FIG. 12 is a configuration diagram for explaining a third embodiment of the present invention;
FIG. 13 is a flowchart illustrating an operation in a voice search phase.
FIG. 14 is a diagram showing the results of a first experiment.
FIG. 15 is a diagram showing the results of a second experiment.
[Explanation of symbols]
P1 voice storage phase
P2 Speech time series feature vector extraction phase
P3 voice search phase
P4 Search key voice time-series feature vector extraction phase
10 Voice signal search device
11 Voice time-series feature extraction unit
12 hour window cutout
13 Partial speech time-series feature fixed length unit
14 Voice storage unit
15 Storage device
16 Feature information comparison part
17 Display format generator
18 Search key voice time series feature extraction unit
20 Search target audio signal input device
21 Display device
22 Search key voice signal input device
23 Search condition input device
30 Speech time-series feature linear compression unit
31 Search Key Voice Time Series Feature Quantity Linear Compression Unit
40 Search condition input part

Claims

Extracting partial speech time series features with different lengths from the speech signal to be searched using multiple types of time windows;
A process of linearly expanding and extracting the extracted partial speech time-series feature quantities of different types to match the length of the reference time window, which is the basis for calculating the similar distance during search,
A speech time-series feature quantity vector to be searched for comparing a partial speech time-series feature quantity aligned with the length of the reference time window with a speech time-series feature quantity vector obtained from a speech signal input as a search key And the process of accumulating as
Extracting a search key voice time series feature vector of the length of the reference time window from a voice signal input or designated as a search key;
A similarity distance between the speech time series feature vector accumulated as the search target and the search key speech time series feature vector is calculated, and the speech signal of the search key and the speech signal of each speech signal section to be searched are calculated. The process of calculating similarity,
A process of ranking and outputting search results in order of similarity based on similarity calculation results ,
The user is made to input specification information using one of the output search results as a search key, and the search is performed again by calculating the similarity using the voice data obtained from the specified search result as a search key. And a process of outputting a result .

A process of extracting partial speech time series features from a speech signal to be searched using a reference time window of a length that is a reference for calculating a similar distance at the time of search;
A process of storing the extracted partial speech time series feature quantity as a speech time series feature quantity vector to be searched for comparison with a speech time series feature quantity vector obtained from a speech signal inputted as a search key;
A process of extracting partial speech time-series features with different lengths from a speech signal input or designated as a search key using time windows of different lengths;
A process of linearly expanding and contracting the extracted partial speech time-series feature quantities of a plurality of lengths to match the length of the reference time window;
A partial voice time series feature quantity aligned with a length of the reference time window is set as a search key voice time series feature quantity vector, and the voice time series feature quantity vector stored as the search target and the search key voice time series feature quantity vector Calculating a similarity distance between the voice signal of the search key and the voice signal of each voice signal section to be searched;
A process of ranking and outputting search results in order of similarity based on similarity calculation results ,
The user is made to input specification information using one of the output search results as a search key, and the search is performed again by calculating the similarity using the voice data obtained from the specified search result as a search key. And a process of outputting a result .

In the process of extracting partial speech time-series features with different lengths from the speech signal using time windows of different types,
Extract voice time series feature from the audio signal, cut out the partial time series feature from the reference time window of the reference length little by little, and center on the length of the reference time window The speech signal search method according to claim 1 or 2, wherein the partial speech time-series feature amount is cut out in a similar manner even in a plurality of types of time windows.

The speech time-series feature vector and the search key speech time-series feature vector accumulated as the search target are linear compression of a partial speech time-series feature amount having a reference time window length to a predetermined length. The speech signal search method according to claim 1, wherein the speech signal search method is provided.

Means for extracting partial speech time-series feature quantities having different lengths from a speech signal to be searched using a plurality of types of time windows;
Means for linearly expanding / contracting the extracted partial time series feature quantities of multiple lengths to match the length of the reference time window, which is the basis for calculating the similar distance during search,
A speech time-series feature quantity vector to be searched for comparing a partial speech time-series feature quantity aligned with the length of the reference time window with a speech time-series feature quantity vector obtained from a speech signal input as a search key As a means of accumulating as
Means for extracting a search key voice time series feature quantity vector of the length of the reference time window from a voice signal input or designated as a search key;
A similarity distance between the speech time series feature vector accumulated as the search target and the search key speech time series feature vector is calculated, and the speech signal of the search key and the speech signal of each speech signal section to be searched are calculated. Means for calculating similarity,
A means for ranking and outputting the search results in order of similarity based on the similarity calculation results ;
The user is made to input specification information using the voice data obtained from the output search result as a search key, and the search is performed again by calculating the degree of similarity using one of the specified search results as a search key. Means for outputting a result .

Means for extracting a partial speech time-series feature amount from a speech signal to be searched using a reference time window having a length as a reference for calculating a similar distance at the time of search;
Means for storing the extracted partial voice time-series feature quantity as a voice time-series feature quantity vector to be searched for comparison with a voice time-series feature quantity vector obtained from a voice signal input as a search key;
Means for extracting partial speech time-series feature amounts having different lengths from a speech signal input or designated as a search key using a plurality of types of time windows;
Means for linearly expanding / contracting the extracted partial time series feature quantities of multiple lengths to match the length of the reference time window, which is the basis for calculating the similar distance during search,
A partial voice time series feature quantity aligned with a length of the reference time window is set as a search key voice time series feature quantity vector, and the voice time series feature quantity vector stored as the search target and the search key voice time series feature quantity vector Means for calculating the similarity distance between the voice signal of the search key and the voice signal of each voice signal section to be searched;
A means for ranking and outputting the search results in order of similarity based on the similarity calculation results ;
The user is made to input specification information using one of the output search results as a search key, and the search is performed again by calculating the similarity using the voice data obtained from the specified search result as a search key. Means for outputting a result .

An audio signal search program for causing a computer to execute the audio signal search method according to any one of claims 1 to 4 .

A recording medium for an audio signal search program, characterized in that a program for causing a computer to execute the audio signal search method according to any one of claims 1 to 4 is recorded on a computer-readable recording medium. .