JP2004145161A

JP2004145161A - Speech database registration processing method, speech generation source recognizing method, speech generation section retrieving method, speech database registration processing device, speech generation source recognizing device, speech generation section retrieving device, program therefor, and recording medium for same program

Info

Publication number: JP2004145161A
Application number: JP2002312074A
Authority: JP
Inventors: Hidenobu Osada; 長田　秀信; Naoko Kosugi; 小杉　尚子
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2002-10-28
Filing date: 2002-10-28
Publication date: 2004-05-20
Anticipated expiration: 2022-10-28
Also published as: JP3980988B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a means making it possible to precisely retrieve a speaking section of a desired speaker even when video and audio include a part wherein a plurality of speakers speak at the same time. <P>SOLUTION: In a speaker speech registration phase, not only feature quantities of the voice of a speaker himself/herself, but also feature quantities of a voice composed of speech signals of a plurality of speakers are extracted and registered in a speech database 1. In a speaker retrieval phase, an input speech signal to be retrieved is segmented into short sections and feature quantities of the respective short sections are collated with feature quantities in the speech database 1 to recognize speakers. In a speaking section determination phase, retrieval results of speakers of the respective short sections are totalized in every fixed number of short sections and speaking sections of the speakers are found according to appearance frequencies of the speakers. In a speaker information display phase, the retrieval results of the speaking section are displayed. <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明の属する技術分野】
本発明は，話者認識等に用いる音声データベース登録，音声発生源認識，音声発生区間検索の技術に関し，特に，例えば放送用の番組などの撮影ならびに収録された映像（以下，映像音声という）に対し，その映像音声中の話者情報を，収録開始からの時間情報とともに，自動的に記録媒体へと記録し，その記録された話者情報をもとに，映像中における特定話者の発話した時間帯を検索するような場合に用いられる技術に関するものである。
【０００２】
【従来の技術】
映像音声中における特定話者の発話区間を検索（以下，発話区間検索という）する場合，一般に，話者の判断に検索しようとする話者の音声を事前登録したもの（以下，登録話者音声という）を用いる方法がある（例えば，非特許文献１参照）。
【０００３】
通常，登録話者音声には所望の話者が単独で発話する音声（以下，これを単独話者音声という）を３０〜１２０秒ほど用い，それから符号帳を作成する。発話区間検索の際にはこの符号帳を用い，番組の先頭から逐次音声特徴量を抽出して符号帳と照らし合わせるなどの処理により，所望の話者の発話区間を検出する。これによって，例えば，話者Ａが発話した時間は番組開始から数えてＴ_０秒からＴ_１秒まで，話者Ｂの発話した時間はＴ_２秒からＴ_３秒まで，というような結果を得ることができる。
【０００４】
一般に，テレビ番組などでは同時に発話する話者がいて，話者が必ずしも一人とは限らず，上記の例でＴ_０＜Ｔ_２＜Ｔ_１＜Ｔ_３となる場合がある。このとき，Ｔ_２〜Ｔ_１区間では複数の話者（この場合は話者Ａと話者Ｂ）が同時に発話していることになり，この部分を単独話者音声から作成された符号帳を用いて正しく検索することが難しいのが現状である。
【０００５】
例えば，前述のケース（Ｔ_０＜Ｔ_２＜Ｔ_１＜Ｔ_３となるケース）では，本来なら話者Ａの発話時間はＴ_０〜Ｔ_１であるにもかかわらず，Ｔ_２〜Ｔ_１の区間が正確に検出できず，発話区間検索による結果において，話者Ａの発話区間がＴ_０〜Ｔ_２（Ｔ_２＜Ｔ_１）と検出されるなどの誤った結果を得ることがある。また，話者Ａの会話と同時に音楽が挿入されている場合にも，話者Ａの発話区間を誤って検出することがある。
【０００６】
さらに，発話は有音部だけではない（文章の切れ目がある）ため，符号帳には無音部分の特徴が反映され，このことが原因で，微小時間単位でのベクトル同士の距離のみを用いて話者を判断すると，誤りが生じる場合がある。
【０００７】
例えば，ある区間の特徴ベクトルが，話者Ｂの符号帳における特徴ベクトルのうちの一つに最も距離が近かったとする。しかし，実際にはこの区間の話者はＡであり，たまたま検索キーとなったベクトルが，話者Ｂの符号帳を作る際に登録されていた一部の無音区間の特徴を反映した特徴ベクトルとの距離が最も近かった，というようなケースである。このような場合には，単純にある時間における映像音声の特徴ベクトルをキーとしてデータベース検索を行うだけでは不十分である。
【０００８】
このように，多くのバリエーションを持つ一般的なテレビ番組等の映像音声を対象にして従来方式に基づいて話者の発話区間検索を行う場合には，一般に，その精度が著しく低下してしまう。
【０００９】
【非特許文献１】
Ｆ．Ｋ．　Ｓｏｏｎｇ　ｅｔ　ａｌ．，”Ａ　ｖｅｃｔｏｒ　ｑｕａｎｔｉｚａｔｉｏｎ　ａｐｐｒｏａｃｈ　ｔｏ　ｓｐｅａｋｅｒ　ｒｅｃｏｇｎｉｔｉｏｎ，”Ｐｒｏｃ．ＩＣＡＳＳＰ，ｐｐ．３８７−３９０
【００１０】
【発明が解決しようとする課題】
以上のように，従来の単独話者音声のみから符号帳を作成する方法では，複数の話者が同時に発話する部分を含むような一般的な映像音声から正確に所望の話者の発話区間を得ることは難しい。また，長時間の映像中から複数の話者が同時に発話する部位を手作業で探し出して，それを登録話者音声として用いる方法も考えられるが，この方法は極めて非効率的であり実用化が困難である。
【００１１】
本発明は，このような問題点の解決を図り，映像音声中に複数話者が同時に発話する部分があっても，精度良く話者検索を行うことができるようにすることを目的とする。
【００１２】
【課題を解決するための手段】
本発明は，人間の発声に限らず，鳥，虫などの動物の鳴き声や，機械音についての音声発生源の認識，音声発生区間の検索に用いることができるが，以下の説明では，主として人間の話者認識，話者の発話区間の検索を例に説明する。
【００１３】
図１は，本発明の概要を説明するための図である。
【００１４】
通常，音声信号の特徴量を学習データとして登録する際には，図１（Ａ）に示すように，各話者の音声Ａ，Ｂの特徴を個別に音声データベース１に格納するのが普通である。検索段階では，入力音声がＡ，Ｂに対してどのくらい類似するのか，またその時間変化はどうかなどの計算を行い，最終的に入力音声が登録音声のどれに合致するのかを決定する。
【００１５】
しかし，検索対象の入力音声に，音声Ａと音声Ｂとが混じっているようなケースでは，音声データベース１に適切な学習データがないため，うまく検索結果を得ることが難しい。音声Ａ，Ｂが混ざったＡ＋Ｂというような音声が入力された場合，通常では入力音声に対する検索結果を時間的，確率的に処理して，ＡかＢか，あるいはそうでないかを判断する。したがって，精度の良い検索はできない。一方，音声Ａと音声Ｂとが混じっているものを予め学習データとして録音することは，手間がかかるし，不可能な場合がある。
【００１６】
そこで，本発明では，図１（Ｂ）に示すように，予め用意された学習音声信号を任意の組合せで合算し，その特徴も音声データベース１に再帰的に登録する。すなわち，音声Ａ，音声Ｂの特徴を音声データベース１に登録するだけでなく，仮想的にＡ＋Ｂという音声を一時的に作り，その特徴も音声データベース１に登録する。これによって，音声Ａと音声Ｂとが混じっている場合にも，音声Ａ，音声Ｂについて検索することが可能になる。
【００１７】
図２に，本発明に係る装置の構成図を示す。発話区間検索装置１０は，話者情報登録手段１１と，音声信号組合せ手段１２と，音声特徴量抽出手段１３と，特徴量格納手段１４と，話者検索手段１５と，話者検索結果処理手段１６と，発話区間情報表示手段１７とを備える。
【００１８】
ここで，テレビ番組などの映像を映像音声と呼び，映像音声を任意の時間ごとに時系列的に区切ったものの一つを短区間映像音声と呼び，時系列的に連続した短区間映像音声を複数個ずつひとまとまりにしたものの一つを中短区間映像音声と呼ぶことにする。
【００１９】
話者情報登録手段１１は，映像音声から任意の箇所を切り出して，登録話者音声の候補として利用者に提示する。例えば，番組から不特定の話者音声を自動的に一定時間，複数個切り出して利用者に提示する。利用者は，提示された複数の候補から登録話者音声として用いるものを判断し，用いると判断されたものに関しては話者名などの付加的な情報を書き加える。
【００２０】
音声信号組合せ手段１２は，利用者が登録話者音声として選択した複数個の単独話者音声の音声信号を任意の組合せで足し合わせたものを作成し，それを登録話者音声に加える。すなわち，複数の話者が同時に発話している音声や，音声のバックに音楽が流れているような音声を仮想的に生成し，登録話者音声に加える。
【００２１】
音声特徴量抽出手段１３は，すべての登録話者音声から，個別に音声特徴量を抽出する。音声特徴量の抽出では，音声信号から線形予測法などに代表される一般的な信号処理方法を用いることができる。
【００２２】
特徴量格納手段１４は，音声特徴量抽出手段１３によって抽出された話者の音声特徴量を，話者名などの話者情報とともに音声データベース１に格納する。
【００２３】
話者検索手段１５は，発話区間検索の検索対象とする映像音声を入力し，映像音声を短区間に区切り，そのそれぞの短区間映像音声から抽出された音声特徴量と音声データベース１に格納された音声特徴量とを時間順に比較し，それらの類似度を算出し，最も類似度の高い結果を返す。
【００２４】
話者検索結果処理手段１６は，上記の類似度計算によって得られた短区間映像音声の話者検索結果を時系列的に連続した複数個ごとに集計し，検索結果名ごとに現れる回数をリストにしたものを出力する。すなわち，映像音声の全時間領域にわたって，話者検索手段１５により得られた結果を，中短区間映像音声ごとに集計し，出現した回数をもとに，所望の話者の発話区間を割り出す。
【００２５】
発話区間情報表示手段１７は，上記の出力に基づき，所望の話者が発話した時間帯情報を番組の先頭からの時間とともに端末画面に表示する。または，ある指定された時間における発話者情報を，その時間の映像とともに端末画面に表示する。
【００２６】
図２に示す発話区間検索装置１０の動作は，以下のとおりである。話者音声登録フェーズでは，話者情報登録手段１１によって入力された話者の音声を音声信号組合せ手段１２によって任意に組合せ，組み合わせた音声と組み合わされる前の音声との両方から，音声特徴量抽出手段１３によって特徴量を抽出し，それらを特徴量格納手段１４によって音声データベース１に格納する。次に，話者検索フェーズでは，検索対象となる映像音声から音声特徴量抽出手段１３によって特徴量を抽出し，抽出された特徴量を検索キーとして，話者検索手段１５により類似度に基づいて検索を行う。次に，発話区間決定フェーズでは，得られた検索結果から話者検索結果処理手段１６により，検索結果として得られた回数をもとに所望の話者の発話区間を割り出す。話者情報表示フェーズでは，発話区間情報表示手段１７によりあらかじめ登録された付加的な話者情報とともに端末画面に表示する。
【００２７】
以上のような手段により，本発明では，利用者があらかじめ発話区間検索装置１０によって提示されたいくつかの単独話者音声を選んで登録話者音声とするだけで，自動的に複数話者の同時発話の音声や音楽挿入部分の発話音声が作られ，これらが登録話者音声として追加される。
【００２８】
また，利用者が登録話者音声を選ぶ際に，人物名やその他の付加情報を入力することができ，利用者の入力した情報と登録話者音声との関連付けが自動的になされる。また，本発明の発話区間検索装置１０によって，自動的に番組の全区間に渡る音声特徴量が逐次算出され，あらかじめ格納された登録話者音声の音声特徴量との類似度が算出され，ある閾値以上を示した音声について登録話者の音声であると判断する。また，番組全時間に渡って得られた短区間映像音声についての話者検索結果を中短区間ごとに集計し，所望の話者の発話区間を決定する。
【００２９】
図２に示す話者情報登録手段１１と，音声信号組合せ手段１２と，音声特徴量抽出手段１３と，特徴量格納手段１４とによって，本発明に係る音声データベース登録処理装置を構成することができる。
【００３０】
また，図２に示す話者情報登録手段１１と，音声信号組合せ手段１２と，音声特徴量抽出手段１３と，特徴量格納手段１４と，話者検索手段１５とによって，本発明に係る音声発生源認識装置を構成することができる。
【００３１】
以上の各手段は，ＣＰＵおよびメモリなどからなるコンピュータとソフトウェアプログラムとによって実現することができ，そのプログラムは，コンピュータが読みとり可能な可搬媒体メモリ，半導体メモリ，ハードディスク等の適当な記録媒体に格納することができる。
【００３２】
本発明と従来技術との違いは，以下のとおりである。従来の話者認識技術では，音声情報をデータベースに登録する際に，通常，検索したい音声（人の音声，あるいは機械音など）を単独登録し，入力音声がそれと一致するかどうかを判断していた。これに対し，本発明は，音声を単独登録するだけでなく，複数の音声を信号レベルで任意の重みで合成し，その特徴量を再帰的にデータベースに登録する。この点が従来技術と異なる点である。
【００３３】
また，音声発生区間を検索する場合，従来技術では，一般的に検索結果の類似度を集計するなどして尤度を求めるが，本発明では，候補として並ぶ検索結果を一定時間集計し，頻出する結果を抽出する。このように出現回数で判断することにより，周囲環境音などが不規則に混入するような場合においても，目的とする音声発生区間を正しく検索することが可能になる。
【００３４】
【発明の実施の形態】
以下，本発明の実施の形態を図を用いて説明する。
【００３５】
〔実施の形態１〕
図３は，本発明の実施の形態１における発話区間検索装置の構成例を示す図である。本実施の形態１における発話区間検索装置１０は，ＣＰＵおよびメモリ等からなるコンピュータであり，ソフトウェアプログラムおよび記憶装置等によって構成される入力部１０１，候補映像音声提示部１０２，登録用音声合成部１０３，特徴量抽出部１０４，特徴量格納部１０５，映像音声切り出し部１０６，検索部１０７，検索結果処理部１０８，話者情報格納部１０９，表示部１１０を備えている。また，本実施の形態１における発話区間検索装置１０には，端末表示装置２０が接続されている。
【００３６】
発話区間検索装置１０の動作は，〈話者音声登録フェーズ〉，〈話者検索フェーズ〉，〈発話区間決定フェーズ〉，〈話者情報表示フェーズ〉に分けることができる。以下，発話区間検索装置１０の各フェーズの動作について，フローチャートを用いて説明する。
【００３７】
〈話者音声登録フェーズ〉
図４は，本実施の形態１における話者音声登録フェーズの動作を説明するフローチャートである。はじめに，入力部１０１は，検索対象となる映像音声を入力し（ステップＳ１０），候補映像音声提示部１０２は，入力した映像音声の中から任意の部分を切り出し，これを登録話者音声の候補（以下，登録用話者候補映像という）として利用者に提示する（ステップＳ１１）。
【００３８】
ここで，候補映像音声提示部１０２では，例えば，一般的な方法によって放送番組中から「人の声であり，一人が連続して一定時間（２０〜６０秒）話している部分」を検出し，登録用話者候補映像として利用者に複数個提示する。
【００３９】
登録用話者候補映像が提示されると，利用者は，提示された登録用話者候補映像をそれぞれ視聴するなどし，登録話者音声として採用するか否かを決定する。候補映像音声提示部１０２は，利用者からの登録用話者候補映像（登録話者音声）の選択を受け（ステップＳ１２），利用者からその登録話者に関する話者名などの情報（以下，登録話者情報という）を入力し（ステップＳ１３），その登録話者情報と登録話者音声とを一時記録する（ステップＳ１４）。利用者の指示により，検出が必要な話者数分だけ上記ステップＳ１２〜Ｓ１４（もしくはＳ１１〜Ｓ１４）を繰り返す（ステップＳ１５）。
【００４０】
上記処理では，例えば利用者が，提示された登録用話者侯捕映像の中からある一つを人物Ａの音声として採用する場合，端末画面で提示されたその登録用話者侯捕映像を登録話者音声として選択し，選択した登録用話者侯捕映像の下に用意されたテキストボックスなどに「話者Ａ」と入力することで，選択した登録用話者侯捕映像と「話者Ａ」という登録話者情報との関連付けが行われる。登録話者情報として話者名を入力するだけではなく，性別，年齢，職業，所属会社などの情報も付加的に入力することができる。利用者は，検出が必要な人数分だけ上記作業を行う。
【００４１】
複数人数分の登録話者音声が選択されると，次に，登録用音声合成部１０３は，選択された複数人数分の登録話者音声から任意の組合わせについて音声を合成して複数の登録話者音声が組み合わされた音声を生成し（ステップＳ１６），候補映像音声提示部１０２で選択された登録話者音声に加える。
【００４２】
ここで，例えば２名の話者が同時に発話している音声や，音楽や効果音を背景として発話が行われている音声などが作成され，それらが登録話者音声に加えられる。さらに具体的に説明すると，利用者が話者Ａと話者Ｂの音声を登録話者音声として選択すると，登録用音声合成部１０３により自動的に話者Ａ＋Ｂの音声が生成され，話者Ａ，話者Ｂ，話者Ａ＋Ｂの音声が登録話者音声となる。
【００４３】
特徴量抽出部１０４は，すべての登録話者音声について，音声信号から線形予測法などに代表される一般的な特徴量を抽出する方法に従って音声特徴量を抽出し（ステップＳ１７），それらの音声特徴量を特徴ベクトルとして特徴量格納部１０５の音声データベース１に記録する（ステップＳ１８）。
【００４４】
〈話者検索フェーズ〉
図５は，本実施の形態１における話者検索フェーズの動作を説明するフローチャートである。まず，入力部１０１によって検索対象となる番組の映像（映像音声）を入力し，映像音声切り出し部１０６によって，その入力映像音声を短時間ごとに区切って短区間映像音声を切り出す（ステップＳ２０）。特徴量抽出部１０４は，切り出された短区間映像音声から音声特徴量（特徴ベクトル）を抽出する（ステップＳ２１）。短区間の長さは，例えば１０ｍｓから１００ｍｓ程度の予め定めれた長さであるが，本発明の実施は，この長さに限られるわけではない。
【００４５】
検索部１０７は，ステップＳ２１で抽出された短区間映像音声の特徴ベクトルと，〈話者音声登録フェーズ〉において特徴量格納部１０５に格納されたすべての登録話者音声の特徴ベクトルとの類似度計算を行い（ステップＳ２２），最も類似度が高かった登録話者音声の登録話者情報を検索結果とする（ステップＳ２３）。ステップＳ２０〜Ｓ２３の処理を番組開始時間から番組終了時間まで繰り返し実行し（ステップＳ２４），すべての短区間映像音声に対して検索結果を得る。
【００４６】
次に，各短区間映像音声の検索結果を時系列的に連続な複数個（例えば１００個）ごとにまとめ（以下，このまとまりを中短区間という），その中短区間ごとに検索結果を集計し（ステップＳ２５），その集計結果を出現回数順にソートして結果リストを生成する（ステップＳ２６）。この結果リストを，一つの中短区間映像音声に対する登録話者音声の検索結果として出力する（ステップＳ２７）。すべての短区間映像音声の検索結果に対して，ステップＳ２５〜Ｓ２７の処理を繰り返して実行する（ステップＳ２８）。
【００４７】
ここで，中短区間映像音声が例えば１００個の短区間映像音声の集合であるとすると，その中短区間内における１００個の検索結果の内訳は，「話者名：出現回数」の形式で表すと，例えば［話者Ａ：５０，話者Ａ＋Ｂ：２０，話者Ｂ＋Ｃ：１０，話者Ａ＋Ｃ：３，話者Ｄ：２，話者Ｂ＋Ｄ：０，．．．　］のようになる。この出現回数のリストを出現回数順にソートしたものが結果リストであり，これを一つの中短区間映像音声に対する話者の検索結果候補とする。
【００４８】
図６は，本実施の形態１における検索結果から結果リストを作成する例を示す図である。図中（ａ）は，各短区間映像音声に対する検索結果の例であり，各短区間の検索結果は話者名で記載されている。これらの検索結果を中短区間ごとに集計する。図６の例では，６つの短区間で一つの中短区間としている。検索結果を集計したものを出現回数ごとにソートしたものが結果リストである。図中（ｂ）は，中短区間ごとの結果リストの例を示している。例えば，１番目の中短区間の結果リストでは，話者Ａが出現する回数が３回，話者Ａ＋Ｂが２回，話者Ａ＋Ｃが１回という集計結果が示されている。
【００４９】
なお，本実施の形態では，上記ステップＳ２５〜Ｓ２８を話者検索フェーズとしているが，この部分を下記の発話区間決定フェーズとして実行してもよく，全体の実質的な動作が変わるわけではない。
【００５０】
〈発話区間決定フェーズ〉
図７は，本実施の形態１における発話区間決定フェーズの動作を説明するフローチャートである。本実施の形態１における〈発話区間決定フェーズ〉では，結果リストを下記に示す流れに従って処理することにより，複数話者の同時発話を含む映像音声から，特定の話者の発話区間を正確に割り出す。
【００５１】
まず，検索結果処理部１０８は，一つの中短区間の結果リストを入力する（ステップＳ３０）。入力した結果リストの上位ｎ件以内に単独話者名があるかどうかを判断し（ステップＳ３１），なければステップＳ３８に進む。ｎは，あらかじめ設定された値である。
【００５２】
ここで，例えば，ｎ＝５とし，ある結果リストが［話者Ａ：５０，話者Ａ＋Ｂ：２０，話者Ｂ＋Ｃ：１０，話者Ａ＋Ｃ：３，話者Ｄ：２，話者Ｂ＋Ｄ：０，……］となっている場合，「話者Ａ」，「話者Ｄ」が上位５件以内に含まれている単独話者名であると判断する。
【００５３】
結果リストの上位ｎ件以内に単独話者名が一つでも含まれている場合，単独話者名のうち最も上位にある話者名をＰ_ａとし（ステップＳ３２），Ｐ_ａの単独の出現回数を総出現回数とする（ステップＳ３３）。結果リストの上位ｎ件以内にある複数話者の同時発話の結果でＰ_ａを含んでいるものがあれば（ステップＳ３４），それらすべてのＰ_ａを含む複数話者の同時発話の出現回数をＰ_ａの単独の出現回数に加え，Ｐ_ａの総出現回数とする（ステップＳ３５）。
【００５４】
ここで，上記の例のように，ｎ＝５とし，ある結果リストが［話者Ａ：５０，話者Ａ＋Ｂ：２０，話者Ｂ＋Ｃ：１０，話者Ａ＋Ｃ：３，話者Ｄ：２，話者Ｂ＋Ｄ：０，．．．　］となっている場合，最も上位にある単独話者名である「話者Ａ」をＰ_ａとすると，複数話者の同時発話のうちＰ_ａを含むのは「話者Ａ＋Ｂ」，「話者Ａ＋Ｃ」である。Ｐ_ａの単独の出現回数に，「話者Ａ＋Ｂ」，「話者Ａ＋Ｃ」の出現回数を加えたＰ_ａの総出現回数は，
５０＋２０＋３＝７３
となる。
【００５５】
図８は，本実施の形態１における総出現回数の算出方法を説明する図である。図８の例の中短区間の結果リストにおいて，上位ｎ＝５以内の単独話者名には話者Ａがあるので，話者ＡがＰ_ａとなる。上位ｎ＝５以内の複数話者の同時発話のうち話者Ａを含むものは，図８の例の場合，「話者Ａ＋Ｂ」，「話者Ａ＋Ｃ」，「話者Ａ＋Ｄ」である。話者Ａの単独の出現回数に，「話者Ａ＋Ｂ」，「話者Ａ＋Ｃ」，「話者Ａ＋Ｄ」の出現回数を加えた話者Ａの総出現回数は，
１０＋９＋７＋２＝２８
となる。
【００５６】
中短区間におけるＰ_ａの総出現回数があらかじめ定められた閾値Ｔを超えた場合（ステップＳ３６），そのＰ_ａをその中短区間映像音声の話者名であるとする（ステップＳ３７）。
【００５７】
ステップＳ３０〜Ｓ３７の処理を，すべての中短区間の結果リストについて実行する（ステップＳ３８）。映像音声中のすべての中短区間映像音声の話者名と時間情報との組合せを，発話区間の話者情報として話者情報格納部１０９に格納する（ステップＳ３９）。
【００５８】
〈話者情報表示フェーズ〉
図９は，本実施の形態１における話者情報表示フェーズの動作を説明するフローチャートである。このフェーズでは，利用者からの要求に従って，端末表示装置２０に話者情報を表示する。
【００５９】
まず，表示部１１０は，利用者からの要求の入力を受ける（ステップＳ４０）。利用者の入力が話者名か時間かを判定し（ステップＳ４１），利用者の入力が話者名であれば，話者情報格納部１０９の話者情報をその話者名で検索し（ステップＳ４２），その話者が発話したすべての中短区間の時間情報を視覚的に端末表示装置２０に表示する（ステップＳ４３）。ステップＳ４１において利用者の入力が時間であれば，話者情報格納部１０９の話者情報をその時間で検索し（ステップＳ４４），その時間に発話している話者の話者名を端末表示装置２０に表示する（ステップＳ４５）。
【００６０】
図１０は，上記ステップＳ４３で表示される話者情報表示画面の例を示している。ここでは，画面左側に映像音声の再生画面とともに話者の名前が表示され，また，画面右側に人物の情報として話者の名前と，その話者が発話している時間帯の情報が表示されている。これによって，特定の話者がいつ発話しているかがすぐに分かる。
【００６１】
また，図１１は，上記ステップＳ４５で表示される話者情報表示画面の例を示している。ここでは，画面左側に映像音声の再生画面とともにその再生画面の時刻が表示され，また，画面右側に人物の情報として指定された時間の話者に関する名前，所属等の話者情報が表示されている。これによって，ある時間にどのような人物が発話しているかがすぐに分かる。
【００６２】
〔実施の形態２〕
図１２は，本発明の実施の形態２における発話区間検索装置の構成例を示す図である。本実施の形態２における発話区間検索装置１０’は，ＣＰＵおよびメモリ等からなるコンピュータであり，ソフトウェアプログラムおよび記憶装置等によって構成される入力部１０１，候補映像音声提示部１０２，登録用音声合成部１０３，特徴量抽出部１０４，特徴量格納部１０５，映像音声切り出し部１０６，検索部１０７，検索結果処理部１０８，話者情報格納部１０９，表示部１１０，映像音声再選択部１１１を備えている。また，本実施の形態２における発話区間検索装置１０’には，端末表示装置２０が接続されている。
【００６３】
本実施の形態２は，映像音声再選択部１１１を有し，上記〈話者情報表示フェーズ〉で表示された話者情報をもとに，特徴量格納部１０５に格納されている登録話者音声の音声特徴量を再設定する機能を持つ点が，前述した実施の形態１と異なる。
【００６４】
本実施の形態２における発話区間検索装置１０’は，実施の形態１の動作の後に，〈話者音声再登録フェーズ〉の動作を行う。以下，発話区間検索装置１０’における〈話者音声再登録フェーズ〉について，フローチャートを用いてその動作を説明する。
【００６５】
〈話者音声再登録フェーズ〉
図１３は，本実施の形態２における話者音声再登録フェーズの動作を説明するフローチャートである。本実施の形態２では，利用者が所望の話者の登録話者音声を，発話区間の検索結果を用いて修正することができる。例えば，実施の形態１の動作によって，所望の話者（話者Ｐ_ａとする）の発話区間がＴ_０〜Ｔ_１およびＴ_２〜Ｔ_３であるという結果が得られたとする。しかし，利用者が実際に端末表示装置２０で結果を確認すると，Ｔ_０〜Ｔ_１は所望の話者でなく，Ｔ_２〜Ｔ_３およびＴ_４〜Ｔ_５が正しい結果であり，これを登録話者音声として再登録したい場合に，利用者は，Ｔ_２〜Ｔ_３およびＴ_４〜Ｔ_５の映像音声から登録話者音声の再登録を行うことができる。
【００６６】
まず，映像音声再選択部１１１は，話者Ｐ_ａの登録話者音声として再登録したい映像音声の選択を利用者から受けると（ステップＳ５０），その映像音声を話者Ｐ_ａの登録話者音声として登録用音声合成部１０３に送る。登録用音声合成部１０３は，利用者が選択した話者Ｐ_ａの登録話者音声と他の登録話者音声とから任意の組合わせについて音声を合成し，利用者が選択した登録話者音声を含む複数の登録話者音声が組み合わされた音声を生成する（ステップＳ５１）。
【００６７】
特徴量抽出部１０４は，利用者が選択した話者Ｐ_ａの登録話者音声と，利用者が選択した登録話者音声を含む複数の登録話者音声が組み合わされた音声とからそれぞれ音声特徴量（特徴ベクトル）を抽出し（ステップＳ５２），それらの抽出された音声特徴量で，特徴量格納部１０５にそれまでに格納されていた音声特徴量を上書きする（ステップＳ５３）。
【００６８】
以上のような一連の動作によって，利用者は，例えばＴ_２〜Ｔ_３の映像音声を新たに話者Ｐ_ａの登録用話者音声として置き換え，さらに話者Ｐ_ａを含む複数の登録話者音声の合成により生成された登録話者音声も新たに置き換えることができる。
【００６９】
以上，本実施の形態１および２について説明したが，本発明では，もちろん検索対象の番組だけではなく，検索対象以外の番組からも登録話者音声を作成することができる。また，登録話者音声としてＢＧＭを登録し，ＢＧＭの登録話者音声と他の登録話者音声との任意の組合せについて音声を合成し，それらの音声特徴量を登録することにより，発話区間の検索において，背景に効果音がある場合の発話区間の検索も行うことが可能になる。
【００７０】
また，登録用音声合成部１０３において音声を合成する際に，各登録話者音声に音の大きさや音の高さなどについて任意に重みを設定してから，各登録話者音声を合成する実施も可能である。
【００７１】
以上，番組映像における人間の発話区間について検索する例を説明したが，本発明が人間の音声以外の一般音声にも適用できることは言うまでもない。
【００７２】
本発明の利用例として，以下のような例が考えられる。
（１）ストリーミング映像，ビデオ，テレビ番組などの映像音声から話者の発話区間を検出する場合に使用する。
（２）単一の集音マイクで録音された電話会議などから議事録を起こす作業の支援に使用する。番組などの音声は必ずしも登録音声だけが音声信号として放送されるわけではなく，実際には周囲環境音，雑音を含み多様である。このような場合に，本発明を用いた音声発生区間の検索は有効である。
（３）一般的な周囲環境音の中でクラクションが鳴らされた回数をカウントするのに使用する。クラクションの音声を単独で登録することは容易であるが，環境音声と混合した状態で正確に検出することは一般的には難しい。環境音にバリエーションがあることと，クラクションの音もドップラ効果などにより歪むからである。本発明を適用することにより，このような場合にも正確に検出することが可能になる。
（４）森の中で動物の鳴き声を判断するのに使用する。
（５）定常状態で動作する機械に，通常ではあり得ない音が発生したことを検出する場合に使用する。
【００７３】
【発明の効果】
以上説明したように，本発明によって，テレビ番組などの複数の話者が同時に発話する音声や，背景に効果音を含む映像中において，利用者が提示された候補の中から所望の話者の単独音声を登録するだけで，複数話者の同時発話部分も含んだ映像中においても，所望の話者の発話区間検索を精度よく行うことができるようになる。また，利用者が，発話区間検索の結果を用いて登録話者音声を再作成することができる。また，人の音声に限らず，自然音についても音声発生源の認識，音声発生区間の検索に利用することができる。
【図面の簡単な説明】
【図１】本発明の概要を説明するための図である。
【図２】本発明に係る装置の構成図である。
【図３】本発明の実施の形態１における発話区間検索装置の構成例を示す図である。
【図４】本実施の形態１における話者音声登録フェーズの動作を説明するフローチャートである。
【図５】本実施の形態１における話者検索フェーズの動作を説明するフローチャートである。
【図６】本実施の形態１における検索結果から結果リストを作成する例を示す図である。
【図７】本実施の形態１における発話区間決定フェーズの動作を説明するフローチャートである。
【図８】本実施の形態１における総出現回数の算出方法を説明する図である。
【図９】本実施の形態１における話者情報表示フェーズの動作を説明するフローチャートである。
【図１０】本実施の形態１における話者情報表示画面の例を示す図である。
【図１１】本実施の形態１における話者情報表示画面の例を示す図である。
【図１２】本発明の実施の形態２における発話区間検索装置の構成例を示す図である。
【図１３】本実施の形態２における話者音声再登録フェーズの動作を説明するフローチャートである。
【符号の説明】
１　　　音声データベース
１０，１０’　発話区間検索装置
１１　　話者情報登録手段
１２　　音声信号組合せ手段
１３　　音声特徴量抽出手段
１４　　特徴量格納手段
１５　　話者検索手段
１６　　話者検索結果処理手段
１７　　発話区間情報表示手段
１０１　入力部
１０２　候補映像音声提示部
１０３　登録用音声合成部
１０４　特徴量抽出部
１０５　特徴量格納部
１０６　映像音声切り出し部
１０７　検索部
１０８　検索結果処理部
１０９　話者情報格納部
１１０　表示部
１１１　映像音声再選択部
２０　　端末表示装置[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to technology for voice database registration, voice source recognition, and voice generation section search used for speaker recognition and the like. In particular, the present invention relates to, for example, shooting and recording of a broadcast program (hereinafter referred to as video / audio). On the other hand, the speaker information in the video and audio, together with the time information from the start of recording, is automatically recorded on a recording medium, and based on the recorded speaker information, the utterance of the specific speaker in the video The present invention relates to a technique used for searching for a specified time zone.
[0002]
[Prior art]
When searching for an utterance section of a specific speaker in video and audio (hereinafter referred to as an utterance section search), generally speaking, the speech of the speaker to be searched is pre-registered in the judgment of the speaker (hereinafter, the registered speaker voice). (For example, see Non-Patent Document 1).
[0003]
Normally, a registered speaker's voice uses a voice independently uttered by a desired speaker (hereinafter, referred to as a single speaker's voice) for about 30 to 120 seconds, and then creates a codebook. At the time of utterance section search, this codebook is used, and a speech section of a desired speaker is detected by processing such as successively extracting audio feature amounts from the beginning of the program and comparing the feature with the codebook. Thus, for example, the time when speaker A speaks is counted from the start of the program as T ₀ Seconds to T ₁ Until the second, speaker B speaks for T ₂ Seconds to T ₃ You can get results like up to seconds.
[0004]
Generally, in a television program or the like, there are speakers who speak simultaneously, and the number of speakers is not necessarily one. ₀ <T ₂ <T ₁ <T ₃ It may be. At this time, T ₂ ~ T ₁ In the section, a plurality of speakers (in this case, speaker A and speaker B) are uttering at the same time, and it is difficult to correctly search this portion using a codebook created from a single speaker's voice. is the current situation.
[0005]
For example, in the case (T ₀ <T ₂ <T ₁ <T ₃ ), The talk time of speaker A is T ₀ ~ T ₁ T ₂ ~ T ₁ Cannot be detected accurately, and the result of the utterance section search indicates that the utterance section of speaker A is T ₀ ~ T ₂ (T ₂ <T ₁ ) May be detected. Also, when music is inserted at the same time as the conversation of the speaker A, the utterance section of the speaker A may be erroneously detected.
[0006]
Furthermore, since the utterance is not only a sound part (there is a break in the sentence), the characteristics of the silent part are reflected in the codebook. For this reason, only the distance between the vectors in minute time units is used. Judging the speaker may cause an error.
[0007]
For example, it is assumed that the feature vector of a certain section is closest to one of the feature vectors in the speaker B's codebook. However, the speaker in this section is actually A, and the vector that happened to be the search key is a feature vector that reflects the features of some silent sections registered when the codebook of speaker B was created. This is the case where the distance to was the shortest. In such a case, simply performing a database search using a feature vector of video and audio at a certain time as a key is not sufficient.
[0008]
As described above, when searching for a speaker's utterance section based on the conventional method for video and audio such as a general TV program having many variations, generally, the accuracy thereof is significantly reduced.
[0009]
[Non-patent document 1]
F. K. See Song et al. , "A vector quantification approach to speaker recognition," Proc. ICASSP, pp. 387-390
[0010]
[Problems to be solved by the invention]
As described above, according to the conventional method of creating a codebook from only a single speaker's voice, a speech section of a desired speaker can be accurately extracted from general video and audio including a portion where a plurality of speakers simultaneously speak. Hard to get. It is also conceivable to manually search for a part where multiple speakers simultaneously utter from a long video and use that as the registered speaker's voice. However, this method is extremely inefficient and cannot be put to practical use. Have difficulty.
[0011]
SUMMARY OF THE INVENTION It is an object of the present invention to solve such a problem and to enable accurate speaker search even when there is a part where a plurality of speakers utter simultaneously in video and audio.
[0012]
[Means for Solving the Problems]
The present invention can be used not only for human utterances but also for the sounds of animals such as birds and insects, the recognition of sound sources for machine sounds, and the search of sound generation sections. The following describes an example of speaker recognition and retrieval of a speaker's utterance section.
[0013]
FIG. 1 is a diagram for explaining the outline of the present invention.
[0014]
Normally, when registering the feature amount of a speech signal as learning data, it is common to individually store the features of speech A and B of each speaker in the speech database 1 as shown in FIG. is there. In the search stage, calculations are performed to determine how similar the input voice is to A and B, and how the input voice changes over time, and finally determine which of the registered voices the input voice matches.
[0015]
However, in the case where speech A and speech B are mixed in the input speech to be searched, there is no appropriate learning data in the speech database 1, and it is difficult to obtain a search result successfully. When a voice such as A + B in which voices A and B are mixed is input, usually, a search result for the input voice is processed temporally and stochastically to determine whether it is A or B or not. Therefore, accurate search cannot be performed. On the other hand, it is time-consuming and sometimes impossible to record a mixture of voice A and voice B as learning data in advance.
[0016]
Therefore, in the present invention, as shown in FIG. 1B, learning speech signals prepared in advance are added in an arbitrary combination, and the features are recursively registered in the speech database 1. That is, in addition to registering the features of the voice A and the voice B in the voice database 1, a voice A + B is temporarily created virtually, and the features are also registered in the voice database 1. This makes it possible to search for voice A and voice B even when voice A and voice B are mixed.
[0017]
FIG. 2 shows a configuration diagram of the apparatus according to the present invention. The utterance section search device 10 includes a speaker information registration unit 11, a voice signal combination unit 12, a voice feature extraction unit 13, a feature storage unit 14, a speaker search unit 15, and a speaker search result processing unit. 16 and utterance section information display means 17.
[0018]
Here, a video such as a television program is called video and audio, one of the video and audio divided in time series at an arbitrary time is called a short section video and audio, and a short section video and audio continuous in time series is called One of the plurality of groups is called a medium / short section video / audio.
[0019]
The speaker information registration unit 11 cuts out an arbitrary portion from the video and audio, and presents it to the user as a candidate for registered speaker's voice. For example, a plurality of unspecified speaker voices are automatically cut out from a program for a certain period of time and presented to the user. The user determines which of the presented candidates is to be used as the registered speaker's voice, and adds additional information such as the speaker's name for those determined to be used.
[0020]
The voice signal combination means 12 creates a sum of voice signals of a plurality of single speaker voices selected by the user as registered speaker voices in an arbitrary combination, and adds it to the registered speaker voices. That is, a voice in which a plurality of speakers are simultaneously speaking or a voice in which music is flowing in the background of the voice are virtually generated and added to the registered speaker's voice.
[0021]
The voice feature extraction unit 13 individually extracts voice features from all registered speaker voices. In the extraction of the audio feature amount, a general signal processing method represented by a linear prediction method or the like from the audio signal can be used.
[0022]
The feature storage unit 14 stores the speaker's speech feature extracted by the speech feature extraction unit 13 in the speech database 1 together with speaker information such as a speaker name.
[0023]
The speaker search means 15 inputs the video / audio to be searched for the utterance section search, divides the video / audio into short sections, and stores the audio feature amount extracted from each short section video / audio and the voice database 1. The obtained speech features are compared in time order, their similarities are calculated, and the result with the highest similarity is returned.
[0024]
The speaker search result processing means 16 sums up the speaker search results of the short-term video and audio obtained by the above similarity calculation for each of a plurality of time-sequential series, and lists the number of appearances for each search result name. Output In other words, the results obtained by the speaker search means 15 over the entire time domain of the video and audio are totaled for each of the medium and short section video and audio, and the utterance section of the desired speaker is determined based on the number of appearances.
[0025]
Based on the output, the utterance section information display means 17 displays time zone information in which the desired speaker has uttered on the terminal screen together with the time from the beginning of the program. Alternatively, the speaker information at a specified time is displayed on the terminal screen together with the video at the time.
[0026]
The operation of the utterance section search device 10 shown in FIG. 2 is as follows. In the speaker voice registration phase, the voice of the speaker input by the speaker information registration unit 11 is arbitrarily combined by the voice signal combination unit 12, and the speech feature amount is extracted from both the combined voice and the voice before being combined. The feature data is extracted by the means 13 and stored in the voice database 1 by the feature data storage means 14. Next, in the speaker search phase, a feature is extracted from the audio and video to be searched by the voice feature extraction unit 13 and the extracted feature is used as a search key by the speaker search unit 15 based on the similarity. Perform a search. Next, in the utterance section determination phase, the speaker search result processing means 16 determines an utterance section of a desired speaker from the obtained search results based on the number of times obtained as search results. In the speaker information display phase, the utterance section information display means 17 displays on the terminal screen together with additional speaker information registered in advance.
[0027]
According to the above-described means, in the present invention, the user automatically selects several independent speaker voices presented by the utterance section search device 10 in advance and sets them as registered speaker voices. Simultaneous utterance voices and utterance voices of the music insertion part are created, and these are added as registered speaker voices.
[0028]
In addition, when the user selects the registered speaker's voice, the user's name and other additional information can be input, and the information input by the user and the registered speaker's voice are automatically associated. Further, the utterance section search device 10 of the present invention automatically calculates the speech feature amount over the entire section of the program, and calculates the similarity with the speech feature amount of the registered speaker's speech stored in advance. It is determined that the voice that is equal to or greater than the threshold is the voice of the registered speaker. Also, speaker search results for short section video and audio obtained over the entire program time are totaled for each medium and short section, and a speech section of a desired speaker is determined.
[0029]
The speaker information registration unit 11, the speech signal combination unit 12, the speech feature amount extraction unit 13, and the feature amount storage unit 14 shown in FIG. 2 can constitute a speech database registration processing device according to the present invention. .
[0030]
The speaker information registering means 11, voice signal combining means 12, voice feature quantity extracting means 13, feature quantity storing means 14, and speaker searching means 15 shown in FIG. A source recognition device can be configured.
[0031]
Each of the above means can be realized by a computer including a CPU and a memory and a software program, and the program is stored in an appropriate recording medium such as a computer-readable portable medium memory, a semiconductor memory, and a hard disk. can do.
[0032]
The differences between the present invention and the prior art are as follows. In the conventional speaker recognition technology, when registering voice information in a database, usually, a voice to be searched (human voice, mechanical voice, etc.) is independently registered, and it is determined whether or not the input voice matches the voice. Was. On the other hand, according to the present invention, not only a voice is registered alone, but also a plurality of voices are synthesized with an arbitrary weight at a signal level, and the feature amount is recursively registered in a database. This is a point different from the prior art.
[0033]
In addition, when searching for a speech generation section, in the related art, likelihood is generally obtained by counting the similarity of search results, but in the present invention, search results arranged as candidates are counted for a certain period of time and frequently appear. To extract the results. By making a determination based on the number of appearances in this way, it is possible to correctly search for a target voice generation section even when ambient environmental sounds and the like are mixed irregularly.
[0034]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
[0035]
[Embodiment 1]
FIG. 3 is a diagram showing a configuration example of the utterance section search device according to the first embodiment of the present invention. The utterance section search device 10 according to the first embodiment is a computer including a CPU, a memory, and the like. , A feature amount extraction unit 104, a feature amount storage unit 105, a video / audio extraction unit 106, a search unit 107, a search result processing unit 108, a speaker information storage unit 109, and a display unit 110. Further, a terminal display device 20 is connected to the utterance section search device 10 according to the first embodiment.
[0036]
The operation of the utterance section search device 10 can be divided into a <speaker voice registration phase>, a <speaker search phase>, a <speaker section determination phase>, and a <speaker information display phase>. Hereinafter, the operation of each phase of the utterance section search device 10 will be described using a flowchart.
[0037]
<Speaker voice registration phase>
FIG. 4 is a flowchart illustrating the operation of the speaker voice registration phase according to the first embodiment. First, the input unit 101 inputs video and audio to be searched (step S10), and the candidate video and audio presentation unit 102 cuts out an arbitrary portion from the input video and audio, and extracts this from the registered speaker voice candidate. (Hereinafter referred to as a registration candidate speaker image) to the user (step S11).
[0038]
Here, the candidate video / audio presentation unit 102 detects, for example, a part that is a human voice and one person continuously speaks for a certain period of time (20 to 60 seconds) from a broadcast program by a general method. , Are presented to the user as candidate speaker images for registration.
[0039]
When the registered speaker candidate video is presented, the user views each of the presented registered speaker candidate videos, and determines whether or not to adopt the registered speaker candidate video as the registered speaker voice. The candidate video / audio presentation unit 102 receives a selection of a registration speaker candidate video (registered speaker voice) from the user (step S12), and receives information (hereinafter, referred to as a speaker name) related to the registered speaker from the user. (Referred to as registered speaker information) (step S13), and the registered speaker information and the registered speaker voice are temporarily recorded (step S14). Steps S12 to S14 (or S11 to S14) are repeated by the number of speakers that need to be detected in accordance with a user's instruction (step S15).
[0040]
In the above process, for example, when the user adopts one of the presented registration speaker change videos as the voice of the person A, the registration speaker change video presented on the terminal screen is used. Select the registered speaker voice and enter "Speaker A" in the text box provided below the selected registration speaker video, so that the selected registration speaker video and the And the registered speaker information “person A”. In addition to inputting a speaker name as registered speaker information, information such as gender, age, occupation, and affiliated company can be additionally input. The user performs the above operation for the number of persons who need to be detected.
[0041]
When a plurality of registered speaker voices are selected, the registration voice synthesis unit 103 synthesizes voices for an arbitrary combination from the selected plurality of registered speaker voices to generate a plurality of registered speaker voices. A voice combined with the speaker voice is generated (step S16) and added to the registered speaker voice selected by the candidate video / audio presentation unit 102.
[0042]
Here, for example, a voice in which two speakers are uttering at the same time, a voice in which utterance is performed with music or sound effects in the background, and the like are created, and these are added to the registered speaker's voice. More specifically, when the user selects the voices of the speakers A and B as the registered speaker's voices, the voice for the speaker A + B is automatically generated by the voice synthesizer 103 for registration, and the voice of the speaker A is generated. , Speaker B, and speaker A + B are registered speaker voices.
[0043]
The feature amount extraction unit 104 extracts a speech feature amount from all the registered speaker's voices according to a general feature amount represented by a linear prediction method or the like from the voice signal (step S17), and extracts those voices. The feature amount is recorded as a feature vector in the audio database 1 of the feature amount storage unit 105 (step S18).
[0044]
<Speaker search phase>
FIG. 5 is a flowchart illustrating the operation of the speaker search phase according to the first embodiment. First, a video (video / audio) of a program to be searched is input by the input unit 101, and the video / audio cutout unit 106 cuts out the short-term video / audio by dividing the input video / audio every short time (step S20). The feature amount extraction unit 104 extracts a sound feature amount (feature vector) from the cut out short-term video and audio (step S21). The length of the short section is a predetermined length of, for example, about 10 ms to 100 ms, but the embodiment of the present invention is not limited to this length.
[0045]
The search unit 107 determines the similarity between the feature vector of the short-term video and audio extracted in step S21 and the feature vectors of all registered speaker voices stored in the feature amount storage unit 105 in the <speaker voice registration phase>. Calculation is performed (step S22), and the registered speaker information of the registered speaker voice having the highest similarity is set as a search result (step S23). The processes of steps S20 to S23 are repeatedly executed from the program start time to the program end time (step S24), and a search result is obtained for all short-section video and audio.
[0046]
Next, the search results of each short section video / audio are grouped into a plurality of continuous time series (for example, 100 pieces) (hereinafter, this unit is referred to as a medium / short section), and the search results are totaled for each medium / short section. (Step S25), and sorts the totaled result in the order of the number of appearances to generate a result list (Step S26). The result list is output as a search result of the registered speaker's voice for one medium / short section video / audio (step S27). The processing of steps S25 to S27 is repeatedly performed on all the short-term video / audio search results (step S28).
[0047]
Here, assuming that the medium / short section video / audio is a set of, for example, 100 short section video / audio, the breakdown of the 100 search results in the medium / short section is in the form of “speaker name: number of appearances”. For example, [Speaker A: 50, Speaker A + B: 20, Speaker B + C: 10, Speaker A + C: 3, Speaker D: 2, Speaker B + D: 0,. . . ]become that way. This list of the number of appearances is sorted in the order of the number of appearances, and is a result list, which is used as a search result candidate of a speaker for one medium / short section video / audio.
[0048]
FIG. 6 is a diagram illustrating an example of creating a result list from search results according to the first embodiment. (A) in the figure is an example of a search result for each short section video and audio, and the search result for each short section is described by a speaker name. These search results are totaled for each medium and short section. In the example of FIG. 6, six short sections constitute one medium short section. The result list is obtained by summarizing the search results and sorting them by the number of appearances. (B) in the figure shows an example of a result list for each middle and short section. For example, in the result list of the first middle / short section, the total number of times that speaker A appears three times, speaker A + B twice, and speaker A + C once is shown.
[0049]
In the present embodiment, steps S25 to S28 are defined as the speaker search phase. However, this part may be executed as the following utterance section determination phase, and the overall operation does not change.
[0050]
<Speech interval decision phase>
FIG. 7 is a flowchart illustrating the operation of the utterance section determination phase according to the first embodiment. In the <utterance interval determination phase> according to the first embodiment, the utterance interval of a specific speaker is accurately determined from video and audio including simultaneous utterances of a plurality of speakers by processing the result list according to the following flow. .
[0051]
First, the search result processing unit 108 inputs a result list of one medium / short section (step S30). It is determined whether there is a single speaker name within the top n items in the input result list (step S31), and if not, the process proceeds to step S38. n is a preset value.
[0052]
Here, for example, n = 5, and a certain result list is [Speaker A: 50, Speaker A + B: 20, Speaker B + C: 10, Speaker A + C: 3, Speaker D: 2, Speaker B + D: 0] ,...], It is determined that “speaker A” and “speaker D” are single speaker names included in the top five cases.
[0053]
If at least one single speaker name is included in the top n items in the result list, the highest speaker name among the single speaker names is set to P. _a (Step S32), P _a Is set as the total number of appearances (step S33). P is the result of simultaneous utterances of multiple speakers within the top n items in the result list. _a (Step S34), if all the P _a Is the number of simultaneous utterances of multiple speakers including _a In addition to the single occurrence of _a Is set as the total number of appearances (step S35).
[0054]
Here, as in the above example, n = 5, and a certain result list is [Speaker A: 50, Speaker A + B: 20, Speaker B + C: 10, Speaker A + C: 3, Speaker D: 2, Speaker B + D: 0,. . . ], The name of the highest speaker, “Speaker A”, is _a Then, among simultaneous utterances of multiple speakers, P _a Are "speaker A + B" and "speaker A + C". P _a Is the sum of the number of appearances of "Speaker A + B" and the number of appearances of "Speaker A + C" _a Is the total number of occurrences of
50 + 20 + 3 = 73
It becomes.
[0055]
FIG. 8 is a diagram illustrating a method of calculating the total number of appearances according to the first embodiment. In the result list of the middle / short section in the example of FIG. 8, since a single speaker name within the upper n = 5 includes the speaker A, the speaker A _a It becomes. Among the simultaneous utterances of a plurality of speakers within the upper n = 5, those including speaker A are “speaker A + B”, “speaker A + C”, and “speaker A + D” in the example of FIG. The total number of appearances of speaker A, which is the sum of the number of appearances of speaker A alone and the number of appearances of “speaker A + B”, “speaker A + C”, and “speaker A + D”, is
10 + 9 + 7 + 2 = 28
It becomes.
[0056]
P in medium and short section _a If the total number of appearances exceeds a predetermined threshold T (step S36), the P _a Is the speaker name of the medium / short section video / audio (step S37).
[0057]
The processing of steps S30 to S37 is executed for all the result lists of the middle and short sections (step S38). The combinations of the speaker names and the time information of all of the short and medium section video and audio in the video and audio are stored in the speaker information storage unit 109 as the speaker information of the utterance section (step S39).
[0058]
<Speaker information display phase>
FIG. 9 is a flowchart illustrating the operation of the speaker information display phase according to the first embodiment. In this phase, speaker information is displayed on the terminal display device 20 in accordance with a request from the user.
[0059]
First, the display unit 110 receives an input of a request from a user (step S40). It is determined whether the input of the user is the speaker name or the time (step S41). If the input of the user is the speaker name, the speaker information in the speaker information storage unit 109 is searched by the speaker name ( (Step S42), the time information of all the medium and short sections uttered by the speaker is visually displayed on the terminal display device 20 (Step S43). If the input of the user is time in step S41, the speaker information in the speaker information storage unit 109 is searched by that time (step S44), and the speaker name of the speaker speaking at that time is displayed on the terminal. It is displayed on the device 20 (step S45).
[0060]
FIG. 10 shows an example of the speaker information display screen displayed in step S43. Here, the name of the speaker is displayed on the left side of the screen along with the video / audio playback screen, and the right side of the screen is the name of the speaker as information on the person and information on the time period during which the speaker is speaking. ing. This makes it easy to see when a particular speaker is speaking.
[0061]
FIG. 11 shows an example of the speaker information display screen displayed in step S45. Here, the time of the playback screen is displayed together with the video / audio playback screen on the left side of the screen, and the speaker information such as the name and affiliation of the speaker at the time designated as the person information is displayed on the right side of the screen. I have. This makes it easy to see who is speaking at a given time.
[0062]
[Embodiment 2]
FIG. 12 is a diagram showing a configuration example of an utterance section search device according to Embodiment 2 of the present invention. The utterance section search device 10 'according to the second embodiment is a computer including a CPU, a memory, and the like, and includes an input unit 101, a candidate video / audio presentation unit 102, and a registration voice synthesis unit, each of which includes a software program and a storage device. 103, a feature amount extraction unit 104, a feature amount storage unit 105, a video / audio cutout unit 106, a search unit 107, a search result processing unit 108, a speaker information storage unit 109, a display unit 110, and a video / audio reselection unit 111. I have. Further, a terminal display device 20 is connected to the utterance section search device 10 'according to the second embodiment.
[0063]
The second embodiment has a video / audio reselection unit 111, and based on the speaker information displayed in the <speaker information display phase>, the registered speaker stored in the feature amount storage unit 105. This embodiment differs from the first embodiment in that it has a function of resetting the audio feature amount of the audio.
[0064]
The utterance section search device 10 'according to the second embodiment performs the operation of the <speaker voice re-registration phase> after the operation of the first embodiment. Hereinafter, the operation of the <speaker voice re-registration phase> in the utterance section search device 10 'will be described with reference to a flowchart.
[0065]
<Speaker voice re-registration phase>
FIG. 13 is a flowchart for explaining the operation of the speaker voice re-registration phase in the second embodiment. In the second embodiment, the user can correct the registered speaker's voice of the desired speaker using the search result of the utterance section. For example, the desired speaker (speaker P _a Utterance section is T ₀ ~ T ₁ And T ₂ ~ T ₃ Is obtained. However, when the user actually confirms the result on the terminal display device 20, T ₀ ~ T ₁ Is not the desired speaker and T ₂ ~ T ₃ And T ₄ ~ T ₅ Is the correct result, and if you want to re-register this as a registered speaker's voice, ₂ ~ T ₃ And T ₄ ~ T ₅ , The registered speaker's voice can be re-registered.
[0066]
First, the video / audio reselection unit 111 sets the speaker P _a When the user selects a video / audio to be re-registered as a registered speaker's voice (step S50), the video / audio is _a Is transmitted to the registration speech synthesis unit 103 as the registered speaker's speech. The registration voice synthesis unit 103 uses the speaker P selected by the user. _a A voice is synthesized from an arbitrary combination of the registered speaker's voice and another registered speaker's voice, and a voice is generated by combining a plurality of registered speaker's voices including the registered speaker's voice selected by the user ( Step S51).
[0067]
The feature amount extraction unit 104 selects the speaker P selected by the user. _a Of the registered speaker's voice and a voice obtained by combining a plurality of registered speaker's voices including the registered speaker's voice selected by the user (step S52). The voice feature amount stored up to that point in the feature amount storage unit 105 is overwritten with the obtained voice feature amount (step S53).
[0068]
By the above series of operations, the user can, for example, ₂ ~ T ₃ Video and audio of speaker P _a As the registration speaker's voice, and the speaker P _a The registered speaker's voice generated by synthesizing a plurality of registered speaker's voices including the above can also be newly replaced.
[0069]
Although Embodiments 1 and 2 have been described above, in the present invention, the registered speaker's voice can be created not only from the program to be searched but also from a program other than the program to be searched. Also, BGM is registered as a registered speaker's voice, a voice is synthesized for an arbitrary combination of the registered speaker's voice of BGM and another registered speaker's voice, and their voice feature amounts are registered, whereby the speech section of the utterance section is registered. In the search, it is also possible to search for an utterance section when there is a sound effect in the background.
[0070]
In addition, when synthesizing the speech in the registration speech synthesis unit 103, the weight of the loudness and the pitch of each registered speaker voice is arbitrarily set, and then each registered speaker voice is synthesized. Is also possible.
[0071]
The example of searching for a human speech section in a program video has been described above, but it goes without saying that the present invention can be applied to general voices other than human voices.
[0072]
The following examples can be considered as application examples of the present invention.
(1) Used to detect a speaker's utterance section from video / audio such as streaming video, video, and television programs.
(2) It is used to support the work of generating the minutes from a conference call or the like recorded with a single sound collecting microphone. The sound of a program or the like is not always the registered sound that is broadcasted as an audio signal. In such a case, the search of the sound generation section using the present invention is effective.
(3) Used to count the number of times a horn is sounded in general ambient sound. Although it is easy to register horn sound alone, it is generally difficult to accurately detect horn sound mixed with environmental sound. This is because there are variations in the environmental sound, and the sound of the horn is also distorted due to the Doppler effect. By applying the present invention, it is possible to accurately detect even such a case.
(4) Used to judge the animal's cry in the forest.
(5) Used to detect the occurrence of unusual sound in a machine operating in a steady state.
[0073]
【The invention's effect】
As described above, according to the present invention, in a voice such as a television program or the like in which a plurality of speakers simultaneously utter or a video including a sound effect in the background, a user can select a desired speaker from among presented candidates. By simply registering a single voice, it is possible to accurately search for a utterance section of a desired speaker even in a video including portions of simultaneous utterances of a plurality of speakers. Further, the user can re-create the registered speaker's voice using the result of the utterance section search. Not only human voices but also natural sounds can be used for recognizing voice sources and searching for voice generation sections.
[Brief description of the drawings]
FIG. 1 is a diagram for explaining an outline of the present invention.
FIG. 2 is a configuration diagram of an apparatus according to the present invention.
FIG. 3 is a diagram illustrating a configuration example of an utterance section search device according to the first embodiment of the present invention.
FIG. 4 is a flowchart illustrating an operation of a speaker voice registration phase according to the first embodiment.
FIG. 5 is a flowchart illustrating an operation in a speaker search phase according to the first embodiment.
FIG. 6 is a diagram showing an example of creating a result list from search results according to the first embodiment.
FIG. 7 is a flowchart illustrating an operation of an utterance section determination phase according to the first embodiment.
FIG. 8 is a diagram illustrating a method of calculating the total number of appearances according to the first embodiment.
FIG. 9 is a flowchart illustrating an operation of a speaker information display phase according to the first embodiment.
FIG. 10 is a diagram showing an example of a speaker information display screen according to the first embodiment.
FIG. 11 is a diagram showing an example of a speaker information display screen according to the first embodiment.
FIG. 12 is a diagram showing a configuration example of an utterance section search device according to a second embodiment of the present invention.
FIG. 13 is a flowchart illustrating an operation of a speaker voice re-registration phase according to the second embodiment.
[Explanation of symbols]
1 audio database
10,10 'utterance section search device
11 Speaker information registration means
12 Voice signal combination means
13 Voice feature extraction means
14 Feature storage means
15 Speaker search means
16 Speaker search result processing means
17 Utterance section information display means
101 Input unit
102 Candidate video / audio presentation unit
103 Registration voice synthesis unit
104 Feature Extraction Unit
105 Feature storage
106 Video / Audio Extraction Unit
107 Search unit
108 Search result processing unit
109 Speaker information storage
110 Display
111 Video / Audio Reselection Unit
20 Terminal display device

Claims

In a method of registering learning data in a speech database for storing speech source information and features of speech generated by the speech source and recognizing the speech source for an unknown speech signal,
Speech signals from each speech source to be recognized are input, and the speech features are extracted. The speech signals from the multiple speech sources are combined to synthesize the speech signals from multiple speech sources. , Extracting the speech feature amounts of the speech signals of the plurality of synthesized speech sources, associating each speech source information with the speech feature amounts, and combining the plurality of speech source information with the speech feature amounts. A voice database registration processing method, wherein information corresponding to the above is registered in the voice database.

A speech source recognition method for recognizing a speech source from an unknown speech signal by collating with a speech feature amount of a speech source registered in a speech database in advance.
Speech signals from each speech source to be recognized are input, and the speech features are extracted. The speech signals from the multiple speech sources are combined to synthesize the speech signals from multiple speech sources. , Extracting the speech feature amounts of the speech signals of the plurality of synthesized speech sources, associating each speech source information with the speech feature amounts, and combining the plurality of speech source information with the speech feature amounts. Registering correspondence information with the voice database in the voice database;
A speech source inputs an unknown speech signal, extracts speech features from the input speech signal, and includes a plurality of speech sources by collating with the speech features of the speech sources registered in the speech database. A voice source searching step of recognizing a voice source which may occur.

In a voice generation section search method for searching for a voice generation section generated by a specific voice source in a voice signal,
Speech signals from each speech source to be recognized are input, and the speech features are extracted. The speech signals from the multiple speech sources are combined to synthesize the speech signals from multiple speech sources. , Extracting the speech feature amounts of the speech signals of the plurality of synthesized speech sources, associating each speech source information with the speech feature amounts, and combining the plurality of speech source information with the speech feature amounts. Registering correspondence information with the voice database in the voice database;
A voice signal to be searched for a voice generation section is input, the input voice signal is divided into predetermined time units, voice features are extracted from each section, and a voice generation source registered in the voice database is extracted. A sound source search step for recognizing a sound source that may include a plurality of sound sources by comparing with a sound feature amount;
The search result of the sound source in each section obtained in the sound source search step is totaled for each of a plurality of predetermined sections, and the sound of the specific sound source is calculated based on the number of appearances of the sound source. A speech source search result processing stage for determining a generation section;
Outputting a voice generation section information obtained in the voice generation source search result processing step.

4. The method according to claim 3, wherein
In the voice generation section information output step, the specified voice generation source displays all time zone information of the voice signal generating the voice, or displays the voice generation source information generated at the time of the specified voice signal. A voice generating section search method.

In the voice generating section search method according to claim 3 or 4,
Based on the specific voice source information presented to the user in the voice generating section information output step and the voice signal thereof, the voice source information and voice feature amounts to the voice database are re-designated by the user. A method for retrieving a speech generation section, comprising a step of re-registering a speech database to be registered.

In the processing unit for registering learning data in a speech database for storing speech source information and feature amounts of speech generated by the speech source and recognizing the speech source for an unknown speech signal,
Means for inputting a speech signal emitted by each speech source to be recognized;
Means for synthesizing audio signals from a plurality of audio sources by combining the input audio signals from the plurality of audio sources;
Means for extracting an audio feature value for each of the input audio signal and the synthesized audio signal;
Means for registering, in the audio database, correspondence information between each of the audio source information and the audio feature amounts and information on the combination of the plurality of combined audio source information and the audio feature amounts. Database registration processor.

A speech source recognition device that recognizes a speech source from an unknown speech signal by collating with a speech feature of the speech source registered in advance in a speech database.
Speech signals from each speech source to be recognized are input, and the speech features are extracted. The speech signals from the multiple speech sources are combined to synthesize the speech signals from multiple speech sources. , Extracting the speech feature amounts of the speech signals of the plurality of synthesized speech sources, associating each speech source information with the speech feature amounts, and combining the plurality of speech source information with the speech feature amounts. Voice database registration means for registering the correspondence information with the voice database;
A speech source inputs an unknown speech signal, extracts speech features from the input speech signal, and includes a plurality of speech sources by collating with the speech features of the speech sources registered in the speech database. And a voice source search unit for recognizing a voice source.

In a voice generation section search device that searches for a voice generation section generated by a specific voice generation source in a voice signal,
Speech signals from each speech source to be recognized are input, and the speech features are extracted. The speech signals from the multiple speech sources are combined to synthesize the speech signals from multiple speech sources. , Extracting the speech feature amounts of the speech signals of the plurality of synthesized speech sources, associating each speech source information with the speech feature amounts, and combining the plurality of speech source information with the speech feature amounts. Voice database registration means for registering the correspondence information with the voice database;
A voice signal to be searched for a voice generation section is input, the input voice signal is divided into predetermined time units, voice features are extracted from each section, and a voice generation source registered in the voice database is extracted. A sound source search means for recognizing a sound source that may include a plurality of sound sources by comparing with a sound feature amount;
The search results of the sound source in each section obtained by the sound source search means are totaled for each of a plurality of predetermined sections, and the sound of the specific sound source is calculated based on the number of appearances of the sound source. Voice source search result processing means for determining a generation section;
A voice generation section information output means for outputting voice generation section information obtained by the voice generation source search result processing means.

9. The apparatus according to claim 8, wherein:
The voice generation section information output means displays all time zone information of a voice signal in which a specified voice source generates voice, or displays voice source information generated at a time of a specified voice signal. A voice generating section search device.

The voice generating section search device according to claim 8 or 9,
Based on the specific sound source information presented to the user by the sound generating section information output means and the sound signal, the sound source information and the sound feature amount in the sound database are re-designated by the user. A voice generation section search device comprising voice database re-registration means for registration.

A method for registering learning data in a speech database for recognizing a speech source for a speech signal whose speech source is unknown by storing speech source information and a feature amount of the speech generated by the speech source. Program to be executed by
Speech signals from each speech source to be recognized are input, and the speech features are extracted. The speech signals from the multiple speech sources are combined to synthesize the speech signals from multiple speech sources. , Extracting the speech feature amounts of the speech signals of the plurality of synthesized speech sources, associating each speech source information with the speech feature amounts, and combining the plurality of speech source information with the speech feature amounts. The process of registering the correspondence information with
A speech database registration processing program to be executed by a computer.

This is a program for causing a computer to execute a speech source recognition method for recognizing a speech source from an unknown speech signal by collating with a speech feature amount of a speech source registered in a speech database in advance. hand,
Speech signals from each speech source to be recognized are input, and the speech features are extracted. The speech signals from the multiple speech sources are combined to synthesize the speech signals from multiple speech sources. , Extracting the speech feature amounts of the speech signals of the plurality of synthesized speech sources, associating each speech source information with the speech feature amounts, and combining the plurality of speech source information with the speech feature amounts. Voice database registration processing for registering the correspondence information with the voice database,
A speech source inputs an unknown speech signal, extracts speech features from the input speech signal, and includes a plurality of speech sources by collating with the speech features of the speech sources registered in the speech database. And a speech source search process for recognizing speech sources that may occur.
A speech source recognition program to be executed by a computer.

A program for causing a computer to execute a voice generation section search method for searching for a voice generation section generated by a specific voice source in a voice signal,
Speech signals from each speech source to be recognized are input, and the speech features are extracted. The speech signals from the multiple speech sources are combined to synthesize the speech signals from multiple speech sources. , Extracting the speech feature amounts of the speech signals of the plurality of synthesized speech sources, associating each speech source information with the speech feature amounts, and combining the plurality of speech source information with the speech feature amounts. Voice database registration processing for registering the correspondence information with the voice database,
A voice signal to be searched for a voice generation section is input, the input voice signal is divided into predetermined time units, voice features are extracted from each section, and a voice generation source registered in the voice database is extracted. A speech source search process for recognizing a speech source that may include a plurality of speech sources by comparing with a speech feature amount;
The search results of the sound source in each section obtained by the sound source search processing are totaled for each of a plurality of predetermined sections, and the sound of the specific sound source is calculated based on the number of appearances of the sound source. Speech source search result processing to determine the occurrence section,
A voice generating section information output process for outputting voice generating section information obtained by the voice source search result processing;
A voice generation section search program to be executed by a computer.

A method for registering learning data in a speech database for recognizing a speech source for a speech signal whose speech source is unknown by storing speech source information and a feature amount of the speech generated by the speech source. Recording medium for recording a program to be executed by
Speech signals from each speech source to be recognized are input, and the speech features are extracted. The speech signals from the multiple speech sources are combined to synthesize the speech signals from multiple speech sources. , Extracting the speech feature amounts of the speech signals of the plurality of synthesized speech sources, associating each speech source information with the speech feature amounts, and combining the plurality of speech source information with the speech feature amounts. The process of registering the correspondence information with
A recording medium for an audio database registration processing program, wherein a program to be executed by a computer is recorded.

Record a program that allows a computer to execute a speech source recognition method that recognizes a speech source from an unknown speech signal by comparing it with speech features of the speech source registered in the speech database in advance. Recording medium,
Speech signals from each speech source to be recognized are input, and the speech features are extracted. The speech signals from the multiple speech sources are combined to synthesize the speech signals from multiple speech sources. , Extracting the speech feature amounts of the speech signals of the plurality of synthesized speech sources, associating each speech source information with the speech feature amounts, and combining the plurality of speech source information with the speech feature amounts. Voice database registration processing for registering the correspondence information with the voice database,
A speech source inputs an unknown speech signal, extracts speech features from the input speech signal, and includes a plurality of speech sources by collating with the speech features of the speech sources registered in the speech database. And a speech source search process for recognizing speech sources that may occur.
A recording medium for a speech source recognition program, wherein a program to be executed by a computer is recorded.

A recording medium storing a program for causing a computer to execute a voice generation section search method for searching for a voice generation section generated by a specific voice generation source in a voice signal,
Speech signals from each speech source to be recognized are input, and the speech features are extracted. The speech signals from the multiple speech sources are combined to synthesize the speech signals from multiple speech sources. , Extracting the speech feature amounts of the speech signals of the plurality of synthesized speech sources, associating each speech source information with the speech feature amounts, and combining the plurality of speech source information with the speech feature amounts. Voice database registration processing for registering the correspondence information with the voice database,
A voice signal to be searched for a voice generation section is input, the input voice signal is divided into predetermined time units, voice features are extracted from each section, and a voice generation source registered in the voice database is extracted. A speech source search process for recognizing a speech source that may include a plurality of speech sources by comparing with a speech feature amount;
The search results of the sound source in each section obtained by the sound source search processing are totaled for each of a plurality of predetermined sections, and the sound of the specific sound source is calculated based on the number of appearances of the sound source. Speech source search result processing to determine the occurrence section,
A voice generating section information output process for outputting voice generating section information obtained by the voice source search result processing;
A recording medium for a sound generation section search program, which records a program to be executed by a computer.