JP2004302175A

JP2004302175A - System, method, and program for speech recognition

Info

Publication number: JP2004302175A
Application number: JP2003095410A
Authority: JP
Inventors: Yasumasa Nakada; 安優中田; Takeshi Osawa; 岳史大澤; Tetsuji Osaka; 哲司大坂; Isao Sato; 功佐藤; Hironobu Takahashi; 裕信高橋; Hiroo Yamashita; 浩生山下; Kenshin Cho; 建新張
Original assignee: FUJIMIKKU KK; Fuji Television Network Inc
Current assignee: FUJIMIKKU KK; Fuji Television Network Inc
Priority date: 2003-03-31
Filing date: 2003-03-31
Publication date: 2004-10-28

Abstract

<P>PROBLEM TO BE SOLVED: To precisely detect a speech spoken during broadcasting by using existent speech recognition technology. <P>SOLUTION: A system is equipped with a speech input part 601 which inputs a speech signal, a document/scenario input part 604 which inputs document data including text data, a speech phoneme conversion part 603 which converts a speech inputted from the speech input part 601 into a speech phoneme series, a text phoneme conversion part 606 which converts text data inputted from the document/scenario input part 604 into a text phoneme series, a collation part (a 1st detecting collation part 608 and a 2nd detecting collation part 610) which collates the speech phoneme series with the text phoneme series to decide whether or not they match each other, and a collation result output part 611 which outputs the text data as a detection result corresponding to the matching phoneme series when the speech phoneme series and text phoneme series match each other. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、ビデオストリームや音声ストリームなどからなるマルチメディアコンテンツに含まれる音声信号を認識する音声認識システム、音声認識方法及び音声認識プログラムに関する。
【０００２】
【従来の技術】
従来、マルチメディアコンテンツは、ビデオストリームと音声ストリームから構成されるのが一般的である。近年にあっては、このビデオストリームに関する応用方法が進み、その一つとして、ビデオストリームにインデックスを付与するいわゆるインデキシング技術がある。このインデキシングとしては、例えば、ビデオストリームに対して、ビデオストリームの検出情報と同期したタイムコードを付与し、このタイムコードに基づいて映像の頭出しができ、このタイムコードをサムネイル表示等のインターフェースと連携させることにより、シーンチェンジ検出やハイライトシーンなど映像上の特徴を、簡単なユーザー操作で検索することが可能となる。
【０００３】
近年、このインデキシングの解析方法は盛んに研究されており、この技術を応用して、「このＣＭ」、「こんなイメージのシーン」等の抽象的なキーワードを用いて、希望する映像が写っているシーンの再生するなどの検索要求に答えられるものとなっている。
【０００４】
一方、音声ストリームに対しても同様に、音声認識などの技術を利用したインデキシング技術の開発もなされている。この音声ストリームに対するインデキシングとしては、例えば、事前に作成された電子化原稿を解析し、実際に放送されたテレビ番組のナレーションの音声認識を行うなど、テレビ放送の分野において良好な結果を得ている。このような音声認識によるインデキシングを応用することにより、特定発話語が認識された段階で警告を鳴らしたり、電子化原稿に対してその文を字幕として表示するなどのサービスを実行することが可能となる（例えば、特許文献１参照）。
【０００５】
【特許文献１】
特開２００２−２４４６９４号公報
【０００６】
【発明が解決しようとする課題】
しかしながら、上述したテレビ放送の分野におけるインデキシング技術は、例えばドキュメンタリー番組など予め放送内容が決定され、発話者も発話訓練を受けたアナウンサーやレポーターであり、良好な録音環境など、音声認識にとって好適に管理された環境に限定されて使用されている。
【０００７】
ところが、一般に連続発話に対する音声認識は、不特定話者対応、不特定内容対応、発話者の発声不完全性（例えば、「東京」を「とーきょー」と発話することが多い）、発話の多様性（「１１０番」は「いちいちぜろばん」、「ひゃくじゅうばん」、「ひゃくとうばん」）、背景音や発話の重畳、環境ノイズなどより、正確に認識することが困難であり、実用には至っておらず、まだ研究段階にある。
【０００８】
このため、例えばニュース報道の現場は、ドキュメンタリーのナレーションなどの理想的な環境と異なり、背景ノイズが多かったり、放送時間に追われ早口で話したりする場合があり、インタビューなどにおいては発話訓練を受けていない者を対象とする場合も多く、このような場合にまで上述した音声認識を適用するのは困難であるのが現状である。
【０００９】
また、ビデオストリームは早送りによって見る時間が短縮できるのに対して、音声ストリームでは、早送りなど時間を短縮した場合、人間による認識が困難となり、画像認識の技術をそのまま応用することができないという問題がある。
【００１０】
そこで、本発明は、以上の点に鑑みてなされたもので、既存の音声認識技術を利用し、放送中に発話される音声をリアルタイムに且つ高精度で、検出することのできる音声認識システム、音声認識方法及び音声認識プログラムを提供することをその目的とする。
【００１１】
【課題を解決するための手段】
上記課題を解決するために、本発明は、音声信号を入力するとともに、テキストデータを含む原稿データを入力し、入力された音声を音声音素列に変換するとともに、入力されたテキストデータをテキスト音素列に変換し、音声音素列とテキスト音素列との一致不一致を照合し、音声音素列とテキスト音素列とが一致する場合に、一致する音素列に対応するテキストデータを検出結果として出力する。
【００１２】
本発明によれば、音声情報をセンシングし、事前に準備した特定発話語若しくは電子化原稿に基づき、放送中の発話に一致する発話語若しくは発話文を検出・照合することができる。すなわち、本発明は、原稿や台本などの原稿データに基づいて発話される音声に対して、その電子化原稿の文と発話音声との照合処理を行い、その発話タイミングで、原稿の文をリアルタイムで検出する。
【００１３】
なお、本発明では、不特定話者、不特定内容並びにリアルタイムでの処理を行うために、照合処理に際し、音素処理を採用する。これにより、発話の淀み、言い直し、未知語に対応することができ、発話内容が決められないジャンルに対しても、本発明を適用することができる。
【００１４】
また、本発明では、検出照合処理にあたり、電子化原稿はテキスト−音素変換処理によってテキスト音素列に変換し、音声は音声−音素変換処理により音声音素列に変換する。そして、この両者の音素列を、例えば、連続動的計画法（連続ＤＰ：ＣｏｎｔｉｎｕｏｕｓＤｙｎａｍｉｃＰｒｏｇｒａｍｍｉｎｇ）により比較し、音声音素列と適合するテキスト音素列を検出する。
【００１５】
上記発明において、原稿データは、原稿の内容に応じて項目分けがされ、項目に応じてテキストデータを分割し、分割された各テキストデータの先頭文字列の範囲を決定し、範囲内の文字列を照合対象テキストとして抽出することが好ましい。
【００１６】
この場合には、電子化原稿は、項節若しくは章節のように項目分けし、構造化文書形態を採ることにより、文書の順番と発話の順番を保証することができ、これの特徴を利用し、全文を照合対象とすることなく、効率の良くしかも高速な照合処理が可能となる。
【００１７】
また、構造化文書で節（分割されたテキストデータ）にあたるテキストを一区切り単位（一息で発話できる文書量若しくは曖昧さを防ぐために設けられる間：ポーズで区切られる文書。以下、適宜「区切りテキスト」と称する。）で管理し、その文の先頭からの音節片（例えば、８音節程度とした）を、照合対象テキストとし、この照合対象テキストの音素列を検査音素列として照合処理を行うことにより、処理の高速化を図ることができ、発話に対してリアルタイムでテキストデータの検出を行うことが可能となる。
【００１８】
上記発明において、前記分割された各テキストデータには、優先度に応じた重み係数を付与し、重み係数に応じた順序で、照合対象テキストと音声音素との照合を行うことが好ましい。なお、上記発明においては、照合処理の進捗に応じて、照合済みの照合対象テキストを削除するとともに、未だ照合されていない照合対象テキストに付与された重み係数を逐次変動させることが好ましい。
【００１９】
この場合には、精度を低下させるいくつかの要因の内、総当りのテキスト音素照合における誤検出を防止することができる。すなわち、前者において、同じような内容が多く含まれている文は誤検出を生じ易い。照合精度を高めるため、前述の照合処理において、区切りテキストに対して原稿の順番に沿った優先順位を与えて誤検出に対応した。これにより、例えば原稿が「内閣は今日・・・」、「総理は今日・・・」の順番で用意されている場合、早く出現するテキストは後に現れるテキストよりも優先順位を高くすることにより、誤検出を回避することができる。
【００２０】
上記発明においては、照合対象テキストと、音声音素列との一致不一致を照合し、所定数の該当する照合対象テキストを検出候補として出力し、この出力された検出候補と音声音素列との一致不一致を照合し、検出結果を出力することが好ましい。
【００２１】
この場合には、一次照合で検出した照合候補に対して、文全体の照合を行う２段階で処理を行うことにより、処理の高速化を図ることができ、リアルタイムに電子化原稿文と音声の同期タイミングを図ることができる。
【００２２】
上記発明においては、音素列同士の一致度を比較するための閾値を保持し、この閾値を変動させることにより照合精度を調整することが好ましい。
【００２３】
例えばニュース報道の現場は、ドキュメンタリーのナレーションなどの理想的な環境と異なり、背景ノイズが多い場合であっても、連続ＤＰの閾値調整により、状況に応じた精度で認識を行うことができる。
【００２４】
上記発明においては、原稿データには、テキストデータの発話状況に関する発話状況情報が含まれ、発話状況情報に基づいて、音声の継続長を変化させることにより、変換速度を調整することが好ましい。
【００２５】
この場合には、例えば、テキストから音素列を生成するに際し、標準となるＡＴＲ５０３文の発話データから求められた音素継続長に対して、母音の継続長を早さに合わせ短くすることが可能となり、放送時間に追われ早口で話したりするようなときであっても、検出漏れを防止することができ、高い照合精度を得ることができる。
【００２６】
上記発明においては、出力されるテキストデータが所定の文字列に該当する場合に、警告処理を行うことが好ましい。これにより、特定の発話に対して警告を行うことができるため、不適切な発話が放送されるのを未然に防止することができる。
【００２７】
また、上記発明においては、検出結果を照合ログとして蓄積するとともに、音声信号が含まれる素材データを蓄積し、蓄積されたテキストデータと、素材データ中における当該テキストデータの位置とに基づいて、当該素材データを所望する位置から出力することが好ましい。さらに、上記発明においては、原稿データとして、ユーザーが任意に設定した文字列であるキーワードを入力し、検出結果を照合ログとして蓄積するとともに、音声信号が含まれる素材データを蓄積し、照合ログに含まれるキーワードと、素材データ中における当該キーワードの位置とに基づいて、当該素材データを所望する位置から出力することが好ましい。
【００２８】
このようなユーザーインターフェースを設けることにより、例えば、放送される映像に対して原稿データに基づいた字幕付与したり、映像にインデックスを付与しつつリアルタイムにＭＰＥＧ２エンコードを行い、装置内に素材データ（ビデオファイル）として蓄積することができる。また、検出したタイミングは、即ち照合ログ（発話テキスト）は、例えば、映像と同期してＭＰＥＧ７などのメタ情報としてファイル保存することが可能であり、このメタファイルとビデオファイルに基づいて、ユーザーが希望するシーンを表示することができる。
【００２９】
この結果、再生映像に合わせ、字幕のようにテキストを表示する機能、そのテキストが発話されている映像を表示する機能、検索によって希望する映像シーンを表示する機能などの機能が可能となる。
【００３０】
【発明の実施の形態】
［第１実施形態］
（システムの構成）
以下に、本発明の実施形態に係る音声認識システムについて詳細に説明する。図１は、本実施形態に係る音声認識システムの概略構成を示すブロック図である。
【００３１】
本実施形態に係る音声認識システムは、図１に示すように、蓄積ＰＣ１と、照合ＰＣ２と、時計サーバー３とがネットワーク４により接続されて構成される。
【００３２】
蓄積ＰＣ１は、映像信号と音声信号をＭＰＥＧ２エンコーダーに入力し、ＭＰＥＧ２フォーマットのデジタルビデオとしてファイル化し、蓄積する機能を有するとともに、照合用の電子化原稿、照合ログファイルなどシステムに関連するファイルを保持するサーバーとしての役割も果たす。照合ＰＣ２は、音声信号をＰＣのマイク入力から取り込み、デジタル化して音声処理を行う機能を有する。
【００３３】
時計サーバー３は、２台のＰＣ１及び２の時間を一致させるサーバー装置であり、基準時計サーバー装置や標準時計サーバーを用いることができる。なお、絶対時間を一致させる必要がない場合、時計サーバーを設けず、２台のＰＣ１，２間で時計同期を取る機能で代用することができる。
【００３４】
（蓄積ＰＣ１の構成）
蓄積ＰＣ１は、図２に示すように、ビデオ保存・音声照合結果保存プログラム８、照合結果再現プログラム１０を実行する。ビデオ保存・音声照合結果保存プログラム８は、照合処理の対象となる原稿データを原稿データベース９ａに蓄積する機能と、音声検出照合プログラム６と連動して、映像音声をデジタル化しデジタルビデオファイルとしてビデオファイルデータベース９ｃに保存する機能とを有するとともに、音声検出照合プログラム６による照合結果を照合ログファイルとして照合ログデータベース９ｂに保存する機能を有する。照合ログファイル並びにビデオファイルのファイル名は年月日時分を組み入れユニークな名前を自動的に発生して管理している。
【００３５】
照合結果再現プログラム１０は、照合ログファイルを用いてその発話があった時間を確認したり（精度確認のデバッグとして利用）、ビデオを再生しながら字幕を表示したりするプログラムである。
【００３６】
この照合ログファイルの内容は、連動するビデオファイル名などの設定情報と、発話テキスト、発話された標準時刻、音声検出照合プログラムのスタートを開始時間とする経過時間などの発話情報から構成される。標準時刻は、何時何分何秒にその発話があったかの確認を行う基準となるものである。また経過時間は、ビデオファイルと同期し、この時間を用いてタイムコードが示す時間のビデオ頭出しができる。
【００３７】
（照合ＰＣ２の構成）
照合ＰＣ２は、図２に示すように、音声検出照合プログラム６と照合結果出力プログラム７を実行する。音声検出照合プログラム６は、原稿データに基づいて音声を処理し、照合結果である照合ログを出力する機能を有するプログラムである。
【００３８】
照合結果出力プログラム７は、発話と同期して、その発話内容を業務に適した形で出力するプログラムである。本実施形態では、照合する原稿データが特定発話語若しくは特定発話文であった場合、それらの言葉が発せられたことを知らしめるため、アラームを鳴らす、パトランプを回す、音声ガイダンスを流すなどの警告処理を行う。また、照合結果出力プログラム７は、照合する原稿がアナウンサー原稿や台本の場合、発話に合わせ発話文を字幕として表示をする字幕放送に適応できる機能を有する。
【００３９】
ここで、照合ＰＣ２上で実行される音声検出照合プログラム６による音声検出照合処理機能について説明する。図３は、音声検出照合処理の機能を示すブロック図である。
【００４０】
同図に示すように、音声検出照合プログラム６は、照合ＰＣ２上で実行されることにより、照合ＰＣ２上に、音声入力部６０１と、音声分析部６０２と、音声音素変換部６０３と、原稿／台本入力部６０４と、照合範囲決定部６０５と、テキスト音素変換部６０６と、発話速度調整処理部６０７と、第１検出照合部６０８と、感度調整制御処理部６０９と、第２検出照合部６１０と、照合結果出力部６１１とを仮想的に構築する。各部の構成及び機能について、処理毎に説明する。
【００４１】
（音声入力〜音声音素変換）
音声入力部６０１は、生放送などの送出信号に含まれる音声５ａや、ＶＴＲ、ＬＤあるいはＤＶＤなどの記録媒体５ｂなどから取得され、音声を含んだ映像番組データからアナウンサー、ナレータ、出演者の音声信号を照合ＰＣ２において、１６ＫＨｚ（サンプリングレート）、１６ビット（量子化）で抽出するモジュールである。この音声入力部６０１に開始指令が入力されると同時に、蓄積ＰＣ１のＭＰＥＧ２エンコーダーが起動され、ビデオファイルの作成及び蓄積が始まる。
【００４２】
音声分析部６０２は、音声中から認識に有効な特徴量を抽出する部分である。音声信号が１次元配列の信号列として取得された場合、その分析方法としては、図４に示すような、取得された音声信号の時間的な変化を、音声波形としてサンプリングし、そのままデジタル化する方法と、図５に示すような、音声信号に含まれている周波数成分を分離抽出し、個々の成分についてデジタル化する方法である。
【００４３】
この図５に示すような、周波数成分を用いて音声信号の分析を行う方法を一般にスペクトル分析と呼んでおり、現在の音声分析法の主流となっている。スペクトル分析の効果として、時間領域の波形は外部環境の変化に対して、変動しやすいが、スペクトル波形は変動が比較的少なく、また、スペクトル分析により、その音声を特徴づける情報が容易に得られる。本実施形態では、音声分析部６０２において、図５に示すスペクトル分析方法により音声分析を行い、認識に必要な特徴量を抽出している。ただし、本実施形態は例示であり、本発明の実施においては、上述した図４に示す方法の他、種々の音声分析方法を採用することができる。
【００４４】
前記音声音素変換部６０３は、音声から音素を抽出し、抽出した音素を出力するモジュールであり、本実施形態では、ベイズ識別関数によるフレーム音素認識を用い、音声分析部６０２から入力された音声特徴量と、音素モデル辞書６０３ａから取得される音素モデルとから、フレーム単位（１フレームは８ｍｓｅｃ）で第Ｎ位まで（Ｎ≦音素数）の音素認識結果を出力するモジュールである。なお、この音声音素変換における音素継続長は、表１に示す、発音記号・継続長対応表から取得される。
【００４５】
【表１】

なお、表１に示す音素継続長は、ＡＴＲ音素バランス文の発話データを分析して求めたものである。このＡＴＲが提供する研究用日本語音声データベースセットＢ（文音声データベース）は、ＡＴＲ音素バランス文（５０３文）を１０話者（男女のアナウンサー及びナレータ）が読み上げた発話データとラベル付けしたデータから構成され、音声処理基本データとなっている。本実施形態では、このデータを音素モデル辞書として利用する。
【００４６】
（原稿／台本入力〜テキスト音素変換）
原稿／台本入力部６０４は、文字列を含むテキストデータを入力するテキストデータ入力部であり、本実施形態では、放送番組の原稿や台本が電子化されたテキストデータを入力する。なお、このテキストデータが電子化されていない場合は、テキスト入力支援システムにおいてその電子化を行う。
【００４７】
原稿／台本入力部６０４は、蓄積ＰＣ２上の原稿データベース９ａ内にある原稿／台本フォルダにある所定の原稿ファイルを読み込む。この原稿ファイルは、発話スピードレベル、背景音レベル、環境ノイズの状況など、放送番組の種類に応じた発話状況情報と、テキストデータである発話台本情報から構成される。
【００４８】
発話状況情報は、音声照合のレベル設定に用いられるデータであり、このうち、発話スピードレベルは番組の内容に応じて記述され、例えばニュース番組やバラエティ番組にあっては、一般に早口で話され、ドキュメンタリー番組などではゆっくり話され、ドラマ番組にあっては、早口で話すシーン、ゆっくり話すシーンである旨が記述される。また、背景音レベル情報には、例えば、ニュースやドキュメンタリー番組にあっては、屋外の撮影である場合や、ドラマや映画番組にあっては背景音楽が多いシーンなどが記述される。
【００４９】
発話速度調整処理部６０７は、原稿ファイルに含まれた発話状況情報に応じて、テキスト音素変換部６０６における発話スピードを調整するモジュールである。この発話速度調整処理部６０７により、発話状況並びに発話環境に応じた音声照合を行い、音声認識の精度を向上させることができる。
【００５０】
照合範囲決定部６０５は、原稿／台本入力部６０４で読み込んだ原稿に基づき、これらから発話されようとする項目（章）のテキストデータを、テキスト音素変換部に出力するモジュールである。この際、照合範囲決定部６０５は、これから発話されようとする項目（章）の内容、後続の項目の先頭文字列の範囲を決定し、この範囲内に含まれるテキスト情報（文字列）をテキスト音素変換部６０６に出力する。通常、放送番組では、これから発話される項目は事前に定められた順序に従い、状況に応じて、項目の入れ替えも生じるが、放送前において予測される範囲であり、照合範囲決定部６０５は、この範囲に関する情報を保持しており、この情報に基づいて項目の戦闘情報を決定する。
【００５１】
なお、本実施形態に係る照合範囲決定部６０５での照合範囲決定についてさらに詳述する。原稿データは、通常の文書と同じように一定の文書構造を有するという特徴を有している。この文書構造は、大きな括りとしていくつかの大項目があり、その一つの大項目にはいくつかの中項目があり、その一つの中項目にはいくつかの小項目があるというような階層構造を有している。
【００５２】
照合範囲決定部６０５は、この文書構造に注目し、発話単位毎に文を細分化した文節毎に原稿データを管理する。ここで、原稿データの例として、ニュース原稿の構造、ニュース原稿の制作から送出までについて述べる。
【００５３】
（１）ニュース原稿の構成
ここで、原稿の構造について説明する。図６は、原稿データとして、ニュース番組の報道用原稿を例示する説明図である。この原稿において、ニュースは、階層Ｌ１において、いくつかの項目に分けられ、制作管理されている。階層Ｌ１の下層には、階層Ｌ２、Ｌ３が関連付けられて階層構造をなしている。
【００５４】
例えば、放送されるニュースの項目には、政治情報、国際情勢、経済情報、事件・事故などの社会情報、ローカルニュース、気象情報などがある。これらの項目を基にしてニュースが送出され、その順番は、階層Ｌ１中の項目１〜ｎのようにヘッドラインや挨拶（「こんばんは、７月７日、夜７時のニュースです。」と簡単な挨拶等）、ニュース項目中で最も話題性の高い項目がトップニュースとなり、その後政治情報、国際情勢、経済情報、社会情報、ローカルニュース、気象情報へと続く（話題性、祭事、節目などの事情により順番が異なる）。また、現在の項目から次の項目に移る場合、次の項目の案内を入れることがある。例えば、「今夜は先ず、内閣誕生のニュースからお伝えいたします。」、「次は地震のニュースです。」、「続いて環境に関するニュースです。」などがある。これらの項目案内は、時間の都合により省略されることもある。
【００５５】
本実施形態において、階層Ｌ１内の各情報の一括りとなるニュース単位を、ニュース項目と呼ぶ。また、放送当日のニュースの状況により、各項目の中が、いくつかに分かれていることもあり、これらの子項目と呼んでいる。このように派生した項目（子項目）は、上位階層Ｌ１の親項目と関連付けされ、下層階層Ｌ２以下で管理されている。
【００５６】
階層Ｌ１に含まれる一つのニュース項目は、通常４００字程度のテキスト（気象情報など長いものでは８００文字程度）からなり、２５区切り程度（長いもので５０区切り程度、区切りとは一息で発話されるテキスト量）程度の量である。本実施形態において、この区切られたテキストを区切りテキストと呼ぶものとする。
【００５７】
なお、ここではニュースを取り上げたが、ドラマやドキュメンタリーなどにおいても、その原稿若しくは台本はニュースの項目構造と同じで、章節で示されるようにいくつかの括りから階層構造をなす。
【００５８】
（２）ニュース原稿の制作から送出までの処理
ニュース原稿制作は、先ず、ニュース項目担当部門の担当記者が取材した内容に基づいて、期日までに原稿を作成する。出来上がった記者原稿は担当デスクによって校正が行われる。担当デスクで印刷された印刷物がアナウンサー原稿となり、報道制作関係部門に配布される。
【００５９】
ドラマやドキュメンタリーなどの番組は事前に作成された原稿若しくは台本に従い、時間と共に進行して収録される。しかしニュースは生放送でしかも時間枠が定められている。ニュース番組の進行状況によっては番組内での時間調整が必要となることもある。このような状況において、制作担当者は、アナウンサー原稿に対して部分削除や追加などの編集を手作業で行うことがある。従って、実際の放送ではこのように、アナウンサー発話が事前に電子化された原稿と必ずしも一致しないことがあり得る。またニュース放送では、できるだけ鮮度の高い情報を提供するため、取材並びに原稿の準備など理由により、当初予定の項目順番が入れ替わることもよくある。この項目順番変更は、アナウンサーがその原稿を読む前に原稿を管理するコンピュータシステムに反映されるため、音声検出処理に影響を与えない。
【００６０】
（３）照合範囲決定と優先順位付与
本実施形態において、原稿データは、原稿の内容に応じて項目分けがされており、これらの項目に応じてテキストデータが分割され、分割されたテキストデータには、優先度に応じた重み係数が付与されている。すなわち、図７に示すように、上位階層Ｌ１において、ｎ個の項目Ｆｉ（ｉ＝１，ｎ）があり、各項目は複数の区切りテキストにより構成される。これらの区切りテキストは音素変換処理によって音素列が生成される。ここでｉ番目の項目全体に対応する音素列をＦｉとし、その中の区切りテキストに対応する音素列をＦｉｊ（ｉ＝１，ｎｊ＝１，ｍｉ）とする。
【００６１】
現在、ｉ番目の項目が発話されようとする時点において、照合範囲決定部６０５の処理は次のようになる。この範囲決定処理において、項目Ｆｉ中の区切りテキストが、最優先の候補となり、放送時間の都合などにより、この項目発話途中で別の項目に移ることも考えられるため、この項目以降の各項目の先頭区切りテキストＦｋ１（ｋ＝ｉ＋１，ｎ）が次の候補となる。
【００６２】
項目ＦｉにおいてＦｉｊ（ｊ＝１，ｍ）の区切りテキストがあり、これからｊ＝１の区切りテキストが発話されようとしているとすると、この候補ｊの優先順位が最も高く、ｊ＋１、ｊ＋２と優先度が低くなる。優先度は数値（ウェイト：ｗ１、ｗ２、ｗ３、・・・）で示され、第２検出照合部６１０での判定閾値レベルに反映される。
【００６３】
図３に示した前記テキスト音素変換部６０６は、図８のステップＳ１０１〜Ｓ１０３に示すように、テキスト中に混在する漢字、かな、カタカナ、数字、数値を、先ずカタカナに変換し、このカタカナ文から発音記号を求め、音素列へと変換するモジュールである。
【００６４】
このテキスト音素変換部６０６では、照合範囲決定部６０５で決められた区切りテキスト全文を音素列に変換する。また第１検出照合部６０８の処理を高速に行うための検査音素列（区切りテキストの先頭からの音節片：本実施形態では８音節とする）を生成する。図９に、テキストと音素列の具体的なサンプルを示す。同図に示すように、発話の多様性対応のため、数値などはひらがなで表記することが必要となる。
【００６５】
このテキスト音素変換部６０６における漢字−カタカナ変換処理では、漢字かな混じりのテキストを形態素解析（文を品詞毎に分割する技術）して品詞毎に分割し、さらにすべてカタカナからなる文字列に変換する。
【００６６】
（例）私は太郎です―――＞ワタシワタローデス
また、このテキスト音素変換部６０６におけるカタカナ−発音記号変換処理では、カタカナからなる文字列を、表２の「カタカナ・発音記号対応表」を用いて、発音記号列に変換する。
【００６７】
【表２】

（例）ワタシワ―――＞ｗａｔａｓｈｉｗａ
また、このテキスト音素変換部６０６においける発音記号−音素列変換処理では、前述した表１の発音記号・継続長対応表を用いて各発音記号を継続長分連続させ、音素列を生成する。ここで、継続長とは、発音記号の継続する長さで単位はフレーム。フレームとは，サンプリングされた音声信号（例えば１６ｋＨｚでサンプリングすると１秒間に１６０００個のデータとなる）を等間隔に切り出した単位で、８ミリ秒おきに切り出している場合は１フレームの時間長は８ミリ秒となる。
【００６８】

なお、表１中の数値は、フレーム数を示す。
【００６９】
この例において「ｗａｔａｓｈｉｗａ」の発話の継続長は、ｗが７フレーム、以下ａ（１０）、ｔ（２）、ａ（１０）、ｓｈ（１５）、ｉ（９）、ｗ（７）、ａ（１０）を累積した７０フレームとなり、７０フレーム×８ｍｓｅｃ＝０．５６ｓｅｃとなる。即ち標準発話において「わたしは」は０．５６秒で発話されることになる。
【００７０】
発話速度調整処理部６０７は、アナウンサーが最適な環境の下、標準発話口調で発話しているため、民放各社の報道アナウンサーの発話に比べ、ゆっくりした口調で原稿を読み上げている。その発話速度は約１．５倍の違いとなる。また、発話速度調整処理部６０７は、第１検出照合部６０８の精度を向上させるため、発話速度の変化は主として母音の長さに反映されるという音響的な特徴（早口発話において母音の長さが短くなる）を着目し、原稿から音素に変換する段階で母音の継続長を調整する処理が設けられている。
【００７１】
（検出照合〜照合結果出力）
第１検出照合部６０８は、音声音素変換部６０３で得た入力音声の音素列に対して、テキスト音素変換部６から得た照合範囲にあるテキスト音素列群を連続ＤＰで比較を行い、累積距離の小さな第４位までの候補を求める。
【００７２】
原稿にある全文を照合対象とする計算量が多くなりリアルタイムでの処理が不可能となるため、照合範囲決定部で求められた対象項目のテキスト並びに後続項目の先頭文を対象とし、それらの文から求めた検査音素列と入力音声音素列との照合を行う。
【００７３】
本実施形態におけるＤＰマッチングと連続ＤＰについて、図１０を用いて、以下に説明する。ＤＰマッチングは２つのデータ列の類似度を測るアルゴリズムである。ここに２つのデータ列Ｒ、Ｑがあるとする。データ列Ｒはデータｒ１，ｒ２，ｒ３，，，，，，，ｒｍからなり、データ列Ｑはデータｑ１，ｑ２，ｑ３，，，，ｑｎからなる。同図において、横軸にデータ列Ｒを、縦軸にデータ列Ｑをとる。先ず全格子点上で、各データ間の距離値（近さの逆）を求める。例えば格子点Ｐはデータｒ２とデータｑ３との距離値を持つ。次に始点Ｓから終点Ｅを格子点を通るようにつなげ（これをパスと言う）、通る格子点の距離値を全部足し合わせ、パスの累積距離を求める。すべてのパスの中で最小の累積距離を持つパスを選択する（このパスを最適パスと言う）。さらにこの累積距離を正規化する（パスの長さ又は縦軸の長さで累積距離を割る）。この正規化した累積距離（以下、累積距離と言う）が小さいほどデータ列間の類似度が大きいと言える。
【００７４】
連続ＤＰは、ＤＰマッチングを拡張し、検索対象とするデータ列の中に入力データ列に類似する区間があるかを調べるアルゴリズムである。
【００７５】
検索対象データ列Ｒはデータｒ１，ｒ２，ｒ３，，，，，，，ｒｍからなり、入力データ列Ｑはデータｑ１，ｑ２，ｑ３，，，，ｑｎからなるとする。図１１において横軸にデータ列Ｒを、縦軸にデータ列Ｑをとる。次のようにして類似区間を求める。ある時点での最適パスを求める（下図では始点がＳ１、終点がＥ１のパス）。このパスの累積距離Ｄ１を求める。次に終点を右に１単位（データ１個分）ずらし（終点Ｅ２）、最適パスとその累積距離Ｄ２を求める。これを最後まで繰り返す。累積距離が最も小さいパスの区間が、入力データ列に最も類似している区間である。例えば下図でパスＳ−Ｅが最も累積距離が小さいとすると、区間Ｋが、入力データ列に最も類似している区間である。
【００７６】
また、横軸を終点位置、縦軸を累積距離とすると図１２のようなグラフになる。なお、本実施形態では、このグラフを累積距離曲線と称する。この累積距離曲線において、閾値を設定し、累積距離が閾値以下で極小となる点が類似区間候補の終点である。図１２の場合、終点Ｅ１とＥがこれに相当するので、これらの２終点で終わる２区間が類似区間の候補となる。Ｅ１よりＥにおける累積距離が小さいので、Ｅで終わる区間（図１１で区間Ｋ）が類似区間として検出される。
【００７７】
感度調整制御処理部６０９は、誤検出や検出漏れに対処するもので、連続ＤＰの判定閾値を調整するものである。感度はウェイトとして与えられ、全体若しくは部分的に判定の閾値（図１２中）を調整するものである。ウェイトが小さいほど累積距離は閾値に近寄り、検出し易くなる。
【００７８】
第２検出照合部６１０は、前段の第１検出照合部６０８で候補となった対象テキスト４候補について、引き続き連続ＤＰによる照合を行うもので、音声音素列と対象テキストの音素列を用いる。ここで行う連続ＤＰは対象テキストが４つあるため、同時に４つの連続ＤＰを行うことになる。４つの連続ＤＰのいくつかで類似区間が検出されたとき、連続ＤＰ累積距離が最小のテキストを検出テキストとする。４つのテキストは原稿の出現順番を考慮して、その順にｗ１，ｗ２，ｗ３，ｗ４の重み係数を持つ（１．０＝ｗ１＜ｗ２＜ｗ３＜ｗ４）。但しこの重み係数はテキストの出現順位を強固に保持させるような値を選択すると、発話内容の変更などに追従できなくなるため、緩やかな重み付けを行う。また図７においてウェイトがゼロのテキストは照合範囲決定部６０５において範囲対象外として扱う。累積距離に重み係数を掛けることにより、順番が早いテキストほど検出し易くしている。
【００７９】
この第２検出照合部６１０における処理の具体例を以下に示す。照合開始時点では４つのテキストの累積距離は、図１３に示すように、閾値以上である。そして、時間を進め、ある時点でテキスト１の累積距離が閾値以下になったとすると、図１４に示すように、テキスト１を検出テキスト候補とし、この時点Ａから検出テキストとその類似区間を求める処理が始まる。
【００８０】
さらに、時間を進め、テキスト１の類似区間候補が見つかった（累積距離曲線が極小になった）場合、図１５に示すように、この時点をＢ１点とする。
【００８１】
時間を進め、テキスト１の新しい類似区間候補が見つかり、Ｂ１点より累積距離が小さい場合、図１６に示すように、この点を新しいＢ２点とする。
【００８２】
他のテキストについても類似区間候補が見つかり、Ｂ１点、Ｂ２点より累積距離が小さい場合新しいＢ３点とし、このテキストを検出テキスト候補とする。図１７ではテキスト３が検出テキスト候補となっている。
【００８３】
そして、Ｂ３点から一定時間Ｌ（遅延時間、例えば１秒）新しいＢ点が見つからない場合、図１８に示すように、現在の最小の累積距離を有するテキスト候補を検出テキスト（ここではテキスト３）とし、Ｂ３点を類似区間の終点とする。
【００８４】
照合結果出力部６１１は、第２検出照合部６１０による検出結果を、照合結果出力プログラム７や、ビデオ保存・音声照合結果保存プログラム８などの他のプログラムに出力する外部出力インターフェースである。
【００８５】
（照合処理処理）
本実施形態に係る照合処理は、第１検出照合部６０８と第２検出照合部６１０の２段階において実行される。図１９は、本実施形態に係る照合処理を示すフローチャート図である。
【００８６】
先ず、音声入力部６０１により音声の入力が行われ（Ｓ２０６）、この入力された音声は、音声分析部６０２による音声分析の後（Ｓ２０７）、音声音素変換部６０３により音声音素に変換され（Ｓ２０８）、音声音素バッファに格納される（Ｓ２０９）。なお、本実施形態における音声音素バッファへの書き込みは、フレーム単位（８ｍｓｅｃ）で行われる。
【００８７】
一方、照合する原稿や台本は、原稿／台本入力部６０４から電子化されたデータとして入力され（Ｓ２０１）、照合範囲決定部６０５において、原稿の構造に基づいて区切りテキストが抽出され（Ｓ２０２）、これから放送において発話されるようとしているニュース項目の全テキスト（項目中の区切りテキスト）並びに後続の項目の先頭文を、テキスト音素変換部６０６におてテキスト音素変換し（Ｓ２０４）、テキスト音素バッファに格納される（Ｓ２０５）。このステップＳ２０４でのテキスト音素変換においては、早口発話に対応するため、適宜、発話即調整処理を行う（Ｓ２０３）。テキスト音素バッファに格納される情報は、区切りテキスト、その音素列、並びに高速に検出を行うための検査音素列（区切りテキスト音素列の先頭からの音節片：本装置では８音節とした）から構成される。
【００８８】
このように音声音素バッファに格納された音声音素に対して、テキスト音素バッファに格納されたテキスト音素群を、第１検出照合部６０８において検出照合処理を行う（Ｓ２１０）。具体的には、連続ＤＰによりＤＰの累積距離が小さい、即ち類似度の高いテキストを検出する。本実施形態では、ステップＳ２１０及びステップＳ２１３に示すように、連続ＤＰ照合は２段階で構成され、１段目が第１検出照合部６０８に、２段目が第２検出照合部６１０に対応する。
【００８９】
先ず、１段目の第１検出照合部６０８にでは、比較する対象となる区切りテキストが約５０個になり、連続ＤＰがこの個数分作動することとなる。またリアルタイムで照合処理を実現するためには、これらのテキスト音素を８ｍｓｅｃ以内で処理しなければならないことから、この第１検出照合部６０８における処理は、上述の検査音素列により高速に行われる。
【００９０】
放送音源には背景音楽などが含まれるため、音声区間、非音声区間を正確に判別することが難しい。また、音声区間で発話される内容が、事前に作成された原稿に含まれていないこともある。また、中継などの情報は事前原稿に含まれない内容である。このような音声音素列は検査音素列と類似しないため、この１段目の連続ＤＰでは、それら類似しない照合をスキップし、音声音素バッファから次の音声音素列を取り込む。
【００９１】
なお、一段目の照合は８音節程度と短いため、例えば「総理大臣は」と言う文が４箇所存在する場合、これらがすべて候補となる。ただし、ステップＳ２０２における照合範囲決定時の優先順位により、これら４候補は等確率ではなく、項目順番を考慮したウェイトが掛けられ、「総理大臣は」に続く後続のテキスト検出の誤検出を防止している。
【００９２】
これら検査音素列との照合結果に基づいて、候補が４つとなるまで、ループ処理を繰り返す（Ｓ２１２）。すなわち、ステップＳ２１２において、検査音素列と入力音素列とが一致する場合は、ｉに１を加算し、次の検査音素列をテキスト音素バッファから取得し、ステップＳ２１０を実行する。一方、ステップＳ２１２において、検査音素列と入力音素列とが一致しない場合には、音声音素バッファから音声音素を取得し、ステップＳ２１０において現在の検査音素列との照合を繰り返す。この処理を、ｉが４となるまで繰り返す。
【００９３】
そして、これらの検査音素列で音声音素と類似度の高い４候補を求め、次段の第２検出照合部６１０の処理に進む（Ｓ２１３）。この２段目の処理は、第２検出照合部６１０において、１段目で候補となった検査音素列に対応する区切りテキスト音素列と音声音素列との連続ＤＰ処理を行う。区切りテキスト音素列の一部は既に連続ＤＰが作動しているため、この情報を引き継いで連続ＤＰが作動する。
【００９４】
この処理はフレーム（８ｍｓｅｃ）毎に処理され、その時点時点での累積距離が求められ、累積距離曲線が得られる。この曲線から極小値を求める。この極小値がローカルミニマかグローバルミニマであるかを判定するため、一定時間（例えば１秒）新しい極小値が見つからなければ、最も小さい極小値（最も一致している）を持つ区切りテキストが検出したテキストとなる（Ｓ２１４）。
【００９５】
検出したテキストについて、表示処理（Ｓ２１５）を行う。例えば、検出したテキストデータを、照合結果再現プログラム１０等の別のアプリケーションに出力し、例えば、字幕装置においては字幕放送ができ、またＭＰＥＧ７形式の蓄積装置においては新しい形態のビデオコンテンツを形成することができる。
【００９６】
次いで、項目内の次の区切りテキストに進む（Ｓ２１６）。このとき、次項目若しくは以後の項目の先頭区切りテキストが存在するか否かについて判断を行い、新たな項目に遷移するような場合（ステップＳ２１６における”Ｙｅｓ”）には、ステップＳ２０２に戻り、照合範囲の決定〜テキスト音素バッファへの蓄積（Ｓ２０２〜Ｓ２０５）の処理を実行する。
【００９７】
一方、ステップＳ２１６において、次項目への遷移ではないと判断した場合に（ステップＳ２１８における”Ｎｏ”）は、テキスト音素バッファから適合テキストの削除処理を行い（Ｓ２１７）、テキスト音素バッファが空になっているか否かについて判断を行い（Ｓ２１８）、空になっている場合（ステップＳ２１８における”Ｙｅｓ”）には、ステップＳ２０２に戻り、照合範囲の決定〜テキスト音素バッファへの蓄積（Ｓ２０２〜Ｓ２０５）の処理を実行し、空になっていない場合には（ステップＳ２１８における”Ｎｏ”）、上記ステップＳ２１０〜Ｓ２１６の処理を実行する。
【００９８】
［第２実施形態］
次いで、本発明の第２実施形態について説明する。本実施形態では、上述した音声認識システムを、特定発話検知アーカイブシステムに応用した例である。図２０は、本実施形態に係る特定発話検知アーカイブシステムの構成を示すブロック図である。
【００９９】
本実施形態に係る特定発話検知アーカイブシステムは、図２０に示すように、照合ＰＣ２で実行される特定発話検知システム２１と、検出結果出力システム２２とを備えるとともに、蓄積ＰＣ１で実行される特定発話検知用アーカイブシステム１１と、特定キーワードデータベース９ｄと、照合ログデータベース９ｂと、ＭＰＥＧ２データベース９ｅと、音声処理再生システム１２とから構成される。
【０１００】
検出結果出力システム２２は、検出結果を、逐次表示するシステムである。音声処理再生システム１２は、照合ログファイルから対応するＭＰＥＧ２ファイルの再生を行うと共に、再生時間に合わせ照合したテキストを画面に表示したり、このテキストからそのシーンを表示したりするシステムである。特定発話検知システム２１は、上述した第１実施形態で説明した音声検出照合プログラム６を検索エンジンとして内蔵しており、前述した原稿ファイルに替えて、ユーザーが指定したキーワードを、ビデオファイルから検索する機能を有する。
【０１０１】
そして、このようなアーカイブシステムに対する操作は、照合ＰＣ２の画面に表示されるインターフェースを介して行うことができる。図２１は、このアーカイブシステムのユーザーインターフェースである操作画面を示す構成図である。
【０１０２】
先ず、特定発話検知用アーカイブシステムを起動する。次に、照合させるテキストデータを読み込み、アーカイブシステムのＭＰＥＧ２ファイル作成を行う。
【０１０３】
次いで、操作画面のテキストボックスＴＢ１において、検索するキーワードを入力する。キーワードは１ページあたり２０個の言葉を入力できる。このテキストボックスＴＢ１では、直接キーワードを入力することもでき、また、特定キーワードデータベース９ｄからキーワード群を読み込むことも可能であり、読み込んだキーワードの編集も行うこともできる。なお、本実施形態では、各テキストボックスＴＢ１に対応してチェックボックスＣＢ１が設けられており、入力したキーワードのうち、任意のキーワードを選択して検出対象とすることができる。
また、本実施形態では、各テキストボックスＴＢ１に対応させて、トラックバーＴＢＲ１が設けられており、各トラックバーＴＢＲ１を操作することにより、各キーワードに対する感度を設定する。感度は検出時のマッチング距離の閾値であり、０．０から５．０の範囲で、標準の閾値は２．５である。
【０１０４】
さらに、本実施形態では、各テキストボックスＴＢ１に対応させて、トラックバーＴＢＲ２が設けられており、このトラックバーＴＢＲ２を操作することによってキーワードの発話速度を調整することができる。０．５倍から２．０倍の範囲で、大変ゆっくりした発話から相当な早口発話に対応することができる。１倍は標準発話に対応する。
【０１０５】
また、本実施形態では、キーワードを検出する最小間隔（単位秒）を設定するテキストボックスＴＢ２、発話リストファイルをＰＣから読み込むためのボタンＢ１、入力・編集したキーワードや、各キーワードの感度、発話速度などの条件を発話リストファイルに書き込むためのボタンＢ２、キーワードをソートするためのボタンＢ３、検知したキーワードに対応した発話出力を実行するチェックボックスＣＢ２、処理を開始するためのボタンＢ４、処理を終了するためのボタンＢ５が設けられている。
【０１０６】
さらに、この操作画面には、全体の感度を調整するトラックバーＴＢＲ３が設けられている。本実施形態において、この感度調整の範囲は−２．５から２．５である。全体の感度の効果は各キーワードの感度に加算として表れ、各キーワードの感度の最大範囲は−２．５から７．５となる。また、全体の発話速度を調整するトラックバーＴＢＲ４も設けられている。本実施形態において、調整範囲は０．５倍から２．０倍である。全体のスピードの効果は各キーワードのスピードに乗算として表れ、各キーワードの速度範囲は０．２５倍から４．０倍になる。
【０１０７】
そして、検出結果は、リストボックスＬＢ１に表示される。図において、左から、「検出絶対時刻」、「処理を開始してからの時間（時：分：秒）」、「キーワードの発話時間（単位秒）」、それに検出されたキーワード文字列である。このリストボックスＬＢ１に表示されるデータは、ログファイルとして、照合ログデータベース９ｂに蓄積される。
【０１０８】
このようにして生成された照合ログは、ログファイルとして、検出結果出力システム２２において読み込まれる。このとき、検出結果出力システムでは、併せて、ログファイルに対応するＭＰＥＧファイルを読み込む。この検出結果出力システム２２は、ログファイルの印刷、インデックスに基づく頭出し再生、ログデータのソート（時刻、類似度、キーワード順）等を行う。
【０１０９】
［第３実施形態］
次いで、本発明の第３実施形態について説明する。本実施形態では、上述した音声認識システムを原稿に基づく音声インデキシングシステムに応用した例である。図２２は、本実施形態に係る音声インデキシングシステムの構成を示すブロック図である。
【０１１０】
本実施形態に係るインデキシングシステムは、図２２に示すように、照合ＰＣ２で実行される音声インデキシングシステム２３と、検出結果出力システム２２とを備えるとともに、蓄積ＰＣ１で実行される音声インデキシング用アーカイブシステム１３と、原稿データベース９ａと、照合ログデータベース９ｂと、ＭＰＥＧ２データベース９ｅと、音声処理再生システム１２とから構成される。
【０１１１】
検出結果出力システム２２は、検出結果を、逐次表示するシステムである。音声処理再生システム１２は、照合ログファイルから対応するＭＰＥＧ２ファイルの再生を行うと共に、再生時間に合わせ照合したテキストを画面に表示したり、このテキストからそのシーンを表示したりするシステムである。
【０１１２】
音声インデキシングシステム２３は、上述した第１実施形態で説明した音声検出照合プログラム６を検索エンジンとして内蔵しており、前述した原稿ファイルに基づいて、原稿ファイル内のテキストを、ビデオファイルから検索する機能を有する。
【０１１３】
そして、このようなインデキシングシステムに対する操作は、照合ＰＣ２の画面に表示されるインターフェースを介して行うことができる。図２３は、このインデキシングシステムのユーザーインターフェースである操作画面を示す構成図である。
【０１１４】
同図に示すように、この操作画面上には、入力した原稿を表示するリストボックスＬＢ２が備えられている。本実施形態では、このリストボックスＬＢ２において検出したテキストは赤色で表示される。
【０１１５】
また、この操作画面には、検出時に一度に処理する文の数を指定するテキストボックスＴＢ３と、検出する文に対する重みを設定するテキストボックスＴＢ４と、検出遅延時間を設定するテキストボックスＴＢ５が設けられている。
【０１１６】
テキストボックスＴＢ４では、例えば、重み係数が０．４の場合、最初の文の重みは１．０、次の文の重みは１．４、その次の文の重みは１．４４となる。重みが大きいほど検出感度が低くなる。また、テキストボックスＴＢ５では、新たに文を検出する際、直前（検出遅延時間以内）に検出した文と類似度を比較し類似度がより大きい場合、出力候補とする。検出遅延時間内に新たな検出文がない場合、前の検出文をログに出力する。
【０１１７】
そして、検出結果のログは、リストボックスＬＢ１に表示される。このリストボックスＬＢ１において、左から、「検出絶対時刻」、「処理を開始してからの時間（時：分：秒）」、「区切りテキストの発話時間（単位秒）」、それに検出された区切りテキストである。
【０１１８】
そして、このようなインデキシングシステムによれば、原稿ファイルから抽出された区切りテキストをキーワードとして、該当するキーワードが発話された時刻等を照合ログとしてリストボックスＬＢ１に表示し、このリストは、照合ログファイルとして、照合ログデータベース９ｂに蓄積される。
【０１１９】
このようにして生成された照合ログファイルは、検出結果出力システム２２において読み込まれる。このとき、検出結果出力システムでは、併せて、ログファイルに対応するＭＰＥＧファイルを読み込む。そして、この検出結果出力システム２２は、ログファイルの印刷、インデックスに基づく頭出し再生、ログデータのソート（時刻、類似度、キーワード順）等を行う。
【０１２０】
［第４実施形態］
なお、上述した実施形態及びその応用例に係る音声認識システム及び方法は、所定のコンピュータ言語で記述されたプログラムとすることができる。すなわち、このプログラムを、ユーザー端末やＷｅｂサーバ等のコンピュータやＩＣチップにインストールすることにより、上述した各機能を有する音声検出照合プログラムや照合結果出力プログラム等を容易に構築することができる。このプログラムは、例えば、通信回線を通じて配布することが可能であり、またスタンドアローンの計算機上で動作するパッケージアプリケーションとして譲渡することができる。
【０１２１】
そして、このようなプログラムは、図２４に示すような、汎用コンピュータ１２０で読み取り可能な記録媒体１１６〜１１９に記録することができる。具体的には、同図に示すような、フレキシブルディスク１１６やカセットテープ１１９等の磁気記録媒体、若しくはＣＤ−ＲＯＭやＤＶＤ−ＲＯＭ１１７等の光ディスクの他、ＲＡＭカード１１８など、種々の記録媒体に記録することができる。本実施形態は書き込み不可のＣＤ−ＲＯＭやＤＶＤ−ＲＯＭ１１７中にあるコンテンツに対してリンクを設けることができる特徴を有する。
【０１２２】
そして、このプログラムを記録したコンピュータ読み取り可能な記録媒体によれば、汎用のコンピュータや専用コンピュータを用いて、上述した音声認識システムや方法を実施することが可能となるとともに、プログラムの保存、運搬及びインストールを容易に行うことができる。
【０１２３】
【発明の効果】
以上述べたように、この発明によれば、既存の音声認識技術を利用し、放送中に発話される音声を、リアルタイムで且つ精度良く検出することができる。この検出結果を利用することにより、放送される映像に対して原稿に基づいた字幕付与したり、発話されている原稿に応じた映像を表示したり、キーワードによる検索によって希望する映像シーンを表示させたりなど、多様なサービスが可能となり、万人に対する様々なユニバーサルサービスを実現することが可能となる。
【図面の簡単な説明】
【図１】第１実施形態に係る音声認識システムの概略構成を示すブロック図である。
【図２】第１実施形態に係る照合ＰＣ及び蓄積ＰＣの内部構造及び関係を示すブロック図である。
【図３】第１実施形態に係る音声検出照合プログラムの機能を示すブロック図である。
【図４】第１実施形態に係る音声信号の時間波形を示すグラフ図である。
【図５】第１実施形態に係る音声信号のスペクトル波形を示すグラフ図である。
【図６】第１実施形態に係るニュース原稿の構造を示す説明図である。
【図７】第１実施形態に係る原稿内部の項目の記述を示す説明図である。
【図８】第１実施形態に係るテキスト音素変換部における処理を示すフローチャート図である。
【図９】第１実施形態に係るテキストと音素列の説明図である。
【図１０】第１実施形態に係るＤＰマッチングにおけるＤＰパスを示すパス図である。
【図１１】第１実施形態に係る連続ＤＰマッチングにおけるＤＰパスを示すパス図である。
【図１２】第１実施形態に係る連続ＤＰマッチングにおける累積距離曲線図である。
【図１３】第１実施形態に係る連続ＤＰマッチングにおける累積距離曲線図である。
【図１４】第１実施形態に係る連続ＤＰマッチングにおける累積距離曲線図である。
【図１５】第１実施形態に係る連続ＤＰマッチングにおける累積距離曲線図である。
【図１６】第１実施形態に係る連続ＤＰマッチングにおける累積距離曲線図である。
【図１７】第１実施形態に係る連続ＤＰマッチングにおける累積距離曲線図である。
【図１８】第１実施形態に係る連続ＤＰマッチングにおける累積距離曲線図である。
【図１９】第１実施形態に係る照合処理を示すフローチャート図である。
【図２０】第２実施形態に係る特定発話検知システムの構成を示すブロック図である。
【図２１】第２実施形態に係るインターフェースの操作画面を示す構成図である。
【図２２】第３実施形態に係る音声インデキシングシステムの構成を示すブロック図である。
【図２３】第３実施形態に係るインターフェースの操作画面を示す構成図である。
【図２４】第４実施形態に係るプログラムを記録したコンピュータ読み取り可能な記録媒体を示す斜視図である。
【符号の説明】
１…蓄積ＰＣ
２…照合ＰＣ
３…時計サーバー
４…ネットワーク
５ａ…音声
５ｂ…記録媒体
６…音声検出照合プログラム
７…照合結果出力プログラム
８…音声照合結果保存プログラム
９ａ…原稿データベース
９ｂ…照合ログデータベース
９ｃ…ビデオファイルデータベース
９ｄ…特定キーワードデータベース
９ｅ…ＭＰＥＧ２データベース
１０…照合結果再現プログラム
１１…特定発話検知用アーカイブシステム
１２…音声処理再生システム
１３…音声インデキシング用アーカイブシステム
２１…特定発話検知システム
２２…検出結果出力システム
２３…音声インデキシングシステム
１１６…フレキシブルディスク
１１７…ＲＯＭ
１１８…ＲＡＭカード
１１９…カセットテープ
１２０…汎用コンピュータ
６０１…音声入力部
６０２…音声分析部
６０３…音声音素変換部
６０３ａ…音素モデル辞書
６０４…原稿／台本入力部
６０５…照合範囲決定部
６０６…テキスト音素変換部
６０７…発話速度調整処理部
６０８…第１検出照合部
６０９…感度調整制御処理部
６１０…第２検出照合部
６１１…照合結果出力部[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a voice recognition system, a voice recognition method, and a voice recognition program for recognizing a voice signal included in multimedia content including a video stream and a voice stream.
[0002]
[Prior art]
Conventionally, multimedia content is generally composed of a video stream and an audio stream. In recent years, application methods related to the video stream have been advanced, and one of them is a so-called indexing technique for adding an index to the video stream. As this indexing, for example, a time code synchronized with the detection information of the video stream is added to the video stream, and the start of the video can be located based on the time code. By cooperating with each other, it becomes possible to search for features on the video such as scene change detection and highlight scenes by a simple user operation.
[0003]
In recent years, this indexing analysis method has been actively studied, and by applying this technology, desired images are captured using abstract keywords such as "this CM" and "scene of such an image". It can answer search requests such as scene playback.
[0004]
On the other hand, an indexing technique using a technique such as speech recognition has also been developed for an audio stream. As the indexing of the audio stream, for example, a good result is obtained in the field of television broadcasting, such as analyzing a digitized manuscript created in advance and performing voice recognition of a narration of an actually broadcasted television program. . By applying such indexing by voice recognition, it is possible to perform services such as sounding a warning when a specific uttered word is recognized and displaying the sentence as subtitles on digitized manuscripts. (For example, see Patent Document 1).
[0005]
[Patent Document 1]
JP-A-2002-244694
[0006]
[Problems to be solved by the invention]
However, the above-mentioned indexing technology in the field of television broadcasting is an announcer or reporter whose broadcast content is determined in advance, such as a documentary program, and the speaker is also a speech-trained announcer or reporter. It is used in limited environments.
[0007]
However, in general, speech recognition for continuous utterances includes support for unspecified speakers, support for unspecified contents, incomplete utterance of the speaker (for example, “Tokyo” is often uttered as “Tokyo”), It is difficult to accurately recognize the utterance diversity ("110th" is "Ichiichi Zeroban", "Hyakujuban", "Hyakutouban"), background sound, superposition of utterance, environmental noise, etc. Yes, not yet in practical use, but still in the research stage.
[0008]
For this reason, for example, news reporting sites differ from ideal environments such as documentary narration, in that there may be a lot of background noise, or they may be spoken quickly due to broadcast time, and they will be trained to speak during interviews. In many cases, it is difficult to apply the above-mentioned speech recognition to such cases.
[0009]
In addition, while video streams can be viewed faster by fast-forwarding, audio streams can be shortened by fast-forwarding or other similar techniques, making it difficult for humans to recognize the image and making it impossible to apply image recognition technology directly. is there.
[0010]
Therefore, the present invention has been made in view of the above points, and a voice recognition system capable of detecting voice uttered during broadcasting in real time and with high accuracy using existing voice recognition technology. It is an object of the present invention to provide a voice recognition method and a voice recognition program.
[0011]
[Means for Solving the Problems]
In order to solve the above-described problems, the present invention provides a method for inputting a voice signal, inputting original data including text data, converting the input voice into a voice phoneme string, and converting the input text data to a text phoneme. Then, the voice phoneme string and the text phoneme string are checked for coincidence or non-coincidence. If the speech phoneme string and the text phoneme string match, text data corresponding to the matching phoneme string is output as a detection result.
[0012]
According to the present invention, it is possible to sense voice information and detect and collate utterance words or utterance sentences matching utterances being broadcast based on specific utterance words or digitized manuscript prepared in advance. That is, according to the present invention, the voice uttered based on the original data such as the original and the script is compared with the sentence of the digitized original and the uttered voice, and at the utterance timing, the sentence of the original is real-time To detect.
[0013]
In the present invention, phoneme processing is employed in the matching process in order to perform unspecified speakers, unspecified contents, and real-time processing. As a result, it is possible to cope with stagnation, rephrasing, and unknown words of the utterance, and the present invention can be applied to a genre in which the utterance content cannot be determined.
[0014]
Further, in the present invention, in the detection and collation processing, the digitized original is converted into a text phoneme string by a text-phoneme conversion processing, and the voice is converted into a voice phoneme string by a voice-phoneme conversion processing. Then, the two phoneme strings are compared by, for example, continuous dynamic programming (Continuous Dynamic Programming), and a text phoneme string that matches the speech phoneme string is detected.
[0015]
In the above invention, the original data is divided into items according to the contents of the original, the text data is divided according to the items, the range of the leading character string of each divided text data is determined, and the character string within the range is determined. Is preferably extracted as the collation target text.
[0016]
In this case, the digitized manuscript can be divided into items such as clauses or chapters, and by adopting a structured document form, the order of documents and the order of utterances can be guaranteed. Thus, efficient and high-speed collation processing can be performed without making the entire text a collation target.
[0017]
In addition, in a structured document, a text corresponding to a section (divided text data) is divided into units of one unit (a document amount that can be uttered in a single breath or provided to prevent ambiguity: a document that is separated by a pause. Syllables (for example, about eight syllables) from the beginning of the sentence are set as text to be collated, and the phoneme string of the text to be collated is subjected to collation processing as a test phoneme string. Processing can be speeded up, and text data can be detected in real time for an utterance.
[0018]
In the above invention, it is preferable that a weighting factor is assigned to each of the divided text data in accordance with the priority, and the matching target text and the phoneme are collated in an order according to the weighting factor. In the above invention, it is preferable that the collated text to be collated is deleted and the weighting factor given to the collation text that has not been collated is sequentially changed according to the progress of the collation processing.
[0019]
In this case, erroneous detection in brute force text phoneme collation can be prevented from among several factors that reduce accuracy. That is, in the former case, a sentence containing many similar contents is likely to cause erroneous detection. In order to increase the collation accuracy, in the collation processing described above, priority is given to the delimited texts in the order of the originals to cope with erroneous detection. For example, if the manuscript is prepared in the order of "Cabinet is today ..." and "Prime is today ...", the text that appears earlier has a higher priority than the text that appears later. False detection can be avoided.
[0020]
In the above invention, the matching target text is matched with the phonemic phoneme string for matching or non-matching, a predetermined number of matching matching target texts are output as detection candidates, and the matching between the output detection candidate and the phoneme phoneme string is not matched. And outputting a detection result.
[0021]
In this case, the processing can be speeded up by performing the two-stage processing of matching the entire sentence on the matching candidates detected in the primary matching, and the digitized original sentence and the voice can be reproduced in real time. Synchronization timing can be set.
[0022]
In the above invention, it is preferable to hold a threshold value for comparing the degree of coincidence between phoneme strings, and adjust the matching accuracy by changing the threshold value.
[0023]
For example, in a news report site, unlike an ideal environment such as a documentary narration, even when there is much background noise, recognition can be performed with accuracy according to the situation by adjusting the threshold value of the continuous DP.
[0024]
In the above invention, it is preferable that the manuscript data includes utterance status information relating to the utterance status of the text data, and the conversion speed is adjusted by changing the duration of the voice based on the utterance status information.
[0025]
In this case, for example, when a phoneme sequence is generated from a text, it becomes possible to shorten the vowel continuation length in accordance with the speed with respect to the phoneme continuation length obtained from the utterance data of the standard ATR503 sentence. In addition, even when the user is busy with the broadcasting time and speaks quickly, it is possible to prevent omission of detection and obtain high matching accuracy.
[0026]
In the above invention, it is preferable to perform a warning process when the output text data corresponds to a predetermined character string. Thus, a warning can be issued for a specific utterance, so that an inappropriate utterance can be prevented from being broadcast.
[0027]
In the above invention, the detection result is stored as a collation log, and material data including an audio signal is stored. Based on the stored text data and the position of the text data in the material data, It is preferable to output the material data from a desired position. Further, in the above invention, as the document data, a keyword which is a character string arbitrarily set by the user is input, and the detection result is accumulated as a collation log, and the material data including the audio signal is accumulated, and is stored in the collation log. It is preferable to output the material data from a desired position based on the included keyword and the position of the keyword in the material data.
[0028]
By providing such a user interface, for example, subtitles are added to a broadcasted video based on original data, or MPEG2 encoding is performed in real time while an index is added to the video, so that the material data (video File). In addition, the detected timing, that is, the collation log (speech text) can be saved as a file such as MPEG7 or other meta information in synchronization with the video, for example. A desired scene can be displayed.
[0029]
As a result, functions such as a function of displaying text such as subtitles in accordance with a reproduced video, a function of displaying a video in which the text is uttered, and a function of displaying a desired video scene by searching are possible.
[0030]
BEST MODE FOR CARRYING OUT THE INVENTION
[First Embodiment]
(System configuration)
Hereinafter, a speech recognition system according to an embodiment of the present invention will be described in detail. FIG. 1 is a block diagram illustrating a schematic configuration of the speech recognition system according to the present embodiment.
[0031]
As shown in FIG. 1, the voice recognition system according to the present embodiment includes a storage PC 1, a verification PC 2, and a clock server 3 connected by a network 4.
[0032]
The storage PC 1 has a function of inputting a video signal and an audio signal to the MPEG2 encoder, converting the video signal and the audio signal into a file as an MPEG2 format digital video, and storing the file, and also holds a file related to the system such as a digitized original for verification and a verification log file. Also acts as a server to The verification PC 2 has a function of taking in an audio signal from a microphone input of the PC, digitizing the audio signal, and performing audio processing.
[0033]
The clock server 3 is a server device that matches the times of the two

PCs

1 and 2, and can use a reference clock server device or a standard clock server. If there is no need to match the absolute times, a clock server may not be provided and a function of synchronizing clocks between the two

PCs

1 and 2 may be used instead.
[0034]
(Configuration of the storage PC 1)
As shown in FIG. 2, the storage PC 1 executes a video storage / audio verification result storage program 8 and a verification result reproduction program 10. The video storage / audio collation result storage program 8 has a function of storing original data to be collated in the original database 9a, and works in conjunction with the audio detection / collation program 6 to digitize video and audio into a video file as a digital video file. In addition to having a function of storing the result in the database 9c, a function of storing the result of the verification by the voice detection and verification program 6 as a verification log file in the verification log database 9b. The file name of the collation log file and the video file incorporates year, month, day, hour, and minute and automatically generates and manages a unique name.
[0035]
The collation result reproduction program 10 is a program that uses the collation log file to confirm the time at which the utterance was made (used for debugging accuracy confirmation) and to display subtitles while playing back video.
[0036]
The content of the collation log file is composed of setting information such as a video file name to be linked, utterance text, a standard time when the utterance was made, and utterance information such as an elapsed time starting from the start of the voice detection collation program. The standard time serves as a reference for confirming what time, minute, and second the utterance occurred. The elapsed time is synchronized with the video file, and the time indicated by the time code can be searched for using the time.
[0037]
(Configuration of verification PC2)
The verification PC 2 executes a voice detection verification program 6 and a verification result output program 7, as shown in FIG. The voice detection / collation program 6 is a program having a function of processing a voice based on document data and outputting a collation log as a collation result.
[0038]
The collation result output program 7 is a program that outputs the content of the utterance in a form suitable for the job in synchronization with the utterance. In the present embodiment, when the document data to be collated is a specific utterance word or specific utterance sentence, a warning such as sounding an alarm, turning a patrol lamp, or giving voice guidance is provided to inform that the word has been uttered. Perform processing. Further, the collation result output program 7 has a function that can be adapted to caption broadcasting in which, when the collation document is an announcer document or script, the utterance sentence is displayed as caption in accordance with the utterance.
[0039]
Here, the function of the voice detection / collation processing by the voice detection / collation program 6 executed on the collation PC 2 will be described. FIG. 3 is a block diagram illustrating the function of the voice detection and collation processing.
[0040]
As shown in the figure, the voice detection / collation program 6 is executed on the collation PC2, so that the voice input unit 601, the voice analysis unit 602, the voice phoneme conversion unit 603, the Script input unit 604, collation range determination unit 605, text phoneme conversion unit 606, speech speed adjustment processing unit 607, first detection collation unit 608, sensitivity adjustment control processing unit 609, and second detection collation unit 610. And the collation result output unit 611 are virtually constructed. The configuration and function of each unit will be described for each process.
[0041]
(Voice input to voice phoneme conversion)
The audio input unit 601 is an audio signal of an announcer, a narrator, or a performer obtained from audio 5a included in a transmission signal such as a live broadcast, a recording medium 5b such as a VTR, an LD, or a DVD. Is a module that extracts 16KHz (sampling rate) and 16 bits (quantization) in the collation PC2. At the same time as the start command is input to the audio input unit 601, the MPEG2 encoder of the storage PC 1 is activated, and the creation and storage of the video file starts.
[0042]
The voice analysis unit 602 is a part that extracts a feature amount effective for recognition from voice. When an audio signal is acquired as a one-dimensional array signal sequence, the analysis method is as follows. As shown in FIG. 4, a temporal change of the acquired audio signal is sampled as an audio waveform and digitized as it is. A method and a method of separating and extracting frequency components included in an audio signal as shown in FIG. 5 and digitizing each component.
[0043]
The method of analyzing a speech signal using frequency components as shown in FIG. 5 is generally called spectrum analysis, and is the mainstream of the current speech analysis method. As an effect of the spectrum analysis, the time domain waveform tends to fluctuate in response to changes in the external environment, but the spectrum waveform has relatively little fluctuation, and information that characterizes the sound can be easily obtained by the spectrum analysis. . In the present embodiment, the voice analysis unit 602 performs voice analysis by the spectrum analysis method shown in FIG. 5 and extracts feature amounts required for recognition. However, the present embodiment is an exemplification, and various voice analysis methods other than the method shown in FIG.
[0044]
The speech phoneme conversion unit 603 is a module that extracts phonemes from speech and outputs the extracted phonemes. In the present embodiment, the phonetic feature input from the speech analysis unit 602 is performed using frame phoneme recognition based on a Bayesian identification function. This module outputs a phoneme recognition result up to the Nth place (N ≦ the number of phonemes) in frame units (one frame is 8 msec) from the amount and the phoneme model acquired from the phoneme model dictionary 603a. The phoneme duration in the phonetic phoneme conversion is obtained from the phonetic symbol / continuation length correspondence table shown in Table 1.
[0045]
[Table 1]

The phoneme duration shown in Table 1 was obtained by analyzing the utterance data of the ATR phoneme balance sentence. The ATR-provided Japanese speech database set B for research (sentence speech database) is based on data obtained by labeling ATR phoneme balance sentences (503 sentences) with utterance data read out by 10 speakers (male and female announcers and narrators). This is the basic data for audio processing. In the present embodiment, this data is used as a phoneme model dictionary.
[0046]
(Original / script input to text phoneme conversion)
The original / script input unit 604 is a text data input unit for inputting text data including a character string. In the present embodiment, the original / script of the broadcast program is input as text data. If the text data is not digitized, the text data is digitized in the text input support system.
[0047]
The document / script input unit 604 reads a predetermined document file in a document / script folder in the document database 9a on the storage PC 2. The original file includes utterance status information corresponding to the type of broadcast program, such as utterance speed level, background sound level, and environmental noise status, and utterance script information that is text data.
[0048]
The utterance status information is data used for setting the level of the voice collation. Among these, the utterance speed level is described according to the content of the program. In a documentary program or the like, a story is spoken slowly, and in a drama program, a scene that speaks quickly and a scene that speaks slowly are described. In the background sound level information, for example, in the case of a news or documentary program, the case of shooting outdoors, or in the case of a drama or movie program, a scene with a lot of background music is described.
[0049]
The utterance speed adjustment processing unit 607 is a module that adjusts the utterance speed in the text phoneme conversion unit 606 according to the utterance status information included in the document file. The utterance speed adjustment processing unit 607 performs voice matching according to the utterance situation and the utterance environment, and can improve the accuracy of voice recognition.
[0050]
The collation range determination unit 605 is a module that outputs text data of an item (chapter) to be uttered from the original read by the original / script input unit 604 to the text phoneme conversion unit. At this time, the collation range determination unit 605 determines the content of the item (chapter) to be uttered from now on, the range of the first character string of the subsequent item, and converts the text information (character string) included in this range into text. Output to the phoneme conversion unit 606. Normally, in a broadcast program, items to be uttered will be replaced in accordance with a predetermined order according to the situation, but the range is a range predicted before the broadcast. The information on the range is held, and the battle information of the item is determined based on the information.
[0051]
The collation range determination by the collation range determination unit 605 according to the present embodiment will be described in more detail. The manuscript data has a feature that it has a certain document structure like a normal document. This document structure has a hierarchical structure in which there are several large items, one large item has several medium items, and one medium item has several small items. have.
[0052]
The collation range determination unit 605 pays attention to the document structure and manages the document data for each segment obtained by segmenting the sentence for each utterance unit. Here, as an example of manuscript data, the structure of a news manuscript and the process from production to transmission of a news manuscript will be described.
[0053]
(1) Composition of news manuscript
Here, the structure of the document will be described. FIG. 6 is an explanatory diagram exemplifying a news manuscript of a news program as manuscript data. In this manuscript, the news is divided into several items in the hierarchy L1, and production management is performed. The layers L2 and L3 are associated with each other below the layer L1 to form a hierarchical structure.
[0054]
For example, broadcast news items include political information, international affairs, economic information, social information such as incidents and accidents, local news, weather information, and the like. News is sent out based on these items, and the order is as simple as headlines and greetings ("Good evening is news on July 7 at 7:00 pm.") Greetings, etc.), the most topical items in the news items become the top news, followed by political information, international affairs, economic information, social information, local news, weather information (topics, festivals, milestones etc.) The order varies depending on the circumstances). Also, when moving from the current item to the next item, guidance for the next item may be inserted. For example, "Tonight we will start with the news of the birth of the Cabinet.", "Next is the news of the earthquake." These item guides may be omitted depending on the time.
[0055]
In the present embodiment, a news unit that is a group of pieces of information in the hierarchy L1 is called a news item. Also, depending on the news situation on the day of the broadcast, each item may be divided into several items, and these are called child items. The items (child items) derived in this manner are associated with the parent items of the upper hierarchy L1, and are managed in the lower hierarchy L2 and lower.
[0056]
One news item included in the hierarchy L1 is usually composed of about 400 characters of text (about 800 characters for long ones such as weather information), and is divided into about 25 divisions (about 50 divisions for a long one, and a text uttered in one breath). Volume). In the present embodiment, the delimited text is referred to as delimited text.
[0057]
Although news is taken up here, the manuscript or script in drama or documentary is the same as the news item structure, and has a hierarchical structure from several groups as shown in the section.
[0058]
(2) Processing from production of news manuscript to transmission
In the production of a news manuscript, first, a manuscript is created by a due date based on the content collected by a reporter in charge of a news item section. The completed reporter's manuscript will be proofread by the desk in charge. The printed matter printed at the desk in charge will be the announcer manuscript and distributed to the news production department.
[0059]
Programs such as dramas and documentaries are recorded over time according to manuscripts or scripts prepared in advance. However, the news is live and timed. Depending on the progress of the news program, time adjustment within the program may be required. In such a situation, the production staff may manually edit the announcer manuscript such as deleting or adding a part. Therefore, in an actual broadcast, the announcer's utterance may not always coincide with a document digitized in advance. In a news broadcast, in order to provide information that is as fresh as possible, the order of initially scheduled items is often changed for reasons such as coverage and preparation of manuscripts. This change in the item order is reflected on the computer system that manages the document before the announcer reads the document, and thus does not affect the sound detection processing.
[0060]
(3) Determine collation range and assign priority
In the present embodiment, the original data is divided into items according to the contents of the original, and the text data is divided according to these items, and the divided text data has a weighting factor corresponding to the priority. Has been granted. That is, as shown in FIG. 7, in the upper hierarchy L1, there are n items Fi (i = 1, n), and each item is composed of a plurality of delimiter texts. A phoneme string is generated from these delimited texts by phoneme conversion processing. Here, the phoneme string corresponding to the entire i-th item is Fi, and the phoneme string corresponding to the delimited text therein is Fij (i = 1, nj = 1, mi).
[0061]
At present, when the i-th item is about to be uttered, the processing of the collation range determination unit 605 is as follows. In this range determination processing, the delimiter text in the item Fi is the highest priority candidate, and it is conceivable that the item may be shifted to another item during the utterance of the item due to the broadcasting time or the like. The first delimited text Fk1 (k = i + 1, n) is the next candidate.
[0062]
Assuming that there is a delimited text of Fij (j = 1, m) in the item Fi and that a delimited text of j = 1 is about to be uttered, the priority of the candidate j is the highest, and the priority of the candidate j is j + 1, j + 2. Lower. The priority is indicated by a numerical value (weight: w1, w2, w3,...) And is reflected on the determination threshold level in the second detection / collation unit 610.
[0063]
The text phoneme conversion unit 606 shown in FIG. 3 first converts kanji, kana, katakana, numbers, and numerical values mixed in the text into katakana, as shown in steps S101 to S103 in FIG. This is a module that obtains phonetic symbols from and converts them into phoneme strings.
[0064]
The text phoneme conversion unit 606 converts the entire delimited text determined by the collation range determination unit 605 into a phoneme string. Further, a test phoneme sequence (a syllable segment from the head of the delimited text: eight syllables in the present embodiment) for performing the processing of the first detection / collation unit 608 at high speed is generated. FIG. 9 shows a specific sample of a text and a phoneme sequence. As shown in the figure, numerical values and the like need to be written in hiragana in order to cope with the variety of utterances.
[0065]
In the Kanji-Katakana conversion processing in the text phoneme conversion unit 606, a text mixed with Kanji or Kana is subjected to morphological analysis (a technique of dividing a sentence for each part of speech), divided for each part of speech, and further converted to a character string composed entirely of katakana. .
[0066]
(Example) I am Taro ---> Watashiwata Rhodes
In the katakana-phonetic symbol conversion processing in the text phoneme conversion unit 606, a character string composed of katakana is converted into a phonetic symbol string using the “Katakana-phonetic symbol correspondence table” in Table 2.
[0067]
[Table 2]

(Example) Washiwashi ----> watashiwa
In the phonetic symbol-phoneme string conversion processing in the text phoneme conversion unit 606, each phonetic symbol is made continuous for the duration using the phonetic symbol / duration correspondence table of Table 1 described above to generate a phoneme sequence. . Here, the continuation length is the continuation length of the phonetic symbol, and the unit is a frame. A frame is a unit obtained by cutting out a sampled audio signal (for example, when sampling at 16 kHz becomes 16000 pieces of data per second) at equal intervals, and when cutting out every 8 milliseconds, the time length of one frame Is 8 milliseconds.
[0068]

The numerical values in Table 1 indicate the number of frames.
[0069]
In this example, the continuation length of the utterance of "watashiwa" is such that w is 7 frames, and a (10), t (2), a (10), sh (15), i (9), w (7), a 70 frames are obtained by accumulating (10), and 70 frames × 8 msec = 0.56 sec. That is, in the standard utterance, "I am" is uttered in 0.56 seconds.
[0070]
Since the announcer speaks in the standard utterance tone under the optimal environment, the utterance speed adjustment processing unit 607 reads out the manuscript in a slower tone than the utterance of the news announcer of each commercial broadcaster. The utterance speed is about 1.5 times different. In addition, the speech rate adjustment processing unit 607 has an acoustic feature that a change in the speech rate is mainly reflected in the length of the vowel (for example, the length of the vowel in the fast utterance) in order to improve the accuracy of the first detection / collation unit 608. And a process of adjusting the continuation length of the vowel at the stage of converting the original into phonemes.
[0071]
(Detection collation to collation result output)
The first detection / collation unit 608 compares the phoneme sequence of the input speech obtained by the speech phoneme conversion unit 603 with the text phoneme sequence group in the collation range obtained from the text phoneme conversion unit 6 by continuous DP, and accumulates. The candidates up to the fourth place with a small distance are obtained.
[0072]
Since the amount of calculation for matching all sentences in the manuscript becomes large and real-time processing becomes impossible, the text of the target item found by the matching range determination unit and the first sentence of the succeeding items are targeted, and those sentences are The test phoneme string obtained from the above is collated with the input speech phoneme string.
[0073]
The DP matching and the continuous DP in the present embodiment will be described below with reference to FIG. DP matching is an algorithm for measuring the similarity between two data strings. Here, it is assumed that there are two data strings R and Q. The data string R is composed of data r1, r2, r3,..., Rm, and the data string Q is composed of data q1, q2, q3,. In the figure, the horizontal axis represents the data string R, and the vertical axis represents the data string Q. First, a distance value (reverse of closeness) between data is obtained on all grid points. For example, the grid point P has a distance value between the data r2 and the data q3. Next, the start point S to the end point E are connected so as to pass through the grid points (this is called a path), and the distance values of the passing grid points are added up to obtain the cumulative distance of the path. The path having the smallest cumulative distance among all the paths is selected (this path is called the optimal path). Further, the accumulated distance is normalized (the accumulated distance is divided by the length of the path or the length of the vertical axis). It can be said that the smaller the normalized cumulative distance (hereinafter referred to as the cumulative distance), the greater the similarity between the data strings.
[0074]
Continuous DP is an algorithm that extends DP matching and checks whether there is a section similar to the input data string in the data string to be searched.
[0075]
It is assumed that the search target data string R is composed of data r1, r2, r3,..., Rm, and the input data string Q is composed of data q1, q2, q3,. In FIG. 11, the data sequence R is plotted on the horizontal axis, and the data sequence Q is plotted on the vertical axis. A similar section is obtained as follows. An optimum path at a certain point in time is obtained (in the figure below, the start point is S1 and the end point is E1). The cumulative distance D1 of this path is obtained. Next, the end point is shifted to the right by one unit (one data) (end point E2), and the optimum path and its cumulative distance D2 are obtained. This is repeated until the end. The section of the path with the smallest cumulative distance is the section most similar to the input data sequence. For example, assuming that the path SE has the smallest cumulative distance in the figure below, the section K is the section most similar to the input data sequence.
[0076]
If the horizontal axis is the end point position and the vertical axis is the cumulative distance, a graph as shown in FIG. 12 is obtained. In the present embodiment, this graph is referred to as a cumulative distance curve. In this cumulative distance curve, a threshold value is set, and the point where the cumulative distance is equal to or less than the threshold value and is minimal is the end point of the similar section candidate. In the case of FIG. 12, the end points E1 and E correspond to this, and two sections ending with these two end points are candidates for similar sections. Since the cumulative distance at E is smaller than E1, the section ending at E (section K in FIG. 11) is detected as a similar section.
[0077]
The sensitivity adjustment control processing unit 609 adjusts the determination threshold of the continuous DP to deal with erroneous detection or omission of detection. The sensitivity is given as a weight, and adjusts the determination threshold (in FIG. 12) in whole or in part. The smaller the weight is, the closer the accumulated distance is to the threshold, and the easier it is to detect.
[0078]
The second detection / collation unit 610 continuously performs collation by continuous DP with respect to the four target text candidates that have become candidates in the first detection / collation unit 608 at the preceding stage, and uses a phoneme sequence of the target text and a phoneme sequence of the target text. Since there are four target texts in the continuous DP performed here, four continuous DPs are performed simultaneously. When a similar section is detected in some of the four continuous DPs, the text having the minimum continuous DP cumulative distance is set as the detected text. The four texts have weighting factors w1, w2, w3, and w4 in that order in consideration of the order of appearance of the document (1.0 = w1 <w2 <w3 <w4). However, if this weighting factor is selected so as to firmly maintain the appearance order of the text, it will not be possible to follow changes in the utterance content, etc. In FIG. 7, a text having a weight of zero is treated as a range outside the range by the collation range determination unit 605. By multiplying the cumulative distance by a weight coefficient, the earlier the text is, the easier it is to detect.
[0079]
A specific example of the process in the second detection / collation unit 610 will be described below. At the time of starting the collation, the accumulated distance of the four texts is equal to or larger than the threshold as shown in FIG. Then, the time is advanced, and assuming that the accumulated distance of the text 1 becomes equal to or less than the threshold value at a certain point in time, as shown in FIG. Begins.
[0080]
When the time is further advanced and a similar section candidate of text 1 is found (the cumulative distance curve has become minimal), this point is set to point B1 as shown in FIG.
[0081]
When the time is advanced and a new similar section candidate for text 1 is found and the cumulative distance is smaller than point B1, this point is set as a new point B2 as shown in FIG.
[0082]
A similar section candidate is found for other texts, and if the cumulative distance is smaller than points B1 and B2, it is set as a new B3 point and this text is set as a detected text candidate. In FIG. 17, text 3 is a detected text candidate.
[0083]
Then, if a new point B is not found for a fixed time L (delay time, for example, 1 second) from the point B3, a text candidate having the current minimum cumulative distance is detected as shown in FIG. And the B3 point is set as the end point of the similar section.
[0084]
The collation result output unit 611 is an external output interface that outputs the detection result of the second detection / collation unit 610 to another program such as the collation result output program 7 and the video storage / audio collation result storage program 8.
[0085]
(Collation processing)
The collation processing according to the present embodiment is executed in two stages of a first detection / collation unit 608 and a second detection / collation unit 610. FIG. 19 is a flowchart illustrating the matching process according to the present embodiment.
[0086]
First, a voice is input by the voice input unit 601 (S206). After the input voice is analyzed by the voice analysis unit 602 (S207), the voice is converted into a voice phoneme by the voice phoneme conversion unit 603 (S208). ), And stored in the voice phoneme buffer (S209). Note that writing to the speech phoneme buffer in the present embodiment is performed in frame units (8 msec).
[0087]
On the other hand, the document or script to be collated is input as digitized data from the document / script input unit 604 (S201), and the collation range determination unit 605 extracts a delimiter text based on the structure of the document (S202). The text of the news item to be uttered in the broadcast (separated text in the item) and the head sentence of the following item are converted to text phonemes in the text phoneme conversion unit 606 (S204), and are stored in the text phoneme buffer. It is stored (S205). In the text phoneme conversion in step S204, an utterance immediate adjustment process is appropriately performed in order to cope with the rapid utterance (S203). The information stored in the text phoneme buffer is composed of a delimited text, its phoneme sequence, and a test phoneme sequence for high-speed detection (a syllable fragment from the beginning of the delimited text phoneme sequence: eight syllables in this device). Is done.
[0088]
The text phoneme group stored in the text phoneme buffer is subjected to detection / collation processing in the first detection / collation unit 608 for the speech phonemes stored in the speech phoneme buffer in this way (S210). Specifically, the continuous DP detects a text having a small cumulative distance of the DP, that is, a text having a high degree of similarity. In the present embodiment, as shown in steps S210 and S213, the continuous DP matching is configured in two stages, with the first stage corresponding to the first detection / collation unit 608 and the second stage corresponding to the second detection / collation unit 610. .
[0089]
First, in the first detection / collation unit 608 of the first stage, the number of delimited texts to be compared is about 50, and the continuous DP is operated by this number. Further, in order to realize the matching process in real time, these text phonemes must be processed within 8 msec. Therefore, the process in the first detection / matching unit 608 is performed at high speed by the above-described test phoneme sequence.
[0090]
Since the broadcast sound source includes background music and the like, it is difficult to accurately determine a voice section and a non-voice section. Further, the content uttered in the voice section may not be included in the manuscript created in advance. The information such as the relay is not included in the preliminary manuscript. Since such a speech phoneme sequence is not similar to the test phoneme sequence, in the first-stage continuous DP, the dissimilarities are skipped, and the next speech phoneme sequence is fetched from the speech phoneme buffer.
[0091]
Note that the first-stage collation is as short as about eight syllables, so if, for example, there are four sentences “Prime Minister”, these are all candidates. However, according to the priority when the collation range is determined in step S202, these four candidates are weighted in consideration of the item order instead of equiprobabilities, thereby preventing erroneous detection of the subsequent text detection following "Prime Minister". ing.
[0092]
The loop process is repeated until the number of candidates becomes four based on the result of collation with these test phoneme strings (S212). That is, if the test phoneme sequence and the input phoneme sequence match in step S212, 1 is added to i, the next test phoneme sequence is obtained from the text phoneme buffer, and step S210 is executed. On the other hand, if the test phoneme sequence does not match the input phoneme sequence in step S212, a voice phoneme is obtained from the voice phoneme buffer, and the collation with the current test phoneme sequence is repeated in step S210. This process is repeated until i becomes 4.
[0093]
Then, four candidates having a high degree of similarity to the voice phoneme are obtained from these test phoneme strings, and the process proceeds to the next stage of the second detection / collation unit 610 (S213). In the process of the second stage, the second detection / collation unit 610 performs a continuous DP process of the delimited text phoneme sequence corresponding to the test phoneme sequence candidate in the first stage and the speech phoneme sequence. Since the continuous DP has already been activated for a part of the delimited text phoneme sequence, the continuous DP is activated by taking over this information.
[0094]
This process is performed for each frame (8 msec), the cumulative distance at that time is obtained, and a cumulative distance curve is obtained. The minimum value is obtained from this curve. In order to determine whether this local minimum is a local minimum or a global minimum, if a new minimum is not found for a certain period of time (for example, 1 second), the delimited text having the minimum minimum (the best match) is detected. It becomes a text (S214).
[0095]
The display processing (S215) is performed on the detected text. For example, the detected text data is output to another application such as the collation result reproduction program 10. For example, subtitles can be broadcast in a subtitle device, and a new type of video content can be formed in an MPEG7 format storage device. Can be.
[0096]
Next, the process proceeds to the next delimited text in the item (S216). At this time, it is determined whether or not there is a head delimited text of the next item or a subsequent item, and in a case where a transition is made to a new item (“Yes” in step S216), the process returns to step S202 and the collation is performed. The processing from the determination of the range to the accumulation in the text phoneme buffer (S202 to S205) is executed.
[0097]
On the other hand, if it is determined in step S216 that the transition is not to the next item ("No" in step S218), the matching text is deleted from the text phoneme buffer (S217), and the text phoneme buffer becomes empty. A determination is made as to whether or not there is a match (S218). If it is empty ("Yes" in step S218), the process returns to step S202 to determine the collation range and store it in the text phoneme buffer (S202 to S205). If the processing is not empty ("No" in step S218), the processing in steps S210 to S216 is performed.
[0098]
[Second embodiment]
Next, a second embodiment of the present invention will be described. The present embodiment is an example in which the above-described speech recognition system is applied to a specific utterance detection archive system. FIG. 20 is a block diagram illustrating a configuration of the specific utterance detection archive system according to the present embodiment.
[0099]
As shown in FIG. 20, the specific utterance detection archive system according to the present embodiment includes a specific utterance detection system 21 executed by the verification PC 2 and a detection result output system 22, and a specific utterance executed by the storage PC 1 It comprises an archive system for detection 11, a specific keyword database 9d, a collation log database 9b, an MPEG2 database 9e, and an audio processing and playback system 12.
[0100]
The detection result output system 22 is a system that sequentially displays the detection results. The audio processing / reproduction system 12 is a system that reproduces a corresponding MPEG2 file from a collation log file, displays text collated according to the reproduction time on a screen, and displays the scene from the text. The specific utterance detection system 21 incorporates the voice detection / collation program 6 described in the above-described first embodiment as a search engine, and searches a video file for a keyword specified by a user instead of the above-described original file. Has functions.
[0101]
Such an operation for the archive system can be performed via an interface displayed on the screen of the verification PC 2. FIG. 21 is a configuration diagram showing an operation screen as a user interface of the archive system.
[0102]
First, the specific utterance detection archive system is activated. Next, the text data to be collated is read, and an MPEG2 file of the archive system is created.
[0103]
Next, a keyword to be searched is input in a text box TB1 on the operation screen. A keyword can input 20 words per page. In the text box TB1, a keyword can be directly input, a keyword group can be read from the specific keyword database 9d, and the read keyword can be edited. In this embodiment, a check box CB1 is provided corresponding to each text box TB1, and an arbitrary keyword can be selected from among the input keywords to be a detection target.
In this embodiment, a track bar TBR1 is provided in correspondence with each text box TB1, and the sensitivity to each keyword is set by operating each track bar TBR1. The sensitivity is a threshold value of the matching distance at the time of detection, and ranges from 0.0 to 5.0, and the standard threshold value is 2.5.
[0104]
Further, in the present embodiment, a track bar TBR2 is provided in correspondence with each text box TB1, and by operating this track bar TBR2, the utterance speed of the keyword can be adjusted. In the range of 0.5 to 2.0 times, it is possible to cope with a very slow utterance to a considerably fast utterance. One time corresponds to a standard utterance.
[0105]
In this embodiment, a text box TB2 for setting a minimum interval (unit: second) for detecting a keyword, a button B1 for reading an utterance list file from a PC, input / edited keywords, sensitivity of each keyword, and utterance speed Button B2 for writing conditions such as in the utterance list file, a button B3 for sorting keywords, a check box CB2 for executing utterance output corresponding to the detected keyword, a button B4 for starting processing, and ending the processing. Button B5 is provided.
[0106]
Further, the operation screen is provided with a track bar TBR3 for adjusting the overall sensitivity. In the present embodiment, the range of the sensitivity adjustment is from -2.5 to 2.5. The effect of the overall sensitivity appears as an addition to the sensitivity of each keyword, and the maximum range of the sensitivity of each keyword is -2.5 to 7.5. Also, a track bar TBR4 for adjusting the overall utterance speed is provided. In the present embodiment, the adjustment range is 0.5 to 2.0 times. The effect of overall speed is expressed as a multiplication of the speed of each keyword, and the speed range of each keyword increases from 0.25 times to 4.0 times.
[0107]
Then, the detection result is displayed in the list box LB1. In the figure, from the left, “absolute detection time”, “time since processing started (hour: minute: second)”, “keyword utterance time (unit second)”, and a detected keyword character string . The data displayed in the list box LB1 is stored as a log file in the collation log database 9b.
[0108]
The collation log generated in this manner is read by the detection result output system 22 as a log file. At this time, the detection result output system also reads the MPEG file corresponding to the log file. The detection result output system 22 performs printing of a log file, cue reproduction based on an index, sorting of log data (time, similarity, keyword order), and the like.
[0109]
[Third embodiment]
Next, a third embodiment of the present invention will be described. The present embodiment is an example in which the above-described voice recognition system is applied to a voice indexing system based on a document. FIG. 22 is a block diagram showing a configuration of the audio indexing system according to the present embodiment.
[0110]
As shown in FIG. 22, the indexing system according to the present embodiment includes a voice indexing system 23 executed by the verification PC 2 and a detection result output system 22, and an audio indexing archive system 13 executed by the storage PC 1. , An original database 9a, a collation log database 9b, an MPEG2 database 9e, and an audio processing / reproduction system 12.
[0111]
The detection result output system 22 is a system that sequentially displays the detection results. The audio processing / reproduction system 12 is a system that reproduces a corresponding MPEG2 file from a collation log file, displays text collated according to the reproduction time on a screen, and displays a scene from the text.
[0112]
The voice indexing system 23 incorporates the voice detection / collation program 6 described in the first embodiment as a search engine, and has a function of searching a video file for text in a document file based on the document file described above. Having.
[0113]
The operation for such an indexing system can be performed through an interface displayed on the screen of the verification PC 2. FIG. 23 is a configuration diagram showing an operation screen which is a user interface of the indexing system.
[0114]
As shown in the figure, a list box LB2 for displaying the input document is provided on the operation screen. In the present embodiment, the text detected in the list box LB2 is displayed in red.
[0115]
The operation screen is provided with a text box TB3 for specifying the number of sentences to be processed at one time at the time of detection, a text box TB4 for setting a weight for a detected sentence, and a text box TB5 for setting a detection delay time. ing.
[0116]
In the text box TB4, for example, when the weight coefficient is 0.4, the weight of the first sentence is 1.0, the weight of the next sentence is 1.4, and the weight of the next sentence is 1.44. The higher the weight, the lower the detection sensitivity. In the text box TB5, when a new sentence is detected, the similarity is compared with the sentence detected immediately before (within the detection delay time). If the similarity is larger, the sentence is determined as an output candidate. If there is no new detection sentence within the detection delay time, the previous detection sentence is output to the log.
[0117]
Then, the log of the detection result is displayed in the list box LB1. In the list box LB1, from the left, "absolute detection time", "time since the start of processing (hour: minute: second)", "utterance time (unit: second) of delimited text", and the detected delimiter It is text.
[0118]
According to such an indexing system, the delimiter text extracted from the manuscript file is used as a keyword, and the time when the corresponding keyword is uttered is displayed as a collation log in the list box LB1, and this list is stored in the collation log file. Is stored in the collation log database 9b.
[0119]
The collation log file generated in this way is read by the detection result output system 22. At this time, the detection result output system also reads the MPEG file corresponding to the log file. The detection result output system 22 performs printing of a log file, cue reproduction based on an index, sorting of log data (time, similarity, keyword order), and the like.
[0120]
[Fourth embodiment]
In addition, the speech recognition system and method according to the above-described embodiment and its application example can be a program described in a predetermined computer language. That is, by installing this program on a computer such as a user terminal or a Web server, or on an IC chip, a voice detection matching program or a matching result output program having the above-described functions can be easily constructed. This program can be distributed through a communication line, for example, and can be transferred as a package application that runs on a stand-alone computer.
[0121]
Such a program can be recorded on recording media 116 to 119 readable by the general-purpose computer 120 as shown in FIG. More specifically, as shown in the figure, in addition to a magnetic recording medium such as a flexible disk 116 and a cassette tape 119, an optical disk such as a CD-ROM and a DVD-ROM 117, and a recording medium such as a RAM card 118, various recording media are used. can do. This embodiment has a feature that a link can be provided to a content in a non-writable CD-ROM or DVD-ROM 117.
[0122]
According to the computer-readable recording medium on which the program is recorded, it is possible to implement the above-described speech recognition system and method using a general-purpose computer or a special-purpose computer. Installation can be performed easily.
[0123]
【The invention's effect】
As described above, according to the present invention, the speech uttered during the broadcast can be detected in real time and accurately using the existing speech recognition technology. By using the detection results, subtitles can be added to the broadcast video based on the original, an image corresponding to the original being spoken can be displayed, and a desired video scene can be displayed by a search using a keyword. Various services, such as a service, can be realized, and various universal services for everyone can be realized.
[Brief description of the drawings]
FIG. 1 is a block diagram illustrating a schematic configuration of a speech recognition system according to a first embodiment.
FIG. 2 is a block diagram illustrating an internal structure and a relationship of a verification PC and a storage PC according to the first embodiment.
FIG. 3 is a block diagram illustrating functions of a voice detection / collation program according to the first embodiment.
FIG. 4 is a graph showing a time waveform of an audio signal according to the first embodiment.
FIG. 5 is a graph showing a spectrum waveform of an audio signal according to the first embodiment.
FIG. 6 is an explanatory diagram showing a structure of a news manuscript according to the first embodiment.
FIG. 7 is an explanatory diagram showing descriptions of items inside a document according to the first embodiment.
FIG. 8 is a flowchart illustrating a process in a text phoneme conversion unit according to the first embodiment.
FIG. 9 is an explanatory diagram of a text and a phoneme sequence according to the first embodiment.
FIG. 10 is a path diagram showing a DP path in DP matching according to the first embodiment.
FIG. 11 is a path diagram showing a DP path in the continuous DP matching according to the first embodiment.
FIG. 12 is a cumulative distance curve diagram in continuous DP matching according to the first embodiment.
FIG. 13 is a cumulative distance curve diagram in continuous DP matching according to the first embodiment.
FIG. 14 is a cumulative distance curve diagram in continuous DP matching according to the first embodiment.
FIG. 15 is a cumulative distance curve diagram in continuous DP matching according to the first embodiment.
FIG. 16 is a cumulative distance curve diagram in continuous DP matching according to the first embodiment.
FIG. 17 is a cumulative distance curve diagram in continuous DP matching according to the first embodiment.
FIG. 18 is a cumulative distance curve diagram in continuous DP matching according to the first embodiment.
FIG. 19 is a flowchart illustrating a matching process according to the first embodiment.
FIG. 20 is a block diagram illustrating a configuration of a specific utterance detection system according to a second embodiment.
FIG. 21 is a configuration diagram illustrating an operation screen of an interface according to the second embodiment.
FIG. 22 is a block diagram illustrating a configuration of a voice indexing system according to a third embodiment.
FIG. 23 is a configuration diagram illustrating an operation screen of an interface according to the third embodiment.
FIG. 24 is a perspective view showing a computer-readable recording medium on which a program according to a fourth embodiment is recorded.
[Explanation of symbols]
1. Storage PC
2 ... Verification PC
3. Clock server
4. Network
5a ... voice
5b: Recording medium
6… Sound detection collation program
7 ... Verification result output program
8 ... Voice collation result storage program
9a ... manuscript database
9b ... collation log database
9c: Video file database
9d… Specific keyword database
9e ... MPEG2 database
10: verification result reproduction program
11 ... Archive system for specific utterance detection
12. Voice processing and playback system
13. Archive system for voice indexing
21 ... Specific utterance detection system
22… Detection result output system
23 ... Sound indexing system
116 ... Flexible disk
117 ... ROM
118 ... RAM card
119 ... cassette tape
120 ... General purpose computer
601 ... voice input unit
602: Voice analysis unit
603: voice phoneme conversion unit
603a ... phoneme model dictionary
604: manuscript / script input unit
605: collation range determination unit
606: Text phoneme conversion unit
607: Utterance speed adjustment processing unit
608... First detection and collation unit
609: sensitivity adjustment control processing unit
610—Second detection / collation unit
611: collation result output unit

Claims

An audio input unit for inputting an audio signal,
An original data input section for inputting original data including text data,
A voice phoneme conversion unit that converts the voice input from the voice input unit into a voice phoneme sequence,
A text phoneme conversion unit that converts text data input from the manuscript data input unit into a text phoneme string,
A matching unit that checks whether the voice phoneme sequence matches the text phoneme sequence.
A speech recognition system, comprising: a collation result output unit that outputs the text data corresponding to the coincident phoneme string as a detection result when the speech phoneme string matches the text phoneme string.

The document data is divided into items according to the contents of the document,
A collation range determining unit that divides the text data according to the item, determines a range of a first character string of each divided text data, and extracts a character string in the range as a collation target text. The speech recognition system according to claim 1, wherein

Each of the divided text data is given a weight coefficient according to the priority,
The collation range determination unit outputs the collation target text to the text phoneme conversion unit in an order according to the weight coefficient, and the collation unit performs collation between a phoneme of the collation target text and a phoneme phoneme. The speech recognition system according to claim 2, wherein:

The collation unit, according to the progress of the collation processing, deletes collated collation target texts, and sequentially varies the weighting factors assigned to collation target texts that have not been collated yet. Item 4. A speech recognition system according to item 3.

The collating unit,
A first detection / matching unit that matches the matching text and the phonemic phoneme string for a match / mismatch and outputs a predetermined number of the matching text to be matched as a detection candidate;
3. The apparatus according to claim 1, further comprising: the second detection / collation unit that collates a match between the detection candidate output from the first detection / collation unit and the speech phoneme string and outputs the detection result. 4. The speech recognition system according to 1.

2. The apparatus according to claim 1, further comprising a sensitivity adjustment control processing unit that holds a threshold value for comparing the degree of coincidence between phoneme strings and adjusts the matching accuracy in the matching unit by changing the threshold value. 3. Voice recognition system.

The manuscript data includes utterance status information on the utterance status of the text data,
The speech recognition system according to claim 1, further comprising: an utterance speed adjustment processing unit that adjusts a conversion speed in the text phoneme conversion unit by changing a continuation length of speech based on the utterance status information. .

The speech recognition system according to claim 1, wherein the collation result output unit has a function of performing a warning process when the output text data corresponds to a predetermined character string.

A collation log database that accumulates a detection result output from the collation result output unit as a collation log,
A material data storage unit for storing material data including the audio signal,
2. The apparatus according to claim 1, further comprising: a collation result reproducing unit that outputs the material data from a desired position based on the text data included in the collation log and the position of the text data in the material data. A speech recognition system as described.

As the document data, input a keyword that is a character string arbitrarily set by the user,
A collation log database that accumulates a detection result output from the collation result output unit as a collation log,
A material data storage unit for storing material data including the audio signal,
2. The apparatus according to claim 1, further comprising: a collation result reproducing unit that outputs the material data from a desired position based on the keyword included in the collation log and a position of the keyword in the material data. 3. Voice recognition system.

(1) inputting an audio signal and inputting original data including text data;
(2) converting the input speech into a phoneme sequence and converting the input text data into a text phoneme sequence;
Collating the mismatch between the voice phoneme sequence and the text phoneme sequence, and outputting the text data corresponding to the matching phoneme sequence as a detection result when the voice phoneme sequence matches the text phoneme sequence. And (3) a voice recognition method.

The document data is divided into items according to the contents of the document,
In the step (3), the text data is divided according to items, a range of a leading character string of each of the divided text data is determined, and a character string within the range is extracted as a text to be collated. The voice recognition method according to claim 11, wherein

Each of the divided text data is given a weight coefficient according to the priority,
13. The speech recognition method according to claim 12, wherein, in the step (3), the matching target text and the phoneme are collated in an order according to the weighting factor.

In the step (3), the collated text to be collated is deleted according to the progress of the collation processing, and the weighting factor assigned to the collated text that has not been collated is sequentially changed. 14. The voice recognition method according to claim 13, wherein

In the step (3),
The matching target text is checked for a match / mismatch with the phonemic phoneme string, and a predetermined number of the matching target texts corresponding to the matching are output as detection candidates.
13. The speech recognition method according to claim 11, wherein the output detection candidate is compared with a match or mismatch between the phoneme string and the detection result is output.

12. The speech recognition method according to claim 11, wherein a threshold for comparing the degree of coincidence between phoneme strings is held, and the matching accuracy in the step (3) is adjusted by changing the threshold.

The manuscript data includes utterance status information on the utterance status of the text data,
12. The speech recognition method according to claim 11, wherein the conversion speed in the step (2) is adjusted by changing the duration of the speech based on the utterance status information.

The voice recognition method according to claim 11, further comprising a step of performing a warning process when the output text data corresponds to a predetermined character string.

While accumulating the detection result as a verification log, accumulating material data including the audio signal,
12. The speech recognition method according to claim 11, further comprising a step of outputting the material data from a desired position based on the stored text data and a position of the text data in the material data. .

As the document data, input a keyword that is a character string arbitrarily set by the user,
While accumulating the detection result as a verification log, accumulating material data including the audio signal,
The voice according to claim 11, further comprising a step of outputting the material data from a desired position based on the keyword included in the collation log and a position of the keyword in the material data. Recognition method.

On the computer,
(1) inputting an audio signal and inputting original data including text data;
(2) converting the input speech into a phoneme sequence and converting the input text data into a text phoneme sequence;
Collating the mismatch between the voice phoneme sequence and the text phoneme sequence, and outputting the text data corresponding to the matching phoneme sequence as a detection result when the voice phoneme sequence matches the text phoneme sequence. (3) A speech recognition program for executing a process comprising:

The document data is divided into items according to the contents of the document,
In the step (3), the text data is divided according to items, a range of a leading character string of each of the divided text data is determined, and a character string within the range is extracted as a text to be collated. The speech recognition program according to claim 21, wherein:

Each of the divided text data is given a weight coefficient according to the priority,
23. The speech recognition program according to claim 22, wherein in the step (3), the collation of the text to be collated with the phoneme is performed in an order according to the weighting factor.

In the step (3), the collated text to be collated is deleted according to the progress of the collation processing, and the weighting factor assigned to the collated text that has not been collated is sequentially changed. 24. The speech recognition program according to claim 23.

In the step (3),
The matching target text is checked for a match / mismatch with the phonemic phoneme string, and a predetermined number of the matching target texts corresponding to the matching are output as detection candidates.
23. The speech recognition program according to claim 21, wherein the output detection candidate is checked for a match / mismatch between the speech phoneme string and the detection result.

22. The speech recognition program according to claim 21, wherein a threshold value for comparing the degree of coincidence between phoneme strings is held, and the matching accuracy in step (3) is adjusted by changing the threshold value.

The manuscript data includes utterance status information on the utterance status of the text data,
22. The speech recognition program according to claim 21, wherein the conversion speed in the step (2) is adjusted by changing a continuation length of the speech based on the utterance status information.

22. The voice recognition program according to claim 21, further comprising a step of performing a warning process when the output text data corresponds to a predetermined character string.

While accumulating the detection result as a verification log, accumulating material data including the audio signal,
22. The voice recognition program according to claim 21, further comprising a step of outputting the material data from a desired position based on the stored text data and a position of the text data in the material data. .

As the document data, input a keyword that is a character string arbitrarily set by the user,
While accumulating the detection result as a verification log, accumulating material data including the audio signal,
22. The voice according to claim 21, further comprising a step of outputting the material data from a desired position based on the keyword included in the collation log and the position of the keyword in the material data. Recognition program.