JP2004347732A

JP2004347732A - Automatic language identification method and system

Info

Publication number: JP2004347732A
Application number: JP2003142736A
Authority: JP
Inventors: Minoru Hayashi; 実林; Yoshihiko Hayashi; 良彦林
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2003-05-20
Filing date: 2003-05-20
Publication date: 2004-12-09

Abstract

<P>PROBLEM TO BE SOLVED: To make it possible to automatically identify the language for indexing the speech of multilingual contents. <P>SOLUTION: The system is equipped with a means 131 for analyzing the signal of the speech of the inputted natural language, means 133 and 136 for retrieving the analyzed speech respectively in parallel by using pronunciation dictionaries 132 and 135 for the natural language of an acoustic model for the natural language of a plurality of the languages and the natural language of a language model for the natural language, means 134 and 137 for respectively calculating the likelihoods of the retrieved results and a means 138 for identifying the language of the inputted natural language by comparing the respective calculated likelihoods. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、映像コンテンツ等の音声をインデクシングするための言語自動識別方法及び装置に関し、多言語音声インデクシング、多言語デイクテーション、多言語検索、多言語翻訳、自動翻訳電話、自然言語処理などに適用されるものである。
【０００２】
【従来の技術】
近年の音声認識アルゴリズムの進歩とコンピュータの飛躍的な性能向上により、様々な応用が可能な音声認識エンジンが開発されている（例えば、非特許文献１参照）。
一方、近年のインターネットのグローバル化並びにブロードバンドネットワークの進展に伴い、音声を含む映像コンテンツの流通が盛んとなり、多言語間の検索や自動翻訳の通信の需要が増加している。しかし、従来の多言語間通信方法では、発信者が言語を指定するか、若しくは受信者が何の言語かを判断して、発信内容が翻訳される。また、発信者、若しくは受信者が言語を指定する際には、各言語に対応する機能カードの差し換え、スイッチの切り替え、画面選択、マウスボタン選択などによる方法などが用いられる。
【０００３】
【非特許文献１】
野田喜昭ほか、「マルチメディア時代を支える音声認識技術」、ＮＴＴＲ＆ＤＶｏｌ．４９，Ｎｏ．３，２０００，Ｐ．１４２〜１４７
【０００４】
【発明が解決しようとする課題】
従来技術による言語識別方法では、人手に頼らざるを得ないため、多言語コンテンツの言語を識別することは不便であり、また、人手に頼るため時間と費用がかかる上、多言語に長けた専門家に頼らざるを得ないという問題がある。
従って、本発明は上記従来技術の問題点を解決するため、映像コンテンツ等の音声等をインデクシングするための言語を自動的に識別する方法及び装置を提供することを目的とする。
【０００５】
【課題を解決するための手段】
上記目的を達成するために、本発明の言語自動識別方法は、多言語コンテンツの音声をインデクシングする際に、複数言語の自然言語用音響モデルと自然言語用言語モデルの自然言語用発音辞書を用いて、入力される自然言語の尤度を並列にまたは一括に計算し、尤度比較あるいは最大尤度となる適用辞書により、入力された言語をリアルタイムで自動的に識別することを特徴とするものである。
【０００６】
また、本発明の言語自動識別装置においては、入力されたコンテンツの音声の信号を分析して解析を行う音声分析手段と、Ａ自然言語の音声認識に用いるＡ自然言語用音響モデルとＡ自然言語の統計的モデルであるＡ自然言語用言語モデルとのＡ自然言語用発音辞書と、前記分析して解析された音声について、Ａ自然言語用発音辞書を探索するＡ自然言語用探索手段と、探索された結果のＡ言語の尤度を計算するＡ自然言語用尤度計算手段と、Ｂ自然言語の音声認識に用いるＢ自然言語用音響モデルとＢ自然言語の統計的モデルであるＢ自然言語用言語モデルとのＢ自然言語用発音辞書と、前記分析して解析された音声について、Ｂ自然言語用発音辞書を採索するＢ自然言語用探索手段と、探索された結果のＢ言語の尤度を計算するＢ自然言語用尤度計算手段と、並列処理計算されたＡ自然言語の尤度とＢ自然言語の尤度を比較し、言語を決定する尤度比較判定手段を備えることを特徴とする。
【０００７】
また、本発明の言語自動識別装置においては入力されたコンテンツの音声の信号を分析して解析を行う音声分析手段と、Ａ自然言語の音声認識に用いるＡ自然言語用音響モデルとＡ自然言語の統計的モデルであるＡ自然言語用言語モデルとのＡ自然言語用発音辞書と、Ｂ自然言語の音声認識に用いるＢ自然言語用音響モデルとＢ自然言語の統計的モデルであるＢ自然言語用言語モデルとのＢ自然言語用発音辞書と、前記分析して解析された音声について、Ａ自然言語用発音辞書とＢ自然言語用発音辞書を用いて、ＧＭＭ手法等により一括に尤度を計算し、最大尤度をとる適用辞書から言語を決定するＧＭＭ尤度計算手段とを備えることを特徴とする。
【０００８】
さらには、本発明の言語自動識別装置においては、発話毎に自然言語が入れ代わることに対応するための発話区間を判定する発話区間判定手段を備えることを特徴とする。
【０００９】
さらには、本発明の言語自動識別装置においては、Ａ自然言語学習用音声データよりＡ自然言語用音響モデルの生成、及びＢ自然言語学習用音声データよりＢ自然言語用音響モデルの生成を処理することができる複数言語に対応することが可能な音響モデル生成手段から成る音響モデルオフライン処理手段を備えている。
【００１０】
さらには、本発明の言語自動識別装置においては、Ａ自然言語学習用テキストデータよりＡ自然言語用統計的言語モデルの生成、及びＢ自然言語学習用テキストデータよりＢ自然言語用統計的言語モデルの生成を処理することができる複数言語に対応することが可能な言語モデル生成手段から成る言語モデルオフライン処理手段を備えている。
【００１１】
【発明の実施の形態】
以下、図面を用いて本発明の実施例を説明する。
図１は、本発明による第１の実施例の全体構成図である。図１において、１０はコンテンツ、１１はコンテンツ入力部、１２は音響信号識別部、１３は言語自動識別装置、１７は大語彙連続音声認識処理部、１８は言語処理部、１９は情報統合部、２０はメタデータ出力部、２１はメタデータである。言語自動識別装置１３は、音声分析部１３１、Ａ自然言語用音響モデル１４４とＡ自然言語用言語モデル１５４とのＡ自然言語用発音辞書１３２、Ａ自然言語用探索部１３３、Ａ自然言語用尤度計算郡１３４、Ｂ自然言語用音響モデル１４５とＢ自然言語用言語モデル１５５とのＢ自然言語用発音辞書１３５、Ｂ自然言語用探索部１３６、Ｂ自然言語用尤度計算部１３７、尤度比較判定部１３８で構成されている。なお、言語自動識別装置１３に、１１〜１２、１７〜２０の各部の一部又は全部を含めることでもよい。
【００１２】
Ａ自然言語用発音辞書１３２のＡ自然言語用音響モデル１４４及びＡ自然言語用言語モデル１５４は、Ａ自然言語が識別できればよく、大語彙連続音声認識処理部１７に用意されるモデルほど大規模である必要はない。Ｂ自然言語用発音辞書１３５のＢ自然言語用音響モデル１４５及びＢ自然言語用言語モデル１５５についても同様である。以下では、Ａ自然言語が日本語、Ｂ自然言語が英語であるとする。
【００１３】
図２に、図１の実施例の処理フローチャートを示す。例えば、日本語と英語を含むニュースコンテンツ１０がブロードバンドネットワークなどを通じてコンテンツ入力部１１へと入力される（ステップ２０１）。コンテンツ入力部１１によりニュースコンテンツ１０の音信号をＡ／Ｄ変換し、音響信号識別部１２により発声区間をセグメンテーションして音声言語処理すべき音声の区間とＢＧＭ（バック・グランド・ミュージック）やノイズなどその他の音響信号の区間に切り分ける（ステップ２０２）。
【００１４】
音声分析部１３１では、上記切り分けられた音声の信号を分析して解析を行い、例えば、サンプリング周波数：１６ＫＨｚ、ハミング窓：２０ｍｓ、フレーム周期１０ｍｓで分析を行い、１６次ＬＰＣケプストラム、Δケプストラム、対数パワー、Δ対数パワーを含む３４次元等の特徴バラメータを抽出する（ステップ２０３）。
【００１５】
音声分析部１３１で抽出された音声の特徴パラメータは、それぞれＡ自然言語用探索部１３３とＢ自然言語用探索部１３６に与えられる。例えば、音声が「では、シアトルＴＮＮ放送局のセイムさんに現地の声を聞いてみたいと思います。」の場合、Ａ自然言語用探索部（日本語用探索部）１３３により、予め適切に日本語認識用に準備された日本語用音響モデル１４４と日本語用言語モデル１５４の日本語用発音辞書１３２を用いてビームサーチの方法で尤度のスコアの高い部分木を残す枝刈りを行い（ステップ２０４）、Ａ自然言語用尤度計算部（日本語用尤度計算部）１３４で尤度の計算を行い、スコアの高い「では」をスコアと共に尤度比較判定部１３８に送る（ステップ２０５）。
【００１６】
同時に、Ｂ自然言語用探索部（英語用探索部）１３６により、予め適切に英語用に準備された英語用音響モデル１４５と英語用言語モデル１５５の英語用発音辞書１３５を用いて、同様にビームサーチの方法で尤度のスコアの高い部分木を残す枝刈りを行い（ステップ２０６）、英語用尤度計算部１３７で尤度の計算を行い、スコアの高い例えば「ｄｅｗａｔｅｒ」をスコアと共に尤度比較判定部１３８に送る（ステップ２０７）。
【００１７】
上記Ａ自然言語用探索部１３３とＡ自然言語用尤度計算部１３４や、Ｂ自然言語用探索部１３６とＢ自然言語用尤度計算部１３７での処理は、通常の音声認識と基本的に同様である。
【００１８】
Ａ自然言語用探索部１３３とＡ自然言語用尤度計算部１３４及びＢ自然言語用探索部１３６とＢ自然言語用尤度計算部１３７により並列処理で計算された日本語「では」の尤度スコアと英語「ｄｅｗａｔｅｒ」の尤度スコアが、尤度比較判定部１３８で比較計算され（ステップ２０８）、スコアの高い「では」が決定され、すなわち、日本語が決定され、大語彙連続音声認識処理部１７に送られる。
【００１９】
大語彙連続音声認識処理部１７は、複数言語（ここでは日本語と英語）の音声認識エンジンで構成されており、「では」で始まる一連の音声を、日本語の音声認識エンジンを用いて高速、高精度で認識し（ステップ２０９）、例えば漢字かな混じり文字列を得、されに読み、品詞情報等を付して言語処理部１８に送る。言語処理部１８では音声認識されたテキストの内容語に対して予めコーパスにおける共起情報から獲得した概念ベクトルに基づく方法などを用いてトピックセグメンテーションを行い、検索アクセスにとって有用な固有表現などの関連情報の付与を行い情報統合部１９に送る（ステップ２１０）。情報統合部１９では音声認識結果と言語処理結果およびその他の音響信号の情報を統合してメタデータ出力部２０に送り（ステップ２１１）、メタデータ２１が利用しやすいＸＭＬ形式のファイルとして保存される（ステップ２１２）。
【００２０】
図３は、本発明の第２の実施例の主要部の構成図である。本実施例は、尤度比較判定部１３８と大語彙連続音声認識処理部１７との間に発話区間判定部１６を追加したもので、それ以外の構成は図１と同様である。
【００２１】
例えば、音声が
「では、シアトルＴＮＮ放送局のセイムさんに現地の声を聞いてみたいと思います。」、
［Ｍｒ．Ｓｅｉｍ，ＨｏｗｄｏｔｈｅｙａｐｐｒｅｃｉａｔｅＩｃｈｉｒｏ’ｓａｃｈｉｅｖｅｍｅｎｔｉｎＳｅａｔｔｌｅ？］［ＣｏｎｇｒａｕｌａｔｉｏｎｓＩｃｈｉｒｏ．］
…
…
［Ｔｈａｎｋｙｏｕ，Ｍｒ．Ｓｅｉｍ．］
「シアトルＴＮＮ放送局のセイムさんにお話しを伺いました。」
と日本語から英語、そして英語から日本語へと変わるようなコンテンツ１０が入力されるとする。
【００２２】
この場合、文頭「では、…」で始まる音声の言語は、先の第１の実施例で説明したように、日本語と識別され、日本語として処理される。ここで、発話区間判定部１６により、「…と思います。」を語尾の無音区間および言語モデルにより文末であると判断し、発話区間判定部１６からフィードバックにより、Ａ自然言語用探索部（日本語用探索部）１３３及びＢ自然言語用探索部（英語用探索部）１３６では、音声分析部１３１からの「Ｍｒ」の音声分析結果から音声を再び処理するようにする。
【００２３】
Ａ自然言語用音響モデル（日本語用音響モデル）１４４とＡ自然言語用言語モデル（日本語用言語モデル）１５４のＡ自然言語用発音辞書（日本語用発音辞書）１３２を用いて、Ａ自然言語用探索部（日本語用探索部）１３３によりビームサーチの方法で尤度のスコアの高い部分木を残す枝刈りを行い、Ａ自然言語用尤度計算部（日本語用尤度計算部）１３４で尤度を計算を行い、スコアの高い「ミスター」がスコアと共に尤度比較判定部１３８に送られる。同時に、Ｂ自然言語用音響モデル（英語用音響モデル）１４５とＢ自然言語用言語モデル（英語用言語モデル）１５５のＢ自然言語用発音辞書（英語用発音辞書）１３５を用いて、Ｂ自然言語用探索部（英語用探索部）１３６によりビームサーチの方法で尤度のスコアの高い部分木を残す枝刈りを行い、Ｂ自然言語尤度計算部（英語用尤度計算部）１３７で尤度を計算を行い、スコアの高い［Ｍｒ」がスコアと共に尤度比較判定部１３８に送られる。並列処理で計算された日本語「ミスター」の尤度スコアと英語「Ｍｒ」の尤度スコアが、尤度比較判定部１３８で比較計算され、スコアの高い「Ｍｒ」が決定される。したがって、大語彙連続音声認識処理部１７では、英語の音声認識エンジンを用いて、「Ｍｒ」で始まる音声が処理される。
【００２４】
同様に、次に発話区間判定部１６により、「…Ｓｅａｔｔｌｅ？」が文末であることを判断した場合、音声分析部１３１からの「Ｃｏｎｇｒａｔｕｌａｔｉｏｎｓ」の音声分析結果から再び処理されることになる。以下、各発話毎の文単位での当該言語による処理が進むことになる。
【００２５】
この第２の実施例によれば、映像コンテンツの音声をインデクシングするために入力された自然言語を、発話毎に自然言語が入れ代わることに対応して自動的に識別することが可能になる。
【００２６】
図４に、Ａ自然言語用音響モデル１４４及びＢ自然言語用音響モデル１４５を生成する一実施例の構成図を示す。図４において、１４は音響モデルオフライン処理部であり、音響モデル生成部１４１、Ａ自然言語学習用音声データベース１４２、Ｂ自然言語学習用音声データベース１４３で構成されている。
【００２７】
音響モデルオフライン処理部１４では、Ａ自然言語学習用音声データベース１４２、Ｂ自然言語学習用音声データベース１４３にある音声データをクラスクリングして各言語の音節構造を考慮した音素クラスタをつくり、音響モデル生成部１４１により、各言語に従ったガーベジを作り音響モデルの適応化を行い、Ａ自然言語用音響モデル１４４、Ｂ自然言語用音響モデル１４５を各々オフラインで作成する。
【００２８】
音響モデルオフライン処理部１４を用いることにより、複数の自然言語を含む映像コンテンツの音声をインデクシングする際に、その複数言語に対応することが可能な音響モデルの生成を行うことができ、入力された自然言語を、各発話毎に自然言語が入れ代わることに対応して自動的に識別することができるようになる。
【００２９】
図５に、Ａ自然言語用言語モデル１５４とＢ自然言語用言語モデル１５５を生成する一実施例の構成図を示す。図５において、１５は言語モデルオフライン処理部であり、言語モデル生成部１５１、Ａ自然言語学習用テキストデータベース１５２、Ｂ自然言語学習用テキストデータベース１５３で構成されている。
【００３０】
言語モデルオフライン処理部１５では、Ａ自然言語学習用テキストデータベース１５２、Ｂ自然言語学習用テキストデータベース１５３にある言語コーパスを用いて、言語モデル生成部１５１によりＮグラムで統計的に学習を行い、Ａ自然言語用言語モデル１５４、Ｂ自然言語用言語モデル１５５を各々オフラインで作成する。
【００３１】
言語モデルオフライン処理部１５を用いることにより、複数の自然言語を含む映像コンテンツの音声をインデクシングする際に、その複数言語に対応することが可能な統計的言語モデルの生成を行うことができ、入力された自然言語を、各発話毎に自然言語が入れ代わることに対応して自動的に識別することができるようになる。
【００３２】
図６は本発明の第３の実施例の全体的構成図である。図６において、１０はコンテンツ、１１はコンテンツ入力部、１２は音響信号識別部、１３は言語自動識別装置、１７は大語彙連続音声認識処理部、１８は言語処理部、１９は情報統合部、２０はメタデータ出力部、２１はメタデータである。言語自動識別装置１２は、音声分析部１３１、Ａ自然言語用音響モデル１４４とＡ自然言語用言語モデル１５４とのＡ自然言語用発音辞書１３２、Ｂ自然言語用音響モデル１４５とＢ自然言語用言語モデル１５５とのＢ自然言語用発音辞書１３５、ＧＭＭ（ＧａｕｓｓｉａｎＭｉｘｔｕｒｅＭｏｄｅｌ）尤度計算部１３９で構成されている。
【００３３】
本言語自動識別装置１３では、図１の言語自動識別装置１３ような複数の自然言語探索部１３３、１３６及び複数の自然言語尤度計算部１３４、１３７による並列処理に代わり、ＧＭＭ尤度計算部１３９において、Ａ自然言語用発音辞書１３２とＢ自然言語用発音辞書１３５を用いて、混合ガウス分布の隠れマルコフモデルにより一括で尤度計算を行い、最高尤度がいずれの辞書で得られたかにより、言語を識別する。
【００３４】
なお、混合ガウス分布の隠れマルコフモデルによる話者を識別する方法は知られており（例えば、Ｄｏｕｇｌａｓほか“ＲｏｂｕｓｔＴｅｘｔ−ＩｎｄｅｐｅｎｄｅｎｔＳｐｅａｋｅｒ
ＩｄｅｎｔｉｆｉｃａｔｉｏｎＵｓｉｎｇＧａｕｓｓｉａｎＭｉｘｔｕｒｅＳｐｅａｋｅｒＭｏｄｅｌｓ”，ＩＥＥＥＴＲＡＮＳＡＣＴＩＯＮＳＯＮＳＰＥＥＣＨＡＮＤＡＵＤＩＯＰＲＯＣＥＳＳＩＮＧ，Ｖｏｌ．３，Ｎｏ１，ＪＡＮＵＡＲＹ１９９５）、ここでは、言語の識別に利用する。
【００３５】
図７は、この第３の実施例の処理フローチャートである。図７において、ステップ３０４がＧＭＭ尤度計算部１３９での処理であり、これ以外は図２に示した処理フローチャートと同様であるので、詳しい説明は省略する。
【００３６】
上記第３の実施例によれば、複数の自然言語を含む映像コンテンツの音声をインデクシングする際に、各自然言語に対応する音響モデルと言語モデルの発音辞書を用意し、一括処理でＧＭＭ尤度計算を行うことにより、高速で言語を識別することが可能になる。
【００３７】
図８は、本発明の第４の実施例の主要部の構成図である。本実施例は、ＧＭＭ尤度計算部１３９と大語彙連続音声認識処理部１７との間に発話区間判定部１６を追加したもので、それ以外の構成は図７と同様である。また、発話区間判定部１６の機能は先の第２の実施例の場合と同様である。
【００３８】
なお、図６や図８においても、図４の音響モデルオフライン処理部１４や図５の言語モデルオフライン処理部１５を追加構成することが出来ることは云うまでもない。
【００３９】
なお、図１や図６、その他で示した装置構成における各部の一部もしくは全部の処理機能をコンピュータのプログラムで構成し、そのプログラムをコンピュータを用いて実行して本発明を実現することができること、あるいは、図２や図７などで示した処理手順をコンピュータのプログラムで構成し、そのプログラムをコンピュータに実行させることができることは言うまでもない。また、コンピュータでその処理機能を実現するためのプログラム、あるいは、コンピュータにその処理手順を実行させるためのプログラムを、そのコンピュータが読み取り可能な記録媒体、例えば、ＦＤ、ＭＯ、ＲＯＭ、メモリカード、ＣＤ、ＤＶＤ、リムーバブルディスクなどに記録して、保存したり、提供したりすることができるとともに、インターネット等のネットワークを通してそのプログラムを配布したりすることが可能である。
【００４０】
【発明の効果】
以上の説明から明らかなように、本発明によれば、以下のような効果が得られる。
（１）複数の自然言語を含む映像コンテンツの音声をインデクシングする際に、各自然言語に対応する音響モデル・言語モデルの発音辞書を用意し、探索・尤度計算の並列処理を行い、尤度比較判定することにより、入力された自然言語に対して、従来人手に頼っていた言語識別を、自動的に識別することができるようになる。
【００４１】
（２）複数の自然言語を含む映像コンテンツの音声をインデクシングする際に、各自然言語に対応する音響モデル・言語モデルの発音辞書を用意し、探索・尤度計算の並列処理を行い、尤度比較判定し、更に発話区間判定することにより、入力された自然言語に対して、従来人手に頼っていた言語識別を、各発話毎に自然言語が入れ代わることに対応して自動的に識別することができるようになる。
【００４２】
（３）複数の自然言語を含む映像コンテンツの音声をインデクシングする際に、複数言語に対応することが可能な音響モデルの生成を行うことができ、入力された自然言語を、各発話毎に自然言語が入れ代わることに対応して自動的に識別することができるようになる。
【００４３】
（４）複数の自然言語を含む映像コンテンツの音声をインデクシングする際に、複数言語に対応することが可能な統計的言語モデルの生成を行うことができ、入力された自然言語を、各発話毎に自然言語が入れ代わることに対応して自動的に識別することができるようになる。
【００４４】
（５）複数の自然言語を含む映像コンテンツの音声をインデクシングする際に、各自然言語に対応する音響モデル・言語モデルの発音辞書を用意し、一括処理でＧＭＭ手法等の尤度計算を行い、尤度比較判定することにより、入力された自然言語に対して、従来人手に頼っていた言語識別を、高速で自動的に識別することができるようになる。
【００４５】
（６）複数の自然言語を含む映像コンテンツの音声をインデクシングする際に、各自然言語に対応する音響モデル・言語モデル・発音辞書を用意し、一括処理でＧＭＭ手法等の尤度計算を行い、言語判定し、更に発話区間判定することにより、入力された自然言語に対して、従来人手に頼っていた言語識別を、各発話毎に自然言語が入れ代わることに対応して高速で自動的に識別することができるようになる。
【図面の簡単な説明】
【図１】本発明による第１の実施例の全体構成図である。
【図２】図１の処理フローチャートである。
【図３】本発明による第２の実施例の主要部の構成図である。
【図４】音響モデル生成の一実施例の構成図である。
【図５】言語モデル生成の一実施例の構成図である。
【図６】本発明による第３の実施例の全体構成図である。
【図７】図６の処理フローチャートである。
【図８】本発明による第４の実施例の主要部の構成図である。
【符号の説明】
１１コンテンツ入力部
１２音響信号識別部
１３言語自動識別装置の中心部
１４音響モデルオフライン処理部
１５言語モデルオフライン処理部
１６発話区間判定部
１７大語彙連続音声認識処理部
１８言語処理部
１９情報統合部
２０メタデー夕出力部
１３１音声分析部
１３２Ａ自然言語用発音辞書
１３３Ａ自然言語用探索部
１３４Ａ自然言語用尤度計算部
１３５Ｂ自然言語用発音辞書
１３６Ｂ自然言語用探索部
１３７Ｂ自然言語用尤度計算
１３８尤度比較判定部
１３９ＧＭＭ尤度計算部
１４１音響モデル生成部
１４２Ａ自然言語学習用音声データベース
１４３Ｂ自然言語学習用音声データベース
１４４Ａ自然言語用音響モデル
１４５Ｂ自然言語用音響モデル
１５１言語モデル生成部
１５２Ａ自然言語学習用テキストデータベース
１５３Ｂ自然言語学習用テキストデータベース
１５４Ａ自然言語用言語モデル
１５５Ｂ自然言語用言語モデル[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to an automatic language identification method and apparatus for indexing audio such as video content, and is used for multilingual audio indexing, multilingual dictation, multilingual search, multilingual translation, automatic translation telephone, natural language processing, and the like. Applicable.
[0002]
[Prior art]
Recent advances in speech recognition algorithms and dramatic improvements in the performance of computers have led to the development of speech recognition engines that can be used in various applications (see, for example, Non-Patent Document 1).
On the other hand, with the globalization of the Internet and the development of broadband networks in recent years, the distribution of video content including audio has become active and the demand for multilingual search and automatic translation communication has increased. However, in the conventional multilingual communication method, the sender specifies the language, or determines the language of the receiver, and translates the transmission content. When a sender or a recipient specifies a language, a method of replacing a function card corresponding to each language, switching a switch, selecting a screen, selecting a mouse button, or the like is used.
[0003]
[Non-patent document 1]
Yoshiaki Noda et al., "Speech Recognition Technology Supporting the Multimedia Era", NTT R & D Vol. 49, no. 3, 2000, p. 142-147
[0004]
[Problems to be solved by the invention]
In the conventional language identification method, it is inconvenient to identify the language of the multilingual content because it has to rely on humans, and it takes time and money to rely on humans. There is a problem that you have to rely on the house.
SUMMARY OF THE INVENTION Accordingly, it is an object of the present invention to provide a method and apparatus for automatically identifying a language for indexing audio or the like of video content or the like in order to solve the above-mentioned problems of the conventional technology.
[0005]
[Means for Solving the Problems]
In order to achieve the above object, the language automatic identification method of the present invention uses a natural language pronunciation dictionary of a natural language acoustic model and a natural language language model in a plurality of languages when indexing speech of multilingual content. And the likelihood of the input natural language is calculated in parallel or collectively, and the input language is automatically identified in real time by a likelihood comparison or an applied dictionary that has the maximum likelihood. It is.
[0006]
Further, in the automatic language identification device of the present invention, a speech analyzing means for analyzing and analyzing a speech signal of the input content, an A natural language acoustic model used for A natural language speech recognition, and an A natural language An A natural language pronunciation dictionary with an A natural language model, which is a statistical model of A; a natural language search means for searching the A natural language pronunciation dictionary for the analyzed and analyzed speech; A natural language likelihood calculating means for calculating the likelihood of the natural language A as a result of the processing, a natural natural language acoustic model used for speech recognition of natural natural language B, and a natural natural language statistical model B natural language A B natural language pronunciation dictionary with a language model; a B natural language search means for searching the B natural language pronunciation dictionary for the analyzed and analyzed speech; and a B language likelihood of the searched result. B natural language to calculate And use likelihood calculating means compares the likelihoods and the likelihood of B natural language parallelism calculated A natural language, characterized in that it comprises a likelihood comparison determination means for determining the language.
[0007]
Also, in the automatic language identification apparatus of the present invention, a speech analysis means for analyzing and analyzing a speech signal of the input content, an acoustic model for A natural language and an A natural language acoustic model used for speech recognition of A natural language. A natural language pronunciation dictionary with a natural language model A as a statistical model, a B natural language acoustic model used for speech recognition of a B natural language, and a B natural language language as a statistical model of B natural language The likelihood is collectively calculated by a GMM method or the like for the B natural language pronunciation dictionary with the model and the analyzed and analyzed speech using the A natural language pronunciation dictionary and the B natural language pronunciation dictionary, GMM likelihood calculating means for determining a language from an applied dictionary that takes the maximum likelihood.
[0008]
Furthermore, the automatic language identification apparatus of the present invention is characterized by comprising an utterance section determining means for determining an utterance section for coping with a natural language being replaced for each utterance.
[0009]
Further, the automatic language identification device of the present invention processes generation of an acoustic model for natural language A from audio data for learning natural language A, and generation of an acoustic model for natural language B from speech data for natural language learning B. Acoustic model off-line processing means comprising acoustic model generating means capable of coping with a plurality of languages that can be used.
[0010]
Further, in the automatic language identification device of the present invention, a statistical language model for A natural language is generated from text data for A natural language learning, and a statistical language model for B natural language is generated from text data for B natural language learning. There is provided a language model off-line processing means comprising a language model generating means capable of supporting a plurality of languages capable of processing the generation.
[0011]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
FIG. 1 is an overall configuration diagram of a first embodiment according to the present invention. In FIG. 1, 10 is a content, 11 is a content input unit, 12 is an audio signal identification unit, 13 is an automatic language identification device, 17 is a large vocabulary continuous speech recognition processing unit, 18 is a language processing unit, 19 is an information integration unit, Reference numeral 20 denotes a metadata output unit, and reference numeral 21 denotes metadata. The automatic language identification device 13 includes a speech analysis unit 131, an A natural language pronunciation dictionary 132 of an A natural language acoustic model 144 and an A natural language model 154, an A natural language search unit 133, and an A natural language likelihood. Degree calculation group 134, B natural language pronunciation dictionary 135 of B natural language acoustic model 145 and B natural language language model 155, B natural language search unit 136, B natural language likelihood calculation unit 137, likelihood It comprises a comparison / determination unit 138. The language automatic identification device 13 may include some or all of the units 11 to 12 and 17 to 20.
[0012]
The A natural language acoustic model 144 and the A natural language language model 154 of the A natural language pronunciation dictionary 132 need only be able to identify the A natural language, and the larger the model prepared in the large vocabulary continuous speech recognition processing unit 17, the larger the scale. No need to be. The same applies to the B natural language acoustic model 145 and the B natural language language model 155 of the B natural language pronunciation dictionary 135. In the following, it is assumed that the natural language A is Japanese and the natural language B is English.
[0013]
FIG. 2 shows a processing flowchart of the embodiment of FIG. For example, news content 10 including Japanese and English is input to the content input unit 11 via a broadband network or the like (step 201). The sound signal of the news content 10 is A / D-converted by the content input unit 11, the utterance section is segmented by the acoustic signal identification unit 12, and the speech section to be subjected to speech language processing and BGM (background music), noise, etc. It is divided into other sound signal sections (step 202).
[0014]
The audio analysis unit 131 analyzes and analyzes the above separated audio signal, for example, analyzes at a sampling frequency of 16 KHz, a hamming window of 20 ms, and a frame period of 10 ms, and obtains a 16th-order LPC cepstrum, a Δcepstrum, and a logarithm. A parameter of 34 dimensions or the like including power and logarithmic power is extracted (step 203).
[0015]
The feature parameters of the speech extracted by the speech analysis unit 131 are provided to the A natural language search unit 133 and the B natural language search unit 136, respectively. For example, if the sound is "I would like to hear the local voice from Mr. Sean of Seattle TNN Broadcasting Station", the search unit for natural language A (search unit for Japanese language) 133 appropriately preliminarily matches Japan. Using the Japanese acoustic model 144 prepared for word recognition and the Japanese pronunciation dictionary 132 of the Japanese language model 154, pruning is performed to leave a subtree with a high likelihood score by a beam search method ( Step 204), the likelihood calculation unit 134 for natural language A (likelihood calculation unit for Japanese) 134 calculates the likelihood, and sends “では” having a high score to the likelihood comparison determination unit 138 together with the score (step 205). ).
[0016]
At the same time, the B natural language search unit (English search unit) 136 similarly uses the English acoustic model 145 and the English pronunciation dictionary 135 of the English language model 155 appropriately prepared for English to perform beamforming. Pruning is performed to leave a partial tree having a high likelihood score by a search method (step 206), and the likelihood calculation unit 137 calculates the likelihood, and outputs a high score, for example, “dewater” together with the score along with the likelihood. This is sent to the comparison / determination unit 138 (step 207).
[0017]
The processing in the A natural language search unit 133 and the A natural language likelihood calculation unit 134, and the processing in the B natural language search unit 136 and the B natural language likelihood calculation unit 137 are basically the same as those of ordinary speech recognition. The same is true.
[0018]
The likelihood of Japanese "wa" calculated by parallel processing by the A natural language search unit 133, the A natural language likelihood calculation unit 134, and the B natural language search unit 136 and the B natural language likelihood calculation unit 137. The score and the likelihood score of English “dewater” are compared and calculated by the likelihood comparison / determination unit 138 (step 208), and “is” having a high score is determined, that is, Japanese is determined, and large vocabulary continuous speech recognition It is sent to the processing unit 17.
[0019]
The large vocabulary continuous speech recognition processing unit 17 is composed of a speech recognition engine of a plurality of languages (here, Japanese and English). (Step 209), for example, a kanji-kana mixed character string is obtained, read, and sent to the language processing unit 18 with the part of speech information attached. The linguistic processing unit 18 performs topic segmentation on the content words of the speech-recognized text using a method based on concept vectors obtained in advance from co-occurrence information in the corpus, and obtains relevant information such as named entities useful for search access. And sends it to the information integration unit 19 (step 210). The information integration unit 19 integrates the speech recognition result, the linguistic processing result, and other audio signal information and sends it to the metadata output unit 20 (step 211), and the metadata 21 is stored as an easy-to-use XML format file. (Step 212).
[0020]
FIG. 3 is a configuration diagram of a main part of the second embodiment of the present invention. In the present embodiment, an utterance section determination unit 16 is added between a likelihood comparison determination unit 138 and a large vocabulary continuous speech recognition processing unit 17, and the other configuration is the same as that of FIG.
[0021]
For example, the voice is "I would like to hear Mr. Same from Seattle TNN Broadcasting Station for the local voice.",
[Mr. Seim, How do you appreciate Ichiro's achievement in Seattle? ] [Congraations Ichiro. ]
…
…
[Thankyo, Mr .; Seim. ]
"We spoke to Seattle TNN broadcaster Mr. Same."
And content 10 that changes from Japanese to English and from English to Japanese.
[0022]
In this case, as described in the first embodiment, the language of the voice starting with the beginning of the sentence “,...” Is identified as Japanese and processed as Japanese. Here, the utterance section determination unit 16 determines that "I think ..." is the end of the sentence based on the silent section at the end and the language model, and provides feedback from the utterance section determination unit 16 to the A natural language search unit (Japan). The word search unit 133 and the B natural language search unit (English search unit) 136 process the speech again from the speech analysis result of "Mr" from the speech analysis unit 131.
[0023]
The A natural language pronunciation dictionary (Japanese pronunciation dictionary) 132 of the A natural language acoustic model (Japanese acoustic model) 144 and the A natural language language model (Japanese language model) 154 is used for the A natural language. The language search unit (Japanese search unit) 133 performs pruning to leave a partial tree with a high likelihood score by a beam search method, and a natural language likelihood calculation unit (Japanese likelihood calculation unit) The likelihood is calculated at 134, and “Mr” having a high score is sent to the likelihood comparison and determination section 138 together with the score. At the same time, using the B natural language pronunciation dictionary (English pronunciation dictionary) 135 of the B natural language acoustic model (English acoustic model) 145 and the B natural language language model (English language model) 155, the B natural language The search unit for English (search unit for English) 136 performs pruning to leave a partial tree with a high likelihood score by a beam search method, and the likelihood calculation unit for natural language B (likelihood calculation unit for English) 137 Is calculated, and [Mr] having a high score is sent to the likelihood comparison / determination unit 138 together with the score. The likelihood score of Japanese “Mr” and the likelihood score of English “Mr” calculated by the parallel processing are compared and calculated by the likelihood comparison / determination unit 138, and “Mr” having a higher score is determined. Therefore, the large vocabulary continuous speech recognition processing unit 17 processes the speech starting with "Mr" using the English speech recognition engine.
[0024]
Similarly, when the utterance section determination unit 16 next determines that “... Seattle?” Is the end of the sentence, the speech analysis unit 131 processes again from the voice analysis result of “Congratulations”. Hereinafter, the processing in the language for each sentence in each sentence proceeds.
[0025]
According to the second embodiment, it is possible to automatically identify a natural language input for indexing audio of video content in response to the natural language being replaced for each utterance.
[0026]
FIG. 4 shows a configuration diagram of an embodiment for generating the A natural language acoustic model 144 and the B natural language acoustic model 145. In FIG. 4, reference numeral 14 denotes an acoustic model offline processing unit, which includes an acoustic model generation unit 141, an A natural language learning audio database 142, and a B natural language learning audio database 143.
[0027]
The acoustic model offline processing unit 14 classifies the speech data in the A natural language learning speech database 142 and the B natural language learning speech database 143 to form phoneme clusters in consideration of the syllable structure of each language, and generates an acoustic model. The unit 141 creates a garbage according to each language and adapts the acoustic model, and creates an acoustic model for natural language A 144 and an acoustic model for natural language B 145 offline.
[0028]
By using the acoustic model off-line processing unit 14, it is possible to generate an acoustic model capable of coping with a plurality of languages when indexing audio of video content including a plurality of natural languages. The natural language can be automatically identified in response to the natural language being replaced for each utterance.
[0029]
FIG. 5 shows a configuration diagram of an embodiment for generating the language model for natural language A 154 and the language model for natural language B 155. In FIG. 5, reference numeral 15 denotes a language model offline processing unit, which includes a language model generation unit 151, an A natural language learning text database 152, and a B natural language learning text database 153.
[0030]
In the language model offline processing unit 15, using the language corpus in the natural language learning text database 152 and the natural language learning text database 153, the language model generation unit 151 statistically learns N-grams using the language corpus. The language model for natural language 154 and the language model for natural language B 155 are created off-line.
[0031]
By using the language model offline processing unit 15, when indexing audio of video content including a plurality of natural languages, it is possible to generate a statistical language model capable of coping with the plurality of languages. The natural language that has been set can be automatically identified in response to the natural language being replaced for each utterance.
[0032]
FIG. 6 is an overall configuration diagram of the third embodiment of the present invention. In FIG. 6, 10 is a content, 11 is a content input unit, 12 is an audio signal identification unit, 13 is an automatic language identification device, 17 is a large vocabulary continuous speech recognition processing unit, 18 is a language processing unit, 19 is an information integration unit, Reference numeral 20 denotes a metadata output unit, and reference numeral 21 denotes metadata. The automatic language identification device 12 includes a speech analysis unit 131, an A natural language pronunciation dictionary 132 of an A natural language acoustic model 144 and an A natural language model 154, a B natural language acoustic model 145, and a B natural language It comprises a B-natural language pronunciation dictionary 135 with the model 155, and a GMM (Gaussian Mixture Model) likelihood calculator 139.
[0033]
In the automatic language identification device 13, a GMM likelihood calculation unit is used instead of the parallel processing by the plurality of natural language search units 133 and 136 and the plurality of natural language likelihood calculation units 134 and 137 as in the automatic language identification device 13 of FIG. 1. At 139, using the pronunciation dictionary for natural language 132 and the pronunciation dictionary for B natural language 135, the likelihood calculation is collectively performed by the hidden Markov model of the Gaussian mixture distribution, and the maximum likelihood is determined by which dictionary was obtained. , Identify the language.
[0034]
In addition, a method of identifying a speaker using a hidden Markov model of a Gaussian mixture distribution is known (for example, Douglas et al., “Robust Text-Independent Speaker”).
Identification Usage Gaussian Mixture Speaker Models ", IEEE TRANSACTIONS ON SPECH AND AUDIO PROCESSING, Vol. 3, No. 1, JANUALY 1995).
[0035]
FIG. 7 is a processing flowchart of the third embodiment. In FIG. 7, step 304 is a process in the GMM likelihood calculation unit 139, and the other steps are the same as the processing flowchart shown in FIG.
[0036]
According to the third embodiment, when indexing audio of video content including a plurality of natural languages, a pronunciation dictionary of an acoustic model and a language model corresponding to each natural language is prepared, and the GMM likelihood is calculated in a batch process. By performing the calculation, the language can be identified at high speed.
[0037]
FIG. 8 is a configuration diagram of a main part of the fourth embodiment of the present invention. In the present embodiment, an utterance section determination unit 16 is added between the GMM likelihood calculation unit 139 and the large vocabulary continuous speech recognition processing unit 17, and the other configuration is the same as that of FIG. The function of the utterance section determination unit 16 is the same as that of the second embodiment.
[0038]
6 and 8, it goes without saying that the acoustic model offline processing unit 14 in FIG. 4 and the language model offline processing unit 15 in FIG. 5 can be additionally configured.
[0039]
Note that a part or all of the processing functions of each unit in the apparatus configuration shown in FIGS. 1, 6, and others can be configured by a computer program, and the program can be executed using a computer to realize the present invention. Alternatively, it goes without saying that the processing procedure shown in FIG. 2 or FIG. 7 can be configured by a computer program and the computer can execute the program. Further, a program for realizing the processing function by the computer or a program for causing the computer to execute the processing procedure is stored in a computer-readable recording medium such as an FD, an MO, a ROM, a memory card, and a CD. , A DVD, a removable disk, or the like, and can be stored or provided, and the program can be distributed through a network such as the Internet.
[0040]
【The invention's effect】
As apparent from the above description, according to the present invention, the following effects can be obtained.
(1) When indexing audio of video content including a plurality of natural languages, a pronunciation dictionary of an acoustic model / language model corresponding to each natural language is prepared, and parallel processing of search / likelihood calculation is performed. By performing the comparison and determination, it becomes possible to automatically identify the language identification that has conventionally relied on humans for the input natural language.
[0041]
(2) When indexing audio of video content including a plurality of natural languages, a pronunciation dictionary of an acoustic model and a language model corresponding to each natural language is prepared, and parallel processing of search and likelihood calculation is performed. By comparing and judging the utterance section, the language identification that has conventionally relied on humans for the input natural language is automatically identified in response to the natural language being replaced for each utterance Will be able to
[0042]
(3) When indexing audio of video content including a plurality of natural languages, an acoustic model capable of supporting a plurality of languages can be generated, and an input natural language is converted into a natural language for each utterance. It becomes possible to automatically identify in response to the exchange of languages.
[0043]
(4) When indexing audio of video content including a plurality of natural languages, a statistical language model capable of supporting a plurality of languages can be generated. Can be automatically identified in response to the natural language being replaced.
[0044]
(5) When indexing audio of video content including a plurality of natural languages, a pronunciation dictionary of an acoustic model / language model corresponding to each natural language is prepared, and likelihood calculation such as a GMM method is performed by batch processing. By performing the likelihood comparison and determination, it becomes possible to automatically and quickly identify the language identification which has conventionally relied on humans for the input natural language.
[0045]
(6) When indexing audio of video content including a plurality of natural languages, prepare an acoustic model / language model / pronunciation dictionary corresponding to each natural language, perform likelihood calculation such as a GMM method by collective processing, By language determination and utterance interval determination, language identification, which previously relied on human input, is automatically identified at high speed in response to the natural language being replaced for each utterance. Will be able to
[Brief description of the drawings]
FIG. 1 is an overall configuration diagram of a first embodiment according to the present invention.
FIG. 2 is a processing flowchart of FIG. 1;
FIG. 3 is a configuration diagram of a main part of a second embodiment according to the present invention.
FIG. 4 is a configuration diagram of an embodiment of acoustic model generation.
FIG. 5 is a configuration diagram of an embodiment of language model generation.
FIG. 6 is an overall configuration diagram of a third embodiment according to the present invention.
FIG. 7 is a processing flowchart of FIG. 6;
FIG. 8 is a configuration diagram of a main part of a fourth embodiment according to the present invention.
[Explanation of symbols]
Reference Signs List 11 Content input unit 12 Acoustic signal identification unit 13 Central part of automatic language identification device 14 Acoustic model offline processing unit 15 Language model offline processing unit 16 Utterance section determination unit 17 Large vocabulary continuous speech recognition processing unit 18 Language processing unit 19 Information integration unit 20 Metadata evening output unit 131 Voice analysis unit 132 A natural language pronunciation dictionary 133 A natural language search unit 134 A natural language likelihood calculation unit 135 B natural language pronunciation dictionary 136 B natural language search unit 137 B natural language Likelihood calculation 138 Likelihood comparison / determination unit 139 GMM likelihood calculation unit 141 Acoustic model generation unit 142 A Natural language learning speech database 143 B Natural language learning speech database 144 A Natural language acoustic model 145 B Natural language acoustic Model 151 Language model generator 152 A Natural language learning text database 153 B natural language learning text database 154 A natural language for the language model 155 B natural language for the language model

Claims

Analyzing and analyzing the input natural language speech signal;
The analyzed and analyzed voices are searched in parallel by using a natural language acoustic model of a plurality of languages and a natural language pronunciation dictionary of a natural language model, and the likelihood of the searched result is calculated. Steps to
Comparing the respective calculated likelihoods to identify the language of the input natural language;
Automatic language identification method characterized by having:

Analyzing and analyzing the input natural language speech signal;
For the analyzed and analyzed speech, the likelihood is collectively calculated using the natural language acoustic model of a plurality of languages and the natural language pronunciation dictionary of the natural language language model, and from the applied dictionary that takes the maximum likelihood, A step of identifying the language of the input natural language.

An apparatus for automatically identifying an input language for indexing audio of input content,
An audio analysis unit that analyzes and analyzes the audio signal of the input content;
A natural language pronunciation dictionary of A natural language acoustic model used for speech recognition of A natural language, and A natural language language model which is a statistical model of A natural language;
A natural language search unit that searches the natural language pronunciation dictionary for the analyzed and analyzed speech;
An A natural language likelihood calculation unit for calculating the likelihood of the A natural language of the searched result;
A B natural language pronunciation dictionary of a B natural language acoustic model used for speech recognition of B natural language and a B natural language language model that is a statistical model of B natural language;
A B natural language search unit that searches the B natural language pronunciation dictionary for the analyzed and analyzed voice signal;
A B natural language likelihood calculation unit that calculates the likelihood of the B natural language of the searched result;
A likelihood comparison / determination unit that determines the language by comparing the calculated likelihood of the natural language A and the likelihood of the natural language B;
An automatic language identification device, comprising:

4. The automatic language identification apparatus according to claim 3, further comprising an utterance section determination unit that determines an utterance section for coping with a natural language being replaced for each utterance.

5. The automatic language identification device according to claim 3, wherein an audio signal identification unit that separates a sound of the input content from other audio signals.
A large vocabulary continuous speech recognition processing unit for recognizing the determined natural language speech,
A language processing unit that performs topic segmentation of the speech-recognized text and provides related information;
An information integration unit that integrates speech recognition results, language processing results, and other audio signal information;
A metadata output unit that outputs the information-connected metadata;
Automatic language identification apparatus, further comprising:

An apparatus for automatically identifying an input language for indexing audio of input content,
An audio analysis unit that analyzes and analyzes the audio signal of the input content;
An A natural language pronunciation dictionary of an A natural language acoustic model used for A natural language speech recognition and an A natural language language model that is a statistical model of the A natural language;
A B natural language pronunciation dictionary of a B natural language acoustic model used for speech recognition of B natural language and a B natural language language model that is a statistical model of B natural language;
A likelihood calculating unit that calculates the likelihood of the analyzed speech using the pronunciation dictionary for natural language A and the pronunciation dictionary for natural language B collectively, and determines the language from the applied dictionary that takes the maximum likelihood. When,
A language automatic identification device comprising:

7. The automatic language identification apparatus according to claim 6, further comprising an utterance section determination unit that determines an utterance section for responding to a change of a natural language for each utterance.

The language automatic identification device according to claim 6 or 7,
An audio signal identification unit that separates the audio of the input content from other audio signals,
A large vocabulary continuous speech recognition processing unit for recognizing the determined natural language speech,
A language processing unit that performs topic segmentation of the speech-recognized text and provides related information;
An information integration unit that integrates speech recognition results, language processing results, and other audio signal information;
A metadata output unit that outputs the information-integrated metadata,
Automatic language identification apparatus, further comprising:

9. The automatic language identification apparatus according to claim 1, wherein an A natural language acoustic model is generated from the A natural language learning audio data, and a B natural language acoustic model is generated from the B natural language learning audio data. An automatic language identification apparatus, further comprising an acoustic model offline processing unit including an acoustic model generation unit capable of supporting a plurality of languages for processing model generation.

9. The automatic language identification apparatus according to claim 1, wherein a statistical language model for A natural language is generated from text data for A natural language learning, and a B natural language is generated from text data for B natural language learning. An automatic language identification apparatus, further comprising a language model offline processing unit including a language model generation unit capable of handling a plurality of languages for processing generation of a statistical language model for use.