JP4861912B2

JP4861912B2 - Probability calculation apparatus and computer program for incorporating knowledge sources

Info

Publication number: JP4861912B2
Application number: JP2007162864A
Authority: JP
Inventors: サクリアニワティアスリサクティ; コンスタンティン・マルコフ; 哲中村
Original assignee: National Institute of Information and Communications Technology
Current assignee: National Institute of Information and Communications Technology
Priority date: 2007-06-20
Filing date: 2007-06-20
Publication date: 2012-01-25
Anticipated expiration: 2027-06-20
Also published as: JP2009003110A

Abstract

<P>PROBLEM TO BE SOLVED: To provide a probability calculating apparatus capable of robustly calculating a probability of a phoneme of a speech signal by using available training data. <P>SOLUTION: The probability calculating apparatus 516 calculates a probability of each of phonemes in a speech signal by using a statistical acoustic model and knowledge sources. The statistical acoustic model and the knowledge sources have a causal dependency represented by a Bayesian network (BN). The BN corresponds to a junction tree including cluster nodes and separator nodes. The apparatus 516 includes: a storage device 520 for local acoustic models R3, C1 and L3; a module for calculating observation data for each of frames; right, center and left context calculating devices 570, 572 and 574 for calculating a local probability of each of the phonemes causing the observation data by using the local acoustic models R3, C1 and L3; and a PDF calculating device 576 for calculating a probability of each of the phonemes as a function of local probabilities. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は音声認識における確率計算に関し，特に，１以上の知識源を組込んだ音声認識における確率計算に関する． The present invention relates to probability calculation in speech recognition, and more particularly to probability calculation in speech recognition incorporating one or more knowledge sources.

情報技術は成長を続けており，日常生活の多くの局面においてますます大きな影響力を持つようになってきている．人間と，対話型システムのような情報処理装置との間の音声を介したコミュニケーションの様態もまた，ますます重要になっている．音声指向型インターフェースを実現するための基本的技術の１つとして，自動音声認識（ＡｕｔｏｍａｔｉｃＳｐｅｅｃｈＲｅｃｏｇｎｉｔｉｏｎ：ＡＳＲ）がある．過去４０年近く，多くの研究者がＡＳＲの領域の研究を行なってきた．その目標は，人間が発話した自然の話し言葉を，自動的に認識することが可能な知的な情報処理装置の開発である．しかし，複雑な音響信号から，その背後に存在する言語的なメッセージを抽出するのは，信号に含まれる変動について多くの源が存在するため，容易な処理ではない． Information technology continues to grow and is becoming increasingly influential in many aspects of everyday life. Voice communication between humans and information processing devices such as interactive systems is also becoming increasingly important. One of the basic technologies for realizing a voice-oriented interface is automatic speech recognition (ASR). For the past 40 years, many researchers have been studying the area of ASR. The goal is to develop intelligent information processing devices that can automatically recognize natural spoken words spoken by humans. However, extracting a linguistic message behind a complex acoustic signal is not an easy process because there are many sources of variation in the signal.

いくつかのアプローチがこの問題に対処するために開発されている．これらのＡＳＲへのアプローチは，一般的に“知識ベース”と“コーパスベース”との２つのタイプに分類される． Several approaches have been developed to address this issue. These approaches to ASR are generally classified into two types: “knowledge base” and “corpus base”.

前者は主に，音声信号のスペクトログラム又はその他の視覚的表現を解釈する人間の能力に基づいており，知識ベースの規則を用いる．しかし，これらの規則が互いに依存する全ての場合を予見するのは難しいため，ある規則が，同じ現象を説明する上で他の規則と全く矛盾するなどして，他の規則と必然的に競合してしまう． The former is mainly based on human ability to interpret spectrograms or other visual representations of speech signals and uses knowledge-based rules. However, it is difficult to foresee all cases where these rules depend on each other, so a rule necessarily conflicts with other rules, such as completely contradicting other rules in explaining the same phenomenon. Resulting in.

これとは対照的に，後者のアプローチは通常，データから知識を自動的に抽出可能な，明確に定義された統計的アルゴリズムを用いた，音声信号のモデル化を基本にしている．このモデル化のアプローチは有望な結果を与えており，前者の知識ベースによるアプローチよりも良い性能を示している．これが，現在のＡＳＲシステムの多くが，隠れマルコフモデル（ＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌｓ：ＨＭＭ）に基づく統計的データ駆動型の方法を用いる理由である．最先端のＡＳＲシステムは，制御された条件においては非常に高い性能に到達している． In contrast, the latter approach is usually based on the modeling of speech signals using well-defined statistical algorithms that can automatically extract knowledge from the data. This modeling approach gives promising results and shows better performance than the former knowledge-based approach. This is why many of the current ASR systems use statistical data driven methods based on Hidden Markov Models (HMMs). State-of-the-art ASR systems have achieved very high performance under controlled conditions.

この領域の著しい進歩にもかかわらず，ＡＳＲシステムが毎日の生活で幅広く利用され，潜在能力を完全に発揮するまでには，克服すべき多くの課題が未だ存在する．例えば，予期できない音響の変化が存在すると，ＡＳＲシステムは人間の聴者よりもはるかに劣る性能しか示さない．単に，統計的モデルに頼るだけで，利用可能な付加的知識をほとんど無視するのでは，限定されたレベルの成功にしか到達できない．多くの研究者はこの問題に気付いており，知識ベース及び統計的なアプローチをさらに明確に統合するための様々な試みを行なってきた． Despite significant advances in this area, there are still many challenges to overcome before the ASR system is widely used in everyday life and fully displays its potential. For example, in the presence of unexpected acoustic changes, ASR systems perform much worse than human listeners. Simply relying on a statistical model and ignoring most of the additional knowledge available can only reach a limited level of success. Many researchers are aware of this problem and have made various attempts to better integrate knowledge bases and statistical approaches.

今までのところ，非特許文献１は，再スコアリングを目的として，ニューラルネットワークを用いて，音響音素知識源の組込みを可能にする研究を提案している．非特許文献２及び３に開示の大語彙連続音声認識（Ｌａｒｇｅ−ＶｏｃａｂｕｌａｒｙＳｐｅｅｃｈＲｅｃｏｇｎｉｔｉｏｎ：ＬＶＣＳＲ）システムもまた，クインフォン／ペンタフォンといった長い期間の同時調音効果の組込みにより，音響モデルの改善をもたらすことに成功した．何人かの研究者は，最近になり，ベイズネットワーク（ＢａｙｅｓｉａｎＮｅｔｗｏｒｋｓ：ＢＮ）のようなグラフ的なツールの利用を試みている．ＢＮはＨＭＭを普遍化したものと考えることもでき，音声のスペクトル情報に加えて，調音素性，サブバンドの相関関係，又は話し方のスタイル等の付加的知識を簡単に組込むことができる（非特許文献４）．
特開２００７−０５２１６６公報Ｊ．リ，Ｙ．ツァオ，及びＣ．−Ｈ．リー，「自動音声認識における候補の再スコアリングのための知識源統合」，ＩＣＡＳＳＰ予稿集，フィラデルフィア，米国，２００５，８３７−８４０ページ（Ｊ．Ｌｉ，Ｙ．Ｔｓａｏ，ａｎｄＣ．−Ｈ．Ｌｅｅ，“Ａｓｔｕｄｙｏｎｋｎｏｗｌｅｄｇｅｓｏｕｒｃｅｉｎｔｅｇｒａｔｉｏｎｆｏｒｃａｎｄｉｄａｔｅｒｅｓｃｏｒｉｎｇｉｎａｕｔｏｍａｔｉｃｓｐｅｅｃｈｒｅｃｏｇｎｉｔｉｏｎ，”ｉｎＰｒｏｃ．ＩＣＡＳＳＰ，Ｐｈｉｌａｄｅｌｐｈｉａ，ＵＳＡ，２００５，ｐｐ．８３７−８４０．）Ｃ．ネッティ，Ｇ．ポタミアノス，Ｊ．ルッティン，Ｉ．マシューズ，Ｈ．グロティン，Ｄ．ヴェルギリ，Ｊ．シソン，Ａ．マシャリ及びＪ．シュー，「聴覚−視覚的音声認識」，技術報告，ＣＳＬＰジョンホプキンス大学，ボルチモア，米国，２０００年（Ｃ．Ｎｅｔｉ，Ｇ．Ｐｏｔａｍｉａｎｏｓ，Ｊ．Ｌｕｅｔｔｉｎ，Ｉ．Ｍａｔｔｅｗｓ，Ｈ．Ｇｌｏｔｉｎ，Ｄ．Ｖｅｒｇｙｒｉ，Ｊ．Ｓｉｓｏｎ，Ａ．Ｍａｓｈａｒｉ，ａｎｄＪ．Ｚｈｏｕ，“Ａｕｄｉｏ−ｖｉｓｕａｌｓｐｅｅｃｈｒｅｃｏｇｎｉｔｉｏｎ，”Ｔｅｃｈ．Ｒｅｐ．，ＣＳＬＰＪｏｈｎＨｏｐｋｉｎｓＵｎｉｖｅｒｓｉｔｙ，Ｂａｌｔｉｍｏｒｅ，ＵＳＡ，２０００．）Ａ．ローリエ，Ｄ．ヒンドル，Ｍ．ライリー及びＲ．スプロート，「ＡＴ＆ＴＬＶＣＳＲ−２０００システム」，音声トランスクリプションワークショップ，メリーランド大学，米国，２０００年（Ａ．Ｌｊｏｌｊｅ，Ｄ．Ｈｉｎｄｌｅ，Ｍ．Ｒｉｌｅｙ，ａｎｄＲ．Ｓｐｒｏａｔ，“ＴｈｅＡＴ＆ＴＬＶＣＳＲ−２０００ｓｙｓｔｅｍ，”ｉｎＳｐｅｅｃｈＴｒａｎｓｃｒｉｐｔｉｏｎＷｏｒｋｓｈｏｐ，ＵｎｉｖｅｒｓｉｔｙｏｆＭａｒｙｌａｎｄ，ＵＳＡ，２０００.）Ｋ．ダウディ，Ｄ．フォア及びＣ．アントアーヌ，「確率論的グラフモデルに基づくマルチバンド音声認識の新たな試み」，ＩＣＳＬＰ予稿集，北京，中国，３２９−３３２ページ，２０００年（Ｋ．Ｄａｏｕｄｉ，Ｄ．Ｆｏｈｒ，ａｎｄＣ．Ａｎｔｏｉｎｅ，“Ａｎｅｗａｐｐｒｏａｃｈｆｏｒｍｕｌｔｉ−ｂａｎｄｓｐｅｅｃｈｒｅｃｏｇｎｉｔｉｏｎｂａｓｅｄｏｎｐｒｏｂａｂｉｌｉｓｔｉｃｇｒａｐｈｉｃａｌｍｏｄｅｌｓ，”ｉｎＰｒｏｃ．ＩＣＳＬＰ，Ｂｅｉｊｉｎｇ，Ｃｈｉｎａ，ｐｐ．３２９−３３２，２０００．）Ｋ．マルコフ及びＳ．ナカムラ，「ハイブリッドＨＭＭ／ＢＮ音響モデルの前方向−後方向トレーニング」，ＩＣＬＳＰ予稿集，６２１−６２４ページ，２００６年（Ｋ．ＭａｒｋｏｖａｎｄＳ．Ｎａｋａｍｕｒａ，“Ｆｏｒｗａｒｄ−ｂａｃｋｗａｒｄｓｔｒａｉｎｉｎｇｏｆｈｙｂｒｉｄＨＭＭ／ＢＮａｃｏｕｓｔｉｃｍｏｄｅｌｓ，”ｉｎＰｒｏｃ．ＩＣＳＬＰ，ｐｐ．６２１−６２４，２００６．）Ｊ．Ｊ．オデル，「大語彙音声認識でのコンテキストの使用」，博士論文，ケンブリッジ大学，ケンブリッジ，英国，１９９５（Ｊ．Ｊ．Ｏｄｅｌｌ，ＴｈｅＵｓｅｏｆＣｏｎｔｅｘｔｉｎＬａｒｇｅＶｏｃａｂｕｌａｒｙＳｐｅｅｃｈＲｅｃｏｇｎｉｔｉｏｎ，Ｐｈ．Ｄ．ｔｈｅｓｉｓ，ＣａｍｂｒｉｄｇｅＵｎｉｖｅｒｓｉｔｙ，Ｃａｍｂｒｉｄｇｅ，ＵＫ，１９９５．）Ｊｉ．ミン，Ｐ．Ｏ．ボイル，Ｍ．オーウェンス，及びＦ．Ｊ．スミス，「連続音声認識のためのトライフォンモデル構築のためのベイズアプローチ」，ＩＥＥＥ音声及び音響処理トランザクション，第７巻，第６号，６７８−６８４ページ，１９９９年１１月（Ｊｉ．Ｍｉｎｇ，Ｐ．Ｏ．Ｂｏｙｌｅ，Ｍ．Ｏｗｅｎｓ，ａｎｄＦ．Ｊ．Ｓｍｉｔｈ，“ＡＢａｙｅｓｉａｎａｐｐｒｏａｃｈｆｏｒｂｕｉｌｄｉｎｇｔｒｉｐｈｏｎｅｍｏｄｅｌｓｆｏｒｃｏｎｔｉｎｕｏｕｓｓｐｅｅｃｈｒｅｃｏｇｎｉｔｉｏｎ，"ＩＥＥＥＴｒａｎｓ．ＳｐｅｅｃｈａｎｄＡｕｄｉｏＰｒｏｃｅｓｓｉｎｇ，ｖｏｌ．７，ｎｏ６，ｐｐ．６７８−６８４，Ｎｏｖｅｍｂｅｒ１９９９．）Ｓ．サクティ，Ｓ．ナカムラ，及びＫ．マルコフ，「ベイズフレームワークに基づく広域音素コンテキストの組込みによる音響モデル精度の向上」，ＩＥＩＣＥ情報＆システムトランザクション，Ｅ８９−Ｄ巻，第３号，９４６−９５３ページ，２００６年（Ｓ．Ｓａｋｔｉ，Ｓ．Ｎａｋａｍｕｒａ，ａｎｄＫ．Ｍａｒｋｏｖ，“ＩｍｐｒｏｖｉｎｇａｃｏｕｓｔｉｃｍｏｄｅｌｐｒｅｃｉｓｉｏｎｂｙｉｎｃｏｒｐｏｒａｔｉｎｇａｗｉｄｅｐｈｏｎｅｔｉｃｃｏｎｔｅｘｔｂａｓｅｄｏｎａＢａｙｅｓｉａｎｆｒａｍｅｗｏｒｋ”ＩＥＩＣＥＴｒａｎｓ．Ｉｎｆ．＆Ｓｔｓｔ．，ｖｏｌ．Ｅ８９−Ｄ，ｎｏ．３，ｐｐ．９４６−９５３，２００６）Ｔ．ジツヒロ，Ｔ．マツイ，及びＳ．ナカムラ，「ＭＤＬ基準に基づく非均一ＨＭＭトポロジの自動生成」，ＩＥＩＣＥ情報＆システムトランザクション，Ｅ８７−Ｄ巻，第８号，２１２１−２１２９ページ，２００４年（Ｔ．Ｊｉｔｓｕｈｉｒｏ，Ｔ．Ｍａｔｓｕｉ，ａｎｄＳ．Ｎａｋａｍｕｒａ，“Ａｕｔｏｍａｔｉｃｇｅｎｅｒａｔｉｏｎｏｆｎｏｎ−ｕｎｉｆｏｒｍＨＭＭｔｏｐｏｌｏｇｉｅｓｂａｓｅｄｏｎｔｈｅＭＤＬｃｒｉｔｅｒｉｏｎ，”ＩＥＩＣＥＴｒａｎｓ．Ｉｎｆ．＆Ｓｙｓｔ．，ｖｏｌＥ８７−Ｄ，ｎｏ．８，ｐｐ．２１２１−２１２９，２００４） So far, Non-Patent Document 1 has proposed research that enables the incorporation of acoustic phoneme knowledge sources using neural networks for the purpose of rescoring. The Large-Vocabulary Speech Recognition (LVCSR) system disclosed in Non-Patent Documents 2 and 3 can also improve acoustic models by incorporating long-term simultaneous articulation effects such as quinphone / pentaphone. succeeded in. Some researchers have recently attempted to use graphical tools such as Bayesian Networks (BN). BN can be thought of as a universal HMM, and can easily incorporate additional knowledge such as articulation features, subband correlation, or style of speech in addition to speech spectrum information (non-patented). Reference 4).
JP 2007-052166 A J. et al. Li, Y. Cao, and C.I. -H. Lee, “Knowledge Source Integration for Re-Scoring Candidates in Automatic Speech Recognition,” ICASSP Proceedings, Philadelphia, USA, 2005, pages 837-840 (J. Li, Y. Tsao, and C.-H. Lee, “A study on knowledge source integration for candidate recycling in automatic speech recognition,” in Proc. ICASSP, Philadelphia, USA, 2005, pp. 37, 2008. C. Netty, G. Potamianos, J.A. Rutin, I.D. Matthews, H.C. Grotin, D.C. Vergili, J.A. Sison, A. Mashari and J.H. Shu, "Hearing-Visual Speech Recognition", Technical Report, CSLP John Hopkins University, Baltimore, USA, 2000 (C. Neti, G. Potamianos, J. Luettin, I. Mattews, H. Grotin, D. Vergiri, (J. Sison, A. Mashari, and J. Zhou, “Audio-visual speech recognition,” Tech. Rep., CSLP John Hopkins University, Baltimore, USA, 2000.) A. Laurier, D.C. Hindle, M.M. Riley and R.C. Sprote, “AT & T LVCSR-2000 System”, Speech Transcription Workshop, University of Maryland, USA, 2000 (A. Ljolje, D. Hindle, M. Riley, and R. Sproat, “The AT & T LVCSR-2000 system. , "In Speech Transcribion Workshop, University of Maryland, USA, 2000.) K. Daudi, D.D. Fore and C.I. Antoine, “A New Trial of Multi-Band Speech Recognition Based on Probabilistic Graph Model”, ICSLP Proceedings, Beijing, China, pp. 329-332, 2000 (K. Daoudi, D. Fohr, and C. Antoine, “ A new approach for multi-band speech recognition based on probabilistic graphical models, “in Proc. ICSLP, Beijing, China, pp. 329-332, 2000.) K. Markov and S.M. Nakamura, “Forward-Backward Training of Hybrid HMM / BN Acoustic Model”, ICLSP Proceedings, 621-624, 2006 (K. Markov and S. Nakamura, “Forward-backwards training of hybrid HMM / BN acoustics” , "In Proc. ICSLP, pp. 621-624, 2006.) J. et al. J. et al. Odel, “Use of Context in Large Vocabulary Speech Recognition”, Doctoral Dissertation, University of Cambridge, Cambridge, UK, 1995 (JJ Odell, The Use of Large Vocabulary Speech Recognition, Ph. D. thesis University, Cambridge) , Cambridge, UK, 1995.) Ji. Min, P.M. O. Boyle, M.C. Owens, and F.M. J. et al. Smith, “A Bayesian approach to building a triphone model for continuous speech recognition”, IEEE Speech and Acoustic Processing Transactions, Vol. 7, No. 6, pp. 678-684, November 1999 (Ji. Ming, P O. Boyle, M. Owens, and F. J. Smith, “A Bayesian approach for building triphone models for continous spike recognition,” IE Transport. , November 1999.) S. Sakti, S. Nakamura and K. Markov, “Improvement of acoustic model accuracy by incorporating wide-range phoneme context based on Bayesian framework”, IEICE Information & System Transaction, Volume E89-D, No. 3, pages 946-953, 2006 (S. Sakti, S. Nakamura, and K. Markov, “Improving acoustic model precision by inducing a wide phonetic context based on a Bayesian framework, 94 & E. ) T.A. Gitzhiro, T .; Matsui and S. Nakamura, “Automatic Generation of Non-uniform HMM Topology Based on MDL Criteria”, IEICE Information & System Transactions, E87-D Volume 8, No. 2121-2129, 2004 (T. Jitsuhiro, T. Matsui, and S. J. Nakamura, “Automatic generation of non-uniform HMM topologies based on the MDL criterion,” IEICE Trans. Inf. & Syst., Vol E87-D, no.

しかし，そのような複雑なモデル等を開発して最善の性能を達成することが可能ではない場合がしばしばであった．モデルのパラメータを適切にトレーニングするには資源が不十分なとき，すなわちトレーニングデータの量，及び利用可能なメモリ領域のような資源が不十分な時に，特にそうしたことが起こる．その結果，頑健でない推定と，未知のパターンの数の増加とにより，入力空間の分解能が失われる．さらに，大きなモデルを用いたデコードもまた煩雑となり，時には不可能にさえなる．ここでできる最良の方法は，利用可能なトレーニングデータを用いて信頼性高く推定できる，簡単な形式のモデルを選択することである． However, it was often not possible to develop such complex models and achieve the best performance. This is especially true when there are insufficient resources to properly train the parameters of the model, that is, when there is insufficient resources such as the amount of training data and available memory space. As a result, the resolution of the input space is lost due to non-robust estimation and an increase in the number of unknown patterns. In addition, decoding using large models is also cumbersome and sometimes impossible. The best way to do this is to select a simple form of model that can be reliably estimated using the available training data.

それゆえに，本発明の目的の１つは，利用可能なトレーニングデータを用いて，音声信号の音素の確率を頑健に計算することが可能な確率計算装置を提供することである． Therefore, one of the objects of the present invention is to provide a probability calculation device capable of robustly calculating the probability of phonemes of speech signals using available training data.

本発明の別の目的は，データが疎になる可能性があるトレーニングデータを用いて音声信号の音素の確率を，頑健でかつ高い信頼性で計算することが可能な確率計算装置を提供することである． Another object of the present invention is to provide a probability calculation device capable of calculating the probability of phonemes of a speech signal robustly and with high reliability using training data whose data may be sparse. It is.

本発明の第１の局面は，音声信号の所与のセグメント中に存在する，予め定義された組の音素の各々について，音声信号のための統計的音響モデル及び１以上の知識源を用いて，確率を計算するための確率計算装置に関する．セグメントは，音声信号の複数のフレームを含む．音響モデル及び１以上の知識源はベイズネットワークにより示される因果関係を有する．ベイズネットワークは，複数のクラスタノード及び１以上のセパレータノードを含むジャンクションツリーに対応する．この装置は，クラスタノード及び１以上のセパレータノードに対応する，複数の局部的音響モデルを記憶するための手段と，フレームの各々に対して予め定義された観測データを計算するための手段と，複数の局部的音響モデルを利用して，音素の各々の，観測データを発生する局部的確率を計算するための局部的確率計算手段と，音素の各々の，観測データを発生する確率を，局部的確率計算手段により計算された局部的確率の所定の関数として計算するための確率計算手段とを含む． A first aspect of the invention uses, for each predefined set of phonemes present in a given segment of a speech signal, a statistical acoustic model for the speech signal and one or more knowledge sources. It relates to a probability calculation device for calculating probabilities. A segment includes multiple frames of an audio signal. The acoustic model and one or more knowledge sources have a causal relationship represented by a Bayesian network. A Bayesian network corresponds to a junction tree that includes multiple cluster nodes and one or more separator nodes. The apparatus includes means for storing a plurality of local acoustic models corresponding to cluster nodes and one or more separator nodes, means for calculating observation data predefined for each of the frames, Using local acoustic models, local probability calculation means for calculating the local probability of generating observation data for each phoneme, and the probability of generating observation data for each phoneme And a probability calculation means for calculating as a predetermined function of the local probability calculated by the statistical probability calculation means.

局部的な音素の各々の，観測データを発生する確率は，局部的確率の予め定義された関数により計算される．各音素に対する局部的確率は複数の局部的音響モデルを利用して計算される．局部的モデルは１以上の知識源を組込んだモデルよりも小さいため，計算量はより少なくなり，モデルのトレーニングに必要なトレーニングデータの量もより少なくなり，そして，確率計算はより頑健で信頼性が高くなる． The probability of generating observational data for each local phoneme is calculated by a predefined function of local probability. Local probabilities for each phoneme are calculated using multiple local acoustic models. Because local models are smaller than models that incorporate one or more knowledge sources, less computation is required, less training data is required to train the model, and probability calculations are more robust and reliable. The nature becomes high.

好ましくは，所定の関数は Preferably, the predetermined function is

で定義され，Ｄは観測データであり，Ｍは音響モデルであり，Ｎは正の整数であり，Ｋ_ｉは１以上の知識源であり，ただし，Ｐ（Ｄ｜Ｋ_ｉ，Ｍ）（ｉ＝１〜Ｎ）及びＰ（Ｄ｜Ｍ）は局部的確率計算手段により計算された局部的確率である．

D is observation data, M is an acoustic model, N is a positive integer, K _i is a knowledge source of 1 or more, provided that P (D | K _i , M) (i = 1 to N) and P (D | M) are local probabilities calculated by the local probability calculating means.

さらに好ましくは，モデルＭはモノフォン音響モデルであり，１以上の知識源は先行するトライフォンコンテキストユニット及び後続するトライフォンコンテキストユニットを含む． More preferably, model M is a monophone acoustic model and the one or more knowledge sources include a preceding triphone context unit and a subsequent triphone context unit.

さらに好ましくは，モデルＭは追加の知識源を用いてトレーニングされたモノフォン音響モデルであり，１以上の知識源は先行するトライフォンコンテキストユニット及び後続するトライフォンコンテキストユニットを含む． More preferably, the model M is a monophone acoustic model trained with additional knowledge sources, the one or more knowledge sources including a preceding triphone context unit and a subsequent triphone context unit.

追加の知識源はアクセント知識，又は性別に関する知識，又はアクセント知識及び性別に関する知識の両方を含む． Additional knowledge sources include accent knowledge, or gender knowledge, or both accent knowledge and gender knowledge.

本発明の第２の局面は，コンピュータ上で実行されると，当該コンピュータに，音声信号の所与のセグメント中に存在する，予め定義された組の音素の各々について，音声信号のための統計的音響モデル及び１以上の知識源を用いて，確率を計算するための確率計算装置として機能させるコンピュータプログラムに関する．セグメントは，音声信号の複数のフレームを含む．音響モデル及び１以上の知識源はベイズネットワークにより示される因果関係を有する．ベイズネットワークは，複数のクラスタノード及び１以上のセパレータノードを含むジャンクションツリーに対応する．このコンピュータプログラムは，コンピュータを，クラスタノード及び１以上のセパレータノードに対応する，複数の局部的音響モデルを記憶するための手段と，フレームの各々に対して予め定義された観測データを計算するための手段と，複数の局部的音響モデルを利用して，音素の各々の，観測データを発生する局部的確率を計算するための局部的確率計算手段と，音素の各々の，観測データを発生する確率を，局部的確率計算手段により計算された局部的確率の所定の関数として計算するための確率計算手段として機能させる． A second aspect of the present invention, when executed on a computer, causes the computer to perform statistics for the speech signal for each predefined set of phonemes present in a given segment of the speech signal. This invention relates to a computer program that functions as a probability calculation device for calculating probabilities using an acoustic model and one or more knowledge sources. A segment includes multiple frames of an audio signal. The acoustic model and one or more knowledge sources have a causal relationship represented by a Bayesian network. A Bayesian network corresponds to a junction tree that includes multiple cluster nodes and one or more separator nodes. The computer program is for the computer to calculate means for storing a plurality of local acoustic models corresponding to cluster nodes and one or more separator nodes, and to calculate observation data predefined for each of the frames. The local probability calculation means for calculating the local probability of generating the observation data for each phoneme, and the observation data for each of the phonemes using the local acoustic model Let the probability function as a probability calculation means for calculating the probability as a predetermined function of the local probability calculated by the local probability calculation means.

１．序論
ここでは，本願が提案するフレームワークを，データのスパースネス及びメモリの制約という困難をしばしば伴う，広域音素知識情報に組込むという問題に適用することについて論ずる．はじめに，どのように付加的知識源がＨＭＭ状態の分布に組込まれるかを示す．次に，どのように付加的知識源がＨＭＭ音素モデリングに組込まれるかを示す．何れのアプローチも２種類のアクセントを含む英語音声データを用いた大語彙連続音声認識実験により実験的に実証されている． 1. Introduction In this section, we discuss the application of the framework proposed in this application to the problem of incorporating it into wide-area phoneme knowledge information, which often involves the difficulties of data sparseness and memory constraints. First, we show how additional knowledge sources are incorporated into the distribution of HMM states. Next, we show how additional knowledge sources can be incorporated into HMM phoneme modeling. Both approaches have been experimentally demonstrated by large vocabulary continuous speech recognition experiments using English speech data containing two types of accents.

はじめに，付加的知識源の組込みに対する一般的フレームワークについて次のセクションに記述する．次に，従来のＨＭＭ音響モデルについての概略をセクション３に記述する．セクション４及び５では，ＨＭＭ状態及び音素モデルレベルで付加的知識源を組込むにあたり，どのように本フレームワークが用いられるかを示す．ここには広域音素コンテキスト情報の組込みの問題への適用法も含まれる．実験の詳細を，結果及び議論を含めてセクション６に示す．最後に，結論がセクション７に記述される． First, a general framework for incorporating additional knowledge sources is described in the next section. Next, Section 3 outlines the conventional HMM acoustic model. Sections 4 and 5 show how this framework can be used to incorporate additional knowledge sources at the HMM state and phoneme model level. This includes application to the problem of embedding global phoneme context information. Details of the experiment, including results and discussion, are given in Section 6. Finally, the conclusion is described in Section 7.

２．知識源を組込むための一般的フレームワーク
統計的コーパスベースのアプローチにおいては，ある観測データＤが与えられることにより，モデルＭをトレーニングする．興味の対象となる重要な問題の１つは，そのモデルについて特定の知識が与えられることにより期待することができるデータを予測する，尤度Ｐ（Ｄ｜Ｍ）を計算することである． 2. General framework for incorporating knowledge sources In a statistical corpus-based approach, given observation data D, model M is trained. One important issue of interest is computing the likelihood P (D | M) that predicts the data that can be expected given specific knowledge about the model.

確率密度関数Ｐ（Ｄ｜Ｍ）は，単純な場合は条件付確率テーブル（ＣｏｎｄｉｔｉｏｎａｌＰｒｏｂａｂｉｌｉｔｙＴａｂｌｅｓ：ＣＰＴ）（Ｄが離散的な場合），又はガウス分布のような連続的関数（Ｄが連続の場合）によりモデル化することができる．この場合，所与のデータｄ及びモデルパラメータｍに対する出力確率は，以下のように簡略に計算される． The probability density function P (D | M) is a conditional probability table (CPT) (when D is discrete) or a continuous function such as a Gaussian distribution (when D is continuous). ). In this case, the output probability for a given data d and model parameter m is simply calculated as follows:

その後，付加的知識源をこのモデルに組込む必要があると仮定する．ここでは，どのようにして付加的知識源が組込まれるかを考慮することが必要である．この考慮の手順はいくつかのステップを含み，その概略を図１に示す．

Then assume that additional knowledge sources need to be incorporated into the model. Here, it is necessary to consider how additional knowledge sources are incorporated. This consideration procedure includes several steps, the outline of which is shown in Fig. 1.

図１を参照して，この手順は，情報源，モデル及びデータの間の因果関係を，ＢＮを用いて定義するステップ（ステップ５０）と，直接にＢＮ推論をすることが可能か否かを判定するステップ（ステップ５２）と，直接にＢＮ推論が可能と判定されたときに，直接にＢＮ推論を実行するステップ（ステップ５４）と，直接にＢＮ推論することが不可能と判定されたときに，後述するジャンクションツリーアルゴリズムを用いて，関係に関するネットワークをリンクされたクラスタの組に分解するステップ（ステップ５６）と，ステップ５６において得られたジャンクションツリー上で推論を行なうステップ５８とを含む． Referring to FIG. 1, this procedure defines a step (step 50) in which a causal relationship between an information source, a model, and data is defined using BN, and whether or not BN inference can be performed directly. A step of determining (step 52), a step of directly executing BN inference when it is determined that direct BN inference is possible (step 54), and a case of determining that direct BN inference is impossible The method includes a step (step 56) of decomposing a network related to a set of linked clusters using a junction tree algorithm, which will be described later, and a step 58 of performing inference on the junction tree obtained in step 56.

以下に，その手順のさらなる詳細を記述する． The details of the procedure are described below.

Ａ．情報源間の因果関係の定義
ＤとＭの間の因果関係がＢＮを用いて説明されるような，単純な場合からはじめる．ＢＮの１つの例は，図２（Ａ）にその概略を示す，ノード７２及びノード７４を含むＢＮ７０である．ここで，ノードＭ７２は正方形のノードで示される離散変数であり，ノードＤ７４は楕円のノードにより示される連続変数である． A. Defining the causal relationship between information sources Start with a simple case where the causal relationship between D and M is explained using BN. One example of a BN is a BN 70 including a node 72 and a node 74, the outline of which is shown in FIG. Here, the node M72 is a discrete variable indicated by a square node, and the node D74 is a continuous variable indicated by an elliptical node.

ＢＮの同時確率関数は以下のように因数分解される． The joint probability function of BN is factored as follows.

ただし，Ｐａ（Ｚ_ｋ）はＢＮ変数Ｚ_ｋの親を示す．このことから，図２（Ａ）より以下の式を得る．

However, Pa (Z _k ) indicates the parent of the BN variable Z _k . From this, the following equation is obtained from Fig. 2 (A).

このため，データに関する知識に基づき，Ｄ，Ｍ及びＫの間の条件に関する依存性を単純に定義し，付加的な知識ＫをＰ（Ｄ，Ｍ）に組込み，同時確率モデルを同様の方法により表す．例えば，Ｄ，Ｍ及びＫの間の条件に関する依存性を，図２（Ｂ）に概略を示すＢＮにより表すことができる．図２（Ｂ）では，ＢＮ８０はノード７２及び７４と，付加的ノードＫ７６とを含む．ここでＢＮ同時確率関数は以下のようになる．

For this reason, based on the knowledge about data, the dependency regarding the condition between D, M, and K is simply defined, the additional knowledge K is incorporated into P (D, M), and the joint probability model is constructed in the same way. To express. For example, the dependency on the condition among D, M and K can be represented by BN as outlined in FIG. In FIG. 2B, the BN 80 includes

nodes

72 and 74 and an additional node K76. Here, the BN joint probability function is as follows.

さらに詳細な例を考える．ここまで，Ｋ_１，Ｋ_２，…，Ｋ_Ｎ知識源があると仮定していた．ここでは，これらすべてが条件に関する依存性が無いと仮定している．図３に，Ｄ，Ｍ及びＫ_１，Ｋ_２，…，Ｋ_Ｎの間の条件に関する依存性の構造の２つの例の概要を示す．

Consider a more detailed example. So far we have assumed that there are K ₁ , K ₂ ,..., K _N knowledge sources. Here, it is assumed that all of these have no dependency on conditions. FIG. 3 outlines two examples of dependency structures for the conditions between D, M and K ₁ , K ₂ ,..., K _N.

図３（Ａ）を参照して，ネットワーク９０はノード７２及び７４と，さらにノード９２，９４，…，９６（ノードＫ_１〜Ｋ_Ｎ）とを含む．ノードＫ_１〜Ｋ_Ｎは親ノード７２及び子ノード７４を持つ．図３（Ｂ）に示すネットワーク１００は，ノード７２及び７４と，ノード９２，…，及び９６（ノードＫ_１〜Ｋ_Ｎ）とを含む．ノードＫ_１〜Ｋ_Ｎのうちノード９２及び９６は子ノード７４のみを持つが，他のノードは親ノード７２及び子ノード７４を持つ． 3A, a network 90 includes nodes 72 and 74, and nodes 92, 94,..., 96 (nodes K _{1 to} K _N ). Nodes K _{1 to} K _N have a parent node 72 and a child node 74. The network 100 shown in FIG. 3B includes nodes 72 and 74 and nodes 92,..., And 96 (nodes K _{1 to} K _N ). Of the nodes K _{1 to} K _N , the nodes 92 and 96 have only child nodes 74, while the other nodes have a parent node 72 and child nodes 74.

このため，図３（Ａ）に示されるＢＮに対する同時確率密度関数は，式（２）により以下の様になる． For this reason, the joint probability density function for BN shown in FIG.

もし，図３（Ｂ）（Ｋ_１及びＫ_Ｎを参照）に示すように，Ｍからの因果関係の影響を何ら受けないあるＫ_ｉがある場合，同時確率密度関数は以下の式で示されるようになる．

As shown in FIG. 3B (see K ₁ and K _N ), if there is a certain K _i that is not affected by any causal relationship from M, the joint probability density function is given by It becomes like this.

ここで分かるように，条件に関する独立性の仮定が異なると，確率関数の分解の仕方も異なってくる（式（５）及び（６）を参照）．

As can be seen here, different assumptions of independence with respect to conditions result in different ways of decomposing probability functions (see equations (5) and (6)).

Ｂ．ベイズネットワークにおける直接推論
推論における最重要関心事は，大域での条件付確率Ｐ（Ｄ｜Ｋ_１，…，Ｋ_Ｎ，Ｍ）を計算することである．この確率密度関数がとる形式が，直接的なＢＮ推論を許容する場合，以下の２つのケースが考えられる． B. Direct inference in Bayesian networks The most important concern in inference is to calculate the conditional probability P (D | K ₁ ,..., K _N , M) in the global. When the form of this probability density function allows direct BN inference, the following two cases can be considered.

１）全ての変数が観測可能である． 1) All variables can be observed.

このケースでは，確率密度関数は単純に式（１）で計算される． In this case, the probability density function is simply calculated by equation (1).

２）付加的な知識源Ｋ_１，…，Ｋ_Ｎのような，いくつかの変数が観測できないか，または隠されている．

2) Some variables such as additional knowledge sources K ₁ ,..., K _N are not observable or hidden.

このケースでは，確率密度関数は式（５）と，すべてのＫ_ｉに対しすべての可能なＫ_i：ｋ_ｉ１，ｋ_ｉ２，…，ｋ_ｉＭに関するマージナライゼーションにより計算される． In this case, the probability density function and formula (5), all _{K i} of for all possible _{_{_{K i: k i1, k i2}}} , ..., is calculated by the marginalization about _{k iM.}

ただし単純化のために，＜Ｄ＝ｄ＞，＜Ｍ＝ｍ＞，及び＜Ｋ_ｉ＝ｋ_ｉｊ＞の変わりに，ｄ，ｍ，及びｋ_ｉｊを用いている．

However, for simplicity, d, m, and k _ij are used _instead of <D = d>, <M = m>, and <K _i = k _ij >.

しかし，全体的な条件付確率Ｐ（Ｄ｜Ｋ_１，…，Ｋ_Ｎ，Ｍ）の計算は，変数が多すぎること，及び／または，計算上の複雑さにより，簡単ではないことがある．このような場合，有向グラフを変数のクラスタに分解し，これらに対し適切な計算を実行できるようにすることが必要である．この処理は，次のサブセクションで述べるジャンクションツリーアルゴリズムにより行なえる． However, the calculation of the overall conditional probability P (D | K ₁ ,..., K _N , M) may not be straightforward due to too many variables and / or computational complexity. In such cases, it is necessary to break the directed graph into a cluster of variables so that appropriate calculations can be performed on them. This can be done with the junction tree algorithm described in the next subsection.

Ｃ．ジャンクションツリー分解
Ｋ_１及びＫ_２の２つの付加的知識源の組込みのみの単純なケースを考えてみる．Ｄ，Ｍ，Ｋ_１及びＫ_２の間の因果関係を，図４（Ａ）に示すＢＮ１１０により示す．ＢＮ１１０はＭ，Ｄ，Ｋ_１及びＫ_２によりそれぞれ示す，ノード１１２，１１４，１１６及び１１８を含む．ここで，ノードＭ，Ｋ_１及びＫ_２は正方形のノードで示される離散変数であり，ノードＤは楕円のノードで示される連続変数である． C. Consider the simple case of only the incorporation of two additional knowledge source of the junction tree decomposition K ₁ and K _2. The causal relationship between D, M, K ₁ and K ₂ is indicated by BN110 shown in FIG. BN 110 includes nodes 112, 114, 116 and 118, denoted by M, D, K ₁ and K ₂ respectively. Here, nodes M, K ₁ and K ₂ are discrete variables indicated by square nodes, and node D is a continuous variable indicated by elliptical nodes.

そして，ジャンクションツリーを得るために，以下のようなグラフ変換が実行される． Then, the following graph transformation is performed to obtain a junction tree.

１）親を結合させ（共通の子を持つ全ての変数のペアの間にリンクを追加し），リンクの向きをなくすことにより，ＢＮ１１０から無向グラフを組立てる．図４（Ａ）の場合，ノード１１６及び１１８の間にリンクが付与される．その結果得られるグラフは「モラルグラフ」と呼ばれる． 1) Assemble an undirected graph from BN110 by joining parents (adding links between all variable pairs with common children) and removing the link orientation. In the case of FIG. 4A, a link is provided between the nodes 116 and 118. The resulting graph is called a “moral graph”.

２）三角形からなるグラフ（トライアンギュレートグラフ）を形成するため，選択的にモラルグラフにアーク（弧）を付加する．もし“コードレスサイクル”が存在しなければ，グラフはトライアンギュレートである，という．コード（弦）とは，長さが３より大きいサイクル内の，２つの連続しない頂点を接続するエッジである． 2) To form a graph consisting of triangles (triangular graph), an arc is selectively added to the moral graph. If there is no "cordless cycle", the graph is said to be triangulated. A chord is an edge that connects two non-contiguous vertices in a cycle with a length greater than 3.

３）トライアンギュレートグラフにおいて，Ｐａ（Ａ）≠０であるすべての変数Ａに対して，Ｐａ（Ａ）∪Ａを含むサブセットを形成する．これはクラスタまたはクリークとよばれる． 3) Form a subset including Pa (A) ∪A for all variables A with Pa (A) ≠ 0 in the triangulate graph. This is called a cluster or clique.

４）クラスタ／クリークをノードとしてジャンクションツリーを構築する．この場合，二つのクリークの間のリンクの各々は，これらクリークの間の空ではない共通集合のセパレータを用いてラベル付けされる． 4) Build a junction tree with clusters / creeks as nodes. In this case, each link between two cliques is labeled with a non-empty common set of separators between these cliques.

図４（Ｂ）は，図４（Ａ）に示すＢＮ１１０に相当する，モラル及びトライアンギュレートグラフ１３０の概要を示す．グラフ１３０はノード１１６及び１１８の間に追加のリンク１２０を含む．しかし，このトライアンギュレートグラフからはＤ，Ｍ，Ｋ_１及びＫ_２の変数の全体の集合からなる１つのクラスタ／クリークしか得ることができず，これ以上分解できない．幸運にも，Ｋ_１及びＫ_２は独立であると仮定されるため，いくつかの矢を逆向きにすることにより，ＢＮ１１０と等価の図４（Ｃ）に示すＢＮ１４０を得ることができる．これが可能なのは，Ｐ（Ｘ，Ｙ）がＰ（Ｘ｜Ｙ）Ｐ（Ｙ）とＰ（Ｙ｜Ｘ）Ｐ（Ｘ）とに分解できること，及びこの２つが互いに等価であるためである． FIG. 4B shows an outline of the moral and triangulated graph 130 corresponding to the BN 110 shown in FIG. Graph 130 includes an additional link 120 between nodes 116 and 118. However, from this triangulated graph, only one cluster / clique consisting of the entire set of D, M, K ₁ and K ₂ variables can be obtained, and no further decomposition is possible. Fortunately, since K ₁ and K ₂ are assumed to be independent, by reversing some arrows, BN 140 shown in FIG. 4C equivalent to BN 110 can be obtained. This is possible because P (X, Y) can be decomposed into P (X | Y) P (Y) and P (Y | X) P (X), and the two are equivalent to each other.

図４（Ｄ）はＢＮ１４０に対応するモラル及びトライアンギュレートグラフ１５０の概要を示す．これによりクラスタ／クリークを同定することができ，さらに図４（Ｅ）にその概要を示すジャンクションツリー１６０を得ることができる．ここではクラスタの組は楕円のノード１６４及び１６６により表され，セパレータの組は正方形のノード１６２により表されている． FIG. 4D shows an outline of the moral and triangulate graph 150 corresponding to the BN 140. As a result, a cluster / clique can be identified, and a junction tree 160 whose outline is shown in FIG. 4 (E) can be obtained. Here, the cluster set is represented by ellipse nodes 164 and 166, and the separator set is represented by a square node 162.

以上から，ＢＮ同時確率分布は，全てのクラスタのポテンシャル（確率）の積を，セパレータのポテンシャルの積で除算することにより以下のように定義される． From the above, the BN joint probability distribution is defined as follows by dividing the product of all cluster potentials (probabilities) by the product of the separator potentials.

ただし，Ｕはグラフにおける全ての変数を示す「世界」を，φ_Ｃｉはクラスタポテンシャル（クラスタＣｉにおける確率）を，φ_Ｓｉはセパレータポテンシャル（セパレータＳｉにおける確率）を示す．このため，同時確率関数，Ｐ（Ｄ，Ｋ_１，Ｋ_２，Ｍ）は図４（Ｅ）によれば以下のようになる．

However, U represents the “world” indicating all variables in the graph, φ _Ci represents the cluster potential (probability in cluster Ci), and φ _Si represents the separator potential (probability in separator Si). Therefore, the joint probability function, P (D, K ₁ , K ₂ , M) is as follows according to FIG.

ただし，Ｐ（Ｄ，Ｋ_１，Ｍ）とＰ（Ｄ，Ｋ_２，Ｍ）とはクラスタポテンシャルであり，Ｐ（Ｄ，Ｍ）はセパレータポテンシャルである．

However, P (D, K ₁ , M) and P (D, K ₂ , M) are cluster potentials, and P (D, M) is a separator potential.

同様の仮定及び考慮に基づいて，図３（Ａ）に示すＢＮ９０と同様のＢＮトポロジは図５（Ａ）のように書くことができる．図５（Ｂ）にこれに対応するジャンクションツリー１８０を示す．図５（Ｂ）では，変数｛（Ｄ，Ｋ１，Ｍ），（Ｄ，Ｋ２，Ｍ），…（Ｄ，ＫＮ，Ｍ）｝のＮ個のクラスタ１６４，１６６，…１７０と，Ｎ−１個のセパレータ｛Ｄ，Ｍ｝（ノード１６２，１６８等）が存在する．このため式（５）により求められる同時確率関数は以下の式により分解することができる． Based on similar assumptions and considerations, a BN topology similar to BN90 shown in FIG. 3 (A) can be written as in FIG. 5 (A). FIG. 5B shows a junction tree 180 corresponding to this. In FIG. 5B, N clusters 164, 166, ... 170 of variables {(D, K1, M), (D, K2, M), ... (D, KN, M)}, and N-1 There are separators {D, M} (nodes 162, 168, etc.). Therefore, the joint probability function obtained by equation (5) can be decomposed by the following equation.

これは，同時確率関数Ｐ（Ｄ，Ｋ_１，…，Ｋ_Ｎ，Ｍ）を，ある付加的な知識Ｋ_１，Ｋ_２，…，Ｋ_Ｎが与えられた場合の観測データＤの確率に対応するいくつかの局部的な同時確率関数Ｐ（Ｄ，Ｋ_１，Ｍ），…，Ｐ（Ｄ，Ｋ_Ｎ，Ｍ）を合成したものとして表す新しい表記方法を示す．

This joint probability function _{P (D, K 1, ...} , K N, M) , and some additional knowledge _K _1, K 2, ..., corresponds to the probability of the observed data D _{when K N} is given A new notation is shown that represents a combination of several local joint probability functions P (D, K ₁ , M), ..., P (D, K _N , M).

Ｄ．ジャンクションツリー推論
チェーンルールを用いることにより，全てのＰ（Ｄ，Ｋ_ｉ，Ｍ）に対し以下の式を得る． D. Junction Tree Inference By using chain rules, we obtain the following equations for all P (D, K _i , M).

このため，式（１１）は以下のようになる．

Therefore, Equation (11) is as follows.

この式（１１）を式（５）と比較すると，

When this equation (11) is compared with equation (5),

であることが分かり，これは，Ｐ（Ｄ｜Ｋ_１，…，Ｋ_Ｎ，Ｍ）が，特定の付加的な知識Ｋ_１，Ｋ_２，…，Ｋ_Ｎが与えられた場合の観測データＤの確率に対応する別々の項に分解可能であることを示す．

It know it is, this _{is, P (D | K 1,} ..., K N, M) is, certain additional knowledge _K _1, K 2, ..., observed data D _{when K N} is given This shows that it can be decomposed into separate terms corresponding to the probabilities of.

いくつかの単純なＰ（Ｄ｜Ｋ_ｉ，Ｍ）を定義したり，推定したり，観測したりすることは，１つではあるが複雑なＰ（Ｄ｜Ｋ_１，…，Ｋ_Ｎ，Ｍ）と比べて非常に簡単となる． Defining, estimating, and observing some simple P (D | K _i , M) is one but complicated P (D | K ₁ ,..., K _N , M ) Is much easier.

このため，データｄ，モデルパラメータｍ，及び付加的な知識源ｋ_１ｊ，…，ｋ_Ｎｊが与えられた場合に対する推論における出力確率は，以下のように計算される． Therefore, the output probability in the inference for the case where data d, model parameter m, and additional knowledge sources k _1j ,..., K _Nj are given is calculated as follows.

３．従来のＨＭＭ音響モデル
従来のＨＭＭに関連して，いくつかの表記を定義する．トライフォンコンテキスト／ａ⁻，ａ，ａ^＋／のＨＭＭ音声モデルをλ，ＨＭＭ状態変数をＱと表記する．Ｘは観測変数であり，Ｘ_ｓ＝Ｘ_ｔ，…，Ｘ_ｔ＋ｍは長さｍの観測データセグメントである．図６に，標準的なＨＭＭ１９０の構造の概要を示す．ここでは，
１）短時間スペクトル特性はガウス分布２１０，２１２，及び２１４の混合によりモデル化される．

3. Conventional HMM Acoustic Model Several notations are defined in relation to the conventional HMM. The HMM speech model of triphone context / a ⁻ , a, a ⁺ / is denoted by λ, and the HMM state variable is denoted by Q. X is an observation variable, and X _s = X _t ,..., X _{t + m} is an observation data segment of length m. Figure 6 outlines the structure of a standard HMM 190. here,
1) Short-time spectral characteristics are modeled by a mixture of

Gaussian distributions

210, 212, and 214.

２）時間的な音声特徴は状態２００，２０２及び２０４の間でのＨＭＭ状態遷移２１６，２１８，２２０，２２２及び２２４により支配される． 2) Temporal speech features are dominated by HMM state transitions 216, 218, 220, 222 and 224 between states 200, 202 and 204.

ＨＭＭ状態出力確率ｐ（ｘ_ｔ｜ｑ_ｉ）は，通常，状態確率密度関数（ＰｒｏｂａｂｉｌｉｔｙＤｅｎｓｉｔｙＦｕｎｃｔｉｏｎ：ＰＤＦ）Ｐ（Ｘ｜Ｑ）から以下の式により計算される． The HMM state output probability p (x _t | q _i ) is normally calculated from the state probability density function (PDF) P (X | Q) by the following equation.

ただし，ｂ_ｍは状態ｑ_ｉのｍ番目の混合分布の混合重みであり，Ｎ（・）は平均ベクトルμ_ｍと共分散行列Σ_ｍとを持つガウス関数である．ＨＭＭセグメントの尤度Ｐ（Ｘ_ｓ｜λ）は，観測結果と状態シーケンスとの同時確率を，全ての状態シーケンスに対してとることにより（合計尤度），又は最も確からしい状態シーケンスのみに対してとることにより（ＶｉｔｅｒｂｉＰａｔｈ），計算される．

Here, b _m is a mixture weight of the m-th mixture distribution of the state q _i , and N (•) is a Gaussian function having an average vector μ _m and a covariance matrix Σ _m . The likelihood P (X _s | λ) of the HMM segment is obtained by taking the joint probability of the observation result and the state sequence for all the state sequences (total likelihood) or only for the most probable state sequence. It is calculated by taking (Viterbi Path).

４．ＨＭＭ状態レベルでの知識源の組込み
Ａ．一般的検討
モデルＭは所定のトライフォンＨＭＭ状態Ｑであり，Ｄはセクション２に述べた理論的フレームワークに従う観測変数Ｘである． 4). Incorporating knowledge sources at the HMM state level A. General Consideration Model M is a given triphone HMM state Q, and D is an observed variable X that follows the theoretical framework described in Section 2.

１）因果関係の定義
このトポロジの構造は図２（Ａ）に示すものと同様であり，トライフォンＨＭＭ状態ＰＤＦは，ここでは式（３）と同様のＢＮ同時確率関数により示される． 1) Definition of causal relationship The structure of this topology is the same as that shown in FIG. 2 (A), and the triphone HMM state PDF is represented here by the same BN joint probability function as in equation (3).

単純に式（５）に従えば，以下のようになる．

If you simply follow equation (5), it will be as follows.

これにより，追加の知識源Ｋ_１，Ｋ_２，…，Ｋ_ＮをＨＭＭ状態分類Ｐ（Ｘ，Ｑ）に組込む（すべてのＫ_１，Ｋ_２，…，Ｋ_Ｎが独立した所定のＱであると仮定されている．）．

This incorporates additional knowledge sources K ₁ , K ₂ ,..., K _N into the HMM state classification P (X, Q) (all K ₁ , K ₂ ,..., K _N are independent predetermined Qs). Is assumed.).

２）推論
主たる関心事はＨＭＭ状態出力確率Ｐ（Ｘ｜Ｋ_１，…，Ｋ_Ｎ，Ｑ）の計算であるが，これはガウス関数により簡単にモデル化することが可能である．このために，状態出力を直接得ることができる．全ての追加の知識源Ｋ_１，…，Ｋ_Ｎがセクション２−Ｂに示されたように隠されていると仮定すると，状態出力確率は，すべての１≦ｉ≦Ｎに対して，すべての可能なＫ_i：ｋ_ｉ１，ｋ_ｉ２，…，ｋ_ｉＭをマージナライゼーションすることにより，式（８）と同様に得られる． 2) Reasoning The main concern is the calculation of the HMM state output probability P (X | K ₁ ,..., K _N , Q), which can be easily modeled by a Gaussian function. For this reason, the state output can be obtained directly. Assuming that all additional knowledge sources K ₁ ,..., K _N are hidden as shown in section 2-B, the state output probabilities are all for 1 ≦ i ≦ N. The possible K _i : k _i1 , k _i2 ,..., K _iM can be obtained in the same way as in Eq.

ここで，ｐ（ｋ_ｉ１｜ｑ_ｔ）…ｐ（ｋ_Ｎｊ｜ｑ_ｔ）の項を，ガウス成分ｐ（ｘ_ｔ｜ｋ_ｉ１，…，ｋ_Ｎｊ，ｑ_ｔ）の混合重み係数として扱えば，式（１９）もまた，式（１６）の従来のＨＭＭの状態出力確率と等価であることが分かる．式（１９）はガウス混合分布を表すので，ＨＭＭを基にした既存のデコーダを，何らかの修正をする必要なく用いて認識を行なうことができる．さらに，ＢＮは状態出力の尤度を推論するために使用されるのみであるので，ＨＭＭを基にしたトライフォン音響モデルのトポロジをそのまま維持し，ＨＭＭ状態遷移が依然として時間的な音声特性により支配されるようにできる．このアプローチはまた，ハイブリッドＨＭＭ／ＢＮモデル化フレームワークとして知られ，非特許文献５に記載されている．以後，状態レベルで付加的知識を組込んで得られるモデルを，ＨＭＭ／ＢＮモデルと呼ぶ．

Here, if the term of p (k _i1 | q _t )... P (k _Nj | q _t ) is treated as a mixture weight coefficient of the Gaussian component p (x _t | k _i1 ,..., K _Nj , q _t ), Equation (19) is also equivalent to the state output probability of the conventional HMM of Equation (16). Since Equation (19) represents a Gaussian mixture distribution, recognition can be performed using an existing decoder based on the HMM without any modification. Furthermore, since BN is only used to infer the likelihood of state output, the topology of the triphone acoustic model based on HMM is maintained as it is, and the HMM state transition is still governed by temporal speech characteristics. Can be done. This approach is also known as a hybrid HMM / BN modeling framework and is described in [5]. Hereinafter, a model obtained by incorporating additional knowledge at the state level is called an HMM / BN model.

このモデルのパラメータ学習は，非特許文献５に記載のＨＭＭ／ＢＮモデルの通常のトレーニングから採用できる．これはバックワード・フォワードアルゴリズムを基にしている．このアルゴリズムでは，各トレーニングの繰返しは，ＢＮのトレーニングと，ＨＭＭ遷移確率の更新とからなる．ＢＮのトレーニングは標準的な統計的方法を用いてなされる．トレーニングの間に全ての変数が観測可能であれば最大尤度（ＭＬ）パラメータ推定が適用され，いくつかの変数が隠れている場合，パラメータは標準的なエクスペクテーション・マキシマイゼーション（ＥＭ）アルゴリズムにより推定される． Parameter learning of this model can be adopted from normal training of the HMM / BN model described in Non-Patent Document 5. This is based on a backward-forward algorithm. In this algorithm, each training iteration consists of BN training and updating of the HMM transition probability. BN training is done using standard statistical methods. Maximum likelihood (ML) parameter estimation is applied if all variables are observable during training, and if some variables are hidden, the parameters are standard expectation-maximization (EM) algorithms Is estimated.

Ｂ．広域音素コンテキスト情報の組込み
ＡＳＲシステムにおいて最も広く用いられる音響ユニットは，現在のところ，依然として，直近の先行する音素コンテキスト及び後続する音素コンテキストを含むトライフォンである．トライフォンは効果的な選択であることが確認されてきたが，より長い期間にわたる同時調音効果を捉えるためには，広域音素コンテキストの方がより適切と考えられている．しかし，広域音素コンテキストには，データのスパースネス及びメモリの制約という問題がある． B. Incorporating global phoneme context information The most widely used acoustic unit in an ASR system is still a triphone that still contains the last preceding phoneme context and the following phoneme context. Triphones have been found to be an effective choice, but wide-range phoneme context is considered more appropriate to capture simultaneous articulation effects over longer periods. However, wide-area phoneme context has problems of data sparseness and memory constraints.

ここで，前のセクションに記載したフレームワークを，広域音素知識情報を組込むという問題にどのように適用するかを説明する． Here we explain how to apply the framework described in the previous section to the problem of incorporating wide-area phoneme knowledge information.

従来の，トライフォンコンテキスト／ａ⁻，ａ，ａ^＋／であるＨＭＭ，λを，／ａ⁻⁻，ａ⁻，ａ，ａ^＋，ａ^＋＋／のようなペンタフォンコンテキストに拡張する必要があるものとする．このため，このアプローチに基づき，ＢＮに２つの変数を挿入することにより，二つ前及び後のコンテキスト，Ｃ_Ｌ（／ａ⁻⁻）及びＣ_Ｒ（／ａ^＋＋／）をトライフォン状態ＰＤＦに組込む． Conventional, triphone context ^{^{/ a -, a, a +}} / a HMM is, the ^{^{λ, / a -, a -}} , a, a +, it is necessary to extend the ^{a ++} / penta von context such as Suppose. Therefore, based on this approach, by inserting two variables in BN, the two previous and subsequent contexts, C _L (/ a ⁻⁻ ) and C _R (/ a ⁺⁺ /), are assigned to the triphone state PDF. Incorporate.

トライフォンＨＭＭ状態Ｑと，観測データＸ，及び２つの付加的変数Ｃ_Ｌ及びＣ_Ｒ間の条件に関する依存性は，図７に概要を示すＢＮトポロジにより説明される．これをＢＮ−Ｃトポロジと呼ぶ． A triphone HMM state Q, the observed data X, and dependence for the two conditions between additional variables C _L and C _R are described by BN topology outlined in Figure 7. This is called a BN-C topology.

図７を参照して，ベイズネットワーク２４０は，ノード２５０，２５２，２５４及び２５６を含み，これらはＱ，Ｘ，Ｃ_Ｌ，及びＣ_Ｒにより，それぞれ示される．ノードＣ_Ｌは２つ前のコンテキスト（／ａ⁻⁻）を，ノードＣ_Ｒは２つ後のコンテキスト（／ａ^＋＋／）を表す． Referring to FIG. 7, Bayesian network 240 includes nodes 250, 252, 254 and 256, these Q, X, by _{C L,} and _{C R,} respectively shown. The node _CL represents the previous context (/ a ⁻⁻ ), and the node _CR represents the second subsequent context (/ a ⁺⁺ /).

ＨＭＭ状態ＰＤＦは，現在のところ，ＢＮ同時確率により示される．これは式（１８）によると，以下のように分解される． The HMM state PDF is currently indicated by the BN joint probability. This is decomposed as follows according to equation (18).

ただし，Ｘは２つ前のコンテキストＣ_Ｌ及び２つ後のコンテキストＣ_Ｒの両方に依存する．Ｘは連続の変数であり，Ｃ_Ｌ，Ｃ_Ｒ及びＱは離散的変数であるので，Ｐ（Ｘ｜Ｃ_Ｌ，Ｃ_Ｒ，Ｑ）はガウス関数でモデル化され，各々のＰ（Ｃ_Ｌ｜Ｑ）又はＰ（Ｃ_Ｒ｜Ｑ）はＣＰＴにより表される．

However, X is dependent on both the previous two contexts C _L and two after the context C _R. Since X is a continuous variable, and C _L , C _R and Q are discrete variables, P (X | C _L , C _R , Q) is modeled by a Gaussian function, and each P (C _L | Q) or P (C _R | Q) is represented by CPT.

状態出力確率はＰ（Ｘ｜Ｃ_Ｌ，Ｃ_Ｒ，Ｑ）により得ることができる．付加的なコンテキスト変数Ｃ_Ｌ及びＣ_Ｒが，式（１９）のように認識時には得ることができない（隠されている）と仮定すると， The state output probability can be obtained from P (X | C _L , C _R , Q). Additional context variable C _L and C _R are, assuming that the formula can not be obtained at the time of recognition as (19) (hidden),

となり，ｐ（ｃ_ｌ｜ｑ_ｉ）ｐ（ｃ_ｒ｜ｑ_ｉ）の項を，ガウス成分の混合重み係数ｐ（ｘ_ｔ｜ｃ_ｌ，ｃ_ｒ，ｑ_ｉ）として扱えば，式（１９）は式（１６）の従来のＨＭＭの状態出力確率と等価である．したがって，ここで，ガウスＰＤＦはｃｌ，ｃｒ及びｑｉの全ての組合せに対しトレーニングされる．

The term | _{_(q i} c r), mixing weight coefficients of the Gaussian components _p | _{_{next, p (c l q i)}} p (x t | c l, c r, q i) be handled as the formula (19) Is equivalent to the state output probability of the conventional HMM in Eq. (16). Thus, here, the Gaussian PDF is trained for all combinations of cl, cr and qi.

さらにこのペンタフォンＢＮを，このフレームワークを使い，性別に関する情報又はアクセント情報等の他の追加の変数で拡張することもできる．図８はトライフォンＨＭＭ状態Ｑと，観測データＸと，２つの付加的変数Ｃ_Ｌ及びＣ_Ｒと，性別に関する変数Ｇ及び／又はアクセント変数Ａとの間の条件に関する依存性の構造の例をいくつか示す． The Pentaphone BN can also be extended with other additional variables such as sex information or accent information using this framework. FIG. 8 shows an example of a dependency structure regarding conditions between the triphone HMM state Q, the observation data X, the two additional variables _CL and _CR, and the variable G and / or the accent variable A regarding sex. Here are some.

ＢＮトポロジは，ノード２７２により示される，性別に関する付加的変数Ｇを用いて，ＢＮ−Ｃを拡張することで，図８（Ａ）の参照番号２７０により示されるものになる．これをＢＮ−ＣＧと呼ぶ．ノード２９２により示される追加のアクセント変数Ａを用いてＢＮ−Ｃを拡張する場合は，ＢＮトポロジは図８（Ｂ）の参照番号２９０が示すものになり，これをＢＮ−ＣＡと呼ぶ．図８（Ｃ）のＢＮトポロジ３１０は，ノード２９２及び２７２によりそれぞれ示される，アクセント及び性別に関する変数の両方を用いて拡張されたものであり，ＢＮ−ＣＧＡと呼ぶ． The BN topology is indicated by reference numeral 270 in FIG. 8A by extending BN-C with an additional variable G related to gender indicated by node 272. This is called BN-CG. When the BN-C is extended using the additional accent variable A indicated by the node 292, the BN topology is indicated by the reference number 290 in FIG. 8B, which is called BN-CA. The BN topology 310 of FIG. 8C is extended using both accent and gender variables indicated by nodes 292 and 272, respectively, and is called BN-CGA.

ＢＮ−ＣＧＡの例（図８（Ｃ）参照）に対するＨＭＭ状態ＰＤＦは以下のように表される． The HMM state PDF for the BN-CGA example (see FIG. 8C) is expressed as follows.

ただし，Ｘは，アクセントＡ，性別Ｇ，２つ前のコンテキストＣ_Ｌ，及び２つ後のコンテキストＣ_Ｒに依存する．この状態出力確率はまた，式（２１）と同様の方法によりＰ（Ｘ｜Ｃ_Ｌ，Ｃ_Ｒ，Ｑ，Ａ，Ｇ）から得ることができる．

However, X is accented A, depends sex G, two previous context _{C L,} and two after the context _{C R.} This state output probability can also be obtained from P (X | C _L , C _R , Q, A, G) in the same manner as in equation (21).

ここで，ｐ（ａ）ｐ（ｇ）ｐ（ｃ_ｌ｜ｑ_ｉ）ｐ（ｃ_ｒ｜ｑ_ｉ）の項を，ガウス成分の混合重み係数ｐ（ｘ｜ｃ_ｌ，ｃ_ｒ，ｑ_ｉ，ａ，ｇ）として扱えば，各ガウスＰＤＦはｃ_ｌ，ｃ_ｒ，ｑ_ｉ，ａ，及びｇの各組合せに対しトレーニングされる．

Here, the term of p (a) p (g) p (c ₁ | q _i ) p (c _r | q _i ) is used as the Gaussian component mixing weighting coefficient p (x | c ₁ , c _r , q _i a, be handled as g), each Gaussian PDF is _{_{_{c l, c r, q i}}} , is trained for each combination of a, and g.

両方の表記（式（２１）及び（２３））は，標準トライフォンＨＭＭ音響モデルにおいて用いられるガウス分布の混合を示す．このため，既存のＨＭＭを基にしたデコーダを，何らかの修正を行なうことなく用いて認識を行なうことができる．提供モデルのパラメータ学習は前のセクションにおいて述べたようにして実行される．トライフォン状態Ｑ，アクセントＡ，性別Ｇ，２つ前のコンテキスト（Ｃ_Ｌ），２つ後のコンテキスト（Ｃ_Ｒ），及び変数Ｘを含む全ての変数が，トレーニングで観測可能であるから，ＭＬパラメータ推定が利用される． Both notations (Equations (21) and (23)) show a mixture of Gaussian distributions used in the standard triphone HMM acoustic model. Therefore, recognition can be performed using an existing HMM-based decoder without any modification. Parameter learning of the provided model is performed as described in the previous section. Since all variables including the triphone state Q, the accent A, the sex G, the second previous context (C _L ), the second subsequent context (C _R ), and the variable X are observable in training, ML Parameter estimation is used.

全てのモデルパラメータを信頼性高く推定するにはトレーニングデータの量が不十分な場合，クラスタリング技術，例えば，知識ベースの，又は，データ駆動型のクラスタリングにより，パラメータの数を削減できる．例えば，２つ前／後の音素コンテキストＣ_Ｌ／Ｃ_Ｒの各値ｃ_ｌ／ｃ_ｒに対し，式（２１）及び（２３）により，対応するガウス成分が存在する． If the amount of training data is insufficient to reliably estimate all model parameters, the number of parameters can be reduced by clustering techniques such as knowledge-based or data-driven clustering. For example, for each value _c l / _{c r} phoneme context _C L / _{C R} after two previous / by the equation (21) and (23), the corresponding Gaussian components are present.

図９はＣ_Ｒのみが追加されたＢＮ３３０に対する，観測空間３４４の概要を示す．図９のＣ_Ｒはノード３４２により示され，２つ後のコンテキストの種々の値／ｂ／，／ｐ／，…，／ｚ／を有する．この変数の種々の値は種々のガウス分布３５０，３５２，…，３５４にそれぞれ対応する．４４音素の組（無音を含む）を英語ＡＳＲに用いるとすれば，２つ前／後の音素コンテキストＣは，４４個の値（Ｃ＝ｃ_１，ｃ_２，…，ｃ_４４）を有する可能性があるということになる．このため，ＢＮ−Ｃトポロジ（図７参照）の各状態に対するガウス分布の総数は，４４^２＝１９３６となり得る．ＢＮ−ＣＧ，ＢＮ−ＣＡ及びＢＮ−ＣＧＡのトポロジはさらにもっと多くなる．このように増加したモデルパラメータを信頼性高く推定するにはトレーニングデータの量が不十分な場合，全体の性能は顕著に低下するであろう．このため，ガウス分布の数を減らすことが好ましい．ガウス分布の数を減らすために利用できる方法が２つある．一方は知識ベースの音素クラスを用いることである．他方はデータ駆動のクラスタリングである．これらの方法は，どのようなベイズネットワークにも適用可能である． 9 for BN330 only _{C R} is added, an overview of the observation space 344. _{C R} in FIG. 9 is indicated by node 342, has various values / b / a after two contexts, / p /, ..., / z / a. The various values of this variable correspond to various Gaussian distributions 350, 352,. If a set of 44 phonemes (including silence) is used for English ASR, the phoneme context C before / after 2 may have 44 values (C = c ₁ , c ₂ ,..., C ₄₄ ). It means that there is sex. Therefore, the total number of Gaussian distributions for each state of the BN-C topology (see FIG. 7) can be 44 ² = 1936. The topology of BN-CG, BN-CA and BN-CGA will be even more. If the amount of training data is insufficient to reliably estimate the increased model parameters, the overall performance will be significantly reduced. For this reason, it is preferable to reduce the number of Gaussian distributions. There are two methods that can be used to reduce the number of Gaussian distributions. One is to use knowledge-based phoneme classes. The other is data-driven clustering. These methods can be applied to any Bayesian network.

ここでは，音素コンテキストを，調音の態様における主な相違に基づき分類し，パラメータのサイズを削減する．テーブル１に，非特許文献６から流用した知識ベースの音素クラスの例を挙げる． Here, phoneme contexts are classified based on the main differences in articulation, and the parameter size is reduced. Table 1 gives examples of knowledge-based phoneme classes taken from Non-Patent Document 6.

ＨＭＭ／ＢＮアプローチに基づくペンタフォンの可能性についての，さらなる詳細及び議論は特許文献１に示されている．

Further details and discussion on the possibility of a pentaphone based on the HMM / BN approach is given in US Pat.

５．音素モデルレベルでの知識源の組込み
Ａ．一般的検討
セクション２に記述の理論的フレームワークに従い，再びモデルＭをＨＭＭ音素モデルλ，ＤをセグメントＸ_ｓとする． 5. Incorporating knowledge sources at the phoneme model level A. General Consideration According to the theoretical framework described in Section 2, model M is again the HMM phoneme model λ and D is the segment X _s .

１）因果関係の定義
トポロジの構造は図２（Ａ）に示されるものと同様であり，ＨＭＭ音素ユニットの確率関数は今回は式（３）と同様のＢＮ同時確率関数により示される． 1) Definition of causal relationship The structure of the topology is the same as that shown in Fig. 2 (A), and the probability function of the HMM phoneme unit is represented by a BN joint probability function similar to equation (3) this time.

追加の知識源Ｋ_１，Ｋ_２，…，Ｋ_ＮをＨＭＭ音素モデルＰ（Ｘ_ｓ，λ）に組込むためには（所与のλに対し，全てのＫ_１，Ｋ_２，…，Ｋ_Ｎが独立と仮定する．），簡易に式（５）に従い，次の式を得る．

To incorporate additional knowledge sources K ₁ , K ₂ ,..., K _N into the HMM phoneme model P (X _s , λ) (for a given λ, all K ₁ , K ₂ ,..., K _N ), Simply follow equation (5) to obtain the following equation.

２）推論
ここでの最大の関心事は，与えられた入力セグメントＸ_ｓに対するＰ（Ｘ_ｓ｜Ｋ_１，…，Ｋ_ｎ，λ）を計算することである．しかし，条件付ＰＤＦに対する単純な形式の関数を得るのは困難である．なぜなら，この式には，持続時間が変化するＨＭＭモデルλ，及びセグメントＸ_ｓが関係しているからである．このためここで，セクション２−Ｃで述べたジャンクションツリーアルゴリズムにより，Ｐ（Ｘ_ｓ｜Ｋ_１，…，Ｋ_Ｎ，λ）を分解する必要がある．これは式（１４）に従い以下のように分解される．

2) Inference The biggest concern here is to calculate P (X _s | K ₁ ,..., K _n , λ) for a given input segment X _s . However, it is difficult to obtain a simple form of function for conditional PDF. This is because the HMM model λ whose duration changes and the segment X _s are related to this equation. For this reason, it is necessary to decompose P (X _s | K ₁ ,..., K _N , λ) by the junction tree algorithm described in section 2-C. This is decomposed as follows according to equation (14).

この式は，いくつかの，より複雑さの少ない依存関係，すなわち，特定の追加の知識Ｋ_１，Ｋ_２，…，Ｋ_Ｎが与えられた場合のセグメント観測データＸｓの尤度に対応するＰ（Ｘ_ｓ｜Ｋ_１，λ），…，Ｐ（Ｘ_ｓ｜Ｋ_Ｎ，λ）によって，音素のＨＭＭ尤度Ｐ（Ｘ_ｓ｜Ｋ_１，Ｋ_２，…，Ｋ_Ｎ，λ）を表す新しい方法である．

This equation represents some less complex dependencies, ie P corresponding to the likelihood of the segment observation data Xs given certain additional knowledge K ₁ , K ₂ ,..., K _N. (X _s | K ₁ , λ),..., P (X _s | K _N , λ) represents a phoneme HMM likelihood P (X _s | K ₁ , K ₂ ,..., K _N , λ) It is a method.

Ｂ．広域音素コンテキスト情報の組込み
前のセクションで述べたアプローチを，広域音素の知識情報の組込みの場合と同じ課題に対して適用してみる．広域音素知識情報の組込みにおいては，トライフォンコンテキスト／ａ⁻，ａ，ａ^＋／を，ペンタフォンコンテクスト／ａ⁻⁻，ａ⁻，ａ，ａ^＋，ａ^＋＋／に拡張する．構造上，従来のＨＭＭのトライフォンコンテキストユニットモデルは，図１０（Ａ）に示すモデル３７０として説明され，ペンタフォンコンテキストユニットモデルは，図１０（Ｂ）に示すモデル３７２として説明される． B. Incorporation of wide-area phoneme context information We will apply the approach described in the previous section to the same problem as in the case of incorporating wide-area phoneme knowledge information. In incorporation of the wide area phoneme knowledge information, triphone context ^{^{/ a -, a, a +}} / a, penta von context ^{^{^{/ a -, a -, a}}} , a +, extended to ^{a ++} /. Structurally, the conventional triphone context unit model of the HMM is described as a model 370 shown in FIG. 10A, and the pentaphone context unit model is described as a model 372 shown in FIG. 10B.

２つ前のコンテキストＣ_Ｌ／ａ⁻⁻／と２つ後のコンテキスト／ａ^＋＋／とを，確率関数Ｐ（Ｘ_ｓ｜λ）に追加する．Ｘ_ｓ，λ，Ｃ_Ｌ及びＣ_Ｒの条件に関する依存性は，図４（Ａ）に示すものと類似のＢＮにより記述される．分解で最終的に得られるジャンクションツリーもまた，図４（Ｅ）に示すものと同様である．図４（Ｅ）におけるＭがここでのＨＭＭ音素モデルλであり，ＤがセグメントＸ_ｓである．このことから，条件付確率関数は，式（２６）によれば以下のように定義される． The two previous contexts C _L / a ⁻⁻ and the second subsequent context / a ⁺⁺ / are added to the probability function P (X _s | λ). X _s, lambda, dependence on a condition of _{C L} and _{C R} is described by BN similar to that shown in FIG. 4 (A). The junction tree finally obtained by decomposition is also the same as that shown in FIG. Figure 4 is a HMM phoneme model λ of M is now in (E), D is a segment _{X s.} From this, the conditional probability function is defined as follows according to equation (26).

λが，トライフォン／ａ⁻，ａ，ａ^＋／，２つ前のコンテキストＣ_Ｌ／ａ⁻⁻／，及び，２つ後のコンテキストＣ_Ｒ／ａ^＋＋／と関連付けられていることから，以下のように書ける．

Since λ is associated with the triphones / a ⁻ , a, a ⁺ /, the second previous context C _L / a ⁻ /, and the second subsequent context C _R / a ^{+ +} / Can be written as

この式（２８）は以下のようになる．

This equation (28) is as follows.

これはペンタフォンモデルが，ｐ（Ｘｓ｜［ａ⁻⁻，ａ⁻，ａ，ａ^＋］），ｐ（Ｘｓ｜［ａ⁻，ａ，ａ^＋，ａ^＋＋］），及びｐ（Ｘｓ｜［ａ⁻，ａ，ａ^＋］）により構成できることを示す．これら構成要素は，左／先行テトラフォンコンテキスト，右／後続テトラフォンコンテキスト，及び中央トライフォンコンテキストというユニットが与えられた場合の，セグメントＸｓの尤度に対応する．

This is because the pentaphone model is represented by p (Xs | [a ⁻⁻ , a ⁻ , a, a ⁺ ]), p (Xs | [a ⁻ , a, a ⁺ , a ⁺⁺ ]), and p (Xs | [ a ^{^-,} a, show that can be configured by ^a +]). These components correspond to the likelihood of segment Xs given the units left / preceding tetraphone context, right / following tetraphone context, and central triphone context.

しかし，［ａ⁻⁻，ａ⁻，ａ，ａ^＋］，［ａ⁻，ａ，ａ^＋，ａ^＋＋］に対するテトラフォンモデルを作成することもまた，データが疎にしか存在しないことにより困難である． However, it is also difficult to create a tetraphone model for [a ⁻⁻ , a ⁻ , a, a ⁺ ], [a ⁻ , a, a ⁺ , a ⁺⁺ ] because the data exists only sparsely. is there.

これに代えて，式（２８）を用い，λがモノフォン／ａ／を示すように，並びに２つ前と後のコンテキストＣ_Ｌ及びＣ_Ｒとが／ａ⁻⁻，ａ⁻／及び／ａ^＋，ａ^＋＋／をそれぞれ表すようにする．この結果，以下の式を得る． Alternatively, using equation (28), lambda is monophones / a / a as shown, and two before and after the context _{C L} and _{C R} and the ^{/ a} ^-, ^a - ^/ and / ^{a +} , A ⁺⁺ / respectively. As a result, the following equation is obtained.

この式は，ペンタフォンコンテキスト／ａ⁻⁻，ａ⁻，ａ，ａ^＋，ａ^＋＋／が，ｐ（Ｘｓ｜［ａ⁻⁻，ａ⁻，ａ，］），ｐ（Ｘｓ｜［ａ，ａ^＋，ａ^＋＋］），及びｐ（Ｘｓ｜［ａ］）により構成されることを示し，これら構成要素は，左／先行テトラフォンコンテキスト（Ｌ３），右／先行テトラフォンコンテキスト（Ｒ３），及び中央トライフォンコンテキスト（Ｃ１）のユニットが与えられたときの，観測データＸｓの尤度に対応する．この構成をＣ１Ｌ３Ｒ３と呼び，その構造を図１０（Ｃ）に示す．

This formula indicates that the pentaphone context / a ⁻⁻ , a ⁻ , a, a ⁺ , a ⁺⁺ / is p (Xs | [a ⁻⁻ , a ⁻ , a,]), p (Xs | [a, a ⁺ , A ⁺⁺ ]), and p (Xs | [a]), these components are left / preceding tetraphone context (L3), right / preceding tetraphone context (R3), and Corresponds to the likelihood of the observed data Xs when the unit of the central triphone context (C1) is given. This configuration is called C1L3R3, and its structure is shown in FIG.

図１０（Ｃ）を参照して，ベイズペンタフォンコンテキストユニットＣ１Ｌ３Ｒ３３７４は，左／先行トライフォンコンテキストユニット（Ｌ３）３８０，右／後続トライフォンコンテキストユニット（Ｒ３）３８２，及びモノフォンユニット（Ｃ１）（図示せず）を含む． Referring to FIG. 10C, Bayesian pentaphone context unit C1L3R3 374 includes left / preceding triphone context unit (L3) 380, right / following triphone context unit (R3) 382, and monophone unit (C1). (Not shown).

この図で分かるように，推定すべきコンテキストユニットの数は，コンテキストのカバーする範囲を損なうことなく，Ｎ^５から（２Ｎ^３＋Ｎ）に削減される．ただしＮは音素の数である．英語ＡＳＲに対し４４音素の組を用いるとすれば，ペンタフォンモデルで推定する必要のあるコンテキストの総数は４４^５≒１６５，０００，０００コンテキストユニットである．トライフォンコンテキストユニットを用いた構成では，この複雑さが約１７０，０００ユニットまで削減される． As can be seen in this figure, the number of context units to be estimated is reduced from N ⁵ to (2N ³ + N) without impairing the range covered by the context. N is the number of phonemes. If a set of 44 phonemes is used for the English ASR, the total number of contexts that need to be estimated with the pentaphone model is 44 ⁵ ≈165,000,000 context units. In a configuration using a triphone context unit, this complexity is reduced to about 170,000 units.

式（２９）及び（３０）を分析すると，式（２７）を，ＨＭＭ音素モデルの他の構成を導くためのスタート点としても用いることが可能であることが分かる．λがモノフォンユニット／ａ／，Ｃ_Ｌ及びＣ_Ｒが，それぞれ，コンテキストユニット／ａ⁻／及び／ａ^＋／に先行する，及び後続するコンテキストユニットであると仮定した場合，非特許文献７で提案されたのと同様の因数分解が得られる．これはベイズトライフォンとして知られている． Analysis of equations (29) and (30) shows that equation (27) can also be used as a starting point for deriving other configurations of the HMM phoneme model. λ is monophone unit / a /, C _L and C _R are each context unit / a ^- preceding / and / a + ^/ a, and assuming that a subsequent context units, in Non-Patent Document 7 A factorization similar to that proposed is obtained. This is known as a Bayesian triphone.

ここでは，トライフォンモデルがモノフォン及びバイフォンモデルから構築されている．以後，同様の方法で構成された全てのモデルも，ベイズモデルと呼ばれる．

Here, the triphone model is built from the monophone and biphone models. Hereinafter, all models constructed in the same way are also called Bayesian models.

ベイズ広域音素コンテキストモデルと呼ばれる，ベイズトライフォンを拡張したものもまた，本願発明者の先の研究論文である非特許文献８に記載されている．このアプローチにより，単にベイスの法則に基づくのみで，コンテキストへの依存度がより少ないモデルから広域の音素コンテキストをモデル化できる．しかし，種々の種類の知識源を組込むことが必要な場合には困難が生ずる． An extension of the Bayesian triphone, called the Bayesian wide-area phoneme context model, is also described in Non-Patent Document 8, which is the previous research paper of the present inventor. With this approach, it is possible to model a wide-range phoneme context from a model with less dependency on the context, simply based on Bayes' law. However, difficulties arise when it is necessary to incorporate various types of knowledge sources.

対照的に，ここでの統一されたフレームワークは，様々な種類の知識源を組込むための，より適切な手段を我々に与える．例えば，性別又はアクセント情報のような他の追加の知識変数で，Ｃ１Ｌ３Ｒ３をさらに拡張することが容易にできる．Ｃ１Ｌ３Ｒ３を，性別情報のみで（Ｃ１Ｌ３Ｒ３−Ｇ），アクセント情報のみで（Ｃ１Ｌ３Ｒ３−Ａ），又は，性別及びアクセントの両方の情報で（Ｃ１Ｌ３Ｒ３−ＡＧ），拡張することができる． In contrast, the unified framework here gives us a better way to incorporate different types of knowledge sources. For example, C1L3R3 can be further extended with other additional knowledge variables such as gender or accent information. C1L3R3 can be extended with gender information alone (C1L3R3-G), accent information alone (C1L3R3-A), or both gender and accent information (C1L3R3-AG).

Ｃ１Ｌ３Ｒ３−ＡＧの場合，ＢＮトポロジと，モラル及びトライアンギュレートグラフと，それに対応するジャンクションツリーとは図１１に示されるようになる．図１１（Ａ）を参照して，ＢＮトポロジ４００は，λ，Ｘ_ｓ，Ｃ_Ｌ，Ｃ_Ｒ，Ｇ及びＡによりそれぞれ示される，ノード４１０，４１２，４１４，４１６，４１８及び４２０を含む．図１１（Ｂ）を参照して，ＢＮトポロジ４００に対応するモラル及びトライアンギュレートグラフ４３０は，ノード４１０，４１２，４１４，４１６，４１８及び４２０と，ノード４１８及び４２０，ノード４１０及び４１８，並びにノード４１０及び４２０をそれぞれ接続する，付加的な３つのリンク４２２，４２４，及び４２６とを含む．図１１（Ｃ）を参照して，図１１（Ｂ）のグラフに対応するジャンクションツリー４５０は，“Ｘ_ｓλＡＧ”，“Ｘ_ｓＣ_Ｌλ”，及び“Ｘ_ｓＣ_Ｒλ”でそれぞれ示される，クラスタノード４６０，４６４，及び４７４と，“Ｘ_ｓλ”，及び“Ｘ_ｓλ”でそれぞれ示される，セパレータノード４６２及び４７２とを含む． In the case of C1L3R3-AG, the BN topology, the moral and triangulated graphs, and the corresponding junction tree are as shown in FIG. Referring to FIG. 11A, the BN topology 400 includes nodes 410, 412, 414, 416, 418, and 420, indicated by λ, X _s , C _L , C _R , G, and A, respectively. Referring to FIG. 11B, a moral and triangulate graph 430 corresponding to the BN topology 400 includes nodes 410, 412, 414, 416, 418 and 420, nodes 418 and 420, nodes 410 and 418, and Includes three additional links 422, 424, and 426 connecting nodes 410 and 420, respectively. Referring to FIG. 11C, junction trees 450 corresponding to the graph of FIG. 11B are indicated by “X _s λAG”, “X _s C _L λ”, and “X _s C _R λ”, respectively. is includes a cluster node 460 and 464, and 474, _{"X s} λ", and indicated _{"X s} λ" respectively, and a separator nodes 462 and 472.

この場合，条件付確率関数は以下のように求められる． In this case, the conditional probability function is obtained as follows.

したがって，λ，Ｃ_Ｌ及びＣ_Ｒに対するＣ１Ｌ３Ｒ３の設定に従えば，Ｃ１Ｌ３Ｒ３−ＡＧのペンタフォン尤度は以下のようになる．

Thus, lambda, according to the setting of C1L3R3 for _{C L} and _{C R,} penta von likelihood of C1L3R3-AG is as follows.

これは，Ｐ（Ｘｓ｜［ａ⁻⁻，ａ⁻，ａ，ａ^＋，ａ^＋＋］，Ａ，Ｇ）を，Ｐ（Ｘｓ｜［ａ］，Ａ，Ｇ），Ｐ（Ｘｓ｜［ａ⁻⁻，ａ⁻，ａ］，Ａ，Ｇ），及びＰ（Ｘｓ｜［ａ，ａ^＋，ａ^＋＋］，Ａ，Ｇ）に因数分解することにより，単純化できることを示している．

This means that P (Xs | [a ⁻⁻ , a ⁻ , a, a ⁺ , a ⁺⁺ ], A, G) is changed to P (Xs | [a], A, G), P (Xs | [a ^{− ^{-, a -, a],}} a, G), and ^{^{P (Xs | [a, a}} +, a ++], a, by factoring in G), and show that it is possible simplified.

提案に係るペンタフォンモデルでＡＳＲシステムを実現するためには，いくつかのモデルで動作できる，特別なデコーダを必要とする．これは，提案に係るペンタフォンモデルを，標準的なトライフォンに基づくＨＭＭシステムにより生成されたＮ−ベストリストの再スコアリングに適用する場合には，避けることができる． In order to realize the ASR system with the proposed pentaphone model, a special decoder that can operate with several models is required. This can be avoided if the proposed pentaphone model is applied to rescoring the N-best list generated by a standard triphone-based HMM system.

図１２は，本発明の第１の実施の形態に係るＡＳＲシステム５００の全体の構造を示す．図１２を参照して，ＡＳＲシステム５００は，音声波形データ５１０を受け，その音声をデコードし，入力音声の仮説のＮベストリストを出力するための標準的デコーダ５１２と，５３０，５３２，５３４，及び５３６でそれぞれ示される，ペンタフォンモデルＣ１Ｌ３Ｒ３，Ｃ１Ｌ３Ｒ３−Ａ，Ｃ１Ｌ３Ｒ３−Ｇ，及びＣ１Ｌ３Ｒ３−ＡＧを記憶するためのモデル記憶装置５２０と，人間の操作に応答して，モデル５３０，５３２，５３４，及び５３６のうちいずれか１つを選択するためのセレクタ５２２と，標準的デコーダ５１２からの仮説のＮべストを，セレクタ５２２により選択されたモデルを利用して再スコアリングし，Ｎベストの仮説のうち最も高いスコアを示す１つを出力するための仮説選択モジュール５１６とを含む． FIG. 12 shows the overall structure of the ASR system 500 according to the first embodiment of the present invention. Referring to FIG. 12, ASR system 500 receives standard waveform 512 for receiving speech waveform data 510, decoding the speech, and outputting an N best list of hypotheses of input speech, and 530, 532, 534. And a model storage device 520 for storing the pentaphone models C1L3R3, C1L3R3-A, C1L3R3-G, and C1L3R3-AG, and models 530, 532, 534, respectively, in response to human operation. And the selector 522 for selecting one of 536 and N-best of the hypothesis from the standard decoder 512 are re-scored using the model selected by the selector 522, and the N-best hypothesis And a hypothesis selection module 516 for outputting one indicating the highest score.

図１３は仮説選択モジュール５１６の詳細を示す．図１３を参照して，仮説選択モジュール５１６は，Ｎベストの仮説を記憶するためのメモリ５５０と，メモリ５５０から仮説を１つずつ読出し，分離された音素の特徴パラメータを，後続する再スコアリングのための機能ユニットに左から右という順序で供給するための読出及び供給モジュール５５２と，シフトメモリ５５４においてこれらの特徴パラメータを受取るための５つのシフトメモリ５５４，５５６，５５８，５６０及び５６２とを含む．特徴パラメータがシフトメモリ５５４，５５６，５５８，５６０及び５６２をシフトされた時，シフトメモリ５５４，５５６，５５８，５６０，及び５６２は，ａ^＋，ａ^＋＋，ａ，ａ⁻，及びａ⁻⁻に対する特徴パラメータをそれぞれ記憶する． FIG. 13 shows details of the hypothesis selection module 516. Referring to FIG. 13, the hypothesis selection module 516 reads a hypothesis from the memory 550 for storing the N best hypotheses one by one, and subsequent re-scoring the feature parameters of the separated phonemes. A read and supply module 552 for supplying the functional units for the left to right in order, and five shift memories 554, 556, 558, 560 and 562 for receiving these characteristic parameters in the shift memory 554 Including. When the characteristic parameter is shifted shift memory 554,556,558,560 and 562, shift memory 554,556,558,560, and ^562, a ^+, a ++, a, a ^- and ^{a -} relative Each feature parameter is stored.

仮説選択モジュール５１６はさらに，Ｒ３モデル並びにシフトメモリ５５４，５５６及び５５８に記憶された特徴パラメータを用いて，確率Ｐ（Ｘｓ｜［ａ，ａ^＋，ａ^＋＋］）を計算するための右コンテキスト計算装置５７０と，Ｃ１モデル，及びシフトメモリ５５８に記憶された特徴ベクトルを用いて，確率Ｐ（Ｘｓ｜［ａ］）を計算するための中央コンテキスト計算装置５７２と，Ｌ３モデル並びにシフトメモリ５５８，５６０，及び５６２に記憶された特徴パラメータを用いて，確率Ｐ（Ｘｓ｜［ａ⁻⁻，ａ⁻，ａ］）を計算するための左コンテキスト計算装置５７４と，読出及び供給モジュール５５２によりメモリ５５０から読出された仮説の各セグメンテーションに対し，式（３０）にしたがって確率Ｐ（Ｘｓ｜［ａ⁻⁻，ａ⁻，ａ，^＋，ａ^＋＋］）を計算するためのＰＤＦ計算装置５７６とを含む． The hypothesis selection module 516 further uses the R3 model and the feature parameters stored in the shift memories 554, 556, and 558 to calculate a right context for calculating the probability P (Xs | [a, a ⁺ , a ⁺⁺ ]). Using the device 570, the C1 model, and the feature vector stored in the shift memory 558, the central context calculation device 572 for calculating the probability P (Xs | [a]), the L3 model and the shift memories 558, 560 , 562 and the left context calculator 574 for calculating the probability P (Xs | [a ⁻ , a ⁻ , a]) from the memory 550 by the read and supply module 552. For each segmentation of the read hypothesis, the probability P (Xs | [a ⁻⁻ , a ⁻ , a, ⁺ , a ⁺⁺ ]).

仮説選択モジュール５１６はさらに，各仮説のセグメントの確率を乗算することにより，メモリ５５０に記憶された各仮説を再スコアリングし，スコアを対応する仮説と関連付けてメモリ５５０に記憶するための再スコアリングモジュール５７８と，メモリ５５０内の仮説をスコアの降順にソートし，最も高いスコアを有する仮説を出力するためのソート及び選択モジュール５８０とを含む． The hypothesis selection module 516 further rescores each hypothesis stored in the memory 550 by multiplying the probabilities of the segments of each hypothesis and re-scores for storing the score in the memory 550 in association with the corresponding hypothesis. A ring module 578, and a sort and selection module 580 for sorting hypotheses in memory 550 in descending order of scores and outputting the hypothesis with the highest score.

単語レベルでのＮベスト認識は，標準的デコーダ５１２により，従来のＨＭＭ音響モデル及び標準的なビタビ復号を用いて，テストデータの全ての発声に対して実行される．全てのＮベストの仮説は，全音素の音響スコア，言語モデル（ＬａｎｇｕａｇｅＭｏｄｅｌ：ＬＭ）スコア，及びビタビ分割を含む．そして，各仮説の音素セグメント毎に，提案に係るペンタフォンモデルを用いて，仮説選択モジュール５１６において再スコアリングが行なわれる． N-best recognition at the word level is performed by standard decoder 512 for all utterances of test data using a conventional HMM acoustic model and standard Viterbi decoding. All N-best hypotheses include an acoustic score for all phonemes, a language model (LM) score, and a Viterbi partition. Then, for each phoneme segment of each hypothesis, re-scoring is performed in the hypothesis selection module 516 using the proposed pentaphone model.

図１３を参照して，メモリ５５０はＮベストの仮説を記憶する．読出及び供給モジュール５５２はメモリ５５０から最初の仮説を読み出し，左から右に（先頭から末尾に），仮説内の音素セグメント（特徴パラメータ）をシフトメモリ５５４へ出力する． Referring to FIG. 13, memory 550 stores N best hypotheses. The read and supply module 552 reads the first hypothesis from the memory 550 and outputs phoneme segments (feature parameters) in the hypothesis to the shift memory 554 from left to right (from the head to the end).

シフトメモリ５５４〜５６２は，音素セグメントをシフトする．シフトメモリ５５４，５５６及び５５８に記憶された，音素セグメントの各組に対して，右コンテキスト計算装置５７０はＲ３モデルを用いて確率Ｐ（Ｘｓ｜［ａ，ａ^＋，ａ^＋＋］）を計算する．シフトメモリ５５８に記憶された各音素セグメントに対し，中央コンテキスト計算装置５７２はＣ１モデルを用いて確率Ｐ（Ｘｓ｜［ａ］）を計算する．シフトメモリ５５８，５６０，及び５６２に記憶された音素セグメントの各組に対して，左コンテキスト計算装置５７４はＬ３モデルを用いて確率Ｐ（Ｘｓ｜［ａ⁻⁻，ａ⁻，ａ］）を計算する．計算された確率は，ＰＤＦ計算装置５７６へ与えられる．ＰＤＦ計算装置５７６はペンタフォンコンテキスト確率Ｐ（Ｘｓ｜［ａ⁻⁻，ａ⁻，ａ，^＋，ａ^＋＋］）を，式（３０）にしたがって計算し，その確率を再スコアリングモジュール５７８に与える． Shift memories 554 to 562 shift phoneme segments. For each set of phoneme segments stored in the shift memories 554, 556 and 558, the right context calculator 570 uses the R3 model to calculate the probability P (Xs | [a, a ⁺ , a ⁺⁺ ]). . For each phoneme segment stored in the shift memory 558, the central context calculator 572 calculates the probability P (Xs | [a]) using the C1 model. For each set of phoneme segments stored in shift memories 558, 560, and 562, left context calculator 574 calculates probability P (Xs | [a ⁻ , a ⁻ , a]) using the L3 model. Do it. The calculated probability is given to the PDF calculator 576. The PDF calculator 576 calculates the pentaphone context probability P (Xs | [a ⁻⁻ , a ⁻ , a, ⁺ , a ⁺⁺ ]) according to the equation (30), and gives the probability to the re-scoring module 578. .

シフトメモリ５５８に記憶される各音素セグメントに対応して，読出及び供給モジュール５５２は再スコアリングモジュール５７８に，ＰＤＦ計算装置５７６の出力を読むタイミングを知らせる．これに応答して，再スコアリングモジュール５７８はＰＤＦ計算装置５７６の出力を読み，その値を記憶する．仮説の最後になると，読出及び供給モジュール５５２は，再スコアリングモジュール５７８に信号を送る．これに応答して，再スコアリングモジュール５７８はその仮内の全音素セグメントの確率を掛け合わせることにより，当該仮説のスコアを計算する．計算完了の際，再スコアリングモジュール５７８は，メモリ５５０内のスコア（ペンタフォンスコア）を処理対象の仮説と関連付けて記憶する． For each phoneme segment stored in the shift memory 558, the read and supply module 552 informs the rescoring module 578 when to read the output of the PDF calculator 576. In response, rescoring module 578 reads the output of PDF calculator 576 and stores the value. At the end of the hypothesis, the read and supply module 552 signals the rescoring module 578. In response, the rescoring module 578 calculates the hypothesis score by multiplying the probabilities of all the phoneme segments in the provision. When the calculation is completed, the rescoring module 578 stores the score (pentaphone score) in the memory 550 in association with the hypothesis to be processed.

メモリ５５０に記憶された全ての仮説に対してペンタフォンスコアが計算されると，読出及び供給モジュールはソート及び選択モジュール５８０に信号を送る．これに応答して，ソート及び選択モジュール５８０は，メモリ５５０に記憶された全ての仮説を，対応するペンタフォン及びＬＭスコアと共に読出し，そのペンタフォン及びＬＭスコアを組合わせて新しいスコアとし，その新しいスコアの降順に仮説を並べかえ，並べかえた仮説のうち最も高いスコアを有するものを選択し，それを新しい仮説５１８として出力する． Once the pentaphone score has been calculated for all hypotheses stored in memory 550, the read and feed module signals the sort and select module 580. In response, the sort and select module 580 reads all hypotheses stored in the memory 550 along with the corresponding pentaphone and LM score, combines the pentaphone and LM score into a new score, and the new The hypotheses are rearranged in descending order of the scores, and the one having the highest score is selected from the rearranged hypotheses, and the new hypothesis 518 is output.

図１４に，仮説の再スコアリングの例を示す． Figure 14 shows an example of hypothesis rescoring.

トレーニングの間に，いくつかの音素コンテキストが出現しなかったかもしれない．このようなコンテキストに対しては，ここで提案したペンタフォンコンテキストモデルは，認識の間に出力確率を作りだすことができない．この問題に対処するため，ここでは，単純に，小さな数値を出力確率として割当る．この再スコアリングには先行，後続，及び中央のモデルからの出力確率が関係するため，全ての要素モデルにフロアリングが適用される． Some phonemic contexts may not have appeared during training. For such contexts, the proposed pentaphone context model cannot produce output probabilities during recognition. To deal with this problem, we simply assign a small number as the output probability. Because this rescoring involves output probabilities from the preceding, following, and central models, flooring is applied to all elemental models.

トレーニングデータの量が不十分な場合，パラメータの推定は，ここで提案したペンタフォンモデルに対するものでさえも信頼性が低くなり，状態出力の信頼性もまた下がる．モデルの信頼性を向上するため削除補間法を用いたが，その結果，より精密と思われるモデルが実際には信頼性を欠く場合に，より信頼性の高いモデルに戻ることができる．この概念は，別々にトレーニングした２個のモデルであって，その一方が他方よりも信頼性高くトレーニングされているようなモデル間を補間することに関連している．しかし，２個のモデルを補間する代わりに，我々はこのアプローチを２つの音素尤度の組込みに適用した．ただし，ここで提案したベイズペンタフォンモデルの音素尤度Ｐ（Ｘ_ｓ｜λ_ｂａｙＰ_ｅｎｔａ）が精密な方であり，トライフォンの尤度Ｐ（Ｘ_ｓ｜λ_{ｔｒｉｐｈｎ}）が，より信頼性の高い方である．このため，音素尤度Ｐ（Ｘ_ｓ｜λ）は以下で与えられる． If the amount of training data is insufficient, the parameter estimation is less reliable, even for the proposed pentaphone model, and the state output is also less reliable. Deletion interpolation was used to improve the reliability of the model, but as a result, if a model that seems to be more precise actually lacks reliability, it can return to a more reliable model. This concept is related to interpolating between two separately trained models, one of which is trained more reliably than the other. However, instead of interpolating the two models, we applied this approach to the incorporation of two phoneme likelihoods. However, here we propose a Bayesian penta von model of phoneme likelihood _{_{P (X s | λ bay P}} enta) is a more precise, triphone of the likelihood _P _{(X s} | λ triphn) is more reliable The higher one. For this reason, the phoneme likelihood P (X _s | λ) is given by

ただし，αはここで提案したペンタフォンモデルのＨＭＭ音素尤度の重みを表し，（１−α）はトライフォンモデルのＨＭＭ音素尤度の重みを表す．トレーニングデータの量が十分に多ければ，Ｐ（Ｘ_ｓ｜λ_ｂａｙＰ_ｅｎｔａ）はより信頼性が高くなり，αは１．０に近づく．十分でなければ，αは０．０に近づき，より信頼性の高いモデルＰ（Ｘ_ｓ｜λ_{ｔｒｉｐｈｎ}）に戻る．

Here, α represents the weight of the HMM phoneme likelihood of the pentaphone model proposed here, and (1-α) represents the weight of the HMM phoneme likelihood of the triphone model. If the amount of training data is large enough, P (X _s | λ _Bay _Penta ) becomes more reliable and α approaches 1.0. If not, α approaches 0.0 and returns to the more reliable model P (X _s | λ _triphn ).

発話の始め／終わりにおいては，全ての左／右コンテキストは無音で満たされる．隣接した単語の間に長い無音が存在しないと仮定しているので，前の単語の最後の音素コンテキストは，現在の単語の最初の音素コンテキストにも影響する．この再スコアリングメカニズムはこのように，単語内及び単語と単語の間の全セグメントに対して同様に振舞う（クロスワードモデル）． At the beginning / end of the utterance, all left / right contexts are filled with silence. Assuming that there is no long silence between adjacent words, the last phoneme context of the previous word also affects the first phoneme context of the current word. This rescoring mechanism thus behaves similarly for all segments within and between words (crossword model).

前述のように計算されたスコアはその後，現在の仮説に対応したＬＭスコアと組合わされる．Ｎベストから，最も高い発声スコアを達成する仮説が新しい認識出力として選択される． The score calculated as described above is then combined with the LM score corresponding to the current hypothesis. From N best, the hypothesis that achieves the highest utterance score is selected as the new recognition output.

６．実験
出願人（株式会社国際電気通信基礎技術研究所（ＡＴＲ））が準備した，アクセント付の英語発声コーパスをこの実験に用いた．文の素材は，旅行で用いられる表現の基本的なドメインに基づくものである．発話データベースは，アメリカ（ＵＳ）とオーストラリア（ＡＵＳ）の英語アクセントからなり，各アクセントは各々，１００人の話者（男性５０名，女性５０名）による約４５，０００の発話（４４発声時間）からなる．このデータの９０％，すなわち４０，０００の発話（男女各４０人の話者による２０，０００の発話）をトレーニングデータとして用いた．評価のため，残り１０％のアクセントデータ（ＵＳ及びＡＵＳ）の混合物から，２０人の異なる話者（男性１０名，女性１０名）による，２００の発話をランダムに選択した．バイグラム及びトライグラム言語モデルを，約１５０，０００の旅行に関する文によりトレーニングした．利用可能であった発音辞典は３７，０００の単語からなり，ＵＳの発音に基づいていた． 6). Experiment An accented English speech corpus prepared by the applicant (ATR) was used for this experiment. The material of the sentence is based on the basic domain of expressions used in travel. The utterance database consists of American (US) and Australian (AUS) English accents, each accented by about 45,000 utterances (44 utterance hours) by 100 speakers (50 men and 50 women). It consists of 90% of this data, that is, 40,000 utterances (20,000 utterances by 40 male and female speakers) were used as training data. For evaluation, 200 utterances by 20 different speakers (10 males and 10 females) were randomly selected from the remaining 10% mixture of accent data (US and AUS). The bigram and trigram language models were trained with about 150,000 travel statements. The pronunciation dictionary that was available consisted of 37,000 words and was based on US pronunciation.

１６ｋＨｚのサンプリング周波数，２０ミリ秒のフレーム長，１０ミリ秒のフレームシフト，並びに１２次のＭＦＣＣ（Ｍｅｌ−ＦｒｅｑｕｅｎｃｙＣｅｐｓｔｒｕｍＣｏｅｆｆｉｃｉｅｎｔｓ：メル周波数ケプストラム），ΔＭＦＣＣ及びΔ対数パワーからなる２５次元の特徴パラメータを用いた．全ての音素について，初期ＨＭＭとして３状態を用いた．そして，連続状態分割（ＳｕｃｃｅｓｓｉｖｅＳｔａｔｅＳｐｌｉｔｔｉｎｇ：ＳＳＳ）トレーニングアルゴリズムを用いて，状態結合ＨＭｎｅｔトポロジを持つトライフォン音響モデルを得た．状態結合の数は，ここで用いられたＳＳＳアルゴリズムが最小記述長（ＭｉｎｉｍｕｍＤｅｓｃｒｉｂｔｉｏｎＬｅｎｇｔｈ：ＭＤＬ）最適化基準に基づくことから，アルゴリズムにより自動的に決定される．ＭＤＬ−ＳＳＳの詳細は他の文献（非特許文献９）に記載されている．ＳＳＳトポロジトレーニングは，全てのトレーニングデータを用いて実行された．状態の総数は２，１２６であり，状態当りガウス混合成分が４種類，すなわち，状態当りガウス混合成分の数が５，１０，１５及び２０個のモデルが得られた． 16kHz sampling frequency, a frame length of 20 ms, a frame shift of 10 ms, and 12-order MFCC (Mel-Frequency Cepstrum Coefficients: Mel frequency cepstrum), use the 25-dimensional feature parameters consisting ΔMFCC and Δ log power It was. For all phonemes, three states were used as the initial HMM. Using a continuous state splitting (SSS) training algorithm, a triphone acoustic model having a state coupled HMnet topology was obtained. The number of state bindings is automatically determined by the algorithm because the SSS algorithm used here is based on a Minimum Description Length (MDL) optimization criterion. Details of MDL-SSS are described in other documents (Non-patent Document 9). SSS topology training was performed using all training data. The total number of states was 2,126, and four types of Gaussian mixture components per state were obtained, that is, models with 5, 10, 15, and 20 Gaussian mixture components per state.

従来のトライフォン音響モデル（ＡＭ）に性別及びアクセントといった付加的知識を組込むこともまた，性別及び／又はアクセント依存のＡＭをトレーニングすることにより，可能である．すべてのモデルに対するトポロジに対応した構造が確実に同じになるように，所定のアクセント又は性別のトレーニングデータによる，組込トレーニング手順のみが行なわれた．このため，総合して，１つの単一トライフォンＡＭ（付加的知識無し）と，２つのアクセント依存トライフォンＡＭ（ＵＳとＡＵＳとの両方に対して）と，２つの性別依存トライフォンＡＭ（男性と女性との両方に対して）と，４つのアクセント及び性別依存トライフォンＡＭ（ＵＳの男性及び女性と，ＡＵＳの男性及び女性とに対して）とを得た． It is also possible to incorporate additional knowledge such as gender and accent into the traditional triphone acoustic model (AM) by training gender and / or accent dependent AM. Only built-in training procedures with predetermined accent or gender training data were performed to ensure that the structure corresponding to the topology for all models was the same. Thus, in total, one single triphone AM (no additional knowledge), two accent-dependent triphones AM (for both US and AUS), and two gender-dependent triphones AM ( And 4 accents and gender-dependent triphones AM (for men and women in US and men and women in AUS).

これら，状態当り５混合成分のベースラインモデルがどのような性能を示したかを図１５のグラフにプロットした．付加的知識無しのトライフォンのベースラインは，８３．６０％の単語正解率を達成した．しかし，性別依存モデルのみ，性能をわずかに向上可能であった．他のモデルの性能は低下しただけであった．とりわけ，アクセント・性別依存モデルは単語正解率８２．１１％まで低下した．これは，他のベースラインモデルに比べトレーニングデータの量が特に少なかったことによるものであろう． The performance of these five mixed component baseline models per state was plotted in the graph of FIG. The baseline of triphone without additional knowledge achieved a word accuracy rate of 83.60%. However, only the gender-dependent model could improve the performance slightly. The performance of other models was only reduced. In particular, the accent / gender dependency model decreased to a word correct rate of 82.11%. This is probably because the amount of training data was particularly small compared to other baseline models.

Ａ．ＨＭＭ状態レベルで知識源を組込んだときの性能
提案に係るペンタフォンモデルを，セクション４−Ｂに記述したように，音素クラスコンテキスト変数でラベル付けされたすべてのアクセントデータにおいて，同じ量のトレーニングデータを用いてトレーニングした．モデルの状態トポロジ，状態の総数，及び，遷移確率は，すべてトライフォンＨＭＭベースラインと同一である．それゆえに，これらはすべてパラメータの数という点で同様の複雑さを有する．主要な違いは，状態の確率分布において，各々のガウス分布がＣ_ＬまたはＣ_Ｒにより明確に条件付けされているという点にのみある．これとは対照的に，ＨＭＭベースラインでの全てのガウス成分は，混合インデックスに関する「意味のある」解釈無しにＥＭアルゴリズムにより学習された．いくつかの音素コンテキストクラスＣ_ＬまたはＣ_Ｒは，文法規則により存在しないか，又はトレーニングデータに現れず，その結果，トレーニング後に，状態当り平均約５０のガウス分布が得られた．データ駆動型クラスタリング技術を用いてペンタフォンモデルのサイズを状態当り５，１０，１５及び２０の混合成分に対応するよう減少させることにより，推定パラメータの信頼性の低下を避け，ガウス分布の総数が全く同じであることにより，性能をベースラインシステムと比べることが可能なようにする． A. Performance when incorporating knowledge sources at the HMM state level The same amount of training for all accent data labeled with phoneme class context variables, as described in Section 4-B, for the proposed pentaphone model We trained using the data. The model's state topology, total number of states, and transition probabilities are all the same as the triphone HMM baseline. Therefore, they all have similar complexity in terms of the number of parameters. The main difference is that in the state probability distribution, each Gaussian distribution is clearly conditioned by C _L or C _R. In contrast, all Gaussian components at the HMM baseline were learned by the EM algorithm without a “significant” interpretation of the mixed index. Some phonemes context class C _L or C _R is absent by the grammar rules, or not appear in the training data, the result, after training, the Gaussian distribution of state per average of about 50 was obtained. By reducing the size of the pentaphone model to correspond to 5, 10, 15, and 20 mixed components per state using data-driven clustering techniques, the reliability of the estimated parameters is avoided and the total number of Gaussian distributions is reduced. Being exactly the same makes it possible to compare performance with the baseline system.

最初に，ベースラインと同じテストデータを用い，ペンタフォンモデルＢＮ−Ｃ，ＢＮ−ＣＧ，ＢＮ−ＣＡ又はＢＮ−ＣＧＡがどの程度の性能であるかを評価した．平均で状態当り５つという同じ数の混合成分を持つこれら４つのモデル全ての結果を，図１６にプロットした． First, we evaluated the performance of the pentaphone model BN-C, BN-CG, BN-CA or BN-CGA using the same test data as the baseline. The results of all four models with the same number of mixed components on average, five per state, are plotted in FIG.

これからわかるように，全てのＢＮのタイプを用い，様々なタイプの知識源の組込みを行なうように状態の確率分布を変えただけで，認識が向上した．しかし，性別及びアクセント変数を組込んだものでは，ここで提案したモデルの認識率はそれ以上向上しなかった．この問題も，各々のアクセント又は性別依存モデルに対するトレーニングデータに限りがあることに関係しているのであろう．それが，最高性能がＢＮ−Ｃを用いた場合の単語正解率８５．０３％である理由である． As can be seen, recognition was improved by using all BN types and changing the state probability distribution to incorporate various types of knowledge sources. However, when the gender and accent variables were incorporated, the recognition rate of the proposed model did not improve further. This problem may also be related to the limited training data for each accent or gender-dependent model. That is why the best performance is 85.03% when using BN-C.

我々は，これを，一致するアクセントのテストの組で評価した．このテストデータは，ＢＮ−Ｃを用いてもたらされる効果が何かをさらに詳しく調査するための，各アクセント（ＵＳ及びＡＵＳ）からランダムに選択された２００の発話である．種々の数の混合成分のモデルを用いて得られた結果をテーブル２に要約する． We evaluated this with a set of matching accent tests. This test data is 200 utterances randomly selected from each accent (US and AUS) to investigate further what the effects that can be achieved with BN-C. Table 2 summarizes the results obtained using different numbers of mixed component models.

これからわかるように，ここで提案したペンタフォンモデルは，同じ数のパラメータの範囲ではベースラインよりも良い性能を示す．ＵＳのペンタフォンＨＭＭ／ＢＮの最良の性能はガウス混合分布数が１０の時に得られ，これによってＷＥＲ（ＷｏｒｄＥｒｒｏｒＲａｔｅ：単語誤り率）が相対的に約８％削減し，ＡＵＳのペンタフォンの最良の性能はガウス混合分布数が２０の時に得られ，ＷＥＲが相対的に約１１％削減した．一致しないアクセントのテストの組でもこれらペンタフォンモデルの性能を評価した．例えば，ＵＳ発声でトレーニングされたモデルをＡＵＳ発声のテストデータでテストし，その逆も行なった．１５個の混合成分のモデルを用いて得られた結果をテーブル３に要約する．一致時と不一致時との比較を簡単にするため，テーブル３には一致するアクセントの評価から得た結果も含ませてある．一致しないアクセントに対するペンタフォンモデルでも，標準的なＨＭＭトライフォンモデルに比べ，依然として一貫して性能が優れていることが分かる．

As can be seen, the proposed pentaphone model performs better than the baseline for the same number of parameters. The best performance of US pentaphone HMM / BN is obtained when the number of Gaussian mixture distributions is 10, which reduces the WER (Word Error Rate) by about 8% and The best performance was obtained when the number of Gaussian mixture distributions was 20, and the WER was relatively reduced by about 11%. The performance of these pentaphone models was also evaluated in a test set of inconsistent accents. For example, a model trained with US utterances was tested with AUS utterance test data and vice versa. Table 3 summarizes the results obtained using the 15 mixed component model. To simplify the comparison between coincidence and non-coincidence, Table 3 also includes the results obtained from the evaluation of matching accents. It can be seen that the Pentaphone model for inconsistent accents still has consistently better performance than the standard HMM triphone model.

Ｂ．ＨＭＭ音素モデルレベルでの知識源組込み時の性能
非特許文献８では，我々は，ペンタフォンモデルを分解する数種類の方法を調査し，最良の方法がＣ１Ｌ３Ｒ３構成であることを見出した．ここでは，Ｃ１Ｌ３Ｒ３モデルのみを用いた追加の実験について記述する． B. Performance when incorporating knowledge sources at the HMM phoneme model level In Non-Patent Document 8, we investigated several methods for decomposing the pentaphone model and found that the best method was the C1L3R3 configuration. Here we describe an additional experiment using only the C1L3R3 model.

全てのアクセント付のペンタフォンモデルの全成分を，同量のトレーニングデータ及び同じＳＳＳトレーニングアルゴリズムを用いて別々にトレーニングした．状態の総数は３，３６０（Ｃ１：１３２状態，Ｌ３：１，７４６状態，Ｒ３：１，７８２状態の合計）で，状態当り４種類のガウス混合成分数，すなわち，５，１０，１５及び２０という数のガウス混合成分のものが得られた．そして，組込みトレーニング手順を，特定のアクセント又は性別のトレーニングデータでペンタフォンＣ１Ｌ３Ｒ３−Ａ，Ｃ１Ｌ３Ｒ３−Ｇ，及びＣ１Ｌ３Ｒ３−ＡＧに対して実行した． All components of all accented pentaphone models were trained separately using the same amount of training data and the same SSS training algorithm. The total number of states is 3,360 (total of C1: 132 state, L3: 1,746 state, R3: 1,782 state), and the number of four types of Gaussian mixture components per state, ie, 5, 10, 15 and 20 The number of Gaussian mixture components was obtained. A built-in training procedure was then performed for Pentaphone C1L3R3-A, C1L3R3-G, and C1L3R3-AG with specific accent or gender training data.

最初に，付加的知識源の組込みが複数のアクセント付のテストデータに対しどんな効果を有するかを評価した．５つの混合成分を有する，提案に係るペンタフォンＣ１Ｌ３Ｒ３，Ｃ１Ｌ３Ｒ３−Ａ，Ｃ１Ｌ３Ｒ３−Ｇ，及びＣ１Ｌ３Ｒ３−ＡＧに対する結果を図１７に要約する．１０ベストリストと，削除補間のための０．３の重みパラメータαを用いて再スコアリングが行なわれた．ここから分かるように，組込んだ知識源が多いほど，性能もよくなった．提案に係るペンタフォンＣ１Ｌ３Ｒ３モデルは，ベースラインに対して性能が向上し，達成された最高性能は，アクセントＡ，性別Ｇ，先行コンテキストＣ_Ｌ，及び後続コンテキストＣ_Ｒという付加的知識を組込んだＣ１Ｌ３Ｒ３−ＡＧによる，８４．３８％という単語正解率である．性別及びアクセントが組込まれた時には，ペンタフォンＨＭＭ／ＢＮに対する場合と同様，性能の低下はなかったが，これは恐らく削除補間法を使用したことによるものである．

First, we evaluated the effect of incorporating additional knowledge sources on test data with multiple accents. The results for the proposed pentaphones C1L3R3, C1L3R3-A, C1L3R3-G, and C1L3R3-AG with five mixed components are summarized in FIG. Rescoring was performed using 10 best lists and a weight parameter α of 0.3 for deletion interpolation. As you can see, the more knowledge sources incorporated, the better the performance. Penta von C1L3R3 model according to the proposed maximum performance capability is improved versus baseline was achieved, incorporating accents A, sex G, prior context C _L, and the additional knowledge that subsequent context C _R The correct word rate is 84.38% according to C1L3R3-AG. When gender and accent were incorporated, there was no performance degradation, as was the case with Pentaphone HMM / BN, probably due to the use of the deletion interpolation method.

次に，全アクセント付テストデータに対してＣ１Ｌ３Ｒ３−ＡＧがどのような性能を示すのか，その詳細を，Ｎベスト（Ｎ＝１０）リストを用いて調査した．補間削除法のための重みパラメータαは同じ（０．３）であった．ここでは，非特許文献１で使用された，相対的向上度（Ｒｅｌ-Ｉｍｐ）と，再スコアリングについての相対的向上度（Ｒｅｌ−Ｒｅｓｃ−Ｉｍｐ）との両方を以下により計算した． Next, we investigated the details of the performance of C1L3R3-AG for all test data with accents using the N best (N = 10) list. The weight parameter α for the interpolation deletion method was the same (0.3). Here, both the relative improvement (Rel-Imp) used in Non-Patent Document 1 and the relative improvement (Rel-Resc-Imp) for rescoring were calculated as follows.

ただし，Ｎベストリストの上限はＮベスト認識結果である．

However, the upper limit of the N best list is the N best recognition result.

種々の数の混合成分のモデルによって得られた結果をテーブル４に要約する．これから分かるように，提案に係るペンタフォンモデルにより，ＡＳＲシステムの性能は一貫して向上した．最大のＲｅｌ−Ｒｅｓｃ−Ｉｍｐは，ＵＳ及びＡＵＳアクセントの両方に対し，１５の混合モデルの時に得られた（ＵＳモデルに対し３７．９２％及びＡＵＳモデルに対し３８．０４％）． Table 4 summarizes the results obtained with various numbers of mixed component models. As can be seen, the proposed pentaphone model has consistently improved the performance of the ASR system. The highest Rel-Resc-Imp was obtained for 15 mixed models for both US and AUS accents (37.92% for US model and 38.04% for AUS model).

また，提案に係るペンタフォンＣ１Ｌ３Ｒ３−ＡＧモデルが，一致しないアクセントのテストの組に対してどの程度の性能を示すかについても評価した．１５個の混合成分を持つモデルを用いて得た結果をテーブル５に要約する．テーブル５は，一致時と不一致時との比較を簡単にするために，一致するアクセントに対する評価からの結果をも含む．提案に係るペンタフォンＣ１Ｌ３Ｒ３−ＡＧモデルが，一致しないアクセントについて標準的なトライフォンモデルよりも一貫して良い性能を示すことが分かる． We also evaluated the performance of the proposed pentaphone C1L3R3-AG model against the mismatched accent test set. Table 5 summarizes the results obtained using the model with 15 mixed components. Table 5 also includes the results from evaluations for matching accents to simplify comparisons between matches and mismatches. It can be seen that the proposed pentaphone C1L3R3-AG model shows consistently better performance than the standard triphone model for mismatched accents.

Ｃ．種々のモデルの比較
最後に，２，２０２個の状態数の従来のペンタフォンＨＭＭモデルであって，何も無いところからＭＤＬ−ＳＳＳを用いてトレーニングされたものを用い，提案に係るモデルの性能の高さが，主に広域音素コンテキストによりもたらされたものかどうかを調査するために，追加の実験を行なった．性別及びアクセントに依存するペンタフォンモデルも，特定のアクセント又は性別に関するトレーニングデータでの組込み手順を用いて取得した．これらはベイズペンタフォンの場合と同様，Ｎベストリストを再スコアリングすることにより実現された． C. Comparison of various models Finally, the performance of the proposed model using a 2,202 state number conventional pentaphone HMM model trained using MDL-SSS from nothing. An additional experiment was conducted to investigate whether the height of the was mainly brought about by the wide phoneme context. A gender and accent dependent pentaphone model was also obtained using a built-in procedure with training data for a specific accent or gender. These were realized by re-scoring the N best list, as in Bayes Pentaphone.

状態当り５つの混合成分を持つ全てのモデルに対する結果を図１８にプロットする．これから分かるように，提案に係るペンタフォンＣ１Ｌ３Ｒ３モデルによりベースラインに比べて性能が向上し，しかもこれは従来のペンタフォンＨＭＭで単に再スコアリングするよりも優秀である．この理由は，ある量のトレーニングデータが与えられたときに，ＭＤＬ−ＳＳＳアルゴリズムを用いて従来のペンタフォンモデルをトレーニングした結果得られたのが，総数２，２０２個の状態を持つモデルであり，これがトライフォンＨＭＭでの状態の総数とそれほど変わらないことによるのであろう．同じガウス分布成分を共有する異なるペンタフォンコンテキストがあまりに多くあるように見えるために，コンテキストの分解能が低下した．このため，いくつかのコンテキスト依存性の少ないモデルを組合わせたものを用いてペンタフォンモデルを近似することにより，コンテキストの分解能の向上と性能の改善とを促進することができた．得られた最高性能は，ＢＮ−Ｃによる単語正解率８５．０３％であった． The results for all models with five mixed components per state are plotted in FIG. As can be seen, the proposed Pentaphone C1L3R3 model improves performance compared to the baseline, which is better than simply re-scoring with a conventional Pentaphone HMM. This is because a model having a total of 2,202 states was obtained as a result of training a conventional pentaphone model using the MDL-SSS algorithm when a certain amount of training data was given. This is probably because it is not so different from the total number of states in the triphone HMM. The resolution of the context was reduced because there seem to be too many different pentaphone contexts sharing the same Gaussian distribution component. Therefore, by improving the context resolution and performance by approximating the pentaphone model using a combination of several context-independent models, we were able to promote the improvement of context resolution and performance. The highest performance obtained was 85.03% of correct word rate by BN-C.

７．結論
統計的音響モデルを基本としたＨＭＭに，付加的知識源を組込むための一般的なフレームワークを述べた．広域音素コンテキスト情報をトライフォンＨＭＭへ組込むことにより，このフレームワークの実現を提示した．これは最初にＢＮを用いてＨＭＭの状態レベルで行なわれた．付加的知識源が認識の間に隠されていても，このアプローチによれば標準デコーディングシステムを変更なく使用することができる．次に，広域音素コンテキスト音響モデリングを，より狭いコンテキストを持ついくつかの他のモデルを用いて構築することにより，ＨＭＭ音素モデルレベルで組込んだ．この複合の技術によって，推定されるべきコンテキストユニットの数の削減がもたらされたため，コンテキスト依存性のより少ないモデルを推定することが必要なだけとなったので，コンテキストの分解能は著しく向上した． 7). Conclusion A general framework for incorporating additional knowledge sources into HMMs based on statistical acoustic models is described. The implementation of this framework was presented by incorporating wide-area phoneme context information into the triphone HMM. This was first done at the state level of the HMM using BN. Even if additional knowledge sources are hidden during recognition, this approach allows the standard decoding system to be used without modification. Next, wide-area phoneme context acoustic modeling was built at the HMM phoneme model level by building with several other models with narrower contexts. Because this composite technique resulted in a reduction in the number of context units to be estimated, it was only necessary to estimate a model with less context dependency, so the context resolution was significantly improved.

これらの広域コンテキストモデル構成を，Ｎベストの再スコアリングにより，処理後の段階に適用した．実験結果により，提案に係るフレームワークで作成された広域音素コンテキストモデルが，標準的なトライフォンモデルに対して単語正解率を向上させることが明らかとなった．２つ前のコンテキストＣ_Ｌと，２つ後のコンテキストＣ_Ｒという付加的知識は，ＨＭＭ状態レベルでの組込みに適しており，一方，アクセントＡ及び性別Ｇという付加的知識は，ＨＭＭ音素モデルレベルでの組込みに，より適していた． These global context model configurations were applied to the post-processing stage by N-score re-scoring. Experimental results show that the phoneme context model created by the proposed framework improves the word accuracy rate compared to the standard triphone model. Two previous context C _L, additional knowledge that the two after the context C _R, is suitable for incorporation in the HMM state level, while the additional knowledge that accent A and sex G, HMM phoneme model level It was more suitable for incorporation in.

上述のように，本発明は，付加的な知識源を統一された方法で組み込むための方法及び装置に関するものである．これら方法及び装置はベイズネットワークのフレームワークを利用し，どのようなドメインからのものでも，すべての付加的知識源を簡単に統合する．このグラフによるモデルフレームワークの有利な点は，（１）情報源間の確率論的関係を学習することを可能にすること，及び，（２）同時確率密度関数を，互いにリンクされた局部的条件付確率密度関数の組に分解することを容易にすること，である．モデルが簡素化された形式であるため，このようにして，限定された量のデータを用いてモデルを構築し，信頼性高く推定することが可能である． As mentioned above, the present invention relates to a method and apparatus for incorporating additional knowledge sources in a unified way. These methods and devices use the Bayesian network framework to easily integrate all additional knowledge sources from any domain. The advantages of this graphical model framework are: (1) it enables learning of probabilistic relationships between information sources, and (2) the joint probability density function is linked to each other locally. It is easy to decompose into a set of conditional probability density functions. Since the model is in a simplified form, it is possible in this way to build a model with a limited amount of data and estimate it reliably.

このフレームワークは一般的なアプローチを代表するものである．即ち，このフレームワークは，それぞれモデルに基づく尤度関数を持つ，多くの既存の音響モデルのモデル化の問題に適用できる． This framework represents a general approach. In other words, this framework can be applied to many existing acoustic model modeling problems, each with a model-based likelihood function.

コンピュータによる実現
上述の実施の形態は，コンピュータシステムと，当該システム上で実行されるコンピュータプログラムとによって実現可能である．図１９はこれら実施の形態で用いられるコンピュータシステム６５０の外観を示し，図２０はコンピュータシステム６５０のブロック図である．ここで示すコンピュータシステム６５０は単なる例示であって，さまざまな他の構成が利用可能である． Realization by computer The above-described embodiment can be realized by a computer system and a computer program executed on the system. FIG. 19 shows the external appearance of the computer system 650 used in these embodiments, and FIG. 20 is a block diagram of the computer system 650. The computer system 650 shown here is merely exemplary, and various other configurations can be used.

図１９を参照して，コンピュータシステム６５０は，コンピュータ６６０と，モニター６６２と，キーボード６６６と，マウス６６８と，スピーカー６９２と，マイクロフォン６９０とを含む．さらに，コンピュータ６６０は，ＤＶＤ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｃ）ドライブ６７０及び半導体メモリポート６７２を含む． Referring to FIG. 19, a computer system 650 includes a computer 660, a monitor 662, a keyboard 666, a mouse 668, a speaker 692, and a microphone 690. Further, the computer 660 includes a DVD (Digital Versatile Disc) drive 670 and a semiconductor memory port 672.

図２０を参照して，コンピュータ６６０はさらに，ＤＶＤ６７０及び半導体メモリポート６７２に接続されたバス６８６と，上述した装置を実現するコンピュータプログラムを実行するためのＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）６７６と，コンピュータ６６０のブートアッププログラムを記憶するＲＯＭ（Ｒｅａｄ−ＯｎｌｙＭｅｍｏｒｙ）６７８と，ＣＰＵ６７６によって使用される作業領域及びＣＰＵ６７６によって実行されるプログラムの記憶領域を提供するＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）６８０と，音声データ，音響データ，言語モデル，及び音声認識のために必要なレキシコンを記憶するためのハードディスクドライブ６７４と，コンピュータ６６０にネットワーク６５２との接続を提供するためのネットワークインターフェース（Ｉ／Ｆ）６９６とを含み，これらは全てバス６８６に接続されている． Referring to FIG. 20, a computer 660 further includes a bus 686 connected to DVD 670 and semiconductor memory port 672, a CPU (Central Processing Unit) 676 for executing a computer program for realizing the above-described device, and computer 660. ROM (Read-Only Memory) 678 for storing the boot-up program, RAM (Random Access Memory) 680 for providing a work area used by the CPU 676 and a storage area for programs executed by the CPU 676, audio data, and sound Hard disk drive 674 for storing data, language model and lexicon required for speech recognition, and connection of computer 660 to network 652 And a network interface (I / F) 696 for providing all of them are connected to the bus 686.

上述の実施の形態に係るシステムを実現するソフトウェアはＤＶＤ６８２又は半導体メモリ６８４等の記憶媒体に記憶されたオブジェクトコードの形で流通し，ＤＶＤドライブ６７０又は半導体メモリポート６７２等の読出装置を介してコンピュータ６６０に提供され，ハードディスクドライブ６７４に記憶される．ＣＰＵ６７６がプログラムを実行する際には，プログラムはハードディスクドライブ６７４から読出されてＲＡＭ６８０に記憶される．図示しないプログラムカウンタによって指定されたアドレスから命令がフェッチされ，その命令が実行される．ＣＰＵ６７６はハードディスクドライブ６７４から処理すべきデータを読出し，処理の結果をこれもまたハードディスクドライブ６７４に記憶する．スピーカ６９２とマイクロフォン６９０とは，音声認識と音声合成とに用いられる． The software that realizes the system according to the above-described embodiment is distributed in the form of an object code stored in a storage medium such as a DVD 682 or a semiconductor memory 684, and is connected to a computer via a reading device such as a DVD drive 670 or a semiconductor memory port 672. 660 and stored in the hard disk drive 674. When the CPU 676 executes the program, the program is read from the hard disk drive 674 and stored in the RAM 680. An instruction is fetched from an address specified by a program counter (not shown), and the instruction is executed. The CPU 676 reads data to be processed from the hard disk drive 674 and stores the processing result in the hard disk drive 674 as well. The speaker 692 and the microphone 690 are used for speech recognition and speech synthesis.

コンピュータシステム６５０の一般的動作は周知であるので，ここでは詳細な説明は行なわない． The general operation of computer system 650 is well known and will not be described in detail here.

ソフトウェアの流通の方法に関して，ソフトウェアは必ずしも記憶媒体上に固定されたものでなくてもよい．例えば，ソフトウェアはネットワーク６５２に接続された別のコンピュータから配布されてもよい．ソフトウェアの一部がハードディスク６７４に記憶され，ソフトウェアの残りの部分をネットワークを介してハードディスク６７４に取込み，実行の際に統合する様にしてもよい． Regarding software distribution methods, software does not necessarily have to be fixed on a storage medium. For example, the software may be distributed from another computer connected to the network 652. A part of the software may be stored in the hard disk 674, and the remaining part of the software may be taken into the hard disk 674 via the network and integrated at the time of execution.

典型的には，現代のコンピュータはコンピュータのオペレーティングシステム（ＯＳ）によって提供される汎用の関数を利用し，所望の目的に従って制御された態様でこれら関数を実行する．従って，ＯＳ又は第３者から提供されうる汎用関数を含まず，一般的な関数の実行順序の組合せのみを指定したプログラムであっても，そのプログラムが全体として所望の目的を達成する制御構造を有する限り，そのプログラムがこの発明の範囲に包含されることは明らかである． Typically, modern computers utilize general-purpose functions provided by a computer operating system (OS) and execute these functions in a controlled manner according to the desired purpose. Therefore, even if it is a program that does not include general-purpose functions that can be provided by the OS or a third party and only specifies a combination of execution order of general functions, a control structure that achieves a desired purpose as a whole Obviously, the program is included in the scope of the present invention as long as it has.

今回開示された実施の形態は単に例示であって，本発明は上記した実施の形態のみに制
限されるわけではない．本発明の範囲は，発明の詳細な説明の記載を参酌した上で，特許
請求の範囲の各請求項によって示され，そこに記載された文言と均等の意味及び範囲内で
のすべての変更を含む． The embodiment disclosed this time is merely an example, and the present invention is not limited to the embodiment described above. The scope of the present invention is indicated by each claim in the claims after taking into account the description of the detailed description of the invention, and all modifications within the meaning and scope equivalent to the wording described therein are included. Including.

付加的知識源を音響モデルに組込む一般的手順を示す図である．It shows the general procedure for incorporating additional knowledge sources into an acoustic model. 種々のＢＮトポロジを示す図である．It is a figure which shows various BN topologies. 種々のＢＮトポロジのいくつかの例を示す図である．FIG. 2 shows some examples of various BN topologies. ＢＮトポロジと，対応するトライアンギュレートグラフと，トライアンギュレートグラフのうちの１つから得たジャンクションツリーとを示す図である．It is a figure which shows BN topology, the corresponding triangulate graph, and the junction tree obtained from one of the triangulation graph. 図３（Ａ）に示すＢＮと同じＢＮトポロジと，これに対応するジャンクションツリーとを示す図である．It is a figure which shows the same BN topology as BN shown to FIG. 3 (A), and the junction tree corresponding to this. トライフォン／ａ^＋，ａ，ａ⁻／をモデル化するために用いられるガウス混合分布密度での従来のＨＭＭ音響モデルを示す図である．FIG. 6 is a diagram illustrating a conventional HMM acoustic model at a Gaussian mixture distribution density used to model triphones / a ⁺ , a, a ⁻ /. ペンタフォンコンテキスト／ａ⁻⁻，ａ⁻，ａ，ａ^＋，ａ^＋＋／をモデル化するためのＢＮ−Ｃトポロジを示す図である．Penta von Context ^{^{^{/ a -, a -, a}}} , a +, a diagram showing a BN-C topology for modeling ^{a ++} /. ＢＮ−ＣＧ，ＢＮ−ＣＡ，及びＢＮ−ＣＧＡのトポロジを示す図である．It is a figure which shows the topology of BN-CG, BN-CA, and BN-CGA. ＢＮによる観測空間モデリングの例を示す図である．It is a figure which shows the example of observation space modeling by BN. 従来のトライフォンモデルと，従来のペンタフォンモデルと，ベイズペンタフォンモデル構成Ｃ１Ｌ３Ｒ３とを示す図である．It is a figure which shows the conventional triphone model, the conventional pentaphone model, and the Bayes pentaphone model configuration C1L3R3. ＢＮトポロジと，これに対応するモラル及びトライアンギュレートグラフと，これに対応するジャンクションツリーとを示す図である．It is a figure which shows BN topology, the moral and triangulate graph corresponding to this, and the junction tree corresponding to this. 本発明のある実施の形態に係るＡＳＲシステム５００の全体構造を示す図である．1 is a diagram showing an overall structure of an ASR system 500 according to an embodiment of the present invention. 仮説選択モジュール５１６の詳細を示すブロック図である．It is a block diagram which shows the detail of the hypothesis selection module 516. FIG. 本実施の形態に係るＮベスト再スコアリングのメカニズムの例を示す図である．It is a figure which shows the example of the mechanism of N best rescoring which concerns on this Embodiment. 実験で用いられたトライフォンベースラインモデルの認識単語正解率の値を示す図である．It is a figure which shows the value of the recognition word correct rate of the triphone baseline model used in the experiment. 種々のＢＮトポロジを用いたペンタフォンＨＭＭ／ＢＮモデルの認識単語正解率の値を示す図である．It is a figure which shows the value of the recognition word correct rate of the pentaphone HMM / BN model using various BN topologies. 種々のベイズペンタフォンモデルの認識単語正解率の値を示す図である．It is a figure which shows the value of the recognition word correct rate of various Bayes pentaphone models. 種々のシステムのトライフォンＨＭＭベースライン，ペンタフォンＨＭＭベースライン，ペンタフォンＨＭＭベースライン，及び本発明の実施の形態に係るペンタフォンモデルの認識単語正解率の値を示す図である．It is a figure which shows the value of the recognition word correct rate of the triphone HMM baseline of various systems, the pentaphone HMM baseline, the pentaphone HMM baseline, and the pentaphone model which concerns on embodiment of this invention. コンピュータシステム６５０の外観を示す図である．It is a figure which shows the external appearance of the computer system 650. FIG. コンピュータシステム６５０を示すブロック図である．1 is a block diagram showing a computer system 650. FIG.

Explanation of symbols

７０，８０，９０，１００，１１０，１４０，２４０，３３０ベイズネットワーク
１３０，１５０モラル及びトライアンギュレートグラフ
１６０，１８０，４５０ジャンクションツリー
１９０ＨＭＭ
２７０，２９０，３１０，４００，４３０ＢＮトポロジ
１６４，１６６，１７０，４６０，４６４，４７４クラスタの組
１６２，１６８，４６２，４７２セパレータの組
５００ＡＳＲシステム
５１０音声波形データ
５１２標準的なデコーダ
５１４Ｎベストリスト
５１６仮説選択モジュール
５３０Ｃ１Ｌ３Ｒ３ペンタフォンモデル
５３２Ｃ１Ｌ３Ｒ３−Ａペンタフォンモデル
５３４Ｃ１Ｌ３Ｒ３−Ｇペンタフォンモデル
５３６Ｃ１Ｌ３Ｒ３−ＡＧペンタフォンモデル
５５０メモリ
５５２読出及び供給モジュール
５５４，５５６，５５８，５６０，５２６シフトメモリ
５７０右コンテキスト計算装置
５７２中央コンテキスト計算装置
５７４左コンテキスト計算装置
５７６確率密度関数計算装置
５７８再スコアリングモジュール
５８０ソート及び選択モジュール 70, 80, 90, 100, 110, 140, 240, 330 Bayesian network 130, 150 Moral and triangulated graph 160, 180, 450 Junction tree 190 HMM
270, 290, 310, 400, 430 BN topology 164, 166, 170, 460, 464, 474 Cluster set 162, 168, 462, 472 Separator set 500 ASR system 510 Speech waveform data 512 Standard decoder 514 N Best List 516 Hypothesis selection module 530 C1L3R3 Pentaphone model 532 C1L3R3-A Pentaphone model 534 C1L3R3-G Pentaphone model 536 C1L3R3-AG Pentaphone model 550 Memory 552 Read and supply module 554, 556, 558, 560, 526 Shift memory 570 Right context calculator 572 Central context calculator 574 Left context calculator 576 Probability density function calculator 578 Rescoring Jules 580 sorting and selection module

Claims

Probability calculation for calculating a probability for each predefined set of phonemes present in a given segment of the speech signal using a statistical acoustic model for the speech signal and one or more knowledge sources The apparatus, wherein the segment includes a plurality of frames of the audio signal, the acoustic model and the one or more knowledge sources have a causal relationship indicated by a Bayesian network, and the Bayesian network includes a plurality of cluster nodes. And a junction tree containing one or more separator nodes,
The device is
Means for storing a plurality of local acoustic models corresponding to the cluster nodes and one or more separator nodes;
Means for calculating predefined observation data for each of the frames;
Local probability calculating means for calculating a local probability of generating the observation data of each of the phonemes using the plurality of local acoustic models;
Wherein each phoneme, the probability of generating the observed data, look including a probability calculation means for calculating a predetermined function of the local probabilities calculated by the local probability calculation means,
The predetermined function is:

Defined by
Where D is the observed data, M is the acoustic model, N is a positive integer, Ki is a knowledge source of 1 or more,
P (D | Ki, M) (i = 1 to N) and P (D | M) are local probability calculated by the local probability calculating means .

The model M is a monophone acoustic model,
The apparatus of claim 1 , wherein the one or more knowledge sources include a preceding triphone context unit and a subsequent triphone context unit.

The model M is a monophonic acoustic model trained with additional knowledge sources,
The apparatus of claim 1 , wherein the one or more knowledge sources include a preceding triphone context unit and a subsequent triphone context unit.

4. The apparatus of claim 3 , wherein the additional knowledge source includes accent knowledge, or gender knowledge, or both accent knowledge and gender knowledge.

When executed on a computer , the computer causes a statistical acoustic model and one or more knowledge for the speech signal for each of a predefined set of phonemes present in a given segment of the speech signal. A computer program that functions as a probability calculation device for calculating a probability using a source, wherein the segment includes a plurality of frames of the speech signal, and the acoustic model and the one or more knowledge sources are based on a Bayesian network. The Bayesian network corresponds to a junction tree that includes a plurality of cluster nodes and one or more separator nodes;
The computer program stores the computer,
Means for storing a plurality of local acoustic models corresponding to the cluster nodes and one or more separator nodes;
Means for calculating predefined observation data for each of the frames;
Local probability calculating means for calculating a local probability of generating the observation data of each of the phonemes using the plurality of local acoustic models;
Functioning as probability calculation means for calculating the probability of generating the observation data of each of the phonemes as a predetermined function of the local probability calculated by the local probability calculation means ;
The predetermined function is:

Defined by
Where D is the observed data, M is the acoustic model, N is a positive integer, Ki is a knowledge source of 1 or more,
P (D | Ki, M) (i = 1~N) and P (D | M) is Ru local probability der calculated by the local probability calculation unit, the computer program.