JP5749187B2

JP5749187B2 - Parameter estimation device, parameter estimation method, speech recognition device, speech recognition method and program

Info

Publication number: JP5749187B2
Application number: JP2012024307A
Authority: JP
Inventors: 陽太郎久保; 堀　貴明; 貴明堀; 中村　篤; 篤中村
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2012-02-07
Filing date: 2012-02-07
Publication date: 2015-07-15
Anticipated expiration: 2032-02-07
Also published as: JP2013160998A

Description

本発明は、音声認識において用いるパラメタを含む確率有限状態モデルを調整するためのパラメタを推定するパラメタ推定装置及びパラメタ推定方法、推定されたパラメタを用いて音声データに対する音声認識結果を求める音声認識装置及び音声認識方法並びにプログラムに関する。 The present invention relates to a parameter estimation device and parameter estimation method for estimating a parameter for adjusting a probabilistic finite state model including parameters used in speech recognition, and a speech recognition device for obtaining a speech recognition result for speech data using the estimated parameter. And a speech recognition method and program.

以下の説明において、テキスト中で使用する記号「^」、「⁻」、「^〜」等は、本来直前の文字の真上に記載されるべきものであるが、テキスト記法の制限により、当該文字の直後に記載する。式中においてはこれらの記号は本来の位置に記述している。また、以下の説明に用いる図面では、同じ機能を持つ構成部や同じ処理を行うステップには同一の符号を記し、重複説明を省略する。 In the following description, the symbols “^”, “ ⁻ ”, “ ^˜ ”, etc. used in the text should be written directly above the previous character. Immediately after. In the formula, these symbols are written in their original positions. Also, in the drawings used for the following description, the same reference numerals are given to components having the same functions and steps for performing the same processing, and redundant description is omitted.

音声認識装置は一般的に確率的有限状態モデルによって表現される。有限状態モデルは離散値を持つ状態変数が、入力、すなわち音声の一部分（以降「フレーム」と呼ぶ）を受け取ることによって、対応する状態に遷移するものである。確率的有限状態モデルは有限状態モデルを確率の概念を用いて拡張したもので、フレームを受け取った際、どの状態に遷移するかが、確率的にしか与えられていないものを指す。確率的有限状態モデルに基づく音声認識装置では、フレーム系列を全て受け取った後で、最も確率の高い状態遷移を推定し、その状態遷移に対応する単語列を出力する。 A speech recognizer is generally represented by a probabilistic finite state model. In the finite state model, a state variable having a discrete value transitions to a corresponding state by receiving an input, that is, a part of speech (hereinafter referred to as “frame”). The probabilistic finite state model is an extension of the finite state model using the concept of probability, and indicates a state in which a transition to a state is only given probabilistically when a frame is received. In the speech recognition apparatus based on the probabilistic finite state model, after receiving all the frame sequences, the state transition with the highest probability is estimated, and a word string corresponding to the state transition is output.

確率的有限状態モデルを用いた音声認識装置として近年広く用いられているＷＦＳＴ（Weighted Finite State Transducer、重み付き有限状態トランスデューサ）に基づく音声認識装置では、例えば、単語の確率的遷移を表現した有限状態モデル、単語からその発音（音素列）への確率的変換を定義したＷＦＳＴ、音素から前後の音素との調音結合を考慮したコンテキスト依存音素への変換を定義したＷＦＳＴ、各コンテキスト依存音素に応じて予め作成した隠れマルコフモデル（Hidden Markov Model、以下「ＨＭＭ」ともいう）といったような、音声認識に必要な各モジュールをＷＦＳＴの合成演算を用いて合成することによって、最終的な確率的有限状態モデルを得る。各モジュールのパラメタは予め人手によって定義しておくか、学習データから学習しておくのが一般的である。 In a speech recognition device based on WFST (Weighted Finite State Transducer), which has been widely used in recent years as a speech recognition device using a probabilistic finite state model, for example, a finite state expressing a probabilistic transition of a word Model, WFST that defines probabilistic conversion from a word to its pronunciation (phoneme sequence), WFST that defines conversion from phoneme to context-dependent phoneme considering the articulation of previous and next phonemes, depending on each context-dependent phoneme A final probabilistic finite state model is created by synthesizing each module necessary for speech recognition, such as a previously created hidden Markov model (hereinafter also referred to as “HMM”), using a WFST synthesis operation. Get. Generally, the parameters of each module are previously defined manually or learned from learning data.

しかし、このようにして得られた音声認識装置は仮に各モジュールがそれぞれの学習において最適であっても、合成した確率的有限状態モデルが最適であるとは限らない。例えば、予測される音声認識エラー率を下げるための「識別学習基準」と呼ばれる種類の学習基準において全体の最適性を考えると、各モジュールに分解した形での学習では必ずしも最適解が求まらないことが知られている（非特許文献１参照）。 However, in the speech recognition apparatus obtained in this way, even if each module is optimal in each learning, the synthesized stochastic finite state model is not always optimal. For example, when considering the overall optimality of a kind of learning criterion called “discriminative learning criterion” for reducing the predicted speech recognition error rate, the optimal solution is not always found in learning in the form of decomposition into modules. It is known that there is no (see Non-Patent Document 1).

図１は、確率的有限状態モデルに基づく音声認識装置９１の機能ブロック図を示す。確率的有限状態モデルに基づく音声認識装置では、音声認識装置を遷移コスト関数π（Ｘ⁻，ｓ，ｓ’，ｔ，ｔ’）と出力シンボルｏ（ｓ，ｓ’）で表現する。ここでＸ⁻は入力系列であり、入力される音声データから抽出される特徴ベクトルｘ⁻を時系列に並べたものであり、Ｘ⁻＝｛ｘ⁻ _１，ｘ⁻ _２，…，ｘ⁻ _Ｔ｝のように表わされる。ただし、Ｔは入力される音声データの大きさ（フレーム総数）を表す。なお、入力系列Ｘ⁻は音声認識装置９１への入力として予め特徴ベクトル系列算出部７で算出されていることが一般的である。遷移コスト関数πは、ある状態ｓのときに、入力系列Ｘ⁻の時刻ｔから時刻ｔ’までを入力された場合（ただし、１≦ｔ＜ｔ’≦Ｔ）、次状態ｓ’に遷移する事象の起こりにくさを定義したものであり、遷移コスト関数πの値が大きいほど状態ｓから状態ｓ’に遷移しづらいことを表す。出力シンボルｏ（ｓ，ｓ’）は状態がｓから状態ｓ’に遷移した場合に音声認識の結果として得られる単語を示したもので、出力シンボルとしては例えば単語が考えられる。また、音声認識の結果を生成しない状態遷移もあり、その場合便宜的に、イプシロンシンボルεを出力すると考える。 FIG. 1 shows a functional block diagram of a speech recognition apparatus 91 based on a probabilistic finite state model. In a speech recognition device based on a probabilistic finite state model, the speech recognition device is expressed by a transition cost function π (X ⁻ , s, s ′, t, t ′) and an output symbol o (s, s ′). Wherein X ^- is an input series, a feature vector x is extracted from the audio data input ^- are those arranged in time ^{^{_{series, X - = {x - 1}}} , x - 2, ..., x - T }. Here, T represents the size (total number of frames) of input audio data. Note that the input sequence X ⁻ is generally calculated in advance by the feature vector sequence calculation unit 7 as an input to the speech recognition device 91. The transition cost function [pi, when certain conditions s, input sequence X ^- 'if it is entered until (where, 1 ≦ t <t' from the time t time t ≦ T), the transition to the next state s' The difficulty of the event is defined, and the larger the value of the transition cost function π, the more difficult the transition from the state s to the state s ′. The output symbol o (s, s ′) indicates a word obtained as a result of speech recognition when the state transitions from s to state s ′. As the output symbol, for example, a word is conceivable. In addition, there is a state transition that does not generate a result of speech recognition. In this case, for convenience, it is considered that an epsilon symbol ε is output.

この遷移コスト関数πと状態系列ｓ⁻：＝｛ｓ_１，ｓ_２，…，ｓ_Ｍ｝、セグメント時刻ｔ⁻：＝｛ｔ_１，ｔ_２，…，ｔ_Ｍ｝（ただしｔ_１＝１，ｔ_Ｍ＝Ｔ）を用いて以下の系列コスト関数Πを定義する。 This transition cost function π and state sequence s ⁻ : = {s ₁ , s ₂ ,..., S _M }, segment time t ⁻ : = {t ₁ , t ₂ ,..., T _M } (where t ₁ = 1, The following sequence cost function Π is defined using t _M = T).

この系列コスト関数Πは、時刻ｔ_ｍから時刻ｔ_ｍ＋１の間に状態ｓ_ｍから状態ｓ_ｍ＋１への状態遷移（以下「状態遷移ｓ→ｓ’」ともいう）が起こる遷移コスト関数πの、ｍ＝１からｍ＝Ｍ−１までの和になっており、入力系列Ｘ⁻全体を入力され、セグメント時刻ｔ⁻が決まった際に、状態系列ｓ⁻がどの程度のコストで起こるかを表わしたものである。なお、以下において、遷移コスト関数π及び系列コスト関数Πの出力を単にコストとも呼ぶ。 This series cost function Π is, (hereinafter also referred to as "state transition s → s'") state transition between from time t _m of time t _{m + 1} from the state s _m to the state s _{m + 1} is the transition cost function π that occurs, m = 1 to m = M−1 and represents the cost of the state sequence s ⁻ when the entire input sequence X ⁻ is input and the segment time t ⁻ is determined. Is. In the following, the outputs of the transition cost function π and the sequence cost function Π are also simply referred to as costs.

音声認識装置９１の最短経路探索部９１１は、入力系列Ｘ⁻を入力した際の最適状態系列ｓ^〜を、この系列コスト関数Πを用いて、以下の最短経路問題の解として出力する。 The shortest path search unit 911 of the speech recognition apparatus 91 outputs the optimum state series s ^˜ when the input series X ⁻ is input as a solution of the following shortest path problem using this series cost function Π.

最終的な認識結果となる単語列は、この最適状態系列ｓ^〜＝｛ｓ^〜 _１，ｓ^〜 _２，…｝中の各状態遷移ｓ^〜 _ｍ→ｓ^〜 _ｍ＋１に対応する出力シンボルｏ（ｓ^〜 _ｍ，ｓ^〜 _ｍ＋１）を、εを除いて、列挙したもので表わすことができる。出力シンボル抽出部８は、この最適状態系列ｓ^〜から最適単語列を抽出する。 Word string serving as the final recognition result, the optimal state sequence ^{^{_{^{s ~ = {s ~ 1,}}}} s ~ 2, ...} output symbol o ^{(s ~} corresponding to each state transition ^{_{^{_{s ~ m → s ~ m +}}}} 1 in _m , s ^˜ _{m + 1} ) can be represented by the enumeration except for ε. Output symbol extracting section 8 extracts the estimated word string from the optimal state sequence s ^~.

従来よく用いられてきたＨＭＭと言語モデル（Ｎ−ｇｒａｍ言語モデル／ネットワーク文法）に基づく音声認識装置では、各遷移コスト関数πを以下の形で表わす。 In a speech recognition apparatus based on an HMM and a language model (N-gram language model / network grammar) that has been frequently used in the past, each transition cost function π is expressed in the following form.

ただし、ω（ｓ，ｓ’）は状態遷移ｓ→ｓ’に対する重みパラメタであり、入力系列Ｘ⁻と独立に設計される状態遷移コストである。ω（ｓ，ｓ’）は、人手で与えてもよいし、ＷＦＳＴに基づく音声認識では予め複数のＷＦＳＴを合成することによって得られた各状態遷移のコストを用いてもよい。またＩ［ｓ，ｓ’］は状態遷移ｓ→ｓ’に関連付けられた出力分布のインデックスであり、ＨＭＭの学習装置等によって得られる。出力分布Ｐ（ｘ⁻ _τ｜Ｉ［ｓ，ｓ’］）は、各Ｉ［ｓ，ｓ’］毎に例えば混合ガウス分布などを用いてモデル化する。これらの遷移コスト関数πに関する状態遷移コストω（ｓ，ｓ’）及びｌｏｇＰ（ｘ⁻ _τ｜Ｉ［ｓ，ｓ’］）は有限状態モデル格納部９１２に格納され、上述の最短経路探索部９１１によって利用される。 However, ω (s, s ') is the state transition s → s' is a weight parameter for the input sequence X ^- is a state transition costs are designed independently. ω (s, s ′) may be given manually, or in speech recognition based on WFST, the cost of each state transition obtained by combining a plurality of WFSTs in advance may be used. I [s, s ′] is an index of the output distribution associated with the state transition s → s ′, and is obtained by an HMM learning device or the like. The output distribution P (x ⁻ _τ | I [s, s ′]) is modeled using, for example, a mixed Gaussian distribution for each I [s, s ′]. The state transition costs ω (s, s ′) and logP (x ⁻ _τ | I [s, s ′]) related to these transition cost functions π are stored in the finite state model storage unit 912, and the shortest path search unit 911 described above. Used by.

しかし、式（３）のような遷移コスト関数πの定義では、状態遷移と入力系列の関係は隠れマルコフモデルの学習によって得られる出力分布を通してのみ表現されてきた。すなわち、従来の音声認識装置９１は入力系列Ｘ⁻に含まれる特徴ベクトルｘ⁻ _τに関する部分であるＰ（ｘ⁻ _τ｜Ｉ［ｓ，ｓ’］）と、それ以外の部分であるω（ｓ，ｓ’）が個別に学習され、全ての取り得る状態遷移と入力フレームの関係を陽に定義してこなかった。特許文献１では、この遷移コスト関数πを状態遷移ｓ→ｓ’毎の調整パラメタベクトルα⁻ _{ｈ（ｓ，ｓ’）}及び入力系列Ｘ⁻を含む素性ベクトルφ⁻（Ｘ⁻，ｔ，ｔ’）を用いて以下のように拡張することにより、より高度な表現と、その同時最適化を実現する。 However, in the definition of the transition cost function π as shown in Equation (3), the relationship between the state transition and the input sequence has been expressed only through the output distribution obtained by learning the hidden Markov model. That is, the conventional speech recognition apparatus 91 input sequence X ^- wherein included in the vector x ^- a part related to _{^{_{τ P (x - τ | I}}} [s, s']) and a portion other than omega (s , S ′) are individually learned, and the relationship between all possible state transitions and input frames has not been explicitly defined. In Patent Document 1, this transition cost function π is converted to a feature vector φ ⁻ (X ⁻ , t, t ′ ₎ including an adjustment parameter vector α ⁻ _{h (s, s ′)} and an input sequence X ⁻ for each state transition s → s ′. ) To be expanded as follows to realize more advanced expressions and their simultaneous optimization.

ただし、「^Ｔ」は転置を表す。図２は、式（４）で表現される音声認識装置９２の機能ブロック図を示す。ｈ（ｓ，ｓ’）は状態遷移ｓ→ｓ’を表す。この表現の場合、ｈ（ｓ，ｓ’）を適切にデザインすれば、全ての状態遷移ｓ→ｓ’に対し異なる調整パラメタベクトルα⁻ _{ｈ（ｓ，ｓ’）}が導入され、その調整パラメタベクトルα⁻ _{ｈ（ｓ，ｓ’）}と素性ベクトルφ⁻（Ｘ⁻，ｔ，ｔ’）の内積が遷移コスト関数πに反映される。素性ベクトルφ⁻（Ｘ⁻，ｔ，ｔ’）として特許文献１では、以下の形を例示している。 However, “ ^T ” represents transposition. FIG. 2 shows a functional block diagram of the speech recognition apparatus 92 expressed by Expression (4). h (s, s ′) represents the state transition s → s ′. In the case of this expression, if h (s, s ′) is appropriately designed, a different adjustment parameter vector α ⁻ _{h (s, s ′)} is introduced for all state transitions s → s ′, and the adjustment parameter vector ^{_{α - h (s, s '}} ) and the feature vector ^{^{φ - (X -, t,}} t') inner product of is reflected in the transition cost function [pi. Patent Literature 1 exemplifies the following form as the feature vector φ ⁻ (X ⁻ , t, t ′).

この表現を用い、調整パラメタベクトルα⁻ _{ｈ（ｓ，ｓ’）}を適切に学習することにより、従来の音声認識装置に追加で全体を考慮した学習に基づくコストを導入できる。ここで、特許文献１では調整パラメタベクトルα⁻ _{ｈ（ｓ，ｓ’）}の決定にPerceptron法を用いている。 By using this expression and appropriately learning the adjustment parameter vector α ^- _{h (s, s ′)} , it is possible to introduce a cost based on learning that takes the whole into consideration in addition to the conventional speech recognition apparatus. Here, in Patent Document 1, the Perceptron method is used to determine the adjustment parameter vector α ⁻ _{h (s, s ′)} .

最短経路探索部９２１は、入力系列Ｘ⁻を受け取り、有限状態モデル調整パラメタ格納部９２３に格納されている調整パラメタベクトルα⁻ _{ｈ（ｓ，ｓ’）}と、有限状態モデル格納部９１２に格納されている状態遷移コストω（ｓ，ｓ’）及びｌｏｇＰ（ｘ⁻ _τ｜Ｉ［ｓ，ｓ’］）とを取り出し、式（１）、（２）、（４）に基づき、最適状態系列ｓ^〜を求め、出力シンボル抽出部８に出力する。 The shortest path search unit 921 receives the input sequence X ⁻ and stores the adjustment parameter vector α ⁻ _{h (s, s ′)} stored in the finite state model adjustment parameter storage unit 923 and the finite state model storage unit 912. State transition costs ω (s, s ′) and logP (x ⁻ _τ | I [s, s ′]) are taken out, and the optimum state sequence s is obtained based on the equations (1), (2), and (4). ^Are obtained and output to the output symbol extraction unit 8.

特開特開２０１１−１６４３３６号公報JP, 2011-164336, A

Jen-Tzung Chien, Chuang-Hua Chueh, "Joint acoustic and language modeling for speech recognition", Speech Communication, March 2010, Volume 52, Issue 3, p.223-235Jen-Tzung Chien, Chuang-Hua Chueh, "Joint acoustic and language modeling for speech recognition", Speech Communication, March 2010, Volume 52, Issue 3, p.223-235

一般に、音声認識のエラーを完全に削減することは困難であるため、音声認識装置の精度は単語エラー率といった指標で測られるが、特許文献１のPerceptron法や、maximum mutual information（以下「ＭＭＩ」とも呼ぶ）法（参考文献１参照）といった従来手法は、正解単語列以外の全ての単語列を不正解とみなして学習を行う方法である。
（参考文献１）S. Kapadia, V. Valtchev, S.J. Young, "MMI training for continuous phoneme recognition on the TIMIT database", Proc. ICASSP, 1993, Vol. 2, pp. 491-494 In general, since it is difficult to completely reduce speech recognition errors, the accuracy of a speech recognition device is measured by an index such as a word error rate. However, the Perceptron method of Patent Document 1 and maximum mutual information (hereinafter “MMI”) are used. A conventional method such as a method (referred to as reference 1) is a method in which all word strings other than the correct word string are regarded as incorrect answers and learning is performed.
(Reference 1) S. Kapadia, V. Valtchev, SJ Young, "MMI training for continuous phoneme recognition on the TIMIT database", Proc. ICASSP, 1993, Vol. 2, pp. 491-494

経験的に、エラーの尺度を測る際は細かいエラーを用いたほうが良いと言われている。正解単語列以外の全ての単語列を不正解とみなして学習を行なう方法では、エラーの単位が大きいため、そのパラメタ推定の精度は十分とは言えない。 Empirically, it is said that it is better to use fine errors when measuring the error scale. In the method in which learning is performed by regarding all word strings other than the correct word string as incorrect answers, the accuracy of parameter estimation is not sufficient because the unit of error is large.

本発明は、細粒度エラー基準に基づき、より小さな単位でエラーを求め、不正解の中でも「不正解の度合い」を細かく考慮することによって、より頑健なパラメタ推定を可能にするパラメタ推定技術を提供することを目的とする。 The present invention provides a parameter estimation technique that enables more robust parameter estimation by obtaining an error in a smaller unit based on a fine-grained error criterion and finely considering the “degree of incorrect answer” among incorrect answers. The purpose is to do.

上記の課題を解決するために、本発明の第一の態様によれば、パラメタ推定装置は、音声認識において用いるパラメタを含む確率有限状態モデルと、学習データと、学習データの正しい音声認識結果に対応する状態遷移の系列である正解状態遷移系列と、確率有限状態モデルを調整するためのパラメタである調整パラメタとを格納する記録部と、確率有限状態モデルを用いて学習データに対して音声認識を行った結果得られる音声認識結果に対応する状態遷移の系列である認識状態遷移系列を生成する認識部と、正解状態遷移系列と認識状態遷移系列との差異に基づき、エラー尺度を算出する細粒度エラー尺度算出部と、エラー尺度に応じて調整パラメタを修正するパラメタ推定部とを含む。 In order to solve the above-described problem, according to the first aspect of the present invention, the parameter estimation device generates a probability finite state model including parameters used in speech recognition, learning data, and a correct speech recognition result of the learning data. A recording unit that stores correct state transition sequences that are corresponding state transition sequences and adjustment parameters that are parameters for adjusting the stochastic finite state model, and speech recognition for learning data using the stochastic finite state model And a recognition unit that generates a recognition state transition sequence that is a sequence of state transitions corresponding to the speech recognition result obtained as a result of performing and a detailed calculation that calculates an error measure based on the difference between the correct state transition sequence and the recognized state transition sequence. A granularity error scale calculation unit and a parameter estimation unit that corrects the adjustment parameter according to the error scale are included.

上記の課題を解決するために、本発明の第二の態様によれば、パラメタ推定方法は、音声認識において用いるパラメタを含む確率有限状態モデルを用いて、学習データに対して音声認識を行った結果得られる音声認識結果に対応する状態遷移の系列である認識状態遷移系列を生成する認識ステップと、学習データの正しい音声認識結果に対応する状態遷移の系列である正解状態遷移系列と認識状態遷移系列との差異に基づき、エラー尺度を算出する細粒度エラー尺度算出ステップと、エラー尺度に応じて、確率有限状態モデルを調整するためのパラメタである調整パラメタを修正するパラメタ推定ステップとを含む。 In order to solve the above problems, according to the second aspect of the present invention, the parameter estimation method performs speech recognition on learning data using a stochastic finite state model including parameters used in speech recognition. A recognition step for generating a recognition state transition sequence corresponding to a speech recognition result obtained as a result, a correct state transition sequence corresponding to a correct speech recognition result of learning data, and a recognition state transition A fine-grained error measure calculating step for calculating an error measure based on a difference from the series, and a parameter estimating step for correcting an adjustment parameter that is a parameter for adjusting the probability finite state model according to the error measure.

本発明によれば、パラメタ推定の精度向上という効果を奏する。 According to the present invention, there is an effect of improving the accuracy of parameter estimation.

従来の音声認識装置の機能ブロック図。The functional block diagram of the conventional speech recognition apparatus. 特許文献１の音声認識装置の機能ブロック図。The functional block diagram of the speech recognition apparatus of patent document 1. FIG. 第一実施形態に係るパラメタ推定装置の機能ブロック図。The functional block diagram of the parameter estimation apparatus which concerns on 1st embodiment. 第一実施形態に係るパラメタ推定装置の処理フローを示す図。The figure which shows the processing flow of the parameter estimation apparatus which concerns on 1st embodiment. 図５Ａは正解ラティスを、図５Ｂは認識ラティスを示す図。5A is a correct lattice, and FIG. 5B is a recognition lattice. 配列ｃ_τを生成するための処理フローを説明するための図。The figure for demonstrating the processing flow for producing _| generating array c ( _tau) . 配列ｃ_τを説明するための図。The figure for _{demonstrating} arrangement _| sequence _cτ . パラメタ推定部の機能ブロック図。The functional block diagram of a parameter estimation part. パラメタ推定部の処理フローを示す図。The figure which shows the processing flow of a parameter estimation part. 音声認識のシミュレーション結果を示す図。The figure which shows the simulation result of speech recognition.

まず、本発明のポイントを説明する。 First, the points of the present invention will be described.

［発明のポイント］
特許文献１で用いられてきたPerceptron法より高精度な学習方法のために、本発明では細粒度エラー基準を導入する。細粒度エラー基準は、これまで隠れマルコフモデル（式（３）及び式（４）のｌｏｇＰ（ｘ⁻ _τ｜Ｉ［ｓ，ｓ’］））及び状態遷移コストω（ｓ，ｓ’ ）を学習する際に用いられてきた（参考文献２〜４参照）。
（参考文献２） D. Povey, P.C. Woodland, "Minimum phone error and I-smoothing for improved discriminative training", Proc. ICASSP, 2002, Vol. 1, pp. 105-108
（参考文献３） D. Povey, D. Kanevsky, B. Kingsbury, B. Ramabhadran, G. Saon, K. Visweswariah, "Boosted MMI for model and feature-space discriminative training", 2008, Proc. ICASSP, pp. 4057-4060
（参考文献４） E. McDermott, S. Watanabe, A. Nakamura, "Discriminative training based on an integrated view of MPE and MMI in margin and error space", Proc. ICASSP, 2010, pp. 4894-4897 [Points of Invention]
For a learning method with higher accuracy than the Perceptron method used in Patent Document 1, the present invention introduces a fine-grained error criterion. The fine-grained error criterion has so far learned the hidden Markov model (logP (x ⁻ _τ | I [s, s ′]) in Equations (3) and (4)) and the state transition cost ω (s, s ′) Have been used (see references 2-4).
(Reference 2) D. Povey, PC Woodland, "Minimum phone error and I-smoothing for improved discriminative training", Proc. ICASSP, 2002, Vol. 1, pp. 105-108
(Reference 3) D. Povey, D. Kanevsky, B. Kingsbury, B. Ramabhadran, G. Saon, K. Visweswariah, "Boosted MMI for model and feature-space discriminative training", 2008, Proc. ICASSP, pp. 4057-4060
(Reference 4) E. McDermott, S. Watanabe, A. Nakamura, "Discriminative training based on an integrated view of MPE and MMI in margin and error space", Proc. ICASSP, 2010, pp. 4894-4897

細粒度エラー基準に基づくminimum phone error（以下「ＭＰＥ」とも呼ぶ）法（参考文献２参照）、boosted MMI（以下「ｂＭＭＩ」とも呼ぶ）法（参考文献３参照）及びdifferenced ＭＭＩ（以下「ｄＭＭＩ」とも呼ぶ）法（参考文献４参照）は、不正解の中でも「不正解の度合い」を細かく考慮することによって、より頑健なパラメタ推定を可能にする技術である。経験的に、エラーの尺度を測る際は細かいエラーを用いたほうが良いと言われており、そういった観点から学習時には音素エラーを最小にするような学習が行われることが多い。 Minimum phone error (hereinafter also referred to as “MPE”) method (refer to Reference 2), boosted MMI (hereinafter also referred to as “bMMI”) method (refer to Reference 3) and differenced MMI (hereinafter referred to as “dMMI”) (Referred to as reference 4) is a technique that enables more robust parameter estimation by carefully considering the “degree of incorrect answer” among incorrect answers. Empirically, it is said that it is better to use fine errors when measuring the error scale. From such a viewpoint, learning that minimizes phoneme errors is often performed during learning.

（１）別々に学習した各モジュールを合成した音声認識装置のさらなる精度向上のためには、特許文献１と同様に、合成した後での調整パラメタの学習が必須である。（２）また、既存技術では学習の基準として、認識結果が与えられた正解と完全に一致するかどうかを考慮していたが、実際の音声認識では音声認識の単語エラー率を削減することが重要である。（３）さらに、隠れマルコフモデル学習手法における関連手法（参考文献３及び４参照）では音素エラーが細粒度のエラー尺度として用いられてきたが、有限状態モデルに基づく音声認識装置の最も細かいエラーは状態遷移誤りであり、より細かいエラー尺度を利用することは重要だと考えられる。 (1) In order to further improve the accuracy of the speech recognition apparatus that synthesizes the separately learned modules, it is essential to learn the adjustment parameters after synthesis, as in Patent Document 1. (2) In addition, the existing technology considers whether or not the recognition result completely matches the given correct answer as a learning criterion. However, in actual speech recognition, the word error rate of speech recognition can be reduced. is important. (3) Furthermore, in related techniques in the hidden Markov model learning technique (see References 3 and 4), phoneme errors have been used as a fine-grained error measure, but the finest error of a speech recognition device based on a finite state model is It is a state transition error, and it seems important to use a finer error scale.

よって、本発明では、合成した後で調整パラメタの学習を行い推定し、その際、音声認識装置の状態遷移毎に定義されたパラメタベクトルの学習にＭＰＥ法に代表されるような細粒度エラー基準を用いる。例えば、状態遷移を何回間違えるかを基準とした細粒度エラー基準を用いたパラメタ学習を行う。 Therefore, in the present invention, adjustment parameters are learned and estimated after synthesis, and at that time, fine-grained error criteria such as the MPE method are used to learn parameter vectors defined for each state transition of the speech recognition apparatus. Is used. For example, parameter learning is performed using a fine-grained error criterion based on how many times the state transition is mistaken.

以下、本発明の実施形態について説明する。なお、ベクトルや行列の各要素単位で行われる処理は、特に断りが無い限り、そのベクトルやその行列の全ての要素に対して適用されるものとする。 Hereinafter, embodiments of the present invention will be described. It should be noted that processing performed in units of elements of vectors and matrices is applied to all elements of the vectors and matrices unless otherwise specified.

＜第一実施形態＞
図３は第一実施形態に係るパラメタ推定装置１００の機能ブロック図を、図４はその処理フローを示す。パラメタ推定装置１００は、有限状態モデル格納部１０１、学習データ格納部１０３、認識部１０５、細粒度エラー尺度算出部１０７、正解経路格納部１０９、パラメタ推定部１１１及び有限状態モデル調整パラメタ格納部１１３を含む。 <First embodiment>
FIG. 3 is a functional block diagram of the parameter estimation apparatus 100 according to the first embodiment, and FIG. 4 shows the processing flow. The parameter estimation device 100 includes a finite state model storage unit 101, a learning data storage unit 103, a recognition unit 105, a fine-grained error scale calculation unit 107, a correct path storage unit 109, a parameter estimation unit 111, and a finite state model adjustment parameter storage unit 113. including.

パラメタ推定装置１００は、学習用の入力系列Ｘ^−（１），Ｘ^−（２），…，Ｘ^−（Ｎ）を用いて、調整パラメタベクトルα⁻ _{ｈ（ｓ，ｓ’）}を最適化する。本発明の実現方法として、様々な方法が考えられるが、本実施形態ではラティスによるデータ構造とｂＭＭＩまたはｄＭＭＩを応用した学習基準を用いた方法を説明する。また、パラメタ推定装置１００によって学習された調整パラメタベクトルα⁻ _{ｈ（ｓ，ｓ’）}を利用して音声認識を行う音声認識装置１２の機能ブロック図は先述の図２と同様の構成を有する。ただし、有限状態モデル調整パラメタ格納部１２３に記憶されている調整パラメタベクトルα⁻ _{ｈ（ｓ，ｓ’）}の生成方法が、特許文献１とは異なる。 The parameter estimation apparatus 100 optimizes the adjustment parameter vector α ⁻ _{h (s, s ′)} using the learning input sequences X ^{− (1)} , X ^{− (2)} ,..., X− ^(N). . Various methods are conceivable as a method for realizing the present invention. In this embodiment, a method using a lattice based data structure and a learning criterion applying bMMI or dMMI will be described. The functional block diagram of the speech recognition device 12 that performs speech recognition using the adjustment parameter vector α ⁻ _{h (s, s ′)} learned by the parameter estimation device 100 has the same configuration as that of FIG. However, the method of generating the adjustment parameter vector α ⁻ _{h (s, s ′)} stored in the finite state model adjustment parameter storage unit 123 is different from that of Patent Document 1.

有限状態モデル格納部１０１には、音声認識において用いる状態遷移コストω（ｓ，ｓ’）及びｌｏｇＰ（ｘ⁻ _τ｜Ｉ［ｓ，ｓ’］）を含む確率有限状態モデルが格納されている。なお、調整パラメタベクトルα⁻ _{ｈ（ｓ，ｓ’）}の推定に先立ち、状態遷移コストω（ｓ，ｓ’）及びｌｏｇＰ（ｘ⁻ _τ｜Ｉ［ｓ，ｓ’］）に関しては学習が終了している状況を考える。例えば、有限状態モデル構築部６５が、ＨＭＭ格納部６１及び言語モデル格納部６３から、従来の音声認識装置の学習手法で得られるＨＭＭや言語モデルを取り出し、状態遷移コストω（ｓ，ｓ’）及びｌｏｇＰ（ｘ⁻ _τ｜Ｉ［ｓ，ｓ’］）を算出し、有限状態モデル格納部１０１に格納する。 The finite state model storage unit 101 stores a stochastic finite state model including a state transition cost ω (s, s ′) and logP (x ⁻ _τ | I [s, s ′]) used in speech recognition. Prior to the estimation of the adjustment parameter vector α ⁻ _{h (s, s ′} ), the learning is finished for the state transition cost ω (s, s ′) and logP (x ⁻ _τ | I [s, s ′]). Think about the situation. For example, the finite state model construction unit 65 takes out the HMM and language model obtained by the learning method of the conventional speech recognition apparatus from the HMM storage unit 61 and the language model storage unit 63, and the state transition cost ω (s, s ′). And logP (x ⁻ _τ | I [s, s ′]) are calculated and stored in the finite state model storage unit 101.

学習データ格納部１０３には、学習データである入力系列Ｘ^−（１），Ｘ^−（２），…，Ｘ^−（Ｎ）が格納されている。 The learning data storage unit 103 stores learning data input sequences X- ⁽¹⁾ , X- ⁽²⁾ ,..., X- ^(N) .

正解経路格納部１０９には、学習データである入力系列Ｘ^−（１），Ｘ^−（２），…，Ｘ^−（Ｎ）の正しい音声認識結果にそれぞれ対応する状態遷移の系列（以下「正解状態遷移系列」という）ｓ＾^（１），ｓ＾^（２），…，ｓ＾^（Ｎ）とセグメント時刻ｔ＾^（１），ｔ＾^（２），…，ｔ＾^（Ｎ）とが格納されている。正解状態遷移系列とセグメント時刻とは、人手により入力系列に対して正しい音声認識結果となる単語列を与えることで、既存の音声認識装置を用いて簡単に得ることができる。 The correct path storage unit 109, a learning data input sequence ^{^{X - (1), X -}} (2), ..., X - respectively the correct speech recognition result of the ^(N) corresponding state transition sequence (hereinafter "correct S ^ ⁽¹⁾ , s ^ ⁽²⁾ , ..., s ^ ^(N) and segment times t ^ ⁽¹⁾ , t ^ ⁽²⁾ , ..., t ^ ^(N) are stored. Has been. The correct state transition sequence and the segment time can be easily obtained using an existing speech recognition apparatus by manually giving a word string that is a correct speech recognition result to the input sequence.

＜認識部１０５＞
認識部１０５は、確率有限状態モデルの状態遷移コストω（ｓ，ｓ’）及びｌｏｇＰ（ｘ⁻ _τ｜Ｉ［ｓ，ｓ’］）を有限状態モデル格納部１０１から取り出し、これらの値を用いて、学習データである入力系列Ｘ^−（１），Ｘ^−（２），…，Ｘ^−（Ｎ）に対して音声認識を行った結果得られる音声認識結果に対応する状態遷移の系列（以下「認識状態遷移系列」という）を生成し（ｓ１）、細粒度エラー尺度算出部１０７及びパラメタ推定部１１１に出力する。 <Recognition unit 105>
The recognition unit 105 extracts the state transition cost ω (s, s ′) and logP (x ⁻ _τ | I [s, s ′]) of the stochastic finite state model from the finite state model storage unit 101, and uses these values. , X- ^(N) , which is the learning data, is a sequence of state transitions corresponding to a speech recognition result obtained as a result of performing speech recognition on the input sequence X- ⁽¹⁾ , X- ⁽²⁾ ,. (Referred to as “recognition state transition series”) (s 1) and output to the fine-grained error scale calculation unit 107 and the parameter estimation unit 111.

一つの入力系列Ｘ^−（ｎ）（ただし、ｎ＝１，２，…，Ｎ）に対して、考えられうる全ての状態遷移の系列を認識状態遷移系列として生成してもよいし、式（１）及び（３）に基づき系列コスト関数Πを求め、系列コスト関数Πの小さいもの上位Ｒ個に対応する状態遷移系列を認識状態遷移系列として生成してもよい。ただし、Ｒは１以上の整数である。そのため、認識状態遷移系列には、間違いやすい状態遷移系列のみが含まれ、正解状態遷移系列が含まれていない場合もある。 For one input sequence X- ⁽ⁿ⁾ (where n = 1, 2,..., N), all possible state transition sequences may be generated as recognition state transition sequences, A sequence cost function Π may be obtained based on 1) and (3), and a state transition sequence corresponding to the top R items having a smaller sequence cost function Π may be generated as a recognized state transition sequence. However, R is an integer of 1 or more. Therefore, the recognized state transition sequence includes only a state transition sequence that is easy to be mistaken, and may not include a correct state transition sequence.

本実施形態では、状態遷移誤りを最小にする学習を行うため、準備として、正解状態遷移系列及びセグメント情報に対応するラティスと、実際に学習データである入力系列Ｘ^−（１），Ｘ^−（２），…，Ｘ^−（Ｎ）に対して音声認識処理を行い、認識結果として得られる認識状態遷移系列を記録したラティスを用意する。なお、ラティスとは、各状態のつながり（以下「アーク」ともいう）をグラフとして表現したものである。 In this embodiment, in order to perform learning that minimizes a state transition error, as preparation, a lattice corresponding to a correct state transition sequence and segment information, and input sequences X ^{− (1)} and X ^{− ( 2)} ,..., X- ^(N) are subjected to speech recognition processing, and a lattice recording a recognition state transition sequence obtained as a recognition result is prepared. Note that the lattice is a graph representing a connection between states (hereinafter also referred to as “arc”).

正解状態遷移系列に対応するラティスを正解ラティスと呼び、Ｌ＾^（ｎ）で表わす。また、認識状態遷移系列に対応するラティスを認識ラティスと呼び、Ｌ^（ｎ）で表わす。図５Ａは正解ラティスの例を、図５Ｂに認識ラティスの例を示す。正解ラティスはｓ＾^（ｎ）とｔ＾^（ｎ）から一意に生成可能であり、図５Ａの場合、ｓ＾^（ｎ）＝｛１，１３，９，１４，１５，１｝、ｔ＾^（ｎ）＝｛１，３，１３，１７，２０，２５｝からそれらを列挙することによって生成したものと考えることができる。ただし、正解ラティスＬ＾^（ｎ）は、正解単語列と音声認識装置を用いて予め算出してあるものとする。 The lattice corresponding to the correct state transition sequence is called a correct lattice and is represented by L ^ ⁽ⁿ⁾ . The lattice corresponding to the recognition state transition sequence is called a recognition lattice and is represented by L ⁽ⁿ⁾ . FIG. 5A shows an example of a correct answer lattice, and FIG. 5B shows an example of a recognition lattice. The correct lattice can be uniquely generated from s ⁽ⁿ⁾ and t ⁽ⁿ⁾ . In the case of FIG. 5A, s ⁽ⁿ⁾ = {1, 13, 9, 14, 15, 1}, t ^{( n)} can be considered to be generated by enumerating them from {1, 3, 13, 17, 20, 25}. However, the correct lattice L ^ ⁽ⁿ⁾ is calculated in advance using a correct word string and a speech recognition device.

本実施形態では、認識部１０５は、認識状態遷移系列として、この認識ラティスを生成し、出力する。 In the present embodiment, the recognition unit 105 generates and outputs this recognition lattice as a recognition state transition sequence.

各ラティスは数学的にはアーク系列の集合として捉えることが可能であり、以降の説明でも、この考えを用いる。この考えではラティス変数Ｌは始端から終端に至るまで通過しなければならないアークを列挙した系列を、取り得る全ての場合について列挙したもの（集合）として考えることができる。例えば、図５Ａの正解ラティスは一つのアーク系列｛ａ^_１，ａ＾_２，ａ＾_３，ａ＾_４，ａ＾_５｝の集合、図５Ｂの認識ラティスは以下に示す六つのアーク系列ｅ_１〜ｅ_６の集合と捉えることができる。
ｅ_１＝｛ａ_１，ａ_２，ａ_３，ａ_４，ａ_５｝
ｅ_２＝｛ａ_１，ａ_７，ａ_１０，ａ_１１，ａ_５｝
ｅ_３＝｛ａ_１，ａ_７，ａ_１０，ａ_１２，ａ_１３｝
ｅ_４＝｛ａ_６，ａ_９，ａ_３，ａ_４，ａ_５｝
ｅ_５＝｛ａ_６，ａ_８，ａ_１０，ａ_１１，ａ_５｝
ｅ_６＝｛ａ_６，ａ_８，ａ_１０，ａ_１２，ａ_１３｝。
なお、図５Ｂの場合、各アーク系列に含まれるアークの数が同じであるが、アーク系列毎に異なる個数のアークが含まれる場合もある。各アークａ_ｉには、遷移前状態、遷移後状態、遷移前時刻及び遷移後時刻の情報が関連付いており、それぞれｓ（ａ_ｉ）、ｓ’（ａ_ｉ）、ｔ（ａ_ｉ）及びｔ’（ａ_ｉ）、と表す。例えば、図５Ｂのアークａ_２には、ｓ（ａ_２）＝１３、ｓ’（ａ_２）＝９、ｔ（ａ_２）＝３及びｔ’（ａ_２）＝１３が関連付いている。また、ラティスにはさらに認識結果に対応する出力シンボルｏ（ｓ，ｓ’）や、認識時に算出したコスト等の情報も付与されてもよい。また、状態ｓから状態ｓ’への状態遷移を（ｓ（ａ_ｉ），ｓ’（ａ_ｉ））と表す。 Each lattice can be mathematically understood as a set of arc sequences, and this idea will be used in the following description. In this view, the lattice variable L can be considered as a set (set) of all possible cases that enumerates a series of arcs that must pass from the beginning to the end. For example, the correct answer lattice in FIG. 5A is a set of one arc sequence {a ^ ₁ , a ^ ₂ , a ^ ₃ , a ^ ₄ , a ^ ₅ }, and the recognition lattice in FIG. 5B is the six arc sequences e shown below. it can be regarded as a set of _{1 ~e} _6.
e ₁ = {a ₁ , a ₂ , a ₃ , a ₄ , a ₅ }
e ₂ = {a ₁ , a ₇ , a ₁₀ , a ₁₁ , a ₅ }
e ₃ = {a ₁ , a ₇ , a ₁₀ , a ₁₂ , a ₁₃ }
e ₄ = {a ₆ , a ₉ , a ₃ , a ₄ , a ₅ }
e ₅ = {a ₆ , a ₈ , a ₁₀ , a ₁₁ , a ₅ }
e ₆ = {a ₆ , a ₈ , a ₁₀ , a ₁₂ , a ₁₃ }.
In the case of FIG. 5B, the number of arcs included in each arc series is the same, but a different number of arcs may be included for each arc series. Each arc a _i is associated with pre-transition state, post-transition state, pre-transition time and post-transition time information, and s (a _i ), s ′ (a _i ), t (a _i ) and t ′ (a _i ). For example, arc a ₂ in FIG. 5B is associated with s (a ₂ ) = 13, s ′ (a ₂ ) = 9, t (a ₂ ) = 3, and t ′ (a ₂ ) = 13. Further, the lattice may be given information such as an output symbol o (s, s ′) corresponding to the recognition result and a cost calculated at the time of recognition. Further, the state transition from the state s to the state s ′ is represented as (s (a _i ), s ′ (a _i )).

［パラメタ推定の原理］
パラメタ推定部１１１及び細粒度エラー尺度算出部１０７の処理を説明する前に、パラメタ推定の原理を説明する。本実施形態が対象とするのは式（４）における調整パラメタベクトルα⁻ _{ｈ（ｓ、ｓ’）}であり、この調整パラメタベクトルα⁻ _{ｈ（ｓ、ｓ’）}の集合をＡ：＝｛α⁻ _ｉ｜∀ｉ｝と表す。ただし、ｉはｈ（ｓ，ｓ’）によって得られるインデックスであり、α⁻ _ｉのｄ次元目をα_ｉ，ｄと表す。なお、全てのｓ、ｓ’についてｈ（ｓ，ｓ’）が異なる自然数を取るように設計すれば、全ての状態遷移に対応することが可能である。また、ｈ（ｓ，ｓ’）の設計を変更することで、メモリ使用量や計算量を節約してもよい。 [Principle of parameter estimation]
Before describing the processing of the parameter estimation unit 111 and the fine grain error scale calculation unit 107, the principle of parameter estimation will be described. The target of this embodiment is the adjustment parameter vector α ⁻ _{h (s, s ′)} in Expression (4), and this set of adjustment parameter vectors α ⁻ _{h (s, s ′)} is represented by A: = {α ⁻ _I | ∀i}. Where, i is the index obtained by h (s, s'), α - the d-th dimension of the _i expressed as alpha _{i, d.} If all s and s ′ are designed so that h (s, s ′) takes different natural numbers, all state transitions can be handled. Further, by changing the design of h (s, s ′), the memory usage and the calculation amount may be saved.

（１）ＭＭＩ法
細粒度エラー尺度を導入する前に、集合Ａの学習に既存手法であるＭＭＩ法（参考文献１参照）を適用することを考える。音響モデル学習のために提案されたＭＭＩ法と同様の式を状態系列に関して行なうこと考えると、以下のような最適化問題の解として集合Ａを得る手法が導出される。 (1) MMI method Before introducing the fine-grained error scale, consider applying the MMI method (see Reference 1), which is an existing method, to learning the set A. Considering that an equation similar to the MMI method proposed for acoustic model learning is performed on the state series, a method for obtaining the set A as a solution of the following optimization problem is derived.

式（６）は、与えられたセグメント時刻ｔ⁻に従って状態遷移ｓ⁻が起こる確率をｅｘｐ（−Π｜（ｓ＾^（ｎ），ｔ＾^（ｎ），Ｘ^−（ｎ）））に比例すると仮定し、入力系列Ｘ⁻を観測した上での正解状態遷移確率の事後確率を最大化するような学習であると言える。この最適化を実行することにより、不正解に比べて正解が起こりやすくなるように集合Ａが調整されるが、この目的関数では正解以外の全ての状態遷移パターンについて同等に扱っており、その状態遷移パターンが正解からどれだけ離れているかという尺度は考慮されていない。 Equation (6) shows that the probability that the state transition s ⁻ occurs according to a given segment time t ⁻ is proportional to exp (−Π | (s ^ ⁽ⁿ⁾ , t ^ ⁽ⁿ⁾ , X ^{− (n)} )). assuming the input sequence X ^- can be said to be learning that maximizes the posterior probability of the correct state transition probability after having observed the. By executing this optimization, the set A is adjusted so that correct answers are more likely to occur than incorrect answers. This objective function treats all state transition patterns other than correct answers equally, and A measure of how far the transition pattern is from the correct answer is not considered.

分母にある総和記号Σ_{ｓ−，ｔ−}（ただし、下付添字ｓ−，ｔ−は、それぞれｓ⁻，ｔ⁻を表す）は、全ての取り得る状態遷移及び全ての取り得るセグメント時刻についての総和であるが、一般にこの総和を取るには大きな計算量が必要であると言われている。そこで、例えば上記の目的関数を、認識ラティスを用いて以下のように近似する。 Sum symbol sigma _s-in the _{denominator, t-(where} subscript s-, t-each ^s -, t ^- represents a) is for all possible state transitions and all possible segments Time Although it is a sum, it is generally said that a large amount of calculation is required to obtain this sum. Therefore, for example, the above objective function is approximated as follows using a recognition lattice.

ここで総和Σ_{a−∈Ｌ（ｎ）}（ただし、下付添字ａ−∈Ｌ（ｎ）は、ａ⁻∈Ｌ^（ｎ）を表す）は認識ラティスＬ^（ｎ）中で取り得る全てのアーク系列（または系列コスト関数Πの小さい上位Ｒ個のアーク系列）についての総和であり、総和Σ_ｊ（ただし、ａ_ｊはアーク系列ａ⁻のｊ番目のアークを示す）はアーク系列ａ⁻に含まれるアークａ_ｊについての総和である。同様にＭＭＩ分子も正解ラティスにより表現にする。 Here summation Σ _{a-∈L (n) (where} the subscripts a-∈L (n) is, a ^- represents a ∈L ⁽ⁿ⁾⁾ are all arcs may take in recognition lattice ^{L (n)} is the sum of the sequence (or sequences cost function smaller upper the R arc series of [pi), the sum sigma _{j (However,} a _j is the arc line a ^- shows the j-th arc) is the arc line a ^- a contained _Is the sum of arc a _j . Similarly, the MMI molecule is expressed by a correct lattice.

ここで、総和Σ_ｊ（ただし、ａ＾_ｊは正解ラティスのアーク系列のｊ番目のアークを示す）は正解ラティスのアーク系列に含まれるアークａ＾_ｊについての総和である。一般的に正解状態遷移ｓ＾に対応するラティスは正確なものを利用することができるため、分子の項は近似ではない。 Here, the sum Σ _j (where a ^ _j represents the j-th arc of the arc sequence of the correct lattice) is the sum of the arcs a ^ _j included in the arc sequence of the correct lattice. In general, an accurate lattice corresponding to the correct state transition ＾ can be used, so the numerator term is not an approximation.

図５に示される通り、各アークａ_ｊ（ただし、ｊ＝１，２，…，Ｊ）には遷移前状態ｓ（ａ_ｊ）、遷移後状態ｓ’（ａ_ｊ）、遷移前時刻ｔ（ａ_ｊ）、遷移後時刻ｔ’（ａ_ｊ）が記録されておりアークの遷移を辿っていけば、分母の効率の良い近似ができるように設計されている。このような認識ラティスは従来の音声認識装置を用いて得ることができる。この目的関数は連続であり、目的関数の導関数も連続なので、最適化は最急勾配法を用いて行うことができる。パラメタ推定部１１１では、認識ラティスを用いて目的関数を近似することで、計算量を削減し、高速に調整パラメタベクトルの最適化を行うことができる。 As shown in FIG. 5, each arc a _j (where j = 1, 2,..., J) has a pre-transition state s (a _j ), a post-transition state s ′ (a _j ), and a pre-transition time t ( a _j ) and the time t ′ (a _j ) after the transition are recorded, and it is designed so that the denominator can be efficiently approximated by following the arc transition. Such a recognition lattice can be obtained using a conventional speech recognition apparatus. Since the objective function is continuous and the derivative of the objective function is also continuous, optimization can be performed using the steepest gradient method. The parameter estimation unit 111 can reduce the amount of calculation and optimize the adjustment parameter vector at high speed by approximating the objective function using the recognition lattice.

（２）ｂＭＭＩ法
細粒度エラー尺度を導入するために、ｂＭＭＩ法（参考文献３参照）を導入する。ｂＭＭＩ法では、単に正解系列の事後確率を最大化するのではなく、エラー尺度の大きい系列が出易いように確率分布を修正した上で、正解系列の事後確率を最大化するように試みる。この修正によって、エラー尺度の大きい系列がより出にくくなるようにパラメタが調整される。 (2) bMMI method In order to introduce the fine-grained error scale, the bMMI method (see Reference 3) is introduced. In the bMMI method, instead of simply maximizing the posterior probability of the correct sequence, the probability distribution is corrected so that a sequence with a large error measure is likely to appear, and then the posterior probability of the correct sequence is attempted to be maximized. By this modification, the parameters are adjusted so that a series with a large error measure is less likely to appear.

具体的には、エラー尺度Ｅ（ｓ⁻，ｔ⁻；ｓ＾^（ｎ），ｔ＾^（ｎ））が大きいものほど系列コスト関数が小さくなったとみなす、以下の修正目的関数を用いる。 Specifically, the following modified objective function is used, which considers that the larger the error measure E (s ⁻ , t ⁻ ; s ^ ⁽ⁿ⁾ , t ^ ⁽ⁿ⁾ ) is, the smaller the sequence cost function is.

ただし、σは調整可能なパラメタで、一般にチューニング用のデータセットを用いて調整する。この目的関数を用いることで、エラー尺度Ｅ（ｓ⁻，ｔ⁻；ｓ＾^（ｎ），ｔ＾^（ｎ））が大きい誤りを起こしにくいパラメタが得られることが知られている（参考文献３参照）。 However, σ is an adjustable parameter and is generally adjusted using a tuning data set. By using this objective function, it is known that a parameter with a large error scale E (s ⁻ , t ⁻ ; s ⁽ⁿ⁾ , t ⁽ⁿ⁾ ) is less likely to cause an error (reference document 3). reference).

エラー尺度Ｅとして状態遷移誤りを用いた場合、認識ラティスによる近似を行った後でも、各認識ラティスのアーク毎に状態遷移エラーの発生回数を状態遷移のエラー尺度Ｅ^〜として求めることができる。 When the state transition error is used as the error measure E, the number of occurrences of the state transition error can be obtained as the state transition error measure E ¹ for each arc of each recognition lattice even after approximation by the recognition lattice.

また、ｂＭＭＩ分子に関してはＭＭＩ分子と同様にラティスによる表現が可能である。 In addition, bMMI molecules can be expressed in a lattice as with MMI molecules.

細粒度エラー尺度は、出力シンボル系列全体が正解とどのくらい離れているかを、従来技術より細かく表現するために必要な尺度である。具体的には、実際に出てくる出力シンボルと正解の出力シンボルの編集距離を用いたり、出力シンボルを音素まで分解した上で実際に出てくる音素列と正解の音素列の編集距離を用いたりすることができる。何れの場合も正解の通りに音声認識動作した場合の動作パターンと、実際の動作パターンがどれくらい異なるかを示す尺度である。出力シンボルのエラーより、音素のエラーといったように、なるべく細かい粒度のエラー尺度を使うことが有効であることが経験的に明かになっている（参考文献２〜４参照）。本実施形態では、例として、有限状態遷移モデルに基づく音声認識で最も細粒度な動作である状態遷移において、正解の状態遷移パターンと、実際の状態遷移パターンがどれだけ異なるかを、異なる状態遷移を行なった回数をカウントすることで表現することを考える。 The fine granularity error measure is a measure necessary for expressing how far the entire output symbol sequence is from the correct answer more finely than in the prior art. Specifically, use the edit distance between the actual output symbol and the correct output symbol, or use the edit distance between the actual phoneme sequence and the correct phoneme sequence after decomposing the output symbol into phonemes. Can be. In any case, it is a scale indicating how much the actual motion pattern differs from the motion pattern when the speech recognition operation is performed as correct. It has been empirically revealed that it is effective to use an error scale with as fine a granularity as possible, such as a phoneme error rather than an output symbol error (see References 2 to 4). In the present embodiment, as an example, in the state transition which is the finest granularity operation in speech recognition based on the finite state transition model, how different the correct state transition pattern and the actual state transition pattern are are different. Consider expressing it by counting the number of times it was performed.

本実施形態ではラティス表現を用いるため、認識ラティスの各アークにおいて、何回状態遷移誤りを起こしたかというアーク毎の状態遷移エラー尺度（上式におけるＥ^〜（ａ_ｊ））が必要になる。アーク毎の状態遷移のエラー尺度Ｅ^〜（ａ_ｊ）は、正解ラティスの配列表現ｃ^（ｎ）を用いて計算を行う。まず、正解ラティスの配列表現ｃ^（ｎ）の各要素ｃ^（ｎ） _τは各時刻における遷移前状態と遷移後状態のペアを表現し、図６に示すようなアルゴリズムで得ることができる。 In this embodiment, since lattice representation is used, a state transition error scale (E ^to (a _j ) in the above equation) for each arc indicating how many state transition errors have occurred in each arc of the recognition lattice is required. The error measure E ^~ (a _j ) of the state transition for each arc is calculated using the array representation c ⁽ⁿ⁾ of the correct lattice. First, each element c ⁽ⁿ⁾ _{τ in} the array representation c ⁽ⁿ⁾ of the correct lattice represents a pair of a pre-transition state and a post-transition state at each time, and can be obtained by an algorithm as shown in FIG.

このアルゴリズムでは、入力系列のフレーム数と同じ要素数を持つ配列ｃ^（ｎ）のτ要素目ｃ^（ｎ） _τに、そのフレーム（τ番目のフレーム）を処理した時に起こった状態遷移（ｓ（ａ＾_ｊ），ｓ’（ａ＾_ｊ））を格納していく（ｓ１０７−３）という操作を行う。状態遷移（ｓ（ａ＾_ｊ），ｓ’（ａ＾_ｊ））が起こっているフレーム全てに対してｓ１０７の処理を行う（ｓ１０７−２、ｓ１０７−４、ｓ１０７−５）。さらに全ての状態遷移（ｓ（ａ＾_ｊ），ｓ’（ａ＾_ｊ））に対してｓ１０７−２〜ｓ１０７−５の処理を行う（ｓ１０７−１、ｓ１０７−６、ｓ１０７−７）。これによって、フレーム番号ｔとそれに対応する正解状態遷移系列についての簡易な表現を得ることができる。図５Ａの正解ラティスについて上述の処理を行った場合の配列表現ｃ^（ｎ）を図７に示す。 In this algorithm, a state transition (s () (s ()) that occurs when the frame (τ-th frame) is processed in the τ-element c ⁽ⁿ⁾ _τ of the array c ⁽ⁿ⁾ having the same number of elements as the number of frames of the input sequence. a ^ _j ), s' (a ^ _j )) is stored (s107-3). The processing of s107 is performed on all frames in which state transitions (s (a ^ _j ), s' (a ^ _j )) have occurred (s107-2, s107-4, s107-5). Further, the processing of s107-2 to s107-5 is performed for all state transitions (s (a ^ _j ), s' (a ^ _j )) (s107-1, s107-6, s107-7). This makes it possible to obtain a simple expression about the frame number t and the corresponding correct state transition sequence. FIG. 7 shows an array representation c ⁽ⁿ⁾ when the above processing is performed on the correct lattice in FIG. 5A.

この配列表現を用いて、アーク毎の状態遷移のエラー尺度Ｅ^〜（ａ_ｊ）は以下のように表現できる。 Using this array representation, the error scale E ^~ (a _j ) of state transition for each arc can be expressed as follows.

ただし、δ（ａ，ｂ）は、クロネッカのデルタ関数と呼ばれている関数でａ＝ｂなら１、そうでなければ０を取る。 However, δ (a, b) is a function called a Kronecker delta function, and takes 1 if a = b, and 0 otherwise.

この計算式は各アークａ_ｊと、それに対応する状態遷移（ｓ（ａ_ｊ），ｓ’（ａ_ｊ））が上で求めた正解状態遷移と、何フレーム分異なるかを計算する。具体的には各アークの開始時刻（ｔ（ａ_ｊ））から終了時刻（ｔ’（ａ_ｊ）−１）に関して、そのアークが表現する状態遷移（ｓ（ａ_ｊ），ｓ’（ａ_ｊ））と、正解状態遷移の配列表現ｃ^（ｎ）が何フレーム分異なるかをデルタ関数と総和によって計算する。よって、異なる状態遷移を行った回数を計数していると言ってもよいし、異なる状態遷移を行った時間（（異なる状態遷移を行った回数）×（１フレームに対する時間））を算出していると言ってもよい。 This calculation formula calculates how many frames each arc a _j and the corresponding state transitions (s (a _j ), s ′ (a _j )) differ from the correct state transition obtained above. Specifically, regarding the start time (t (a _j )) to the end time (t ′ (a _j ) −1) of each arc, the state transitions (s (a _j ), s ′ (a _j ) represented by the arc )) And the frame representation c ^{(n) of} the correct state transition are calculated by the delta function and the sum. Therefore, it can be said that the number of times of performing different state transitions is counted, and the time ((number of times of performing different state transitions) × (time for one frame)) of performing different state transitions is calculated. It can be said that there is.

（３）ｄＭＭＩ法
ｄＭＭＩ法（参考文献４参照）ではエラー尺度を直接的に削減するため、以下の目的関数を最大化することを試みる。 (3) dMMI Method The dMMI method (see Reference 4) attempts to maximize the following objective function in order to directly reduce the error measure.

ただし、分母にある総和Σ_{ｓ’−，ｔ’−}（ただし、下付添字ｓ’−，ｔ’−は、それぞれｓ’⁻，ｔ’⁻を表す）は、全ての取り得る状態遷移及び全ての取り得るセグメント時刻についての総和である。式（１２）の分数部分は、全ての取り得る状態遷移及び全ての取り得るセグメント時刻についての、ｓ⁻，ｔ⁻の起こりえる確率Ｐ（ｓ⁻，ｔ⁻）を表している。この目的関数は負のエラー尺度Ｅ（ｓ⁻，ｔ⁻；ｓ＾^（ｎ），ｔ＾^（ｎ））の確率Ｐ（ｓ⁻，ｔ⁻）に関する期待値となっており、これを最大化することはエラー尺度を直接小さくするようにパラメタを調整していることに相当する。参考文献４によると、この目的関数の有効な近似として、ｂＭＭＩの目的関数を用いた以下の形を使用可能であることがわかっている。 However, the sum Σ _{s′−, t′−} in the denominator (where subscripts s′− and t′− represent s ′ ⁻ and t ′ ⁻ respectively) represents all possible state transitions and all This is the sum of the segment times that can be taken. The fractional part of Equation (12), for all possible state transitions and all possible segments time, s ^-, t ^- it occurs may probability P (s ^{-, t} ^-) of the represent. This objective function is the expected value for the probability P (s ⁻ , t ⁻ ) of the negative error measure E (s ⁻ , t ⁻ ; s ^ ⁽ⁿ⁾ , t ^ ⁽ⁿ⁾ ), which is maximized Doing this is equivalent to adjusting the parameters to directly reduce the error measure. According to Reference 4, it is known that the following form using the bMMI objective function can be used as an effective approximation of the objective function.

ここでσ_１及びσ_２は調整可能なパラメタであり、σ_１≠σ_２である。なお、原理上は上の最適化を直接近似なしに解くことも可能である。 Here, σ ₁ and σ ₂ are adjustable parameters, and σ ₁ ≠ σ ₂ . In principle, the above optimization can be solved without direct approximation.

＜細粒度エラー尺度算出部１０７＞
細粒度エラー尺度算出部１０７は、正解状態遷移系列と認識状態遷移系列との差異に基づき、エラー尺度を算出し（図４のｓ３）、パラメタ推定部１１１に出力する。本実施形態では、正解状態遷移系列と認識状態遷移系列との間において、異なる状態遷移を行った回数を計数し、その回数（または、その回数に対応する時間）をエラー尺度として算出する。例えば、細粒度エラー尺度算出部１０７は、正解経路格納部１０９から正解状態遷移系列ｓ＾^（ｎ）と、ｓ＾^（ｎ）に対応するセグメント時刻ｔ＾^（ｎ）を取り出し、正解ラティスＬ＾^（ｎ）を生成する。また、認識部１０５から認識ラティスＬ^（ｎ）を受け取る。正解ラティスＬ＾^（ｎ）を用いて配列ｃ^（ｎ）を生成し、式（１１）によりエラー尺度Ｅ^〜を算出し、パラメタ推定部１１１に出力する。なお、生成した正解ラティスは、素性ベクトル生成部１１１ｃに出力する。 <Fine Grain Error Scale Calculation Unit 107>
The fine-grained error measure calculation unit 107 calculates an error measure based on the difference between the correct state transition sequence and the recognized state transition sequence (s3 in FIG. 4), and outputs the error measure to the parameter estimation unit 111. In the present embodiment, the number of times of different state transitions between the correct state transition sequence and the recognized state transition sequence is counted, and the number of times (or time corresponding to the number of times) is calculated as an error scale. For example, fine grained error measure calculator 107, the correct state transition sequence from correct path storage unit 109 ^{s ^ (n),} ^{s ^ (n)} corresponding to the segment time ^{t ^ (n) is} taken out, correct lattice L ^ ^(N) is generated. Also, the recognition lattice L ⁽ⁿ⁾ is received from the recognition unit 105. An array c ⁽ⁿ⁾ is generated by using the correct lattice L ^ ⁽ⁿ⁾ , an error measure E ¹ is calculated by Expression (11), and is output to the parameter estimation unit 111. The generated correct lattice is output to the feature vector generation unit 111c.

＜パラメタ推定部１１１＞
図８はパラメタ推定部１１１の機能ブロック図を、図９はその処理フローを示す。 <Parameter estimation unit 111>
FIG. 8 shows a functional block diagram of the parameter estimation unit 111, and FIG. 9 shows a processing flow thereof.

パラメタ推定部１１１は、エラー尺度Ｅ^〜に応じて調整パラメタベクトルα⁻ _{ｈ（ｓ、ｓ’）}を修正する（図４のｓ５）。 The parameter estimation unit 111 corrects the adjustment parameter vector α ⁻ _{h (s, s ′)} according ^to the error measure E˜ (s5 in FIG. 4).

パラメタ推定部１１１は、調整パラメタ初期化部１１１ａ、勾配ベクトル初期化部１１１ｂ、素性ベクトル生成部１１１ｃ、アーク重み算出部１１１ｄ、偏微分係数更新部１１１ｅ、調整パラメタ更新部１１１ｆ及び収束判定部１１１ｇを含む。
（調整パラメタ初期化部１１１ａ）
調整パラメタ初期化部１１１ａは、調整パラメタベクトルの集合Ａ：＝｛α⁻ _ｉ｜∀ｉ｝の調整パラメタベクトルα⁻ _ｉの各要素α_ｉ，ｄ（調整パラメタベクトルα⁻ _ｉのｄ番目の要素）の初期化を行い（ｓ１１１−１）、全ての調整パラメタベクトルα⁻ _ｉの全ての要素α_ｉ，ｄを初期化した集合Ａを有限状態モデル調整パラメタ格納部１１３に格納する。なお、本実施形態では単に０を代入することで初期化する。他にもガウス分布からの等価変換に基づく初期化や、データセットの統計量に基づく初期化などが考えられる。 The parameter estimation unit 111 includes an adjustment parameter initialization unit 111a, a gradient vector initialization unit 111b, a feature vector generation unit 111c, an arc weight calculation unit 111d, a partial differential coefficient update unit 111e, an adjustment parameter update unit 111f, and a convergence determination unit 111g. Including.
(Adjustment parameter initialization unit 111a)
Adjustment parameter initialization unit 111a, a set of adjustment parameters vector A: = ^- | adjustment parameter vector of {α _i ∀i} α ^- each element alpha _{i, d} (adjustment parameter vector of _{_i α} ^- _i d th element of ) Is initialized (s111-1), and the set A in which all the elements α _{i, d} of all the adjustment parameter vectors α ^- _i are initialized is stored in the finite state model adjustment parameter storage unit 113. In this embodiment, initialization is performed simply by substituting 0. In addition, initialization based on the equivalent transformation from Gaussian distribution, initialization based on the statistics of the data set, and the like can be considered.

（勾配ベクトル初期化部１１１ｂ）
勾配ベクトル初期化部１１１ｂは、調整パラメタベクトルα⁻ _ｉのｄ番目の要素α_ｉ，ｄに対応する偏微分係数Δ_ｉ，ｄを初期化し（ｓ１１１−２）、偏微分係数更新部１１１ｅに出力する。ここでは各データから算出される勾配の総和を取るため、最初に勾配ベクトルの全要素を０で初期化する。 (Gradient vector initialization unit 111b)
The gradient vector initialization unit 111b initializes the partial differential coefficient Δ _{i, d} corresponding to the d-th element α _{i, d} of the adjustment parameter vector α ^- _i (s111-2) and outputs it to the partial differential coefficient update unit 111e. To do. Here, in order to take the sum of the gradients calculated from each data, first, all elements of the gradient vector are initialized with zero.

（素性ベクトル生成部１１１ｃ）
素性ベクトル生成部１１１ｃは、細粒度エラー尺度算出部１０７から正解ラティスＬ＾^（ｎ）を受け取る。さらに素性ベクトル生成部１１１ｃは、学習データ格納部１０３から学習データＸ^−（ｎ）を取り出し、正解ラティスＬ＾^（ｎ）と認識ラティスＬ^（ｎ）の各アークに対応する素性ベクトルを計算し（ｓ１１１−４）、偏微分係数更新部１１１ｅに出力する。φ＾_{ｎ，ｊ，ｄ}はｎ番目の正解ラティスのｊ番目のアークに対応する素性ベクトルのｄ次元目の要素であり、φ⁻ _{ｎ，ｊ，ｄ}はｎ番目の認識ラティスのｊ番目のアークに対応する素性ベクトルのｄ次元目の要素である。計算は、どのような素性ベクトルを用いるかによって異なるが、例えば式（５）の素性ベクトルを用いるならば、式（５）にアークの時刻情報を代入したベクトルφ⁻（Ｘ^−（ｎ），ｔ（ａ_ｊ），ｔ’（ａ_ｊ））のｄ番目の要素を利用すれば良い。 (Feature Vector Generation Unit 111c)
The feature vector generation unit 111c receives the correct lattice L ^ ⁽ⁿ⁾ from the fine granularity error scale calculation unit 107. Furthermore, the feature vector generation unit 111c extracts the learning data X- ⁽ⁿ⁾ from the learning data storage unit 103, and calculates a feature vector corresponding to each arc of the correct lattice L ^ ⁽ⁿ⁾ and the recognition lattice L ⁽ⁿ⁾ ( s111-4), and outputs it to the partial differential coefficient updating unit 111e. φ ^ _{n, j, d} is _{the d-} th element of the feature vector corresponding to the jth arc of the nth correct lattice, and φ ⁻ _{n, j, d} is the jth arc of the nth recognition lattice. Is the d-th element of the feature vector corresponding to. The calculation differs depending on what kind of feature vector is used. For example, if the feature vector of Expression (5) is used, a vector φ ⁻ (X ^{− (n)} , The d-th element of t (a _j ), t ′ (a _j )) may be used.

（アーク重み算出部１１１ｄ）
アーク重み算出部１１１ｄは、有限状態モデル格納部１０１から状態遷移コストω（ｓ，ｓ’）及びｌｏｇＰ（ｘ⁻ _τ｜Ｉ［ｓ，ｓ’］）を取り出し、学習データ格納部１０３から学習データＸ^−(ｎ)を取り出し、認識部１０５から認識ラティスＬ^(ｎ)を受け取り、有限状態モデル調整パラメタ格納部１１３から調整パラメタベクトルα⁻ _ｉを取り出し、細粒度エラー尺度算出部１０７からエラー尺度Ｅ^〜を受け取り、学習基準（ｂＭＭＩかｄＭＭＩか）に応じて対応するアーク重みγ_{ｎ，ｉ，ｊ}を算出し（ｓ１１１−５）、偏微分係数更新部１１１ｅに出力する。 (Arc weight calculator 111d)
The arc weight calculation unit 111 d extracts the state transition cost ω (s, s ′) and logP (x ⁻ _τ | I [s, s ′]) from the finite state model storage unit 101, and learns data from the learning data storage unit 103. X- ⁽ⁿ⁾ is extracted, the recognition lattice L ⁽ⁿ⁾ is received from the recognition unit 105, the adjustment parameter vector α ^- _i is extracted from the finite state model adjustment parameter storage unit 113, and the error measure E is output from the fine-grained error measure calculation unit 107. receiving ^~ the learning reference to calculate the corresponding arc weights gamma _{n, i, j} in accordance with (BMMI or dMMI or) (s111-5), and outputs the partial differential coefficient update unit 111e.

ｂＭＭＩのアーク重みγ_{ｎ，ｉ，ｊ}はForward-Backwardアルゴリズムを用いて求める。まず、各アークａ_ｊの前向きコストα_ｊを以下の再帰式によって求める。 The bMMI arc weights γ _{n, i, j} are obtained using the Forward-Backward algorithm. First, the forward cost alpha _j of each arc a _j according to the following recursion formula.

ここでＰｒｅ（ｊ）はｊ番目のアークａ_ｊに先行して接続しているアークのインデックス集合であり、すなわちＰｒｅ（ｊ）＝｛ｊ’｜ｓ’（ａ_ｊ’）＝ｓ（ａ_ｊ）｝であり、認識ラティスＬ^(ｎ)に基づき求めることができる。またｋはラティススムーシング係数とも呼ばれる数値であり、ラティスによる近似を用いた手法の精度向上のために調整可能な係数である。遷移コスト関数πは、状態遷移コストω（ｓ，ｓ’）及びｌｏｇＰ（ｘ⁻ _τ｜Ｉ［ｓ，ｓ’］）、学習データＸ^−(ｎ)、調整パラメタベクトルα⁻ _ｉを用いて式（４）に基づき求める。同様に後ろ向きコストβ_ｊを以下の再帰式によって求める。 Here, Pre (j) is an index set of arcs connected prior to the j-th arc a _j , that is, Pre (j) = {j ′ | s ′ (a _{j ′} ) = s (a _j )}, Which can be obtained based on the recognition lattice L ⁽ⁿ⁾ . K is a numerical value called a lattice smoothing coefficient, and is a coefficient that can be adjusted to improve the accuracy of a technique using approximation by lattice. The transition cost function π is expressed by using the state transition cost ω (s, s ′) and logP (x ⁻ _τ | I [s, s ′]), learning data X− ⁽ⁿ⁾ , and the adjustment parameter vector α ⁻ _i. Calculate based on (4). Similarly, the backward cost β _{j is obtained} by the following recursive formula.

ここでＦｏｌ（ｊ）はｊ番目のアークａ_ｊに後続して接続しているアークのインデックスの集合であり、すなわちＦｏｌ（ｊ）＝｛ｊ’｜ｓ（ａ_ｊ’）＝ｓ’（ａ_ｊ）｝である。α_ｊとβ_ｊ及び最初のアークにおける後ろ向きコストＢ＝β_１を用いて、γ_ｎ，ｊは以下のように表わされる。 Here, Fol (j) is a set of arc indices connected after the j-th arc a _j , that is, Fol (j) = {j ′ | s (a _{j ′} ) = s ′ (a _j )}. Using α _j and β _j and the backward cost B = β ₁ at the first arc, γ _{n, j} is expressed as:

ｄＭＭＩのアーク重みγ_ｎ，ｊは、ｂＭＭＩのγ_ｎ，ｊをσの関数としてγ^〜 _ｎ，ｊ（σ）と置いた時、異なるσ_２≠σ_１に対して二回上述の計算（式（１４）、（１５））を繰り返すことによって、以下のように得られる。 Arc weight gamma _{n, j} of dMMI _is, γ _{n, j} and gamma ^~ _n as a function of the sigma of _BMMI, when placed with _j (sigma), different σ ₂ ≠ σ ₁ with respect to twice the above calculation (formula By repeating (14) and (15)), the following is obtained.

（偏微分係数更新部１１１ｅ）
偏微分係数更新部１１１ｅは、素性ベクトル生成部１１１ｃから正解ラティスＬ＾^（ｎ）、素性ベクトルφ＾_{ｎ，ｊ，ｄ}及びφ⁻ _{ｎ，ｊ，ｄ}を受け取り、認識部１０５から認識ラティスＬ^（ｎ）を受け取り、アーク重み算出部１１１ｄからアーク重みγ_ｎ，ｊを受け取り、入力ｎに対応する勾配係数を全アークに対して加算し、偏微分係数Δ_ｉ，ｄを更新し（ｓ１１１−６）、調整パラメタ更新部１１１ｆに出力する。 (Partial differential coefficient update unit 111e)
Partial differential coefficient update unit 111e is correct from feature vector generating unit 111c lattice ^{L ^ (n),} feature vector phi _{^ n, j, d} and phi ^- receiving _{n, j,} and _d, recognition from the recognition unit 105 lattice ^{L ( n)} , the arc weight γ _{n, j} is received from the arc weight calculator 111d _, the gradient coefficient corresponding to the input n is added to all arcs, and the partial differential coefficient Δ _{i, d} is updated (s111-6). ) And output to the adjustment parameter update unit 111f.

上述の処理（ｓ１１１−４〜ｓ１１１−６）を全ての学習データに対して行う（ｓ１１１−３、ｓ１１１−７、ｓ１１１−８）。 The above processing (s111-4 to s111-6) is performed on all the learning data (s111-3, s111-7, s111-8).

（調整パラメタ更新部１１１ｆ）
調整パラメタ更新部１１１ｆは、偏微分係数Δ_ｉ，ｄを受け取り、このΔ_ｉ，ｄを用いて、調整パラメタベクトルα_ｉ，ｄの更新を行い（ｓ１１１−９）、有限状態モデル調整パラメタ格納部１１３に格納する。例えば、トレーニング法として最急勾配法を利用した場合、以下の式によって更新を行う。 (Adjustment parameter update unit 111f)
The adjustment parameter update unit 111f receives the partial differential coefficient Δ _{i, d} and updates the adjustment parameter vector α _{i, d} using this Δ _{i, d} (s111-9), and the finite state model adjustment parameter storage unit 113 is stored. For example, when the steepest gradient method is used as a training method, updating is performed according to the following formula.

ここでηは学習率と呼ばれる変数であり、適切に設定する必要がある。なお、この更新ルールは用いた最適化法によって異なる。 Here, η is a variable called a learning rate and needs to be set appropriately. This update rule differs depending on the optimization method used.

（収束判定部１１１ｇ）
収束判定部１１１ｇは、有限状態モデル調整パラメタ格納部１１３から調整パラメタベクトルα_ｉ，ｄを取り出し、収束判定を行い（ｓ１１１−１０）、収束していた場合は学習プログラムを終了する。収束していなかった場合は、ｓ１１１−２〜ｓ１１１−９の処理を繰り返すように各部に対し制御信号を送信する。判定の方法としては、単に何回ループしたかをカウントする方法や、バリデーションデータを用いて音声認識率が向上し続ける限り続ける方法、目的関数の値を評価して変動が閾値より小さくなった時点で打ち切る方法などがある。 (Convergence determination unit 111g)
The convergence determination unit 111g extracts the adjustment parameter vector α _{i, d} from the finite state model adjustment parameter storage unit 113, performs the convergence determination (s111-10), and ends the learning program if it has converged. If not converged, a control signal is transmitted to each unit so as to repeat the processing of s111-2 to s111-9. Judgment methods include simply counting the number of loops, continuing as long as the speech recognition rate continues to improve using validation data, and when the objective function value is evaluated and the fluctuation becomes smaller than the threshold. There is a method to stop by.

図２の音声認識装置１２は、有限状態モデル調整パラメタ格納部１２３に格納されている、パラメタ推定装置１００において生成された調整パラメタベクトルα_ｉ，ｄを用いて、音声認識を行うことでその精度を向上させることができる。なお、有限状態モデル調整パラメタ格納部１２３（図２参照）には、最終的に有限状態モデル調整パラメタ格納部１１３（図３参照）に格納されている調整パラメタベクトルα_ｉ，ｄと同一の情報が格納されている。 The speech recognition apparatus 12 in FIG. 2 performs speech recognition using the adjustment parameter vector α _{i, d} generated in the parameter estimation apparatus 100 stored in the finite state model adjustment parameter storage unit 123, thereby improving its accuracy. Can be improved. The finite state model adjustment parameter storage unit 123 (see FIG. 2) stores the same information as the adjustment parameter vector α _{i, d} finally stored in the finite state model adjustment parameter storage unit 113 (see FIG. 3). Is stored.

＜実験結果＞
本実施形態の有効性を確認するため、大語彙音声認識実験を行った。この実験では、収束判定部１１１ｇにおいてバリデーションデータに基づく手法を用いた。特徴ベクトル系列抽出部７としては音声信号を１２次元Mel-frequency cepstral coefficients（ＭＦＣＣ）と、対数パワーに変換し、その上でそれら１３次元変数の時間微分値、及び時間二階微分値を結合することで、３９次元入力ベクトルへと変換する装置を用いた。データセットは、英語の講義音声データを用いた。学習データセットには、講義音声データから１０１時間分のデータを利用した。学習データセットに含まれる系列数は６０３９２、単語数は１，０７６，６４７単語である。エラー率算出のためのデータセットには講義音声データから７．８時間分のデータを利用した。評価データセットに含まれる系列数は６９８９、単語数は７４８２３単語である。結果を図１０に示す。表中の”＋”はｄＭＭＩ法で個別学習された音声認識装置９１に追加で調整パラメタを導入したものを指す。音声認識装置９１の最高スコア（ｄＭＭＩ法，２８．２％）よりも、本実施形態の調整パラメタベクトルα⁻を用いて音声認識を行う音声認識装置１２の最高スコアは２７．１％であり（ｂＭＭＩ法、σ２．０の場合）、従来技術に比べ１％以上の精度向上を達成できた。加えて、特許文献１の音声認識装置９２と比べても、さらに０．７％の精度向上を確認できた。以上の結果より、本実施形態のパラメタ推定装置が有効に機能しているといえる。 <Experimental result>
In order to confirm the effectiveness of this embodiment, a large vocabulary speech recognition experiment was conducted. In this experiment, a method based on validation data was used in the convergence determination unit 111g. The feature vector series extraction unit 7 converts the speech signal into 12-dimensional Mel-frequency cepstral coefficients (MFCC) and logarithmic power, and then combines the time differential value and the time second-order differential value of these 13-dimensional variables. Thus, a device for converting into a 39-dimensional input vector was used. The data set was English lecture audio data. The learning data set used 101 hours of data from lecture audio data. The number of series included in the learning data set is 60392, and the number of words is 1,076,647 words. The data set for calculating the error rate was 7.8 hours of data from lecture audio data. The number of series included in the evaluation data set is 6989, and the number of words is 74823 words. The results are shown in FIG. “+” In the table indicates that an adjustment parameter is additionally introduced to the speech recognition apparatus 91 individually learned by the dMMI method. The highest score of the speech recognition device 12 that performs speech recognition using the adjustment parameter vector α ⁻ of this embodiment is 27.1%, compared to the highest score of the speech recognition device 91 (dMMI method, 28.2%) ( In the case of bMMI method, σ2.0), it was possible to achieve an accuracy improvement of 1% or more compared to the prior art. In addition, even when compared with the speech recognition device 92 of Patent Document 1, an accuracy improvement of 0.7% was confirmed. From the above results, it can be said that the parameter estimation apparatus of the present embodiment functions effectively.

＜効果＞
このような構成によりパラメタ推定の精度を向上させることができる。本実施形態は、このような各モジュールの個別学習によって構築された音声認識装置において、構築済みの状態遷移コストω（ｓ，ｓ’）及びｌｏｇＰ（ｘ⁻ _τ｜Ｉ［ｓ，ｓ’］）を全体を考慮しながら、再学習することにより、より精度の高いパラメタを推定している。同様の試みは、特許文献１でも行われており、本実施形態の音声認識装置は特許文献１記載のものと同じであるが、その内部で使われている調整パラメタの取得方法が特許文献１とは異なる。本実施形態により推定された調整パラメタを用いることでより音声認識の精度を向上させることができる。 <Effect>
With such a configuration, the accuracy of parameter estimation can be improved. In this embodiment, in the speech recognition apparatus constructed by such individual learning of each module, the constructed state transition cost ω (s, s ′) and logP (x ⁻ _τ | I [s, s ′]) By re-learning while taking the whole into consideration, a more accurate parameter is estimated. A similar attempt is made in Patent Document 1, and the speech recognition apparatus of the present embodiment is the same as that described in Patent Document 1, but the adjustment parameter acquisition method used therein is disclosed in Patent Document 1. Is different. The accuracy of speech recognition can be further improved by using the adjustment parameter estimated according to the present embodiment.

＜その他の変形例＞
なお、本実施形態では、正解ラティスＬ＾^（ｎ）を細粒度エラー尺度算出部１０７で求めているが、正解単語列と音声認識装置を用いて、図示しない正解ラティス生成部において予め求め、図示しない記憶部に格納しておいてもよい。 <Other variations>
In the present embodiment, the correct lattice L ^ ^{(n) is obtained} by the fine-grained error scale calculating unit 107. However, using the correct word string and the speech recognition apparatus, the correct lattice L ^ ⁽ⁿ⁾ is obtained in advance by the correct lattice generating unit (not shown). You may store in the memory | storage part which does not.

なお、本実施形態では、一つの正解ラティスに対してアーク系列が一つ存在する場合について述べたが、一つの正解ラティスに対して一つ以上のアーク系列が存在する構成としてもよい。その場合、式（８）に代えて以下の式によりＭＭＩ分子を表す。 In this embodiment, the case where one arc sequence exists for one correct lattice has been described. However, a configuration in which one or more arc sequences exist for one correct lattice may be used. In that case, the MMI molecule is represented by the following formula instead of the formula (8).

ここで総和Σ_{ａ〜∈Ｌ＾（ｎ）}（ただし、下付添字ａ〜∈Ｌ＾（ｎ）は、ａ^〜∈Ｌ＾^（ｎ）を表す）は正解ラティスＬ＾^（ｎ）に含まれるアーク系列についての総和であり、総和Σ_ｊ（ただし、ａ＾_jはアーク系列ａ^〜のｊ番目のアークを示す）はアーク系列ａ^〜に含まれるアークａ＾_ｊについての総和である。 Here the sum Σ _{a~∈L ^ (n) (where} subscripts a~∈L ^ (n) represents the ^{a ~} ∈L ^{^ (n))} are included in the correct lattice ^{L ^ (n)} is the sum of the arc sequence, the sum sigma _{j (However,} a ^ _j represents the j-th ^arc-arc sequence a) is the sum of the arc a ^ _j contained in the arc sequence a ^~.

本実施形態では、アーク重みγ_ｎ，ｊとして認識ラティスに対応するもののみ計算しているが、正解ラティスに対応するアーク重みγ＾_ｎ，ｊを計算してもよい。その場合、アーク重み算出部１１１ｄは、素性ベクトル生成部１１１ｃから正解ラティスを受け取り、式（１４）〜（１６）に代えて、以下の式を用いて、ｂＭＭＩのアーク重みγ＾_ｎ，ｊを算出する。 In the present embodiment, only the arc weight γ _{n, j} corresponding to the recognition lattice is calculated, but the arc weight γ ^ _{n, j} corresponding to the correct lattice may be calculated. In this case, the arc weight calculation unit 111d receives the correct answer lattice from the feature vector generation unit 111c, and uses the following formula instead of the formulas (14) to (16) to calculate the arc weight γ ^ _{n, j} of the bMMI. calculate.

この場合もエラー尺度Ｅ^〜は本実施形態と同様、一つの正解のみを含む正解ラティスを用いて計算する必要がある。 As with this case error measure E ^~ this embodiment, it is necessary to calculate using the correct lattice that contains only one correct answer.

ｄＭＭＩのアーク重みγ_ｎ，ｊは、式（１７）に代えて以下の式で求める。 The arc weight γ _{n, j} of dMMI is obtained by the following equation instead of equation (17).

この場合、偏微分係数更新部１１１ｅは式（１８）に代えて以下の式で偏微分係数Δ_ｉ，ｄを計算する。 In this case, the partial differential coefficient updating unit 111e calculates the partial differential coefficient Δ _{i, d} using the following expression instead of the expression (18).

なお、素性ベクトルφ＾、φ⁻の形は特許文献１と同様、式（５）で表現される形には限定されない。 Note that the shape of the feature vectors φ ^ and φ ⁻ is not limited to the form expressed by the equation (5), as in the case of Patent Document 1.

音声認識のための確率的有限状態モデルの構築法は様々な種類があるが、本発明は、「確率的有限状態モデルに基づく音声認識装置」に対し適用できるものであり、どのように構築した確率的有限状態モデルにも適用できるものである。言い換えると、従来様々な方法で実現されてきた音声認識装置を、確率的有限状態モデルに基づく音声認識装置という抽象的な形に変形することで、本発明を適用することができる。従来の音声認識装置はほぼ全て、従来技術の音声認識装置９１、９２の形に抽象化することが可能であり、本発明はこの形に抽象化できる音声認識装置の高精度化に適用可能である。よって、現在主流となっている音声認識装置のほとんどが本発明によって拡張可能である。 There are various types of methods for constructing a stochastic finite state model for speech recognition, but the present invention can be applied to a "speech recognition device based on a stochastic finite state model". It can also be applied to a stochastic finite state model. In other words, the present invention can be applied by transforming a speech recognition apparatus that has been realized by various methods into an abstract form of a speech recognition apparatus based on a probabilistic finite state model. Almost all conventional speech recognition devices can be abstracted in the form of speech recognition devices 91 and 92 of the prior art, and the present invention can be applied to increase the accuracy of speech recognition devices that can be abstracted in this shape. is there. Therefore, most of the speech recognition devices that are currently mainstream can be expanded by the present invention.

本実施形態では、収束判定部１１１ｇにおいて、収束判定を行い、収束していなかった場合は、ｓ１１１−２〜ｓ１１１−９の処理を繰り返すが（図９参照）、このときに認識処理（図４のｓ１）以降の処理を繰り返す構成としてもよい。この場合、認識部１０５は、更新された調整パラメタベクトルα⁻ _{ｈ（ｓ，ｓ’）}を有限状態モデル調整パラメタ格納部１１３から取り出し（図３中、破線で示す）、式（４）に基づき認識ラティスＬ^（ｎ）を生成する。このとき、二回目以降の繰返しにおいては、調整パラメタの初期化処理（図８のｓ１１１−１）を省略する。 In the present embodiment, the convergence determination unit 111g performs convergence determination, and when the convergence is not achieved, the processing of s111-2 to s111-9 is repeated (see FIG. 9). At this time, the recognition processing (FIG. 4) is performed. It is good also as a structure which repeats the process after s1). In this case, the recognizing unit 105 extracts the updated adjustment parameter vector α ⁻ _{h (s, s ′)} from the finite state model adjustment parameter storage unit 113 (indicated by a broken line in FIG. 3), and based on Expression (4). A recognition lattice L ⁽ⁿ⁾ is generated. At this time, the adjustment parameter initialization process (s111-1 in FIG. 8) is omitted in the second and subsequent iterations.

本実施形態では簡単のため、エラー尺度として状態遷移の誤り回数を用いたが、他のエラー尺度を用いてもよく、本発明は単に正解と不正解を単に分けるのではなく、その間の距離を細かく利用する手法全てに対して適用できる。 In the present embodiment, the number of state transition errors is used as an error measure for the sake of simplicity. However, other error measures may be used, and the present invention does not simply separate the correct answer from the incorrect answer, but determines the distance between them. It can be applied to all methods that are used in detail.

本発明は上記の実施形態及び変形例に限定されるものではない。例えば、上述の各種の処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。 The present invention is not limited to the above-described embodiments and modifications. For example, the various processes described above are not only executed in time series according to the description, but may also be executed in parallel or individually as required by the processing capability of the apparatus that executes the processes. In addition, it can change suitably in the range which does not deviate from the meaning of this invention.

＜プログラム及び記録媒体＞
上述したパラメタ推定装置及び音声認識装置は、コンピュータにより機能させることもできる。この場合はコンピュータに、目的とする装置（各種実施形態で図に示した機能構成をもつ装置）として機能させるためのプログラム、またはその処理手順（各実施形態で示したもの）の各過程をコンピュータに実行させるためのプログラムを、ＣＤ−ＲＯＭ、磁気ディスク、半導体記憶装置などの記録媒体から、あるいは通信回線を介してそのコンピュータ内にダウンロードし、そのプログラムを実行させればよい。 <Program and recording medium>
The parameter estimation device and the speech recognition device described above can be functioned by a computer. In this case, each process of a program for causing a computer to function as a target device (a device having the functional configuration shown in the drawings in various embodiments) or a process procedure (shown in each embodiment) is processed by the computer. A program to be executed by the computer may be downloaded from a recording medium such as a CD-ROM, a magnetic disk, or a semiconductor storage device or via a communication line into the computer, and the program may be executed.

１２音声認識装置
１００パラメタ推定装置
１０１有限状態モデル格納部
１０３学習データ格納部
１０５認識部
１０７細粒度エラー尺度算出部
１０７細粒度エラー尺度算出部
１０９正解経路格納部
１１１パラメタ推定部
１１１ａ調整パラメタ初期化部
１１１ｂ勾配ベクトル初期化部
１１１ｃ素性ベクトル生成部
１１１ｄ算出部
１１１ｅ偏微分係数更新部
１１１ｆ調整パラメタ更新部
１１１ｇ収束判定部
１１３有限状態モデル調整パラメタ格納部 DESCRIPTION OF SYMBOLS 12 Speech recognition apparatus 100 Parameter estimation apparatus 101 Finite state model storage part 103 Learning data storage part 105 Recognition part 107 Fine granularity error scale calculation part 107 Fine granularity error scale calculation part 109 Correct answer path storage part 111 Parameter estimation part 111a Adjustment parameter initialization Unit 111b gradient vector initialization unit 111c feature vector generation unit 111d calculation unit 111e partial differential coefficient update unit 111f adjustment parameter update unit 111g convergence determination unit 113 finite state model adjustment parameter storage unit

Claims

A stochastic finite state model including parameters used in speech recognition, learning data, a correct state transition sequence that is a sequence of state transitions corresponding to a correct speech recognition result of the learning data, and for adjusting the stochastic finite state model A recording unit for storing adjustment parameters as parameters;
A recognition unit that generates a recognition state transition sequence that is a sequence of state transitions corresponding to a speech recognition result obtained as a result of performing speech recognition on the learning data using the stochastic finite state model;
Based on the difference between the correct state transition series and the recognized state transition series, a fine-grained error scale calculator that calculates an error scale in units of state transition errors ;
Including a parameter estimation unit that corrects the adjustment parameter according to the error scale,
Parameter estimation device.

The parameter estimation device according to claim 1,
The fine-grained error scale calculation unit counts the number of times different state transitions are performed between the correct state transition sequence and the recognized state transition sequence, and the number of times or a time corresponding to the number of times is the error As a measure,
Parameter estimation device.

The parameter estimation device according to claim 2,
The correct state transition sequence and the recognized state transition sequence are expressed by a lattice that represents a connection between the states, and each arc of the lattice is represented by a _jj And arc a _ｊj The state before transition, the state after transition, the time before transition, and the time after transition of s (a _ｊj ), S '(a _ｊj ), T (a _ｊj ) And t '(a _ｊj ) And arc a _ｊj State transition corresponding to (s (a _ｊj ), S '(a _ｊj )), And each element of the lattice array expression corresponding to the correct state transition sequence is c ^（ｎ）(N) _ττ And the delta function of Kronecker is δ and the arc a _ｊj E ~ (a _ｊj ), And the fine-grained error scale calculation unit

To calculate the error measure,
Parameter estimation device.

The parameter estimation device according to any one of claims 1 to 3 ,
The correct state transition series and the recognized state transition series are represented by a lattice that represents a connection between the states,
The recognition unit generates a recognition lattice indicating a recognition state transition sequence,
The fine-grained error scale calculation unit calculates an error scale based on a difference between the correct lattice corresponding to the correct state transition series and the recognition lattice,
The parameter estimation unit corrects the adjustment parameter according to the error scale using the correct lattice and the recognition lattice.
Parameter estimation device.

Using the adjustment parameter estimated by the parameter estimation apparatus according to any one of claims 1 to 4, the speech recognition device for determining the speech recognition result for the speech data.

A recognition step for generating a recognition state transition sequence that is a sequence of state transitions corresponding to a speech recognition result obtained as a result of performing speech recognition on learning data using a stochastic finite state model including parameters used in speech recognition; and ,
A fine-grained error scale calculation step for calculating an error scale in units of state transition errors based on a difference between a correct state transition series that is a state transition series corresponding to a correct speech recognition result of the learning data and the recognized state transition series When,
A parameter estimation step of correcting an adjustment parameter that is a parameter for adjusting the stochastic finite state model according to the error measure,
Parameter estimation method.

The parameter estimation method according to claim 6 , wherein
In the fine-grained error scale calculation step, the number of times of performing different state transitions between the correct state transition sequence and the recognized state transition sequence is counted, and the number of times, or the time corresponding to the number of times is the error As a measure,
Parameter estimation method.

The parameter estimation method according to claim 7, comprising:
The correct state transition sequence and the recognized state transition sequence are expressed by a lattice that represents a connection between the states, and each arc of the lattice is represented by a _jj And arc a _ｊj The state before transition, the state after transition, the time before transition, and the time after transition of s (a _ｊj ), S '(a _ｊj ), T (a _ｊj ) And t '(a _ｊj ) And arc a _ｊj State transition corresponding to (s (a _ｊj ), S '(a _ｊj )), And each element of the lattice array expression corresponding to the correct state transition sequence is c ^（ｎ）(N) _ττ And the delta function of Kronecker is δ and the arc a _ｊj E ~ (a _ｊj ), And the fine-grained error scale calculation step includes

To calculate the error measure,
Parameter estimation method.

The parameter estimation method according to any one of claims 6 to 8 ,
The correct state transition series and the recognized state transition series are represented by a lattice that represents a connection between the states,
In the recognition step, a recognition lattice indicating a recognition state transition sequence is generated,
In the fine-grain error scale calculation step, an error scale is calculated based on the difference between the correct lattice corresponding to the correct state transition sequence and the recognition lattice,
In the parameter estimation step, using the correct answer lattice and the recognition lattice, the adjustment parameter is corrected according to the error measure.
Parameter estimation method.

Using the adjustment parameter estimated by the parameter estimating method according to any of claims 6 9, the speech recognition method for obtaining the speech recognition result for the speech data.

Parameter estimation apparatus according to any one of claims 1 to 4, or a program for causing a computer to function as a speech recognition apparatus according to claim 5, wherein.