JP4528076B2

JP4528076B2 - Speech recognition apparatus and speech recognition program

Info

Publication number: JP4528076B2
Application number: JP2004271895A
Authority: JP
Inventors: 彰夫小林; 亨今井
Original assignee: Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2004-09-17
Filing date: 2004-09-17
Publication date: 2010-08-18
Anticipated expiration: 2024-09-17
Also published as: JP2006085012A

Description

本発明は、音声認識装置及び音声認識プログラムに係り、特に、音声の認識精度を向上させるための音声認識装置及び音声認識プログラムに関する。 The present invention relates to a speech recognition apparatus and a speech recognition program, and more particularly to a speech recognition apparatus and a speech recognition program for improving speech recognition accuracy.

従来、音声認識技術において音声の認識精度を向上させるための手法としては、例えば単語仮説の事後確率等、１つの特徴量を単語の信頼度の尺度として用いることで、音声認識率を改善する手法が提案されている（例えば、非特許文献１参照。）。 Conventionally, as a method for improving speech recognition accuracy in speech recognition technology, for example, a method of improving speech recognition rate by using one feature amount as a measure of word reliability, such as a posteriori probability of a word hypothesis. Has been proposed (see, for example, Non-Patent Document 1).

また、事後確率以外の特徴量としては、音響安定度や単語仮説密度等がある（例えば、非特許文献２，３参照。）。
Ｆ．Ｗｅｓｓｅｌｅｔａｌ．，“Ｃｏｎｆｉｄｅｎｃｅｍｅａｓｕｒｅｆｏｒｌａｒｇｅｖｏｃａｂｕｌａｒｙｃｏｎｔｉｎｕｏｕｓｓｐｅｅｃｈｒｅｃｏｇｎｉｔｉｏｎ，” ＩＥＥＥＴｒａｎｓ．ＳｐｅｅｃｈａｎｄＡｕｄｉｏＰｒｏｃｅｓｓｉｎｇ，Ｖｏｌ．９，ＰＰ．２８８−２９８，Ｍａｒｔｈ２００１．Ｔ．Ｚｅｐｐｅｎｆｅｌｄ，Ｍ．Ｆｉｎｋｅ，ａｎｄＫ．Ｒｉｅｓ，“Ｒｅｃｏｇｎｉｔｉｏｎｏｆｃｏｎｖｅｒｓａｔｉｏｎａｌｔｅｌｅｐｈｏｎｅｓｐｅｅｃｈｕｓｉｎｇｔｈｅｊａｎｕｓｓｐｅｅｃｈｅｎｇｉｎｅ，” ＩＥＥＥｉｎｔ．Ｃｏｎｆ．Ａｃｏｕｓｔｉｃｓ，ＳｐｅｅｃｈａｎｄＳｉｇｎａｌＰｒｅｃｅｓｓｉｎｇ，ＰＰ．１８１５−１８１８，１９９７．Ｔ．ＫｅｍｐａｎｄＴ．Ｓｈａａｆ，“ＥｓｔｉｍａｔｉｎｇＣｏｎｆｉｄｅｎｃｅＵｓｉｎｇｗｏｒｄｌａｔｔｉｃｅｓ，” Ｅｕｒｏｓｐｅｅｃｈ，Ｒｈｏｄｅｓ，Ｇｒｅｅｃｅ，ＰＰ．８２７−８３０，１９９７． In addition, as feature quantities other than posterior probabilities, there are acoustic stability, word hypothesis density, and the like (see, for example, Non-Patent Documents 2 and 3).
F. Wessel et al. , “Confidence measurement for large vocabulary continuous speech recognition,” IEEE Trans. Speech and Audio Processing, Vol. 9, PP. 288-298, Marth 2001. T. T. Zeppenfeld, M.M. Finke, and K.K. Ries, “Recognition of conventional telephone use the the Janus speech engine,” IEEE int. Conf. Acoustics, Speech and Signal Pressing, PP. 1815-1818, 1997. T. T. Kemp and T.W. Shaaf, “Estimating Confidence Using Word lattices,” Eurospech, Rhodes, Greece, PP. 827-830, 1997.

ところで、上述した非特許文献に示すように、従来の手法は単語仮説の事後確率、又はその他の単一の特徴量のみを用いて単語の信頼度を求め、音声認識率の改善を行っている。 By the way, as shown in the above-mentioned non-patent document, the conventional method obtains the reliability of the word using only the posterior probability of the word hypothesis or other single feature amount, and improves the speech recognition rate. .

しかしながら、単一の特徴量から得られる信頼度は精度が低く、認識精度の改善が小さい。そこで、単語仮説が正しいか又は誤っているかの正誤の分類について精度を向上させるためには、複数の尺度を統合するほうが好ましい。つまり、分類精度を高くすることで音声認識における単語の誤り率の削減が期待できる。 However, the reliability obtained from a single feature amount is low in accuracy, and the improvement in recognition accuracy is small. Therefore, in order to improve the accuracy of correct / incorrect classification of whether the word hypothesis is correct or incorrect, it is preferable to integrate a plurality of measures. That is, it is possible to expect a reduction in word error rate in speech recognition by increasing classification accuracy.

本発明は、上述した課題に鑑みなされたものであり、単語出力の正誤判定を高精度に行い、音声の認識精度を向上させるための音声認識装置及び音声認識プログラムを提供することを目的とする。 The present invention has been made in view of the above-described problems, and an object of the present invention is to provide a speech recognition device and a speech recognition program for accurately determining whether a word output is correct and improving speech recognition accuracy. .

上記課題を解決するために、本件発明は、以下の特徴を有する課題を解決するための手段を採用している。 In order to solve the above problems, the present invention employs means for solving the problems having the following characteristics.

請求項１に記載された発明は、音声を認識する音声認識装置において、入力される音声から単語ネットワークを生成する音声認識手段と、前記音声認識手段により得られる単語ネットワークと、音響モデルと、言語モデルとに基づいて、前記単語ネットワークに含まれる複数の単語仮説に対する予め設定された複数の特徴量を計算する特徴量計算手段と、前記特徴量計算手段により得られる音響スコア、言語スコア、単語事後確率、音響安定度、単語仮説密度、アクティブなＨＭＭの数、音素の平均フレーム数、バックオフケース、及び前記単語仮説の正誤ラベルの履歴による特徴量のうちから予め設定された複数の特徴量を用いて、前記複数の特徴量及び該特徴量の時系列のそれぞれからなる各信頼度尺度ｘと、前記単語仮説の正誤を示し、前記単語仮説毎に前記正誤の何れかの値に変化させることで、以下の式（１）からなる信頼度モデルＰ _ＭＥ（ｙ｜ｘ）の値を最大化させるための正誤ラベルｙとを設定し、前記各信頼度尺度ｘが前記各信頼度尺度ｘに対して予め設定された閾値より大きく、かつ前記正誤ラベルｙが正解である場合には“１”とし、それ以外の場合には“０”とする二値関数からなる素性関数ｆ（ｘ，ｙ）と、前記素性関数ｆ（ｘ，ｙ）に対する所定の重みλとから前記単語仮説の正誤の確率の対数である対数信頼度スコアを算出し、算出された前記対数信頼度スコアと、前記音声認識手段により得られた対数音響スコア及び対数言語スコアとを前記単語ネットワークに含まれる文仮説毎に加算して、最も高いスコアとなった文仮説、スコアの大きい順から予め設定した数の文仮説、又は予め設定されたスコア以上の文仮説の正解単語列を音声認識結果として出力するスコア計算手段とを有し、前記対数信頼度スコアは、前記素性関数ｆ（ｘ，ｙ）と前記重みλとを含む以下の式（１）からなる前記信頼度モデルＰ _ＭＥ（ｙ｜ｘ）から得られる前記単語仮説が与えられたときの正解となる確率に対し、前記確率の対数を取ることにより算出されることを特徴とする。

The invention described in claim 1 is a speech recognition apparatus for recognizing speech, speech recognition means for generating a word network from input speech, a word network obtained by the speech recognition means, an acoustic model, and a language Based on a model, feature quantity calculation means for calculating a plurality of preset feature quantities for a plurality of word hypotheses included in the word network, and an acoustic score, a language score, and a word posterior obtained by the feature quantity calculation means probability, acoustic stability, word hypothesis density, number of active HMM, a phoneme number average frame of the back-off case, and a plurality of feature amounts set in advance from among the feature amount by the correctness label history of the word hypotheses with a respective confidence measure x each consisting of time series of the plurality of feature quantities and the feature quantity, it indicates the correctness of the word hypotheses By changing either the value of the correctness for each of the word hypothesis, consisting of the following formula (1) reliability model P _ME | setting the correctness labels y for maximizing the value of _(y x) If each reliability measure x is larger than a threshold value set in advance for each reliability measure x and the correct / incorrect label y is correct, “1” is set. Otherwise, “1” is set. 0 "consisting of the binary function that feature function f (x, y), the feature function f (x, y) log confidence score and a predetermined weight λ for the logarithm of the probability of correctness of the word hypotheses And the calculated log confidence score and the logarithmic acoustic score and logarithmic language score obtained by the speech recognition means are added for each sentence hypothesis included in the word network to obtain the highest score. Sentence hypothesis, from highest to lowest score And a score calculating means for outputting because the set number sentence hypothesis, or the correct word sequence of preset score or more sentence hypotheses as the speech recognition result, the log confidence score, the feature function f (x , Y) and the weight λ, the probability of being a correct answer given the word hypothesis obtained from the reliability model P _ME (y | x) consisting of the following equation (1): It is calculated by taking the logarithm of .

請求項１記載の発明によれば、単語ネットワークを用いることにより、単語出力の正誤判定を高精度に行うことができる。これにより、音声の認識結果の精度を向上させることができる。 According to the first aspect of the invention, by using the word network, it is possible to accurately determine whether the word output is correct. Thereby, the accuracy of the speech recognition result can be improved.

請求項２に記載された発明は、前記スコア計算手段は、前記単語ネットワークに含まれる単語仮説から前記複数の特徴量及び前記特徴量の時系列を求めることを特徴とする。 The invention described in claim 2 is characterized in that the score calculation means obtains the plurality of feature quantities and a time series of the feature quantities from a word hypothesis included in the word network.

請求項２記載の発明によれば、単語仮説から複数の特徴量及び前記特徴量の時系列に基づいて信頼度スコアを計算することにより、音声の認識結果の精度を向上させることができる。 According to the second aspect of the present invention, the accuracy of the speech recognition result can be improved by calculating the reliability score based on a plurality of feature quantities and the time series of the feature quantities from the word hypothesis.

請求項３に記載された発明は、前記スコア計算手段は、予め設定される特徴量の閾値に基づいて、各単語仮説を正解又は不正解に分類し、分類した結果を前記正誤ラベルとして前記単語仮説に付与することを特徴とする。 The invention described in claim 3, wherein the score calculating means, based on the feature amount of the threshold value set in advance, said word classified into correct or incorrect answer each word hypothesis, the results were classified as the errata label It is characterized by giving to a hypothesis.

請求項３記載の発明によれば、正誤ラベルを付与しておくことで、スコア計算を効率的に行うことができる。 According to invention of Claim 3, a score calculation can be efficiently performed by providing a correct / incorrect label.

請求項４に記載された発明は、前記スコア計算手段は、予め設定される特徴量の閾値により正解又は不正解に分類する前記素性関数ｆ（ｘ，ｙ）によって表現された二値分類器を複数有し、前記複数の二値分類器を特徴量の時系列に対応させて結合し、結合した二値分類器から得られる前記対数信頼度スコアと、前記対数音響スコア及び前記対数言語スコアとを前記単語ネットワークに含まれる文仮説毎に加算して、前記正解単語列を出力することを特徴とする。 According to a fourth aspect of the present invention, the score calculation means includes: a binary classifier expressed by the feature function f (x, y) that is classified into a correct answer or an incorrect answer according to a preset feature amount threshold value. A plurality of binary classifiers associated with a time series of feature quantities, the log confidence score obtained from the combined binary classifier , the logarithmic acoustic score and the logarithmic language score Are added for each sentence hypothesis included in the word network, and the correct word string is output.

請求項４記載の発明によれば、時系列に対応させて、単語出力の正誤判定を高精度に行うことができ、音声の認識結果の精度を向上させることができる。 According to the fourth aspect of the present invention, it is possible to determine the correctness of the word output with high accuracy in correspondence with the time series, and to improve the accuracy of the speech recognition result.

請求項５に記載された発明は、音声を認識する音声認識処理をコンピュータに実行させるための音声認識プログラムにおいて、前記コンピュータを、入力される音声から単語ネットワークを生成する音声認識手段、前記音声認識手段により得られる単語ネットワークと、音響モデルと、言語モデルとに基づいて、前記単語ネットワークに含まれる複数の単語仮説に対する予め設定された複数の特徴量を計算する特徴量計算手段、及び、前記特徴量計算手段により得られる音響スコア、言語スコア、単語事後確率、音響安定度、単語仮説密度、アクティブなＨＭＭの数、音素の平均フレーム数、バックオフケース、及び前記単語仮説の正誤ラベルの履歴による特徴量のうちから予め設定された複数の特徴量を用いて、前記複数の特徴量及び該特徴量の時系列のそれぞれからなる各信頼度尺度ｘと、前記単語仮説の正誤を示し、前記単語仮説毎に前記正誤の何れかの値に変化させることで、以下の式（１）からなる信頼度モデルＰ _ＭＥ（ｙ｜ｘ）の値を最大化させるための正誤ラベルｙとを設定し、前記各信頼度尺度ｘが前記各信頼度尺度ｘに対して予め設定された閾値より大きく、かつ前記正誤ラベルｙが正解である場合には“１”とし、それ以外の場合には“０”とする二値関数からなる素性関数ｆ（ｘ，ｙ）と、前記素性関数ｆ（ｘ，ｙ）に対する所定の重みλとから前記単語仮説の正誤の確率の対数である対数信頼度スコアを算出し、算出された前記対数信頼度スコアと、前記音声認識手段により得られた対数音響スコア及び対数言語スコアとを前記単語ネットワークに含まれる文仮説毎に加算して、最も高いスコアとなった文仮説、スコアの大きい順から予め設定した数の文仮説、又は予め設定されたスコア以上の文仮説の正解単語列を音声認識結果として出力するスコア計算手段として機能させ、前記対数信頼度スコアは、前記素性関数ｆ（ｘ，ｙ）と前記重みλとを含む以下の式（１）からなる前記信頼度モデルＰ _ＭＥ（ｙ｜ｘ）から得られる前記単語仮説が与えられたときの正解となる確率に対し、前記確率の対数を取ることにより算出されることを特徴とする。

According to a fifth aspect of the present invention, there is provided a speech recognition program for causing a computer to execute speech recognition processing for recognizing speech, wherein the computer recognizes speech recognition means for generating a word network from input speech, and the speech recognition Feature amount calculating means for calculating a plurality of preset feature amounts for a plurality of word hypotheses included in the word network based on a word network obtained by the means, an acoustic model, and a language model; and the feature acoustic score obtained by the amount computing means, the language score, the word posterior probabilities, acoustic stability, word hypothesis density, number of active HMM, a phoneme number average frames, due to the back-off case, and correctness label history of the word hypotheses Using the plurality of feature amounts set in advance from among the feature amounts, the plurality of feature amounts and the feature amounts are used. Each confidence measure x each consisting of time series of amounts, shows the correctness of said word hypotheses, by changing to one of the values of the correctness for each of the word hypothesis consists of the following equation (1) reliability model P _ME | values of _(y x) sets the correctness labels y for maximizing, greater than a predetermined threshold the respective confidence measure x is relative to the respective confidence measure x, In addition, when the correct / incorrect label y is correct, it is set to “1”, otherwise it is set to “0”, and a feature function f (x, y) composed of a binary function and the feature function f (x, calculates log confidence score is the logarithm of the probability of correctness of the word hypotheses from the predetermined weight λ for y), and the calculated the log confidence scores, logarithmic acoustic score and obtained by the speech recognition means Log language score and included in the word network Add as each sentence hypothesis, output the sentence hypothesis that has the highest score, the sentence hypothesis of a preset number from the highest score, or the correct word string of sentence hypotheses greater than the preset score as the speech recognition result The logarithmic reliability score is the reliability model P _ME (y | x) comprising the following function (1) including the feature function f (x, y) and the weight λ. It is calculated by taking the logarithm of the probability with respect to the probability of the correct answer when the word hypothesis obtained from the above is given .

請求項５記載の発明によれば、単語出力の正誤判定を高精度に行うことができる。これにより、音声の認識結果の精度を向上させることができる。また、特別な装置構成を必要とせず、低コストで音声認識を実現することができる。また、プログラムをインストールすることにより、容易に音声認識を実現することができる。 According to the fifth aspect of the present invention, it is possible to determine the correctness of word output with high accuracy. Thereby, the accuracy of the speech recognition result can be improved. In addition, voice recognition can be realized at a low cost without requiring a special device configuration. Moreover, voice recognition can be easily realized by installing the program.

本発明によれば、音声の認識精度を向上させることができる。 According to the present invention, speech recognition accuracy can be improved.

＜本発明の概要＞
本発明は、発生された音声等の入力音声から単語ネットワークを生成し、単語ネットワーク上においてそれぞれの単語仮説の様々な特徴量を求め、特徴量及びその時系列に基づいて、音声認識の正解・不正解を二値分類により表現し、二値分類結果（正解率の評価）を統合した信頼度モデル等からスコア（信頼度スコア）を計算し、更に言語モデルから得られる言語スコアや、音響モデルから得られる音響スコアと合わせて正解単語列を音声認識結果として出力する。 <Outline of the present invention>
The present invention generates a word network from input speech such as generated speech, obtains various feature amounts of each word hypothesis on the word network, and corrects / incorrects speech recognition based on the feature amount and its time series. The correct answer is expressed by binary classification, the score (reliability score) is calculated from a reliability model that integrates the binary classification result (evaluation of correct answer rate), and further from the language score obtained from the language model and the acoustic model A correct word string is output as a speech recognition result together with the obtained acoustic score.

なお、二値分類結果の統合には、例えば異なる知識源を統合したモデルを容易に作成できるという特長を持ち、分類問題でよく用いられている最大エントロピー（ＭＥ）モデル等を用いることができる。 Note that the binary classification results can be integrated using, for example, a maximum entropy (ME) model that has a feature that a model in which different knowledge sources are integrated can be easily created and is often used in classification problems.

以下に、本発明における音声認識装置及び音声認識プログラムを好適に実施した形態について、図面を用いて詳細に説明する。 DESCRIPTION OF EMBODIMENTS Hereinafter, embodiments in which a speech recognition apparatus and a speech recognition program according to the present invention are suitably implemented will be described in detail with reference to the drawings.

＜機能構成＞
図１は、本発明における音声認識装置の機能構成の一例を示す図である。図１の音声認識装置１０は、音声認識手段１１と、特徴量計算手段１２と、スコア計算手段１３と、音響モデル１４と、言語モデル１５と、信頼度モデル１６とを有するよう構成されている。 <Functional configuration>
FIG. 1 is a diagram illustrating an example of a functional configuration of a speech recognition apparatus according to the present invention. The speech recognition apparatus 10 in FIG. 1 includes a speech recognition unit 11, a feature amount calculation unit 12, a score calculation unit 13, an acoustic model 14, a language model 15, and a reliability model 16. .

音声認識手段１１は、発声された音声等を入力音声として入力し、音声の波形と単語の発音とから得られるもっともらしさを示す情報が蓄積された音響モデル１４、及び単語同士の繋がりやすさを示す情報が蓄積された言語モデル１５を参照して単語ネットワークを生成する。また、音声認識手段１１は、生成した単語ネットワークを特徴量計算手段１２に出力する。 The voice recognition means 11 inputs the spoken voice or the like as the input voice, and stores the acoustic model 14 in which information indicating the likelihood obtained from the waveform of the voice and the pronunciation of the word is stored, and the ease of connection between words. A word network is generated with reference to the language model 15 in which the information shown is stored. Further, the voice recognition unit 11 outputs the generated word network to the feature amount calculation unit 12.

特徴量計算手段１２は、単語ネットワークの各単語仮説に対する予め設定される複数の特徴量を音響モデル１４、言語モデル１５を参照して計算する。また、特徴量計算手段１２は、入力された単語ネットワークそのものからも特徴量を計算する。特徴量計算手段１２は、計算された特徴量をスコア計算手段１３に出力する。 The feature quantity calculation means 12 calculates a plurality of preset feature quantities for each word hypothesis of the word network with reference to the acoustic model 14 and the language model 15. Also, the feature quantity calculation means 12 calculates the feature quantity from the inputted word network itself. The feature quantity calculation unit 12 outputs the calculated feature quantity to the score calculation unit 13.

スコア計算手段１３は、信頼度モデル１６を基づいて、正解単語列を生成し、音声認識結果として出力する。次に、上述した各構成部の具体的な内容について説明する。 The score calculation means 13 generates a correct word string based on the reliability model 16 and outputs it as a speech recognition result. Next, the specific content of each component described above will be described.

＜音声認識手段１１＞
音声認識手段１１は、発声された音声等を入力音声として入力し、単語の終端時刻を頂点とし、音響モデル１４の音響スコア、言語モデル１５の言語スコアにより得られる単語の仮説同士で繋がれた部分を辺とする単語ネットワークを生成する。音声認識手段１１は、生成した単語ネットワークを特徴量計算手段１２に出力する。 <Voice recognition means 11>
The speech recognition means 11 inputs the spoken speech or the like as input speech, and is connected by word hypotheses obtained from the acoustic score of the acoustic model 14 and the language score of the language model 15 with the end time of the word as a vertex. Generate a word network with parts as edges. The voice recognition unit 11 outputs the generated word network to the feature amount calculation unit 12.

＜特徴量計算手段１２＞
特徴量計算手段１２は、入力した単語ネットワークの各辺上の単語仮説に対し、例えば、（ａ）音響スコア，（ｂ）言語スコア，（ｃ）単語事後確率，（ｄ）音響安定度（ａｃｏｕｓｔｉｃｓｔａｂｉｌｉｔｉｅｓ），（ｅ）単語仮説密度，（ｆ）アクティブなＨＭＭ（ＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ；隠れマルコフモデル）の数，（ｇ）音素の平均フレーム数，（ｈ）バックオフ（Ｂａｃｋ−ｏｆｆ）ケース，（ｉ）単語仮説の正誤ラベルの履歴等による特徴量を求める。 <Feature amount calculation means 12>
For example, (a) an acoustic score, (b) a language score, (c) a word posterior probability, and (d) an acoustic stability (acoustic) for the word hypothesis on each side of the input word network. (e) word hypothesis density, (f) number of active HMMs (Hidden Markov Models), (g) average number of phoneme frames, (h) backoff (Back-off) case, ( i) A feature amount based on a history of correct / wrong labels of the word hypothesis is obtained.

ここで、上述の（ａ）音響スコアは、音響モデル１４を参照することにより求めることができ、（ｂ）言語スコアは、言語モデル１５を参照することにより求めることができる。また、（ｃ）単語事後確率は、単語ネットワークの各辺（スコア）がどのぐらいの確率で使用されているかを示すものであり、例えば非特許文献１に示す手法により求めることができる。 Here, the above-mentioned (a) acoustic score can be obtained by referring to the acoustic model 14, and (b) the language score can be obtained by referring to the language model 15. Further, (c) the word posterior probability indicates how much each side (score) of the word network is used, and can be obtained by the method described in Non-Patent Document 1, for example.

また、（ｄ）音響安定度は、単語ネットワークにおいて使用されるルートの頻度を示すものであり、例えば非特許文献２に示す手法により求めることができる。また、（ｅ）単語仮説密度は、単語ネットワーク上のある時刻において使用される単語の頻度を示すものであり、例えば非特許文献３の手法により求めることができる。 Moreover, (d) acoustic stability shows the frequency of the route used in a word network, for example, can be calculated | required by the method shown in the nonpatent literature 2. FIG. Moreover, (e) word hypothesis density shows the frequency of the word used at a certain time on a word network, for example, can be calculated | required by the method of a nonpatent literature 3.

また、（ｆ）アクティブなＨＭＭの数とは、音声認識で単語の探索をする際に同時に探索されているＨＭＭの数を示すものであり、例えばある単語の探索においてＨＭＭの数が多いほど、その単語が正解である可能性が低いという判定をすることができる。 In addition, (f) the number of active HMMs indicates the number of HMMs that are simultaneously searched when searching for words in speech recognition. For example, the larger the number of HMMs in a certain word search, It can be determined that the word is unlikely to be correct.

また、バックオフ（Ｂａｃｋ−ｏｆｆ）ケースは、例えばある単語仮説間の繋がりやすさが求められていないような場合に、単語ネットワークを生成したときの音響スコアと言語スコアとでスコアの高い順に文の候補を並べ、所定の数の文仮説を用いて設定される特徴量である。 Also, the back-off case is, for example, when the ease of connection between a certain word hypothesis is not required, and the acoustic score and language score when the word network is generated are written in descending order of score. The feature amount is set using a predetermined number of sentence hypotheses.

特徴量計算手段１２は、上述のように求められたそれぞれの特徴量を付与した単語ネットワークをスコア計算手段１３に出力する。 The feature amount calculation unit 12 outputs the word network to which the respective feature amounts obtained as described above are added to the score calculation unit 13.

＜スコア計算手段１３＞
スコア計算手段１３では、特徴量が付与された単語ネットワークに対し、最適な正解単語列を生成し、生成した正解単語列を音声認識結果として出力する。ここで、スコア計算手段１３の具体的な構成例について図を用いて説明する。 <Score calculation means 13>
The score calculation means 13 generates an optimal correct word string for the word network to which the feature amount is assigned, and outputs the generated correct word string as a speech recognition result. Here, a specific configuration example of the score calculation unit 13 will be described with reference to the drawings.

図２は、本発明におけるスコア計算手段の一構成例を示す図である。図２に示すスコア計算手段１３は、文仮説生成手段２１と、信頼度計算手段２２と、リスコアリング手段２３とを有するよう構成されている。 FIG. 2 is a diagram showing a configuration example of the score calculation means in the present invention. The score calculation unit 13 illustrated in FIG. 2 includes a sentence hypothesis generation unit 21, a reliability calculation unit 22, and a rescoring unit 23.

文仮説生成手段２１は、特徴量計算手段１２から得られる特徴量付き単語ネットワークを入力し、リスコアリング等を行うことにより予め設定されるＮ個の文仮説（ｎ−ｂｅｓｔ文仮説）を生成する。また、文仮説生成手段２１は、生成されたＮ個の文仮説に対して、例えばバックオフケース等により特徴量を求めることができる。この場合、バックオフケースは、言語モデル１５にｎ−ｇｒａｍが存在した場合には“１”とし、それ以外では“０”とするような特徴量を設定する。 The sentence hypothesis generation means 21 receives the feature-added word network obtained from the feature quantity calculation means 12 and generates N sentence hypotheses (n-best sentence hypotheses) set in advance by performing rescoring or the like. To do. Moreover, the sentence hypothesis generation means 21 can obtain a feature amount for the generated N sentence hypotheses by, for example, a back-off case. In this case, in the back-off case, a feature value is set to “1” when n-gram exists in the language model 15 and “0” otherwise.

また、文仮説生成手段２１は、求められた特徴量を信頼度計算手段２２に出力する。信頼度計算手段２２は、入力した特徴量と信頼度モデル１６とに基づいて信頼度を計算する。 In addition, the sentence hypothesis generation unit 21 outputs the obtained feature amount to the reliability calculation unit 22. The reliability calculation means 22 calculates the reliability based on the input feature quantity and the reliability model 16.

ここで、信頼度モデル１６は、例えば最大エントロピー法により求めることができる。（なお、最大エントロピー法については、例えば「Ａ．Ｂｅｒｇｅｒ，Ｓ．Ｄ．Ｐｉｅｔｒａ，ａｎｄＶ．Ｄ．Ｐｉｅｔｒａ，“ＡＭａｘｉｍｕｍＥｎｔｒｏｐｙＡｐｐｒｏａｃｈｔｏＮａｔｕｒａｌＬａｎｇｕａｇｅＰｒｅｃｅｓｓｉｎｇ，” ＣｏｍｐｕｔａｔｉｏｎａｌＬｉｎｇｕｉｓｔｉｃｓ，Ｖｏｌ．２２，ＰＰ３９−７１，１９９６．」等に示されている。）最大エントロピー法により、信頼度モデルは以下に示す式（１）により与えられる。 Here, the reliability model 16 can be obtained by, for example, the maximum entropy method. (For the maximum entropy method, see, for example, “A. Berger, SD Pietra, and V.D. 1996. ”, etc.) By the maximum entropy method, the reliability model is given by the following equation (1).

ここで、ｘは仮説の事後確率等の信頼度尺度（特徴量及びその時系列）を示し、ｙ∈｛−１，１｝は単語仮説の正解又は不正解の正誤ラベルを示している。また、ｆ_ｉ（ｘ，ｙ）は、観測値ペア（ｘ，ｙ）に関して特徴の条件で“０”又は“１”を返す素性関数と呼ばれる二値関数を示し、λ_ｉは素性関数に対する重みを示している。

Here, x represents a reliability measure (feature and its time series) such as a posteriori probability of a hypothesis, and y∈ {−1, 1} represents a correct answer or incorrect label of a word hypothesis. Further, f _i (x, y) represents a binary function called a feature function that returns “0” or “1” under the feature condition for the observed value pair (x, y), and λ _i is a weight for the feature function. Is shown.

また、Ｐ_ＭＥ（ｙ｜ｘ）は、信頼度モデルで、着目している単語仮説が正解（又は不正解）となる確率を示している。なお、上述のλ_ｉは、ＧＩＳ（ＧｅｎｅｒａｌｉｚｅｄＩｎｔｅｒａｔｉｖｅＳｃａｌｉｎｇ：一般化反復スケーリング）アルゴリズム等により求めることができる。 P _ME (y | x) is a reliability model and indicates a probability that the focused word hypothesis is correct (or incorrect). Note that λ _i described above can be obtained by a GIS (Generalized Interactive Scaling) algorithm or the like.

信頼度計算手段２２は、特徴量付き単語ネットワークから得られる各特徴量を用いて、特徴量の閾値にしたがって単語仮説の正解・不正解を決定するような二値分類器としての役割（構成）を有する。なお、二値分類器は、上述した式（１）の素性関数によって表現される。 The reliability calculation means 22 uses a feature quantity obtained from the feature-added word network and functions as a binary classifier that determines the correct / incorrect answer of the word hypothesis according to the feature value threshold (configuration). Have The binary classifier is expressed by the feature function of the above-described equation (1).

ここで、素性関数は二値関数であるため、単純な“ｂｉｎａｒｙｃｌａｓｓｉｆｉｅｒ”とみることもできる。このような関数を使って信頼度（特徴量）を表現する上で重要なことは、“信頼度を素性関数でどのように表すか”と“信頼度の時間的変化（時系列）をどのように表すか”ということである。そこで、本発明では、信頼度を素性関数で記述するには、ある信頼度に対して閾値を設定し、閾値の前後で活性化するような素性関数を定義する。 Here, since the feature function is a binary function, it can be regarded as a simple “binary classifier”. What is important in expressing reliability (features) using such a function is “how to express reliability with feature functions” and “how the reliability changes over time (time series)” "How do you represent it?" Therefore, in the present invention, in order to describe the reliability using a feature function, a threshold is set for a certain reliability and a feature function that is activated before and after the threshold is defined.

つまり、特徴量の閾値で単語仮説を正解又は不正解に分類する二値分類器を複数有し、これら複数の二値分類器を特徴量の時系列に関して連結した二値分類器として結合し、全ての二値分類器を統合した信頼度モデルによって単語仮説に対する信頼度スコア及び正誤ラベルを求めて正解単語列を出力する。これにより、時系列に対応させて、単語出力の正誤判定を高精度に行うことができ、音声の認識結果の精度を向上させることができる。 In other words, it has a plurality of binary classifiers that classify word hypotheses into correct or incorrect answers based on feature amount threshold values, and these multiple binary classifiers are combined as a binary classifier connected with respect to the time series of feature amounts, A reliability score and a correct / incorrect label for the word hypothesis are obtained by a reliability model integrating all binary classifiers, and a correct word string is output. Accordingly, it is possible to determine whether the word output is correct or not with high accuracy in correspondence with the time series, and to improve the accuracy of the speech recognition result.

ここで、ｃ_ｔを信頼度とし、ｙ_ｔを予測する単語仮説の正誤（正解又は不正解）ラベルとして、信頼度に対する閾値ｃ_{ｔｈｒｅｓｈ１}に対して、例えば式（１）の素性関数を以下に示す式（２）のように定義する。 Here, the _{c t} and reliability, as correctness (correct or incorrect) label word hypothesis to predict _{y t,} shown against a threshold _{c thresh1} for reliability, for example, the feature functions of formula (1) below It is defined as equation (2).

また、式（２）に示したｆ_ｉは二値関数であり、特徴量を表現するには不十分である。そこで信頼度の詳細な表現を行うため、同じ信頼度尺度に対して複数の閾値ｃ_{ｔｈｒｅｓｈ２}，ｃ_{ｔｈｒｅｓｈ３}，・・・を定め、それぞれの閾値に対して素性関数を定義する。例えば、閾値ｃ_{ｔｈｒｅｓｈ２}及びｃ_{ｔｈｒｅｓｈ３}を用いた場合には、以下に示す式（３）、式（４）のようになる。

Further, f _i shown in equation (2) is a binary function, is insufficient to represent the feature quantity. Therefore, in order to express the reliability in detail, a plurality of threshold _values c _thresh2 , c _thresh3 ,... _Are defined for the same reliability measure, and a feature function is defined for each threshold value. For example, when threshold _values c _thresh2 and c _thresh3 are used, the following equations (3) and (4) are obtained.

また、信頼度の時間的変化は、単語仮説に対して得られた信頼度の系列を素性関数に取り込むことで実現する。特徴量の時間的変化を表現するためには、例えば信頼度ｃ_ｔ−１，ｃ_ｔに対して、以下に示す式（５）により素性を決定する。

Further, the temporal change in the reliability is realized by incorporating the reliability series obtained for the word hypothesis into the feature function. In order to express the temporal change of the feature amount, for example, the feature is determined by the following equation (5) with respect to the reliability c _t−1 and c _t .

なお、上述した素性関数の定義では、信頼度に対する閾値を決定することが重要である。そこで、信頼度に対する閾値は、次の手順で決定する。まず、任意の二値分類器を用いて閾値を１つ決定する。次に、閾値の上下に一定の間隔で、新たな閾値を設定し、素性を定義して最大エントロピーモデルを学習する。閾値の設定は、モデルによる分類誤り率が下がり始めてから上がらなくなるまで繰り返し行う。

In the above-described feature function definition, it is important to determine a threshold for reliability. Therefore, the threshold value for the reliability is determined by the following procedure. First, one threshold is determined using an arbitrary binary classifier. Next, a new threshold value is set at regular intervals above and below the threshold value, the feature is defined, and the maximum entropy model is learned. The threshold setting is repeated until the classification error rate by the model starts to decrease and does not increase.

次に、窓幅（信頼度尺度の系列の個数）を変更して素性を再決定する。上述の処理を全ての信頼度尺度に対して行い、再度に最大エントロピー法による統合を行う。 Next, the feature is re-determined by changing the window width (the number of series of reliability scales). The above processing is performed for all reliability measures, and integration by the maximum entropy method is performed again.

このように、閾値と素性を定義し、最大エントロピー法で統合することにより信頼度モデルが得ることができる。 Thus, a reliability model can be obtained by defining a threshold and a feature and integrating them by a maximum entropy method.

信頼度モデルが得られたら、信頼度計算手段２２において、以下に示す式（６）にしたがって最適な正解・不正解の正誤ラベル列を求める。 When the reliability model is obtained, the reliability calculation means 22 obtains the correct correct / incorrect correct / incorrect correct label sequence according to the following equation (6).

ここで、上述の式（６）のａｒｇｍａｘ操作は、各単語仮説に対する最適な正解・不正解の正誤ラベルの系列ｙ^＊を、ｖｉｔｅｒｂｉアルゴリズム（ｖｉｔｅｒｂｉアルゴリズムについては、例えば「Ｇ．Ｄ．Ｆｏｒｎｅｙ，“ＴｈｅＶｉｔｅｒｂｉＡｌｇｏｒｉｔｈｍ，” Ｐｒｏｃ．ＩＥＥＥ，Ｖｏｌ．６１，ｐｐ．２６８−２７８，１９７３」等に示されている。）により求めることを示している。

Here, the argmax operation of the above equation (6) is performed by calculating the optimum correct / incorrect correct / incorrect correct label sequence y ^* for each word hypothesis, and the viterbi algorithm (for the Viterbi algorithm, for example, “GD Forney,” The Viterbi Algorithm, “Proc. IEEE, Vol. 61, pp. 268-278, 1973”, etc.).

このとき、Ｎ個の文仮説中のそれぞれの単語仮説ｗ_ｔ ^（ｎ）に、信頼度スコアＰ_ＭＥ（ｙ_ｔ｜ｘ_ｔ）が与えられる。信頼度計算手段２２は、上述の信頼度モデルの信頼度スコアをリスコアリング手段２３に出力する。 At this time, a reliability score P _ME (y _t | x _t ) is given to each word hypothesis w _t ⁽ⁿ⁾ in the N sentence hypotheses. The reliability calculation unit 22 outputs the reliability score of the above-described reliability model to the rescoring unit 23.

リスコアリング手段２３は、例えば文仮説生成手段２１により生成されたＮ個の文仮説の第ｎ番目の文仮説ｗ^（ｎ）に対し、二値分類器の出力スコアを重み付けして加えることにより、リスコアリングスコアＳ（ｗ^（ｎ））を以下に示す式（７）の計算により補正する。 Rescoring means 23, for example with respect to the n-th sentence hypothesis w of N sentence hypotheses generated by sentence hypothesis generation means 21 ^(n), by adding to weight the output scores of the binary classifier The rescoring score S (w ⁽ⁿ⁾ ) is corrected by the calculation of the following equation (7).

ただし、ａｃ（ｗ_ｔ ^（ｎ））は文仮説ｗ_ｔ ^（ｎ）の音響モデルの対数スコアを示し、ｌｍ（ｗ_ｔ ^（ｎ））はｗ_ｔ ^（ｎ）の言語モデルの対数スコアを示し、ｃｆ（ｗ_ｔ ^（ｎ））は信頼度モデルの対数スコアを示している。また、ｇｗは言語スコアに対する重みを示し、ｃｗは二値分類器スコア（信頼度スコア）に対する重みを示している。

Where ac (w _t ⁽ⁿ⁾ ) represents the logarithmic score of the acoustic model of the sentence hypothesis w _t ⁽ⁿ⁾ , lm (w _t ⁽ⁿ⁾ ) represents the logarithmic score of the language model of w _t ⁽ⁿ⁾ , cf (w _t ⁽ⁿ⁾ ) indicates a logarithmic score of the reliability model. Gw represents a weight for the language score, and cw represents a weight for the binary classifier score (reliability score).

ここで、リスコアリングスコアＳ（ｗ^（ｎ））の補正方法としては、単純に重みｃｗの調整によりスコアを補正する方法と、分類器スコアを文仮説の認識率に応じて補正する方法がある。 Here, as a method of correcting the rescoring score S (w ⁽ⁿ⁾ ), there are a method of simply correcting the score by adjusting the weight cw and a method of correcting the classifier score according to the recognition rate of the sentence hypothesis. is there.

後述の方法は、具体的には各ｎ−ｂｅｓｔ文仮説に対して最大エントロピーモデルにより正誤ラベルのラベル付けを行い、正誤ラベルｙ_ｔと、スコアｃｆ（ｗ_ｔ ^（ｎ））を求め、上述の式（７）を用いてｎ−ｂｅｓｔ文仮説をリスコアリングする。 The method will be described later, and label the correctness label by the maximum entropy model specifically for each n-best sentence hypotheses, determined the correctness label _{y t,} the score _{^{cf (w t (n))}} , the above-mentioned Rescore the n-best sentence hypothesis using equation (7).

このとき、分類器スコアｃｗを文仮説の単語誤り率に応じて補正する。補正式は以下に示す式（８）で与えられる。 At this time, the classifier score cw is corrected according to the word error rate of the sentence hypothesis. The correction formula is given by the following formula (8).

ここで、Ｃａ，Ｃｂはそれぞれ定数を示している。また、ｅ（ｗ_ｔ ^（ｎ））（０≦ｅ（ｗ_ｔ ^（ｎ））≦１）は単語誤り率を示しているが、これは事前には得ることができない。そこで、上述した式（６）を用いて信頼度計算手段２２で得られた正誤ラベル系列を用いて、以下に示す式（９）を計算し、単語誤り率を代替する。

Here, Ca and Cb are constants, respectively. Also, e (w _t ⁽ⁿ⁾ ) (0 ≦ e (w _t ⁽ⁿ⁾ ) ≦ 1) indicates the word error rate, but this cannot be obtained in advance. Therefore, using the correct / wrong label sequence obtained by the reliability calculation means 22 using the above-described equation (6), the following equation (9) is calculated to replace the word error rate.

このように、単語認識率に応じてペナルティを設けてリスコアリングスコアに与えることにより、単語誤り率の高い文仮説をｎ−ｂｅｓｔの下位に押し下げることができる。

Thus, by providing a penalty according to the word recognition rate and giving it to the rescoring score, it is possible to push down a sentence hypothesis having a high word error rate to the lower part of n-best.

リスコアリング手段２３は、リスコアリングにより得られる信頼度の最も高いスコアとなった文仮説、スコアの大きい順から予め設定した数の文仮説、又は予め設定されたスコア以上の文仮説の正解単語列を音声認識結果として出力する。 The re-scoring means 23 corrects sentence hypotheses that have the highest reliability score obtained by rescoring, a predetermined number of sentence hypotheses in descending order of score, or a sentence hypothesis greater than or equal to a preset score. A word string is output as a speech recognition result.

上述したように、本発明における音声認識装置により、単語出力の正誤判定を高精度に行うことができる。これにより、音声認識の精度を向上させることができる。 As described above, the speech recognition apparatus according to the present invention can determine whether a word output is correct or incorrect with high accuracy. Thereby, the accuracy of voice recognition can be improved.

ここで、上述した音声認識装置は、上述した専用の装置構成等を用いて本発明における音声認識を行うこともできるが、各構成における処理をコンピュータに実行させることができる実行プログラムを生成し、例えば、汎用のパーソナルコンピュータ、ワークステーション等にプログラムをインストールすることにより、本発明における音声認識を実現することができる。 Here, the voice recognition device described above can perform voice recognition in the present invention using the dedicated device configuration described above, but generates an execution program that can cause a computer to execute the processing in each configuration. For example, voice recognition in the present invention can be realized by installing a program in a general-purpose personal computer, workstation, or the like.

＜ハードウェア構成＞
ここで、本発明における音声認識が実行可能なコンピュータのハードウェア構成例について図を用いて説明する。図３は、本発明における音声認識が実現可能なハードウェア構成の一例を示す図である。 <Hardware configuration>
Here, a hardware configuration example of a computer capable of performing speech recognition according to the present invention will be described with reference to the drawings. FIG. 3 is a diagram illustrating an example of a hardware configuration capable of realizing speech recognition according to the present invention.

図３におけるコンピュータ本体には、入力装置３１と、出力装置３２と、ドライブ装置３３と、補助記憶装置３４と、メモリ装置３５と、各種制御を行うＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）３６と、ネットワーク接続装置３７とを有するよう構成されており、これらはシステムバスＢで相互に接続されている。 3 includes an input device 31, an output device 32, a drive device 33, an auxiliary storage device 34, a memory device 35, a CPU (Central Processing Unit) 36 that performs various controls, and a network connection device. 37, and these are connected to each other by a system bus B.

入力装置３１は、ユーザが操作するキーボード及びマウス等のポインティングデバイスを有しており、ユーザからのプログラムの実行等、各種操作信号を入力する。出力装置３２は、本発明における処理を行うためのコンピュータ本体を操作するのに必要な各種ウィンドウやデータ等を表示するディスプレイを有し、ＣＰＵ３６が有する制御プログラムにより本発明における音声を認識するためのプログラムの実行経過や結果等を表示することができる。 The input device 31 has a pointing device such as a keyboard and a mouse operated by a user, and inputs various operation signals such as execution of a program from the user. The output device 32 has a display for displaying various windows and data necessary for operating the computer main body for performing the processing in the present invention, and for recognizing the sound in the present invention by the control program of the CPU 36. Program execution progress and results can be displayed.

ここで、本発明において、コンピュータ本体にインストールされる実行プログラムは、例えば、ＣＤ−ＲＯＭ等の記録媒体３８等により提供される。プログラムを記録した記録媒体３８は、ドライブ装置３３にセット可能であり、記録媒体３８に含まれる実行プログラムが、記録媒体３８からドライブ装置３３を介して補助記憶装置３４にインストールされる。 Here, in the present invention, the execution program installed in the computer main body is provided by, for example, the recording medium 38 such as a CD-ROM. The recording medium 38 on which the program is recorded can be set in the drive device 33, and the execution program included in the recording medium 38 is installed in the auxiliary storage device 34 from the recording medium 38 via the drive device 33.

補助記憶装置３４は、ハードディスク等のストレージ手段であり、本発明における実行プログラムや、コンピュータに設けられた制御プログラム等を蓄積し必要に応じて入出力を行うことができる。 The auxiliary storage device 34 is a storage means such as a hard disk, and can store an execution program in the present invention, a control program provided in a computer, and the like, and can perform input / output as necessary.

ＣＰＵ３６は、ＯＳ（ＯｐｅｒａｔｉｎｇＳｙｓｔｅｍ）等の制御プログラム、及びメモリ装置３５により読み出され格納されている実行プログラムに基づいて、各種演算や各ハードウェア構成部とのデータの入出力等、コンピュータ全体の処理を制御して、音声認識における各処理を実現することができる。プログラムの実行中に必要な各種情報等は、補助記憶装置３４から取得することができ、また格納することもできる。 Based on a control program such as an OS (Operating System) and an execution program read and stored by the memory device 35, the CPU 36 performs various operations and inputs / outputs data to / from each hardware component. Each process in the speech recognition can be realized by controlling the process. Various information necessary during the execution of the program can be acquired from the auxiliary storage device 34 and can also be stored.

ネットワーク接続装置３７は、通信ネットワーク等と接続することにより、実行プログラムを通信ネットワークに接続されている他の端末等から取得したり、プログラムを実行することで得られた実行結果又は本発明における実行プログラム自体を他の端末等に提供することができる。 The network connection device 37 acquires an execution program from another terminal connected to the communication network by connecting to a communication network or the like, or an execution result obtained by executing the program or an execution in the present invention The program itself can be provided to other terminals.

上述したようなハードウェア構成により、特別な装置構成を必要とせず、低コストで音声認識を実現することができる。また、プログラムをインストールすることにより、容易に音声認識を実現することができる。 With the hardware configuration described above, voice recognition can be realized at low cost without requiring a special device configuration. Moreover, voice recognition can be easily realized by installing the program.

次に、実行プログラムにおける処理手順についてフローチャートを用いて説明する。 Next, a processing procedure in the execution program will be described using a flowchart.

＜音声認識処理＞
図４は、本発明における音声認識プログラムを用いた音声認識処理手順を示す一例のフローチャートである。発声された音声等を入力音声として入力し（Ｓ０１）、入力した音声データの認識処理を行い、単語ネットワークを生成する（Ｓ０２）。 <Voice recognition processing>
FIG. 4 is a flowchart showing an example of a voice recognition processing procedure using the voice recognition program according to the present invention. The spoken voice or the like is input as input voice (S01), the input voice data is recognized, and a word network is generated (S02).

次に、Ｓ０２にて得られる単語ネットワークに基づいて、音響モデルや言語モデルを参照して予め設定される複数の特徴量を計算する（Ｓ０３）。また、Ｓ０３にて得られる特徴量が付与された単語ネットワークと信頼度モデルとに基づいて信頼度スコアを計算し（Ｓ０４）、計算の結果、信頼度の最も高いスコアとなった文仮説、スコアの大きい順から予め設定した数の文仮説、又は予め設定されたスコア以上の文仮説の正解単語列を音声認識結果として出力する（Ｓ０５）。 Next, based on the word network obtained in S02, a plurality of feature amounts set in advance are calculated with reference to the acoustic model and the language model (S03). Further, a reliability score is calculated based on the word network to which the feature amount obtained in S03 and the reliability model are given (S04), and the sentence hypothesis and score having the highest reliability are obtained as a result of the calculation. A correct number of sentence hypotheses from a descending order or a correct word string of sentence hypotheses equal to or higher than a preset score is output as a speech recognition result (S05).

また、図５は、Ｓ０４における信頼度スコア計算処理手順を示す一例のフローチャートである。Ｓ０４における信頼度スコア計算処理では、まず、複数の特徴量が付与された特徴量付き単語ネットワークを入力し（Ｓ１１）、リスコアリングすることにより予め設定されるＮ個の文仮説（ｎ−ｂｅｓｔ文仮説）を生成する（Ｓ１２）。また、生成された文仮説の特徴量に基づいて信頼度（信頼度スコア）を計算する（Ｓ１３）。また、信頼度スコアを計算する上述の式（７）の信頼度の重み付けを調整することで信頼度のリスコアリングを行う（Ｓ１４）。 FIG. 5 is a flowchart illustrating an example of the reliability score calculation processing procedure in S04. In the reliability score calculation process in S04, first, a word network with a feature amount to which a plurality of feature amounts are added is input (S11), and N sentence hypotheses (n-best) preset by re-scoring are input. Sentence hypothesis) is generated (S12). In addition, the reliability (reliability score) is calculated based on the feature amount of the generated sentence hypothesis (S13). In addition, the re-scoring of the reliability is performed by adjusting the weighting of the reliability in the above-described formula (7) for calculating the reliability score (S14).

これらの処理により、単語出力の正誤判定を高精度に行うことができる。また、音声認識の精度を向上させることができる。更に、上述の処理を行う実行プログラムを汎用コンピュータ等にインストールすることにより、容易に音声認識を実現することができる。 With these processes, it is possible to determine the correctness of word output with high accuracy. In addition, the accuracy of voice recognition can be improved. Furthermore, voice recognition can be easily realized by installing an execution program for performing the above-described processing in a general-purpose computer or the like.

＜従来法と本発明との比較＞
次に、上述した音声認識手法による従来法との比較結果について、図を用いて説明する。図６は、比較データ及び比較結果の一例を示す図である。ここで、図６（ａ）は比較データの一例を示し、図６（ｂ）は正誤ラベルを付与した際の分類誤り率を示し、図６（ｃ）はリスコアリングの比較結果を示している。 <Comparison between the conventional method and the present invention>
Next, a comparison result with the conventional method using the above-described speech recognition method will be described with reference to the drawings. FIG. 6 is a diagram illustrating an example of comparison data and comparison results. Here, FIG. 6 (a) shows an example of comparison data, FIG. 6 (b) shows a classification error rate when a correct / incorrect label is given, and FIG. 6 (c) shows a comparison result of rescoring. Yes.

図６（ａ）は、最大エントロピーの学習データと、本発明における手法の評価データとにおける文章数と、単語数と、文章の難しさを示すパラメータであるパープレキシティと、未知語率と、単語誤り率を示している。 FIG. 6A shows the number of sentences in the maximum entropy learning data and the evaluation data of the method in the present invention, the number of words, the perplexity that is a parameter indicating the difficulty of the sentence, the unknown word rate, Indicates the word error rate.

ここで、学習データは、音声認識装置により認識結果を出力し、各特徴量を求めた後に信頼度モデルの最大エントロピーモデルの学習に使用した。なお、学習データのうち、１８，１０６語は、最大エントロピーモデルの素性関数の閾値、各特徴量に基づく分類器の閾値、及び音声認識システムのパラメータ（言語重み、挿入ペナルティ）等の決定に用いた。また、信頼度モデルは、８，２１０語の素性関数を学習データから選択した。 Here, the learning data is used for learning the maximum entropy model of the reliability model after outputting the recognition result by the speech recognition device and obtaining each feature amount. Of the learning data, 18,106 words are used to determine the feature function threshold of the maximum entropy model, the classifier threshold based on each feature, and parameters (language weight, insertion penalty) of the speech recognition system. It was. For the reliability model, a feature function of 8,210 words was selected from the learning data.

また、図６（ｂ）は、上述した信頼度計算手段２２において、正解・不正解の正誤ラベルの付与を行う際の各特徴量における分類誤り率を示している。図６（ｂ）のうち、単語事後確率、音響安定度、及び単語仮説密度はそれぞれの特徴量を使って二値分類器を構成したときの分類誤り率（従来法）を示し、信頼度モデルは、本発明による分類誤り率を示している。図６（ｂ）によれば、信頼度モデルは従来の特徴量に基づく分類に比べて誤り率が低く、性能が高いことがわかる。 FIG. 6B shows the classification error rate for each feature when the reliability calculation unit 22 assigns correct / incorrect correct / incorrect labels. In FIG. 6B, the word posterior probability, the acoustic stability, and the word hypothesis density indicate the classification error rate (conventional method) when a binary classifier is configured using each feature amount, and is a reliability model. Indicates the classification error rate according to the invention. According to FIG. 6B, it can be seen that the reliability model has a lower error rate and higher performance than the classification based on the conventional feature amount.

また、図６（ｃ）は、音声認識結果における単語誤り率を示している。なお、図６（ｃ）中の従来法は、上記に示す（７）式で信頼度モデルの信頼度スコアｃｆを加えない方法で音声認識を行ったものである。また、単語事後確率は、単語の事後確率の対数を信頼度スコアｃｆとして利用して音声認識を行ったものである。図６（ｃ）によれば、本発明における手法により得られる信頼度モデルは、単語誤り率を改善しており、本発明が有効であることがわかる。 FIG. 6C shows the word error rate in the speech recognition result. In addition, the conventional method in FIG.6 (c) performs speech recognition by the method which does not add the reliability score cf of a reliability model by (7) Formula shown above. The word posterior probability is obtained by performing speech recognition using the logarithm of the word posterior probability as the reliability score cf. As can be seen from FIG. 6C, the reliability model obtained by the method of the present invention improves the word error rate, and the present invention is effective.

上述したように本発明によれば、単語出力の正誤判定を高精度に行うことができる。これにより、音声の認識結果の精度を向上させることができる。具体的には、複数の特徴量を統合し、音声認識の単語仮説の正解、不正解に関する信頼度を求める。また、信頼度尺度を最大エントロピーモデルで統合し、単語の正誤をラベル付けすることで、単語出力の正誤判定を高精度に行い、音声の認識精度を向上させることができる。 As described above, according to the present invention, it is possible to determine the correctness of word output with high accuracy. Thereby, the accuracy of the speech recognition result can be improved. Specifically, a plurality of feature quantities are integrated to obtain reliability regarding correct and incorrect answers of the speech recognition word hypothesis. Further, by integrating the reliability measure with the maximum entropy model and labeling the correctness / incorrectness of the word, the correctness / incorrectness determination of the word output can be performed with high accuracy, and the speech recognition accuracy can be improved.

以上本発明の好ましい実施例について詳述したが、本発明は係る特定の実施形態に限定されるものではなく、特許請求の範囲に記載された本発明の要旨の範囲内において、種々の変形、変更が可能である。 The preferred embodiments of the present invention have been described in detail above, but the present invention is not limited to such specific embodiments, and various modifications, within the scope of the gist of the present invention described in the claims, It can be changed.

本発明における音声認識装置の機能構成の一例を示す図である。It is a figure which shows an example of a function structure of the speech recognition apparatus in this invention. 本発明におけるスコア計算手段の一構成例を示す図である。It is a figure which shows the example of 1 structure of the score calculation means in this invention. 本発明における音声認識が実現可能なハードウェア構成の一例を示す図である。It is a figure which shows an example of the hardware constitutions which can implement | achieve the speech recognition in this invention. 本発明における音声認識プログラムを用いた音声認識処理手順を示す一例のフローチャートである。It is a flowchart of an example which shows the speech recognition process sequence using the speech recognition program in this invention. 信頼度スコア計算処理手順を示す一例のフローチャートである。It is a flowchart of an example which shows the reliability score calculation process procedure. 比較データ及び比較結果の一例を示す図である。It is a figure which shows an example of comparison data and a comparison result.

Explanation of symbols

１０音声認識装置
１１音声認識手段
１２特徴量計算手段
１３スコア計算手段
１４音響モデル
１５言語モデル
１６信頼度モデル
２１文仮説生成手段
２２信頼度計算手段
２３リスコアリング手段
３１入力装置
３２出力装置
３３ドライブ装置
３４補助記憶装置
３５メモリ装置
３６ＣＰＵ
３７ネットワーク接続装置
３８記録媒体 DESCRIPTION OF SYMBOLS 10 Speech recognition apparatus 11 Voice recognition means 12 Feature-value calculation means 13 Score calculation means 14 Acoustic model 15 Language model 16 Reliability model 21 Sentence hypothesis generation means 22 Reliability calculation means 23 Rescoring means 31 Input device 32 Output device 33 Drive Device 34 Auxiliary storage device 35 Memory device 36 CPU
37 Network connection device 38 Recording medium

Claims

In a speech recognition device that recognizes speech,
Speech recognition means for generating a word network from input speech;
Feature quantity calculation means for calculating a plurality of preset feature quantities for a plurality of word hypotheses included in the word network based on the word network obtained by the speech recognition means, an acoustic model, and a language model;
Acoustic score obtained by the feature amount calculating means, the language score, the word posterior probabilities, acoustic stability, word hypothesis density, number of active HMM, a phoneme number average frame of the back-off case, and the correctness label of said word hypotheses history by using a plurality of feature amounts set in advance from among the feature amount of a respective confidence measure x each consisting of time series of the plurality of feature quantities and the feature amount, the correctness of the word hypothesis shows , by changing to one of the values of the correctness for each of the word hypothesis, reliability model P _ME consisting of the following formulas (1) | and correctness labels y for maximizing the value of _(y x) And set to “1” when each reliability measure x is larger than a threshold value set in advance for each reliability measure x and the correct / incorrect label y is correct, and otherwise. Two with “0” Feature function f (x, y) composed of functions and to calculate the feature function f (x, y) log confidence score is logarithmic and a predetermined weight λ probability of correctness of the word hypothesis for, it is calculated The log reliability score and the logarithmic acoustic score and logarithmic language score obtained by the speech recognition means are added for each sentence hypothesis included in the word network, and the sentence hypothesis having the highest score, A score calculation means for outputting as a speech recognition result a sentence hypothesis of a predetermined number of sentence hypotheses from a descending order, or a correct word string of sentence hypotheses equal to or higher than a preset score ;
The logarithmic reliability score is given by the word hypothesis obtained from the reliability model P _ME (y | x) including the following function (1) including the feature function f (x, y) and the weight λ. A speech recognition apparatus , wherein the probability is calculated by taking the logarithm of the probability with respect to the probability of being a correct answer .

The score calculation means includes
The speech recognition apparatus according to claim 1, wherein the plurality of feature quantities and a time series of the feature quantities are obtained from a word hypothesis included in the word network.

The score calculation means includes
3. The word hypothesis is classified into a correct answer or an incorrect answer based on a preset feature amount threshold, and the classified result is assigned to the word hypothesis as the correct / incorrect label. Voice recognition device.

The score calculation means includes
A plurality of binary classifiers expressed by the feature function f (x, y) for classifying the correct answer or the incorrect answer according to a preset feature amount threshold;
The plurality of binary classifiers are combined in correspondence with a time series of feature quantities, and the log reliability score obtained from the combined binary classifier, the logarithmic acoustic score, and the logarithmic language score are combined into the word network. 4. The speech recognition apparatus according to claim 1, wherein the correct word string is output by adding each sentence hypothesis included in the sentence hypothesis. 5.

In a speech recognition program for causing a computer to execute speech recognition processing for recognizing speech,
The computer,
Speech recognition means for generating a word network from input speech;
Feature quantity calculation means for calculating a plurality of preset feature quantities for a plurality of word hypotheses included in the word network based on a word network obtained by the speech recognition means, an acoustic model, and a language model; and ,
Acoustic score obtained by the feature amount calculating means, the language score, the word posterior probabilities, acoustic stability, word hypothesis density, number of active HMM, a phoneme number average frame of the back-off case, and the correctness label of said word hypotheses history by using a plurality of feature amounts set in advance from among the feature amount of a respective confidence measure x each consisting of time series of the plurality of feature quantities and the feature amount, the correctness of the word hypothesis shows By changing the value of each of the word hypotheses to one of the correct and incorrect values, a correct / incorrect label y for maximizing the value of the reliability model P _ME (y | x) consisting of the following equation (1) is obtained: And set to “1” when each reliability measure x is larger than a threshold value set in advance for each reliability measure x and the correct / incorrect label y is correct, and otherwise. Two with “0” Feature function f (x, y) composed of functions and to calculate the feature function f (x, y) log confidence score is logarithmic and a predetermined weight λ probability of correctness of the word hypothesis for, it is calculated The log reliability score and the logarithmic acoustic score and logarithmic language score obtained by the speech recognition means are added for each sentence hypothesis included in the word network, and the sentence hypothesis having the highest score, Function as score calculation means for outputting as a voice recognition result a sentence hypothesis of a sentence hypothesis of a preset number of sentences from a large order or a sentence hypothesis greater than a preset score ,
The logarithmic reliability score is given by the word hypothesis obtained from the reliability model P _ME (y | x) including the following function (1) including the feature function f (x, y) and the weight λ. A speech recognition program which is calculated by taking the logarithm of the probability with respect to the probability of being a correct answer when the answer is made .