JP4829910B2

JP4829910B2 - Speech recognition error analysis apparatus, method, program, and recording medium therefor

Info

Publication number: JP4829910B2
Application number: JP2008038468A
Authority: JP
Inventors: 太一浅見; 喜昭野田
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2008-02-20
Filing date: 2008-02-20
Publication date: 2011-12-07
Anticipated expiration: 2028-02-20
Also published as: JP2009198646A

Description

この発明は、音声認識技術に関する。特に、言語モデルにおける音声認識誤りの原因を分析する音声認識誤り分析装置、方法、プログラム及びその記録媒体に関する。 The present invention relates to speech recognition technology. In particular, the present invention relates to a speech recognition error analyzing apparatus, method, program, and recording medium for analyzing the cause of a speech recognition error in a language model.

音声認識エンジンを構成する音響モデル、言語モデルを改善する際には、認識誤りを起こしやすい部分から改善すると効率が良い。 When improving the acoustic model and the language model that constitute the speech recognition engine, it is efficient to improve from the part that easily causes a recognition error.

入力音声がどの音素に近いかを判定する音響モデルにおいては、Confusion Matrixを作成することにより認識誤りを起こしやすい部分を特定することができる。Confusion Matrixは、全ての音素について、別のどの音素と混同しやすいかを表にしたものである。Confusion Matrixを作成して混同しやすい音素を特定した上で、その混同しやすい音素から改善を行うことで、音響モデルを効率良く改善して行くことができる。 In an acoustic model that determines which phoneme the input speech is close to, it is possible to identify a portion that is likely to cause a recognition error by creating a confusion matrix. The Confusion Matrix lists all phonemes that are likely to be confused with other phonemes. An acoustic model can be improved efficiently by creating a confusion matrix and identifying phonemes that are easily confused, and then making improvements from the phonemes that are easily confused.

一方、言語モデルの性能の分析方法としては、パープレキシティによって言語モデルを評価する方法がよく用いられている（例えば、非特許文献１参照。）。音声認識では、言語モデルで計算された単語連鎖確率を用いて、認識単語候補の絞り込みを行っている。パープレキシティは、認識語彙に含まれる各単語から、次の単語への平均分岐数を示す値であり、その値が大きいほど認識単語候補を絞り込みにくい言語モデルであるということを表す。 On the other hand, as a method for analyzing the performance of a language model, a method of evaluating a language model by perplexity is often used (for example, see Non-Patent Document 1). In speech recognition, recognition word candidates are narrowed down using word chain probabilities calculated by a language model. The perplexity is a value indicating the average number of branches from each word included in the recognized vocabulary to the next word, and indicates that the larger the value is, the more difficult it is to narrow down the recognized word candidates.

しかし、パープレキシティの値からは、具体的にどの単語列で候補を絞り込みにくいのかわからないため、言語モデルにおける認識誤りを起こしやすい部分を特定することはできない。
Lawrence Rabiner（著），Biing-Hwang Juang（著），古井定煕(翻訳)，「音声認識の基礎（下）」,ＮＴＴアドバンステクノロジ株式会社，１９９５年，Ｐ．２６３−２６５ However, since the perplexity value does not clearly indicate which word string is difficult to narrow down candidates, it is not possible to identify a portion that is likely to cause a recognition error in the language model.
Lawrence Rabiner (Author), Biing-Hwang Juang (Author), Sadaaki Furui (Translation), "Basics of Speech Recognition (Lower)", NTT Advanced Technology Co., Ltd., 1995, p. 263-265

上記したように、非特許文献１に記載された言語モデルの性能の分析方法では、言語モデルにおける認識誤りを起こしやすい部分を特定することができないという問題があった。 As described above, the method for analyzing the performance of the language model described in Non-Patent Document 1 has a problem in that it is impossible to specify a portion that easily causes a recognition error in the language model.

この発明は、言語モデルにおいて認識誤りを起こしやすい部分を特定する音声認識誤り分析装置、方法、プログラム及びその記録媒体を提供することを目的とする。 An object of the present invention is to provide a speech recognition error analysis apparatus, method, program, and recording medium for identifying a part that easily causes a recognition error in a language model.

この発明の１つの観点によれば、言語モデルを用いて音声信号に対して音声認識処理を行い、その音声認識結果である単語列（以下、認識単語列とする。）を割り当てる。認識単語列内の、その認識単語列に対応する正解単語列と一致しない１つ又は連続する複数の単語から構成される単語列（以下、認識誤り単語列とする。）と、その認識誤り単語列及びその前後一単語から構成される認識誤り区間とを認識単語列から抽出する。認識誤り区間の最初の単語と、認識誤り単語列の最初の単語とから構成される開始部誤り二単語組を抽出する。認識誤り区間の最初の単語と、認識誤り単語列に対応する正解単語列の最初の単語とから構成される開始部正解二単語組を抽出する。言語モデルを用いて、開始部誤り二単語組の単語連鎖確率と開始部正解二単語組の単語連鎖確率をそれぞれ計算する。開始部誤り二単語組の単語連鎖確率と開始部正解二単語組の単語連鎖確率とを比較して、開始部誤り二単語組の単語連鎖確率よりも単語連鎖確率が低い開始部正解二単語組（以下、低開始部正解二単語組とする。）を抽出する。 According to one aspect of the present invention, a speech recognition process is performed on a speech signal using a language model, and a word string (hereinafter referred to as a recognized word string) as a speech recognition result is assigned. A word string composed of one or a plurality of consecutive words that do not match the correct word string corresponding to the recognized word string in the recognized word string (hereinafter referred to as a recognized error word string), and the recognized error word A recognition error section composed of a string and one word before and after the string is extracted from the recognized word string. A starting error two-word set composed of the first word in the recognition error section and the first word in the recognition error word string is extracted. A starting correct two-word group composed of the first word in the recognition error section and the first word in the correct word string corresponding to the recognition error word string is extracted. Using the language model, the word chain probability of the starting part error two-word group and the word chain probability of the starting part correct word group are calculated. Comparing the word chain probability of the start part error two-word set with the word chain probability of the start part correct word pair, the start part correct two-word group having a word chain probability lower than the word chain probability of the start part error two-word set (Hereinafter referred to as a low-starting part correct two-word set) is extracted.

開始部誤り二単語組の単語連鎖確率よりも単語連鎖確率が低い低開始部正解二単語組は、認識誤りが発生する原因となる単語列である。したがって、低開始部正解二単語組を抽出することにより、言語モデルにおける認識誤りを起こしやすい部分を特定することができる。 The low start correct two-word set having a word chain probability lower than the word chain probability of the start-part error two-word set is a word string that causes a recognition error. Therefore, by extracting the low-start-part correct two-word set, it is possible to specify a portion that is likely to cause a recognition error in the language model.

以下、図面を参照してこの発明の実施形態の例を説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

［第一実施形態］
認識誤りは数単語に亘って連続して生じる傾向があり、認識誤りの原因は、（１）認識誤りの開始の原因と、（２）認識誤りの拡大の原因の２つに分けることができる。第一実施形態は、認識誤りの原因のうち、認識誤りの開始の原因となり得る部分を特定するものである。 [First embodiment]
Recognition errors tend to occur continuously over several words, and the causes of recognition errors can be divided into two reasons: (1) the cause of recognition error start and (2) the cause of recognition error expansion. . The first embodiment specifies a portion that can cause a recognition error among the causes of the recognition error.

図１，４を参照してこの発明の第一実施形態の例を説明する。図１は、音声認識誤り分析装置の例の機能ブロック図である。図４は、音声認識誤り分析方法の処理の流れを例示するフローチャートである。 An example of the first embodiment of the present invention will be described with reference to FIGS. FIG. 1 is a functional block diagram of an example of a speech recognition error analysis apparatus. FIG. 4 is a flowchart illustrating the processing flow of the speech recognition error analysis method.

第一実施形態の音声認識誤り分析装置１は、図１において実線で示す、音声認識部１１、認識誤り区間抽出部１２、開始部二単語組抽出部２１，開始部単語連鎖確率計算部２２及び低開始部正解二単語組抽出部２３を例えば備える。 The speech recognition error analysis apparatus 1 according to the first embodiment includes a speech recognition unit 11, a recognition error section extraction unit 12, a start unit two-word set extraction unit 21, a start unit word chain probability calculation unit 22, and a solid line in FIG. The low start part correct answer two word set extraction part 23 is provided, for example.

＜ステップＳ１＞
音声認識部１１は、音響モデル、言語モデル及び認識辞書を用いて、音声信号に対して音声認識処理を行い、その音声信号に対してその音声認識処理の結果である単語列を割り当てる。割り当てられた単語列を、認識単語列とする。認識単語列の各単語には、始端時刻と終端時刻が付与される。認識単語列は、認識誤り区間抽出部１２に送られる。
音声認識処理の概要については、例えば参考文献１を参照のこと。 <Step S1>
The speech recognition unit 11 performs speech recognition processing on the speech signal using an acoustic model, a language model, and a recognition dictionary, and assigns a word string that is a result of the speech recognition processing to the speech signal. Let the assigned word string be a recognized word string. Each word in the recognized word string is given a start time and an end time. The recognition word string is sent to the recognition error section extraction unit 12.
See, for example, Reference 1 for an overview of speech recognition processing.

〔参考文献１〕政瀧浩和，外５名，「顧客との自然な会話を聞き取る自由発話音声認識技術『ＶｏｉｃｅＲｅｘ』」，ＮＴＴ技術ジャーナル，２００６年１１月，Ｎｏ．１８，ｖｏｌ．１１，ｐ．１５−１８
例えば、音声認識部１１は、「インターネットが繋がらない」という文を少なくとも含む音声信号に対して音声認識処理を行い、図５に実線で示すように、その「インターネットが繋がらない」という音声信号部分に「（インターネット）（勝つ）（な）（が）（荒）（ない）」という単語列を含む認識単語列を割り当てる。 [Reference 1] Hirokazu Masami, 5 others, “Free Speech Recognition Technology that Listens to Natural Conversations with Customers“ VoiceRex ””, NTT Technical Journal, November 2006, No. 18, vol. 11, p. 15-18
For example, the voice recognition unit 11 performs voice recognition processing on a voice signal including at least a sentence “Internet is not connected”, and the voice signal part “Internet is not connected” as shown by a solid line in FIG. Is assigned a recognition word string including the word string “(Internet) (win) (na) (ga) (rough) (no)”.

＜ステップＳ２＞
認識誤り区間抽出部１２は、認識単語列と、その認識単語列に対応する正解単語列とを比較して、認識誤り単語列と、その認識誤り単語列及びその前後一単語とから構成される認識誤り区間とを抽出する。 <Step S2>
The recognition error section extraction unit 12 compares the recognition word string with the correct word string corresponding to the recognition word string, and is composed of the recognition error word string, the recognition error word string, and one word before and after the recognition error word string. A recognition error interval is extracted.

認識誤り単語列とは、認識単語列内の、その認識単語列に対応する正解単語列と一致しない１つ又は連続する複数の単語から構成される単語列のことである。抽出された認識誤り単語列と認識誤り区間は、開始部二単語組抽出部２１に送られる。 The recognition error word string is a word string composed of one or a plurality of consecutive words that do not match the correct word string corresponding to the recognition word string in the recognition word string. The extracted recognition error word string and recognition error section are sent to the start unit two-word set extraction unit 21.

図５に示した例では、認識単語列と、その認識単語列に対応する正解単語列とは、（勝つ）（な）（が）（荒）という連続する４つの単語の部分で一致しない。したがって、（勝つ）（な）（が）（荒）が認識誤り単語列となる。また、この認識誤り単語列に、その前の単語（インターネット）と、その後ろの単語（ない）を加えた（インターネット）（勝つ）（な）（が）（荒）（ない）が、認識誤り区間となる。 In the example shown in FIG. 5, the recognized word string and the correct word string corresponding to the recognized word string do not match in the four consecutive word parts (win) (na) (ga) (rough). Accordingly, (win) (na) (ga) (rough) is a recognition error word string. In addition, this recognition error word string includes the previous word (Internet) and the next word (not) (Internet) (win) (na) (ga) (rough) (no) It becomes a section.

一般に、音声認識部１１の音声認識処理により、複数の認識誤り区間が認識誤り区間抽出部１２により抽出される。以下の処理は、各複数の認識誤り区間ごとに行われる。 In general, a plurality of recognition error sections are extracted by the recognition error section extraction section 12 by the voice recognition processing of the voice recognition section 11. The following processing is performed for each of a plurality of recognition error intervals.

＜ステップＳ３＞
開始部二単語組抽出部２１は、図２に例示するように、開始部誤り二単語組抽出部２１１と開始部正解二単語組抽出部２１２とを含む。 <Step S3>
As illustrated in FIG. 2, the start part two-word set extraction unit 21 includes a start part error two-word set extraction part 211 and a start part correct answer two-word set extraction part 212.

開始部誤り二単語組抽出部２１１は、認識誤り単語列と認識誤り区間とから、開始部誤り二単語組を抽出する。抽出された開始部誤り二単語組は、開始部単語連鎖確率計算部２２に送られる。
開始部誤り二単語組とは、認識誤り区間の最初の単語と、認識誤り単語列の最初の単語とから構成される二単語である。 The start part error two word set extraction unit 211 extracts a start part error two word set from the recognition error word string and the recognition error section. The extracted start part error two-word set is sent to the start part word chain probability calculation part 22.
The start part error two-word set is a two-word composed of the first word in the recognition error section and the first word in the recognition error word string.

図５に示した例では、認識誤り区間の最初の単語である（インターネット）と、認識誤り単語列の最初の単語である（勝つ）とから構成される二単語（インターネット）（勝つ）が、開始部誤り二単語組となる。 In the example shown in FIG. 5, two words (Internet) (win) composed of the first word in the recognition error section (Internet) and the first word in the recognition error word string (win) are: It becomes a starting part error two word set.

＜ステップＳ４＞
開始部二単語組抽出部２１の開始部正解二単語組抽出部２１２は、認識誤り区間と、認識誤り単語列に対応する正解単語列とから、開始部正解二単語組を抽出する。抽出された開始部正解二単語組は、開始部単語連鎖確率計算部２２に送られる。
開始部正解二単語組とは、認識誤り区間の最初の単語と、認識誤り単語列に対応する正解単語列の最初の単語とから構成される二単語である。 <Step S4>
The start correct two-word set extraction unit 212 of the start two-word set extraction unit 21 extracts the start correct two-word set from the recognition error section and the correct word sequence corresponding to the recognition error word sequence. The extracted start unit correct answer two-word set is sent to the start unit word chain probability calculation unit 22.
The start part correct two-word group is a two-word composed of the first word in the recognition error section and the first word in the correct word string corresponding to the recognition error word string.

図５に示した例では、認識誤り区間の最初の単語（インターネット）と、認識誤り単語列に対応する正解単語列の最初の単語（が）とから構成される二単語（インターネット）（が）が、開始部正解二単語組となる。 In the example shown in FIG. 5, two words (Internet) (GA) composed of the first word (Internet) in the recognition error section and the first word (GA) in the correct word string corresponding to the recognition error word string. However, it becomes a starting part correct answer two word set.

＜ステップＳ５＞
開始部単語連鎖確率計算部２２は、音声認識部１１が用いたのと同じ言語モデルを用いて、開始部誤り二単語組の単語連鎖確率と開始部正解二単語組の単語連鎖確率をそれぞれ計算する。計算された単語連鎖確率は、計算の元になった開始部誤り二単語組又は開始部正解二単語組と共に、低開始部正解二単語組抽出部２３に送られる。 <Step S5>
The start part word chain probability calculation part 22 calculates the word chain probability of the start part error two-word set and the word chain probability of the start part correct answer two-word set, respectively, using the same language model used by the speech recognition unit 11. To do. The calculated word chain probabilities are sent to the low start correct part two-word set extraction unit 23 together with the start part correct two-word set or the start correct part two-word set that is the basis of the calculation.

単語連鎖確率とは、言語モデルを用いて計算される、二単語組の１つ目の単語からその二単語組の２つ目の単語に連鎖する確率のことである（例えば、参考文献２参照。）。 The word chain probability is the probability of chaining from the first word of a two-word set to the second word of the two-word set, calculated using a language model (see, for example, Reference 2) .)

〔参考文献２〕Lawrence Rabiner（著），Biing-Hwang Juang（著），古井定煕(翻訳)，「音声認識の基礎（下）」,ＮＴＴアドバンステクノロジ株式会社，１９９５年，Ｐ．２６２−２６３
＜ステップＳ６＞
低開始部正解二単語組抽出部２３は、開始部誤り二単語組の単語連鎖確率と開始部正解二単語組の単語連鎖確率とを比較して、開始部誤り二単語組の単語連鎖確率よりも単語連鎖確率が低い開始部正解二単語組を抽出する。開始部誤り二単語組の単語連鎖確率よりも単語連鎖確率が低い開始部正解二単語組を、低開始部正解二単語組とする。 [Reference 2] Lawrence Rabiner (Author), Biing-Hwang Juang (Author), Sadaaki Furui (Translation), "Basics of Speech Recognition (Bottom)", NTT Advanced Technology Co., Ltd., 1995, p. 262-263
<Step S6>
The low start correct two-word set extraction unit 23 compares the word chain probability of the start correct two-word set with the word chain probability of the start correct two-word set, Also extract the correct two-word set of the starting part with a low word chain probability. A start correct two-word set having a word chain probability lower than the word chain probability of the start error two-word set is set as a low start correct two-word set.

低開始部正解二単語組は、その単語連鎖確率が開始部誤り二単語組の単語連鎖確率よりも低いため、認識誤りが開始する原因となり得る。したがって、上記のように低開始部正解二単語組を抽出することにより、言語モデルにおいて認識誤りを起こしやすい部分を特定することができる。より詳細には、言語モデルにおいて認識誤りを起こしやすい部分の内、認識誤りの開始の原因となり得る部分を特定することができる。 The low starting part correct two-word set may cause a recognition error to start because its word chain probability is lower than the word chain probability of the starting part error two-word set. Therefore, by extracting the low-start-part correct two-word set as described above, it is possible to specify a part that is likely to cause a recognition error in the language model. More specifically, it is possible to identify a portion that may cause a recognition error among the portions that are likely to cause a recognition error in the language model.

認識誤りの開始の原因となる低開始部正解二単語組に、対応する開始部誤り二単語組よりも低い単語連鎖確率が割り当てられているのは、言語モデル学習データとして用いるテキストに、その低開始部正解二単語が出現しないか、その出現回数が少ないために、適切な確率を学習できていないことが原因と考えられる。したがって、低開始部正解二単語がよく現れるテキストを、言語モデル学習用データとして使うことで改善を行うことができる。 The word chain probability that is assigned to the low-start-part correct two-word set that causes the start of recognition error is lower than the corresponding two-part start-part error two-word sets. It is thought that the reason is that an appropriate probability has not been learned because the start part correct answer two words do not appear or the number of appearances is small. Therefore, it is possible to improve by using the text in which the low start part correct two words frequently appear as language model learning data.

［第二実施形態］
以下、第二実施形態の例を説明する。先に述べたように、認識誤りは数単語に亘って連続して生じる傾向があり、認識誤りの原因は、（１）認識誤りの開始の原因と、（２）認識誤りの拡大の原因の２つに分けることができる。第二実施形態は、これら両方の原因を特定するものである。 [Second Embodiment]
Hereinafter, an example of the second embodiment will be described. As mentioned above, recognition errors tend to occur continuously over several words. The causes of recognition errors are (1) the cause of recognition error start and (2) the cause of recognition error expansion. It can be divided into two. The second embodiment specifies both causes.

以下、第二実施形態の例を説明するが、第一実施形態と異なる部分についてのみ説明し、第一実施形態と同様な部分については重複説明を省略する。 Hereinafter, although the example of 2nd embodiment is demonstrated, only a different part from 1st embodiment is demonstrated, and duplication description is abbreviate | omitted about the part similar to 1st embodiment.

第二実施形態の音声認識誤り分析装置は、第一実施形態の音声認識誤り分析装置１の各部に加えて、図１に破線で例示する、区間内二単語組抽出部３１、区間内単語連鎖確率計算部３２及び高区間内誤り二単語組抽出部３３を例えば備える。また、第二実施形態の音声認識誤り分析方法においては、第一実施形態の音声認識誤り分析装置１の各処理に加えて、図４に破線で例示するステップＳ７からＳ１０の処理を行う。 The speech recognition error analysis apparatus according to the second embodiment includes, in addition to the components of the speech recognition error analysis apparatus 1 according to the first embodiment, an intra-section two-word set extraction unit 31 and intra-section word chain illustrated by broken lines in FIG. For example, a probability calculation unit 32 and a high interval error two-word set extraction unit 33 are provided. Further, in the speech recognition error analysis method of the second embodiment, in addition to the processes of the speech recognition error analysis apparatus 1 of the first embodiment, the processes of steps S7 to S10 illustrated by broken lines in FIG. 4 are performed.

＜ステップＳ２＞
認識誤り区間抽出部１２は、抽出した認識誤り単語列を区間内二単語組抽出部３１に送る。認識誤り区間を区間内二単語組抽出部３１に送る必要はない。 <Step S2>
The recognition error section extraction unit 12 sends the extracted recognition error word string to the in-section two word set extraction unit 31. It is not necessary to send the recognition error section to the intra-section two-word set extraction unit 31.

＜ステップＳ７＞
区間内二単語組抽出部３１は、図３に例示するように、区間内誤り二単語組抽出部３１１と、正解復帰二単語組抽出部３１２とを含む。 <Step S7>
As illustrated in FIG. 3, the intra-section two-word set extraction unit 31 includes an intra-section error two-word set extraction unit 311 and a correct return two-word set extraction unit 312.

区間内誤り二単語組抽出部３１１は、認識誤り単語列から、区間内誤り二単語組をすべて抽出する。抽出された区間内誤り二単語組は、正解復帰二単語組抽出部３１２と、区間内単語連鎖確率計算部３２に送られる。
区間内誤り二単語組とは、認識誤り単語列内の連続する２つの単語の組のことである。 The intra-section error two-word set extraction unit 311 extracts all intra-section error two-word sets from the recognition error word string. The extracted intra-section error two-word sets are sent to the correct return two-word set extraction section 312 and the intra-section word chain probability calculation section 32.
The intra-section error two-word set is a set of two consecutive words in the recognition error word string.

図５に示した例では、（勝つ）（な）、（な）（が）及び（が）（荒）がそれぞれ区間内誤り二単語組となる。 In the example shown in FIG. 5, (win) (na), (na) (ga), and (ga) (coarse) are the intra-section error two-word sets.

＜ステップＳ８＞
区間内二単語組抽出部３１の正解復帰二単語組抽出部３１２は、区間内誤り二単語組と、正解単語列とから、正解復帰二単語組を区間内誤り二単語組ごとに抽出する。抽出された正解復帰二単語組は、区間内単語連鎖確率計算部３２に送られる。 <Step S8>
The correct return two-word set extraction unit 312 of the intra-section two-word set extraction unit 31 extracts a correct return two-word set for each intra-section error two-word set from the intra-section error two-word set and the correct word string. The extracted correct answer return two-word group is sent to the intra-section word chain probability calculation unit 32.

正解復帰二単語組とは、区間内誤り二単語組の１つ目の単語と、その１つ目の単語の始端よりも時間的に後にあり、その１つ目の単語の終端に時間的に最も近い始端を有する正解単語列内の単語とから構成される単語列のことである。 The correct return two-word set is the first word of the intra-interval error two-word set and the time after the beginning of the first word, and at the end of the first word in terms of time. It is a word string composed of words in a correct word string having the closest starting point.

図５に示した例では、区間内誤り二単語組（勝つ）（な）に対応する正解復帰二単語組は、（勝つ）（繋が）である。すなわち、区間内誤り二単語組（勝つ）（な）の１つ目の単語である（勝つ）の始端よりも時間的に後ろにある、正解単語列内の単語は（繋が）と（ら）である。（繋が）と（ら）の内、（勝つ）の終端に時間的に最も近い始端を有するのは、（繋が）である。（勝つ）の終端と（繋が）の始端との時間的な距離の方が、（勝つ）の終端と（ら）の始端との時間的な距離よりも短いからである。したがって、区間内誤り二単語組（勝つ）（な）に対応する正解復帰二単語組は、（勝つ）（繋が）となるのである。同様に、区間内誤り二単語組（な）（が）に対応する正解復帰二単語組は（な）（ら）であり、区間内誤り二単語組（が）（荒）に対応する正解復帰二単語組は（が）（ら）となる。 In the example shown in FIG. 5, the correct answer return two-word group corresponding to the intra-section error two-word group (win) (na) is (win) (connected). That is, the words in the correct word string that are temporally behind the beginning of the first word (win) in the intra-segment error two-word group (win) (na) are (connected) and (ra). It is. Of the (connected) and (ra), it is (connected) that has the start point closest in time to the end of (win). This is because the temporal distance between the end of (winning) and the starting end of (connected) is shorter than the temporal distance between the end of (winning) and the starting end of (ra). Therefore, the correct answer return two-word group corresponding to the intra-section error two-word group (win) (na) is (win) (connected). Similarly, the correct answer return two-word pair corresponding to the intra-section error two-word set (na) (ga) is (na) (ra), and the correct answer return corresponding to the intra-section error two-word set (ga) (rough) The two word set is (ga) (ra).

＜ステップＳ９＞
区間内単語連鎖確率計算部３２は、音声認識部１１が用いたのと同じ言語モデルを用いて、区間内誤り二単語組の単語連鎖確率と正解復帰二単語組の単語連鎖確率をそれぞれ計算する。計算された単語連鎖確率は、計算の元になった区間内誤り二単語組又は正解復帰二単語組と共に、高区間内誤り二単語組抽出部３３に送られる。 <Step S9>
The intra-section word chain probability calculation unit 32 calculates the word chain probability of the intra-section error two-word set and the word chain probability of the correct return two-word set using the same language model used by the speech recognition unit 11. . The calculated word chain probabilities are sent to the high-intersection error two-word group extraction unit 33 together with the intra-section error two-word group or the correct return two-word group that is the basis of the calculation.

＜ステップＳ１０＞
高区間内誤り二単語組抽出部３３は、区間内誤り二単語組の単語連鎖確率と、それに対応する正解復帰二単語組の単語連鎖確率とを比較して、正解復帰二単語組の単語連鎖確率よりも単語連鎖確率が高い区間内誤り二単語組を抽出する。正解復帰二単語組の単語連鎖確率よりも単語連鎖確率が高い区間内誤り二単語組を、高区間内誤り二単語組とする。 <Step S10>
The high-interval error two-word set extraction unit 33 compares the word chain probability of the intra-section error two-word set with the corresponding word-chain probability of the correct return two-word set, and the word chain of the correct return two-word set An intra-section error two-word set having a word chain probability higher than the probability is extracted. An intra-section error two-word set having a word chain probability that is higher than the word chain probability of the correct return two-word set is defined as a high-section intra-error two-word set.

高区間内誤り二単語組は、その単語連鎖確率が正解復帰二単語組の単語連鎖確率よりも高いため、認識誤りを拡大する原因となり得る。したがって、上記のように高区間内誤り二単語組を抽出することにより、言語モデルにおいて認識誤りを起こしやすい部分を特定することができる。より詳細には、言語モデルにおいて認識誤りを起こしやすい部分の内、認識誤りを拡大する原因となり得る部分を特定することができる。 The intra-high-interval error two-word group has a higher word chain probability than the word chain probability of the correct return two-word group, and can therefore cause recognition errors to expand. Therefore, by extracting the two-word group error in the high section as described above, it is possible to specify a part that is likely to cause a recognition error in the language model. More specifically, it is possible to identify a portion that can cause recognition errors in a portion that easily causes recognition errors in the language model.

認識誤りの拡大の原因となる高区間内誤り二単語組により高い単語連鎖確率が割り当てられていることは、その高区間内誤り二単語組が偏って多く現れるテキストを言語モデル学習データとして用いていることが原因と考えられる。したがって、この高区間内誤り二単語に偏らないように言語モデル学習に使うテキストを選択することにより改善を行うことができる。 High word chain probabilities are assigned to the high-intra-error two-word pairs that cause recognition errors to expand. This is considered to be the cause. Therefore, it is possible to improve by selecting the text used for language model learning so as not to be biased to the two errors in the high section.

［変形例］
図１に一点鎖線で示す開始部出現頻度集計部２４が、低開始部正解二単語組の出現頻度を求めてもよい（ステップＳ１１，図４）。例えば、低開始部正解二単語組抽出部２３が抽出した各低開始部正解二単語組の数をカウントして、各低開始部正解二単語組に出現頻度としてそのカウント数を割り当てる。また、例えば、低開始部正解二単語組の出現頻度＝（その低開始部正解二単語組のカウント数）／（低開始部正解二単語組のカウント数の総和）とし、各低開始部正解二単語組に出現頻度として割合を割り当ててもよい。 [Modification]
The start part appearance frequency totaling unit 24 indicated by a one-dot chain line in FIG. 1 may obtain the appearance frequency of the low start part correct two-word set (step S11, FIG. 4). For example, the number of each low start correct two-word set extracted by the low start correct two-word set extraction unit 23 is counted, and the count is assigned to each low start correct two-word set as an appearance frequency. Also, for example, the appearance frequency of the low start part correct two-word set = (the count number of the low start part correct two-word set) / (the sum of the count numbers of the low start part correct two-word set), and each low start part correct answer A ratio may be assigned as the appearance frequency to the two-word group.

このように、開始部出現頻度集計部２４を設けることにより、低開始部正解二単語組の中で出現頻度が高いものを抽出することが可能となり、改善すべき低開始部正解二単語組を絞り込むことができる。 As described above, by providing the start part appearance frequency totaling unit 24, it is possible to extract a low start part correct answer two-word set having a high appearance frequency, and to select a low start part correct answer two word set to be improved. You can narrow down.

同様に、図１に一点鎖線で示す区間内出現頻度集計部３４が、高区間内誤り二単語組の出現頻度を求めてもよい（ステップＳ１２，図４）。これにより、高区間内誤り二単語組の中で出現頻度が高いものを抽出することが可能となり、改善すべき高区間内誤り二単語組を絞り込むことができる。 Similarly, the intra-section appearance frequency totaling unit 34 indicated by the alternate long and short dash line in FIG. 1 may obtain the appearance frequency of the high-section intra-error two-word set (step S12, FIG. 4). As a result, it is possible to extract a high-frequency intra-error two-word set that has a high appearance frequency, and to narrow down high-error intra-word two-word sets to be improved.

上述の構成をコンピュータによって実現する場合、音声認識誤り分析装置の各部が有する機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記各部の機能がコンピュータ上で実現される。 When the above-described configuration is realized by a computer, the processing contents of the functions of each unit of the speech recognition error analysis apparatus are described by a program. By executing this program on a computer, the functions of the above-described units are realized on the computer.

すなわち、ＣＰＵが各プログラムを逐次読み込んで実行することにより、音声認識部１１、認識誤り区間抽出部１２、開始部二単語組抽出部２１、開始部単語連鎖確率計算部２２、低開始部正解二単語組抽出部２３、開始部出現頻度集計部２４、区間内二単語組抽出部３１、区間内単語連鎖確率計算部３２、高区間内誤り二単語組抽出部３３及び区間内出現頻度集計部３４の機能がそれぞれ実現される。この場合、音声認識誤り装置の各部として機能するＣＰＵは、メモリ、ハードディスク等の記録媒体から読み込み込んだデータに対して処理を行い、処理を行った後のデータを記録媒体に格納する。 That is, when the CPU sequentially reads and executes each program, the speech recognition unit 11, the recognition error section extraction unit 12, the start unit two word set extraction unit 21, the start unit word chain probability calculation unit 22, the low start unit correct answer 2 Word set extraction unit 23, start portion appearance frequency totaling unit 24, intra-section two-word set extraction unit 31, intra-section word chain probability calculation unit 32, high-section error two-word set extraction unit 33, and intra-section appearance frequency totaling section 34 Each function is realized. In this case, the CPU functioning as each unit of the speech recognition error device performs processing on data read from a recording medium such as a memory or a hard disk, and stores the processed data in the recording medium.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよいが、具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ（Random Access Memory）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ
−Ｒ（Recordable）／ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto-Optical disc）等を、半導体メモリとしてＥＥＰ−ＲＯＭ（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 The program describing the processing contents can be recorded on a computer-readable recording medium. The computer-readable recording medium may be any medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, or a semiconductor memory. Specifically, for example, the magnetic recording device may be a hard disk device or a flexible Discs, magnetic tapes, etc. as optical discs, DVD (Digital Versatile Disc), DVD-RAM (Random Access Memory), CD-ROM (Compact Disc Read Only Memory), CD
-R (Recordable) / RW (ReWritable), etc., MO (Magneto-Optical disc), etc. as a magneto-optical recording medium, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. as a semiconductor memory it can.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

また、上述した実施形態とは別の実行形態として、コンピュータが可搬型記録媒体から直接このプログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を基底する性質を有するデータ等）を含むものとする。 As an execution form different from the above-described embodiment, the computer may read the program directly from the portable recording medium and execute processing according to the program. Each time is transferred, the processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to a computer but has a property that is based on computer processing).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、本装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In this embodiment, the present apparatus is configured by executing a predetermined program on a computer. However, at least a part of these processing contents may be realized by hardware.

また、上述の各種の処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。例えば、図４において、ステップＳ３の処理とステップＳ４の処理とを並列に行ってもよい。同様に、ステップＳ７の処理とステップＳ８の処理とを並列に行ってもよい。また、ステップＳ３からステップＳ６までの処理と、ステップＳ７からステップＳ１０までの処理とを並列に行ってもよい。
その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。 In addition, the various processes described above are not only executed in time series according to the description, but may be executed in parallel or individually according to the processing capability of the apparatus that executes the processes or as necessary. For example, in FIG. 4, the process of step S3 and the process of step S4 may be performed in parallel. Similarly, the process of step S7 and the process of step S8 may be performed in parallel. Further, the processing from step S3 to step S6 and the processing from step S7 to step S10 may be performed in parallel.
Needless to say, other modifications are possible without departing from the spirit of the present invention.

音声認識誤り分析装置の例の機能ブロック図。The functional block diagram of the example of a speech recognition error analysis apparatus. 開始部二単語組抽出部の例の機能ブロック図。The functional block diagram of the example of a start part two word set extraction part. 区間内二単語組抽出部の例の機能ブロック図。The functional block diagram of the example of the two word group extraction part in an area. 音声認識誤り分析方法の処理の流れを例示するフローチャート。The flowchart which illustrates the flow of a process of the speech recognition error analysis method. 認識誤り単語列、認識誤り区間、開始部誤り二単語組、開始部正解二単語組、区間内誤り二単語組及び正解復帰二単語組等の例を表す図。The figure showing examples, such as a recognition error word sequence, a recognition error section, a start part error two-word group, a start part correct answer two-word group, an intra-section error two-word group, and a correct return two-word group.

Explanation of symbols

１音声認識誤り分析装置
１１音声認識部
１２誤り認識区間抽出部
２１開始部二単語組抽出部
２２開始部単語連鎖確率計算部
２３低開始部正解二単語組抽出部
２４開始部出現頻度集計部
３１区間内二単語組抽出部
３２区間内単語連鎖確率計算部
３３高区間内誤り二単語組抽出部
３４区間内出現頻度集計部
２１１開始部二単語組抽出部
２１２開始部正解二単語組抽出部
３１１区間内二単語組抽出部
３１２正解復帰二単語組抽出部 DESCRIPTION OF SYMBOLS 1 Speech recognition error analyzer 11 Speech recognition part 12 Error recognition section extraction part 21 Start part Two word set extraction part 22 Start part Word chain probability calculation part 23 Low start part Correct two word set extraction part 24 Start part appearance frequency totaling part 31 Intra-section two-word set extraction section 32 In-section word chain probability calculation section 33 In-high section error two-word set extraction section 34 In-section appearance frequency counting section 211 Start section Two-word set extraction section 212 Start section correct answer two-word set extraction section 311 Intra-section two-word set extraction unit 312 Correct answer return two-word set extraction unit

Claims

A speech recognition unit that performs speech recognition processing on a speech signal using a language model and assigns a word string (hereinafter referred to as a recognition word string) that is a speech recognition result;
A word string composed of one or a plurality of consecutive words that do not match the correct word string corresponding to the recognized word string in the recognized word string (hereinafter referred to as a recognized error word string), and the recognized error word A recognition error interval extraction unit that extracts a recognition error interval composed of a sequence and one word before and after the sequence from the recognition word sequence;
A starter error two-word set extraction unit that extracts a starter error two-word set composed of a first word of the recognition error section and a first word of the recognition-error word string;
A starter correct two-word set extraction unit that extracts a starter correct two-word set composed of a first word of the recognition error section and a first word of a correct word string corresponding to the recognition-error word string;
Using the language model, a starter word chain probability calculating unit that calculates a word chain probability of the starter error two-word set and a word chain probability of the starter correct two-word set, respectively,
Comparing the word chain probability of the start part error two-word set with the word chain probability of the start part correct answer two-word set, the start part correct answer 2 having a word chain probability lower than the word chain probability of the start part error two-word set A low start part correct answer two-word set extraction unit that extracts a word set (hereinafter referred to as a low start part correct answer two word set);
A speech recognition error analysis apparatus comprising:

  The speech recognition error analysis apparatus according to claim 1,
  Start part appearance frequency totalization part for obtaining the appearance frequency of the low start part correct two-word set,
  A speech recognition error analyzer further comprising:

  In the speech recognition error analysis device according to claim 1 or 2,
  An intra-section error two-word set extraction unit that extracts all of a set of two consecutive words in the recognition-error word string (hereinafter referred to as an intra-section error two-word set) from the recognition-error word string;
  The first correct word in the intra-interval error two-word set and the correct word that is temporally after the start of the first word and has the start closest to the end of the first word in time A correct return two-word set extraction unit that extracts a correct return two-word set composed of the words in the sequence for each of the above-mentioned two error pairs in the section;
  Using the language model, an intra-segment word chain probability calculation unit for calculating a word chain probability of the intra-interval error two-word set and a word chain probability of the correct return two-word set, respectively,
  Compare the word chain probability of the two-word set in the interval and the word chain probability of the two-word set of correct return corresponding to the two-word set of errors in the interval, and more than the word chain probability of the two-word set of correct return A high-intersection error two-word set extraction unit that extracts an intra-section error two-word set with a high word chain probability (hereinafter referred to as a high-section error two-word set);
  A speech recognition error analyzer further comprising:

In the speech recognition error analysis device according to claim 3 ,
Intra-section appearance frequency totaling unit for determining the appearance frequency of the above-mentioned high section error two-word set,
A speech recognition error analyzer further comprising:

A voice recognition step in which a voice recognition unit performs voice recognition processing on a voice signal using a language model and assigns a word string (hereinafter referred to as a recognition word string) as a result of the voice recognition;
The recognition error section extraction unit is a word string composed of one or a plurality of continuous words that do not match the correct word string corresponding to the recognition word string in the recognition word string (hereinafter referred to as a recognition error word string). And a recognition error section extraction step for extracting a recognition error section composed of the recognition error word string and one word before and after the recognition error word string,
Start part error two-word set extraction unit extracts a start part error two-word set consisting of the first word of the recognition error section and the first word of the recognition error word string Steps,
Start of extracting the correct part two-word set of the start part composed of the first word in the recognition error section and the first word in the correct word string corresponding to the recognition error word string Part correct word two word set extraction step;
A starting word chain probability calculating step, wherein the starting word chain probability part calculates the word chain probability of the starting part error two-word set and the word chain probability of the starting part correct two-word set, respectively, using the language model; ,
The low start part correct two-word set extraction unit compares the word chain probability of the start part correct two-word set with the word chain probability of the start part correct two-word set, and the word chain probability of the start part error two-word set A low start part correct two-word set extraction step for extracting a start correct part two-word set (hereinafter referred to as a low start correct part two-word set) having a lower word chain probability,
A speech recognition error analysis method comprising:

  The speech recognition error analysis method according to claim 5,
  A start part appearance frequency totaling step for obtaining the appearance frequency of the low start part correct two-word set,
  A speech recognition error analysis method further comprising:

  The speech recognition error analysis method according to claim 5 or 6,
  The intra-section error two-word set extraction unit extracts all the consecutive two word sets (hereinafter referred to as intra-section error two-word sets) in the recognition error word string from the recognition error word string. A two-word set extraction step;
  The correct return two-word set extraction unit is temporally after the first word of the intra-section error two-word set and the first end of the first word, and at the end of the first word A correct return two-word set extraction step for extracting a correct return two-word set composed of a word in the correct word sequence having a starting point closest to
  An intra-segment word chain probability calculation unit, using the language model, to calculate the intra-interval error two-word group word chain probability and the correct answer return two-word group word chain probability, ,
  The high-intersection error two-word set extraction unit compares the word chain probability of the intra-section error two-word set with the word chain probability of the correct return two-word set corresponding to the intra-section error two-word set, and A high-intersection error two-word set extraction step for extracting an intra-section error two-word pair (hereinafter referred to as a high-section error two-word set) having a word chain probability higher than the word chain probability of the correct return two-word set;
  A speech recognition error analysis method further comprising:

The speech recognition error analysis method according to claim 7 ,
Intra-section appearance frequency totaling section, the intra-section appearance frequency totaling step to obtain the appearance frequency of the above-mentioned high section error two-word set,
A speech recognition error analysis method further comprising:

5. A speech recognition error analysis program for causing a computer to function as each part of the speech recognition error analysis apparatus according to claim 1.

A computer-readable recording medium on which the speech recognition error analysis program according to claim 9 is recorded.