JP2006084706A

JP2006084706A - Voice recognition performance estimating method, device, and program, recognition trouble word extracting method, device, and program, and recording medium

Info

Publication number: JP2006084706A
Application number: JP2004268590A
Authority: JP
Inventors: Noboru Miyazaki; 昇宮崎; Satoshi Takahashi; 敏高橋
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2004-09-15
Filing date: 2004-09-15
Publication date: 2006-03-30
Anticipated expiration: 2024-09-15
Also published as: JP4336282B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a recognition performance estimating device capable of evaluating the performance of a word list used for voice recognition and a recognition trouble word extracting device capable of extracting a word causing recognition trouble from words included in the word list. <P>SOLUTION: The voice recognition performance estimating device 100 is equipped with a word similarity calculating means 104 of inputting two words which are an object word and an opponent word to calculate the word similarity; a word misrecognition score calculating means 103 of inputting the object word and word list to calculate a word misrecognition score; a mean word misrecognition score calculating means 102 of inputting the word list and calculating the mean word misrecognition score of all the words included in the word list; and a voice recognition rate estimating means 101 of inputting three elements which are the number of the words included in the word list, the mean word misrecognition score, and a candidate number narrowing-down reference value set when a solution candidate is searched for in voice recognition processing and calculating an estimated value of a voice recognition rate. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

この発明は、音声認識の技術分野に関わり、特に音声認識に用いる単語リストの性能の予測及び認識語彙の設計の分野に関する。 The present invention relates to the technical field of speech recognition, and more particularly to the field of predicting the performance of a word list used for speech recognition and designing a recognition vocabulary.

音声認識装置の性能を実際に音声認識を行う前に推定する音声認識性能推定の分野における従来の技術としては、音声認識に用いる音響モデルのパラメータから音素の誤認識傾向を自動的に推定し、二つの単語の間の音響的な類似度を算出し、それを元に単語間の類似度を算出し、対立する単語との類似度から認識率を推定する手法（非特許文献１）があった。
また従来、音声認識に用いる認識語彙の設計においては、ユーザが発音すると予測される語を認識語彙として設計していた。
“A Study on Model-Based Error Rate Estimation for Automatic Speech Recognition”、IEEE Transactions on Speech and Audio Processing、Volume 11, No 6, November 2003. As a conventional technique in the field of speech recognition performance estimation that estimates the performance of a speech recognition device before actually performing speech recognition, it automatically estimates the misrecognition tendency of phonemes from the parameters of the acoustic model used for speech recognition, There is a method (Non-patent Document 1) that calculates an acoustic similarity between two words, calculates a similarity between words based on the acoustic similarity, and estimates the recognition rate from the similarity with the opposing word. It was.
Conventionally, in designing a recognition vocabulary used for speech recognition, a word predicted to be pronounced by a user is designed as a recognition vocabulary.
“A Study on Model-Based Error Rate Estimation for Automatic Speech Recognition”, IEEE Transactions on Speech and Audio Processing, Volume 11, No 6, November 2003.

音声認識技術を実用に供する際、提供するサービスの品質を見積もる為、認識結果の精度を事前にある程度推定できていることが必要である。認識結果の精度を事前に推定するため、一般にはサービスを提供した際に入力される音声と類似した環境の音声を試験データとして収集し、これに対して認識実験を行うことによりその精度を見積もる。しかし、実サービスと類似した環境における音声を収集するには相応の経済的負担と期間を必要とするため、音声の収録をせずに認識精度を推定する技術が求められている。
背景技術に述べた非特許文献１の手法では、認識語彙と音響モデルの情報のみを用いて認識精度を推定しており、音声の収録を不要としている。 When the speech recognition technology is put to practical use, it is necessary to estimate the accuracy of the recognition result to some extent in advance in order to estimate the quality of the service to be provided. In order to estimate the accuracy of recognition results in advance, in general, speech in an environment similar to the speech input when a service is provided is collected as test data, and the accuracy is estimated by conducting a recognition experiment on this. . However, since collecting a sound in an environment similar to an actual service requires an appropriate economic burden and time, a technique for estimating the recognition accuracy without recording the sound is required.
In the method of Non-Patent Document 1 described in the background art, the recognition accuracy is estimated using only the information of the recognition vocabulary and the acoustic model, and the recording of the voice is unnecessary.

一方で、音声認識技術を実用に供する際、認識結果の精度と共に、認識を行う速度が重要な要因となる。たとえば音声の入力を受け付け、適切な応答を出力する音声応答システムを構築する際、入力音声が終わった後に数秒経過しなければ結果が得られないならば、応答が間延びし不自然な印象を与えるなど、使い勝手が著しく低下する。
ところで、現在広く用いられる音声認識手法は、入力として得られる音声波形を数ミリ秒の短い単位で区切り、短い区間内での周波数特性などを手がかりに、短い単位で発声されている音素は何であるかを推定し、その音素を含む単語の候補を単語リストとして列挙し、単語リストの中から最も入力に良く当てはまる単語の候補を認識結果とする。 On the other hand, when the speech recognition technology is put to practical use, the speed of recognition is an important factor as well as the accuracy of the recognition result. For example, when constructing a voice response system that accepts voice input and outputs an appropriate response, if the result is not obtained until a few seconds have passed after the input voice is over, the response is delayed and gives an unnatural impression Usability is significantly reduced.
By the way, the currently widely used speech recognition method is to divide a speech waveform obtained as input into short units of several milliseconds, and what is the phoneme uttered in short units based on the frequency characteristics within a short interval. The word candidates including the phoneme are listed as a word list, and the word candidate that best matches the input from the word list is used as the recognition result.

この単語リストを列挙する際に認識対象となる単語を全て列挙すると、計算量が増加し実時間で音声認識結果を得ることが困難になる場合があるが、認識対象となる単語を一定の基準で選別し、限られた数の単語の中だけで候補を探索すれば、計算量を抑えて実時間で認識結果が得られることは知られている。しかし、候補の数を限定すると、もしその候補の中に正解が含まれなければ正しく認識することができないため、認識精度は低下する。すなわち、音声認識を行う際に探索候補をいくつまでに絞り込むか、という基準値を変更することにより認識速度を速くすることが出来るが、その代償として認識精度が低下するという不具合が生じる。 When enumerating all the words to be recognized when enumerating this word list, the amount of calculation increases and it may be difficult to obtain a speech recognition result in real time. It is known that a recognition result can be obtained in real time with a reduced amount of calculation if a candidate is searched only in a limited number of words. However, if the number of candidates is limited, since the correct recognition cannot be performed unless the correct answer is included in the candidates, the recognition accuracy decreases. That is, the recognition speed can be increased by changing the reference value of how many search candidates are narrowed down when performing speech recognition, but the disadvantage is that the recognition accuracy is reduced.

そこで、一般に音声認識を実用に供する際には、認識速度と認識精度のどちらを重視するかによって探索候補の絞込み個数を変更するということが行われている。
非特許文献１にあるような従来の技術においては、この探索候補の絞込み個数による認識精度の変化を推定できないため、たとえば高速な応答出力を求められるサービスに音声認識技術を用いる場合、事前に十分な信頼性を持って認識精度を推定することはできなかった。
また、ユーザがコンピュータシステムに向かって話しかけることに慣れてゆけばゆくほど、「えーと」、「あのー」、「あっ」などの、言いさしたり言いよどんだりする発話が増加してゆく。すなわち、音声認識の対象には、名前や地名といった重要なキーワードだけでなく、重要ではないがユーザの発声に出現しうるものが含まれる。 Therefore, in general, when speech recognition is put to practical use, the number of search candidates to be narrowed is changed depending on which of the recognition speed and the recognition accuracy is important.
In the conventional technology as described in Non-Patent Document 1, since the change in recognition accuracy due to the number of search candidates narrowed down cannot be estimated, for example, when using the speech recognition technology for a service that requires high-speed response output, it is sufficient in advance. The recognition accuracy could not be estimated with high reliability.
In addition, as the user gets used to speaking toward the computer system, the number of utterances such as “Ut”, “Oh no”, “Ah”, etc. increases. That is, the target of speech recognition includes not only important keywords such as names and place names, but also those that are not important but can appear in the user's utterance.

一方、音声認識の認識精度は、発音が似ている語が認識語彙（単語リスト）に多く含まれていると低下する傾向がある。たとえば、「江藤」と発声した音声は、「あのー」と認識されることは少ないが、「えーっと」と誤って認識されることは多い。重要ではないがユーザの発声に出現しうる語彙をすべて含むように認識語彙を設計すると、その中には人名や地名といった重要なキーワードと発声が酷似した語彙が含まれ、重要なキーワードの音声認識率が著しく低下することがある。
従来の認識語彙設計においては、出現すると予測される語をすべて単語リストとして設計していたため、重要なキーワードの単語リストに悪影響を与えるかどうか、という観点から、重要ではないが実現しうる語を単語リストに含めるかどうかを判断することができなかった。 On the other hand, the recognition accuracy of speech recognition tends to decrease when words having similar pronunciations are included in the recognized vocabulary (word list). For example, a voice uttered “Eto” is rarely recognized as “Ah”, but is often mistakenly recognized as “Eh”. If the recognition vocabulary is designed to include all vocabularies that are not important but can appear in the user's utterance, the vocabulary that closely resembles the utterance with important keywords such as names of people and places will be included in the recognition vocabulary. The rate may be significantly reduced.
In the conventional recognition vocabulary design, all the words that are expected to appear are designed as a word list. Therefore, from the viewpoint of whether the word list of important keywords is adversely affected, the words that can be realized are not important. It was not possible to determine whether to include in the word list.

本発明では、音声認識に用いる単語リストの性能を推定することができる音声認識性能推定方法及び装置を提案しようとするものである。そのために本発明では対象単語と対立単語から成る二つの単語を受け取り、音声認識を行った際に対象単語に対応する発声が対立単語として認識される度合いを示す単語類似度を算出する単語類似度算出処理と、対象単語と性能推定対象となる被推定単語リストを受け取り、音声認識を行った際に対象単語に対応する発声が被推定単語リストのいずれかの単語として認識される度合いを示す単語誤認識スコアを算出する単語誤認識スコア算出処理と、被推定単語リストを受け取り、被推定単語リストに含まれるすべての単語の平均的な単語誤認識スコアを算出する平均単語誤認識スコア算出処理と、被推定単語リストに含まれる単語数と、平均単語誤認識スコアと、音声認識の解候補を探索する際に設定される候補数絞込み基準値の三要素を入力として、音声認識率の推定値を算出する音声認識率推定処理と、を実行することを特徴とする音声認識性能推定装置を提案する。 The present invention intends to propose a speech recognition performance estimation method and apparatus capable of estimating the performance of a word list used for speech recognition. For this purpose, the present invention receives two words consisting of a target word and an opposing word, and calculates a word similarity that indicates the degree to which the utterance corresponding to the target word is recognized as an opposing word when speech recognition is performed. A word indicating the degree to which the utterance corresponding to the target word is recognized as one of the words in the estimated word list when the calculation process, the target word and the estimated word list as the performance estimation target are received and speech recognition is performed A word misrecognition score calculating process for calculating a misrecognition score; an average word misrecognition score calculating process for receiving an estimated word list and calculating an average word misrecognition score of all words included in the estimated word list; The three elements of the number of words included in the estimated word list, the average word error recognition score, and the candidate number narrowing reference value set when searching for speech recognition solution candidates As a force, proposes a speech recognition performance estimation apparatus and executes the speech recognition rate estimation processing of calculating the estimated value of the speech recognition rate, a.

また本発明では、認識障害単語探索処理と上記の音声認識性能推定方法を実行させ、上記認識障害単語探索方法は、入力で与えられる被推定単語リストに含まれるそれぞれの単語について、被推定単語リストから当該単語を除いた残りのすべての単語で構成される障害探索用単語リストの音声認識性能を音声認識性能推定方法により推定し、音声認識性能が高い順に障害探索用単語リストを一定数抽出し、それぞれの障害探索用単語リストに対応する除かれた単語を認識障害単語として出力することを特徴とする、認識障害単語抽出方法を提案する。 In the present invention, the recognition impaired word search process and the speech recognition performance estimation method are executed, and the recognition impaired word search method performs the estimated word list for each word included in the estimated word list given by input. The speech recognition performance estimation method estimates the speech recognition performance of the failure search word list consisting of all the remaining words excluding the word from, and extracts a certain number of failure search word lists in descending order of speech recognition performance Then, a recognition failure word extraction method is proposed in which the removed words corresponding to each failure search word list are output as recognition failure words.

なお、本発明で提案するのと同様な認識障害単語抽出方法は、従来技術による音声認識性能推定方法を用いても実現することは可能である。しかしながら認識障害単語を正しく抽出するためにはこれに用いる音声認識性能推定方法の精度が高い事が要求される。このため、十分な信頼性を持って認識精度を推定することができない場合がある従来技術を用いた認識障害単語抽出方法に比べ、本発明による認識障害単語抽出方法は、より高い信頼精度を持つ本発明による音声認識性能推定方法を用いるため、明らかに優れた認識障害単語抽出方法となる。 Note that the recognition failure word extraction method similar to that proposed in the present invention can also be realized using a speech recognition performance estimation method according to the prior art. However, in order to correctly extract the recognition failure word, it is required that the accuracy of the speech recognition performance estimation method used for this is high. For this reason, the recognition failure word extraction method according to the present invention has higher reliability accuracy than the recognition failure word extraction method using the prior art in which the recognition accuracy may not be estimated with sufficient reliability. Since the speech recognition performance estimation method according to the present invention is used, the recognition failure word extraction method is clearly excellent.

本発明の音声認識性能推定方法によれば、音声認識率を推定する際に、単語の数と、平均単語誤認識スコアと、候補の絞込み基準値を用いて音声認識率を算出する。
これにより、たとえば基準値を、候補を絞り込む際の最大候補数として設定し、その最大候補数が大きければ推定される認識率が高くなり、最大候補数が少なければ推定される認識率が低くなるように音声認識率推定処理を行えば、候補の絞り込み基準値に応じた音声認識率を推定することができる。
また、候補を絞り込む際にたとえば基準値を、候補の持つ音響スコアや言語スコアといった音声認識スコアが基準値以上でなくてはならないと設定したり、対立する候補の中で最も高い音声認識スコアを持つ候補との音声認識スコアの差が基準値以下でなくてはならないと設定すれば、最大候補数の場合と同様に、基準値に応じた音声認識率の変化を推定することが可能となる。 According to the speech recognition performance estimation method of the present invention, when the speech recognition rate is estimated, the speech recognition rate is calculated using the number of words, the average word misrecognition score, and the candidate narrowing reference value.
Thereby, for example, the reference value is set as the maximum number of candidates when narrowing down candidates, and the estimated recognition rate increases if the maximum number of candidates is large, and the estimated recognition rate decreases if the maximum number of candidates is small. If the speech recognition rate estimation process is performed as described above, it is possible to estimate the speech recognition rate according to the candidate narrowing reference value.
In addition, when narrowing down candidates, for example, the reference value is set so that the voice recognition score such as the acoustic score and language score of the candidate must be higher than the reference value, or the highest voice recognition score among the conflicting candidates If it is set that the difference in the speech recognition score with the candidate must be below the reference value, it is possible to estimate the change in the speech recognition rate according to the reference value, as in the case of the maximum number of candidates. .

また、本発明の認識障害単語抽出方法によれば、重要なキーワードとして指定した抽出阻止単語以外の単語について音声認識性能に悪影響を与える語を抽出できる。このため、重要なキーワードの認識性能へ与える効果を考慮しつつ、重要ではないが出現しうる単語のうちから、音声認識に悪影響を与える語を削除する、といった認識語彙設計が可能となる。 Furthermore, according to the recognition failure word extraction method of the present invention, words that adversely affect speech recognition performance can be extracted for words other than the extraction-prevented words designated as important keywords. For this reason, it is possible to design a recognition vocabulary such as deleting words that adversely affect speech recognition from words that may appear but are not important, while taking into consideration the effect on the recognition performance of important keywords.

本発明による音声認識性能推定装置、及び認識障害単語抽出装置は主にコンピュータに本発明で提案する音声認識性能推定プログラム及び認識障害単語抽出プログラムをインストールし、コンピュータにこれらのプログラムを実行させることにより、コンピュータに性能推定対象となる被推定単語リストの認識率を推定する音声認識性能推定装置及び被推定単語リストの中から認識に障害となる障害単語を抽出する認識障害単語抽出装置として機能させる実施形態が最良の実施形態である。
コンピュータが音声認識性能推定装置として機能する場合、コンピュータは少なくとも単語類似度算出手段と、単語誤認識スコア算出手段と、平均単語誤認識スコア算出手段と、音声認識率推定手段とを備え、これら各手段の機能により被推定単語リストの音声認識率の推定値を算出する動作を実行する。 The speech recognition performance estimation device and the recognition failure word extraction device according to the present invention mainly install a speech recognition performance estimation program and a recognition failure word extraction program proposed in the present invention in a computer and cause the computer to execute these programs. Implementation of a computer to function as a speech recognition performance estimation device for estimating the recognition rate of a word list to be estimated for performance estimation and a recognition failure word extraction device for extracting a failure word that hinders recognition from the word list to be estimated The form is the best embodiment.
When the computer functions as a speech recognition performance estimation device, the computer includes at least a word similarity calculation means, a word error recognition score calculation means, an average word error recognition score calculation means, and a speech recognition rate estimation means. The operation of calculating the estimated value of the speech recognition rate of the estimated word list by the function of the means is executed.

コンピュータが認識障害単語抽出装置として機能する場合、コンピュータは上記した音声認識性能推定装置の構成に加えて認識障害単語探索手段を備え、この認識障害単語探索手段は音声認識性能推定装置が推定した被推定単語リストの音声認識率を利用して音声認識率が高い順に被単語リストを一定数抽出し、それぞれの単語リストに対応する除かれた単語を認識障害単語として出力する動作を実行する。 When the computer functions as a recognition failure word extraction device, the computer includes a recognition failure word search means in addition to the configuration of the speech recognition performance estimation device described above. A certain number of word lists are extracted in descending order of the speech recognition rate using the speech recognition rate of the estimated word list, and an operation of outputting the removed words corresponding to the respective word lists as recognition failure words is executed.

以下にこの発明による音声認識性能推定装置の一実施例を図１乃至図３を用いて詳細に説明する。
図１は音声認識性能推定装置１００の全体の構成を示す。音声認識性能推定装置１００は音声認識率推定手段１０１と、平均単語誤認識率スコア算出手段１０２と、単語誤認識スコア算出手段１０３と、単語類似度算出手段１０４からなり、候補数絞込み基準値と被推定単語リストを入力として、音声認識率ｒを出力する。
音声認識率推定手段１０１は、候補数絞込み基準値と被推定単語リストＬ_{ｉｎｐｕｔ}を受け取り、被推定単語リストＬ_{ｉｎｐｕｔ}を平均単語誤認識スコア算出手段１０２へ引き渡し、結果として平均単語誤認識スコアを受け採る。 An embodiment of a speech recognition performance estimation apparatus according to the present invention will be described in detail below with reference to FIGS.
FIG. 1 shows the overall configuration of the speech recognition performance estimation apparatus 100. The speech recognition performance estimation apparatus 100 includes a speech recognition rate estimation unit 101, an average word error recognition rate score calculation unit 102, a word error recognition score calculation unit 103, and a word similarity calculation unit 104. The speech recognition rate r is output with the estimated word list as input.
The speech recognition rate estimation means 101 receives the candidate number narrowing reference value and the estimated word list L _input, passes the estimated word list L _input to the average word error recognition score calculation means 102, and receives the average word error recognition score as a result. take.

この発明による音声認識性能推定装置においては、候補数絞込み基準値を探索の各時点において列挙する解候補の最大限度数とし、平均単語誤認識スコアをｘ、被推定単語リストＬ_{ｉｎｐｕｔ}に含まれる総単語数をｙ、候補数絞込み基準値をｚとすると、音声認識性能推定装置１００は、下記の式（１）で算出される値ｒを音声認識率として出力する。
r=a₁x+a₂y+a₃(y/z)+a₄ …式（１）
ここで、各ａ_nの値は、音声認識が行われる音環境や、音声認識に用いられる音響モデルおよび音声認識手法に応じてあらかじめ調整されているものとする。 In the speech recognition performance estimation apparatus according to the present invention, the candidate number narrowing reference value is set to the maximum number of candidate solutions listed at each time point of the search, the average word misrecognition score is x, and the total number included in the estimated word list L _input When the number of words is y and the candidate number narrowing reference value is z, the speech recognition performance estimation apparatus 100 outputs a value r calculated by the following equation (1) as a speech recognition rate.
r = a ₁ x + a ₂ y + a ₃ (y / z) + a ₄ (1)
Here, the value of each a _n is or sound environment in which the speech recognition is performed, assumed to be previously adjusted in accordance with the acoustic model and speech recognition method used in speech recognition.

平均単語誤認識スコア算出装置１０２は、被推定単語リストＬ_{ｉｎｐｕｔ}を受け取り、被推定単語リストＬ_{ｉｎｐｕｔ}に含まれるそれぞれの単語ごとに、その単語を対象単語Ｗ_ｉとし、その単語以外の単語を全て含む単語リストＬ_ｉを新たに作成して、単語誤認識スコア算出手段１０３へ引き渡し、結果として単語誤認識スコアＳＣＡを受け取る。本発明においては、それぞれの単語毎に得られる単語誤認識スコアＳＣＡの値の平均を平均単語誤認識スコアｘとして出力する。
更に本発明においては、平均単語誤認識スコアｘの他の算出方法としてそれぞれの単語毎に得られる単語誤認識スコアＳＣＡの値にそれぞれの単語の生起確率を乗じた値の総和を平均単語誤認識スコアｘとして出力する方法を提案する。 The average word error recognition score calculating unit 102 receives the object estimated word list L _{input The,} for each word included in the estimated word list L _{input The,} by the words and target words W _i, all words other than the word and create a new word list L _i including, passing word false recognition score calculating unit 103 receives the words misrecognized score SCA as a result. In the present invention, the average value of the word error recognition score SCA obtained for each word is output as the average word error recognition score x.
Further, in the present invention, as another method of calculating the average word misrecognition score x, the sum of values obtained by multiplying the value of the word misrecognition score SCA obtained for each word by the occurrence probability of each word is used as the average word misrecognition. A method of outputting as a score x is proposed.

単語誤認識スコア算出手段１０３は、図２に示すように重要対立語抽出部２０１と、単語間距離算出部２０２と、スコア変換部２０３とかならなり、対象単語Ｗ_ｉと単語リストＬ_ｉを入力として、単語誤認識スコアＳＣＡを出力する。
単語誤認識スコア算出手段１０３は、対象単語Ｗ_ｉと単語リストＬ_ｉを受け取り、単語リストＬ_ｉに含まれるそれぞれの単語について、これを対立単語Ｗ_ｋとし、対象単語Ｗ_ｉと組にして単語類似度算出手段１０４へ引き渡し、結果として単語類似度Ｊを受け取る。重要対立語抽出部２０１は単語リストＬ_ｉに含まれる全ての単語のうち、単語類似度Ｊが大きいものを上位から一定数抽出し、これを重要対立語リストＬ_ｉｍｓｂとして、対象単語Ｗ_ｉと共に単語間距離算出部２０２へと引き渡す。 Words misrecognized score calculating means 103, input important conflict word extraction section 201 as shown in FIG. 2, the word distance calculation unit 202, it if Toka score conversion unit 203, the target word W _i and word list L _i As a result, a word error recognition score SCA is output.
The word misrecognition score calculation means 103 receives the target word W _i and the word list L _i, and for each word included in the word list L _i , sets this as the opposite word W _k , and sets the target word W _i as a pair. As a result, the word similarity degree J is received. Important conflict word extraction section 201 of all the words contained in the word list L _i is the one word similarity J is large to a certain number extracted from the upper, it as important conflicting word list L _Imsb, with target words W _i Delivered to the inter-word distance calculation unit 202.

単語間距離算出部２０２は、対象単語Ｗ_ｉの対象単語自身との単語類似度の対数値を算出し、重要対立語リストＬ_ｉｍｓｂに含まれる各単語について、対象単語Ｗ_ｉとの単語類似度の対数値を自分自身の単語類似度の対数値から減じた値を単語間距離として、スコア変換部２０３へ引き渡す。
スコア変換部２０３は、重要対立語リストＬ_ｉｍｓｂに含まれる各単語の単語間距離ｄを、ｄが小さければ小さいほど大きな値へ、ｄが大きければ大きいほど小さな値へと変換してから、変換後の値の総和をとって単語誤認識スコアＳＣＡとして出力する。これは、単語間距離が近ければ近いほど、対象単語が誤って誤認識される相手としての重要度が高まることを反映するための処理である。変換する関数としては様々な関数が考えられるが、例えば式（２）に示すＳＩＧＭＯＩＤ関数などを用いることができる。 Word distance calculation unit 202 calculates the logarithm of word similarity between the target word own target word W _i, for each word included in the important conflicting word list L _Imsb, word similarity between the target word W _i The value obtained by subtracting the logarithmic value of the word similarity from the logarithmic value of the word itself is passed to the score conversion unit 203 as the distance between words.
The score conversion unit 203 converts the inter-word distance d of each word included in the important confrontation list L _imsb into a larger value as d is smaller, and a smaller value as d is larger, The sum of the subsequent values is taken and output as a word error recognition score SCA. This is a process for reflecting that the closer the distance between words is, the higher the importance of the target word as a partner to be erroneously recognized. Various functions can be considered as the function to be converted. For example, a SIGMOID function shown in Expression (2) can be used.

ここで、定数α及びβは、音声認識を行う音環境などに応じてあらかじめ調整されているものとする。
単語類似度算出手段１０４は、図３に示すように音素列生成部３０１と、音素継続時間付与部３０２と、音素アライメント部３０３と、辞書３０４と、音素継続時間長データ３０５と、音素コンフュージョンマトリクス３０６とからなり、対象単語Ｗ_ｉと対立単語Ｗ_ｋを入力として、単語類似度Ｊを出力する。

Here, it is assumed that the constants α and β are adjusted in advance according to a sound environment for performing speech recognition.
As shown in FIG. 3, the word similarity calculation unit 104 includes a phoneme string generation unit 301, a phoneme duration adding unit 302, a phoneme alignment unit 303, a dictionary 304, phoneme duration time data 305, and phoneme confusion. It consists of a matrix 306, receives the target word _Wi and the confronting word _Wk, and outputs the word similarity J.

音素列生成部３０１は、対象単語Ｗ_ｉと対立単語Ｗ_ｋを入力とし、それぞれの発音情報から、対応する音素の系列を作成し、対象単語Ｗ_ｉに対応する音素列Ｗ_ｉｏｎは音素継続時間付与部３０２へ、対立単語Ｗ_ｋに対応する音素列Ｗ_ｋｏｎは音素アライメント部３０３へと出力する。
音素継続時間付与部３０２は、対象単語Ｗ_ｉの音素列Ｗ_ｉｏｎを受け取り、音素継続時間長データ３０５で与えられる各音素の平均的な音素継続長を音素列Ｗ_ｉｏｎの各音素へ付与し、継続時間情報付きの音素列Ｗ_ｉｏｎｔとして、音素アライメント部３０３へと出力する。 Phoneme string generation unit 301 inputs the conflict word W _k and target words W _i, from each of the sound information, to create a series of corresponding phoneme, phoneme sequence corresponding to the target word W _i W _ion phoneme duration To the assigning unit 302, the phoneme string W _kon corresponding to the conflict word W _k is output to the phoneme alignment unit 303.
Phoneme duration applying unit 302 receives the phoneme string W _ion of the target word W _i, to impart an average phoneme duration of each phoneme given by the phoneme duration data 305 to each phoneme of a phoneme sequence W _ion, Output to phoneme alignment unit 303 as phoneme string W _iont with duration information.

音素アライメント部３０３は、対象単語Ｗ_ｉに対応する継続時間情報付きの音素列Ｗ_ｉｏｎｔと、対立単語Ｗ_ｋに対応する音素列Ｗ_ｋｏｎをうけとり、対象単語Ｗ_ｉに対応する音素列Ｗ_ｉｏｎｔの各音素が継続している各時点において、対立単語Ｗ_ｋに対応する音素列Ｗ_ｋｏｎのいずれかの音素を対応させる。
対応付けの例を、図４に示す。対応付けは、対象単語Ｗ_ｉに対応する側の音素が、対立単語に対応する側の音素として認識される確率を、全ての時点における音素の組にわたって積算した値が最も大きくなるように、動的計画法を用いて行う。ある音素が別の音素に間違えて認識される確率は、音素コンフュージョンマトリクス３０６に与えられている。音素コンフュージョンマトリクス３０６は、あらゆる音素と音素の組み合わせについて、片方の音素が発声された場合にもう片方の音素へ認識される確率を保持しており、音声認識が行われる音環境と、音声認識に用いられる音響モデルに応じて、事前に準備されている。 Phoneme alignment unit 303, and the phoneme string _{W iont} with duration information corresponding to the target word _{W i,} receives the phoneme string _{W kon} corresponding to the confrontation word _{W k,} of the phoneme string _{W iont} corresponding to the target word _{W i} At each time point when each phoneme continues, any phoneme in the phoneme string W _kon corresponding to the conflict word W _k is associated.
An example of association is shown in FIG. Correspondence, as phonemes side corresponding to the target word W _i is the probability to be recognized as a phoneme of the side corresponding to the conflict word, the value obtained by integrating over phoneme-set at all time points becomes largest, the dynamic This is done using genetic programming. The probability that a phoneme is mistakenly recognized by another phoneme is given to the phoneme confusion matrix 306. The phoneme confusion matrix 306 holds the probability of being recognized by the other phoneme when one phoneme is uttered for any combination of phonemes and phonemes. It is prepared in advance according to the acoustic model used.

音素アライメント部３０３は、音素の対応付けが行われた後に、全ての時点における音素の組にわたって積算された確率値を、対象単語の総継続時間で正規化し、単語類似度として出力する。 The phoneme alignment unit 303 normalizes the probability values accumulated over the phoneme pairs at all time points after the phoneme association is performed, and outputs the normalized value as the word similarity.

図５に本発明による認識障害単語抽出装置の実施例を示す。図５に示す５０１は認識障害単語抽出装置を示す。本発明による認識障害単語抽出装置５０１は認識障害単語探索装置５０２と音声認識性能推定装置１００とによって構成される。
認識障害単語探索装置５０２には候補数絞込み基準値と、被推定単語リストＬ_{ｉｎｐｕｔ}と、抽出阻止単語リストL_ｏｆｆとが入力され、これらの入力の中から、被推定単語リストＬ_{ｉｎｐｕｔ}に含まれる各単語Ｗ_ｉについて、被推定単語リストＬ_{ｉｎｐｕｔ}から各単語Ｗ_ｉを除いて得られる新たな単語リストＬ_ｉと候補数絞込み基準値を音声認識性能推定装置１００に受け渡し、音声認識性能推定装置１００から新たな単語リストＬ_ｉの音声認識率ｒ_ｉを受け取り、この音声認識率ｒ_ｉを手がかりに認識障害単語を探索し、認識障害単語Ｗ_ｏｆｆを出力する。ここで抽出阻止単語リストL_ｏｆｆとは被推定単語リストＬ_{ｉｎｐｕｔ}に含まれる単語の中の重要単語をリストアップした単語リストを意味し、地名、人名等音声認識を用いた各種サービスの実行に必要な重要キーワードの単語リストである。この単語リストに挙げた単語を抽出阻止することにより、サービスの実行に支障をきたすことなく、単語リストの調整を行うことができる。 FIG. 5 shows an embodiment of the recognition failure word extraction apparatus according to the present invention. 501 shown in FIG. 5 shows a recognition failure word extraction apparatus. The recognition failure word extraction device 501 according to the present invention includes a recognition failure word search device 502 and a speech recognition performance estimation device 100.
The recognition failure word search device 502 receives a candidate number narrowing reference value, an estimated word list L _input, and an extraction prevention word list L _off, and is included in the estimated word list L _input from these inputs. For each word W _i , a new word list L _i obtained by removing each word W _i from the estimated word list L _input and a candidate number narrowing reference value are passed to the speech recognition performance estimation device 100, and the speech recognition performance estimation device 100 The speech recognition rate r _i of the new word list L _i is received from this, the recognition failure word is searched using this speech recognition rate r _i as a clue, and the recognition failure word W _off is output. Here, the extraction prevention word list L _off means a word list that lists important words in the words included in the estimated word list L _input , and is necessary for the execution of various services using speech recognition such as place names and personal names. This is a word list of important keywords. By preventing the words listed in this word list from being extracted, the word list can be adjusted without impeding the execution of the service.

以下に図６に示す動作フローを用いて認識障害単語探索装置５０１の動作を説明する。
認識障害単語抽出装置５０１はステップ６０１にて候補数絞込み基準値およびＮ単語からなる被推定単語リストＬ_{ｉｎｐｕｔ}およびＭ単語からなる抽出阻止単語リストL_ｏｆｆを受け取ると、ステップ６０２において、被推定単語リストＬ_{ｉｎｐｕｔ}に含まれる抽出阻止単語以外の各単語を対象単語Ｗ_ｉ（ｉ＝０〜Ｎ−Ｍ−１）とするループを開始する。
ループ内部のステップ６０３において、被推定単語リストＬ_{ｉｎｐｕｔ}から対象単語Ｗ_ｉを除いた単語リストＬ_ｉ（図９）を作成し、ステップ６０４において、単語リストＬ_ｉおよび候補数絞込み基準値を音声認識性能推定装置１００へ入力して音声認識率ｒを得る。 The operation of the recognition failure word search device 501 will be described below using the operation flow shown in FIG.
When the recognition failure word extraction device 501 receives the candidate number narrowing reference value and the estimated word list L _input consisting of N words and the extraction blocked word list L _off consisting of M words in step 601, in step 602, the estimated word list L target words each word other than the extraction blocking words included in the _{_{input W i (i = 0~N-}} M-1) to start the loop.
In the loop inside the step 603, creates a word except the target word _{W i} from the estimated word list _{L input The} list _{L i} (Fig. 9), the speech recognition in step 604, the reference value narrowing down word list _{L i} and the number of candidates The speech recognition rate r is obtained by inputting the performance estimation device 100.

ループが終了すると、ステップ６０６において、音声認識性能の高いものから順に一定数の単語リストＬ_ｉを選択し、ステップ６０７において、選択された単語リストＬ_ｉに対応する対象単語Ｗ_ｉを認識障害単語として出力する。
以下では各部の具体的な動作例を図を用いて説明する。
図７は本発明による音声認識性能推定装置１００に入力する被推定単語リストＬ_{ｉｎｐｕｔ}の例である。また、図８は、音素コンフュージョンマトリクス３０６に用意した確率例である。本来、日本語には４０前後の音素が存在するとされるが、ここでは簡便のため、図７の単語に含まれる音素と無音区間を意味するｐａｕｓｅのみを考慮の対象とした。 When the loop is finished, in step 606, selects a certain number of word list L _i in descending order of speech recognition performance, in step 607, cognitive impairment word target word W _i corresponding to the selected word list L _i Output as.
Hereinafter, specific operation examples of the respective units will be described with reference to the drawings.
FIG. 7 is an example of the estimated word list L _input input to the speech recognition performance estimation apparatus 100 according to the present invention. FIG. 8 is an example of a probability prepared in the phoneme confusion matrix 306. Originally, there are about 40 phonemes in Japanese, but here, for the sake of simplicity, only the phonemes included in the words in FIG. 7 and pauses meaning silent intervals are considered.

また、以下の例では、候補数絞込み基準値を、候補を絞り込む際の最大候補数と設定し、その値を６とする。
平均単語誤認識スコア算出手段１０６（図１）は、被推定単語リストＬ_{ｉｎｐｕｔ}を受け取ると、被推定単語リストＬ_{ｉｎｐｕｔ}に含まれるそれぞれの語を対象単語Ｗ_ｉとし、それ以外の語を新たな単語リストＬ_ｉとして単語誤認識スコア算出装置１０３へ引き渡す。図９の各行に、対象単語Ｗ_ｉと単語リストＬ_ｉのペアの例を示す。単語リストＬ_ｉには四つの単語が含まれるため、この場合は四通りの対象単語Ｗ_ｉと単語リストＬ_ｉのペアが作成される。 Also, in the following example, the candidate number narrowing reference value is set as the maximum number of candidates when narrowing candidates, and the value is set to 6.
The average word error recognition score calculating unit 106 (FIG. 1), the new receives an object to be estimated word list _{L input The,} each of the words included in the estimated word list _{L input The} intended word _{W i,} the other word deliver to word erroneously recognized score calculation device 103 as a word list _{L i.} Each line of FIG. 9 shows an example of a pair of target words W _i and word list L _i. Because the word list L _i includes four words, a pair of target words W _i and word list L _i of quadruplicate in this case is created.

単語誤認識スコア算出手段１０３に含まれる重要対立語抽出部２０１（図２）では、図９の各行に対応する対象単語Ｗ_ｉと単語リストＬ_ｉを受け取り、単語リストＬ_ｉに含まれる各単語について、単語類似度算出手段１０４を用いて対象単語Ｗ_ｉとの単語類似度を算出し、対象単語Ｗ_ｉとの単語類似度が大きいものを一定数、重要対立語リストＬ_ｉｍｓｂとして抽出する。図１０の各行に、対象単語Ｗ_ｉと、重要対立語リストＬ_ｉｍｓｂと、対象単語Ｗ_ｉとの単語類似度の例を示す。この例では、抽出される重要対立語の数は２としており、単語類似度は確率値の対数で表現しているため負の値をとっているが、値が大きいほど（０に近いほど）対象単語Ｗ_ｉに近いことを意味している。また、対象単語Ｗ_ｉの欄に記入されている数字は、対象単語Ｗ_ｉの対象単語自身との類似度を表している。 In words misrecognized score calculating means important conflict word contained in 103 extraction unit 201 (FIG. 2) receives the target word W _i and word list L _i for each row in FIG. 9, each of words contained in the word list L _i for calculates word similarity between the target word W _i using the word similarity degree calculating unit 104, extracts a having a large word similarity between the target word W _i fixed number as a key conflict word list _L imsb. Each row of Figure 10 shows the target words _{W i,} and important conflict word list _{L Imsb,} an example of a word similarity between target words _{W i.} In this example, the number of important conflicting words extracted is 2, and the word similarity is expressed as a logarithm of probability values, and thus takes a negative value. However, the larger the value (the closer it is to 0). which means that close to the target word W _i. Also, numbers are entered in the column of the target word W _i represents the similarity between the target word own target word W _i.

単語間距離算出部２０２（図２）では、対象単語Ｗ_ｉと重要対立語リストＬ_ｉｍｓｂを受け取り、重要対立語リストＬ_ｉｍｓｂに含まれる各単語について、単語類似度の対数値を対象単語Ｗ_ｉの対象単語自身との単語類似度の対数値から減じて単語間距離を算出する。図１１の各行に、重要対立語の単語間距離の例を示す。例えば対象単語Ｗ_ｉが「赤」である場合の重要対立語「朝」の単語間距離は、−０．２１−−０．５７＝０．３６となっている。このように、単語間距離の値は、対象単語Ｗ_ｉに近ければ近いほど小さな値となっている。 In a word distance calculation unit 202 (FIG. 2) receives the target word _{W i} and important conflict word list _{L Imsb,} for each word included in the important conflicting word list _{L Imsb,} word directed to the logarithm of the word similarity _{W i} The inter-word distance is calculated by subtracting from the logarithmic value of the word similarity with the target word itself. Each row of FIG. 11 shows an example of the distance between words of the important conflicting word. For example, important words between the distance of the confrontation word "morning" when the target word _{W i} is "red", has become a -0.21--0.57 = 0.36. In this way, the value of the word distance between, has become the smaller the value the closer to the target word W _i.

スコア変換部２０３では、重要対立語の各単語の単語間距離を、変換関数によって、小さければ小さいほど大きな値へ、大きければ大きいほど小さな値へと変換する。図１２の各行に、重要対立単語リストＬ_ｉｍｓｂの変換された単語間距離の例を示す。この例では、変換する関数には、式（２）に与えられるＳＩＧＭＯＩＤ関数に、α＝１０、β＝０．４を代入して計算した。このα、βの値は絶対的なものではなく、音声認識が使用される環境に応じて調整されるべき値である。さらにこれらの値の総和をとって、各対象単語Ｗ_ｉに対応した単語誤認識スコアＳＣＡが計算される。図１３の各行に、対象単語Ｗ_ｉに対応した単語誤認識スコアＳＣＡの例を示す。 In the score conversion unit 203, the distance between words of each important confrontation word is converted into a larger value as it is smaller, and into a smaller value as it is larger, using a conversion function. Each row of FIG. 12 shows an example of the converted inter-word distance of the important conflict word list L _imsb . In this example, the function to be converted is calculated by substituting α = 10 and β = 0.4 into the SIGMOID function given in Equation (2). The values of α and β are not absolute values and should be adjusted according to the environment in which speech recognition is used. Further taking the sum of these values, a word erroneous recognition score SCA corresponding to each target word W _i is calculated. Each row of Figure 13 shows an example of a word misrecognition score SCA corresponding to the target word W _i.

平均単語誤認識スコア算出手段１０２では、図１３の各行に含まれる単語誤認識スコアＳＣＡの平均を取り、これを平均単語誤認識スコアとして出力する。この例では、０．８＋０．６＋０．１７＋０．００２４を４で割って、０．３９４が平均単語誤認識スコアとなる。
音声認識率推定装置１００では、得られた平均単語誤認識スコア０．３９４と、単語数４と、候補数絞込み基準値６を用いて、式（１）を用いて音声認識率を算出する。式（１）のａ_ｎの値を、たとえばａ_１＝−１．５、ａ_２＝−０．２、ａ_３＝０．８、ａ_４＝８０とすると、
−15×0.394−0.2×4＋0.8×（4／6）＋80＝73.8 …式（３）
となり、音声認識率を７３．８％として推定する。 The average word error recognition score calculation means 102 takes the average of the word error recognition scores SCA included in each line of FIG. 13 and outputs this as the average word error recognition score. In this example, 0.8 + 0.6 + 0.17 + 0.0024 is divided by 4, and 0.394 becomes the average word error recognition score.
The speech recognition rate estimation apparatus 100 calculates the speech recognition rate using Equation (1) using the obtained average word error recognition score 0.394, the number of words 4, and the candidate number narrowing reference value 6. The value of _{a n} of formula (1), for example _{_{_{a 1 = -1.5, a 2 =}}} -0.2, a 3 = 0.8, When a 4 = 80,
−15 × 0.394−0.2 × 4 + 0.8 × (4/6) + 80 = 73.8 Equation (3)
Thus, the speech recognition rate is estimated as 73.8%.

これらａ_ｎの値は絶対的なものではなく、音声認識が使用される環境に応じて調整されるべき値である。
単語類似度算出手段１０４（図３）では、対象単語Ｗ_ｉと対立単語Ｗ_ｋを受け取り、まず音素列生成部３０１において、単語に対応する音素列を生成する。図１４に示す例では「赤」と「朝」を受け取り、辞書を用いて対応する音素列である/ａ/、/ｋ/、/ａ/と/ａ/、/ｓ/、/ａ/に、前後のポーズ区間を付与したものを生成している。
次に、対象単語Ｗ_ｉに対応する音素列Ｗ_ｉｏｎについては、音素継続時間付与部３０２（図３）において音素の継続時間を付与する。図１５の例では、対象単語Ｗ_ｉが「赤」であるので、/ｐａｕｓｅ/、/ａ/、/ｋ/、/ａ/、/ｐａｕｓｅ/に継続時間が付与される。 The value of these a _n are not absolute, the value should be adjusted according to the environment in which the speech recognition is used.
In the word similarity calculation means 104 (FIG. 3), the target word _Wi and the conflict word _Wk are received, and the phoneme string generation unit 301 first generates a phoneme string corresponding to the word. In the example shown in FIG. 14, “red” and “morning” are received, and the corresponding phoneme sequences using the dictionary are / a /, / k /, / a / and / a /, / s /, / a /. , The ones with front and back pose sections are generated.
Next, the phoneme string _{W ion} corresponding to the target word _{W i,} to impart phoneme duration in the phoneme duration applying unit 302 (FIG. 3). In the example of FIG. 15, since the target word _Wi is “red”, the duration is given to / pause /, / a /, / k /, / a /, / pause /.

次に、音素アライメント部３０３（図３）において、対象単語Ｗ_ｉに対応する音素列Ｗ_ｉｏｎの各音素が継続している各時点について、対立単語Ｗ_ｋに対応する音素列のＷ_ｋｏｎのいずれかの音素を対応させる。図１６に対応付けされた例を示す。対応付けには様々な組み合わせが存在するが、対応付けられる各音素について、音素コンフュージョンマトリクス３０６上で与えられる確率値を積算した値が最大となるように、動的計画法を用いて対応付けを決定する。
対応付けられる各音素の確率について、図１６の音素対応１６０１を例にとって説明する。音素対応１６０１では、対象単語Ｗ_ｉに対応する音素は/ａ/で、対立単語Ｗ_ｋに対応する音素は/ｓ/なので、図８に与えられる音素コンフュージョンマトリクス３０６上で/ａ/の行、/ｋ/の列の値である０．０３が、対応付けられる各音素の確率値となる。 Next, the phoneme alignment unit 303 (FIG. 3), for each time point each phoneme of the phoneme sequence _{W ion} corresponding to the target word _{W i} is continued, either _{W kon} phoneme string corresponding to the conflicting word _{W k} Match any phoneme. An example associated with FIG. 16 is shown. There are various combinations of associations. For each associated phoneme, association is performed using dynamic programming so that the value obtained by integrating the probability values given on the phoneme confusion matrix 306 is maximized. To decide.
The probability of each phoneme to be associated will be described taking the phoneme correspondence 1601 in FIG. 16 as an example. In phonemes corresponding 1601, phonemes corresponding to the target word and W _i / a / a, Conflict word W corresponding phoneme _k is / s / So phoneme confusion matrix 306 on at of / a / line given in FIG. 8 , / K / column value 0.03 is the probability value of each phoneme associated.

対応付けられた各音素の確率値を積算した値を対象単語の音素の総継続時間で正規化した値が、単語類似度算出手段１０４の出力値となる。この例では、積算値の対数の値を時間長で割ることにより正規化を行っており、この値が図１０の第一行の「朝」に対応する値である−０．５７となっている。
図１７を用いて、認識障害単語抽出装置５０１の動作例を示す。候補数絞込み基準値として候補を絞り込む際の最大候補数と設定し、抽出する認識障害単語の数は１とする。
図１７の例では認識障害単語抽出装置５０１に入力として、図７に示す被推定単語リストＬ_{ｉｎｐｕｔ}、および候補数絞込み基準値と、抽出阻止単語リストL_ｏｆｆを入力する。 The value obtained by integrating the probability values of the associated phonemes is normalized by the total duration of the phonemes of the target word is the output value of the word similarity calculation means 104. In this example, normalization is performed by dividing the logarithm value of the integrated value by the length of time, and this value is -0.57, which is a value corresponding to "morning" in the first row of FIG. Yes.
An example of the operation of the recognition failure word extraction device 501 will be described with reference to FIG. The maximum number of candidates when narrowing candidates is set as the candidate number narrowing reference value, and the number of recognition failure words to be extracted is 1.
In the example of FIG. 17, the to-be-estimated word list L _input , the candidate number narrowing reference value, and the extraction prevention word list L _{off shown} in FIG.

図６に示したステップ６０２から始まるループにおける、対象単語Ｗ_ｉは予め定めた抽出阻止単語リストに含まれる単語以外の単語で構成され、図１７の第一列目に示すように被推定単語リストＬ_{ｉｎｐｕｔ}の各単語となり、対応する単語リストＬ_ｉは第二列目に示すように、被推定単語リストＬ_{ｉｎｐｕｔ}から対象単語Ｗ_ｉを除いた単語のリストとなる。
単語リストＬ_ｉに対応する音声認識率を単語リストＬ_ｉおよび候補数絞込み基準値を音声認識性能推定装置１００へ入力し、得られる音声認識性能を図１７の第三列目に示す。この例では「赤」を除いて得られる「朝烏賊牛」の単語リストの音声認識率が最も高くなるので、この単語リストに対応する認識障害単語は「赤」となる。被推定単語リストＬ_{ｉｎｐｕｔ}で与えられる単語が全て抽出阻止単語リストに含まれる単語であった場合は、認識障害単語の抽出動作は実行されず、認識障害単語の抽出結果は「ナシ」を出力する。 In the loop starting from step 602 shown in FIG. 6, the target word _Wi is composed of words other than those included in the predetermined extraction prevention word list, and the estimated word list is shown in the first column of FIG. It becomes each word L _{input the,} as the corresponding word list _{L i} shown in the second row, a list of words, except for the target word _{W i} from the estimated word list _{L input the.}
The reference value narrowing down word list L _i and the number of candidates of speech recognition rate corresponding to the word list L _i input to the speech recognition performance estimating apparatus 100, shows the resulting speech recognition performance in the third row of FIG. 17. In this example, since the speech recognition rate of the word list of “morning bandit cow” obtained by excluding “red” is the highest, the recognition failure word corresponding to this word list is “red”. When all the words given in the estimated word list L _input are words included in the extraction prevention word list, the recognition failure word extraction operation is not executed, and the recognition failure word extraction result outputs “pear”. .

以上説明した本発明による音声認識性能推定装置および認識障害単語抽出装置はコンピュータに本発明による音声認識性能推定プログラム、および認識障害単語抽出プログラムをインストールし、コンピュータに備えた中央演算処理装置にこれらのプログラムを解読させ、実行させることにより実現される。
本発明による音声認識性能推定プログラムおよび認識性能推定プログラムはコンピュータが解読可能なプログラム言語によって記述され、コンピュータが読み取り可能な磁気ディスク或はＣＤ−ＲＯＭのような記録媒体に記録され、これらの記録媒体又は通信回線を通じてコンピュータにインストールされる。 The speech recognition performance estimation device and the recognition failure word extraction device according to the present invention described above install the speech recognition performance estimation program and the recognition failure word extraction program according to the present invention in a computer, and these are installed in a central processing unit provided in the computer. This is realized by deciphering and executing the program.
The speech recognition performance estimation program and the recognition performance estimation program according to the present invention are described in a computer-readable program language, and are recorded on a recording medium such as a magnetic disk or a CD-ROM that can be read by a computer. Or it is installed in a computer through a communication line.

本発明による音声認識性能推定方法、音声認識推定装置、音声認識推定プログラム、認識障害単語抽出方法、認識障害単語抽出装置、認識障害単語抽出プログラムは、音声を用いた自動案内システムなどの設計分野において活用される。 The speech recognition performance estimation method, speech recognition estimation device, speech recognition estimation program, recognition failure word extraction method, recognition failure word extraction device, and recognition failure word extraction program according to the present invention are used in the design field such as an automatic guidance system using speech. Be utilized.

本発明による音声認識性能推定装置の全体を説明するためのブロック図。The block diagram for demonstrating the whole speech recognition performance estimation apparatus by this invention. 図１に示した単語誤認識スコア算出手段の構成を説明するためのブロック図。The block diagram for demonstrating the structure of the word misrecognition score calculation means shown in FIG. 図１に示した単語類似度算出手段の内部の構成を説明するためのブロック図。The block diagram for demonstrating the internal structure of the word similarity calculation means shown in FIG. 図３に示した音素アライメント部において対象単語音素列と、対立単語音素列とを対応付けする様子を説明するための図。The figure for demonstrating a mode that the target word phoneme string and an opposing word phoneme string are matched in the phoneme alignment part shown in FIG. 本発明の認識障害単語抽出装置の構成を説明するためのブロック図The block diagram for demonstrating the structure of the recognition impairment word extraction apparatus of this invention. 図５に示した認識障害単語探索装置の動作を説明するためのフローチャート。The flowchart for demonstrating operation | movement of the recognition disorder | damage | failure word search apparatus shown in FIG. 本発明の音声認識性能推定装置に入力する被推定単語リストの一例を説明するための図。The figure for demonstrating an example of the to-be-estimated word list | wrist input into the speech recognition performance estimation apparatus of this invention. 図３に示した音素コンフュージョンマトリクスの内部の様子を説明するための図。The figure for demonstrating the mode inside the phoneme confusion matrix shown in FIG. 図１に示した単語誤認識スコア算出手段に引き渡される対象単語と単語リストの例を説明するための図。The figure for demonstrating the example of the target word and word list handed over to the word misrecognition score calculation means shown in FIG. 図２に示した重要対立語抽出部から出力される対象単語と重要対立単語とを対応付けしたリストを説明するための図。The figure for demonstrating the list | wrist which matched the target word and important conflict word output from the important conflict word extraction part shown in FIG. 対象単語と重要対立単語との単語間距離の例を説明するための図。The figure for demonstrating the example of the distance between words of an object word and an important conflict word. 図１１に示した単語間距離の変換例を説明するための図。The figure for demonstrating the conversion example of the distance between words shown in FIG. 図１に示したスコア変換部が出力する単語認識スコアの例を説明するための図。The figure for demonstrating the example of the word recognition score which the score conversion part shown in FIG. 1 outputs. 図３に示した音素列生成部の動作を説明するための図。The figure for demonstrating operation | movement of the phoneme string production | generation part shown in FIG. 図３に示した音素継続時間付与部の動作を説明するための図。The figure for demonstrating operation | movement of the phoneme continuation time provision part shown in FIG. 図３に示した音素アライメント部の動作を説明するための図。The figure for demonstrating operation | movement of the phoneme alignment part shown in FIG. 本発明による認識障害単語抽出装置の動作状況を説明するための図。The figure for demonstrating the operation | movement condition of the recognition impairment word extraction apparatus by this invention.

Explanation of symbols

１００音声認識性能推定装置３０４辞書
１０１音声認識率推定手段３０５音素継続時間長データ
１０２平均単語誤認識スコア算出手段３０６音素コンフュージョンマトリクス
１０３単語誤認識スコア算出手段５０１認識障害単語抽出装置
１０４単語類似度算出手段５０２認識障害単語探索装置
２０１重要対立語抽出部
２０２単語間距離算出部
２０３スコア変換部
３０１音素列生成部
３０２音素継続時間付与部
３０３音素アライメント部 DESCRIPTION OF SYMBOLS 100 Speech recognition performance estimation apparatus 304 Dictionary 101 Speech recognition rate estimation means 305 Phoneme duration data 102 Average word misrecognition score calculation means 306 Phoneme confusion matrix 103 Word misrecognition score calculation means 501 Recognition failure word extraction apparatus 104 Word similarity Calculation means 502 Cognitive impairment word search device 201 Important allele extraction unit 202 Interword distance calculation unit 203 Score conversion unit 301 Phoneme sequence generation unit 302 Phoneme duration addition unit 303 Phoneme alignment unit

Claims

On the computer,
A word similarity calculation process for calculating a word similarity indicating a degree of recognition of an utterance corresponding to the target word as the opposite word when speech recognition is performed when two words including the target word and the opposite word are received; ,
Word misrecognition score calculation that receives the target word and the word list and calculates a word misrecognition score indicating the degree to which the utterance corresponding to the target word is recognized as any word in the word list when speech recognition is performed Processing,
An average word misrecognition score calculation process for receiving the word list and calculating an average average word misrecognition score of all words included in the word list;
Estimating speech recognition rate using as input the number of words included in the word list, the average word misrecognition score, and the candidate number narrowing reference value set when searching for solution candidates in speech recognition processing A speech recognition rate estimation process for calculating a value;
The speech recognition performance estimation method characterized by performing this.

The speech recognition performance estimation method according to claim 1,
In the word similarity calculation process, each phoneme of the phoneme string corresponding to the pronunciation of the target word is assumed to continue only the average duration of the phoneme, and each continuation of the phoneme string corresponding to the pronunciation of the opposite word The time is the probability that the sum of the durations of the phonemes of the opposing word is equal to the sum of the durations of the phonemes of the target word, and that the phonemes of the target word are recognized by the phonemes of the opposing words at the same time Is determined so that the probability of integration at all times becomes the highest, and the value obtained by normalizing the integrated probability value with the sum of durations of each phoneme of the target word is used as the word similarity. Speech recognition performance estimation method.

In any one of the speech recognition performance estimation methods of Claim 1 or 2,
The word misrecognition score calculation process extracts a certain number of words having a high word similarity with the target word from the word list, and for each of the extracted words, the word similarity between the target word and the target word itself The difference value obtained by subtracting the logarithmic value of the word similarity between each word and the target word is calculated from the logarithm value of the degree, and this is used as the interword distance. The smaller the interword distance value, the larger the value. A speech recognition performance estimation method characterized in that a converted value is calculated and a sum of converted values corresponding to each extracted word is used as a word error recognition score.

The speech recognition performance estimation method according to any one of claims 1 to 3,
The average word error recognition score calculation processing is performed for each word included in the estimated word list when the word is the target word and all remaining words included in the estimated word list are new word lists. Speech recognition characterized by calculating a recognition score and dividing the sum of the word error recognition scores of each word included in the new word list by the number of words in the estimated word list as an average word error recognition score Performance estimation method.

The speech recognition performance estimation method according to any one of claims 1 to 3,
The average word misrecognition score calculation processing is a word error when each word included in the estimated word list is the target word and all the remaining words included in the estimated word list are the new word list. A speech recognition performance estimation method characterized in that a recognition score is calculated and a sum of the occurrence probabilities of each word included in the estimated word list multiplied by the word error recognition score of the word is used as an average word error recognition score .

The speech recognition performance estimation method according to any one of claims 1 to 5,
The candidate number narrowing reference value set when searching for speech recognition solution candidates is set as the maximum number of solution candidates listed at each time point of the search,
The speech recognition performance estimation process is a weighted value obtained by dividing the number of words included in the estimated word list, the average word error recognition score, and the number of words included in the estimated word list by the maximum number of solution candidates. A speech recognition performance estimation method, characterized in that a linear sum plus a constant is used as an estimated speech recognition rate.

On the computer,
Executing any one of the recognition failure word search processing and the speech recognition performance estimation method according to claim 1,
In the recognition failure word search process, for each word included in the estimated word list given by input, speech recognition of a new word list composed of all remaining words excluding the word from the estimated word list Estimating the performance by the speech recognition performance estimation method, extracting a certain number of the above new word lists in descending order of speech recognition performance, and outputting the removed words corresponding to each new word list as recognition impairment words A method for extracting recognition impairment words as a feature.

The recognition impairment word extraction method according to claim 7,
The recognition failure word search process includes an extraction prevention word list in the input, and for each word other than the extraction prevention word in the estimated word list given by input, the remaining words obtained by removing the word from the estimation word list Estimate the speech recognition performance of a new word list composed of all words using the speech recognition performance estimation method, extract a certain number of word lists in descending order of speech recognition performance, and remove the words corresponding to each word list Is extracted as a recognition failure word.

A word similarity calculating means for calculating a word similarity indicating a degree that an utterance corresponding to the target word is recognized as the opposite word when speech recognition is performed using two words of the target word and the opposite word;
A word for calculating a word misrecognition score indicating the degree to which the utterance corresponding to the target word is recognized as one of the words included in the word list when the target word and the word list are input and speech recognition is performed. Misrecognition score calculation means;
The average word error recognition score calculating means for calculating the average average word error recognition score of all the words included in the word list, using the word list as input,
Estimated value of speech recognition rate using as input the number of words included in the word list, the average word error recognition score, and the candidate number narrowing reference value set when searching for solution candidates in speech recognition processing Speech recognition rate estimating means for calculating
A speech recognition performance estimation apparatus comprising:

The speech recognition performance estimation apparatus according to claim 9,
The word similarity calculating means assumes that each phoneme of the phoneme string corresponding to the pronunciation of the target word lasts for an average duration of the phoneme, and each phoneme of the phoneme string corresponding to the pronunciation of the opposite word As for the duration, the sum of the durations of the phonemes of the opposing word is equal to the sum of the durations of the phonemes of the target word, and the phonemes of the target word at the same time point are recognized as phonemes of the opposing word. It is determined that the probability obtained by integrating the probabilities at all points in time is the highest, and the value obtained by normalizing the value of the integrated probability with the sum of durations of each phoneme of the target word is used as the word similarity. A speech recognition performance estimation device.

In any of the speech recognition performance estimation apparatuses according to claim 9 or 10,
The word misrecognition score calculating means extracts a certain number of words having a large word similarity with the target word from the word list, and for each of the extracted words, the word similarity between the target word and the target word itself The difference value obtained by subtracting the logarithmic value of the word similarity between each word and the target word is calculated from the logarithm value of the degree, and this is used as the interword distance. The smaller the interword distance value, the larger the value. A speech recognition performance estimation apparatus characterized by calculating a converted value so that the sum of converted values corresponding to each extracted word is used as a word error recognition score.

The speech recognition performance estimation device according to any one of claims 9 to 11,
The average word misrecognition score calculation means, for each word included in the estimated word list, the word when the word is the target word and all the remaining words included in the estimated word list are the new word list Speech recognition characterized by calculating a misrecognition score and dividing the sum of the word misrecognition scores of each word included in the new word list by the number of words in the estimated word list as an average word misrecognition score Performance estimation device.

The speech recognition performance estimation device according to any one of claims 9 to 12,
The average word misrecognition score calculation means, for each word included in the estimated word list, the word when the word is the target word and all the remaining words included in the estimated word list are the new word list A speech recognition performance estimation device characterized by calculating a misrecognition score and using the sum of the occurrence probability of each word included in the estimated word list multiplied by the word misrecognition score of the word as an average word misrecognition score .

The speech recognition performance estimation device according to any one of claims 9 to 12,
The candidate number narrowing reference value set when searching for speech recognition solution candidates is set as the maximum number of solution candidates listed at each time point of the search,
The speech recognition performance estimation device is weighted by a value obtained by dividing the number of words included in the estimated word list, the average word misrecognition score, and the number of words included in the estimated word list by the maximum number of solution candidates. An apparatus for estimating speech recognition performance, wherein an estimated speech recognition rate is obtained by adding a constant to a linear sum.

On the computer,
A recognition failure word search means; and any one of the speech recognition performance estimation devices according to claims 9 to 14,
The recognition failure word search means, for each word included in the estimated word list given by input, voice recognition of a new word list composed of all remaining words excluding the word from the estimated word list A recognition failure characterized by estimating performance by a speech recognition performance estimation device, extracting a certain number of word lists in descending order of speech recognition performance, and outputting the removed words corresponding to each word list as recognition failure words Word extraction device.

The recognition impairment word extracting device according to claim 15,
The recognition failure word search means includes an extraction prevention word list in the input, and excludes the word from the estimated word list for each word other than the extraction prevention word among the words included in the estimation word list given by the input. The speech recognition performance estimation device estimates the speech recognition performance of a new word list composed of all the remaining words, extracts a certain number of word lists in descending order of speech recognition performance, and removes the corresponding word lists. A recognition failure word extraction apparatus, wherein the recognition word is output as a recognition failure word.

A speech recognition performance estimation program that is described in a program language that can be read by a computer and causes the computer to function as at least the speech recognition performance estimation device according to any one of claims 9 to 14.

A recognition failure word extraction program, which is described in a computer-readable program language, and causes the computer to function as at least a recognition failure word extraction device according to claim 15.

A recording medium comprising a computer-readable recording medium and recording either the speech recognition performance estimation program according to claim 17 or the recognition failure word extraction program according to claim 18.