JP6078402B2

JP6078402B2 - Speech recognition performance estimation apparatus, method and program thereof

Info

Publication number: JP6078402B2
Application number: JP2013075948A
Authority: JP
Inventors: 太一浅見; 哲小橋川
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2013-04-01
Filing date: 2013-04-01
Publication date: 2017-02-08
Anticipated expiration: 2033-04-01
Also published as: JP2014202781A

Description

本発明は、音声認識の認識精度を推定する音声認識性能推定装置とその方法とプログラムに関する。 The present invention relates to a speech recognition performance estimation device, method and program for estimating recognition accuracy of speech recognition.

図１に、音声認識の認識率と学習データ量との関係を例示する。縦軸は認識率[％]、横軸は学習データ量[時間]である。図１は、音声認識に用いる音響モデルの学習の進捗度合いを表す。音響モデルの学習において、学習セットのデータ量を増加させて行くと、初めは急激に認識精度が向上（音響モデルの学習が進む）するが、データ増加によって得られる認識精度の向上幅は徐々に小さくなり、ある認識精度に達すると認識精度は飽和する特性を示す。飽和時の認識精度を、ここでは「認識精度上限」と称する。 FIG. 1 illustrates the relationship between the recognition rate of speech recognition and the amount of learning data. The vertical axis represents the recognition rate [%], and the horizontal axis represents the learning data amount [time]. FIG. 1 shows the progress of learning of an acoustic model used for speech recognition. In acoustic model learning, if the amount of data in the learning set is increased, initially the recognition accuracy improves rapidly (acoustic model learning proceeds), but the recognition accuracy gain obtained by increasing the data gradually increases. The recognition accuracy is saturated when it reaches a certain recognition accuracy. Here, the recognition accuracy at the time of saturation is referred to as “recognition accuracy upper limit”.

この学習データは、学習用音声データと、その学習用音声データを人が聞いて文字に書き起こした学習用正解テキストと、の組からなる。学習データは、音声認識システムを導入する導入先の音声を用いて作成するのが一般的である。学習用正解テキストの作成には大きなコストが必要である。よって、学習データ量は必要最低限に抑えたいとの強い要求がある。 This learning data consists of a set of learning speech data and a correct text for learning that a person listens to the learning speech data and transcribes it into characters. The learning data is generally created using the voice of the introduction destination where the voice recognition system is introduced. Creating a correct text for learning requires a large cost. Therefore, there is a strong demand to minimize the amount of learning data.

学習データ量が、必要十分な量であるかを判断するためには、「認識精度上限」を事前に知る必要がある。従来、「認識精度上限」を推定する方法としては、例えば非特許文献１に記載されているクローズド評価で計測した認識精度を「認識精度上限」の推定値とする方法が知られている。 In order to determine whether the amount of learning data is a necessary and sufficient amount, it is necessary to know the “recognition accuracy upper limit” in advance. Conventionally, as a method for estimating the “recognition accuracy upper limit”, for example, a method is known in which the recognition accuracy measured by the closed evaluation described in Non-Patent Document 1 is used as the estimated value of the “recognition accuracy upper limit”.

クローズド評価とは、評価用音声と評価用正解テキストの組からなる評価セットを用いて学習した音響モデル（クローズド音響モデル）を作成し、そのクローズド音響モデルを使って評価セットの認識精度を計測する性能評価法である。一般的に、モデルの学習に用いた既知データは、学習後の音響モデルによって高い精度で認識することができる。クローズド評価で計測される既知データの認識精度は一種の理想条件における認識精度であるため、認識精度の上限の推定値として用いられている。 Closed evaluation refers to creating an acoustic model (closed acoustic model) learned using an evaluation set consisting of a set of evaluation speech and correct text for evaluation, and measuring the recognition accuracy of the evaluation set using the closed acoustic model. This is a performance evaluation method. Generally, the known data used for model learning can be recognized with high accuracy by the learned acoustic model. Since the recognition accuracy of known data measured by closed evaluation is a recognition accuracy under a kind of ideal condition, it is used as an estimated value of the upper limit of recognition accuracy.

佐古淳、山形知行、滝口哲也、有木康雄、「音声認識との統合によるシステム要求検出」情報処理学会研究報告,SLP,音声言語情報処理2007(129),pp.143-148,2007.Sako, Satoshi Yamagata, Tetsuya Takiguchi, Yasuo Ariki, `` System Requirements Detection by Integration with Speech Recognition '' Information Processing Society of Japan, SLP, Spoken Language Information Processing 2007 (129), pp.143-148, 2007.

一般的に、クローズド評価によって得られる「認識精度上限」は、実際の「認識精度上限」よりも高い値を示す。クローズド評価で計測された「認識精度上限」は、既知データを認識した結果である。しかし、実際の音響モデル学習は評価セットと異なるデータを用いて行われ、認識精度の評価時には学習に用いたのとは異なる未知データが入力されるので、評価時の認識精度はクローズド評価で計測された「認識精度上限」よりも常に低い値を示す。実際よりも高い「認識精度上限」に基づいて学習データ量の過不足を判断すると、ある程度学習が進み認識精度の伸び代が少ない状況になっているにもかかわらず、「認識精度上限」とのギャップから更に学習セットの増量が必要と判断し、不要な書き起こし作業を継続させ無駄なコストを生じさせてしまう課題がある。 In general, the “recognition accuracy upper limit” obtained by the closed evaluation is higher than the actual “recognition accuracy upper limit”. The “recognition accuracy upper limit” measured in the closed evaluation is a result of recognizing known data. However, actual acoustic model learning is performed using data different from the evaluation set, and unknown data different from that used for learning is input during recognition accuracy evaluation. Therefore, recognition accuracy during evaluation is measured by closed evaluation. The value is always lower than the “recognition accuracy upper limit”. Judging whether the amount of learning data is excessive or insufficient based on the higher “recognition accuracy upper limit” than actual, the learning accuracy has progressed to some extent, and the recognition accuracy upper limit has been reduced, despite the fact that the recognition allowance is less. There is a problem in that it is determined that the learning set needs to be increased further from the gap, and unnecessary transcription work is continued, resulting in unnecessary costs.

本発明は、この課題に鑑みてなされたものであり、「認識精度上限」を正確に推定することのできる音声認識性能推定装置と、その方法とプログラムを提供することを目的とする。 The present invention has been made in view of this problem, and an object of the present invention is to provide a speech recognition performance estimation apparatus capable of accurately estimating the “recognition accuracy upper limit”, and a method and program thereof.

本発明の音声認識性能推定装置は、クローズド音響モデル複数生成部と、クローズド音響モデル群記録部と、学習セット音声認識精度計算部と、音響モデル選択部と、音声認識精度計算部と、を具備する。クローズド音響モデル複数生成部は、初期音響モデルと評価セットを入力として、当該評価セットを構成する評価用音声データから音響特徴量を抽出し、当該音響特徴量と上記初期音響モデルとの間のモデルであるＮ（Ｎ≧２）個のクローズド音響モデルを生成してクローズド音響モデル群記録部に出力する。学習セット音声認識精度計算部は、学習セットを入力として当該学習セットを、上記クローズド音響モデル群記録部に記録されたＮ個のクローズド音響モデルを用いて音声認識したＮ個の音声認識精度を出力する。音響モデル選択部は、上記Ｎ個の音声認識精度を入力として上記クローズド音響モデル群記録部を参照し、当該Ｎ個の音声認識精度の最大値に対応するクローズド音響モデルを選択して選択済みクローズド音響モデルとして出力する。音声認識精度計算部は、選択済みクローズド音響モデルを用いて評価セットを音声認識した音声認識精度を、認識精度上限の推定値として出力する。 The speech recognition performance estimation device of the present invention includes a closed acoustic model multiple generation unit, a closed acoustic model group recording unit, a learning set speech recognition accuracy calculation unit, an acoustic model selection unit, and a speech recognition accuracy calculation unit. To do. The closed acoustic model multiple generation unit receives the initial acoustic model and the evaluation set, extracts the acoustic feature amount from the evaluation sound data constituting the evaluation set, and models between the acoustic feature amount and the initial acoustic model N (N ≧ 2) closed acoustic models are generated and output to the closed acoustic model group recording unit. The learning set speech recognition accuracy calculation unit outputs N speech recognition accuracy obtained by performing speech recognition on the learning set using the N closed acoustic models recorded in the closed acoustic model group recording unit with the learning set as an input. To do. The acoustic model selection unit receives the N speech recognition accuracies as input, refers to the closed acoustic model group recording unit, selects a closed acoustic model corresponding to the maximum value of the N speech recognition accuracies, and is selected closed Output as an acoustic model. The speech recognition accuracy calculation unit outputs the speech recognition accuracy obtained by recognizing the evaluation set using the selected closed acoustic model as an estimated upper limit of the recognition accuracy.

本発明の音声認識性能推定装置によれば、評価用音声データ音響特徴量と初期音響モデルとの間のモデルであるＮ個の音響モデルを生成し、そのＮ個の音響モデルの中から学習セットに対して最大の認識精度を示す音響モデルを選択し、その音響モデルを用いて評価セットを音声認識した認識精度を「認識精度上限」の推定値として出力する。したがって予め、凡その「認識精度上限」を知ることができる。その結果、認識率の伸び代を予め把握することができるので、余分な書き起こし作業の発生を防止することができ、音声認識システム導入時の開発コストを抑制することが可能になる。 According to the speech recognition performance estimation device of the present invention, N acoustic models that are models between the speech data acoustic feature for evaluation and the initial acoustic model are generated, and a learning set is generated from the N acoustic models. The acoustic model showing the maximum recognition accuracy is selected, and the recognition accuracy obtained by voice recognition of the evaluation set using the acoustic model is output as an estimated value of “recognition accuracy upper limit”. Therefore, the approximate “recognition accuracy upper limit” can be known in advance. As a result, since the allowance for the recognition rate can be grasped in advance, it is possible to prevent the occurrence of extra transcription work and to suppress the development cost when the speech recognition system is introduced.

音声認識の認識率と学習データ量との関係の例を示す図。The figure which shows the example of the relationship between the recognition rate of speech recognition, and learning data amount. 本発明の音声認識性能推定装置１００の機能構成例を示す図。The figure which shows the function structural example of the speech recognition performance estimation apparatus 100 of this invention. 音声認識性能推定装置１００の動作フローを示す図。The figure which shows the operation | movement flow of the speech recognition performance estimation apparatus 100. クローズド音響モデル複数生成部１１０の機能構成例を示す図。The figure which shows the function structural example of the closed acoustic model multiple production | generation part 110. FIG. 本発明の音声認識性能推定装置２００，２００′の機能構成例を示す図。The figure which shows the function structural example of the speech recognition performance estimation apparatus 200,200 'of this invention. 音声認識精度計算部２３０の機能構成例を示す図。The figure which shows the function structural example of the speech recognition precision calculation part 230. FIG. 本発明の音声認識性能推定装置３００の機能構成例を示す図。The figure which shows the function structural example of the speech recognition performance estimation apparatus 300 of this invention. 本発明の音声認識性能推定装置４００の機能構成例を示す図。The figure which shows the function structural example of the speech recognition performance estimation apparatus 400 of this invention. 本発明の音声認識性能推定装置５００の機能構成例を示す図。The figure which shows the function structural example of the speech recognition performance estimation apparatus 500 of this invention. 音声認識性能推定装置５００の動作フローを示す図。The figure which shows the operation | movement flow of the speech recognition performance estimation apparatus 500.

以下、この発明の実施の形態を図面を参照して説明する。複数の図面中同一のものには同じ参照符号を付し、説明は繰り返さない。 Embodiments of the present invention will be described below with reference to the drawings. The same reference numerals are given to the same components in a plurality of drawings, and the description will not be repeated.

図２に、この発明の音声認識性能推定装置１００の機能構成例を示す。その動作フローを図３に示す。音声認識性能推定装置１００は、クローズド音響モデル複数生成部１１０と、クローズド音響モデル群記録部１２０と、学習セット音声認識精度計算部１３０と、音響モデル選択部１４０と、選択済みクローズド音響モデル１５０と、音声認識精度計算部１６０と、制御部１７０と、を具備する。音声認識性能推定装置１００は、例えばＲＯＭ、ＲＡＭ、ＣＰＵ等で構成されるコンピュータに所定のプログラムが読み込まれて、ＣＰＵがそのプログラムを実行することで実現されるものである。 FIG. 2 shows a functional configuration example of the speech recognition performance estimation apparatus 100 of the present invention. The operation flow is shown in FIG. The speech recognition performance estimation apparatus 100 includes a closed acoustic model multiple generation unit 110, a closed acoustic model group recording unit 120, a learning set speech recognition accuracy calculation unit 130, an acoustic model selection unit 140, and a selected closed acoustic model 150. A voice recognition accuracy calculation unit 160 and a control unit 170. The speech recognition performance estimation apparatus 100 is realized by reading a predetermined program into a computer composed of, for example, a ROM, a RAM, a CPU, and the like, and executing the program by the CPU.

音声認識性能推定装置１００に入力される評価用音声データ等は、例えば、サンプリング周波数１６ｋＨｚで離散値化されたディジタル信号である。また、音声認識性能推定装置１００で行う音声認識処理は、その離散値を、例えば、２５６点集めて１フレームとし、フレーム単位で動作するものである。 The evaluation speech data and the like input to the speech recognition performance estimation apparatus 100 is a digital signal that has been digitized at a sampling frequency of 16 kHz, for example. Also, the speech recognition processing performed by the speech recognition performance estimation apparatus 100 is performed in units of frames with 256 discrete points collected, for example, as one frame.

クローズド音響モデル複数生成部１１０は、初期音響モデルと評価セットを入力として、その評価セットを構成する評価用音声データから音響特徴量を抽出し、当該音響特徴量と初期音響モデルとの間のモデルであるＮ（Ｎ≧２）個のクローズド音響モデルを生成してクローズド音響モデル群記録部１２０に出力する（ステップＳ１１０）。Ｎは、ここでは１〜１０００の範囲の５０間隔の５０，１００，１５０，…，１０００の例えば２０個を表す数（Ｎ＝２０）である。このＮ個の数値は、予め定数としてクローズド音響モデル複数生成部１１０に持たせておいても良いし、図２に破線で学習パラメータτとして示すように外部から与えても良い。クローズド音響モデル複数生成部１１０の詳しい説明は後述する。 The closed acoustic model multiple generation unit 110 receives an initial acoustic model and an evaluation set, extracts an acoustic feature amount from evaluation speech data constituting the evaluation set, and models between the acoustic feature amount and the initial acoustic model. N (N ≧ 2) closed acoustic models are generated and output to the closed acoustic model group recording unit 120 (step S110). Here, N is a number (N = 20) representing, for example, 20 of 50, 100, 150,. The N numerical values may be given to the closed acoustic model multiple generation unit 110 as constants in advance, or may be given from the outside as indicated by the broken line in FIG. 2 as the learning parameter τ. Detailed description of the closed acoustic model multiple generation unit 110 will be described later.

クローズド音響モデル複数生成部１１０は、Ｎ個のクローズド音響モデルが得られるまで、モデル生成処理を繰り返す。この繰り返し処理の制御は制御部１７０が行う（ステップＳ１７０１のＮｏ）。Ｎ個を大きくするとクローズド音響モデルの数が増えるので、より正確に「認識精度上限」を推定することが可能になる。ただ、Ｎ個を大きくすると学習セット音声認識精度計算部１３０における音声認識処理の処理時間は増加する。 The closed acoustic model multiple generation unit 110 repeats the model generation process until N closed acoustic models are obtained. The control unit 170 controls this repetitive process (No in step S1701). If N is increased, the number of closed acoustic models increases, so that the “recognition accuracy upper limit” can be estimated more accurately. However, if N is increased, the processing time of the speech recognition processing in the learning set speech recognition accuracy calculation unit 130 increases.

学習セット音声認識精度計算部１３０は、学習セットを入力としてその学習セットを、クローズド音響モデル群記録部１２０に記録されたＮ個のクローズド音響モデルを用いて音声認識したＮ個の音声認識精度を出力する（ステップＳ１３０）。学習セットは、上記した学習データのことであり、学習用音声データと、その学習用音声データを人が聞いて文字に書き起こした学習用正解テキストとの組からなるものである。音声認識は、例えば参考文献１（政瀧ほか、“顧客との自然な会話を聞き取る自由発話音声認識技術「VoiceRex」,ＮＴＴ技術ジャーナル,Vol.18,No.11,pp.15-18,2006.）に記載されている周知の既存技術である。また、音声認識精度は、単語誤り率（Word Error Rate）を１００から減算した値であり、単語誤り率は例えば参考文献２（X. Huang, A. Acero and H. - W. Hon,”Spoken Language Processing,”Prentice Hall, pp.419-421,2001.）に記載された方法で算出する。学習セット音声認識精度計算部１３０はＮ個の音声認識精度が得られるまで、音声認識処理を繰り返す。この繰り返し処理の制御は制御部１７０が行う（ステップＳ１７０２のＮｏ） The learning set speech recognition accuracy calculation unit 130 obtains N speech recognition accuracies obtained by performing speech recognition on the learning set using the N closed acoustic models recorded in the closed acoustic model group recording unit 120 with the learning set as an input. Output (step S130). The learning set is the above-described learning data, and is composed of a set of learning speech data and a correct text for learning that a person hears the learning speech data and transcribes it into characters. For example, Reference 1 (Masayoshi et al., “Free speech recognition technology“ VoiceRex ”for listening to natural conversations with customers”, NTT Technical Journal, Vol. 18, No. 11, pp. 15-18, 2006. .) Is a well-known existing technology described in. The speech recognition accuracy is a value obtained by subtracting the word error rate from 100, and the word error rate is, for example, Reference 2 (X. Huang, A. Acero and H.-W. Hon, “Spoken”). Language Processing, “Prentice Hall, pp.419-421, 2001”). The learning set speech recognition accuracy calculation unit 130 repeats speech recognition processing until N speech recognition accuracy is obtained. Control of this repetitive processing is performed by the control unit 170 (No in step S1702).

音響モデル選択部１４０は、学習セット音声認識精度計算部１３０が出力するＮ個の音声認識精度を入力として、クローズド音響モデル群記録部１２０を参照し、Ｎ個の音声認識精度の最大値に対応するクローズド音響モデルを選択して選択済みクローズド音響モデル１５０として出力する（ステップＳ１４０）。 The acoustic model selection unit 140 receives the N speech recognition accuracies output from the learning set speech recognition accuracy calculation unit 130, refers to the closed acoustic model group recording unit 120, and corresponds to the maximum value of the N speech recognition accuracies. The closed acoustic model to be selected is selected and output as the selected closed acoustic model 150 (step S140).

音声認識精度計算部１６０は、音響モデル選択部１４０が選択した選択済みクローズド音響モデルを用いて評価セットを音声認識した音声認識精度を、認識精度上限の推定値として出力する（ステップＳ１６０）。 The speech recognition accuracy calculation unit 160 outputs the speech recognition accuracy obtained by speech recognition of the evaluation set using the selected closed acoustic model selected by the acoustic model selection unit 140 as an estimated upper limit of recognition accuracy (step S160).

以上説明したように音声認識性能推定装置１００によれば、評価用音声データの音響特徴量と初期音響モデルとの間に位置するモデルであるＮ個の音響モデルを生成し、そのＮ個の音響モデルの中から学習セットに対して最大の認識精度を示す音響モデルを選択し、その音響モデルを用いて評価セットを音声認識した認識精度を「認識精度上限」の推定値として出力する。つまり、音声認識性能推定装置１００は、学習セットの音声認識精度に基づいて音響モデルを評価用音声データの音響特徴量にどこまで近づけて良いかを判定して、学習セットの音声認識精度が最大となる音響モデルを生成し、その音響モデルによって「認識精度上限」を推定する。 As described above, according to the speech recognition performance estimation device 100, N acoustic models that are models located between the acoustic feature amount of the evaluation speech data and the initial acoustic model are generated, and the N acoustic models are generated. An acoustic model showing the maximum recognition accuracy with respect to the learning set is selected from the models, and the recognition accuracy obtained by voice recognition of the evaluation set using the acoustic model is output as an estimated value of “recognition accuracy upper limit”. That is, the speech recognition performance estimation apparatus 100 determines how close the acoustic model may be to the acoustic feature amount of the evaluation speech data based on the speech recognition accuracy of the learning set, and the speech recognition accuracy of the learning set is maximized. And an “upper recognition accuracy upper limit” is estimated from the acoustic model.

その結果、音声認識システム導入時に、音響モデルの学習に用いる学習セットと異なる認識精度評価に用いる任意の評価セットを用いて求めた音声認識率と、「認識精度上限」との差分から認識率の伸び代を知ることができる。よって、学習用正解テキストの余分な書き起こし作業の発生を防止することが可能になる。 As a result, when the speech recognition system was introduced, the recognition rate was calculated from the difference between the speech recognition rate obtained using an arbitrary evaluation set used for recognition accuracy evaluation different from the learning set used for learning the acoustic model and the “recognition accuracy upper limit”. You can know the growth allowance. Therefore, it is possible to prevent the unnecessary transcription of the learning correct text.

なお、初期音響モデルと評価用音声データの音響特徴量との間に、Ｎ個の音響モデルを作成するのにどのような方法を用いても良い。初期音響モデルと評価用音声データの音響特徴量との間に複数の音響モデルを予め作成し、最大の音声認識精度を求める構成と方法がこの発明の特別な技術的特徴である。以降では、各部のより具体的な機能構成例を示して更に詳しく音声認識性能推定装置１００の動作を説明する。 Note that any method may be used to create N acoustic models between the initial acoustic model and the acoustic feature amount of the evaluation voice data. A special technical feature of the present invention is a configuration and method in which a plurality of acoustic models are created in advance between the initial acoustic model and the acoustic feature value of the evaluation speech data to obtain the maximum speech recognition accuracy. Hereinafter, the operation of the speech recognition performance estimation apparatus 100 will be described in more detail by showing a more specific functional configuration example of each unit.

〔クローズド音響モデル複数生成部〕
図４に、クローズド音響モデル複数生成部１１０の機能構成例を示す。クローズド音響モデル複数生成部１１０は、音響特徴量抽出手段１１０１と、各音素区間抽出手段１１０２と、モデルパラメータ更新手段１１０３と、学習パラメータτ１１０４と、出力手段１１０５と、を具備する。 [Closed acoustic model generator]
FIG. 4 shows a functional configuration example of the closed acoustic model multiple generation unit 110. The closed acoustic model multiple generation unit 110 includes an acoustic feature quantity extraction unit 1101, each phoneme section extraction unit 1102, a model parameter update unit 1103, a learning parameter τ 1104, and an output unit 1105.

音響特徴量抽出手段１１０１は、評価用音声の音声データから音響特徴量（ＭＦＣＣ：Mel-Frequency Cepstrum Coefficient）ｘ_ｔを抽出する。音響特徴量ｘ_ｔは周知の計算方法で求めることができる。 The acoustic feature quantity extraction unit 1101 extracts an acoustic feature quantity (MFCC: Melt-Frequency Cepstrum Coefficient) x _t from the voice data of the evaluation voice. The acoustic feature amount x _t can be obtained by a known calculation method.

各音素区間抽出手段１１０２は、音響特徴量抽出手段１１０１で抽出した音響特徴量ｘｔを入力とし、初期音響モデルを参照して単語をローマ字で表したアルファベットの単位（a,k,s,t,…,N）にほぼ相当する音素に当たる音素区間ｔ_ａ,t_k,…を抽出する。ここで添え字ａを付した「ｔ_ａ」は、音素「ａ」の音素区間を意味する。以降では音素を表す「ａ」等の表記は省略する。 Each phoneme segment extraction means 1102 receives the acoustic feature quantity xt extracted by the acoustic feature quantity extraction means 1101 as input, and refers to the initial acoustic model to represent alphabetic units (a, k, s, t, ..., phoneme sections t _a , t _k , ... corresponding to phonemes substantially corresponding to N) are extracted. Here, “t _a ” with the subscript “a” means a phoneme segment of the phoneme “a”. Hereinafter, a notation such as “a” representing a phoneme is omitted.

モデルパラメータ更新手段１１０３は、音響特徴量ｘ_ｔと同じ音素の初期音響モデルμ_０（特徴量の平均値）と学習パラメータτ１１０４を、入力として、例えば次式でクローズド音響モデルμ＾を生成する。 The model parameter update unit 1103 receives the initial phone model μ ₀ (average value of feature values) of the same phoneme as the sound feature quantity x _t and the learning parameter τ 1104 as input, and generates a closed sound model μ ^ by the following equation, for example.

ここでｔはフレーム、Ｔはフレーム数を意味する。 Here, t means a frame, and T means the number of frames.

クローズド音響モデルμ＾は、各音素ごと学習パラメータτごとに生成される。学習パラメータτを、上記したように例えば１〜１０００の範囲の５０，１００，１５０，…，１０００の２０個の数とすると、クローズド音響モデルは、全ての音素ごとにμ_１＾〜μ_２０＾の２０組が生成される。この複数のクローズド音響モデルμ_１＾〜μ_２０＾は、出力手段１１０５を介してクローズド音響モデル群記録部１２０に記録される。出力手段１１０５は、各音素、各学習パラメータτごとにクローズド音響モデルを出力したことを、音響特徴量抽出手段１１０１又は各音素区間抽出手段１１０２に伝達する。 A closed acoustic model μ ^ is generated for each learning parameter τ for each phoneme. As described above, when the learning parameter τ is 20 numbers in the range of 1 to 1000, for example, 50, 100, 150,..., 1000, the closed acoustic model has μ ₁ ^ to μ ₂₀ ^ for every phoneme. 20 sets are generated. The plurality of closed acoustic models μ ₁ ^ to μ ₂₀ ^ are recorded in the closed acoustic model group recording unit 120 via the output unit 1105. The output unit 1105 transmits the output of the closed acoustic model for each phoneme and each learning parameter τ to the acoustic feature quantity extraction unit 1101 or each phoneme section extraction unit 1102.

クローズド音響モデルμ＾は、式（１）から明らかなように初期音響モデルμ_０と評価用音声データの音響特徴量ｘ_ｔとの重み付け平均である。その値は、学習パラメータτが大きくなると初期音響モデルに近づき、小さくすると評価用音声データの音響特徴量ｘ_ｔに近づく。 The closed acoustic model μ ^ is a weighted average of the initial acoustic model μ ₀ and the acoustic feature amount x _t of the evaluation voice data, as is apparent from the equation (1). The value approaches the initial acoustic model when the learning parameter τ increases, and approaches the acoustic feature amount x _t of the evaluation voice data when the learning parameter τ increases.

クローズド音響モデルは、特徴量の平均値のみを式（１）で変更したものであっても良い。その場合、分散Σ_０は初期音響モデルの値そのままとする。又は、式（２）を用いて分散Σ＾も変更するようにしても良い。 The closed acoustic model may be a model in which only the average value of the feature values is changed by Expression (1). In that case, the variance Σ ₀ is left as it is in the initial acoustic model. Or you may make it also change dispersion | distribution (SIGMA) ^ using Formula (2).

ここでα＞ｋ−１，ｋは次元数であり、′は転置を意味する。式（１）と式（２）については、参考文献３（篠田浩一「確率モデルによる音声認識のための話者適応化技術」電子情報通信学会論文誌Ｄ-II Vol. J87-D-II No.2 pp.371-386 2004年2月.）に記載されている。 Here, α> k−1, k is the number of dimensions, and ′ means transposition. For the formulas (1) and (2), refer to Reference 3 (Koichi Shinoda “Speaker Adaptation Technology for Speech Recognition Using Stochastic Models”, IEICE Transactions D-II Vol. J87-D-II No .2 pp.371-386 February 2004).

〔学習セット音声認識精度計算部〕
学習セット音声認識精度計算部１３０は、クローズド音響モデル群記録部１２０に記録された複数のクローズド音響モデルと、言語モデルを用いて学習セットの学習用音声データを音声認識し、複数のクローズド音響モデルに対応する複数の音声認識精度を出力するものである。音声認識は、上記したように既存技術である。 [Learning set speech recognition accuracy calculator]
The learning set speech recognition accuracy calculation unit 130 recognizes speech data for learning in a learning set using a plurality of closed acoustic models recorded in the closed acoustic model group recording unit 120 and a language model, and a plurality of closed acoustic models. A plurality of voice recognition accuracies corresponding to are output. Speech recognition is an existing technology as described above.

音声認識精度には、例えば単語正解精度を用いる。単語正解精度は、音声認識結果と学習用正解テキストを対比することで求める。上記したように単語誤り率（参考文献２）から音声認識精度を求めても良い。 For the speech recognition accuracy, for example, word correct accuracy is used. The word accuracy is obtained by comparing the speech recognition result with the correct text for learning. As described above, the speech recognition accuracy may be obtained from the word error rate (reference document 2).

〔音響モデル選択部〕
音響モデル選択部１４０は、Ｎ個の音響モデルに対応するＮ個の音声認識精度を入力とする。音響モデル選択部１４０は、クローズド音響モデル群記録部１２０を参照して音声認識精度が最大値を示すクローズド音響モデルを選択し、選択済みクローズド音響モデル１５０として出力する。 [Acoustic model selection section]
The acoustic model selection unit 140 receives N speech recognition accuracies corresponding to N acoustic models as input. The acoustic model selection unit 140 refers to the closed acoustic model group recording unit 120 to select a closed acoustic model having a maximum voice recognition accuracy and outputs the selected closed acoustic model 150.

音声認識精度が最大となるクローズド音響モデルが複数存在する場合は、音声認識精度が最大のクローズド音響モデルの中でも最も初期音響モデルからの変化量の大きいモデルを選択する方法、又は、最も初期音響モデルからの変化量の小さいモデルを選択する方法があり、常に同じ方法を選択すればどちらの方法を用いて選択しても良い。 When there are multiple closed acoustic models with the highest speech recognition accuracy, a method of selecting the model with the largest amount of change from the initial acoustic model among the closed acoustic models with the largest speech recognition accuracy, or the earliest acoustic model There is a method of selecting a model with a small amount of change from, and either method may be selected as long as the same method is always selected.

〔音声認識精度計算部〕
音声認識精度計算部１６０は、選択済みクローズド音響モデルを用いて評価セットを音声認識した音声認識精度を、認識精度上限の推定値として出力する。音声認識精度は、評価用正解テキストと音声認識結果のマッチングを取り、例えば、上記した参考文献２に記載された方法で計算した単語誤り率を１００から減算することによって得られる単語正解精度を、認識精度上限の推定値とする。 [Voice recognition accuracy calculation section]
The voice recognition accuracy calculation unit 160 outputs the voice recognition accuracy obtained by voice recognition of the evaluation set using the selected closed acoustic model as an estimated upper limit of the recognition accuracy. The speech recognition accuracy is obtained by matching the correct text for evaluation with the speech recognition result, for example, by subtracting the word error rate calculated by the method described in the above-mentioned Reference Document 2 from 100, The estimated value is the upper limit of recognition accuracy.

図５に、この発明の音声認識性能推定装置２００の機能構成例を示す。音声認識性能推定装置２００は、上記した音声認識性能推定装置１００（図２）の学習セット音声認識精度計算部１３０が、学習セット音声認識精度計算部２３０に置き換わった点のみが異なる。学習セット音声認識精度計算部２３０は、学習用音声データを入力として、当該学習用音声データを、上記クローズド音響モデル群記録部に記録されたＮ個のクローズド音響モデルと言語モデルを用いて音声認識し、上記Ｎ個のクローズド音響モデルにそれぞれ対応するＮ個の音声認識精度を出力するものであり、学習用正解テキストを用いずに音声認識精度を出力するものである。 FIG. 5 shows a functional configuration example of the speech recognition performance estimation apparatus 200 of the present invention. The speech recognition performance estimation device 200 is different only in that the learning set speech recognition accuracy calculation unit 130 of the speech recognition performance estimation device 100 (FIG. 2) described above is replaced with a learning set speech recognition accuracy calculation unit 230. The learning set speech recognition accuracy calculation unit 230 receives the learning speech data as an input, and recognizes the learning speech data using the N closed acoustic models and language models recorded in the closed acoustic model group recording unit. The N speech recognition accuracies respectively corresponding to the N closed acoustic models are output, and the speech recognition accuracies are output without using the learning correct text.

図６に、学習セット音声認識精度計算部２３０の機能構成例を示す。学習セット音声認識精度計算部２３０は、単語コンフュージョンネットワーク作成手段２３０１と、言語モデル２３０２と、単語アライメントネットワーク変換手段２３０３と、単語正解精度計算手段２３０４と、を具備する。 FIG. 6 shows a functional configuration example of the learning set speech recognition accuracy calculation unit 230. The learning set speech recognition accuracy calculation unit 230 includes a word confusion network creation unit 2301, a language model 2302, a word alignment network conversion unit 2303, and a word correct accuracy calculation unit 2304.

単語コンフュージョンネットワーク作成手段２３０１は、クローズド音響モデル群記録部１２０に記録されたＮ個の音響モデルと言語モデル２３０２を用いて、学習用音声データを音声認識して認識結果単語列を単語コンフュージョンネットワーク（ＷＣＮ：Word Confusion Network）に変換して出力する。単語コンフュージョンネットワークは、ある１発話に対する複数の認識結果単語列を効率的に表現したものである。 The word confusion network creation unit 2301 uses the N acoustic models and the language model 2302 recorded in the closed acoustic model group recording unit 120 to recognize speech speech for learning and convert the recognition result word string to word confusion. Convert to a network (WCN: Word Confusion Network) and output. The word confusion network efficiently represents a plurality of recognition result word strings for a certain utterance.

単語アライメントネットワーク変換手段２３０３は、単語コンフュージョンネットワーク作成手段２３０１で作成した単語コンフュージョンネットワークを単語アライメントネットワークに変換する。単語コンフュージョンネットワークと単語アライメントネットワークは、例えば参考文献４（小川厚徳、堀貴明、中村篤、「単語アライメントネットワークと識別的タイプ分類による認識精度推定」日本音響学会2012年秋季研究発表会,pp.67-68,2012.）に記載されているように周知なものである。 The word alignment network conversion unit 2303 converts the word confusion network created by the word confusion network creation unit 2301 into a word alignment network. The word confusion network and the word alignment network can be found in, for example, Reference 4 (Atsunori Ogawa, Takaaki Hori, Atsushi Nakamura, “Estimation of recognition accuracy using the word alignment network and discriminative type classification”, Acoustical Society of Japan 2012 Autumn Meeting, pp. 67-68, 2012.).

単語正解精度計算手段２３０４は、単語アライメントネットワーク変換手段２３０３で変換した単語アライメントネットワークを入力として単語正解精度の推定値ＷＡＣＣを出力する。この単語正解精度の推定値ＷＡＣＣは、クローズド音響モデル群記録部１２０に記録されたＮ個の音響モデルに対応するように計算され、Ｎ個の音声認識精度となる。 The word correct accuracy calculation means 2304 receives the word alignment network converted by the word alignment network conversion means 2303 and outputs an estimated value WACC of word correct accuracy. The estimated value WACC of the word correct accuracy is calculated so as to correspond to N acoustic models recorded in the closed acoustic model group recording unit 120, and becomes N speech recognition accuracy.

単語正解精度の推定値ＷＡＣＣは、次式により求める。 The estimated value WACC of word correct accuracy is obtained by the following equation.

ここで、Ｅ（＃Ｃ）は単語アライメントネットワークから得られる正解単語数の推定値、Ｅ（＃Ｓ）は置換誤り数の推定値、Ｅ（＃Ｉ）は挿入誤り数の推定値、Ｅ（＃Ｄ）は削除誤り数の推定値である。 Here, E (#C) is the estimated number of correct words obtained from the word alignment network, E (#S) is the estimated number of replacement errors, E (#I) is the estimated number of insertion errors, and E (# #D) is an estimated value of the number of deletion errors.

音声認識性能推定装置２００は、学習用音声データに対応する学習用正解テキストが無くてもＮ個の音声認識精度を出力することができるので、その作成に要するコストを抑制する効果を奏する。この学習セットを、学習用音声データのみとする考えは、評価セットにも適用することが可能である。その場合の音声認識性能推定装置２００′（図５）は、評価セットの代わりに評価用音声データを入力とするクローズド音響モデル複数生成部３１０と音声認識精度計算部３６０、学習用音声データの代わりに学習セットを入力とする学習セット音声認識精度計算部１３０、を含む構成となる。クローズド音響モデル複数生成部３１０と音声認識精度計算部３６０の動作は、実施例３で説明する。 The speech recognition performance estimation apparatus 200 can output N speech recognition accuracies even if there is no correct text for learning corresponding to the speech data for learning, and thus has an effect of suppressing the cost required for the creation. The idea of using only the learning speech data as the learning set can be applied to the evaluation set. In this case, the speech recognition performance estimation apparatus 200 ′ (FIG. 5) uses a closed acoustic model multiple generation unit 310, a speech recognition accuracy calculation unit 360, and a learning speech data, which receive evaluation speech data instead of an evaluation set. The learning set speech recognition accuracy calculation unit 130 that receives the learning set is included. The operations of the closed acoustic model multiple generation unit 310 and the speech recognition accuracy calculation unit 360 will be described in a third embodiment.

図７にこの発明の音声認識性能推定装置３００の機能構成例を示す。音声認識性能推定装置３００は、上記した音声認識性能推定装置２００（図５）のクローズド音響モデル複数生成部１１０がクローズド音響モデル複数生成部３１０に、音声認識精度計算部１６０が音声認識精度計算部３６０に、それぞれ置き換わった点が異なる。 FIG. 7 shows a functional configuration example of the speech recognition performance estimation apparatus 300 of the present invention. In the speech recognition performance estimation device 300, the closed acoustic model multiple generation unit 110 of the speech recognition performance estimation device 200 (FIG. 5) described above is the closed acoustic model multiple generation unit 310, and the speech recognition accuracy calculation unit 160 is the speech recognition accuracy calculation unit. 360 is different in that each is replaced.

クローズド音響モデル複数生成部３１０は、初期音響モデルと評価用音声データを入力として、当該評価用音声データを初期音響モデルを用いて音声認識した音声認識結果と評価用音声データから音響特徴量を抽出し、当該音響特徴量と初期音響モデルとの間のモデルであるＮ（Ｎ≧２）個のクローズド音響モデルを生成してクローズド音響モデル群記録部１２０に出力する。クローズド音響モデル複数生成部３１０は、初期音響モデルを用いて音声認識した音声認識結果を上記した評価用正解テキストと同じ様に扱って処理する点で、クローズド音響モデル複数生成部１１０と異なる。 The closed acoustic model multiple generation unit 310 receives the initial acoustic model and the evaluation voice data, and extracts an acoustic feature amount from the voice recognition result obtained by voice recognition of the evaluation voice data using the initial acoustic model and the evaluation voice data. Then, N (N ≧ 2) closed acoustic models that are models between the acoustic feature quantity and the initial acoustic model are generated and output to the closed acoustic model group recording unit 120. The closed acoustic model multiple generation unit 310 is different from the closed acoustic model multiple generation unit 110 in that the speech recognition result obtained by performing speech recognition using the initial acoustic model is handled and processed in the same manner as the above-described correct text for evaluation.

音声認識精度計算部３６０は、評価用音声データを入力として、当該評価用音声データを、選択済みクローズド音響モデル１５０と言語モデルを用いて音声認識した音声認識精度を、認識精度上限の推定値として出力する。言語モデルは、上記した言語モデル２３０２を用いることができる。音声認識精度計算部３６０は、上記した学習セット音声認識精度計算部２３０と同様に、正解テキストを用いずに音声認識精度を出力するものである。 The speech recognition accuracy calculation unit 360 receives the speech data for evaluation, and the speech recognition accuracy obtained by performing speech recognition on the speech data for evaluation using the selected closed acoustic model 150 and the language model, as an estimated upper limit of the recognition accuracy. Output. The language model 2302 described above can be used as the language model. Similar to the learning set speech recognition accuracy calculation unit 230 described above, the speech recognition accuracy calculation unit 360 outputs speech recognition accuracy without using correct text.

音声認識性能推定装置３００は、学習用正解テキストと評価用正解テキストとが無くても認識精度上限の推定値を出力することができるので、それらの作成に要するコストを抑制する効果を奏する。 The speech recognition performance estimation apparatus 300 can output an estimated value of the upper limit of recognition accuracy without the learning correct text and the evaluation correct text, and thus has an effect of suppressing the cost required to create them.

図８に、この発明の音声認識性能推定装置４００の機能構成例を示す。音声認識性能推定装置４００は、上記した音声認識性能推定装置１００と２００と２００′と３００と、データセット選択部４８０と、平均部４９０と、を具備する。 FIG. 8 shows a functional configuration example of the speech recognition performance estimation apparatus 400 of the present invention. The speech recognition performance estimation device 400 includes the speech recognition performance estimation devices 100, 200, 200 ′, and 300 described above, a data set selection unit 480, and an averaging unit 490.

データセット選択部４８０は、音声データと正解テキストとの組みから成る複数のデータセットと、音声データのみから成る複数のデータセットとを入力として、複数のデータセットの既選のデータセット以外の２つのデータセットを選択し、選択した２個のデータセットの両方に正解テキストが付与されている場合は一方のデータセットを学習セット、他方のデータセットを評価セットとして音声認識性能推定装置１００に出力し、選択した２個のデータセットの片方に正解テキストが付与されている場合は正解テキストが付与されているデータセットを評価セット又は学習セット、他方を学習用音声データ又は評価用音声データとして音声認識性能推定装置２００と２００′に出力し、選択した２個のデータセットの何れにも正解テキストが付与されていない場合は一方を学習用音声データ、他方を評価用音声データとして音声認識性能推定装置３００に出力する動作を所定の回数行う。所定の回数は１回でも良い。又は、計算時間に応じて３〜１０程度に設定する。又は、データセットの組み合わせの数の最大値Ｍとしても良い。 The data set selection unit 480 receives a plurality of data sets composed of a set of voice data and correct text and a plurality of data sets composed only of the voice data, and inputs two data sets other than the selected data set of the plurality of data sets. When one set is selected and correct text is given to both of the two selected data sets, one data set is output to the speech recognition performance estimation apparatus 100 as a learning set and the other data set as an evaluation set. If the correct text is given to one of the two selected data sets, the data set to which the correct text is given is used as the evaluation set or learning set, and the other is used as the learning voice data or evaluation voice data. Output to the recognition performance estimation devices 200 and 200 ′, and correct text for any of the two selected data sets. There performs count operation a predetermined output to the speech recognition performance estimator 300 one if not granted training speech data, as voice data for evaluation of the other. The predetermined number of times may be one. Or it sets to about 3-10 according to calculation time. Alternatively, the maximum value M of the number of combinations of data sets may be used.

例えば、正解テキストを含むデータセットが２個、正解テキストが無いデータセットが２個あったと仮定する。この場合、音声認識性能推定装置１００には、一方を学習セット、他方を評価セットとする場合と、一方を評価セット、他方を学習セットとする場合の２つの場合がある。音声認識性能推定装置３００には、この場合、音声認識性能推定装置１００と同様に２つの場合がある。このように、所定の回数とは、音声認識性能推定装置１００と２００と２００′と３００との間で異なる場合もある回数である。 For example, assume that there are two data sets that contain correct text and two data sets that do not have correct text. In this case, there are two cases in the speech recognition performance estimation apparatus 100: one is a learning set and the other is an evaluation set, and the other is an evaluation set and the other is a learning set. In this case, there are two cases in the speech recognition performance estimation apparatus 300 as in the speech recognition performance estimation apparatus 100. Thus, the predetermined number of times is the number of times that may differ between the speech recognition performance estimation devices 100, 200, 200 ′, and 300.

平均部４９０は、音声認識性能推定装置１００と２００と３００がそれぞれ出力する認識精度上限の推定値を入力として、その全ての認識精度上限の推定値の平均値を算出して出力する。 The averaging unit 490 receives the estimated values of the recognition accuracy upper limit output from the speech recognition performance estimation apparatuses 100, 200, and 300, respectively, and calculates and outputs the average value of all the recognition accuracy upper limit estimated values.

音声認識性能推定装置４００によれば、正解テキストの無いデータセットと、正解テキストの有るデータセットと、が複数混在するデータセットから認識精度上限の推定値を求めることができ、複数のデータセットの組み合わせから得られた推定値を平均することにより、音声認識性能推定装置１００，２００，２００′，３００をそれぞれ１回動作させた場合と比較して、その推定精度を高めることが可能である。また、音声データと正解テキストとの組から成るデータセットが常に用意されているとは限らない開発環境の中で、音声認識性能推定装置４００は、音声データのみのデータセットから認識精度上限の推定値を求めることができる。つまり、音声認識性能推定装置４００は、データセットの形式によらずに動作する優れた効果も奏する。 According to the speech recognition performance estimation apparatus 400, an upper limit of recognition accuracy can be obtained from a data set in which a plurality of data sets without correct text and a data set with correct text are mixed. By averaging the estimated values obtained from the combinations, it is possible to increase the estimation accuracy as compared with the case where each of the speech recognition performance estimation devices 100, 200, 200 ′, 300 is operated once. Further, in a development environment in which a data set composed of a set of speech data and correct text is not always prepared, the speech recognition performance estimation device 400 estimates an upper limit of recognition accuracy from a data set of only speech data. The value can be determined. That is, the speech recognition performance estimation apparatus 400 also has an excellent effect of operating regardless of the data set format.

なお、音声認識性能推定装置４００として、音声認識性能推定装置２００と２００′の構成は、どちらか一方を備えれば良い。どちらか一方の構成を備えていれば、音声データと正解テキストとの組のデータセットと音声データのみからなるデータセットの、２組のデータセットを選択した場合でも、認識精度上限の推定値を求めることが可能である。 As the speech recognition performance estimation device 400, the configuration of the speech recognition performance estimation devices 200 and 200 ′ may be provided with either one. If either one of the configurations is provided, even if two data sets are selected, a data set consisting of a set of voice data and correct text and a data set consisting only of voice data, the estimated upper limit of recognition accuracy is set. It is possible to ask.

図９に、この発明の音声認識性能推定装置５００の機能構成例を示す。その動作フローを図１０に示す。音声認識性能推定装置５００は、クローズド音響モデル複数生成部５１０と、クローズド音響モデル群記録部１２０と、音声認識精度計算部５６０と、挟みうち制御部５８０と、認識精度上限出力部５４０と、制御部５７０と、を具備する。 FIG. 9 shows a functional configuration example of the speech recognition performance estimation apparatus 500 of the present invention. The operation flow is shown in FIG. The speech recognition performance estimation device 500 includes a closed acoustic model multiple generation unit 510, a closed acoustic model group recording unit 120, a speech recognition accuracy calculation unit 560, a pinch out control unit 580, a recognition accuracy upper limit output unit 540, and a control. Part 570.

音声認識性能推定装置５００は、上記した音声認識性能推定装置１００等と同様に、例えばＲＯＭ、ＲＡＭ、ＣＰＵ等で構成されるコンピュータに所定のプログラムが読み込まれて、ＣＰＵがそのプログラムを実行することで実現されるものである。 Similar to the above-described speech recognition performance estimation device 100 and the like, the speech recognition performance estimation device 500 reads a predetermined program into a computer including, for example, a ROM, a RAM, and a CPU, and the CPU executes the program. It is realized with.

クローズド音響モデル複数生成部５１０は、初期音響モデルと評価セットと挟みうち制御信号と音声認識精度とを入力として、当該評価セットを構成する評価用音声データから音響特徴量を抽出し、当該音響特徴量を上限、上記初期音響モデルを下限する範囲のクローズド音響モデルを、上記挟みうち制御信号に応じて生成してクローズド音響モデル群記録部１２０に出力すると共に、音声認識精度計算部５６０が上記クローズド音響モデルを用いて求めた上記音声認識精度を、出力済みの該当するクローズド音響モデルに付与する（ステップＳ５１０）。挟みうち制御信号は、上記した学習パラメータτに当たる制御信号であり、周知の数値計算手法の挟みうち法に基づいて変化する、値の幅が順次狭めるように設定される制御信号である。 The closed acoustic model multiple generation unit 510 receives the initial acoustic model, the evaluation set, the sandwiched control signal, and the speech recognition accuracy, extracts an acoustic feature amount from the evaluation speech data constituting the evaluation set, and extracts the acoustic feature. A closed acoustic model having an upper limit on the amount and a lower limit on the initial acoustic model is generated according to the sandwiching control signal and output to the closed acoustic model group recording unit 120, and the speech recognition accuracy calculation unit 560 is configured to output the closed acoustic model. The voice recognition accuracy obtained using the acoustic model is assigned to the corresponding closed acoustic model that has been output (step S510). The pinch-in control signal is a control signal corresponding to the learning parameter τ described above, and is a control signal that changes based on the pinch-out method of a well-known numerical calculation method and is set so that the value range is sequentially narrowed.

音声認識精度計算部５６０は、学習セットを入力として当該学習セットを、クローズド音響モデル群記録部１２０に記録された順のクローズド音響モデルを用いて音声認識した音声認識精度を出力する（ステップＳ５６０）。 The speech recognition accuracy calculation unit 560 outputs the speech recognition accuracy of speech recognition using the learning set as an input, using the closed acoustic model in the order recorded in the closed acoustic model group recording unit 120 (step S560). .

挟みうち制御部５８０は、音声認識精度を入力として、当該音声認識精度が最大になる様に挟みうち制御信号を可変して出力し、当該可変した挟みうち制御信号でクローズド音響モデル複数生成部５１０と音声認識精度計算部５６０を動作させる処理を所定の回数繰り返す（ステップＳ５８０）。 The pinch-out controller 580 receives the voice recognition accuracy as an input, variably outputs the pinch-in control signal so that the voice recognition accuracy is maximized, and uses the variable pinch-out control signal to generate a plurality of closed acoustic model generation units 510. The process of operating the voice recognition accuracy calculation unit 560 is repeated a predetermined number of times (step S580).

認識精度上限値出力部５４０は、クローズド音響モデル群記録部１２０に記録されたクローズド音響モデルの中から最大の音声認識精度を検索し、当該最大の音声認識精度の値を認識精度上限の推定値として出力する（ステップＳ５４０）。 The recognition accuracy upper limit value output unit 540 searches the closed acoustic model recorded in the closed acoustic model group recording unit 120 for the maximum speech recognition accuracy, and uses the maximum speech recognition accuracy value as an estimated recognition accuracy upper limit value. (Step S540).

音声認識性能推定装置５００は、予めＮ個のクローズド音響モデルを生成した後に、認識精度上限の推定値を求める方式の音声認識性能推定装置１００等とは異なり、数値計算手法の挟みうち法に基づいて音声認識精度の高いクローズド音響モデルを探索する方式であり、計算量を削減できる。 Unlike the speech recognition performance estimation apparatus 100 or the like that calculates an estimated value of the upper limit of recognition accuracy after generating N closed acoustic models in advance, the speech recognition performance estimation apparatus 500 is based on the sandwiching method of numerical calculation methods. This method searches for a closed acoustic model with high speech recognition accuracy, and can reduce the amount of calculation.

以上説明したように、本願発明の音声認識性能推定装置は、音響モデルを、評価用音声データの音響特徴量にどこまで近づけて良いかを判定し、学習セットの音声認識精度が最大となる音響モデルを生成する。そして、評価セットをその音響モデルを用いて音声認識した音声認識精度を、認識精度上限の推定値とするものである。その結果、従来技術のクローズド評価で得られる「認識精度上限」が、実際の「認識精度上限」よりも高い値を示す課題を解決することができる。 As described above, the speech recognition performance estimation device according to the present invention determines how close the acoustic model can be to the acoustic feature amount of the evaluation speech data, and maximizes the speech recognition accuracy of the learning set. Is generated. Then, the speech recognition accuracy obtained by performing speech recognition on the evaluation set using the acoustic model is used as an estimated value of the recognition accuracy upper limit. As a result, it is possible to solve the problem that the “recognition accuracy upper limit” obtained by the closed evaluation of the prior art is higher than the actual “recognition accuracy upper limit”.

なお、実施例１で説明した学習パラメータτに応じた重み付け平均値としてＮ個のクローズド音響モデルを生成する考えは、実施例２と３に適用することも可能である。また、挟みうち法の数値計算手法を用いた音声認識性能推定装置５００の考えは、実施例２と３に適用することも可能である。 Note that the idea of generating N closed acoustic models as the weighted average value corresponding to the learning parameter τ described in the first embodiment can be applied to the second and third embodiments. Further, the idea of the speech recognition performance estimation apparatus 500 using the numerical calculation method of the pinch-out method can be applied to the second and third embodiments.

上記装置における処理手段をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、各装置における処理手段がコンピュータ上で実現される。 When the processing means in the above apparatus is realized by a computer, the processing contents of the functions that each apparatus should have are described by a program. Then, by executing this program on the computer, the processing means in each apparatus is realized on the computer.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記録装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Further, the program may be distributed by storing the program in a recording device of a server computer and transferring the program from the server computer to another computer via a network.

また、各手段は、コンピュータ上で所定のプログラムを実行させることにより構成することにしてもよいし、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Each means may be configured by executing a predetermined program on a computer, or at least a part of these processing contents may be realized by hardware.

Claims

The initial acoustic model and the evaluation set are input, the acoustic feature quantity is extracted from the evaluation speech data constituting the evaluation set, and N (N ≧ 2) that is a model between the acoustic feature quantity and the initial acoustic model Closed acoustic model multiple generation unit that generates a single closed acoustic model and outputs the closed acoustic model group recording unit,
A learning set speech recognition accuracy calculation unit that outputs N speech recognition accuracies obtained by performing speech recognition using the N closed acoustic models recorded in the closed acoustic model group recording unit with the learning set as an input; ,
The sound to be output as a selected closed acoustic model by selecting the closed acoustic model corresponding to the maximum value of the N speech recognition accuracy by referring to the closed acoustic model group recording unit with the N speech recognition accuracy as an input A model selector,
A speech recognition accuracy calculation unit that outputs speech recognition accuracy obtained by speech recognition of the evaluation set using the selected closed acoustic model as an estimated upper limit of recognition accuracy;
A speech recognition performance estimation device comprising:

The initial acoustic model and the evaluation set are input, the acoustic feature quantity is extracted from the evaluation speech data constituting the evaluation set, and N (N ≧ 2) that is a model between the acoustic feature quantity and the initial acoustic model Closed acoustic model multiple generation unit that generates a single closed acoustic model and outputs the closed acoustic model group recording unit,
Using the learning speech data as an input, the learning speech data is speech-recognized using the N closed acoustic models and the language model recorded in the closed acoustic model group recording unit , and the N closed acoustic models are obtained. A learning set speech recognition accuracy calculation unit for outputting N speech recognition accuracy corresponding to each,
The sound to be output as a selected closed acoustic model by selecting the closed acoustic model corresponding to the maximum value of the N speech recognition accuracy by referring to the closed acoustic model group recording unit with the N speech recognition accuracy as an input A model selector,
A speech recognition accuracy calculation unit that outputs speech recognition accuracy obtained by speech recognition of the evaluation set using the selected closed acoustic model as an estimated upper limit of recognition accuracy;
A speech recognition performance estimation device comprising:

Using the initial acoustic model and the evaluation voice data as input, the acoustic feature value is extracted from the voice recognition result obtained by voice recognition of the evaluation voice data using the initial acoustic model and the evaluation voice data. A closed acoustic model multiple generation unit that generates N (N ≧ 2) closed acoustic models that are models between the models and outputs the closed acoustic model group recording unit;
Using the learning speech data as an input, the learning speech data is speech-recognized using the N closed acoustic models and the language model recorded in the closed acoustic model group recording unit , and the N closed acoustic models are obtained. A learning set speech recognition accuracy calculation unit for outputting N speech recognition accuracy corresponding to each,
The sound to be output as a selected closed acoustic model by selecting the closed acoustic model corresponding to the maximum value of the N speech recognition accuracy by referring to the closed acoustic model group recording unit with the N speech recognition accuracy as an input A model selector,
Using the selected closed acoustic model, the speech recognition accuracy for speech recognition of the evaluation set using the evaluation correct text output from the closed acoustic model multiple generation unit and the evaluation speech data input from the outside as an evaluation set, A speech recognition accuracy calculation unit that outputs an estimated value of the recognition accuracy upper limit;
A speech recognition performance estimation device comprising:

The speech recognition performance estimation device according to claim 1 is a first speech recognition performance estimation device,
The speech recognition performance estimation device according to claim 2 is replaced with a second speech recognition performance estimation device,
The speech recognition performance estimation device according to claim 3 as a third speech recognition performance estimation device,
Select two data sets other than the selected data set of multiple data sets, with multiple data sets consisting of a set of voice data and correct text and multiple data sets consisting only of voice data as inputs, When correct text is given to both of the two selected data sets, one data set is output as a learning set and the other data set is output as an evaluation set to the first speech recognition performance estimation device, and the selected 2 When the correct text is given to one of the data sets, the data set to which the correct text is given is output to the second speech recognition performance estimation device as the evaluation set, and the other is used as the speech data for learning. If the correct text is not assigned to either of the two data sets, one is used for learning speech data and the other is used for evaluation speech data. A data set selection unit that performs count operation a predetermined output to the third speech recognition performance estimator as data,
The estimation value upper limit of recognition accuracy output from the first speech recognition performance estimation device, the estimation value upper limit of recognition accuracy output from the second speech recognition performance estimation device, and the recognition accuracy upper limit output from the third speech recognition performance estimation device. And an average unit for calculating and outputting an average value of all the recognition accuracy upper limit estimated values,
A speech recognition performance estimation device comprising:

The speech recognition performance estimation apparatus according to any one of claims 1 to 3,
The closed acoustic model multiple generator is
N closed acoustic models are generated with a weighted average value corresponding to N (N ≧ 2) learning parameters τ in a range where the initial acoustic model parameter is the lower limit and the acoustic feature parameter is the upper limit. An apparatus for estimating speech recognition performance, wherein the apparatus is for outputting.

The initial acoustic model, the evaluation set, the sandwiched control signal, and the voice recognition accuracy are input, and the acoustic feature quantity is extracted from the evaluation voice data constituting the evaluation set. The closed acoustic model of the lower limit range is generated according to the pinching control signal and output to the closed acoustic model group recording unit, and the speech recognition accuracy calculated by the speech recognition accuracy calculation unit using the closed acoustic model A closed acoustic model multiple generation unit for assigning to the corresponding closed acoustic model that has been output,
A speech recognition accuracy calculation unit that outputs speech recognition accuracy obtained by performing speech recognition using the closed acoustic models in the order recorded in the closed acoustic model group recording unit with the learning set as an input;
With the voice recognition accuracy as an input, the pinch control signal is variably output so that the voice recognition accuracy is maximized, and the closed acoustic model multiple generation unit and the voice recognition accuracy calculation unit are output with the variable pinch control signal. A sandwiching control unit that repeats the process of operating a predetermined number of times,
A recognition accuracy upper limit output unit that searches for the maximum speech recognition accuracy from among the closed acoustic models recorded in the closed acoustic model group recording unit and outputs the maximum speech recognition accuracy value as an estimated upper limit of recognition accuracy When,
A speech recognition performance estimation device comprising:

A closed acoustic model multiple generation unit receives an initial acoustic model and an evaluation set, extracts an acoustic feature amount from audio data for evaluation constituting the evaluation set, and a model between the acoustic feature amount and the initial acoustic model A closed acoustic model multiple generation process for generating N (N ≧ 2) closed acoustic models and outputting them to the closed acoustic model group recording unit;
The learning set speech recognition accuracy calculation unit outputs N speech recognition accuracy obtained by performing speech recognition on the learning set using the N closed acoustic models recorded in the closed acoustic model group recording unit. Learning set speech recognition accuracy calculation process,
The acoustic model selection unit selects the closed acoustic model corresponding to the maximum value of the N speech recognition accuracies by referring to the closed acoustic model group recording unit using the N speech recognition accuracies as an input. Acoustic model selection process to be output as an acoustic model;
A speech recognition accuracy calculation process in which a speech recognition accuracy calculation unit outputs speech recognition accuracy obtained by performing speech recognition of the evaluation set using the selected closed acoustic model as an estimated upper limit of recognition accuracy;
A speech recognition performance estimation method comprising:

A program for causing a computer to operate as the speech recognition performance estimation device according to any one of claims 1 to 6.