JP3908878B2

JP3908878B2 - Phoneme recognition performance measuring device for continuous speech recognition device

Info

Publication number: JP3908878B2
Application number: JP27332899A
Authority: JP
Inventors: 健小早川; 寛之世木; 亨今井; 彰男安藤
Original assignee: Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 1999-09-27
Filing date: 1999-09-27
Publication date: 2007-04-25
Anticipated expiration: 2019-09-27
Also published as: JP2001100789A

Description

【０００１】
【発明の属する技術分野】
本発明は、音響モデルおよび言語モデルを使用して連続音声を認識する連続音声認識装置の音素認識性能測定装置に関する。
【０００２】
【従来の技術】
従来、音響モデルおよび確率的言語モデルを使用して連続音声を認識する音声認識装置が知られている（「確率モデルによる音声認識」，中川聖一，電子情報通信学会）。従来のこの種の音声認識装置の機能構成を図１に示す。
【０００３】
図１において、１０は、入力音声を音素認識し、認識した音素（ラベル形態）とその尤度（尤もらしさ）を出力する音響モデルであり、たとえば、ＨＭＭ（隠れマルコフモデル）などが有名である。１１は言語モデルである言語モデルは、音素列の形態の単語および隣接する単語との接続確率をデータベースの形態で有する。１２は音響モデル１０から時系列的に出力される音素の系列および尤度と言語モデル１１のデータベースに記載された各単語とおよびその接続確率を使用して入力した音素列に対して尤もらしさが最も高い、言語モデル中の単語を単語認識結果として出力する音響尤度計算部である。
【０００４】
このような構成の連続音声認識装置では、連続音声が音響モデル１０において、音素認識された後、音響尤度計算部１２で単語認識されて、単語列すなわち、文が出力される。
【０００５】
連続音声認識装置で使用される音響モデルの性能を測定することは松岡、大附，森、古井，白井著，電子情報通信学会論文集，１９９６年１２月２１２５頁や松岡、大附，森、古井，白井著，日本音響学会講演論文集，１９９７年３月号２−６−１１頁において提案されている。
【０００６】
【発明が解決しようとする課題】
音響モデルの性能を測定するためには、あらかじめ音素の表記が判明している評価用の音声を音響モデルに入力し、音素認識結果として得られる音素列と、評価用の音声の表記とを比較することで、音素認識率を得ることができる。なお、ここでいう音素とは、母音や子音などの音韻あるいは音韻よりも短い音声の長さの音声部分を指す、音響モデルから出力される音素とは、音声認識された音素の識別名（ラベル）等を意味する。
【０００７】
連続音声認識装置の音素認識性能を示す指標として音響モデルの音素認識率を使用するには以下の問題があった。
【０００８】
音響モデル１０では入力の音声から得られる音響的な特徴と音響モデル内にあらかじめ保有している音素の音響的な特徴との比較により音素認識を行う。
【０００９】
音響モデル単体では、音素認識のみを実行し、単語認識は行わないので、言語上の文や単語のつながりではありえないような音素の誤認識結果が発生する。
【００１０】
連続音声認識装置ではこのような誤認識結果は、音響尤度計算装置１２の単語認識において、是正される。したがって、連続音声認識装置における音素認識性能測定値として音響モデル単体の音素認識率を使用すると、実際の性能よりも測定値が低くなるという傾向がある。
【００１１】
以上の点に鑑みて、本発明の目的は、言語上の文法的接続関係（単語接続率）を排除した場合の連続音声認識装置の音素認識性能を正しく測定することができる連続音声認識装置の音素認識性能測定装置を提供することにある。
【００１２】
【課題を解決するための手段】
このような目的を達成するために、言語モデルおよび音響モデルを使用して連続音声認識を行う連続音声認識装置の音素認識性能を測定する連続音声認識装置の音素認識性能測定装置において、評価用音声の音素列の表記を入力する入力手段と、前記連続音声認識装置に対して前記評価用音声を入力し、該評価用音声に対して前記音響モデルにより音素認識を行い、該音響モデルの音素認識結果に対して言語上存在する文字列をもとにした音素に拘束をかける音素処理手段と、該音素処理手段から出力される音素列と、前記入力手段から入力された評価用音声の音素列の表記とを比較して音素認識率を計算し、当該計算された音素認識率を音素認識性能の測定値として出力する性能測定手段とを具え、前記音素処理手段により音素に拘束をかけることにより言語上存在しない音素列を除外し、該言語上存在しない音素列を除外した前記音素処理手段の処理結果を前記性能測定手段の性能測定に使用することを特徴とする。
【００１３】
請求項２の発明は、請求項１に記載の連続音声認識装置の音素認識性能測定装置において、前記言語上存在する文字列は単語であることを特徴とする。
【００１６】
【発明の実施の形態】
以下、図面を参照して本発明の実施形態を詳細に説明する。
【００１７】
図２は連続音声認識装置の中に設けた音素認識性能測定装置の基本構成を示す。図２において、図１の従来例と同様の個所には同一の符号を付し、詳細な説明を省略する。２１は単語辞書２１−１を有する音素列生成部であり、単語辞書２１−１に記載された音素列形態の単語を順次に音響尤度計算部１２に供給する。これにより音響尤度計算部１２では、音響モデル１０から一定時間の間出力された複数の音素、すなわち音素列と、音素列生成部２１から供給された単語の音素列とを比較し、最も類似している単語（音素列形態）を単語認識結果として出力する。
【００１８】
本実施形態では、単語の接続率が音響尤度計算部１２に与えられないので、言語上の単語等の接続関係が考慮されておらず、言語上存在しうる単語の音素列を音声認識結果とする点に留意されたい。したがって、たとえば、「朝」という評価用音声に対して音響モデルの音素認識結果として「ａａａｓｋｓｓａａａ」というように言語上存在しない音素「ｋ」が音素の認識結果の中に混在していても「音素列生成部」２１から供給される単語の中に「朝」に対応する音素列が存在すると、音素認識結果中の上記音素「ｋ」は是正されて単語認識結果として、「ａａａｓｓｓｓａａａ」が出力される。このように言語上存在しない音素を排除する音素処理を行うことを本明細書では「音響モデルの音素認識結果に単語拘束をかける」と呼ぶことにする。本実施形態では、拘束をかける文字列長さを単語としているが、用途に応じて文節や文としてもよく、この場合には、音素列生成部２１で使用する単語辞書２１−１が対応の文字列長さの音素列を記述したものとなる。
【００１９】
２２は音素列比較部であり、評価用音声を入力したときに音響尤度計算部１２から出力される音素列形態の単語認識結果と、評価用の音声の表記とを比較して、音素認識率を計算する。比較の方法としては、本実施形態では、動的計画法（ＤＰやダイナミックプログラミングとも呼ばれる）を使用する。計算された認識率は不図示の表示器やプリンタにより可視出力される。
【００２０】
このような音素認識性能測定装置を有する連続音声認識装置を実現するためのハードウェア構成の一例を図３に示す。連続音声認識装置は、連続音声認識用プログラムを汎用のコンピュータ、たとえば、パーソナルコンピュータに実装することで実現する。図３はパーソナルコンピュータ等の主要部分の構成を示す。
【００２１】
図３において、ＣＰＵ１００は音声認識用プログラムを実行して，連続音声認識処理を行うと共に、後述の音素認識性能のためのプログラムを実行して連続音声認識装置の音素認識性能を測定する。
【００２２】
システムメモリ１１０は、ＣＰＵ１００が行う情報処理に対する入出力データを一時記憶する。ハードディスク記憶装置（ＨＤＤと略記する）１３０は、連続音声認識プログラム、言語モデルのためのデータ、音響モデルで使用するデータ等、本発明に関わる単語辞書（図２の２１−１に対応）および音素認識性能測定プログラムを保存記憶する。上述のプログラムは不図示のキーボードやマウス等の実行の指示で、ＨＤＤ１３０からシステムメモリ１１０にローディングされた後、ＣＰＵ１００により指示されたプログラムが実行される。
【００２３】
入力インターフェース（Ｉ／Ｏ）１２０は不図示のマイクロホンから入力された音声信号をＡ／Ｄ変換して、デジタル形態の音声信号を連続音声認識のためにＣＰＵ１００に引き渡す。
【００２４】
このような構成において、ユーザがキーボードまたはマウス等により音素認識性能測定プログラムの実行を指示すると、ＨＤＤ１３０からシステムメモリ１１０に図４の音素認識性能測定プログラムがロードされ、ＣＰＵ１００により実行される。
【００２５】
ユーザはあらかじめ評価用の音声の音素表記を文書ファイルの形態でＨＤＤ１３０に記憶しておくものとする。音素表記の文書ファイルは、パーソナルコンピュータのワープロソフトで作成してもよいし、フロッピー等を介して不図示のＨＤＤ１３０にオフライン転送してもよい。
【００２６】
ユーザは不図示のマイクロホンから評価用音声を入力する。入力された音声はＩ／Ｏ１２０を介してＣＰＵ１００に引き渡される。ＣＰＵ１００は従来と同様の連続音声認識プログラムの中の音響モデル部分（たとえば、サブルーチンや関数の形態）を使用して音素認識を行って、音素認識結果をシステムメモリ１１０内のワーク領域に順次に記憶していく。
【００２７】
ＣＰＵ１００は音素認識結果に対して、最も類似している単語（音素列形態）を単語辞書２１−１から取り出す。この処理が、図２の音素列生成部２１および音響尤度計算部２２の処理に対応する。より、具体的には、単語辞書２１−１から順次に単語の音素列を読み出して、比較対照の音響モデル１０の出力、すなわち、音素列と比較し、尤度を従来と同様の方法で計算する。以下、順次に単語辞書から単語を読み出して、比較対象の音素列と比較して尤度を計算する。最初に計算された尤度が仮の最高値としてその音素列および／または単語辞書２１−１の記憶位置と共に一時記憶される。次に読み出された単語についての尤度と仮の最高値とが比較され、次回に読み出された尤度が仮の最高値よりも大きい場合にはその読み出された単語の音素列およびまたは単語の記憶位置、尤度がこれまでに一時記憶されていた仮の最高値および関連データと置換される。
【００２８】
このようにして、単語辞書２１−１に記載されている単語の音素列全てと、比較対照の音素列との比較，すなわち，尤度計算を行う。すると、システムメモリ１１０に記憶されている仮の最高値、対応する音素列（表記）、単語辞書の記憶位置が単語の認識結果として決定される。
【００２９】
決定された音素列が、システムメモリ１１０の音声認識結果記憶領域に記憶される（図４のステップＳ１０）。
【００３０】
以上の処理を音声の入力終了まで一定時間間隔で繰り返すと（ステップＳ１０〜Ｓ２０のループ処理）、システムメモリ１１０の音声認識結果記憶領域には累積的に単語の認識結果が音素列の形態で累積的に記憶されていく。
【００３１】
次に、入力された音声についての音声認識を終了すると、ＣＰＵ１００はＨＤＤ１３０から評価用音声の音素表記のファイルをＨＤＤ１３０からシステムメモリ１１０のワーク領域上に読み出す（入力する）。つづいて、ＣＰＵ１００はシステムメモリ１１０の単語認識結果記憶領域に一時記憶されている音素列と、ワーク領域に記憶されている評価用音声の音素表記とを使用して動的計画法、すなわち、ＤＰマッチングの手法により音素認識率を計算する（図４のステップＳ３０→Ｓ４０）。
【００３２】
計算された音素認識率が音素認識性能の測定値として、表示器やプリンタに出力される（図４のステップＳ５０）。
【００３３】
ユーザは出力された測定値を見て、連続音声認識プログラムで使用する単語拘束をかけた音素認識機能の音素認識性能を知ることができる。
【００３４】
本実施形態の他に次の形態を実施できる。
【００３５】
１）上述の実施形態では、連続音声認識プログラムと音素認識性能測定プログラムを別個としていたが、連続音声認識プログラムの中に、音素認識性能測定プロうグラムを組み込むこともできる。この場合は、キーボードあるいはマウスにより連続音声認識を行うか、音素認識性能を行うかのモード指示をユーザから受け付ける。連続音声認識が指示された場合には、連続音声プログラムの音響尤度計算部１２が言語モデルを使用するように定義し、音素認識性能測定が指示された場合には音素列生成部に切り替えるようにするとよい（図５参照）。
【００３６】
さらに、上記の形態では言語モデル１１から出力される単語の接続確率は無視されるが、図６に示すように音素認識性能測定時には単語の接続確率として固定値を音響尤度計算部１２に与えると、言語モデルが有する音素列形態の単語情報を音素認識性能の測定のために使用する単語辞書（２１−１）の代わりに使用することもできる。
【００３７】
２）上述の実施形態では、音声認識性能測定プログラムを記録する記録媒体はハードディスク記憶装置（ＨＤＤ）であってもよいが、記録媒体はＲＯＭやＲＡＭ等のＩＣメモリや、フロッピーディスクやＣＤＲＯＭ等の携帯用記録媒体であってもよい。
【００３８】
３）上述の実施形態では、音素認識率を音素認識性能の測定値としているが、音素認識率そのものの値を使用してもよいし、音素認識率をたとえば、よい、普通、悪いというように、段階的に表現するようにしてもよい。
【００３９】
４）上述の実施形態は本発明を説明する一実施形態であって、特許請求の範囲の記載が示す技術思想にしたがって、上述の実施形態に対する上述以外の変形が可能である。また、このような変形形態は本願特許の権利範囲内にある。
【００４０】
【発明の効果】
以上、説明したように、本発明によれば、たとえば、単語単位で、音響モデルの音素認識結果に対して拘束をかけること（単語等の文字列では言語上の存在しない音素を音素認識結果から排除すること）により連続音声認識装置の音素認識性能を正しく測定することができる。
【００４１】
また、連続音声認識装置の音素認識性能の測定結果を連続音声認識装置全体としての認識性能を解析する際に、参考にすることで、効率よく連続音声認識装置の認識性能を解析することができる。
【図面の簡単な説明】
【図１】図１は従来の音声認識装置の代表的な構成を示すブロック図である。
【図２】図２は本発明実施形態の基本構成を示すブロック図である。
【図３】本発明実施形態のハードウェア構成を示すブロック図である。
【図４】ＣＰＵの音素認識性能測定処理手順を示すフローチャートである。
【図５】本発明実施形態の他の形態を示すブロック図である。
【図６】本発明実施形態のさらに他の形態を示すブロック図である。
【符号の説明】
１０音響モデル
１１言語モデル
１２音響尤度計算部
２１音素列生成部
２１−１単語辞書
１００ＣＰＵ
１１０システムメモリ
１２０Ｉ／Ｏ
１３０ＨＤＤ[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a phoneme recognition performance measuring device for a continuous speech recognition device that recognizes continuous speech using an acoustic model and a language model.
[0002]
[Prior art]
Conventionally, a speech recognition apparatus that recognizes continuous speech using an acoustic model and a probabilistic language model is known (“speech recognition by a stochastic model”, Seiichi Nakagawa, IEICE). A functional configuration of this type of conventional speech recognition apparatus is shown in FIG.
[0003]
In FIG. 1, reference numeral 10 denotes an acoustic model that recognizes an input speech as a phoneme and outputs the recognized phoneme (label form) and its likelihood (likelihood). For example, HMM (Hidden Markov Model) is famous. . Reference numeral 11 denotes a language model. The language model has connection probabilities with words in the form of phoneme strings and adjacent words in the form of a database. 12 is the likelihood of the phoneme string input using the sequence and likelihood of the phonemes output from the acoustic model 10 in time series, the words described in the database of the language model 11 and their connection probabilities. It is an acoustic likelihood calculation unit that outputs the highest word in the language model as a word recognition result.
[0004]
In the continuous speech recognition apparatus having such a configuration, after continuous speech is phoneme-recognized in the acoustic model 10, the acoustic likelihood calculation unit 12 recognizes words and outputs a word string, that is, a sentence.
[0005]
Measuring the performance of acoustic models used in continuous speech recognizers is Matsuoka, Otsuki, Mori, Furui, Shirai, The Institute of Electronics, Information and Communication Engineers, December 2125, Matsuoka, Otsuki, Mori, Furui, Proposed in Shirai, Acoustical Society of Japan, March 1997, pages 2-6-11.
[0006]
[Problems to be solved by the invention]
To measure the performance of the acoustic model, input the speech for evaluation whose phoneme notation is known in advance into the acoustic model, and compare the phoneme string obtained as a result of phoneme recognition with the notation of the speech for evaluation. By doing so, the phoneme recognition rate can be obtained. Here, the phoneme refers to a phoneme such as a vowel or a consonant, or a voice part having a shorter voice length than the phoneme. A phoneme output from the acoustic model is an identification name (label) of a phoneme recognized. ) Etc.
[0007]
The use of the phoneme recognition rate of the acoustic model as an index indicating the phoneme recognition performance of the continuous speech recognition apparatus has the following problems.
[0008]
In the acoustic model 10, phoneme recognition is performed by comparing an acoustic feature obtained from input speech with an acoustic feature of a phoneme that is held in advance in the acoustic model.
[0009]
Since the acoustic model alone performs only phoneme recognition and does not perform word recognition, a phoneme misrecognition result that cannot be a linguistic sentence or word connection occurs.
[0010]
In the continuous speech recognition apparatus, such a misrecognition result is corrected in the word recognition of the acoustic likelihood calculation apparatus 12. Therefore, when the phoneme recognition rate of the acoustic model alone is used as the phoneme recognition performance measurement value in the continuous speech recognition apparatus, the measurement value tends to be lower than the actual performance.
[0011]
In view of the above points, an object of the present invention is to provide a continuous speech recognition apparatus capable of correctly measuring the phoneme recognition performance of a continuous speech recognition apparatus when a grammatical connection relationship (word connection rate) in a language is excluded. and to provide a phoneme recognition performance measurement equipment.
[0012]
[Means for Solving the Problems]
In order to achieve such an object, in the phoneme recognition performance measuring device of a continuous speech recognition device that measures the phoneme recognition performance of a continuous speech recognition device that performs continuous speech recognition using a language model and an acoustic model, An input means for inputting a notation of a phoneme string, and the evaluation speech is input to the continuous speech recognition device, phoneme recognition is performed on the evaluation speech by the acoustic model, and the phoneme recognition of the acoustic model is performed. Phoneme processing means for constraining phonemes based on character strings existing in the result, phoneme strings output from the phoneme processing means, and phoneme strings of evaluation speech input from the input means the phoneme recognition rate calculated by the comparison and representation, the calculated phoneme recognition rate comprises a performance measuring means for outputting a measure of phoneme recognition performance, or the phoneme restrained by the phoneme processing means Excludes the phoneme sequence that does not exist on the language by Rukoto, characterized by the use of processing results of the phoneme processing means excluding phoneme sequence which does not exist on該言words the performance measurement of the performance measurement means.
[0013]
According to a second aspect of the present invention, in the phoneme recognition performance measuring apparatus of the continuous speech recognition apparatus according to the first aspect, the character string existing in the language is a word.
[0016]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
[0017]
FIG. 2 shows a basic configuration of a phoneme recognition performance measuring apparatus provided in the continuous speech recognition apparatus. 2, the same parts as those in the conventional example of FIG. 1 are denoted by the same reference numerals, and detailed description thereof is omitted. A phoneme string generation unit 21 having a word dictionary 21-1 sequentially supplies the phoneme string form words described in the word dictionary 21-1 to the acoustic likelihood calculation unit 12. As a result, the acoustic likelihood calculation unit 12 compares the plurality of phonemes output from the acoustic model 10 for a certain period of time, that is, the phoneme sequence and the phoneme sequence of the word supplied from the phoneme sequence generation unit 21, and is most similar. The word (phoneme string form) is output as a word recognition result.
[0018]
In this embodiment, since the connection rate of words is not given to the acoustic likelihood calculation unit 12, the connection relationship of words in the language is not considered, and the phoneme string of words that may exist in the language is obtained as a speech recognition result. Please note that. Therefore, for example, even if a phoneme “k” that does not exist in the language such as “aaaskssaa” is mixed as a phoneme recognition result of the acoustic model for the evaluation voice “morning”, If a phoneme sequence corresponding to “morning” is present in the word supplied from the “sequence generation unit” 21, the phoneme “k” in the phoneme recognition result is corrected, and “aaasssaaa” is output as the word recognition result. The In this specification, performing phoneme processing that excludes phonemes that do not exist in the language is referred to as “constraining the phoneme recognition result of the acoustic model”. In this embodiment, the length of the character string to be constrained is a word, but it may be a phrase or a sentence depending on the application. In this case, the word dictionary 21-1 used in the phoneme string generation unit 21 is compatible. It describes a phoneme string of character string length.
[0019]
A phoneme string comparison unit 22 compares the word recognition result in the phoneme string form output from the acoustic likelihood calculation unit 12 when the evaluation speech is input with the notation of the speech for evaluation, and performs phoneme recognition. Calculate the rate. As a comparison method, in this embodiment, dynamic programming (also called DP or dynamic programming) is used. The calculated recognition rate is visually output by a display or printer (not shown).
[0020]
An example of a hardware configuration for realizing a continuous speech recognition apparatus having such a phoneme recognition performance measuring apparatus is shown in FIG. The continuous speech recognition apparatus is realized by mounting a continuous speech recognition program on a general-purpose computer such as a personal computer. FIG. 3 shows a configuration of main parts of a personal computer or the like.
[0021]
In FIG. 3, the CPU 100 executes a speech recognition program to perform continuous speech recognition processing, and also executes a program for phoneme recognition performance described later to measure the phoneme recognition performance of the continuous speech recognition apparatus.
[0022]
The system memory 110 temporarily stores input / output data for information processing performed by the CPU 100. A hard disk storage device (abbreviated as HDD) 130 is a word dictionary (corresponding to 21-1 in FIG. 2) and phonemes related to the present invention, such as a continuous speech recognition program, data for a language model, and data used in an acoustic model. Save and store the recognition performance measurement program. The above-described program is an instruction to execute a keyboard or mouse (not shown), and after being loaded from the HDD 130 into the system memory 110, the program instructed by the CPU 100 is executed.
[0023]
An input interface (I / O) 120 A / D converts an audio signal input from a microphone (not shown) and delivers the digital audio signal to the CPU 100 for continuous audio recognition.
[0024]
In such a configuration, when the user instructs execution of the phoneme recognition performance measurement program using a keyboard or a mouse, the phoneme recognition performance measurement program of FIG. 4 is loaded from the HDD 130 to the system memory 110 and executed by the CPU 100.
[0025]
Assume that the user stores in advance the phonetic notation of the voice for evaluation in the HDD 130 in the form of a document file. The phonetic document file may be created by a word processor software of a personal computer, or may be transferred off-line to the HDD 130 (not shown) via a floppy or the like.
[0026]
The user inputs evaluation voice from a microphone (not shown). The input voice is delivered to the CPU 100 via the I / O 120. The CPU 100 performs phoneme recognition using an acoustic model portion (for example, a form of a subroutine or function) in a continuous speech recognition program similar to the conventional one, and sequentially stores the phoneme recognition results in a work area in the system memory 110. I will do it.
[0027]
The CPU 100 extracts the word (phoneme string form) that is most similar to the phoneme recognition result from the word dictionary 21-1. This processing corresponds to the processing of the phoneme string generation unit 21 and the acoustic likelihood calculation unit 22 in FIG. More specifically, the phoneme strings of words are sequentially read out from the word dictionary 21-1, and compared with the output of the comparison acoustic model 10, that is, the phoneme string, and the likelihood is calculated in the same manner as in the prior art. To do. Thereafter, the words are sequentially read from the word dictionary, and compared with the phoneme string to be compared, the likelihood is calculated. The likelihood calculated first is temporarily stored as a temporary maximum value together with the phoneme string and / or the storage position of the word dictionary 21-1. Next, the likelihood for the read word is compared with the temporary maximum value, and if the likelihood read next time is larger than the temporary maximum value, the phoneme string of the read word and Alternatively, the storage position and likelihood of the word are replaced with the temporary maximum value and related data temporarily stored so far.
[0028]
In this way, all the phoneme strings of words described in the word dictionary 21-1 are compared with the comparison phoneme string, that is, the likelihood calculation is performed. Then, the provisional maximum value stored in the system memory 110, the corresponding phoneme string (notation), and the storage position of the word dictionary are determined as the word recognition result.
[0029]
The determined phoneme string is stored in the speech recognition result storage area of the system memory 110 (step S10 in FIG. 4).
[0030]
When the above processing is repeated at regular time intervals until the end of speech input (loop processing in steps S10 to S20), the word recognition results are accumulated in the form of phoneme strings in the speech recognition result storage area of the system memory 110. Will be memorized.
[0031]
Next, when the speech recognition for the input speech is completed, the CPU 100 reads (inputs) a phoneme-notation file of evaluation speech from the HDD 130 onto the work area of the system memory 110. Subsequently, the CPU 100 uses the phoneme string temporarily stored in the word recognition result storage area of the system memory 110 and the phoneme notation of the evaluation speech stored in the work area to perform dynamic programming, that is, DP The phoneme recognition rate is calculated by a matching method (steps S30 to S40 in FIG. 4).
[0032]
The calculated phoneme recognition rate is output as a measurement value of the phoneme recognition performance to a display or a printer (step S50 in FIG. 4).
[0033]
The user can know the phoneme recognition performance of the phoneme recognition function with the word constraint used in the continuous speech recognition program by looking at the output measurement values.
[0034]
In addition to this embodiment, the following embodiment can be implemented.
[0035]
1) In the above-described embodiment, the continuous speech recognition program and the phoneme recognition performance measurement program are separated, but a phoneme recognition performance measurement program can be incorporated into the continuous speech recognition program. In this case, a mode instruction indicating whether to perform continuous speech recognition using the keyboard or mouse or to perform phoneme recognition performance is received from the user. When continuous speech recognition is instructed, the acoustic likelihood calculation unit 12 of the continuous speech program is defined to use a language model, and when phoneme recognition performance measurement is instructed, switching to the phoneme string generation unit is performed. (See FIG. 5).
[0036]
Furthermore, in the above embodiment, the word connection probability output from the language model 11 is ignored, but a fixed value is given to the acoustic likelihood calculation unit 12 as the word connection probability when measuring the phoneme recognition performance as shown in FIG. The word information in the phoneme string form of the language model can be used instead of the word dictionary (21-1) used for measuring the phoneme recognition performance.
[0037]
2) In the above embodiment, the recording medium for recording the voice recognition performance measurement program may be a hard disk storage device (HDD), but the recording medium may be an IC memory such as a ROM or a RAM, a floppy disk or a CDROM, or the like. It may be a portable recording medium.
[0038]
3) In the above-described embodiment, the phoneme recognition rate is a measured value of the phoneme recognition performance, but the value of the phoneme recognition rate itself may be used, and the phoneme recognition rate may be, for example, good, normal, or bad. It may be expressed in stages.
[0039]
4) The above-described embodiment is an embodiment for explaining the present invention, and modifications other than the above can be made to the above-described embodiment in accordance with the technical idea indicated by the claims. Also, such variations are within the scope of the present patent.
[0040]
【The invention's effect】
As described above, according to the present invention, for example, the phoneme recognition result of the acoustic model is constrained in units of words (phonemes that do not exist in the language for character strings such as words are determined from the phoneme recognition result. This makes it possible to correctly measure the phoneme recognition performance of the continuous speech recognition apparatus.
[0041]
In addition, when analyzing the recognition performance of the continuous speech recognition device as a whole, the recognition performance of the continuous speech recognition device can be efficiently analyzed by referring to the measurement results of the phoneme recognition performance of the continuous speech recognition device. .
[Brief description of the drawings]
FIG. 1 is a block diagram showing a typical configuration of a conventional speech recognition apparatus.
FIG. 2 is a block diagram showing a basic configuration of the embodiment of the present invention.
FIG. 3 is a block diagram showing a hardware configuration of an embodiment of the present invention.
FIG. 4 is a flowchart showing a phoneme recognition performance measurement processing procedure of a CPU.
FIG. 5 is a block diagram showing another embodiment of the present invention.
FIG. 6 is a block diagram showing still another embodiment of the present invention.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 10 Acoustic model 11 Language model 12 Acoustic likelihood calculation part 21 Phoneme sequence generation part 21-1 Word dictionary 100 CPU
110 System memory 120 I / O
130 HDD

Claims

In the phoneme recognition performance measuring device of a continuous speech recognition device that measures the phoneme recognition performance of a continuous speech recognition device that performs continuous speech recognition using a language model and an acoustic model,
An input means for inputting a phoneme string notation of the evaluation voice;
The evaluation speech is input to the continuous speech recognition apparatus, phoneme recognition is performed on the evaluation speech by the acoustic model, and a language string is present for a phoneme recognition result of the acoustic model. Phoneme processing means for constraining the phoneme
A phoneme recognition rate is calculated by comparing the phoneme sequence output from the phoneme processing unit with the notation of the phoneme sequence of the evaluation speech input from the input unit, and the calculated phoneme recognition rate is used as a phoneme recognition performance. A phoneme processing means that excludes phoneme strings that do not exist in the language and excludes phoneme strings that do not exist in the language by constraining the phonemes by the phoneme processing means. A phoneme recognition performance measuring apparatus for a continuous speech recognition apparatus , wherein a processing result is used for performance measurement of the performance measuring means.

The phoneme recognition performance measuring apparatus of the continuous speech recognition apparatus according to claim 1, wherein the character string existing in the language is a word.