JP2001100789A

JP2001100789A - Instrument for measuring phoneme recognition capacity in continuous speech recognition device

Info

Publication number: JP2001100789A
Application number: JP27332899A
Authority: JP
Inventors: Takeshi Kobayakawa; 健小早川; Hiroyuki Segi; 寛之世木; Toru Imai; 亨今井; Akio Ando; 彰男安藤
Original assignee: Nippon Hoso Kyokai NHK; Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 1999-09-27
Filing date: 1999-09-27
Publication date: 2001-04-13
Anticipated expiration: 2019-09-27
Also published as: JP3908878B2

Abstract

PROBLEM TO BE SOLVED: To correctly measure a phoneme recognition capacity of a continuous speech recognition device. SOLUTION: A speech for evaluation is inputted to the continuous speech recognition device to make an acoustic model 10 execute phoneme recognition. A phoneme string of a word generated by a phoneme string generation part 21 is used to put word restrictions on the phoneme recognition result by a speech likelihood calculation part 12. A speech recognition ratio is calculated in a phoneme comparison part 22 by a phoneme string outputted as the result of word restrictions and an expression of the speech for evaluation, and the calculation result is outputted as a measured value of the phoneme recognition capacity.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、音響モデルおよび
言語モデルを使用して連続音声を認識する連続音声認識
装置の音素認識性能測定装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a phoneme recognition performance measuring device for a continuous speech recognition device that recognizes continuous speech using an acoustic model and a language model.

【０００２】[0002]

【従来の技術】従来、音響モデルおよび確率的言語モデ
ルを使用して連続音声を認識する音声認識装置が知られ
ている（「確率モデルによる音声認識」，中川聖一，電
子情報通信学会）。従来のこの種の音声認識装置の機能
構成を図１に示す。2. Description of the Related Art Conventionally, a speech recognition apparatus for recognizing continuous speech using an acoustic model and a stochastic language model is known ("Speech recognition by a stochastic model", Seichi Nakagawa, IEICE). FIG. 1 shows a functional configuration of a conventional speech recognition apparatus of this kind.

【０００３】図１において、１０は、入力音声を音素認
識し、認識した音素（ラベル形態）とその尤度（尤もら
しさ）を出力する音響モデルであり、たとえば、ＨＭＭ
（隠れマルコフモデル）などが有名である。１１は言語
モデルである言語モデルは、音素列の形態の単語および
隣接する単語との接続確率をデータベースの形態で有す
る。１２は音響モデル１０から時系列的に出力される音
素の系列および尤度と言語モデル１１のデータベースに
記載された各単語とおよびその接続確率を使用して入力
した音素列に対して尤もらしさが最も高い、言語モデル
中の単語を単語認識結果として出力する音響尤度計算部
である。In FIG. 1, reference numeral 10 denotes an acoustic model for recognizing a phoneme of an input voice and outputting the recognized phoneme (label form) and its likelihood (likelihood).
(Hidden Markov model) is famous. A language model 11 is a language model, and has a connection probability between a word in the form of a phoneme string and an adjacent word in the form of a database. Reference numeral 12 denotes the likelihood of a phoneme sequence input using the phoneme sequence and likelihood output from the acoustic model 10 in time series and each word described in the database of the language model 11 and the connection probability thereof. This is an acoustic likelihood calculation unit that outputs the highest word in the language model as a word recognition result.

【０００４】このような構成の連続音声認識装置では、
連続音声が音響モデル１０において、音素認識された
後、音響尤度計算部１２で単語認識されて、単語列すな
わち、文が出力される。In the continuous speech recognition device having such a configuration,
After the continuous speech is subjected to phoneme recognition in the acoustic model 10, the acoustic likelihood calculation unit 12 performs word recognition and outputs a word string, that is, a sentence.

【０００５】連続音声認識装置で使用される音響モデル
の性能を測定することは松岡、大附，森、古井，白井
著，電子情報通信学会論文集，１９９６年１２月２１２
５頁や松岡、大附，森、古井，白井著，日本音響学会講
演論文集，１９９７年３月号２−６−１１頁において提
案されている。Measuring the performance of acoustic models used in continuous speech recognizers is described in Matsuoka, Ohtsuki, Mori, Furui, Shirai, IEICE Transactions, December 212, 1996.
5 and Matsuoka, Otsuki, Mori, Furui and Shirai, Proceedings of the Acoustical Society of Japan, March 1997, pp. 2-6-11.

【０００６】[0006]

【発明が解決しようとする課題】音響モデルの性能を測
定するためには、あらかじめ音素の表記が判明している
評価用の音声を音響モデルに入力し、音素認識結果とし
て得られる音素列と、評価用の音声の表記とを比較する
ことで、音素認識率を得ることができる。なお、ここで
いう音素とは、母音や子音などの音韻あるいは音韻より
も短い音声の長さの音声部分を指す、音響モデルから出
力される音素とは、音声認識された音素の識別名（ラベ
ル）等を意味する。In order to measure the performance of the acoustic model, a speech for evaluation in which the notation of a phoneme is known in advance is input to the acoustic model, and a phoneme sequence obtained as a phoneme recognition result; The phoneme recognition rate can be obtained by comparing the notation of the speech for evaluation. Here, the phoneme refers to a phoneme such as a vowel or a consonant or a voice portion having a shorter voice length than the phoneme. The phoneme output from the acoustic model is an identification name (label) of the phoneme recognized by the voice. ) Etc.

【０００７】連続音声認識装置の音素認識性能を示す指
標として音響モデルの音素認識率を使用するには以下の
問題があった。The use of the phoneme recognition rate of the acoustic model as an index indicating the phoneme recognition performance of the continuous speech recognition apparatus has the following problems.

【０００８】音響モデル１０では入力の音声から得られ
る音響的な特徴と音響モデル内にあらかじめ保有してい
る音素の音響的な特徴との比較により音素認識を行う。In the acoustic model 10, phoneme recognition is performed by comparing acoustic features obtained from an input speech with acoustic features of phonemes stored in the acoustic model in advance.

【０００９】音響モデル単体では、音素認識のみを実行
し、単語認識は行わないので、言語上の文や単語のつな
がりではありえないような音素の誤認識結果が発生す
る。The acoustic model alone performs only phoneme recognition and does not perform word recognition, so that an erroneous phoneme recognition result that cannot be a connection between a sentence or a word in a language is generated.

【００１０】連続音声認識装置ではこのような誤認識結
果は、音響尤度計算装置１２の単語認識において、是正
される。したがって、連続音声認識装置における音素認
識性能測定値として音響モデル単体の音素認識率を使用
すると、実際の性能よりも測定値が低くなるという傾向
がある。In the continuous speech recognition device, such an erroneous recognition result is corrected by the acoustic likelihood calculation device 12 in word recognition. Therefore, when the phoneme recognition rate of the acoustic model alone is used as the phoneme recognition performance measurement value in the continuous speech recognition device, the measurement value tends to be lower than the actual performance.

【００１１】以上の点に鑑みて、本発明の目的は、言語
上の文法的接続関係（単語接続率）を排除した場合の連
続音声認識装置の音素認識性能を正しく測定することが
できる連続音声認識装置の音素認識性能測定装置および
記録媒体を提供することにある。In view of the above, it is an object of the present invention to provide a continuous speech recognition apparatus capable of correctly measuring the phoneme recognition performance of a continuous speech recognition apparatus when grammatical connection relations (word connection rates) are excluded. An object of the present invention is to provide a phoneme recognition performance measuring device and a recording medium of a recognition device.

【００１２】[0012]

【課題を解決するための手段】このような目的を達成す
るために、請求項１の発明は、言語モデルおよび音響モ
デルを使用して連続音声認識を行う連続音声認識装置の
音素認識性能を測定する連続音声認識装置の音素認識性
能測定装置において、評価用音声の音素列の表記を入力
する入力手段と、前記連続音声認識装置に対して前記評
価用音声を入力し、該評価用音声に対して前記音響モデ
ルにより音素認識を行い、該音響モデルの音素認識結果
に対して言語上存在する文字列についての拘束をかける
音素処理手段と、該音素処理手段から出力される音素列
と、前記入力手段から入力された評価用音声の音素列の
表記とを比較して音素認識率を計算し、当該計算された
音素認識率を音素認識性能の測定値として出力する性能
測定手段とを具えたことを特徴とする。In order to achieve the above object, an invention according to claim 1 measures a phoneme recognition performance of a continuous speech recognition apparatus that performs continuous speech recognition using a language model and an acoustic model. In the phoneme recognition performance measuring device of the continuous speech recognition device, input means for inputting a notation of a phoneme string of the speech for evaluation, and inputting the speech for evaluation to the continuous speech recognition device, for the speech for evaluation Phoneme processing means for performing phoneme recognition using the acoustic model and constraining a phoneme recognition result of the acoustic model with respect to a linguistic character string; a phoneme sequence output from the phoneme processing means; Performance measuring means for calculating a phoneme recognition rate by comparing the phoneme recognition rate with the notation of a phoneme string of the evaluation speech input from the means, and outputting the calculated phoneme recognition rate as a measured value of phoneme recognition performance. It is characterized in.

【００１３】請求項２の発明は、請求項１に記載の連続
音声認識装置の音素認識性能測定装置において、前記言
語上存在する文字列は単語であることを特徴とする。According to a second aspect of the present invention, in the phoneme recognition performance measuring apparatus for a continuous speech recognition apparatus according to the first aspect, the character string existing in the language is a word.

【００１４】請求項３の発明は、言語モデルおよび音響
モデルを使用して連続音声認識を行う連続音声認識装置
に実装され、実行されるプログラムであって、音素認識
性能を測定するためのプログラムを記録した記録媒体に
おいて、評価用音声の音素列の表記を入力する入力ステ
ップと、前記連続音声認識装置に対して前記評価用音声
を入力し、該評価用音声に対して前記音響モデルにより
音素認識を行い、該音響モデルの音素認識結果に対して
言語上存在する文字列についての拘束をかける音素処理
ステップと、該音素処理手段から出力される音素列と、
前記入力手段から入力された評価用音声の音素列の表記
とを比較して音素認識率を計算し、当該計算された音素
認識率を音素認識性能の測定値として出力する性能測定
ステップとを具えたことを特徴とする。According to a third aspect of the present invention, there is provided a program implemented and executed in a continuous speech recognition apparatus for performing continuous speech recognition using a language model and an acoustic model, the program for measuring phoneme recognition performance. An input step of inputting a notation of a phoneme string of an evaluation voice in the recorded recording medium; and inputting the evaluation voice to the continuous speech recognition device, and performing phoneme recognition on the evaluation voice using the acoustic model. Performing a phoneme processing step of constraining a character string existing in the language on the phoneme recognition result of the acoustic model; and a phoneme string output from the phoneme processing unit,
A performance measuring step of calculating a phoneme recognition rate by comparing the phoneme recognition rate with a notation of a phoneme string of the evaluation speech input from the input means, and outputting the calculated phoneme recognition rate as a measured value of phoneme recognition performance. It is characterized by having.

【００１５】請求項４の発明は、請求項３に記載の記録
媒体において、前記言語上存在する文字列は単語である
ことを特徴とする。According to a fourth aspect of the present invention, in the recording medium according to the third aspect, the character string existing in the language is a word.

【００１６】[0016]

【発明の実施の形態】以下、図面を参照して本発明の実
施形態を詳細に説明する。Embodiments of the present invention will be described below in detail with reference to the drawings.

【００１７】図２は連続音声認識装置の中に設けた音素
認識性能測定装置の基本構成を示す。図２において、図
１の従来例と同様の個所には同一の符号を付し、詳細な
説明を省略する。２１は単語辞書２１−１を有する音素
列生成部であり、単語辞書２１−１に記載された音素列
形態の単語を順次に音響尤度計算部１２に供給する。こ
れにより音響尤度計算部１２では、音響モデル１０から
一定時間の間出力された複数の音素、すなわち音素列
と、音素列生成部２１から供給された単語の音素列とを
比較し、最も類似している単語（音素列形態）を単語認
識結果として出力する。FIG. 2 shows a basic configuration of a phoneme recognition performance measuring device provided in the continuous speech recognition device. In FIG. 2, the same parts as those in the conventional example of FIG. 1 are denoted by the same reference numerals, and detailed description will be omitted. Reference numeral 21 denotes a phoneme string generation unit having a word dictionary 21-1, which sequentially supplies the phoneme string form words described in the word dictionary 21-1 to the acoustic likelihood calculation unit 12. Accordingly, the acoustic likelihood calculating unit 12 compares a plurality of phonemes output from the acoustic model 10 for a certain period of time, that is, the phoneme sequence, with the phoneme sequence of the word supplied from the phoneme sequence generating unit 21 and finds the most similarity. A word (phoneme string form) is output as a word recognition result.

【００１８】本実施形態では、単語の接続率が音響尤度
計算部１２に与えられないので、言語上の単語等の接続
関係が考慮されておらず、言語上存在しうる単語の音素
列を音声認識結果とする点に留意されたい。したがっ
て、たとえば、「朝」という評価用音声に対して音響モ
デルの音素認識結果として「ａａａｓｋｓｓａａａ」と
いうように言語上存在しない音素「ｋ」が音素の認識結
果の中に混在していても「音素列生成部」２１から供給
される単語の中に「朝」に対応する音素列が存在する
と、音素認識結果中の上記音素「ｋ」は是正されて単語
認識結果として、「ａａａｓｓｓｓａａａ」が出力され
る。このように言語上存在しない音素を排除する音素処
理を行うことを本明細書では「音響モデルの音素認識結
果に単語拘束をかける」と呼ぶことにする。本実施形態
では、拘束をかける文字列長さを単語としているが、用
途に応じて文節や文としてもよく、この場合には、音素
列生成部２１で使用する単語辞書２１−１が対応の文字
列長さの音素列を記述したものとなる。In the present embodiment, since the connection rate of words is not given to the acoustic likelihood calculation unit 12, connection relations of words and the like in a language are not considered, and phoneme strings of words that can exist in a language are not considered. Note that the result is a speech recognition result. Therefore, for example, even if a phoneme “k” that does not exist in a language such as “aaaskssaaa” as a phoneme recognition result of the acoustic model for the evaluation speech “morning” is mixed in the phoneme recognition result, “phoneme” If the phoneme string corresponding to “morning” exists in the words supplied from the “string generation unit” 21, the phoneme “k” in the phoneme recognition result is corrected, and “aaaasssaaa” is output as the word recognition result. You. Performing phoneme processing to eliminate phonemes that do not exist in the language in this way is referred to herein as “constraining words on the phoneme recognition results of the acoustic model”. In this embodiment, the length of the character string to be constrained is a word, but it may be a phrase or a sentence depending on the application. In this case, the word dictionary 21-1 used by the phoneme sequence generation unit 21 corresponds to the word. This is a description of a phoneme string having a character string length.

【００１９】２２は音素列比較部であり、評価用音声を
入力したときに音響尤度計算部１２から出力される音素
列形態の単語認識結果と、評価用の音声の表記とを比較
して、音素認識率を計算する。比較の方法としては、本
実施形態では、動的計画法（ＤＰやダイナミックプログ
ラミングとも呼ばれる）を使用する。計算された認識率
は不図示の表示器やプリンタにより可視出力される。Reference numeral 22 denotes a phoneme string comparison unit which compares a word recognition result in the form of a phoneme string output from the acoustic likelihood calculating unit 12 when the evaluation speech is input, with the notation of the speech for evaluation. , Calculate the phoneme recognition rate. In the present embodiment, a dynamic programming (also called DP or dynamic programming) is used as a comparison method. The calculated recognition rate is visually output by a display or a printer (not shown).

【００２０】このような音素認識性能測定装置を有する
連続音声認識装置を実現するためのハードウェア構成の
一例を図３に示す。連続音声認識装置は、連続音声認識
用プログラムを汎用のコンピュータ、たとえば、パーソ
ナルコンピュータに実装することで実現する。図３はパ
ーソナルコンピュータ等の主要部分の構成を示す。FIG. 3 shows an example of a hardware configuration for realizing a continuous speech recognition device having such a phoneme recognition performance measuring device. The continuous speech recognition device is realized by mounting a continuous speech recognition program on a general-purpose computer, for example, a personal computer. FIG. 3 shows a configuration of a main part such as a personal computer.

【００２１】図３において、ＣＰＵ１００は音声認識用
プログラムを実行して，連続音声認識処理を行うと共
に、後述の音素認識性能のためのプログラムを実行して
連続音声認識装置の音素認識性能を測定する。In FIG. 3, a CPU 100 executes a speech recognition program to perform continuous speech recognition processing, and executes a program for phoneme recognition performance described later to measure the phoneme recognition performance of the continuous speech recognition device. .

【００２２】システムメモリ１１０は、ＣＰＵ１００が
行う情報処理に対する入出力データを一時記憶する。ハ
ードディスク記憶装置（ＨＤＤと略記する）１３０は、
連続音声認識プログラム、言語モデルのためのデータ、
音響モデルで使用するデータ等、本発明に関わる単語辞
書（図２の２１−１に対応）および音素認識性能測定プ
ログラムを保存記憶する。上述のプログラムは不図示の
キーボードやマウス等の実行の指示で、ＨＤＤ１３０か
らシステムメモリ１１０にローディングされた後、ＣＰ
Ｕ１００により指示されたプログラムが実行される。System memory 110 temporarily stores input / output data for information processing performed by CPU 100. The hard disk storage device (abbreviated as HDD) 130 is
Continuous speech recognition program, data for language models,
A word dictionary (corresponding to 21-1 in FIG. 2) and a phoneme recognition performance measurement program according to the present invention, such as data used in an acoustic model, are stored and stored. The above-described program is loaded from the HDD 130 to the system memory 110 by an instruction to execute a keyboard, a mouse, or the like (not shown).
The program specified by U100 is executed.

【００２３】入力インターフェース（Ｉ／Ｏ）１２０は
不図示のマイクロホンから入力された音声信号をＡ／Ｄ
変換して、デジタル形態の音声信号を連続音声認識のた
めにＣＰＵ１００に引き渡す。An input interface (I / O) 120 converts an audio signal input from a microphone (not shown) into an A / D signal.
After conversion, the digital audio signal is delivered to the CPU 100 for continuous audio recognition.

【００２４】このような構成において、ユーザがキーボ
ードまたはマウス等により音素認識性能測定プログラム
の実行を指示すると、ＨＤＤ１３０からシステムメモリ
１１０に図４の音素認識性能測定プログラムがロードさ
れ、ＣＰＵ１００により実行される。In such a configuration, when the user instructs the execution of the phoneme recognition performance measurement program using a keyboard or a mouse, the phoneme recognition performance measurement program of FIG. 4 is loaded from the HDD 130 into the system memory 110 and executed by the CPU 100. .

【００２５】ユーザはあらかじめ評価用の音声の音素表
記を文書ファイルの形態でＨＤＤ１３０に記憶しておく
ものとする。音素表記の文書ファイルは、パーソナルコ
ンピュータのワープロソフトで作成してもよいし、フロ
ッピー等を介して不図示のＨＤＤ１３０にオフライン転
送してもよい。It is assumed that the user previously stores the phoneme notation of the voice for evaluation in the form of a document file in the HDD 130. The document file in phoneme notation may be created by word processing software of a personal computer, or may be transferred off-line to the HDD 130 (not shown) via a floppy or the like.

【００２６】ユーザは不図示のマイクロホンから評価用
音声を入力する。入力された音声はＩ／Ｏ１２０を介し
てＣＰＵ１００に引き渡される。ＣＰＵ１００は従来と
同様の連続音声認識プログラムの中の音響モデル部分
（たとえば、サブルーチンや関数の形態）を使用して音
素認識を行って、音素認識結果をシステムメモリ１１０
内のワーク領域に順次に記憶していく。The user inputs an evaluation voice from a microphone (not shown). The input voice is delivered to the CPU 100 via the I / O 120. CPU 100 performs phoneme recognition using an acoustic model portion (for example, a subroutine or a function form) in a continuous speech recognition program similar to the conventional one, and stores the phoneme recognition result in system memory 110.
Are sequentially stored in the work area.

【００２７】ＣＰＵ１００は音素認識結果に対して、最
も類似している単語（音素列形態）を単語辞書２１−１
から取り出す。この処理が、図２の音素列生成部２１お
よび音響尤度計算部２２の処理に対応する。より、具体
的には、単語辞書２１−１から順次に単語の音素列を読
み出して、比較対照の音響モデル１０の出力、すなわ
ち、音素列と比較し、尤度を従来と同様の方法で計算す
る。以下、順次に単語辞書から単語を読み出して、比較
対象の音素列と比較して尤度を計算する。最初に計算さ
れた尤度が仮の最高値としてその音素列および／または
単語辞書２１−１の記憶位置と共に一時記憶される。次
に読み出された単語についての尤度と仮の最高値とが比
較され、次回に読み出された尤度が仮の最高値よりも大
きい場合にはその読み出された単語の音素列およびまた
は単語の記憶位置、尤度がこれまでに一時記憶されてい
た仮の最高値および関連データと置換される。The CPU 100 determines a word (phoneme string form) most similar to the phoneme recognition result in the word dictionary 21-1.
Remove from This processing corresponds to the processing of the phoneme sequence generation unit 21 and the acoustic likelihood calculation unit 22 in FIG. More specifically, a phoneme sequence of words is sequentially read from the word dictionary 21-1, and the output of the acoustic model 10 to be compared, that is, the phoneme sequence is compared, and the likelihood is calculated by a method similar to the conventional method. I do. Hereinafter, words are sequentially read from the word dictionary and compared with a phoneme string to be compared to calculate likelihood. The likelihood calculated first is temporarily stored as a temporary maximum value together with the phoneme string and / or the storage location of the word dictionary 21-1. Next, the likelihood of the read word and the provisional maximum value are compared, and if the likelihood read next time is larger than the provisional maximum value, the phoneme sequence of the read word and Alternatively, the storage position and likelihood of the word are replaced with the temporary maximum value and the related data that have been temporarily stored so far.

【００２８】このようにして、単語辞書２１−１に記載
されている単語の音素列全てと、比較対照の音素列との
比較，すなわち，尤度計算を行う。すると、システムメ
モリ１１０に記憶されている仮の最高値、対応する音素
列（表記）、単語辞書の記憶位置が単語の認識結果とし
て決定される。In this way, all the phoneme strings of the words described in the word dictionary 21-1 are compared with the phoneme strings to be compared, that is, the likelihood calculation is performed. Then, the provisional maximum value, the corresponding phoneme string (notation), and the storage position of the word dictionary stored in the system memory 110 are determined as a word recognition result.

【００２９】決定された音素列が、システムメモリ１１
０の音声認識結果記憶領域に記憶される（図４のステッ
プＳ１０）。The determined phoneme sequence is stored in the system memory 11.
0 is stored in the voice recognition result storage area (step S10 in FIG. 4).

【００３０】以上の処理を音声の入力終了まで一定時間
間隔で繰り返すと（ステップＳ１０〜Ｓ２０のループ処
理）、システムメモリ１１０の音声認識結果記憶領域に
は累積的に単語の認識結果が音素列の形態で累積的に記
憶されていく。When the above process is repeated at regular time intervals until the input of the speech is completed (loop process of steps S10 to S20), the speech recognition result storage area of the system memory 110 accumulates the word recognition results of the phoneme sequence. It is stored cumulatively in the form.

【００３１】次に、入力された音声についての音声認識
を終了すると、ＣＰＵ１００はＨＤＤ１３０から評価用
音声の音素表記のファイルをＨＤＤ１３０からシステム
メモリ１１０のワーク領域上に読み出す（入力する）。
つづいて、ＣＰＵ１００はシステムメモリ１１０の単語
認識結果記憶領域に一時記憶されている音素列と、ワー
ク領域に記憶されている評価用音声の音素表記とを使用
して動的計画法、すなわち、ＤＰマッチングの手法によ
り音素認識率を計算する（図４のステップＳ３０→Ｓ４
０）。Next, when the speech recognition for the inputted speech is completed, the CPU 100 reads out (inputs) a phoneme notation file of the evaluation speech from the HDD 130 onto the work area of the system memory 110.
Subsequently, the CPU 100 uses the phoneme sequence temporarily stored in the word recognition result storage area of the system memory 110 and the phoneme notation of the evaluation speech stored in the work area to perform dynamic programming, that is, DP. The phoneme recognition rate is calculated by the matching method (step S30 → S4 in FIG. 4).
0).

【００３２】計算された音素認識率が音素認識性能の測
定値として、表示器やプリンタに出力される（図４のス
テップＳ５０）。The calculated phoneme recognition rate is output to a display or a printer as a measured value of the phoneme recognition performance (step S50 in FIG. 4).

【００３３】ユーザは出力された測定値を見て、連続音
声認識プログラムで使用する単語拘束をかけた音素認識
機能の音素認識性能を知ることができる。The user can know the phoneme recognition performance of the phoneme recognition function with the word constraint used in the continuous speech recognition program by looking at the output measurement values.

【００３４】本実施形態の他に次の形態を実施できる。The following embodiment can be carried out in addition to this embodiment.

【００３５】１）上述の実施形態では、連続音声認識プ
ログラムと音素認識性能測定プログラムを別個としてい
たが、連続音声認識プログラムの中に、音素認識性能測
定プロうグラムを組み込むこともできる。この場合は、
キーボードあるいはマウスにより連続音声認識を行う
か、音素認識性能を行うかのモード指示をユーザから受
け付ける。連続音声認識が指示された場合には、連続音
声プログラムの音響尤度計算部１２が言語モデルを使用
するように定義し、音素認識性能測定が指示された場合
には音素列生成部に切り替えるようにするとよい（図５
参照）。1) In the above embodiment, the continuous speech recognition program and the phoneme recognition performance measurement program are separate. However, the phoneme recognition performance measurement program can be incorporated in the continuous speech recognition program. in this case,
A mode instruction whether to perform continuous speech recognition using a keyboard or a mouse or to perform phoneme recognition performance is received from a user. When the continuous speech recognition is instructed, the acoustic likelihood calculation unit 12 of the continuous speech program defines to use the language model, and when the phoneme recognition performance measurement is instructed, switches to the phoneme string generation unit. (Fig. 5
reference).

【００３６】さらに、上記の形態では言語モデル１１か
ら出力される単語の接続確率は無視されるが、図６に示
すように音素認識性能測定時には単語の接続確率として
固定値を音響尤度計算部１２に与えると、言語モデルが
有する音素列形態の単語情報を音素認識性能の測定のた
めに使用する単語辞書（２１−１）の代わりに使用する
こともできる。Further, in the above embodiment, the connection probability of the word output from the language model 11 is ignored, but as shown in FIG. 6, a fixed value is used as the connection probability of the word when measuring the phoneme recognition performance, as shown in FIG. When it is provided to 12, the word information in the phoneme string form of the language model can be used instead of the word dictionary (21-1) used for measuring the phoneme recognition performance.

【００３７】２）上述の実施形態では、音声認識性能測
定プログラムを記録する記録媒体はハードディスク記憶
装置（ＨＤＤ）であってもよいが、記録媒体はＲＯＭや
ＲＡＭ等のＩＣメモリや、フロッピーディスクやＣＤＲ
ＯＭ等の携帯用記録媒体であってもよい。2) In the above embodiment, the recording medium for recording the speech recognition performance measurement program may be a hard disk storage device (HDD), but the recording medium may be an IC memory such as a ROM or a RAM, a floppy disk, or the like. CDR
It may be a portable recording medium such as an OM.

【００３８】３）上述の実施形態では、音素認識率を音
素認識性能の測定値としているが、音素認識率そのもの
の値を使用してもよいし、音素認識率をたとえば、よ
い、普通、悪いというように、段階的に表現するように
してもよい。3) In the above-described embodiment, the phoneme recognition rate is used as the measured value of the phoneme recognition performance. However, the value of the phoneme recognition rate itself may be used, or the phoneme recognition rate may be, for example, good, normal, or bad. For example, it may be expressed step by step.

【００３９】４）上述の実施形態は本発明を説明する一
実施形態であって、特許請求の範囲の記載が示す技術思
想にしたがって、上述の実施形態に対する上述以外の変
形が可能である。また、このような変形形態は本願特許
の権利範囲内にある。4) The above embodiment is an embodiment for explaining the present invention, and modifications other than those described above can be made to the above embodiment according to the technical idea described in the claims. Such modifications are within the scope of the patent of the present application.

【００４０】[0040]

【発明の効果】以上、説明したように、本発明によれ
ば、たとえば、単語単位で、音響モデルの音素認識結果
に対して拘束をかけること（単語等の文字列では言語上
の存在しない音素を音素認識結果から排除すること）に
より連続音声認識装置の音素認識性能を正しく測定する
ことができる。As described above, according to the present invention, for example, the phoneme recognition result of the acoustic model is restricted on a word-by-word basis. Is excluded from the phoneme recognition result), so that the phoneme recognition performance of the continuous speech recognition device can be correctly measured.

【００４１】また、連続音声認識装置の音素認識性能の
測定結果を連続音声認識装置全体としての認識性能を解
析する際に、参考にすることで、効率よく連続音声認識
装置の認識性能を解析することができる。Further, by referring to the measurement result of the phoneme recognition performance of the continuous speech recognition device when analyzing the recognition performance of the continuous speech recognition device as a whole, the recognition performance of the continuous speech recognition device is efficiently analyzed. be able to.

[Brief description of the drawings]

【図１】図１は従来の音声認識装置の代表的な構成を示
すブロック図である。FIG. 1 is a block diagram showing a typical configuration of a conventional speech recognition apparatus.

【図２】図２は本発明実施形態の基本構成を示すブロッ
ク図である。FIG. 2 is a block diagram showing a basic configuration of an embodiment of the present invention.

【図３】本発明実施形態のハードウェア構成を示すブロ
ック図である。FIG. 3 is a block diagram illustrating a hardware configuration according to the embodiment of the present invention.

【図４】ＣＰＵの音素認識性能測定処理手順を示すフロ
ーチャートである。FIG. 4 is a flowchart showing a phoneme recognition performance measurement processing procedure of a CPU.

【図５】本発明実施形態の他の形態を示すブロック図で
ある。FIG. 5 is a block diagram showing another embodiment of the present invention.

【図６】本発明実施形態のさらに他の形態を示すブロッ
ク図である。FIG. 6 is a block diagram showing still another embodiment of the present invention.

[Explanation of symbols]

１０音響モデル１１言語モデル１２音響尤度計算部２１音素列生成部２１−１単語辞書１００ＣＰＵ１１０システムメモリ１２０Ｉ／Ｏ１３０ＨＤＤ Reference Signs List 10 acoustic model 11 language model 12 acoustic likelihood calculating section 21 phoneme string generating section 21-1 word dictionary 100 CPU 110 system memory 120 I / O 130 HDD

───────────────────────────────────────────────────── フロントページの続き (72)発明者今井亨東京都世田谷区砧一丁目10番11号日本放送協会放送技術研究所内 (72)発明者安藤彰男東京都世田谷区砧一丁目10番11号日本放送協会放送技術研究所内Ｆターム(参考） 5D015 AA01 BB02 HH11 LL00 ──────────────────────────────────────────────────続き Continuing on the front page (72) Inventor Tohru Imai 1-10-11 Kinuta, Setagaya-ku, Tokyo Japan Broadcasting Corporation Broadcasting Research Institute (72) Inventor Akio Ando 1-110, Kinuta, Setagaya-ku, Tokyo No. Japan Broadcasting Corporation Broadcasting Research Institute F-term (reference) 5D015 AA01 BB02 HH11 LL00

Claims

[Claims]

An apparatus for measuring a phoneme recognition performance of a continuous speech recognizer for performing continuous speech recognition using a language model and an acoustic model, comprising: Input means for inputting the evaluation speech to the continuous speech recognition device, perform phoneme recognition on the evaluation speech by the acoustic model, and perform a phoneme recognition result of the acoustic model. Phoneme processing means for constraining a character string existing in a language; phoneme recognition by comparing a phoneme string output from the phoneme processing means with a notation of a phoneme string of an evaluation speech input from the input means. A performance measuring means for calculating a rate and outputting the calculated phoneme recognition rate as a measured value of the phoneme recognition performance.

2. The phoneme recognition performance measuring device for a continuous speech recognition device according to claim 1, wherein the character string existing in the language is a word.

3. A recording medium that is installed and executed in a continuous speech recognition apparatus that performs continuous speech recognition using a language model and an acoustic model, wherein the program is a program for measuring phoneme recognition performance. An input step of inputting a notation of a phoneme string of an evaluation voice; inputting the evaluation voice to the continuous speech recognition device; performing phoneme recognition on the evaluation voice using the acoustic model; A phoneme processing step of constraining a character string existing in the language to the phoneme recognition result of the model; a phoneme string output from the phoneme processing means; and a phoneme string of an evaluation speech input from the input means. Calculating a phoneme recognition rate by comparing the notation with the notation, and outputting the calculated phoneme recognition rate as a measured value of phoneme recognition performance. Recording medium.

4. The recording medium according to claim 3, wherein the character string existing in the language is a word.