JP2002268679A

JP2002268679A - Method and device for detecting error of voice recognition result and error detecting program for voice recognition result

Info

Publication number: JP2002268679A
Application number: JP2001064031A
Authority: JP
Inventors: Takeshi Mishima; 剛三島; Nobumasa Seiyama; 信正清山; Atsushi Imai; 篤今井; Toru Tsugi; 徹都木
Original assignee: Nippon Hoso Kyokai NHK; Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2001-03-07
Filing date: 2001-03-07
Publication date: 2002-09-20

Abstract

PROBLEM TO BE SOLVED: To support manual efficient correction by automatically detecting a recognition error which is inevitable when contents represented in a voice language is made into character information by using voice recognition technology. SOLUTION: Results which are outputted from a voice recognition device including recognition errors and sound information corresponding thereto are stored by correct answers, errors and sound models for discriminating whether the results are correct or incorrect, the most of which are generated from the pieces of sound information; and sound information which is newly inputted and an object of recognition is compared with those models to decide whether the recognition results are correct or incorrect.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、音声認識結果の誤
り検出方法及び装置及び音声認識結果の誤り検出プログ
ラムに係り、特に、過去の認識結果から正誤情報を抽出
し、正誤を弁別する際に使用する正誤弁別モデルの作成
方法及びその正誤弁別モデルを用いて認識誤り箇所を指
摘する認識誤り修正支援装置において、放送におけるニ
ュース音声の字幕化や、講演における聴覚障害者への字
幕サービス、雑誌や新聞のインター記事等の校正作業の
効率化を促進するための音声認識結果の誤り検出方法及
び装置及び音声認識結果の誤り検出プログラムに関す
る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a method and an apparatus for detecting an error in a speech recognition result and a program for detecting an error in a speech recognition result, and more particularly to extracting correct / incorrect information from past recognition results and discriminating between correct and wrong. In a method for creating a correct / incorrect discrimination model to be used and a recognition error correction support device that points out recognition errors using the correct / incorrect discrimination model, subtitles for news audio in broadcasting, subtitle services for hearing-impaired persons in lectures, magazines and the like The present invention relates to a method and an apparatus for detecting an error in a speech recognition result and a program for detecting an error in a speech recognition result for promoting the efficiency of proofreading work for newspaper inter-articles.

【０００２】[0002]

【従来の技術】最近では、音声の文字書き起こしを支援
する音声認識技術の導入が盛んに行われているが、現状
では、認識率１００％の音声認識装置は存在しない。特
に、不特定話者、連続音声認識を対象としたものについ
ては、認識結果に誤りが含まれる可能性が非常に高い。
従って、現状においては音声認識装置を使用して変換し
た認識結果を、実際に何らかの目的に使用する場合は、
何らかの修正手段が必要である。2. Description of the Related Art Recently, a speech recognition technique for supporting the transcription of speech has been actively introduced, but at present, there is no speech recognition apparatus having a recognition rate of 100%. In particular, for unspecified speakers and those for continuous speech recognition, there is a very high possibility that the recognition result contains an error.
Therefore, in the present situation, when the recognition result converted using the speech recognition device is actually used for some purpose,
Some corrective measures are needed.

【０００３】この修正手段として、修正者が認識対象と
なった音声を聴取し、その音声から文字列化された結果
を目で見直し、誤りを発見することが一般的に行われて
いる。As a correction means, it is a common practice that a corrector listens to a voice to be recognized, reviews the result of a character string from the voice, and finds an error.

【０００４】このように、音声認識システムから出力さ
れた認識結果を人が目で見て、実際に発声された音声を
耳で聞きながら誤りを検出していくことにおいて、人
は、文章の１文字１文字に着目して修正を行うというよ
り、文章の流れに着目して修正を行う傾向にあるため、
局所的な誤字・脱字を見逃すことが多くなる。特に、実
際に発声された音声を聞きながら、かつ修正するという
２つの処理を同時に行うため、文字のみから文章を校正
するだけの作業に比べて誤りを見逃す確率は更に高くな
る。[0004] As described above, when a person visually recognizes the recognition result output from the speech recognition system and detects an error while listening to the actually uttered voice with his ear, the person takes one of the sentences. Since there is a tendency to focus on the flow of sentences rather than focusing on one character,
It often misses local typographical errors and omissions. In particular, since the two processes of simultaneously correcting the sentence while listening to the actually uttered voice are performed at the same time, the probability of overlooking an error is further increased as compared with the operation of simply correcting a sentence from only characters.

【０００５】また、連続して発声される音声をリアルタ
イムに文字化し、テレビの生放送番組に字幕化を行うな
ど、その認識結果を即時に使用するオンラインでの認識
誤り修正作業では、通常の話速の発声で、再度聞きなお
すことができない音声を聞きながら認識結果を修正して
いく作業となるため、さらに条件が厳しくなり、修正者
にかなりの負担を強いることになると同時に、誤りを見
逃す確率も高くなることが避けられない。[0005] Also, in the online recognition error correction work in which the recognition result is used immediately, such as converting a continuously uttered voice into a character in real time and converting the sound into a subtitle in a live TV program, a normal speech speed is used. , The recognition result is corrected while listening to the voice that cannot be heard again, so the conditions become even more severe, imposing a considerable burden on the corrector, and also increasing the probability of missing errors It is inevitable that it will be high.

【０００６】このように、音声認識システムの実用化に
は認識誤りの修正も考慮に入れたシステムを構築しなけ
ればならない。システムの実用を考えた場合、文字変換
効率の向上と同時に、修正効率の向上も必要不可欠であ
るが、修正部分において、人手による修正がより確実で
あるのが現状で、この修正者への支援を行ない、修正時
の負担、及び未修正箇所の低減に努めることがシステム
全体の向上につながるとになる。As described above, in order to put the speech recognition system into practical use, it is necessary to construct a system that takes into account correction of recognition errors. When considering the practical use of the system, it is indispensable to improve not only the efficiency of character conversion but also the efficiency of correction.However, in the correction part, correction by hand is more reliable at present. In order to reduce the burden at the time of correction and the uncorrected portion, the overall system will be improved.

【０００７】[0007]

【発明が解決しようとする課題】本発明は、上記の点に
鑑みなされたもので、音声認識システムから出力される
認識結果のうち誤りである箇所を自動的に抽出し、修正
者に効果的に提示することで、修正作業のミス及び負担
を軽減することが可能な音声認識結果の誤り検出方法及
び装置及び音声認識結果の誤り検出プログラムを提供す
ることを目的とする。SUMMARY OF THE INVENTION The present invention has been made in view of the above circumstances, and automatically extracts an erroneous portion in a recognition result output from a speech recognition system, and is effective for a corrector. It is therefore an object of the present invention to provide a method and an apparatus for detecting an error in a speech recognition result and a program for detecting an error in a speech recognition result, which can reduce the error and the burden of the correction work by presenting the error.

【０００８】[0008]

【課題を解決するための手段】本発明（請求項１）は、
音声認識により音声言語によって表現された内容を文字
情報化する際に、該音声認識の結果の誤りを検出するた
めの音声認識結果の誤り検出方法において、過去の音声
認識結果に基づいて、時間領域及び周波数領域で分析さ
れる特徴量を有し、最小単位毎に、正解／誤りに弁別さ
れた正解音響情報及び誤り音響情報を蓄積しておき、正
解音響情報及び誤り音響情報に対して学習処理を行うこ
とにより、該正解音響情報及び該誤り音響情報を最もよ
く弁別する音響モデルである正誤弁別モデルを生成し、
新たに入力される音声認識結果と該認識結果に対応する
音響情報から、正誤弁別モデルを用いて、誤りの検出を
行う。Means for Solving the Problems The present invention (claim 1) provides:
When converting the content represented by a speech language into character information by speech recognition, in a speech recognition result error detection method for detecting an error in the speech recognition result, a time domain based on a past speech recognition result is used. And the correct acoustic information and the error acoustic information discriminated as correct / error are stored for each minimum unit, and the learning processing is performed on the correct acoustic information and the error acoustic information. By performing the above, to generate a correct / false discrimination model that is an acoustic model that best discriminates the correct sound information and the error sound information,
An error is detected from a newly input speech recognition result and acoustic information corresponding to the recognition result using a true / false discrimination model.

【０００９】本発明（請求項２）は、過去の音声認識結
果に含まれる正解、及び誤りに対応する音声信号を音響
分析により正解音響情報及び誤り音響情報の抽出を行な
い、抽出された正解音響情報及び誤り音響情報を蓄積し
ておき、蓄積されている正解音響情報及び誤り音響情報
における代表点を求め、代表点を初期値とし、該初期値
に含まれる特徴量や該代表点から求めた初期モデルパラ
メータを逐次更新していくことで最適な音響モデルを生
成する。According to the present invention (claim 2), the correct sound information and the error sound information are extracted from the correct answer included in the past speech recognition result and the speech signal corresponding to the error by acoustic analysis, and the extracted correct answer sound is extracted. Information and error acoustic information are stored, a representative point in the stored correct answer information and error acoustic information is obtained, the representative point is set as an initial value, and a characteristic amount included in the initial value and the representative point are obtained. An optimal acoustic model is generated by sequentially updating the initial model parameters.

【００１０】本発明（請求項３）は、誤りの検出を行う
際に、音声認識装置から出力される音声信号を連続音声
中から切り出し、各認識単位の音響特徴量を求め、音響
特徴量と、正誤弁別モデルとの比較を行ない、該正誤弁
別モデルの正解モデル及び誤りモデルのいずれに類似し
ているかにより正誤を判定する。According to the present invention (claim 3), when an error is detected, a speech signal output from a speech recognition device is cut out from continuous speech, and an acoustic feature amount of each recognition unit is obtained. , A correct / false discrimination model, and correct / incorrect is determined based on which of the correct model and the error model of the correct / false discrimination model is similar.

【００１１】本発明（請求項４）は、音声認識により音
声言語によって表現された内容を文字情報化する際に該
音声認識の結果の誤りを検出するための音声認識結果の
誤り検出装置であって、過去の音声認識結果について、
時間領域及び周波数領域で分析される特徴量を有し、最
小単位毎に、正解／誤りに弁別された正解音響情報及び
誤り音響情報を蓄積しておき、該正解音響情報及び該誤
り音響情報に対して学習処理を行ない、正解音響情報及
び誤り音響情報を最もよく弁別する音響モデルである正
誤弁別モデルを生成する学習手段と、新たに入力される
音声認識結果と該認識結果に対応する音響情報から、正
誤弁別モデルを用いて、誤りの検出を行う検出手段とを
有する。[0011] The present invention (claim 4) is a speech recognition result error detection apparatus for detecting an error in the speech recognition result when converting the contents expressed in a speech language into character information by speech recognition. For past speech recognition results,
It has a feature quantity analyzed in the time domain and the frequency domain, and accumulates correct sound information and error sound information discriminated as correct / error for each minimum unit, and stores the correct sound information and the error sound information in the correct sound information and the error sound information. A learning unit for performing a learning process for generating a correct / false discrimination model which is a sound model for best discriminating correct sound information and error sound information, and a newly input speech recognition result and sound information corresponding to the recognition result. And detection means for detecting an error using a true / false discrimination model.

【００１２】本発明（請求項５）は、学習手段におい
て、過去の音声認識結果に含まれる正解、及び誤りに振
り分け、振り分けられた正解、及び誤りの音声信号を音
響分析により正解音響情報及び誤り音響情報の抽出を行
ない、抽出された該正解音響情報及び該誤り音響情報を
蓄積する正誤弁別手段と、蓄積されている正解音響情報
及び誤り音響情報における代表点を音響モデルの初期値
として求める音響モデル初期値生成手段と、代表点を初
期値とし、該初期値に含まれる特徴量や該代表点から求
めた初期モデルパラメータを逐次更新していくことで最
適な音響モデルを生成する識別学習手段とを有する。According to the present invention (claim 5), the learning means sorts the correct answer and the error included in the past speech recognition result into the correct answer and the error, and the sorted correct answer and the error voice signal are subjected to the acoustic analysis to obtain the correct sound information and the error. True / false discriminating means for extracting sound information and accumulating the extracted correct sound information and error sound information, and sound for obtaining a representative point in the stored correct sound information and error sound information as an initial value of the sound model. Model initial value generating means, and discriminative learning means for generating an optimal acoustic model by successively updating a feature value included in the initial value and an initial model parameter obtained from the representative point using the representative point as an initial value. And

【００１３】本発明（請求項６）は、検出手段におい
て、音声認識装置から出力される音声信号を連続音声中
から切り出す音声信号抽出手段と、各認識単位の音響特
徴量を求める特徴抽出手段と、音響特徴量と、正誤弁別
モデルとの比較を行ない、該正誤弁別モデルの正解モデ
ル及び誤りモデルのいずれに類似しているかにより正誤
を判定する正誤モデル照合手段とを有する。According to the present invention (claim 6), in the detection means, a speech signal extraction means for extracting a speech signal output from the speech recognition device from continuous speech, and a feature extraction means for obtaining an acoustic feature amount of each recognition unit. A true / false model comparing unit that compares the acoustic feature quantity with the true / false discrimination model and determines whether the true / false discrimination model is similar to the correct model or the error model.

【００１４】本発明（請求項７）は、音声認識により音
声言語によって表現された内容を文字情報化する際に該
音声認識の結果の誤りを検出するための音声認識結果の
誤り検出プログラムであって、過去の音声認識結果につ
いて、時間領域及び周波数領域で分析される特徴量を有
し、最小単位毎に、正解／誤りに弁別された正解音響情
報及び誤り音響情報を蓄積しておき、該正解音響情報及
び該誤り音響情報に対して学習処理を行ない、該正解音
響情報及び該誤り音響情報を最もよく弁別する音響モデ
ルである正誤弁別モデルを生成する学習プロセスと、新
たに入力される音声認識結果と該認識結果に対応する音
響情報から、正誤弁別モデルを用いて、誤りの検出を行
う検出プロセスとを有する。The present invention (claim 7) is a speech recognition result error detection program for detecting an error in the speech recognition result when converting the content expressed in a speech language into character information by speech recognition. For the past speech recognition results, it has a feature amount analyzed in a time domain and a frequency domain, and accumulates correct sound information and error sound information discriminated as correct / error for each minimum unit. A learning process of performing a learning process on the correct sound information and the error sound information to generate a correct / false discrimination model that is an acoustic model that discriminates the correct sound information and the error sound information best; A detection process for detecting an error from the recognition result and acoustic information corresponding to the recognition result using a true / false discrimination model.

【００１５】本発明（請求項８）は、音声認識結果の誤
り検出プログラムの学習プロセスにおいて、過去の音声
認識結果に含まれる正解、及び誤りに振り分け、振り分
けられた正解、及び誤りの音声信号を音響分析により正
解音響情報及び誤り音響情報の抽出を行ない、抽出され
た該正解音響情報及び該誤り音響情報を蓄積する正誤弁
別プロセスと、蓄積されている正解音響情報及び誤り音
響情報における代表点を音響モデルの初期値として求め
る音響モデル初期値生成プロセスと、代表点を初期値と
し、該初期値に含まれる特徴量や該代表点から求めた初
期モデルパラメータを逐次更新していくことで最適な音
響モデルを生成する識別学習プロセスとを有する。According to the present invention (claim 8), in the learning process of the error detection program of the speech recognition result, the speech signal is sorted into the correct answer and the error included in the past speech recognition result, and the sorted correct answer and the error speech signal are extracted. A correct / error discrimination process for extracting correct sound information and error sound information by sound analysis and accumulating the extracted correct sound information and error sound information, and representing a representative point in the stored correct sound information and error sound information. The optimal acoustic model initial value generation process obtained as the initial value of the acoustic model, and the characteristic value included in the initial value and the initial model parameters obtained from the representative point are successively updated by setting the representative point as the initial value. Discriminative learning process for generating an acoustic model.

【００１６】本発明（請求項９）は、音声認識結果の誤
り検出プログラムの検出プロセスにおいて、音声認識装
置から出力される音声信号を連続音声中から切り出す音
声信号抽出プロセスと、各認識単位の音響特徴量を求め
る特徴抽出プロセスと、音響特徴量と、正誤弁別モデル
との比較を行ない、該正誤弁別モデルの正解モデル及び
誤りモデルのいずれに類似しているかにより正誤を判定
する正誤モデル照合プロセスとを有する。According to a ninth aspect of the present invention, in the process of detecting an error detection program for a speech recognition result, a speech signal extracting process for extracting a speech signal output from a speech recognition device from continuous speech, and a sound for each recognition unit. A feature extraction process for determining a feature amount, a sound feature amount, a comparison between a true / false discrimination model, and a true / false model matching process for judging correctness based on which one of the correct model and the error model of the true / false discrimination model is similar. Having.

【００１７】上記のように、本発明は、音声認識装置か
ら出力された認識誤りが含まれた結果とそれに対応する
音響情報を、正解及び誤り別に蓄積し、これらの音響情
報から正誤を最も良く弁別する音響モデルの生成を行な
い、新たに音声認識装置から入力される認識対象となっ
た音響情報をこれらの音響モデル（正誤弁別モデル）と
比較することで、認識結果が正解であるか誤りであるか
を判定することにより、音声認識技術を用いて音声言語
によって表現された内容を文字情報化（音声による自動
字幕化、書き起こしなど）する場合に避けられない認識
誤りを検出し、人手による効率的な修正を支援すること
が可能となる。As described above, according to the present invention, a result containing a recognition error output from a speech recognition apparatus and acoustic information corresponding to the result are stored for each correct answer and for each error, and the correctness / error is best determined from these acoustic information. An acoustic model to be discriminated is generated, and the acoustic information to be recognized, which is newly input from the speech recognition device, is compared with these acoustic models (correct / incorrect discriminating models) to determine whether the recognition result is correct or incorrect. By determining whether or not there is, there is a recognition error that cannot be avoided when the contents expressed in the speech language are converted into character information (automatic captioning by voice, transcription, etc.) using speech recognition technology, It is possible to support efficient correction.

【００１８】[0018]

【発明の実施の形態】図１は、本発明の認識誤り検出シ
ステムの構成を示す。FIG. 1 shows the configuration of a recognition error detection system according to the present invention.

【００１９】同図に示す認識誤り検出システム１は、誤
り検出装置４及び誤り呈示部７から構成され、誤り検出
装置４は、過去の音声認識結果２を入力とする学習部５
と、音声認識装置３から音声認識結果を入力とする検出
部６から構成される。The recognition error detecting system 1 shown in FIG. 1 comprises an error detecting device 4 and an error presenting unit 7, and the error detecting device 4 receives a past speech recognition result 2 as a learning unit 5.
And a detection unit 6 that receives a speech recognition result from the speech recognition device 3.

【００２０】なお、本発明で用いられる音声認識装置３
は、当該システムのための特別な仕様の認識装置を必要
とせず、汎用の認識装置を用いるものとする。The voice recognition device 3 used in the present invention
Does not require a recognizer of special specifications for the system, but uses a general-purpose recognizer.

【００２１】図２は、本発明の音声認識結果の誤り検出
処理の概要を示すフローチャートである。FIG. 2 is a flowchart showing an outline of the error detection processing of the speech recognition result of the present invention.

【００２２】図２に沿って、図１に示す構成要素の動作
を説明する。The operation of the components shown in FIG. 1 will be described with reference to FIG.

【００２３】ステップ１０１）学習部５には、過去の
音声認識結果（認識文字列及び音声信号）２が入力され
る。なお、過去の音声認識結果は、データベース等に蓄
積されているものとする。Step 101) The past speech recognition result (recognized character string and speech signal) 2 is input to the learning unit 5. It is assumed that past speech recognition results are stored in a database or the like.

【００２４】ステップ１０２）学習部５は、入力され
た過去の音声認識結果に含まれる正解及び誤りに対する
音声信号を音響分析によって特徴（正誤音響情報）を抽
出する。Step 102) The learning section 5 extracts features (correct / false acoustic information) by acoustic analysis of a speech signal corresponding to a correct answer and an error contained in the input past speech recognition result.

【００２５】ステップ１０３）その抽出された正誤音
響情報を最も良く弁別する正誤弁別モデルを識別学習に
よって生成し、検出部６に出力する。Step 103) A true / false discrimination model that best discriminates the extracted true / false acoustic information is generated by discrimination learning and output to the detection unit 6.

【００２６】ステップ１０４）検出部６には、音声認
識装置３から新たな音声認識結果（認識文字列及び音
声信号）が入力される。Step 104) A new speech recognition result (recognized character string and speech signal) is input from the speech recognition device 3 to the detection unit 6.

【００２７】ステップ１０５）検出部６は、入力され
た新たな音声認識結果に対応する音響情報と、学習部５
から取得した正誤弁別モデルとを比較し、正解または、
誤りであるかの判定を行い、その結果を誤り呈示部７に
出力する。Step 105) The detecting unit 6 sets the acoustic information corresponding to the new input speech recognition result and the learning unit 5
Compare the correct / false discrimination model obtained from
It is determined whether or not the error is an error, and the result is output to the error presenting unit 7.

【００２８】ステップ１０６）誤り呈示部７は、検出
部６から取得した認識結果中の誤りを表示する。Step 106) The error presenting section 7 displays an error in the recognition result obtained from the detecting section 6.

【００２９】これにより、修正者は、誤り呈示部７によ
り呈示された結果を見ることにより、修正すべき箇所の
絞込が行ない易くなり、修正作業効率の向上が期待でき
る。特に、リアルタイム性が要求される修正では、修正
者は認識結果中の誤りが呈示された箇所のみに注意を払
うだけで音声の聴取に専念することができ、修正時の負
荷が軽減されると同時に、誤りを見逃す確率を少なくす
ることができる。また、修正時間が短縮されるため、リ
アルタイム性がより向上する。Thus, by viewing the result presented by the error presenting section 7, the corrector can easily narrow down a portion to be corrected, and improvement in correction work efficiency can be expected. In particular, in the correction that requires real-time properties, the corrector can concentrate on listening to the sound only by paying attention to the point where the error is presented in the recognition result, and the load at the time of the correction is reduced. At the same time, the probability of missing an error can be reduced. Further, since the correction time is shortened, the real-time property is further improved.

【００３０】[0030]

【実施例】以下、図面と共に本発明の実施例を説明す
る。Embodiments of the present invention will be described below with reference to the drawings.

【００３１】最初に、前述の図１を用いて、誤り検出装
置の構成について説明する。First, the configuration of the error detection device will be described with reference to FIG.

【００３２】認識誤り検出システム１は、過去の音声認
識装置３から出力された正解及び誤りの文字列とそれに
対応する音声信号である過去の音声認識結果２と、未知
の入力音声を文字情報に変換する音声認識装置３と、学
習処理により生成される正誤弁別モデルを用いて音声認
識装置３からの入力に対して誤りを検出する誤り検出装
置４及び当該誤り検出装置４からの検出結果を修正者に
呈示する呈示部７から構成される。The recognition error detection system 1 converts the correct and erroneous character strings output from the past speech recognition device 3 and the past speech recognition result 2 which is the corresponding speech signal, and the unknown input speech into character information. A speech recognition device 3 for conversion, an error detection device 4 for detecting an error with respect to an input from the speech recognition device 3 using a true / false discrimination model generated by a learning process, and a detection result from the error detection device 4 corrected. And a presentation unit 7 for presenting to a person.

【００３３】誤り検出装置４は、過去の音声認識結果２
に含まれる正解及び誤りの音声信号から正誤弁別モデル
を生成する学習部５と、音声認識装置３から出力される
認識結果に該当する正誤弁別モデルの選択、及び、認識
結果に対応する音声信号と選択された正誤弁別モデルと
の比較操作を行う検出部６から構成される。The error detection device 4 calculates the past speech recognition result 2
And a learning unit 5 that generates a true / false discrimination model from the correct / error speech signals included in the speech recognition unit 3, selection of a true / false discrimination model corresponding to the recognition result output from the speech recognition device 3, and a speech signal corresponding to the recognition result. The detection unit 6 performs a comparison operation with the selected true / false discrimination model.

【００３４】過去の音声認識結果２は、音声認識の対象
となった音声信号と、その音声信号に対応した認識結果
文字列のことであり、通常の音声認識装置が入出力する
情報を用いることとする。The past speech recognition result 2 is a speech signal subjected to speech recognition and a character string of a recognition result corresponding to the speech signal, and uses information input and output by a normal speech recognition device. And

【００３５】音声認識装置３は、認識結果が得られる装
置であれば、その殆どが適用可能であり、認識対象、認
識手法は特に限定するものではない。Most of the voice recognition device 3 is applicable as long as it can obtain a recognition result, and the recognition target and the recognition method are not particularly limited.

【００３６】学習部５は、過去の音声認識結果２から得
られる音声信号と、それに対応して出力された誤りが含
まれた文字列から、正誤の比較の際に用いられる正誤弁
別モデルの生成を行う。The learning unit 5 generates a true / false discrimination model to be used for the right / false comparison from a voice signal obtained from the past voice recognition result 2 and a character string including an error output corresponding thereto. I do.

【００３７】以下に、学習部５について詳述する。Hereinafter, the learning section 5 will be described in detail.

【００３８】図３は、本発明の一実施例の学習部の構成
を示す。FIG. 3 shows the configuration of the learning section according to one embodiment of the present invention.

【００３９】同図に示す学習部５は、正誤弁別部５１、
正誤データベース部５２、誤りデータベース部５３、正
解音響モデル初期値生成部５４、誤り音響モデル初期値
生成部５５、識別学習部５６、正解音響モデル及び誤り
音響モデルを有する正誤弁別モデル５７から構成され
る。The learning section 5 shown in FIG.
It is composed of a correct / error database unit 52, an error database unit 53, a correct answer acoustic model initial value generator 54, an error acoustic model initial value generator 55, a discrimination learning unit 56, and a correct / false discriminant model 57 having a correct answer acoustic model and an error acoustic model. .

【００４０】図４は、本発明の一実施例の学習部におけ
る処理のフローチャートである。FIG. 4 is a flowchart of a process in the learning section according to one embodiment of the present invention.

【００４１】以下、図３に示す学習部５の動作を図４の
フローチャートに沿って説明する。ステップ２０１）学習部５は、過去の音声認識結果
２から得られる誤りが含まれた認識結果を、正誤弁別部
５１により音声認識装置が出力する最小単位（単語、形
態素等、以後認識単位と記す）毎に、正解、及び誤りと
して振り分ける操作を行う。The operation of the learning section 5 shown in FIG. 3 will be described below with reference to the flowchart of FIG. Step 201) The learning unit 5 describes a recognition result including an error obtained from the past speech recognition result 2 as a minimum unit (word, morpheme, etc., hereinafter a recognition unit, etc.) output by the speech recognition device by the true / false discrimination unit 51. ), An operation of sorting as a correct answer and an error is performed.

【００４２】振り分けは、認識対象となった音声信号と
認識単位文字列との対比で弁別する方法や、予め正解文
字列を作成しておき、その正解文字列と認識単位文字列
との対比で弁別する方法など多数存在するが、弁別手法
は特に問わない。The distribution is performed by a method of discriminating by comparing the voice signal to be recognized with the recognition unit character string, or by preparing a correct character string in advance and comparing the correct character string with the recognition unit character string. There are many methods such as discrimination, but the discrimination method is not particularly limited.

【００４３】ステップ２０２）正誤弁別部５１で振り
分けられた正解、及び誤りの音声信号は、音響分析によ
り特徴抽出が行われ、認識単位毎に正解は正解データベ
ース５２に、誤りは誤りデータベース部５３に格納し、
過去の認識結果で大量に存在する認識単位毎の正誤情報
とそれに付随する音響情報をデータベースに蓄積してい
く。データは、認識単位毎の文字情報と音響情報が対に
なり、当該音響情報は、時間領域及び周波数領域で分析
される様々な特徴量を有する時系列データである。Step 202) The correct answer and the incorrect speech signal that have been sorted out by the correct / error discriminating section 51 are subjected to feature extraction by acoustic analysis, and the correct answer is stored in the correct database 52 and the error is stored in the error database 53 for each recognition unit. Store,
Accuracy information for each recognition unit, which is present in large quantities in the past recognition results, and associated acoustic information are accumulated in a database. The data is a pair of character information and acoustic information for each recognition unit, and the acoustic information is time-series data having various feature amounts analyzed in a time domain and a frequency domain.

【００４４】ステップ２０３）正解音響モデル初期値
生成部５４では、正解データベース部５２に含まれる認
識単位毎の音響情報におけるデータの代表点を求め、識
別学習部５６で初期値として使用する。代表点の求め方
としては、データ間の距離値から歪み尺度を定義し、そ
の歪を最小にする点を代表点とするクラスタリング手法
など様々に存在するが、手法は特に問わない。また、代
表点の個数もデータの分布状態によって任意に設定可能
とする。Step 203) The correct answer acoustic model initial value generation section 54 obtains a representative point of data in the acoustic information for each recognition unit included in the correct answer database section 52, and the identification learning section 56 uses it as an initial value. There are various methods for obtaining a representative point, such as a clustering method in which a distortion measure is defined based on a distance value between data and a point that minimizes the distortion is used as a representative point, but the method is not particularly limited. Also, the number of representative points can be set arbitrarily according to the distribution state of data.

【００４５】ステップ２０４）同様に誤り音響モデル
初期値生成部５５では、誤りデータベース部５３に含ま
れる認識単位毎の音響情報を元に代表点を求める。Step 204) Similarly, the error acoustic model initial value generation unit 55 obtains a representative point based on the acoustic information for each recognition unit included in the error database unit 53.

【００４６】ステップ２０５）識別学習部５６では、
任意の認識単位に含まれる正解、誤りの全データを最も
よく弁別する代表点を求めるために、正解音響モデル初
期値生成部５４、誤り音響モデル初期値生成部５５で生
成した初期値に含まれる特徴量や、代表点から求めた初
期モデルパラメータを逐次更新していくことで、最適な
音響モデルの生成を行う。データ更新手法としては、識
別誤り尺度を定義し、識別誤り尺度を低減させる方向に
特徴量やモデルパラメータを変化させる勾配法など多数
存在するが、学習手法は特に問わない。Step 205) In the identification learning section 56,
In order to find a representative point that best discriminates all correct and error data included in an arbitrary recognition unit, it is included in the initial values generated by the correct answer acoustic model initial value generator 54 and the error acoustic model initial value generator 55. An optimal acoustic model is generated by sequentially updating the feature amounts and the initial model parameters obtained from the representative points. There are many data updating methods, such as a gradient method that defines an identification error measure and changes a feature amount or a model parameter in a direction to reduce the identification error measure, but a learning method is not particularly limited.

【００４７】ステップ２０６）識別学習部５６は、学
習によって得られる正解及び誤りのモデルデータは、各
認識単位毎に正誤弁別モデル５７として、それぞれ正解
音響モデル、誤り音響モデルに蓄積する。Step 206) The identification learning unit 56 accumulates the correct and incorrect model data obtained by the learning as the correct and incorrect discrimination model 57 for each recognition unit in the correct and wrong acoustic models, respectively.

【００４８】ステップ２０７）正誤弁別モデル５７の
各モデルデータは、検出部６に送られる。Step 207) Each model data of the true / false discrimination model 57 is sent to the detection unit 6.

【００４９】ここで、前述の図４のフローチャートにお
けるステップ２０３の認識単位毎の音響情報における代
表点を求める一例として、クラスタリング手法について
説明する。Here, a clustering method will be described as an example of obtaining a representative point in acoustic information for each recognition unit in step 203 in the flowchart of FIG.

【００５０】まず、正解音響モデル初期値生成部５４及
び誤り音響モデル初期値生成部５５において、識別学習
のための初期値として、正解、及び誤りの特性を効果的
に表すクラスタの抽出を行う。正解データベース部５
２、誤りデータベース部５３毎に収集されたデータに対
し、k-means アルゴリズムによるクラスタリングを行な
い、代表点を決定する。クラスタリングに使用するサン
プルは、時系列データであるため、音声パターン間の距
離ｄif（ｉ，ｆ：サンプル番号）は、ＤＰ距離ｄ（Ｘi
，Ｘj ）として算出する。但し、Ｘi ，Ｘj は、異な
る時系列音声サンプルとする。セントロイドの算出は、
データの変形を極力避けるため、以下の（１）式のよう
に、サンプル間のmimi-max center ｙ（ω）を選出し、
代表点としている。First, in the correct answer acoustic model initial value generating section 54 and the error acoustic model initial value generating section 55, as initial values for discriminating learning, clusters that effectively represent correct answer and error characteristics are extracted. Correct Answer Database 5
2. The data collected for each error database unit 53 is subjected to clustering by the k-means algorithm to determine a representative point. Since the samples used for clustering are time-series data, the distance dif (i, f: sample number) between the voice patterns is equal to the DP distance d (Xi
, Xj). However, Xi and Xj are different time-series audio samples. The centroid calculation is
In order to avoid data deformation as much as possible, select a mimi-max center y (ω) between samples as shown in the following equation (1).
It is a representative point.

【００５１】[0051]

【数１】ここで、ωは任意のクラスタを表す。収束判定のための
ｎ番目のクラスタ内総合歪Δn ^kを以下の（２）式で求
め、(Equation 1) Here, ω represents an arbitrary cluster. The n-th intra-cluster total distortion Δnk for convergence determination is ^obtained by the following equation (2).

【００５２】[0052]

【数２】任意の閾値Δthを以下の式（３）に適用して収束判定を
行う。(Equation 2) The convergence determination is performed by applying an arbitrary threshold value Δth to the following equation (3).

【００５３】[0053]

【数３】但し、上記のＬはクラスタ内のサンプル数、ｋは繰り返
し回数、Ｍは繰り返し回数に応じたクラスタ数とする。(Equation 3) Here, L is the number of samples in a cluster, k is the number of repetitions, and M is the number of clusters according to the number of repetitions.

【００５４】次に、上記のステップ２０５における代表
点から求めた初期モデルパラメータを逐次更新していく
ことで、最適な音響モデルの生成を行う例として、例え
ば、各形態素毎のサンプルに存在する正誤データから、
判定誤り最小化基準による識別学習により正誤判別テン
プレートの生成を行う。まず、初期テンプレートから識
別重みも考慮した識別誤り尺度を定義し、勾配法を用い
て識別誤り尺度の効果方向に各パラメータを逐次更新し
ていくことで学習を行う。Next, as an example of generating an optimal acoustic model by successively updating the initial model parameters obtained from the representative points in the above step 205, for example, the correctness and falseness existing in each morpheme sample From the data,
A correct / incorrect judgment template is generated by identification learning based on the judgment error minimization criterion. First, learning is performed by defining a classification error scale that also takes into account the classification weight from the initial template, and sequentially updating each parameter in the direction of the effect of the classification error scale using the gradient method.

【００５５】次に、新たに入力される音声認識結果と、
認識結果に対応する音声信号から、正誤弁別モデルを用
いて誤りの検出を行う検出部６の動作について説明す
る。Next, a newly input speech recognition result,
The operation of the detection unit 6 that detects an error from a speech signal corresponding to a recognition result using a true / false discrimination model will be described.

【００５６】検出部６は、音声認識装置３から出力され
てくる認識単位に対応した音響情報と、学習部５で求め
た認識結果文字列に該当した正誤弁別モデルを用いて誤
りの検出を行う。The detection unit 6 detects an error using acoustic information corresponding to a recognition unit output from the speech recognition device 3 and a true / false discrimination model corresponding to the recognition result character string obtained by the learning unit 5. .

【００５７】図５は、本発明の一実施例の検出部の構成
を示す。FIG. 5 shows the configuration of the detection unit according to one embodiment of the present invention.

【００５８】同図に示す検出部６は、音声信号抽出部６
１、特徴抽出部６２、正誤モデル照合部６３、正誤弁別
モデル６４から構成される。The detection section 6 shown in FIG.
1, a feature extraction unit 62, a true / false model collating unit 63, and a true / false discrimination model 64.

【００５９】図６は、本発明の一実施例の検出部におけ
る処理のフローチャートである。FIG. 6 is a flowchart of the processing in the detection unit according to one embodiment of the present invention.

【００６０】ステップ３０１）音声認識装置３から出
力される音声信号を用いて音声信号抽出部６１により、
各認識単位に対応する音声信号を連続音声中から切り出
す。ステップ３０２）特徴抽出部６２では、各認識単位の
音響特徴量を求める。音響特徴量は、正誤弁別モデル作
成時に使用したものと同様の特徴量を使用するが、音響
特徴量としては周波数領域での特徴量、時間領域での特
徴量など様々存在するが、特徴種別、及び特徴抽出手法
などは特に限定しない。Step 301) The speech signal extraction unit 61 uses the speech signal output from the speech recognition device 3 to
A speech signal corresponding to each recognition unit is cut out from continuous speech. Step 302) The feature extraction unit 62 calculates an acoustic feature amount of each recognition unit. The acoustic features use the same features as those used at the time of creating the true / false discrimination model, and there are various acoustic features such as a feature in the frequency domain and a feature in the time domain. The method for extracting the feature and the like are not particularly limited.

【００６１】ステップ３０３）正誤モデル照合部６３
では、特徴量抽出部６３で得られた音響特徴量と、学習
部５から送られてきた正誤弁別モデル６４中の音声認識
装置３が出力した認識文字列に該当するモデルとの比較
により正解、及び誤りであるかを判定する。正誤の判定
では、入力された認識単位に対応する音響情報と正解モ
デル、及び誤りモデルとのマッチングを行ない、どちら
のモデルに類似しているかで判定を行う。モデル形態が
パラメータ距離として表せるモデルであれば距離の近い
方、確率値として表せるモデルであれば確率尤度の高い
方が判定結果となる。Step 303) True / false model collating unit 63
Then, a correct answer is obtained by comparing the acoustic feature amount obtained by the feature amount extraction unit 63 with a model corresponding to the recognition character string output by the speech recognition device 3 in the true / false discrimination model 64 sent from the learning unit 5. And whether it is an error. In the determination of correctness, the acoustic information corresponding to the input recognition unit is matched with the correct answer model and the error model, and a determination is made based on which model is similar. If the model form can be expressed as a parameter distance, the closer the distance is, if the model can be expressed as a probability value, the higher the probability likelihood is the determination result.

【００６２】ステップ３０４）モデル照合部６３は、
誤り呈示部７に対して、判定結果を出力する。Step 304) The model matching unit 63
The judgment result is output to the error presenting unit 7.

【００６３】これにより、誤り呈示部７は、正誤モデル
照合部６３から出力される判定結果を元に、修正者への
誤りの呈示を行う。誤りの呈示では、認識結果の文字情
報に、色・大きさ、間隔などの様々な文字属性を付加し
て表示することで、修正箇所を明示し、修正者へ修正喚
起を促す。Thus, the error presenting section 7 presents an error to the corrector based on the determination result output from the true / false model collating section 63. In the presentation of the error, various character attributes such as color, size, and spacing are added to the character information of the recognition result and displayed, thereby clearly indicating a correction portion and prompting a corrector to urge the corrector to correct.

【００６４】上記の実施例では、図４で学習部５を、図
６で検出部６における動作を説明したが、図４、図５に
示す動作をプログラムとして構築することも可能であ
る。In the above embodiment, the operation of the learning section 5 has been described with reference to FIG. 4 and the operation of the detection section 6 has been described with reference to FIG. 6. However, the operations shown in FIGS.

【００６５】また、構築されたプログラムを誤り検出装
置として利用されるコンピュータのＣＰＵにインストー
ルする、または、ネットワークを介して流通させること
も可能である。It is also possible to install the constructed program in a CPU of a computer used as an error detection device, or to distribute the program via a network.

【００６６】なお、本発明は、上記の実施例に限定され
ることなく、特許請求の範囲内において、種々変更・応
用が可能である。It should be noted that the present invention is not limited to the above-described embodiment, but can be variously modified and applied within the scope of the claims.

【００６７】[0067]

【発明の効果】現状の音響認識システムの利用を考えた
場合、認識結果に必ず誤りが含まれるため、何らかの修
正手段が必要であり、通常は、修正者が手動で修正する
ことにより行われているが、本発明によれば、正誤弁別
モデルを利用して自動的に誤りを検出し、修正者に誤り
を効果的に呈示することができる。これは、修正者の負
担を軽減するだけでなく、未修正箇所を減少させること
も可能となる。When the current acoustic recognition system is used, since the recognition result always contains an error, some correction means is required. Usually, the correction is performed manually by a corrector. However, according to the present invention, an error can be automatically detected by using a true / false discrimination model, and the error can be effectively presented to a corrector. This can reduce not only the burden on the corrector but also the number of uncorrected portions.

【００６８】また、本発明は、音声認識システンムの内
部構造には依存せず、認識結果が得られるものであれば
適用可能であるため、音声認識システムの改変を必要と
せず、応用範囲が広がる。Further, the present invention can be applied as long as a recognition result can be obtained without depending on the internal structure of the speech recognition system, so that the speech recognition system does not need to be modified and the application range is expanded. .

[Brief description of the drawings]

【図１】本発明の認識誤り検出システムの構成図であ
る。FIG. 1 is a configuration diagram of a recognition error detection system of the present invention.

【図２】本発明の音声認識結果の誤り検出処理の概要を
示すフローチャートである。FIG. 2 is a flowchart showing an outline of an error detection process of a speech recognition result according to the present invention.

【図３】本発明の一実施例の学習部の構成図である。FIG. 3 is a configuration diagram of a learning unit according to an embodiment of the present invention.

【図４】本発明の一実施例の学習部における処理のフロ
ーチャートである。FIG. 4 is a flowchart of a process in a learning unit according to one embodiment of the present invention.

【図５】本発明の一実施例の検出部の構成図である。FIG. 5 is a configuration diagram of a detection unit according to an embodiment of the present invention.

【図６】本発明の一実施例の検出部における処理のフロ
ーチャートである。FIG. 6 is a flowchart of a process in a detection unit according to one embodiment of the present invention.

[Explanation of symbols]

１認識誤り検出システム２過去の音声認識結果３音声認識装置４誤り検出装置５学習部６検出部７誤り呈示部５１正誤弁別部５２正解データベース部５３誤りデータベース部５４正解音響モデル初期値生成部５５誤り音響モデル初期値生成部５６識別学習部５７正誤弁別モデル６１音声信号抽出部６２特徴抽出部６３正誤モデル照合部６４正誤弁別モデル Reference Signs List 1 recognition error detection system 2 past speech recognition result 3 speech recognition device 4 error detection device 5 learning unit 6 detection unit 7 error presentation unit 51 correct / error discrimination unit 52 correct answer database unit 53 error database unit 54 correct answer acoustic model initial value generation unit 55 Error acoustic model initial value generation unit 56 Discrimination learning unit 57 True / false discrimination model 61 Audio signal extraction unit 62 Feature extraction unit 63 True / false model matching unit 64 True / false discrimination model

───────────────────────────────────────────────────── フロントページの続き (72)発明者今井篤東京都世田谷区砧一丁目10番11号日本放送協会放送技術研究所内 (72)発明者都木徹東京都世田谷区砧一丁目10番11号日本放送協会放送技術研究所内Ｆターム(参考） 5D015 GG01 GG03 GG04 LL04 ──────────────────────────────────────────────────続き Continuing on the front page (72) Inventor Atsushi Imai 1-10-11 Kinuta, Setagaya-ku, Tokyo Japan Broadcasting Corporation Broadcasting Research Institute (72) Inventor Toru Toki 1-1-10 Kinuta, Setagaya-ku, Tokyo 11 Japan Broadcasting Corporation Broadcasting Research Institute F term (reference) 5D015 GG01 GG03 GG04 LL04

Claims

[Claims]

1. A speech recognition result error detection method for detecting an error in a speech recognition result when converting the contents expressed by a speech language into character information by speech recognition, comprising the steps of: Based on the characteristic amounts analyzed in the time domain and the frequency domain, the correct sound information and the error sound information discriminated as correct / error are stored for each minimum unit, and the correct sound information and the error By performing a learning process on the acoustic information, a true / false discrimination model, which is an acoustic model that discriminates the correct acoustic information and the erroneous acoustic information best, is generated. An error detection method for a speech recognition result, wherein an error is detected from corresponding acoustic information using the true / false discrimination model.

2. A sound signal corresponding to a correct answer and an error included in the past speech recognition result is extracted by sound analysis to obtain correct answer sound information and error sound information, and the extracted correct answer sound information and the error sound are extracted. Information is stored, and a representative point in the stored correct sound information and the error sound information is obtained. The representative point is set as an initial value, and a feature amount included in the initial value and an initial value obtained from the representative point are obtained. 2. The method according to claim 1, wherein an optimal acoustic model is generated by sequentially updating model parameters.

3. When detecting the error, a speech signal output from a speech recognition device is cut out from continuous speech, an acoustic feature amount of each recognition unit is obtained, and the acoustic feature amount and the true / false discrimination model are determined. 2. The error detection method for a speech recognition result according to claim 1, wherein the correctness / error is determined based on which of the correct model and the error model of the correct / false discrimination model is similar.

4. A speech recognition result error detecting apparatus for detecting an error in the speech recognition result when converting the contents expressed by a speech language into character information by speech recognition, comprising: , Has a feature amount analyzed in a time domain and a frequency domain, and stores correct sound information and error sound information discriminated as correct / error for each minimum unit, and stores the correct sound information and the error sound. A learning unit that performs learning processing on the information to generate a correct / false discrimination model that is a sound model that discriminates the correct sound information and the error sound information best; Detecting means for detecting an error from the acoustic information by using the true / false discrimination model.

5. The learning means according to claim 1, wherein said learning unit assigns the correct answer and the error included in the past speech recognition result to an error, and extracts the correct answer information and the error sound information by acoustic analysis of the assigned correct answer and the error sound signal. Correct / error discriminating means for accumulating the extracted correct sound information and error sound information, and a sound model initial value for obtaining a representative point in the stored correct sound information and error sound information as an initial value of a sound model. Generation means, and identification learning means for generating an optimal acoustic model by sequentially updating the feature amount included in the initial value and the initial model parameter obtained from the representative point with the representative point as an initial value. The apparatus for detecting an error in a speech recognition result according to claim 4.

6. A sound signal extracting means for extracting a sound signal output from a speech recognition device from continuous speech, a feature extracting means for obtaining an acoustic feature amount of each recognition unit, 5. The error of the speech recognition result according to claim 4, further comprising: a correct / error model matching unit that compares the correct / false discrimination model with the correct / false discrimination model and determines whether the model is similar to the correct model or the error model. Detection device.

7. A speech recognition result error detection program for detecting an error in the result of speech recognition when converting the contents expressed by a speech language into character information by speech recognition, comprising: , Has a feature amount analyzed in a time domain and a frequency domain, and stores correct sound information and error sound information discriminated as correct / error for each minimum unit, and stores the correct sound information and the error sound. A learning process of performing a learning process on the information to generate a correct / false discrimination model that is a sound model that discriminates the correct sound information and the error sound information best. A detection process for detecting an error from the corresponding acoustic information by using the true / false discrimination model.

8. The learning process includes: distributing the correct answer and the error included in the past speech recognition result to the correct answer and the error; extracting the correct answer acoustic information and the error acoustic information by acoustic analysis of the assigned correct answer and the error sound signal. Correct / error discrimination process of accumulating the extracted correct sound information and error sound information, and a sound model initial value for obtaining a representative point in the stored correct sound information and error sound information as an initial value of a sound model. A generation process and an identification learning process of generating an optimal acoustic model by sequentially updating the feature amount included in the initial value and the initial model parameter obtained from the representative point with the representative point as an initial value. The program for detecting an error in a speech recognition result according to claim 7.

9. The detection process includes: a speech signal extraction process of extracting a speech signal output from a speech recognition device from continuous speech; a feature extraction process for obtaining an acoustic feature amount of each recognition unit; 8. The error of the speech recognition result according to claim 7, further comprising a correct / error model matching process of comparing the correct / false discrimination model with the correct / false discrimination model and determining whether the correct / error discrimination model is similar to the correct model or the error model. Detection program.