JP6325770B2

JP6325770B2 - Speech recognition error correction apparatus and program thereof

Info

Publication number: JP6325770B2
Application number: JP2013019376A
Authority: JP
Inventors: 庄衛佐藤
Original assignee: Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2013-02-04
Filing date: 2013-02-04
Publication date: 2018-05-16
Anticipated expiration: 2033-02-04
Also published as: JP2014149490A

Description

本発明は、番組音声の音声認識結果を示す単語列に含まれる音声認識誤りを修正する音声認識誤り修正装置及びそのプログラムに関する。 The present invention relates to a speech recognition error correcting apparatus and a program for correcting a speech recognition error included in a word string indicating a speech recognition result of program sound.

従来から、音声認識技術は、放送番組の字幕制作に利用されている。その音声認識結果には、認識誤りが含まれるため、その認識誤りを修正する修正者（オペレータ）を配置し、認識誤りを修正した文字列を字幕として放送している。 Conventionally, voice recognition technology has been used for the production of captions for broadcast programs. Since the speech recognition result includes a recognition error, a corrector (operator) that corrects the recognition error is arranged and a character string in which the recognition error is corrected is broadcast as subtitles.

この認識誤りを修正する従来技術として、認識誤りをポイントする者とポイントされた単語を修正する者との２人のペアが数組で修正する発明が提案されている（特許文献１）。また、前記従来技術のように役割を分けずに、認識結果の文章を１人から数名で分担し、担当した文章の誤りを修正していく発明も提案されている（特許文献２）。この特許文献１，２に記載の発明では、タッチパネルを利用して、画面上に表示される認識結果の誤り部分をタッチして特定し、認識誤りの種別に応じて、必要があれば修正のための文字列を、キーボードを用いて入力している。 As a conventional technique for correcting this recognition error, an invention has been proposed in which two pairs of a person who points to a recognition error and a person who corrects the pointed word are corrected in several sets (Patent Document 1). In addition, an invention has been proposed in which a sentence as a recognition result is shared by one to several persons and an error in the sentence in charge is corrected without dividing roles as in the prior art (Patent Document 2). In the inventions described in Patent Documents 1 and 2, by using a touch panel, an error part of a recognition result displayed on the screen is specified by touching, and if necessary, correction is made according to the type of recognition error. The character string for inputting is input using the keyboard.

この修正のための操作は、画面のタッチとキーボード操作とを行き来するため、修正手順だけでなくその動きにも習熟していないと、迅速な修正作業が困難である。そこで、この操作を練習するための発明が提案されている（特許文献３）。さらに、置換誤りや脱落誤りを修正するために必要な文字入力は、標準的なキーボードを用いて入力する場合が多く、修正者には、前記した修正操作に慣れるだけでなく、迅速にキーボードから日本語を入力できる技術が求められている。 Since the operation for the correction is performed between the touch on the screen and the keyboard operation, it is difficult to perform a quick correction operation unless the user has mastered not only the correction procedure but also its movement. Then, the invention for practicing this operation is proposed (patent document 3). Furthermore, the character input necessary to correct substitution errors and omission errors is often input using a standard keyboard, and the corrector not only gets used to the correction operation described above, but also quickly enters the keyboard. There is a need for technology that can input Japanese.

また、このキーボードの入力負担を軽減するために、同音異義語の修正候補を提示し、認識対象に関連する原稿を提示するパレットを用意し、認識誤りの修正にパレット上の文字列を利用する発明が提案されている（特許文献４）。しかし、この特許文献４に記載の発明では、すべての認識誤りをカバーすることはできない。 In addition, to reduce the input burden on the keyboard, a correction palette for homonyms is presented, a palette is presented to present the manuscript related to the recognition target, and character strings on the palette are used to correct recognition errors. An invention has been proposed (Patent Document 4). However, the invention described in Patent Document 4 cannot cover all recognition errors.

この他、リスピーク方式の字幕制作において、誤り部分をリスピーカーに再度発話してもらうことで修正する方式も提案されている。この場合、修正者は、認識結果を適切に言い直して得られた文字列が適切な場所に挿入されるように編集しなければならず、リスピーカーと連携してこの作業を行う熟練した技量が求められる。 In addition, a method for correcting the error part by having the re-speaker speak again in the lispeaking subtitle production has also been proposed. In this case, the corrector must edit the recognition result appropriately so that the character string obtained is inserted at the appropriate place, and has a skilled skill to perform this work in cooperation with the re-speaker. Is required.

ここで、必要な修正文字列を効率良く得るために、修正が指定されていない確定区間を拘束条件として、修正区間又は修正区間周辺を再評価する発明が提案されている（特許文献５）。また、修正が必要な区間を、詳細なユーザー辞書を用いて再度認識する発明も提案されている（特許文献６，７）。この特許文献６，７に記載の発明では、認識対象音声自体が不明瞭であったり、誤った発話であったりする場合には、精度良く修正することが困難である。 Here, in order to efficiently obtain a necessary correction character string, an invention has been proposed in which a correction section or a vicinity of a correction section is reevaluated using a fixed section in which correction is not specified as a constraint condition (Patent Document 5). There has also been proposed an invention in which a section that needs to be corrected is recognized again using a detailed user dictionary (Patent Documents 6 and 7). In the inventions described in Patent Documents 6 and 7, it is difficult to correct with high accuracy when the recognition target speech itself is unclear or an erroneous utterance.

また、修正文字列をキーボードから入力し、認識結果から得られる仮説ラティスの当該修正区間を修正文字列で拘束して再度認識することで、当該修正区間以外の認識誤りを自動で修正する発明が提案されている（特許文献８）。この特許文献８に記載の発明では、依然キーボードからの文字列入力という負担が残されている。 Also, an invention is provided in which a correction character string is input from a keyboard, and the correction section of the hypothesis lattice obtained from the recognition result is constrained by the correction character string and recognized again, thereby automatically correcting recognition errors other than the correction section. It has been proposed (Patent Document 8). In the invention described in Patent Document 8, the burden of inputting a character string from the keyboard still remains.

また、修正部分を、修正者が言い直した音声を認識して修正箇所を特定する発明も提案されている（特許文献９）。この特許文献９に記載の発明では、修正箇所を音声で特定しても、キーボードを用いて修正文字列を入力する必要があり、修正作業を容易に行うことができない。そこで、修正者が誤りの含まれる文章をまるごと言い直し、その際に、認識誤り部分を強調して発声することで、修正箇所を特定し、その部分を言い直した音声の認識結果に置き換える方式が提案されている（特許文献１０）。さらに、修正者が誤認識部分を再度言い直した音声を用いて、網羅的な修正候補を提示し、修正者が望みの修正候補を選択する発明も提案されている（特許文献１１）。さらに、音声認識結果を用いて、手書き文字の入力を補完する発明も提案されている（特許文献１２）。 In addition, an invention has also been proposed in which a corrected portion is recognized by recognizing the voice restated by the corrector to identify a corrected portion (Patent Document 9). In the invention described in Patent Document 9, even if a correction location is specified by voice, it is necessary to input a correction character string using a keyboard, and correction work cannot be easily performed. Therefore, the corrector rephrases the entire sentence containing the error, and at that time, emphasizes the recognition error part and utters, identifies the correction part, and replaces that part with the re-recognized speech recognition result. Has been proposed (Patent Document 10). Furthermore, an invention has also been proposed in which a corrector presents an exhaustive correction candidate using a voice in which a misrecognized portion is restated, and the corrector selects a desired correction candidate (Patent Document 11). Furthermore, the invention which complements the input of handwritten characters using the speech recognition result has also been proposed (Patent Document 12).

特開２００１−６０１９２号公報JP 2001-60192 A 特許３９８６０１５号公報Japanese Patent No. 3998615 特開２００４−２４０２３４号公報JP 2004-240234 A 特許３９８６００９号公報Japanese Patent No. 3986209 特許４７０９８８７号公報Japanese Patent No. 4709987 特許４９０２６１７号公報Japanese Patent No. 4902617 特開２０００−１８７４９７号公報JP 2000-187497 A 特開２０１１−１９７４１０号公報JP 2011-197410 A 特許４７８４１２０号公報Japanese Patent No. 4784120 特開２００３−３１６３８６号公報JP 2003-316386 A 特開２００１−９２４９３号公報JP 2001-92493 A 特開２００７−１８２９０号公報JP 2007-18290 A

しかし、特許文献１０，１１に記載の発明では、音声認識誤り部分の修正に言い直し音声を利用しているが、言い直し音声だけを用いてそれを音声認識したのでは、認識誤りが含まれ、正しい修正単語列を高精度に推定できず、修正者の追加入力が必要になる。ここで、音声認識誤り部分の音声認識に用いる音響モデルを、その修正者に最適化することが考えられる。この場合、音声認識誤り部分の音声認識結果は、不特定話者用の音響モデルを用いた番組音声の認識結果と誤りの傾向が異なる。 However, in the inventions described in Patent Documents 10 and 11, the rephrased speech is used for correcting the speech recognition error part. However, if only the restated speech is used for speech recognition, a recognition error is included. Therefore, it is impossible to estimate a correct corrected word string with high accuracy, and an additional input from the corrector is required. Here, it is conceivable to optimize the acoustic model used for speech recognition of the speech recognition error part for the corrector. In this case, the speech recognition result of the speech recognition error part is different in tendency of error from the recognition result of the program speech using the acoustic model for unspecified speakers.

また、特許文献１２に記載の発明では、音声認識結果と手書き文字の認識結果が相補的に働いておらず、手書き文字の入力を補助するに留まっている。この修正者の手書き文字の認識結果は、音声認識とモーダル（態様）が異なるため、番組音声の認識結果と誤りの傾向が異なる。 Further, in the invention described in Patent Document 12, the speech recognition result and the handwritten character recognition result do not work complementarily, but only assist the input of the handwritten character. Since the corrector's handwritten character recognition result is different in modal (mode) from speech recognition, the program speech recognition result and the error tendency are different.

以上より、異なる誤り傾向を有する音声認識誤り部分の音声認識結果と手書き文字の認識結果とを、番組音声の認識結果に相補的に統合すれば、正しい修正単語列を高精度に推定することができる。 From the above, if the speech recognition result of the speech recognition error part having different error tendency and the recognition result of the handwritten character are complementarily integrated with the recognition result of the program sound, a correct corrected word string can be estimated with high accuracy. it can.

そこで、本発明は、修正作業が容易で、正しい修正単語列を高精度に推定できる音声認識誤り修正装置及びそのプログラムを提供することを課題とする。 Therefore, an object of the present invention is to provide a speech recognition error correction apparatus and a program thereof that can be easily corrected and can accurately estimate a correct correction word string.

前記した課題に鑑みて、本願第１発明に係る音声認識誤り修正装置は、番組音声の音声認識結果を示す単語列に含まれる音声認識誤りを、正しい修正単語列で修正する音声認識誤り修正装置であって、音声認識誤り部分認識手段と、仮説ラティス統合手段と、音声認識誤り部分修正手段と、を備えることを特徴とする。 In view of the above-described problems, the speech recognition error correction apparatus according to the first invention of the present application corrects a speech recognition error included in a word string indicating a speech recognition result of program sound with a correct correction word string. The speech recognition error part recognition means, hypothesis lattice integration means, and speech recognition error part correction means are provided.

かかる構成によれば、音声認識誤り修正装置は、修正者の発話による音声認識誤り部分の音声認識を行う音声認識手段と、修正者による音声認識誤り部分の手書き文字認識を行う手書き文字認識手段と、音声認識又は手書き文字認識の何れかを予め選択するスイッチとを備える。そして、修正者が、番組音声の音声認識が誤った理由に応じて、音声認識又は手書き文字認識の何れかを手動で選択する。例えば、番組音声の不明瞭な発声や言い間違いといった音響的な理由の場合、より素早く正確な修正が可能となるため、音声認識が選択され、音声認識誤り部分の発話が入力される。また、同音異義語といった理由の場合、より素早く正確な修正が可能となるため、手書き文字認識が選択され、音声認識誤り部分の手書き文字が入力される。 According to such a configuration, the speech recognition error correcting device includes speech recognition means for performing speech recognition of a speech recognition error portion caused by the corrector's utterance, and handwritten character recognition means for performing handwriting character recognition of the speech recognition error portion by the corrector. And a switch for selecting in advance either voice recognition or handwritten character recognition. Then, the corrector manually selects either speech recognition or handwritten character recognition depending on the reason why the speech recognition of the program sound is incorrect. For example, in the case of an acoustic reason such as an unclear utterance of a program sound or a wrong word, voice correction is selected because speech correction is selected and an utterance of a voice recognition error portion is input. In addition, in the case of a homonym, the handwritten character recognition is selected and the handwritten character of the voice recognition error portion is input because correction can be performed more quickly and accurately.

また、音声認識誤り修正装置は、音声認識誤り部分認識手段によって、予め選択した音声認識又は手書き文字認識の結果として、正しい修正単語列の候補である修正単語列候補、及び、修正単語列候補毎の認識スコアを出力する。 The speech recognition error correction apparatus, the speech recognition errors partial recognition means, as a result of the speech recognition or handwriting recognition preselected correct Modify word sequence candidates and a modified word sequence candidates, and, for each corrected word sequence candidates and it outputs a recognition score.

ここで、修正者による音声認識及び手書き文字認識は、番組音声の音声認識と誤りの傾向が異なる。そこで、音声認識誤り修正装置は、仮説ラティス統合手段によって、入力された仮説ラティスの音声認識誤り部分の始点及び終点に位置する枝の節点に、修正単語列候補及び認識スコアが対応付けられた枝を接続することで、仮説ラティスを統合する。 Here, the voice recognition and handwritten character recognition by the corrector are different in the tendency of error from the voice recognition of the program voice. Therefore, the speech recognition error correction device uses the hypothesis lattice integration unit to link the corrected word string candidate and the recognition score to the nodes of the branches located at the start point and end point of the speech recognition error part of the input hypothesis lattice. Connect hypotheses lattices by connecting.

この仮説ラティスは、番組音声の音声認識で評価された各単語及び各単語の音響スコアを対応付けた枝と、各単語の位置を示す枝の節点とで構成されており、番組音声の音声認識の評価内容を表している。 This hypothesis lattice is composed of a branch in which each word evaluated in the speech recognition of the program sound and the acoustic score of each word are associated, and a node of the branch indicating the position of each word. Represents the evaluation contents of.

また、音声認識誤り修正装置は、音声認識誤り部分修正手段によって、統合された仮説ラティスの音声認識誤り部分で始点から終点までの枝の経路毎に、音響スコア及び認識スコアを用いて統合スコアを算出し、算出した統合スコアが最高になる枝の経路を正しい修正単語列として推定する。
また、音声認識誤り修正装置は、音声認識誤り部分認識手段が、手書き文字認識の結果が全てひらがな又はカタカナの場合、番組音声の音声認識に用いる発音辞書、又は、音声認識誤り部分の音声認識に用いる発音辞書から、ひらがな又はカタカナの表記に該当する音素列の単語を読みだして、修正単語列候補とする。
これにより、音声認識誤り修正装置は、ひらがな又はカタカナの表記に該当する音素列の全単語を、そのひらがな又はカタカナに該当する漢字表記も含め、修正単語列候補として扱う。これによって、音声認識誤り修正装置は、修正者が即座に音声認識誤り部分の漢字表記を思い出せない場合でも、その音声認識誤り部分をひらがな又はカタカナで入力し、迅速な修正が可能となる。 Further, the speech recognition error correction device uses the acoustic score and the recognition score to calculate the integrated score for each branch path from the start point to the end point in the speech recognition error part of the integrated hypothesis lattice by the speech recognition error part correction unit. The branch path with the highest calculated integrated score is estimated as a correct corrected word string.
Further, the speech recognition error correction device is adapted to recognize a pronunciation dictionary used for speech recognition of program speech or speech recognition error portions when the speech recognition error portion recognition means is all hiragana or katakana. A phoneme string word corresponding to the hiragana or katakana notation is read out from the pronunciation dictionary to be used, and set as a corrected word string candidate.
As a result, the speech recognition error correcting apparatus treats all words of the phoneme string corresponding to the hiragana or katakana notation as the corrected word string candidates including the kanji notation corresponding to the hiragana or katakana. As a result, even when the corrector cannot immediately remember the kanji notation of the voice recognition error portion, the voice recognition error correction device can input the voice recognition error portion in hiragana or katakana and quickly correct it.

また、本願第２発明に係る音声認識誤り修正装置は、修正者に固有の特定話者用音響モデルを用いて、音声認識を行うことを特徴とする。
かかる構成によれば、音声認識誤り修正装置は、修正者の発話を正確に音声認識し、正しい修正単語列をより高精度に推定することができる。 Moreover, the speech recognition error correction apparatus according to the second invention of the present application is characterized in that speech recognition is performed using an acoustic model for a specific speaker unique to the corrector.
According to such a configuration, the speech recognition error correction device can accurately recognize the corrector's utterance and estimate a correct corrected word string with higher accuracy.

また、本願第３発明に係る音声認識誤り修正装置は、音声認識誤り部分修正手段が、枝の経路毎に、音響スコア及び認識スコアの重み付け総和を、統合スコアとして算出することを特徴とする。
かかる構成によれば、音声認識誤り修正装置は、重み付け総和により統合スコアを正確に算出し、正しい修正単語列をより高精度に推定することができる。 The speech recognition error correction device according to the third invention of the present application is characterized in that the speech recognition error partial correction means calculates the weighted sum of the acoustic score and the recognition score as an integrated score for each branch path.
According to such a configuration, the speech recognition error correction device can accurately calculate the integrated score by the weighted sum, and can estimate a correct corrected word string with higher accuracy.

また、本願第４発明に係る音声認識誤り修正装置は、音声認識誤り部分修正手段が、枝の経路毎に、音響スコアの事後確率と、予め設定された対数尤度算出式により算出した認識スコアの事後確率との総和を、統合スコアとして算出することを特徴とする。
かかる構成によれば、音声認識誤り修正装置は、対数尤度算出式により統合スコアを正確に算出し、正しい修正単語列をより高精度に推定することができる。 Further, the speech recognition error correction device according to the fourth invention of the present application is the recognition score calculated by the speech recognition error partial correction means for each branch path by the posterior probability of the acoustic score and the logarithmic likelihood calculation formula set in advance. The total with the posterior probability is calculated as an integrated score.
According to such a configuration, the speech recognition error correction apparatus can accurately calculate the integrated score using the log likelihood calculation formula, and can estimate a correct corrected word string with higher accuracy.

なお、本願発明に係る音声認識誤り修正装置は、ＣＰＵ（Central Processing Unit）、記憶手段（例えば、メモリ、ハードディスク）等のハードウェア資源を備えるコンピュータを、前記した各手段として協調動作させるための音声認識誤り修正プログラムによって実現することもできる（本願第５発明）。このプログラムは、通信回線を介して配布してもよく、ＣＤ−ＲＯＭやフラッシュメモリ等の記録媒体に書き込んで配布してもよい。 Note that the speech recognition error correction apparatus according to the present gun onset Ming, CPU (Central Processing Unit), storage means (e.g., memory, hard disk) a computer comprising hardware resources such, in order to work together as each means described above The voice recognition error correction program can also be realized (the fifth invention of the present application). This program may be distributed through a communication line, or may be distributed by writing in a recording medium such as a CD-ROM or a flash memory.

本願発明によれば、以下のような優れた効果を奏する。
本願第１，５発明によれば、修正者の発話の音声認識又は修正者の手書き文字の認識の結果と仮説ラティスとを相補的に統合し、統合した仮説ラティスから統合スコアを算出するため、正しい修正単語列を高精度に推定することができる。さらに、本願第１，５発明によれば、修正者がキーボードを用いる必要がなく、修正作業を容易に行うことができる。
本願第１，５発明によれば、修正者が即座に漢字表記を思い出せない場合でも、音声認識誤りを迅速に修正することができる。 According to the present invention, the following excellent effects can be obtained.
According to the first and fifth inventions of the present application, the result of speech recognition of the corrector's utterance or recognition of the corrector's handwritten character and the hypothetical lattice are complementarily integrated, and an integrated score is calculated from the integrated hypothetical lattice. A correct corrected word string can be estimated with high accuracy. Furthermore, according to the first and fifth inventions of the present application, it is not necessary for the corrector to use the keyboard, and the correction work can be easily performed.
According to the first and fifth inventions of the present application, it is possible to quickly correct a speech recognition error even when the corrector cannot immediately remember the kanji notation.

本願第２発明によれば、修正者の発話を正確に音声認識し、正しい修正単語列をより高精度に推定することができる。
本願第３，４発明によれば、統合スコアを正確に算出し、正しい修正単語列をより高精度に推定することができる。 According to the second invention of the present application, it is possible to accurately recognize a corrector's utterance and to estimate a correct corrected word string with higher accuracy.
According to the third and fourth aspects of the present invention, it is possible to accurately calculate an integrated score and estimate a correct corrected word string with higher accuracy .

本願発明の実施形態に係る音声認識誤り修正装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech recognition error correction apparatus which concerns on embodiment of this invention. 図１の認識仮説統合手段での仮説ラティスの統合を説明する説明図であり、（ａ）は音声認識誤りの修正に用いられる単語の一例を示し、（ｂ）は修正単語列候補及び認識スコアを対応付けた枝と仮説ラティスとの一例を示す。It is explanatory drawing explaining the integration | stacking of the hypothesis lattice in the recognition hypothesis integration means of FIG. 1, (a) shows an example of the word used for correction | amendment of a speech recognition error, (b) is a correction word sequence candidate and recognition score. An example of a branch and a hypothesis lattice associated with each other is shown. 図１の認識仮説統合手段での仮説ラティスの統合を説明する説明図であり、（ａ）は統合された仮説ラティスを示し、（ｂ）は枝の経路に対応付けられた認識スコアを示す。2A and 2B are explanatory diagrams for explaining the integration of hypothesis lattices by the recognition hypothesis integration unit of FIG. 1, in which FIG. 1A shows an integrated hypothesis lattice and FIG. 1B shows a recognition score associated with a branch path; 図１の音声認識誤り修正装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the speech recognition error correction apparatus of FIG.

[字幕生成システムの概略]
図１を参照し、本願発明の実施形態に係る字幕生成システム１の概略ついて、詳細に説明する。
字幕生成システム１は、番組音声を音声認識すると共に、音声認識の結果に誤りがある場合、この音声認識誤りを修正者が修正するものである。図１に示すように、字幕生成システム１は、音声認識装置１０と、音声認識誤り修正装置２０と、表示装置３０とを備える。 [Outline of caption generation system]
With reference to FIG. 1, the outline of the caption generation system 1 according to the embodiment of the present invention will be described in detail.
The caption generation system 1 recognizes program audio and corrects the audio recognition error by a corrector when there is an error in the audio recognition result. As shown in FIG. 1, the caption generation system 1 includes a speech recognition device 10, a speech recognition error correction device 20, and a display device 30.

音声認識装置１０は、放送番組の音声である番組音声が入力され、入力された番組音声を音声認識することで音声認識結果を示す単語列（音声認識結果単語列）を生成するものであり、音声認識手段１１と、音響モデル１３と、言語モデル１５と、発音辞書１７とを備える。 The voice recognition device 10 receives a program voice that is the voice of a broadcast program, and generates a word string (speech recognition result word string) indicating a voice recognition result by voice recognition of the input program voice. The speech recognition means 11, the acoustic model 13, the language model 15, and the pronunciation dictionary 17 are provided.

音声認識手段１１は、後記する音響モデル１３と、言語モデル１５と、発音辞書１７とを用いて、番組音声を音声認識し、音声認識結果単語列（最尤単語列）と、認識仮説のラティスとを生成するものである。例えば、音声認識手段１１は、各単語に現れる音素の音響的な特徴を示した統計モデル（音響モデル１３）を用いて番組音声を評価し、単語と単語との繋がりやすさを示す統計モデル（言語モデル１５）を用いて、認識結果の日本語文章らしさを評価する音声認識手法を利用する。 The speech recognition means 11 recognizes program speech by using an acoustic model 13, language model 15, and pronunciation dictionary 17, which will be described later, a speech recognition result word string (maximum likelihood word string), and a lattice of recognition hypotheses. Are generated. For example, the speech recognition means 11 evaluates the program speech using a statistical model (acoustic model 13) that shows the acoustic features of phonemes that appear in each word, and shows a statistical model that indicates the ease of connection between words (words). Using a language model 15), a speech recognition method for evaluating the likelihood of a Japanese sentence as a recognition result is used.

ここで、音声認識手段１１は、音声認識により生成した音声認識結果単語列を、音声認識誤り修正装置２０及び表示装置３０に出力する。また、音声認識手段１１は、音声認識の評価結果を示す仮説ラティスを、音声認識誤り修正装置２０に出力する。
なお、仮説ラティスの詳細は、後記する（図２）。 Here, the voice recognition unit 11 outputs the voice recognition result word string generated by the voice recognition to the voice recognition error correction device 20 and the display device 30. Further, the speech recognition unit 11 outputs a hypothesis lattice indicating the evaluation result of speech recognition to the speech recognition error correction device 20.
Details of the hypothesis lattice will be described later (FIG. 2).

音響モデル１３は、例えば、隠れマルコフモデル（ＨＭＭ：Hidden Markov Model）といった、各単語に現れる音素の音響的な特徴を示した統計モデルである。
言語モデル１５は、例えば、バイグラム又はトライグラムといった、単語と単語との繋がりやすさを示す統計モデルである。
発音辞書１７は、例えば、各単語がどのような音素列になるのかを示した発音モデルであり、音響モデル１３及び言語モデル１５を関係付けている。 The acoustic model 13 is a statistical model that shows the acoustic features of phonemes that appear in each word, such as a hidden Markov model (HMM).
The language model 15 is a statistical model indicating ease of connection between words, such as bigram or trigram.
The pronunciation dictionary 17 is, for example, a pronunciation model that indicates what phoneme sequence each word becomes, and associates the acoustic model 13 and the language model 15 with each other.

音声認識誤り修正装置２０は、音声認識装置１０から入力された音声認識結果単語列が誤っている場合、修正者が音声認識結果単語列の誤りを修正するものである。この音声認識誤り修正装置２０は、誤りが修正された音声認識結果単語列を字幕として、例えば、放送送出装置（不図示）に出力する。
表示装置３０は、修正者が音声認識誤りを目視するために、音声認識装置１０から入力された音声認識結果単語列を表示するディスプレイである。 In the speech recognition error correction device 20, when the speech recognition result word string input from the speech recognition device 10 is incorrect, the corrector corrects the error in the speech recognition result word sequence. The speech recognition error correction device 20 outputs the speech recognition result word string with the corrected error as a caption to, for example, a broadcast transmission device (not shown).
The display device 30 is a display that displays the speech recognition result word string input from the speech recognition device 10 so that the corrector can visually recognize the speech recognition error.

[音声認識誤り修正装置の構成]
続いて、音声認識誤り修正装置２０の構成ついて、詳細に説明する。
図１に示すように、音声認識誤り修正装置２０は、修正指示入力手段２１と、音声認識誤り部分認識手段２２と、認識仮説統合手段（仮説ラティス統合手段）２６と、仮説リスコアリング手段（音声認識誤り部分修正手段）２７とを備える。 [Configuration of speech recognition error correction device]
Next, the configuration of the speech recognition error correction device 20 will be described in detail.
As shown in FIG. 1, the speech recognition error correction device 20 includes a correction instruction input means 21, a speech recognition error partial recognition means 22, a recognition hypothesis integration means (hypothesis lattice integration means) 26, and a hypothesis rescoring means ( Voice recognition error portion correcting means) 27.

修正指示入力手段２１は、修正者が、音声認識装置１０から入力された音声認識結果単語列に対する修正指示を入力するものである。この修正指示入力手段２１は、例えば、表示装置３０に表示された単語のタッチ、又は、ポインティングデバイスを用いたジェスチャーにより、修正者が、音声認識誤り部分の位置を特定し、音声認識誤りの種類を選択する。
なお、ジェスチャーとは、音声認識誤りの種類毎に定められた記号をポインティングデバイスで描くことである。 The correction instruction input means 21 is used by the corrector to input a correction instruction for the speech recognition result word string input from the speech recognition apparatus 10. For example, the correction instruction input means 21 specifies the position of the voice recognition error part by touching a word displayed on the display device 30 or a gesture using a pointing device. Select.
The gesture is to draw a symbol determined for each type of speech recognition error with a pointing device.

ここで、音声認識誤りの種類が脱落誤り又は置換誤りの場合、修正単語列を音声認識結果単語列に挿入又は置換するために、音声認識誤り部分を特定する必要がある。より具体的には、修正指示入力手段２１は、音声認識装置１０から入力された音声認識結果単語列において、修正者が指示した音声認識誤り部分の開始時刻（始点）から終了時刻（終点）までを特定し、認識仮説統合手段２６に出力する。 Here, when the type of the speech recognition error is a drop error or a replacement error, it is necessary to specify a speech recognition error part in order to insert or replace the corrected word string in the speech recognition result word string. More specifically, the correction instruction input means 21 includes from the start time (start point) to the end time (end point) of the voice recognition error part specified by the corrector in the voice recognition result word string input from the voice recognition device 10. Is identified and output to the recognition hypothesis integrating means 26.

音声認識誤り部分認識手段２２は、修正者が入力した発話の音声認識、又は、修正者が入力した手書き文字の認識により、正しい修正単語列の候補である修正単語列候補と、修正単語列候補毎の認識スコアとを求めるものである。図１に示すように、音声認識誤り部分認識手段２２は、音声認識手段２３と、手書き文字認識手段２４と、スイッチ２５とを備える。 The voice recognition error partial recognition means 22 is a correct word string candidate and a correction word string candidate that are correct correction word string candidates by voice recognition of an utterance input by the corrector or recognition of a handwritten character input by the corrector. Each recognition score is obtained. As shown in FIG. 1, the speech recognition error part recognition unit 22 includes a speech recognition unit 23, a handwritten character recognition unit 24, and a switch 25.

音声認識手段２３は、音声認識装置１０と同様、図示を省略した音響モデル、言語モデル及び発音辞書を備え、音声認識誤り部分に対する発話を音声認識するものである。この音声認識手段２３は、例えば、修正者の発話を入力するためのマイク（不図示）を備える。また、音声認識手段２３の言語モデル及び発音辞書は、音声認識装置１０と同一であってもよい。そして、音声認識手段２３は、音声認識の結果として、修正単語列候補と、修正単語列候補毎の認識スコアとをスイッチ２５に出力する。 Similar to the speech recognition apparatus 10, the speech recognition means 23 includes an acoustic model, a language model, and a pronunciation dictionary (not shown), and recognizes speech for a speech recognition error part. The voice recognition means 23 includes, for example, a microphone (not shown) for inputting the corrector's speech. Further, the language model and pronunciation dictionary of the speech recognition means 23 may be the same as that of the speech recognition device 10. Then, the speech recognition unit 23 outputs the corrected word string candidate and the recognition score for each corrected word string candidate to the switch 25 as a result of the voice recognition.

ここで、音声認識手段２３は、修正者に固有の特定話者用音響モデル（不図示）を用いて、修正者の発話を音声認識することが好ましい。これによって、音声認識手段２３は、修正者の発話を正確に音声認識することができる。 Here, it is preferable that the speech recognition means 23 recognizes the speech of the corrector by using a specific speaker-specific acoustic model (not shown). Thereby, the voice recognition means 23 can correctly recognize the speech of the corrector.

手書き文字認識手段２４は、音声認識誤り部分に対する手書き文字を認識するものである。この手書き文字認識手段２４は、例えば、修正者の手書き文字を認識するためのタブレット端末及びスタイラスペン（不図示）を備える。そして、手書き文字認識手段２４は、手書き文字認識の結果として、修正単語列候補と、修正単語列候補毎の認識スコアとをスイッチ２５に出力する。 The handwritten character recognizing means 24 recognizes a handwritten character with respect to the voice recognition error part. The handwritten character recognition means 24 includes, for example, a tablet terminal and a stylus pen (not shown) for recognizing a corrector's handwritten character. And the handwritten character recognition means 24 outputs the correction word sequence candidate and the recognition score for every correction word sequence candidate to the switch 25 as a result of handwritten character recognition.

ここで、手書き文字認識の結果が、全てひらがな又はカタカナの場合が考えられる。この場合、手書き文字認識手段２４は、音声認識装置１０が備える発音辞書１７、又は、音声認識手段２３が備える発音辞書から、ひらがな又はカタカナの表記に該当する音素列の単語を読みだして、修正単語列候補とすることが好ましい。これによって、手書き文字認識手段２４は、修正者が即座に漢字表記を思い出せない場合でも、音声認識誤りを迅速に修正することができる。
なお、発音辞書１７又は音声認識手段２３の発音辞書の何れを利用するか、修正者が手動で設定できる。 Here, the case where the result of handwritten character recognition is all hiragana or katakana is considered. In this case, the handwritten character recognizing means 24 reads out the phoneme string words corresponding to the hiragana or katakana notation from the pronunciation dictionary 17 provided in the speech recognition device 10 or the pronunciation dictionary provided in the speech recognition means 23, and corrected. It is preferable to use word string candidates. Thereby, the handwritten character recognizing means 24 can quickly correct the voice recognition error even when the corrector cannot immediately remember the kanji notation.
Note that the corrector can manually set which of the pronunciation dictionary 17 or the pronunciation dictionary of the speech recognition means 23 is to be used.

スイッチ２５は、音声認識手段２３又は手書き文字認識手段２４から入力された修正単語列候補及び認識スコアの一方を、認識仮説統合手段２６に出力するものである。
ここで、修正者は、音声認識誤りの理由に応じて、音声認識又は手書き文字認識の何れか一方を手動で選択する。
例えば、音声認識誤りが番組音声の不明瞭な発声や言い間違いといった音響的な理由の場合、より素早く正確に修正を行うには、音声認識が好ましい、この場合、修正者は、スイッチ２５で音声認識手段２３の側を選択し、音声認識誤り部分に対する発話を音声認識手段２３に入力する。
また、例えば、音声認識誤りが同音異義語といった理由の場合、より素早く正確に修正を行うには、手書き文字認識が好ましい。この場合、修正者は、スイッチ２５で手書き文字認識手段２４の側を選択し、音声認識誤り部分に対する手書き文字を手書き文字認識手段２４に入力する。 The switch 25 outputs one of the corrected word string candidate and the recognition score input from the speech recognition unit 23 or the handwritten character recognition unit 24 to the recognition hypothesis integration unit 26.
Here, the corrector manually selects either speech recognition or handwritten character recognition depending on the reason for the speech recognition error.
For example, when the voice recognition error is an acoustic reason such as an unclear utterance or misrepresentation of the program sound, the voice recognition is preferable to correct it more quickly and accurately. The recognition means 23 side is selected, and the speech for the speech recognition error part is input to the speech recognition means 23.
In addition, for example, in the case where the voice recognition error is a homonym, handwritten character recognition is preferable in order to correct more quickly and accurately. In this case, the corrector selects the handwritten character recognizing means 24 side with the switch 25 and inputs the handwritten character corresponding to the voice recognition error part to the handwritten character recognizing means 24.

このように、音声認識手段２３及び手書き文字認識手段２４が共に、修正単語列候補と、認識スコアとを出力する。従って、修正者が音声認識又は手書き文字認識の何れを選択して場合であっても、認識仮説統合手段２６の処理を共通化し、音声認識誤り修正装置２０の構成を簡素化することができる。 Thus, both the speech recognition means 23 and the handwritten character recognition means 24 output the corrected word string candidate and the recognition score. Therefore, even if the corrector selects either speech recognition or handwritten character recognition, the processing of the recognition hypothesis integrating means 26 can be made common, and the configuration of the speech recognition error correcting device 20 can be simplified.

認識仮説統合手段２６は、音声認識装置１０から入力された仮説ラティスの音声認識誤り部分の始点及び終点に位置する枝の節点に、修正単語列候補及び認識スコアが対応付けられた枝を接続することで、仮説ラティスを統合するものである。また、認識仮説統合手段２６は、統合された仮説ラティスを、仮説リスコアリング手段２７に出力する。 The recognition hypothesis integration unit 26 connects the branches associated with the corrected word string candidates and the recognition scores to the nodes of the branches located at the start and end points of the speech recognition error part of the hypothesis lattice input from the speech recognition apparatus 10. In this way, the hypothesis lattice is integrated. Further, the recognition hypothesis integration unit 26 outputs the integrated hypothesis lattice to the hypothesis rescoring unit 27.

＜仮説ラティスの統合＞
図２，図３を参照し、認識仮説統合手段２６による仮説ラティスの統合について、詳細に説明する（適宜図１参照）。
ここで、図２（ａ）に示す単語を一例として説明する。つまり、単語ｗ_１＝“多く”、単語ｗ_２＝“思い出す”、単語ｗ_３＝“淘汰”、単語ｗ_４＝“する”、単語ｗ_５＝“似通って”、単語ｗ_６＝“います”、単語ｗ_７＝“した”、単語ｗ_８＝“を”であることとする。
図２（ｂ）に示すように、この音声認識誤り部分の前に“ビデオを見て”という単語列があり、音声認識誤り部分の後に“ことができた”という単語列が続くこととする。 <Integration of hypothesis lattice>
The hypothesis lattice integration by the recognition hypothesis integration means 26 will be described in detail with reference to FIGS. 2 and 3 (see FIG. 1 as appropriate).
Here, the word shown in FIG. 2A will be described as an example. That is, word w ₁ = “many”, word w ₂ = “remember”, word w ₃ = “淘汰”, word w ₄ = “to do”, word w ₅ = “similar”, word w ₆ = “ It is assumed that the word w ₇ = “do” and the word w ₈ = “do”.
As shown in FIG. 2B, the word string “Watch the video” is preceded by the voice recognition error part, and the word string “I was able to follow” follows the voice recognition error part. .

図２（ｂ）の上段には、音声認識誤り部分認識手段２２から入力された修正単語列候補ｗ_１〜ｗ_６と、認識スコアＰ_１〜Ｐ_３とが対応付けられた枝１０〜１２を図示した。つまり、枝１０は、修正単語列候補｛ｗ_１，ｗ_２｝＝“多く思い出す”の認識スコアがＰ_１であることを示す。また、枝１１は、修正単語列候補｛ｗ_１，ｗ_３，ｗ_４｝＝“多く淘汰する”の認識スコアがＰ_２であることを示す。また、枝１２は、修正単語列候補｛ｗ_５，ｗ_６｝＝“似通っています”の認識スコアがＰ_３であることを示す。 In the upper part of FIG. 2B, branches 10 to 12 in which the corrected word string candidates w _{1 to} w ₆ input from the speech recognition error partial recognition means 22 and the recognition scores P _{1 to} P ₃ are associated are shown. Illustrated. That is, the branch 10 indicates that the recognition score of the corrected word string candidate {w ₁ , w ₂ } = “remember a lot” is P ₁ . Further, the branch 11 indicates that the recognition score of the modified word string candidate {w ₁ , w ₃ , w ₄ } = “hesitantly” is P ₂ . The branch 12 indicates that the recognition score of the modified word string candidate {w ₅ , w ₆ } = “similar” is P ₃ .

図２（ｂ）の下段には、音声認識装置１０から入力された仮説ラティスを図示した。
この仮説ラティスは、番組音声の音声認識での評価内容を表している。つまり、仮説ラティスは、番組音声の音声認識で評価された単語及び単語毎の音響スコアを対応付けた枝と、単語の開始時刻（位置）を示す枝の節点とで構成された有向グラフである。この音響スコアＬは、入力された番組音声がどれぐらい単語らしいかを示したスコアである。 A hypothesis lattice input from the speech recognition apparatus 10 is illustrated in the lower part of FIG.
This hypothesis lattice represents the evaluation contents in the speech recognition of the program sound. In other words, the hypothesis lattice is a directed graph composed of a branch that associates a word evaluated by voice recognition of program audio and an acoustic score for each word, and a node of a branch that indicates the start time (position) of the word. The acoustic score L is a score indicating how much the input program sound seems to be a word.

この図２（ｂ）では、仮説ラティスの枝を矢印で図示し、節点を黒丸で図示した。また、図２（ｂ）の仮説ラティスにおいて、音声認識誤り部分の開始時刻がＴ_Ｓであり、終了時刻がＴ_Ｅである。また、開始時刻Ｔ_Ｓを示す節点及び終了時刻Ｔ_Ｅを示す節点には、最尤経路を表す枝（太線で図示）と、最尤経路以外の枝（破線で図示）とが接続されている。 In FIG. 2 (b), the branches of the hypothetical lattice are indicated by arrows, and the nodes are indicated by black circles. Further, in the hypothesis lattice of FIG. 2 (b), the start time of speech recognition errors moiety is a T _S, the end time is T _E. Further, the node representing the nodes and end time T _E indicating the starting time T _S, the branch representing the maximum likelihood path (shown by a bold line), and the branch other than the most likely path (shown by broken lines) are connected .

図２（ｂ）の仮説ラティスにおいて、枝毎に、枝を一意に識別する枝番号と、枝に対応付けられた単語ｗ及び音響スコアＬとを図示した。つまり、枝１は、単語ｗ_１＝“多く”の音響スコアがＬ_１であることを示す。また、枝２は、単語ｗ_２＝“思い出す”の音響スコアがＬ_２であることを示す。また、枝３は、単語ｗ_１＝“多く”の音響スコアがＬ_３であることを示す。また、枝４は、単語ｗ_３＝“淘汰”の音響スコアがＬ_４であることを示す。また、枝５は、単語ｗ_４＝“する”の音響スコアがＬ_５であることを示す。また、枝６は、単語ｗ_７＝“した”の音響スコアがＬ_６であることを示す。また、枝７は、単語ｗ_８＝“を”の音響スコアがＬ_７であることを示す。 In the hypothesis lattice of FIG. 2B, for each branch, a branch number that uniquely identifies the branch, a word w and an acoustic score L associated with the branch are illustrated. That is, branch 1 indicates that the acoustic score of the word w ₁ = “many” is L ₁ . The branch 2 indicates that the acoustic score of the word w ₂ = “remember” is L ₂ . The branch 3 indicates that the acoustic score of the word w ₁ = “many” is L ₃ . The branch 4 indicates that the acoustic score of the word w ₃ = “淘汰” is L ₄ . The branch 5 indicates that the acoustic score of the word w ₄ = “Yes” is L ₅ . The branch 6 indicates that the acoustic score of the word w ₇ = “done” is L ₆ . The branch 7 indicates that the acoustic score of the word w ₈ = “O” is L ₇ .

すなわち、図２（ｂ）の仮説ラティスにおいて、開始時刻Ｔ_Ｓを示す節点には、３つの枝が入力される。また、開始時刻Ｔ_Ｓを示す節点から、枝１，３が分岐する。また、枝３の先端にある節点から、枝４，７が分岐する。枝１，７は同じ節点に合流し、この節点から枝２が出力され、枝２の先端が終了時刻Ｔ_Ｅを示す節点に合流する。また、枝４の先端にある節点から、枝５，６が分岐する。また、枝５は、終了時刻Ｔ_Ｅを示す節点に合流する。従って、図２（ｂ）の仮説ラティスにおいて、開始時刻Ｔ_Ｓから終了時刻Ｔ_Ｅまでの間には、枝１−２の経路Ｈ_１と、枝３−４−５の経路Ｈ_２と、枝３−７−２の経路Ｈ_３という、３つの経路が存在する。 That is, in the hypothesis lattice of FIG. 2 (b), the node indicating the start time T _S, 3 single branch is input. Further, from the node showing the start time _{T S,} the branch 1,3 branches. Branches 4 and 7 branch from the node at the tip of branch 3. Branches 1,7 are joined to the same node, the branch from the node 2 is output, the tip of the branch 2 are joined to the node indicating the end time T _E. Branches 5 and 6 branch from the node at the tip of branch 4. In addition, the branches 5, to join the node to indicate the end time T _E. Therefore, in the hypothesis lattice of FIG. 2B, between the start time T _S and the end time T _E , the path H _{1 of the} branch 1-2, the path H _{2 of the} branch 3-4-5, and the branch that path _{H 3} of 3-7-2, there are three paths.

認識仮説統合手段２６は、図２（ｂ）上段の枝１０〜１２を、図２（ｂ）下段の仮説ラティスに統合する。まず、認識仮説統合手段２６は、枝１０〜１２に対応付けられた修正単語列候補が、経路Ｈ_１〜Ｈ_３の各枝に対応付けられた単語列に一致するか否かを判定する。 The recognition hypothesis integrating unit 26 integrates the upper branches 10 to 12 in FIG. 2B into the lower hypothesis lattice in FIG. First, the recognition hypothesis integrating unit 26 determines whether the corrected word string candidate associated with the branches 10 to 12 matches the word string associated with each branch of the paths H _{1 to} H ₃ .

ここで、認識仮説統合手段２６は、修正単語列候補が経路Ｈ_１〜Ｈ_３の単語列に一致する場合、一致する修正単語列候補の認識スコアＰを経路Ｈ_１〜Ｈ_３に対応付ける。
例えば、枝１０の修正単語列候補｛ｗ_１，ｗ_２｝＝“多く思い出す”であり、枝１−２の経路Ｈ_１の単語列｛ｗ_１，ｗ_２｝と一致する。このため、認識仮説統合手段２６は、修正単語列候補｛ｗ_１，ｗ_２｝の認識スコアＰ_１を枝１−２の経路Ｈ_１に対応付ける。
また、例えば、枝１１の修正単語列候補｛ｗ_１，ｗ_３，ｗ_４｝＝“多く淘汰する”であり、枝３−４−５の経路Ｈ_２の単語列｛ｗ_１，ｗ_３，ｗ_４｝と一致する。このため、認識仮説統合手段２６は、修正単語列候補｛ｗ_１，ｗ_３，ｗ_４｝の認識スコアＰ_２を枝３−４−５の経路Ｈ_２に対応付ける。 Here, when the corrected word string candidate matches the word string of the paths H _{1 to} H ₃ , the recognition hypothesis integrating unit 26 associates the recognition score P of the matching corrected word string candidate with the paths H _{1 to} H ₃ .
For example, the modified word string candidate {w ₁ , w ₂ } of the branch 10 = “remember a lot”, which matches the word string {w ₁ , w ₂ } of the path H ₁ of the branch 1-2. For this reason, the recognition hypothesis integration unit 26 associates the recognition score P ₁ of the corrected word string candidate {w ₁ , w ₂ } with the path H ₁ of the branch 1-2.
In addition, for example, the modified word string candidate of the branch 11 {w ₁ , w ₃ , w ₄ } = “is hesitant”, and the word string {w ₁ , w ₃ , w of the path H ₂ of the branch 3-4-5 matches w ₄ }. For this reason, the recognition hypothesis integration unit 26 associates the recognition score P ₂ of the modified word string candidate {w ₁ , w ₃ , w ₄ } with the path H ₂ of the branch 3-4-5.

一方、認識仮説統合手段２６は、修正単語列候補が枝の経路Ｈ_１〜Ｈ_３の単語列に一致しない場合、この修正単語列候補が得られる枝の経路を仮説ラティスに追加し、追加した枝の経路に認識スコアＰを対応付ける。
例えば、枝１２の修正単語列候補｛ｗ_５，ｗ_６｝＝“似通っています“は、枝１−２の経路Ｈ_１の単語列｛ｗ_１，ｗ_２｝＝“多く思い出す”、枝３−４−５の経路Ｈ_２の単語列｛ｗ_１，ｗ_３，ｗ_４｝＝“多く淘汰する”、枝３−７−２の経路Ｈ_３の単語列｛ｗ_１，ｗ_８，ｗ_２｝＝“多くを思い出す”の何れにも一致しない。 On the other hand, when the corrected word string candidate does not match the word paths of the branch paths H _{1 to} H ₃ , the recognition hypothesis integration unit 26 adds the branch path from which the corrected word string candidate is obtained to the hypothesis lattice and adds it. The recognition score P is associated with the branch path.
For example, the modified word string candidate {w ₅ , w ₆ } = “similar” of the branch 12 is the word string {w ₁ , w ₂ } = “remembers” of the path H ₁ of the branch 1-2, branch 3 The word sequence {w ₁ , w ₃ , w ₄ } of the route H ₂ of −4-5 = “heavyly”, the word sequence {w ₁ , w ₈ , w ₂ of the route H ₃ of the branch 3-7-2 } = Does not match any of “Remember many”.

従って、認識仮説統合手段２６は、枝１２の修正単語列候補｛ｗ_５，ｗ_６｝に含まれる単語ｗ_５，ｗ_６がそれぞれ対応付けられた枝８，９を新たに生成する。このとき、単語ｗ_５，ｗ_６の音響スコアＬ_８，Ｌ_９が存在しないため、認識仮説統合手段２６は、この音響スコアＬ_８，Ｌ_９の計算を音声認識装置１０に要求する。そして、認識仮説統合手段２６は、この要求に応じて、音声認識装置１０から入力された音響スコアＬ_８，Ｌ_９を、枝８，９に対応付ける。 Therefore, the recognition hypothesis integrating unit 26 newly generates branches 8 and 9 in which the words w ₅ and w ₆ included in the modified word string candidate {w ₅ , w ₆ } of the branch 12 are respectively associated. At this time, since the acoustic scores L ₈ and L _{9 of} the words w ₅ and w ₆ do not exist, the recognition hypothesis integrating unit 26 requests the speech recognition apparatus 10 to calculate the acoustic scores L ₈ and L ₉ . Then, the recognition hypothesis integration unit 26 associates the acoustic scores L ₈ and L ₉ input from the speech recognition apparatus 10 with the branches 8 and ₉ in response to this request.

さらに、認識仮説統合手段２６は、図３（ａ）に示すように、生成した枝８−９の経路Ｈ_４を仮説ラティスに接続する。具体的には、認識仮説統合手段２６は、開始時刻Ｔ_Ｓを示す節点から、枝８を分岐させる。また、認識仮説統合手段２６は、枝８の先端にある節点に枝９を接続し、終了時刻Ｔ_Ｅの節点まで伸ばす。そして、認識仮説統合手段２６は、枝８−９の経路Ｈ_４に、この経路Ｈ_４の単語列に一致する修正単語列候補｛ｗ_５，ｗ_６｝の認識スコアＰ_３を対応付ける。 Furthermore, the recognition hypothesis integration unit 26, as shown in FIG. 3 (a), connecting the path H ₄ of the generated branches 8-9 to the hypothesis lattice. Specifically, the recognition hypothesis integration unit 26, from the node showing the start time T _S, diverts the branches 8. Further, the recognition hypothesis integration unit 26 connects the branch 9 to the node at the tip of the branch 8, extending to the nodes of the end time T _E. Then, the recognition hypothesis integration unit 26, the path _{H 4} branches 8-9 associates the recognition score _{P 3} modifications word string candidates that match the word sequence of the pathway _{_{_{H 4 {w 5, w 6}}} }.

なお、枝３−７−２の経路Ｈ_３の単語列に一致する修正単語列候補｛ｗ_１，ｗ_８，ｗ_２｝の認識スコアＰ_４が存在しない。この場合、認識仮説統合手段２６は、この認識スコアＰ_４を対応付けるための枝１３を生成し、この認識スコアＰ_４の計算を音声認識誤り部分認識手段２２に要求する。そして、認識仮説統合手段２６は、この要求に応じて、音声認識誤り部分認識手段２２から入力された認識スコアＰ_４を枝１３に対応付ける。
また、枝３−４−６については、枝６が最尤経路の節点に接続されないため、認識スコアを対応付ける必要がない。
その結果、図３（ｂ）に示すように、統合された仮説ラティスの経路Ｈ_１〜Ｈ_４には、認識スコアＰ_１〜Ｐ_４が対応付けられることになる。 It should be noted that there is no recognition score P _{4 of} the modified word string candidate {w ₁ , w ₈ , w ₂ } that matches the word string of the path H ₃ of the branch 3-7-2. In this case, the recognition hypothesis integration unit 26 generates a branch 13 for associating the recognition score P _4, and requests the calculation of the recognition score P ₄ to speech recognition errors partial recognition unit 22. In response to this request, the recognition hypothesis integration unit 26 associates the recognition score P ₄ input from the speech recognition error partial recognition unit 22 with the branch 13.
Further, regarding the branch 3-4-6, since the branch 6 is not connected to the node of the maximum likelihood path, it is not necessary to associate a recognition score.
As a result, as shown in FIG. 3B, recognition scores P _{1 to} P ₄ are associated with the paths H _{1 to} H ₄ of the integrated hypothesis lattice.

図１に戻り、音声認識誤り修正装置２０の構成について、説明を続ける。
仮説リスコアリング手段２７は、認識仮説統合手段２６から入力された仮説ラティスにおける枝の経路Ｈ毎に、音響スコアＬ及び認識スコアＰを用いて統合スコアＬ´を算出し、仮説ラティスをリスコアリングするものである。そして、仮説リスコアリング手段２７は、算出した統合スコアが最高になる枝の経路を正しい修正単語列として推定し、推定した修正単語列で音声認識誤り部分を修正する。 Returning to FIG. 1, the description of the configuration of the speech recognition error correction apparatus 20 will be continued.
The hypothesis rescoring means 27 calculates an integrated score L ′ using the acoustic score L and the recognition score P for each branch path H in the hypothesis lattice input from the recognition hypothesis integration means 26, and rescores the hypothesis lattice. It is something to ring. Then, the hypothesis rescoring means 27 estimates the path of the branch having the highest calculated integrated score as a correct corrected word string, and corrects the speech recognition error part with the estimated corrected word string.

＜仮説ラティスのリスコアリング＞
図３を参照し、仮説リスコアリング手段２７によるリスコアリングについて、詳細に説明する（適宜図１参照）。
仮説リスコアリング手段２７は、枝の経路Ｈ毎に、音響スコアＬ及び認識スコアＰの重み付け総和を、統合スコアＬ´として算出する。つまり、仮説リスコアリング手段２７は、各枝の音響スコアＬと、各枝の重みａ（ｎ）とを乗じた値の合計値を算出する。また、仮説リスコアリング手段２７は、各経路Ｈの単語列に一致する修正単語列候補の認識スコアＰと、この認識スコアＰが対応付けられた枝の重みｂ（ｍ）を乗じ、前記した合計値に加算する。
なお、ｎは音響スコアＬが対応付けられた枝番号であり（本実施形態では、１≦ｎ≦９）、ｍは認識スコアＰが対応付けられた枝番号である（本実施形態では、１０≦ｍ≦１３）。 <Rescoring Hypothesis Lattice>
With reference to FIG. 3, the rescoring by the hypothesis rescoring means 27 will be described in detail (see FIG. 1 as appropriate).
The hypothesis rescoring means 27 calculates, for each branch path H, the weighted sum of the acoustic score L and the recognition score P as an integrated score L ′. That is, the hypothesis rescoring means 27 calculates the total value of the values obtained by multiplying the acoustic score L of each branch and the weight a (n) of each branch. Further, the hypothesis rescoring means 27 multiplies the recognition score P of the modified word string candidate that matches the word string of each path H and the weight b (m) of the branch associated with this recognition score P, as described above. Add to the total value.
Note that n is a branch number associated with the acoustic score L (in this embodiment, 1 ≦ n ≦ 9), and m is a branch number associated with the recognition score P (in this embodiment, 10 ≦ m ≦ 13).

図３（ａ）では、仮説リスコアリング手段２７は、枝１−２の経路Ｈ_１について、下記の式（１）に示すように、枝１の音響スコアＬ_１に重みａ（１）を乗じた値と、枝２の音響スコアＬ_２に重みａ（２）を乗じた値と、枝１−２の経路Ｈ_１の認識スコアＰ_１に重みｂ（１０）を乗じた値との和を、統合スコアＬ´（Ｈ_１）として算出する。 In FIG. 3A, the hypothesis rescoring means 27 assigns a weight a (1) to the acoustic score L ₁ of the branch 1 for the path H ₁ of the branch 1-2, as shown in the following equation (1). The sum of the multiplied value, the value obtained by multiplying the acoustic score L ₂ of the branch 2 by the weight a (2), and the value obtained by multiplying the recognition score P ₁ of the path H ₁ of the branch 1-2 by the weight b (10). Is calculated as an integrated score L ′ (H ₁ ).

また、仮説リスコアリング手段２７は、枝３−４−５の経路Ｈ_２について、下記の式（２）に示すように、枝３の音響スコアＬ_３に重みａ（３）を乗じた値と、枝４の音響スコアＬ_４に重みａ（４）を乗じた値と、枝５の音響スコアＬ_５に重みａ（５）を乗じた値と、枝３−４−５の経路Ｈ_２の認識スコアＰ_２に重みｂ（１１）を乗じた値との和を、統合スコアＬ´（Ｈ_２）として算出する。 Moreover, the hypothesis rescoring unit 27, the route of _{H 2} branches 3-4-5, as shown in the following formula (2), multiplied by the weight a (3) to the acoustic score _{L 3} branches 3 values A value obtained by multiplying the acoustic score L ₄ of the branch 4 by the weight a (4), a value obtained by multiplying the acoustic score L ₅ of the branch 5 by the weight a (5), and a route H _{2 of the} branch 3-4-5 The sum of the value obtained by multiplying the recognition score P ₂ by the weight b (11) is calculated as the integrated score L ′ (H ₂ ).

また、仮説リスコアリング手段２７は、枝３−７−２の経路Ｈ_３について、下記の式（３）に示すように、枝３の音響スコアＬ_３に重みａ（３）を乗じた値と、枝７の音響スコアＬ_７に重みａ（７）を乗じた値と、枝２の音響スコアＬ_２に重みａ（２）を乗じた値と、枝３−７−２の経路Ｈ_３の認識スコアＰ_４に重みｂ（１３）を乗じた値との和を、統合スコアＬ´（Ｈ_３）として算出する。 Moreover, the hypothesis rescoring unit 27, the route _{H 3} branch 3-7-2, as shown in the following formula (3), multiplied by the weight a (3) to the acoustic score _{L 3} branches 3 values And the value obtained by multiplying the acoustic score L ₇ of the branch 7 by the weight a (7), the value obtained by multiplying the acoustic score L ₂ of the branch 2 by the weight a (2), and the path H _{3 of the} branch 3-7-2. the sum of the value obtained by multiplying the weight b (13) to the recognition score _{P 4} of the, calculated as total score _{L'(H 3).}

また、仮説リスコアリング手段２７は、枝８−９の経路Ｈ_４について、下記の式（４）に示すように、枝８の音響スコアＬ_８に重みａ（８）を乗じた値と、枝９の音響スコアＬ_９に重みａ（９）を乗じた値と、枝８−９の経路Ｈ_４の認識スコアＰ_３に重みｂ（１２）を乗じた値との和を、統合スコアＬ´（Ｈ_４）として算出する。 Moreover, the hypothesis rescoring unit 27, the route _{H 4} branches 8-9, as shown in the following formula (4), a value obtained by multiplying the weight a (8) to the acoustic score _{L 8} branches 8, a value obtained by multiplying the weight a (9) to the acoustic score _{L 9} branch 9, the sum of the value obtained by multiplying the weight b (12) to the recognition score _{P 3} pathways _{H 4} branches 8-9, the total score L Calculated as' (H ₄ ).

ここで、重みａ（ｎ）及びｂ（ｍ）は、番組音声の認識仮説（音響スコア）の信頼度をａ￣とし、音声認識誤り部分認識手段２２の認識仮説（認識スコア）の信頼度をｂ￣とすると、それぞれ、下記の式（５）及び式（６）で表すことができる。 Here, the weights a (n) and b (m) represent the reliability of the recognition hypothesis (recognition score) of the speech recognition error partial recognizing means 22 with the reliability of the recognition hypothesis (acoustic score) of the program audio being a￣. Assuming b それぞれ, it can be expressed by the following equations (5) and (6), respectively.

番組音声の認識仮説に比べ、音声認識誤り部分認識手段２２の認識仮説の方が信頼できるため、信頼度ａ￣よりも信頼度ｂ￣が高くなるように予め設定されることが多い。また、Ｃ（ｎ）及びＣ（ｍ）は、音響スコアＬと認識スコアＰのダイナミックレンジを揃えるために予め設定される重みであり、音声認識や手書き文字認識の入力複雑さやパラメータ数を示す。一般的に、音声認識の入力複雑さは、その単語の専有時間(フレーム数)に重みをつけた量で評価される。また、手書き文字認識の場合、文字数や画数に重みをつけた量で評価される。 Since the recognition hypothesis of the speech recognition error partial recognition means 22 is more reliable than the recognition hypothesis of program audio, it is often set in advance so that the reliability b 信頼 is higher than the reliability a￣. Further, C (n) and C (m) are weights set in advance to align the dynamic range of the acoustic score L and the recognition score P, and indicate the input complexity and the number of parameters for speech recognition and handwritten character recognition. In general, the input complexity of speech recognition is evaluated by an amount obtained by weighting the exclusive time (number of frames) of the word. Further, in the case of handwritten character recognition, evaluation is performed with an amount weighted to the number of characters and the number of strokes.

前記したように、仮説リスコアリング手段２７は、仮説を展開して統合スコアＬ´を算出することができる。 As described above, the hypothesis rescoring means 27 can calculate the integrated score L ′ by expanding the hypothesis.

その後、仮説リスコアリング手段２７は、仮説ラティスで音声認識誤り部分以外の最尤経路を制約して、音声認識装置１０の言語モデルから算出される仮説文章全体の文章らしさを示す言語スコアと、統合スコアＬ´とを用いて再度評価を行い、この両者の合計スコアが最も高くなる仮説を正しい修正単語列として推定する。図３の例では、仮説リスコアリング手段２７は、音声認識誤り部分の単語候補を、その前後の単語列“ビデオを見て”及び“ことができた”とつなげた上で言語スコアを算出し、統合スコアＬ´とともに再度評価する。その後、仮説リスコアリング手段２７は、音声認識装置１０から入力された音声認識結果単語列の誤り部分を、正しい修正単語列に修正する。 Thereafter, the hypothesis rescoring means 27 constrains the maximum likelihood path other than the speech recognition error part in the hypothesis lattice, and indicates a language score indicating the sentence-likeness of the entire hypothesis sentence calculated from the language model of the speech recognition apparatus 10; The evaluation is performed again using the integrated score L ′, and the hypothesis having the highest total score of both is estimated as a correct corrected word string. In the example of FIG. 3, the hypothesis re-scoring means 27 calculates the language score after connecting the word candidates of the speech recognition error part with the word strings “watching the video” and “being able”. Then, the evaluation is performed again together with the integrated score L ′. Thereafter, the hypothesis rescoring means 27 corrects the error part of the speech recognition result word string input from the speech recognition device 10 to a correct corrected word string.

[音声認識誤り修正装置の動作]
図４を参照し、音声認識誤り修正装置２０の動作について、詳細に説明する。
音声認識誤り修正装置２０は、音声認識装置１０から、仮説ラティスが入力される（ステップＳ１）。
音声認識誤り修正装置２０は、修正指示入力手段２１によって、音声認識誤り部分を特定し（ステップＳ２）、音声認識誤りの種類を選択する（ステップＳ３）。 [Operation of voice recognition error correction device]
With reference to FIG. 4, the operation of the speech recognition error correction apparatus 20 will be described in detail.
The speech recognition error correction device 20 receives a hypothesis lattice from the speech recognition device 10 (step S1).
The speech recognition error correcting device 20 specifies the speech recognition error part by the correction instruction input means 21 (step S2), and selects the type of speech recognition error (step S3).

音声認識誤り修正装置２０は、音声認識誤り部分認識手段２２によって、修正者が発話又は手書き文字を入力し、修正単語列候補毎の認識スコアを算出する（ステップＳ４）。
音声認識誤り修正装置２０は、認識仮説統合手段２６によって、ステップＳ１で入力された仮説ラティスの音声認識誤り部分に、修正単語列候補毎の認識スコアが対応付けられた枝を統合する（ステップＳ５）。 In the speech recognition error correcting device 20, the corrector inputs utterances or handwritten characters by the speech recognition error partial recognition means 22, and calculates a recognition score for each corrected word string candidate (step S4).
The speech recognition error correction apparatus 20 integrates the branch in which the recognition score for each candidate word string candidate is associated with the speech recognition error part of the hypothesis lattice input in step S1 by the recognition hypothesis integration unit 26 (step S5). ).

音声認識誤り修正装置２０は、仮説リスコアリング手段２７によって、仮説ラティスにおける枝の経路毎に統合スコアを算出し、言語スコア及び統合スコアＬ´を用いてリスコアリングを行い、両者の合計スコアが最も高くなる仮説を正しい修正単語列として推定する（仮説リスコアリング：ステップＳ６）。
音声認識誤り修正装置２０は、仮説リスコアリング手段２７によって、推定した修正単語列で音声認識誤り部分を修正し、修正結果として出力する（ステップＳ７）。 The speech recognition error correction device 20 calculates an integrated score for each branch path in the hypothesis lattice by the hypothesis re-scoring means 27, performs re-scoring using the language score and the integrated score L ′, and calculates the total score of both. Is estimated as a correct corrected word string (hypothesis rescoring: step S6).
The speech recognition error correcting device 20 corrects the speech recognition error part with the estimated corrected word string by the hypothesis re-scoring means 27 and outputs it as a correction result (step S7).

以上のように、本願発明の実施形態に係る音声認識誤り修正装置２０は、異なる誤り傾向を有する音声認識結果と手書き文字の認識結果とを仮説ラティスの音声認識誤り部分に相補的に統合し、統合した仮説ラティスから統合スコアを算出するため、正しい修正単語列を高精度に推定することができる。 As described above, the speech recognition error correction apparatus 20 according to the embodiment of the present invention complementarily integrates speech recognition results having different error tendencies and handwritten character recognition results into the speech recognition error portion of the hypothesis lattice, Since an integrated score is calculated from the integrated hypothesis lattice, a correct corrected word string can be estimated with high accuracy.

さらに、音声認識誤り修正装置２０は、音声認識誤りを修正者が簡単に修正することが可能になり、修正者が修正操作に煩わされることなく、音声認識誤りの発見及び修正に専念することができる。
さらに、音声認識誤り修正装置２０は、手書き文字や発話など特殊な技能を必要としない入力方法を利用できるようになり、修正作業を行うにあたり、修正操作を熟知する手間が低減される。これにより、音声認識誤り修正装置２０は、より多くの人が修正作業に携われるようになり、字幕番組の拡充及び制作コストの低減が可能となる。 Furthermore, the speech recognition error correction device 20 allows the corrector to easily correct the speech recognition error, and the corrector can concentrate on finding and correcting the speech recognition error without being troubled by the correction operation. it can.
Furthermore, the speech recognition error correction device 20 can use an input method that does not require special skills such as handwritten characters and utterances, and the time and effort required to know the correction operation when performing the correction work is reduced. As a result, the speech recognition error correction apparatus 20 can make more people involved in the correction work, and can expand the caption program and reduce the production cost.

（変形例）
なお、音声認識誤り修正装置２０は、前記した実施形態に限定されず、その趣旨を逸脱しない範囲で種々の変形を加えることができる。
仮説リスコアリング手段２７は、枝の経路毎に、音響スコアＬに相当する事後確率と、認識スコアＰに相当する事後確率との総和を、統合スコアＬ´として算してもよい。 (Modification)
Note that the speech recognition error correction apparatus 20 is not limited to the above-described embodiment, and various modifications can be made without departing from the spirit thereof.
The hypothesis rescoring means 27 may calculate the sum of the posterior probability corresponding to the acoustic score L and the posterior probability corresponding to the recognition score P for each branch path as the integrated score L ′.

具体的には、仮説リスコアリング手段２７は、音響スコアＬに相当する事後確率Ｌ（ｉ）を、フォワードバックワードアルゴリズムを用いて算出できる。また、仮説リスコアリング手段２７は、音響スコアＬと同様に認識スコアＰが対数尤度に相当する場合、この認識スコアＰに相当する事後確率Ｐ´（ｍ）を、下記の式（７）で定義された対数尤度算出式を用いて算出できる。 Specifically, the hypothesis rescoring means 27 can calculate the posterior probability L (i) corresponding to the acoustic score L using a forward backward algorithm. Also, the hypothesis rescoring means 27, when the recognition score P corresponds to the log likelihood as in the case of the acoustic score L, the posterior probability P ′ (m) corresponding to the recognition score P is expressed by the following equation (7). It can be calculated using the log likelihood calculation formula defined in.

このように事後確率を求めた場合、仮説リスコアリング手段２７は、枝１−２の経路Ｈ_１について、下記の式（８）に示すように、枝１の音響スコアＬ_１に相当する事後確率Ｌ（１）と、枝２の音響スコアＬ_２に相当する事後確率Ｌ（２）と、枝１−２の経路Ｈ_１の認識スコアＰ_１に相当する事後確率Ｐ´（１０）との総和を、統合スコアＬ´（Ｈ_１）として算出する。 When the posterior probability is obtained in this way, the hypothesis rescoring means 27 performs the posterior corresponding to the acoustic score L ₁ of the branch 1 for the path H ₁ of the branch 1-2, as shown in the following equation (8). probability L (1), and the posterior probability L (2) corresponding to the acoustic score _{L 2} branches 2, the posterior probability P'(10) corresponding to the recognition score _{P 1} of the path _{H 1} branches 1-2 The sum is calculated as the integrated score L ′ (H ₁ ).

なお、仮説リスコアリング手段２７は、他の経路Ｈについても同様に統合スコアＬ´を算出できるため、詳細な説明を省略する。
また、仮説リスコアリング手段２７は、フォワードバックワードアルゴリズム以外、事後確率Ｌ（ｉ）を近似的に算出する手法も利用できる。 The hypothesis rescoring means 27 can calculate the integrated score L ′ in the same way for the other routes H, and thus detailed description thereof is omitted.
Further, the hypothesis rescoring means 27 can use a method of approximately calculating the posterior probability L (i) other than the forward backward algorithm.

１字幕生成システム
１０音声認識装置
１１音声認識手段
１３音響モデル
１５言語モデル
１７発音辞書
２０音声認識誤り修正装置
２１修正指示入力手段
２２音声認識誤り部分認識手段
２３音声認識手段
２４手書き文字認識手段
２５スイッチ
２６認識仮説統合手段（仮説ラティス統合手段）
２７仮説リスコアリング手段（音声認識誤り部分修正手段）
３０表示装置 DESCRIPTION OF SYMBOLS 1 Subtitle production | generation system 10 Speech recognition apparatus 11 Speech recognition means 13 Acoustic model 15 Language model 17 Pronunciation dictionary 20 Speech recognition error correction apparatus 21 Correction instruction input means 22 Speech recognition error partial recognition means 23 Speech recognition means 24 Handwritten character recognition means 25 Switch 26 Recognition hypothesis integration means (hypothesis lattice integration means)
27 Hypothesis rescoring means (speech recognition error partial correction means)
30 Display device

Claims

A speech recognition error correction device that corrects a speech recognition error included in a word string indicating a speech recognition result of a program sound with a correct correction word string,
Speech recognition means for performing speech recognition of a speech recognition error portion due to the corrector's utterance, handwritten character recognition means for recognizing handwritten character recognition of the speech recognition error portion by the corrector, and either speech recognition or handwritten character recognition in advance selected, and a switch in advance as a result of the selected the voice recognition or the handwriting recognition was the correct modification word sequence candidates and a modified word sequence candidates, and, for outputting a recognition score for each of the modified word sequence candidates, A speech recognition error partial recognition means comprising:
A hypothesis lattice composed of each word evaluated in the speech recognition of the program sound and a branch in which the acoustic score of each word is associated and a node of the branch indicating the position of each word is input and input. A hypothesis lattice that integrates the hypothesis lattice by connecting a branch associated with the corrected word string candidate and the recognition score to the node of the branch located at the start point and the end point of the speech recognition error part of the hypothesis lattice. Integration means,
An integrated score is calculated using the acoustic score and the recognition score for each branch path from the start point to the end point in the integrated speech recognition error part of the hypothesis lattice, and the calculated integrated score is the highest. Speech recognition error portion correcting means for estimating a branch path as the correct corrected word string;
Equipped with a,
The speech recognition error part recognition means, when the result of the handwritten character recognition is all hiragana or katakana, from the pronunciation dictionary used for speech recognition of the program sound, or the pronunciation dictionary used for speech recognition of the speech recognition error part, A speech recognition error correcting apparatus, wherein a word of a phoneme string corresponding to the hiragana or katakana notation is read out and used as the corrected word string candidate.

2. The speech recognition error correction apparatus according to claim 1, wherein the speech recognition error partial recognition means performs speech recognition using an acoustic model for a specific speaker unique to the corrector.

3. The speech according to claim 1, wherein the speech recognition error part correcting unit calculates a weighted sum of the acoustic score and the recognition score as the integrated score for each path of the branch. Recognition error correction device.

The speech recognition error part correcting unit is configured to calculate, for each path of the branch, a sum of a posterior probability of the acoustic score and a posterior probability of the recognition score calculated by a logarithmic likelihood calculation formula set in advance. The speech recognition error correction apparatus according to claim 1, wherein the speech recognition error correction apparatus is calculated as follows.

A speech recognition error correction program for causing a computer to function as the speech recognition error correction device according to any one of claims 1 to 4 .