JP2015125232A

JP2015125232A - Voice recognition error correction device

Info

Publication number: JP2015125232A
Application number: JP2013268910A
Authority: JP
Inventors: 庄衛佐藤; Shoe Sato
Original assignee: Nippon Hoso Kyokai NHK; Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2013-12-26
Filing date: 2013-12-26
Publication date: 2015-07-06
Anticipated expiration: 2033-12-26
Also published as: JP6232282B2

Abstract

PROBLEM TO BE SOLVED: To provide a voice recognition error correction device capable of reducing generation of an automatic correction error in a conventional block collation system.SOLUTION: A voice recognition error correction device 100 comprises: WFST storage means 110 in which a correspondence manuscript set (WFST) constructed by preliminarily reading a manuscript text set 200 is stored; node data update means 120 for calculating and updating scores in a state of being able to be transited on a network of the WFST as node data for every time for accepting input of words of a recognition word string of a voice recognition device 220; node data storage means 130; manuscript search means 140 for sequentially determining hypotheses partially approximating the final best hypothesis as error correction results while tracing back on the network of the WFST based on the node data which is stored at that point of time whenever a processing start condition is satisfied; and manuscript output means 150 for sequentially outputting corresponding manuscripts determined as the error correction results.

Description

本発明は、音声認識技術に係り、特に、認識誤りを自動修正する音声認識誤り修正装置に関する。 The present invention relates to a speech recognition technique, and more particularly to a speech recognition error correction device that automatically corrects recognition errors.

例えばテレビ放送のニュース番組における放送音声について音声認識をして字幕を生成する場合、オペレータが、例えばタッチパネルモニタに提示された音声認識結果を一字一句目視でチェックし、誤りを発見したら訂正するといった作業を行っている。このようなオペレータの負荷を軽減して音声認識誤りを自動修正する技術が特許文献１に記載されている。例えばニュース番組の発話音声の認識結果は、発話の情報源となる原稿テキストをもとに発話された音声の認識結果であるため、このような自動修正を行うことが可能である。 For example, when subtitles are generated by recognizing broadcast audio in a TV broadcast news program, the operator visually checks the speech recognition result presented on the touch panel monitor, for example, and corrects if an error is found. Doing work. Patent Document 1 discloses a technique for automatically correcting a speech recognition error by reducing the load on the operator. For example, the recognition result of the utterance voice of the news program is the recognition result of the voice uttered based on the manuscript text as the utterance information source, and thus such automatic correction can be performed.

ただし、音声認識の対象である発話内容は、原稿に基づいているものの、一部が言い換えられたり、一部の文節が脱落したり、新たな情報が加えられることがある。一方、音声認識装置は、この原稿テキストを用いて適応化したものを用いるため、発声単語列において原稿と同一の部分については高精度に認識できるものの、原稿とは不一致の部分については認識精度が低下する。また、雑音や、不明瞭な発声、言い誤り、言いよどみなどが原因で生じる認識誤りも含まれている。 However, although the utterance content that is the target of speech recognition is based on the manuscript, some words may be rephrased, some phrases may be dropped, or new information may be added. On the other hand, since the speech recognition apparatus uses an adaptation made using this manuscript text, the same part as the manuscript in the utterance word string can be recognized with high accuracy, but the portion that does not match the manuscript has a recognition accuracy. descend. It also includes recognition errors caused by noise, unclear utterances, misrepresentations, and stagnation.

特許文献１に記載された技術は、音声認識結果から原稿内の出力候補として確定すべき原稿の区間に、長さＮ（単語数Ｎ）の単語連鎖ブロックを用いるものである。この出力候補として確定すべき原稿の区間とは、音声認識結果の単語列と、原稿内の単語列と、を比較して両者の単語列の一致率を求めるべき区間のことである。特許文献１に記載された技術では、音声認識結果の単語列（単語数Ｎ_h）と、原稿内の単語列（単語数Ｎ_r）と、が共に同じ単語数Ｎであるものとして、基準とする音声認識結果の単語列（単語数Ｎ＝Ｎ_h）に対して、不一致率が最も小さくなるような原稿の単語列を探索し、その一部を修正結果として出力する。以下では、この従来技術の方式をブロック照合方式と呼ぶ。 The technique described in Patent Document 1 uses a word chain block having a length N (number of words N) in a section of a document to be determined as an output candidate in the document from the speech recognition result. The section of the manuscript to be determined as an output candidate is a section in which the word string of the speech recognition result is compared with the word string in the manuscript to obtain the matching rate of both word strings. In the technique described in Patent Document 1, it is assumed that the word string (word count N _h ) of the speech recognition result and the word string (word count N _r ) in the manuscript both have the same word count N. For the word sequence (word count N = N _h ) of the speech recognition result to be searched, the word sequence of the manuscript that has the smallest mismatch rate is searched, and a part thereof is output as the correction result. Hereinafter, this conventional method is referred to as a block matching method.

ブロック照合方式について、簡単のため、Ｎ＝Ｎ_h＝６として図８を参照して説明する。発話すべき原稿テキストの１つは、元原稿６００「西日本と東日本の太平洋側では先月気象庁が …」であるものとする。この場合、照合ブロックＢ１は、元原稿６００の「西」からの６単語を含むように設定され、照合原稿７０１は、「西」〜「の」までの６単語の連鎖ブロックとなる。照合ブロックＢ２は、照合ブロックＢ１から１単語分だけ右にずらして設定され、照合原稿７０２は、「日本」〜「太平洋」までの６単語の連鎖ブロックとなる。同様に、照合ブロックＢ３は、照合ブロックＢ２から１単語分ずらして設定され、照合原稿７０３は、「と」〜「側」までの６単語の連鎖ブロックとなる。また、照合ブロックＢ４は、照合ブロックＢ３から１単語分ずらして設定され、照合原稿７０４は、「東」〜「では」までの６単語の連鎖ブロックとなる。さらに、照合ブロックＢ５は、照合ブロックＢ４から１単語ずらして設定され、照合原稿７０５は、「日本」〜「先月」までの６単語の連鎖ブロックとなる。 For simplicity, the block matching method will be described with reference to FIG. 8 with N = N _h = 6. One of the manuscript texts to be spoken is the original manuscript 600 “The Japan Meteorological Agency last month on the Pacific side of West Japan and East Japan…”. In this case, the collation block B1 is set to include 6 words from “west” of the original document 600, and the collation document 701 is a chain block of 6 words from “west” to “no”. The collation block B2 is set to be shifted to the right by one word from the collation block B1, and the collation document 702 is a chain block of six words from “Japan” to “Pacific”. Similarly, the collation block B3 is set so as to be shifted by one word from the collation block B2, and the collation document 703 is a chain block of six words from “to” to “side”. Further, the collation block B4 is set so as to be shifted by one word from the collation block B3, and the collation document 704 is a chain block of six words from “East” to “In”. Further, the collation block B5 is set by shifting one word from the collation block B4, and the collation document 705 is a chain block of six words from “Japan” to “Last month”.

ここで、元原稿６００を読み上げて音声認識された認識結果８０１が、例えば、「日本の大西洋側では先月」であった場合、３番目に入力した認識単語「大西洋」は、「太平洋」の誤りである。照合ブロックＢ１〜Ｂ５を用いた場合、認識結果８０１との一致率が最も高くなるのは、照合原稿７０５である。よって、この場合、修正出力９０１は、照合原稿７０５と同じテキストデータである、「日本の太平洋側では先月」となる。これにより、誤りが修正される。 Here, for example, if the recognition result 801 read out from the original manuscript 600 and recognized by speech is “Last month on the Atlantic side of Japan”, the third recognition word “Atlantic” input is an error of “Pacific”. It is. When the verification blocks B1 to B5 are used, the verification document 705 has the highest matching rate with the recognition result 801. Therefore, in this case, the corrected output 901 is “the last month on the Pacific side of Japan”, which is the same text data as the collation manuscript 705. This corrects the error.

特開２０１２−１２８１８８号公報JP 2012-128188 A

従来技術であるブロック照合方式は、自動修正を行うことができる反面、認識結果の単語列のブロック境界が未知であることに起因する自動修正誤りがあり、改良の余地があった。先の例では、元原稿６００において、確定した単語列の最後の単語「先月」の次に、「気象庁」が配置されているので、認識結果８０１の次に入力する単語列に対して、照合ブロックＢ６が「気象庁」からの６単語を含むように設定され、照合原稿７０６が「気象庁」からの６単語の連鎖ブロックとなったときに、次の認識結果と照合原稿との一致率が最も高くなると考えられる。ただし、当初から単語「先月」と単語「気象庁」との間に、ブロック境界が存在すると分かっていたわけではない。 The block matching method, which is the prior art, can perform automatic correction, but has an automatic correction error due to an unknown block boundary of a word string as a recognition result, and there is room for improvement. In the previous example, in the original document 600, “Meteorological Agency” is arranged after the last word “last month” in the confirmed word string, so the word string input next to the recognition result 801 is collated. When block B6 is set to include 6 words from “Meteorological Agency” and collation manuscript 706 is a six-word chain block from “Meteorological Agency”, the matching rate between the next recognition result and the collation manuscript is the highest. It is thought to be higher. However, it was not known from the beginning that there was a block boundary between the word “Last Month” and the word “Meteorological Agency”.

例えば、発話内容の一部に句などの脱落があった場合、出力候補として確定すべき原稿の区間の単語数Ｎ_rは、基準とする音声認識結果の単語列の単語数Ｎ＝Ｎ_hよりも大きくなってしまう。ここで、脱落した文節の単語数をδ_dとすると、Ｎ_r＝Ｎ_h＋δ_dの関係となる。したがって、基準とする単語数Ｎ＝Ｎ_hの観点からは、Ｎ＝Ｎ_h＜Ｎ_r の関係となる。本来であれば、基準とする単語数Ｎより長いＮ_rの原稿区間において、修正候補を見つけなければならないが、そのような適切なＮ_rは未知であり設定することができない。 For example, if a phrase or the like is dropped in part of the utterance content, the number of words N _{r in} the section of the manuscript to be determined as an output candidate is based on the number of words N = N _{h in} the word string of the speech recognition result as a reference. Will also grow. Here, if the number of words shed clauses and [delta] _d, a relationship of _{_{_{N r = N h + δ d}}} . Therefore, from the viewpoint of the reference word number N = N _h , the relationship is N = N _h <N _r . Would otherwise, in the original section of the longer N _r than the number of words N of the reference, but must find suggestions, such suitable N _r can not be set are unknown.

具体例として、図８に示す元原稿６００から図９に示す認識結果８０２が得られた場合を想定する。認識結果８０２は、「東日本の側では先月気象庁が …」である。１〜６番目の入力単語「東 … 先月」からなるブロックに着目すると、元原稿６００と見比べて４番目には「太平洋」の脱落誤りが生じて、認識単語「側」が入力している。つまり、δ_d＝１である。この場合、出力すべき原稿を求めるために、図８に示す区間の長さが６である照合原稿において探索すると、図９の認識結果８０２との一致率が最も高くなるのは、照合原稿７０４である。これにより、今回の照合における長さ６の区間においては、脱落していた「太平洋」が修復される。ただし、次に照合される照合ブロックにおいて、以下の問題が生じる。認識結果８０２の７番目以降の入力単語「気象庁が … 」からなるブロックに対応して、一致率が最も高くなるのは、照合原稿７０６である。つまり、前回の照合と今回の照合とを連結した修正出力９０２でみた場合、ブロック境界の認識単語である「先月」が出力から欠落する問題が生じる。 As a specific example, it is assumed that the recognition result 802 shown in FIG. 9 is obtained from the original document 600 shown in FIG. The recognition result 802 is “the Japan Meteorological Agency last month on the East Japan side”. Focusing on the block consisting of the 1st to 6th input words “East ... Last month”, the drop of “Pacific” occurs in the 4th compared to the original manuscript 600, and the recognition word “side” is input. That is, δ _d = 1. In this case, when searching for a collation document whose section length is 6 shown in FIG. 8 in order to obtain a document to be output, the collation document 704 has the highest matching rate with the recognition result 802 in FIG. It is. As a result, the missing “Pacific” is restored in the section of length 6 in the current collation. However, the following problem occurs in the collation block to be collated next. Corresponding document 706 has the highest matching rate corresponding to the block consisting of the seventh and subsequent input words “Meteorological Agency is ...” of recognition result 802. That is, when viewed from the corrected output 902 obtained by connecting the previous collation and the current collation, there is a problem that “last month” that is a recognition word at the block boundary is missing from the output.

また、発話内容に、情報の追加や言いよどみによる分節などの繰り返しがあった場合、出力候補として確定すべき原稿の区間の単語数Ｎ_rは、基準とする音声認識結果の単語列の単語数Ｎ＝Ｎ_hよりも小さくなってしまう。ここで、追加された情報の単語数、または言いよどみなどで繰り返し発話された単語数をδ_iとすると、Ｎ_r＝Ｎ_h−δ_iの関係となる。したがって、基準とする単語数Ｎ＝Ｎ_hの観点からは、Ｎ＝Ｎ_h＞Ｎ_r の関係となる。本来であれば、基準とする単語数Ｎより短いＮ_rの原稿区間において、修正候補を見つけなければならないが、そのような適切なＮ_rは未知であり設定することができない。 When the content of the utterance includes repetition of addition of information or segmentation due to stagnation, the number of words N _{r in} the section of the manuscript to be determined as an output candidate is the number N of words in the word string of the speech recognition result as a reference. = it becomes smaller than N _h. Here, if the number of words in the added information or the number of words repeatedly spoken due to stagnation is δ _i , the relationship is N _r = N _h −δ _i . Therefore, in view of the number of words N = N _h as a reference, a relationship of N = N _h> N _r. Would otherwise, in the original section of the shorter N _r than the number of words N of the reference, but must find suggestions, such suitable N _r can not be set are unknown.

具体例として、図８に示す元原稿６００から図１０に示す認識結果８０３が得られた場合を想定する。認識結果８０３は、「日本の泰平用賀までは」の６単語である。元原稿６００と見比べて文節の区切り（分節）が異なり、かつ単語誤りが生じており、元原稿の５単語分しか認識されていないことが分かる。つまり、δ_i＝１である。この場合、出力すべき原稿を求めるために、図８に示す区間の長さが６である照合原稿において探索すると、図９の認識結果８０３との一致率が最も高くなるのは、同スコアを有した照合原稿７０５，７０４の２つである。 As a specific example, a case is assumed where the recognition result 803 shown in FIG. 10 is obtained from the original document 600 shown in FIG. The recognition result 803 is 6 words “To Taihei Yoga in Japan”. Compared with the original manuscript 600, the segment breaks (segments) are different, word errors have occurred, and only five words of the original manuscript are recognized. That is, δ _i = 1. In this case, when searching for a collation manuscript in which the length of the section shown in FIG. 8 is 6 in order to obtain a manuscript to be output, the matching score with the recognition result 803 in FIG. The two collation originals 705 and 704 are provided.

例えば照合原稿１（照合原稿７０５）を採用した場合、今回の照合による修正出力１は、照合原稿７０５と同じテキストの修正出力９０１となる。ただし、認識結果８０３に続いて認識単語として「先月」が入力した場合、次回に照合される照合ブロックの先頭が「先月」の位置のときの照合原稿のスコアが最も高くなるため、修正出力１の直後の出力（次回の照合の出力）は、修正出力９０３となる。つまり、前回の照合と今回の照合とを連結した修正出力でみた場合、ブロック境界の認識単語である「先月」が二重に出力される問題が生じる。 For example, when the collation original 1 (collation original 705) is adopted, the correction output 1 by the current collation becomes the correction output 901 of the same text as the collation original 705. However, when “last month” is input as a recognition word following the recognition result 803, the score of the collation document is highest when the head of the collation block to be collated next time is “last month”, so the corrected output 1 The output immediately after (the output of the next collation) is the corrected output 903. In other words, when the corrected output obtained by concatenating the previous collation and the current collation is viewed, there is a problem that “last month” that is a recognition word at the block boundary is output twice.

同様に、例えば照合原稿２（照合原稿７０４）を採用した場合、今回の照合による修正出力２は、照合原稿７０４と同じテキストの修正出力９０４となる。ただし、認識結果８０３の入力以前の認識単語として「… 西日本と東」が入力している場合、直前に照合された照合ブロックの末尾が「東」の位置のときの照合原稿のスコアが最も高くなるため、修正出力２の直前の出力（前回の照合の出力）は、修正出力９０５となる。つまり、前回の照合と今回の照合とを連結した修正出力でみた場合、ブロック境界の認識単語である「東」が二重に出力される問題が生じる。 Similarly, when the collation document 2 (collation document 704) is employed, for example, the correction output 2 by the current collation is the same text correction output 904 as the collation document 704. However, if “... West Japan and East” is input as the recognition word before the input of the recognition result 803, the collation manuscript score when the last collation block collated immediately before is “east” is the highest. Therefore, the output immediately before the corrected output 2 (the output of the previous collation) becomes the corrected output 905. In other words, when the corrected output obtained by concatenating the previous collation and the current collation is viewed, there is a problem that “east”, which is a recognition word at the block boundary, is output twice.

さらに、基準とする音声認識結果の単語列に対応する原稿区間に起こるこのような不一致は、発話内容だけに起因するのではなく、音声認識装置の認識誤りによっても生じる。例えば、複数単語を１単語として認識するような誤りは、句の脱落と同様の不一致を引き起こすことになる。一方、１単語を複数単語として認識するような誤りは、情報の追加と同様の不一致を引き起こすことになる。特にブロック境界に誤認識単語があった場合、対応する原稿区間が不適切になり、正しく修正できないことが多い。 Furthermore, such inconsistency that occurs in the document section corresponding to the word sequence of the speech recognition result used as a reference is caused not only by the utterance content but also by a recognition error of the speech recognition apparatus. For example, an error such as recognizing a plurality of words as one word causes a discrepancy similar to that of a missing phrase. On the other hand, an error that recognizes one word as a plurality of words causes a mismatch similar to the addition of information. In particular, when there is a misrecognized word at the block boundary, the corresponding manuscript section becomes inappropriate and often cannot be corrected correctly.

従来技術では、これらの不適切な原稿区間と、音声認識結果の単語列と、を照合してしまう結果、ブロック境界において、単語が欠落したり、同じ単語が２回出力されたりするなどの自動修正誤りが生じる。このような自動修正誤りは、音声認識結果による単語仮説列の境界（文境界）が未知であるかぎり、音声認識結果と原稿との対応をとる区間の単位を文や他の単位としても同様に生じる。 In the prior art, as a result of collating these inappropriate manuscript sections with the word string of the speech recognition result, an automatic operation such as missing a word or outputting the same word twice at a block boundary. A correction error occurs. Such an automatic correction error is the same even if the unit of the section in which the speech recognition result corresponds to the manuscript is set as a sentence or other unit as long as the boundary (sentence boundary) of the word hypothesis sequence based on the speech recognition result is unknown. Arise.

本発明は、以上のような問題点に鑑みてなされたものであり、従来のブロック照合方式における自動修正誤りの発生を低減できる音声認識誤り修正装置を提供することを課題とする。 The present invention has been made in view of the above problems, and an object of the present invention is to provide a speech recognition error correction apparatus that can reduce the occurrence of automatic correction errors in a conventional block matching method.

前記課題を解決するために、本発明に係る音声認識誤り修正装置は、原稿テキスト集合に含まれる原稿を読み上げた発話音声を認識する音声認識装置が出力する認識単語列を入力として受け付け、予め記憶した対応原稿集合の中から対応原稿の単語列を推定することで、前記認識単語列に含まれる誤りを修正する音声認識誤り修正装置であって、対応原稿集合記憶手段と、ノードデータ更新手段と、ノードデータ記憶手段と、原稿探索手段と、原稿出力手段と、を備えることを特徴とする。 In order to solve the above problems, a speech recognition error correction apparatus according to the present invention receives a recognition word string output from a speech recognition apparatus that recognizes an uttered speech read out from a document included in a document text set as input, and stores it in advance. A speech recognition error correction apparatus for correcting an error included in the recognized word string by estimating a word string of the corresponding document from the corresponding document set, comprising: a corresponding document set storage unit; a node data update unit; A node data storage unit, a document search unit, and a document output unit.

かかる構成によれば、音声認識誤り修正装置は、対応原稿集合記憶手段に、前記原稿テキスト集合を予め読み込んで構築された前記対応原稿集合であって状態を表すノードとノード間の状態遷移を表す枝とをネットワークとして有した重み付き有限状態トランスデューサで表された前記対応原稿集合を記憶している。そして、音声認識誤り修正装置は、ノードデータ更新手段によって、前記認識単語列の単語の入力を受け付ける時刻毎に、前記重み付き有限状態トランスデューサのネットワーク上を遷移可能な状態のスコアをノードデータとして計算および更新し、ノードデータ記憶手段に、前記計算されたノードデータを更新時刻毎に記憶する。そして、音声認識誤り修正装置は、原稿探索手段によって、最終最良仮説を確定するための全原稿についての全認識単語列の認識結果の入力を待たずに予め定められた処理開始条件が満たされる度に、その時点で記憶されている前記ノードデータに基づいて前記ネットワーク上をトレースバックしながら、前記最終最良仮説を部分的に近似した仮説を誤り修正結果として逐次確定する。そして、音声認識誤り修正装置は、原稿出力手段によって、前記誤り修正結果として確定された対応原稿を逐次出力する。 According to such a configuration, the speech recognition error correction apparatus represents the state transition between the nodes representing the state of the corresponding manuscript set constructed by reading the manuscript text set in advance in the corresponding manuscript set storage unit. The corresponding document set represented by a weighted finite state transducer having branches as a network is stored. Then, the speech recognition error correction device calculates, as node data, a score of a state capable of transitioning on the network of the weighted finite state transducer at each time when the input of the word of the recognized word string is received by the node data updating unit. The node data storage means stores the calculated node data at each update time. Then, the speech recognition error correction apparatus performs a predetermined process start condition without waiting for input of recognition results of all recognized word strings for all originals for determining the final best hypothesis by the original searching means. In addition, while tracing back on the network based on the node data stored at that time, a hypothesis partially approximating the final best hypothesis is sequentially determined as an error correction result. Then, the speech recognition error correction apparatus sequentially outputs the corresponding original document determined as the error correction result by the original output unit.

本発明によれば、従来のブロック照合方式においてブロック境界に起因して生じた自動修正誤りの発生を低減することができる。 According to the present invention, it is possible to reduce the occurrence of automatic correction errors caused by block boundaries in the conventional block matching method.

本発明の実施形態に係る音声認識誤り修正装置を含むシステムを模式的に示すブロック図である。1 is a block diagram schematically showing a system including a speech recognition error correction device according to an embodiment of the present invention. 重み付き有限状態トランスデューサの構築例を模式的に示す図である。It is a figure which shows typically the construction example of a weighted finite state transducer. 本発明の実施形態に係る音声認識誤り修正装置の構成を模式的に示すブロック図である。It is a block diagram which shows typically the structure of the speech recognition error correction apparatus which concerns on embodiment of this invention. 本発明の実施形態に係る音声認識誤り修正装置によるトレースバック及び原稿分割を説明するための図（その１）である。It is FIG. (1) for demonstrating the trace back and original division by the speech recognition error correction apparatus which concerns on embodiment of this invention. 本発明の実施形態に係る音声認識誤り修正装置によるトレースバック及び原稿分割を説明するための図（その２）である。FIG. 10 is a diagram (No. 2) for explaining traceback and document division by the speech recognition error correcting device according to the embodiment of the present invention; 本発明の実施形態に係る音声認識誤り修正装置による処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the process by the speech recognition error correction apparatus which concerns on embodiment of this invention. 重み付き有限状態トランスデューサで適用できるアルゴリズムの例を示す模式図である。It is a schematic diagram which shows the example of the algorithm which can be applied with a weighted finite state transducer. 従来技術において認識誤りの訂正を説明するための模式図である。It is a schematic diagram for demonstrating correction of a recognition error in a prior art. 従来技術において認識結果の欠落を説明するための模式図である。It is a schematic diagram for demonstrating lack of a recognition result in a prior art. 従来技術において認識結果の二重出力を説明するための模式図である。It is a schematic diagram for demonstrating the double output of a recognition result in a prior art.

以下、本発明の音声認識誤り修正装置について詳細に説明する。
図１に示す音声認識誤り修正装置１００は、原稿テキスト集合２００に含まれる原稿２０１を読み上げた発話音声を認識する音声認識装置２２０が出力する認識結果の単語列（認識単語列）を入力として受け付け、予め記憶した対応原稿の単語列を推定することで、認識単語列に含まれる誤りを修正するものである。ここで、音声認識誤り修正装置１００が推定処理のために予め記憶した情報は、原稿テキスト集合２００を予め読み込んで構築された対応原稿の集合であって、状態を表すノードとノード間の状態遷移を表す枝とをネットワークとして有した重み付き有限状態トランスデューサ（Weighted Finite State Transducers：以下、ＷＦＳＴという）で表された対応原稿集合である。この音声認識誤り修正装置１００は、ＷＦＳＴのネットワーク上で最良仮説を逐次調べ、ＷＦＳＴ上の対応原稿の単語列と認識単語列との編集距離を基準に、全ての単語の入力を待たずに最終最良仮説を近似して部分的に修正結果を逐次確定する。 Hereinafter, the speech recognition error correction apparatus of the present invention will be described in detail.
The speech recognition error correction apparatus 100 shown in FIG. 1 accepts as input a word string (recognition word string) of a recognition result output by the speech recognition apparatus 220 that recognizes the uttered speech read out from the document 201 included in the document text set 200. The error contained in the recognized word string is corrected by estimating the word string of the corresponding document stored in advance. Here, the information stored in advance for the estimation process by the speech recognition error correction apparatus 100 is a set of corresponding manuscripts constructed by reading the manuscript text set 200 in advance, and the state transition between nodes representing the state Is a corresponding manuscript set represented by a weighted finite state transducer (hereinafter referred to as WFST) having a branch representing the network as a network. This speech recognition error correction apparatus 100 sequentially checks the best hypothesis on the WFST network and does not wait for the input of all words based on the edit distance between the word string of the corresponding manuscript on the WFST and the recognized word string. The best hypothesis is approximated and the correction results are partially determined sequentially.

図１に示した例は、地方放送局発のニュース番組に音声認識を用いて字幕を付与する場合に適用するための音声認識誤り修正装置１００を含むシステム全体を模式的に示している。このような番組の発話音声は、概ね事前に用意された原稿テキストに基づいているという特徴がある。また、大規模なキー放送局は、音声認識結果に含まれる認識誤りを人手で修正するためのオペレータを配置しているが、地方放送局は、現状では前記オペレータを配置することが難しいことから、このような例を示した。本実施形態によれば、地方放送局のオペレータ配置の課題を解決することができる。 The example shown in FIG. 1 schematically shows the entire system including the speech recognition error correction apparatus 100 to be applied when subtitles are given to a news program originated from a local broadcasting station using speech recognition. The utterance voice of such a program is characterized in that it is generally based on a manuscript text prepared in advance. In addition, large-scale key broadcasting stations have an operator for manually correcting a recognition error included in the speech recognition result. However, it is difficult for local broadcasting stations to arrange the operator at present. An example of this is shown. According to the present embodiment, it is possible to solve the problem of local station operator arrangement.

図１に示す原稿テキスト集合２００は、人が話す予定の内容をテキストに書き起こしたものの全体を表している。原稿テキスト集合２００は、例えば文、文章、段落といった単語列の区切りの単位や、そのテーマやトピック等の内容分類に応じて、多数の細分化された個別の内容に分けられる。このような個別の内容を、以下では単に原稿と呼ぶ。また、単語列の単位が一例として文であるものとして説明する。 A manuscript text set 200 shown in FIG. 1 represents the whole of a text that is a transcript of what a person plans to speak. The manuscript text set 200 is divided into a number of subdivided individual contents, for example, according to word string delimiter units such as sentences, sentences, paragraphs, and content classifications such as themes and topics. Such individual contents are hereinafter simply referred to as a manuscript. Further, description will be made assuming that the word string unit is a sentence as an example.

本実施形態では、例えば下記（Ａ１）〜（Ａ７）の条件を前提としている。
（Ａ１）原稿テキスト集合２００内の複数の原稿文が音声認識対象として読まれる。
（Ａ２）例えば１つのニュース項目に関する原稿といっても、いくつかの更新された版（バージョン）が用意されていて、どのバージョンの原稿が何時のニュース番組で読まれるのか事前には分かっていない。
（Ａ３）複数の原稿文がどのような順番で読まれるのか事前には分かっていない。
（Ａ４）原稿テキスト集合２００に含まれる原稿文には、読まれないものもある。
（Ａ５）読む人物によっては、原稿通りに読まずに、敢えて言い回しを変えてしまう場合や、言い誤りが生じる場合がある。
（Ａ６）音声認識装置２２０の認識誤りのため意昧不明になった字幕を送出して視聴者に誤解を与えたり不快にさせたりすることを回避することを大前提とする。そのため、意味不明な認識結果の場合には送出せず、代わりに、事前に編集者により校正され内容が確認されている、発話内容に最も近いと自動推定された原稿（事前原稿）を字幕として送出する。
（Ａ７）インタビュー部分などであって認識結果に対応する原稿が元々存在しない場合、自動推定は不可能なので、元原稿が無いインタビュー部分などについては字幕を送出しない。 In the present embodiment, for example, the following conditions (A1) to (A7) are assumed.
(A1) A plurality of manuscript sentences in the manuscript text set 200 are read as voice recognition targets.
(A2) For example, even if the manuscript is related to one news item, several updated versions (versions) are prepared, and it is not known in advance which version of the manuscript will be read in the news program. .
(A3) The order in which a plurality of manuscript sentences are read is not known in advance.
(A4) Some original texts included in the original text set 200 cannot be read.
(A5) Depending on the person reading, there is a case where the wording is intentionally changed or a wording error occurs without reading the document as it is.
(A6) A major premise is to avoid sending a subtitle that has become unclear due to a recognition error of the voice recognition device 220 to mislead or make the viewer uncomfortable. Therefore, it is not sent in the case of an unknown recognition result, and instead, the subtitle is a manuscript (prior manuscript) that is automatically estimated to be the closest to the utterance content that has been proofread by the editor and confirmed in advance. Send it out.
(A7) If there is no original corresponding to the recognition result in an interview part or the like, automatic estimation is impossible, so no subtitles are transmitted for an interview part without an original.

原稿テキスト集合２００は、記者が例えばニュース番組用に入稿した原稿の電子データの集合であって、例えばハードディスク等の一般的な記憶装置やネットワーク上の記憶手段に記憶されている。この原稿テキスト集合２００は、対応原稿集合のＷＦＳＴを事前に構築するためにも利用される。 The manuscript text set 200 is a set of manuscript electronic data submitted by a reporter for, for example, a news program, and is stored in a general storage device such as a hard disk or a storage means on a network. This document text set 200 is also used to construct a WFST of the corresponding document set in advance.

音声認識装置２２０は、生の音声データが入力された場合に、隠れマルコフモデル（ＨＭＭ）による音響モデル、言語モデルを利用して、音声データを認識し、その認識した結果を認識単語列として生成するものである。本実施形態において、音声認識装置２２０は、特に限定されず、従来公知のものを採用することができる。 When raw speech data is input, the speech recognition device 220 recognizes speech data using an acoustic model or language model based on a hidden Markov model (HMM), and generates the recognized result as a recognition word string. To do. In the present embodiment, the voice recognition device 220 is not particularly limited, and a conventionally known device can be adopted.

なお、前記（Ａ２）の条件に示す通り、各ニュース項目に対して複数のバージョンの原稿が入稿されており、どのバージョンをどの順番で放送するのかは事前に確定できない。そのような状況で、音声認識装置２２０は、音声認識を行い、その発話音声に対して、そもそも、対応する原稿が存在するのか否かを即座に調べなければならない。そのため、音声認識に用いる言語モデルは、高精度に音声認識結果と原稿との対応をとるために、原稿テキスト集合２００を用いて適応化しておき、原稿通りに読み上げた場合の認識精度が高くなるようにしておくことが好ましい。 As shown in the condition (A2), a plurality of versions of a manuscript have been submitted for each news item, and it is not possible to determine in advance which version will be broadcast in which order. In such a situation, the speech recognition apparatus 220 must perform speech recognition and immediately check whether or not a corresponding document exists for the uttered speech. Therefore, the language model used for speech recognition is adapted using the document text set 200 in order to obtain a high-accuracy correspondence between the speech recognition result and the document, and the recognition accuracy when it is read out as it is according to the document becomes high. It is preferable to do so.

トランスデューサ構築装置２４０は、音声認識誤り修正装置１００で利用する対応原稿の集合（対応原稿集合）としてＷＦＳＴを構築するものである。
トランスデューサ構築装置２４０は、音声認識の対象となる読み上げ原稿、つまり、原稿テキスト集合２００に含まれる原稿文から、音声認識誤り修正装置１００で利用するＷＦＳＴを事前に構築する。ＷＦＳＴは、入力シンボルと出力シンボル、遷移重みを有する有限状態機械であり、単語と文などの異なる粒度の入出力を効率よく扱うことができる。このＷＦＳＴの構築については後記する。 The transducer construction device 240 constructs a WFST as a corresponding document set (corresponding document set) used in the speech recognition error correction apparatus 100.
The transducer construction device 240 constructs in advance a WFST to be used by the speech recognition error correction device 100 from a read-out original to be speech-recognized, that is, from a document sentence included in the document text set 200. WFST is a finite state machine having input symbols, output symbols, and transition weights, and can efficiently handle input / output of different granularities such as words and sentences. The construction of this WFST will be described later.

音声認識誤り修正装置１００は、音声認識装置２２０から認識結果の単語が入力される度に、ＷＦＳＴを用いて、入力単語を受理可能な遷移を求めてそのスコアを計算し、累積スコアに対する閾値を用いて枝刈りを行いながら、従来公知のビタビアルゴリズム（Viterbi Algorithm）による探索（ビタビ探索）を用いることを前提としている。なお、ビタビアルゴリズムとは、受信系列に対して送信符号に最も近い、即ち尤度を最大にする符号系列を推定する際に、最大尤度の符号系列を、トレリス線図を用いて効果的に探索する方法である。 Each time a recognition result word is input from the speech recognition device 220, the speech recognition error correction device 100 uses WFST to obtain a transition that can accept the input word, calculates its score, and sets a threshold for the cumulative score. It is assumed that a search (Viterbi search) using a conventionally known Viterbi algorithm is used while performing pruning. Note that the Viterbi algorithm is a method that uses a trellis diagram to effectively calculate a code sequence having the maximum likelihood when estimating a code sequence that is closest to a transmission code with respect to a received sequence, that is, that maximizes the likelihood. It is a method of searching.

通常のビタビ探索では、全ての入力が観測されてから、最もスコアが良くなるパスをトレースバックして最良仮説を出力する。そのため、通常の探索方法では、全ての入力が観測され終わる前に、古い入力から順に逐次修正結果を出力するといったことはできない。例えば、テレビ放送番組の放送音声を認識した結果から字幕を制作してリアルタイムでテレビ画面の画像に重畳する場合を想定すると、通常のビタビ探索による最尤系列は番組の最後まで単語を入力しないと確定することができない。これでは番組が終了してしまうことになるので、このような運用に対して通常のビタビ探索は不適である。 In normal Viterbi search, after all the inputs are observed, the path with the best score is traced back and the best hypothesis is output. Therefore, in the normal search method, it is not possible to output the correction results sequentially from the oldest input before all the inputs are observed. For example, assuming that the subtitles are produced from the result of recognizing the broadcast sound of a TV broadcast program and superimposed on the image of the TV screen in real time, the maximum likelihood sequence by the normal Viterbi search must input words until the end of the program It cannot be confirmed. This will end the program, so normal Viterbi search is not suitable for such operations.

一方、音声認識誤り修正装置１００は、ビタビ探索を用いつつも、最尤系列を逐次近似してトレースバックする。すなわち、予め定められた処理開始条件が満たされる度に、その時点で最もスコアが良くなるパスをトレースバックして、確定できる出力遷移を決定するので、修正結果を逐次出力することができる。ここでトレースバックされるパスは、最良仮説の近似であるが、各出力遷移に対応する入力単語列と、原稿の単語列との編集距離を信頼度の基準にして同パスを確定するか否かを決定して近似精度の向上をはかる。なお、詳細は後記する。 On the other hand, the speech recognition error correction apparatus 100 successively approximates the maximum likelihood sequence and traces back using the Viterbi search. That is, every time a predetermined processing start condition is satisfied, a path with the best score at that time is traced back and an output transition that can be determined is determined, so that correction results can be output sequentially. The path traced back here is an approximation of the best hypothesis, but whether or not the path is determined based on the reliability of the edit distance between the input word string corresponding to each output transition and the word string of the manuscript. To improve the approximation accuracy. Details will be described later.

［構築されたＷＦＳＴの例］
図２は、トランスデューサ構築装置２４０で構築されたＷＦＳＴの例である。ＷＦＳＴは、状態を表すノードと、状態遷移を表す枝と、を有する。なお、状態遷移のことを単に遷移という場合もある。本実施形態では、入力シンボルを単語、出力シンボルを所定の単語列とする、ＷＦＳＴを構築する。所定の単語列を文として説明する。 [Example of constructed WFST]
FIG. 2 is an example of WFST constructed by the transducer construction device 240. The WFST has a node representing a state and a branch representing a state transition. Note that state transition may be simply referred to as transition. In the present embodiment, a WFST is constructed in which an input symbol is a word and an output symbol is a predetermined word string. A predetermined word string will be described as a sentence.

この例では、楕円形の各ノードに、識別するため３桁の数字を付している。始点ノードはノード００１であり、終点ノードはノード００８である。この例では、始点と終点との間には、ノード００２〜ノード００７が直線状に並べられている。また、始点と終点との間には、並列に、ノード０１０〜ノード０１５が直線状に並べられている。さらに、始点と終点との間には、並列に、ノード０１８〜ノード０２３が直線状に並べられている。また、この例では、状態（ノード）と状態（ノード）との間に、遷移（枝）が設定されている。ここで、ノードとノードとの間という場合、自ノード間も含まれている。各遷移には、単語が記載されているか、または、記号として＜Ｓ＞、＜Ｉ＞、＜Ｄ＞、＜ＥｍｉＸ（ここでＸは１〜３の１つ）＞および＜ｅｐｓ＞のうちのいずれかが記載されている。 In this example, each ellipse node is given a three-digit number for identification. The start point node is node 001, and the end point node is node 008. In this example, nodes 002 to 007 are arranged in a straight line between the start point and the end point. Further, nodes 010 to 015 are arranged in a straight line in parallel between the start point and the end point. Furthermore, nodes 018 to 023 are arranged in a straight line in parallel between the start point and the end point. In this example, a transition (branch) is set between the state (node) and the state (node). Here, when it is between nodes, it also includes between own nodes. Each transition has a word or symbol <S>, , <D>, <EmiX (where X is one of 1 to 3)> and <eps> Either is listed.

まず、図２のすべての遷移について一般化して説明すると、このＷＦＳＴは、状態と状態間の各遷移に、（Ｓⁱ／Ｓ^o：ω）のパラメータが設定されている。ここで、Ｓⁱとは同遷移が受理する単語入力を表し、Ｓ^oとは同遷移が出力する所定の単語列（文）を表し、ωは遷移重み（状態遷移重み）を表す。つまり、各遷移には、３つ組のパラメータが設定されている。ただし、図２では紙面の都合上、パラメータをすべての遷移に記載しているわけではなく、単語が記載された合計１８の遷移に、３つ組のパラメータのうちのＳⁱ、Ｓ^oのいずれかのみが記載されている。 First, generalizing and explaining all the transitions in FIG. 2, in this WFST, a parameter (S ⁱ / S ^o : ω) is set for each transition between states. Here, S ⁱ represents a word input accepted by the transition, S ^o represents a predetermined word string (sentence) output by the transition, and ω represents a transition weight (state transition weight). That is, a triple parameter is set for each transition. However, in FIG. 2, the parameters are not described in all transitions for the sake of space, and a total of 18 transitions including words are included in any of the three parameters of S ⁱ and S ^o . Only is described.

ここで、図２に記載された単語を一般化して単語ｓと表記する。なお、アルファベットの大文字と小文字とを区別している。図２において、単語ｓは、原稿の単語列が含んでいる単語を表す。単語ｓが記載された各遷移は、当該遷移に記載された単語ｓと同じ単語が入力されたときのみ遷移可能なことを表している。つまり、原稿の単語列が含んでいるある単語ｓの位置に対応する認識単語列の位置に入力した単語が、原稿のある単語ｓと同じ単語であれば、状態遷移することができる。要するに、単語ｓが記載された各遷移は、音声認識された単語を受理して進む遷移である。このようにここで構築するＷＦＳＴは、全ての原稿文を自由に接続できるネットワークである。 Here, the word described in FIG. 2 is generalized and expressed as a word s. Note that uppercase and lowercase letters are distinguished. In FIG. 2, a word s represents a word included in the word string of the document. Each transition in which the word s is described indicates that transition is possible only when the same word as the word s described in the transition is input. That is, if the word input at the position of the recognized word string corresponding to the position of a certain word s included in the word string of the document is the same as the word s on the document, the state can be changed. In short, each transition in which the word s is described is a transition that proceeds after receiving a speech-recognized word. As described above, the WFST constructed here is a network that can freely connect all document texts.

図２において、単語ｓが記載された遷移についてのパラメータは、（ｓ／ε：0.0）で表される。ここで、ｓは同遷移が受理できる単語入力を表し、εはこの遷移で出力は無いことを意昧する。また、0.0は遷移重みの１つであって、この遷移に対して単語ｓと同じ単語が入力されたときには、ペナルティが課されないことを意味する。例えば、図２で「先月」が記載された遷移は、３つ組のパラメータで表すと、(先月／ε：0.0)のことである。 In FIG. 2, the parameter for the transition in which the word s is described is represented by (s / ε: 0.0). Here, s represents a word input that the transition can accept, and ε means that there is no output in this transition. Moreover, 0.0 is one of the transition weights, and means that no penalty is imposed when the same word as the word s is input for this transition. For example, the transition in which “last month” is described in FIG. 2 is (last month / ε: 0.0) in terms of a triple parameter.

図２において、＜Ｓ＞が記載された遷移は、置換単語を受理するための遷移である。つまり、原稿の単語列が含んでいるある単語ｓの位置に対応する認識単語列の位置に入力した単語が、原稿のある単語ｓとは異なる任意の単語に置換されていたときに、その置換単語を受理するための遷移である。以下、原稿の単語列が含んでいるある単語ｓの位置において、この単語ｓとは異なる任意の単語のことを、任意の単語＊と表記する。この置換には、例えば「再開」が同音異義語の「再会」に翻字されて認識された場合も含まれる。 In FIG. 2, a transition in which <S> is described is a transition for accepting a replacement word. That is, when the word input at the position of the recognized word string corresponding to the position of a certain word s included in the document word string is replaced with an arbitrary word different from the word s on the document, the replacement is performed. A transition for accepting a word. Hereinafter, an arbitrary word different from the word s at the position of a certain word s included in the word string of the document is referred to as an arbitrary word *. This replacement includes, for example, a case where “restart” is recognized as a transliteration of the homonym “reunion”.

図２において、＜Ｓ＞が記載された遷移は、任意の単語＊を受理可能である。この＜Ｓ＞が記載された遷移についてのパラメータは（＊／ε：ω_s）で表される。ここで、＊は同遷移が受理できる任意の単語入力を表し、εはこの遷移で出力は無いことを意昧する。また、ω_sは遷移重みの１つであって、この遷移に対して単語ｓとは異なる任意の単語＊が入力されたときに課すペナルティ（以下、置換ペナルティという）を意味する。この置換ペナルティω_sは、ノードスコアを下げる数値で表され、例えば-1.0を用いる。例えば、図２で＜Ｓ＞が記載された遷移は、３つ組のパラメータで表すと、(＊／ε：-1.0)のことである。 In FIG. 2, the transition in which <S> is described can accept any word *. The parameter for the transition in which <S> is described is represented by (* / ε: ω _s ). Here, * represents an arbitrary word input that the transition can accept, and ε means that there is no output in this transition. Further, ω _s is one of transition weights, and means a penalty imposed on the transition when an arbitrary word * different from the word s is input (hereinafter referred to as a replacement penalty). This replacement penalty ω _s is expressed by a numerical value that lowers the node score, and for example, −1.0 is used. For example, a transition in which <S> is described in FIG. 2 is (* / ε: −1.0) in terms of a triple parameter.

図２において、＜I＞が記載された遷移は、挿入単語を受理するための遷移である。つまり、発話者に起因して、発話内容に情報の追加や言いよどみによる分節などの繰り返しがあった場合、原稿どおり又は置換されたと認識された単語列に続く位置に挿入された単語を受理するための遷移である。また、音声認識装置２２０に起因して、原稿通りならば１単語と認識すべきところを、複数単語として認識するような認識誤りで生じて、原稿通りの１単語に続く位置に挿入された単語を受理するための遷移である。 In FIG. 2, a transition in which is described is a transition for accepting an insertion word. In other words, when there is a repetition of information added to the utterance content or segmentation due to stagnation due to the speaker, the word inserted at the position following the word string recognized as being replaced or replaced is accepted. Transition. In addition, due to the voice recognition device 220, a word inserted in a position following the one word as the manuscript is caused by a recognition error that recognizes a word that should be recognized as the manuscript as a plurality of words if it is as the manuscript. Is a transition to accept

図２において、＜Ｉ＞が記載された遷移は、任意の単語＊を受理可能である。この＜Ｉ＞が記載された遷移についてのパラメータは（＊／ε：ω_i）で表される。ここで、＊は同遷移が受理できる任意の単語入力を表し、εはこの遷移で出力は無いことを意昧する。また、ω_iは遷移重みの１つであって、この遷移に対して任意の単語＊が入力されたときに課すペナルティ（以下、挿入ペナルティという）を意味する。この挿入ペナルティω_iは、ノードスコアを下げる数値で表され、例えば-1.0を用いる。例えば、図２で＜I＞が記載された遷移は、３つ組のパラメータで表すと、(＊／ε：-1.0)のことである。 In FIG. 2, the transition with described can accept any word *. The parameter for the transition in which is described is represented by (* / ε: ω _i ). Here, * represents an arbitrary word input that the transition can accept, and ε means that there is no output in this transition. Further, ω _i is one of transition weights, and means a penalty imposed when an arbitrary word * is input for this transition (hereinafter referred to as an insertion penalty). This insertion penalty ω _i is represented by a numerical value that lowers the node score, and for example, −1.0 is used. For example, the transition in which is described in FIG. 2 is (* / ε: −1.0) in terms of a triple parameter.

図２において、＜Ｄ＞が記載された遷移は、脱落単語を受理するための遷移である。つまり、発話者に起因して、発話内容の一部に句などの脱落があった場合、認識単語列において原稿から脱落した単語の位置を特定するための遷移である。また、音声認識装置２２０に起因して、原稿通りならば複数単語と認識すべきところを、単語が削除されて１単語として認識するような認識誤りで生じて、認識単語列において原稿から脱落した単語の位置を特定するための遷移である。 In FIG. 2, a transition in which <D> is described is a transition for accepting a dropped word. That is, this is a transition for specifying the position of a word dropped from the original in the recognized word string when a phrase or the like is dropped in a part of the utterance content due to the speaker. In addition, due to the voice recognition device 220, a word that should be recognized as a plurality of words according to the original is caused by a recognition error in which the word is deleted and recognized as one word, and is dropped from the original in the recognized word string. It is a transition for specifying the position of a word.

図２において、＜Ｄ＞が記載された遷移は、単語の入力が無くても遷移可能である。この＜Ｄ＞が記載された遷移についてのパラメータは（ε／ε：ω_d）で表される。ここで、はじめのεはこの遷移で単語の入力が無いことを意昧し、次のεはこの遷移で出力が無いことを意昧する。また、ω_dは遷移重みの１つであって、この遷移で単語が脱落したときに課すペナルティ（以下、脱落ペナルティという）を意味する。この脱落ペナルティω_dは、ノードスコアを下げる数値で表され、例えば-1.0を用いる。例えば、図２で＜Ｄ＞が記載された遷移は、３つ組のパラメータで表すと、(ε／ε：-1.0)のことである。 In FIG. 2, transitions with <D> can be transitioned even if no word is input. The parameter for the transition in which <D> is described is represented by (ε / ε: ω _d ). Here, the first ε means that there is no word input in this transition, and the next ε means that there is no output in this transition. Further, ω _d is one of transition weights, and means a penalty imposed when a word is dropped in this transition (hereinafter referred to as a drop penalty). This drop penalty ω _d is represented by a numerical value that lowers the node score, and for example, −1.0 is used. For example, the transition in which <D> is described in FIG. 2 is (ε / ε: −1.0) when represented by a triple parameter.

図２において、＜ＥｍｉＸ＞が記載された遷移は、所定の単語列として文Ｌを出力するための遷移であり、修正結果を出力するための遷移である。この＜ＥｍｉＸ＞が記載された遷移についてのパラメータは（ε／Ｌ：0.0）で表される。ここで、εはこの遷移において単語の入力が無いことを意昧する。また、Ｌはこの遷移で出力される単語列（文）を意昧する。例えば、図２で＜Ｅｍｉ１＞が記載された遷移は、３つ組のパラメータで表すと、(ε／先月の関東甲信地方は…：0.0)のことである。つまり、この場合、Ｌは、始点ノード００１からノード００２を経由してノード００７に至る各遷移に並べられた単語列「先月の関東甲信地方は …」をすべて順番に繋げた単語列となる。なお、パラメータ0.0は遷移重みの１つであって、この遷移に対して文を出力するときには、ペナルティが課されないことを意味する。 In FIG. 2, a transition in which <EmiX> is described is a transition for outputting a sentence L as a predetermined word string, and a transition for outputting a correction result. The parameter for the transition in which <EmiX> is described is represented by (ε / L: 0.0). Here, ε means that no word is input in this transition. L means a word string (sentence) output in this transition. For example, a transition in which <Emi1> is described in FIG. 2 is (ε / in the Kanto Koshin region in the last month ...: 0.0) in terms of a triple parameter. That is, in this case, L is a word string in which all of the word strings “Kanto Koshin region in last month are ...” arranged in each transition from the start node 001 to the node 007 via the node 002 are connected in order. The parameter 0.0 is one of transition weights, and means that no penalty is imposed when a sentence is output for this transition.

図２において、＜ｅｐｓ＞が記載された遷移は、終点ノードと始点ノードを連結する遷移であり、イプシロン遷移（ε遷移）と呼ばれている。＜ｅｐｓ＞が記載された遷移は、原稿テキスト集合に含まれる所定の単語列（文）が、連続して発話されるという拘束を与える遷移である。＜ｅｐｓ＞が記載された遷移についてのパラメータは（ε／ε：ω_u）で表される。はじめのεはこの遷移で単語の入力が無いことを意昧し、次のεはこの遷移で出力が無いことを意昧する。また、ω_uは遷移重みの１つであって、適切な重み（数値）を与えることにより、ＷＦＳＴは、より長く一致する文のスコアを高くすることができるようになる。 In FIG. 2, the transition in which <eps> is described is a transition that connects the end point node and the start point node, and is called an epsilon transition (ε transition). The transition in which <eps> is described is a transition that gives a constraint that a predetermined word string (sentence) included in the document text set is continuously spoken. A parameter for a transition in which <eps> is described is represented by (ε / ε: ω _u ). The first ε means that there is no word input at this transition, and the next ε means there is no output at this transition. Also, ω _u is one of transition weights, and by giving an appropriate weight (numerical value), the WFST can increase the score of sentences that match longer.

［ＷＦＳＴの構築方法］
トランスデューサ構築装置２４０によるＷＦＳＴの構築方法について説明する。
予めＷＦＳＴにおいて出力遷移（＜ＥｍｉＸ＞が記載された遷移）を配置するための単語列の単位を決定しておく。これは、必要とする誤り修正能力に応じて設定することができる。出力遷移を配置する位置は、原稿テキスト集合２００に含まれる原稿の単位を１つの区切りとすることができる。出力遷移を配置する位置は、文章単位、句単位、あるいは、記者が原稿の読み易さのために配置した改行単位などが利用可能である。ここで、長い単位を設定すると修正精度は高くなるが、送出する字幕単語列の確定が遅くなる。逆に、短い単位を設定すると、送出する字幕単語列の確定は速くなるが修正精度が低下する。よって、どのような単位を利用するかについては、期待される音声認識の認識精度と、原稿と読み上げ音声の一致度合いと、に応じて適宜設計すればよい。 [How to build WFST]
A method for constructing a WFST by the transducer construction device 240 will be described.
A word string unit for arranging output transitions (transitions in which <EmiX> is described) in WFST is determined in advance. This can be set according to the required error correction capability. The positions where the output transitions are arranged can make the unit of the manuscript included in the manuscript text set 200 as one delimiter. As the position where the output transition is arranged, a sentence unit, a phrase unit, a line feed unit arranged for easy reading by the reporter, or the like can be used. Here, when a long unit is set, the correction accuracy increases, but the determination of the subtitle word string to be transmitted is delayed. Conversely, if a short unit is set, the determination of the subtitle word string to be sent out becomes faster, but the correction accuracy decreases. Therefore, what unit should be used may be appropriately designed according to the expected recognition accuracy of voice recognition and the degree of matching between the original and the reading voice.

本実施形態では、ＷＦＳＴにおいて出力遷移を配置する位置は、一例として文を単位に決定されていることとしている。別の観点では、図２のＷＦＳＴは、文（所定単位の単語列）毎に始点ノード００１と終点ノード００８との間に文を構成する各単語の入力遷移を表す枝（図２において単語ｓが記載された遷移）と出力遷移を表す枝（図２において＜ＥｍｉＸ＞が記載された遷移）とを含んでいる。 In this embodiment, the position where the output transition is arranged in the WFST is determined in units of sentences as an example. From another viewpoint, the WFST in FIG. 2 uses a branch (word s in FIG. 2) that represents input transition of each word constituting a sentence between a start point node 001 and an end point node 008 for each sentence (a word string in a predetermined unit). ) And a branch representing an output transition (transition in which <EmiX> is described in FIG. 2).

ＷＦＳＴの構築は、まず、ＷＦＳＴの始点から始めて、原稿テキスト集合２００に含まれる原稿テキストを一単語ずつ読み込む度に、同単語を受理する重み０の遷移と新たなノードとを順次作成していく。ここで、重み０の遷移とは３つ組のパラメータで表すと、(ｓ／ε：0.0)のことである。そして、前記した予め決定された単位になったら、出力遷移を追加してＷＦＳＴの終点ノードに連結する。まだ原稿が残っていたら、再び始点から始めて、原稿テキストを一単語ずつ読み込む度に、同単語を受理する重み０の遷移と新たなノードとを順次作成していく。そして、前記した予め決定された単位になったら、出力遷移を追加してＷＦＳＴの終点ノードに連結する。以下、同様に繰り返す。 The construction of the WFST starts with the starting point of the WFST, and each time the original text included in the original text set 200 is read word by word, a transition with a weight of 0 for accepting the word and a new node are sequentially created. . Here, the transition of weight 0 means (s / ε: 0.0) when expressed by a triple parameter. And when it becomes the above-mentioned predetermined unit, an output transition is added and it connects with the end point node of WFST. If the original still remains, starting from the starting point again, each time the original text is read word by word, a transition with a weight of 0 and a new node for accepting the word are sequentially created. And when it becomes the above-mentioned predetermined unit, an output transition is added and it connects with the end point node of WFST. Thereafter, the same is repeated.

原稿テキスト集合２００から、すべての原稿テキストを読み込み終えたならば、最後に、終点ノードと始点ノードとをε遷移で連結する。ここで、ε遷移とは３つ組のパラメータで表すと、（ε／ε：ω_u）のことである。ここで、遷移重みω_uに適切な重みを与える。これにより、ＷＦＳＴはより長く一致する文のスコアを高くすることができるようになり、他の文の接頭辞と一致する文が原稿中に存在する場合にも、適切に動作できるようになる。最後に、各単語の遷移に、置換、脱落、挿入を受理する遷移を追加する。 When all the original texts have been read from the original text set 200, finally, the end point node and the start point node are connected by the ε transition. Here, the ε transition is (ε / ε: ω _u ) in terms of a triple parameter. Here, an appropriate weight is given to the transition weight ω _u . As a result, the WFST can increase the score of a sentence that matches longer, and can operate properly even when a sentence that matches the prefix of another sentence exists in the document. Finally, transitions that accept substitutions, omissions, and insertions are added to the transitions of each word.

［トランスデューサ構築装置の構成例］
図１に示す例では、トランスデューサ構築装置２４０は、単語ネットワーク登録手段２４１と、編集ネットワーク登録手段２４２と、を備えている。
単語ネットワーク登録手段２４１は、原稿テキスト集合２００に含まれる原稿テキストにおける予め定められた所定単位（例えば文単位）毎に次の一連の処理を行う。すなわち、単語ネットワーク登録手段２４１は、一連の処理として、原稿テキスト集合２００に含まれる原稿テキストに含まれる単語列の単語を読み込む度に、ＷＦＳＴのネットワークの始点ノードから、単語を受理する入力遷移の枝と新たなノードとを、読み込んだ単語列が予め定められた所定単位（例えば文単位）になるまで順次作成する。そして、ＷＦＳＴのネットワークにおいて、読み込んだ単語列の出力遷移の枝を追加して終点ノードに連結する。 [Configuration example of transducer construction device]
In the example shown in FIG. 1, the transducer construction device 240 includes a word network registration unit 241 and an editing network registration unit 242.
The word network registration unit 241 performs the following series of processes for each predetermined unit (for example, sentence unit) in the document text included in the document text set 200. That is, as a series of processes, the word network registration unit 241 receives input words from the start node of the WFST network each time a word in a word string included in the original text included in the original text set 200 is read. Branches and new nodes are sequentially created until the read word string reaches a predetermined unit (for example, sentence unit). Then, in the WFST network, an output transition branch of the read word string is added and connected to the end node.

編集ネットワーク登録手段２４２は、単語ネットワーク登録手段２４１で作成されたＷＦＳＴのネットワークのノード間に、単語の置換に対応して任意の単語を受理する状態遷移を表す枝と、単語の挿入に対応して任意の単語を受理する状態遷移を表す枝と、単語の削除に対応して入力が無くても出力側に遷移する状態遷移を表す枝とを追加するものである。 The editing network registration unit 242 corresponds to a branch representing a state transition that accepts an arbitrary word corresponding to a word replacement and a word insertion between nodes of the WFST network created by the word network registration unit 241. Thus, a branch representing a state transition that accepts an arbitrary word and a branch representing a state transition that transitions to the output side even if there is no input in response to the deletion of the word are added.

［音声認識誤り修正装置の構成例］
図１に示す例では、音声認識誤り修正装置１００とは別にトランスデューサ構築装置２４０を設けたが、図３に示すように、例えば音声認識誤り修正装置１００がトランスデューサ構築装置２４０を備えるようにしてもよい。この音声認識誤り修正装置１００は、図３に示すように、ＷＦＳＴ記憶手段（対応原稿集合記憶手段）１１０と、ノードデータ更新手段１２０と、ノードデータ記憶手段１３０と、原稿探索手段１４０と、原稿出力手段１５０と、を備えている。 [Configuration example of speech recognition error correction device]
In the example shown in FIG. 1, the transducer construction device 240 is provided separately from the speech recognition error correction device 100. However, as shown in FIG. 3, for example, the speech recognition error correction device 100 may include the transducer construction device 240. Good. As shown in FIG. 3, the speech recognition error correction apparatus 100 includes a WFST storage unit (corresponding document set storage unit) 110, a node data update unit 120, a node data storage unit 130, a document search unit 140, and a document. Output means 150.

ＷＦＳＴ記憶手段（対応原稿集合記憶手段）１１０は、原稿テキスト集合２００を用いて予め構築されたＷＦＳＴ（対応原稿集合）を記憶している。このＷＦＳＴ（対応原稿集合）は、トランスデューサ構築装置２４０が構築したものである。よって、ＷＦＳＴについては、図２を参照して説明したものと同じなので重複を避けるため説明を省略する。 The WFST storage unit (corresponding document set storage unit) 110 stores a WFST (corresponding document set) constructed in advance using the document text set 200. This WFST (corresponding document set) is constructed by the transducer construction device 240. Therefore, since WFST is the same as that described with reference to FIG. 2, description thereof is omitted to avoid duplication.

ノードデータ更新手段１２０は、音声認識装置２２０が出力する認識単語列の単語の入力を受け付ける時刻毎に、ＷＦＳＴのネットワーク上を遷移可能な状態のスコアをノードデータとして計算および更新するものである。ノードデータ更新手段１２０は、例えば認識単語が１単語入力するたびに、ＷＦＳＴ記憶手段１１０に記憶されたＷＦＳＴを参照して逐次的にビタビ探索を行い、ノードデータを更新する。 The node data updating unit 120 calculates and updates a score of a state that can be transited on the WFST network as node data every time when an input of a word of a recognized word string output from the speech recognition device 220 is received. For example, each time a recognition word is input, the node data updating unit 120 refers to the WFST stored in the WFST storage unit 110 and sequentially performs a Viterbi search to update the node data.

ノードデータ更新手段１２０は、認識単語列として入力する単語が、対応原稿と同じ単語である場合、スコアに「０」を加算し、入力する単語が対応原稿と異なる単語である場合、スコアにペナルティの「−１」を加算する。
例えば、図２に示す例において、認識単語列として入力する単語列が、対応原稿と全く同じ単語列である場合、始点ノード００１から、単語「先月」を受理して対応原稿の単語に対応する遷移を通ってノード００２に進むので、ノードデータ更新手段１２０は、スコアに「０」を加算する。その後、例えば、「の」を受理してノード００３に進むと、スコアに「０」を加算する。同様に、「関東甲信」、…を受理していくと、スコアに「０」ずつ足していくことになる。 The node data updating unit 120 adds “0” to the score when the word input as the recognition word string is the same word as the corresponding document, and penalizes the score when the input word is a word different from the corresponding document. Add "-1".
For example, in the example shown in FIG. 2, when the word string input as the recognition word string is exactly the same word string as the corresponding document, the word “Last Month” is received from the start node 001 and corresponds to the word of the corresponding document. Since the process proceeds to the node 002 through the transition, the node data update unit 120 adds “0” to the score. Thereafter, for example, when “NO” is received and the process proceeds to node 003, “0” is added to the score. Similarly, when “Kanto Koshin”,... Is received, “0” is added to the score.

一方、例えば、図２に示す例において、認識単語列として入力する単語列が、対応原稿と異なる単語列である場合、始点ノード００１から、単語「先週」を受理すると、対応原稿の単語「先月」が置換されているので、置換に対応する遷移を通ってノード００２に進む。この場合、ノードデータ更新手段１２０は、スコアにペナルティの「−１」を加算する。また、挿入誤りや脱落誤りに対応する遷移を通った際も、同様にノードデータ更新手段１２０は、スコアにペナルティの「−１」を加算する。 On the other hand, for example, in the example shown in FIG. 2, when the word string input as the recognition word string is a word string different from the corresponding manuscript, if the word “Last Week” is received from the start point node 001, ”Has been replaced, the process proceeds to node 002 through a transition corresponding to the replacement. In this case, the node data updating unit 120 adds a penalty “−1” to the score. Also, when passing through a transition corresponding to an insertion error or dropout error, the node data update unit 120 similarly adds a penalty “−1” to the score.

このように、入力する認識単語がＷＦＳＴにおける単語ｓと同じであった場合に、そのパスのスコアが最良となる。一方、置換、挿入、削除の編集があった場合、スコアが悪化する。例えば＜Ｄ＞が記載された遷移は、入力が無くても遷移できるが、＜Ｄ＞が記載された遷移だけを通るパスの場合、出力遷移に近づくほど、スコアが低くなっている。ＷＦＳＴは、認識単語列に、誤りや言い変えが含まれると、その分だけスコアが悪くなるというネットワークとして作成されている。 Thus, when the input recognition word is the same as the word s in WFST, the score of the path is the best. On the other hand, if there is a replacement, insertion, or deletion edit, the score deteriorates. For example, a transition in which <D> is described can transition even if there is no input. However, in the case of a path that passes only a transition in which <D> is described, the score becomes lower as it approaches the output transition. The WFST is created as a network in which if the recognition word string includes an error or paraphrase, the score is deteriorated accordingly.

ノードデータ記憶手段１３０は、ノードデータ更新手段１２０によって計算されたノードデータを更新時刻毎に記憶するものであり、例えばメモリやハードディスク等の一般的な記憶手段である。 The node data storage unit 130 stores the node data calculated by the node data update unit 120 at each update time, and is a general storage unit such as a memory or a hard disk.

原稿探索手段１４０は、最終最良仮説を確定するための全原稿についての全認識単語列の認識結果の入力を待たずに予め定められた処理開始条件が満たされる度に、その時点で記憶されているノードデータに基づいてＷＦＳＴのネットワーク上をトレースバックしながら、最終最良仮説を部分的に近似した仮説を誤り修正結果として逐次確定するものである。 The document searching means 140 is stored at that time every time a predetermined processing start condition is satisfied without waiting for input of recognition results of all recognized word strings for all documents for determining the final best hypothesis. A hypothesis partially approximating the final best hypothesis is sequentially determined as an error correction result while tracing back on the WFST network based on the node data.

原稿探索手段１４０は、ＷＦＳＴ（対応原稿集合）に含まれる対応原稿の単語列と、入力された認識単語列との編集距離に基づいて最終最良仮説を近似する。原稿探索手段１４０は、ＷＦＳＴのネットワーク上で予め定められた範囲毎に区切ったパス間において、その先頭から末尾までのパス区間での編集距離がある程度小さければ、そのパス区間が信頼できるものとして確定して出力する。ここで、編集距離が短いということは、認識単語列と原稿の単語列とがほとんどマッチしているパスを通ってきたことを意味する。逆に、編集距離が長いパス区間は信頼度が低いので、その時点では確定せずに、次回のトレースバックのときにも利用する。いつまでも信頼度が低いパス区間は、原稿には元々記載されていない違うことを話した区間である、と推定される。よって、信頼度が低いパス区間を出力しない。以下では、ＷＦＳＴのネットワーク上で予め定められた範囲のパス区間を、一例として、ＷＦＳＴのネットワーク上の２つの出力遷移間に挟まれたパス区間であるものとして説明する。 The document search unit 140 approximates the final best hypothesis based on the edit distance between the word string of the corresponding document included in the WFST (corresponding document set) and the input recognition word string. The document search means 140 determines that the path section is reliable if the edit distance in the path section from the head to the end is small to some extent between the paths divided in predetermined ranges on the WFST network. And output. Here, the short edit distance means that the path through which the recognized word string and the original word string are almost matched has been passed. On the other hand, since the reliability of a path section with a long editing distance is low, it is not determined at that time and is used for the next traceback. It is presumed that a path section having a low reliability forever is a section talking about a difference that is not originally described in the manuscript. Therefore, a path section with low reliability is not output. In the following, a path section in a predetermined range on the WFST network will be described as an example of a path section sandwiched between two output transitions on the WFST network.

前記処理開始条件が満たされるとは、例えば、発話音声がない無音期間が所定の期間に達した場合、または、音声認識装置２２０が出力する認識単語列としての単語の入力数が所定の単語数に達した場合等を意味する。所定の期間は特に限定されないが一例として３秒間を挙げることができる。また、所定の単語数は特に限定されないが一例として２０単語を挙げることができる。このときの起動信号は、例えば音声認識装置２２０に自動的に出力させるようにしてもよいし、操作者がポーズだと認識したときや、所定の単語数になったと認識したときに、手動で入力するようにしてもよい。これによれば、認識単語の入力毎に探索処理を開始する場合に比べて処理負荷を低減できる。また、例えば無音期間が所定の期間だけあれば、その間、認識結果の逐次受信が停止しているので、その時点のノードスコアを容易に比較することができる。 The processing start condition is satisfied when, for example, a silent period without speech is reached a predetermined period, or the number of input words as a recognized word string output by the speech recognition device 220 is a predetermined number of words. It means the case of reaching. The predetermined period is not particularly limited, but an example is 3 seconds. Further, the predetermined number of words is not particularly limited, but 20 words can be given as an example. The activation signal at this time may be automatically output to the voice recognition device 220, for example, or manually when the operator recognizes that it is a pose or when a predetermined number of words has been reached. You may make it input. According to this, processing load can be reduced compared with the case where a search process is started for every input of a recognition word. Further, for example, if the silent period is only a predetermined period, since the sequential reception of the recognition results is stopped during that period, the node scores at that time can be easily compared.

上記機能を実現するために、本実施形態では、原稿探索手段１４０は、一例として図３に示すように、最大スコアノード検出手段１４１と、トレースバック手段１４２と、原稿分割手段１４３と、出力候補記憶手段１４４と、編集距離算出手段１４５と、編集距離判別手段１４６と、確定出力記憶手段１４７と、確定時刻記憶手段１４８と、を備えることとした。 In order to realize the above function, in the present embodiment, the document search unit 140 includes a maximum score node detection unit 141, a traceback unit 142, a document division unit 143, an output candidate, as shown in FIG. The storage unit 144, the edit distance calculation unit 145, the edit distance determination unit 146, the fixed output storage unit 147, and the fixed time storage unit 148 are provided.

最大スコアノード検出手段１４１は、予め定められた処理開始条件が満たされた場合、その時点で記憶されているノードデータにおいてスコアが最大のノードを検出するものである。例えば、発話音声がない無音期間（ポーズ）が所定の期間に達した場合、または、認識結果としての単語の入力数が所定の単語数に達する度に、その旨を示す起動信号が、最大スコアノード検出手段１４１に入力する。 The maximum score node detection unit 141 detects a node having the maximum score in the node data stored at that time when a predetermined process start condition is satisfied. For example, when the silent period (pause) in which there is no uttered voice reaches a predetermined period, or whenever the number of input words as a recognition result reaches a predetermined number of words, the activation signal indicating that is the maximum score Input to the node detection means 141.

トレースバック手段１４２は、最大スコアノード検出手段１４１で検出されたノードから、当該ノードに到達したパスについてＷＦＳＴのネットワークを下流から上流に向かってたどり、前回のトレースバックで確定し、出力された単語系列の最後の入力単語に対応した時刻までトレースバックするものである。 The traceback means 142 follows the WFST network from the downstream to the upstream for the path that has reached the node from the node detected by the maximum score node detection means 141, and is confirmed by the previous traceback, and the output word Trace back to the time corresponding to the last input word of the sequence.

図４は、図２に示したＷＦＳＴにパスＰ１を付加した模式図である。図４において、スコアが最大のノードがノード０２０であるものとする。また、前回のトレースバックで確定された最後の入力単語に対応したノードがノード００７であったものとする。この場合、トレースバック手段１４２は、星印で示す位置から、パスＰ１をノード０２０、ノード０１９、ノード０１８の順番に逆向きにたどり、始点ノード００１へ達すると、さらに終点ノード００８に戻る。次に、２本目のツリーの出力遷移＜Ｅｍｉ２＞を経てノード０１５に達する。続いて、トレースバック手段１４２は、図５において、ノード０１５、ノード０１４、…の順番に逆向きにたどり、始点ノード００１へ達すると、パスＰ２で示すように終点ノード００８に戻る。次に、１本目のツリーの出力遷移＜Ｅｍｉ１＞を経てノード００７に達する。 FIG. 4 is a schematic diagram in which a path P1 is added to the WFST shown in FIG. In FIG. 4, it is assumed that the node with the highest score is the node 020. Further, it is assumed that the node corresponding to the last input word determined in the previous traceback is the node 007. In this case, the traceback unit 142 traces the path P1 from the position indicated by the star in the reverse direction in the order of the node 020, the node 019, and the node 018, and when it reaches the start point node 001, returns to the end point node 008. Next, the node 015 is reached through the output transition <Emi2> of the second tree. Subsequently, the traceback unit 142 traces backward in the order of the nodes 015, 014,... In FIG. 5, and when reaching the start point node 001, returns to the end point node 008 as shown by the path P2. Next, the node 007 is reached through the output transition <Emi1> of the first tree.

図３に戻って、原稿探索手段１４０の説明を続ける。
原稿分割手段１４３は、今回トレースバックするパスの中で、２つの出力遷移間に挟まれたパス区間毎にＷＦＳＴ（対応原稿集合）に含まれる対応原稿の単語列を切り出すものである。図４及び図５を用いて説明した例の場合、出力遷移＜Ｅｍｉ１＞と出力遷移＜Ｅｍｉ２＞との間に挟まれたパス区間が、原稿分割手段１４３により分割される。 Returning to FIG. 3, the description of the document search means 140 will be continued.
The document dividing means 143 cuts out a word string of the corresponding document included in the WFST (corresponding document set) for each path section sandwiched between two output transitions in the path traced back this time. In the case of the example described with reference to FIGS. 4 and 5, the path section sandwiched between the output transition <Emi1> and the output transition <Emi2> is divided by the document dividing unit 143.

出力候補記憶手段１４４は、原稿分割手段１４３で分割されたパス区間に対応した出力遷移の出力シンボル（切り出された原稿）を、出力候補として記憶するものであり、例えばメモリやハードディスク等の一般的な記憶手段である。図４及び図５を用いて説明した例の場合、「今週もまとまった雨は …」が、出力候補として記憶される。 The output candidate storage unit 144 stores an output transition output symbol (cut out document) corresponding to the path section divided by the document dividing unit 143 as an output candidate. Storage means. In the case of the example described with reference to FIGS. 4 and 5, “This week's rain is ...” is stored as an output candidate.

編集距離算出手段１４５は、原稿分割手段１４３で切り出された対応原稿毎に、入力された認識単語列との編集距離を算出するものである。本実施形態では、編集距離は、当該パス区間についての挿入、置換、削除に係る編集操作回数を、当該パス区間の単語数で除した値で定義される。ここで、認識単語列の単語の置換、挿入、削除の編集操作回数をｅとし、出力遷移が対応する原稿の単語数をＮ_rとすると、編集距離は、原稿の単語数Ｎ_rに対する、認識単語列の単語の編集操作回数ｅの割合（ｅ／Ｎ_r）で表される。 The edit distance calculating unit 145 calculates an edit distance from the input recognition word string for each corresponding document cut out by the document dividing unit 143. In the present embodiment, the editing distance is defined by a value obtained by dividing the number of editing operations related to insertion, replacement, and deletion for the path section by the number of words in the path section. Here, assuming that the number of word replacement, insertion, and deletion editing operations in the recognized word string is e, and the number of original words corresponding to the output transition is N _r , the editing distance is recognized for the original word number N _r . It is represented by the ratio (e / N _r ) of the number of word editing operations e in the word string.

具体的には、図２に示す例において、ＷＦＳＴ上のパス区間が「ノード００７→ノード００８→ノード００１→ノード００１０→ノード００１１→ノード００１２→ノード００１３→ノード００１４→ノード００１５」である場合を想定する。このパス区間は、６単語からなるものとし、単語「今週」が「今月」に置換されて認識されていた場合、編集距離は１／６となる。 Specifically, in the example shown in FIG. 2, the case where the path section on WFST is “node 007 → node 008 → node 001 → node 0010 → node 0011 → node 0012 → node 0013 → node 0014 → node 0015”. Suppose. This path section is composed of 6 words, and when the word “this week” is replaced with “this month” and recognized, the edit distance is 1/6.

編集距離判別手段１４６は、ＷＦＳＴのネットワークを下流から上流に向かってパス区間を選択しながら、算出された編集距離が所定の閾値以下であるか否かを順次判別し、閾値以下である場合、ＷＦＳＴのネットワーク上の当該パス区間の出力遷移を確定し、その出力シンボルを誤り修正結果として確定するものである。ここで、編集距離（ｅ／Ｎ_r）が閾値Ｔ以下である場合、つまり、ｅ／Ｎ_r≦Ｔを満たすとき、出力遷移の出力シンボルを修正結果として確定する。また、編集距離判別手段１４６は、編集距離（ｅ／Ｎ_r）が閾値Ｔより大きい場合、その出力シンボルを採用しない。つまり、閾値より大きな編集距離を有したパス区間の出力遷移の出力は、一旦保留され、このパス区間以降に確定された出力遷移があった場合には棄却される。なお、編集距離（ｅ／Ｎ_r）がその定義から０〜１の範囲の値なので、閾値は０＜Ｔ＜１の関係を満たす。 The edit distance determining means 146 sequentially determines whether or not the calculated edit distance is equal to or less than a predetermined threshold while selecting a path section from the downstream to the upstream in the WFST network. The output transition of the path section on the WFST network is determined, and the output symbol is determined as an error correction result. Here, when the edit distance (e / N _r ) is equal to or smaller than the threshold T, that is, when e / N _r ≦ T is satisfied, the output symbol of the output transition is determined as the correction result. Further, when the edit distance (e / N _r ) is larger than the threshold value T, the edit distance determination unit 146 does not adopt the output symbol. That is, the output of the output transition of the path section having an edit distance larger than the threshold is temporarily suspended, and is rejected when there is an output transition determined after this path section. Since the edit distance (e / N _r ) is a value in the range of 0 to 1 from the definition, the threshold satisfies the relationship 0 <T <1.

確定出力記憶手段１４７は、編集距離判別手段１４６にて編集距離が所定の閾値以下であると判定された場合、当該所定パス区間における出力遷移の出力シンボルを誤り修正結果として記憶するものであり、例えばメモリやハードディスク等の一般的な記憶手段である。確定出力記憶手段１４７の記憶構造は、スタックであり、データを後入れ先出しの構造で保持する。 The definite output storage unit 147 stores the output symbol of the output transition in the predetermined path section as an error correction result when the editing distance determination unit 146 determines that the editing distance is equal to or less than the predetermined threshold. For example, it is a general storage means such as a memory or a hard disk. The storage structure of the definite output storage unit 147 is a stack and holds data in a last-in first-out structure.

確定時刻記憶手段１４８は、今回のトレースバック処理で確定した確定時刻を記憶するものであり、例えばメモリやハードディスク等の一般的な記憶手段である。確定時刻記憶手段１４８は、今回トレースバックする全パス区間（切り出された全ての原稿）について編集距離判別手段１４６による判別処理が終了した時点で、スタックに積まれた出力シンボルに対応する最新の確定単語の時刻を確定時刻として記憶する。 The fixed time storage unit 148 stores the fixed time determined by the current traceback process, and is a general storage unit such as a memory or a hard disk. The confirmation time storage means 148 is the latest confirmation corresponding to the output symbol stacked on the stack at the time when the edit distance determination means 146 has finished the determination for all the path sections (all cut out documents) to be traced back this time. The time of the word is stored as the confirmed time.

原稿出力手段１５０は、原稿探索手段１４０によって誤り修正結果として確定された対応原稿を逐次出力するものである。原稿出力手段１５０は、ＷＦＳＴのネットワークを今回トレースバックするパスの中で、切り出されたすべての対応原稿の各パス区間に対して算出された編集距離についての判定処理が全て終了するまでに確定され、スタックに積まれた出力シンボルのデータをスタックが空になるまで出力する。 The manuscript output unit 150 sequentially outputs the corresponding manuscript determined as an error correction result by the manuscript search unit 140. The manuscript output means 150 is determined by the end of the determination process for the edit distance calculated for each path section of all the cut out corresponding manuscripts in the path traced back through the WFST network this time. The data of the output symbols loaded on the stack is output until the stack becomes empty.

この音声認識誤り修正装置１００による修正出力は、間違いを正すことと、間違いを出力しないこと、の両方の意味を含んでいる。つまり、音声認識誤り修正装置１００による修正結果を、仮に事前に人が見ることができたとしたときに、「これでは文章として成立していない」、「意味が異なっている」と感じるほどの間違い部分を、音声認識誤り修正装置１００がその処理の中で検出し、その検出部分を出力しないという動作も、広義の誤り修正として含んでいる。 The correction output by the speech recognition error correction apparatus 100 includes both the meaning of correcting an error and not outputting the error. In other words, if the correction result by the speech recognition error correction device 100 can be viewed in advance by a person, an error is enough to feel that “this is not a sentence” or “the meaning is different”. The operation in which the speech recognition error correction apparatus 100 detects the part in the process and does not output the detected part is included as error correction in a broad sense.

［音声認識誤り修正装置の動作］
本発明の実施形態に係る音声認識誤り修正装置１００による処理の流れについて図６を参照（適宜図３参照）して説明する。
（前提１）認識結果の単語入力を｛ω₀，ω₁，…，ω_k，…，ω_j，…｝とする。
（前提２）前回のトレースバックにより確定した部分の最後の入力単語をω_kとし、そのときの出力遷移をａ_p（時間軸に沿ったＰ番目の出力遷移）とする。
（前提３）認識結果の単語ω_jが入力された後、所定の無音が続いたことをトリガに、逐次確定を行う場合を考える。
（前提４）ノードデータ更新手段１２０は、無音になる前に最後に入力した単語ω_jを受理して遷移できるノードを全て計算する。 [Operation of voice recognition error correction device]
The flow of processing by the speech recognition error correction apparatus 100 according to the embodiment of the present invention will be described with reference to FIG. 6 (refer to FIG. 3 as appropriate).
(Assuming 1) the recognition result of a word input _{_{{ω 0, ω 1, ...}} , ω k, ..., ω j, ...} and.
(Assumption 2) _Let ω _k be the last input word of the part determined by the previous traceback, and let the output transition at that time be a _p (Pth output transition along the time axis).
(Premise 3) Consider a case in which, after a word ω _{j as a} recognition result is input, successive determination is performed using a predetermined silence as a trigger.
(Premise 4) The node data updating unit 120 accepts the last input word ω _j before silence and calculates all nodes that can make a transition.

所定の無音が続いたことをトリガに、最大スコアノード検出手段１４１は、現時点で記憶されているノードデータにおいて最もスコアの高いノードを検出する（ステップＳ１）。この検出ノードで表される状態は、トレースバック開始時点の最尤状態である。そして、トレースバック手段１４２は、検出されたノードから、当該ノードに到達したパスについてＷＦＳＴ上の単語履歴を逆向きにたどり、前回のトレースバックで確定し、出力された単語系列の最後の入力単語ω_k（ＷＦＳＴの遷移が受理した単語がω_kである遷移）に対応した確定時刻までトレースバックする（ステップＳ２）。ここで、前回のトレースバックで確定し、出力された単語系列の最後の入力単語ω_kに対応した確定時刻としては、確定時刻記憶手段１４８に格納されている確定時刻を用いる。なお、単語がω_kである遷移の代わりに、出力遷移ａ_Pにたどり着くまでトレースバックするようにしてもよい。 The maximum score node detecting means 141 detects the node having the highest score in the node data stored at the present time, triggered by the predetermined silence being continued (step S1). The state represented by this detection node is the maximum likelihood state at the start of traceback. Then, the traceback unit 142 traces the word history on the WFST in the reverse direction for the path reaching the node from the detected node, confirms it in the previous traceback, and outputs the last input word of the output word sequence. Trace back to a fixed time corresponding to ω _k (transition in which the word accepted by the WFST transition is ω _k ) (step S2). Here, the confirmed time stored in the confirmed time storage means 148 is used as the confirmed time corresponding to the last input word ω _k of the word sequence that is confirmed and output in the previous traceback. Instead of the transition whose word is ω _k , traceback may be performed until the output transition a _P is reached.

そして、原稿分割手段１４３は、今回トレースバックするパスの中で、２つの出力遷移間に挟まれたパス区間毎に原稿を分割し、出力候補として出力候補記憶手段１４４に格納する（ステップＳ３）。ここで、出力遷移ａ_Pにたどり着くまで逆向きに進みながら、出力可能な出力遷移ａ_L（時間軸に沿ったＬ番目（ただしＬ＞Ｐ）の出力遷移）を通過する度に原稿を分割してもよいし、出力遷移ａ_Pの側から出力可能な出力遷移ａ_Lを通過する度に原稿を分割してもよい。また、出力可能な出力遷移ａ_Lとは、出力遷移のシンボルが出力候補になるものであるが、後に編集距離判別手段１４６により棄却され出力されない出力遷移も含んでいる。このような出力候補の編集距離をＤと表記する。 Then, the document dividing unit 143 divides the document for each path section sandwiched between two output transitions in the path traced back this time, and stores the document in the output candidate storage unit 144 as an output candidate (step S3). . Here, the document is divided every time it passes through the output transition a _L (L-th (but L> P) output transition along the time axis) that can be output while proceeding in the reverse direction until reaching the output transition a _P. Alternatively, the document may be divided every time it passes through the output transition a _L that can be output from the output transition a _P side. The output transition a _L that can be output includes output transition symbols that are candidates for output, but also include output transitions that are later rejected by the edit distance determination unit 146 and are not output. The edit distance of such an output candidate is denoted as D.

そして、編集距離算出手段１４５は、出力候補の編集距離Ｄを算出する（ステップＳ４）。具体的には、出力遷移ａ_Lの出力シンボルに対応する区間、すなわち、ＷＦＳＴ上を出力遷移ａ_Lから逆向きに進んだときの直前の出力遷移ａ_L-1と当該出力遷移ａ_Lとの間に挟まれたパス区間、についての編集操作回数（つまり、＜Ｓ＞、＜Ｄ＞、＜Ｉ＞を通った回数）を、同区間の単語数で割った値を、当該出力遷移ａ_Lにおける編集距離Ｄ_Lとして算出する。すなわち、同区間の編集操作回数をｅ_Lとし、同区間の単語数をＮ_L ^rとすると、出力遷移ａ_Lにおける編集距離Ｄ_Lは、ｅ_L／Ｎ_L ^rで表される。 Then, the edit distance calculation unit 145 calculates the edit distance D of the output candidate (step S4). Specifically, the section that corresponds to the output symbol of the output transition a _L, i.e., immediately before the output transition a _L-1 and with the output transition a _L of when it proceeds in the opposite direction on WFST from the output transition a _L A value obtained by dividing the number of editing operations (that is, the number of passes through <S>, <D>, and ) for the path section sandwiched between by the number of words in the section is the output transition a _L Is calculated as the edit distance D _L at. That is, assuming that the number of editing operations in the same section is e _L and the number of words in the same section is N _L ^r , the editing distance D _L in the output transition a _L is expressed by e _L / N _L ^r .

そして、編集距離判別手段１４６は、所定の出力候補を選択し、算出された編集距離Ｄが閾値Ｔ以下であるか否かを判別する（ステップＳ５）。編集距離Ｄが閾値Ｔ以下である場合（ステップＳ５：Ｙｅｓ）、編集距離判別手段１４６は、ＷＦＳＴ上の当該パス区間の出力遷移を確定し、その出力シンボルを誤り修正結果として確定する（ステップＳ６）。さらに、編集距離判別手段１４６は、今回確定した出力シンボルのデータを、確定出力記憶手段１４７に記憶されたスタックに積み（ステップＳ７）、ステップＳ８に進む。 Then, the edit distance determination unit 146 selects a predetermined output candidate, and determines whether or not the calculated edit distance D is equal to or less than the threshold value T (step S5). When the edit distance D is less than or equal to the threshold T (step S5: Yes), the edit distance determination unit 146 determines the output transition of the path section on the WFST and determines the output symbol as an error correction result (step S6). ). Further, the editing distance determination unit 146 accumulates the data of the output symbol determined this time on the stack stored in the finalized output storage unit 147 (step S7), and proceeds to step S8.

そして、編集距離判別手段１４６は、前方に依然として選択すべき出力候補がある場合（ステップＳ８：Ｎｏ）、前記ステップＳ５に戻る。一方、すべての出力候補選択が選択された場合（ステップＳ８：Ｙｅｓ）、すなわち、切り出されたすべての原稿に対応した各パス区間に対して算出された編集距離についての判定処理が終了した場合、原稿出力手段１５０は、その時点でスタックに積まれている出力シンボルのデータをスタックが空になるまで順次出力する（ステップＳ９）。これにより、前方側に配置された原稿から順に出力される。 Then, when there is an output candidate that should still be selected ahead (Step S8: No), the edit distance determination unit 146 returns to Step S5. On the other hand, when all the output candidate selections are selected (step S8: Yes), that is, when the determination process for the edit distance calculated for each path section corresponding to all the cut out documents is completed, The document output means 150 sequentially outputs the output symbol data stacked on the stack at that time until the stack becomes empty (step S9). As a result, the documents are output sequentially from the document placed on the front side.

ここで、編集距離判別手段１４６は、すべての出力候補選択を選択した場合（ステップＳ８：Ｙｅｓ）、スタックに積まれた出力シンボルに対応する確定単語の時刻が最も新しいものを今回のトレースバック処理で確定した確定時刻として確定時刻記憶手段１４８に格納する。 Here, when all the output candidate selections are selected (step S8: Yes), the edit distance determination unit 146 selects the latest confirmed word corresponding to the output symbol stacked on the stack for the current traceback process. Is stored in the fixed time storage means 148 as the fixed time determined in step.

また、前記ステップＳ５において、編集距離Ｄが閾値Ｔより大きい場合（ステップＳ５：Ｎｏ）、データをスタックに積むことなくステップＳ８に進む。 If the edit distance D is greater than the threshold T in step S5 (step S5: No), the process proceeds to step S8 without loading data on the stack.

つまり、原稿出力手段１５０は、毎回のトレースバック処理でスタックに積んだデータを、確定された原稿として逐次出力する。この際に、音声認識結果のうち、所定のパス区間の編集距離Ｄが閾値Ｔより大きい場合、信頼度が低いパスなので、当該パス区間の出力遷移の出力シンボルは、誤り修正結果としては採用されず、出力もされない。 That is, the document output unit 150 sequentially outputs the data accumulated on the stack in each traceback process as a confirmed document. At this time, if the edit distance D of the predetermined path section is larger than the threshold T in the speech recognition result, the output symbol of the output transition of the path section is adopted as the error correction result because the path has low reliability. And no output.

［編集距離の閾値Ｔの決め方］
音声認識の認識精度が９０％くらいならば、編集距離の値も９０％くらいになる可能性がある。判別に用いる編集距離の閾値Ｔとしては、音声認識の認識精度よりも充分低いところ、例えば単語一致率の信頼度分だけ下方にマージンを取って設定することが好ましい。ここで、単語一致率の信頼度は、ＷＦＳＴのネットワークの２つの出力遷移間の単語数に依存する。 [How to determine the edit distance threshold T]
If the recognition accuracy of voice recognition is about 90%, the edit distance value may be about 90%. The threshold value T of the edit distance used for the determination is preferably set at a position sufficiently lower than the recognition accuracy of voice recognition, for example, with a margin below the reliability of the word matching rate. Here, the reliability of the word match rate depends on the number of words between two output transitions of the WFST network.

その他の要因としては、原稿テキスト集合２００に含まれる原稿の候補の文章としての重なりがどのくらいの割合であるのかという点も考慮して閾値Ｔを決めることが好ましい。例えば、下記（Ｅ１）〜（Ｅ３）に示す文の場合、文章としての重なりが８０％くらいの割合で含まれている。
（Ｅ１）今日の天気は晴れです
（Ｅ２）今日の天気は雨です
（Ｅ３）今日の天気は曇りです
このような場合、編集距離の閾値も８０％くらいに設定してしまったとしたら所望の動きが実現できない。なお、ニュース原稿の一文ごとに出力遷移を配置し、閾値Ｔを５０％とした条件で実験した場合、問題なく動作することが確認できた。 As another factor, it is preferable to determine the threshold value T in consideration of the ratio of the overlapping of the original document candidates included in the original text set 200 as a ratio. For example, in the case of sentences shown in (E1) to (E3) below, overlapping as sentences is included at a rate of about 80%.
(E1) Today's weather is sunny (E2) Today's weather is rainy (E3) Today's weather is cloudy In this case, if the edit distance threshold is set to about 80%, the desired movement Cannot be realized. It has been confirmed that when an experiment is performed under the condition that an output transition is arranged for each sentence of the news manuscript and the threshold T is 50%, the operation is performed without any problem.

［ＷＦＳＴのオプション］
＜オプション１：言い換えを受理するＷＦＳＴの構築＞
ＷＦＳＴの情報源となる原稿には、それが読まれるときに、読み飛ばされる句や、言い換えられる句、補足される句が含まれている場合がある。これらの一部には、定型で高い頻度で起こるものがある。例えば、ニュース番組の原稿では、取材元を表す「警視庁によりますと、」などの句は、読み飛ばされやすい定型句である。ただし、これを読み飛ばしたとしても、ニュース主文（５Ｗ１Ｈ）の文意に変わりはなく、実用上の問題はない。 [WFST options]
<Option 1: Construction of WFST that accepts paraphrasing>
A manuscript serving as a WFST information source may include a phrase that is skipped when it is read, a phrase that can be rephrased, and a phrase that is supplemented. Some of these are typical and occur frequently. For example, in a news program manuscript, phrases such as “according to the Metropolitan Police Department” representing the source of the interview are fixed phrases that are easily skipped. However, even if this is skipped, there is no change in the meaning of the news main sentence (5W1H), and there is no practical problem.

オプション１では、このような定型の言い回しをＷＦＳＴに追加しておくことで、精度よく修正結果を出力できるようにしたものである。ＷＦＳＴは、従来公知のように、音声認識デコーダや機械翻訳などに用いられており、種々の演算アルゴリズムが知られている。例えば、合成（図７（ａ）参照）、最小化（図７（ｂ）参照）、決定化（図７（ｃ）参照）を行うアルゴリズムを適用することができ、効率よい状態遷移機械を構成できるという特徴がある。上記の言い回しの追加については、原稿から構築したＷＦＳＴとは別に、言い回しを追加するためのＷＦＳＴを別途構築しておき、原稿から構築したＷＦＳＴと合成することにより、効率よく実現できる。 In option 1, such a standard wording is added to the WFST so that the correction result can be output with high accuracy. As conventionally known, WFST is used for speech recognition decoders, machine translation, and the like, and various calculation algorithms are known. For example, an algorithm that performs synthesis (see FIG. 7A), minimization (see FIG. 7B), and determinization (see FIG. 7C) can be applied, and an efficient state transition machine can be configured. There is a feature that you can. The addition of the wording can be efficiently realized by separately constructing a WFST for adding wording separately from the WFST constructed from the manuscript and combining it with the WFST constructed from the manuscript.

例えば、言い換え例については、過去の同種の番組の原稿と、実際に読み上げられた単語列と、の差分から、頻度が高く、同言い換えによって文意に変更がないものを選別して用意しておく。この選別された言い換え例ごとに、言い換えを合成するためのＷＦＳＴを構築しておき、原稿から構築したＷＦＳＴと合成演算を施すことにより、言い換えに対応可能なＷＦＳＴを構築することができる。ここで、ＷＦＳＴの合成について図７（ａ）を参照して説明する。 For example, as for paraphrasing examples, prepare the ones that have a high frequency and that have no change in meaning by paraphrasing, based on the difference between the original of the same kind of program in the past and the word string actually read out. deep. For each selected paraphrase example, a WFST for synthesizing paraphrases is constructed, and a WFST that is compatible with paraphrasing can be constructed by performing a synthesis operation with the WFST constructed from the original. Here, the synthesis of WFST will be described with reference to FIG.

図７（ａ）では、ノードを円形で示している。図７（ａ）の左側の上の図は、原稿から構築したＷＦＳＴの一例の模式図であり、図７（ａ）の左側の下の図は、追加されるＷＦＳＴの一例の模式図である。図７（ａ）の右側の図は、原稿から構築したＷＦＳＴと、追加されるＷＦＳＴとを合成した後のＷＦＳＴの模式図である。 In FIG. 7A, the nodes are shown as circles. The upper diagram on the left side of FIG. 7A is a schematic diagram of an example of a WFST constructed from a document, and the lower diagram on the left side of FIG. 7A is a schematic diagram of an example of an added WFST. . The diagram on the right side of FIG. 7A is a schematic diagram of the WFST after the WFST constructed from the original and the added WFST are combined.

＜オプション２：ＷＦＳＴを作成する際のオプションＡ＞
ＷＦＳＴを作成する際に、必要があればＷＦＳＴの最小化を行ってもよい。ここで、ＷＦＳＴの最小化について図７（ｂ）を参照して説明する。図７（ｂ）の左側の図は、原稿から通常の手法で構築されたＷＦＳＴの一例を示す模式図である。ここで、ａ１〜ａ６は異なる単語を示す。 <Option 2: Option A when creating WFST>
When creating the WFST, the WFST may be minimized if necessary. Here, the minimization of WFST will be described with reference to FIG. The diagram on the left side of FIG. 7B is a schematic diagram showing an example of a WFST constructed from a document by a normal method. Here, a1 to a6 indicate different words.

図７（ｂ）の右側の図は、原稿から通常の手法で構築されたＷＦＳＴを最小化した後のＷＦＳＴの模式図である。最小化した後のＷＦＳＴには、元のＷＦＳＴの３つの単語列において共通する接頭辞（単語ａ１，ａ２）について、配列順序（単語位置）を考慮してノード（状態）を集約し、最小個数の枝（遷移）が配置されている。
ＷＦＳＴの最小化によれば、同じ接頭辞を有する単語列（文）を同一の遷移で共有できるので、演算量を削減することができる。 The diagram on the right side of FIG. 7B is a schematic diagram of the WFST after minimizing the WFST constructed from the original by a normal method. In the WFST after minimization, nodes (states) are aggregated in consideration of the arrangement order (word position) for the prefixes (words a1, a2) common to the three word strings of the original WFST, and the minimum number Branches (transitions) are arranged.
According to WFST minimization, word strings (sentences) having the same prefix can be shared by the same transition, so that the amount of calculation can be reduced.

＜オプション３：ＷＦＳＴを作成する際のオプションＢ＞
また、ＷＦＳＴを作成する際に、必要があればＷＦＳＴの決定化を行ってもよい。ここで、ＷＦＳＴの決定化について図７（ｃ）を参照して説明する。図７（ｃ）の左側の図は、図７（ｂ）の右側に示すＷＦＳＴと同じ形状のＷＦＳＴの模式図である。ただし、図７（ｂ）において単語ａ４が記載されていた遷移には、代わりに出力文ｏ１が記載されている。同様に、単語ａ５が記載されていた遷移には、代わりに出力文ｏ２が記載され、単語ａ６の代わりに出力文ｏ３が記載されている。 <Option 3: Option B when creating a WFST>
Further, when creating a WFST, if necessary, the WFST may be determinized. Here, determinization of WFST will be described with reference to FIG. The diagram on the left side of FIG. 7C is a schematic diagram of a WFST having the same shape as the WFST shown on the right side of FIG. However, in the transition in which the word a4 is described in FIG. 7B, the output sentence o1 is described instead. Similarly, in the transition in which the word a5 is described, the output sentence o2 is described instead, and the output sentence o3 is described instead of the word a6.

図７（ｃ）の右側の図は、元とするＷＦＳＴを決定化した後のＷＦＳＴの模式図である。決定化した後のＷＦＳＴでは、元のＷＦＳＴと比べて、出力文ｏ３が１つ前（１つ左側）の遷移に記載されている点が異なっている。
元のＷＦＳＴには、左から２番目のノードから次のノードへ状態遷移する際に分岐があり、この２番目のノードから図中下のノードに遷移した時点で、出力文がｏ１やｏ２ではなくｏ３になることが決定的であることが分かる。そこで、少しでも早く推定結果を出力するために、決定化した後のＷＦＳＴでは、出力文の位置を変更したものである。 The diagram on the right side of FIG. 7C is a schematic diagram of the WFST after determinating the original WFST. The WFST after determinization is different from the original WFST in that the output sentence o3 is described in the previous transition (one left side).
The original WFST has a branch at the time of state transition from the second node from the left to the next node, and when the transition from the second node to the lower node in the figure, the output statement is o1 or o2. It turns out that it becomes decisive to become o3. Therefore, in order to output the estimation result as soon as possible, the position of the output sentence is changed in the WFST after determinization.

ＷＦＳＴの決定化によれば、出力文を、接頭辞がユニークとなる遷移に移動して、出力文を旱期に確定できるようになるなどの利点がある。ただし、ＷＦＳＴを作成する際にＷＦＳＴの決定化を行った場合、原稿探索手段１４０による最尤仮説の探索処理でも対応できるように設定変更が必要である。つまり、ＷＦＳＴの決定化を行わない場合に比べて、編集距離を計算するためのパス区間を出力遷移の前後にシフトさせる必要がある。加えて、前後のパス区間の伸縮分を吸収できるように、閾値Ｔをより厳しい値（小さい値）に設定する必要がある。 According to WFST determinization, there is an advantage that the output sentence is moved to a transition with a unique prefix, so that the output sentence can be finalized. However, if the WFST is determinized when creating the WFST, it is necessary to change the setting so that the search process of the maximum likelihood hypothesis by the manuscript search means 140 can also cope. That is, it is necessary to shift the path section for calculating the edit distance before and after the output transition as compared with the case where WFST is not determinized. In addition, the threshold value T needs to be set to a stricter value (small value) so that the expansion and contraction of the preceding and following path sections can be absorbed.

［他のオプション］
本発明は、多言語字幕の生成にも応用可能である。例えば図２に示すＷＦＳＴのノード０１５の次の＜Ｅｍｉ２＞が記載された出力遷移に、ノード０１０〜０１５までの和文に対応した英文を出力シンボルとすることにより、日本語の音声入力に対応した英語の字幕を生成することができる。また、日英の字幕を同時に生成する必要がある場合には、「今週もまとまった雨は …」にその英訳文を併記したものを利用することができる。 [Other options]
The present invention can also be applied to the generation of multilingual subtitles. For example, in the output transition in which <Emi2> next to node 015 of WFST shown in FIG. 2 is described, an English sentence corresponding to a Japanese sentence from nodes 010 to 015 is used as an output symbol, thereby supporting Japanese voice input. English subtitles can be generated. If you need to generate Japanese and English subtitles at the same time, you can use the English translation along with “This rain is all together this week”.

以上説明したように、本実施形態に係る音声認識誤り修正装置１００は、原稿中の文が、任意の順番で連続して発声されるという拘束のもと、文境界を固定せずに、認識結果と原稿との対応を単語単位でとることで、従来のブロック照合方式の自動修正誤りを解消する。一方で、より精度の高い修正出力を得るためには、出力は、文、又はそれに準じる単位があった方が望ましい。この二律背反を解消して両立させるため、音声認識誤り修正装置１００は、認識結果と原稿との対応を、重み付き有限状態トランスデューサ（ＷＦＳＴ）を用いて求めている。 As described above, the speech recognition error correction apparatus 100 according to the present embodiment recognizes a sentence without fixing a sentence boundary under the constraint that sentences in a document are continuously uttered in an arbitrary order. By taking the correspondence between the result and the manuscript in word units, the automatic correction error of the conventional block matching method is eliminated. On the other hand, in order to obtain a corrected output with higher accuracy, it is desirable that the output has a sentence or a unit equivalent thereto. In order to eliminate this contradiction and achieve compatibility, the speech recognition error correction apparatus 100 uses a weighted finite state transducer (WFST) to determine the correspondence between the recognition result and the document.

そして、音声認識誤り修正装置１００は、認識単語の単語列が原稿の単語列と比較して、どこと一番マッチしているのかを、従来のブロック照合方式（特許文献１の技術）の長さＮ（単語数Ｎ）の単語連鎖ブロックより長い範囲で照合している。従来のブロック照合方式と比べると、認識単語の単語列と原稿の単語列とを照合するための区間を、単語連鎖ブロックに相当する区間だけではなく、原稿の文章を遡っていった、もっと長い文章全体で照合する。そのため、どこでマッチさせるのがよいのかが従来よりも明白に分かり、自動修正誤りを従来よりも低減できる。 Then, the speech recognition error correction apparatus 100 compares the word string of the recognized word with the word string of the original document to determine where the word string matches the longest of the conventional block matching method (the technique of Patent Document 1). Collating in a range longer than a word chain block of length N (number of words N). Compared to the conventional block matching method, the section for matching the word string of the recognized word and the word string of the manuscript is not only the section corresponding to the word chain block, but the manuscript sentence, which is longer Match the entire sentence. Therefore, it is clearly known where to make the match, and automatic correction errors can be reduced more than before.

以上、実施形態に基づいて本発明に係る音声認識誤り修正装置について説明したが、本発明はこれらに限定されるものではない。例えば、発話音声の認識単語に対する推定対応原稿の信頼度が高いか否かを編集距離を用いて判別することとしたが、編集距離のほか、原稿と認識結果の一致率、一致精度、脱落率、挿入率を利用したり、それらを併用したりしてもよい。 Although the speech recognition error correction apparatus according to the present invention has been described based on the embodiments, the present invention is not limited to these. For example, it was decided by using the edit distance whether or not the reliability of the estimated correspondence manuscript with respect to the recognition word of the utterance speech is high. The insertion rate may be used or they may be used in combination.

また、例えば図２に示すＷＦＳＴの＜ＥｍｉＸ＞が記載された出力遷移に、音声認識の結果では得られない「、」や「。」、記号なども原稿の表記に従って出力シンボルに埋め込むことができる。この場合、より読みやすい字幕を生成することができる。 For example, in the output transition in which <EmiX> of WFST shown in FIG. 2 is described, “,”, “.”, Symbols, and the like that cannot be obtained as a result of speech recognition can be embedded in the output symbols according to the notation of the manuscript. . In this case, subtitles that are easier to read can be generated.

本発明において、字幕を付けることは必須ではない。また、音声認識の対象となる話す予定の内容がある程度決まっていて、その内容を事前に入手できるようであれば、必ずしも放送番組の音声を前提とするものでなくてもよい。 In the present invention, it is not essential to add subtitles. Also, if the content of the speech to be spoken is determined to some extent and the content can be obtained in advance, it is not necessarily premised on the sound of the broadcast program.

本発明の音声認識誤り修正装置１００の性能を確かめるために以下の実験を行った。
音声認識対象は、ニュース番組（首都圏ニュース845）の９番組分から取得した気象コーナーを除く４２０文とした。これらの音声認識結果を入力として４２０文の修正結果を人の目で見て確認して分析し、その分析結果を、「適切・同意」、「不適」、「出力なし」、「判定困難」の４種類に分類した。ここで、「不適」とは、文章として成立していないものや、意味の異なるものを表す。また、「判定困難」とは、複数文にまたがる言い換えや、「同意」であるかどうか判断できないものを表す。分析結果を表１に示す。 In order to confirm the performance of the speech recognition error correction apparatus 100 of the present invention, the following experiment was performed.
The speech recognition target was 420 sentences excluding the weather corner acquired from nine programs of the news program (Metropolitan area news 845). Using these voice recognition results as input, the 420 sentence correction results are visually confirmed and analyzed, and the analysis results are “appropriate / consent”, “unsuitable”, “no output”, “difficult to judge”. It was classified into four types. Here, “inappropriate” represents something that is not established as a sentence or has a different meaning. Further, “difficult to determine” represents paraphrasing over a plurality of sentences, or something that cannot be determined as “agreement”. The analysis results are shown in Table 1.

表１の分析結果で示すように、適切・同意に分類された文章は、実施例の方が多かった。また、比較例では、その１８％の出力が不適な文章であったのに対し、実施例では不適な出力は皆無であった。なお、実施例の方が「出力なし」が多い理由は、実施例では、音声認識結果のうち、所定のパス区間の編集距離Ｄが閾値Ｔより大きい場合、信頼度が低いパスとして、当該パス区間の出力遷移の出力シンボルを出力しないからである。実際、出力されなかった４３の文章は、すべてが、インタビューＶＴＲ音声など、対応原稿が元々存在しないものであった。 As shown in the analysis results in Table 1, the text classified as appropriate / consent was more in the examples. In the comparative example, 18% of the output was inappropriate, whereas in the example, there was no inappropriate output. The reason why there is more “no output” in the embodiment is that, in the embodiment, when the edit distance D of a predetermined path section is larger than the threshold T in the speech recognition result, the path is regarded as a path with low reliability. This is because the output symbol of the output transition of the section is not output. In fact, all the 43 sentences that were not output were those for which no corresponding manuscript originally existed, such as an interview VTR voice.

したがって、少なくとも、前記実施形態で説明した地方放送局発のニュース番組に音声認識を用いて字幕を付与する場合に前提とする条件（Ａ６）、つまり、音声認識装置２２０の認識誤りのため意昧不明になった字幕を送出して視聴者に誤解を与えたり不快にさせたりすることを回避することを大前提とする場合には、実施例が好適であることを確かめることができた。 Therefore, at least the condition (A6) that is assumed when subtitles are given to a news program originated from a local broadcasting station described in the above-described embodiment by using speech recognition, that is, because of a recognition error of the speech recognition device 220 is significant. It was confirmed that the embodiment was suitable when it was premised that the subtitles that became unknown were sent to avoid misleading or making the viewer uncomfortable.

１００音声認識誤り修正装置
１１０ＷＦＳＴ記憶手段（対応原稿集合記憶手段）
１２０ノードデータ更新手段
１３０ノードデータ記憶手段
１４０原稿探索手段
１４１最大スコアノード検出手段
１４２トレースバック手段
１４３原稿分割手段
１４４出力候補記憶手段
１４５編集距離算出手段
１４６編集距離判別手段
１４７確定出力記憶手段
１４８確定時刻記憶手段
１５０原稿出力手段
２００原稿テキスト集合
２２０音声認識装置
２４０トランスデューサ構築装置
２４１単語ネットワーク登録手段
２４２編集ネットワーク登録手段 100 Voice recognition error correction device 110 WFST storage means (corresponding document set storage means)
120 Node data update means 130 Node data storage means 140 Document search means 141 Maximum score node detection means 142 Traceback means 143 Document division means 144 Output candidate storage means 145 Edit distance calculation means 146 Edit distance determination means 147 Final output storage means 148 Confirm Time storage means 150 Document output means 200 Document text set 220 Speech recognition device 240 Transducer construction device 241 Word network registration means 242 Editing network registration means

Claims

Receiving a recognition word string output by a speech recognition device that recognizes a utterance voice read out from a manuscript included in the manuscript text set as an input, and estimating a word string of the corresponding manuscript from a prestored corresponding manuscript set, A speech recognition error correction device for correcting an error included in a recognition word string,
The corresponding manuscript set constructed by reading the manuscript text set in advance and represented by a weighted finite state transducer having a node representing a state and a branch representing a state transition between the nodes as a network Corresponding document set storage means for storing
Node data updating means for calculating and updating a score of a state capable of transitioning on the network of the weighted finite state transducer as node data for each time of receiving an input of a word of the recognition word string;
Node data storage means for storing the calculated node data for each update time;
Each time a predetermined process start condition is satisfied without waiting for input of recognition results of all recognition word strings for all manuscripts for determining the final best hypothesis, based on the node data stored at that time A document search means for sequentially confirming a hypothesis partially approximating the final best hypothesis as an error correction result while tracing back on the network.
Document output means for sequentially outputting the corresponding document determined as the error correction result;
A speech recognition error correction apparatus comprising:

The weighted finite state transducer pre-constructed as the corresponding document set stored in the corresponding document set storage means is the network,
For each corresponding manuscript included in the corresponding manuscript set, a branch representing an input transition of each word constituting a word string of the corresponding manuscript and a branch representing an output transition of the word string between a start node and an end node for each corresponding manuscript. Including
A branch representing a state transition from the end node to the start node, and
A branch representing a state transition that accepts an arbitrary word in response to a word substitution, a branch representing a state transition that accepts an arbitrary word in response to an insertion of a word, and no input corresponding to a deletion of a word The speech recognition error correction apparatus according to claim 1, further comprising at least one of a branch representing a state transition that makes a transition to the output side.

The document search means includes:
Corresponding manuscript words in a path section in a predetermined range on the weighted finite state transducer network as an edit distance between the word sequence of the corresponding manuscript included in the corresponding manuscript set and the input recognition word string Calculate the value obtained by dividing the number of editing operations related to insertion, replacement, and deletion for a column by the number of words in the path section,
The speech recognition error correction apparatus according to claim 2, wherein the final best hypothesis is approximated by comparing the edit distance calculated for each path section with a predetermined threshold.

The document search means includes:
When a word string of a corresponding document whose edit distance is less than or equal to the threshold is determined, the corresponding document whose output is sequentially determined before the corresponding document path section in the weighted finite state transducer network is traced back. Documents corresponding to all path sections whose edit distance is less than or equal to the threshold are sequentially output from the upstream of the network by the document output means, and documents corresponding to all path sections whose edit distance is greater than the threshold are not output. The speech recognition error correction apparatus according to claim 3, wherein

The weighted finite state transducer pre-constructed as the corresponding document set stored in the corresponding document set storage means is the network,
A branch that accepts a word string of a predetermined paraphrase candidate having the same meaning as a word string included in the manuscript text set, and / or a word string included in the manuscript text set, and the speech recognition apparatus 5. The apparatus according to claim 1, further comprising a branch that accepts a word string that is predetermined as a possibility of dropping in the recognition word string to be output. Voice recognition error correction device.

The document search means, when the silence period without the utterance voice reaches a predetermined period, or when the number of input words as a recognition word string output by the voice recognition device reaches a predetermined number of words, 6. The speech recognition error correction apparatus according to claim 1, wherein trace back is performed on a network of the weighted finite state transducers assuming that the processing start condition is satisfied.