JP2018004947A

JP2018004947A - Text correction device, text correction method, and program

Info

Publication number: JP2018004947A
Application number: JP2016131807A
Authority: JP
Inventors: 中村　孝; Takashi Nakamura; 孝中村; 浩和政瀧; Hirokazu Masataki
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2016-07-01
Filing date: 2016-07-01
Publication date: 2018-01-11
Anticipated expiration: 2036-07-01
Also published as: JP6552999B2

Abstract

PROBLEM TO BE SOLVED: To efficiently generate transcribed text that includes unnecessary words.SOLUTION: A morpheme analysis unit 1 analyzes the morpheme of transcribed text that was transcribed from a spoken voice. A grammar generation unit 2 inserts an unnecessary word in the morpheme analysis result for each morpheme boundary and generates a grammar model. A speech recognition unit 3 generates a plurality of speech recognition result candidates in which the speech data of spoken voices is speech recognized using the grammar model. A recognition result selection unit 4 calculates, for each speech recognition result candidate, the degree of similarity of an unnecessary word that is a syllable train included in the speech recognition result candidate to a word that follows the unnecessary word, and selects a speech recognition result of the spoken voice from the speech recognition result candidates on the basis of the calculated degree of similarity.SELECTED DRAWING: Figure 1

Description

この発明は音声認識技術に関し、特に、音響モデルの学習に用いる書き起こしテキストを補正する技術に関する。 The present invention relates to a speech recognition technique, and more particularly to a technique for correcting a transcription text used for learning an acoustic model.

一般的に音声認識は、音声の音響的特徴をモデル化する音響モデル、単語間のつながりやすさをモデル化する言語モデル、単語と音素列の対応付けを行う発音辞書の３種類のモデルを用いて行われる。 In general, speech recognition uses three types of models: an acoustic model that models the acoustic features of speech, a language model that models the ease of connection between words, and a pronunciation dictionary that associates words with phoneme strings. Done.

音声は主に口腔や舌の形状・舌の位置・唇の動きにより調音され、その際物理的な動作を伴うため必ず過渡状態が発生する。そのため大まかには前後の音素環境により音響特徴が変わり得ることが想定され、音響モデルは、当該音素と、それ以前に現れる音素列と、それ以後に現れる音素列とを考慮したN-gram毎にモデル化されていることが多い。したがって、音響モデルは一般的に音声もしくはその特徴量と、音素との対応関係が付与されていることを教師データとして学習される。しかし、この対応関係を人手で付与することは非常にコストが掛かるため、音声もしくはその特徴量と、発声内容を正確に表した音素列とに基づいて、Viterbiアルゴリズムなどの自動化方法により、人手を介さず推定することを行い、学習に利用されることが多い。 The sound is tuned mainly by the shape of the mouth and tongue, the position of the tongue, and the movement of the lips. For this reason, it is assumed that the acoustic features may change depending on the phoneme environment before and after, and the acoustic model is determined for each N-gram considering the phoneme, the phoneme sequence that appears before that, and the phoneme sequence that appears after that. Often modeled. Therefore, an acoustic model is generally learned as teacher data that a correspondence relationship between speech or a feature amount thereof and a phoneme is given. However, since it is very costly to manually assign this correspondence relationship, an automated method such as the Viterbi algorithm is used for manual operation based on the speech or its feature value and the phoneme string that accurately represents the utterance content. In many cases, it is used for learning.

上記で述べた音響モデル学習用の音素列は、一般的に、学習音声の発声内容を人手で（日本語であれば）かな漢字テキストとして書き起こしを行い、形態素解析器を用いて読みを付与し、発音辞書を用いて読みから音素列を生成することを行って作られる。 The phoneme string for learning the acoustic model described above generally transcribes the utterance content of the learning speech manually (in Japanese) as kana-kanji text, and gives a reading using a morphological analyzer. It is created by generating phoneme strings from readings using a pronunciation dictionary.

音響モデルの学習をより正確に行うためには、書き起こしテキストを正確に作成する必要がある。しかし、話し言葉にはフィラーや言いよどみ・言い直し等の、発話内容とは直接関係のない、話し言葉特有の現象（以下、不要語と呼ぶ）が現れることがよく観測される。そのため、音響モデル学習用の書き起こしテキスト作成時には、この不要語も含めて正確に記述されることが望ましい。しかし、日常生活において不要語はあまり意識されないため、不要語の正確な書き起こしには習熟を要し、かつ書き起こしそのものに要する時間も増大する。 In order to learn the acoustic model more accurately, it is necessary to create a transcription text accurately. However, it is often observed that spoken language has a phenomenon unique to spoken language (hereinafter referred to as “unnecessary word”) that is not directly related to the content of the utterance, such as filler, squeeze / rephrase. For this reason, it is desirable to accurately describe this unnecessary word when creating a transcription text for acoustic model learning. However, since unnecessary words are not very conscious in daily life, accurate transcription of unnecessary words requires proficiency, and the time required for transcription itself increases.

不要語等が含まれていない不完全な書き起こしテキストから不要語を復元（挿入）する技術は様々に開発されている。例えば、非特許文献１では、不要語のうちフィラーに着目し、不完全な書き起こしテキストの各形態素にフィラーが後続するか否かをラベル付けする系列ラベリング問題として定義し、条件付き確率場（CRF; Conditional Random Fields）を用いてフィラー挿入を実現している。また、例えば、非特許文献２では、統計的スタイル変換モデルを用いて、書き言葉で記述された会議議事録を話し言葉に変換し、変換した議事録から会議の詳細な単位（例えば、話者交替毎のターンとして、10秒〜3分程度の発話）毎に制約の強い言語モデルを生成し、実際の音声とその言語モデルを用いて音声認識することで、不要語を含む発声内容テキストを生成する。 Various techniques for restoring (inserting) unnecessary words from incompletely transcribed text that does not include unnecessary words have been developed. For example, in Non-Patent Document 1, focusing on fillers among unnecessary words, it is defined as a series labeling problem for labeling whether or not a filler follows each morpheme of incompletely transcribed text, and a conditional random field ( Filler insertion is realized using CRF (Conditional Random Fields). Further, for example, in Non-Patent Document 2, using a statistical style conversion model, meeting minutes described in written language are converted into spoken words, and detailed units of the meeting (for example, each speaker change) are converted from the converted minutes. Utterance of about 10 seconds to 3 minutes as a turn of), generates a language model with strong constraints, and recognizes speech using the actual speech and its language model, thereby generating utterance content text including unnecessary words .

太田健吾，土屋雅稔，中川聖一，“フィラー予測モデルに基づく話し言葉言語モデルの構築”，情報処理学会論文誌，Vol.50，No.2，pp.477-487，2009年Kengo Ota, Masami Tsuchiya, Seiichi Nakagawa, “Construction of spoken language model based on filler prediction model”, Transactions of Information Processing Society of Japan, Vol.50, No.2, pp.477-487, 2009 三村正人，秋田祐哉，河原達也，“統計的言語モデル変換を用いた音響モデルの準教師つき学習”，電子情報通信学会誌，Vol.J94-D，No.2，pp.460-468，2011年Masato Mimura, Yuya Akita, Tatsuya Kawahara, “Learning Supervised Acoustic Model Using Statistical Language Model Transformation”, IEICE Journal, Vol.J94-D, No.2, pp.460-468, 2011 Year

しかしながら、非特許文献１では、従来技術と比較して高精度にフィラー挿入箇所およびフィラー種別を推定できているが、元々フィラーの発生が確率的に起こることもあり、テキストのみで統計的に推定することは難しい。またフィラー以外の不要語の復元は実現できていない。 However, in Non-Patent Document 1, the filler insertion location and filler type can be estimated with higher accuracy than in the prior art, but the occurrence of filler originally may occur stochastically and statistically estimated only by text. Difficult to do. Also, unnecessary words other than the filler cannot be restored.

また、非特許文献２では、統計的話し言葉変換モデルを用いて話し言葉に変換するが、不要語のうちフィラーの出現確率が高いため、フィラー以外の不要語を統計的に妥当にモデリングできるかが不明である。実際、非特許文献２で復元できている不要語はフィラーのみである。また、話者交替毎のターンは一般的な音声認識の単位（文）より長く、言語モデルによる制約が正しくかかっているかが不明である。 In Non-Patent Document 2, conversion to spoken language is performed using a statistical spoken language conversion model, but it is unclear whether unnecessary words other than filler can be modeled statistically because of the high occurrence probability of filler among unnecessary words. It is. In fact, the only unnecessary word that can be restored in Non-Patent Document 2 is the filler. Further, the turn for each speaker change is longer than a general speech recognition unit (sentence), and it is unclear whether the restriction by the language model is correctly applied.

この発明は、上述のような点に鑑みて、不要語を含む書き起こしテキストを効率的に生成することを目的とする。 In view of the above-described points, an object of the present invention is to efficiently generate a transcription text including unnecessary words.

上記の課題を解決するために、この発明のテキスト補正装置は、発話音声を書き起こした書き起こしテキストの形態素解析結果に対して形態素境界毎に不要語を挿入して文法モデルを生成する文法生成部と、文法モデルを用いて発話音声の音声データを音声認識した複数の音声認識結果候補を生成する音声認識部と、各音声認識結果候補についてその音声認識結果候補に含まれる音節列である不要語とその不要語に続く単語との類似度を算出し、その類似度に基づいて音声認識結果候補から発話音声の音声認識結果を選定する認識結果選定部と、を含む。 In order to solve the above-mentioned problem, the text correction apparatus of the present invention generates a grammar model by inserting unnecessary words at each morpheme boundary into a morpheme analysis result of a transcribed text in which speech speech is transcribed. A speech recognition unit for generating a plurality of speech recognition result candidates obtained by speech recognition of speech data of an utterance speech using a grammar model, and a syllable string included in each speech recognition result candidate for each speech recognition result candidate A recognition result selection unit that calculates the similarity between the word and the word following the unnecessary word, and selects the speech recognition result of the uttered speech from the speech recognition result candidates based on the similarity.

この発明によれば、不要語を含まない書き起こしテキストから、実際の発声に合わせて任意の不要語を復元することができる。したがって、不要語を含む書き起こしテキストを効率的に生成することができる。 According to the present invention, an arbitrary unnecessary word can be restored from a transcription text that does not include an unnecessary word in accordance with an actual utterance. Therefore, it is possible to efficiently generate a transcription text including unnecessary words.

図１は、テキスト補正装置の機能構成を例示する図である。FIG. 1 is a diagram illustrating a functional configuration of the text correction apparatus. 図２は、テキスト補正方法の処理手続きを例示する図である。FIG. 2 is a diagram illustrating a processing procedure of the text correction method. 図３は、文法生成部の処理内容を説明するための図である。FIG. 3 is a diagram for explaining the processing contents of the grammar generation unit. 図４は、文法生成部の処理内容を説明するための図である。FIG. 4 is a diagram for explaining the processing contents of the grammar generation unit. 図５は、文法生成部の処理内容を説明するための図である。FIG. 5 is a diagram for explaining the processing contents of the grammar generation unit. 図６は、認識結果選定部の処理内容を説明するための図である。FIG. 6 is a diagram for explaining the processing contents of the recognition result selection unit. 図７は、認識結果選定部の処理内容を説明するための図である。FIG. 7 is a diagram for explaining the processing contents of the recognition result selection unit. 図８は、認識結果選定部の処理内容を説明するための図である。FIG. 8 is a diagram for explaining the processing contents of the recognition result selection unit.

以下、この発明の実施の形態について詳細に説明する。なお、図面中において同じ機能を有する構成部には同じ番号を付し、重複説明を省略する。 Hereinafter, embodiments of the present invention will be described in detail. In addition, the same number is attached | subjected to the component which has the same function in drawing, and duplication description is abbreviate | omitted.

実施形態のテキスト補正装置は、図１に示すように、形態素解析部１、文法生成部２、音声認識部３、認識結果選定部４、および発音辞書記憶部５を備える。このテキスト補正装置が後述する各ステップの処理を行うことにより実施形態のテキスト補正方法が実現される。 As shown in FIG. 1, the text correction apparatus according to the embodiment includes a morphological analysis unit 1, a grammar generation unit 2, a speech recognition unit 3, a recognition result selection unit 4, and a pronunciation dictionary storage unit 5. The text correction method of the embodiment is realized by the processing of each step described later by this text correction device.

テキスト補正装置は、例えば、中央演算処理装置（CPU: Central Processing Unit）、主記憶装置（RAM: Random Access Memory）などを有する公知又は専用のコンピュータに特別なプログラムが読み込まれて構成された特別な装置である。テキスト補正装置は、例えば、中央演算処理装置の制御のもとで各処理を実行する。テキスト補正装置に入力されたデータや各処理で得られたデータは、例えば、主記憶装置に格納され、主記憶装置に格納されたデータは必要に応じて読み出されて他の処理に利用される。テキスト補正装置の各処理部の少なくとも一部が集積回路等のハードウェアによって構成されていてもよい。テキスト補正装置が備える各記憶部は、例えば、RAM（Random Access Memory）などの主記憶装置、ハードディスクや光ディスクもしくはフラッシュメモリ（Flash Memory）のような半導体メモリ素子により構成される補助記憶装置、またはリレーショナルデータベースやキーバリューストアなどのミドルウェアにより構成することができる。 The text correction device is, for example, a special program configured by reading a special program into a known or dedicated computer having a central processing unit (CPU), a main storage device (RAM: Random Access Memory), and the like. Device. For example, the text correction device executes each process under the control of the central processing unit. Data input to the text correction device and data obtained in each process are stored in, for example, a main storage device, and the data stored in the main storage device is read out as needed and used for other processing. The At least a part of each processing unit of the text correction apparatus may be configured by hardware such as an integrated circuit. Each storage unit included in the text correction device is, for example, a main storage device such as a RAM (Random Access Memory), an auxiliary storage device configured by a semiconductor memory element such as a hard disk, an optical disk, or a flash memory, or a relational device. It can be configured with middleware such as a database or key-value store.

図２を参照して、実施形態のテキスト補正方法の処理手続きを説明する。 With reference to FIG. 2, the processing procedure of the text correction method of the embodiment will be described.

ステップＳ１において、形態素解析部１は、発話音声から書き起こした書き起こしテキストを入力とし、その書き起こしテキストを形態素解析する。形態素解析結果は文法生成部２へ送られる。形態素解析には一般的な形態素解析器を用いればよいが、少なくとも表記および読みを出力できるものを利用する。ここで、形態素解析結果に品詞情報が含まれている場合、特定品詞の形態素について、他の品詞の形態素を結合させ、形態素数を減少させる処理を行ってもよい。形態素解析結果に含まれる形態素数が少ないほど全体の処理量を小さくすることができる。 In step S <b> 1, the morphological analysis unit 1 receives the transcription text transcribed from the uttered speech, and performs morphological analysis on the transcription text. The morphological analysis result is sent to the grammar generation unit 2. A general morpheme analyzer may be used for the morpheme analysis, but at least one that can output notation and reading is used. Here, when the part of speech information is included in the morphological analysis result, the morpheme of the specific part of speech may be combined with the morpheme of another part of speech to reduce the number of morphemes. The smaller the number of morphemes included in the morpheme analysis result, the smaller the overall processing amount.

例えば、入力される書き起こしテキストが「明日は快晴ですよねー」であった場合、形態素解析結果は以下のようになる。
明日；アス；名詞
は；ワ；格助詞
快晴；カイセイ；名詞
ですよねー；デスヨネー；終助詞 For example, if the input transcript is “Tomorrow is sunny”, the morphological analysis result is as follows.
Tomorrow; As; Noun; Wa; Case particle Sunny; Kaisei;

この形態素解析結果において、例えば、格助詞の形態素を直前の名詞の形態素に結合させることで、形態素数を減少させることができる。その結果を以下に示す。
明日は；アスワ；名詞
快晴；カイセイ；名詞
ですよねー；デスヨネー；終助詞 In this morpheme analysis result, for example, the number of morphemes can be reduced by combining the morpheme of the case particle with the morpheme of the immediately preceding noun. The results are shown below.
Tomorrow; Asuwa; Noun Clear; Kaisei; Noun

ステップＳ２において、文法生成部２は、形態素解析部１が出力する形態素解析結果を入力とし、発音辞書記憶部５に記憶された発音辞書を読み出し、その形態素解析結果に対して形態素境界毎に不要語を挿入して文法モデルを生成する。文法モデルは音声認識部３へ送られる。 In step S2, the grammar generation unit 2 receives the morpheme analysis result output from the morpheme analysis unit 1, reads out the pronunciation dictionary stored in the pronunciation dictionary storage unit 5, and is unnecessary for each morpheme boundary with respect to the morpheme analysis result. Insert a word to generate a grammar model. The grammar model is sent to the speech recognition unit 3.

文法モデルの生成は以下のようにして行う。まず、図３に示すように、形態素解析結果の表記を参照して有限状態文法などの受理可能な文法を生成する。図３の例は、「明日は快晴ですよねー」との書き起こしテキストから生成した文法を、重みつき有限状態トランスデューサ（WFST; Weighted Finite-State Transducer）で表現した例である。次に、図４に示すように、形態素解析結果の形態素境界毎にフィラー、音節、および無音（pause）を挿入して受理可能となるように文法を更新する。図４の例は、図３で例示した文法に対して、「明日は」と「快晴」との間にフィラー（「えー」「あー」など）および音節（「あ」「て」「ふ」など）を挿入した例である。図４の例では、各不要語に与える重みα, β, γは定数とする。さらに、図５に示すように、フィラー、音節、および無音（pause）は連続可能であっても受理可能となるように文法を更新する。図５の例は、「明日は」と「快晴」の間に２連続のフィラー連続および音節連続を受理可能とした文法の例である。最後に、発音辞書を用いて、各形態素の読みを音素に変換し、文法を更新する。生成する文法モデルは、後段の音声認識部３で取り扱うことができるものであればどのようなものであってもよい。上記では文法を最初に生成した上で更新する構成としたが、一度に最終的な文法を生成するように構成してもよい。 The grammar model is generated as follows. First, as shown in FIG. 3, an acceptable grammar such as a finite state grammar is generated with reference to the notation of the morphological analysis result. The example of FIG. 3 is an example in which a grammar generated from a transcription text “Tomorrow is sunny” is expressed by a weighted finite state transducer (WFST). Next, as shown in FIG. 4, the grammar is updated so that it can be accepted by inserting a filler, a syllable, and a pause for each morpheme boundary of the morpheme analysis result. In the example of FIG. 4, for the grammar illustrated in FIG. 3, between “Tomorrow is” and “Clear”, fillers (such as “Ah” “Ah”) and syllables (“A” “Te” “F”). Etc.) is inserted. In the example of FIG. 4, the weights α, β, and γ given to each unnecessary word are constants. Further, as shown in FIG. 5, the grammar is updated so that the filler, syllable, and pause can be accepted even if they can be continued. The example of FIG. 5 is an example of a grammar that can accept two consecutive filler continuations and syllable continuations between “Tomorrow is” and “Clear”. Finally, the pronunciation dictionary is used to convert each morpheme reading into a phoneme and update the grammar. The grammar model to be generated may be anything as long as it can be handled by the subsequent speech recognition unit 3. In the above description, the grammar is generated first and then updated. However, the final grammar may be generated at a time.

ステップＳ３において、音声認識部３は、発話音声の音声データおよび文法生成部２が出力する文法モデルを入力とし、その文法モデルを用いて発話音声の音声データを音声認識し、１つまたは複数の音声認識結果候補を得る。音声認識結果候補は認識結果選定部４へ送られる。音声認識には任意の音声認識器を用いればよい。 In step S3, the speech recognition unit 3 receives speech speech data and the grammar model output from the grammar generation unit 2, and uses the grammar model to perform speech recognition on speech speech data. A speech recognition result candidate is obtained. The speech recognition result candidate is sent to the recognition result selection unit 4. Any speech recognizer may be used for speech recognition.

例えば、実際の発話内容が「明日は、かい、快晴えーですよねー」であり、その書き起こしテキストが「明日は快晴ですよねー」であった場合、音声認識結果候補は、例えば以下のようになる。ここでは、３位まで（3-best）の音声認識結果候補を出力する場合を示している。
１位明日は、かひ、快晴ええですよねー
２位明日はかい、快晴えーですよねー
３位明日は、たい、快晴えですよねー For example, if the actual utterance content is “Tomorrow is good, clear weather” and the transcript is “Tomorrow is clear”, the speech recognition result candidates are as follows: become. Here, the case where the speech recognition result candidates up to the third place (3-best) are output is shown.
1st place Tomorrow is Kahi, it's fine weather 2nd place Tomorrow is nice, it's fine weather 3rd place Tomorrow is really fine

音声認識部３で用いた音声認識器が単語ラティスを出力できる場合は、単語ラティスを生成し、音声認識結果候補として出力する。図６は上記の音声認識結果候補を単語ラティスとして出力した例である。図６において、太字実線のブロックは音節列、二重線のブロックはフィラー、太字点線のブロックは無音（pause）を表している。各パス（矢印）にはそれぞれ音声認識後のスコアが対応するが、図６では省略している。 When the speech recognizer used in the speech recognition unit 3 can output a word lattice, a word lattice is generated and output as a speech recognition result candidate. FIG. 6 shows an example in which the above speech recognition result candidate is output as a word lattice. In FIG. 6, a bold solid line block represents a syllable string, a double line block represents a filler, and a bold dotted line block represents a pause. Each path (arrow) corresponds to a score after speech recognition, but is omitted in FIG.

ステップＳ４において、認識結果選定部４は、音声認識部３が出力する音声認識結果候補を入力とし、音声認識結果候補から発話音声の音声認識結果を選定する。選定した音声認識結果は、不要語付きの書き起こしテキストとして出力される。 In step S4, the recognition result selection unit 4 receives the speech recognition result candidate output from the speech recognition unit 3, and selects the speech recognition result of the uttered speech from the speech recognition result candidate. The selected speech recognition result is output as a transcription text with unnecessary words.

入力が音声認識結果候補の場合、各音声認識結果候補に対し、書き起こしテキストから挿入された音節列と、フィラーおよびポーズを除いてその音節列に後続する単語との類似度を求める。１つの音声認識結果候補について複数の音節列が存在する場合、各音素列について求めた類似度の平均値や中央値、最大値など、何らかの手段で１つの値を求める。上記のようにして求めた類似度が最も高い音声認識結果候補を、最終的な音声認識結果として出力する。 When the input is a speech recognition result candidate, for each speech recognition result candidate, the similarity between the syllable string inserted from the transcribed text and the word following the syllable string is obtained except for the filler and pause. When a plurality of syllable strings exist for one speech recognition result candidate, one value is obtained by some means such as an average value, a median value, or a maximum value of similarities obtained for each phoneme string. The speech recognition result candidate having the highest similarity obtained as described above is output as the final speech recognition result.

入力が単語ラティスの場合、複数の候補が存在する単語のうち音節列であるものと、フィラーおよびポーズを除いてその音節列に後続する単語との類似度をそれぞれ求め、最も類似度が高い音節列を通るパスのスコアをリスコアリングする。図７において黒塗りのブロックで示す単語は、複数の候補が存在する単語（「かい」「かひ」…）に対して、フィラーおよびポーズを除いてその単語に後続する単語（「快晴」）の例である。言いよどみや言い直しは、正しい内容が不要語のすぐ後続に現れる場合が多いため、フィラー以外の不要語の認識結果である音節列と、その後続単語とを比較することにしている。最終的に、リスコアリングしたパスから最尤パスを求め、音声認識結果を出力する。図８の例では、太字実線の矢印が最尤パスを示しており、「明日はかい、快晴ええですよねー」が最尤パスとして選定されたことを表している。 If the input is a word lattice, the syllable string is obtained by calculating the similarity between the syllable string and the word that follows the syllable string, excluding the filler and pause. Rescore the path score through the column. In FIG. 7, a word indicated by a black block is a word (“sunny”) that follows the word except for fillers and pauses for a word having a plurality of candidates (“Kai”, “Kahi”...). It is an example. In slogan or rephrasing, correct contents often appear immediately after an unnecessary word, so a syllable string that is a recognition result of an unnecessary word other than a filler is compared with the subsequent word. Finally, the maximum likelihood path is obtained from the re-scored path, and the speech recognition result is output. In the example of FIG. 8, a bold solid line arrow indicates the maximum likelihood path, and “Tomorrow is bright, it is clear” is selected as the maximum likelihood path.

入力がいずれの場合も、類似度の算出方法は任意のものを利用してよい。ただし、数値が大きいほど類似している指標である必要がある。類似度の算出方法としては、例えば、１．音節列と後続単語との音節表記の一致性、２．音節列の音素列と後続単語の音素列との編集距離、などが挙げられる。前者は、例えば、表記が一致している音節の個数を類似度とする。後者は、一般的な編集距離計算方法を用いればよい。その際、調音方法による分類（閉鎖音、摩擦音など）が類似していれば距離を小さくするなど工夫が可能である。 Regardless of the input, any method for calculating the similarity may be used. However, the larger the numerical value, the more similar the index needs to be. As a method of calculating the similarity, for example, 1. 1. Consistency of syllable notation between syllable string and subsequent word For example, the edit distance between the phoneme string of the syllable string and the phoneme string of the subsequent word. In the former, for example, the number of syllables with the same notation is used as the similarity. The latter may use a general editing distance calculation method. At that time, if the classification by the articulation method (closing sound, friction sound, etc.) is similar, it is possible to devise such as reducing the distance.

上述のように構成することにより、この発明のテキスト補正技術によれば、不要語を含まない書き起こしテキストから不要語を含む文法モデルを生成し、その文法モデルを用いて実際の発声を音声認識することで任意の不要語を復元することができる。したがって、不要語を含む書き起こしテキストを効率的に生成することができる。 By configuring as described above, according to the text correction technique of the present invention, a grammar model including an unnecessary word is generated from a transcription text that does not include an unnecessary word, and an actual utterance is recognized by voice recognition using the grammar model. By doing so, any unnecessary words can be restored. Therefore, it is possible to efficiently generate a transcription text including unnecessary words.

以上、この発明の実施の形態について説明したが、具体的な構成は、これらの実施の形態に限られるものではなく、この発明の趣旨を逸脱しない範囲で適宜設計の変更等があっても、この発明に含まれることはいうまでもない。実施の形態において説明した各種の処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。 As described above, the embodiments of the present invention have been described, but the specific configuration is not limited to these embodiments, and even if there is a design change or the like as appropriate without departing from the spirit of the present invention, Needless to say, it is included in this invention. The various processes described in the embodiments are not only executed in time series according to the description order, but may also be executed in parallel or individually as required by the processing capability of the apparatus that executes the processes.

［プログラム、記録媒体］
上記実施形態で説明した各装置における各種の処理機能をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記各装置における各種の処理機能がコンピュータ上で実現される。 [Program, recording medium]
When various processing functions in each device described in the above embodiment are realized by a computer, the processing contents of the functions that each device should have are described by a program. Then, by executing this program on a computer, various processing functions in each of the above devices are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したDVD、CD-ROM等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 This program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるASP（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing the process, the computer reads a program stored in its own recording medium and executes a process according to the read program. As another execution form of the program, the computer may directly read the program from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to the computer. Each time, the processing according to the received program may be executed sequentially. A configuration in which the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes a processing function only by an execution instruction and result acquisition without transferring a program from the server computer to the computer. It is good. Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、本装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In this embodiment, the present apparatus is configured by executing a predetermined program on a computer. However, at least a part of these processing contents may be realized by hardware.

１形態素解析部
２文法生成部
３音声認識部
４認識結果選定部
５発音辞書記憶部 1 Morphological Analysis Unit 2 Grammar Generation Unit 3 Speech Recognition Unit 4 Recognition Result Selection Unit 5 Pronunciation Dictionary Storage Unit

Claims

A grammar generation unit that generates a grammar model by inserting unnecessary words at each morpheme boundary to the morphological analysis result of the transcribed text transcribed speech speech;
A speech recognition unit that generates a plurality of speech recognition result candidates obtained by speech recognition of speech data of the uttered speech using the grammar model;
For each speech recognition result candidate, a similarity between an unnecessary word that is a syllable string included in the speech recognition result candidate and a word that follows the unnecessary word is calculated, and the uttered speech is calculated from the speech recognition result candidate based on the similarity. A recognition result selection unit for selecting the voice recognition result of
Text correction device.

The text correction device according to claim 1,
The grammar generation unit generates the grammar model by inserting the unnecessary words including filler, syllable, or silence for each morpheme analysis boundary with respect to the morpheme analysis result.
Text correction device.

The text correction device according to claim 2,
The grammar generation unit generates the grammar model by inserting the unnecessary words including a filler continuation in which an arbitrary filler is continuous and a syllable continuation in which a plurality of syllables are continuous for each morphological analysis boundary with respect to the morphological analysis result. To do,
Text correction device.

The text correction apparatus according to any one of claims 1 to 3,
The recognition result selection unit calculates, as the similarity, the number of syllables in which the syllable representations of the unnecessary words that are syllable strings included in the speech recognition result candidate and the syllable strings of words following the unnecessary words match. Is,
Text correction device.

The text correction apparatus according to any one of claims 1 to 3,
The recognition result selection unit calculates, as the similarity, an edit distance between a phoneme string of an unnecessary word that is a syllable string included in the speech recognition result candidate and a phoneme string of a syllable string of a word that follows the unnecessary word. is there,
Text correction device.

The grammar generation unit generates a grammar model by inserting unnecessary words at each morpheme boundary to the morphological analysis result of the transcription text that transcribes the speech.
The speech recognition unit generates a plurality of speech recognition result candidates obtained by performing speech recognition on the speech data of the uttered speech using the grammar model,
The recognition result selection unit calculates, for each speech recognition result candidate, the similarity between the syllable string of each unnecessary word included in the speech recognition result candidate and the syllable string of the word following the unnecessary word, and based on the similarity Selecting a speech recognition result of the uttered speech from the speech recognition result candidates,
Text correction method.

A program for causing a computer to function as the text correction apparatus according to claim 1.