JP2012022069A

JP2012022069A - Speech recognition method, and device and program for the same

Info

Publication number: JP2012022069A
Application number: JP2010158472A
Authority: JP
Inventors: Taichi Asami; 太一浅見; Satoru Kobashigawa; 哲小橋川; Yoshikazu Yamaguchi; 義和山口; Hirokazu Masataki; 浩和政瀧
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2010-07-13
Filing date: 2010-07-13
Publication date: 2012-02-02
Anticipated expiration: 2030-07-13
Also published as: JP5406797B2

Abstract

PROBLEM TO BE SOLVED: To provide a speech recognition method, for instance, appropriate for generation of a text to be used for data mining.SOLUTION: The speech recognition method comprises: a speech recognition process; a speech document recognition reliability calculation process, a speech document removal process; and a word removal process. The speech recognition process outputs a speech recognition result, in which the reliability of word recognition is added to each word after speech recognition processing is applied to an input speech document, and the speech document recognition reliability calculation process accepts the speech recognition result as an input, and calculates and outputs speech document recognition reliability which is the recognition reliability for the whole speech document. Then the speech document removal process accepts the speech recognition result and the speech document recognition reliability as an input and removes a speech document that falls short of a prescribed threshold of speech document recognition reliability, and the word removal process removes a word that falls short of a prescribed threshold of word recognition reliability, from the speech recognition results of the speech documents that are not removed in the speech document removal process.

Description

この発明は、例えばデータマイニングに用いるテキストを生成するのに好適な音声認識方法とその装置と、プログラムに関する。 The present invention relates to a speech recognition method and apparatus suitable for generating text used for data mining, for example, and a program.

テキストデータとして収集されたデータを統計的に分析する手法を、一般的にテキストマイニングと称する。そのテキストデータを得る目的で音声認識を用いる場合がある。音声認識には、誤認識が付き物である。そこで、従来から、この音声認識誤りを減らす工夫が試みられている。 A technique for statistically analyzing data collected as text data is generally referred to as text mining. Voice recognition may be used for the purpose of obtaining the text data. Misrecognition is a natural part of speech recognition. Thus, conventionally, attempts have been made to reduce this speech recognition error.

例えば、音声認識結果に、その認識結果の確からしさを表す認識信頼度を付与する方法がある。特許文献１に、音声認識処理の探索の結果の上位Ｎ位までのＮベスト候補のスコア１位の単語w₁の認識信頼度を、単語w₁のスコアとスコア２位以下でw₁と異なる単語w₂とのスコア差を単語w₁の継続時間長で正規化した値とする考えが示されている。 For example, there is a method of giving a recognition reliability indicating the certainty of the recognition result to the voice recognition result. In Patent Document 1, the recognition reliability of the word w _{1 with} the highest score of the N best candidates up to the top N as a result of the search of the speech recognition process is different from the score of the word w ₁ with the score of 2 or lower and w _1. The idea that the score difference from the word w ₂ is normalized by the duration of the word w ₁ is shown.

また、別の方法として、音声認識結果中の各単語間の関連の強さを計測して周囲の単語と関連の強い単語に、高い認識信頼度を付与し、関連の弱い単語に低い認識信頼度を付与する方法がある（非特許文献１）。この方法は、単語w及び単語wの直前k個の単語と直後の１単語とのn個の単語集合N(w)を音声認識結果から取得する。そして、その単語集合N(w)に含まれる全ての２単語の組み合わせ（w_i,w_j）に対して、事前に学習コーパス上で算出した相互情報量MI（w_i,w_j）を用いて単語間の強さS（w_i,w_j）を計算する。また、単語集合N(w)中の全ての単語tについての関連の強さS（t,w_i）の平均値を文脈一貫性尺度SC(t)として計算する。 Another method is to measure the strength of association between each word in the speech recognition result, and give high recognition confidence to words that are strongly related to surrounding words, and low recognition confidence to weakly related words. There is a method of providing a degree (Non-Patent Document 1). In this method, n word sets N (w) of the word w and k words immediately before the word w and one word immediately after the word w are acquired from the speech recognition result. Then, the mutual information MI (w _i , w _j ) calculated in advance on the learning corpus is used for all combinations (w _i , w _j ) of two words included in the word set N (w). Then, the strength S (w _i , w _j ) between words is calculated. Further, the average value of the relation strength S (t, w _i ) for all the words t in the word set N (w) is calculated as the context consistency measure SC (t).

大量に蓄積された音声文書をデータマイニング処理する際は、上記したような認識信頼度が付与されたテキストが用いられる。 When data mining processing is performed on a large amount of stored voice documents, text with the above-described recognition reliability is used.

特開２００５−１４８３４２号公報JP 2005-148342 A

D. Inkpen, A. Desilets, “Semantic Similarity for Detecting Recognition Errors in Automatic Speech Transcripts,”Proceedings of HLT/EMNLP, pp.49-56, October 2005.D. Inkpen, A. Desilets, “Semantic Similarity for Detecting Recognition Errors in Automatic Speech Transcripts,” Proceedings of HLT / EMNLP, pp.49-56, October 2005.

大量に蓄積された音声文書を音声認識によってテキスト化し、全文検索・文書分類・パターン抽出などのテキストマイニング処理を行う際、実際には発声されていない単語が音声認識誤りによってテキスト中に現れることにより、誤った情報がテキストマイニング結果に含まれてしまう問題がある。テキストマイニング結果に誤った情報が多数含まれる場合、テキストマイニング利用者が得られた結果を見て有用な知見を得ることは難しい。 When a large amount of speech documents are converted into text by speech recognition and text mining processing such as full-text search, document classification, and pattern extraction is performed, words that are not actually spoken appear in the text due to speech recognition errors. There is a problem that incorrect information is included in the text mining result. When a lot of incorrect information is included in the text mining result, it is difficult for a text mining user to obtain useful knowledge by looking at the obtained result.

上記したように、音声認識結果の各単語に認識信頼度を付与して認識信頼度の低い単語をテキストマイニング処理の対象から除外することで、誤った情報の抽出量を削減することが可能である。しかし、音声認識結果全体の認識精度が低い場合、単語に対する認識信頼度は認識の確からしさを正しく表さない傾向がある。例えば、周囲雑音の大きな環境で収録された音声文書の認識誤り単語に適切に認識信頼度を付与することは難しい。また、話者の話し方の差異を原因とする場合も、音声文書全体の認識信頼度が低くなり各単語に適切な認識信頼度を付与できない。 As mentioned above, it is possible to reduce the amount of erroneous information extracted by giving recognition confidence to each word of the speech recognition result and excluding words with low recognition confidence from the target of text mining processing. is there. However, when the recognition accuracy of the entire speech recognition result is low, the recognition reliability for a word tends not to accurately represent the probability of recognition. For example, it is difficult to appropriately assign a recognition reliability to a recognition error word of an audio document recorded in an environment with a large ambient noise. In addition, even when the cause is a difference in the speaker's way of speaking, the recognition reliability of the entire voice document becomes low, and an appropriate recognition reliability cannot be given to each word.

この発明は、このような課題に鑑みてなされたものであり、音声認識によってテキストデータに認識誤り単語が含まれ難いようにした音声認識方法とその装置と、プログラムを提供することを目的とする。 The present invention has been made in view of the above problems, and an object of the present invention is to provide a speech recognition method, a device thereof, and a program that make it difficult for a recognition error word to be included in text data by speech recognition. .

この発明の音声認識方法は、音声認識過程と、音声文書認識信頼度計算過程と、音声文書除去過程と、単語除去過程と、を含む。音声認識過程は、入力される音声文書を音声認識処理した単語毎に単語認識信頼度を付与した音声認識結果を出力する。音声文書認識信頼度計算過程は、音声認識結果を入力として音声文書全体の認識信頼度である音声文書認識信頼度を計算して出力する。音声文書除去過程は、音声認識結果と音声文書認識信頼度とを入力として所定の音声文書認識信頼度閾値未満の音声文書を除去する。単語除去過程は、音声文書除去過程で除去されなかった音声文書の音声認識結果から所定の単語認識信頼度閾値未満の単語認識信頼度の単語を除去する。 The speech recognition method of the present invention includes a speech recognition process, a speech document recognition reliability calculation process, a speech document removal process, and a word removal process. The speech recognition process outputs a speech recognition result with word recognition reliability assigned to each word obtained by performing speech recognition processing on the input speech document. In the voice document recognition reliability calculation process, the voice recognition result, which is the recognition reliability of the entire voice document, is calculated and output using the voice recognition result as an input. In the voice document removal process, a voice document less than a predetermined voice document recognition reliability threshold is removed by inputting the voice recognition result and the voice document recognition reliability. In the word removal process, words having a word recognition reliability less than a predetermined word recognition reliability threshold are removed from the voice recognition result of the voice document that has not been removed in the voice document removal process.

この発明の音声認識方法は、音声文書単位での除去を行った後に、残った音声文書に対して単語単位の除去を行うので、認識信頼度が全体的に低い単語単位での除去が難しい音声認識結果を適切に除去することができる。したがって、例えば、テキストマイニング処理対象となるテキストデータに含まれる認識誤り単語の数を、効果的に削減することができる。その結果、テキストマイニングの利用者が有用な知見を得ることが可能となる。 In the speech recognition method according to the present invention, since the removal is performed in units of words after the removal in units of speech documents, it is difficult to remove in units of words whose recognition reliability is generally low. The recognition result can be appropriately removed. Therefore, for example, it is possible to effectively reduce the number of recognition error words included in text data to be subjected to text mining processing. As a result, users of text mining can obtain useful knowledge.

また、この発明の音声認識方法を、音声認識に用いる確率モデルの教師なし適応に利用しても効果的である。つまり、認識誤り単語の少ない音声認識結果を収集することが可能であるので、収集した認識誤りの少ない音声認識結果を教師なし適応に用いれば音声認識精度を向上させることができる。 It is also effective to use the speech recognition method of the present invention for unsupervised adaptation of a probability model used for speech recognition. That is, since speech recognition results with few recognition error words can be collected, the speech recognition accuracy can be improved by using the collected speech recognition results with few recognition errors for unsupervised adaptation.

この発明の音声認識装置１００の機能構成例を示す図。The figure which shows the function structural example of the speech recognition apparatus 100 of this invention. 音声認識装置１００の動作フローを示す図。The figure which shows the operation | movement flow of the speech recognition apparatus 100. Ｎベスト候補と単語認識信頼度について説明する図。The figure explaining N best candidate and word recognition reliability. 音声文書認識信頼度計算部２０の機能構成例を示す図。The figure which shows the function structural example of the audio | voice document recognition reliability calculation part 20. FIG. 音声文書認識信頼度計算部２０の動作フローを示す図。The figure which shows the operation | movement flow of the voice document recognition reliability calculation part 20. FIG. 音声認識部１０が出力する単語w_n、単語認識信頼度D(w_n)の例を示す図。Word w _n output by the speech recognition unit 10, shows an example of a word recognition reliability D (w _n). この発明の音声認識装置２００の機能構成例を示す図。The figure which shows the function structural example of the speech recognition apparatus 200 of this invention. 音声認識装置２００の動作フローを示す図。The figure which shows the operation | movement flow of the speech recognition apparatus 200. 単語関連度テーブル作成装置１５０の機能構成例を示す図。The figure which shows the function structural example of the word related degree table creation apparatus 150. FIG. 単語集合を概念的に示す図。The figure which shows a word set notionally. 単語関連度テーブルの一例を示す図。The figure which shows an example of a word related degree table. 音声文書認識信頼度計算部７０の機能構成例を示す図。The figure which shows the function structural example of the audio | voice document recognition reliability calculation part. この発明の音声認識装置３００の機能構成例を示す図。The figure which shows the function structural example of the speech recognition apparatus 300 of this invention. 音声文書認識信頼度高速計算部９０の機能構成例を示す図。The figure which shows the function structural example of the audio | voice document recognition reliability high speed calculation part 90. FIG. 音声文書認識信頼度高速計算部９０の動作フローを示す図。The figure which shows the operation | movement flow of the voice document recognition reliability high speed calculation part. 単語集合と直前重複フラグと直後重複フラグの例を示す図。The figure which shows the example of a word set, an immediately preceding duplication flag, and an immediately following duplication flag. 単語集合取得ステップの動作フローの例を示す図。The figure which shows the example of the operation | movement flow of a word set acquisition step. 単語集合音響信頼度高速計算ステップの動作フローの例を示す図。The figure which shows the example of the operation | movement flow of a word set acoustic reliability high-speed calculation step.

以下、この発明の実施の形態を図面を参照して説明する。複数の図面中同一のものには同じ参照符号を付し、説明は繰り返さない。 Embodiments of the present invention will be described below with reference to the drawings. The same reference numerals are given to the same components in a plurality of drawings, and the description will not be repeated.

図１にこの発明の音声認識装置１００の機能構成例を示す。その動作フローを図２に示す。音声認識装置１００は、音声認識部１０と、音声文書認識信頼度計算部２０と、音声文書除去部３０と、単語除去部４０と、制御部５０と、を具備する。音声認識装置１００の各部の機能は、例えばＲＯＭ、ＲＡＭ、ＣＰＵ等で構成されるコンピュータに所定のプログラムが読み込まれて、ＣＰＵがそのプログラムを実行することで実現されるものである。 FIG. 1 shows a functional configuration example of the speech recognition apparatus 100 of the present invention. The operation flow is shown in FIG. The speech recognition apparatus 100 includes a speech recognition unit 10, a speech document recognition reliability calculation unit 20, a speech document removal unit 30, a word removal unit 40, and a control unit 50. The function of each part of the speech recognition apparatus 100 is realized by reading a predetermined program into a computer composed of, for example, a ROM, a RAM, a CPU, etc., and executing the program by the CPU.

音声認識部１０は、入力される音声文書を音声認識処理した単語毎に単語認識信頼度を付与した音声認識結果を出力する（ステップＳ１０）。音声認識部１０は、図示しない内部の音響分析部によって、音声文書を、数十msecのフレームと呼ばれる単位でＬＰＣケプストラム、ＭＦＣＣ、その他の音響特徴パラメータ系列に分析する。そして、辞書と言語モデルを用いて入力音声に対する認識結果候補の探索を、音響特徴パラメータ系列について行う。探索の結果、上位Ｎ位までのＮベスト候補が、単語認識信頼度と共に音声認識結果として出力される。なお、音声文書とは、例えば、コールセンターにおける顧客とオペレータとの間で交わされる会話の集合のことであり、一要件についてまとめられた音声データである。また、例えば一つの講義が１個の音声ファイルにまとめられた様なものである。 The speech recognition unit 10 outputs a speech recognition result in which word recognition reliability is given to each word obtained by performing speech recognition processing on the input speech document (step S10). The voice recognition unit 10 analyzes the voice document into an LPC cepstrum, MFCC, and other acoustic feature parameter series in units called frames of several tens of msec by an internal acoustic analysis unit (not shown). Then, a recognition result candidate for the input speech is searched for the acoustic feature parameter series using a dictionary and a language model. As a result of the search, N best candidates up to the top N are output as a speech recognition result together with the word recognition reliability. Note that the voice document is, for example, a set of conversations exchanged between a customer and an operator at a call center, and is voice data compiled for one requirement. For example, one lecture is organized into one audio file.

ここで、図３を参照してＮベスト候補と単語認識信頼度について説明する。なお、Ｎベスト候補と単語認識信頼度については従来技術である。単語認識信頼度については、例えば特許文献１に記載されている。 Here, N best candidates and word recognition reliability will be described with reference to FIG. Note that N best candidates and word recognition reliability are conventional techniques. The word recognition reliability is described in Patent Document 1, for example.

図３の横軸は、経過時間でありフレームで表す。縦軸は、フレーム単位で探索された単語列候補を、スコアの高い順番に並べたＮベスト候補である。スコアとは探索時の尤度のことである。 The horizontal axis in FIG. 3 represents elapsed time and is represented by a frame. The vertical axis represents N best candidates in which word string candidates searched for in units of frames are arranged in descending order of scores. The score is the likelihood at the time of search.

単語認識信頼度は、フレームt_*において単語w_**(*は任意の整数)と異なる単語がＮベスト候補中に存在する場合、単語w_**のフレームt_*におけるスコアと対立候補単語のフレームt_*における次の順位のスコアとの間のスコア差によって与えられる。図３に示す例では、フレームt₁〜t₄の音響特徴パラメータ系列で探索された第１位候補の単語w₁₁（11は第１候補の１番目の単語であることを表す）の単語認識信頼度は、対向する単語が第３位候補の単語w₃₁と第２位候補の単語w₂₁であるので、それぞれのスコア差（●）の合計をフレーム数で除した値が単語認識信頼度となる。対立候補が存在しない単語w₁₃については、予め決められた固定値（○）が用いられて単語認識信頼度となる。この単語認識信頼度が候補毎に累積されて単語列の認識信頼度となる。 Word recognition reliability, if the word w _** in frame t _* (* is an arbitrary integer) is different word present in the N-best candidate words w _** frame t score paired candidacy word frames in _* Given by the score difference between the next ranking score at t _* . In the example shown in FIG. 3, the word recognition of the first candidate word w ₁₁ (11 represents the first word of the first candidate) searched for in the acoustic feature parameter series of frames t _{1 to} t _4. Since the opposite words are the third candidate word w ₃₁ and the second candidate word w ₂₁ , the value obtained by dividing the total score difference (●) by the number of frames is the word recognition reliability. It becomes. For the word w ₁₃ the conflict candidate does not exist, the word recognition reliability by predetermined fixed value (○) is used. This word recognition reliability is accumulated for each candidate and becomes the word string recognition reliability.

音声文書認識信頼度計算部２０は、単語毎の単語認識信頼度及び単語列の認識信頼度から音声文書全体の認識信頼度である音声文書認識信頼度を計算して出力する（ステップＳ２０）。 The voice document recognition reliability calculation unit 20 calculates and outputs a voice document recognition reliability that is the recognition reliability of the entire voice document from the word recognition reliability for each word and the word string reliability (step S20).

音声文書除去部３０は、音声認識部１０が出力する音声認識結果と音声文書認識信頼度計算部２０が出力する音声文書認識信頼度とを入力として、所定の音声文書認識信頼度閾値θ_d未満の音声文書を除去する（ステップＳ３０）。 The voice document removal unit 30 receives the voice recognition result output from the voice recognition unit 10 and the voice document recognition reliability output from the voice document recognition reliability calculation unit 20 as input, and is less than a predetermined voice document recognition reliability threshold θ _d. Are deleted (step S30).

単語除去部４０は、音声文書除去部３０で除去されなかった音声文書の音声認識結果から所定の単語認識信頼度閾値θ_w未満の単語認識信頼度の単語を除去する（ステップＳ４０）。音声文書認識信頼度閾値θ_dと単語認識信頼度閾値θ_wは、予め定数として各部に備えておいても良いし、外部から与えても良い。 Word removal unit 40 removes the word of a predetermined word recognition confidence threshold θ word recognition reliability of less than _w results speech recognition of the speech documents which were not removed in the spoken document removal unit 30 (step S40). The voice document recognition reliability threshold value θ _d and the word recognition reliability threshold value θ _w may be previously provided as constants in each unit, or may be given from the outside.

このように、音声認識装置１００は、音声文書単位での除去を行った後に、残った音声文書に対して単語単位の除去した音声認識結果を出力するので、音声認識結果に含まれる誤認識単語を削減することができる。 As described above, the speech recognition apparatus 100 outputs the speech recognition result in which the word unit is removed from the remaining speech document after performing the removal in the speech document unit, so that the erroneous recognition word included in the speech recognition result. Can be reduced.

図４に音声文書認識信頼度計算部２０の機能構成を示して更に詳しく動作を説明する。その動作フローを図５に示す。音声文書認識信頼度計算部２０は、単語時間長取得手段２１と、正規化解除手段２２と、音声文書総時間長計算手段２３と、信頼度累積手段２４と、音声文書認識信頼度算出手段２５と、を備える。 FIG. 4 shows the functional configuration of the voice document recognition reliability calculation unit 20, and the operation will be described in more detail. The operation flow is shown in FIG. The voice document recognition reliability calculation unit 20 includes a word time length acquisition unit 21, a denormalization unit 22, a voice document total time length calculation unit 23, a reliability accumulation unit 24, and a voice document recognition reliability calculation unit 25. And comprising.

単語時間長取得手段２１は、音声認識部２０が出力する単語の単語時間長を求める（ステップＳ２１）。単語時間長取得手段２１は、最初に音声文書の信頼度合計値D(W)と音声文書の総時間長WDを０に初期化する（ステップＳ５０）。ステップＳ５０〜Ｓ５２は、図１に示した制御部５０が処理する。 The word time length acquisition unit 21 obtains the word time length of the word output from the voice recognition unit 20 (step S21). The word time length acquisition unit 21 first initializes the total reliability D (W) of the voice document and the total time length WD of the voice document to 0 (step S50). Steps S50 to S52 are processed by the control unit 50 shown in FIG.

図６に、音声認識部２０が出力する単語w_n、単語認識信頼度D(w_n)の例を示す。ここでは、音声ファイルが１個のＮベスト候補から成る例で説明する。つまり添え字は一桁で説明する。例えば、単語w₁は名詞「本日」であり、単語認識信頼度D(w_n)は９８９１であり、その始端時刻（wd_nsFn）と終端時刻(wd_neFn)は０．００−０.９８[秒]である。単語認識信頼度D(w_n)の値は、マイナスの値もあり得る。図６の例では、「おいたわしい」がそれに該当する。第１位候補の単語のスコアが、下位の候補の単語のスコアよりも小さい場合に単語認識信頼度D(w_n)はマイナスとなる。この場合、第１位候補のその単語の信頼度は相当低いことを意味する。この単語認識信頼度D(w_n)の値は、音声認識結果の音響的な信頼度を表すものである。 FIG. 6 shows an example of the word w _n and the word recognition reliability D (w _n ) output from the speech recognition unit 20. Here, an example in which an audio file is composed of one N best candidate will be described. In other words, the subscript is described with one digit. For example, the word w ₁ is the noun “today”, the word recognition reliability D (w _n ) is 9891, and its start time (wd _n sFn) and end time (wd _n eFn) are 0.00-0. It is 98 [seconds]. The value of the word recognition reliability D (w _n ) may be a negative value. In the example of FIG. 6, “Nice” corresponds to this. When the score of the first candidate word is smaller than the score of the lower candidate word, the word recognition reliability D (w _n ) is negative. In this case, it means that the reliability of the word of the first candidate is considerably low. The value of the word recognition reliability D (w _n ) represents the acoustic reliability of the speech recognition result.

単語時間長取得手段２１は、単語w₁の終端時刻から始端時刻を引いたその単語時間長wd₁を、０．９８[秒]若しくは、フレーム時間を例えば１０msecとした場合、９８フレームとして取得する（ステップＳ２１）。 The word time length acquisition means 21 acquires the word time length wd ₁ obtained by subtracting the start time from the end time of the word w ₁ as 0.98 [sec] or 98 frames when the frame time is set to 10 msec, for example. (Step S21).

正規化解除手段２２は、単語認識信頼度D(w₁)＝９８９１に単語時間長wd₁を乗じた単語信頼度wcを計算する（ステップＳ２２）。正規化解除手段２２は、フレーム数で正規化されていた単語認識信頼度の正規化を解除する働きをする。 The denormalization means 22 calculates a word reliability wc obtained by multiplying the word recognition reliability D (w ₁ ) = 9891 by the word time length wd ₁ (step S22). The normalization cancellation means 22 functions to cancel the normalization of the word recognition reliability that has been normalized by the number of frames.

信頼度累積手段２４は、正規化が解除された単語信頼度wcを累積した単語信頼度累積値D(W)を求める（ステップＳ２４）音声文書総時間長計算手段２３は、単語時間長取得手段２１で取得した単語時間長wd_*を音声ファイル全体で累積した音声文書総時間長WDを計算する（ステップＳ２３）。上記したステップＳ２１〜Ｓ２３の処理は、音声ファイルの全ての単語w_nについて処理されるまで、単語を更新（ステップＳ５２）しながら繰り返される（ステップＳ５１のno）。 The reliability accumulating means 24 obtains a word reliability accumulated value D (W) obtained by accumulating the normalized word reliability wc (step S24). The voice document total time length calculating means 23 is a word time length obtaining means. The voice document total time length WD obtained by accumulating the word time length wd _* acquired in 21 in the entire voice file is calculated (step S23). Processing in steps S21~S23 described above is repeated until the processing for all words w _n of the audio file, while updating the word (step S52) (no in step S51).

音声文書認識信頼度算出手段２５は、単語信頼度累積値D(W)を音声文書総時間長WDで除した音声文書の音声文書認識信頼度docCを算出する（ステップＳ２５）。音声文書認識信頼度算出手段２５は、音声文書のフレーム当たりの認識信頼度を求めることになる。この音声文書認識信頼度docCは、音声文書の音声認識結果の音響的な品質を表す指標となる。 The voice document recognition reliability calculation means 25 calculates the voice document recognition reliability docC of the voice document obtained by dividing the word reliability cumulative value D (W) by the voice document total time length WD (step S25). The voice document recognition reliability calculation means 25 obtains the recognition reliability per frame of the voice document. The voice document recognition reliability docC is an index representing the acoustic quality of the voice recognition result of the voice document.

音声文書除去部３０は、音声文書の音響的な信頼度を表す音声文書認識信頼度docCと、事前に決められた音声文書認識信頼度閾値θ_dとを比較して音声文書認識信頼度docCがθ_d以上の場合、音声文書をそのまま単語除去部４０に出力する。音声文書認識信頼度docCがθ_d未満の場合は、その音声文書を単語除去部４０に出力しない。 The voice document removal unit 30 compares the voice document recognition reliability docC representing the acoustic reliability of the voice document with a predetermined voice document recognition reliability threshold θ _d to obtain a voice document recognition reliability docC. If it is equal to or greater than θ _d , the voice document is output as it is to the word removal unit 40. When the voice document recognition reliability docC is less than θ _d , the voice document is not output to the word removal unit 40.

音声文書認識信頼度閾値θ_dは実数値であり、大きな値（例えば30000程度）に設定すると出力される音声文書の音声認識精度は高くなる。小さな値（例えば-30000程度）に設定すると出力される音声文書の音声認識精度は低くなるが、出力される音声文書の数は増加する。 The voice document recognition reliability threshold θ _d is a real value, and when set to a large value (for example, about 30000), the voice recognition accuracy of the output voice document is increased. When a small value (for example, about −30000) is set, the voice recognition accuracy of the output voice document is lowered, but the number of output voice documents is increased.

単語除去部４０は、音声文書除去部３０が出力した音声文書を構成する単語とその単語に付与された単語認識信頼度D(w_n)を入力として、単語認識信頼度D(w_n)の値が所定の単語認識信頼度閾値θw未満であれば当該単語を、除去されたことを表す所定の記号、例えば“<rejected>”に置換して音声文書を出力する。 The word removal unit 40 receives the words constituting the voice document output from the voice document removal unit 30 and the word recognition reliability D (w _n ) assigned to the words, and inputs the word recognition reliability D (w _n ). If the value is less than a predetermined word recognition reliability threshold value θw, the word is replaced with a predetermined symbol indicating that it has been removed, for example, “<rejected>”, and a speech document is output.

以上の処理によって最終的に出力される音声認識結果は、音声認識精度の比較的高い音声文書から更に単語認識信頼度が低い単語が除去されたものになり、音声認識結果に含まれる誤認識単語を削減することができる。この音声認識装置１００を用いてデータマイニング用のテキストデータを取得すると、テキストデータに含まれる誤認識単語の数を減らすことができるので、テキストマイニングの利用者が有用な知見を得ることが可能になる。 The speech recognition result that is finally output by the above processing is obtained by removing words with lower word recognition reliability from a speech document having a relatively high speech recognition accuracy, and is a misrecognized word included in the speech recognition result. Can be reduced. When text data for data mining is acquired using the speech recognition apparatus 100, the number of misrecognized words included in the text data can be reduced, so that a user of text mining can obtain useful knowledge. Become.

図７にこの発明の音声認識装置２００の機能構成例を示す。その動作フローを図８に示す。音声認識装置２００は、上記した音声文書認識信頼度docCを、音響信頼度と文脈信頼度を組み合わせた値にしたものである。よって、音声文書認識信頼度計算部７０のみが、音声認識装置２００と異なる。音響信頼度とは、音声認識装置１００の音声文書認識信頼度docCと同じものである。以降、実施例１で述べた音声文書認識信頼度docCを音響信頼度と称する。 FIG. 7 shows a functional configuration example of the speech recognition apparatus 200 of the present invention. The operation flow is shown in FIG. The speech recognition apparatus 200 is a combination of the above-described speech document recognition reliability docC and a combination of acoustic reliability and context reliability. Therefore, only the speech document recognition reliability calculation unit 70 is different from the speech recognition apparatus 200. The acoustic reliability is the same as the voice document recognition reliability docC of the voice recognition apparatus 100. Hereinafter, the voice document recognition reliability docC described in the first embodiment is referred to as acoustic reliability.

音声認識装置２００の音声文書認識信頼度計算部７０は、音声認識結果から文脈信頼度を得るために、音声認識結果を構成する単語間の関連度を示す単語関連度テーブル６０を参照する。単語関連度テーブル６０を作成する単語関連度テーブル作成装置１５０の動作を説明する。 The speech document recognition reliability calculation unit 70 of the speech recognition apparatus 200 refers to the word association degree table 60 indicating the association degree between words constituting the speech recognition result in order to obtain context reliability from the speech recognition result. The operation of the word association degree table creation device 150 that creates the word association degree table 60 will be described.

〔単語関連度テーブル作成装置〕
図９に単語関連度テーブル作成装置１５０の機能構成例を示す。単語関連度テーブル作成装置１５０は、学習コーパス８１と、形態素解析部８２と、学習コーパス単語集合取得部８３と、単語リスト８４と、単語カウント部８５と、単語関連度計算部８６と、テーブル配列部８７と、を具備する。 [Word relevance table creation device]
FIG. 9 shows a functional configuration example of the word association degree table creation device 150. The word association degree table creation device 150 includes a learning corpus 81, a morpheme analysis unit 82, a learning corpus word set acquisition unit 83, a word list 84, a word count unit 85, a word association degree calculation unit 86, and a table array. Part 87.

学習コーパス８１は音声文書を大規模に集積したものである。形態素解析部８２は、学習コーパス８１から音声文書を読み出して単語に分割する周知の形態素解析処理を行い、各単語の前後に単語境界を表す記号、例えば「\ｎ」を付与した単語境界付き学習コーパスを出力する。形態素解析処理は周知であり、例えば参考文献「特許第３３７９６４３号」に記載されている。 The learning corpus 81 is a collection of voice documents on a large scale. The morpheme analysis unit 82 performs a well-known morpheme analysis process that reads a speech document from the learning corpus 81 and divides it into words, and learns with word boundaries provided with symbols representing word boundaries before and after each word, for example, “\ n” Output corpus. The morphological analysis processing is well known and is described in, for example, a reference document “Japanese Patent No. 3379643”.

学習コーパス単語集合取得部８３は、形態素解析部８２が出力する単語境界付き学習コーパスの先頭から末尾まで、窓幅n単語、窓シフト量m単語で窓かけを行い、各窓に含まれる単語リスト８４に記載された単語をまとめて単語集合とし、窓ごとの単語集合を出力する。単語リスト８４は、音声認識結果に出現し得る全ての単語が記載されたものであり、事前に作成しておく。図１０に、単語集合を概念的に示す。横方向は時間経過であり、単語集合をN₁〜N_ｈで示す。mは窓シフト量であり、nは窓幅である。隣り合う単語集合は、n-m個の単語を共有する関係にある。 The learning corpus word set acquisition unit 83 performs windowing with a window width n words and a window shift amount m words from the beginning to the end of the word-boundary learning corpus output by the morpheme analysis unit 82, and a word list included in each window The words described in 84 are combined into a word set, and a word set for each window is output. The word list 84 describes all the words that can appear in the speech recognition result, and is created in advance. FIG. 10 conceptually shows the word set. The horizontal direction is the passage of time, and the word set is indicated by N _{1 to} N _h . m is the window shift amount, and n is the window width. Adjacent word sets share a relationship of nm words.

単語カウント部８５は、学習コーパス単語集合取得部８３が出力する単語集合を入力として、単語集合内の各単語の単独生起回数C(w)、各単語ペアの生起回数C（w_i,w_j）、単語集合の総数をカウントして出力する。単語wの生起回数C(w)とは、単語wを含む単語集合の個数である。単語ペア（w_i,w_j）の生起回数C（w_i,w_j）とは、w_iとw_jを共に含む単語集合の個数である。 The word count unit 85 receives the word set output from the learning corpus word set acquisition unit 83 as an input, and the number of occurrences C (w) of each word in the word set and the number of occurrences C (w _i , w _{j of} each word pair) ), Count and output the total number of word sets. The number of occurrences C (w) of the word w is the number of word sets including the word w. The number of occurrences C (w _i , w _j ) of the word pair (w _i , w _j ) is the number of word sets including both w _i and w _j .

単語関連度計算部８６は、各単語ペア（w_i,w_j）の関連度S（w_i,w_j）を例えば式（１）で計算する。 The word relevance calculation unit 86 calculates the relevance S (w _i , w _j ) of each word pair (w _i , w _j ) using, for example, equation (1).

Nは単語集合の総数、C(w)は単語wの単独生起回数、C（w_i,w_j）は単語w_iとw_jの共起回数である。関連度Ｓ（w_i,w_j）の値が大きいと、それらの単語同士の関連性が高いことを意味する。 N is the total number of word sets, C (w) is the number of single occurrences of word w, and C (w _i , w _j ) is the number of co-occurrence of words w _i and w _j . When the value of the relevance S (w _i , w _j ) is large, it means that the relevance between these words is high.

テーブル配列部８７は、単語w_iとw_jから計算した関連度S（w_i,w_j）を参照できるテーブルの形式に配列する。図１１に単語関連度テーブル６０の一例を示す。最上列と最左列は単語w₁〜w_Nであり、各行と各列の交差する欄にそれぞれの単語の関連度S（w_i,w_j）が配列される。 The table arrangement unit 87 arranges the relevance S (w _i , w _j ) calculated from the words w _i and w _j in a table format that can be referred to. FIG. 11 shows an example of the word association degree table 60. The uppermost column and the leftmost column are words w _{1 to} w _N , and the relevance S (w _i , w _j ) of each word is arranged in a column where each row and each column intersect.

図１２に音声文書認識信頼度計算部７０の機能構成例を示す。音声文書認識信頼度計算部７０は、認識結果単語集合取得手段７１と、単語集合音響信頼度計算手段７２と、単語集合信頼度計算手段７３と、信頼度統合手段７４と、を具備する。 FIG. 12 shows a functional configuration example of the voice document recognition reliability calculation unit 70. The voice document recognition reliability calculation unit 70 includes a recognition result word set acquisition unit 71, a word set acoustic reliability calculation unit 72, a word set reliability calculation unit 73, and a reliability integration unit 74.

認識結果単語集合取得手段７１は、音声認識結果の単語w_kと単語認識信頼度C(w_k)を、その先頭から所定の数n個ずつnより小さい数のm個移動させながらh個の単語集合に分割する。つまり、音声文書の音声認識結果の先頭からｎ単語を取得し、得られた単語集合をN₁とする。そして、音声認識結果の先頭からm単語目からn単語を取得し、得られた単語集合をN₂とする。次に、音声認識結果の先頭から2m単語目からn単語を取得し、得られた単語集合をN₃とする。以上の処理を音声文書の末尾に到達するまで繰り返し、h個の単語集合N_k(N_k：N₁〜N_h)を取得する。ｋは、実施例１ではnを用いた着目している単語集合及び単語を識別する変数である。 The recognition result word set acquisition unit 71 moves the words w _k and the word recognition reliability C (w _k ) of the speech recognition result by moving a predetermined number n from the head by m, which is smaller than n. Divide into word sets. That is, n words are acquired from the head of the speech recognition result of the speech document, and the obtained word set is set to N ₁ . Then, to get the n words of m words from the beginning of the speech recognition results, resulting word set to N _2. Next, n words are acquired from the 2m-th word from the beginning of the speech recognition result, and the obtained word set is defined as N ₃ . The above processing is repeated until the end of the speech document is reached, and h word sets N _k (N _k : N _{1 to} N _h ) are acquired. k is a variable that identifies a focused word set and word using n in the first embodiment.

単語集合音響信頼度計算手段７２は、各単語集合に含まれる全ての単語の単語認識信頼度D(w)と単語時間長wd_kを取得して実施例１で説明済みの音声文書認識信頼度計算部２０と同じ処理をして単語集合N_k毎に音響信頼度CA(N_k)を計算する。 The word set acoustic reliability calculation means 72 acquires the word recognition reliability D (w) and the word time length wd _k of all the words included in each word set, and the speech document recognition reliability described in the first embodiment. The acoustic reliability CA (N _k ) is calculated for each word set N _k by performing the same processing as the calculation unit 20.

単語集合文脈信頼度計算手段７３は、各単語集合に含まれる全ての２単語の組み合わせ（w_i,w_j）に対して単語の関連の強さS（w_i,w_j）を単語関連度テーブル６０を参照して求め、その平均値を各単語集合の文脈信頼度CL(N_k)として計算する。 The word set context reliability calculation means 73 calculates the word association strength S (w _i , w _j ) for all combinations of two words (w _i , w _j ) included in each word set. The average value is calculated as the context reliability CL (N _k ) of each word set.

信頼度統合手段７４は、それぞれh個の音響信頼度CA(N_k)と文脈信頼度CL(N_k)の平均値を音声文書認識信頼度として計算して出力する。このように、音響信頼度CA(N_k)と文脈信頼度CL(N_k)を用いて音声文書の認識信頼度を算出することで、実施例１の音声文書認識信頼度よりも精度の高い音声文書認識信頼度とすることができる。 The reliability integration unit 74 calculates and outputs an average value of h acoustic reliability CA (N _k ) and context reliability CL (N _k ) as a voice document recognition reliability. In this way, by calculating the recognition reliability of the voice document using the acoustic reliability CA (N _k ) and the context reliability CL (N _k ), the accuracy is higher than the voice document recognition reliability of the first embodiment. The voice document recognition reliability can be obtained.

図１３に、音声文書認識信頼度を計算する処理を省力化したこの発明の音声認識装置３００の機能構成例を示す。音声認識装置３００は、音声文書認識信頼度高速計算部９０を備える点で、音声認識装置２００と異なる。音声文書認識信頼度高速計算部９０は、単語集合N_k間で重複する信頼度加算処理を省くことで、音響信頼度の計算を高速に計算するようにしたものである。 FIG. 13 shows an example of the functional configuration of the speech recognition apparatus 300 of the present invention in which the processing for calculating the speech document recognition reliability is saved. The speech recognition apparatus 300 is different from the speech recognition apparatus 200 in that it includes a speech document recognition reliability high speed calculation unit 90. The voice document recognition reliability high-speed calculation unit 90 calculates the acoustic reliability at high speed by omitting the reliability addition processing overlapping between the word sets _Nk .

図１４に音声文書認識信頼度高速計算部９０の機能構成例を示す。その動作フローを図１５に示す。音声文書認識信頼度高速計算部９０は、単語集合取得手段９１と単語集合音響信頼度高速計算手段９２を備える点で、音声認識装置２００の単語集合音響信頼度計算手段７２と異なる。 FIG. 14 shows a functional configuration example of the voice document recognition reliability high-speed calculation unit 90. The operation flow is shown in FIG. The speech document recognition reliability high-speed calculation unit 90 is different from the word set acoustic reliability calculation unit 72 of the speech recognition apparatus 200 in that it includes a word set acquisition unit 91 and a word set acoustic reliability high-speed calculation unit 92.

単語集合取得手段９１は、音響信頼度の計算量を減らす目的で使用する直前重複フラグと直後重複フラグの２つのフラグを各単語に付与して単語集合を取得する。直前重複フラグとは、当該単語が直前の単語集合にも含まれるか否かを示す真偽値である。直後重複フラグとは、当該単語が直後の単語集合にも含まれるか否かを示す真偽値である。 The word set obtaining unit 91 obtains a word set by assigning each word with two flags, the immediately preceding duplicate flag and the immediately following duplicate flag used for the purpose of reducing the calculation amount of the acoustic reliability. The immediately preceding duplicate flag is a true / false value indicating whether or not the word is also included in the immediately preceding word set. The immediately following duplication flag is a true / false value indicating whether or not the word is also included in the immediately following word set.

単語集合取得手段９１は、音声認識部１０から入力される音声認識結果の単語w_kを、その先頭から所定の数n個ずつnより小さい数のm個移動させながらh個の単語集合に分割する（ステップＳ９１）。単語集合に分割する際、単語集合取得手段９１は、１番目の単語集合に追加する１番目からm番目の単語の直前重複フラグBFと直後重複フラグAFを偽、m+1番目からn番目の単語の直前重複フラグBFを偽及び直後重複フラグAFを真とし、Ｎ番目の単語集合に追加する１番目から（N-1）・m番目の単語の直前重複フラグを真及び直後重複フラグを偽、（N-1）・m+1番目からn+（N-1）・m番目の単語の直前重複フラグを偽及び直後重複フラグを真とする。 The word set acquisition unit 91 divides the word w _k of the speech recognition result input from the speech recognition unit 10 into h word sets while moving a predetermined number n from the beginning, m less than n. (Step S91). When the word set is divided into word sets, the word set acquisition unit 91 sets the immediately preceding duplicate flag BF and the immediately following duplicate flag AF of the first to mth words to be added to the first word set to false, and the (m + 1) th to nth The word immediately preceding duplicate flag BF is false and the immediately following duplicate flag AF is true, and the first to (N-1) -mth word immediately preceding duplicate flags to be added to the Nth word set are true and the immediately following duplicate flag is false. , (N−1) · m + 1 to n + (N−1) · m-th word are set to false, and the next duplicate flag is set to true.

単語集合音響信頼度高速計算手段９２は、重複区間記憶部９２０を備え、直後重複フラグBFが真の単語の単語認識信頼度にその単語の継続時間長を乗算して時間正規化を解除した値とその継続時間長を、重複区間記憶部９２０に記憶する。そして、直前重複フラグBFが偽で直後重複フラグAFが真の単語の単語認識信頼度にその単語の継続時間長を乗算して時間正規化を解除した値とその継続時間長と、重複区間記憶部９２０に記憶された値とから単語集合の音響信頼度を計算する（ステップＳ９２）。 The word set acoustic reliability high-speed calculation unit 92 includes an overlapping section storage unit 920, and a value obtained by canceling time normalization by multiplying the word recognition reliability of a true word by the word recognition reliability of the true word and the duration of the word. And the duration length thereof are stored in the overlapping section storage unit 920. Then, the value obtained by multiplying the word recognition reliability of the true word by the duration length of the word and canceling the time normalization, the duration length, and the overlap interval memory The acoustic reliability of the word set is calculated from the value stored in the unit 920 (step S92).

図１６に、単語集合N₁〜N_hと、各単語集合に付与された直前重複フラグBFと直後重複フラグAFの例を示す。図１６の横方向は経過時間である。横方向に重なり合う単語集合を縦方向にずらして表記している。 FIG. 16 shows an example of the word sets N _{1 to} N _h and the immediately preceding duplicate flag BF and the immediately following duplicate flag AF given to each word set. The horizontal direction in FIG. 16 is the elapsed time. The word sets that overlap in the horizontal direction are shown shifted in the vertical direction.

単語集合N₁の先頭からm個の単語は、前後の単語集合内の単語と重ならないので直前重複フラグBFと直後重複フラグAFは（BF：０，AF：０）である。単語集合N₁の先頭からｍ+1番目からｎ番目の単語は、直後の単語集合（N₂）の単語と重複するので直前重複フラグBFと直後重複フラグAFは（BF：０，AF：１）である。以降、BF：とAF：の表記を省略する場合もある。 The m words from the beginning of the word set N _1, the front and rear because the word does not overlap with the words in the set immediately before overlap flag BF and after overlap flag AF is (BF: 0: 0, AF ). Since the words m + 1 to n from the beginning of the word set N ₁ overlap with the words in the immediately following word set (N ₂ ), the immediately preceding duplicate flag BF and the immediately following duplicate flag AF are (BF: 0, AF: 1 ). Hereinafter, the notation of BF: and AF: may be omitted.

２番目の単語集合N₂の1番目からm番目の単語は直前の単語集合（N₁）の単語とだけ重複するので（１，０）、m+1番目からn番目の単語は直前と直後の両方の単語集合の単語と重複するので（１，１）、n+1番目からn+m番目の単語は直後の単語集合の単語のみと重複するので（０，１）である。３番目以降の単語集合も２番目の単語集合と同じ関係である。 Since the 1st to m-th words in the second word set N ₂ overlap only with the words in the previous word set (N ₁ ) (1,0), the m + 1-th to n-th words are immediately before and after (1, 1), the (n + 1) th to n + mth words overlap with only the words of the immediately following word set (0, 1). The third and subsequent word sets have the same relationship as the second word set.

図１５の例では、mをm<n/2としたので（１，１）の状態が存在するが、m=n/2とすると直前重複フラグBFと直後重複フラグAFは、N₁が（（０，０），（０，１））、N₂以降が（（１，０），（０，１））となる。つまり、単語集合取得手段９１は１番目から（N-1）・m番目の単語の直前重複フラグBFを真及び直後重複フラグAFを偽、（N-1）・m+1番目からn+（N-1）・m番目の単語の直前重複フラグBFを偽及び直後重複フラグAFを真とする。 In the example of FIG. 15, since m is set to m <n / 2, there is a state (1, 1). However, when m = n / 2, the immediately preceding duplicate flag BF and the immediately following duplicate flag AF have N ₁ ( (0, 0), (0, 1)), and after N ₂ are ((1, 0), (0, 1)). That is, the word set acquisition unit 91 sets the immediately preceding duplicate flag BF of the first to (N−1) · mth words to be true and the immediately following duplicate flag AF to be false, and (N−1) · m + 1 to n + (N -1)-The preceding duplicate flag BF of the mth word is false and the immediately following duplicate flag AF is true.

図１６に、図１５に示した単語集合を取得する単語集合取得手段９１の動作フローを示す。図１６は窓シフト量mをm<n/2とした例である。 FIG. 16 shows an operation flow of the word set acquisition unit 91 for acquiring the word set shown in FIG. FIG. 16 shows an example in which the window shift amount m is m <n / 2.

音声文書認識信頼度計算部９０が処理を開始すると、単語集合取得手段９１は単語集合を特定する変数N_kと単語w_nを特定する変数w_k、及びカウント値iをそれぞれN_k＝N₁，w_k＝w₁,i=1に初期化する（ステップＳ９１ａ）。そして、音声認識結果の先頭からm番目までの単語w₁〜w_mまでの単語w_kの直前重複フラグBFを０（偽）、直後重複フラグAFを０（偽）に設定する（ステップＳ９１ｂ〜Ｓ９１ｄ）。更に、単語w_m+1〜w_nまでの単語w_kの直前重複フラグBFを０（偽）、直後重複フラグAFを１（真）に設定（ステップＳ９１ｅ〜Ｓ９１ｇ）して単語w₁〜w_nまでを１番目の単語集合N₁として取得する。 When spoken document recognition reliability calculating unit 90 starts the process, the word set obtaining unit 91 variable w _k, and the count value i of each N _k = N ₁ specifying the variable N _k and words w _n for specifying a word set , W _k = w ₁ , i = 1 is initialized (step S91a). Then, 0 last overlap flag BF word w _k to word w ₁ to w _m from the beginning of the speech recognition result to m-th (false), the immediately overlap flag AF is set to 0 (false) (step S91b~ S91d). Further, the immediately preceding duplication flag BF of words w _k _{+ 1 to} w _{n is set} to 0 (false) and the immediately following duplication flag AF is set to 1 (true) (steps S91e to S91g), and the words w ₁ to w are set. Up to _n are acquired as the _first word set N ₁ .

次に、単語集合N_kをN_k=N_k+1（ステップＳ９１ｈ）として２番目の単語集合N₂を取得する。ここで、着目する単語を表す変数w_kを、先頭の単語w₁からmi個移動させたw_k=w_miとする（ステップＳ９１ｉ）。ここではi=1なのでw_k=w_mである。 Next, the second word set N ₂ is obtained with the word set N _k as N _k = N _{k + 1} (step S91h). Here, it is assumed that the variable w _k representing the word of interest is w _k = w _{mi that} is moved mi from the _first word w ₁ (step S91i). Here, since i = 1, w _k = w _m .

そこで、単語w_m+m、つまり、先頭の単語から2m個の単語までの直前重複フラグBFを１（真）、直後重複フラグAFを０（偽）に設定する（ステップＳ９１ｊ〜Ｓ９１ｍ）。そして、先頭の単語から2m+1個からn+m(i-1)個までの単語の直前重複フラグBFを１（真）、直後重複フラグAFを１（真）に設定する（ステップＳ９１ｎ〜Ｓ９１ｐ）。更に、先頭の単語からn+m(i-1)+1個目からn+mi個までの単語の直前重複フラグBFを０（偽）、直後重複フラグAFを１（真）に設定する（ステップＳ９１ｑ〜Ｓ９１ｓ）。このような処理によって、単語w_m〜w_ｎ+ｍは、その単語の直前重複フラグBFと直後重複フラグAFが図１５に示すように設定され、単語集合N₂として取得される。 Thus, the word w _{m + m} , that is, the immediately preceding overlap flag BF from the first word to 2m words is set to 1 (true), and the immediately after overlap flag AF is set to 0 (false) (steps S91j to S91m). Then, the immediately preceding duplication flag BF is set to 1 (true) and the immediately following duplication flag AF is set to 1 (true) for 2m + 1 to n + m (i-1) words from the first word (steps S91n to S91n). S91p). Further, the immediately preceding overlap flag BF is set to 0 (false) and the immediately following overlap flag AF is set to 1 (true) for the n + m (i-1) +1 to n + mi words from the first word ( Steps S91q to S91s). By such processing, the words w _{m to} w _{n + m} are acquired as the word set N ₂ with the immediately preceding overlapping flag BF and the immediately following overlapping flag AF of the word set as shown in FIG.

そして、先頭からn+mi+1個目の単語が存在すれば、カウント値iをインクリメント（ステップＳ９１ｕ）、単語集合を現す変数N_kもN_k=N_k+1とインクリメント（ステップＳ９１ｖ）され、ステップＳ９１ｉの処理に戻る。 If the n + mi + 1-th word from the head exists, the count value i is incremented (step S91u), and the variable N _{k representing the} word set is also incremented to N _k = N _{k + 1} (step S91v). The process returns to step S91i.

以上の処理（ステップＳ９１ｉ〜Ｓ９１ｖ）は、音声認識結果の最後の単語になるまで繰り返される（ステップＳ９１ｔのｎｏ）。その結果、音声認識結果の単語は、その先頭から所定の数n個ずつnより小さい数のm個移動させながらｈ個の単語集合に分割される。そして、各単語集合には、図１５に示すように直前重複フラグBFと直後重複フラグAFとが付与される。 The above processing (steps S91i to S91v) is repeated until the last word of the speech recognition result is reached (no in step S91t). As a result, the words of the speech recognition result are divided into h word sets while moving a predetermined number n from the beginning by m, which is smaller than n. Then, as shown in FIG. 15, the immediately preceding overlap flag BF and the immediately following overlap flag AF are assigned to each word set.

単語集合音響信頼度高速計算手段９２は、重複区間記憶部９２０を備え、直後重複フラグAFが真の単語の単語信頼度を２度計算しないようにしたものである。図１８に示す単語集合音響信頼度高速計算手段９２の動作フローを参照して説明する。 The word set acoustic reliability high-speed calculation means 92 includes an overlapping section storage unit 920, and does not calculate the word reliability of a word whose true overlap flag AF is true twice. The operation will be described with reference to the operation flow of the word set acoustic reliability high speed calculation means 92 shown in FIG.

単語集合音響信頼度高速計算手段９２は、最初に単語集合を特定する変数N_kと単語w_nを特定する変数w_kを、それぞれN_k＝N₁，w_k＝w₁に初期化する（ステップＳ９２ａ）。そして、単語集合音響信頼度高速計算手段９２は、単語（N₁，w₁）〜（N₁，w_n）の単語認識信頼度にその単語の継続時間長を乗算して時間正規化を解除した単語信頼度を計算（ステップＳ９２ｂ）し、その単語信頼度wc_nを累積する（ステップＳ９２ｃ）。と共に、単語の継続時間長wd_nも累積する（ステップＳ９２ｄ）。この処理は、単語集合N₁の最後の単語w_nになるまで繰り返される（ステップＳ９２ｉ）。 Word set sound reliability fast computation means 92, first the variable w _k for specifying the variable N _k and words w _n for specifying a word set is initialized to _{_{_{N k = N 1, w k}}} = w 1 , respectively ( Step S92a). Then, the word set acoustic reliability fast calculation means 92 multiplies the word recognition reliability of the words (N ₁ , w ₁ ) to (N ₁ , w _n ) by the duration time of the word to cancel the time normalization. Get words reliability (step S92b) and, accumulating the word confidence wc _n (step S92c). At the same time, the word duration time wd _n is also accumulated (step S92d). This process is repeated until the last word w _n of the word set N ₁ is reached (step S92i).

そして、単語集合N₁の先頭からm個目の単語w_mから最後の単語w_nまでの単語の直後重複フラグAFが１（真）となるので、その単語w_m〜w_nまでの単語信頼度wc_nはBackDupDMに保存される（ステップＳ９２ｆ）。また、単語w_m〜w_nの継続時間長wd_nはBackDupWDに保存される（ステップＳ９２ｇ）。BackDupDMとBackDupWDは、例えばキュー（queue）と呼ばれるデータ構造で単語信頼度wc_ｋと継続時間長wd_ｋを保存するものとする。 Then, since the duplication flag AF immediately after the word from the _m-th word w _m to the last word w _n of the word set N ₁ is 1 (true), the word trust for the words w _{m to} w _n degrees wc _n is stored in BackDupDM (step S92f). Also, duration wd _n words w _m to w _n are stored in BackDupWD (Step S92G). BackDupDM and BackDupWD store the word reliability wc _k and the duration time wd _k in a data structure called a queue, for example.

最後の単語w_nまでの単語信頼度の計算が終わる（ステップＳ９２ｉのyes）と、単語集合N₁の音響信頼度を、単語信頼度wc_kの累積値S（wc_k）を継続時間長の累積値S（wd_k）で除して計算する（ステップＳ９２ｉ）。 When the calculation of the word reliability up to the last word w _n is finished (yes in step S92i), the acoustic reliability of the word set N ₁ and the cumulative value S (wc _k ) of the word reliability wc _k are set as the duration length. Calculation is performed by dividing by the cumulative value S (wd _k ) (step S92i).

２番目の単語集合N₂の先頭からn-ｍ番目、つまり先頭の単語からn個目の単語wm〜wnは、直前の単語集合N₁と重複しているので既に計算済みである。よって、BackDupDMに保存されている計算済みの単語信頼度をPreDMにコピーする（ステップＳ９２ｍ）と共にBackDupWDに保存されている継続時間長をPreWDにコピーする（ステップＳ９２ｎ）。そして、単語集合N₁で単語信頼度を計算していない直後重複フラグBＦが０（偽）の単語w_n+1〜w_n+miの正規化を解除して単語信頼度と継続時間長の累積を計算する（ステップＳ９２ｐとＳ９２ｑ）。なお、正規化解除のステップは作図の都合により表記を省略している。 N-m-th from the second head of the word set N _2, i.e. the top of n-th word wm~wn from the word is already computed so overlaps with the immediately preceding word set N _1. Therefore, the calculated word reliability stored in BackDupDM is copied to PreDM (step S92m), and the duration length stored in BackDupWD is copied to PreWD (step S92n). Then, duplicate flag BF immediately after the word set N ₁ does not calculate the word reliability is 0 word w _{_{n +} 1} ~w _{n +} _mi to release the normalized word reliability and duration of the (false) The accumulation is calculated (steps S92p and S92q). Note that the denormalization step is omitted for convenience of drawing.

新たに計算される直後重複フラグBＦが０（偽）の単語w_n+1〜w_n+miの単語信頼度wc_ｋと継続時間長wd_ｋは、BackDupDMとBackDupWDに保存される（ステップＳ９２ｒ）。BackDupDMとBackDupWDは、古い順番にデータが消去される構造なので最新の単語w_2m〜w_n+mの単語信頼度wc_kと継続時間長wd_kが保存される。音声認識結果の先頭の単語からn個目以降の単語（w_n+1〜）では、直前重複フラグBF=0の単語についてのみ正規化を解除する計算と継続時間長の累積計算を行い、新たに計算した値をキュー（BackDupDM,BackDupWD）に保存する(図１６参照)。 The word reliability wc _k and the duration time wd _k of the words wn _{+ 1 to} wn _{+ mi} whose duplicate flag BF is 0 (false) immediately after being newly calculated are stored in BackDupDM and BackDupWD (step S92r). . Since BackDupDM and BackDupWD are structures in which data is erased in the oldest order, the word reliability wc _k and duration length wd _{k of the} latest words w _{2m to} w _{n + m} are stored. For the nth and subsequent words (w _{n + 1} ) from the first word in the speech recognition result, only the word with the previous duplication flag BF = 0 is denormalized and the cumulative duration is calculated. The calculated values are stored in queues (BackDupDM, BackDupWD) (see FIG. 16).

単語集合N₂の最後の単語w_n+miまで処理が終了する（ステップＳ９２ｔのyes）と、単語集合N₂の音響信頼度をBackDupDMに保存されている単語信頼度の和に新たに計算した単語信頼度を加えた累積値を、BackDupWDに保存されている継続時間長の和に新たに計算した継続時間長を加えた値で除して、単語集合N₂の音響信頼度を計算する（ステップＳ９２ｕ）。 Last word w _{n + mi} to process word set N ₂ is completed (step yes in S92t), newly computed sum word confidence stored acoustic confidence word set N ₂ to BackDupDM Calculate the acoustic reliability of the word set N ₂ by dividing the cumulative value including the word reliability by the value obtained by adding the newly calculated duration to the sum of the durations stored in BackDupWD ( Step S92u).

そして、カウント値ｉがインクリメント（ステップＳ９２ｘ）され、着目する単語集合N_kもインクリトされN₃以降の単語集合に対して同じ処理が、音声認識結果の単語が無くなるまで繰り返される（ステップＳ９２ｗのyes）。このように、直前の単語集合N_kで計算済みの単語集合と継続時間長は、コピーして計算されるので２度計算することが無い。 Then, the count value i is incremented (step S92x), the word set N _{k of} interest is incremented, and the same processing is repeated for the word sets after N ₃ until there are no words in the speech recognition result (yes in step S92w). ). In this way, the word set already calculated in the immediately preceding word set N _k and the duration length are copied and calculated, and thus are not calculated twice.

つまり、単語集合音響信頼度高速計算手段９２は、重複区間記憶部９２０に直後重複フラグBFが真の単語の単語認識信頼度D(w_k)にその単語の継続時間長wd_kを乗算して時間正規化を解除した値wc_kとその継続時間長wd_kを記憶し、直前重複フラグAFが偽で直後重複フラグBFが真の単語の単語認識信頼度D(w_k)にその単語w_kの継続時間長を乗算して時間正規化を解除した値wc_kとその継続時間長wd_kと、重複区間記憶部９２０に記憶された値とから単語集合w_kの音響信頼度を計算する。よって、実施例１と２の音声認識装置よりも高速に音響信頼度を計算することができる。 That is, the word set acoustic reliability high-speed calculation unit 92 multiplies the word recognition reliability D (w _k ) of a word whose true overlap flag BF is true by the overlap interval storage unit 920 by the duration duration wd _k of the word. Stores the value wc _k with the time normalization canceled and its duration wd _k, and sets the word w _k to the word recognition reliability D (w _k ) of the word with the immediately preceding duplicate flag AF being false and the immediately following duplicate flag BF being true. The acoustic reliability of the word set w _k is calculated from the value wc _{k obtained} by canceling the time normalization by multiplying the continuous time length, the continuous time length wd _k, and the value stored in the overlapping section storage unit 920. Therefore, the acoustic reliability can be calculated at a higher speed than the speech recognition apparatuses according to the first and second embodiments.

なお、直前重複フラグBFと直後重複フラグAFを用いて音響信頼度を高速に計算する例を説明したが、その処理方法は上記した例に限定されない。例えば、図１８に示した例に加えて、重複区間の単語信頼度の合計と重複区間の継続時間長の合計を、それぞれ記憶して置くようにしても良い。そのようにすれば、音響信頼度を計算する度（ステップＳ９２ｕ）に行うキューに記憶された単語信頼度と継続時間長の累積値を計算する処理も削減することが可能である。 Although an example in which the acoustic reliability is calculated at high speed using the immediately preceding overlap flag BF and the immediately following overlap flag AF has been described, the processing method is not limited to the above-described example. For example, in addition to the example shown in FIG. 18, the total word reliability of overlapping sections and the total duration time of overlapping sections may be stored. By doing so, it is possible to reduce processing for calculating the cumulative value of the word reliability and the duration length stored in the queue every time the acoustic reliability is calculated (step S92u).

上記装置における処理手段をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、各装置における処理手段がコンピュータ上で実現される。 When the processing means in the above apparatus is realized by a computer, the processing contents of the functions that each apparatus should have are described by a program. Then, by executing this program on the computer, the processing means in each apparatus is realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ（Random Access Memory）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ−Ｒ（Recordable）/ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto Optical disc）等を、半導体メモリとしてＥＥＰ−ＲＯＭ（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used. Specifically, for example, as a magnetic recording device, a hard disk device, a flexible disk, a magnetic tape or the like, and as an optical disk, a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only). Memory), CD-R (Recordable) / RW (ReWritable), etc., magneto-optical recording medium, MO (Magneto Optical disc), etc., semiconductor memory, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. Can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記録装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Further, the program may be distributed by storing the program in a recording device of a server computer and transferring the program from the server computer to another computer via a network.

また、各手段は、コンピュータ上で所定のプログラムを実行させることにより構成することにしてもよいし、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Each means may be configured by executing a predetermined program on a computer, or at least a part of these processing contents may be realized by hardware.

Claims

A speech recognition process for outputting a speech recognition result to which word recognition reliability is given for each word obtained by performing speech recognition processing on an input speech document;
A speech recognition reliability calculation process for calculating and outputting a speech document recognition reliability, which is a recognition reliability of the entire speech document, using the speech recognition result as an input;
An audio document removal process of removing an audio document having a threshold value less than a predetermined audio document recognition reliability threshold by inputting the audio recognition result and the audio document recognition reliability;
A word removal process for removing words having a word recognition reliability less than a predetermined word recognition reliability threshold from the voice recognition result of the voice document that has not been removed in the voice document removal process;
A speech recognition method including:

The speech recognition method according to claim 1,
The above speech recognition reliability calculation process is as follows:
A word set acquisition step of dividing the word of the speech recognition result into a word set while moving a predetermined number n from the beginning by m, which is a number smaller than n;
Multiply the word recognition reliability by the duration of the word and cancel the time normalization, add the sum, and divide the sum by the sum of the durations of all words in the word set. A word set sound reliability calculation step to be obtained as a set sound reliability;
By referring to a word association degree table representing the degree of association between two words among all the combinations between words included in the speech recognition result, an average value of association strengths of the combinations of two words included in the word set is obtained. A word set context reliability calculation step to obtain as word set context reliability,
A reliability integration step of obtaining a value obtained by averaging the word set acoustic reliability and the word set context reliability of the entire voice document as a voice document recognition reliability;
A speech recognition method comprising:

The speech recognition method according to claim 1,
The above speech recognition reliability calculation process is as follows:
The word of the speech recognition result is divided into a word set while moving a predetermined number n from the beginning by m less than n, and immediately before the first to mth words to be added to the first word set The duplicate flag and the immediately following duplicate flag are set to false, the immediately preceding duplicate flag of the first to N · mth words to be added to the Nth word set is true and the immediately following duplicate flag is false, and the N · m + 1 to n + N A word set acquisition step in which the immediately preceding duplicate flag of the mth word is false and the immediately following duplicate flag is true;
The overlapping section storage step for storing the value obtained by multiplying the word recognition reliability of the word for which the duplication flag is true by the duration length of the word and canceling the time normalization, and the duration length, and
A value obtained by multiplying the word recognition reliability of the word whose previous duplicate flag is false and the true duplicate flag is true by the duration length of the word and canceling the time normalization, the duration duration, and the overlapping section storing step A word set acoustic reliability fast calculation step for calculating the acoustic reliability of the word set from the value stored in
A speech recognition method comprising:

A speech recognition unit that outputs a speech recognition result to which word recognition reliability is given for each word obtained by performing speech recognition processing on the input speech document;
A speech recognition reliability calculation unit that calculates and outputs a speech document recognition reliability that is a recognition reliability of the entire speech document by using the speech recognition result as an input;
A voice document removal unit that removes a voice document that is less than a predetermined voice document recognition reliability threshold by inputting the voice recognition result and the voice document recognition reliability;
A word removal unit that removes words having a word recognition reliability less than a predetermined word recognition reliability threshold from the voice recognition result of the voice document that has not been removed by the voice document removal unit;
A speech recognition apparatus comprising:

A speech recognition method program for causing a computer to execute the speech recognition method according to claim 1.