JP5325176B2

JP5325176B2 - 2-channel speech recognition method, apparatus and program thereof

Info

Publication number: JP5325176B2
Application number: JP2010162629A
Authority: JP
Inventors: 太一浅見; 済央野本; 哲小橋川; 浩和政瀧
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2010-07-20
Filing date: 2010-07-20
Publication date: 2013-10-23
Anticipated expiration: 2030-07-20
Also published as: JP2012027065A

Description

この発明は、例えばコールセンター等で行われる顧客側の音声とオペレータ側の音声とから成る２チャネルの音声を、音声認識する音声認識方法とその装置と、プログラムに関する。 The present invention relates to a voice recognition method, apparatus, and program for recognizing two-channel voice consisting of customer-side voice and operator-side voice, for example, at a call center.

従来、顧客側の音声（以降、送話側音声）とオペレータ側の音声（以降、受話側音声）の２チャネルの音声を音声認識する場合、それぞれのチャネルを別々に音声認識していた。そして、それぞれの音声認識結果に、音声認識結果の確からしさを表す認識信頼度を付与していた。 Conventionally, when recognizing two-channel voices of customer-side voice (hereinafter referred to as “sending-side voice”) and operator-side voice (hereinafter referred to as “receiving-side voice”), each channel has been separately recognized. And the recognition reliability showing the certainty of a speech recognition result was provided to each speech recognition result.

その認識信頼度を算出するための従来技術として、例えば特許文献１に、音声認識処理の探索結果の上位Ｎ位までのＮベスト候補のスコア１位の単語w₁の認識信頼度を、単語w₁のスコアとスコア２位以下でw₁と異なる単語w₂とのスコア差を単語w₁の継続時間長で正規化した値とする考えが示されている。 As a conventional technique for calculating the recognition reliability, for example, Patent Document 1 discloses the recognition reliability of the word w _{1 with} the highest score of N best candidates up to the top N in the search result of the speech recognition processing. believed to be normalized value of the score difference between w ₁ and different words w ₂ less than 2-position ₁ of the score and the score in duration of words w ₁ is shown.

また、認識信頼度を算出するための別の方法として、音声認識結果中の各単語間の関連の強さを計測して周囲の単語と関連の強い単語に、高い認識信頼度を付与し、関連の弱い単語に低い認識信頼度を付与する方法がある（非特許文献１）。この方法は、単語w及び単語wの直前k個の単語と直後の１単語とのn個の単語集合N(w)を音声認識結果から取得する。そして、その単語集合N(w)に含まれる全ての２単語の組み合わせ（w_i,w_j）に対して、事前に学習コーパス上で算出した相互情報量MI（w_i,w_j）を用いて単語間の関連の強さS（w_i,w_j）を計算する。また、単語集合N(w)中の全ての単語tについての関連の強さS（t,w_i）の平均値を文脈一貫性尺度SC(t)として計算する。 In addition, as another method for calculating the recognition reliability, the strength of the relationship between each word in the speech recognition result is measured, and a high recognition reliability is given to a word strongly related to the surrounding word, There is a method of giving a low recognition reliability to a weakly related word (Non-Patent Document 1). In this method, n word sets N (w) of the word w and k words immediately before the word w and one word immediately after the word w are acquired from the speech recognition result. Then, the mutual information MI (w _i , w _j ) calculated in advance on the learning corpus is used for all combinations (w _i , w _j ) of two words included in the word set N (w). Then, the strength S (w _i , w _j ) of the association between words is calculated. Further, the average value of the relation strength S (t, w _i ) for all the words t in the word set N (w) is calculated as the context consistency measure SC (t).

特開２００５−１４８３４２号公報JP 2005-148342 A

D. Inkpen, A. Desilets, “Semantic Similarity for Detecting Recognition Errors in Automatic Speech Transcripts,”Proceedings of HLT/EMNLP, pp.49-56, October 2005.D. Inkpen, A. Desilets, “Semantic Similarity for Detecting Recognition Errors in Automatic Speech Transcripts,” Proceedings of HLT / EMNLP, pp.49-56, October 2005.

一般的に送話側の音声は、様々な音響的環境で送話されるので、音声品質のバラツキが大きい。逆に、受話側の音声は比較的静かなオフィスにおける会話であるので音声品質が良好である。このように、音声品質に差のある音声をそれぞれ別々に音声認識して認識信頼度を付与すると、送話側の音声の認識信頼度を実際の信頼度よりも悪く評価してしまう場合がある。 In general, since the voice on the transmission side is transmitted in various acoustic environments, the voice quality varies greatly. On the contrary, the voice on the receiving side is a relatively quiet office conversation, so the voice quality is good. As described above, when the voices having different voice qualities are separately recognized and given the recognition reliability, the voice recognition reliability on the transmission side may be evaluated worse than the actual reliability. .

この発明は、このような課題に鑑みてなされたものであり、送話側の音声認識信頼度を適切に付与することが出来る２チャネル音声の音声認識方法と、その装置とプログラムを提供することを目的とする。 This invention is made in view of such a subject, and provides the speech recognition method of the 2-channel speech which can assign | provide the speech recognition reliability of a transmission side appropriately, its apparatus, and a program With the goal.

この発明の２チャネル音声の音声認識方法は、音声認識過程と、認識信頼度計算過程と、を含む。音声認識過程は、送話側音声と受話側音声を入力としてそれぞれの音声を音声認識処理した単語毎に単語認識信頼度を付与した送話側音声認識結果と受話側音声認識結果を出力する。認識信頼度計算過程は、送話側音声認識結果と受話側音声認識結果を入力として音声認識結果の全ての単語間の組み合わせの２単語間の関連度を示す単語関連度テーブルを参照してそれぞれの音声認識結果のチャネル内文脈一貫性尺度と、送話側音声認識結果と受話側音声認識結果との間のチャネル間文脈一貫性尺度とを求め、送話側チャネル内文脈一貫性尺度とチャネル間文脈一貫性尺度の重み付き和を送話側認識信頼度として計算して出力する。 The speech recognition method for two-channel speech according to the present invention includes a speech recognition process and a recognition reliability calculation process. In the speech recognition process, the transmitting side speech recognition result and the receiving side speech recognition result are output by giving the word recognition reliability to each word obtained by performing speech recognition processing on the respective speeches by inputting the transmitting side speech and the receiving side speech. The recognition reliability calculation process is performed by referring to a word relevance table indicating the relevance between two words of a combination of all words of the speech recognition result with the sending side speech recognition result and the receiving side speech recognition result as inputs. The intra-channel context consistency measure of the speech recognition result and the inter-channel context consistency measure between the sending side speech recognition result and the receiving side speech recognition result. The weighted sum of the inter-context consistency measure is calculated and output as the sender recognition confidence.

例えばコールセンター等で交わされる応対音声においては、送話側音声と受話側音声に共通する単語や、関連する単語が含まれる場合が多い。送話側音声と受話側音声との間の単語共起に着目したこの発明のチャネル間文脈一貫性尺度は、二つの音声の間の関連性が強い場合に大きな値を示すものである。よって、そのチャネル間文脈一貫性尺度と送話側チャネル内文脈一貫性尺度の重み付け和を送話側認識信頼度として計算するこの発明の２チャネル音声の音声認識方法は、送話側の認識信頼度を適切に付与することができる。 For example, in the reception voice exchanged at a call center or the like, there are many cases where a common word or related word is included in the transmission side voice and the reception side voice. The inter-channel context consistency measure of the present invention, which focuses on word co-occurrence between the transmitting voice and the receiving voice, shows a large value when the relationship between the two voices is strong. Therefore, the speech recognition method for two-channel speech according to the present invention, which calculates the weighted sum of the inter-channel context consistency measure and the intra-sender context consistency measure as the sender recognition reliability, The degree can be given appropriately.

この発明の音声認識装置１００の機能構成例を示す図。The figure which shows the function structural example of the speech recognition apparatus 100 of this invention. 音声認識装置１００の動作フローを示す図。The figure which shows the operation | movement flow of the speech recognition apparatus 100. Ｎベスト候補と単語認識信頼度について説明する図。The figure explaining N best candidate and word recognition reliability. 単語関連度テーブル作成装置１５０の機能構成例を示す図。The figure which shows the function structural example of the word related degree table creation apparatus 150. FIG. 単語集合を概念的に示す図。The figure which shows a word set notionally. 単語関連度テーブルの一例を示す図。The figure which shows an example of a word related degree table. 認識信頼度計算部３０の機能構成例を示す図。The figure which shows the function structural example of the recognition reliability calculation part 30. FIG. この発明の音声認識装置２００の機能構成例を示す図。The figure which shows the function structural example of the speech recognition apparatus 200 of this invention. 重み計算部８０の機能構成例を示す図。The figure which shows the function structural example of the weight calculation part 80. FIG.

以下、この発明の実施の形態を図面を参照して説明する。複数の図面中同一のものには同じ参照符号を付し、説明は繰り返さない。 Embodiments of the present invention will be described below with reference to the drawings. The same reference numerals are given to the same components in a plurality of drawings, and the description will not be repeated.

図１に、この発明の音声認識装置１００の機能構成例を示す。その動作フローを図２に示す。音声認識装置１００は、音声認識部２０と、認識信頼度計算部３０と、単語関連度テーブル４０と、を具備する。音声認識装置１００の各部の機能は、例えばＲＯＭ、ＲＡＭ、ＣＰＵ等で構成されるコンピュータに所定のプログラムが読み込まれて、ＣＰＵがそのプログラムを実行することで実現されるものである。 FIG. 1 shows a functional configuration example of the speech recognition apparatus 100 of the present invention. The operation flow is shown in FIG. The speech recognition apparatus 100 includes a speech recognition unit 20, a recognition reliability calculation unit 30, and a word association degree table 40. The function of each part of the speech recognition apparatus 100 is realized by reading a predetermined program into a computer composed of, for example, a ROM, a RAM, a CPU, etc., and executing the program by the CPU.

音声認識部２０は、例えばコールセンター等で顧客とオペレータとの間で交わされる送話側音声と受話側音声とからなる応対音声を、音声認識処理した単語毎に単語認識信頼度を付与した送話側音声認識結果と受話側音声認識結果を出力する（ステップＳ２０）。音声認識部２０は、図示しない内部の音響分析部によって、送話側音声と受話側音声を、数十msecのフレームと呼ばれる単位でＬＰＣケプストラム、ＭＦＣＣ、その他の音響特徴パラメータ系列に分析する。そして、辞書と言語モデルを用いて入力音声に対する認識結果候補の探索を、音響特徴パラメータ系列について行う。探索の結果、上位Ｎ位までのＮベスト候補が、単語認識信頼度と共に音声認識結果として出力される。なお、応対音声が送話側音声と受話側音声とが一体化したものである場合は、その応対音声を送話側音声と受話側音声の２チャネルに分割する音声チャネル分割部１０を設ける。 For example, the speech recognition unit 20 is a speech that has been given a word recognition reliability for each word that has been subjected to speech recognition processing on a response speech composed of a transmission side speech and a reception side speech exchanged between a customer and an operator at a call center or the like. The side speech recognition result and the receiving side speech recognition result are output (step S20). The speech recognition unit 20 analyzes the transmitting side speech and the receiving side speech into LPC cepstrum, MFCC, and other acoustic feature parameter series in units called frames of several tens of msec by an internal acoustic analysis unit (not shown). Then, a recognition result candidate for the input speech is searched for the acoustic feature parameter series using a dictionary and a language model. As a result of the search, N best candidates up to the top N are output as a speech recognition result together with the word recognition reliability. In the case where the answering voice is an integration of the sending voice and the receiving voice, a voice channel dividing unit 10 is provided for dividing the answering voice into two channels of the sending voice and the receiving voice.

ここで、図３を参照してＮベスト候補と単語認識信頼度について説明する。なお、Ｎベスト候補と単語認識信頼度については従来技術である。単語認識信頼度については、例えば特許文献１に記載されている。 Here, N best candidates and word recognition reliability will be described with reference to FIG. Note that N best candidates and word recognition reliability are conventional techniques. The word recognition reliability is described in Patent Document 1, for example.

図３の横軸は、経過時間でありフレームで表す。縦軸は、フレーム単位で探索された単語列候補を、スコアの高い順番に並べたＮベスト候補である。スコアとは探索時の尤度のことである。 The horizontal axis in FIG. 3 represents elapsed time and is represented by a frame. The vertical axis represents N best candidates in which word string candidates searched for in units of frames are arranged in descending order of scores. The score is the likelihood at the time of search.

単語認識信頼度は、フレームt_*において単語w_**(*は任意の整数)と異なる単語がＮベスト候補中に存在する場合、単語w_**のフレームt_*におけるスコアと対立候補単語のフレームt_*における次の順位のスコアとの間のスコア差によって与えられる。図３に示す例では、フレームt₁〜t₄の音響特徴パラメータ系列で探索された第１位候補の単語w₁₁（11は第１候補の１番目の単語であることを表す）の単語認識信頼度は、対向する単語が第３位候補の単語w₃₁と第２位候補の単語w₂₁であるので、それぞれのスコア差（●）の合計をフレーム数で除した値が単語認識信頼度となる。対立候補が存在しない単語w₁₃については、予め決められた固定値（○）が用いられて単語認識信頼度となる。この単語認識信頼度が候補毎に累積されて単語列の認識信頼度となる。 Word recognition reliability, if the word w _** in frame t _* (* is an arbitrary integer) is different word present in the N-best candidate words w _** frame t score paired candidacy word frames in _* Given by the score difference between the next ranking score at t _* . In the example shown in FIG. 3, the word recognition of the first candidate word w ₁₁ (11 represents the first word of the first candidate) searched for in the acoustic feature parameter series of frames t _{1 to} t _4. Since the opposite words are the third candidate word w ₃₁ and the second candidate word w ₂₁ , the value obtained by dividing the total score difference (●) by the number of frames is the word recognition reliability. It becomes. For the word w ₁₃ the conflict candidate does not exist, the word recognition reliability by predetermined fixed value (○) is used. This word recognition reliability is accumulated for each candidate and becomes the word string recognition reliability.

認識信頼度計算部３０は、送話側音声認識結果と受話側音声認識結果を入力として音声認識結果の全ての単語間の組み合わせの２単語間の関連度を示す単語関連度テーブル４０を参照して送話側音声認識結果と受話側音声認識結果のチャネル内文脈一貫性尺度と、送話側音声認識結果と受話側音声認識結果との間のチャネル間文脈一貫性尺度とを求め、送話側チャネル内文脈一貫性尺度とチャネル間文脈一貫性尺度の重み付き和を送話側認識信頼度として計算して出力する（ステップＳ３０）。 The recognition reliability calculation unit 30 refers to the word association degree table 40 indicating the degree of association between two words of the combination between all words of the speech recognition result with the transmission side speech recognition result and the reception side speech recognition result as inputs. Determine the in-channel context consistency measure between the sending and receiving speech recognition results and the inter-channel context consistency measure between the sending and receiving speech recognition results. A weighted sum of the in-channel context consistency measure and the inter-channel context consistency measure is calculated and output as the transmitting side recognition reliability (step S30).

応対音声は、顧客とオペレータとの対話であるため、顧客の発話はオペレータの発話内容と関連することが多い。よって、顧客側の音声認識結果に、オペレータ側の音声認識結果と関連する単語が含まれるとき、その認識結果は正しいと考えて良い。この発明の２チャネル音声の音声認識方法は、２つの発話の関連性が強い場合に見られる単語共起の関係を利用するもので送話側認識信頼度の精度を高める効果を奏する。 Since the response voice is a dialogue between the customer and the operator, the customer's utterance is often related to the utterance content of the operator. Therefore, when a word related to the voice recognition result on the operator side is included in the voice recognition result on the customer side, the recognition result may be considered to be correct. The speech recognition method for two-channel speech according to the present invention uses the relationship of word co-occurrence seen when the relevance between two utterances is strong, and has the effect of increasing the accuracy of the transmission side recognition reliability.

ここで、単語関連度テーブルを作成する単語関連度テーブル作成装置１５０を説明する。
〔単語関連度テーブル作成装置〕
図４に単語関連度テーブル作成装置１５０の機能構成例を示す。単語関連度テーブル作成装置１５０は、学習コーパス１５１と、形態素解析部１５２と、学習コーパス単語集合取得部１５３と、単語リスト１５４と、単語カウント部１５５と、単語関連度計算部１５６と、テーブル配列部１５７と、を具備する。 Here, the word association degree table creation device 150 that creates the word association degree table will be described.
[Word relevance table creation device]
FIG. 4 shows a functional configuration example of the word association degree table creation device 150. The word association degree table creation device 150 includes a learning corpus 151, a morpheme analysis unit 152, a learning corpus word set acquisition unit 153, a word list 154, a word count unit 155, a word association degree calculation unit 156, and a table array. Part 157.

学習コーパス１５１は音声文書を大規模に集積したものである。形態素解析部１５２は、学習コーパス１５１から音声文書を読み出して単語に分割する周知の形態素解析処理を行い、各単語の前後に単語境界を表す記号、例えば「\ｎ」を付与した単語境界付き学習コーパスを出力する。形態素解析処理は周知であり、例えば参考文献「特許第３３７９６４３号」に記載されている。 The learning corpus 151 is a collection of voice documents on a large scale. The morpheme analysis unit 152 performs a well-known morpheme analysis process that reads a speech document from the learning corpus 151 and divides it into words. Output corpus. The morphological analysis processing is well known and is described in, for example, a reference document “Japanese Patent No. 3379643”.

学習コーパス単語集合取得部１５３は、形態素解析部１５２が出力する単語境界付き学習コーパスの先頭から末尾まで、窓幅n単語、窓シフト量m単語で窓かけを行い、各窓に含まれる単語リスト１５４に記載された単語をまとめて単語集合とし、窓ごとの単語集合を出力する。単語リスト１５４は、音声認識結果に出現し得る全ての単語が記載されたものであり、事前に作成しておく。図５に、単語集合を概念的に示す。横方向は時間経過であり、単語集合をN₁〜N_ｈで示す。mは窓シフト量であり、nは窓幅である。隣り合う単語集合は、n-m個の単語を共有する関係にある。 The learning corpus word set acquisition unit 153 performs windowing with the window width n words and the window shift amount m words from the beginning to the end of the word-boundary learning corpus output by the morpheme analysis unit 152, and the word list included in each window The words described in 154 are combined into a word set, and a word set for each window is output. The word list 154 describes all words that can appear in the speech recognition result, and is created in advance. FIG. 5 conceptually shows the word set. The horizontal direction is the passage of time, and the word set is indicated by N _{1 to} N _h . m is the window shift amount, and n is the window width. Adjacent word sets share a relationship of nm words.

単語カウント部１５５は、学習コーパス単語集合取得部１５３が出力する単語集合を入力として、単語集合内の各単語の単独生起回数C(w)、各単語ペアの生起回数C（w_i,w_j）、単語集合の総数をカウントして出力する。単語wの生起回数C(w)とは、単語wを含む単語集合の個数である。単語ペア（w_i,w_j）の生起回数C（w_i,w_j）とは、w_iとw_jを共に含む単語集合の個数である。 The word count unit 155 receives the word set output from the learning corpus word set acquisition unit 153 as an input, and the number of occurrences C (w) of each word in the word set and the number of occurrences C (w _i , w _{j of} each word pair) ), Count and output the total number of word sets. The number of occurrences C (w) of the word w is the number of word sets including the word w. The number of occurrences C (w _i , w _j ) of the word pair (w _i , w _j ) is the number of word sets including both w _i and w _j .

単語関連度計算部８６は、各単語ペア（w_i,w_j）の関連度S（w_i,w_j）を例えば式（１）で計算する。 The word relevance calculation unit 86 calculates the relevance S (w _i , w _j ) of each word pair (w _i , w _j ) using, for example, equation (1).

Nは単語集合の総数、C(w)は単語wの単独生起回数、C（w_i,w_j）は単語w_iとw_jの共起回数である。関連度Ｓ（w_i,w_j）の値が大きいと、それらの単語同士の関連性が高いことを意味する。関連度S（w_i,w_j）には、式（１）の他に、例えばJaccard係数（式（２））を用いても良い。 N is the total number of word sets, C (w) is the number of single occurrences of word w, and C (w _i , w _j ) is the number of co-occurrence of words w _i and w _j . When the value of the relevance S (w _i , w _j ) is large, it means that the relevance between these words is high. For the relevance S (w _i , w _j ), for example, a Jaccard coefficient (formula (2)) may be used in addition to the formula (1).

また、Dice係数（式３）やSimpson係数（式４）を用いることもできる。 A Dice coefficient (Equation 3) or a Simpson coefficient (Equation 4) can also be used.

テーブル配列部１５７は、単語w_iとw_jから計算した関連度S（w_i,w_j）を参照できるテーブルの形式に配列する。図６に単語関連度テーブル４０の一例を示す。最上列と最左列は単語w₁〜w_Nであり、各行と各列の交差する欄にそれぞれの単語の関連度S（w_i,w_j）が配列される。 The table arrangement unit 157 arranges the relevance S (w _i , w _j ) calculated from the words w _i and w _j in a table format that can be referred to. FIG. 6 shows an example of the word association degree table 40. The uppermost column and the leftmost column are words w _{1 to} w _N , and the relevance S (w _i , w _j ) of each word is arranged in a column where each row and each column intersect.

図７に、認識信頼度計算部３０のより具体的な機能構成例を示して更に詳しく説明する。認識信頼度計算部３０は、受話認識結果単語集合取得手段３１と、送話認識結果単語集合取得手段３２と、チャネル内文脈一貫性尺度計算手段３３と、チャネル間文脈一貫性尺度計算手段３４と、受話側文脈一貫性統合手段３５と、送話側文脈一貫性統合手段３６と、単語リスト１５４と、を備える。 FIG. 7 shows a more specific functional configuration example of the recognition reliability calculation unit 30 and will be described in more detail. The recognition reliability calculation unit 30 includes a reception recognition result word set acquisition unit 31, a transmission recognition result word set acquisition unit 32, an in-channel context consistency measure calculation unit 33, and an inter-channel context consistency measure calculation unit 34. , Receiving side context consistency integration means 35, transmission side context consistency integration means 36, and word list 154.

受話認識結果単語集合取得手段３１は、単語関連度テーブル作成装置１５０の学習コーパス単語集合取得手段１５３と同じように、受話側音声認識結果の先頭から末尾まで、窓幅n単語、窓シフト量m単語で窓かけを行い、各窓に含まれる単語リスト１５４に記載された単語をまとめて単語集合とし、窓ごとに時間情報付き単語集合を出力する。単語リスト１５４は単語関連度テーブル作成装置１５０と同じものである。送話側認識結果単語集合取得手段３２も、送話側音声認識結果を入力として窓ごとに時間情報付き単語集合を出力する。 Similarly to the learning corpus word set acquisition unit 153 of the word association degree table creation device 150, the reception recognition result word set acquisition unit 31 has a window width n word and a window shift amount m from the beginning to the end of the reception side speech recognition result. Windowing is performed with words, and the words described in the word list 154 included in each window are grouped into a word set, and a word set with time information is output for each window. The word list 154 is the same as the word association degree table creation device 150. The transmission side recognition result word set acquisition means 32 also outputs a word set with time information for each window by using the transmission side speech recognition result as an input.

チャネル内文脈一貫性尺度計算手段３３は、送話側と受話側の時間情報付き単語集合NU_i，NO_iを入力として、各単語集合それぞれの文脈一貫性尺度をチャネル内文脈一貫性尺度として計算する。チャネル内文脈一貫性尺度は、単語集合に含まれる全ての単語ペア（w_i，w_j）に対して（但し、w_i≠w_j）単語関連度テーブル４０を参照して求めた関連度S（w_i，w_j）の平均値である。 The in-channel context consistency measure calculating means 33 receives the word sets NU _i and NO _i with time information on the transmitting side and the receiving side as inputs, and calculates the context consistency measure of each word set as an in-channel context consistency measure. To do. The in-channel context consistency measure is the relevance S obtained by referring to the word relevance table 40 for all word pairs (w _i , w _j ) included in the word set (where w _i ≠ w _j ). It is an average value of (w _i , w _j ).

チャネル内文脈一貫性尺度計算手段３３は、上記した計算を送話側と受話側の時間情報付き単語集合NU_i，NO_iのそれぞれについて行い、送話側のチャネル内文脈一貫性尺度SC_in(NU_i)と受話側のチャネル内文脈一貫性尺度SC_in(NO_i)を計算する。単語集合単位の関連性の強さを表すチャネル内文脈一貫性尺度SC_in(NU_i)，SC_in(NO_i)は、送話側及び受話側の発話単位での文脈の一貫性を表す指標である。 The in-channel context consistency measure calculation means 33 performs the above calculation for each of the word sets NU _i and NO _i with time information on the transmitting side and the receiving side, and the in-channel context consistency measure SC _in ( NU _i ) and receiver's in-channel context consistency measure SC _in (NO _i ). The in-channel context consistency scales SC _in (NU _i ) and SC _in (NO _i ), which represent the strength of relevance of word set units, are indices that represent the consistency of contexts in the utterance units on the sending and receiving sides. It is.

チャネル間文脈一貫性尺度計算手段３４は、送話側と受話側の時間情報付き単語集合NU_i，NO_iを入力として、チャネル間文脈一貫性尺度SC_inter(NU_i）を計算する。チャネル間文脈一貫性尺度SC_inter(NU_i）は次の手順で計算する。 The inter-channel context consistency measure calculating means 34 calculates the _inter- channel context consistency measure SC _inter (NU _i ) with the word sets NU _i and NO _i with time information on the transmitting side and the receiving side as inputs. The _inter- channel context consistency measure SC _inter (NU _i ) is calculated by the following procedure.

先ず、送話側単語集合NU_iに付与された時刻TU_iの直前の時刻が付与された受話側単語集合NO_iを取得する。時刻TU_iの直前の時刻の受話側単語集合NO_iが無い場合は、最も早い時刻が付与された受話側単語集合とする。 First, the receiver side word set NO _i to which the time immediately before the time TU _i given to the transmitter side word set NU _i is given is acquired. If there is no receiver-side word set NO _{i at} the time immediately before time TU _i , the receiver-side word set assigned the earliest time is used.

次に、時刻TU_iの直前の時刻の受話側単語集合NO_iに含まれる単語と、着目している送話側単語集合に含まれる単語との組み合わせの関連度の平均値を、チャネル間文脈一貫性尺度SC_inter(NU_i）として計算する。チャネル間文脈一貫性尺度SC_inter(NU_i）は、送話側の単語集合毎に、その直前の受話側の単語集合との間の単語間の関連性の強さを表し、送話側と発話側との間の発話の関連性の強さを表す指標である。 Next, the average value of the degree of association between the words included in the receiving-side word set NO _i at the time immediately before the time TU _i and the words included in the focused transmitting-side word set is calculated as the inter-channel context. Calculated as the consistency measure SC _inter (NU _i ). The _inter- channel context consistency scale SC _inter (NU _i ) indicates the strength of the relationship between words for each word set on the sending side and the word set on the immediately preceding receiving side. This is an index representing the strength of the utterance relationship with the utterance side.

送話側文脈一貫性統合手段３６は、チャネル内文脈一貫性尺度計算手段３３が出力する送話側のチャネル内文脈一貫性尺度SC_in(NU_i)と、チャネル間文脈一貫性尺度計算手段３４が出力するチャネル間文脈一貫性尺度SC_inter(NU_i）を入力として送話側認識信頼度CUを式（５）で計算して出力する。 The sending-side context consistency integration means 36 includes an in-channel context consistency scale SC _in (NU _i ) output from the in-channel context consistency scale calculation means 33 and an inter-channel context consistency scale calculation means 34. The _inter- channel context consistency measure SC _inter (NU _i ) output by S is input and the transmission side recognition reliability CU is calculated by Equation (5) and output.

hは単語集合の個数、αは重みである。重みαは、例えば、実際の応対音声と人手で作成した書き起こしテキストをペアにした開発セットに基づいて求めた送話側認識信頼度と受話側認識信頼度とを重み付け加算した認識信頼度と、音声認識精度との相関係数の値が最も高くなる様に事前に設定される任意の値（０＜α＜１）である。通常、受話側の音声認識精度の方が高いので、重みαは大きな値にした方が良い。その方が、より正確に送話側の認識信頼度を評価することになると考えられる。 h is the number of word sets, and α is a weight. The weight α is, for example, a recognition reliability obtained by weighting and adding a sender-side recognition reliability and a receiver-side recognition reliability obtained based on a development set in which an actual response voice and a manually created transcription text are paired. Any value (0 <α <1) set in advance so that the value of the correlation coefficient with the speech recognition accuracy becomes the highest. Usually, since the voice recognition accuracy on the receiving side is higher, the weight α should be set to a larger value. It is considered that this will more accurately evaluate the recognition reliability on the transmission side.

このように、送話側文脈一貫性統合手段３６は、送話側音声認識結果の単語集合単位の関連性の強さと、送話側音声認識結果の単語集合とその直前の受話側音声認識結果の単語集合との関連性の強さとの重み付け和を、送話側音声認識結果の単語集合の数で平均した送話側認識信頼度として計算する。送話側認識信頼度CUは、送話側発話と受話側発話との間の関連性が強い場合に大きな値となる。 As described above, the sending-side context consistency integrating unit 36 determines the strength of the relevance of the word-set unit of the sending-side speech recognition result, the word set of the sending-side speech recognition result, and the immediately preceding receiving-side speech recognition result. The weighted sum of the relevance to the word set is calculated as the transmission side recognition reliability averaged by the number of word sets of the transmission side speech recognition result. The transmission side recognition reliability CU is a large value when the relationship between the transmission side utterance and the reception side utterance is strong.

受話側文脈一貫性統合手段３５は、チャネル内文脈一貫性尺度計算手段３３が出力する受話側のチャネル内文脈一貫性尺度SC_in(NO_i)を、単語集合NO_iの数で平均した受話側認識信頼度COを計算（式（６））して出力する。 The receiver-side context consistency integration unit 35 averages the receiver-side in-channel context consistency measure SC _in (NO _i ) output from the in-channel context consistency measure calculation unit 33 by the number of word sets NO _i. The recognition reliability CO is calculated (formula (6)) and output.

また、重みαは単語認識信頼度から計算で求めても良い。 Further, the weight α may be calculated from the word recognition reliability.

図８に、重みαを単語認識信頼度から求める重み計算部８０を備えた音声認識装置２００の機能構成例を示す。音声認識装置２００は、音声認識装置１００に対して重み計算部８０を備える点のみが異なる。重み計算部８０は、音声認識部２０が出力する送話側音声認識結果と受話側音声認識結果を入力として重みαを計算して認識信頼度計算部３０の送話側文脈一貫性統合手段３６に重みαを与えるものである。 FIG. 8 shows a functional configuration example of the speech recognition apparatus 200 including the weight calculation unit 80 that obtains the weight α from the word recognition reliability. The speech recognition apparatus 200 differs from the speech recognition apparatus 100 only in that a weight calculation unit 80 is provided. The weight calculation unit 80 receives the transmission side speech recognition result and the reception side speech recognition result output from the speech recognition unit 20 as input, calculates the weight α, and the transmission side context consistency integration unit 36 of the recognition reliability calculation unit 30. Is given a weight α.

図９に重み計算部８０の機能構成例を示す。重み計算部８０は、受話側認識スコア算出手段８１と、送話側認識スコア算出手段８２と、シグモイド関数演算手段８３と、を備える。受話側認識スコア算出手段８１は、音声認識部２０が出力する受話側音声認識結果を入力として、音声認識結果の各単語に付与された単語認識信頼度の総和を、各単語の継続時間長の総和で除した受話側音声認識スコアPOを出力する。 FIG. 9 shows a functional configuration example of the weight calculation unit 80. The weight calculation unit 80 includes a reception side recognition score calculation unit 81, a transmission side recognition score calculation unit 82, and a sigmoid function calculation unit 83. The receiving side recognition score calculation means 81 receives the receiving side speech recognition result output from the speech recognition unit 20 as an input, and calculates the sum of the word recognition reliability assigned to each word of the speech recognition result as the duration length of each word. The receiver's speech recognition score PO divided by the sum is output.

送話側認識スコア算出手段８２は、音声認識部２０が出力する送話側音声認識結果の各単語に付与された単語認識信頼度の総和を、各単語の継続時間長の総和で除した受話側音声認識スコアPUを出力する。シグモイド関数演算手段８３は、受話側音声認識スコアPOと受話側音声認識スコアPUを入力として式（７）で重みαを計算する。 The transmission side recognition score calculation means 82 receives the received speech obtained by dividing the sum of the word recognition reliability given to each word of the transmission side speech recognition result output by the speech recognition unit 20 by the total duration of each word. Output side speech recognition score PU. The sigmoid function calculation means 83 calculates the weight α according to the equation (7) with the receiving side speech recognition score PO and the receiving side speech recognition score PU as inputs.

ｇは重みαのゲイン定数、ｄはシフト定数であり、予め設定される値であり、例えばシフト定数ｄは0＜50000の範囲、ゲイン定数は1000〜5000の範囲に設定される。 g is a gain constant of the weight α, d is a shift constant, and is a preset value. For example, the shift constant d is set in a range of 0 <50000, and the gain constant is set in a range of 1000 to 5000.

重み計算部８０は、送話側チャネルに比べて受話側チャネルの音声認識精度が高いほど、受話側音声認識スコアPOと受話側音声認識スコアPUの差が大きくなることを利用し、認識スコアの差をシグモイド関数によって０〜１の範囲の値に変換した値を重みαとして出力する。 The weight calculation unit 80 uses the fact that the higher the speech recognition accuracy of the receiving channel than the transmitting channel is, the larger the difference between the receiving speech recognition score PO and the receiving speech recognition score PU is. A value obtained by converting the difference into a value in the range of 0 to 1 by the sigmoid function is output as the weight α.

上記装置における処理手段をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、各装置における処理手段がコンピュータ上で実現される。 When the processing means in the above apparatus is realized by a computer, the processing contents of the functions that each apparatus should have are described by a program. Then, by executing this program on the computer, the processing means in each apparatus is realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ（Random Access Memory）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ−Ｒ（Recordable）/ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto Optical disc）等を、半導体メモリとしてＥＥＰ−ＲＯＭ（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used. Specifically, for example, as a magnetic recording device, a hard disk device, a flexible disk, a magnetic tape or the like, and as an optical disk, a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only). Memory), CD-R (Recordable) / RW (ReWritable), etc., magneto-optical recording medium, MO (Magneto Optical disc), etc., semiconductor memory, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. Can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記録装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Further, the program may be distributed by storing the program in a recording device of a server computer and transferring the program from the server computer to another computer via a network.

また、各手段は、コンピュータ上で所定のプログラムを実行させることにより構成することにしてもよいし、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Each means may be configured by executing a predetermined program on a computer, or at least a part of these processing contents may be realized by hardware.

Claims

A speech recognition process that outputs a speech recognition result and a speech recognition result on the transmission side and a speech recognition result on each word obtained by performing speech recognition processing on each speech by inputting the speech on the transmission side and the reception side speech; and
With reference to the word association degree table indicating the degree of association between two words in the combination of all the words in the speech recognition result using the above transmission side speech recognition result and receiver side speech recognition result as input, An intra-channel context consistency measure and an inter-channel context consistency measure between the transmitting side speech recognition result and the receiving side speech recognition result are obtained. A recognition reliability calculation process in which the weighted sum of the consistency measure is calculated and output as the sender recognition reliability, and
A speech recognition method for two-channel speech including

The speech recognition method according to claim 1,
The recognition reliability calculation process is as follows:
Furthermore, the speech recognition method for two-channel speech, which is a process of calculating and outputting the average value of the context consistency measure in the receiver side channel as the receiver recognition reliability.

The speech recognition method for two-channel speech according to claim 2,
The weight of the weighted sum is the receiving side obtained by dividing the sum of the word recognition reliability given to each word of the sending side speech recognition result and the receiving side speech recognition result by the sum of durations of the words. 2. A speech recognition method for two-channel speech, characterized in that it is a value obtained by a sigmoid function having a gain as a difference between a speech recognition score and a transmitting side speech recognition score.

A speech recognition unit that outputs a speech recognition result and a speech recognition result on the transmission side, and a speech recognition result on the receiving side, to which a word recognition reliability is assigned to each word obtained by performing speech recognition processing on each speech by inputting the speech on the transmission side and the reception side speech
With reference to the word association degree table indicating the degree of association between two words in the combination of all the words in the speech recognition result using the above transmission side speech recognition result and receiver side speech recognition result as input, An intra-channel context consistency measure and an inter-channel context consistency measure between the transmitting side speech recognition result and the receiving side speech recognition result are obtained. A recognition reliability calculation unit for calculating and outputting the weighted sum of the consistency measure as the transmission side recognition reliability;
A two-channel voice recognition apparatus comprising:

The speech recognition apparatus according to claim 4,
The recognition reliability calculation unit
Furthermore, a speech recognition apparatus for two-channel speech, characterized in that the average value of the context consistency measure in the receiver channel is calculated and output as the receiver recognition reliability.

A 2-channel speech recognition method program for causing a computer to execute the 2-channel speech recognition method according to any one of claims 1 to 3.