JP5325176B2 - 2-channel speech recognition method, apparatus and program thereof - Google Patents

2-channel speech recognition method, apparatus and program thereof Download PDF

Info

Publication number
JP5325176B2
JP5325176B2 JP2010162629A JP2010162629A JP5325176B2 JP 5325176 B2 JP5325176 B2 JP 5325176B2 JP 2010162629 A JP2010162629 A JP 2010162629A JP 2010162629 A JP2010162629 A JP 2010162629A JP 5325176 B2 JP5325176 B2 JP 5325176B2
Authority
JP
Japan
Prior art keywords
speech recognition
speech
word
channel
recognition result
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
JP2010162629A
Other languages
Japanese (ja)
Other versions
JP2012027065A (en
Inventor
太一 浅見
済央 野本
哲 小橋川
浩和 政瀧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nippon Telegraph and Telephone Corp
Original Assignee
Nippon Telegraph and Telephone Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nippon Telegraph and Telephone Corp filed Critical Nippon Telegraph and Telephone Corp
Priority to JP2010162629A priority Critical patent/JP5325176B2/en
Publication of JP2012027065A publication Critical patent/JP2012027065A/en
Application granted granted Critical
Publication of JP5325176B2 publication Critical patent/JP5325176B2/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Description

この発明は、例えばコールセンター等で行われる顧客側の音声とオペレータ側の音声とから成る2チャネルの音声を、音声認識する音声認識方法とその装置と、プログラムに関する。   The present invention relates to a voice recognition method, apparatus, and program for recognizing two-channel voice consisting of customer-side voice and operator-side voice, for example, at a call center.

従来、顧客側の音声(以降、送話側音声)とオペレータ側の音声(以降、受話側音声)の2チャネルの音声を音声認識する場合、それぞれのチャネルを別々に音声認識していた。そして、それぞれの音声認識結果に、音声認識結果の確からしさを表す認識信頼度を付与していた。   Conventionally, when recognizing two-channel voices of customer-side voice (hereinafter referred to as “sending-side voice”) and operator-side voice (hereinafter referred to as “receiving-side voice”), each channel has been separately recognized. And the recognition reliability showing the certainty of a speech recognition result was provided to each speech recognition result.

その認識信頼度を算出するための従来技術として、例えば特許文献1に、音声認識処理の探索結果の上位N位までのNベスト候補のスコア1位の単語w1の認識信頼度を、単語w1のスコアとスコア2位以下でw1と異なる単語w2とのスコア差を単語w1の継続時間長で正規化した値とする考えが示されている。 As a conventional technique for calculating the recognition reliability, for example, Patent Document 1 discloses the recognition reliability of the word w 1 with the highest score of N best candidates up to the top N in the search result of the speech recognition processing. believed to be normalized value of the score difference between w 1 and different words w 2 less than 2-position 1 of the score and the score in duration of words w 1 is shown.

また、認識信頼度を算出するための別の方法として、音声認識結果中の各単語間の関連の強さを計測して周囲の単語と関連の強い単語に、高い認識信頼度を付与し、関連の弱い単語に低い認識信頼度を付与する方法がある(非特許文献1)。この方法は、単語w及び単語wの直前k個の単語と直後の1単語とのn個の単語集合N(w)を音声認識結果から取得する。そして、その単語集合N(w)に含まれる全ての2単語の組み合わせ(wi,wj)に対して、事前に学習コーパス上で算出した相互情報量MI(wi,wj)を用いて単語間の関連の強さS(wi,wj)を計算する。また、単語集合N(w)中の全ての単語tについての関連の強さS(t,wi)の平均値を文脈一貫性尺度SC(t)として計算する。 In addition, as another method for calculating the recognition reliability, the strength of the relationship between each word in the speech recognition result is measured, and a high recognition reliability is given to a word strongly related to the surrounding word, There is a method of giving a low recognition reliability to a weakly related word (Non-Patent Document 1). In this method, n word sets N (w) of the word w and k words immediately before the word w and one word immediately after the word w are acquired from the speech recognition result. Then, the mutual information MI (w i , w j ) calculated in advance on the learning corpus is used for all combinations (w i , w j ) of two words included in the word set N (w). Then, the strength S (w i , w j ) of the association between words is calculated. Further, the average value of the relation strength S (t, w i ) for all the words t in the word set N (w) is calculated as the context consistency measure SC (t).

特開2005−148342号公報JP 2005-148342 A

D. Inkpen, A. Desilets, “Semantic Similarity for Detecting Recognition Errors in Automatic Speech Transcripts,”Proceedings of HLT/EMNLP, pp.49-56, October 2005.D. Inkpen, A. Desilets, “Semantic Similarity for Detecting Recognition Errors in Automatic Speech Transcripts,” Proceedings of HLT / EMNLP, pp.49-56, October 2005.

一般的に送話側の音声は、様々な音響的環境で送話されるので、音声品質のバラツキが大きい。逆に、受話側の音声は比較的静かなオフィスにおける会話であるので音声品質が良好である。このように、音声品質に差のある音声をそれぞれ別々に音声認識して認識信頼度を付与すると、送話側の音声の認識信頼度を実際の信頼度よりも悪く評価してしまう場合がある。   In general, since the voice on the transmission side is transmitted in various acoustic environments, the voice quality varies greatly. On the contrary, the voice on the receiving side is a relatively quiet office conversation, so the voice quality is good. As described above, when the voices having different voice qualities are separately recognized and given the recognition reliability, the voice recognition reliability on the transmission side may be evaluated worse than the actual reliability. .

この発明は、このような課題に鑑みてなされたものであり、送話側の音声認識信頼度を適切に付与することが出来る2チャネル音声の音声認識方法と、その装置とプログラムを提供することを目的とする。   This invention is made in view of such a subject, and provides the speech recognition method of the 2-channel speech which can assign | provide the speech recognition reliability of a transmission side appropriately, its apparatus, and a program With the goal.

この発明の2チャネル音声の音声認識方法は、音声認識過程と、認識信頼度計算過程と、を含む。音声認識過程は、送話側音声と受話側音声を入力としてそれぞれの音声を音声認識処理した単語毎に単語認識信頼度を付与した送話側音声認識結果と受話側音声認識結果を出力する。認識信頼度計算過程は、送話側音声認識結果と受話側音声認識結果を入力として音声認識結果の全ての単語間の組み合わせの2単語間の関連度を示す単語関連度テーブルを参照してそれぞれの音声認識結果のチャネル内文脈一貫性尺度と、送話側音声認識結果と受話側音声認識結果との間のチャネル間文脈一貫性尺度とを求め、送話側チャネル内文脈一貫性尺度とチャネル間文脈一貫性尺度の重み付き和を送話側認識信頼度として計算して出力する。   The speech recognition method for two-channel speech according to the present invention includes a speech recognition process and a recognition reliability calculation process. In the speech recognition process, the transmitting side speech recognition result and the receiving side speech recognition result are output by giving the word recognition reliability to each word obtained by performing speech recognition processing on the respective speeches by inputting the transmitting side speech and the receiving side speech. The recognition reliability calculation process is performed by referring to a word relevance table indicating the relevance between two words of a combination of all words of the speech recognition result with the sending side speech recognition result and the receiving side speech recognition result as inputs. The intra-channel context consistency measure of the speech recognition result and the inter-channel context consistency measure between the sending side speech recognition result and the receiving side speech recognition result. The weighted sum of the inter-context consistency measure is calculated and output as the sender recognition confidence.

例えばコールセンター等で交わされる応対音声においては、送話側音声と受話側音声に共通する単語や、関連する単語が含まれる場合が多い。送話側音声と受話側音声との間の単語共起に着目したこの発明のチャネル間文脈一貫性尺度は、二つの音声の間の関連性が強い場合に大きな値を示すものである。よって、そのチャネル間文脈一貫性尺度と送話側チャネル内文脈一貫性尺度の重み付け和を送話側認識信頼度として計算するこの発明の2チャネル音声の音声認識方法は、送話側の認識信頼度を適切に付与することができる。   For example, in the reception voice exchanged at a call center or the like, there are many cases where a common word or related word is included in the transmission side voice and the reception side voice. The inter-channel context consistency measure of the present invention, which focuses on word co-occurrence between the transmitting voice and the receiving voice, shows a large value when the relationship between the two voices is strong. Therefore, the speech recognition method for two-channel speech according to the present invention, which calculates the weighted sum of the inter-channel context consistency measure and the intra-sender context consistency measure as the sender recognition reliability, The degree can be given appropriately.

この発明の音声認識装置100の機能構成例を示す図。The figure which shows the function structural example of the speech recognition apparatus 100 of this invention. 音声認識装置100の動作フローを示す図。The figure which shows the operation | movement flow of the speech recognition apparatus 100. Nベスト候補と単語認識信頼度について説明する図。The figure explaining N best candidate and word recognition reliability. 単語関連度テーブル作成装置150の機能構成例を示す図。The figure which shows the function structural example of the word related degree table creation apparatus 150. FIG. 単語集合を概念的に示す図。The figure which shows a word set notionally. 単語関連度テーブルの一例を示す図。The figure which shows an example of a word related degree table. 認識信頼度計算部30の機能構成例を示す図。The figure which shows the function structural example of the recognition reliability calculation part 30. FIG. この発明の音声認識装置200の機能構成例を示す図。The figure which shows the function structural example of the speech recognition apparatus 200 of this invention. 重み計算部80の機能構成例を示す図。The figure which shows the function structural example of the weight calculation part 80. FIG.

以下、この発明の実施の形態を図面を参照して説明する。複数の図面中同一のものには同じ参照符号を付し、説明は繰り返さない。   Embodiments of the present invention will be described below with reference to the drawings. The same reference numerals are given to the same components in a plurality of drawings, and the description will not be repeated.

図1に、この発明の音声認識装置100の機能構成例を示す。その動作フローを図2に示す。音声認識装置100は、音声認識部20と、認識信頼度計算部30と、単語関連度テーブル40と、を具備する。音声認識装置100の各部の機能は、例えばROM、RAM、CPU等で構成されるコンピュータに所定のプログラムが読み込まれて、CPUがそのプログラムを実行することで実現されるものである。   FIG. 1 shows a functional configuration example of the speech recognition apparatus 100 of the present invention. The operation flow is shown in FIG. The speech recognition apparatus 100 includes a speech recognition unit 20, a recognition reliability calculation unit 30, and a word association degree table 40. The function of each part of the speech recognition apparatus 100 is realized by reading a predetermined program into a computer composed of, for example, a ROM, a RAM, a CPU, etc., and executing the program by the CPU.

音声認識部20は、例えばコールセンター等で顧客とオペレータとの間で交わされる送話側音声と受話側音声とからなる応対音声を、音声認識処理した単語毎に単語認識信頼度を付与した送話側音声認識結果と受話側音声認識結果を出力する(ステップS20)。音声認識部20は、図示しない内部の音響分析部によって、送話側音声と受話側音声を、数十msecのフレームと呼ばれる単位でLPCケプストラム、MFCC、その他の音響特徴パラメータ系列に分析する。そして、辞書と言語モデルを用いて入力音声に対する認識結果候補の探索を、音響特徴パラメータ系列について行う。探索の結果、上位N位までのNベスト候補が、単語認識信頼度と共に音声認識結果として出力される。なお、応対音声が送話側音声と受話側音声とが一体化したものである場合は、その応対音声を送話側音声と受話側音声の2チャネルに分割する音声チャネル分割部10を設ける。   For example, the speech recognition unit 20 is a speech that has been given a word recognition reliability for each word that has been subjected to speech recognition processing on a response speech composed of a transmission side speech and a reception side speech exchanged between a customer and an operator at a call center or the like. The side speech recognition result and the receiving side speech recognition result are output (step S20). The speech recognition unit 20 analyzes the transmitting side speech and the receiving side speech into LPC cepstrum, MFCC, and other acoustic feature parameter series in units called frames of several tens of msec by an internal acoustic analysis unit (not shown). Then, a recognition result candidate for the input speech is searched for the acoustic feature parameter series using a dictionary and a language model. As a result of the search, N best candidates up to the top N are output as a speech recognition result together with the word recognition reliability. In the case where the answering voice is an integration of the sending voice and the receiving voice, a voice channel dividing unit 10 is provided for dividing the answering voice into two channels of the sending voice and the receiving voice.

ここで、図3を参照してNベスト候補と単語認識信頼度について説明する。なお、Nベスト候補と単語認識信頼度については従来技術である。単語認識信頼度については、例えば特許文献1に記載されている。   Here, N best candidates and word recognition reliability will be described with reference to FIG. Note that N best candidates and word recognition reliability are conventional techniques. The word recognition reliability is described in Patent Document 1, for example.

図3の横軸は、経過時間でありフレームで表す。縦軸は、フレーム単位で探索された単語列候補を、スコアの高い順番に並べたNベスト候補である。スコアとは探索時の尤度のことである。   The horizontal axis in FIG. 3 represents elapsed time and is represented by a frame. The vertical axis represents N best candidates in which word string candidates searched for in units of frames are arranged in descending order of scores. The score is the likelihood at the time of search.

単語認識信頼度は、フレームt*において単語w**(*は任意の整数)と異なる単語がNベスト候補中に存在する場合、単語w**のフレームt*におけるスコアと対立候補単語のフレームt*における次の順位のスコアとの間のスコア差によって与えられる。図3に示す例では、フレームt1〜t4の音響特徴パラメータ系列で探索された第1位候補の単語w11(11は第1候補の1番目の単語であることを表す)の単語認識信頼度は、対向する単語が第3位候補の単語w31と第2位候補の単語w21であるので、それぞれのスコア差(●)の合計をフレーム数で除した値が単語認識信頼度となる。対立候補が存在しない単語w13については、予め決められた固定値(○)が用いられて単語認識信頼度となる。この単語認識信頼度が候補毎に累積されて単語列の認識信頼度となる。 Word recognition reliability, if the word w ** in frame t * (* is an arbitrary integer) is different word present in the N-best candidate words w ** frame t score paired candidacy word frames in * Given by the score difference between the next ranking score at t * . In the example shown in FIG. 3, the word recognition of the first candidate word w 11 (11 represents the first word of the first candidate) searched for in the acoustic feature parameter series of frames t 1 to t 4. Since the opposite words are the third candidate word w 31 and the second candidate word w 21 , the value obtained by dividing the total score difference (●) by the number of frames is the word recognition reliability. It becomes. For the word w 13 the conflict candidate does not exist, the word recognition reliability by predetermined fixed value (○) is used. This word recognition reliability is accumulated for each candidate and becomes the word string recognition reliability.

認識信頼度計算部30は、送話側音声認識結果と受話側音声認識結果を入力として音声認識結果の全ての単語間の組み合わせの2単語間の関連度を示す単語関連度テーブル40を参照して送話側音声認識結果と受話側音声認識結果のチャネル内文脈一貫性尺度と、送話側音声認識結果と受話側音声認識結果との間のチャネル間文脈一貫性尺度とを求め、送話側チャネル内文脈一貫性尺度とチャネル間文脈一貫性尺度の重み付き和を送話側認識信頼度として計算して出力する(ステップS30)。   The recognition reliability calculation unit 30 refers to the word association degree table 40 indicating the degree of association between two words of the combination between all words of the speech recognition result with the transmission side speech recognition result and the reception side speech recognition result as inputs. Determine the in-channel context consistency measure between the sending and receiving speech recognition results and the inter-channel context consistency measure between the sending and receiving speech recognition results. A weighted sum of the in-channel context consistency measure and the inter-channel context consistency measure is calculated and output as the transmitting side recognition reliability (step S30).

応対音声は、顧客とオペレータとの対話であるため、顧客の発話はオペレータの発話内容と関連することが多い。よって、顧客側の音声認識結果に、オペレータ側の音声認識結果と関連する単語が含まれるとき、その認識結果は正しいと考えて良い。この発明の2チャネル音声の音声認識方法は、2つの発話の関連性が強い場合に見られる単語共起の関係を利用するもので送話側認識信頼度の精度を高める効果を奏する。   Since the response voice is a dialogue between the customer and the operator, the customer's utterance is often related to the utterance content of the operator. Therefore, when a word related to the voice recognition result on the operator side is included in the voice recognition result on the customer side, the recognition result may be considered to be correct. The speech recognition method for two-channel speech according to the present invention uses the relationship of word co-occurrence seen when the relevance between two utterances is strong, and has the effect of increasing the accuracy of the transmission side recognition reliability.

ここで、単語関連度テーブルを作成する単語関連度テーブル作成装置150を説明する。
〔単語関連度テーブル作成装置〕
図4に単語関連度テーブル作成装置150の機能構成例を示す。単語関連度テーブル作成装置150は、学習コーパス151と、形態素解析部152と、学習コーパス単語集合取得部153と、単語リスト154と、単語カウント部155と、単語関連度計算部156と、テーブル配列部157と、を具備する。
Here, the word association degree table creation device 150 that creates the word association degree table will be described.
[Word relevance table creation device]
FIG. 4 shows a functional configuration example of the word association degree table creation device 150. The word association degree table creation device 150 includes a learning corpus 151, a morpheme analysis unit 152, a learning corpus word set acquisition unit 153, a word list 154, a word count unit 155, a word association degree calculation unit 156, and a table array. Part 157.

学習コーパス151は音声文書を大規模に集積したものである。形態素解析部152は、学習コーパス151から音声文書を読み出して単語に分割する周知の形態素解析処理を行い、各単語の前後に単語境界を表す記号、例えば「\n」を付与した単語境界付き学習コーパスを出力する。形態素解析処理は周知であり、例えば参考文献「特許第3379643号」に記載されている。   The learning corpus 151 is a collection of voice documents on a large scale. The morpheme analysis unit 152 performs a well-known morpheme analysis process that reads a speech document from the learning corpus 151 and divides it into words. Output corpus. The morphological analysis processing is well known and is described in, for example, a reference document “Japanese Patent No. 3379643”.

学習コーパス単語集合取得部153は、形態素解析部152が出力する単語境界付き学習コーパスの先頭から末尾まで、窓幅n単語、窓シフト量m単語で窓かけを行い、各窓に含まれる単語リスト154に記載された単語をまとめて単語集合とし、窓ごとの単語集合を出力する。単語リスト154は、音声認識結果に出現し得る全ての単語が記載されたものであり、事前に作成しておく。図5に、単語集合を概念的に示す。横方向は時間経過であり、単語集合をN1〜Nで示す。mは窓シフト量であり、nは窓幅である。隣り合う単語集合は、n-m個の単語を共有する関係にある。 The learning corpus word set acquisition unit 153 performs windowing with the window width n words and the window shift amount m words from the beginning to the end of the word-boundary learning corpus output by the morpheme analysis unit 152, and the word list included in each window The words described in 154 are combined into a word set, and a word set for each window is output. The word list 154 describes all words that can appear in the speech recognition result, and is created in advance. FIG. 5 conceptually shows the word set. The horizontal direction is the passage of time, and the word set is indicated by N 1 to N h . m is the window shift amount, and n is the window width. Adjacent word sets share a relationship of nm words.

単語カウント部155は、学習コーパス単語集合取得部153が出力する単語集合を入力として、単語集合内の各単語の単独生起回数C(w)、各単語ペアの生起回数C(wi,wj)、単語集合の総数をカウントして出力する。単語wの生起回数C(w)とは、単語wを含む単語集合の個数である。単語ペア(wi,wj)の生起回数C(wi,wj)とは、wiとwjを共に含む単語集合の個数である。 The word count unit 155 receives the word set output from the learning corpus word set acquisition unit 153 as an input, and the number of occurrences C (w) of each word in the word set and the number of occurrences C (w i , w j of each word pair) ), Count and output the total number of word sets. The number of occurrences C (w) of the word w is the number of word sets including the word w. The number of occurrences C (w i , w j ) of the word pair (w i , w j ) is the number of word sets including both w i and w j .

単語関連度計算部86は、各単語ペア(wi,wj)の関連度S(wi,wj)を例えば式(1)で計算する。 The word relevance calculation unit 86 calculates the relevance S (w i , w j ) of each word pair (w i , w j ) using, for example, equation (1).

Figure 0005325176
Figure 0005325176

Nは単語集合の総数、C(w)は単語wの単独生起回数、C(wi,wj)は単語wiとwjの共起回数である。関連度S(wi,wj)の値が大きいと、それらの単語同士の関連性が高いことを意味する。関連度S(wi,wj)には、式(1)の他に、例えばJaccard係数(式(2))を用いても良い。 N is the total number of word sets, C (w) is the number of single occurrences of word w, and C (w i , w j ) is the number of co-occurrence of words w i and w j . When the value of the relevance S (w i , w j ) is large, it means that the relevance between these words is high. For the relevance S (w i , w j ), for example, a Jaccard coefficient (formula (2)) may be used in addition to the formula (1).

Figure 0005325176
Figure 0005325176

また、Dice係数(式3)やSimpson係数(式4)を用いることもできる。   A Dice coefficient (Equation 3) or a Simpson coefficient (Equation 4) can also be used.

Figure 0005325176
Figure 0005325176

Figure 0005325176
Figure 0005325176

テーブル配列部157は、単語wiとwjから計算した関連度S(wi,wj)を参照できるテーブルの形式に配列する。図6に単語関連度テーブル40の一例を示す。最上列と最左列は単語w1〜wNであり、各行と各列の交差する欄にそれぞれの単語の関連度S(wi,wj)が配列される。 The table arrangement unit 157 arranges the relevance S (w i , w j ) calculated from the words w i and w j in a table format that can be referred to. FIG. 6 shows an example of the word association degree table 40. The uppermost column and the leftmost column are words w 1 to w N , and the relevance S (w i , w j ) of each word is arranged in a column where each row and each column intersect.

図7に、認識信頼度計算部30のより具体的な機能構成例を示して更に詳しく説明する。認識信頼度計算部30は、受話認識結果単語集合取得手段31と、送話認識結果単語集合取得手段32と、チャネル内文脈一貫性尺度計算手段33と、チャネル間文脈一貫性尺度計算手段34と、受話側文脈一貫性統合手段35と、送話側文脈一貫性統合手段36と、単語リスト154と、を備える。   FIG. 7 shows a more specific functional configuration example of the recognition reliability calculation unit 30 and will be described in more detail. The recognition reliability calculation unit 30 includes a reception recognition result word set acquisition unit 31, a transmission recognition result word set acquisition unit 32, an in-channel context consistency measure calculation unit 33, and an inter-channel context consistency measure calculation unit 34. , Receiving side context consistency integration means 35, transmission side context consistency integration means 36, and word list 154.

受話認識結果単語集合取得手段31は、単語関連度テーブル作成装置150の学習コーパス単語集合取得手段153と同じように、受話側音声認識結果の先頭から末尾まで、窓幅n単語、窓シフト量m単語で窓かけを行い、各窓に含まれる単語リスト154に記載された単語をまとめて単語集合とし、窓ごとに時間情報付き単語集合を出力する。単語リスト154は単語関連度テーブル作成装置150と同じものである。送話側認識結果単語集合取得手段32も、送話側音声認識結果を入力として窓ごとに時間情報付き単語集合を出力する。   Similarly to the learning corpus word set acquisition unit 153 of the word association degree table creation device 150, the reception recognition result word set acquisition unit 31 has a window width n word and a window shift amount m from the beginning to the end of the reception side speech recognition result. Windowing is performed with words, and the words described in the word list 154 included in each window are grouped into a word set, and a word set with time information is output for each window. The word list 154 is the same as the word association degree table creation device 150. The transmission side recognition result word set acquisition means 32 also outputs a word set with time information for each window by using the transmission side speech recognition result as an input.

チャネル内文脈一貫性尺度計算手段33は、送話側と受話側の時間情報付き単語集合NUi,NOiを入力として、各単語集合それぞれの文脈一貫性尺度をチャネル内文脈一貫性尺度として計算する。チャネル内文脈一貫性尺度は、単語集合に含まれる全ての単語ペア(wi,wj)に対して(但し、wi≠wj)単語関連度テーブル40を参照して求めた関連度S(wi,wj)の平均値である。 The in-channel context consistency measure calculating means 33 receives the word sets NU i and NO i with time information on the transmitting side and the receiving side as inputs, and calculates the context consistency measure of each word set as an in-channel context consistency measure. To do. The in-channel context consistency measure is the relevance S obtained by referring to the word relevance table 40 for all word pairs (w i , w j ) included in the word set (where w i ≠ w j ). It is an average value of (w i , w j ).

チャネル内文脈一貫性尺度計算手段33は、上記した計算を送話側と受話側の時間情報付き単語集合NUi,NOiのそれぞれについて行い、送話側のチャネル内文脈一貫性尺度SCin(NUi)と受話側のチャネル内文脈一貫性尺度SCin(NOi)を計算する。単語集合単位の関連性の強さを表すチャネル内文脈一貫性尺度SCin(NUi),SCin(NOi)は、送話側及び受話側の発話単位での文脈の一貫性を表す指標である。 The in-channel context consistency measure calculation means 33 performs the above calculation for each of the word sets NU i and NO i with time information on the transmitting side and the receiving side, and the in-channel context consistency measure SC in ( NU i ) and receiver's in-channel context consistency measure SC in (NO i ). The in-channel context consistency scales SC in (NU i ) and SC in (NO i ), which represent the strength of relevance of word set units, are indices that represent the consistency of contexts in the utterance units on the sending and receiving sides. It is.

チャネル間文脈一貫性尺度計算手段34は、送話側と受話側の時間情報付き単語集合NUi,NOiを入力として、チャネル間文脈一貫性尺度SCinter(NUi)を計算する。チャネル間文脈一貫性尺度SCinter(NUi)は次の手順で計算する。 The inter-channel context consistency measure calculating means 34 calculates the inter- channel context consistency measure SC inter (NU i ) with the word sets NU i and NO i with time information on the transmitting side and the receiving side as inputs. The inter- channel context consistency measure SC inter (NU i ) is calculated by the following procedure.

先ず、送話側単語集合NUiに付与された時刻TUiの直前の時刻が付与された受話側単語集合NOiを取得する。時刻TUiの直前の時刻の受話側単語集合NOiが無い場合は、最も早い時刻が付与された受話側単語集合とする。 First, the receiver side word set NO i to which the time immediately before the time TU i given to the transmitter side word set NU i is given is acquired. If there is no receiver-side word set NO i at the time immediately before time TU i , the receiver-side word set assigned the earliest time is used.

次に、時刻TUiの直前の時刻の受話側単語集合NOiに含まれる単語と、着目している送話側単語集合に含まれる単語との組み合わせの関連度の平均値を、チャネル間文脈一貫性尺度SCinter(NUi)として計算する。チャネル間文脈一貫性尺度SCinter(NUi)は、送話側の単語集合毎に、その直前の受話側の単語集合との間の単語間の関連性の強さを表し、送話側と発話側との間の発話の関連性の強さを表す指標である。 Next, the average value of the degree of association between the words included in the receiving-side word set NO i at the time immediately before the time TU i and the words included in the focused transmitting-side word set is calculated as the inter-channel context. Calculated as the consistency measure SC inter (NU i ). The inter- channel context consistency scale SC inter (NU i ) indicates the strength of the relationship between words for each word set on the sending side and the word set on the immediately preceding receiving side. This is an index representing the strength of the utterance relationship with the utterance side.

送話側文脈一貫性統合手段36は、チャネル内文脈一貫性尺度計算手段33が出力する送話側のチャネル内文脈一貫性尺度SCin(NUi)と、チャネル間文脈一貫性尺度計算手段34が出力するチャネル間文脈一貫性尺度SCinter(NUi)を入力として送話側認識信頼度CUを式(5)で計算して出力する。 The sending-side context consistency integration means 36 includes an in-channel context consistency scale SC in (NU i ) output from the in-channel context consistency scale calculation means 33 and an inter-channel context consistency scale calculation means 34. The inter- channel context consistency measure SC inter (NU i ) output by S is input and the transmission side recognition reliability CU is calculated by Equation (5) and output.

Figure 0005325176
Figure 0005325176

hは単語集合の個数、αは重みである。重みαは、例えば、実際の応対音声と人手で作成した書き起こしテキストをペアにした開発セットに基づいて求めた送話側認識信頼度と受話側認識信頼度とを重み付け加算した認識信頼度と、音声認識精度との相関係数の値が最も高くなる様に事前に設定される任意の値(0<α<1)である。通常、受話側の音声認識精度の方が高いので、重みαは大きな値にした方が良い。その方が、より正確に送話側の認識信頼度を評価することになると考えられる。   h is the number of word sets, and α is a weight. The weight α is, for example, a recognition reliability obtained by weighting and adding a sender-side recognition reliability and a receiver-side recognition reliability obtained based on a development set in which an actual response voice and a manually created transcription text are paired. Any value (0 <α <1) set in advance so that the value of the correlation coefficient with the speech recognition accuracy becomes the highest. Usually, since the voice recognition accuracy on the receiving side is higher, the weight α should be set to a larger value. It is considered that this will more accurately evaluate the recognition reliability on the transmission side.

このように、送話側文脈一貫性統合手段36は、送話側音声認識結果の単語集合単位の関連性の強さと、送話側音声認識結果の単語集合とその直前の受話側音声認識結果の単語集合との関連性の強さとの重み付け和を、送話側音声認識結果の単語集合の数で平均した送話側認識信頼度として計算する。送話側認識信頼度CUは、送話側発話と受話側発話との間の関連性が強い場合に大きな値となる。   As described above, the sending-side context consistency integrating unit 36 determines the strength of the relevance of the word-set unit of the sending-side speech recognition result, the word set of the sending-side speech recognition result, and the immediately preceding receiving-side speech recognition result. The weighted sum of the relevance to the word set is calculated as the transmission side recognition reliability averaged by the number of word sets of the transmission side speech recognition result. The transmission side recognition reliability CU is a large value when the relationship between the transmission side utterance and the reception side utterance is strong.

受話側文脈一貫性統合手段35は、チャネル内文脈一貫性尺度計算手段33が出力する受話側のチャネル内文脈一貫性尺度SCin(NOi)を、単語集合NOiの数で平均した受話側認識信頼度COを計算(式(6))して出力する。 The receiver-side context consistency integration unit 35 averages the receiver-side in-channel context consistency measure SC in (NO i ) output from the in-channel context consistency measure calculation unit 33 by the number of word sets NO i. The recognition reliability CO is calculated (formula (6)) and output.

Figure 0005325176
Figure 0005325176

また、重みαは単語認識信頼度から計算で求めても良い。   Further, the weight α may be calculated from the word recognition reliability.

図8に、重みαを単語認識信頼度から求める重み計算部80を備えた音声認識装置200の機能構成例を示す。音声認識装置200は、音声認識装置100に対して重み計算部80を備える点のみが異なる。重み計算部80は、音声認識部20が出力する送話側音声認識結果と受話側音声認識結果を入力として重みαを計算して認識信頼度計算部30の送話側文脈一貫性統合手段36に重みαを与えるものである。   FIG. 8 shows a functional configuration example of the speech recognition apparatus 200 including the weight calculation unit 80 that obtains the weight α from the word recognition reliability. The speech recognition apparatus 200 differs from the speech recognition apparatus 100 only in that a weight calculation unit 80 is provided. The weight calculation unit 80 receives the transmission side speech recognition result and the reception side speech recognition result output from the speech recognition unit 20 as input, calculates the weight α, and the transmission side context consistency integration unit 36 of the recognition reliability calculation unit 30. Is given a weight α.

図9に重み計算部80の機能構成例を示す。重み計算部80は、受話側認識スコア算出手段81と、送話側認識スコア算出手段82と、シグモイド関数演算手段83と、を備える。受話側認識スコア算出手段81は、音声認識部20が出力する受話側音声認識結果を入力として、音声認識結果の各単語に付与された単語認識信頼度の総和を、各単語の継続時間長の総和で除した受話側音声認識スコアPOを出力する。   FIG. 9 shows a functional configuration example of the weight calculation unit 80. The weight calculation unit 80 includes a reception side recognition score calculation unit 81, a transmission side recognition score calculation unit 82, and a sigmoid function calculation unit 83. The receiving side recognition score calculation means 81 receives the receiving side speech recognition result output from the speech recognition unit 20 as an input, and calculates the sum of the word recognition reliability assigned to each word of the speech recognition result as the duration length of each word. The receiver's speech recognition score PO divided by the sum is output.

送話側認識スコア算出手段82は、音声認識部20が出力する送話側音声認識結果の各単語に付与された単語認識信頼度の総和を、各単語の継続時間長の総和で除した受話側音声認識スコアPUを出力する。シグモイド関数演算手段83は、受話側音声認識スコアPOと受話側音声認識スコアPUを入力として式(7)で重みαを計算する。   The transmission side recognition score calculation means 82 receives the received speech obtained by dividing the sum of the word recognition reliability given to each word of the transmission side speech recognition result output by the speech recognition unit 20 by the total duration of each word. Output side speech recognition score PU. The sigmoid function calculation means 83 calculates the weight α according to the equation (7) with the receiving side speech recognition score PO and the receiving side speech recognition score PU as inputs.

Figure 0005325176
Figure 0005325176

gは重みαのゲイン定数、dはシフト定数であり、予め設定される値であり、例えばシフト定数dは0<50000の範囲、ゲイン定数は1000〜5000の範囲に設定される。   g is a gain constant of the weight α, d is a shift constant, and is a preset value. For example, the shift constant d is set in a range of 0 <50000, and the gain constant is set in a range of 1000 to 5000.

重み計算部80は、送話側チャネルに比べて受話側チャネルの音声認識精度が高いほど、受話側音声認識スコアPOと受話側音声認識スコアPUの差が大きくなることを利用し、認識スコアの差をシグモイド関数によって0〜1の範囲の値に変換した値を重みαとして出力する。   The weight calculation unit 80 uses the fact that the higher the speech recognition accuracy of the receiving channel than the transmitting channel is, the larger the difference between the receiving speech recognition score PO and the receiving speech recognition score PU is. A value obtained by converting the difference into a value in the range of 0 to 1 by the sigmoid function is output as the weight α.

上記装置における処理手段をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、各装置における処理手段がコンピュータ上で実現される。   When the processing means in the above apparatus is realized by a computer, the processing contents of the functions that each apparatus should have are described by a program. Then, by executing this program on the computer, the processing means in each apparatus is realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、DVD(Digital Versatile Disc)、DVD−RAM(Random Access Memory)、CD−ROM(Compact Disc Read Only Memory)、CD−R(Recordable)/RW(ReWritable)等を、光磁気記録媒体として、MO(Magneto Optical disc)等を、半導体メモリとしてEEP−ROM(Electronically Erasable and Programmable-Read Only Memory)等を用いることができる。   The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used. Specifically, for example, as a magnetic recording device, a hard disk device, a flexible disk, a magnetic tape or the like, and as an optical disk, a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only). Memory), CD-R (Recordable) / RW (ReWritable), etc., magneto-optical recording medium, MO (Magneto Optical disc), etc., semiconductor memory, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. Can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したDVD、CD−ROM等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記録装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。   The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Further, the program may be distributed by storing the program in a recording device of a server computer and transferring the program from the server computer to another computer via a network.

また、各手段は、コンピュータ上で所定のプログラムを実行させることにより構成することにしてもよいし、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。   Each means may be configured by executing a predetermined program on a computer, or at least a part of these processing contents may be realized by hardware.

Claims (6)

送話側音声と受話側音声を入力としてそれぞれの音声を音声認識処理した単語毎に単語認識信頼度を付与した送話側音声認識結果と受話側音声認識結果を出力する音声認識過程と、
上記送話側音声認識結果と受話側音声認識結果を入力として音声認識結果の全ての単語間の組み合わせの2単語間の関連度を示す単語関連度テーブルを参照して上記それぞれの音声認識結果のチャネル内文脈一貫性尺度と、上記送話側音声認識結果と上記受話側音声認識結果との間のチャネル間文脈一貫性尺度とを求め、送話側チャネル内文脈一貫性尺度と上記チャネル間文脈一貫性尺度の重み付き和を送話側認識信頼度として計算して出力する認識信頼度計算過程と、
を含む2チャネル音声の音声認識方法。
A speech recognition process that outputs a speech recognition result and a speech recognition result on the transmission side and a speech recognition result on each word obtained by performing speech recognition processing on each speech by inputting the speech on the transmission side and the reception side speech; and
With reference to the word association degree table indicating the degree of association between two words in the combination of all the words in the speech recognition result using the above transmission side speech recognition result and receiver side speech recognition result as input, An intra-channel context consistency measure and an inter-channel context consistency measure between the transmitting side speech recognition result and the receiving side speech recognition result are obtained. A recognition reliability calculation process in which the weighted sum of the consistency measure is calculated and output as the sender recognition reliability, and
A speech recognition method for two-channel speech including
請求項1に記載した音声認識方法において、
上記認識信頼度計算過程は、
更に、受話側チャネル内文脈一貫性尺度の平均値を受話側認識信頼度として計算して出力する過程であることを特徴とする2チャネル音声の音声認識方法。
The speech recognition method according to claim 1,
The recognition reliability calculation process is as follows:
Furthermore, the speech recognition method for two-channel speech, which is a process of calculating and outputting the average value of the context consistency measure in the receiver side channel as the receiver recognition reliability.
請求項2に記載した2チャネル音声の音声認識方法において、
上記重み付き和の重みは、上記送話側音声認識結果と上記受話側音声認識結果の各単語に付与された単語認識信頼度の総和を、各単語の継続時間長の総和で除した受話側音声認識スコアと送話側音声認識スコアの差をゲインとするシグモイド関数で求められた値であることを特徴とする2チャネル音声の音声認識方法。
The speech recognition method for two-channel speech according to claim 2,
The weight of the weighted sum is the receiving side obtained by dividing the sum of the word recognition reliability given to each word of the sending side speech recognition result and the receiving side speech recognition result by the sum of durations of the words. 2. A speech recognition method for two-channel speech, characterized in that it is a value obtained by a sigmoid function having a gain as a difference between a speech recognition score and a transmitting side speech recognition score.
送話側音声と受話側音声を入力としてそれぞれの音声を音声認識処理した単語毎に単語認識信頼度を付与した送話側音声認識結果と受話側音声認識結果を出力する音声認識部と、
上記送話側音声認識結果と受話側音声認識結果を入力として音声認識結果の全ての単語間の組み合わせの2単語間の関連度を示す単語関連度テーブルを参照して上記それぞれの音声認識結果のチャネル内文脈一貫性尺度と、上記送話側音声認識結果と上記受話側音声認識結果との間のチャネル間文脈一貫性尺度とを求め、送話側チャネル内文脈一貫性尺度と上記チャネル間文脈一貫性尺度の重み付き和を送話側認識信頼度として計算して出力する認識信頼度計算部と、
を具備する2チャネル音声の音声認識装置。
A speech recognition unit that outputs a speech recognition result and a speech recognition result on the transmission side, and a speech recognition result on the receiving side, to which a word recognition reliability is assigned to each word obtained by performing speech recognition processing on each speech by inputting the speech on the transmission side and the reception side speech
With reference to the word association degree table indicating the degree of association between two words in the combination of all the words in the speech recognition result using the above transmission side speech recognition result and receiver side speech recognition result as input, An intra-channel context consistency measure and an inter-channel context consistency measure between the transmitting side speech recognition result and the receiving side speech recognition result are obtained. A recognition reliability calculation unit for calculating and outputting the weighted sum of the consistency measure as the transmission side recognition reliability;
A two-channel voice recognition apparatus comprising:
請求項4に記載した音声認識装置において、
上記認識信頼度計算部は、
更に、受話側チャネル内文脈一貫性尺度の平均値を受話側認識信頼度として計算して出力するものであることを特徴とする2チャネル音声の音声認識装置。
The speech recognition apparatus according to claim 4,
The recognition reliability calculation unit
Furthermore, a speech recognition apparatus for two-channel speech, characterized in that the average value of the context consistency measure in the receiver channel is calculated and output as the receiver recognition reliability.
請求項1乃至3の何れかに記載した2チャネル音声の音声認識方法を、コンピュータに実行させるための2チャネル音声の音声認識方法プログラム。   A 2-channel speech recognition method program for causing a computer to execute the 2-channel speech recognition method according to any one of claims 1 to 3.
JP2010162629A 2010-07-20 2010-07-20 2-channel speech recognition method, apparatus and program thereof Expired - Fee Related JP5325176B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2010162629A JP5325176B2 (en) 2010-07-20 2010-07-20 2-channel speech recognition method, apparatus and program thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP2010162629A JP5325176B2 (en) 2010-07-20 2010-07-20 2-channel speech recognition method, apparatus and program thereof

Publications (2)

Publication Number Publication Date
JP2012027065A JP2012027065A (en) 2012-02-09
JP5325176B2 true JP5325176B2 (en) 2013-10-23

Family

ID=45780100

Family Applications (1)

Application Number Title Priority Date Filing Date
JP2010162629A Expired - Fee Related JP5325176B2 (en) 2010-07-20 2010-07-20 2-channel speech recognition method, apparatus and program thereof

Country Status (1)

Country Link
JP (1) JP5325176B2 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9870765B2 (en) 2016-06-03 2018-01-16 International Business Machines Corporation Detecting customers with low speech recognition accuracy by investigating consistency of conversation in call-center

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3621922B2 (en) * 2001-02-01 2005-02-23 松下電器産業株式会社 Sentence recognition apparatus, sentence recognition method, program, and medium
JP4128342B2 (en) * 2001-07-19 2008-07-30 三菱電機株式会社 Dialog processing apparatus, dialog processing method, and program
EP1450350A1 (en) * 2003-02-20 2004-08-25 Sony International (Europe) GmbH Method for Recognizing Speech with attributes
JP2005010691A (en) * 2003-06-20 2005-01-13 P To Pa:Kk Apparatus and method for speech recognition, apparatus and method for conversation control, and program therefor
JP4734155B2 (en) * 2006-03-24 2011-07-27 株式会社東芝 Speech recognition apparatus, speech recognition method, and speech recognition program
JP5044783B2 (en) * 2007-01-23 2012-10-10 国立大学法人九州工業大学 Automatic answering apparatus and method

Also Published As

Publication number Publication date
JP2012027065A (en) 2012-02-09

Similar Documents

Publication Publication Date Title
US11037553B2 (en) Learning-type interactive device
JP5024154B2 (en) Association apparatus, association method, and computer program
JP6066354B2 (en) Method and apparatus for reliability calculation
US9047866B2 (en) System and method for identification of a speaker by phonograms of spontaneous oral speech and by using formant equalization using one vowel phoneme type
US20040162730A1 (en) Method and apparatus for predicting word error rates from text
JP5506738B2 (en) Angry emotion estimation device, anger emotion estimation method and program thereof
WO2018192186A1 (en) Speech recognition method and apparatus
JP6246636B2 (en) PATTERN IDENTIFICATION DEVICE, PATTERN IDENTIFICATION METHOD, AND PROGRAM
JP5411807B2 (en) Channel integration method, channel integration apparatus, and program
Le et al. Automatic Paraphasia Detection from Aphasic Speech: A Preliminary Study.
JP2012194245A (en) Speech recognition device, speech recognition method and speech recognition program
Lakomkin et al. KT-speech-crawler: Automatic dataset construction for speech recognition from YouTube videos
US20100324897A1 (en) Audio recognition device and audio recognition method
Ogawa et al. Estimating speech recognition accuracy based on error type classification
Tobin et al. Assessing asr model quality on disordered speech using bertscore
JP5325176B2 (en) 2-channel speech recognition method, apparatus and program thereof
JP2005148342A (en) Method for speech recognition, device, and program and recording medium for implementing the same method
JP2005275348A (en) Speech recognition method, device, program and recording medium for executing the method
Lindh Forensic comparison of voices, speech and speakers–Tools and Methods in Forensic Phonetics
JP5149941B2 (en) Speech recognition method, apparatus and program thereof
JP6468584B2 (en) Foreign language difficulty determination device
JP5406797B2 (en) Speech recognition method, apparatus and program thereof
JP5513439B2 (en) Word relevance table creation device and method, speech recognition device and program
KR20090006903A (en) Method and apparatus for auto translation using speech recognition
JP6526602B2 (en) Speech recognition apparatus, method thereof and program

Legal Events

Date Code Title Description
A621 Written request for application examination

Free format text: JAPANESE INTERMEDIATE CODE: A621

Effective date: 20121101

A977 Report on retrieval

Free format text: JAPANESE INTERMEDIATE CODE: A971007

Effective date: 20130627

TRDD Decision of grant or rejection written
A01 Written decision to grant a patent or to grant a registration (utility model)

Free format text: JAPANESE INTERMEDIATE CODE: A01

Effective date: 20130709

A61 First payment of annual fees (during grant procedure)

Free format text: JAPANESE INTERMEDIATE CODE: A61

Effective date: 20130719

R150 Certificate of patent or registration of utility model

Ref document number: 5325176

Country of ref document: JP

Free format text: JAPANESE INTERMEDIATE CODE: R150

Free format text: JAPANESE INTERMEDIATE CODE: R150

A977 Report on retrieval

Free format text: JAPANESE INTERMEDIATE CODE: A971007

Effective date: 20130829

S531 Written request for registration of change of domicile

Free format text: JAPANESE INTERMEDIATE CODE: R313531

R350 Written notification of registration of transfer

Free format text: JAPANESE INTERMEDIATE CODE: R350

LAPS Cancellation because of no payment of annual fees