JP2005084071A

JP2005084071A - Speech recognizing apparatus

Info

Publication number: JP2005084071A
Application number: JP2003312385A
Authority: JP
Inventors: Masaki Naito; 正樹内藤; Kengo Fujita; 顕吾藤田
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2003-09-04
Filing date: 2003-09-04
Publication date: 2005-03-31
Anticipated expiration: 2023-09-04
Also published as: JP4526057B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a speech recognizing apparatus capable of highly accurately recognizing a speech even if a speech not to be inputted to the recognizing apparatus, such as the voice of a person behind and a mutter is mixed when the inputted speech is recognized. <P>SOLUTION: A speech collation part 1 collates the speech inputted through a portable telephone with standard patterns for recognition and sends a recognition result out. As the standard patterns for recognition, standard patterns 2 for recognition including recognition vocabulary standard patterns 2-1 generated by modeling speeches to be recognized and background person standard patterns 2-1 added before and after the recognition vocabulary standard patterns 2-1 are used. The background person speech standard patterns are a background person's speech standard pattern group 2-3 obtained by learning based upon an input speech of low level which is inputted through a portable telephone transmission system and distorted more than a normal input speech to the portable telephone and a background person's speech garbage standard patterns 2-4 obtained by learning based upon the speech of the person behind which is included in the standard telephone speech. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は、音声認識装置に関し、特に、携帯電話を介して入力された音声を認識する際に、背後の人声、独り言など、音声認識装置への入力目的以外の音声が混入しても高い精度で音声を認識できる音声認識装置に関する。 The present invention relates to a voice recognition device, and in particular, when a voice inputted via a mobile phone is recognized, even if voice other than the input purpose to the voice recognition device such as a human voice or a monologue in the background is mixed. The present invention relates to a speech recognition apparatus that can recognize speech with high accuracy.

図８は、従来の音声認識装置を示すブロック図であり、携帯電話を介して入力された音声は、音声照合部１において認識用標準パタン２と照合される。認識用標準パタン２は、照合のために予め格納されており、認識語彙標準パタン２−１、あるいは認識語彙標準パタン２−１と音声標準パタン２−２との組み合わせからなる。 FIG. 8 is a block diagram showing a conventional speech recognition apparatus. A speech input through a mobile phone is collated with a recognition standard pattern 2 in a speech collation unit 1. The recognition standard pattern 2 is stored in advance for collation, and includes a recognition vocabulary standard pattern 2-1 or a combination of a recognition vocabulary standard pattern 2-1 and a speech standard pattern 2-2.

図９は、認識語彙標準パタン２−１のみからなる認識用標準パタン２を示し、図１０は、認識語彙標準パタン２−１と音声標準パタン２−２との組み合わせからなる認識用標準パタン２を示す。音声照合部１は、入力音声と認識用標準パタン２とを照合し、最も照合スコア（類似度）の高い認識語彙を認識結果として送出する。 FIG. 9 shows a recognition standard pattern 2 consisting only of the recognition vocabulary standard pattern 2-1, and FIG. 10 shows a recognition standard pattern 2 consisting of a combination of the recognition vocabulary standard pattern 2-1 and the speech standard pattern 2-2. Indicates. The voice collation unit 1 collates the input voice with the standard pattern for recognition 2, and sends a recognition vocabulary having the highest collation score (similarity) as a recognition result.

図９の認識用標準パタン２を使用した場合、両端が無音区間で区切られた入力音声は、音声照合部１で認識語彙標準パタン（認識語彙１〜Ｎ）２−１からなる認識用標準パタン２と照合され、照合の結果、最も照合スコアの高い認識語彙が認識結果として送出される。 When the recognition standard pattern 2 of FIG. 9 is used, the input speech whose both ends are separated by a silent section is recognized by the speech collating unit 1 as a recognition standard pattern composed of recognition vocabulary standard patterns (recognition vocabulary 1 to N) 2-1. 2 and the recognition vocabulary with the highest collation score is sent as a recognition result.

また、図１０の認識用標準パタン２を使用した場合、両端が無音区間で区切られた入力音声は、認識語彙標準パタン２−１の前後に、子音と母音とからなる任意の日本語音声を受理する音声標準パタン２−２を付加した認識用標準パタン２と照合される。
河原達也、宗続敏彦、三木清一、堂下修司“会話音声中の単語スポッティングのための言語モデルの検討”電子情報通信学会技術研究報告SP94-28 1994 PP.41-48 Further, when the recognition standard pattern 2 in FIG. 10 is used, the input speech whose both ends are separated by a silent section is an arbitrary Japanese speech composed of consonants and vowels before and after the recognition vocabulary standard pattern 2-1. It is collated with the recognition standard pattern 2 to which the received voice standard pattern 2-2 is added.
Tatsuya Kawahara, Toshihiko Sotsugi, Kiyoichi Miki, Shuji Doshita “Examination of language model for word spotting in conversational speech” IEICE Technical Report SP94-28 1994 PP.41-48

しかしながら、図９の認識語彙標準パタン２−１のみからなる認識用標準パタン２を使用した場合には、入力音声に背後の人声などが混入すると、その人声を認識語彙と照合し、その認識結果を送出してしまうことがあるため、誤認識が増加してしまうという課題がある。 However, when the recognition standard pattern 2 consisting only of the recognition vocabulary standard pattern 2-1 in FIG. 9 is used, if a human voice behind the input voice is mixed in the input voice, the human voice is collated with the recognition vocabulary. Since the recognition result may be transmitted, there is a problem that misrecognition increases.

図１０の認識語彙標準パタン２−１と音声標準パタン２−２との組み合わせからなる認識用標準パタン２を使用した場合には、背後の人声などの混入による誤認識を改善できるが、認識語彙またはその一部が、前後の音声標準パタン２−２と照合されることで新たな誤認識が生じるという課題がある。 When the recognition standard pattern 2 including the combination of the recognition vocabulary standard pattern 2-1 and the speech standard pattern 2-2 in FIG. 10 is used, erroneous recognition due to mixing of human voices and the like in the background can be improved. There is a problem that a new misrecognition occurs when the vocabulary or part of the vocabulary is collated with the preceding and following speech standard patterns 2-2.

本発明の目的は、入力された音声を認識する際に、背後の人声、独り言など、音声認識装置への入力目的以外の音声が混入しても高い精度で音声を認識できる音声認識装置を提供することにある。 An object of the present invention is to provide a voice recognition device capable of recognizing voice with high accuracy even when voice other than the input purpose to the voice recognition device such as a human voice or monologue in the background is mixed when the inputted voice is recognized. It is to provide.

上記課題を解決するために、本発明は、携帯電話を介して入力された音声と認識用標準パタンとを照合して認識結果を送出する音声照合部を有する音声認識装置において、前記認識用標準パタンは、認識対象とする音声をモデル化した認識語彙標準パタンと該認識語彙標準パタンの前後に付加された背景人声標準パタンとを含む認識用標準パタンであり、前記背景人声標準パタンは、携帯電話伝送系を介して入力され、携帯電話への通常の入力音声と比較して大きな歪みを受ける低レベルの入力音声を元に学習して得られた背景人声標準パタン群、あるいは携帯電話音声に含まれる背後の人声を元に学習して得られた背景人声ガーベージ標準パタンであることを特徴とする。 In order to solve the above-mentioned problem, the present invention provides a speech recognition apparatus having a speech collation unit that collates speech input via a mobile phone with a recognition standard pattern and sends a recognition result. The pattern is a recognition standard pattern including a recognition vocabulary standard pattern that models speech to be recognized and a background human voice standard pattern added before and after the recognition vocabulary standard pattern, and the background human voice standard pattern is A background human voice standard pattern group obtained by learning based on low-level input speech that is input via a mobile phone transmission system and receives large distortion compared to normal input speech to a mobile phone, or mobile phone It is a background human voice garbage standard pattern obtained by learning based on the human voice behind the telephone voice.

また、本発明は、携帯電話を介して入力された音声と認識用標準パタンとを照合して認識結果を送出する複数の音声照合部と判定部とを有する音声認識装置において、前記複数の音声照合部のうちの少なくとも１つにおける認識用標準パタンは、認識対象とする音声をモデル化した認識語彙標準パタンと、該認識語彙標準パタンの前後に付加された背景人声標準パタンとを含む認識用標準パタンであり、前記背景人声標準パタンは、携帯電話伝送系を介して入力され、携帯電話への通常の入力音声と比較して大きな歪みを受ける低レベルの入力音声を元に学習して得られた背景人声標準パタン群、あるいは携帯電話音声に含まれる背後の人声を元に学習して得られた背景人声ガーベージ標準パタンであり、前記判定部は、前記複数の音声照合部より送出される照合スコアの各々に対して予め定めた値を加算し、その結果得られた値を互いに比較して最も値が高い値を送出した音声照合部での認識語彙を認識結果として送出することを特徴とする。 Further, the present invention provides a speech recognition apparatus having a plurality of speech collating units and a judging unit that collate speech inputted via a mobile phone with a standard pattern for recognition and send a recognition result, and the plurality of speeches The recognition standard pattern in at least one of the collating units includes a recognition vocabulary standard pattern that models the speech to be recognized, and a background human voice standard pattern that is added before and after the recognition vocabulary standard pattern. The background human voice standard pattern is learned on the basis of low-level input speech that is input via a mobile phone transmission system and receives large distortion compared to normal input speech to the mobile phone. Background human voice standard patterns or background human voice garbage standard patterns obtained by learning based on the human voice behind the mobile phone voice, and the determination unit includes the plurality of voice verifications. Add a predetermined value to each of the collation scores sent out, compare the obtained values with each other, and send the recognition vocabulary at the voice collation unit that sent the highest value as the recognition result It is characterized by doing.

ここで、前記認識用標準パタンにおいて、前記認識語彙標準パタンと前記背景人声標準パタンとの間に、両者の境界を表す境界標準パタン、例えば無音に対応する標準パタンを挟むことができる。 Here, in the recognition standard pattern, a boundary standard pattern representing the boundary between the recognition vocabulary standard pattern and the background human voice standard pattern, for example, a standard pattern corresponding to silence can be sandwiched.

本発明によれば、認識用語彙標準パタンと背景人声標準パタンとの組み合わせからなる認識用標準パタンを用い、背景人声標準パタンとして、携帯電話伝送系を介して入力され、携帯電話への通常の入力音声と比較して大きな歪みを受ける低レベルの入力音声を元に学習して得られた背景人声標準パタン群、あるいは携帯電話音声に含まれる背後の人声を元に学習して得られた背景人声ガーベージ標準パタンを用いるので、携帯電話を介して入力された音声において、背後の人声、独り言など、音声認識装置への入力目的以外の音声が混入しても高い精度で音声を認識できる。 According to the present invention, a recognition standard pattern consisting of a combination of a recognition vocabulary standard pattern and a background human voice standard pattern is used, input as a background human voice standard pattern via a mobile phone transmission system, and sent to the mobile phone. Learning based on background human voice standard patterns obtained by learning based on low-level input voice that receives large distortion compared to normal input voice, or on the background voice contained in mobile phone voice Since the obtained background human voice garbage standard pattern is used, even if voice other than the input purpose to the voice recognition device such as the human voice behind you or monologue mixed in the voice input via the mobile phone is mixed with high accuracy Can recognize voice.

また、通常の入力音声と比較して大きな歪みを受けた音声を元に学習した背景人声標準パタンを用いることで、認識語彙又はその一部が前後の背景人声標準パタンと照合され誤認識する現象を抑えることができる。 In addition, by using the background human voice standard pattern learned based on the voice that has undergone large distortion compared to the normal input voice, the recognition vocabulary or part of it is collated with the background human voice standard pattern before and after, and misrecognition Can be suppressed.

また、認識語彙標準パタンと背景人声標準パタンとの間に、境界を表す標準パタンを挿入することにより、認識語彙部分を背景人声と誤認識する現象を抑え、さらに高い精度で音声を認識できる。 In addition, by inserting a standard pattern that represents the boundary between the recognized vocabulary standard pattern and the background human voice standard pattern, the recognition vocabulary part is mistakenly recognized as the background human voice, and speech is recognized with higher accuracy. it can.

以下、図面を参照して本発明について説明する。図１は、本発明に係る音声認識装置の実施形態を示すブロック図であり、図８と同一あるいは同等部分には同じ符号を付してある。 The present invention will be described below with reference to the drawings. FIG. 1 is a block diagram showing an embodiment of a speech recognition apparatus according to the present invention, where the same or equivalent parts as in FIG.

図１において、携帯電話を介して入力された音声は、音声照合部１において認識用標準パタン２と照合される。認識用標準パタン２は、照合のために予め格納されており、認識語彙標準パタン２−１と、背景人声標準パタン群２−３あるいは背景人声ガーベージ標準パタン２−４との組み合わせからなる。背景人声標準パタン群２−３と背景人声ガーベージ標準パタン２−４は、共に背景人声に対する標準パタンであるので、以下では、これらを背景人声標準パタンと呼ぶことがある。 In FIG. 1, a voice input via a mobile phone is collated with a recognition standard pattern 2 in a voice collation unit 1. The recognition standard pattern 2 is stored in advance for collation, and includes a combination of the recognition vocabulary standard pattern 2-1 and the background human voice standard pattern group 2-3 or the background human voice garbage standard pattern 2-4. . Since the background human voice standard pattern group 2-3 and the background human voice garbage standard pattern 2-4 are both standard patterns for the background human voice, they may be referred to as background human voice standard patterns below.

背景人声標準パタン群２−３は、携帯電話の伝送系を介して入力され、携帯電話への通常の入力音声と比較して大きな歪みを受ける低レベルの入力音声を元に学習することにより作成される。ここで、携帯電話への通常の入力音声と比較して大きな歪みを受ける低レベルの入力音声とは、通話を行おうとする人が携帯電話に向けて通常発声する場合の音声と比較して十分にレベルが低く携帯電話の伝送系で大きな歪（符号化歪など）を受けるレベルの音声である。 The background human voice standard pattern group 2-3 is learned based on low-level input speech that is input through a transmission system of a mobile phone and receives a large distortion compared to normal input speech to the mobile phone. Created. Here, the low-level input voice that receives a large distortion compared to the normal input voice to the mobile phone is sufficient compared to the voice when the person who is going to make a call normally speaks to the mobile phone. In particular, the sound is low in level and is subject to large distortion (such as encoding distortion) in a mobile phone transmission system.

例えば携帯電話で用いられる音声符号化方式では、文献「ARIB標準規格：STD-T64-C.S0014-0 Enhanced Variable Rate Codec,Speech Service Option 3 for Wideband Spread Spectrum Digital Systems(4-23頁、4.3.1.2節)」に記述されているように、音声の入力レベルを考慮した式により、音声を符号化する際の符号長が決定される。符号長が短いほど符号化により音声に加わる歪みが大きくなるため、例えば、符号長選択時に使用する判定式に従い、短い符号長で符号化されるレベルの音声を背景人声標準パタンの学習に使用する。 For example, in the speech coding system used in mobile phones, the document “ARIB Standard: STD-T64-C.S0014-0 Enhanced Variable Rate Codec, Speech Service Option 3 for Wideband Spread Spectrum Digital Systems (page 4-23, 4.3. As described in section 1.2), the code length for encoding speech is determined by an equation that takes into account the speech input level. The shorter the code length, the greater the distortion added to the speech due to encoding.For example, according to the judgment formula used when selecting the code length, the speech of the level encoded with the short code length is used for background human voice standard pattern learning. To do.

また、背景人声ガーベージ標準パタンは、携帯電話音声に含まれる背後の人声を元に学習することにより作成される。図２は、携帯電話に入力された音声と背景の人声の入力レベルの分布を示す。携帯電話に入力された音声と背景の人声の入力レベルの分布の差異から分かるように、背景の人声は、携帯電話に向けて発声された音声と比較して入力レベルが低く、携帯電話の伝送系で大きな歪（符号化歪など）を受けている。 The background human voice garbage standard pattern is created by learning based on the human voice behind the mobile phone voice. FIG. 2 shows the distribution of the input level of the voice input to the mobile phone and the background human voice. As can be seen from the difference in the input level distribution between the voice input to the mobile phone and the background human voice, the background human voice has a lower input level than the voice uttered to the mobile phone. The transmission system is subjected to large distortion (such as encoding distortion).

図３および図４は、本発明で使用される認識用標準パタン２の例を示し、図３は、認識語彙標準パタン２−１と背景人声標準パタン（背景人声標準パタン群）２−３との組み合わせからなる認識用標準パタン２の例であり、図４は、認識語彙標準パタン２−１と背景人声標準パタン（背景人声ガーベージ標準パタン）２−４との組み合わせからなる認識用標準パタン２の例である。音声照合部１は、入力音声と認識用標準パタンとを照合し、最も照合スコア（類似度）の高い認識語彙を認識結果として送出する。 3 and 4 show examples of the standard pattern for recognition 2 used in the present invention. FIG. 3 shows a recognition vocabulary standard pattern 2-1 and a background human voice standard pattern (background human voice standard pattern group) 2- 4 is an example of a recognition standard pattern 2 composed of a combination of 3 and FIG. 4 shows a recognition composed of a combination of a recognition vocabulary standard pattern 2-1 and a background human voice standard pattern (background human voice garbage standard pattern) 2-4. This is an example of the standard pattern 2 for use. The voice collation unit 1 collates the input voice with the standard pattern for recognition, and sends the recognition vocabulary having the highest collation score (similarity) as a recognition result.

図３の認識用標準パタン２を使用した場合、両端が無音区間で区切られた入力音声は、背景人声標準パタン２−３、認識語彙標準パタン（認識語彙１〜Ｎ）２−１、背景人声標準パタン２−３からなる認識用標準パタン２と照合される。 When the recognition standard pattern 2 of FIG. 3 is used, the input speech whose both ends are separated by a silent section is a background human voice standard pattern 2-3, a recognition vocabulary standard pattern (recognition vocabulary 1 to N) 2-1, a background. It is collated with a recognition standard pattern 2 consisting of a human voice standard pattern 2-3.

また、図４の認識用標準パタン２を使用した場合には、両端が無音区間で区切られた音声は、背景人声標準パタン（背景人声ガーベージ標準パタン）２−４、認識語彙標準パタン（認識語彙１〜Ｎ）２−１、背景人声標準パタン（背景人声ガーベージ標準パタン）２−４からなる認識用標準パタンと照合される。 When the recognition standard pattern 2 in FIG. 4 is used, the speech whose both ends are separated by a silent section is a background human voice standard pattern (background human voice garbage standard pattern) 2-4, a recognition vocabulary standard pattern ( It is collated with a recognition standard pattern consisting of recognition vocabulary 1 to N) 2-1, background human voice standard pattern (background human voice garbage standard pattern) 2-4.

通常、背景の人声は、携帯電話に向けて発声される音声と比較して入力レベルが低い。そのため、背景の人声は、携帯電話の音声符号処理において、携帯電話に向けて発声された音声と異なる符号化歪みを受ける。このような背景人声が受ける符号化歪みを考慮して作成した背景人声標準パタン群２−３または背景人声ガーベージ標準パタン２−４を含む認識用標準パタンを用いることにより、背景人声が混入されていても、携帯電話に向けて発声された音声を良好に認識できるようになる。 Usually, the background human voice has a lower input level than the voice uttered to the mobile phone. For this reason, the background human voice is subjected to different coding distortion from the voice uttered toward the mobile phone in the voice code processing of the mobile phone. By using the recognition standard pattern including the background human voice standard pattern group 2-3 or the background human voice garbage standard pattern 2-4 created in consideration of the coding distortion that the background human voice receives, the background human voice is obtained. Even if is mixed, the voice uttered toward the mobile phone can be recognized well.

図５は、本発明に係る音声認識装置の他の実施形態を示すブロック図である。本実施形態では、複数の音声照合部１（１−１，１−２，・・・，１−ｎ）を設け、これら複数の音声照合部１（１−１，１−２，・・・，１−ｎ）に入力音声を並列に入力する。各音声照合部１は、入力音声をそれぞれの認識用標準パタン２（２−１，２−２，・・・，２−ｎ）と照合し、照合結果１，２，・・・，ｎを判定部３に送出する。 FIG. 5 is a block diagram showing another embodiment of the speech recognition apparatus according to the present invention. In the present embodiment, a plurality of voice collation units 1 (1-1, 1-2,..., 1-n) are provided, and the plurality of voice collation units 1 (1-1, 1-2,. , 1-n), input speech is input in parallel. Each voice collation unit 1 collates the input voice with each recognition standard pattern 2 (2-1, 2-2,..., 2-n), and compares the collation results 1, 2,. It is sent to the determination unit 3.

ここで、音声照合部１（１−１，１−２，・・・，１−ｎ）のうちの少なくとも１つは、図１の実施形態と同様に、認識語彙標準パタンと背景人声標準パタン（背景人声標準パタン群または背景人声ガーベージ標準パタン）からなる認識用標準パタンを用いるものであるが、その他の音声照合部は、従来の音声認識装置における認識用標準パタン、例えば図９や図１０の認識用標準パタンを用いるものでも構わない。 Here, at least one of the speech collating units 1 (1-1, 1-2,..., 1-n) is similar to the embodiment of FIG. A standard pattern for recognition consisting of a pattern (background human voice standard pattern group or background human voice garbage standard pattern) is used, but the other voice collating unit is a standard pattern for recognition in a conventional voice recognition device, for example, FIG. Alternatively, the recognition standard pattern shown in FIG. 10 may be used.

各音声照合部１（１−１，１−２，・・・，１−ｎ）は、入力音声と認識用標準パタンとの照合結果として最も類似度の高い認識語彙と照合スコアＰｉ（ｉ＝１〜ｎ）とを送出する。 Each speech collation unit 1 (1-1, 1-2,..., 1-n) recognizes a recognition vocabulary having the highest similarity as a collation result between the input speech and the standard pattern for recognition and a collation score Pi (i = 1-n).

判定部３は、下記式に示すように、各音声照合部１（１−１，１−２，・・・，１−ｎ）から送出された照合スコアＰｉ（ｉ＝１〜ｎ）に予め設定された値Ｗｉを加算し、その結果得られた値Ｓｉを互いに比較し、最も高い値を持つ認識語彙を認識結果ｍとして送出する。 As shown in the following equation, the determination unit 3 preliminarily applies the verification score Pi (i = 1 to n) transmitted from each speech verification unit 1 (1-1, 1-2,..., 1-n). The set value Wi is added, the values Si obtained as a result are compared with each other, and the recognition vocabulary having the highest value is transmitted as the recognition result m.

ここで、値Ｗｉは、各照合スコアＰｉ（ｉ＝１〜ｎ）に対する重み付けであり、例えば、背後の人声、独り言など、音声認識装置への入力目的以外の音声の混入が多い環境で本装置が使用されることが予測される場合、ノイズ耐性を持たせた図３、図４のような認識用標準パタンを用いた音声照合部の照合結果に対する値Ｗｉを大きくし、図９、図１０のような認識用標準パタンを用いた音声照合部の照合結果に対する値Ｗｉを小さくする。

Ｓｉ＝logＰｉ＋Ｗｉ
Here, the value Wi is a weight for each collation score Pi (i = 1 to n). For example, the value Wi is used in an environment where there is a lot of voices other than the input purpose to the voice recognition device, such as a human voice and monologue. When the apparatus is predicted to be used, the value Wi for the collation result of the speech collation unit using the recognition standard pattern as shown in FIG. 3 and FIG. The value Wi for the collation result of the speech collation unit using the recognition standard pattern such as 10 is reduced.

Si = logPi + Wi

図５の実施形態によれば、種々の異なる認識用標準パタンを用い、それぞれの照合結果に重み付けをすることにより、音声認識装置への入力目的以外の音声の混入が多い環境や少ない環境など、使用される環境の元で音声認識が良好に行われるように音声認識装置を適合させることができる。 According to the embodiment of FIG. 5, by using various different standard patterns for recognition and weighting each matching result, an environment where there is a lot of voice mixing other than the purpose of input to the voice recognition device, an environment where there is little, etc. The speech recognition device can be adapted so that speech recognition is performed well under the environment in which it is used.

以上の実施形態では、背景人声標準パタンと認識語彙標準パタンとが連続した認識用標準パタンを使用したが、認識語彙標準パタンと背景人声標準パタンとの間に、両者の境界を表す境界標準パタンを挟んだ認識用標準パタンを用いることにより、認識語彙部分が背景人声と誤認識される現象をさらに低減させることができる。 In the above embodiment, the recognition standard pattern in which the background human voice standard pattern and the recognition vocabulary standard pattern are continuous is used, but the boundary representing the boundary between the recognition vocabulary standard pattern and the background human voice standard pattern is used. By using a standard pattern for recognition with a standard pattern in between, the phenomenon that the recognized vocabulary part is erroneously recognized as the background human voice can be further reduced.

図６は、背景人声標準パタン２−３と認識語彙標準パタン２−１との間に、両者の境界を表す境界標準パタン２−５を挟んだ認識用標準パタン２の例であり、同図において、ＧＢは、背景人声ガーベージ標準パタンを示している。両端が無音区間で区切られた入力音声は、背景人声標準パタン２−３、境界標準パタン２−５、認識語彙標準パタン２−１、境界標準パタン２−５、背景人声標準パタン２−３からなる認識用標準パタンと照合される。この照合の結果、最も照合スコアの高い認識語彙が認識結果として送出される。 FIG. 6 shows an example of a recognition standard pattern 2 in which a boundary standard pattern 2-5 that represents the boundary between the background human voice standard pattern 2-3 and the recognition vocabulary standard pattern 2-1 is sandwiched. In the figure, GB indicates a background human voice garbage standard pattern. The input speech with both ends separated by a silent section is a background human voice standard pattern 2-3, a boundary standard pattern 2-5, a recognition vocabulary standard pattern 2-1, a boundary standard pattern 2-5, and a background human voice standard pattern 2- 3 is collated with a recognition standard pattern consisting of three. As a result of this collation, the recognition vocabulary with the highest collation score is sent as the recognition result.

図７は、背景人声標準パタン２−３と認識語彙標準パタン２−１の境界を表す境界標準パタン２−５として、無音に対応する標準パタンを用いた例であり、両端が無音区間で区切られた入力音声は、背景人声標準パタン２−３、無音パタン（境界標準パタン）２−５、認識語彙標準パタン２−１、無音パタン（境界標準パタン）２−５、背景人声標準パタン２−３からなる認識用標準パタンと照合される。この照合の結果、最も照合スコアの高い認識語彙が認識結果として送出される。 FIG. 7 shows an example in which a standard pattern corresponding to silence is used as a boundary standard pattern 2-5 representing the boundary between the background human voice standard pattern 2-3 and the recognized vocabulary standard pattern 2-1, and both ends are silent sections. The divided input speech includes a background human voice standard pattern 2-3, a silence pattern (boundary standard pattern) 2-5, a recognition vocabulary standard pattern 2-1, a silence pattern (boundary standard pattern) 2-5, and a background human voice standard. It is collated with a recognition standard pattern consisting of pattern 2-3. As a result of this collation, the recognition vocabulary with the highest collation score is sent as the recognition result.

以上では、図３あるいは図４の認識用標準パタンを単独に使用する実施形態について説明したが、背景人声標準パタン群と背景人声ガーベージ標準パタンとを並列に組み合わせて構成される背景人声標準パタンを、認識語彙標準パタン２−１の前後に配置して用いることもできる。また、図６、図７は、背景人声標準パタンとして背景人声ガーベージ標準パタンを用いた例であるが、背景人声標準パタンとしては背景人声標準パタン群あるいは背景人声標準パタン群と背景人声ガーベージ標準パタンの組み合わせを用いることもできる。 In the above, the embodiment in which the recognition standard pattern of FIG. 3 or FIG. 4 is used alone has been described. However, the background human voice configured by combining the background human voice standard pattern group and the background human voice garbage standard pattern in parallel. Standard patterns can also be used before and after the recognized vocabulary standard pattern 2-1. 6 and 7 are examples using the background human voice garbage standard pattern as the background human voice standard pattern. As the background human voice standard pattern, the background human voice standard pattern group or the background human voice standard pattern group is used. A combination of background human voice garbage standard patterns can also be used.

本発明は、携帯電話向けボイスポータルサービスの他、コールセンタ向け音声対話システムやカーナビ向け音声対話システムなどに利用して有用である。 INDUSTRIAL APPLICABILITY The present invention is useful for a voice conversation service for a call center, a voice conversation system for a car navigation system, and the like in addition to a voice portal service for a mobile phone.

本発明に係る音声認識装置の実施形態を示すブロック図である。It is a block diagram which shows embodiment of the speech recognition apparatus which concerns on this invention. 携帯電話への入力音声と背景人声の入力レベルの分布を示す特性図である。It is a characteristic view which shows distribution of the input level of the input voice and background human voice to a mobile phone. 本発明で使用される認識用標準パタンの例を示す図である。It is a figure which shows the example of the standard pattern for recognition used by this invention. 本発明で使用される認識用標準パタンの他の例を示す図である。It is a figure which shows the other example of the standard pattern for recognition used by this invention. 本発明に係る音声認識装置の他の実施形態を示すブロック図である。It is a block diagram which shows other embodiment of the speech recognition apparatus which concerns on this invention. 本発明で使用される認識用標準パタンのさらに他の例を示す図である。It is a figure which shows the further another example of the standard pattern for recognition used by this invention. 本発明で使用される認識用標準パタンのさらに他の例を示す図である。It is a figure which shows the further another example of the standard pattern for recognition used by this invention. 従来の音声認識装置を示すブロック図である。It is a block diagram which shows the conventional speech recognition apparatus. 従来の音声認識装置で使用されている認識用標準パタンを示す図である。It is a figure which shows the standard pattern for recognition used with the conventional speech recognition apparatus. 従来の音声認識装置で使用されている他の認識用標準パタンを示す図である。It is a figure which shows the other standard pattern for recognition currently used with the conventional speech recognition apparatus.

Explanation of symbols

１，１−１〜１−ｎ・・・音声照合部、２，２−１〜２−ｎ・・・認識用標準パタン、２−１・・・認識用語彙標準パタン、２−２・・・音声標準パタン、２−３・・・背景人声標準パタン、２−４・・・背景人声ガーベージ標準パタン、２−５・・・境界標準（無音）パタン、３・・・判定部
1, 1-1 to 1-n ... voice collation unit, 2, 2-1 to 2-n ... recognition standard pattern, 2-1 ... recognition vocabulary standard pattern, 2-2. Voice standard pattern, 2-3 ... background human voice standard pattern, 2-4 ... background human voice garbage standard pattern, 2-5 ... boundary standard (silence) pattern, 3 ... judgment unit

Claims

In a speech recognition apparatus having a speech collation unit that collates speech input via a mobile phone with a recognition standard pattern and sends a recognition result,
The recognition standard pattern is a recognition standard pattern including a recognition vocabulary standard pattern that models speech to be recognized and a background human voice standard pattern added before and after the recognition vocabulary standard pattern,
The background human voice standard pattern is obtained through learning based on low-level input speech that is input via a mobile phone transmission system and receives large distortion compared to normal input speech to the mobile phone. A voice recognition device characterized by being a voice standard pattern group or a background human voice garbage standard pattern obtained by learning based on a human voice behind a mobile phone voice.

In a speech recognition apparatus having a plurality of speech collation units and a determination unit that collate speech input via a mobile phone with a standard pattern for recognition and send a recognition result,
The recognition standard pattern in at least one of the plurality of speech collating units includes a recognition vocabulary standard pattern that models speech to be recognized, and a background human voice standard pattern that is added before and after the recognition vocabulary standard pattern. A standard pattern for recognition including
The background human voice standard pattern is input through a mobile phone transmission system, and a background person obtained by learning based on low-level input voice that receives large distortion compared to normal input voice to a mobile phone. Voice standard pattern group, or background human voice garbage standard pattern obtained by learning based on the human voice behind mobile phone voice,
The determination unit adds a predetermined value to each of the collation scores sent from the plurality of voice collation units, compares the obtained values with each other, and sends the highest value. A speech recognition apparatus characterized in that a recognition vocabulary in a collation unit is transmitted as a recognition result.

The boundary standard pattern representing the boundary between the recognition vocabulary standard pattern and the background human voice standard pattern is sandwiched between the recognition standard pattern and the background human voice standard pattern. Voice recognition device.

The speech recognition apparatus according to claim 3, wherein the boundary standard pattern is a standard pattern corresponding to silence.