JP2005084071A - Speech recognizing apparatus - Google Patents

Speech recognizing apparatus Download PDF

Info

Publication number
JP2005084071A
JP2005084071A JP2003312385A JP2003312385A JP2005084071A JP 2005084071 A JP2005084071 A JP 2005084071A JP 2003312385 A JP2003312385 A JP 2003312385A JP 2003312385 A JP2003312385 A JP 2003312385A JP 2005084071 A JP2005084071 A JP 2005084071A
Authority
JP
Japan
Prior art keywords
standard pattern
recognition
speech
voice
human voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
JP2003312385A
Other languages
Japanese (ja)
Other versions
JP4526057B2 (en
Inventor
Masaki Naito
正樹 内藤
Kengo Fujita
顕吾 藤田
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
KDDI Corp
Original Assignee
KDDI Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by KDDI Corp filed Critical KDDI Corp
Priority to JP2003312385A priority Critical patent/JP4526057B2/en
Publication of JP2005084071A publication Critical patent/JP2005084071A/en
Application granted granted Critical
Publication of JP4526057B2 publication Critical patent/JP4526057B2/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Images

Abstract

<P>PROBLEM TO BE SOLVED: To provide a speech recognizing apparatus capable of highly accurately recognizing a speech even if a speech not to be inputted to the recognizing apparatus, such as the voice of a person behind and a mutter is mixed when the inputted speech is recognized. <P>SOLUTION: A speech collation part 1 collates the speech inputted through a portable telephone with standard patterns for recognition and sends a recognition result out. As the standard patterns for recognition, standard patterns 2 for recognition including recognition vocabulary standard patterns 2-1 generated by modeling speeches to be recognized and background person standard patterns 2-1 added before and after the recognition vocabulary standard patterns 2-1 are used. The background person speech standard patterns are a background person's speech standard pattern group 2-3 obtained by learning based upon an input speech of low level which is inputted through a portable telephone transmission system and distorted more than a normal input speech to the portable telephone and a background person's speech garbage standard patterns 2-4 obtained by learning based upon the speech of the person behind which is included in the standard telephone speech. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は、音声認識装置に関し、特に、携帯電話を介して入力された音声を認識する際に、背後の人声、独り言など、音声認識装置への入力目的以外の音声が混入しても高い精度で音声を認識できる音声認識装置に関する。   The present invention relates to a voice recognition device, and in particular, when a voice inputted via a mobile phone is recognized, even if voice other than the input purpose to the voice recognition device such as a human voice or a monologue in the background is mixed. The present invention relates to a speech recognition apparatus that can recognize speech with high accuracy.

図8は、従来の音声認識装置を示すブロック図であり、携帯電話を介して入力された音声は、音声照合部1において認識用標準パタン2と照合される。認識用標準パタン2は、照合のために予め格納されており、認識語彙標準パタン2−1、あるいは認識語彙標準パタン2−1と音声標準パタン2−2との組み合わせからなる。   FIG. 8 is a block diagram showing a conventional speech recognition apparatus. A speech input through a mobile phone is collated with a recognition standard pattern 2 in a speech collation unit 1. The recognition standard pattern 2 is stored in advance for collation, and includes a recognition vocabulary standard pattern 2-1 or a combination of a recognition vocabulary standard pattern 2-1 and a speech standard pattern 2-2.

図9は、認識語彙標準パタン2−1のみからなる認識用標準パタン2を示し、図10は、認識語彙標準パタン2−1と音声標準パタン2−2との組み合わせからなる認識用標準パタン2を示す。音声照合部1は、入力音声と認識用標準パタン2とを照合し、最も照合スコア(類似度)の高い認識語彙を認識結果として送出する。   FIG. 9 shows a recognition standard pattern 2 consisting only of the recognition vocabulary standard pattern 2-1, and FIG. 10 shows a recognition standard pattern 2 consisting of a combination of the recognition vocabulary standard pattern 2-1 and the speech standard pattern 2-2. Indicates. The voice collation unit 1 collates the input voice with the standard pattern for recognition 2, and sends a recognition vocabulary having the highest collation score (similarity) as a recognition result.

図9の認識用標準パタン2を使用した場合、両端が無音区間で区切られた入力音声は、音声照合部1で認識語彙標準パタン(認識語彙1〜N)2−1からなる認識用標準パタン2と照合され、照合の結果、最も照合スコアの高い認識語彙が認識結果として送出される。   When the recognition standard pattern 2 of FIG. 9 is used, the input speech whose both ends are separated by a silent section is recognized by the speech collating unit 1 as a recognition standard pattern composed of recognition vocabulary standard patterns (recognition vocabulary 1 to N) 2-1. 2 and the recognition vocabulary with the highest collation score is sent as a recognition result.

また、図10の認識用標準パタン2を使用した場合、両端が無音区間で区切られた入力音声は、認識語彙標準パタン2−1の前後に、子音と母音とからなる任意の日本語音声を受理する音声標準パタン2−2を付加した認識用標準パタン2と照合される。
河原達也、宗続敏彦、三木清一、堂下修司“会話音声中の単語スポッティングのための言語モデルの検討”電子情報通信学会技術研究報告SP94-28 1994 PP.41-48
Further, when the recognition standard pattern 2 in FIG. 10 is used, the input speech whose both ends are separated by a silent section is an arbitrary Japanese speech composed of consonants and vowels before and after the recognition vocabulary standard pattern 2-1. It is collated with the recognition standard pattern 2 to which the received voice standard pattern 2-2 is added.
Tatsuya Kawahara, Toshihiko Sotsugi, Kiyoichi Miki, Shuji Doshita “Examination of language model for word spotting in conversational speech” IEICE Technical Report SP94-28 1994 PP.41-48

しかしながら、図9の認識語彙標準パタン2−1のみからなる認識用標準パタン2を使用した場合には、入力音声に背後の人声などが混入すると、その人声を認識語彙と照合し、その認識結果を送出してしまうことがあるため、誤認識が増加してしまうという課題がある。   However, when the recognition standard pattern 2 consisting only of the recognition vocabulary standard pattern 2-1 in FIG. 9 is used, if a human voice behind the input voice is mixed in the input voice, the human voice is collated with the recognition vocabulary. Since the recognition result may be transmitted, there is a problem that misrecognition increases.

図10の認識語彙標準パタン2−1と音声標準パタン2−2との組み合わせからなる認識用標準パタン2を使用した場合には、背後の人声などの混入による誤認識を改善できるが、認識語彙またはその一部が、前後の音声標準パタン2−2と照合されることで新たな誤認識が生じるという課題がある。   When the recognition standard pattern 2 including the combination of the recognition vocabulary standard pattern 2-1 and the speech standard pattern 2-2 in FIG. 10 is used, erroneous recognition due to mixing of human voices and the like in the background can be improved. There is a problem that a new misrecognition occurs when the vocabulary or part of the vocabulary is collated with the preceding and following speech standard patterns 2-2.

本発明の目的は、入力された音声を認識する際に、背後の人声、独り言など、音声認識装置への入力目的以外の音声が混入しても高い精度で音声を認識できる音声認識装置を提供することにある。   An object of the present invention is to provide a voice recognition device capable of recognizing voice with high accuracy even when voice other than the input purpose to the voice recognition device such as a human voice or monologue in the background is mixed when the inputted voice is recognized. It is to provide.

上記課題を解決するために、本発明は、携帯電話を介して入力された音声と認識用標準パタンとを照合して認識結果を送出する音声照合部を有する音声認識装置において、前記認識用標準パタンは、認識対象とする音声をモデル化した認識語彙標準パタンと該認識語彙標準パタンの前後に付加された背景人声標準パタンとを含む認識用標準パタンであり、前記背景人声標準パタンは、携帯電話伝送系を介して入力され、携帯電話への通常の入力音声と比較して大きな歪みを受ける低レベルの入力音声を元に学習して得られた背景人声標準パタン群、あるいは携帯電話音声に含まれる背後の人声を元に学習して得られた背景人声ガーベージ標準パタンであることを特徴とする。   In order to solve the above-mentioned problem, the present invention provides a speech recognition apparatus having a speech collation unit that collates speech input via a mobile phone with a recognition standard pattern and sends a recognition result. The pattern is a recognition standard pattern including a recognition vocabulary standard pattern that models speech to be recognized and a background human voice standard pattern added before and after the recognition vocabulary standard pattern, and the background human voice standard pattern is A background human voice standard pattern group obtained by learning based on low-level input speech that is input via a mobile phone transmission system and receives large distortion compared to normal input speech to a mobile phone, or mobile phone It is a background human voice garbage standard pattern obtained by learning based on the human voice behind the telephone voice.

また、本発明は、携帯電話を介して入力された音声と認識用標準パタンとを照合して認識結果を送出する複数の音声照合部と判定部とを有する音声認識装置において、前記複数の音声照合部のうちの少なくとも1つにおける認識用標準パタンは、認識対象とする音声をモデル化した認識語彙標準パタンと、該認識語彙標準パタンの前後に付加された背景人声標準パタンとを含む認識用標準パタンであり、前記背景人声標準パタンは、携帯電話伝送系を介して入力され、携帯電話への通常の入力音声と比較して大きな歪みを受ける低レベルの入力音声を元に学習して得られた背景人声標準パタン群、あるいは携帯電話音声に含まれる背後の人声を元に学習して得られた背景人声ガーベージ標準パタンであり、前記判定部は、前記複数の音声照合部より送出される照合スコアの各々に対して予め定めた値を加算し、その結果得られた値を互いに比較して最も値が高い値を送出した音声照合部での認識語彙を認識結果として送出することを特徴とする。   Further, the present invention provides a speech recognition apparatus having a plurality of speech collating units and a judging unit that collate speech inputted via a mobile phone with a standard pattern for recognition and send a recognition result, and the plurality of speeches The recognition standard pattern in at least one of the collating units includes a recognition vocabulary standard pattern that models the speech to be recognized, and a background human voice standard pattern that is added before and after the recognition vocabulary standard pattern. The background human voice standard pattern is learned on the basis of low-level input speech that is input via a mobile phone transmission system and receives large distortion compared to normal input speech to the mobile phone. Background human voice standard patterns or background human voice garbage standard patterns obtained by learning based on the human voice behind the mobile phone voice, and the determination unit includes the plurality of voice verifications. Add a predetermined value to each of the collation scores sent out, compare the obtained values with each other, and send the recognition vocabulary at the voice collation unit that sent the highest value as the recognition result It is characterized by doing.

ここで、前記認識用標準パタンにおいて、前記認識語彙標準パタンと前記背景人声標準パタンとの間に、両者の境界を表す境界標準パタン、例えば無音に対応する標準パタンを挟むことができる。   Here, in the recognition standard pattern, a boundary standard pattern representing the boundary between the recognition vocabulary standard pattern and the background human voice standard pattern, for example, a standard pattern corresponding to silence can be sandwiched.

本発明によれば、認識用語彙標準パタンと背景人声標準パタンとの組み合わせからなる認識用標準パタンを用い、背景人声標準パタンとして、携帯電話伝送系を介して入力され、携帯電話への通常の入力音声と比較して大きな歪みを受ける低レベルの入力音声を元に学習して得られた背景人声標準パタン群、あるいは携帯電話音声に含まれる背後の人声を元に学習して得られた背景人声ガーベージ標準パタンを用いるので、携帯電話を介して入力された音声において、背後の人声、独り言など、音声認識装置への入力目的以外の音声が混入しても高い精度で音声を認識できる。   According to the present invention, a recognition standard pattern consisting of a combination of a recognition vocabulary standard pattern and a background human voice standard pattern is used, input as a background human voice standard pattern via a mobile phone transmission system, and sent to the mobile phone. Learning based on background human voice standard patterns obtained by learning based on low-level input voice that receives large distortion compared to normal input voice, or on the background voice contained in mobile phone voice Since the obtained background human voice garbage standard pattern is used, even if voice other than the input purpose to the voice recognition device such as the human voice behind you or monologue mixed in the voice input via the mobile phone is mixed with high accuracy Can recognize voice.

また、通常の入力音声と比較して大きな歪みを受けた音声を元に学習した背景人声標準パタンを用いることで、認識語彙又はその一部が前後の背景人声標準パタンと照合され誤認識する現象を抑えることができる。   In addition, by using the background human voice standard pattern learned based on the voice that has undergone large distortion compared to the normal input voice, the recognition vocabulary or part of it is collated with the background human voice standard pattern before and after, and misrecognition Can be suppressed.

また、認識語彙標準パタンと背景人声標準パタンとの間に、境界を表す標準パタンを挿入することにより、認識語彙部分を背景人声と誤認識する現象を抑え、さらに高い精度で音声を認識できる。   In addition, by inserting a standard pattern that represents the boundary between the recognized vocabulary standard pattern and the background human voice standard pattern, the recognition vocabulary part is mistakenly recognized as the background human voice, and speech is recognized with higher accuracy. it can.

以下、図面を参照して本発明について説明する。図1は、本発明に係る音声認識装置の実施形態を示すブロック図であり、図8と同一あるいは同等部分には同じ符号を付してある。   The present invention will be described below with reference to the drawings. FIG. 1 is a block diagram showing an embodiment of a speech recognition apparatus according to the present invention, where the same or equivalent parts as in FIG.

図1において、携帯電話を介して入力された音声は、音声照合部1において認識用標準パタン2と照合される。認識用標準パタン2は、照合のために予め格納されており、認識語彙標準パタン2−1と、背景人声標準パタン群2−3あるいは背景人声ガーベージ標準パタン2−4との組み合わせからなる。背景人声標準パタン群2−3と背景人声ガーベージ標準パタン2−4は、共に背景人声に対する標準パタンであるので、以下では、これらを背景人声標準パタンと呼ぶことがある。   In FIG. 1, a voice input via a mobile phone is collated with a recognition standard pattern 2 in a voice collation unit 1. The recognition standard pattern 2 is stored in advance for collation, and includes a combination of the recognition vocabulary standard pattern 2-1 and the background human voice standard pattern group 2-3 or the background human voice garbage standard pattern 2-4. . Since the background human voice standard pattern group 2-3 and the background human voice garbage standard pattern 2-4 are both standard patterns for the background human voice, they may be referred to as background human voice standard patterns below.

背景人声標準パタン群2−3は、携帯電話の伝送系を介して入力され、携帯電話への通常の入力音声と比較して大きな歪みを受ける低レベルの入力音声を元に学習することにより作成される。ここで、携帯電話への通常の入力音声と比較して大きな歪みを受ける低レベルの入力音声とは、通話を行おうとする人が携帯電話に向けて通常発声する場合の音声と比較して十分にレベルが低く携帯電話の伝送系で大きな歪(符号化歪など)を受けるレベルの音声である。   The background human voice standard pattern group 2-3 is learned based on low-level input speech that is input through a transmission system of a mobile phone and receives a large distortion compared to normal input speech to the mobile phone. Created. Here, the low-level input voice that receives a large distortion compared to the normal input voice to the mobile phone is sufficient compared to the voice when the person who is going to make a call normally speaks to the mobile phone. In particular, the sound is low in level and is subject to large distortion (such as encoding distortion) in a mobile phone transmission system.

例えば携帯電話で用いられる音声符号化方式では、文献「ARIB標準規格:STD-T64-C.S0014-0 Enhanced Variable Rate Codec,Speech Service Option 3 for Wideband Spread Spectrum Digital Systems(4-23頁、4.3.1.2節)」に記述されているように、音声の入力レベルを考慮した式により、音声を符号化する際の符号長が決定される。符号長が短いほど符号化により音声に加わる歪みが大きくなるため、例えば、符号長選択時に使用する判定式に従い、短い符号長で符号化されるレベルの音声を背景人声標準パタンの学習に使用する。   For example, in the speech coding system used in mobile phones, the document “ARIB Standard: STD-T64-C.S0014-0 Enhanced Variable Rate Codec, Speech Service Option 3 for Wideband Spread Spectrum Digital Systems (page 4-23, 4.3. As described in section 1.2), the code length for encoding speech is determined by an equation that takes into account the speech input level. The shorter the code length, the greater the distortion added to the speech due to encoding.For example, according to the judgment formula used when selecting the code length, the speech of the level encoded with the short code length is used for background human voice standard pattern learning. To do.

また、背景人声ガーベージ標準パタンは、携帯電話音声に含まれる背後の人声を元に学習することにより作成される。図2は、携帯電話に入力された音声と背景の人声の入力レベルの分布を示す。携帯電話に入力された音声と背景の人声の入力レベルの分布の差異から分かるように、背景の人声は、携帯電話に向けて発声された音声と比較して入力レベルが低く、携帯電話の伝送系で大きな歪(符号化歪など)を受けている。   The background human voice garbage standard pattern is created by learning based on the human voice behind the mobile phone voice. FIG. 2 shows the distribution of the input level of the voice input to the mobile phone and the background human voice. As can be seen from the difference in the input level distribution between the voice input to the mobile phone and the background human voice, the background human voice has a lower input level than the voice uttered to the mobile phone. The transmission system is subjected to large distortion (such as encoding distortion).

図3および図4は、本発明で使用される認識用標準パタン2の例を示し、図3は、認識語彙標準パタン2−1と背景人声標準パタン(背景人声標準パタン群)2−3との組み合わせからなる認識用標準パタン2の例であり、図4は、認識語彙標準パタン2−1と背景人声標準パタン(背景人声ガーベージ標準パタン)2−4との組み合わせからなる認識用標準パタン2の例である。音声照合部1は、入力音声と認識用標準パタンとを照合し、最も照合スコア(類似度)の高い認識語彙を認識結果として送出する。   3 and 4 show examples of the standard pattern for recognition 2 used in the present invention. FIG. 3 shows a recognition vocabulary standard pattern 2-1 and a background human voice standard pattern (background human voice standard pattern group) 2- 4 is an example of a recognition standard pattern 2 composed of a combination of 3 and FIG. 4 shows a recognition composed of a combination of a recognition vocabulary standard pattern 2-1 and a background human voice standard pattern (background human voice garbage standard pattern) 2-4. This is an example of the standard pattern 2 for use. The voice collation unit 1 collates the input voice with the standard pattern for recognition, and sends the recognition vocabulary having the highest collation score (similarity) as a recognition result.

図3の認識用標準パタン2を使用した場合、両端が無音区間で区切られた入力音声は、背景人声標準パタン2−3、認識語彙標準パタン(認識語彙1〜N)2−1、背景人声標準パタン2−3からなる認識用標準パタン2と照合される。   When the recognition standard pattern 2 of FIG. 3 is used, the input speech whose both ends are separated by a silent section is a background human voice standard pattern 2-3, a recognition vocabulary standard pattern (recognition vocabulary 1 to N) 2-1, a background. It is collated with a recognition standard pattern 2 consisting of a human voice standard pattern 2-3.

また、図4の認識用標準パタン2を使用した場合には、両端が無音区間で区切られた音声は、背景人声標準パタン(背景人声ガーベージ標準パタン)2−4、認識語彙標準パタン(認識語彙1〜N)2−1、背景人声標準パタン(背景人声ガーベージ標準パタン)2−4からなる認識用標準パタンと照合される。   When the recognition standard pattern 2 in FIG. 4 is used, the speech whose both ends are separated by a silent section is a background human voice standard pattern (background human voice garbage standard pattern) 2-4, a recognition vocabulary standard pattern ( It is collated with a recognition standard pattern consisting of recognition vocabulary 1 to N) 2-1, background human voice standard pattern (background human voice garbage standard pattern) 2-4.

通常、背景の人声は、携帯電話に向けて発声される音声と比較して入力レベルが低い。そのため、背景の人声は、携帯電話の音声符号処理において、携帯電話に向けて発声された音声と異なる符号化歪みを受ける。このような背景人声が受ける符号化歪みを考慮して作成した背景人声標準パタン群2−3または背景人声ガーベージ標準パタン2−4を含む認識用標準パタンを用いることにより、背景人声が混入されていても、携帯電話に向けて発声された音声を良好に認識できるようになる。   Usually, the background human voice has a lower input level than the voice uttered to the mobile phone. For this reason, the background human voice is subjected to different coding distortion from the voice uttered toward the mobile phone in the voice code processing of the mobile phone. By using the recognition standard pattern including the background human voice standard pattern group 2-3 or the background human voice garbage standard pattern 2-4 created in consideration of the coding distortion that the background human voice receives, the background human voice is obtained. Even if is mixed, the voice uttered toward the mobile phone can be recognized well.

図5は、本発明に係る音声認識装置の他の実施形態を示すブロック図である。本実施形態では、複数の音声照合部1(1−1,1−2,・・・,1−n)を設け、これら複数の音声照合部1(1−1,1−2,・・・,1−n)に入力音声を並列に入力する。各音声照合部1は、入力音声をそれぞれの認識用標準パタン2(2−1,2−2,・・・,2−n)と照合し、照合結果1,2,・・・,nを判定部3に送出する。   FIG. 5 is a block diagram showing another embodiment of the speech recognition apparatus according to the present invention. In the present embodiment, a plurality of voice collation units 1 (1-1, 1-2,..., 1-n) are provided, and the plurality of voice collation units 1 (1-1, 1-2,. , 1-n), input speech is input in parallel. Each voice collation unit 1 collates the input voice with each recognition standard pattern 2 (2-1, 2-2,..., 2-n), and compares the collation results 1, 2,. It is sent to the determination unit 3.

ここで、音声照合部1(1−1,1−2,・・・,1−n)のうちの少なくとも1つは、図1の実施形態と同様に、認識語彙標準パタンと背景人声標準パタン(背景人声標準パタン群または背景人声ガーベージ標準パタン)からなる認識用標準パタンを用いるものであるが、その他の音声照合部は、従来の音声認識装置における認識用標準パタン、例えば図9や図10の認識用標準パタンを用いるものでも構わない。   Here, at least one of the speech collating units 1 (1-1, 1-2,..., 1-n) is similar to the embodiment of FIG. A standard pattern for recognition consisting of a pattern (background human voice standard pattern group or background human voice garbage standard pattern) is used, but the other voice collating unit is a standard pattern for recognition in a conventional voice recognition device, for example, FIG. Alternatively, the recognition standard pattern shown in FIG. 10 may be used.

各音声照合部1(1−1,1−2,・・・,1−n)は、入力音声と認識用標準パタンとの照合結果として最も類似度の高い認識語彙と照合スコアPi(i=1〜n)とを送出する。   Each speech collation unit 1 (1-1, 1-2,..., 1-n) recognizes a recognition vocabulary having the highest similarity as a collation result between the input speech and the standard pattern for recognition and a collation score Pi (i = 1-n).

判定部3は、下記式に示すように、各音声照合部1(1−1,1−2,・・・,1−n)から送出された照合スコアPi(i=1〜n)に予め設定された値Wiを加算し、その結果得られた値Siを互いに比較し、最も高い値を持つ認識語彙を認識結果mとして送出する。   As shown in the following equation, the determination unit 3 preliminarily applies the verification score Pi (i = 1 to n) transmitted from each speech verification unit 1 (1-1, 1-2,..., 1-n). The set value Wi is added, the values Si obtained as a result are compared with each other, and the recognition vocabulary having the highest value is transmitted as the recognition result m.

ここで、値Wiは、各照合スコアPi(i=1〜n)に対する重み付けであり、例えば、背後の人声、独り言など、音声認識装置への入力目的以外の音声の混入が多い環境で本装置が使用されることが予測される場合、ノイズ耐性を持たせた図3、図4のような認識用標準パタンを用いた音声照合部の照合結果に対する値Wiを大きくし、図9、図10のような認識用標準パタンを用いた音声照合部の照合結果に対する値Wiを小さくする。

Si=logPi+Wi
Here, the value Wi is a weight for each collation score Pi (i = 1 to n). For example, the value Wi is used in an environment where there is a lot of voices other than the input purpose to the voice recognition device, such as a human voice and monologue. When the apparatus is predicted to be used, the value Wi for the collation result of the speech collation unit using the recognition standard pattern as shown in FIG. 3 and FIG. The value Wi for the collation result of the speech collation unit using the recognition standard pattern such as 10 is reduced.

Si = logPi + Wi

図5の実施形態によれば、種々の異なる認識用標準パタンを用い、それぞれの照合結果に重み付けをすることにより、音声認識装置への入力目的以外の音声の混入が多い環境や少ない環境など、使用される環境の元で音声認識が良好に行われるように音声認識装置を適合させることができる。   According to the embodiment of FIG. 5, by using various different standard patterns for recognition and weighting each matching result, an environment where there is a lot of voice mixing other than the purpose of input to the voice recognition device, an environment where there is little, etc. The speech recognition device can be adapted so that speech recognition is performed well under the environment in which it is used.

以上の実施形態では、背景人声標準パタンと認識語彙標準パタンとが連続した認識用標準パタンを使用したが、認識語彙標準パタンと背景人声標準パタンとの間に、両者の境界を表す境界標準パタンを挟んだ認識用標準パタンを用いることにより、認識語彙部分が背景人声と誤認識される現象をさらに低減させることができる。   In the above embodiment, the recognition standard pattern in which the background human voice standard pattern and the recognition vocabulary standard pattern are continuous is used, but the boundary representing the boundary between the recognition vocabulary standard pattern and the background human voice standard pattern is used. By using a standard pattern for recognition with a standard pattern in between, the phenomenon that the recognized vocabulary part is erroneously recognized as the background human voice can be further reduced.

図6は、背景人声標準パタン2−3と認識語彙標準パタン2−1との間に、両者の境界を表す境界標準パタン2−5を挟んだ認識用標準パタン2の例であり、同図において、GBは、背景人声ガーベージ標準パタンを示している。両端が無音区間で区切られた入力音声は、背景人声標準パタン2−3、境界標準パタン2−5、認識語彙標準パタン2−1、境界標準パタン2−5、背景人声標準パタン2−3からなる認識用標準パタンと照合される。この照合の結果、最も照合スコアの高い認識語彙が認識結果として送出される。   FIG. 6 shows an example of a recognition standard pattern 2 in which a boundary standard pattern 2-5 that represents the boundary between the background human voice standard pattern 2-3 and the recognition vocabulary standard pattern 2-1 is sandwiched. In the figure, GB indicates a background human voice garbage standard pattern. The input speech with both ends separated by a silent section is a background human voice standard pattern 2-3, a boundary standard pattern 2-5, a recognition vocabulary standard pattern 2-1, a boundary standard pattern 2-5, and a background human voice standard pattern 2- 3 is collated with a recognition standard pattern consisting of three. As a result of this collation, the recognition vocabulary with the highest collation score is sent as the recognition result.

図7は、背景人声標準パタン2−3と認識語彙標準パタン2−1の境界を表す境界標準パタン2−5として、無音に対応する標準パタンを用いた例であり、両端が無音区間で区切られた入力音声は、背景人声標準パタン2−3、無音パタン(境界標準パタン)2−5、認識語彙標準パタン2−1、無音パタン(境界標準パタン)2−5、背景人声標準パタン2−3からなる認識用標準パタンと照合される。この照合の結果、最も照合スコアの高い認識語彙が認識結果として送出される。   FIG. 7 shows an example in which a standard pattern corresponding to silence is used as a boundary standard pattern 2-5 representing the boundary between the background human voice standard pattern 2-3 and the recognized vocabulary standard pattern 2-1, and both ends are silent sections. The divided input speech includes a background human voice standard pattern 2-3, a silence pattern (boundary standard pattern) 2-5, a recognition vocabulary standard pattern 2-1, a silence pattern (boundary standard pattern) 2-5, and a background human voice standard. It is collated with a recognition standard pattern consisting of pattern 2-3. As a result of this collation, the recognition vocabulary with the highest collation score is sent as the recognition result.

以上では、図3あるいは図4の認識用標準パタンを単独に使用する実施形態について説明したが、背景人声標準パタン群と背景人声ガーベージ標準パタンとを並列に組み合わせて構成される背景人声標準パタンを、認識語彙標準パタン2−1の前後に配置して用いることもできる。また、図6、図7は、背景人声標準パタンとして背景人声ガーベージ標準パタンを用いた例であるが、背景人声標準パタンとしては背景人声標準パタン群あるいは背景人声標準パタン群と背景人声ガーベージ標準パタンの組み合わせを用いることもできる。   In the above, the embodiment in which the recognition standard pattern of FIG. 3 or FIG. 4 is used alone has been described. However, the background human voice configured by combining the background human voice standard pattern group and the background human voice garbage standard pattern in parallel. Standard patterns can also be used before and after the recognized vocabulary standard pattern 2-1. 6 and 7 are examples using the background human voice garbage standard pattern as the background human voice standard pattern. As the background human voice standard pattern, the background human voice standard pattern group or the background human voice standard pattern group is used. A combination of background human voice garbage standard patterns can also be used.

本発明は、携帯電話向けボイスポータルサービスの他、コールセンタ向け音声対話システムやカーナビ向け音声対話システムなどに利用して有用である。   INDUSTRIAL APPLICABILITY The present invention is useful for a voice conversation service for a call center, a voice conversation system for a car navigation system, and the like in addition to a voice portal service for a mobile phone.

本発明に係る音声認識装置の実施形態を示すブロック図である。It is a block diagram which shows embodiment of the speech recognition apparatus which concerns on this invention. 携帯電話への入力音声と背景人声の入力レベルの分布を示す特性図である。It is a characteristic view which shows distribution of the input level of the input voice and background human voice to a mobile phone. 本発明で使用される認識用標準パタンの例を示す図である。It is a figure which shows the example of the standard pattern for recognition used by this invention. 本発明で使用される認識用標準パタンの他の例を示す図である。It is a figure which shows the other example of the standard pattern for recognition used by this invention. 本発明に係る音声認識装置の他の実施形態を示すブロック図である。It is a block diagram which shows other embodiment of the speech recognition apparatus which concerns on this invention. 本発明で使用される認識用標準パタンのさらに他の例を示す図である。It is a figure which shows the further another example of the standard pattern for recognition used by this invention. 本発明で使用される認識用標準パタンのさらに他の例を示す図である。It is a figure which shows the further another example of the standard pattern for recognition used by this invention. 従来の音声認識装置を示すブロック図である。It is a block diagram which shows the conventional speech recognition apparatus. 従来の音声認識装置で使用されている認識用標準パタンを示す図である。It is a figure which shows the standard pattern for recognition used with the conventional speech recognition apparatus. 従来の音声認識装置で使用されている他の認識用標準パタンを示す図である。It is a figure which shows the other standard pattern for recognition currently used with the conventional speech recognition apparatus.

符号の説明Explanation of symbols

1,1−1〜1−n・・・音声照合部、2,2−1〜2−n・・・認識用標準パタン、2−1・・・認識用語彙標準パタン、2−2・・・音声標準パタン、2−3・・・背景人声標準パタン、2−4・・・背景人声ガーベージ標準パタン、2−5・・・境界標準(無音)パタン、3・・・判定部
1, 1-1 to 1-n ... voice collation unit, 2, 2-1 to 2-n ... recognition standard pattern, 2-1 ... recognition vocabulary standard pattern, 2-2. Voice standard pattern, 2-3 ... background human voice standard pattern, 2-4 ... background human voice garbage standard pattern, 2-5 ... boundary standard (silence) pattern, 3 ... judgment unit

Claims (4)

携帯電話を介して入力された音声と認識用標準パタンとを照合して認識結果を送出する音声照合部を有する音声認識装置において、
前記認識用標準パタンは、認識対象とする音声をモデル化した認識語彙標準パタンと該認識語彙標準パタンの前後に付加された背景人声標準パタンとを含む認識用標準パタンであり、
前記背景人声標準パタンは、携帯電話伝送系を介して入力され、携帯電話への通常の入力音声と比較して大きな歪みを受ける低レベルの入力音声を元に学習して得られた背景人声標準パタン群、あるいは携帯電話音声に含まれる背後の人声を元に学習して得られた背景人声ガーベージ標準パタンであることを特徴とする音声認識装置。
In a speech recognition apparatus having a speech collation unit that collates speech input via a mobile phone with a recognition standard pattern and sends a recognition result,
The recognition standard pattern is a recognition standard pattern including a recognition vocabulary standard pattern that models speech to be recognized and a background human voice standard pattern added before and after the recognition vocabulary standard pattern,
The background human voice standard pattern is obtained through learning based on low-level input speech that is input via a mobile phone transmission system and receives large distortion compared to normal input speech to the mobile phone. A voice recognition device characterized by being a voice standard pattern group or a background human voice garbage standard pattern obtained by learning based on a human voice behind a mobile phone voice.
携帯電話を介して入力された音声と認識用標準パタンとを照合して認識結果を送出する複数の音声照合部と判定部とを有する音声認識装置において、
前記複数の音声照合部のうちの少なくとも1つにおける認識用標準パタンは、認識対象とする音声をモデル化した認識語彙標準パタンと、該認識語彙標準パタンの前後に付加された背景人声標準パタンとを含む認識用標準パタンであり、
前記背景人声標準パタンは、携帯電話伝送系を介して入力され、携帯電話への通常の入力音声と比較して大きな歪みを受ける低レベルの入力音声を元に学習して得られた背景人声標準パタン群、あるいは携帯電話音声に含まれる背後の人声を元に学習して得られた背景人声ガーベージ標準パタンであり、
前記判定部は、前記複数の音声照合部より送出される照合スコアの各々に対して予め定めた値を加算し、その結果得られた値を互いに比較して最も値が高い値を送出した音声照合部での認識語彙を認識結果として送出することを特徴とする音声認識装置。
In a speech recognition apparatus having a plurality of speech collation units and a determination unit that collate speech input via a mobile phone with a standard pattern for recognition and send a recognition result,
The recognition standard pattern in at least one of the plurality of speech collating units includes a recognition vocabulary standard pattern that models speech to be recognized, and a background human voice standard pattern that is added before and after the recognition vocabulary standard pattern. A standard pattern for recognition including
The background human voice standard pattern is input through a mobile phone transmission system, and a background person obtained by learning based on low-level input voice that receives large distortion compared to normal input voice to a mobile phone. Voice standard pattern group, or background human voice garbage standard pattern obtained by learning based on the human voice behind mobile phone voice,
The determination unit adds a predetermined value to each of the collation scores sent from the plurality of voice collation units, compares the obtained values with each other, and sends the highest value. A speech recognition apparatus characterized in that a recognition vocabulary in a collation unit is transmitted as a recognition result.
前記認識用標準パタンにおいて、前記認識語彙標準パタンと前記背景人声標準パタンとの間に、両者の境界を表す境界標準パタンが挟まれていることを特徴とする請求項1または2に記載の音声認識装置。 The boundary standard pattern representing the boundary between the recognition vocabulary standard pattern and the background human voice standard pattern is sandwiched between the recognition standard pattern and the background human voice standard pattern. Voice recognition device. 前記境界標準パタンは、無音に対応する標準パタンであることを特徴とする請求項3に記載の音声認識装置。 The speech recognition apparatus according to claim 3, wherein the boundary standard pattern is a standard pattern corresponding to silence.
JP2003312385A 2003-09-04 2003-09-04 Voice recognition device Expired - Fee Related JP4526057B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2003312385A JP4526057B2 (en) 2003-09-04 2003-09-04 Voice recognition device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP2003312385A JP4526057B2 (en) 2003-09-04 2003-09-04 Voice recognition device

Publications (2)

Publication Number Publication Date
JP2005084071A true JP2005084071A (en) 2005-03-31
JP4526057B2 JP4526057B2 (en) 2010-08-18

Family

ID=34413655

Family Applications (1)

Application Number Title Priority Date Filing Date
JP2003312385A Expired - Fee Related JP4526057B2 (en) 2003-09-04 2003-09-04 Voice recognition device

Country Status (1)

Country Link
JP (1) JP4526057B2 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008089625A (en) * 2006-09-29 2008-04-17 Honda Motor Co Ltd Voice recognition apparatus, voice recognition method and voice recognition program
WO2013145578A1 (en) * 2012-03-30 2013-10-03 日本電気株式会社 Audio processing device, audio processing method, and audio processing program
CN111785302A (en) * 2020-06-23 2020-10-16 北京声智科技有限公司 Speaker separation method and device and electronic equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH01309099A (en) * 1987-06-04 1989-12-13 Ricoh Co Ltd Speech responding device
JP2000284792A (en) * 1999-03-31 2000-10-13 Canon Inc Device and method for recognizing voice
JP2003108188A (en) * 2001-09-28 2003-04-11 Kddi Corp Voice recognizing device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH01309099A (en) * 1987-06-04 1989-12-13 Ricoh Co Ltd Speech responding device
JP2000284792A (en) * 1999-03-31 2000-10-13 Canon Inc Device and method for recognizing voice
JP2003108188A (en) * 2001-09-28 2003-04-11 Kddi Corp Voice recognizing device

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008089625A (en) * 2006-09-29 2008-04-17 Honda Motor Co Ltd Voice recognition apparatus, voice recognition method and voice recognition program
WO2013145578A1 (en) * 2012-03-30 2013-10-03 日本電気株式会社 Audio processing device, audio processing method, and audio processing program
CN111785302A (en) * 2020-06-23 2020-10-16 北京声智科技有限公司 Speaker separation method and device and electronic equipment

Also Published As

Publication number Publication date
JP4526057B2 (en) 2010-08-18

Similar Documents

Publication Publication Date Title
US11776540B2 (en) Voice control of remote device
US11455995B2 (en) User recognition for speech processing systems
US10600414B1 (en) Voice control of remote device
US10593328B1 (en) Voice control of remote device
US20030120486A1 (en) Speech recognition system and method
CN109192202B (en) Voice safety recognition method, device, computer equipment and storage medium
JP4838351B2 (en) Keyword extractor
KR100923896B1 (en) Method and apparatus for transmitting speech activity in distributed voice recognition systems
US6792408B2 (en) Interactive command recognition enhancement system and method
US6836758B2 (en) System and method for hybrid voice recognition
JP2001188784A (en) Device and method for processing conversation and recording medium
KR20070009688A (en) Detection of end of utterance in speech recognition system
US7676364B2 (en) System and method for speech-to-text conversion using constrained dictation in a speak-and-spell mode
KR20080049826A (en) A method and a device for speech recognition
EP1525577B1 (en) Method for automatic speech recognition
US7509257B2 (en) Method and apparatus for adapting reference templates
JP4526057B2 (en) Voice recognition device
US20070129945A1 (en) Voice quality control for high quality speech reconstruction
Venkatagiri Speech recognition technology applications in communication disorders
CN112565242B (en) Remote authorization method, system, equipment and storage medium based on voiceprint recognition
JP2003177788A (en) Audio interactive system and its method
JP2005092310A (en) Voice keyword recognizing device
JP2003044085A (en) Dictation device with command input function
KR100931790B1 (en) Recognition dictionary generation method using phonetic name list in speech recognition system and method of processing similar phonetic name using same
JPH086590A (en) Voice recognition method and device for voice conversation

Legal Events

Date Code Title Description
A621 Written request for application examination

Free format text: JAPANESE INTERMEDIATE CODE: A621

Effective date: 20060901

A977 Report on retrieval

Free format text: JAPANESE INTERMEDIATE CODE: A971007

Effective date: 20090929

A131 Notification of reasons for refusal

Free format text: JAPANESE INTERMEDIATE CODE: A131

Effective date: 20091014

A521 Request for written amendment filed

Free format text: JAPANESE INTERMEDIATE CODE: A523

Effective date: 20091211

TRDD Decision of grant or rejection written
A01 Written decision to grant a patent or to grant a registration (utility model)

Free format text: JAPANESE INTERMEDIATE CODE: A01

Effective date: 20100526

A01 Written decision to grant a patent or to grant a registration (utility model)

Free format text: JAPANESE INTERMEDIATE CODE: A01

A61 First payment of annual fees (during grant procedure)

Free format text: JAPANESE INTERMEDIATE CODE: A61

Effective date: 20100528

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20130611

Year of fee payment: 3

R150 Certificate of patent or registration of utility model

Free format text: JAPANESE INTERMEDIATE CODE: R150

Ref document number: 4526057

Country of ref document: JP

Free format text: JAPANESE INTERMEDIATE CODE: R150

LAPS Cancellation because of no payment of annual fees