JP2003108188A

JP2003108188A - Voice recognizing device

Info

Publication number: JP2003108188A
Application number: JP2001303696A
Authority: JP
Inventors: Tsuneo Kato; 恒夫加藤; Masaki Naito; 正樹内藤; Toru Shimizu; 徹清水
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2001-09-28
Filing date: 2001-09-28
Publication date: 2003-04-11

Abstract

PROBLEM TO BE SOLVED: To provide a voice recognizing device which can recognize a voice with high precision by discriminating CODEC-dependent non-stationary noise with high precision even if the non-stationary noise is included in an inputted voice. SOLUTION: Viterbi search parts 2-1 to 2-N are provided which execute viterbi search algorithms using voice HMM, voiceless HMM, and noise HMM and noise HMM's of the viterbi search parts 2-1 to 2-N are modeled depending upon CODECs of various systems. A most likelihood search result selection part 4 selects the most likelihood search results out of search results of the viterbi search parts 2-1 to 2-N and the selected search result is outputted as a recognition result by a recognition output part 5.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、音声認識装置に関
し、特に、各種方式の符号・復号器（以下、ＣＯＤＥＣ
という）固有の非線形歪みを受けた音声でも高精度で認
識することができる音声認識装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice recognition device, and more particularly to a code / decoder of various systems (hereinafter, CODEC).
That is, the present invention relates to a voice recognition device capable of highly accurately recognizing even a voice that has been subjected to inherent non-linear distortion.

【０００２】[0002]

【従来の技術】従来、携帯電話などにおいては、隠れマ
ルコフモデル（Hideden Markov Model、以下、ＨＭＭと
いう）を用いてビタービ探索（最尤経路探索）アルゴリ
ズムにより音声を認識する音声認識装置が採用されてい
る。2. Description of the Related Art Conventionally, in a mobile phone or the like, a voice recognition device that recognizes a voice by a Viterbi search (maximum likelihood route search) algorithm using a hidden Markov model (hereinafter referred to as HMM) has been adopted. There is.

【０００３】例えば、特開平９−２３０８８６号公報に
は、入力される音声と音声ＨＭＭとの間の類似度から認
識結果を出力する音声認識方法において、実環境毎に収
録した学習音声により各種雑音に対する耐雑音音声ＨＭ
Ｍを作成し、この耐雑音音声ＨＭＭと入力される音声と
の間の類似度から認識結果を出力することが記載されて
いる。For example, Japanese Patent Laid-Open No. 9-230886 discloses a voice recognition method for outputting a recognition result based on the similarity between an input voice and a voice HMM, and learning noises recorded for each real environment cause various noises. Noise-resistant voice HM for
It is described that M is created and a recognition result is output from the similarity between the noise resistant speech HMM and the input speech.

【０００４】また、特開２０００−２８４７９２号公報
には、雑音データベースを用いた学習により外線電話、
内線電話、一般電話、デジタル携帯電話などといった電
話種毎の非音声用ＨＭＭパラメータを予め作成してお
き、この非音声用ＨＭＭパラメータを用いて音響信号が
どの電話種からのものかを判別した後に当該電話種用の
音響モデルを用いて音声認識を行うこと、音声認識はＨ
ＭＭを用いるビタービ探索アルゴリズムなどで行うこと
ができることが記載されている。Further, Japanese Patent Laid-Open No. 2000-284792 discloses an external telephone by learning using a noise database.
After creating non-voice HMM parameters for each telephone type such as extension telephones, ordinary telephones, digital mobile telephones, etc., and using the non-voice HMM parameters to determine which telephone type the acoustic signal is from Performing voice recognition using the acoustic model for the phone type, and voice recognition is H
It is described that the Viterbi search algorithm using MM can be performed.

【０００５】また、短時間に変動する種々の雑音（以
下、非定常雑音と称す）、例えば、咳、くしゃみ、間投
詞などの話者が発する不要音、周囲の人声、打音、靴
音、車両のエンジン音、電波の乱れから生じる雑音など
を収集してモデル化し、それにより得られるモデルを雑
音ＨＭＭとして用いることにより非定常雑音の区間を音
声区間として誤認識しないようすることも、「T.Schult
z I.Rogina“Acoustic andLanguage Modeling of Human
and Nonhuman Noises for Human-to-human Spontaneou
s Speech Recognition,”Proc.ICASSP 95, pp.293-296
(199)」や「T.Yamada et. al:“Voice Activity Detect
ion using Non-speech Models and HMM composition,”
Proc.HSC 2001, pp.131-134 (2001) 」で提案されてい
る。Further, various noises that fluctuate in a short time (hereinafter referred to as non-stationary noises), for example, unnecessary sounds emitted by the speaker such as cough, sneeze, and interjection, surrounding human voices, tap sounds, shoe sounds, vehicles. It is also possible to collect non-stationary noise sections as speech sections by collecting and modeling the engine sound and noise generated from disturbance of radio waves and using the obtained model as a noise HMM. Schult
z I. Rogina “Acoustic and Language Modeling of Human
and Nonhuman Noises for Human-to-human Spontaneou
s Speech Recognition, ”Proc.ICASSP 95, pp.293-296
(199) '' and `` T. Yamada et. Al: “Voice Activity Detect
ion using Non-speech Models and HMM composition, ”
Proc. HSC 2001, pp.131-134 (2001) ".

【０００６】[0006]

【発明が解決しようとする課題】ところで、携帯電話な
どでは、その種別に応じて各種方式のＣＯＤＥＣが採用
されており、例えば、ＣＥＬＰ（Code Excited Linear
Prediction ；符号励振線形予測）に基づく符号化方式
であるＣＳ−ＡＣＥＬＰ（Conjugate-StructureAlgebra
ic Code Excited Linear Prediction）が採用されてい
るものもあれば、ｃｄｍａＯｎｅのようにＥＶＲＣ（En
hanced Variable Rate Codec）が採用されているものも
ある。ＣＯＤＥＣは、その方式の違いによりその構成が
異なるため、それ固有の非線形特性を有し、ＣＯＤＥＣ
を通った信号はＣＯＤＥＣ依存性の非線形歪みを受け
る。特に、携帯電話などでは帯域幅の狭いＣＯＤＥＣが
採用されており、少ない量子化数で量子化を行っている
ため、伝送されてくる音声における非線形歪みが大き
く、また、相手機がどの方式のＣＯＤＥＣを採用してい
るかによる非線形歪みの差異も大きい。By the way, various types of CODECs are adopted in mobile phones and the like according to their types. For example, CELP (Code Excited Linear) is used.
Prediction: CS-ACELP (Conjugate-StructureAlgebra), which is an encoding method based on code-excited linear prediction
ic Code Excited Linear Prediction) is used in some cases, and EVRC (En
Some have adopted the hanced Variable Rate Codec). The CODEC has a non-linear characteristic peculiar to the CODEC because its configuration is different depending on the method.
The passed signal is subjected to CODEC-dependent nonlinear distortion. In particular, CODECs having a narrow bandwidth are used in mobile phones and the like, and since the quantization is performed with a small number of quantizations, the nonlinear distortion in the transmitted voice is large, and the CODEC of the other device is used. There is also a large difference in nonlinear distortion depending on whether or not is adopted.

【０００７】ＣＯＤＥＣ依存性の非線形歪みを有する雑
音は、音声認識における特に音声区間の始端検出を実際
より早め、音声区間の誤識別を生じさせ、音声の誤認識
の割合を増加させるという問題を引き起こす要因となる
ものであるが、上記既提案は、種々の非定常雑音をモデ
ル化して作成した雑音ＨＭＭを用いるというだけであ
り、入力される音声における雑音がＣＯＤＥＣ依存性の
非線形歪みを有することについて何ら考慮していない。
したがって、上記既定案のものでは雑音がＣＯＤＥＣ依
存性のものであることに起因して音声区間識別および音
声認識の精度低下するという問題がある。なお、これは
携帯電話による音声を認識するものだけの問題でなく、
程度の違いがあるにせよ各種方式のＣＯＤＥＣを通った
音声信号が入力される音声認識装置に共通する問題であ
る。Noise having CODEC-dependent non-linear distortion causes a problem that the beginning detection of a voice section is particularly accelerated in speech recognition, the voice section is misidentified, and the rate of false recognition of voice is increased. Although it is a factor, the above-mentioned proposal only uses a noise HMM created by modeling various non-stationary noises, and that noise in input speech has CODEC-dependent nonlinear distortion. I have not considered anything.
Therefore, the above-mentioned fixed proposal has a problem that the accuracy of the voice section identification and the voice recognition is reduced due to the noise being CODEC-dependent. In addition, this is not only a problem of recognizing voice from mobile phones,
This is a problem that is common to voice recognition devices to which voice signals that have passed through various types of CODECs are input, although they differ in degree.

【０００８】本発明は、ＨＭＭを用いるビタービ探索ア
ルゴリズムにより音声を認識する音声認識装置におい
て、入力される音声にＣＯＤＥＣ依存性の非線形歪みを
有する非定常雑音が含まれていても雑音区間を高精度で
識別することができ、音声区間の識別および音声の認識
を高精度で行うことができる音声認識装置を提供するこ
とを目的とするものである。According to the present invention, in a voice recognition device for recognizing a voice by a Viterbi search algorithm using an HMM, even if the input voice contains non-stationary noise having CODEC-dependent nonlinear distortion, the noise interval is highly accurate. It is an object of the present invention to provide a voice recognition device capable of identifying a voice section and performing voice recognition with high accuracy.

【０００９】[0009]

【課題を解決するための手段】上記課題を解決するた
め、本発明は、複数のビタービ探索部と、前記複数のビ
タービ探索部による探索結果のうちの最も尤度の高いも
のを選択する最尤探索結果選択部と、選択された探索結
果を認識結果として出力するおよび認識結果出力部とを
備え、前記複数のビタービ探索部の各々は、音声ＨＭＭ
と雑音ＨＭＭとを含み、それらを用いてビタービ探索ア
ルゴリズムを実行するものであり、前記複数のビター探
索部の各々の雑音ＨＭＭは、各種方式のＣＯＤＥＣの各
々に依存してモデル化されたものである点に第１の特徴
がある。また、本発明は、前記複数のビタービ探索部の
各々がさらに、無音ＨＭＭを備える点に第２の特徴があ
る。また、本発明は、前記音声ＨＭＭも、各種方式のＣ
ＯＤＥＣの各々に依存してモデル化されたものである点
に第３の特徴がある。また、本発明は、前記雑音ＨＭＭ
は、各種方式のＣＯＤＥＣの各々を通して入力される音
声における雑音区間の音声を切り出して得られる雑音デ
ータベースを用いて学習することにより得られるもので
ある点に第４の特徴がある。さらに、本発明は、前記音
声入力部に入力される音声が電話回線を通して入力され
る点に第５の特徴がある。In order to solve the above problems, the present invention provides a maximum likelihood selection of a plurality of Viterbi search units and a search result of the plurality of Viterbi search units having the highest likelihood. A search result selection unit and a recognition result output unit that outputs the selected search result as a recognition result, and each of the plurality of Viterbi search units includes a voice HMM.
And a noise HMM, and the Viterbi search algorithm is executed using them, and the noise HMM of each of the plurality of bitter search units is modeled depending on each CODEC of various methods. One point is the first feature. A second feature of the present invention is that each of the plurality of Viterbi search units further includes a silent HMM. Further, in the present invention, the voice HMM is also a C of various systems.
The third feature is that it is modeled depending on each ODEC. The present invention also provides the noise HMM.
Has a fourth characteristic in that it is obtained by learning by using a noise database obtained by cutting out a voice in a noise section in the voice input through each of various CODECs. Furthermore, the present invention has a fifth feature in that the voice input to the voice input unit is input through a telephone line.

【００１０】第１の特徴によれば、入力される音声にＣ
ＯＤＥＣ依存性の非線形歪みを有する非定常雑音が含ま
れていても雑音区間を高精度で識別でき、音声区間の識
別および音声の認識を高精度で行うことができる。ま
た、第２および第３の特徴によれば、音声区間の識別お
よび音声の認識をより高精度で行うことができる。ま
た、第４の特徴によれば、各種方式のＣＯＤＥＣの各々
に依存する雑音ＨＭＭを容易に作成することができる。
さらに、第５の特徴によれば、特に電話回線を通して入
力される携帯電話などの音声区間の識別および音声の認
識を高精度で行うことができる。According to the first feature, the input voice has C
Even if non-stationary noise having ODEC-dependent non-linear distortion is included, the noise section can be identified with high accuracy, and the speech section and the speech can be identified with high accuracy. Further, according to the second and third characteristics, it is possible to identify the voice section and recognize the voice with higher accuracy. Further, according to the fourth feature, it is possible to easily create a noise HMM that depends on each of various types of CODECs.
Furthermore, according to the fifth feature, it is possible to identify the voice section and the voice recognition of a mobile phone or the like that is inputted through a telephone line with high accuracy.

【００１１】[0011]

【発明の実施の形態】以下、本発明の実施の形態を図面
を参照して詳細に説明する。図１は、本発明の一実施形
態を示すブロック図である。図１において、音声入力部
１を介して入力される音声は、複数のビタービ探索部２
−１、２−２、・・・、２−Ｎに入力されて並列処理さ
れる。BEST MODE FOR CARRYING OUT THE INVENTION Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. FIG. 1 is a block diagram showing an embodiment of the present invention. In FIG. 1, the voice input through the voice input unit 1 includes a plurality of Viterbi search units 2.
, 1, 2-2, ..., 2-N and are processed in parallel.

【００１２】音声入力部１は、例えば、携帯電話を用い
た通話などにおいて相手機から電話回線を介して伝送さ
れてくる音声を取り込むものであり、取り込まれる音声
は、相手機が採用しているＣＯＤＥＣに応じたＣＯＤＥ
Ｃ依存性の非線形歪みを受けている。The voice input unit 1 is for capturing the voice transmitted from the other device through the telephone line in a call using a mobile telephone, for example, and the other device adopts the captured voice. CODE according to CODEC
It is subject to C-dependent nonlinear distortion.

【００１３】ビタービ探索部２−１、２−２、・・・、
２−Ｎについては後に詳細に説明するが、それらの各々
は、音声ＨＭＭ、無音ＨＭＭおよび雑音ＨＭＭを含み、
それらを用いてビタービ探索アルゴリズムを実行するこ
とにより、入力される音声の音声区間、無音区間、雑音
区間を識別すると共に音声区間における音声を構成する
音素を認識し、この結果に音声認識用辞書・文法３を適
用することにより音声を認識する。音声ＨＭＭ、無音Ｈ
ＭＭおよび雑音ＨＭＭは、認識処理に先立って予め作成
されて格納されている。なお、無音区間は、本来、音声
がない区間であり、その区間が音声区間の識別に影響せ
ず、音声区間の識別で自ずと識別されるのであれば無音
ＨＭＭを特別に設ける必要はない。Viterbi search units 2-1, 2-2, ...
2-N will be described in more detail below, each of which includes a speech HMM, a silence HMM and a noise HMM,
By executing the Viterbi search algorithm using them, the speech section of the input speech, the silent section, and the noise section are identified, and the phonemes that compose the speech in the speech section are recognized, and the result is a dictionary for speech recognition. Recognize speech by applying grammar 3. Voice HMM, silence H
The MM and the noise HMM are created and stored in advance prior to the recognition processing. It should be noted that the silent section is originally a section having no voice, and if the section does not affect the identification of the voice section and is naturally identified by the identification of the voice section, the silent HMM does not need to be specially provided.

【００１４】ビタービ探索部２−１、２−２、・・・、
２−Ｎの各々は、各種方式のＣＯＤＥＣの各々に対応す
る音声認識系を構成している。これらビタービ探索部２
−１、２−２、・・・、２−Ｎによる探索結果のうち
の、最も尤度の高いものが最尤探索結果選択部４で選択
され、この選択された探索結果が認識結果として認識結
果出力部５より出力される。Viterbi search units 2-1, 2-2, ...
Each of 2-N constitutes a voice recognition system corresponding to each of various types of CODECs. These Viterbi search unit 2
Among the search results by -1, 2-2, ..., 2-N, the one with the highest likelihood is selected by the maximum likelihood search result selection unit 4, and the selected search result is recognized as the recognition result. It is output from the result output unit 5.

【００１５】音声ＨＭＭは、入力される音声の音声区間
を識別するとともにその音声を構成する音素を認識する
ために用いられる音声モデルであり、認識されるべき音
素に対する音素ＨＭＭの集合からなっている。無音ＨＭ
Ｍは、入力された音声の無音区間を識別するために用い
られる無音モデルである。The voice HMM is a voice model used for identifying the voice section of the input voice and recognizing the phonemes that compose the voice, and is composed of a set of phoneme HMMs for the phonemes to be recognized. . Silent HM
M is a silence model used to identify the silence section of the input voice.

【００１６】雑音ＨＭＭは、入力される音声における雑
音区間を識別するために用いられる雑音モデルであり、
種々の非定常雑音の各々に対する雑音ＨＭＭの集合から
なっている。また、入力される音声における雑音は、通
話する相手機の種別に応じたＣＯＤＥＣ依存性のもので
あるため、雑音ＨＭＭはＣＯＤＥＣに依存してモデル化
されている。したがって、各種方式のＣＯＤＥＣの各々
に対応して複数のビタービ探索部２−１、２−２、・・
・、２−Ｎが設けられている。すなわち、ＣＯＤＥＣ
１、ＣＯＤＥＣ２、・・・、ＣＯＤＥＣＮを採用して
いる種々の相手機と通話することが予定されている場合
には、ＣＯＤＥＣ１に依存してモデル化された雑音ＨＭ
Ｍを含むビタービ探索部２−１、ＣＯＤＥＣ２に依存し
てモデル化された雑音ＨＭＭを含むビタービ探索部２−
２、・・・、ＣＯＤＥＣＮに依存してモデル化された
雑音ＨＭＭを含むビタービ探索部２−Ｎが設けられてい
る。なお、全ての相手機のＣＯＤＥＣに対してビタービ
探索部および雑音ＨＭＭを備えることは必ずしも要求さ
れず、典型的な機種で採用されているＣＯＤＥＣに対し
て備えるようにしてもよく、そのようにすれば構成が簡
単化され、実用上の支障もない。The noise HMM is a noise model used for identifying a noise section in input speech,
It consists of a set of noise HMMs for each of the various non-stationary noises. Further, the noise in the input voice is CODEC-dependent depending on the type of the other party with whom the call is being made, so the noise HMM is modeled depending on CODEC. Therefore, a plurality of Viterbi search units 2-1, 2-2, ... Corresponding to each of the various types of CODECs.
., 2-N are provided. That is, CODEC
, CODEC2, ..., When it is planned to talk to various parties using CODEC N, noise HM modeled depending on CODEC1
Viterbi search unit 2-1 including M, Viterbi search unit 2-including noise HMM modeled depending on CODEC 2
2, ..., Viterbi search unit 2-N including a noise HMM modeled depending on CODEC N is provided. It should be noted that it is not always necessary to provide the Viterbi search unit and the noise HMM for all the CODECs of the other devices, and it may be provided for the CODECs adopted in the typical model. If so, the structure is simplified and there is no practical problem.

【００１７】ビタービ探索部２−１、２−２、・・・、
２−Ｎで並列的に実行されるビタービ探索アルゴリズム
は、入力される音声と音声ＨＭＭを構成する音素ＨＭ
Ｍ、無音ＨＭＭおよび種々の非定常雑音の雑音ＨＭＭと
を比較して類似度を計算し、その値に基づいて音声区
間、無音区間、雑音区間を識別すると共に音声を構成す
る音素を認識するものである。この処理において雑音Ｈ
ＭＭとしてＣＯＤＥＣに依存してモデル化された雑音Ｈ
ＭＭを用いていることにより、ＣＯＤＥＣ依存性の非線
形歪みの影響が除去されて非定常雑音が認識され雑音区
間が識別されるため、音声区間の識別および音声の認識
が高精度で行われ、誤認識が低減される。Viterbi search units 2-1, 2-2, ...
The Viterbi search algorithm executed in parallel in 2-N is the phoneme HM that constitutes the input voice and the voice HMM.
M, a silent HMM, and a noise HMM of various non-stationary noises are compared to calculate a similarity, and a voice section, a silent section, and a noise section are identified based on the calculated values, and phonemes constituting a voice are recognized. Is. Noise H in this process
Noise H modeled as MM depending on CODEC
By using the MM, the influence of the CODEC-dependent non-linear distortion is removed, non-stationary noise is recognized, and the noise section is identified. Therefore, the speech section and the speech are recognized with high accuracy and erroneous. Recognition is reduced.

【００１８】なお、上述の実施態様は、特に雑音ＨＭＭ
をＣＯＤＥＣに依存してモデル化したものであるが、こ
れに加えて音声ＨＭＭもＣＯＤＥＣに依存してモデル化
することができ、そのようにすれば音声区間の識別およ
び音声の認識の精度をより向上させることができる。It should be noted that the above-described embodiment is particularly effective for noise HMMs.
Is modeled depending on CODEC. In addition to this, the speech HMM can also be modeled depending on CODEC. In this case, the accuracy of speech segment identification and speech recognition can be improved. Can be improved.

【００１９】図２は、入力される音声の一具体例を示す
波形図であり、図３は、図２の音声に対して実行される
ビタービ探索アルゴリズムの一具体例を示す説明図であ
る。この例では、無音区間−雑音区間−無音区間−音声
区間−無音区間−雑音区間−無音区間が連続し、音声区
間に「あか」、「あお」あるいは「きいろ」のいずれか
の音声が入力される場合を想定している。また、“ｓｉ
ｌ”は、無音ＨＭＭを表し、“ｎｏｉｓｅ１”〜“ｎｏ
ｉｓｅ３”は、非定常雑音の種類毎の雑音ＨＭＭを表
し、“ａ”、“ｋ”、“ｏ”などは、音声ＨＭＭに含ま
れる音素ＨＭＭを表している。FIG. 2 is a waveform diagram showing a specific example of the input voice, and FIG. 3 is an explanatory diagram showing a specific example of the Viterbi search algorithm executed for the voice of FIG. In this example, the silent section-noise section-silent section-speech section-silent section-noise section-silent section is continuous, and either "red", "blue" or "yellow" is input to the speech section. It is assumed that Also, "si
"1" represents a silent HMM, and "noise1" to "no"
“Ise3” represents a noise HMM for each type of non-stationary noise, and “a”, “k”, “o”, and the like represent phoneme HMMs included in the speech HMM.

【００２０】まず、入力される音声の最初の無音状態
が、その音声と無音ＨＭＭ“ｓｉｌ”との類似度が最尤
であることにより識別され、無音区間の間、無音ＨＭＭ
“ｓｉｌ”を用いた識別処理が継続される。すなわち、
無音ＨＭＭ“ｓｉｌ”による識別の自己ループを介する
処理が継続されることにより無音区間が識別される。First, the first silence state of the input voice is identified by the maximum likelihood of the similarity between the voice and the silence HMM "sil", and during the silence interval, the silence HMM is detected.
The identification process using "sil" is continued. That is,
The silent section is identified by continuing the processing through the self-loop of the identification by the silent HMM “sil”.

【００２１】非定常雑音が入力されると雑音ＨＭＭとの
類似度が最尤となるため、この処理は、無音ＨＭＭ“ｓ
ｉｌ”による識別のループを抜け出し、遷移パスを通っ
て雑音区間の識別処理に入る。この処理は、入力された
音声における雑音と雑音ＨＭＭ“ｎｏｉｓｅ１”〜“ｎ
ｏｉｓｅ３”との類似度が最尤であることに基づいて自
己ループを介する処理を継続して雑音区間を識別する処
理である。“ｎｏｉｓｅ１”〜“ｎｏｉｓｅ３”のうち
のどの雑音ＨＭＭの認識のループに入り込み継続される
かは非定常雑音の種類によるが、いずれにしてもこの処
理により雑音区間が識別される。次に続く無音区間も最
初の無音区間と同様に識別される。When non-stationary noise is input, the similarity with the noise HMM becomes maximum likelihood.
The process of exiting the identification loop by il "and entering the noise section identification process through the transition path. This process includes noise and noise HMMs" noise1 "to" n "in the input speech.
This is a process of continuously performing a process through a self-loop based on that the degree of similarity with "noise3" is maximum likelihood to identify a noise section. A loop for recognizing which noise HMM among "noise1" to "noise3" Depending on the type of non-stationary noise, the noise interval is identified by this process in any case, and the following silent segment is identified in the same manner as the first silent segment.

【００２２】続いて例えば、「あか」の音声が入力され
る場合を想定すると、その場合には図３の上段に示した
認識手順が実行されることになる。すなわち、音声
「あ」が入力されると、音素ＨＭＭ“ａ”との類似度が
最尤であることに基づいて「あ」が識別、認識され、続
いて「か」が入力されると、音素ＨＭＭ“ｋ”、“ａ”
との類似度が最尤であることに基づいて「ｋ」、「ａ」
が順次識別され、「か」が認識される。図３の音声区間
の中段は、「あお」の音声が入力される場合に実行され
る認識手順を示し、下段は、「きいろ」の音声が入力さ
れる場合に実行される認識手順を示している。Subsequently, for example, assuming that a voice of "red" is input, in that case, the recognition procedure shown in the upper part of FIG. 3 is executed. That is, when the voice "a" is input, "a" is identified and recognized based on the similarity between the phoneme HMM "a" and the phoneme HMM "a" being the maximum likelihood, and then "ka" is input. Phoneme HMM "k", "a"
"K" and "a" based on the maximum likelihood of similarity with
Are sequentially identified, and "ka" is recognized. The middle part of the voice section in FIG. 3 shows the recognition procedure executed when the "blue" voice is input, and the lower part shows the recognition procedure executed when the "yellow" voice is input. There is.

【００２３】以上のようにして音声区間の識別と共に音
声の認識が行われ、続く無音区間、雑音区間および無音
区間も上述と同様にして識別されるが、音声区間の前の
雑音が音声区間に含まれるものとして識別され、音声区
間検出の始端が実際より早まってしまうという問題は、
雑音ＨＭＭをＣＯＤＥに依存してモデル化したものにす
ることにより低減される。なお、以上で説明した探索ア
ルゴリズムは一例にすぎず、ビタービ探索において用い
られるＨＭＭがどのように連続するか、どのような音素
ＨＭＭが用いられるかは入力される音声によって異なる
ことはもちろんである。As described above, the voice recognition is performed together with the voice section identification, and the subsequent silent section, noise section, and silent section are also identified in the same manner as described above, but the noise before the voice section becomes the voice section. The problem that it is identified as included and the start of voice section detection is earlier than the actual
It is reduced by making the noise HMM model dependent on CODE. The search algorithm described above is merely an example, and it goes without saying that how the HMMs used in the Viterbi search are continuous and what phoneme HMM is used depends on the input voice.

【００２４】図４は、本発明で用いられる雑音ＨＭＭの
作成方法の一具体例を示すフローチャートである。ステ
ップＳ１〜Ｓ５は、各種ＣＯＤＥＣの各々について行わ
れるものであり、同図では、ＣＯＤＥＣ１としてＣＳ−
ＡＣＥＬＰ、ＣＯＤＥＣ２としてＥＶＲＣを想定し、こ
れら各々に対する雑音ＨＭＭを作成する例を図示してい
る。以下では、ＣＳ−ＡＣＥＬＰに対する雑音ＨＭＭを
作成する場合について説明する。FIG. 4 is a flow chart showing a specific example of a method of creating a noise HMM used in the present invention. Steps S1 to S5 are performed for each of the various CODECs, and in FIG.
An EVRC is assumed as ACELP and CODEC2, and an example of creating a noise HMM for each of them is illustrated. Hereinafter, a case of creating a noise HMM for CS-ACELP will be described.

【００２５】まず、ＣＳ−ＡＣＥＬＰを通して街頭、ビ
ル内、オフィス、公園、市街地の道路脇、駅構内など、
種々の非定常雑音を含む音声を収集する（ステップＳ
１）。これら収集された音声に含まれる種々の非定常雑
音は、ＣＳ−ＡＣＥＬＰに固有の非線形歪みを含んでい
る。次に、収集した音声における雑音区間の非定常雑音
を切り出し、雑音源の種別毎に分類する（ステップＳ
２）。雑音区間の非定常雑音の切り出しは、レベルが高
い区間の音声を切り出すことなどにより行うことがで
き、雑音源の種別毎の分類は、レベルや周波数、波形な
どの雑音源毎の雑音の特徴に基づいて行うことができ
る。以上によりＣＳ−ＡＣＥＬＰが介在した場合の非線
形歪みを含む非定常雑音の雑音データベースが得られ
る。First, through CS-ACELP, on the street, in buildings, offices, parks, roadsides in urban areas, station premises, etc.
Collect speech containing various non-stationary noises (step S
1). The various non-stationary noises contained in these collected voices include the non-linear distortion inherent in CS-ACELP. Next, the non-stationary noise in the noise section in the collected speech is cut out and classified by the type of noise source (step S
2). The non-stationary noise in the noise section can be cut out by cutting out the speech in the section with a high level, and the classification by noise source type is based on the noise characteristics of each noise source such as level, frequency, and waveform. Can be done based on. As described above, a noise database of non-stationary noise including non-linear distortion when CS-ACELP intervenes can be obtained.

【００２６】次に、この雑音データベースを用いて、Ｈ
ＭＭの学習を行い（ステップＳ３）、ＨＭＭパラメータ
を得る。これにより得られたＨＭＭパラメータをＲＯＭ
あるいはＲＡＭに格納し（ステップＳ４）、ＣＳ−ＡＣ
ＥＬＰに対するビタービ探索部の雑音ＨＭＭとする（ス
テップＳ５）。Next, using this noise database, H
MM learning is performed (step S3) to obtain HMM parameters. HMM parameters obtained by this are stored in ROM
Alternatively, it is stored in RAM (step S4) and CS-AC
The noise HMM of the Viterbi search unit for ELP is set (step S5).

【００２７】ＥＶＲＣあるいはその他のＣＯＤＥＣに対
する雑音ＨＭＭも同様のステップにより作成することが
できる。このようにして作成された雑音ＨＭＭは、ＣＯ
ＤＥＣ依存性の非線形歪みを含む非定常雑音の雑音デー
タベースを用いて学習により作成されるものであるた
め、ＣＯＤＥＣの各々に依存してモデル化されたものと
なる。A noise HMM for EVRC or other CODEC can be created by similar steps. The noise HMM created in this way is
Since it is created by learning using the noise database of non-stationary noise including DEC-dependent nonlinear distortion, it is modeled depending on each CODEC.

【００２８】本発明は、携帯電話の利用が多い音声通話
の音声認識装置として用いることができるが、それに限
らず各種方式のＣＯＤＥＣからの音声信号が入力される
音声認識装置として用いることができる。The present invention can be used as a voice recognition device for a voice call in which a mobile phone is frequently used, but is not limited to this and can be used as a voice recognition device to which voice signals from various types of CODECs are input.

【００２９】[0029]

【発明の効果】以上の説明から明らかなように、請求項
１の発明によれば、入力される音声にＣＯＤＥＣ依存性
の非定常雑音が含まれていてもその非定常雑音を高精度
で識別識別し、音声区間の識別および音声の認識を高精
度で行うことができる。また、請求項２および３の発明
によれば、音声区間の識別および音声の認識をより高精
度で行うことができる。また、請求項４の発明によれ
ば、各種方式のＣＯＤＥＣの各々に依存する雑音ＨＭＭ
を容易に作成することができる。さらに、請求項５の発
明によれば、特に電話回線を通して入力される携帯電話
などの音声区間の識別および音声の認識を高精度で行う
ことができる。As is apparent from the above description, according to the first aspect of the present invention, even if the input voice includes non-stationary noise that depends on CODEC, the non-stationary noise can be identified with high accuracy. It is possible to identify and identify a voice section and recognize a voice with high accuracy. Further, according to the inventions of claims 2 and 3, it is possible to identify the voice section and recognize the voice with higher accuracy. Further, according to the invention of claim 4, the noise HMM depending on each of the various types of CODECs is used.
Can be created easily. Further, according to the invention of claim 5, it is possible to identify the voice section and the voice recognition of a cellular phone or the like which is inputted through a telephone line with high accuracy.

[Brief description of drawings]

【図１】本発明の一実施形態を示すブロック図であ
る。FIG. 1 is a block diagram showing an embodiment of the present invention.

【図２】入力される音声の一具体例を示す波形図であ
る。FIG. 2 is a waveform chart showing a specific example of input voice.

【図３】図２の音声に対するビタービ探索アルゴリズ
ムの一具体例を示す説明図である。FIG. 3 is an explanatory diagram showing a specific example of a Viterbi search algorithm for the voice of FIG.

【図４】本発明で用いる雑音ＨＭＭの作成方法の一具
体例を示すフローチャートである。FIG. 4 is a flowchart showing a specific example of a method of creating a noise HMM used in the present invention.

[Explanation of symbols]

１・・・音声入力部、２−１〜２−Ｎ・・・ビタービ探索部、
３・・・音声認識用辞書・文法、４・・・最尤探索結果選択
部、５・・・認識結果出力部1 ... Voice input unit, 2-1 to 2-N ... Viterbi search unit,
3 ... Speech recognition dictionary / grammar, 4 ... Maximum likelihood search result selection unit, 5 ... Recognition result output unit

───────────────────────────────────────────────────── フロントページの続き (72)発明者清水徹埼玉県上福岡市大原二丁目１番15号株式会社ケイディーディーアイ研究所内Ｆターム(参考） 5D015 GG00 HH23 KK02 ─────────────────────────────────────────────────── ─── Continued front page (72) Inventor Toru Shimizu 2-15-1 Ohara, Kamifukuoka City, Saitama Stock Company CAD Research Institute F term (reference) 5D015 GG00 HH23 KK02

Claims

[Claims]

1. A voice recognition device based on a Viterbi search algorithm using a Hidden Markov Model (HMM), wherein a voice input unit to which a voice is input, and a plurality of Viterbi search units that process signals from the voice input unit in parallel are provided. A maximum likelihood search result selection unit that selects the most likely one of the search results by the plurality of Viterbi search units, and a recognition that outputs the search result selected by the maximum likelihood search result selection unit as a recognition result. A result output unit, each of the plurality of Viterbi search units includes a speech HMM and a noise HMM, and executes a Viterbi search algorithm using them, and the noise HMM of each of the plurality of Viterbi search units. Is a speech recognition apparatus characterized by being modeled depending on each of various types of CODECs.

2. The voice recognition device according to claim 1, wherein each of the plurality of Viterbi search units further includes a silent HMM.

3. The voice HMM is also a CODE of various types.
The speech recognition apparatus according to claim 1 or 2, wherein the speech recognition apparatus is modeled depending on each C.

4. The noise HMM is a CODE of various types.
4. The voice recognition according to claim 1, wherein the voice recognition is obtained by learning using a noise database obtained by cutting out a voice in a noise section in a voice input through each of C. apparatus.

5. The voice recognition device according to claim 1, wherein the voice input to the voice input unit is input via a telephone line.