JPH10198394A

JPH10198394A - Voice recognition method

Info

Publication number: JPH10198394A
Application number: JP9002647A
Authority: JP
Inventors: Hiroo Ikura; 啓雄居倉
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1997-01-10
Filing date: 1997-01-10
Publication date: 1998-07-31

Abstract

PROBLEM TO BE SOLVED: To provide a voice recognition method with which a voice superposed with noise can be recognized with a high degree of provability even though a kind of noise and an SN ratio are simultaneously changed on issuance of voice. SOLUTION: A voice recognition device comprises a voice signal input part 2, a characteristic value extracting part 2, a data storage part 3 and a recognition result determining part 4. Upon creation of a word HMM(hiddened Markov model) used for recognition, noise HMM having been created previously from several kinds of noises, is used, and further, in consideration with S/N rations in a plurality of levels, it is possible to recognize a voice with a high degree of provability even though a kind of noise to e superposed, and the S/N ratio will be changed on the way.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、隠れマルコフモデ
ル（ＨＭＭ）を用いた音声認識方法に関するものであ
る。The present invention relates to a speech recognition method using a hidden Markov model (HMM).

【０００２】[0002]

【従来の技術】計算機による音声の自動認識に広く用い
られている手法にＨＭＭによるものがある。ＨＭＭとは
一般に複数個の状態を持つ非決定性確率有限オートマト
ンであり、その各状態は確率的定常信号源である。すな
わち、ＨＭＭは確率的定常信号源を非決定的に切り替え
ながら信号を発する非定常信号源と言うことができる。2. Description of the Related Art An HMM method is widely used for automatic speech recognition by a computer. An HMM is generally a non-deterministic stochastic finite state automaton having a plurality of states, each of which is a stochastic stationary signal source. That is, the HMM can be said to be a non-stationary signal source that emits a signal while non-deterministically switching the stochastic stationary signal source.

【０００３】ＨＭＭを用いた音声の認識に用いられる手
法に最尤推定法と呼ばれるものがある。ＨＭＭを単語認
識に用いるためには、まず、認識対象単語毎にＨＭＭを
準備しておき、それぞれのＨＭＭが自単語に属する音声
サンプルから抽出される特徴パラメータ系列を出力し易
いように、ＨＭＭを定義している内部パラメータを調節
する。そして最尤推定法では、未知の音声が入力された
とき、各ＨＭＭ毎にその未知の音声から抽出した特徴パ
ラメータ系列の出力し易さ（尤度）を算出し、最大の尤
度を出力したＨＭＭに対応する単語を認識結果とする。There is a method called maximum likelihood estimation method as a method used for speech recognition using the HMM. In order to use the HMM for word recognition, first, an HMM is prepared for each word to be recognized, and the HMM is set so that each HMM can easily output a feature parameter sequence extracted from a speech sample belonging to the own word. Adjust the defined internal parameters. In the maximum likelihood estimation method, when an unknown voice is input, the output ease (likelihood) of a feature parameter sequence extracted from the unknown voice is calculated for each HMM, and the maximum likelihood is output. A word corresponding to the HMM is set as a recognition result.

【０００４】単語を認識単位とするＨＭＭを用いて雑音
が重畳された単語音声を認識する手法の一つにＦｒａｎ
ｃＭａｒｔｉｎが文献‘‘Ｒｅｃｏｇｎｉｔｉｏｎ
ｏｆＮｏｉｓｙＳｐｅｅｃｈｂｙＣｏｍｐｏｓｉ
ｔｉｏｎｏｆＨｉｄｄｅｎＭａｒｋｏｖＭｏｄ
ｅｌｓ’’（信学技報ＳＰ９２−９６）で提案したＮＯ
ＶＯ−ＨＭＭを用いる方法がある。これは雑音ＨＭＭと
単語ＨＭＭの内部パラメータを前記文献中でＮＯＶＯ変
換と呼ばれている手法で合成し、こうして生成されたＮ
ＯＶＯ−ＨＭＭを用いることにより、雑音が重畳された
単語音声を高い精度で認識するというものである。[0004] One of the methods for recognizing a word voice on which noise is superimposed using an HMM using a word as a recognition unit is Tran.
c Martin wrote the document `` Recognition
ofNoisy Speech by Composi
Tion of Hidden Markov Mod
Els '' (NO. SP92-96)
There is a method using a VO-HMM. This is because the noise HMM and the internal parameters of the word HMM are synthesized by a method called NOVO conversion in the above-mentioned document, and the N
By using the OVO-HMM, word speech on which noise is superimposed is recognized with high accuracy.

【０００５】図６は従来のＮＯＶＯ変換の概念図であ
る。従来の方法では、認識対象単語の学習サンプルデー
タを用いた学習によって認識対象単語ＨＭＭを生成し、
１種類の雑音の学習サンプルデータを用いた学習によっ
て雑音ＨＭＭを生成した後、これら認識対象単語ＨＭＭ
と雑音ＨＭＭとをＮＯＶＯ変換によって合成し、各認識
対象単語毎にＮＯＶＯ−ＨＭＭを得る。FIG. 6 is a conceptual diagram of a conventional NOVO conversion. In the conventional method, a recognition target word HMM is generated by learning using learning sample data of the recognition target word,
After generating a noise HMM by learning using one type of noise learning sample data, these recognition target words HMM
And the noise HMM are combined by NOVO conversion to obtain a NOVO-HMM for each recognition target word.

【０００６】[0006]

【発明が解決しようとする課題】従来のＮＯＶＯ−ＨＭ
Ｍによる認識手法において高い認識率を得るには、ＮＯ
ＶＯ−ＨＭＭを作成するときに用いた雑音、すなわち、
認識に際して考慮する雑音が発声時間中に大きく変化し
ないことが必要であり、発声途中で雑音の種類、または
ＳＮ比が大きく変化した場合には認識率が大きく低下し
てしまうという問題点があった。SUMMARY OF THE INVENTION Conventional NOVO-HM
To obtain a high recognition rate in the recognition method using M,
The noise used when creating the VO-HMM, ie,
It is necessary that the noise to be considered during recognition does not change significantly during the utterance time, and if the type of noise or the SN ratio changes significantly during utterance, there is a problem that the recognition rate is greatly reduced. .

【０００７】したがって本発明は、雑音の種類とＳＮ比
が発声途中に同時に変化する場合にも雑音重畳音声を高
い確率で認識することができる音声認識方法を提供する
ことを目的とする。Accordingly, an object of the present invention is to provide a speech recognition method capable of recognizing a noise-superimposed speech with a high probability even when the type of noise and the SN ratio simultaneously change during speech production.

【０００８】[0008]

【課題を解決するための手段】請求項１記載の発明は、
単語を認識単位とするＨＭＭを用いて雑音が重畳された
音声を認識する音声認識方法であって、認識に用いる単
語ＨＭＭの生成時に、予め複数種の雑音から生成してお
いた雑音ＨＭＭを使用し、更に複数レベルのＳＮ比を考
慮することにより、重畳される雑音の種類とＳＮ比が途
中で変化する場合にも音声を高い確率で認識する。According to the first aspect of the present invention,
A speech recognition method for recognizing speech on which noise is superimposed using an HMM having a word as a recognition unit, wherein a noise HMM previously generated from a plurality of types of noise is used when generating a word HMM used for recognition. In addition, by considering the SN ratios of a plurality of levels, even when the type of noise to be superimposed and the SN ratio change on the way, the voice is recognized with a high probability.

【０００９】請求項２記載の発明は、単語を認識単位と
するＨＭＭを用いて雑音が重畳された音声を認識する音
声認識方法であって、複数の雑音と複数レベルのＳＮ比
を考慮して生成した各単語ＨＭＭの尤度を算出する際
に、最大尤度を与える経路上の雑音の種類を記録してお
き、最初の数個の単語ＨＭＭにおいて雑音の種類の遷移
状況がだいたい同じものとなれば、雑音の遷移の系列を
先の遷移状況に固定して各単語ＨＭＭの尤度の計算を行
い、計算量を削減する。According to a second aspect of the present invention, there is provided a speech recognition method for recognizing a speech on which noise is superimposed using an HMM having a word as a recognition unit, wherein a plurality of noises and a plurality of levels of SN ratio are considered. When calculating the likelihood of each generated word HMM, the type of noise on the path that gives the maximum likelihood is recorded, and the transition status of the type of noise in the first few words HMM is almost the same. If so, the transition sequence of the noise is fixed to the previous transition state, the likelihood of each word HMM is calculated, and the calculation amount is reduced.

【００１０】[0010]

【発明の実施の形態】本発明は、雑音が重畳された単語
音声を認識するＮＯＶＯ−ＨＭＭの生成に際して、複数
種類の雑音と複数レベルのＳＮ比を考慮することによ
り、雑音の種類、または音声と雑音のＳＮ比が発生時間
中に変化しても認識精度が大きく低下しないＮＯＶＯ−
ＨＭＭを生成するものである。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS The present invention considers a plurality of types of noise and a plurality of levels of S / N ratios in generating a NOVO-HMM that recognizes a word voice on which noise is superimposed, so that the type of noise or voice NOVO- does not significantly reduce recognition accuracy even if the SN ratio of noise and noise changes during the time of occurrence.
An HMM is generated.

【００１１】以下、本発明の一実施の形態による音声認
識方法について図面を参照しながら説明する。図１は本
発明の一実施の形態による音声認識装置の構成ブロック
図である。１は学習サンプルデータまたは認識対象デー
タである音声信号をデジタル値に変換する音声信号入力
部、２は入力信号からフレーム毎に特徴量を算出する特
徴量抽出部、３は学習サンプルデータ、雑音データ、認
識対象単語ＨＭＭ、雑音ＨＭＭ、ＮＯＶＯ−ＨＭＭを格
納するデータ格納部、４は入力単語の出力確率を計算す
ると共に認識結果の決定を行う認識結果判定部である。Hereinafter, a speech recognition method according to an embodiment of the present invention will be described with reference to the drawings. FIG. 1 is a configuration block diagram of a speech recognition device according to an embodiment of the present invention. 1 is an audio signal input unit for converting an audio signal, which is learning sample data or recognition target data, into a digital value, 2 is a feature amount extraction unit for calculating a feature amount for each frame from the input signal, 3 is learning sample data, noise data , A data storage unit for storing a recognition target word HMM, a noise HMM, and a NOVO-HMM, and a recognition result determination unit 4 for calculating an output probability of an input word and determining a recognition result.

【００１２】図２は本発明の一実施の形態による音声認
識装置の回路ブロック図である。１１はマイク、１２は
中央処理装置（ＣＰＵ）、１３は読み出し専用メモリ
（ＲＯＭ）、１４は書き込み可能メモリ（ＲＡＭ）、１
５は出力装置である。FIG. 2 is a circuit block diagram of a speech recognition apparatus according to one embodiment of the present invention. 11 is a microphone, 12 is a central processing unit (CPU), 13 is a read-only memory (ROM), 14 is a writable memory (RAM), 1
5 is an output device.

【００１３】図１における音声信号入力装置１はマイク
１１とＣＰＵ１２により構成され、特徴量抽出部２とデ
ータ格納部３と認識結果判定部４はＣＰＵ１２がＲＯＭ
１３に書かれたプログラムを実行し、ＲＡＭ１４にアク
セスすることにより実行される。The audio signal input device 1 shown in FIG. 1 comprises a microphone 11 and a CPU 12, and a feature extracting unit 2, a data storing unit 3, and a recognition result judging unit 4 have a CPU 12
13 is executed by executing a program written in the RAM 13 and accessing the RAM 14.

【００１４】図３は本発明の一実施の形態による音声認
識方法のフローチャートである。まず、マイク１１から
音声を入力し、特徴量抽出部２を経て得られた特徴量を
ＲＡＭ１４に書き込む（ｓｔｅｐ２）。次に、ＲＯＭ１
３上に格納されている各認識対象単語毎に作成したＮＯ
ＶＯ−ＨＭＭについて、ＲＡＭ１４上の特徴量に対する
尤度を計算する（ｓｔｅｐ３）。そして、最大の尤度を
出力したＮＯＶＯ−ＨＭＭに対応する単語名を認識結果
として出力装置１５に出力する（ｓｔｅｐ４）。FIG. 3 is a flowchart of a voice recognition method according to an embodiment of the present invention. First, a voice is input from the microphone 11, and the feature amount obtained through the feature amount extraction unit 2 is written into the RAM 14 (step 2). Next, ROM1
No. 3 created for each recognition target word stored on
For the VO-HMM, the likelihood for the feature on the RAM 14 is calculated (step 3). Then, a word name corresponding to the NOVO-HMM that has output the maximum likelihood is output to the output device 15 as a recognition result (step 4).

【００１５】図４は本発明の一実施の形態による音声認
識方法に用いられるＮＯＶＯ変換の概念図であって、Ｎ
ＯＶＯ−ＨＭＭの作成過程を示すものであり、ＮＯＶＯ
変換を施す雑音ＨＭＭの状態数を２、認識に際して考慮
する雑音をＡとＢの２種類、ＳＮ比のレベル数をｘ［ｄ
Ｂ］とｙ［ｄＢ］の２段階とした場合の図である。FIG. 4 is a conceptual diagram of NOVO conversion used in the speech recognition method according to one embodiment of the present invention.
This figure shows the process of creating an OVO-HMM,
The number of states of the noise HMM to be transformed is 2, the noises to be considered at the time of recognition are A and B, and the number of levels of the SN ratio is x [d
B] and y [dB].

【００１６】図６の従来方法と比べると学習サンプルデ
ータを用いた学習によって生成する雑音ＨＭＭの形状が
異なる。従来の方法では１種類の雑音から２状態の雑音
ＨＭＭを学習によって直接生成した後にその雑音ＨＭＭ
と単語ＨＭＭに対してＮＯＶＯ変換を施していたが、本
発明では、まず２種類の雑音から学習によってそれぞれ
１状態の雑音ＨＭＭを生成し、その後、考慮するＳＮ比
のレベル数だけ各雑音ＨＭＭの状態を複製する。そし
て、学習と複製によって得られた全ての雑音ＨＭＭの状
態間の状態遷移確率を人為的に与えることによって（こ
の場合は自己遷移確率を０．７程度、他状態遷移確率を
それぞれ０．１程度）雑音ＨＭＭの状態を結合し、４状
態の雑音ＨＭＭを生成する。そして、この４状態の雑音
ＨＭＭと認識対象単語ＨＭＭとに従来のＮＯＶＯ変換を
施した場合、ＮＯＶＯ−ＨＭＭの形状は図４のいちばん
下に示すように、４列の単語ＨＭＭが並んだような形に
なる。このいちばん上の列の各状態は、ＮＯＶＯ変換に
おける音声線形スペクトルと雑音線形スペクトルとを加
算する段階で雑音Ａのｘ［ｄＢ］用の係数を用いたも
の、２番目の列の各状態は雑音Ａのｙ［ｄＢ］用の係数
を用いたもの、３番目の列の各状態は雑音Ｂのｘ［ｄ
Ｂ］用の係数を用いたもの、４番目の列の各状態は雑音
Ｂのｙ［ｄＢ］用の係数を用いたものになる。The shape of the noise HMM generated by the learning using the learning sample data differs from the conventional method of FIG. In the conventional method, a two-state noise HMM is directly generated from one type of noise by learning, and then the noise HMM is generated.
NOVO conversion is performed on the word HMM and the word HMM. However, in the present invention, first, a one-state noise HMM is generated by learning from two types of noise, and thereafter, the number of noise HMMs of each noise HMM is increased by the number of SN ratio levels to be considered. Duplicate state. Then, the state transition probabilities between the states of all the noise HMMs obtained by learning and duplication are artificially given (in this case, the self transition probability is about 0.7, and the other state transition probabilities are about 0.1, respectively). ) Combine the states of the noise HMM to generate a 4-state noise HMM. When the conventional NOVO conversion is performed on the four-state noise HMM and the recognition target word HMM, the shape of the NOVO-HMM is such that four rows of word HMMs are arranged as shown at the bottom of FIG. It takes shape. Each state in the top row uses a coefficient for x [dB] of noise A at the stage of adding the speech linear spectrum and the noise linear spectrum in the NOVO conversion. A using the coefficient for y [dB] of A, each state in the third column is x [d
The state using the coefficient for B] and the state in the fourth column use the coefficient for y [dB] of the noise B.

【００１７】図５は本発明の一実施の形態による音声認
識方法の計算量削減の概念図である。上述した方法で作
成したＮＯＶＯ−ＨＭＭは、単一種類の雑音のみ、また
は単一レベルのＳＮ比のみを考慮した場合のＮＯＶＯ−
ＨＭＭに比べて多くの状態を有しており、各ＮＯＶＯ−
ＨＭＭの尤度計算に膨大な計算量が必要となる。そこ
で、各ＮＯＶＯ−ＨＭＭの尤度計算時に、最大尤度を与
える経路上の雑音の種類とＳＮ比を記録しておき、最初
の数単語のＮＯＶＯ−ＨＭＭでの雑音とＳＮ比の遷移状
況がだいたい同じものとなった場合、雑音とＳＮ比の遷
移の系列を固定して全てのＮＯＶＯ−ＨＭＭの尤度の再
計算を行う。これにより、計算量を大幅に削減すること
ができる。FIG. 5 is a conceptual diagram of the calculation amount reduction of the voice recognition method according to one embodiment of the present invention. The NOVO-HMM created by the above-described method has a NOVO-HMM when only a single type of noise or only a single-level SN ratio is considered.
It has more states than HMM, and each NOVO-
An enormous amount of calculation is required for the likelihood calculation of the HMM. Therefore, when calculating the likelihood of each NOVO-HMM, the type of noise on the path that gives the maximum likelihood and the SN ratio are recorded, and the transition state of the noise and SN ratio in the NOVO-HMM for the first few words is recorded. When they are almost the same, the sequence of the transition between the noise and the S / N ratio is fixed, and the likelihood of all NOVO-HMMs is recalculated. As a result, the amount of calculation can be significantly reduced.

【００１８】[0018]

【発明の効果】本発明の音声認識方法は、雑音の種類と
ＳＮ比が発声途中に同時に変化する場合にも雑音重畳音
声を高い確率で認識することができる。According to the speech recognition method of the present invention, noise-superimposed speech can be recognized with a high probability even when the type of noise and the S / N ratio change simultaneously during speech production.

[Brief description of the drawings]

【図１】本発明の一実施の形態による音声認識装置の構
成ブロック図FIG. 1 is a configuration block diagram of a speech recognition device according to an embodiment of the present invention;

【図２】本発明の一実施の形態による音声認識装置の回
路ブロック図FIG. 2 is a circuit block diagram of a speech recognition device according to one embodiment of the present invention.

【図３】本発明の一実施の形態による音声認識方法のフ
ローチャートFIG. 3 is a flowchart of a voice recognition method according to an embodiment of the present invention;

【図４】本発明の一実施の形態による音声認識方法に用
いられるＮＯＶＯ変換の概念図FIG. 4 is a conceptual diagram of NOVO conversion used in a speech recognition method according to an embodiment of the present invention.

【図５】本発明の一実施の形態による音声認識方法の計
算量削減の概念図FIG. 5 is a conceptual diagram of a calculation amount reduction of the voice recognition method according to one embodiment of the present invention;

【図６】従来のＮＯＶＯ変換の概念図FIG. 6 is a conceptual diagram of a conventional NOVO conversion.

[Explanation of symbols]

１音声信号入力部２特徴量抽出部３データ格納部４認識結果判定部１１マイク１２ＣＰＵ１３ＲＯＭ１４ＲＡＭ１５出力装置 Reference Signs List 1 audio signal input unit 2 feature amount extraction unit 3 data storage unit 4 recognition result determination unit 11 microphone 12 CPU 13 ROM 14 RAM 15 output device

Claims

[Claims]

1. A speech recognition method for recognizing speech on which noise is superimposed using an HMM having a word as a recognition unit, wherein a word HMM used for recognition is generated in advance from a plurality of types of noise. Speech recognition characterized by using a noisy noise HMM and considering a plurality of levels of S / N ratios to recognize speech with high probability even when the type of superimposed noise and the S / N ratio change on the way. Method.

2. A speech recognition method for recognizing speech on which noise is superimposed by using an HMM having a word as a recognition unit, wherein each of the word HMMs generated in consideration of a plurality of noises and an S / N ratio of a plurality of levels. When calculating the likelihood, the type of noise on the path that gives the maximum likelihood is recorded, and the first few words HM
When the transition state of the noise type is substantially the same in M, the likelihood of each word HMM is calculated by fixing the sequence of the noise transition to the previous transition state, and the amount of calculation is reduced. Voice recognition method to be used.