JP3513284B2

JP3513284B2 - Voice recognition method and apparatus

Info

Publication number: JP3513284B2
Application number: JP26152795A
Authority: JP
Inventors: 康弘小森; 雅章山田; 恭則大洞
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 1995-10-09
Filing date: 1995-10-09
Publication date: 2004-03-31
Anticipated expiration: 2015-10-09
Also published as: JPH09106295A

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は音声認識方法および
その装置に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech recognition method and its apparatus.

【０００２】[0002]

【従来の技術】近年、音声認識システムの実用化が盛ん
になってきている。音声認識の実用化において、入力さ
れたものが何であったかを特定する技術も非常に重要で
あるが、入力されたものが音声認識システムで受け付け
ることができないと判定する技術も非常に重要である。
隠れマルコフモデル（以下、ＨＭＭ）を用いた音声認識
システムでは、この問題に対して、（１）音素連鎖駆動
の音素ＨＭＭ連鎖尤度による入力音声の尤度の正規化法
と、（２）言語探索処理にGarbage Model ＨＭＭを用
いた方法とが、一般的であり、かつ、有効な方法として
知られている。2. Description of the Related Art In recent years, practical use of voice recognition systems has become popular. In practical use of voice recognition, a technique for specifying what was input is also very important, but a technique for determining that the input cannot be accepted by the voice recognition system is also very important.
In the speech recognition system using the hidden Markov model (hereinafter, HMM), in order to solve this problem, (1) a phoneme chain-driven phoneme HMM chain likelihood-based normalization method for the likelihood of the input speech and (2) language A method using a Garbage Model HMM for search processing is known as a general and effective method.

【０００３】（１）の方法では、音声認識に用いる音素
ＨＭＭの出力確率とその遷移確率を「音素連鎖の記述
（例えば、子音＋母音または母音＋母音の連鎖を繰り返
す）」に従って計算する。一方、（２）の方法では、リ
ジェクトしたい入力の集合を予め学習したＨＭＭを作成
し、そのＨＭＭの各状態に従って計算される。In the method (1), the output probability of the phoneme HMM used for speech recognition and its transition probability are calculated according to "description of a phoneme chain (for example, consonant + vowel or vowel + vowel chain is repeated)". On the other hand, in the method (2), an HMM in which a set of inputs desired to be rejected is learned in advance is created, and the HMM is calculated according to each state of the HMM.

【０００４】[0004]

【発明が解決しようとする課題】しかしながら、上記従
来方法（１）では、音声認識候補を求めるための言語探
索処理と探索空間（trellis やViterbi Search とtrel
lis 空間）とは別に、音素連鎖駆動用の探索処理と探索
空間（例えば、trellis やViterbi search とtrellis
空間）を必要とする。However, in the above-mentioned conventional method (1), a language search process and a search space (trellis or Viterbi Search and trel) for obtaining a speech recognition candidate are used.
Apart from the lis space), search processing and search space for phoneme chain drive (eg trellis, Viterbi search and trellis
Space).

【０００５】また、従来方法（２）では、認識単位であ
るＨＭＭに加えGarbage Modelの出力確率計算がさらに
必要になり、また、言語探索処理に加えたGarbage Mod
el分の言語探索空間とその処理が必要になる。いずれの
方法においても、リジェクト用の探索処理（trellis や
Viterbi search ）とその探索空間（trellis 空間）や
Garbage Modelのような特別な出力確率の計算とを必要
とすることが、計算量やメモリ容量の観点から問題とな
る。Further, in the conventional method (2), the output probability calculation of the Garbage Model is further required in addition to the recognition unit HMM, and the Garbage Mod added to the language search processing is required.
A language search space for el and its processing are required. In either method, the search process for rejects (trellis or
Viterbi search) and its search space (trellis space)
The need for a special output probability calculation such as Garbage Model is a problem from the viewpoint of calculation amount and memory capacity.

【０００６】本発明は、前記従来の欠点を除去し、リジ
ェクト用の探索処理（trellis やViterbi search ）を
なくし、さらにリジェクト用の探索空間（trellis 空
間）を減らして、コンパクトで高速・高性能な音声認識
を実現する音声認識方法およびその装置を提供する。The present invention eliminates the above-mentioned conventional drawbacks, eliminates the search processing for rejects (trellis or Viterbi search), and further reduces the search space for rejects (trellis space) to achieve compactness, high speed and high performance. Provided are a voice recognition method and a device for realizing voice recognition.

【０００７】[0007]

【０００８】[0008]

【課題を解決するための手段】この課題を解決するため
に、本発明の音声認識方法は、隠れマルコフモデルを用
いて音声を認識する音声認識方法において、入力された
音声を音響分析し音声パラメータを求め、各隠れマルコ
フモデルの各状態に対する出力確率を計算する工程と、
辞書や文法等の言語情報と隠れマルコフモデルとその状
態の出力確率を用いて言語探索し、入力された音声の認
識候補とその尤度を求めると共に、各隠れマルコフモデ
ルの各状態に対する出力確率を計算した結果より、分析
された音声パラメータに対し最大の状態出力確率とその
状態を求め、状態の遷移の仕方により決まるリジェクト
用の状態遷移確率により音声入力のリジェクト用の尤度
を求める工程と、前記２つの尤度の比に基づいて、閾値
により音声入力の認識結果を出力するかリジェクトする
かを決定する工程とを有することを特徴とする。 [Means for Solving the Problem ] To solve this problem
In the speech recognition method of the present invention, in the speech recognition method for recognizing speech using a hidden Markov model, the input speech is acoustically analyzed to obtain speech parameters, and the output probability for each state of each hidden Markov model is calculated. And the process of
A language search is performed using linguistic information such as a dictionary or grammar, and output probabilities of hidden Markov models and their states, and recognition candidates of the input speech and its likelihood are obtained, and output probabilities for each state of each hidden Markov model From the calculated results, the maximum state output probability and its state are obtained for the analyzed voice parameter, and the reject is determined by the state transition method.
Determining a likelihood for rejecting a voice input based on the state transition probability for the voice input, and determining whether to output or reject the recognition result of the voice input based on a threshold based on the ratio of the two likelihoods. It is characterized by having.

【０００９】ここで、前記最大の状態出力確率を示す状
態の遷移の仕方により決まる状態遷移確率の決定では、
経験的な方法により遷移確率を与える。また、前記最大
の状態出力確率を示す状態の遷移の仕方により決まる状
態遷移確率の決定では、全隠れマルコフモデルの全状態
を分解し、出力確率を決める内部を固定したまま全状態
が自由に遷移できるように初期値を与え、改めて隠れマ
ルコフモデルを学習した音声データにより、その遷移確
率を学習する。Here, in the determination of the state transition probability determined by the way of transition of the state showing the maximum state output probability,
The transition probability is given by an empirical method. Further, in the determination of the state transition probability determined by the way of transition of the state showing the maximum state output probability, all states of the total hidden Markov model are decomposed, and all states are freely changed while fixing the inside that determines the output probability. The transition probability is learned from the voice data from which the hidden Markov model is learned by giving an initial value as much as possible.

【００１０】又、本発明の音声認識装置は、隠れマルコ
フモデルを用いて音声を認識する音声認識装置であっ
て、入力された音声を音響分析し音声パラメータを求め
る手段と、分析された音声パラメータの各隠れマルコフ
モデルの各状態に対する出力確率を計算する手段と、辞
書や文法等の言語情報と隠れマルコフモデルとその状態
の出力確率を用いて言語探索し、入力された音声の認識
候補とその尤度を求める手段と、各隠れマルコフモデル
の各状態に対する出力確率を計算した結果より、分析さ
れた音声パラメータに対し最大の状態出力確率とその状
態を求める手段と、求められた状態の遷移の仕方により
決まるリジェクト用の状態遷移確率により音声入力のリ
ジェクト用の尤度を求める手段と、前記２つの尤度の比
に基づいて、閾値により音声入力の認識結果を出力する
かリジェクトするかを決定する手段とを有することを特
徴とする。The speech recognition apparatus of the present invention is a speech recognition apparatus for recognizing speech using a hidden Markov model, and means for acoustically analyzing input speech to obtain speech parameters, and analyzed speech parameters. A means for calculating the output probability for each state of each hidden Markov model of, and a language search using the language information such as a dictionary or grammar and the output probability of the hidden Markov model and that state, and the recognition candidates of the input speech and its From the result of calculating the likelihood and the output probability for each state of each hidden Markov model, the means for obtaining the maximum state output probability and its state for the analyzed speech parameter, and the transition of the obtained state transition means for determining a likelihood for rejection of the speech entered by the state transition probability for rejection determined by how, on the basis of the ratio of the two likelihood, the threshold And having a means for determining whether to reject outputs the recognition result of the speech input Ri.

【００１１】ここで、前記リジェクト用の尤度を求める
手段は、経験的な方法により遷移確率を与え、最大の状
態出力確率を示す状態の遷移の仕方により状態遷移確率
を決める。また、前記リジェクト用の尤度を求める手段
は、全隠れマルコフモデルの全状態を分解し、出力確率
を決める内部を固定したまま全状態が自由に遷移できる
ように初期値を与え、改めて隠れマルコフモデルを学習
した音声データにより、その遷移確率を学習し、最大の
状態出力確率を示す状態の遷移の仕方により態遷移確率
を決める。Here, the means for obtaining the likelihood for rejecting gives a transition probability by an empirical method, and determines the state transition probability by the way of transition of the state showing the maximum state output probability. Further, the means for obtaining the likelihood for rejecting is to decompose all the states of the total hidden Markov model, give an initial value so that all the states can freely transition while fixing the inside that determines the output probability, and then make a hidden Markov model again. The transition probability is learned from the voice data that has learned the model, and the state transition probability is determined by the way of transition of the state showing the maximum state output probability.

【００１２】[0012]

DETAILED DESCRIPTION OF THE INVENTION

＜本実施の形態の音声認識方法の概念＞本実施の形態の
音声認識方法では、音声認識に用いる音素ＨＭＭの各状
態をバラバラに分解し、入力フレームに対して最大出力
確率を示す状態を遷移するとし、また、その遷移確率は
音素認識ＨＭＭの遷移確率とは独立に決定できる。しか
し、単に最大値を遷移するように尤度を計算すると、認
識結果の尤度Ｒより必ずリジェクト尤度ｒの方が大きく
なるので閾値thを導入し、認識対象が入力された時にＲ
／ｒ＞thとなり、認識対象外の入力された時にＲ／ｒ≦
thとなるような閾値thを導入する。<Concept of speech recognition method of this embodiment> In the speech recognition method of this embodiment, each state of the phoneme HMM used for speech recognition is disassembled into pieces, and the state showing the maximum output probability is changed with respect to the input frame. Then, the transition probability can be determined independently of the transition probability of the phoneme recognition HMM. However, if the likelihood is simply calculated so that the maximum value transits, the reject likelihood r is always larger than the likelihood R of the recognition result, so a threshold th is introduced, and when the recognition target is input, R
/ R> th, and R / r ≦ when input is outside the recognition target
A threshold th that is th is introduced.

【００１３】このような方法により、認識の対象となっている音声が入力された場合は、そ
の入力に対する認識対象を構成する認識ＨＭＭの系列の
尤度Ｒは大きな値を示し、一方、リジェクト用の尤度ｒ
も大きい値となるが、上記の閾値に対してはＲ／ｒ＞th
となって、認識が正しく行われます。一方、認識の対象となっていない音声が入力された場合は、
その入力に対するあらゆる認識対象を構成する認識ＨＭ
Ｍの系列の尤度Ｒは小さな値を示し、一方、リジェクト
用の尤度ｒは、あらゆる音声に対して比較的大きい値と
なるので、上記の閾値に対してはＲ／ｒ≦thとなって、
リジェクトされる。By such a method, when the speech to be recognized is input, the likelihood R of the sequence of recognition HMMs constituting the recognition target for the input shows a large value, while for the reject Likelihood r
Is also a large value, but R / r> th for the above threshold
And the recognition is done correctly. On the other hand, if a voice that is not the target of recognition is input,
Recognition HM that constitutes all recognition targets for that input
Likelihood R of the M sequence shows a small value, while likelihood r for rejection has a relatively large value for all speeches, so that R / r ≦ th for the above threshold. hand,
Rejected.

【００１４】＜本実施の形態の音声認識装置の構成例＞
図１は本実施の形態の音声認識装置の一構成例を示す図
である。図中、１０１は音声を入力するマイクやＡ／Ｄ
を含む音声入力部、１０２は音声パラメータを求める音
響分析部、１０３は出力確率を計算する出力確率計算
部、１０４は出力確率を計算するためのＨＭＭ、１０５
は言語処理を行なう言語探索部、１０６は言語処理に用
いる文法・辞書、１０７はリジェクト用最大出力確率計
算部、１０８はリジェクト用状態遷移確率計算部、１０
９はリジェクト判定部、１１０は結果を出力する表示部
である。<Example of Configuration of Speech Recognition Device According to this Embodiment>
FIG. 1 is a diagram showing a configuration example of a voice recognition device according to the present embodiment. In the figure, 101 is a microphone or A / D for inputting voice.
Includes a voice input unit, 102 an acoustic analysis unit for obtaining a voice parameter, 103 an output probability calculation unit for calculating an output probability, 104 an HMM for calculating an output probability, 105
Is a language search unit for performing language processing, 106 is a grammar / dictionary used for language processing, 107 is a maximum output probability calculation unit for reject, 108 is a state transition probability calculation unit for reject, 10
Reference numeral 9 is a reject determination unit, and 110 is a display unit for outputting the result.

【００１５】尚、本音声認識装置は、汎用コンピュータ
を特定のプログラムで動作させることにより実現されて
も良いし、ニューロコンピュータのような特殊なコンピ
ュータを使用する構成であっても良い。又、汎用コンピ
ュータに場合には、本音声認識方法を実現するプログラ
ムあるいは特殊なデータ（テーブル等）を、外部の記憶
媒体からあるいは通信によりメインメモリにロードし
て、実行する構成にしてもよい。The speech recognition apparatus may be realized by operating a general-purpose computer with a specific program, or may be configured to use a special computer such as a neuro computer. In the case of a general-purpose computer, a program or special data (table, etc.) for realizing the present speech recognition method may be loaded from an external storage medium or by communication into the main memory and executed.

【００１６】＜本実施の形態の音声認識装置の処理の流
れ＞上記の要素により構成される音声認識装置は、図２
に示す処理の流れに従って動作する。音声入力部２０１
（１０１に対応）で切り出された音声は、音響分析部２
０２（１０２に対応）でフレーム毎に音声パラメータに
分析され、出力確率計算部２０３（１０３に対応）にお
いて、ＨＭＭ２０４（１０４に対応）を用いて出力確率
の計算を行なう。言語探索部２０５（１０５に対応）
は、出力確率計算部２０３からの出力確率とＨＭＭ２０
４と、文法・辞書等２０６（１０６に対応）の言語情報
とを用いて言語探索を行い、認識候補とその尤度Ｒ（２
０５−１）を出力する。<Processing Flow of Speech Recognition Apparatus According to this Embodiment> A speech recognition apparatus configured by the above elements is shown in FIG.
It operates according to the process flow shown in. Voice input unit 201
The sound extracted by (corresponding to 101) is the acoustic analysis unit 2
In 02 (corresponding to 102), the voice parameter is analyzed for each frame, and the output probability calculating unit 203 (corresponding to 103) calculates the output probability using the HMM 204 (corresponding to 104). Language search unit 205 (corresponding to 105)
Is the output probability from the output probability calculation unit 203 and the HMM 20.
4 and the linguistic information of the grammar / dictionary 206 (corresponding to 106), a language search is performed, and the recognition candidate and its likelihood R (2
05-1) is output.

【００１７】一方、リジェクト用最大出力確率計算部２
０７（１０７に対応）では、入力音声のフレーム毎に最
大となる出力確率とその状態を求め、リジェクト用状態
遷移確率計算部２０８（１０８に対応）で、最大確率を
出力した状態間の遷移確率に基づいてリジェクト用の尤
度ｒ（２０８−１）を求める。最後に、認識候補の尤度
Ｒとリジェクト用尤度ｒとを比較して、リジェクト判定
部２０９（１０９に対応）にて、入力音声に対する認識
結果を出力するかリジェクトするかを鑑定し（Ｒ／ｒ＞
th or ≦th）、それぞれの結果（認識結果２１０−１，
リジェクト２１０−２）を出力する。On the other hand, the maximum output probability calculation unit for reject 2
In 07 (corresponding to 107), the maximum output probability and its state are obtained for each frame of the input speech, and the state transition probability calculation unit for reject 208 (corresponding to 108) outputs the maximum probability, the transition probability between states. The likelihood r (208-1) for reject is calculated based on Finally, the likelihood R of the recognition candidate is compared with the likelihood r for rejection, and the rejection determination unit 209 (corresponding to 109) determines whether to output or reject the recognition result for the input voice (R / R>
th or ≤th), each result (recognition result 210-1,
The reject 210-2) is output.

【００１８】上記リジェクト用尤度ｒを求める処理のイ
メージ図を図３に示す。図３に示すように、本実施の形
態の音声認識方法では、音声認識に用いる音素ＨＭＭの
各状態をバラバラに分解し、入力フレームに対して最大
出力確率を示す状態を遷移する。また、その遷移確率は
音素認識ＨＭＭの遷移確率とは独立に決定する。FIG. 3 is an image diagram of the process for obtaining the rejection likelihood r. As shown in FIG. 3, in the speech recognition method according to the present embodiment, each state of the phoneme HMM used for speech recognition is disassembled into pieces, and the state showing the maximum output probability is transitioned to the input frame. The transition probability is determined independently of the transition probability of the phoneme recognition HMM.

【００１９】一般的にはＨＭＭの尤度計算は、対数をと
って対数確率和で求められる。従って、状態遷移確率に
関しては、以下の２つが考えられる。Generally, the HMM likelihood calculation is performed by taking the logarithm and calculating the sum of log probability. Therefore, the following two can be considered regarding the state transition probability.

【００２０】（１）経験的な方法では、状態が変らない
（同じ状態に遷移した）ときにはｌｏｇ（０．９５）
を、違う状態へ遷移した時はｌｏｇ（０．０５）を加え
る方法が考えられる。この場合の状態遷移確率テーブル
（正確には、確率ではないが）は以下のようになる。(1) In the empirical method, when the state does not change (transition to the same state), log (0.95)
When changing to a different state, a method of adding log (0.05) can be considered. In this case, the state transition probability table (more accurately, not the probability) is as follows.

【００２１】[0021]

【表１】（２）また、全ＨＭＭの全状態を分解し、出力確率を決
める内部を固定したまま全状態間が自由に遷移できるよ
うに初期値を与え、改めてＨＭＭを学習した音声データ
によりその状態遷移確率を学習する方法では、学習の結
果、以下の状態遷移確率のテーブル（一例）が作成さ
れ、これらの対数値を用いる方法が考えられる。学習に
は、ＨＭＭの通常の学習方法であるＥＭ algorithm を
用いることができる。このとき、出力確率に関するパラ
メータを固定する。[Table 1] (2) Moreover, all states of all HMMs are decomposed, initial values are given so that transitions between all states can be freely performed while fixing the internals that determine output probabilities, and the state transition probabilities are again obtained by speech data that has learned HMMs. In the method of learning, the following table (example) of state transition probabilities is created as a result of learning, and a method of using these logarithmic values can be considered. For learning, an EM algorithm, which is a normal learning method for HMM, can be used. At this time, the parameter related to the output probability is fixed.

【００２２】[0022]

【表２】＜本実施の形態の音声認識装置による処理例＞以下、本
実施の形態による音声認識の結果と、従来方法による音
声認識の結果とを、具体的な例に基づいて比較する。[Table 2] <Processing example by the voice recognition device of the present embodiment> Hereinafter, the result of the voice recognition according to the present embodiment and the result of the voice recognition according to the conventional method will be compared based on a specific example.

【００２３】図４は、本実施の形態によるリジェクト過
程のシミュレーションを表す図である。ここで、入力可
能な単語を「おおさか」と「とうきょう」とし、入力と
しては「おおさか」と「ほっかいどう」とし、また、リ
ジェクトの閾値thは０．９０と仮定すす。入力が「おお
さか」の場合には、「とうきょう」と「おおさか」の認
識対象の探索区間にてそれぞれの尤度が求められ、その
うちの大きい尤度が認識結果の尤度Ｒとなる。本例で
は、「おおさか」の９５０．０が尤度Ｒとなり、一
方、リジェクト尤度ｒは１０００．０である。従って、
Ｒ／ｒ＝０．９５及びth＝０．９０なのでＲ／ｒ＞thと
なり、入力はリジェクトされないで「おおさか」という
認識結果を出力する。FIG. 4 is a diagram showing a simulation of the reject process according to the present embodiment. Here, it is assumed that the words that can be input are “Osaka” and “Tokyo”, the inputs are “Osaka” and “Hokkaido”, and the reject threshold th is 0.90. When the input is “Osaka”, the respective likelihoods are obtained in the search sections of the recognition targets of “Tokyo” and “Osaka”, and the larger likelihood is the likelihood R of the recognition result. In this example, 950.0 of "Osaka" is the likelihood R, while the reject likelihood r is 1000.0. Therefore,
Since R / r = 0.95 and th = 0.90, R / r> th, and the input is not rejected, and the recognition result “Osaka” is output.

【００２４】一方、入力が「ほっかいどう」の場合に
は、「とうきょう」と「おおさか」の認識対称の探索区
間にてそれぞれの尤度が求めれれ、そのうちの大きい尤
度が認識結果の尤度Ｒとなる。本例では、「とうきょ
う」の８００．０が尤度Ｒとなり、一方、リジェクト尤
度ｒは１０００．０である。従って、Ｒ／ｒ＝０．８５
及びとth＝０．９０なのでＲ／ｒ＜thとなり、入力はリ
ジェクトされて認識結果を出力しない。On the other hand, when the input is “Hokkaido”, the respective likelihoods are obtained in the search sections of “Tokyo” and “Osaka” which are recognition symmetry, and the larger likelihood is the likelihood of the recognition result. It becomes R. In this example, 800.0 of "Tokyo" is the likelihood R, while the reject likelihood r is 1000.0. Therefore, R / r = 0.85
Since and th = 0.90, R / r <th, and the input is rejected and the recognition result is not output.

【００２５】図５は、従来方法によるリジェクト過程の
シミュレーションを表す図である。ここで、左側が音素
連鎖を用いた場合、右側がGarbage Model(GBmodel)を用
いた場合である。従来方法の音素連鎖やGarbage Model
では、そこで用いられているモデルの全状態数の探索空
間のメモリやその区間のViterbi あるいはtrellis 計算
を必要とする。FIG. 5 is a diagram showing a simulation of the reject process according to the conventional method. Here, the left side is the case where the phoneme chain is used, and the right side is the case where the Garbage Model (GB model) is used. Phoneme chain of the conventional method and Garbage Model
Then, we need the memory of the search space for the total number of states of the model used there, and the Viterbi or trellis calculation of the interval.

【００２６】上記本実施の形態と従来例との比較から、
本実施の形態においては、一状態のメモリで済み、Vite
rbi 計算あるいはtrellis 計算を必要とせず、最大出力
確率状態の選択に加え、その最大出力確率と状態の遷移
確率の単なる関を求める単純な計算でリジェクト用の尤
度が求まることがわかる。尚、本実施の形態では音素を
単位にしたＨＭＭを用いているが、如何なる単位であっ
ても一切問題がなく、本発明はこれらを含むものであ
る。又、本発明は、複数の機器から構成されるシステム
に適用しても、１つの機器から成る装置に適用しても良
い。また、本発明はシステム或は装置にプログラムを供
給することによって達成される場合にも適用できること
はいうまでもない。この場合、本発明の係るプログラム
を格納した記憶媒体が本発明を構成することになる。そ
して、該記憶媒体からそのプログラムをシステム或は装
置に読み出すことによって、そのシステム或は装置が、
予め定められた仕方で動作する。From the comparison between the present embodiment and the conventional example,
In this embodiment, only one state of memory is required, and Vite
It can be seen that the likelihood for rejection is obtained by a simple calculation that does not require rbi calculation or trellis calculation, but in addition to selection of the maximum output probability state and a simple relation between the maximum output probability and the state transition probability. It should be noted that although the HMM in which the phoneme is used as a unit is used in the present embodiment, there is no problem in any unit, and the present invention includes these. Further, the present invention may be applied to a system including a plurality of devices or an apparatus including one device. Further, it goes without saying that the present invention can be applied to the case where it is achieved by supplying a program to a system or an apparatus. In this case, the storage medium storing the program according to the present invention constitutes the present invention. Then, by reading the program from the storage medium to the system or device, the system or device
It works in a predetermined way.

【００２７】[0027]

【発明の効果】リジェクト用の出力計算を単純な最大値
を求める計算に置き換えることにより、音声認識に用い
る音素ＨＭＭの各状態をバラバラに分解し、入力フレー
ムに対して最大出力確率を示す状態を遷移するとし、ま
た、そのリジェクト用の遷移確率は音素認識ＨＭＭの遷
移確率とは独立に決定できるので、従来方法では必要で
あった、リジェクト用の探索処理(trellisやViterbi se
arch)をなくし、さらにリジェクト用の探索空間(trelli
s空間）を、最大値の出力確率とそのリジェクト用の遷
移確率で求めるリジェクト用の累積尤度（１個の値）と
を保持するメモリに減らし、コンパクトで高速な高性能
な音声認識を実現する方法および装置を提供できる。EFFECTS OF THE INVENTION By replacing the output calculation for reject with a simple maximum value calculation, each state of the phoneme HMM used for speech recognition is disassembled into pieces, and the state showing the maximum output probability for the input frame is calculated. Since the transition probability for the reject can be determined independently of the transition probability of the phoneme recognition HMM, a search process for the reject (trellis or Viterbi se) which is necessary in the conventional method is performed.
arch) and search space (trelli
The s space), reduced to the memory for holding the the <br/> accumulated likelihood (one value) for reject obtaining in Qian <br/> transfer probability of the output probabilities and their use reject the maximum value, compact possible to provide a method and apparatus for realizing high-speed high-performance speech recognition.

[Brief description of drawings]

【図１】本実施の形態の音声認識装置の構成図である。FIG. 1 is a configuration diagram of a voice recognition device of the present embodiment.

【図２】本実施の形態の音声認識装置での処理の流れを
示す図である。FIG. 2 is a diagram showing a flow of processing in the voice recognition device in the present embodiment.

【図３】本実施の形態のリジェクト用尤度ｒを求める処
理のイメージを示す図である。FIG. 3 is a diagram showing an image of a process for obtaining a rejection likelihood r according to the present embodiment.

【図４】本実施の形態のリジェクト処理のシミュレーシ
ョンを示す図である。FIG. 4 is a diagram showing a simulation of reject processing according to the present embodiment.

【図５】従来のリジェクト処理のシミュレーションを示
す図である。FIG. 5 is a diagram showing a simulation of a conventional reject process.

───────────────────────────────────────────────────── フロントページの続き (56)参考文献特開平８−6588（ＪＰ，Ａ) 特許3315565（ＪＰ，Ｂ２) 渡辺，塚田，音声認識を用いたゆう度補正による未知発話のリジェクション，電子情報通信学会論文誌Ｄ−ＩＩ，日本，1992年12月，Ｖｏｌ．Ｊ75− Ｄ−ＩＩ，Ｎｏ．12，ｐ．2002−2009 Ｏｚｅｋｉ，Ｔｈｅｍｕｔｕａｌｉｎｆｏｒｍａｔｉｏｎａｓａｓｃｏｒｉｎｇｆｕｎｃｔｉｏｎｆｏｒｓｐｅｅｃｈｒｅｃｏｇｎｉｔｉｏｎ，電子情報通信学会技術研究報告［音声］，日本，1995年12月15日，ＳＰ 95−101，ｐ．53−60 野田，嵯峨山，前向き尤度を用いたＡ＊ビーム探索によるＨＭＭ−ＬＲ音声認識，電子情報通信学会技術研究報告［音声］，日本，1994年６月17日，ＳＰ94−23，ｐ．１−７ (58)調査した分野(Int.Cl.⁷，ＤＢ名) G10L 15/00 - 15/28 ＪＩＣＳＴファイル（ＪＯＩＳ)─────────────────────────────────────────────────── ─── Continuation of the front page (56) References JP-A-8-6588 (JP, A) Patent 3315565 (JP, B2) Watanabe, Tsukada, Rejection of unknown utterance by likelihood correction using voice recognition, electronic IEICE Transactions DII, Japan, December 1992, Vol. J75-D-II, No. 12, p. 2002-2009 Ozeki, The mutual information as a focusing function for speech recognition, Technical Report of IEICE [Voice], Japan, December 15, 1995, SP 95-101, p. 53-60 Noda, Sagayama, HMM-LR speech recognition by A * beam search using forward likelihood, IEICE technical report [Speech], Japan, June 17, 1994, SP94-23, p. 1-7 (58) Fields investigated (Int.Cl. ⁷ , DB name) G10L 15/00-15/28 JISST file (JOIS)

Claims

(57) [Claims]

1. A speech recognition method for recognizing speech using a hidden Markov model, the method comprising: acoustically analyzing input speech to obtain speech parameters; calculating output probabilities for each state of each hidden Markov model; Language search using linguistic information such as grammar and grammar, output probability of hidden Markov model and its state, and recognition candidates of input speech and its likelihood are calculated, and output probability for each state of each hidden Markov model is calculated. Based on the result, the maximum state output probability and its state are obtained for the analyzed voice parameter, and the reject is determined by the state transition method.
Determining a likelihood for rejecting a voice input based on a state transition probability for a voice, and determining whether to output or reject a recognition result of a voice input based on a threshold value based on a ratio of the two likelihoods. A method for recognizing speech, comprising:

2. The transition probability is given by an empirical method in the determination of the state transition probability determined by the way of transition of the state showing the maximum state output probability.
1. The voice recognition method described in 1 .

3. In the determination of the state transition probability determined by the way of the transition of the state showing the maximum state output probability, all the states of the total hidden Markov model are decomposed, and all the states are fixed while the inside which determines the output probability is fixed. giving initialized freely transition, again hidden by the speech data learned Markov model speech recognition method of claim 1, wherein the learning the transition probability.

4. A voice recognition device for recognizing a voice by using a hidden Markov model, comprising means for acoustically analyzing input voice to obtain a voice parameter, and each state of each hidden Markov model of the analyzed voice parameter. A means for calculating the output probability for, a language search using the language information such as a dictionary or grammar, the output probability of the hidden Markov model and its state, and means for obtaining the recognition candidate of the input speech and its likelihood. From the result of calculating the output probability for each state of the hidden Markov model, the means for obtaining the maximum state output probability and its state for the analyzed speech parameter, and the rejection for the determined state transition method .
And a means for determining a likelihood for rejecting a voice input based on the state transition probability, and a means for determining whether to output or reject a recognition result of the voice input based on a ratio of the two likelihoods. A voice recognition device characterized by the above.

5. The means for obtaining the likelihood for rejecting is characterized in that a transition probability is given by an empirical method, and the state transition probability is determined by the way of transition of the state showing the maximum state output probability. Item 4. The voice recognition device according to item 4 .

6. The means for obtaining the likelihood for rejecting decomposes all the states of the total hidden Markov model, and gives initialization so that all the states can freely transition while fixing the inside that determines the output probability, 5. The voice recognition device according to claim 4 , wherein the transition probability is learned from the voice data from which the hidden Markov model is learned again, and the state transition probability is determined by the way of transition of the state showing the maximum state output probability.