JPH05257491A

JPH05257491A - Voice recognizing system

Info

Publication number: JPH05257491A
Application number: JP4054711A
Authority: JP
Inventors: Hiroshi Matsuura; 博松浦
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 1992-03-13
Filing date: 1992-03-13
Publication date: 1993-10-08

Abstract

PURPOSE:To obtain a system for recognizing a voice at a high speed without dropping accuracy. CONSTITUTION:An input voice is analyzed by a voice analyzing part 1, an analyzed feature parameter is applied to a conversion part 3 and a symbol sequence is found out by using a symbol recognizing dictionary 2. The symbol sequence is passed through the 1st HMM recognizing part 5 in which the 1st HMM is set up, and probability of outputting of the symbol sequence by the model is found out. Plural words are extracted by an upper candidate extracting part 6 in the order of decending probability. Then, the symbol sequence is passed through the 2nd HMM recognizing part 8 in which the 2nd HMM having more states than the 1st HMM is set up and probability of outputting of the symbol sequence by the model is found out. A word is specified based upon the probability of the recognized results.

Description

【発明の詳細な説明】Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】この発明は、発声された音声を認
識する音声認識方式に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice recognition system for recognizing spoken voice.

【０００２】[0002]

【従来の技術】音声を一定のシンボル系列に変換するベ
クトル量子化を行ない、量子化シンボル系列を隠れマル
コフモデル（以下、ＨＭＭと称する）で認識する方式
が、近年成功をおさめている。2. Description of the Related Art A method of performing vector quantization for converting speech into a constant symbol sequence and recognizing the quantized symbol sequence by a hidden Markov model (hereinafter referred to as HMM) has been successful in recent years.

【０００３】ＨＭＭの一般的定式化について述べる。Ｈ
ＭＭではｎ個の状態Ｓ1 ，Ｓ2 ，…，Ｓn を有し、初期
状態がこれらｎ個の状態に確率的に分布しているとす
る。音声では、一定のフレーム周期ごとにある確率（遷
移確率）で状態を遷移するモデルが使用される。遷移の
際には、ある確率（出力確率）でシンボルを出力しない
で状態を遷移するナル遷移を導入することもある。出力
シンボル系列が与えられても、状態遷移系列は一意に決
定されない。観測できるのはシンボル系列だけであるこ
とから、隠れマルコフモデルと称されている。ＨＭＭの
モデルＭは次の６つのパラメータから定義される。ｎ：状態数（状態Ｓ1 ，Ｓ2 ，…，Ｓn ）ｈ：シンボル数（シンボルＲ1 ，Ｒ2 ，…，Ｒh ）Ｐij：遷移確率Ｓi にいてＳj に遷移する確率Ｑij（ｈ）：Ｓi からＳj への遷移の際にシンボルｈを
出力する確率ｍi ：初期状態確率（初期状態がＳi である確率）Ｆ：最終状態の集合A general formulation of the HMM will be described. H
It is assumed that the MM has n states S1, S2, ..., Sn, and the initial state is stochastically distributed to these n states. For speech, a model that transitions a state with a certain probability (transition probability) for each fixed frame period is used. At the time of a transition, a null transition that transitions the state without outputting a symbol with a certain probability (output probability) may be introduced. Even if the output symbol sequence is given, the state transition sequence is not uniquely determined. Since only the symbol series can be observed, it is called a hidden Markov model. The HMM model M is defined by the following six parameters. n: Number of states (states S1, S2, ..., Sn) h: Number of symbols (symbols R1, R2, ..., Rh) Pij: Probability of transition probability Si to Sj Qij (h): From Si to Sj Probability of outputting symbol h during transition mi: initial state probability (probability that initial state is Si) F: set of final states

【０００４】次にモデルＭに対して音声の特徴を反映し
た遷移上の制限を加える。音声では一般に状態Ｓi から
以前に通過した状態（Ｓi-1 ，Ｓi-2 ，…）に戻るよう
なル−プの遷移は、時間的前後関係を乱すため許されな
い。前記のようなＨＭＭの構造としては、図５のような
例が代表的である。ＨＭＭの評価はモデルＭがシンボル
系列Ｏ＝ｏ1 ，ｏ2 ，…，ｏt を出力する確率Ｐr （Ｏ
／Ｍ）を求める。認識時にはＨＭＭ認識部で各モデルを
仮定してＰr （Ｏ／Ｍ）が最大となるようなモデルＭを
ビタビアルゴリズムにより求める。また、ＨＭＭの学習
はＨＭＭ学習部にて多数のシンボル系列をＯを与えて、
平均的にＰr （Ｏ／Ｍ）が最大となるモデルＭのパラメ
ータを推定すればよい。以上のようにして発声された入
力音声を認識処理することにより、その入力音声を高精
度に認識することが可能となる。Next, the model M is subject to transitional restrictions that reflect the characteristics of speech. In speech, a loop transition that returns from a state Si to a previously passed state (Si-1, Si-2, ...) Is not allowed because it disturbs the temporal context. As a structure of the HMM described above, an example as shown in FIG. 5 is typical. The HMM is evaluated by the probability Pr (O) that the model M outputs the symbol series O = o1, o2, ..., Ot.
/ M). At the time of recognition, the HMM recognition unit assumes each model and obtains a model M that maximizes Pr (O / M) by the Viterbi algorithm. Also, for learning HMM, the HMM learning unit gives O to a large number of symbol sequences,
It suffices to estimate the parameter of the model M that maximizes Pr (O / M) on average. By recognizing the input voice uttered as described above, the input voice can be recognized with high accuracy.

【０００５】[0005]

【発明が解決しようとする課題】前記のような従来の技
術によれば、認識の際の計算処理は単語数が２００程度
になると、現在のワークステーションにおいては実時間
（発声後２００〜３００ｍｓ以内）では結果が得られな
い程計算量が多くなるという問題がある。この発明は前
記事情に鑑みてなされたものでその目的は、精度を落と
さずに高速に認識処理を行なう音声認識方式を提供する
ことにある。According to the above-mentioned conventional technique, when the number of words in the calculation process for recognition reaches about 200, the current workstation is in real time (within 200 to 300 ms after utterance). In (), there is a problem that the calculation amount becomes large so that the result cannot be obtained. The present invention has been made in view of the above circumstances, and an object thereof is to provide a voice recognition system for performing recognition processing at high speed without degrading accuracy.

【０００６】[0006]

【課題を解決するための手段】この発明は上記問題を解
決するために、音声信号を入力して音声分析し、特徴パ
ラメータを求める音声分析手段と、この音声分析手段に
よって求められた特徴パラメータをシンボル系列に変換
する変換手段と、このシンボル系列を、単語ごとに予め
作成された第１の隠れマルコフモデルに通し、そのモデ
ルが前記シンボル系列を出力する確率を求める第１の確
率決定手段と、この第１の確率決定手段によって求めら
れた確率をもとに、確率の大きい方から複数の単語を抽
出する上位候補抽出手段と、この上位候補抽出手段によ
って抽出された複数の単語についての、この第１の隠れ
マルコフモデルより状態数の多い、予め作成された第２
の隠れマルコフモデルにこのシンボル系列を通し、その
モデルがこのシンボル系列を出力する確率を求める第２
の確率決定手段とを設け、この第２の確率決定手段によ
って決定された確率をもとに単語を特定して音声認識を
行なうことを特徴とする。In order to solve the above problems, the present invention provides a voice analysis means for inputting a voice signal and performing voice analysis to obtain a feature parameter, and a feature parameter obtained by the voice analysis means. Conversion means for converting into a symbol series, and a first probability determining means for passing this symbol series through a first hidden Markov model created in advance for each word, and obtaining a probability that the model outputs the symbol series, Based on the probabilities obtained by the first probability determining means, the upper candidate extracting means for extracting a plurality of words from the one with the highest probability and the plurality of words extracted by the upper candidate extracting means A second pre-created second with more states than the first Hidden Markov Model
Pass this symbol sequence to the hidden Markov model of and obtain the probability that the model outputs this symbol sequence.
The probability determining means is provided and the word is specified based on the probability determined by the second probability determining means to perform voice recognition.

【０００７】またこの発明は、前記第１の確率決定手段
を、ｋ段（ｋは２以上の整数）の組として構成し、各組
の、ｎ個の単語についての、予め作成された第１の隠れ
マルコフモデルのうちの複数のモデルに前記シンボル系
列を通し、そのモデルがこのシンボル系列を出力する確
率を求める第１の確率決定手段、及びこの第１の確率決
定手段によって求められた確率をもとに、確率の大きい
方からｍ個（ｍは２≦ｍ＜ｎを満足する整数）の単語を
抽出する上位候補抽出手段を設ける。そして、初段の第
１の確率決定手段が、このｎ個の単語のそれぞれについ
ての第１の隠れマルコフモデル全てに前記シンボル系列
を通し、初段以外の前記第１の確率決定手段が、前段の
組の前記上位候補抽出手段によって抽出されたｍ個の単
語についての、第１の隠れマルコフモデルに前記シンボ
ル系列を通すように構成する。この第１の隠れマルコフ
モデルの状態数を後段になるほど多く、この上位候補抽
出手段によって抽出される単語の数ｍを後段になるほど
少なくなるようにｋ段の組を構成する。そして、このｋ
段の組の最終段の上位候補抽出手段によって抽出された
ｍ個の単語についての、前記第１の隠れマルコフモデル
より状態数の多い、予め作成された第２の隠れマルコフ
モデルに前記シンボル系列を通し、そのモデルがこのシ
ンボル系列を出力する確率を求める第２の確率決定手段
とを設け、この第２の確率決定手段によって決定された
確率をもとに単語を特定して音声認識を行なうことを特
徴とする。Further, according to the present invention, the first probability determining means is configured as a set of k stages (k is an integer of 2 or more), and a first preliminarily prepared for each set of n words. The first probability determining means for passing the symbol sequence through a plurality of models among the hidden Markov models of the above, and determining the probability that the model outputs this symbol sequence, and the probability determined by the first probability determining means. Initially, a high-rank candidate extraction means for extracting m words (m is an integer satisfying 2 ≦ m <n) from the highest probability is provided. Then, the first probability determining means in the first stage passes the symbol sequence to all of the first hidden Markov models for each of the n words, and the first probability determining means other than the first stage uses the preceding stage combination. The first hidden Markov model for the m words extracted by the higher-rank candidate extracting means is passed through the symbol sequence. The k-stage set is configured such that the number of states of the first hidden Markov model is increased in the subsequent stage, and the number m of words extracted by the upper candidate extracting means is decreased in the subsequent stage. And this k
The symbol sequence is applied to a second hidden Markov model created in advance, which has a larger number of states than the first hidden Markov model for the m words extracted by the higher-rank candidate extraction means in the final stage of the stage set. And a second probability determining means for determining the probability that the model outputs this symbol sequence, and performing speech recognition by specifying a word based on the probability determined by the second probability determining means. Is characterized by.

【０００８】[0008]

【作用】上記の構成によれば、音声信号を入力して音声
分析して特徴パラメータを求め、この特徴パラメータを
シンボル系列に変換し、このシンボル系列を、単語ごと
に予め作成された第１の隠れマルコフモデルに通して、
そのモデルが前記シンボル系列を出力する確率を第１の
確率決定手段により決定する。この第１の確率決定手段
によって求められた確率をもとに、確率の大きい方から
複数の単語を上位候補抽出手段により抽出して、まず第
１段階として大まかな認識を行ない入力音声の単語の候
補をあげる。According to the above configuration, a voice signal is input and voice analysis is performed to obtain a characteristic parameter, the characteristic parameter is converted into a symbol series, and the symbol series is generated in advance for each word. Through the hidden Markov model,
The first probability determining means determines the probability that the model outputs the symbol sequence. Based on the probabilities obtained by the first probability determining means, a plurality of words with the highest probability are extracted by the high-rank candidate extracting means, and first, rough recognition is performed as the first step to determine the words of the input speech. Give a candidate.

【０００９】この上位候補抽出手段によって抽出された
複数の単語についての、前記第１の隠れマルコフモデル
より状態数の多い、予め作成された第２の隠れマルコフ
モデルに前記シンボル系列を通し、そのモデルが前記シ
ンボル系列を出力する確率を第２の確率決定手段によっ
て求め、決定された確率をもとに単語を特定するように
する。For the plurality of words extracted by the upper candidate extraction means, the symbol sequence is passed through a second hidden Markov model created in advance, which has a larger number of states than the first hidden Markov model, and the model thereof is passed. The second probability determining means determines the probability that the symbol output the symbol sequence, and the word is specified based on the determined probability.

【００１０】このようにして、第１段階の大まかな認識
によりあげられた単語の候補に対してのみ、状態数の多
い第２の隠れマルコフモデルによる詳細な認識を行なう
ようにして、精度を落とさずに高速に認識処理を行な
う。In this way, the accuracy is lowered by performing detailed recognition by the second hidden Markov model having a large number of states only for the word candidates given by the rough recognition in the first stage. The recognition process is performed at high speed.

【００１１】また、上記のように２段階に認識を行なう
のみならず、はじめは大まかに認識を行なって単語の候
補を抽出し、その候補についてもう少し詳細に認識を行
なって、さらに候補を絞っていき多段階に認識を行なう
こともできる。Further, not only the recognition in two stages as described above, but at the beginning, the recognition is roughly performed to extract word candidates, and the candidates are recognized in more detail to narrow down the candidates. It is also possible to perform recognition in multiple stages.

【００１２】[0012]

【実施例】以下、図面を参照してこの発明の実施例を説
明する。Embodiments of the present invention will be described below with reference to the drawings.

【００１３】（第１実施例）図１は、この発明の第１実
施例を示す音声認識装置のブロック構成図である。図１
の音声認識装置は、音声分析部１、シンボル認識辞書
２、変換部３、第１のＨＭＭセット部４、第１のＨＭＭ
認識部５、上位候補抽出部６、第２のＨＭＭセット部
７、及び第２のＨＭＭ認識部８を備えている。(First Embodiment) FIG. 1 is a block diagram of a voice recognition apparatus showing a first embodiment of the present invention. Figure 1
The speech recognition device includes a speech analysis unit 1, a symbol recognition dictionary 2, a conversion unit 3, a first HMM setting unit 4, and a first HMM.
The recognition unit 5, the high-rank candidate extraction unit 6, the second HMM setting unit 7, and the second HMM recognition unit 8 are provided.

【００１４】音声分析部１は、入力音声を分析し特徴パ
ラメータを抽出する。シンボル認識辞書２は、各シンボ
ルごとに複数の標準パターンから作成された識別用辞書
である。変換部３は、音声分析部１によって抽出された
特徴パラメータと、シンボル認識辞書２に登録されてい
る所定のシンボルとのマッチング処理を行ない、シンボ
ル系列を求める。The voice analysis unit 1 analyzes the input voice and extracts characteristic parameters. The symbol recognition dictionary 2 is an identification dictionary created from a plurality of standard patterns for each symbol. The conversion unit 3 performs a matching process between the characteristic parameter extracted by the voice analysis unit 1 and a predetermined symbol registered in the symbol recognition dictionary 2 to obtain a symbol series.

【００１５】第１のＨＭＭセット部４は、予め用意され
た例えば３２単語それぞれについてのＨＭＭ（第１のＨ
ＭＭ）を第１のＨＭＭ認識部５にセットする。第１のＨ
ＭＭ認識部５は、変換部３で求められたシンボル系列を
入力し、上記セットされた各第１のＨＭＭがこのシンボ
ル系列を出力する確率を求める処理を行なう。上位候補
抽出部６は、この処理結果の確率の大きい方から複数の
単語を抽出する。The first HMM setting section 4 prepares an HMM (first HMM) for each of, for example, 32 words prepared in advance.
MM) is set in the first HMM recognition unit 5. First H
The MM recognizing unit 5 receives the symbol sequence obtained by the converting unit 3 and performs a process for obtaining the probability that each of the set first HMMs outputs this symbol sequence. The higher-rank candidate extraction unit 6 extracts a plurality of words from the one with the highest probability of this processing result.

【００１６】第２のＨＭＭセット部７は、予め用意され
た例えば３２単語それぞれについてのＨＭＭ（第２のＨ
ＭＭ）を第２のＨＭＭ認識部８にセットする。第２のＨ
ＭＭ認識部８は、上位候補抽出部６により抽出された複
数の単語についての上記セットされた第２のＨＭＭに変
換部３で求められたシンボル系列を通して、各ＨＭＭが
このシンボル系列を出力する確率を求め、この確率をも
とに単語を特定する。The second HMM setting section 7 prepares an HMM (second HMM) for each of, for example, 32 words prepared in advance.
MM) is set in the second HMM recognition unit 8. Second H
The MM recognition unit 8 outputs the probability that each HMM outputs this symbol sequence through the symbol sequence obtained by the conversion unit 3 to the above-set second HMM for the plurality of words extracted by the high-rank candidate extraction unit 6. And the word is specified based on this probability.

【００１７】図２に、図１の第１のＨＭＭ認識部５にセ
ットされる第１のＨＭＭを示す。このＨＭＭはｌｅｆｔ
ｔｏｒｉｇｈｔ型で、５個の状態Ｓ1 ，Ｓ2 ，…，
Ｓ5を有し、初期状態はＳ1 のみとし、８ｍｓのフレー
ム周期で、ある出力確率でシンボルを出力するモデルで
ある。このシステムのＨＭＭの、たとえば３２個のモデ
ルについてのパラメータは次のようになっている。ｎ：状態数＝５（状態Ｓ1 ，Ｓ2 ，…，Ｓ5 ）ｈ：シンボル数＝１９１（シンボルのそれぞれをコード
にするＲ＝１，２，…，１９１）Ｐij：遷移確率Ｓi にいてＳj に遷移する確率Ｑij（ｈ）：Ｓi からＳj への遷移の際にシンボルｈを
出力する確率また、最終確率はＳ5 に限定する。FIG. 2 shows the first HMM set in the first HMM recognition unit 5 of FIG. This HMM is left
To-right type, five states S1, S2, ...,
It is a model that has S5, has only S1 as the initial state, and outputs a symbol with a certain output probability in a frame period of 8 ms. The parameters for, for example, 32 models of the HMM of this system are as follows. n: number of states = 5 (states S1, S2, ..., S5) h: number of symbols = 191 (R = 1, 2, ..., 191 which code each symbol) Pij: transition probability S i to S j Probability Qij (h): Probability of outputting the symbol h at the transition from Si to Sj Further, the final probability is limited to S5.

【００１８】図３に、図１の第２のＨＭＭ認識部８にセ
ットされる第２のＨＭＭを示す。このＨＭＭはｌｅｆｔ
ｔｏｒｉｇｈｔ型で１０個の状態Ｓ1 ，Ｓ2 ，…，
Ｓ10を有し、初期状態はＳ1 のみとし、８ｍｓのフレー
ム周期で、一定の遷移確率で状態を遷移する。その遷移
の際に、一定の出力確率でシンボルを出力するモデルで
ある。この実施例におけるシステムのＨＭＭの３２個の
モデルについてのパラメータは次のようになっている。ｎ：状態数＝１０（状態Ｓ1 ，Ｓ2 ，…，Ｓ10）ｈ：シンボル数＝１９１（シンボルのそれぞれをコード
にするＲ＝１，２，…，１９１）Ｐij：遷移確率Ｓi にいてＳj に遷移する確率Ｑij（ｈ）：Ｓi からＳj への遷移の際にシンボルｈを
出力する確率また、最終確率はＳ10に限定する。次に、図１の構成による音声認識処理について説明す
る。FIG. 3 shows a second HMM set in the second HMM recognizing unit 8 in FIG. This HMM is left
10 states of to right type S1, S2, ...,
It has S10, the initial state is only S1, and the state is transited with a constant transition probability in a frame period of 8 ms. It is a model that outputs a symbol with a constant output probability at the time of the transition. The parameters for the 32 HMM models of the system in this example are as follows: n: number of states = 10 (states S1, S2, ..., S10) h: number of symbols = 191 (R = 1, 2, ..., 191 which code each of the symbols) Pij: transition probability S i to S j Probability Qij (h): Probability of outputting the symbol h at the transition from Si to Sj Further, the final probability is limited to S10. Next, the voice recognition processing with the configuration of FIG. 1 will be described.

【００１９】音声が入力されると音声分析部１におい
て、たとえば、線形予測法（ＬＰＣ）あるいはバンドパ
スフィルタ（ＢＰＦ）により分析を行ない特徴パラメー
タを抽出する。変換部３は、この分析・抽出された特徴
パラメータを、シンボル認識辞書２に登録されている各
シンボルごとの標準パタ−ンとマッチング処理を行な
い、シンボル系列を求める。When a voice is input, the voice analysis unit 1 analyzes it by, for example, a linear prediction method (LPC) or a bandpass filter (BPF) and extracts a characteristic parameter. The conversion unit 3 performs a matching process on the analyzed and extracted characteristic parameters with the standard pattern for each symbol registered in the symbol recognition dictionary 2 to obtain a symbol sequence.

【００２０】第１のＨＭＭセット部４には、図２に示す
第１のＨＭＭを所定の３２単語について予め学習し蓄積
しておき、このＨＭＭを第１のＨＭＭ認識部５にセット
する。入力された音声に対し求められたシンボル系列
を、第１のＨＭＭ認識部５においてこの第１のＨＭＭに
通し、上位候補抽出部６にて、これを出力する確率Ｐr
（Ｏ／Ｍ）を求め、この確率が１位から５位までになる
ようなＨＭＭを求める。The first HMM setting unit 4 learns and accumulates the first HMM shown in FIG. 2 for predetermined 32 words in advance, and sets this HMM in the first HMM recognition unit 5. The symbol sequence obtained for the input speech is passed through the first HMM in the first HMM recognition unit 5 and the probability Pr of outputting it through the upper candidate extraction unit 6
(O / M) is calculated, and an HMM is calculated so that this probability is from the 1st place to the 5th place.

【００２１】第２のＨＭＭセット部７には、図３に示す
前記第１のＨＭＭより状態数を多くした所定の３２単語
について第２のＨＭＭを予め学習し蓄積しておき、この
ＨＭＭを第２のＨＭＭ認識部８にセットする。第２のＨ
ＭＭ認識部８において前述のシンボル系列を、上位候補
抽出部６で得られた１位から５位までの５個の単語の第
２のＨＭＭに通して、これを出力する確率Ｐr （Ｏ／
Ｍ）を求め、この確率が最大となる単語を認識結果とす
る。The second HMM setting unit 7 preliminarily learns and accumulates the second HMM for a predetermined 32 words having a larger number of states than the first HMM shown in FIG. 2 is set in the HMM recognition unit 8. Second H
In the MM recognition unit 8, the above-mentioned symbol sequence is passed through the second HMM of the five words from the first place to the fifth place obtained by the high-rank candidate extraction unit 6, and the probability Pr (O / O /
M) is obtained, and the word having the highest probability is used as the recognition result.

【００２２】ところで、認識処理においては、ＨＭＭの
状態数を多くすればする程、そのＨＭＭにシンボル系列
を通して確率を求めて行なった認識結果の精度は高くな
るが、認識処理に必要とする時間が長くなる。状態数が
少ない場合には、認識結果の精度は低くなるが認識処理
に必要とする時間は短くてすむ。In the recognition process, as the number of states of the HMM increases, the accuracy of the recognition result obtained by obtaining the probability through the HMM symbol sequence increases, but the time required for the recognition process increases. become longer. When the number of states is small, the accuracy of the recognition result is low, but the time required for the recognition process is short.

【００２３】そこで、この実施例においては認識処理を
２段階に行ない、１段階目で入力音声のシンボル系列を
３２単語についての状態数５のＨＭＭに通し、これを出
力する確率を求めて、この確率の大きい方から５個のＨ
ＭＭを求める。Therefore, in this embodiment, the recognition process is performed in two steps, and in the first step, the symbol sequence of the input voice is passed through the HMM having the number of states of 5 for 32 words, and the probability of outputting this is obtained, 5 Hs from the highest probability
Find MM.

【００２４】次に２段階目では、３２単語についての状
態数１０のＨＭＭを用意しておき、入力音声のシンボル
系列を１段階目で求められた確率の大きい方から５個の
単語のＨＭＭに通し、この確率が最大となる単語を認識
結果とする。Next, in the second step, an HMM having a number of states of 10 for 32 words is prepared, and the symbol sequence of the input voice is converted into an HMM of five words from the one having the highest probability obtained in the first step. Throughout, the word with the highest probability is used as the recognition result.

【００２５】上述の認識処理時間の具体例を示すと、た
とえば、認識処理を２段階に行なわない従来方式では、
３２単語についてＨＭＭの状態数１０で認識処理を行な
い確率が最大となる単語を認識結果とすると、約６０ｍ
ｓで処理が行なわれる。To give a concrete example of the above-mentioned recognition processing time, for example, in the conventional system in which the recognition processing is not performed in two stages,
If the recognition result is the word that has the maximum probability of performing recognition processing for the 32 words with the HMM state number 10, it is about 60 m.
Processing is performed at s.

【００２６】一方これに対し、認識処理を２段階に行な
う本実施例では、まず、１段階目の処理、即ちＨＭＭの
状態数を５として認識処理を行ない、確率の大きい方か
ら５個の単語を抽出する処理は約３０ｍｓで行なわれ
る。次に、２段階目の処理、即ち１段階目の処理で抽出
された５個の単語に対してＨＭＭの状態数を１０として
認識を行ない、確率が最大となる単語を認識結果とする
処理は、１０ｍｓで行なわれる。したがって、本実施例
によれば従来方式に比べて約１／３の時間が短縮され
る。しかも、試験を行なった結果、最終的な誤りは１つ
も増加しなかった。On the other hand, in the present embodiment in which the recognition process is performed in two stages, first, the process of the first stage, that is, the recognition process is performed with the number of HMM states being 5, and the five words with the highest probability are selected. Is extracted in about 30 ms. Next, the process of the second stage, that is, the process of recognizing the five words extracted in the process of the first stage with the HMM state number as 10 and recognizing the word with the highest probability as the recognition result, It is performed in 10 ms. Therefore, according to this embodiment, the time is reduced by about 1/3 as compared with the conventional method. Moreover, as a result of the test, no final error was increased.

【００２７】（第２実施例）前記第１実施例において
は、第１のＨＭＭ認識部５および第２のＨＭＭ認識部８
の２つの確率決定手段を使用して２段階に処理を行ない
単語を特定したが、以下に述べる第２実施例のように３
段階以上の処理で単語を特定することもできる。(Second Embodiment) In the first embodiment, the first HMM recognition unit 5 and the second HMM recognition unit 8 are used.
The word was specified by performing the processing in two stages by using the two probability determining means of the above. However, as in the second embodiment described below, 3
It is also possible to specify a word by performing processing in steps or more.

【００２８】図４は、この第２実施例を示す音声認識装
置のブロック構成図であり、図１と同一部分には同一符
号を付してある。図４の音声認識装置は、図１の音声認
識装置と同様に音声分析部１、シンボル認識辞書２、変
換部３、第２のＨＭＭセット部７及び第２のＨＭＭ認識
部８を備える他、図１の第１のＨＭＭセット部４、第１
のＨＭＭ認識部５及び上位候補抽出部６に代えて、直列
多段接続されるｋ段の第１のＨＭＭセット部４-i、第１
のＨＭＭ認識部５-i、上位候補抽出手段６-iの組（i ＝
１〜ｋ）を備えている。FIG. 4 is a block diagram of a voice recognition apparatus showing the second embodiment, and the same parts as those in FIG. 1 are designated by the same reference numerals. The voice recognition device of FIG. 4 includes a voice analysis unit 1, a symbol recognition dictionary 2, a conversion unit 3, a second HMM setting unit 7, and a second HMM recognition unit 8 as in the voice recognition device of FIG. The first HMM setting unit 4 of FIG. 1, the first
Instead of the HMM recognizing unit 5 and the higher-rank candidate extracting unit 6, the k-stage first HMM setting unit 4-i connected in series multi-stage, the first
HMM recognition unit 5-i and higher-rank candidate extraction means 6-i (i =
1-k).

【００２９】この相違点についてのみ構成を説明する。
第１のＨＭＭセット部４-iは、第１のＨＭＭ認識部５-i
に予め用意された例えば３２単語それぞれについてのＨ
ＭＭ（第１のＨＭＭ）をセットする。初段の第１のＨＭ
Ｍ認識部５-1は、変換部３で求められたシンボル系列を
入力し、３２単語それぞれについての各第１のＨＭＭが
このシンボル系列を出力する確率を求める。上位候補抽
出部６-1は、第１のＨＭＭ認識部５-1の処理結果の確率
の大きい方からｍ1 個（ｍ1 は２≦ｍ1 ＜３２を満足す
る整数）の単語を抽出する。２段目以降のＨＭＭ認識部
５-j（j ＝２〜ｋ）は、前段の上位候補抽出部６-(j-1)
により抽出されたｍ(j-1) 個の単語についての各第１の
ＨＭＭがこのシンボル系列を出力する確率を求める。そ
して、上位候補抽出部６-jは、第１のＨＭＭ認識部５-j
の処理結果の確率の大きい方からｍj 個（ｍj は２≦ｍ
j ＜３２及びｍj ＜ｍ(j-1) を満足する整数）の単語を
抽出する。ここで、第１のＨＭＭの状態数は後段になる
ほど多くなるように設定される。次に、図４の構成によ
る音声認識処理について説明する。音声分析部１、シン
ボル認識辞書２、変換部３までの処理は、第１実施例と
同じである。The configuration will be described only with respect to this difference.
The first HMM setting unit 4-i is the first HMM recognition unit 5-i.
For example, H for each of 32 words prepared in advance
Set MM (first HMM). First-stage 1st HM
The M recognizing unit 5-1 receives the symbol sequence obtained by the converting unit 3 and obtains the probability that each first HMM for each of 32 words outputs this symbol sequence. The upper candidate extraction unit 6-1 extracts m1 words (m1 is an integer satisfying 2≤m1 <32) from the one with the highest probability of the processing result of the first HMM recognition unit 5-1. The HMM recognition unit 5-j (j = 2 to k) in the second and subsequent stages is the upper candidate extraction unit 6- (j-1) in the previous stage.
The probability that each first HMM for m (j-1) words extracted by is output this symbol sequence is obtained. Then, the higher-rank candidate extraction unit 6-j receives the first HMM recognition unit 5-j.
Mj pieces (mj is 2 ≦ m
The words of j <32 and mj <m (j-1) are extracted. Here, the number of states of the first HMM is set to increase in the subsequent stages. Next, the voice recognition processing with the configuration of FIG. 4 will be described. The processes up to the voice analysis unit 1, the symbol recognition dictionary 2, and the conversion unit 3 are the same as those in the first embodiment.

【００３０】第１のＨＭＭセット部４-1には第１のＨＭ
Ｍを所定の３２単語について予め学習しておき、このＨ
ＭＭを第１のＨＭＭ認識部５-1にセットする。前記変換
部３で求められたシンボル系列を第１のＨＭＭ認識部５
-1においてこの第１のＨＭＭに通し、このシンボル系列
を出力する確率を求める。上位候補抽出部６-1は、これ
を出力する確率を求め、この確率の大きい方からｍ1 個
（ｍ1 は２＜ｍ1 ＜３２を満足する整数）の単語を抽出
する。The first HM setting unit 4-1 has a first HM.
M is learned in advance for 32 predetermined words, and this H
The MM is set in the first HMM recognition unit 5-1. The symbol sequence obtained by the conversion unit 3 is converted into the first HMM recognition unit 5
At -1, the probability of outputting this symbol sequence is obtained by passing through this first HMM. The high-rank candidate extraction unit 6-1 obtains the probability of outputting this, and extracts m1 words (m1 is an integer satisfying 2 <m1 <32) from the highest probability.

【００３１】２段目以降の第１のＨＭＭセット部４-j
（j は２≦j ≦ｋの整数）には、各々前段の第１のＨＭ
Ｍセット部４-(j-1)のＨＭＭよりも状態数を多くした所
定の３２単語についてのＨＭＭを予め学習しておき、こ
のＨＭＭを第１のＨＭＭ認識部５-jにセットする。そし
てＨＭＭ認識部５-jにおいて、前段の上位候補抽出部６
-(j-1)により抽出されたｍ(j-1) 個の単語についての上
記の第１のＨＭＭがシンボル系列を出力する確率を求め
る。この上位候補抽出部６-jは、処理結果の確率の大き
い方からｍj 個の単語を抽出する。ここで、上位候補抽
出部６-jにより抽出される単語の数ｍj は後段になるほ
ど少なくなるように設定されているので、候補となる単
語が徐々に絞られる。また、第１のＨＭＭの状態数は後
段になるほど多く設定されているため、候補となった単
語について後段になるほどより詳細に認識処理を行なう
ことができる。この際、候補となる単語は、上記のよう
に後段になるほど少なくなるため、ＨＭＭの状態数を後
段になるほど多くしても、処理時間を減らすことができ
る。First HMM setting unit 4-j in the second and subsequent stages
(J is an integer of 2≤j≤k), the first HM of the preceding stage
The HMM for a predetermined 32 words having a larger number of states than the HMM of the M setting unit 4- (j-1) is learned in advance, and this HMM is set in the first HMM recognition unit 5-j. Then, in the HMM recognition unit 5-j, the upper-rank candidate extraction unit 6 in the preceding stage
The probability that the above first HMM outputs the symbol sequence for m (j-1) words extracted by-(j-1) is obtained. The high-rank candidate extraction unit 6-j extracts mj words from the one with the highest probability of the processing result. Here, since the number mj of words extracted by the higher-rank candidate extraction unit 6-j is set to be smaller in the subsequent stage, the candidate words are gradually narrowed down. Further, since the number of states of the first HMM is set to be larger in the latter stage, the recognition process can be performed in more detail in the latter stage for the candidate word. At this time, the number of candidate words decreases as the latter stage as described above, and therefore the processing time can be reduced even if the number of HMM states increases as the latter stage.

【００３２】第２のＨＭＭセット部７には、前記第１の
ＨＭＭより状態数を多くした所定の３２単語について第
２のＨＭＭを予め学習し蓄積しておき、このＨＭＭを第
２のＨＭＭ認識部８にセットする。第２のＨＭＭ認識部
８において前述のシンボル系列を、ｋ段目（最終段）の
上位候補抽出部６-kで得られたｍk 個の単語についての
第２のＨＭＭに通してこれを出力する確率を求め、この
確率が最大となる単語を認識結果とする。なお、前記実
施例においては、離散型ＨＭＭで説明したが、連続型Ｈ
ＭＭによっても同様に実施される。The second HMM setting unit 7 preliminarily learns and accumulates the second HMM for a predetermined 32 words having a larger number of states than the first HMM, and recognizes this HMM as the second HMM recognition. Set in part 8. In the second HMM recognition unit 8, the above-mentioned symbol sequence is passed through the second HMM for the mk words obtained by the k-th (final stage) higher-rank candidate extraction unit 6-k and is output. The probability is calculated, and the word having the highest probability is used as the recognition result. Although the discrete HMM has been described in the above embodiment, the continuous HMM is used.
The same is performed by the MM.

【００３３】[0033]

【発明の効果】以上詳記したようにこの発明によれば、
まず最初に状態数の少ない隠れマルコフモデル（第１の
隠れマルコフモデル）により、認識しようとする入力音
声の特徴パラメータから得られたシンボル系列を出力す
る確率（第１の確率）を求めて、この確率の大きい方か
ら複数の単語を抽出（上位候補抽出）して大まかに候補
を選ぶことにより大分類を行ない、次に前記隠れマルコ
フモデルよりも状態数の多い隠れマルコフモデル（第２
の隠れマルコフモデル）により、前記抽出された確率の
大きい方から複数の単語に対し、シンボル系列を出力す
る確率（第２の確率）を求めて、この確率をもとに入力
音声の単語を特定する構成とすることにより、精度を落
とさずに高速に認識処理を行なうことができる。As described above in detail, according to the present invention,
First, a hidden Markov model with a small number of states (first hidden Markov model) is used to obtain the probability (first probability) of outputting the symbol sequence obtained from the feature parameter of the input speech to be recognized, A large classification is performed by extracting a plurality of words (extracting upper candidates) from the one with a higher probability and roughly selecting the candidate, and then a hidden Markov model having a larger number of states than the hidden Markov model (second
Hidden Markov model), the probability of outputting a symbol sequence (second probability) is calculated for a plurality of words from the one with the highest extracted probability, and the word of the input speech is identified based on this probability. With this configuration, the recognition processing can be performed at high speed without lowering the accuracy.

【００３４】さらにこの発明によれば、上記第１の確率
を求めるにあたり、後段になるほど状態数の多い第１の
隠れマルコフモデルを多段に有して、各段において前段
の第１の隠れマルコフモデルにより抽出（上位候補抽
出）された確率の大きい方から複数の単語（前段で抽出
したよりも少ない数とする、ただし初段においては全入
力音声の単語）に対し、シンボル系列を出力する確率を
多段階に求める構成とすることにより、認識対象単語の
数が多い場合には、精度を落とさずにさらに高速に認識
処理を行なうことができる。Further, according to the present invention, in obtaining the above-mentioned first probability, the first hidden Markov model having a larger number of states in the subsequent stages is provided in multiple stages, and the first hidden Markov model in the preceding stage is provided in each stage. The higher the probability of outputting the symbol sequence to the multiple words (the number of words is smaller than that extracted in the previous stage, but in the first stage all input speech words) from the one with the highest probability of being extracted by With the configuration obtained in stages, when the number of recognition target words is large, the recognition processing can be performed at a higher speed without lowering the accuracy.

[Brief description of drawings]

【図１】この発明を適用する音声認識装置の第１実施例
を示すブロック構成図。FIG. 1 is a block configuration diagram showing a first embodiment of a voice recognition device to which the present invention is applied.

【図２】図１の第１のＨＭＭ認識部５にセットされるＨ
ＭＭを示す図。FIG. 2 is an H set in the first HMM recognition unit 5 of FIG.
The figure which shows MM.

【図３】図１の第２のＨＭＭ認識部８にセットされるＨ
ＭＭを示す図。3 is an H set in a second HMM recognition unit 8 in FIG.
The figure which shows MM.

【図４】この発明を適用する音声認識装置の第２実施例
を示すブロック構成図。FIG. 4 is a block diagram showing a second embodiment of a voice recognition device to which the invention is applied.

【図５】一般的なＨＭＭを示す図。FIG. 5 is a diagram showing a general HMM.

【符号の説明】１…音声分析部、２…シンボル認識辞書、３…変換部、
４，４-1〜４-k…第１のＨＭＭセット部、５，５-1〜５
-k…第１のＨＭＭ認識部（第１の確率決定手段）、６，
６-1〜６-k…上位候補抽出部、７…第２のＨＭＭセット
部、８…第２のＨＭＭ認識部（第２の確率決定手段）。[Explanation of Codes] 1 ... Speech analysis unit, 2 ... Symbol recognition dictionary, 3 ... Conversion unit,
4, 4-1 to 4-k ... First HMM setting unit, 5, 5-1 to 5
-k ... 1st HMM recognition part (1st probability determination means), 6,
6-1 to 6-k ... High-rank candidate extraction unit, 7 ... Second HMM setting unit, 8 ... Second HMM recognition unit (second probability determining means).

Claims

[Claims]

1. A voice analysis unit for inputting a voice signal and performing voice analysis to obtain a characteristic parameter, a conversion unit for converting the characteristic parameter obtained by the voice analysis unit into a symbol sequence, and the symbol sequence as a word. A first hidden Markov model that is created in advance for each model, and a first probability determining unit that obtains a probability that the model outputs the symbol sequence; and a probability that is obtained by the first probability determining unit. In addition, a higher-rank candidate extraction means for extracting a plurality of words from the one having a higher probability, and a plurality of states of the plurality of words extracted by the higher-rank candidate extraction means, which have more states than the first hidden Markov model, are created in advance. Second probability determining means that passes the symbol sequence through a second hidden Markov model and obtains the probability that the model outputs the symbol sequence Comprising a speech recognition system, characterized in that the probability determined by the second probability determination unit configured to identify words based.

2. A voice analysis means for inputting a voice signal and performing voice analysis to obtain a characteristic parameter, a conversion means for converting the characteristic parameter obtained by the voice analysis means into a symbol sequence, and k stages (k is 2). The above integer sequence), each set passes the symbol sequence through a plurality of models of the first hidden Markov model created in advance for each of the n words, and the model is the symbol sequence. Based on the probabilities obtained by the first probability determining means and the probability obtained by the first probability determining means, and m probabilities from the larger probability (m is an integer satisfying 2 ≦ m <n ), The first probability determining means at the first stage has a first candidate for each of the n words.
Pass the above symbol sequence to all hidden Markov models of
The first probability determining means other than the first stage is configured to pass the symbol sequence through the first hidden Markov model for the m words extracted by the upper candidate extracting means in the preceding set, A k-stage set configured such that the number of states of the first hidden Markov model increases as the number of states increases, and the number m of words extracted by the higher-rank candidate extraction means decreases as the number of states decreases. The second hidden Markov model having a larger number of states than the first hidden Markov model for the m words extracted by the upper candidate extraction means in the final stage of Second probability determining means for determining the probability that the model outputs the symbol sequence, and based on the probability determined by the second probability determining means. Speech recognition method being characterized in that so as to identify.