JP3006496B2

JP3006496B2 - Voice recognition device

Info

Publication number: JP3006496B2
Application number: JP8171422A
Authority: JP
Inventors: 優高野
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1996-03-21
Filing date: 1996-07-01
Publication date: 2000-02-07
Anticipated expiration: 2016-07-01
Also published as: JPH09311694A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、発声中から特定の
単語を検出する音声認識装置に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech recognition device for detecting a specific word from a utterance.

【０００２】[0002]

【従来の技術】単語検出は、通常の音声認識と比して、
予め入力音声中における単語の存在区間を確定しておく
必要がないという利点を有している。2. Description of the Related Art Word detection, compared to normal speech recognition,
This has the advantage that it is not necessary to determine the section in which the word exists in the input voice in advance.

【０００３】従来、発声中から特定単語を検出する方法
として、文献１（信学論（Ｄ）Ｊ６７−Ｄ，１１ｐ．１
２４２−１２４９）に記載されているような方法が知ら
れている。当論文に記載されている方法は、毎フレーム
候補単語ごと独立に算出される尤度が一定の閾値を越え
た場合、他の候補単語の検出と無関係に、検出を行なう
こととしている。また、別の方法として、文献２（「確
率モデルによる音声認識」中川聖一ｐ．２０−２６）に
記載されている方法が知られている。当文献の方法で
は、フレームごとに言語モデル上最適な単語を決定し出
力する。また、時間的にオーバーラップする単語の検出
を抑制する方法として、文献３（信学技報ＳＰ９５−７
７ｐ．３１−３８）に記載されている方法が知られてい
る。当文献の方法では、各フレームにおいて各候補単語
の仮検出を決定し、検出ごとの尤度からフレーム同期処
理を用いて時間的にオーバーラップする検出を抑制する
ことができる。Conventionally, as a method for detecting a specific word from a utterance, reference 1 (IEEE J67-D, 11p.1).
242-1249). In the method described in this paper, when the likelihood calculated independently for each frame candidate word exceeds a certain threshold, detection is performed independently of the detection of other candidate words. As another method, a method described in Document 2 (“Speech Recognition by Stochastic Model”, Seiichi Nakagawa, pp. 20-26) is known. In the method of this document, an optimal word is determined and output on a language model for each frame. Also, as a method for suppressing detection of words that overlap in time, reference 3 (IEICE Technical Report SP95-7)
7p. 31-38) are known. In the method of this document, provisional detection of each candidate word is determined in each frame, and detection that temporally overlaps can be suppressed using frame synchronization processing based on the likelihood of each detection.

【０００４】[0004]

【発明が解決しようとする課題】しかしながら文献１の
方法では、発声中に候補単語が時間的に重なっているよ
うな仮説を許容する。文献２の方法ではそのようなこと
は起こらないが、発声された単語の一部に類似した単語
を、発声の途中で検出する現象が起こり得る。これを防
ぐためには入力音声中、単語の存在する時間の範囲を何
らかの手段で定めておく必要がある。文献３の方法では
そのような必要はなく、発声中に候補単語が時間的に重
なっている仮説を許容することもない。しかしながら、
文献３に示されているような、認識対象語の前方のみ、
任意の音節列を受理するモデル（ガーベージモデル）を
付加したモデルを使用する際には、発声された単語の一
部に類似した単語の尤度を実際に発声された単語の尤度
より高く見積ることにより類似単語の方を検出し、実際
に発声された単語の検出の方をキャンセルしてしまう現
象（部分マッチング）が起こることがあり、それを改善
する方法は自明ではない。ところが、認識対象語の前後
にガーベージモデルを付加したモデルを用い、文献２で
示されている方法を使用する場合には、ガーベージモデ
ルのパタンマッチングのスコアを低く設定することによ
り部分マッチングを回避することができることが知られ
ている。However, the method of Reference 1 allows a hypothesis that candidate words temporally overlap during utterance. Such a phenomenon does not occur in the method of Reference 2, but a phenomenon that a word similar to a part of the uttered word is detected in the middle of the utterance may occur. In order to prevent this, it is necessary to determine the range of time during which the word exists in the input voice by some means. The method of Ref. 3 does not require such a case, and does not allow a hypothesis in which candidate words temporally overlap during utterance. However,
As shown in Document 3, only in front of the recognition target word,
When using a model to which a model that accepts an arbitrary syllable string (a garbage model) is added, the likelihood of a word similar to a part of a spoken word is estimated to be higher than the likelihood of an actually spoken word. As a result, a phenomenon (partial matching) that detects a similar word and cancels the detection of an actually uttered word may occur, and a method of improving the phenomenon is not obvious. However, when using a model in which a garbage model is added before and after the recognition target word and using the method shown in Reference 2, partial matching is avoided by setting a low pattern matching score of the garbage model. It is known that it can.

【０００５】本発明の目的は、単語検出に対し上述のよ
うな、一単語の発声に対する複数候補の検出を低減する
方法とともに、かつ文献２の部分マッチングに対する頑
健性をも保有する方法を用いることで、文献１から文献
３までに述べられている方法のもつ前述のような利点を
兼ね備えた単語認識法を提供することにある。[0005] It is an object of the present invention to use a method for reducing the detection of a plurality of candidates for one word utterance as described above for word detection, and a method which also has robustness to partial matching described in Reference 2. Accordingly, it is an object of the present invention to provide a word recognition method having the above-mentioned advantages of the methods described in Documents 1 to 3.

【０００６】[0006]

【課題を解決するための手段】本発明は、入力音声デー
タ中からフレームごとに尤度を基準として前記候補単語
の検出を行なう音声認識装置において、一単語発声中の
複数の候補単語の検出を低減するための方法に関するも
のである。SUMMARY OF THE INVENTION The present invention relates to a speech recognition apparatus for detecting candidate words for each frame from input speech data on a frame-by-frame basis, wherein a plurality of candidate words in one word are detected. It relates to a method for reduction.

【０００７】[0007]

BEST MODE FOR CARRYING OUT THE INVENTION

（第１の実施の形態）図１は、本発明の第１の形態の音
声認識装置の構成を示す図である。(First Embodiment) FIG. 1 is a diagram showing a configuration of a speech recognition apparatus according to a first embodiment of the present invention.

【０００８】本発明の第１の形態による音声認識装置は
入力音声より一定時間（以後、フレームとする）ごとの
特徴量を抽出する音声分析部１０１と、候補単語を記憶
している単語辞書１０２と、単語辞書中の候補単語より
言語モデルを生成するモデル生成部１０３と、前記特徴
量及び前記言語モデルより、各フレームにおいて前記言
語モデルに当てはまる最適な単語系列（以後、最適列と
する）及びその尤度を求める尤度計算部１０４と、尤度
計算部１０４からの出力を入力として、候補単語の出力
を行なう検出部１０５及びそれに付随する記憶部１０６
よりなる。[0008] A speech recognition apparatus according to a first embodiment of the present invention includes a speech analysis unit 101 for extracting a feature amount from an input speech for each predetermined time (hereinafter referred to as a frame), and a word dictionary 102 storing candidate words. A model generation unit 103 that generates a language model from candidate words in a word dictionary; an optimal word sequence (hereinafter, referred to as an optimal sequence) that applies to the language model in each frame based on the feature amount and the language model; A likelihood calculation unit 104 for obtaining the likelihood, a detection unit 105 that outputs a candidate word by using an output from the likelihood calculation unit 104 as an input, and a storage unit 106 associated therewith
Consisting of

【０００９】音声分析部１０１では、入力音声のフレー
ムごとの周波数分析を行ない、フレームごとの特徴量ベ
クトル（以後、特徴量とする）を生成する。特徴量の要
素としては、パワー、パワー変化量、メルケプストラ
ム、メルケプストラム変化量、メルケプストラム２次変
化量等を用いる。フレームごとの特徴量は、毎フレー
ム、尤度計算部１０４へ出力される。The speech analysis unit 101 performs a frequency analysis for each frame of the input speech, and generates a feature vector (hereinafter referred to as a feature) for each frame. Power, power variation, mel cepstrum, mel cepstrum variation, mel cepstrum secondary variation, and the like are used as elements of the feature quantity. The feature amount for each frame is output to the likelihood calculation unit 104 for each frame.

【００１０】単語辞書１０２は、認識対象となる単語を
単位音響モデルの連鎖の形、例えば、単語を構成する音
節を表す音響モデルの連鎖（「大阪」（おおさか）の場
合には「お」−「お」−「さ」−「か」）の形で記憶し
ている。The word dictionary 102 stores a word to be recognized in the form of a chain of unit acoustic models, for example, a chain of acoustic models representing syllables constituting a word ("O" in the case of "Osaka"). "O"-"sa"-"ka").

【００１１】モデル生成部１０３では、単語辞書１０２
内の各単語モデルの前後に任意の音節連鎖を付加した発
声を受理するような言語モデルを構成する。尤度計算部
１０４では、モデル生成部１０３で生成された言語モデ
ルと音声分析部１０１の出力である各フレームにおける
特徴量より各フレームの最適列及びその尤度を算出し、
検出部１０５へ出力する。In the model generation unit 103, the word dictionary 102
A language model is constructed that accepts utterances with arbitrary syllable chains added before and after each word model in. The likelihood calculation unit 104 calculates an optimal sequence of each frame and its likelihood from the language model generated by the model generation unit 103 and the feature amount of each frame output from the speech analysis unit 101,
Output to the detection unit 105.

【００１２】記憶部１０６では、予め定めた閾値と、検
出部１０５から出力される候補単語を記憶しておく。検
出部１０５では、尤度計算部１０４の出力である最適列
とその尤度を毎フレーム受け取り、候補単語の決定及び
出力判定を行ない、もし必要ならば候補単語の出力を行
なう。The storage unit 106 stores a predetermined threshold value and candidate words output from the detection unit 105. The detection unit 105 receives the optimal sequence and the likelihood output from the likelihood calculation unit 104 for each frame, determines candidate words and determines output, and outputs candidate words if necessary.

【００１３】第１の形態による音声認識装置は請求項
１、２、５に対応する音声認識装置である。A speech recognition apparatus according to the first embodiment is a speech recognition apparatus according to the first, second, and fifth aspects.

【００１４】（第１の形態の動作の説明）図２は、本発
明の第１の形態の音声認識装置の動作を表す図である。(Explanation of Operation of First Embodiment) FIG. 2 is a diagram showing the operation of the speech recognition apparatus according to the first embodiment of the present invention.

【００１５】音声分析部１０１においては、入力音声の
フレームごとの周波数分析を行ない、特徴量を生成し、
毎フレーム、尤度計算部１０４へ出力する。尤度計算部
１０４では、文献２に示されているオートマトン制御Ｏ
ｎｅＰａｓｓＤＰ法の手法を用いて、モデル生成部
１０３で生成された言語モデルと音声分析部１０１の出
力である各フレームにおける特徴量のパタンマッチング
を行なうことにより、各フレームの最適列及び、その尤
度を算出し、最適列及びその尤度を検出部１０５へ出力
する。The speech analysis unit 101 performs a frequency analysis for each frame of the input speech to generate a feature amount,
Output to likelihood calculation section 104 every frame. The likelihood calculation unit 104 uses the automaton control O
Using the method of the ne Pass DP method, by performing pattern matching of the language model generated by the model generation unit 103 and the feature amount in each frame output from the speech analysis unit 101, the optimal sequence of each frame and its The likelihood is calculated, and the optimal sequence and its likelihood are output to the detection unit 105.

【００１６】検出部１０５では、各フレームごとにまず
尤度計算部１０４から出力された尤度を記憶部１０６に
記憶された閾値と比較する（ステップ１）。該当尤度が
閾値以上であれば、尤度計算部１０４から出力された最
適列中の候補単語（以後、最適単語とする）を記憶部１
０６に格納されている候補単語と比較し、もし異なって
いれば最適単語を出力する（ステップ２）。The detection unit 105 first compares the likelihood output from the likelihood calculation unit 104 with the threshold value stored in the storage unit 106 for each frame (step 1). If the corresponding likelihood is equal to or greater than the threshold, the candidate word in the optimal sequence output from the likelihood calculating unit 104 (hereinafter, referred to as an optimal word) is stored in the storage unit 1
Compared with the candidate word stored in 06, and if different, the optimum word is output (step 2).

【００１７】次に、該当尤度が閾値以上であれば、最適
単語を記憶部１０６に格納する（ステップ３Ａ）。閾値
未満であれば、空語を記憶部１０６に記憶する（ステッ
プ３Ｂ）。Next, if the corresponding likelihood is equal to or larger than the threshold, the optimum word is stored in the storage unit 106 (step 3A). If it is less than the threshold, the empty word is stored in the storage unit 106 (step 3B).

【００１８】次に本発明の第１の形態の効果について説
明する。文献２に示す方法は、フレームごとに最適列及
び尤度を求める手段を示しているが、それらを直接、単
語音声の認識結果とするとフレームごとに認識結果を出
力することになり不便である。第１の形態は、フレーム
ごとの最適列及び尤度より、単一の認識単語を抽出する
手段を与える。Next, the effects of the first embodiment of the present invention will be described. The method disclosed in Document 2 shows a means for obtaining the optimal sequence and likelihood for each frame. However, if these are directly used as the recognition results of word speech, the recognition results are output for each frame, which is inconvenient. The first mode provides a means for extracting a single recognized word from the optimal sequence and likelihood for each frame.

【００１９】（第１の形態の実施例）次に、本発明の第
１の実施の形態の一実施例の動作を詳細に説明する。(Example of First Embodiment) Next, the operation of one example of the first embodiment of the present invention will be described in detail.

【００２０】本実施例では、言語モデルとして図３に示
すモデルを用いる。図３のモデルは空列を含む任意の音
節列を受理する前方のガーベージ、候補単語、空列を含
む任意の音節列を受理する後方のガーベージの３個のモ
デルの連接の形で構成され、各モデルをこの順で経由し
た発声すなわち候補単語１個の前後に任意の音節列を付
加した発声をすべて受理する。モデル生成部１０３は単
語辞書１０２の各候補単語より図３のモデルを予め作成
し、記憶しているものとする。単語辞書１０２の内容
は、「関」（せき）「碧南」（へきなん）「那覇」（な
は）の３単語であるとする。尤度計算部１０４では、文
献２に示されている方法により毎フレーム、先頭フレー
ムより該当フレームまでの特徴量列と図３の言語モデル
のパタンマッチングを行ない、該当言語モデル上での最
適列及び尤度を計算する。ただし、尤度は該当言語モデ
ルの最終状態における確率値の自然対数値を用いる。ま
た、記憶部１０６には予め閾値−１００．０を記憶して
おき、候補単語の初期値として空語を記憶しておく。ま
た、本実施例では、フレーム間隔を１０ｍｓとする。In this embodiment, a model shown in FIG. 3 is used as a language model. The model of FIG. 3 is constructed in the form of a concatenation of three models: a front garbage that accepts any syllable string including an empty string, a candidate word, and a rear garbage that accepts any syllable string including an empty string; All utterances that pass through each model in this order, that is, utterances with an arbitrary syllable string added before and after one candidate word, are accepted. It is assumed that the model generation unit 103 previously creates and stores the model in FIG. 3 from each candidate word in the word dictionary 102. It is assumed that the contents of the word dictionary 102 are three words of “Seki” (seki), “Hekinan” (Hekinan), and “Naha” (Naha). The likelihood calculating unit 104 performs pattern matching between the feature amount sequence from the first frame to the corresponding frame for each frame and the language model of FIG. 3 according to the method described in Literature 2, and determines the optimal sequence on the relevant language model and Calculate the likelihood. However, the natural logarithm of the probability value in the final state of the language model is used as the likelihood. The storage unit 106 stores a threshold value of −100.0 in advance, and stores a null word as an initial value of the candidate word. In this embodiment, the frame interval is 10 ms.

【００２１】図４のように、「碧南」という発声がなさ
れたとする。検出部１０５では、毎フレーム、尤度計算
部１０４より出力される最適列及びその尤度を受け取
る。最適列に含まれる候補単語は、フレームごとに図５
に示すような系列であるとする。As shown in FIG. 4, it is assumed that an utterance "Hekinami" is made. The detection unit 105 receives the optimal sequence output from the likelihood calculation unit 104 and the likelihood for each frame. The candidate words included in the optimal sequence are shown in FIG.
It is assumed that the sequence is as shown in FIG.

【００２２】第１フレームにおいては、まず、ステップ
１において、尤度−２００．０を閾値−１００．０と比
較する。尤度が閾値未満であるため、ステップ２におい
ては何もしない。そして、ステップ３Ｂにおいて、空語
を記憶部１０６に格納する。In the first frame, first, in step 1, likelihood -200.0 is compared with threshold value -100.0. Since the likelihood is less than the threshold, nothing is performed in step 2. Then, in step 3B, the empty word is stored in the storage unit 106.

【００２３】以後、第１２０フレームまでは、尤度がす
べて閾値未満であるため、第１フレームと同様、ステッ
プ３Ｂにおいて、記憶部１０６に格納するという動作の
みを行なう。Thereafter, since all likelihoods are less than the threshold up to the 120th frame, only the operation of storing the data in the storage unit 106 in step 3B is performed as in the first frame.

【００２４】第１２１フレームにおいては、ステップ１
において、尤度−９９．０と閾値−１００．０の比較を
行なう。尤度が閾値以上であるため、ステップ２では第
１２１フレームの最適単語「碧南」を記憶部１０６に記
憶されている候補単語（空語）と比較し、異なっている
ことより「碧南」を出力する。その後ステップ３Ａにお
いて、最適単語「碧南」を記憶部１０６に記憶する。In the 121st frame, step 1
, The likelihood -99.0 is compared with the threshold value -100.0. Since the likelihood is equal to or larger than the threshold value, in step 2, the optimal word “Hekinan” in the 121st frame is compared with the candidate word (blank word) stored in the storage unit 106, and “Hekinan” is output because of the difference. I do. Thereafter, in step 3A, the optimum word “Hekinan” is stored in the storage unit 106.

【００２５】第１２２フレーム以後は、尤度がすべて閾
値以上であるため、ステップ１は第１２１フレームと同
様に行ない、ステップ２において、最適単語「碧南」と
記憶部１０６に格納されている候補単語「碧南」の比較
を行ない、同一であることから最適単語の出力は行なわ
ない。ステップ３Ａにおいて、最適単語「碧南」を記憶
部１０６に記憶する。Since the likelihoods are all greater than or equal to the threshold after the 122nd frame, Step 1 is performed in the same manner as in the 121st frame, and in Step 2, the optimal word “Hekinan” and the candidate word stored in the storage unit 106 are "Hekinan" is compared, and the output of the optimum word is not performed because they are the same. In step 3A, the optimum word “Hekinan” is stored in the storage unit 106.

【００２６】（第２の実施の形態）図６は、本発明の第
２の実施の形態の音声認識装置の構成を示す図である。(Second Embodiment) FIG. 6 is a diagram showing a configuration of a speech recognition apparatus according to a second embodiment of the present invention.

【００２７】本発明の第２の実施の形態による音声認識
装置は入力音声より一定時間（以後、フレームとする）
ごとの特徴量を抽出する音声分析部１０１と、単語を記
憶している単語辞書１０２と、単語辞書１０２中の単語
より言語モデルを生成するモデル生成部１０３と、前記
特徴量及び前記言語モデルより、各フレームにおいて前
記言語モデルに当てはまる最適な単語系列（以後、最適
列とする）及びその尤度を求める尤度計算部１０４と、
単語辞書１０２中の単語より、途中モデルを生成する途
中モデル生成部１０７と、前記特徴量及び前記途中モデ
ルより、各フレームにおいて、前記途中モデルの尤度を
求める途中尤度計算部１０８と、尤度計算部１０４及び
途中尤度計算部１０８からの出力を入力として、候補単
語の出力を行なう検出部１０５及びそれに付随する記憶
部１０６よりなる。The speech recognition apparatus according to the second embodiment of the present invention has a predetermined time (hereinafter referred to as a frame) from an input speech.
A voice analysis unit 101 for extracting a feature amount of each word, a word dictionary 102 storing words, a model generation unit 103 for generating a language model from words in the word dictionary 102, A likelihood calculating unit 104 for finding an optimal word sequence (hereinafter referred to as an optimal sequence) applicable to the language model in each frame and its likelihood;
An intermediate model generating unit 107 for generating an intermediate model from words in the word dictionary 102; an intermediate likelihood calculating unit 108 for obtaining the likelihood of the intermediate model in each frame from the feature amount and the intermediate model; It comprises a detection unit 105 that outputs the candidate words by using the outputs from the degree calculation unit 104 and the intermediate likelihood calculation unit 108 as inputs, and a storage unit 106 associated therewith.

【００２８】音声分析部１０１、単語辞書１０２、モデ
ル生成部１０３、尤度計算部１０４は第１の実施の形態
に記したものと同一のものを用いる。The same speech analyzer 101, word dictionary 102, model generator 103, and likelihood calculator 104 as those described in the first embodiment are used.

【００２９】途中モデル生成部１０７は、単語辞書１０
２内の各単語モデルのすべての前半部分列につき、その
直前に任意の音節連鎖を付加した発声を受理するような
言語モデル（途中モデル）を生成する。The midway model generation unit 107 stores the word dictionary 10
For each of the first half subsequences of each word model in 2, a language model (intermediate model) that accepts an utterance with an arbitrary syllable chain added immediately before it is generated.

【００３０】途中尤度計算部１０８は、途中モデル生成
部１０７で生成されたすべての途中モデルと音声分析部
１０１の出力であるフレームごとの特徴量のパタンマッ
チングを行ない、フレームごとの各途中モデルの尤度を
計算する。そして、各候補単語につき、該当候補単語の
途中モデルの尤度のうち最大のもの（以後、途中尤度と
する）を求め、検出部１０５へ出力する。The midway likelihood calculating section 108 performs pattern matching of all the midway models generated by the midway model generating section 107 and the feature amount of each frame which is the output of the speech analyzing section 101. Is calculated. Then, for each candidate word, the largest of the likelihoods of the model in the middle of the candidate word (hereinafter referred to as the middle likelihood) is obtained and output to the detection unit 105.

【００３１】記憶部１０６は、予め定めた閾値及び予め
定めた予備閾値を記憶している。また、候補単語、保留
中の最適単語及びその対立候補単語リストを記憶する。The storage unit 106 stores a predetermined threshold value and a predetermined preliminary threshold value. Also, a candidate word, a pending optimal word, and a list of its opposing candidate words are stored.

【００３２】検出部１０５は、尤度計算部１０４の出力
及び途中尤度計算部１０８の出力を受け取り、記憶部１
０６内の情報を利用して出力単語の決定及び出力を行な
う。The detecting unit 105 receives the output of the likelihood calculating unit 104 and the output of the intermediate likelihood calculating unit 108,
The output word is determined and output using the information in 06.

【００３３】第２の実施の形態による音声認識装置は請
求項１、２、３、６、７、１４に対応する音声認識装置
を実現する。The speech recognition apparatus according to the second embodiment realizes a speech recognition apparatus according to claims 1, 2, 3, 6, 7, and 14.

【００３４】（第２の実施の形態の動作の説明）図７
は、本発明の第２の実施の形態の音声認識装置の動作を
表す図である。音声分析部１０１、単語辞書１０２、モ
デル生成部１０３、尤度計算部１０４は第１の実施の形
態と同一の動作を行なう。(Explanation of Operation of Second Embodiment) FIG.
FIG. 7 is a diagram illustrating an operation of the voice recognition device according to the second exemplary embodiment of the present invention. The voice analysis unit 101, the word dictionary 102, the model generation unit 103, and the likelihood calculation unit 104 perform the same operations as in the first embodiment.

【００３５】まず、第１の実施の形態のステップ１を実
行する。該当尤度が閾値以上であれば、尤度計算部１０
４から出力された最適単語を記憶部１０６に格納されて
いる候補単語と比較する（ステップ２）。尤度が閾値未
満であれば、空語を記憶部１０６に記憶する（ステップ
３Ｂ）。もし、尤度が閾値以上である場合、最適単語を
記憶部１０６に記憶する（ステップ３Ａ）。尤度が閾値
以上であり、かつ最適単語が記憶部１０６に記憶されて
いる候補単語と異なっている場合には、最適単語及び現
在フレームにおいて途中尤度が予備閾値以上である候補
単語（以後、対立候補単語とする）すべてのリストを、
記憶部１０６に記憶し（ステップ４）、記憶部１０６に
記憶されている過去の対立候補リスト中で現在フレーム
の最適単語を含むものがある場合、そのリスト及び対応
する過去の最適単語の情報を記憶部１０６より消去する
（ステップ４−２）。後処理として、記憶部１０６に記
憶されている過去の対立候補単語リストから、現在フレ
ームにおいて途中尤度が予備閾値を下回っているものを
すべて取り除く（ステップ５）。次に、記憶部１０６に
記憶されている最適単語のうち、対立候補単語リストが
空であるものを出力し、該当情報を記憶部１０６より消
去する（ステップ６）。ただしここで、途中尤度を予備
閾値と比較する候補単語は、すべての候補単語でなく、
現在フレームの最適単語を除外する方が好ましい場合が
多い。First, step 1 of the first embodiment is executed. If the corresponding likelihood is equal to or greater than the threshold, the likelihood calculating unit 10
4 is compared with the candidate words stored in the storage unit 106 (step 2). If the likelihood is less than the threshold, the empty word is stored in the storage unit 106 (step 3B). If the likelihood is equal to or larger than the threshold, the optimum word is stored in the storage unit 106 (Step 3A). If the likelihood is equal to or greater than the threshold and the optimal word is different from the candidate word stored in the storage unit 106, the candidate word whose intermediate likelihood is equal to or greater than the preliminary threshold in the optimal word and the current frame (hereinafter, referred to as List of all alternative words)
When the past conflict candidate list stored in the storage unit 106 includes the best word of the current frame, the list and the corresponding past best word information are stored in the past conflict candidate list stored in the storage unit 106 (step 4). The data is deleted from the storage unit 106 (step 4-2). As post-processing, all the candidate frames for which the intermediate likelihood is lower than the preliminary threshold in the current frame are removed from the past candidate word list stored in the storage unit 106 (step 5). Next, among the optimum words stored in the storage unit 106, those having an empty candidate word list are output, and the corresponding information is deleted from the storage unit 106 (step 6). However, here, candidate words for comparing the intermediate likelihood with the preliminary threshold are not all candidate words,
It is often preferable to exclude the best word in the current frame.

【００３６】次に本発明の第２の実施の形態の効果につ
いて説明する。Next, the effect of the second embodiment of the present invention will be described.

【００３７】第１の実施の形態においては、ある候補単
語の一部に類似の単語、例えば「碧南」（へきなん）発
声時、「へきなん」の「へき」に類似の「関」（せき）
を検出してしまう場合がある。第２の実施の形態は、こ
のような場合、他の単語すなわち「碧南」の途中である
と推定できる場合には、「関」の検出を保留し（ステッ
プ４）、以後、「碧南」の完成をもって「関」の検出は
とりやめる（ステップ４−２）。「碧南」の途中尤度が
十分低下した場合には、「碧南」の途中であるという推
定は破棄し（ステップ５）、保留していた「関」の検出
を行なう（ステップ６）。このようにして、部分列類似
単語の検出を防ぐことができる。また、最適単語の出力
は他の候補単語の途中であるという推定がすべて破棄さ
れたフレームにおいてすぐ行われるため、部分列類似単
語の検出防止によって生じる遅延時間は必要最低限であ
る。In the first embodiment, when a word similar to a part of a candidate word is uttered, for example, when “Hekinan” is uttered, “Seki” similar to “Heki” of “Hekinan” is used. )
May be detected. In such a case, the second embodiment suspends the detection of "Seki" if it can be estimated that the word is in the middle of another word, that is, "Hekinan" (step 4). Upon completion, the detection of "Seki" is canceled (step 4-2). If the likelihood on the way to “Hekinan” is sufficiently reduced, the estimation that it is on the way to “Hekinan” is discarded (Step 5), and the reserved “Seki” is detected (Step 6). In this way, it is possible to prevent a subsequence similar word from being detected. In addition, since the output of the optimum word is presumed to be in the middle of another candidate word in the discarded frame, the delay time caused by the detection of the subsequence similar word is minimal.

【００３８】（第２の実施の形態の実施例）次に、本発
明の第２の実施の形態の一実施例の動作を詳細に説明す
る。(Example of Second Embodiment) Next, the operation of one example of the second embodiment of the present invention will be described in detail.

【００３９】本実施例では、言語モデルとして第１の実
施例で使用したものと同一の、図３に示すモデルを用い
る。単語辞書１０２、尤度計算部１０４、記憶部１０６
に予め記憶されている数値、尤度の計算法及びフレーム
間隔は第１の実施の形態と同一のものを用いる。In this embodiment, the same language model shown in FIG. 3 as that used in the first embodiment is used. Word dictionary 102, likelihood calculating section 104, storage section 106
The same numerical values, likelihood calculation methods and frame intervals stored in advance as those in the first embodiment are used.

【００４０】途中モデルは図８に示す、空列を含む任意
の音節列を受理する前方のガーベージ、候補単語の２個
のモデルの連接の形で構成され、各モデルをこの順で経
由した発声すなわち候補単語１個の前後に任意の音節列
を付加した発声をすべて受理する。フレームごとに図８
のモデルの単語途中のすべての状態の尤度、たとえば１
音節１状態の場合は、図８に黒点で示した状態の尤度を
とり出すことにより、途中尤度を求めることができる。
途中モデル生成部１０８は、単語辞書１０２の各候補単
語より図８の途中モデルを作成し、記憶しているものと
する。ただし、途中モデルは、図３に示した言語モデル
の部分モデルであり、パタンマッチングの方法も同一で
あるため、図３の言語モデルの途中状態の尤度を適宜と
り出すことにより途中尤度が得られる。よって図３のモ
デルを言語モデルと途中モデルで共有し、独立の途中モ
デル生成部１０７及び途中尤度計算部１０８は設けない
方法でも、本実施例は実現できる。The intermediate model is composed of a forward garbage, which accepts an arbitrary syllable string including an empty string, and a concatenation of two models of candidate words, as shown in FIG. That is, all utterances with an arbitrary syllable string added before and after one candidate word are accepted. Figure 8 for each frame
Likelihood of all states in the middle of a word in the model of
In the case of the syllable 1 state, an intermediate likelihood can be obtained by extracting the likelihood of the state indicated by the black point in FIG.
It is assumed that the intermediate model generation unit 108 creates and stores the intermediate model in FIG. 8 from each candidate word in the word dictionary 102. However, since the intermediate model is a partial model of the language model shown in FIG. 3 and the pattern matching method is the same, the intermediate likelihood is obtained by appropriately extracting the likelihood of the intermediate state of the language model in FIG. can get. Therefore, the present embodiment can be realized by a method in which the model in FIG. 3 is shared by the language model and the intermediate model, and the independent intermediate model generation unit 107 and the intermediate likelihood calculation unit 108 are not provided.

【００４１】本実施例では予備閾値を−１０１．５とす
る。予備閾値は現在フレームにおいて途中である可能性
のある候補単語を選ぶものであるため、低ければ低いほ
ど、対立候補単語の数は増加し、単語途中の推定洩れの
可能性が減少するが、候補単語出力の遅延時間が長くな
る。通常は閾値より小さい値を選ぶ。なお、本実施例で
は、対立候補単語には該当フレームの最適単語を含まな
いものとする。In this embodiment, the preliminary threshold is set to -101.5. Since the preliminary threshold selects candidate words that may be in the middle of the current frame, the lower the lower the threshold, the more the number of alternative candidate words increases, and the lower the possibility of estimation omission in the middle of the word. Word output delay time becomes longer. Usually, a value smaller than the threshold is selected. In this embodiment, it is assumed that the conflict candidate word does not include the optimum word of the corresponding frame.

【００４２】以下に第２の実施の形態の動作の具体例を
示す。A specific example of the operation of the second embodiment will be described below.

【００４３】図９のように、「碧南」という発声がなさ
れたとする。検出部１０５では、毎フレーム、尤度計算
部１０４より出力される最適列及びその尤度を受け取
る。最適列に含まれる候補単語は、フレームごとに図１
０に示す単語であるとする。As shown in FIG. 9, it is assumed that an utterance "Hekinami" is made. The detection unit 105 receives the optimal sequence output from the likelihood calculation unit 104 and the likelihood for each frame. The candidate words included in the optimal sequence are shown in FIG.
It is assumed that the word is 0.

【００４４】第１フレームにおいては、まず、ステップ
１において、尤度−２００．０を閾値−１００．０と比
較する。ステップ３Ｂにおいて、空語を記憶部１０６に
格納する。記憶部１０６に最適単語と対応する対立候補
リストは記憶されていないため、以後のステップでは何
もしない。以後、第９０フレームまでは、尤度がすべて
閾値未満であり、かつ記憶部１０６に最適単語と対立候
補リストの組が記憶されていないため、第１フレームと
同様、ステップ３Ｂにおいて、空語を記憶部１０６に格
納するという動作（ステップ２）のみを行なう。In the first frame, first, in step 1, likelihood -200.0 is compared with threshold value -100.0. In step 3B, an empty word is stored in the storage unit 106. Since the conflict candidate list corresponding to the optimum word is not stored in the storage unit 106, nothing is performed in the subsequent steps. Thereafter, up to the 90th frame, since the likelihoods are all less than the threshold value and the storage unit 106 does not store the set of the optimal word and the alternative candidate list, as in the first frame, an empty word is set in step 3B. Only the operation of storing in the storage unit 106 (step 2) is performed.

【００４５】第９１フレームにおいては、ステップ１に
おいて、尤度−９９．０と閾値−１００．０の比較を行
なう。尤度が閾値以上であるため、ステップ２では第９
１フレームの最適単語「関」を記憶部１０６に記憶する
（ステップ３Ａ）。次に対立候補単語（この場合「碧
南」のみ）のリストを作成し、記憶部１０６に最適単語
「関」を記憶し、さらに「碧南」のみを含むリストを記
憶する（ステップ４）。In the 91st frame, in step 1, the likelihood -99.0 is compared with the threshold value -100.0. Since the likelihood is equal to or larger than the threshold, the ninth
The optimal word “Seki” of one frame is stored in the storage unit 106 (Step 3A). Next, a list of conflict candidate words (in this case, only “Hekinami”) is created, the optimal word “Seki” is stored in the storage unit 106, and a list including only “Hekinami” is stored (step 4).

【００４６】第９１フレームでは、過去の最適単語は記
憶部１０６内に記憶されていないため、以後のステップ
では何もしない。In the 91st frame, since the past optimum word is not stored in the storage section 106, nothing is performed in the subsequent steps.

【００４７】第９２フレーム以後第１２０フレームまで
は、最適単語は「関」のまま不変のため、ステップ１か
らステップ４に至るまでは記憶部１０６に「関」を書き
込む動作のみを行なう。ステップ４−２では、現在フレ
ームの最適単語「関」は過去の対立候補リストに存在し
ないため、何もしない。From the 92nd frame to the 120th frame, since the optimum word remains unchanged at “Seki”, only the operation of writing “Seki” into the storage unit 106 is performed from Step 1 to Step 4. In step 4-2, nothing is performed because the optimal word “Seki” of the current frame does not exist in the past candidate list.

【００４８】ステップ５では、対立候補単語リストの要
素である「碧南」の途中尤度は予備閾値以上であるた
め、何もしない。ステップ６では、リストが「碧南」を
含んでいるため、何もしない。In step 5, the middle likelihood of "Hekinan", which is an element of the alternative candidate word list, is equal to or greater than the preliminary threshold, so that nothing is performed. In step 6, nothing is performed because the list includes "Hekinan".

【００４９】第１２１フレームにおいて、最適単語が
「碧南」に交替する。ステップ１から２において、「碧
南」を記憶部１０６に記憶する。ステップ３から４にお
いては、最適単語「碧南」及び要素として「那覇」のみ
を含む対立候補単語リストが記憶部１０６に格納され
る。In the 121st frame, the optimum word is replaced with "Hekinan". In steps 1 and 2, “Hekinan” is stored in the storage unit 106. In steps 3 and 4, the contending candidate word list including only the optimal word “Hekinan” and the element “Naha” is stored in the storage unit 106.

【００５０】ステップ４−２において、過去の最適単語
「関」の対立候補単語リストが第１２１フレームの最適
単語「碧南」を含んでいることから、記憶部１０６より
過去の最適単語「関」及びその対立候補単語リストは消
去される。ステップ５、ステップ６においては、該当す
る対立候補単語リストが存在しないため、何もしない。In step 4-2, since the candidate word list of the past optimal word “Seki” includes the optimal word “Hekinan” in the 121st frame, the storage unit 106 stores the past optimal words “Seki” and “Seki”. The conflict candidate word list is deleted. In steps 5 and 6, there is no corresponding candidate word list, so nothing is done.

【００５１】第１２２フレームより、第１４０フレーム
までは、最適単語及び対立候補単語に変化はないため、
記憶部１０６の最適単語を更新するのみの動作をとる。From the 122nd frame to the 140th frame, there is no change in the optimum word and the alternative candidate word.
An operation of only updating the optimum word in the storage unit 106 is performed.

【００５２】第１４１フレームにおいては、最適単語に
変化はないため、ステップ１からステップ４−２までは
直前フレームと同様に動く。ステップ５においては、
「那覇」の途中尤度が予備閾値を下回っているため、過
去の最適単語「碧南」の対立候補単語リストより取り除
く。これにより「碧南」の対立候補単語リストは空にな
ったため、ステップ６において「碧南」が検出される。
このようにして、「碧南」の部分列類似語「関」の出力
は行なわれず、かつ、発声された単語「碧南」の出力は
対立候補単語「那覇」の途中推定が破棄され次第行なわ
れる。In the 141st frame, since there is no change in the optimum word, steps 1 to 4-2 move in the same manner as the immediately preceding frame. In step 5,
Since the intermediate likelihood of “Naha” is below the preliminary threshold, it is removed from the list of candidate words for the past optimal word “Hekinan”. As a result, the candidate word list of “Hekinan” becomes empty, and “Hekinan” is detected in step 6.
In this way, the output of the subsequence analogy word “Seki” of “Hekinan” is not performed, and the output of the uttered word “Hekinan” is performed as soon as the midway estimation of the alternative candidate word “Naha” is discarded.

【００５３】（第２の実施の形態の変形例）第２の実施
の形態の変形例として、検出部１０５におけるステップ
４において、現在フレームの番号を一緒に記憶部１０６
に記憶しておき、ステップ６において、対立候補単語リ
ストが空であるもののみならず、記憶部１０６に記憶さ
れているフレーム番号が現在フレームより一定時間以上
過去のものを出力させるような動作が考えられる。この
ような動作を採用すると、候補単語の発声から出力まで
の遅延時間を一定時間以内に強制的に抑えることがで
き、なるべく速やかに認識結果の欲しい音声認識装置を
実現する際には有効である。(Modification of Second Embodiment) As a modification of the second embodiment, in step 4 of the detection unit 105, the number of the current frame is stored together with the storage unit 106.
In step 6, an operation is performed to output not only the candidate word for which the conflict candidate word list is empty, but also the one whose frame number stored in the storage unit 106 is a predetermined time or more past the current frame. Conceivable. When such an operation is adopted, the delay time from the utterance of the candidate word to the output thereof can be forcibly suppressed within a certain time, which is effective in realizing a speech recognition device that wants a recognition result as quickly as possible. .

【００５４】この変形例は、請求項１、２、６、７、１
３、１４に対応する音声認識装置を実現する。This modified example is described in claims 1, 2, 6, 7, and 1.
A speech recognition device corresponding to 3, 14 is realized.

【００５５】また、さらなる変形例として、上記変形例
のステップ６において、記憶部１０６に記憶されている
フレーム番号が一定以上の場合は対応する最適単語を出
力させるような動作も考えられる。これは、音声認識装
置の起動より、一定時間以内の入力のみ受け付ける装置
として有効である。As a further modified example, in step 6 of the above modified example, when the frame number stored in the storage unit 106 is equal to or more than a certain value, an operation of outputting a corresponding optimum word may be considered. This is effective as a device that accepts only an input within a predetermined time from the activation of the voice recognition device.

【００５６】この変形例は、請求項１、２、６、７、１
０、１４に対応する音声認識装置を実現する。This modified example is described in claims 1, 2, 6, 7, 1.
A speech recognition device corresponding to 0 and 14 is realized.

【００５７】また、第２の実施の形態の別の変形例とし
て、検出部１０５におけるステップ６において、対立候
補単語リストが空であるもののみならず、該当フレーム
において、外部の入力装置すなわち、キーボード、スイ
ッチ、他の音声認識装置等の入力信号を受けた場合、記
憶装置１０６に記憶されている最適単語を出力するよう
な動作が考えられる。これにより、話者が認識のタイミ
ングを決定することにより単語の存在区間が限定され、
より認識性能があがることが期待できる。例えば、発声
を終えたらボタンを押す、等の動作を話者に要求できる
場合に有効である。Further, as another modified example of the second embodiment, in step 6 in the detecting unit 105, not only the candidate word list is empty but also the external input device, that is, the keyboard in the corresponding frame. When an input signal from a switch, a switch, another speech recognition device, or the like is received, an operation of outputting the optimum word stored in the storage device 106 is conceivable. Thereby, the existence section of the word is limited by the speaker determining the timing of the recognition,
It can be expected that recognition performance will improve. For example, it is effective when the speaker can be requested to perform an operation such as pressing a button after the utterance is completed.

【００５８】この変形例は、請求項１、２、６、７、１
１、１４に対応する音声認識装置を実現する。This modified example is described in claims 1, 2, 6, 7, 1.
1 and 14 are realized.

【００５９】また、この場合、外部の入力装置として、
音声検出装置を用いることが考えられる。音声検出装置
としては、例えば特開平７−２２５５９２に記されてい
るようなものを用いるが、入力音声中の発声区間を決定
するようなものであればどのようなものでも良い。音声
検出は話者の発話に連動して音声区間の始端、終端を決
定すると考えられるため、ボタン等の代わりに音声検出
装置を用いることで、話者に余分な動作を要求しない音
声認識装置が実現できる。この場合、音声検出による音
声区間終端検出の失敗の場合を考慮して、検出部１０５
におけるステップ４において現在フレームの番号を一緒
に記憶部１０６に記憶しておき、ステップ６において、
音声検出装置からの信号がなくとも、記憶部１０６に記
憶されているフレーム番号が現在フレームより一定時間
以上過去のものを出力させる動作を行なうようにするこ
とも考えられる。このようにすると、請求項１、２、
６、７、１７に対応する音声認識装置を実現できる。な
お、この場合請求項１７で採用した出力条件は請求項１
２及び１４における出力条件である。In this case, as an external input device,
It is conceivable to use a voice detection device. As the voice detection device, for example, a device described in JP-A-7-225592 is used, but any device may be used as long as it determines a vocal section in the input voice. Since voice detection is considered to determine the start and end of a voice section in conjunction with the speaker's utterance, using a voice detection device instead of a button etc. allows a voice recognition device that does not require the speaker to perform extra operations. realizable. In this case, the detection unit 105 considers the case where the voice section end detection by voice detection fails.
In step 4 of, the number of the current frame is stored together in the storage unit 106, and in step 6,
Even if there is no signal from the voice detection device, it is conceivable to perform an operation of outputting a frame number stored in the storage unit 106 that is past a current frame for a predetermined time or more. In this case, the first and second aspects of the present invention are described.
A speech recognition device corresponding to 6, 7, 17 can be realized. In this case, the output conditions adopted in claim 17 are the same as those in claim 1.
These are output conditions in 2 and 14.

【００６０】さらに、第２の実施の形態の変形例とし
て、検出部１０５のステップ４において対立候補単語リ
ストを作成以後、最適単語が交代せずかつ最適列の最後
が該当最適単語で終わっている間は該当フレームで途中
尤度が予備閾値に達している候補単語をすべてリストに
加える動作を行なうことにより、対立候補単語のリスト
アップをより確実に行なうものが考えられる。この動作
の追加により、最適単語初出のフレームにおいて尤度が
たまたま低くなっている対立候補単語もリストから洩ら
さずに採用することができるようになる。Further, as a modification of the second embodiment, after the conflict candidate word list is created in step 4 of the detection unit 105, the optimum words are not changed and the end of the optimum row ends with the corresponding optimum word. During the interval, an operation of adding all the candidate words whose intermediate likelihoods have reached the preliminary threshold in the corresponding frame to the list may be performed to more reliably list the opposing candidate words. With the addition of this operation, it is possible to adopt an alternative candidate word whose likelihood happens to be low in the frame where the optimum word first appears, without missing from the list.

【００６１】この変形例は、請求項１、２、４、６、
８、１０、１４に対応する音声認識装置を実現する。な
お、請求項８における検出区間後の一定時間が０ｍｓで
あるような例である。This modified example is described in claims 1, 2, 4, 6,
A speech recognition device corresponding to 8, 10, and 14 is realized. Note that this is an example in which the fixed time after the detection section in claim 8 is 0 ms.

【００６２】また、この場合、検出部１０５のステップ
４において対立候補単語リストを作成以後、最適単語が
交代せずかつ最適列の最後が該当最適単語で終わってい
る間及びそれ以降の一定時間、例えば２００ｍｓ程度の
間には該当フレームで途中尤度が予備閾値に達している
候補単語をすべてリストに加える動作を行ない、リスト
中のすべての対立候補単語の途中尤度が予備閾値を下回
った場合にはじめて出力動作を行なうように設計するこ
もできる。これは、例えば複数の候補単語が２００ｍｓ
以内に続けて話されることはない、というような仮定を
おいてよい場合に、それを利用して誤検出を低減する方
法である。これは請求項１、２、４、６、８、１０、１
７に対応する音声認識装置を実現する。請求項８におけ
る検出区間後の一定時間が２００ｍｓであるような例で
ある。In this case, after the alternative candidate word list is created in step 4 of the detection unit 105, the optimum words are not changed and the last of the optimum row ends with the corresponding optimum word, and for a certain period thereafter, For example, during the period of about 200 ms, an operation of adding all the candidate words whose intermediate likelihood has reached the preliminary threshold in the corresponding frame to the list is performed, and the intermediate likelihood of all the opposing candidate words in the list is lower than the preliminary threshold. It is also possible to design such that the output operation is performed for the first time. This means, for example, that a plurality of candidate words are 200 ms.
This is a method for reducing erroneous detection by using the assumption that it is possible to make an assumption that the user will not be continuously spoken within this time. This is defined in claims 1, 2, 4, 6, 8, 10, 1
7 is realized. This is an example in which the fixed time after the detection section in claim 8 is 200 ms.

【００６３】さらに第２の実施の形態の変形例として、
検出部１０５のステップ４において記憶部１０６に記憶
する対立候補単語リストを、常に全候補単語のリストと
するようなものが考えられる。この場合、単語の検出
後、何らかの単語の途中尤度が予備閾値を上回っている
うちは検出を保留する。また、最適単語が入れ替わった
場合、それは必ず対立候補単語リストに入っているた
め、過去の最適単語の検出は放棄されることになる。こ
の変形例によっても、時間的に重なる単語検出のみなら
ず、時間的に近接する検出を抑制することができるよう
になる。Further, as a modification of the second embodiment,
It is conceivable that the alternative candidate word list stored in the storage unit 106 in step 4 of the detection unit 105 is always a list of all candidate words. In this case, after the detection of the word, the detection is suspended while the likelihood of some word exceeds the preliminary threshold. Further, when the optimal word is replaced, it is always included in the alternative candidate word list, so that the detection of the past optimal word is abandoned. According to this modification as well, not only detection of temporally overlapping words but also detection of temporal proximity can be suppressed.

【００６４】この変形例は、請求項１、２、４、６、
８、１６に対応する音声認識装置である。なお、請求項
８における検出区間後の一定時間が０ｍｓであるような
例である。This modification is described in claims 1, 2, 4, 6,
8 and 16 are speech recognition devices. Note that this is an example in which the fixed time after the detection section in claim 8 is 0 ms.

【００６５】この変形例におけるさらなる変形として、
外部入力装置として音声検出装置を付加し、出力条件
に、該当音声検出装置からの音声終端検出信号が与えら
れるという条件を加え、出力条件をより厳しくすること
が考えられる。これにより、時間的近接及び、他の候補
単語の発声途中可能性の両方の情報を利用して、検出保
留を強化することができる。As a further modification of this modification,
A voice detection device may be added as an external input device, and a condition that a voice termination detection signal from the voice detection device is provided may be added to the output condition to make the output condition more strict. As a result, it is possible to enhance the detection suspension by using both the information on the temporal proximity and the possibility of another candidate word being uttered.

【００６６】この変形例は、請求項１、２、４、６、
８、１８に対応するもので、請求項１８における出力条
件は請求項１２及び１６における出力条件である。これ
も請求項８における検出区間後の一定時間は０ｍｓであ
るような例である。This modification is described in claims 1, 2, 4, 6,
The output conditions according to claims 18 and 18 correspond to the output conditions according to claims 12 and 16. This is also an example in which the fixed time after the detection section in claim 8 is 0 ms.

【００６７】[0067]

【発明の効果】以上のように、本発明を用いれば、重な
って生ずる単語検出や、近接しすぎた単語検出を低減
し、誤検出のより少ない音声認識装置を、必要最小限の
遅延で実現することができる。As described above, according to the present invention, it is possible to reduce the detection of overlapping words and the detection of words that are too close to each other, and realize a speech recognition apparatus with less erroneous detection with a minimum required delay. can do.

[Brief description of the drawings]

【図１】本発明の第１の実施の形態の音声認識装置の構
成図である。FIG. 1 is a configuration diagram of a voice recognition device according to a first embodiment of the present invention.

【図２】本発明の第１の実施の形態の音声認識装置の動
作を表す流れ図である。FIG. 2 is a flowchart illustrating an operation of the voice recognition device according to the first exemplary embodiment of the present invention.

【図３】本発明の第１の実施の形態の音声認識装置で用
いる言語モデルを表すネットワークである。FIG. 3 is a network showing a language model used in the speech recognition device according to the first embodiment of the present invention.

【図４】本発明の第１の実施の形態の音声認識装置の１
実施例における最適単語と尤度の経時変化を表す図であ
る。FIG. 4 shows a speech recognition device according to the first embodiment of the present invention.
It is a figure showing the optimal word and the time-dependent change of likelihood in an Example.

【図５】本発明の第１の実施の形態の音声認識装置の１
実施例におけるフレームごとの最適単語と尤度を示す表
である。FIG. 5 illustrates a speech recognition device according to the first embodiment of the present invention.
6 is a table showing an optimum word and likelihood for each frame in the embodiment.

【図６】本発明の第２の実施の形態の音声認識装置の構
成図である。FIG. 6 is a configuration diagram of a speech recognition device according to a second embodiment of the present invention.

【図７】本発明の第２の実施の形態の音声認識装置の動
作を表す流れ図である。FIG. 7 is a flowchart illustrating an operation of the voice recognition device according to the second exemplary embodiment of the present invention.

【図８】本発明の第２の実施の形態の音声認識装置で用
いる言語モデルを表すネットワークである。FIG. 8 is a network showing a language model used in a speech recognition device according to a second embodiment of the present invention.

【図９】本発明の第２の実施の形態の音声認識装置の１
実施例における最適単語と尤度と途中尤度の経時変化を
表す図である。FIG. 9 shows a speech recognition device according to a second embodiment of the present invention.
It is a figure showing the time-dependent change of the optimal word, the likelihood, and the intermediate likelihood in an Example.

【図１０】本発明の第２の実施の形態の音声認識装置の
１実施例におけるフレームごとの最適単語と尤度及び対
立候補単語を示す表である。FIG. 10 is a table showing optimal words, likelihoods, and candidate words for each frame in one example of the speech recognition device according to the second embodiment of the present invention;

フロントページの続き (56)参考文献特開平９−166995（ＪＰ，Ａ) 特開平８−83091（ＪＰ，Ａ) 特開平６−266386（ＪＰ，Ａ) 特開平５−19784（ＪＰ，Ａ) 特開平４−362698（ＪＰ，Ａ) 特開平９−179581（ＪＰ，Ａ) 特開平９−106297（ＪＰ，Ａ) 日本音響学会平成８年度春季研究発表会講演論文集▲Ｉ▼ ３−５−２「半音節単位に基づく単語認識のためのワードスポッティング」ｐ．111−112（平成８年３月26日発行) 電子情報通信学会技術研究報告［音声］Ｖｏｌ．95，Ｎｏ．355，ＳＰ95−77, 「半音節を用いたワードスポッティングによる単語認識」ｐ．31−38（1995年11 月16日発行) Ｐｒｏｃｅｅｄｉｎｇｓｏｆ 1985 ＩＥＥＥＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＡｃｏｕｓｔｉｃｓ，ＳｐｅｅｃｈａｎｄＳｉｇｎａｌＰｒｏｃｅｓｓｉｎｇ，Ｖｏｌ．３”ＫｅｙｗｏｒｄＲｅｃｏｇｎｉｔｉｏｎｕｓｉｎｇＴｅｍｐｌａｔｅＣｏｎｃａｔｅｎａｔｉｏｎ”ｐ．1233−1236 (58)調査した分野(Int.Cl.⁷，ＤＢ名) G10L 3/00 - 9/20 ＪＩＣＳＴファイル（ＪＯＩＳ)Continuation of front page (56) References JP-A-9-166995 (JP, A) JP-A-8-83091 (JP, A) JP-A-6-266386 (JP, A) JP-A-5-19784 (JP) , A) JP-A-4-362698 (JP, A) JP-A-9-179581 (JP, A) JP-A-9-106297 (JP, A) Proceedings of the Acoustical Society of Japan, Spring Meeting, 1996 I ▼ 3-5-2 “Word spotting for word recognition based on syllable units” p. 111-112 (issued March 26, 1996) IEICE Technical Report [Voice] Vol. 95, No. 355, SP95-77, “Word recognition by word spotting using semisyllables” p. 31-38 (Issued November 16, 1995) Proceedings of 1985 IEEE International Conference on Acoustics, Speech and Signal Processing, Vol. 3 "Keyword Recognition using Template update Concatenation" p. 1233-1236 (58) Field surveyed (Int. Cl. ⁷ , DB name) G10L 3/00-9/20 JICST file (JOIS)

Claims

(57) [Claims]

1. A speech analysis unit for extracting a feature amount for each frame from a frequency analysis of a fixed time (frame) of an input speech, and a speech model for receiving an arbitrary syllable string before and after a candidate word are added. A likelihood calculation unit that performs pattern matching between the language model and the feature amount for each frame, selects an optimal word sequence (optimum sequence) on the language model for each frame, and calculates the likelihood thereof; A detection unit that determines and outputs an optimal word for each frame from the optimal sequence and the likelihood, and when the optimal word is repeated once or more, the optimal word within a corresponding time (detection section) is determined at most A speech recognition device characterized by outputting once.

2. A speech analysis unit for extracting a feature amount for each frame from a frequency analysis for each fixed time (frame) of an input speech, and an arbitrary syllable string before and after a predetermined candidate word is received. Pattern matching between the language model to which the speech model is added and the feature amount is performed for each frame, an optimal word sequence (optimum sequence) on the language model is selected for each frame, and likelihood calculation for calculating the likelihood is performed. And a detection unit that determines and outputs an optimal word for each frame from the optimal sequence and the likelihood, and when the optimal word is repeated one or more times, a corresponding time (detection section)
A speech recognition device for outputting the optimum word in the detection section at most once when the likelihood reaches a predetermined constant value (threshold) even once within the range.

3. The speech recognition apparatus according to claim 1, wherein the candidate words included in the optimum sequence are set as optimum words.

4. The method according to claim 1, wherein the candidate word is output once as the optimal word if the last word in the optimal sequence is the candidate word, and otherwise the optimal word is not output. Item 3. The speech recognition device according to item 1 or 2.

5. The speech recognition apparatus according to claim 1, wherein the time at which the optimum word is output is the first time of the detection section.

6. The speech recognition apparatus according to claim 1, wherein the time at which the optimum word is output is the first time at which a predetermined condition (output condition) is satisfied. .

7. A language model (intermediate model) for receiving an utterance to which an arbitrary syllable string is added immediately before all of the first half subsequences of all of the candidate words, and in each of the frames, A device for calculating the likelihood obtained by performing the pattern matching of the feature amount, and calculating the maximum likelihood (intermediate likelihood) among the likelihoods of all the intermediate models of each of the candidate words; In, the output is suspended only when there is the candidate word (alternative candidate word) in which the intermediate likelihood has reached a predetermined constant value (preliminary threshold), and thereafter, the predetermined condition (output condition) 5. The speech recognition device according to claim 1, wherein the output is performed at the first time when the speech is satisfied.

8. A language model (intermediate model) for receiving an utterance to which an arbitrary syllable string is added immediately before all of the first half substrings of all of the candidate words, and in each of the frames, all of the intermediate models and the An apparatus for calculating the likelihood obtained by performing the pattern matching of the feature amount, and calculating the maximum likelihood (intermediate likelihood) among all the intermediate models of the candidate word, wherein the intermediate likelihood is included in the detection section. Suspends output only when there is a candidate word (alternative candidate word) that has reached a predetermined constant value (preliminary threshold) even once, and thereafter, the first time the predetermined condition (output condition) is satisfied 4. An output at a time.
Or the voice recognition device according to 4.

9. A language model (intermediate model) for receiving an utterance to which an arbitrary syllable string is added immediately before all of the first half substrings of all of the candidate words, and in each of the frames, all of the intermediate models and the A device for calculating the likelihood obtained by performing pattern matching of the feature amount, and calculating the maximum likelihood of all the intermediate models of the candidate word (intermediate likelihood), wherein the intermediate likelihood is within the detection interval and Thereafter, the output is suspended only when there is a candidate word (alternative candidate word) which has reached a predetermined constant value (preliminary threshold value) in all frames within a predetermined time period. 4. An output at the first time when a predetermined condition (output condition) is satisfied.
Or the voice recognition device according to 4.

10. The output condition is that, in the frame after a predetermined time has elapsed since the apparatus was started, there is no conflict candidate word that has become the optimum word since the detection time. Item 10. The speech recognition device according to item 7, 8 or 9.

11. An apparatus according to claim 6, further comprising: an input device for receiving an external input, wherein the output condition is that a signal from the input device is supplied after the detection time. The speech recognition device according to the above.

12. The apparatus according to claim 1, wherein said input device is a voice detection device for determining a range of a voice section, and said output condition is that a signal of a voice section end from said voice detection device is given. The speech recognition device according to claim 11.

13. The output condition is that, in the frame for which a predetermined time has elapsed after the detection time, there is no conflict candidate word that has become the optimum word since the detection time. The speech recognition device according to 7, 8, or 9.

14. After the detection time, the intermediate likelihoods of all of the alternative candidate words once respectively fall below the preliminary threshold value, and the relevant candidate word has not become the optimum word even once from the detection time. 10. The speech recognition device according to claim 7, wherein the output condition is set as the output condition.

15. In a certain frame after the detection time, the halfway likelihood of all of the alternative candidate words is lower than the preliminary threshold value, and the relevant alternative candidate word becomes the optimum word at least once from the detection time. 10. The speech recognition apparatus according to claim 7, wherein the output condition is that no voice recognition is performed.

16. In a certain frame after the detection time, the intermediate likelihood of all other candidate words is lower than the preliminary threshold value, and the other candidate word is at least once the optimal word. 10. The speech recognition device according to claim 7, wherein the output condition is that the condition is not satisfied.

17. An output condition in which any one of a plurality of output conditions selected in advance from among the output conditions according to claim 10, 11, 12, 13, 14, or 15 is satisfied. The speech recognition device according to claim 7, 8 or 9, wherein

18. The method of claim 10, 11, 12, 13, 14,
10. The speech recognition apparatus according to claim 7, wherein the output condition is that all of a plurality of output conditions selected in advance from among the output conditions in 15 or 16 are satisfied.