JPH09311694A

JPH09311694A - Speech recognition device

Info

Publication number: JPH09311694A
Application number: JP8171422A
Authority: JP
Inventors: Masaru Takano; 優高野
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1996-03-21
Filing date: 1996-07-01
Publication date: 1997-12-02
Anticipated expiration: 2016-07-01
Also published as: JP3006496B2

Abstract

PROBLEM TO BE SOLVED: To provide a word spotting device with less erroneous detection and a means thereof. SOLUTION: By a likelihood computing unit 104, frame synchronous DP matching based on a model, to which an optional syllable chain is added before/ after a proposed word, is carried out, and the result is outputted to a detection unit 105. By an intermediate likelihood computing unit 108, synchronous DP matching based on a model, to which an optional syllable chain is added before the proposed word, is carried out, and the intermediate likelihood of each proposed word is outputted to the detection unit 105. In the detection unit 105, if an optimum word outputted from the likelihood computing unit 104 is provided with likelihood of the threshold value or more, an opposite proposed word with an intermediate likelihood of a fixed value or more is searched from outputs from the intermediate likelihood computing unit 108, and if this kind of word is detected, detection of the optimum word is retained. The detection is established when all of the opposite proposed words are eliminated. Alternately, disposal is carried out when either of the opposite proposed words is detected.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、発声中から特定の
単語を検出する音声認識装置に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice recognition device for detecting a specific word in utterance.

【０００２】[0002]

【従来の技術】単語検出は、通常の音声認識と比して、
予め入力音声中における単語の存在区間を確定しておく
必要がないという利点を有している。2. Description of the Related Art Word detection, as compared with ordinary speech recognition,
It has an advantage that it is not necessary to determine the existing section of the word in the input voice in advance.

【０００３】従来、発声中から特定単語を検出する方法
として、文献１（信学論（Ｄ）Ｊ６７−Ｄ，１１ｐ．１
２４２−１２４９）に記載されているような方法が知ら
れている。当論文に記載されている方法は、毎フレーム
候補単語ごと独立に算出される尤度が一定の閾値を越え
た場合、他の候補単語の検出と無関係に、検出を行なう
こととしている。また、別の方法として、文献２（「確
率モデルによる音声認識」中川聖一ｐ．２０−２６）に
記載されている方法が知られている。当文献の方法で
は、フレームごとに言語モデル上最適な単語を決定し出
力する。また、時間的にオーバーラップする単語の検出
を抑制する方法として、文献３（信学技報ＳＰ９５−７
７ｐ．３１−３８）に記載されている方法が知られてい
る。当文献の方法では、各フレームにおいて各候補単語
の仮検出を決定し、検出ごとの尤度からフレーム同期処
理を用いて時間的にオーバーラップする検出を抑制する
ことができる。[0003] Conventionally, as a method for detecting a specific word in utterance, Reference 1 (Journal of theology (D) J67-D, 11p.
242-1249) is known. In the method described in this paper, if the likelihood calculated independently for each frame candidate word exceeds a certain threshold, it is detected regardless of the detection of other candidate words. Further, as another method, a method described in Document 2 (“Speech recognition by probabilistic model”, Seiichi Nakagawa p.20-26) is known. In the method of this document, the optimum word in the language model is determined and output for each frame. In addition, as a method for suppressing the detection of words that temporally overlap with each other, Reference 3 (Science Technical Report SP95-7
7p. 31-38) are known. According to the method of this document, it is possible to determine the provisional detection of each candidate word in each frame, and suppress the detection of temporal overlap using the frame synchronization processing from the likelihood of each detection.

【０００４】[0004]

【発明が解決しようとする課題】しかしながら文献１の
方法では、発声中に候補単語が時間的に重なっているよ
うな仮説を許容する。文献２の方法ではそのようなこと
は起こらないが、発声された単語の一部に類似した単語
を、発声の途中で検出する現象が起こり得る。これを防
ぐためには入力音声中、単語の存在する時間の範囲を何
らかの手段で定めておく必要がある。文献３の方法では
そのような必要はなく、発声中に候補単語が時間的に重
なっている仮説を許容することもない。しかしながら、
文献３に示されているような、認識対象語の前方のみ、
任意の音節列を受理するモデル（ガーベージモデル）を
付加したモデルを使用する際には、発声された単語の一
部に類似した単語の尤度を実際に発声された単語の尤度
より高く見積ることにより類似単語の方を検出し、実際
に発声された単語の検出の方をキャンセルしてしまう現
象（部分マッチング）が起こることがあり、それを改善
する方法は自明ではない。ところが、認識対象語の前後
にガーベージモデルを付加したモデルを用い、文献２で
示されている方法を使用する場合には、ガーベージモデ
ルのパタンマッチングのスコアを低く設定することによ
り部分マッチングを回避することができることが知られ
ている。However, the method of Reference 1 allows a hypothesis that candidate words are temporally overlapped during utterance. Although such a phenomenon does not occur in the method of Document 2, a phenomenon may occur in which a word similar to a part of the uttered word is detected in the middle of utterance. In order to prevent this, it is necessary to somehow determine the range of time in which the words exist in the input voice. The method of Reference 3 does not require such a case, and does not allow the hypothesis that candidate words temporally overlap during utterance. However,
Only in front of the recognition target word, as shown in Reference 3,
When using a model that accepts an arbitrary syllable sequence (garbage model), estimate the likelihood of a word similar to a part of the uttered word higher than the likelihood of the actually uttered word. As a result, a similar word may be detected and the detection of the actually uttered word may be canceled (partial matching), and a method of improving it may not be obvious. However, when a model in which a garbage model is added before and after the recognition target word is used and the method shown in Reference 2 is used, partial matching is avoided by setting the pattern matching score of the garbage model low. It is known that you can.

【０００５】本発明の目的は、単語検出に対し上述のよ
うな、一単語の発声に対する複数候補の検出を低減する
方法とともに、かつ文献２の部分マッチングに対する頑
健性をも保有する方法を用いることで、文献１から文献
３までに述べられている方法のもつ前述のような利点を
兼ね備えた単語認識法を提供することにある。It is an object of the present invention to use a method for detecting a plurality of candidates for utterance of one word as described above for word detection, and a method for retaining robustness against partial matching of Document 2. Then, there is a need to provide a word recognition method that has the above-mentioned advantages of the methods described in Documents 1 to 3.

【０００６】[0006]

【課題を解決するための手段】本発明は、入力音声デー
タ中からフレームごとに尤度を基準として前記候補単語
の検出を行なう音声認識装置において、一単語発声中の
複数の候補単語の検出を低減するための方法に関するも
のである。According to the present invention, in a voice recognition device for detecting a candidate word from input voice data on a frame-by-frame basis using a likelihood as a reference, a plurality of candidate words in one word is detected. It relates to a method for reducing.

【０００７】[0007]

BEST MODE FOR CARRYING OUT THE INVENTION

（第１の実施の形態）図１は、本発明の第１の形態の音
声認識装置の構成を示す図である。(First Embodiment) FIG. 1 is a block diagram showing the arrangement of a speech recognition apparatus according to the first embodiment of the present invention.

【０００８】本発明の第１の形態による音声認識装置は
入力音声より一定時間（以後、フレームとする）ごとの
特徴量を抽出する音声分析部１０１と、候補単語を記憶
している単語辞書１０２と、単語辞書中の候補単語より
言語モデルを生成するモデル生成部１０３と、前記特徴
量及び前記言語モデルより、各フレームにおいて前記言
語モデルに当てはまる最適な単語系列（以後、最適列と
する）及びその尤度を求める尤度計算部１０４と、尤度
計算部１０４からの出力を入力として、候補単語の出力
を行なう検出部１０５及びそれに付随する記憶部１０６
よりなる。The speech recognition apparatus according to the first aspect of the present invention includes a speech analysis unit 101 for extracting a feature amount for each constant time (hereinafter, referred to as a frame) from an input speech, and a word dictionary 102 storing candidate words. A model generation unit 103 that generates a language model from candidate words in a word dictionary; an optimal word sequence (hereinafter referred to as an optimal sequence) that applies to the language model in each frame based on the feature amount and the language model; Likelihood calculation section 104 for obtaining the likelihood, detection section 105 for outputting a candidate word using the output from likelihood calculation section 104 as an input, and storage section 106 associated therewith.
Consists of.

【０００９】音声分析部１０１では、入力音声のフレー
ムごとの周波数分析を行ない、フレームごとの特徴量ベ
クトル（以後、特徴量とする）を生成する。特徴量の要
素としては、パワー、パワー変化量、メルケプストラ
ム、メルケプストラム変化量、メルケプストラム２次変
化量等を用いる。フレームごとの特徴量は、毎フレー
ム、尤度計算部１０４へ出力される。The voice analysis unit 101 analyzes the frequency of each frame of the input voice and generates a feature quantity vector (hereinafter referred to as a feature quantity) for each frame. As the element of the feature amount, power, power change amount, mel cepstrum, mel cepstrum change amount, mel cepstrum secondary change amount, or the like is used. The feature amount for each frame is output to the likelihood calculation unit 104 for each frame.

【００１０】単語辞書１０２は、認識対象となる単語を
単位音響モデルの連鎖の形、例えば、単語を構成する音
節を表す音響モデルの連鎖（「大阪」（おおさか）の場
合には「お」−「お」−「さ」−「か」）の形で記憶し
ている。The word dictionary 102 forms a chain of unit acoustic models of a word to be recognized, for example, a chain of acoustic models representing syllables that form a word (“O” in the case of “Osaka”). It is stored in the form of "o"-"sa"-"ka").

【００１１】モデル生成部１０３では、単語辞書１０２
内の各単語モデルの前後に任意の音節連鎖を付加した発
声を受理するような言語モデルを構成する。尤度計算部
１０４では、モデル生成部１０３で生成された言語モデ
ルと音声分析部１０１の出力である各フレームにおける
特徴量より各フレームの最適列及びその尤度を算出し、
検出部１０５へ出力する。In the model generation unit 103, the word dictionary 102
We construct a language model that accepts utterances with arbitrary syllable chains added before and after each word model in. The likelihood calculation unit 104 calculates the optimal sequence of each frame and its likelihood from the language model generated by the model generation unit 103 and the feature amount of each frame output from the speech analysis unit 101,
Output to the detection unit 105.

【００１２】記憶部１０６では、予め定めた閾値と、検
出部１０５から出力される候補単語を記憶しておく。検
出部１０５では、尤度計算部１０４の出力である最適列
とその尤度を毎フレーム受け取り、候補単語の決定及び
出力判定を行ない、もし必要ならば候補単語の出力を行
なう。The storage unit 106 stores a predetermined threshold value and candidate words output from the detection unit 105. The detection unit 105 receives the optimum sequence output from the likelihood calculation unit 104 and its likelihood for each frame, determines a candidate word and determines the output, and outputs the candidate word if necessary.

【００１３】第１の形態による音声認識装置は請求項
１、２、５に対応する音声認識装置である。A voice recognition device according to a first aspect is a voice recognition device according to claims 1, 2, and 5.

【００１４】（第１の形態の動作の説明）図２は、本発
明の第１の形態の音声認識装置の動作を表す図である。(Description of Operation of First Mode) FIG. 2 is a diagram showing an operation of the speech recognition apparatus of the first mode of the present invention.

【００１５】音声分析部１０１においては、入力音声の
フレームごとの周波数分析を行ない、特徴量を生成し、
毎フレーム、尤度計算部１０４へ出力する。尤度計算部
１０４では、文献２に示されているオートマトン制御Ｏ
ｎｅＰａｓｓＤＰ法の手法を用いて、モデル生成部
１０３で生成された言語モデルと音声分析部１０１の出
力である各フレームにおける特徴量のパタンマッチング
を行なうことにより、各フレームの最適列及び、その尤
度を算出し、最適列及びその尤度を検出部１０５へ出力
する。In the voice analysis unit 101, frequency analysis is performed for each frame of the input voice to generate a feature quantity,
Each frame is output to the likelihood calculating unit 104. In the likelihood calculation unit 104, the automaton control O shown in Document 2 is performed.
The ne Pass DP method is used to perform pattern matching between the language model generated by the model generation unit 103 and the feature amount in each frame output from the speech analysis unit 101, and the optimal sequence of each frame and its The likelihood is calculated, and the optimum column and its likelihood are output to the detection unit 105.

【００１６】検出部１０５では、各フレームごとにまず
尤度計算部１０４から出力された尤度を記憶部１０６に
記憶された閾値と比較する（ステップ１）。該当尤度が
閾値以上であれば、尤度計算部１０４から出力された最
適列中の候補単語（以後、最適単語とする）を記憶部１
０６に格納されている候補単語と比較し、もし異なって
いれば最適単語を出力する（ステップ２）。The detection unit 105 first compares the likelihood output from the likelihood calculation unit 104 with the threshold value stored in the storage unit 106 for each frame (step 1). If the corresponding likelihood is equal to or more than the threshold value, the storage unit 1 stores the candidate word (hereinafter, referred to as an optimum word) in the optimum sequence output from the likelihood calculation unit 104.
The candidate word stored in 06 is compared, and if different, the optimum word is output (step 2).

【００１７】次に、該当尤度が閾値以上であれば、最適
単語を記憶部１０６に格納する（ステップ３Ａ）。閾値
未満であれば、空語を記憶部１０６に記憶する（ステッ
プ３Ｂ）。Next, if the corresponding likelihood is greater than or equal to the threshold value, the optimum word is stored in the storage unit 106 (step 3A). If it is less than the threshold value, an empty word is stored in the storage unit 106 (step 3B).

【００１８】次に本発明の第１の形態の効果について説
明する。文献２に示す方法は、フレームごとに最適列及
び尤度を求める手段を示しているが、それらを直接、単
語音声の認識結果とするとフレームごとに認識結果を出
力することになり不便である。第１の形態は、フレーム
ごとの最適列及び尤度より、単一の認識単語を抽出する
手段を与える。Next, the effect of the first embodiment of the present invention will be described. The method shown in Document 2 shows a means for obtaining the optimum sequence and the likelihood for each frame, but if these are directly used as the recognition result of the word speech, the recognition result is output for each frame, which is inconvenient. The first form provides a means for extracting a single recognized word from the optimal sequence and likelihood for each frame.

【００１９】（第１の形態の実施例）次に、本発明の第
１の実施の形態の一実施例の動作を詳細に説明する。(Example of First Embodiment) Next, the operation of one example of the first embodiment of the present invention will be described in detail.

【００２０】本実施例では、言語モデルとして図３に示
すモデルを用いる。図３のモデルは空列を含む任意の音
節列を受理する前方のガーベージ、候補単語、空列を含
む任意の音節列を受理する後方のガーベージの３個のモ
デルの連接の形で構成され、各モデルをこの順で経由し
た発声すなわち候補単語１個の前後に任意の音節列を付
加した発声をすべて受理する。モデル生成部１０３は単
語辞書１０２の各候補単語より図３のモデルを予め作成
し、記憶しているものとする。単語辞書１０２の内容
は、「関」（せき）「碧南」（へきなん）「那覇」（な
は）の３単語であるとする。尤度計算部１０４では、文
献２に示されている方法により毎フレーム、先頭フレー
ムより該当フレームまでの特徴量列と図３の言語モデル
のパタンマッチングを行ない、該当言語モデル上での最
適列及び尤度を計算する。ただし、尤度は該当言語モデ
ルの最終状態における確率値の自然対数値を用いる。ま
た、記憶部１０６には予め閾値−１００．０を記憶して
おき、候補単語の初期値として空語を記憶しておく。ま
た、本実施例では、フレーム間隔を１０ｍｓとする。In this embodiment, the model shown in FIG. 3 is used as the language model. The model of FIG. 3 is configured in the form of a concatenation of three models of forward garbage that accepts an arbitrary syllable sequence including an empty sequence, candidate words, and backward garbage that accepts an arbitrary syllable sequence including an empty sequence, All utterances that pass through each model in this order, that is, utterances in which an arbitrary syllable string is added before and after one candidate word, are all accepted. It is assumed that the model generation unit 103 previously creates and stores the model of FIG. 3 from each candidate word of the word dictionary 102. It is assumed that the contents of the word dictionary 102 are three words, “Seki” (cough), “Hekinan”, and “Naha”. The likelihood calculation unit 104 performs pattern matching between the feature quantity sequence from the first frame to the corresponding frame and the language model of FIG. 3 for each frame by the method described in Document 2 to determine the optimum sequence on the relevant language model. Calculate the likelihood. However, the likelihood uses the natural logarithm of the probability value in the final state of the corresponding language model. Further, the storage unit 106 stores threshold value -100.0 in advance, and stores an empty word as an initial value of a candidate word. In addition, in this embodiment, the frame interval is set to 10 ms.

【００２１】図４のように、「碧南」という発声がなさ
れたとする。検出部１０５では、毎フレーム、尤度計算
部１０４より出力される最適列及びその尤度を受け取
る。最適列に含まれる候補単語は、フレームごとに図５
に示すような系列であるとする。As shown in FIG. 4, it is assumed that the utterance "Hekinan" is made. The detection unit 105 receives the optimal sequence output from the likelihood calculation unit 104 and its likelihood for each frame. The candidate words included in the optimum sequence are shown in FIG.
It is assumed that the series is as shown in.

【００２２】第１フレームにおいては、まず、ステップ
１において、尤度−２００．０を閾値−１００．０と比
較する。尤度が閾値未満であるため、ステップ２におい
ては何もしない。そして、ステップ３Ｂにおいて、空語
を記憶部１０６に格納する。In the first frame, first, in step 1, likelihood -200.0 is compared with threshold -100.0. Since the likelihood is less than the threshold, nothing is done in step 2. Then, in step 3B, the empty word is stored in the storage unit 106.

【００２３】以後、第１２０フレームまでは、尤度がす
べて閾値未満であるため、第１フレームと同様、ステッ
プ３Ｂにおいて、記憶部１０６に格納するという動作の
みを行なう。After that, since the likelihoods are all less than the threshold value up to the 120th frame, only the operation of storing in the storage unit 106 is performed in step 3B as in the first frame.

【００２４】第１２１フレームにおいては、ステップ１
において、尤度−９９．０と閾値−１００．０の比較を
行なう。尤度が閾値以上であるため、ステップ２では第
１２１フレームの最適単語「碧南」を記憶部１０６に記
憶されている候補単語（空語）と比較し、異なっている
ことより「碧南」を出力する。その後ステップ３Ａにお
いて、最適単語「碧南」を記憶部１０６に記憶する。In the 121st frame, step 1
In, the likelihood -99.0 and the threshold -100.0 are compared. Since the likelihood is greater than or equal to the threshold value, in step 2, the optimum word “Hekinan” in the 121st frame is compared with the candidate word (empty word) stored in the storage unit 106, and “Hekinan” is output because it is different. To do. Then, in step 3A, the optimum word "Binan" is stored in the storage unit 106.

【００２５】第１２２フレーム以後は、尤度がすべて閾
値以上であるため、ステップ１は第１２１フレームと同
様に行ない、ステップ２において、最適単語「碧南」と
記憶部１０６に格納されている候補単語「碧南」の比較
を行ない、同一であることから最適単語の出力は行なわ
ない。ステップ３Ａにおいて、最適単語「碧南」を記憶
部１０６に記憶する。Since the likelihoods after the 122nd frame are all above the threshold, step 1 is performed in the same manner as in the 121st frame, and in step 2, the optimum word "Hekinan" and the candidate word stored in the storage unit 106 are stored. The comparison of "Binan" is performed, and since they are the same, the optimum word is not output. In step 3A, the optimum word “Hekinan” is stored in the storage unit 106.

【００２６】（第２の実施の形態）図６は、本発明の第
２の実施の形態の音声認識装置の構成を示す図である。(Second Embodiment) FIG. 6 is a diagram showing the structure of a speech recognition apparatus according to a second embodiment of the present invention.

【００２７】本発明の第２の実施の形態による音声認識
装置は入力音声より一定時間（以後、フレームとする）
ごとの特徴量を抽出する音声分析部１０１と、単語を記
憶している単語辞書１０２と、単語辞書１０２中の単語
より言語モデルを生成するモデル生成部１０３と、前記
特徴量及び前記言語モデルより、各フレームにおいて前
記言語モデルに当てはまる最適な単語系列（以後、最適
列とする）及びその尤度を求める尤度計算部１０４と、
単語辞書１０２中の単語より、途中モデルを生成する途
中モデル生成部１０７と、前記特徴量及び前記途中モデ
ルより、各フレームにおいて、前記途中モデルの尤度を
求める途中尤度計算部１０８と、尤度計算部１０４及び
途中尤度計算部１０８からの出力を入力として、候補単
語の出力を行なう検出部１０５及びそれに付随する記憶
部１０６よりなる。The speech recognition apparatus according to the second embodiment of the present invention has a fixed time (hereinafter, referred to as a frame) from the input speech.
A speech analysis unit 101 that extracts a feature amount for each of the words, a word dictionary 102 that stores words, a model generation unit 103 that generates a language model from the words in the word dictionary 102, and a feature amount and the language model. , A likelihood calculation unit 104 for obtaining an optimal word sequence (hereinafter referred to as an optimal sequence) that fits the language model in each frame and its likelihood,
An intermediate model generation unit 107 that generates an intermediate model from the words in the word dictionary 102; an intermediate likelihood calculation unit 108 that calculates the likelihood of the intermediate model in each frame from the feature amount and the intermediate model; The detection unit 105 and the storage unit 106 associated with the detection unit 105 output the candidate words with the outputs from the degree calculation unit 104 and the intermediate likelihood calculation unit 108 as inputs.

【００２８】音声分析部１０１、単語辞書１０２、モデ
ル生成部１０３、尤度計算部１０４は第１の実施の形態
に記したものと同一のものを用いる。The voice analysis unit 101, the word dictionary 102, the model generation unit 103, and the likelihood calculation unit 104 are the same as those described in the first embodiment.

【００２９】途中モデル生成部１０７は、単語辞書１０
２内の各単語モデルのすべての前半部分列につき、その
直前に任意の音節連鎖を付加した発声を受理するような
言語モデル（途中モデル）を生成する。The intermediate model generation unit 107 uses the word dictionary 10
For all the first half subsequences of each word model in 2, a language model (intermediate model) that accepts utterances with an arbitrary syllable chain added immediately before is generated.

【００３０】途中尤度計算部１０８は、途中モデル生成
部１０７で生成されたすべての途中モデルと音声分析部
１０１の出力であるフレームごとの特徴量のパタンマッ
チングを行ない、フレームごとの各途中モデルの尤度を
計算する。そして、各候補単語につき、該当候補単語の
途中モデルの尤度のうち最大のもの（以後、途中尤度と
する）を求め、検出部１０５へ出力する。The intermediate likelihood calculation unit 108 performs pattern matching of all the intermediate models generated by the intermediate model generation unit 107 and the feature amount for each frame output from the speech analysis unit 101, and each intermediate model for each frame. Compute the likelihood of. Then, for each candidate word, the maximum likelihood (hereinafter referred to as the intermediate likelihood) of the likelihoods of the intermediate model of the relevant candidate word is obtained and output to the detection unit 105.

【００３１】記憶部１０６は、予め定めた閾値及び予め
定めた予備閾値を記憶している。また、候補単語、保留
中の最適単語及びその対立候補単語リストを記憶する。The storage unit 106 stores a predetermined threshold value and a predetermined preliminary threshold value. It also stores candidate words, pending optimal words, and their conflict candidate word list.

【００３２】検出部１０５は、尤度計算部１０４の出力
及び途中尤度計算部１０８の出力を受け取り、記憶部１
０６内の情報を利用して出力単語の決定及び出力を行な
う。The detection unit 105 receives the output of the likelihood calculation unit 104 and the output of the intermediate likelihood calculation unit 108, and stores them in the storage unit 1.
The output word is determined and output using the information in 06.

【００３３】第２の実施の形態による音声認識装置は請
求項１、２、３、６、７、１４に対応する音声認識装置
を実現する。The voice recognition device according to the second embodiment realizes a voice recognition device corresponding to claims 1, 2, 3, 6, 7, and 14.

【００３４】（第２の実施の形態の動作の説明）図７
は、本発明の第２の実施の形態の音声認識装置の動作を
表す図である。音声分析部１０１、単語辞書１０２、モ
デル生成部１０３、尤度計算部１０４は第１の実施の形
態と同一の動作を行なう。(Explanation of the Operation of the Second Embodiment) FIG.
[Fig. 8] is a diagram showing an operation of the voice recognition device in the second exemplary embodiment of the present invention. The voice analysis unit 101, the word dictionary 102, the model generation unit 103, and the likelihood calculation unit 104 perform the same operations as in the first embodiment.

【００３５】まず、第１の実施の形態のステップ１を実
行する。該当尤度が閾値以上であれば、尤度計算部１０
４から出力された最適単語を記憶部１０６に格納されて
いる候補単語と比較する（ステップ２）。尤度が閾値未
満であれば、空語を記憶部１０６に記憶する（ステップ
３Ｂ）。もし、尤度が閾値以上である場合、最適単語を
記憶部１０６に記憶する（ステップ３Ａ）。尤度が閾値
以上であり、かつ最適単語が記憶部１０６に記憶されて
いる候補単語と異なっている場合には、最適単語及び現
在フレームにおいて途中尤度が予備閾値以上である候補
単語（以後、対立候補単語とする）すべてのリストを、
記憶部１０６に記憶し（ステップ４）、記憶部１０６に
記憶されている過去の対立候補リスト中で現在フレーム
の最適単語を含むものがある場合、そのリスト及び対応
する過去の最適単語の情報を記憶部１０６より消去する
（ステップ４−２）。後処理として、記憶部１０６に記
憶されている過去の対立候補単語リストから、現在フレ
ームにおいて途中尤度が予備閾値を下回っているものを
すべて取り除く（ステップ５）。次に、記憶部１０６に
記憶されている最適単語のうち、対立候補単語リストが
空であるものを出力し、該当情報を記憶部１０６より消
去する（ステップ６）。ただしここで、途中尤度を予備
閾値と比較する候補単語は、すべての候補単語でなく、
現在フレームの最適単語を除外する方が好ましい場合が
多い。First, step 1 of the first embodiment is executed. If the relevant likelihood is greater than or equal to the threshold, the likelihood calculation unit 10
The optimum word output from No. 4 is compared with the candidate word stored in the storage unit 106 (step 2). If the likelihood is less than the threshold value, an empty word is stored in the storage unit 106 (step 3B). If the likelihood is equal to or more than the threshold value, the optimum word is stored in the storage unit 106 (step 3A). When the likelihood is equal to or higher than the threshold and the optimum word is different from the candidate word stored in the storage unit 106, the optimum word and the candidate word whose midway likelihood is equal to or higher than the preliminary threshold in the current frame (hereinafter, All the lists)
If there is a past conflict candidate list stored in the storage unit 106 (step 4) and containing the optimum word of the current frame in the past candidate list stored in the storage unit 106, information on the list and the corresponding past optimum word is displayed. It is erased from the storage unit 106 (step 4-2). As post-processing, all the past candidate candidate words stored in the storage unit 106 are removed from the past candidate whose intermediate likelihood is below the preliminary threshold in the current frame (step 5). Next, among the optimum words stored in the storage unit 106, the optimum candidate word list is output, and the corresponding information is deleted from the storage unit 106 (step 6). However, here, the candidate words whose intermediate likelihoods are compared with the preliminary threshold are not all candidate words,
It is often preferable to exclude the best word in the current frame.

【００３６】次に本発明の第２の実施の形態の効果につ
いて説明する。Next, the effect of the second embodiment of the present invention will be described.

【００３７】第１の実施の形態においては、ある候補単
語の一部に類似の単語、例えば「碧南」（へきなん）発
声時、「へきなん」の「へき」に類似の「関」（せき）
を検出してしまう場合がある。第２の実施の形態は、こ
のような場合、他の単語すなわち「碧南」の途中である
と推定できる場合には、「関」の検出を保留し（ステッ
プ４）、以後、「碧南」の完成をもって「関」の検出は
とりやめる（ステップ４−２）。「碧南」の途中尤度が
十分低下した場合には、「碧南」の途中であるという推
定は破棄し（ステップ５）、保留していた「関」の検出
を行なう（ステップ６）。このようにして、部分列類似
単語の検出を防ぐことができる。また、最適単語の出力
は他の候補単語の途中であるという推定がすべて破棄さ
れたフレームにおいてすぐ行われるため、部分列類似単
語の検出防止によって生じる遅延時間は必要最低限であ
る。In the first embodiment, a word similar to a part of a certain candidate word, for example, when "Hekinan" (hekinan) is uttered, "seki" (seki) similar to "hekinan" of "hekinan" is used. )
May be detected. In such a case, the second embodiment suspends the detection of “Seki” (step 4) when it can be estimated that it is in the middle of another word, that is, “Hekinan”, and thereafter, Upon completion, detection of "Seki" is stopped (step 4-2). When the likelihood in the middle of "Hekinan" is sufficiently reduced, the estimation that it is in the middle of "Hekinan" is discarded (step 5), and "Seki" that has been held is detected (step 6). In this way, detection of subsequence similar words can be prevented. In addition, the output of the optimum word is immediately performed in the frame in which it is estimated that it is in the middle of other candidate words, so that the delay time caused by the detection of the subsequence similar word is the minimum necessary.

【００３８】（第２の実施の形態の実施例）次に、本発
明の第２の実施の形態の一実施例の動作を詳細に説明す
る。(Example of Second Embodiment) Next, the operation of one example of the second embodiment of the present invention will be described in detail.

【００３９】本実施例では、言語モデルとして第１の実
施例で使用したものと同一の、図３に示すモデルを用い
る。単語辞書１０２、尤度計算部１０４、記憶部１０６
に予め記憶されている数値、尤度の計算法及びフレーム
間隔は第１の実施の形態と同一のものを用いる。In this embodiment, the same model shown in FIG. 3 as that used in the first embodiment is used as the language model. Word dictionary 102, likelihood calculation unit 104, storage unit 106
The same numerical values, likelihood calculation methods, and frame intervals stored in advance are used as in the first embodiment.

【００４０】途中モデルは図８に示す、空列を含む任意
の音節列を受理する前方のガーベージ、候補単語の２個
のモデルの連接の形で構成され、各モデルをこの順で経
由した発声すなわち候補単語１個の前後に任意の音節列
を付加した発声をすべて受理する。フレームごとに図８
のモデルの単語途中のすべての状態の尤度、たとえば１
音節１状態の場合は、図８に黒点で示した状態の尤度を
とり出すことにより、途中尤度を求めることができる。
途中モデル生成部１０８は、単語辞書１０２の各候補単
語より図８の途中モデルを作成し、記憶しているものと
する。ただし、途中モデルは、図３に示した言語モデル
の部分モデルであり、パタンマッチングの方法も同一で
あるため、図３の言語モデルの途中状態の尤度を適宜と
り出すことにより途中尤度が得られる。よって図３のモ
デルを言語モデルと途中モデルで共有し、独立の途中モ
デル生成部１０７及び途中尤度計算部１０８は設けない
方法でも、本実施例は実現できる。The midway model is composed of a forward garbage that accepts an arbitrary syllable sequence including an empty sequence and a concatenation of two models of candidate words, as shown in FIG. That is, all utterances in which an arbitrary syllable string is added before and after one candidate word are accepted. Figure 8 for each frame
Likelihood of all states in the middle of a word in the model of
In the case of the syllable 1 state, the midway likelihood can be obtained by extracting the likelihood of the state shown by the black dots in FIG.
It is assumed that the midway model generation unit 108 has created and memorized the midway model in FIG. 8 from each candidate word in the word dictionary 102. However, since the intermediate model is a partial model of the language model shown in FIG. 3 and the pattern matching method is the same, the intermediate likelihood can be reduced by appropriately extracting the likelihood of the intermediate state of the language model of FIG. can get. Therefore, the present embodiment can be realized by a method in which the model of FIG. 3 is shared by the language model and the intermediate model, and the independent intermediate model generation unit 107 and the intermediate likelihood calculation unit 108 are not provided.

【００４１】本実施例では予備閾値を−１０１．５とす
る。予備閾値は現在フレームにおいて途中である可能性
のある候補単語を選ぶものであるため、低ければ低いほ
ど、対立候補単語の数は増加し、単語途中の推定洩れの
可能性が減少するが、候補単語出力の遅延時間が長くな
る。通常は閾値より小さい値を選ぶ。なお、本実施例で
は、対立候補単語には該当フレームの最適単語を含まな
いものとする。In this embodiment, the preliminary threshold value is set to -101.5. Since the preliminary threshold is for selecting candidate words that may be in the middle of the current frame, the lower it is, the more the number of conflict candidate words increases and the possibility of omission of estimation in the middle of words decreases. The delay time of word output becomes long. Normally, select a value smaller than the threshold. In this embodiment, the opposition candidate word does not include the optimum word of the corresponding frame.

【００４２】以下に第２の実施の形態の動作の具体例を
示す。A specific example of the operation of the second embodiment will be shown below.

【００４３】図９のように、「碧南」という発声がなさ
れたとする。検出部１０５では、毎フレーム、尤度計算
部１０４より出力される最適列及びその尤度を受け取
る。最適列に含まれる候補単語は、フレームごとに図１
０に示す単語であるとする。As shown in FIG. 9, it is assumed that the utterance "Hekinan" is made. The detection unit 105 receives the optimal sequence output from the likelihood calculation unit 104 and its likelihood for each frame. The candidate words included in the optimum sequence are shown in FIG. 1 for each frame.
It is assumed that the word is 0.

【００４４】第１フレームにおいては、まず、ステップ
１において、尤度−２００．０を閾値−１００．０と比
較する。ステップ３Ｂにおいて、空語を記憶部１０６に
格納する。記憶部１０６に最適単語と対応する対立候補
リストは記憶されていないため、以後のステップでは何
もしない。以後、第９０フレームまでは、尤度がすべて
閾値未満であり、かつ記憶部１０６に最適単語と対立候
補リストの組が記憶されていないため、第１フレームと
同様、ステップ３Ｂにおいて、空語を記憶部１０６に格
納するという動作（ステップ２）のみを行なう。In the first frame, first, in step 1, the likelihood -200.0 is compared with the threshold -100.0. In step 3B, the empty word is stored in the storage unit 106. Since the storage unit 106 does not store the conflict candidate list corresponding to the optimum word, nothing is done in the subsequent steps. Thereafter, up to the 90th frame, the likelihoods are all less than the threshold, and the storage unit 106 does not store the set of the optimum word and the confrontation candidate list. Only the operation of storing in storage unit 106 (step 2) is performed.

【００４５】第９１フレームにおいては、ステップ１に
おいて、尤度−９９．０と閾値−１００．０の比較を行
なう。尤度が閾値以上であるため、ステップ２では第９
１フレームの最適単語「関」を記憶部１０６に記憶する
（ステップ３Ａ）。次に対立候補単語（この場合「碧
南」のみ）のリストを作成し、記憶部１０６に最適単語
「関」を記憶し、さらに「碧南」のみを含むリストを記
憶する（ステップ４）。In the 91st frame, in step 1, the likelihood -99.0 and the threshold -100.0 are compared. Since the likelihood is greater than or equal to the threshold, in step 2 the ninth
The optimum word “Seki” of one frame is stored in the storage unit 106 (step 3A). Next, a list of confrontation candidate words (in this case, only "Hekinan") is created, the optimum word "Seki" is stored in the storage unit 106, and a list including only "Hekinan" is stored (step 4).

【００４６】第９１フレームでは、過去の最適単語は記
憶部１０６内に記憶されていないため、以後のステップ
では何もしない。In the 91st frame, since the past optimum word is not stored in the storage unit 106, nothing is done in the subsequent steps.

【００４７】第９２フレーム以後第１２０フレームまで
は、最適単語は「関」のまま不変のため、ステップ１か
らステップ４に至るまでは記憶部１０６に「関」を書き
込む動作のみを行なう。ステップ４−２では、現在フレ
ームの最適単語「関」は過去の対立候補リストに存在し
ないため、何もしない。From the 92nd frame to the 120th frame, the optimum word remains unchanged as "Seki". Therefore, from Step 1 to Step 4, only the operation of writing "Seki" in the storage unit 106 is performed. In step 4-2, since the optimum word "Seki" of the current frame does not exist in the past conflict candidate list, nothing is done.

【００４８】ステップ５では、対立候補単語リストの要
素である「碧南」の途中尤度は予備閾値以上であるた
め、何もしない。ステップ６では、リストが「碧南」を
含んでいるため、何もしない。In step 5, nothing is done because the intermediate likelihood of "Binan", which is an element of the conflict candidate word list, is greater than or equal to the preliminary threshold. In step 6, since the list includes "Hekinan", nothing is done.

【００４９】第１２１フレームにおいて、最適単語が
「碧南」に交替する。ステップ１から２において、「碧
南」を記憶部１０６に記憶する。ステップ３から４にお
いては、最適単語「碧南」及び要素として「那覇」のみ
を含む対立候補単語リストが記憶部１０６に格納され
る。In the 121st frame, the optimum word is changed to "Hekinan". In steps 1 and 2, “Hekinan” is stored in the storage unit 106. In steps 3 to 4, a candidate candidate word list including only the optimum word “Hekinan” and “Naha” as an element is stored in the storage unit 106.

【００５０】ステップ４−２において、過去の最適単語
「関」の対立候補単語リストが第１２１フレームの最適
単語「碧南」を含んでいることから、記憶部１０６より
過去の最適単語「関」及びその対立候補単語リストは消
去される。ステップ５、ステップ６においては、該当す
る対立候補単語リストが存在しないため、何もしない。In step 4-2, since the conflict candidate word list of the past optimum word “Seki” includes the optimum word “Hekinan” of the 121st frame, the past optimum word “Seki” and the optimum word “Seki” are stored in the storage unit 106. The conflict candidate word list is deleted. In step 5 and step 6, since no corresponding conflict candidate word list exists, nothing is done.

【００５１】第１２２フレームより、第１４０フレーム
までは、最適単語及び対立候補単語に変化はないため、
記憶部１０６の最適単語を更新するのみの動作をとる。From the 122nd frame to the 140th frame, there is no change in the optimum word and the conflict candidate word.
Only the operation of updating the optimum word in the storage unit 106 is performed.

【００５２】第１４１フレームにおいては、最適単語に
変化はないため、ステップ１からステップ４−２までは
直前フレームと同様に動く。ステップ５においては、
「那覇」の途中尤度が予備閾値を下回っているため、過
去の最適単語「碧南」の対立候補単語リストより取り除
く。これにより「碧南」の対立候補単語リストは空にな
ったため、ステップ６において「碧南」が検出される。
このようにして、「碧南」の部分列類似語「関」の出力
は行なわれず、かつ、発声された単語「碧南」の出力は
対立候補単語「那覇」の途中推定が破棄され次第行なわ
れる。Since there is no change in the optimum word in the 141st frame, steps 1 to 4-2 move similarly to the immediately preceding frame. In step 5,
Since the likelihood of “Naha” in the middle is below the preliminary threshold, it is removed from the past candidate word list of the optimum word “Binan”. As a result, the conflict candidate word list of "Hekinan" is emptied, so that "Hekinan" is detected in step 6.
In this way, the substring-similar word "Seki" of "Hekinan" is not output, and the spoken word "Hekinan" is output as soon as the halfway estimation of the conflict candidate word "Naha" is discarded.

【００５３】（第２の実施の形態の変形例）第２の実施
の形態の変形例として、検出部１０５におけるステップ
４において、現在フレームの番号を一緒に記憶部１０６
に記憶しておき、ステップ６において、対立候補単語リ
ストが空であるもののみならず、記憶部１０６に記憶さ
れているフレーム番号が現在フレームより一定時間以上
過去のものを出力させるような動作が考えられる。この
ような動作を採用すると、候補単語の発声から出力まで
の遅延時間を一定時間以内に強制的に抑えることがで
き、なるべく速やかに認識結果の欲しい音声認識装置を
実現する際には有効である。(Modification of Second Embodiment) As a modification of the second embodiment, in step 4 of the detection unit 105, the number of the current frame is stored together with the storage unit 106.
In step 6, in addition to the empty candidate word list, the operation in which the frame number stored in the storage unit 106 is past the current frame by a certain time or more is output. Conceivable. By adopting such an operation, the delay time from the utterance of a candidate word to its output can be forcibly suppressed within a certain time, which is effective in realizing a voice recognition device that wants a recognition result as quickly as possible. .

【００５４】この変形例は、請求項１、２、６、７、１
３、１４に対応する音声認識装置を実現する。This modification is defined in claims 1, 2, 6, 7, and 1.
A voice recognition device corresponding to 3 and 14 is realized.

【００５５】また、さらなる変形例として、上記変形例
のステップ６において、記憶部１０６に記憶されている
フレーム番号が一定以上の場合は対応する最適単語を出
力させるような動作も考えられる。これは、音声認識装
置の起動より、一定時間以内の入力のみ受け付ける装置
として有効である。Further, as a further modified example, in step 6 of the modified example described above, an operation of outputting the corresponding optimum word when the frame number stored in the storage unit 106 is a certain number or more can be considered. This is effective as a device that accepts only input within a fixed time after the voice recognition device is activated.

【００５６】この変形例は、請求項１、２、６、７、１
０、１４に対応する音声認識装置を実現する。This modification is defined in claims 1, 2, 6, 7, and 1.
A voice recognition device corresponding to 0 and 14 is realized.

【００５７】また、第２の実施の形態の別の変形例とし
て、検出部１０５におけるステップ６において、対立候
補単語リストが空であるもののみならず、該当フレーム
において、外部の入力装置すなわち、キーボード、スイ
ッチ、他の音声認識装置等の入力信号を受けた場合、記
憶装置１０６に記憶されている最適単語を出力するよう
な動作が考えられる。これにより、話者が認識のタイミ
ングを決定することにより単語の存在区間が限定され、
より認識性能があがることが期待できる。例えば、発声
を終えたらボタンを押す、等の動作を話者に要求できる
場合に有効である。Further, as another modification of the second embodiment, in step 6 in the detection unit 105, not only the conflict candidate word list is empty, but also in the corresponding frame, an external input device, that is, a keyboard is used. When receiving an input signal from a switch, another voice recognition device, or the like, an operation of outputting the optimum word stored in the storage device 106 can be considered. As a result, the speaker determines the recognition timing to limit the word existence section,
It can be expected that the recognition performance will be improved. For example, it is effective when the speaker can be requested to perform an operation such as pressing a button after utterance.

【００５８】この変形例は、請求項１、２、６、７、１
１、１４に対応する音声認識装置を実現する。This modification is defined in claims 1, 2, 6, 7, and 1.
A voice recognition device corresponding to 1 and 14 is realized.

【００５９】また、この場合、外部の入力装置として、
音声検出装置を用いることが考えられる。音声検出装置
としては、例えば特開平７−２２５５９２に記されてい
るようなものを用いるが、入力音声中の発声区間を決定
するようなものであればどのようなものでも良い。音声
検出は話者の発話に連動して音声区間の始端、終端を決
定すると考えられるため、ボタン等の代わりに音声検出
装置を用いることで、話者に余分な動作を要求しない音
声認識装置が実現できる。この場合、音声検出による音
声区間終端検出の失敗の場合を考慮して、検出部１０５
におけるステップ４において現在フレームの番号を一緒
に記憶部１０６に記憶しておき、ステップ６において、
音声検出装置からの信号がなくとも、記憶部１０６に記
憶されているフレーム番号が現在フレームより一定時間
以上過去のものを出力させる動作を行なうようにするこ
とも考えられる。このようにすると、請求項１、２、
６、７、１７に対応する音声認識装置を実現できる。な
お、この場合請求項１７で採用した出力条件は請求項１
２及び１４における出力条件である。In this case, as an external input device,
It is conceivable to use a voice detection device. As the voice detecting device, for example, a device described in Japanese Patent Laid-Open No. 7-225592 is used, but any device may be used as long as it determines a vocal section in the input voice. Since voice detection is considered to determine the beginning and end of the voice section in conjunction with the speaker's utterance, a voice recognition device that does not require extra action to the speaker can be used by using a voice detection device instead of a button or the like. realizable. In this case, in consideration of the case where the voice section end detection by the voice detection fails, the detection unit 105
In step 4, the current frame number is stored together in the storage unit 106, and in step 6,
Even if there is no signal from the voice detecting device, it is possible to perform an operation of outputting the frame number stored in the storage unit 106 that is past the current frame by a certain time or more. With this configuration, claims 1, 2,
A voice recognition device corresponding to 6, 7, or 17 can be realized. In this case, the output condition adopted in claim 17 is the claim 1
2 and 14 are output conditions.

【００６０】さらに、第２の実施の形態の変形例とし
て、検出部１０５のステップ４において対立候補単語リ
ストを作成以後、最適単語が交代せずかつ最適列の最後
が該当最適単語で終わっている間は該当フレームで途中
尤度が予備閾値に達している候補単語をすべてリストに
加える動作を行なうことにより、対立候補単語のリスト
アップをより確実に行なうものが考えられる。この動作
の追加により、最適単語初出のフレームにおいて尤度が
たまたま低くなっている対立候補単語もリストから洩ら
さずに採用することができるようになる。Further, as a modification of the second embodiment, after the opposition candidate word list is created in step 4 of the detection unit 105, the optimum words do not change and the end of the optimum sequence ends with the corresponding optimum word. In the meantime, it is conceivable that the operation of adding all candidate words whose intermediate likelihood has reached the preliminary threshold in the corresponding frame to the list to more reliably list the confrontation candidate words. With the addition of this operation, it becomes possible to adopt the conflicting candidate word whose likelihood is low in the frame in which the optimum word first appears without missing from the list.

【００６１】この変形例は、請求項１、２、４、６、
８、１０、１４に対応する音声認識装置を実現する。な
お、請求項８における検出区間後の一定時間が０ｍｓで
あるような例である。This modified example is defined by claim 1, 2, 4, 6,
A voice recognition device corresponding to 8, 10, and 14 is realized. Note that this is an example in which the fixed time after the detection section in claim 8 is 0 ms.

【００６２】また、この場合、検出部１０５のステップ
４において対立候補単語リストを作成以後、最適単語が
交代せずかつ最適列の最後が該当最適単語で終わってい
る間及びそれ以降の一定時間、例えば２００ｍｓ程度の
間には該当フレームで途中尤度が予備閾値に達している
候補単語をすべてリストに加える動作を行ない、リスト
中のすべての対立候補単語の途中尤度が予備閾値を下回
った場合にはじめて出力動作を行なうように設計するこ
もできる。これは、例えば複数の候補単語が２００ｍｓ
以内に続けて話されることはない、というような仮定を
おいてよい場合に、それを利用して誤検出を低減する方
法である。これは請求項１、２、４、６、８、１０、１
７に対応する音声認識装置を実現する。請求項８におけ
る検出区間後の一定時間が２００ｍｓであるような例で
ある。In addition, in this case, after the opposition candidate word list is created in step 4 of the detection unit 105, the optimum words do not change and the end of the optimum sequence ends with the corresponding optimum word, and a fixed time thereafter. For example, during a period of about 200 ms, an operation is performed in which all candidate words whose intermediate likelihood has reached the preliminary threshold in the corresponding frame are added to the list, and the intermediate likelihoods of all conflict candidate words in the list fall below the preliminary threshold. It can also be designed to perform output operation for the first time. This means, for example, that multiple candidate words are 200 ms
This is a method of reducing false positives by using the assumption that it will not be continuously spoken within. This is claimed in claims 1, 2, 4, 6, 8, 10, 1
A voice recognition device corresponding to 7 is realized. This is an example in which the fixed time after the detection section in claim 8 is 200 ms.

【００６３】さらに第２の実施の形態の変形例として、
検出部１０５のステップ４において記憶部１０６に記憶
する対立候補単語リストを、常に全候補単語のリストと
するようなものが考えられる。この場合、単語の検出
後、何らかの単語の途中尤度が予備閾値を上回っている
うちは検出を保留する。また、最適単語が入れ替わった
場合、それは必ず対立候補単語リストに入っているた
め、過去の最適単語の検出は放棄されることになる。こ
の変形例によっても、時間的に重なる単語検出のみなら
ず、時間的に近接する検出を抑制することができるよう
になる。Further, as a modification of the second embodiment,
It is conceivable that the conflict candidate word list stored in the storage unit 106 in step 4 of the detection unit 105 is always a list of all candidate words. In this case, after the detection of the word, the detection is suspended as long as the intermediate likelihood of some word exceeds the preliminary threshold. Also, when the optimum word is replaced, it is always included in the conflict candidate word list, so that detection of the optimum word in the past is abandoned. According to this modification as well, not only detection of overlapping words in time but also detection of proximity in time can be suppressed.

【００６４】この変形例は、請求項１、２、４、６、
８、１６に対応する音声認識装置である。なお、請求項
８における検出区間後の一定時間が０ｍｓであるような
例である。This modified example is defined by claim 1, 2, 4, 6,
It is a voice recognition device corresponding to 8 and 16. Note that this is an example in which the fixed time after the detection section in claim 8 is 0 ms.

【００６５】この変形例におけるさらなる変形として、
外部入力装置として音声検出装置を付加し、出力条件
に、該当音声検出装置からの音声終端検出信号が与えら
れるという条件を加え、出力条件をより厳しくすること
が考えられる。これにより、時間的近接及び、他の候補
単語の発声途中可能性の両方の情報を利用して、検出保
留を強化することができる。As a further modification of this modification,
It is conceivable to add a voice detection device as an external input device and add a condition that the voice end detection signal from the corresponding voice detection device is given to the output condition to make the output condition more strict. This makes it possible to enhance the detection hold by using information on both temporal proximity and the possibility of other candidate words being spoken.

【００６６】この変形例は、請求項１、２、４、６、
８、１８に対応するもので、請求項１８における出力条
件は請求項１２及び１６における出力条件である。これ
も請求項８における検出区間後の一定時間は０ｍｓであ
るような例である。This modified example is defined by claim 1, 2, 4, 6,
The output conditions in claims 18 and 18 correspond to the output conditions in claims 12 and 16. This is also an example in which the constant time after the detection section in claim 8 is 0 ms.

【００６７】[0067]

【発明の効果】以上のように、本発明を用いれば、重な
って生ずる単語検出や、近接しすぎた単語検出を低減
し、誤検出のより少ない音声認識装置を、必要最小限の
遅延で実現することができる。As described above, according to the present invention, it is possible to reduce the detection of overlapping words and the detection of words that are too close to each other, and to realize a voice recognition device with less erroneous detection with minimum delay. can do.

[Brief description of drawings]

【図１】本発明の第１の実施の形態の音声認識装置の構
成図である。FIG. 1 is a configuration diagram of a voice recognition device according to a first embodiment of this invention.

【図２】本発明の第１の実施の形態の音声認識装置の動
作を表す流れ図である。FIG. 2 is a flowchart showing an operation of the voice recognition device in the first exemplary embodiment of the present invention.

【図３】本発明の第１の実施の形態の音声認識装置で用
いる言語モデルを表すネットワークである。FIG. 3 is a network showing a language model used in the speech recognition apparatus according to the first embodiment of the present invention.

【図４】本発明の第１の実施の形態の音声認識装置の１
実施例における最適単語と尤度の経時変化を表す図であ
る。FIG. 4 is a speech recognition device 1 according to the first embodiment of the present invention.
It is a figure showing the change with time of the optimal word and likelihood in an Example.

【図５】本発明の第１の実施の形態の音声認識装置の１
実施例におけるフレームごとの最適単語と尤度を示す表
である。FIG. 5 is a speech recognition device 1 according to the first embodiment of the present invention.
7 is a table showing optimal words and likelihoods for each frame in the example.

【図６】本発明の第２の実施の形態の音声認識装置の構
成図である。FIG. 6 is a configuration diagram of a voice recognition device according to a second embodiment of the present invention.

【図７】本発明の第２の実施の形態の音声認識装置の動
作を表す流れ図である。FIG. 7 is a flowchart showing the operation of the voice recognition device in the second exemplary embodiment of the present invention.

【図８】本発明の第２の実施の形態の音声認識装置で用
いる言語モデルを表すネットワークである。FIG. 8 is a network showing a language model used in the speech recognition apparatus according to the second embodiment of the present invention.

【図９】本発明の第２の実施の形態の音声認識装置の１
実施例における最適単語と尤度と途中尤度の経時変化を
表す図である。FIG. 9 is a voice recognition device 1 according to a second embodiment of the present invention.
It is a figure showing the time-dependent change of the optimal word, likelihood, and halfway likelihood in an Example.

【図１０】本発明の第２の実施の形態の音声認識装置の
１実施例におけるフレームごとの最適単語と尤度及び対
立候補単語を示す表である。FIG. 10 is a table showing optimal words, likelihoods, and conflict candidate words for each frame in one example of the speech recognition apparatus according to the second exemplary embodiment of the present invention.

Claims

[Claims]

1. A voice analysis unit for extracting a feature amount for each frame from a frequency analysis of an input voice for each fixed time (frame), and a voice model for accepting an arbitrary syllable string before and after a candidate word. A likelihood calculation unit that performs pattern matching between the language model and the feature amount for each frame, selects an optimal word sequence (optimal sequence) on the language model for each frame, and calculates the likelihood thereof; A detection unit that determines and outputs the optimum word for each frame based on the optimum sequence and the likelihood, and when the optimum word continues one or more times, the optimum word within the corresponding time (detection section) is at most A voice recognition device characterized by outputting once.

2. A voice analysis unit for extracting a feature amount for each frame by frequency analysis of an input voice for each fixed time (frame), and receiving an arbitrary syllable string before and after a predetermined candidate word. Likelihood calculation for performing pattern matching between the language model added with a speech model and the feature quantity for each frame, selecting an optimal word sequence (optimal sequence) on the language model for each frame, and calculating the likelihood thereof. Unit, a detection unit that determines and outputs the optimum word for each frame based on the optimum sequence and the likelihood, and when the optimum word continues one or more times, the corresponding time (detection section)
A voice recognition device characterized in that when the likelihood reaches a predetermined constant value (threshold value) even once, the optimum word in the detection section is output at most once.

3. The voice recognition device according to claim 1, wherein the candidate word included in the optimum sequence is an optimum word.

4. The optimum word is output once as the optimum word if the last word of the optimum sequence is the candidate word, and the optimum word is not output otherwise. The voice recognition device according to item 1 or 2.

5. The voice recognition device according to claim 1, wherein the time at which the optimum word is output is the first time of the detection section.

6. The voice recognition device according to claim 1, wherein the time to output the optimum word is the first time when a predetermined condition (output condition) is satisfied. .

7. A language model (intermediate model) for accepting a utterance in which an arbitrary syllable sequence is added immediately before all the first half subsequences of all the candidate words, and in each of the frames, all the intermediate models A time when the optimal word should be detected, which is provided with a device that calculates the likelihood of pattern matching of the feature quantity and calculates the maximum likelihood (halfway likelihood) of all the halfway models of each candidate word. In the above, output is suspended only when there is the candidate word (opposite candidate word) whose intermediate likelihood has reached a predetermined constant value (preliminary threshold value), and thereafter, the predetermined condition (output condition) is The voice recognition device according to claim 1, wherein the voice recognition device outputs at the first time when it is satisfied.

8. A language model (intermediate model) for accepting a utterance in which an arbitrary syllable sequence is added immediately before all the first half subsequences of all of the candidate words is provided, and all intermediate models and the intermediate model are provided in each frame. A device for calculating the likelihood of pattern matching of the feature quantity and calculating the maximum likelihood (midway likelihood) of all the midway models of the candidate word, and the midway likelihood within the detection section. Output is suspended only when there is the candidate word (opposite candidate word) that has reached a predetermined constant value (preliminary threshold value) even once, and thereafter, the first condition in which the predetermined condition (output condition) is satisfied Outputting at time, Claim 1, 2, 3
Alternatively, the voice recognition device according to item 4.

9. A language model (intermediate model) for accepting a utterance in which an arbitrary syllable sequence is added immediately before all the first half subsequences of all of the candidate words is provided, and all intermediate models and the intermediate model are provided in each frame. A likelihood is calculated by performing pattern matching of feature amounts, and a device that calculates the maximum likelihood (intermediate likelihood) of all intermediate models of the candidate word is provided, and the intermediate likelihood is within the detection section and Output is suspended only when there is a candidate word (opposite candidate word) that has reached a predetermined constant value (preliminary threshold value) in all frames within a predetermined time period thereafter, and thereafter, in advance. The output is performed at the first time when the defined condition (output condition) is satisfied.
Alternatively, the voice recognition device according to item 4.

10. The output condition is that there is no one of the conflict candidate words that has become the optimum word since the detection time in the frame in which a predetermined time has elapsed after the device is activated. The voice recognition device according to Item 7, 8 or 9.

11. An output device comprising an input device for receiving an input from the outside, wherein the output condition is that a signal from the input device is given after a detection time. The voice recognition device described.

12. The input device is a voice detection device for determining the range of a voice section, and the output condition is that a signal indicating the end of the voice section is given from the voice detection device. The voice recognition device according to claim 11.

13. The output condition is that there is no one of the conflict candidate words that has become the optimum word since the detection time in the frame in which a certain time has elapsed after the detection time. The voice recognition device according to 7, 8, or 9.

14. After the detection time, each of the intermediate likelihoods of all the conflict candidate words once falls below the preliminary threshold, and the corresponding conflict candidate word has never been the optimum word since the detection time. 10. The voice recognition device according to claim 7, 8 or 9, wherein the output condition is set as follows.

15. In a certain frame after the detection time, the intermediate likelihoods of all the conflict candidate words are lower than the preliminary threshold, and the conflict candidate word is the optimum word even after the detection time. 10. The voice recognition device according to claim 7, wherein the output condition is that the voice recognition is not performed.

16. In a certain frame after the detection time, the intermediate likelihoods of all the other candidate words are lower than the preliminary threshold, and the other candidate words are the optimum words even once after the detection time. 10. The voice recognition device according to claim 7, wherein the output condition is that the condition is not satisfied.

17. The output condition is that any one of a plurality of the output conditions selected in advance from the output conditions according to claim 10, 11, 12, 13, 14 or 15 is satisfied. The voice recognition device according to claim 7, 8, or 9.

18. A method according to claim 10, 11, 12, 13, 14,
10. The voice recognition device according to claim 7, wherein the output condition is that all of a plurality of arbitrary output conditions selected in advance from the output conditions of 15 or 16 are satisfied.