JPH06266386A

JPH06266386A - Word spotting method

Info

Publication number: JPH06266386A
Application number: JP5056214A
Authority: JP
Inventors: Akihiro Imamura; 明弘今村
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1993-03-16
Filing date: 1993-03-16
Publication date: 1994-09-22

Abstract

PURPOSE:To detect a key word in an input speech precisely at a high speed in synchronism with the time of the input speech. CONSTITUTION:A likelihood calculation part inputs a time series of speech feature quantities from a speech analytic part 2 and finds the likelihood between all partial word series consisting of recognition object words and unknown words, which are receivable from the start state of automaton set in a storage part 9 to respective stages and the feature series of the input speech from a speech start end to each time in synchronism with time by using a key word and a garbage hidden Markov model showing partial word sequences in storage parts 7 and 8. A posterior probability calculation part 4 finds the posterior probability at the end of the vocalization of each partial word series including a recognition object word as a tail word and posterior probability in the vocalization of each partial word series at each time. A recognition decision part 5 compares those posterior probability values with each other at each time to decide a recognition object word present in a partial word series.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】この発明は、人間が自由に発声し
た音声の中に存在するキーワードを、機械に認識あるい
は検出させるワードスポッティング方法に関するもので
ある。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a word spotting method in which a machine recognizes or detects a keyword present in a voice freely uttered by a human.

【０００２】[0002]

【従来の技術】近年音声認識技術の研究開発が活発に行
われ、いくつかの商品化も行われている。なかでも、人
間が連続的に発声した文章を認識対象とする連続音声認
識が可能となれば、多くの人間と機械間のインタフェー
スを飛躍的に改善できる。しかし現状では、数百程度の
限られた語彙数での連続音声認識が可能であるに過ぎ
ず、入力音声中に認識装置に登録されていない未知語が
存在する場合には、正しい認識結果が得られないという
問題が起こる。2. Description of the Related Art In recent years, research and development of voice recognition technology have been actively carried out, and some of them have been commercialized. In particular, if continuous speech recognition can be performed on sentences that are continuously uttered by humans, the interface between many humans and machines can be dramatically improved. However, at present, continuous speech recognition is possible only with a limited number of vocabularies of several hundreds, and if there is an unknown word that is not registered in the recognition device in the input speech, a correct recognition result is obtained. The problem arises that you cannot get it.

【０００３】ワードスポッティング技術は、このような
問題の解決を目指したもので、連続的に発声された文章
音声あるいは発声時の周囲環境雑音などが音声区間の前
後に付加した入力音声信号中のどの位置に、認識装置に
登録されているキーワードが存在しているかを推定する
ものであり、入力音声中での未知語の存在を許容する認
識技術となっている。The word spotting technique aims to solve such a problem, and continuously detects a sentence voice that is continuously uttered, ambient noise during utterance, or the like in an input voice signal that is added before and after a voice section. This is a technique for estimating whether or not a keyword registered in the recognition device exists at the position, and is a recognition technique that allows the existence of an unknown word in the input voice.

【０００４】このようなワードスポッティング方法の従
来技術としては、例えば、日本音響学会平成２年度春季
研究発表会講演論文集Ｉ（１９９０年３月）の２９〜３
０ページに掲載されている論文「ＨＭＭによる電話音声
スポッティング」（以下、第一の方法と称する）や、Ｉ
ＥＥＥＴransactions on Ａcoustics，Ｓpeech，andＳ
ignal Ｐrocessing，Ｖol.３８，Ｎo.１１（１９９０年
１１月）の１８７０〜１８７８ページに掲載されている
論文「Ａutomatic Ｒecognition of Ｋeywordsin Ｕnco
nstrained Ｓpeech Ｕsing Ｈidden Ｍarkov Ｍodels」
（以下、第二の方法と称する）がある。As a conventional technique of such a word spotting method, for example, 29th to 3rd of Proceedings I of the Acoustical Society of Japan Spring Research Presentation Meeting I (March 1990) 29-3.
The article "Telephone voice spotting with HMM" on page 0 (hereinafter referred to as the first method) and I
EEE Transactions on Acoustics, Speech, and S
ignal Processing, Vol. 38, No. 11 (November 1990), pages 1870-1878, "Automatic Recognition of Keywordsin Unco."
nstrained Speech Using Hidden Markov Models "
(Hereinafter referred to as the second method).

【０００５】第一の方法では、認識装置は認識対象語に
ついてだけ統計的な確率音響モデル（キーワード隠れマ
ルコフモデル）を持ち、入力音声の各時刻を各認識対象
語の終端と仮定しながら、尤もらしい始端時刻を探索す
る方法をとっている。キーワードは、推定された単語の
時間的な長さや、推定された区間に対する確率音響モデ
ルからの尤度が、キーワード毎に決められた閾値範囲内
に入っている場合に、検出されるようになっている。し
たがって、第一の方法では入力音声中に未知語が存在し
ていても、時刻に同期してその時点で終端するキーワー
ドを高速に求めることができる。In the first method, the recognizing device has a statistical stochastic acoustic model (keyword hidden Markov model) only for the recognition target word, and assuming that each time of the input speech is the end of each recognition target word, The method is to search for a unique start time. A keyword is detected when the estimated temporal length of the word or the likelihood from the stochastic acoustic model for the estimated section is within the threshold range determined for each keyword. ing. Therefore, according to the first method, even if an unknown word is present in the input voice, the keyword that ends at that time can be obtained at high speed in synchronization with the time.

【０００６】一方、第二の方法では、入力音声中でのキ
ーワードおよび未知語相互の出現順序を有限状態オート
マトンで規定し、認識対象単語を表す確率音響モデル
（キーワード隠れマルコフモデル）と、音声以外の雑音
区間や想定される複数の未知語を用いて作成した確率音
響モデル（ガーベッジ隠れマルコフモデル）を用いてい
る。認識は、与えられたオートマトンで受理することが
可能であるような未知語を途中に含む全単語モデル列に
対する入力音声の尤度を求め、尤度が最大となる単語列
を検出することにより行うものである。このように、第
二の方法では単語の出現順序を考慮することで、誤った
位置でのキーワードの検出や正解の脱落の低減が可能で
ある。On the other hand, in the second method, the order of appearance of keywords and unknown words in the input speech is defined by a finite state automaton, and a stochastic acoustic model (keyword hidden Markov model) representing the recognition target word and other than speech are used. We use a stochastic acoustic model (garbage hidden Markov model) created by using the noise intervals of and the assumed multiple unknown words. The recognition is performed by finding the likelihood of the input speech for all word model strings including unknown words in the middle that can be accepted by a given automaton, and detecting the word string with the maximum likelihood. It is a thing. As described above, in the second method, by considering the appearance order of words, it is possible to detect a keyword at an incorrect position and reduce omission of correct answer.

【０００７】[0007]

【発明が解決しようとする課題】上記従来技術におい
て、第一の方法では、入力音声中に未知語が存在してい
ても、時刻に同期してその時点で終端するキーワードを
高速に求められる反面、キーワード検出のための閾値範
囲の設定によっては、誤った位置でのキーワードの検出
や正解の脱落などを生じてしまう問題がある。この問題
に対応するためには、ワードスポッティング結果を用い
た繁雑な後処理が必要である。また、入力音声中でのキ
ーワードおよび未知語相互の出現順序に関する情報や、
未知語に関する確率音響モデルを利用していないため、
時間的に長い単語の中に存在する短い単語を検出してし
まうという部分マッチングの問題も生じることがあり、
これを解決するためには、部分マッチングが生じる可能
性のある単語対について、その相互位置関係に関する情
報を用いた後処理も必要であるという問題がある。In the above-mentioned prior art, in the first method, even if an unknown word is present in the input voice, a keyword that terminates at that point in synchronization with time can be obtained at high speed. However, depending on the setting of the threshold value range for keyword detection, there is a problem that a keyword is detected at a wrong position or a correct answer is dropped. To deal with this problem, complicated post-processing using word spotting results is necessary. Also, information on the order of appearance of keywords and unknown words in the input voice,
Since we do not use the stochastic acoustic model for unknown words,
The problem of partial matching that a short word existing in a long word in time may be detected may occur.
In order to solve this, there is a problem that it is necessary to perform post-processing on the word pair for which partial matching may occur, using information on the mutual positional relationship.

【０００８】これに対し、第二の方法では、入力音声中
でのキーワードおよび未知語相互の出現順序を考慮し、
認識対象単語を表すキーワード隠れマルコフモデルと、
発声以外の雑音区間や想定される複数の未知語を用いて
作成したガーベッジ隠れマルコフモデルを用いること
で、第一の方法で問題となるキーワードの誤った位置で
の検出や正解の脱落、部分マッチングの低減を可能とし
ている。しかし、この方法では、未知語を一つの単語と
みなしており、入力音声がオートマトンで受理されるど
の単語列であるかを推定する連続単語認識を行っている
のと等価であるから、第一の方法のように、入力の時刻
に同期して、各時点で終端するキーワードを求めること
は不可能であり、入力される音声区間が確定、すなわち
発声が終了しなければ、認識結果が求められないという
問題がある。On the other hand, in the second method, the order of appearance of the keywords and unknown words in the input speech is considered,
A keyword hidden Markov model that represents the recognition target word,
By using the Garbage Hidden Markov Model created using noise intervals other than utterance and multiple unknown words that are supposed to be used, detection of keywords at the wrong position in the first method, omission of correct answer, partial matching It is possible to reduce However, in this method, the unknown word is regarded as one word, and it is equivalent to performing continuous word recognition to estimate which word string the input speech is accepted by the automaton. It is impossible to find a keyword that ends at each time point in synchronization with the input time, as in the method described in (1), and the recognition result is obtained if the input voice section is fixed, that is, if the utterance has not ended. There is a problem that there is no.

【０００９】この発明は、上記第一および第二の方法に
代表される従来のワードスポッティング方法が持つ問題
点を解消し、キーワードや未知語の出現順序を考慮しな
がら、入力音声の時刻に同期して高速に、かつ精度良
く、入力音声中に存在するキーワードおよびいくつかの
キーワードの時間的連鎖を検出することが可能なワード
スポッティング方法を提供することを目的とする。The present invention solves the problems of the conventional word spotting methods represented by the above first and second methods, and synchronizes with the time of the input voice while considering the appearance order of keywords and unknown words. It is therefore an object of the present invention to provide a word spotting method capable of detecting a keyword existing in an input voice and a temporal chain of several keywords at high speed and with high accuracy.

【００１０】[0010]

【課題を解決するための手段】この目的を達成するため
に、この発明では、まず、予め検出したい認識対象単語
とその他の未知が出現する順序関係を規定した有限状態
オートマトンと、認識対象単語の音声特徴時系列を表す
キーワード隠れマルコフモデルおよび未知語の音声特徴
時系列や雑音などの非音声の特徴時系列を包括的に表す
ガーベッジ隠れマルコフモデルを作成しておく。次いで
発声者から音声が入力されると、設定したオートマトン
の開始状態から各状態までで受理可能な認識単語および
未知語からなる全ての部分単語系列と、音声始端から各
時刻までの入力音声の特徴系列との間で、部分単語列を
表すキーワードおよびガーベッジ隠れマルコフモデルを
用いて、尤度を時刻に同期して逐次的に求める。さら
に、この尤度を用いて、各時刻が認識対象単語を最後尾
単語とするような各部分単語系列の発声終了である場合
の事後確率と、各時刻が各部分系列の発声途中である場
合の事後確率を算出する。これらの事後確率を時刻毎に
比較して、そのうちで最大値を示すものが、ある部分単
語系列の発声終了である場合に対応する時に、その部分
単語系列中に存在する認識対象単語が部分単語系列内で
の出現順序通りに、最大値を検出した時刻までに現われ
たと認識する。In order to achieve this object, in the present invention, first, a finite state automaton that defines the order relation in which a recognition target word to be detected and other unknowns appear, and a recognition target word are defined. We create a keyword hidden Markov model that represents a time series of speech features and a garbage hidden Markov model that comprehensively represents a time series of speech features of unknown words and non-speech feature time series such as noise. Next, when a voice is input from the speaker, all subword sequences consisting of recognized words and unknown words that can be accepted from the set state of the automaton to each state, and the characteristics of the input voice from the beginning of the voice to each time With respect to the sequence, the likelihood is sequentially calculated in synchronization with time by using a keyword representing a partial word string and a garbage hidden Markov model. Further, using this likelihood, the posterior probability when each time is the end of the utterance of each partial word sequence that makes the recognition target word the last word, and when each time is in the middle of utterance of each partial sequence Calculate the posterior probability of. These posterior probabilities are compared for each time, and when the one showing the maximum value corresponds to the case where the utterance end of a certain partial word sequence corresponds, the recognition target word existing in the partial word sequence is a partial word. Recognize that they appear by the time when the maximum value is detected, in the order of appearance in the sequence.

【００１１】[0011]

【作用】この発明では、予め用意した認識対象単語およ
び未知語の出現順序を規定した有限状態オートマトンと
認識対象単語を表すガーベッジ隠れマルコフモデルおよ
び未知語や雑音などを包括的に表すガーベッジ隠れマル
コフモデルを用いて、設定したオートマトンの開始状態
から各状態までで受理可能な認識対象単語および未知語
からなる全ての部分単語系列と、音声始端から各時刻ま
での入力音声の特徴系列との間で、尤度を時刻に同期し
て逐次的に求め、この尤度から各時刻が認識対象単語を
最後尾単語とするような各部分単語系列の発声終了であ
る場合の事後確率と、各時刻が各部分単語系列の発声途
中である場合の事後確率を算出している。さらに、これ
らの事後確率を時刻毎に比較して、そのうちで最大値を
示すものが、ある部分単語系列の発声終了である場合に
対応する時に、その部分単語系列中に存在する認識対象
単語がその時刻までに現われたと認識する方法となって
いる。すなわち、この発明は、入力音声が最後まで発声
されて音声区間が確定する以前に、オートマトンで規定
されるような一連の単語連鎖の一部分までが発声された
かどうかを、入力の時刻に同期して検出できる方法とな
っている。According to the present invention, a finite state automaton that defines the appearance order of a recognition target word and an unknown word prepared in advance and a garbage hidden Markov model that represents the recognition target word and a garbage hidden Markov model that comprehensively represents unknown words and noises. Using, all the partial word series consisting of recognition target words and unknown words that can be accepted from the start state of the automaton set to each state, and between the feature series of the input speech from the speech start end to each time, The likelihood is sequentially calculated in synchronization with time, and from this likelihood, the posterior probability when each time is the end of utterance of each partial word sequence such that the recognition target word is the last word, and each time is The posterior probability when the partial word sequence is in the middle of utterance is calculated. Furthermore, these posterior probabilities are compared for each time, and when the one showing the maximum value corresponds to the case where the utterance end of a certain partial word sequence corresponds, the recognition target words existing in the partial word sequence are It is a method to recognize that it appeared by that time. That is, the present invention synchronizes with the time of input whether or not a part of a series of word chains as defined by the automaton is uttered before the input voice is uttered to the end and the voice section is determined. It is a method that can be detected.

【００１２】[0012]

【実施例】以下、この発明の一実施例を図面を参照しな
がら説明する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS An embodiment of the present invention will be described below with reference to the drawings.

【００１３】図１は、この発明の一実施例を示す認識装
置のブロック図である。この図において、１は音声入力
部、２は音声分析部、３は尤度計算部、４は事後確率計
算部、５は認識判定部、６は認識結果出力部、７はキー
ワード隠れマルコフモデル記憶部、８はカーベッジ隠れ
マルコフモデル記憶部、９はオートマトン記憶部、１０
は全体の制御部である。FIG. 1 is a block diagram of a recognition apparatus showing an embodiment of the present invention. In this figure, 1 is a voice input unit, 2 is a voice analysis unit, 3 is a likelihood calculation unit, 4 is a posterior probability calculation unit, 5 is a recognition determination unit, 6 is a recognition result output unit, and 7 is a keyword hidden Markov model storage. Section, 8 is a Carbage hidden Markov model storage section, 9 is an automaton storage section, 10
Is the overall control unit.

【００１４】この認識装置の動作の中心は尤度計算部
３、事後確率計算部４、認識判定部５にあるが、最初
に、音声分析部２、キーワード隠れマルコフモデル記憶
部７、ガーベッジ隠れマルコフモデル記憶部８およびオ
ートマトン記憶部９について、以下に説明する。なお、
ここでは、ワードスポッティングの対象としてＮ^w個の
単語からなる認識対象単語セット｛Ｎ^w｝と、入力され
る音声中に現れる認識対象語以外の未知語や雑音を表す
ものとしてＮ^g個の単語からなる未知単語セット｛Ｎ^g｝
を考え、合計Ｎ＝Ｎ^w＋Ｎ^g｝個からなる語彙を、有限オ
ートマトンで用いる単語セット｛Ｎ｝とする。The main part of the operation of this recognizing device is the likelihood calculating unit 3, the posterior probability calculating unit 4, and the recognition determining unit 5. First, the speech analyzing unit 2, the keyword hidden Markov model storage unit 7, and the garbage hidden Markov model. The model storage unit 8 and the automaton storage unit 9 will be described below. In addition,
Here, a recognition target word set {N ^w } consisting of N ^w words as a target of word spotting, and N ^g words representing an unknown word or noise other than the recognition target words appearing in the input speech. Unknown word set consisting of {N ^g }
, And ^{let the} vocabulary consisting of N = N ^w + N ^g } in total be the word set {N} used in the finite automaton.

【００１５】この認識装置では、認識が可能な入力音声
における認識対象語および認識対象語以外の未知語の出
現順序には制約があり、図２に示すような状態数がＱ＋
１個の有限状態オートマトンで規定されているとする。
図２のオートマトンにおいて、状態０は開始状態であ
り、この状態を出発点として遷移枝の上に書かれた単語
の内いずれか一つを出力しながら、次々と状態を遷移
し、その結果、ある状態まで達したところまでで得られ
た出力済みの単語列が、その状態まででこのオートマト
ンが受理できる単語列、すなわち、この認識装置で認識
可能な単語列となる。このようなオートマトンによって
規定された状態ｐから状態ｑへ単語ｎを出力しての遷移
を、δ（ｐ，ｎ）＝ｑと表記することにする。オートマ
トン記憶部９には、このような認識対象語とその他の未
知語が出現する順序関係を規定した有限状態オートマト
ンがあらかじめ記憶されている。In this recognition apparatus, there are restrictions on the appearance order of the recognition target words and the unknown words other than the recognition target words in the recognizable input speech, and the number of states as shown in FIG. 2 is Q +.
It is assumed that it is specified by one finite state automaton.
In the automaton of FIG. 2, state 0 is a start state, and while this state is used as a starting point, one of the words written on the transition branch is output, and the states are transitioned one after another. As a result, The output word string obtained up to a certain state becomes a word string that can be accepted by this automaton up to that state, that is, a word string that can be recognized by this recognition device. The transition of outputting the word n from the state p to the state q defined by such an automaton will be expressed as δ (p, n) = q. The automaton storage unit 9 stores in advance a finite state automaton that defines the order relationship in which such recognition target words and other unknown words appear.

【００１６】音声入力部１に入力される音声信号は、音
声分析部２によって特徴抽出が行われ、ある一定時間間
隔（以下では、これをフレームと称する）ごとに特徴量
ｘ_tへ変換される。この音声分析部２において抽出され
る特徴量としては、線形予測分析法、フーリエ変換法、
フィルタバンク分析法など種々の手法を用いることがで
きる。The voice signal input to the voice input unit 1 is feature-extracted by the voice analysis unit 2 and is converted into a feature amount x _t at a fixed time interval (hereinafter, referred to as a frame). . The feature amount extracted by the voice analysis unit 2 includes a linear prediction analysis method, a Fourier transform method,
Various methods such as a filter bank analysis method can be used.

【００１７】各認識対象単語および未知語は、音声分析
部２からの出力として得られる特徴量ｘ_tの時系列が、
単語毎にどのような出現順序や出現頻度をもって現れる
かを表現する隠れマルコフモデルで表わすことができ
る。各単語ｎの隠れマルコフモデルの構造を特徴づける
基本的なパラメータとしては、状態数Ｊⁿ、隠れマルコ
フモデルの状態ｊが初期状態となる確率π_j ⁿ、状態ｉか
ら状態ｊへの遷移確率ａⁿ _ij、状態ｉから状態ｊへの状
態遷移において入力された音声のある特徴量ｘ_tを出力
するというシンボル出力確率ｂ_ij ⁿ(ｘ_t）がある。各単
語ｎの隠れマルコフモデルの状態の内、初期状態確率π
_j ⁿが０ではなく初期状態になり得るものの集合を｛ＳＩ
ⁿ｝とし、単語の終点を表す最終状態であるものの集合
を｛ＳＦⁿ｝と表すことにする。認識対象単語に対する
これらのパラメータは、キーワード隠れマルコフモデル
記憶部７に、未知語に対しては、ガーベッジ隠れマルコ
フモデル記憶部８に、それぞれ記憶されているものとす
る。これら、初期状態確率π_j ⁿ、状態遷移確率ａ_ij ⁿ、
シンボル出力確率ｂ_ij ⁿ(ｘ_t）については、例えば、Ｉ
ＥＥＥＡＳＳＰＭagazine，Ｖol.３，Ｎo.1（１９８
６年１月）の４〜１６ページに掲載されている論文「Ａ
n Ｉntroduction to Ｈidden Ｍarkov Ｍodels」で紹介
されているバウムウェルヒ再推定法を応用することによ
り、学習データを用いて各単語に最適な値に設定するこ
とができる。For each recognition target word and unknown word, the time series of the feature quantity x _t obtained as an output from the voice analysis unit 2 is
It can be expressed by a hidden Markov model that expresses the appearance order and appearance frequency of each word. The basic parameters that characterize the structure of the hidden Markov model of each word n are the number of states J ⁿ , the probability π _j ⁿ that the state j of the hidden Markov model becomes the initial state, and the transition probability a from the state i to the state j. ⁿ _ij, is the symbol output probability of outputting a feature amount x _t in the state i of voice inputted in the state transition to state _{^{_{j b ij n (x t)}}} . Of the states of the hidden Markov model of each word n, the initial state probability π
_{If the set of j} ⁿ that can be in the initial state instead of 0 is {SI
ⁿ }, and a set of final states that represent the end points of words is represented as {SF ⁿ }. It is assumed that these parameters for the recognition target word are stored in the keyword hidden Markov model storage unit 7 and the unknown words are stored in the garbage hidden Markov model storage unit 8, respectively. These initial state probability π _j ⁿ , state transition probability a _ij ⁿ ,
For the symbol output probability b _ij ⁿ (x _t ), for example, I
EEE ASSP Magazine, Vol.3, No.1 (198
The paper "A
By applying the Baumwelhi re-estimation method introduced in “Introduction to Hidden Markov Models”, it is possible to set the optimum value for each word by using the learning data.

【００１８】次に、図１の尤度計算部３、事後確率計算
部４および認識判定部５の動作の説明に現れるいくつか
の変数を以下のように定義する。Next, some variables appearing in the description of the operations of the likelihood calculating section 3, the posterior probability calculating section 4 and the recognition determining section 5 in FIG. 1 will be defined as follows.

【００１９】Ｌ_q ⁿ（ｔ，ｊ）：オートマトンの状態ｑに
至る単語ｎの隠れマルコフモデルの状態ｊでのフレーム
時刻ｔまでの累積尤度。L _q ⁿ (t, j): Cumulative likelihood up to the frame time t in the state j of the hidden Markov model of the word n leading to the state q of the automaton.

【００２０】Ｂ_q ⁿ（ｔ，ｊ）：上記Ｌ_q ⁿ（ｔ，ｊ）に対
応する最適状態遷移パスに対するバックポインタ。B _q ⁿ (t, j): Back pointer to the optimum state transition path corresponding to L _q ⁿ (t, j).

【００２１】Ｌ_q（ｔ）：フレーム時刻ｔでオート
マトンの状態ｑに至る単語列の隠れマルコフモデルの最
大累積尤度。L _q (t): Maximum cumulative likelihood of a hidden Markov model of a word string reaching the state q of the automaton at frame time t.

【００２２】Ｎ_q（ｔ）：Ｌ_q（ｔ）に対応する単
語列の最後尾の単語名。N _q (t): The last word name of the word string corresponding to L _q (t).

【００２３】Ｂ_q（ｔ）：Ｎ_q（ｔ）に対応する単
語の開始フレーム時刻から１を引いた値。B _q (t): A value obtained by subtracting 1 from the start frame time of the word corresponding to N _q (t).

【００２４】Ｑ_q（ｔ）：Ｌ_q（ｔ）に対応する単
語列の状態ｑの直前の状態番号。Q _q (t): The state number immediately before the state q of the word string corresponding to L _q (t).

【００２５】Ｐ_q ⁿ（ｔ，ｊ）：オートマトンの状態ｑに
至る単語ｎの隠れマルコフモデルの状態ｊでのフレーム
時刻ｔまでの事後確率。P _q ⁿ (t, j): posterior probability until the frame time t in the state j of the hidden Markov model of the word n leading to the state q of the automaton.

【００２６】ＰＦ_q ⁿ（ｔ）：オートマトンの状態ｑに至
る単語ｎがフレーム時刻ｔで発声終了である事後確率。PF _q ⁿ (t): posterior probability that the word n reaching the state q of the automaton is the end of utterance at frame time t.

【００２７】ＰＣ_q ⁿ（ｔ）：オートマトンの状態ｑに至
る単語ｎがフレーム時刻ｔで発声途中である事後確率。PC _q ⁿ (t): posterior probability that the word n reaching the state q of the automaton is in the middle of utterance at the frame time t.

【００２８】Ｓ_q ⁿ（ｔ）：オートマトンの状態ｑに至
る単語ｎのフレーム時刻ｔでの隠れマルコフモデルの最
適最終状態。S _q ⁿ (t): Optimal final state of the hidden Markov model at the frame time t of the word n leading to the state q of the automaton.

【００２９】図３は、図１の認識装置におけるワードス
ポッティング手順の全体的フローチャートを示したもの
で、尤度計算、事後確率計算、認識判定は、それぞれ尤
度計算部３、事後確率計算部４、認識判定部５で行われ
る処理である。ここで、尤度計算、事後確率計算、認識
判定の各処理はフレーム時刻毎に繰り返し行われる
（）。さらに、この間に、尤度計算と事後確率計算の
各処理が、まず、オートマトンの状態ｑに至る単語毎に
繰り返され（と）、それがオートマトンの状態毎に
繰り返される（と）。この繰り返し制御は、制御部
１０が司る。また、制御部１０は、ワードスポッティン
グ処理に先立って所定の初期設定を行う。FIG. 3 shows an overall flow chart of the word spotting procedure in the recognition apparatus of FIG. 1. Likelihood calculation, posterior probability calculation, and recognition determination are performed in the likelihood calculation unit 3 and the posterior probability calculation unit 4, respectively. The processing performed by the recognition determination unit 5. Here, each process of likelihood calculation, posterior probability calculation, and recognition determination is repeatedly performed at each frame time (). Further, during this period, the processes of likelihood calculation and posterior probability calculation are first repeated for each word reaching the state q of the automaton (and), and then repeated for each state of the automaton (and). The control unit 10 controls this repetitive control. Further, the control unit 10 performs a predetermined initial setting prior to the word spotting process.

【００３０】以下、この発明の実施例におけるワードス
ポッティンク手順を詳述する。ワードスポッティング
は、以下のステップ１〜２１を繰り返し行うことによっ
て動作する。なお、ステップ１、２および２１は制御部
１０での処理、ステップ３からステップ１１までは尤度
計算部３で、ステップ１２からステップ１４までは事後
確率計算部４で、ステップ１５からステップ２０までは
認識判定部５でそれぞれ行われる処理である。The word spotting procedure in the embodiment of the present invention will be described in detail below. Word spotting operates by repeating the following steps 1 to 21. Note that steps 1, 2 and 21 are the processes in the control unit 10, steps 3 to 11 are the likelihood calculator 3, steps 12 to 14 are the posterior probability calculator 4, and steps 15 to 20 are the same. Are processes performed by the recognition determination unit 5, respectively.

【００３１】＜初期設定＞ステップ１（初期設定）まず、音声が入力される前に初期設定として、各変数に
次のような値を設定する。<Initial Setting> Step 1 (Initial Setting) First, the following values are set to each variable as initial setting before voice input.

【００３２】[0032]

【数１】 [Equation 1]

【００３３】＜フレームの繰り返し制御＞ステップ２（フレーム時刻毎の繰り返し）フレーム時刻ｔ＝１，２，…，Ｔについて、ステップ３
からステップ２１までを繰り返す。ただし、ここでＴ
は、入力される音声のフレーム総数である。 <Repetition Control of Frame> Step 2 (Repeat every frame time ) For frame time t = 1, 2, ..., T, step 3
To step 21 are repeated. However, here T
Is the total number of frames of input speech.

【００３４】＜尤度計算＞ステップ３（オートマトンの状態毎の繰り返し）オートマトンの状態ｑ＝１，２，…，Ｑについて、ステ
ップ４からステップ１１までを繰り返す。 <Likelihood Calculation> Step 3 (Repeat for each state of automaton ) Steps 4 to 11 are repeated for states q = 1, 2, ..., Q of the automaton.

【００３５】ステップ４（オートマトンの状態ｑに至る
単語毎の繰り返し）次式で与えられるような、オートマトンの状態ｑに至る
すべての単語ｎについて、ステップ５からステップ１０
までを繰り返す。 Step 4 (reaching the state q of the automaton )
Repeat for each word) Steps 5 to 10 for all words n up to the state q of the automaton as given by
Repeat up to.

【００３６】[0036]

【数２】 [Equation 2]

【００３７】ステップ５（単語ｎの初期状態毎の繰り返
し）単語ｎのすべての初期状態ｊ∈｛ＳＩⁿ｝について、ス
テップ６からステップ７を繰り返す。 Step 5 (Repeat every initial state of word n )
Then , steps 6 to 7 are repeated for all initial states jε {SI ⁿ } of word n.

【００３８】ステップ６（最適パスの決定＞もし、オトーマトンの状態ｑに至る単語ｎの隠れマルコ
フモデルの初期状態ｊでのフレーム時刻ｔ−１までの累
積尤度Ｌ_q ⁿ（ｔ−１，ｊ）が、次式の条件を満たせば、
ステップ７を実行する。 Step 6 (Determination of Optimal Path> If the initial state j of the hidden Markov model of the word n leading to the state q of the otomaton, the cumulative likelihood L _q ⁿ (t-1, j) up to the frame time t-1. ) Satisfies the following condition,
Perform step 7.

【００３９】[0039]

【数３】 [Equation 3]

【００４０】ステップ７（最適パスのバックポインタの
再設定）オートマトンの状態ｑ至る単語ｎの隠れマルコフモデル
の初期状態ｊでのフレーム時刻ｔ−１までの累積尤度Ｌ
_q ⁿ（ｔ−１，ｊ）およびそれに対応する最適パスのバッ
クポインタＢ_q ⁿ（ｔ−１，ｊ）を、次のように再設定す
る。 Step 7 (of the back pointer of the optimum path )
Accumulated likelihood L to frame time t-1 in the initial state j of the hidden Markov model of the state q leading word n reconfiguration) automaton
_q ⁿ a (t-1, j) and optimal path back pointer B _q ⁿ of the corresponding (t-1, j), and re-set as follows.

【００４１】[0041]

【数４】 [Equation 4]

【００４２】ステップ８（単語ｎの状態毎の繰り返し）単語ｎの各状態（ｊ＝１，２，…，Ｊⁿ）について、ス
テップ９を繰り返す。 Step 8 (repeat every state of word n ) Repeat step 9 for each state (j = 1, 2, ..., J ⁿ ) of word n.

【００４３】ステップ９（尤度および最適パスの計算）オートマトンの状態ｑに至る単語ｎの隠れマルコフモデ
ルの各状態ｊでのフレーム時刻ｔまでの累積尤度Ｌ
_q ⁿ（ｔ，ｊ）およびそれに対応する最適パスのバックポ
インタＢ_q ⁿ（ｔ，ｊ）を、単語ｎの隠れマルコフモデル
の各パラメータおよびフレーム時刻ｔにおける入力音声
の特徴量ｘ_tを用いて、次のように計算する。 Step 9 (Calculation of Likelihood and Optimal Path) Cumulative likelihood L up to the frame time t in each state j of the hidden Markov model of the word n reaching the state q of the automaton.
_q ⁿ (t, j) and its corresponding back pointer B _q ⁿ (t, j) of the optimal path are calculated by using the parameters of the hidden Markov model of the word n and the feature quantity x _t of the input speech at the frame time t. , Calculate as follows:

【００４４】[0044]

【数５】 [Equation 5]

【００４５】このステップ９の動作によって、オートマ
トンの開始状態から各状態までで受理可能な認識対象単
語および未知語からなるすべての部分単語系列と、音声
始端からフレーム時刻ｔまでの入力音声の特徴系列との
間で、部分単語列を表すキーワードおよびガーベッジ隠
れマルコフモデルによって計算される尤度を求めたこと
になる。By the operation of this step 9, all partial word sequences consisting of recognition target words and unknown words that can be accepted from the start state of the automaton to each state, and the feature sequence of the input voice from the voice start end to the frame time t Between and, the keywords that represent the partial word string and the likelihood calculated by the Garbage Hidden Markov Model are obtained.

【００４６】ステップ１０（フレーム時刻ｔでの最適最
終状態の決定）オートマトンの状態ｑに至る単語ｎのフレーム時刻ｔで
の隠れマルコフモデルの最適最終状態Ｓ_q ⁿ（ｔ）を、次
式のように決定する。 Step 10 (optimum maximum at frame time t
Determination of Final State) The optimal final state S _q ⁿ (t) of the hidden Markov model at the frame time t of the word n that reaches the state q of the automaton is determined by the following equation.

【００４７】[0047]

【数６】 [Equation 6]

【００４８】ステップ１１（最適単語列の決定）フレーム時刻ｔでオートマトンの状態ｐに至る各単語ｎ
の内で最適なものを選び、次のＬ_q（ｔ），Ｎ_q（ｔ），
Ｂ_q（ｔ），Ｑ_q（ｔ）を求める。 Step 11 (determination of optimum word string) Each word n reaching the state p of the automaton at frame time t
Of the following L _q (t), N _q (t),
Find _Bq (t) and _Qq (t).

【００４９】[0049]

【数７】 [Equation 7]

【００５０】＜事後確率計算＞ステップ１２（オートマトンの状態毎の繰り返し）オートマトンの状態ｑ＝１，２，…，Ｑについて、ステ
ップ１３からステップ１４までを繰り返す。 <Posterior Probability Calculation> Step 12 (Repeat for Each State of Automata ) Steps 13 to 14 are repeated for states q = 1, 2, ..., Q of the automaton.

【００５１】ステップ１３（オートマトンの状態ｑに至
る単語毎の繰り返し）次式で与えられるような、オートマトンの状態ｑに至る
単語ｎについて、ステップ１４を繰り返す。 Step 13 (reaching the state q of the automaton
Repeat for each word) Step 14 is repeated for the word n that reaches the state q of the automaton as given by the following equation.

【００５２】[0052]

【数８】 [Equation 8]

【００５３】ステップ１４（事後確率の算出）次式によって、オートマトンの状態ｑに至る認識対象単
語ｎ∈｛Ｎ^w｝がフレーム時刻ｔで時刻ｔで発声終了で
ある事後確率ＰＦ_q ⁿ（ｔ）と、オートマトンの状態ｑに
至る認識対象単語あるいは未知語ｎ∈｛Ｎ｝がフレーム
時刻ｔで発声途中である事後確率ＰＣ_q ⁿ（ｔ）が求めら
れる。 Step 14 (Calculation of Posterior Probability) The posterior probability PF _q ⁿ (t) that the recognition target word nε {N ^w } reaching the state q of the automaton is the utterance end at the frame time t at the time t is calculated by the following equation. Then, the posterior probability PC _q ⁿ (t) that the recognition target word or the unknown word nε {N} reaching the state q of the automaton is in the middle of utterance at the frame time t is obtained.

【００５４】[0054]

【数９】 [Equation 9]

【００５５】＜認識判定＞ステップ１５（事後確率最大の単語の決定）ステップ１４で求めたオートマトンの状態ｑに至る認識
対象単語ｎがフレーム時刻ｔで発声終了である事後確率
ＰＦ_q ⁿ（ｔ）と、オトーマトンの状態ｑに至る認識対象
単語あるいは未知語ｎがフレーム時刻ｔで発声途中であ
る事後確率ＰＣ_q ⁿ（ｔ）から、フレーム時刻ｔが単語の
発声終了であるとした場合の最大事後確率を持つ単語ｎ
^Fとそれに対応するオートマトンの状態ｑ^F、および単語
の発声途中であるとした場合の最大事後確率を持つ単語
ｎ^Cとそれに対応するオートマトンの状態ｑ^Cを次のよう
にして求める。 <Recognition Determination> Step 15 (Determination of Word with Maximum Posterior Probability) Posterior probability PF _q ⁿ (t) that the recognition target word n reaching the state q of the automaton obtained in step 14 is the end of utterance at frame time t. And the posterior probability PC _q ⁿ (t) that the recognition target word or the unknown word n reaching the state of the otomaton is in the middle of utterance at the frame time t, the maximum posterior when the frame time t is the end of utterance of the word Word n with probability
^F and the state q ^{F of} the automaton corresponding thereto, and the word n ^C having the maximum posterior probability when the word is in the middle of utterance and the state q ^C of the automaton corresponding thereto are obtained as follows.

【００５６】[0056]

【数１０】 [Equation 10]

【００５７】ここで、それぞれの最大事後確率を次のよ
うに定義する。Here, each maximum posterior probability is defined as follows.

【００５８】[0058]

【数１１】 [Equation 11]

【００５９】このステップ１５では、さらに、このＡと
Ｂの大小を比較し、Ａの方が大きい場合には、現在のフ
レーム時刻ｔで終端する単語ｎ^Fからなるワードスポッ
ティング結果があると判断し、ステップ１６へ移る。ま
た、Ｂの方が大きい場合には、ワードスポッティング結
果なしと判断し、ステップ２１へ移る。In step 15, the magnitudes of A and B are further compared. If A is larger, it is determined that there is a word spotting result consisting of the word n ^F ending at the current frame time t. , Go to step 16. When B is larger, it is determined that there is no word spotting result, and the process proceeds to step 21.

【００６０】ステップ１６（ワードスポッティング結果
の単語列の決定）ワードスポッティングされたオートマトンの状態ｑ^Fに
至る最後尾単語がｎ^Fであるような単語列を構成する各
認識対象単語を求めるために、ｑ₀＝ｑ^Fおよびｂ₀＝
ｔ，ｉ＝０，ｋ＝０として、ステップ１７からステップ
１９を繰り返す。 Step 16 (Word spotting result
Determination of each word string of) In order to obtain each recognition target word that constitutes the word string such that the last word reaching the state q ^F of the word-spotted automaton is n ^F , q ₀ = q ^F and b ₀ =
Steps 17 to 19 are repeated with t, i = 0 and k = 0.

【００６１】ステップ１７（認識対象かどうかの判定）もし、Ｎ_qi（ｂ_i）が認識対象単語であれば、次式のよ
うにｋを１増加させると同時に、認識結果Ｗ_kとして登
録する。 Step 17 (determination as to whether or not it is a recognition target) If N _qi (b _i ) is a recognition target word, k is incremented by 1 as shown in the following expression and, at the same time, it is registered as a recognition result W _k .

【００６２】[0062]

【数１２】 [Equation 12]

【００６３】ステップ１８（直前の単語の終端時刻と状
態の決定）Ｎ_qi（ｂ_i）の直前の単語の終端フレーム時刻ｂ_i+1と、
その単語が至ったオートマトンの状態ｑ_i+1を、次式に
よって求める。 Step 18 (Termination time and status of the immediately preceding word
Determining the state) _End frame time b _{i + 1} of the word immediately before N _qi (b _i ),
The state of the automaton q _{i + 1 at} which the word has arrived is determined by the following equation.

【００６４】[0064]

【数１３】 [Equation 13]

【００６５】ステップ１９（音声の始端まで達したかど
うかの判定）もし、ｂ_i+1＝０ならば、音声の始端にまで遡って単語
をすべて検索し終ったことになり、ステップ２０へ移
る。それ以外の場合は、ｉ＝ｉ＋１として、ステップ１
７へ戻る。 Step 19 (whether the beginning of the voice has been reached
Emergence of determination) if, if b i _{+ 1} = 0, all become retrieved finished with it the words back to the beginning of the speech, and then proceeds to step 20. Otherwise, set i = i + 1 and go to step 1
Return to 7.

【００６６】ステップ２０（ワードスポッティング結果
の出力）認識結果出力部６に、フレーム時刻ｔで終端するｋ個の
認識対象単語からなる単語列Ｗ_k，Ｗ_k-1，…，Ｗ₁がワ
ードスポッティングされたことを出力する。 Step 20 (Word spotting result
The output of the word string W _k , W _k−1 , ..., W ₁ consisting of k recognition target words ending at the frame time t is output to the recognition result output unit 6.

【００６７】＜フレーム時刻の更新制御＞ステップ２１（フレーム時刻の更新）フレーム時刻ｔを一つ進めて、入力音声の終端に達して
いないならば、ステップ２に戻る。<Frame Time Update Control> Step 21 (Frame Time Update) The frame time t is advanced by one, and if the end of the input voice has not been reached, the process returns to step 2.

【００６８】以上のような動作によって、図１の実施例
では、予め作成して各記憶部７，８，９に記憶しておい
た、認識対象単語とその他の未知語が出現する順序関係
を規定した有限状態オートマトンと、認識対象単語の音
声特徴時系列をあらわすキーワード隠れマルコフモデル
および未知語の音声特徴時系列や雑音などの非音声の特
徴時系列を包括的にあらわすガーベッジ隠れマルコフモ
デルを用いて、制御部１０の繰り返し動作制御下で、尤
度計算部３では、オートマトンの開始状態から各状態ま
でで受理可能な認識対象単語および未知語からなる全て
の部分単語系列と、音声始端から各時刻までの入力音声
の特徴系列との間で、部分単語列を表すキーワードおよ
びガーベッジ隠れマルコフモデルによって計算される尤
度が時刻に同期して逐次的に求められる。同様に、事後
確率計算部４では、この尤度を用いることによって、各
時刻が認識対象単語を最後尾単語とするような各部分単
語系列の発声終了である場合の事後確率と、各時刻が各
部分単語系列の発声途中である場合の事後確率が算出さ
れる。認識判定部５では、これらの事後確率のうちで最
大値を示すものが、ある部分単語系列の発声終了である
場合に対応するときに、その部分単語系列中に存在する
認識対象単語が部分単語系列内での出現順序通りに現れ
たと判定し、その認識結果が認識結果出力部６から出力
される。With the above-described operation, in the embodiment shown in FIG. 1, the order relation in which the recognition target word and the other unknown words appear, which are created in advance and stored in the respective storage units 7, 8 and 9, appear. We use a defined finite state automaton, a keyword hidden Markov model that represents the speech feature time series of recognition target words, and a garbage hidden Markov model that comprehensively represents the speech feature time series of unknown words and non-speech feature time series such as noise. Under the repetitive operation control of the control unit 10, the likelihood calculation unit 3 detects all the partial word sequences consisting of the recognition target word and the unknown word that are acceptable from the start state of the automaton to each state, and from the speech start end. The likelihood calculated by the keyword representing the subword sequence and the Garbage Hidden Markov Model is synchronized with the time to the feature sequence of the input speech up to the time. They are sequentially required. Similarly, the posterior probability calculation unit 4 uses this likelihood to calculate the posterior probability and the time when each time is the end of utterance of each partial word sequence in which the recognition target word is the last word. The posterior probability when each partial word sequence is in the middle of utterance is calculated. In the recognition determination unit 5, when the one showing the maximum value among these posterior probabilities corresponds to the case where the utterance end of a certain partial word series corresponds, the recognition target word existing in the partial word series is a partial word. It is determined that they have appeared in the order of appearance in the series, and the recognition result is output from the recognition result output unit 6.

【００６９】したがって、入力音声の時刻に同期して、
各時刻までに存在するキーワードの連鎖を高速に検出で
き、また、オートマトンによって単語列を規定すること
によって誤ったキーワードの検出や脱落を最小限にとど
めることが可能となる。Therefore, in synchronization with the time of the input voice,
It is possible to detect a chain of keywords existing up to each time at high speed, and by defining a word string by an automaton, it is possible to minimize detection and omission of erroneous keywords.

【００７０】[0070]

【発明の効果】以上説明したように、この発明によるワ
ードスポッティング方法では、入力音声が最後まで発声
されて音声区間が確定する以前に、オートマトンで推定
されるような一連の単語連鎖の一部分までが達成された
かどうかを、入力の時刻に同期して検出できる。したが
って、従来の代表的方法が持つ問題点、すなわち、先の
第一の方法のような、誤った位置でのキーワードの検出
や正解の脱落などを生じてしまうという問題点、部分マ
ッチングに対応するために単語対の相互位置関係に関す
る情報を用いた後処理が必要であるという問題点、ま
た、先の第二の方法のような、入力の時刻に同期して、
各時点で終端するキーワードを求めることが不可能であ
り、入力される音声区間が確定（発声が終了）しなけれ
ば認識結果が求められないという問題点をいずれも同時
に解消し、キーワードや未知語の出現順序を考慮しなが
ら、入力音声の時刻に同期して高速に、かつ精度良く、
入力音声中に存在するキーワードおよびいくつかのキー
ワードの時間的連鎖を検出することが可能になる。As described above, in the word spotting method according to the present invention, even a part of a series of word chains that is estimated by an automaton is detected before the input voice is uttered to the end and the voice section is determined. Whether it has been achieved can be detected synchronously with the time of input. Therefore, it corresponds to the problem of the conventional representative method, that is, the problem that the keyword is detected at the wrong position or the correct answer is lost, like the first method, and the partial matching. In order to do this, post-processing using information about the mutual positional relationship of word pairs is necessary. Also, like the second method above, in synchronization with the time of input,
At the same time, the problem that it is not possible to find the keyword that terminates at each point in time and the recognition result is not obtained unless the input voice section is fixed (utterance ends) is solved. In consideration of the appearance order of, the synchronization with the time of the input voice is fast and accurate,
It becomes possible to detect the keywords present in the input speech and the temporal chain of several keywords.

[Brief description of drawings]

【図１】この発明を適用した認識装置の一実施例のブロ
ック構成図である。FIG. 1 is a block configuration diagram of an embodiment of a recognition device to which the present invention is applied.

【図２】この発明の実施例において用いられている有限
状態オートマトンの一例を示す図である。FIG. 2 is a diagram showing an example of a finite state automaton used in an embodiment of the present invention.

【図３】この発明の実施例におけるワードスポッティン
グ手順の全体的フローチャートである。FIG. 3 is an overall flowchart of a word spotting procedure in the embodiment of the present invention.

[Explanation of symbols]

１音声入力部２音声分析部３尤度計算部４事後確率計算部５認識判定部６認識結果出力部７キーワード隠れマルコフモデル記憶部８ガーベッジ隠れマルコフモデル記憶部９オートマトン記憶部１０制御部 1 voice input unit 2 voice analysis unit 3 likelihood calculation unit 4 posterior probability calculation unit 5 recognition determination unit 6 recognition result output unit 7 keyword hidden Markov model storage unit 8 garbage hidden Markov model storage unit 9 automaton storage unit 10 control unit

Claims

[Claims]

1. A finite-state automaton that defines in advance the order relation in which a recognition target word and other unknown words appear.
Create a keyword hidden Markov model that represents the speech feature time series of the recognition target word and a garbage hidden Markov model that comprehensively represents the speech feature time series of unknown words and non-speech feature time series such as noise, and start the automaton. Between all the partial word sequences consisting of recognition target words and unknown words that can be accepted from each state to each state, and the characteristic sequence of the input speech from the beginning of the speech to each time, the keyword representing the partial word string and the garbage hiding Using the Markov model, the likelihood is sequentially obtained in synchronization with time, and using the above likelihood, when each time is the end of utterance of each part word sequence such that the recognition target word is the last word The posterior probability and the posterior probability when each time is in the midst of utterance of each partial word sequence are calculated. When corresponding to the case where the voice termination, word spotting method for the recognition target words that are present in that part word sequence appeared occurrence in sequence in the partial word sequence recognized, characterized in that.