JPH09258767A

JPH09258767A - Voice spotting device

Info

Publication number: JPH09258767A
Application number: JP8068211A
Authority: JP
Inventors: Toshiyuki Hanazawa; 利行花沢; Yoshiharu Abe; 芳春阿部; Kunio Nakajima; 邦男中島
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 1996-03-25
Filing date: 1996-03-25
Publication date: 1997-10-03

Abstract

PROBLEM TO BE SOLVED: To output information sounding like an unknown word interval by spotting word and phrase voice from voice containing unknown words, and also spotting the unknown word interval. SOLUTION: Heuristic forward likelihood 7 and heuristic backward likelihood 8 are computed using a heuristic phrase model 12 modeled on a time series of feature vectors of phrase voice of various talk contents. Phrase spotting is performed using a phrase model 13 modeled on a time series of spectral feature vectors of phrase voice to be a spotted object, and the heuristic forward likelihood 7 and heuristic backward likelihood 8. With an unknown word model 15 modeled on a time series of spectral feature vectors of optional voice, and the heuristic forward likelihood 7 and heuristic backward likelihood 8 as input, the voice interval of an unknown word is spotted.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】この発明は、未知語を含む音
声中から単語や文節音声をスポッティングするととも
に、未知語区間もスポッティングして、未知語区間らし
さの情報も出力する音声スポッティング装置に関するも
のである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice spotting device for spotting words and phrase voices from a voice containing an unknown word, and also spotting an unknown word section to output information about an unknown word section. is there.

【０００２】[0002]

【従来の技術】従来、連続発声された音声中から、ある
特定の単語や文節、あるいは意味的なまとまり持った部
分文を検出する技術を一般にスポッティングという。こ
こでは単語スポッティングを例にとり、スポッティング
技術の説明をする。単語をスポッティングするというこ
とは、具体的には連続発声された音声中から、スポッテ
ィング対象とする単語の発声開始時刻（以後開始時刻と
いう）と発声終了時刻（以後終了時刻という）、及びス
ポッティングスコアを求めるということである。ここで
スポッティングスコアとは、前記単語がスポッティング
された音声区間に実際に存在するかどうかの信頼度を数
値化して表現したものあり、一般に前記スポッティング
スコアは、スポッティング対象とする単語音声のスペク
トル特徴時系列をモデル化した単語モデルと、入力音声
のスペクトル特徴時系列との類似度を計算することによ
り得られる。2. Description of the Related Art Conventionally, a technique for detecting a specific word or phrase, or a partial sentence having a semantic cohesion from a continuously uttered voice is generally called spotting. Here, the word spotting is taken as an example to explain the spotting technique. To spot a word means, specifically, from the continuously uttered voice, the utterance start time (hereinafter referred to as start time), the utterance end time (hereinafter referred to as end time), and the spotting score of the word to be spotted. It means asking. Here, the spotting score is a numerical representation of the reliability of whether or not the word actually exists in the spotted voice section, and generally, the spotting score is the spectral feature of the word voice to be spotted. It is obtained by calculating the similarity between the word model that models the sequence and the spectral feature time series of the input speech.

【０００３】単語をスポッティングして連続音声認識を
行う方式では、通常スポッティング対象とする単語を全
てスポッティングした後に、そのスポッティング結果
が、言語処理を施すモジュール（以後言語処理部とい
う）に送られ言語処理部で、種々の言語情報を用いてス
ポッティングされた単語を接続して文章を組み立てた
り、意味抽出等の処理を行う。このとき言語処理部で
は、言語情報として構文規則を用いることもできるし、
構文規則を用いずに単語の意味的な情報のみを用いて前
記の処理を行うことも可能である。従って、認識可能な
文章の構文、すなわち単語間の接続規則をトップダウン
に与えて文章単位で認識処理を行う構文駆動型の連続音
声認識方式と比較して、認識可能な文章の自由度が非常
に大きくなるという利点がある。In the method of performing continuous speech recognition by spotting words, after all the words to be usually spotted are spotted, the spotting result is sent to a module for performing language processing (hereinafter referred to as language processing section). The section connects the spotted words by using various linguistic information to assemble a sentence and performs processing such as meaning extraction. At this time, the language processing section can use syntax rules as the language information,
It is also possible to perform the above processing using only the semantic information of words without using syntax rules. Therefore, the degree of freedom of recognizable sentences is much higher than that of the sentence-driven continuous speech recognition method in which the recognizable sentence syntax, that is, the connection rule between words is given to the top-down and the recognition process is performed in sentence units. It has the advantage of becoming large.

【０００４】しかし、単語をスポッティングして連続音
声認識を行う方式では、スポッティングされた単語を接
続して文章を組み立てる過程で、各単語のスポッティン
グスコアを比較する必要が生じるが、通常の始終端フリ
ーのスポッティング方式（「ＨＭＭ音韻認識に基づくワ
ードスポッティング」、電子情報通信学会技術報告、SP
88-23 、信学技報 Vol.88 No.92、17頁ー22頁、川端
豪、花沢利行、鹿野清宏、1988年６月24日、社団法人電
子情報通信学会発行）を用いた場合、時間軸上で異なる
時刻にスポッティングされた単語同士のスポッティング
スコアの比較は、正確性を欠くという問題点があった。
また前記の始終端フリーのスポッティング方式では、入
力音声中でスポッティング対象とする単語が存在しない
区間においても、その単語と音響的に類似度の高い区間
では、しばしば誤って前記のスポッティング対象とする
単語がスポッティングされるいう、「湧き出し誤り」と
呼ばれる現象が生じやすいという問題点があった。However, in the method of performing continuous speech recognition by spotting words, it is necessary to compare the spotting scores of the words in the process of connecting the spotted words and assembling a sentence. Spotting method ("Word spotting based on HMM phoneme recognition", IEICE technical report, SP
88-23, IEEJ Vol.88 No.92, pages 17-22, Kawabata
Australia, Toshiyuki Hanazawa, Kiyohiro Shikano, June 24, 1988, published by The Institute of Electronics, Information and Communication Engineers), the comparison of spotting scores between words spotted at different times on the time axis is accurate. There was a problem of lacking.
Further, in the start-end free spotting method, even in a section in which the word to be spotted does not exist in the input speech, the word to be spotted is often erroneously included in a section having a high acoustic similarity to the word. There is a problem in that a phenomenon called "welling error" is likely to occur that is spotted.

【０００５】近年、上記２つの問題点を低減させる単語
スポッティング方式として、「ヒューリスティックスポ
ッティング」と称する方式が「ヒューリスティックな言
語モデルを用いた会話音声中の単語スポッティング」
（電子情報通信学会論文誌、Vol.J78-D-II No.7 1013頁
−1020頁、河原達也、宗続敏彦、堂下修司、1995年７月
25日、社団法人電子情報通信学会情報・システムソサ
イエティ発行）（以後、文献１として参照する）で提案
された。In recent years, as a word spotting method for reducing the above two problems, a method called "heuristic spotting" is "word spotting in conversational speech using a heuristic language model".
(Journal of the Institute of Electronics, Information and Communication Engineers, Vol.J78-D-II No.7, pages 1013-1020, Tatsuya Kawahara, Toshihiko Sotsugi, Shuji Doshita, July 1995.
The proposal was made on the 25th, published by the Institute of Electronics, Information and Communication Engineers Information and Systems Society (hereinafter referred to as Reference 1).

【０００６】図５は前記文献１に記載されている単語ス
ポッティング方式を示す構成図である。図において、１
は音声信号の入力端であり、２は音声信号の入力端１か
ら入力された音声信号であり、３は音声信号２の音響特
徴ベクトルの時系列を算出する分析手段である。また４
は分析手段３の出力である特徴ベクトルの時系列であ
り、５は種々の発話内容の音声の特徴ベクトルの時系列
をモデル化したヒューリスティック言語モデルであり、
６は入力音声の特徴ベクトルの時系列４を入力とし、ヒ
ューリスティック言語モデル５を用いて入力音声の特徴
ベクトルの時系列４に対するヒューリスティック前向き
尤度７とヒューリスティック後ろ向き尤度８を計算する
ヒューリスティック言語モデル照合手段である。さらに
９はスポッティング対象とする単語音声のスペクトル特
徴ベクトルの時系列をモデル化した単語モデルであり、
１０は単語モデル９と特徴ベクトルの時系列４とヒュー
リスティック前向き尤度７とヒューリスティック後ろ向
き尤度８とを入力とし、単語モデル９を用いて単語のス
ポッティングを行うスポッティング手段である。１１は
単語のスポッティング結果である。FIG. 5 is a block diagram showing the word spotting method described in Document 1 above. In the figure, 1
Is an input end of the audio signal, 2 is an audio signal input from the input end 1 of the audio signal, and 3 is an analysis means for calculating the time series of the acoustic feature vector of the audio signal 2. Also 4
Is a time series of the feature vector output from the analysis unit 3, and 5 is a heuristic language model that models the time series of the feature vector of speech of various utterance contents,
Reference numeral 6 is a heuristic language model matching that inputs a time series 4 of feature vectors of the input speech and uses a heuristic language model 5 to calculate a heuristic forward likelihood 7 and a heuristic backward likelihood 7 for the time series 4 of the feature vectors of the input speech. It is a means. Furthermore, 9 is a word model that models the time series of the spectral feature vector of the word speech to be spotted,
Numeral 10 is a spotting means for inputting a word model 9, a time series 4 of feature vectors, a heuristic forward likelihood 7 and a heuristic backward likelihood 8 and performing word spotting using the word model 9. 11 is a word spotting result.

【０００７】このような構成では、ヒューリスティック
言語モデル５と単語モデル９は、ＨＭＭ（Hidden Marko
v Model 、隠れマルコフモデル）を用いてモデル化する
ものとする。ＨＭＭを用いるのでヒューリスティック言
語モデル５や単語モデル９と特徴ベクトルの時系列４と
の類似度としては、特徴ベクトルの時系列４がヒューリ
スティック言語モデル５や単語モデル９から生成される
確率的尺度である尤度を用いる。このスポッティング方
式では、入力音声は無秩序な発話内容ではなく、ある言
語的な制約を満たしていると仮定して、その言語的な制
約をヒューリスティック言語モデルと称するモデルで表
現し、単語スポッティングを行う際にヒューリスティッ
ク言語モデル５と、スポッティング対象とする単語モデ
ル９を併用して入力音声全体を考慮したスポッティング
スコアを計算する。In such a configuration, the heuristic language model 5 and the word model 9 are HMM (Hidden Marko).
v Model, Hidden Markov Model). Since the HMM is used, the similarity between the heuristic language model 5 or the word model 9 and the time series 4 of the feature vector is a probabilistic measure that the time series 4 of the feature vector is generated from the heuristic language model 5 or the word model 9. Likelihood is used. In this spotting method, it is assumed that the input speech does not have disordered utterance content but satisfies a certain linguistic constraint, and the linguistic constraint is expressed by a model called a heuristic language model. Then, the heuristic language model 5 and the word model 9 to be spotted are used together to calculate a spotting score in consideration of the entire input voice.

【０００８】具体的には、図６に示すようにスポッティ
ング対象単語モデル９の前後にヒューリスティック言語
モデル５を接続して、入力音声全体をモデル化する。す
なわち、入力音声中でスポッティング対象単語が存在す
る区間に対しては、そのスポッティング対象単語モデル
９でモデル化し、それ以外の音声区間に対してはヒュー
リスティック言語モデル５によってモデル化し、後述す
る方法によって特徴ベクトルの時系列４に対する尤度計
算を行いスポッティングスコアを求める。Specifically, as shown in FIG. 6, a heuristic language model 5 is connected before and after the spotting target word model 9 to model the entire input voice. That is, the section in which the spotting target word exists in the input speech is modeled by the spotting target word model 9, and the other speech section is modeled by the heuristic language model 5, and is characterized by the method described later. The likelihood calculation for the time series 4 of the vector is performed to obtain the spotting score.

【０００９】この構成では文献１に従って、個人スケジ
ュール管理に関する単語をスポッティングする場合につ
いて説明する。従ってスポッティング対象とする単語モ
デル９としては、前記個人スケジュール管理のタスクで
決められた複数個の単語のモデルを用いる。当然のこと
ながら、前記の複数個の単語は、入力される音声に含ま
れると考えられる単語である。ヒューリスティック言語
モデル５としては、図７に示すようにスポッティング対
象とする単語モデル９と全く同一の単語のモデルを、各
モデル間で任意の連鎖ができるように接続したモデルを
用いるものとする。図７では単語モデル９は四角で囲ん
で表示されている。With this configuration, a case of spotting words relating to personal schedule management will be described according to Reference 1. Therefore, as the word model 9 to be spotted, a model of a plurality of words determined by the task of the personal schedule management is used. As a matter of course, the plurality of words are words that are considered to be included in the input voice. As the heuristic language model 5, as shown in FIG. 7, a model in which word models that are exactly the same as the word model 9 to be spotted are connected so that arbitrary chains can be established between the models is used. In FIG. 7, the word model 9 is displayed surrounded by a square.

【００１０】入力音声中にはヒューリスティック言語モ
デル５を構成する単語しか含まれていないと仮定すれ
ば、単語モデル間で任意の連鎖ができるように構成した
ヒューリスティック言語モデル５によって全ての特徴ベ
クトルの時系列４をモデル化できる。一方、入力音声中
にヒューリスティック言語モデル５を構成する単語以外
の単語（以後、このような単語を未知語という）が含ま
れている場合には、前記のように構成したヒューリステ
ィック言語モデル５では、未知語の区間の特徴ベクトル
の時系列４をモデル化できないが、モデル化ができない
ことによる影響は後述する。Assuming that the input speech contains only the words that make up the heuristic language model 5, all the feature vectors are generated by the heuristic language model 5 that allows arbitrary chains between word models. Series 4 can be modeled. On the other hand, when the input speech contains a word other than the words constituting the heuristic language model 5 (hereinafter, such a word is referred to as an unknown word), the heuristic language model 5 configured as described above Although it is not possible to model the time series 4 of the feature vector of the unknown word section, the effect of not being modeled will be described later.

【００１１】次に図５に基づいて、本スポッティング方
式の動作例について説明する。音声信号の入力端１から
入力された音声信号２は、分析手段３によって特徴ベク
トルの時系列４であるＸ₁ Ｘ₂ Ｘ₃ 、……、Ｘ_T に変換
される。ここでＸは特徴ベクトル、添字は各特徴ベクト
ルの時刻を示すものとする。この特徴ベクトルＸは例え
ばＬＰＣケプストラムである。ヒューリスティック言語
モデル照合手段６は、分析手段３の出力である特徴ベク
トルの時系列４を入力として、ヒューリスティック言語
モデル５を用いて（３）式によって、ヒューリスティッ
ク前向き尤度７であるＳ_fw(t)(ｔ＝１〜Ｔ）を、（６）
式によって、ヒューリスティック後ろ向き尤度８である
Ｓ_bw(t)(ｔ＝１〜Ｔ）をそれぞれ計算し、出力する。Next, an operation example of this spotting method will be described with reference to FIG. Audio signal 2 inputted from the input terminal 1 of the audio signal, X ₁ X ₂ X ₃ is a time series 4 of feature vectors by the analysis unit 3, ..., is converted into X _T. Here, X represents a feature vector, and the subscript represents the time of each feature vector. This feature vector X is, for example, an LPC cepstrum. The heuristic language model matching means 6 receives the time series 4 of the feature vector output from the analysis means 3 as an input, and uses the heuristic language model 5 to calculate the heuristic forward likelihood 7 S _fw (t) by the equation (3). (t = 1 to T) becomes (6)
S _bw (t) (t = 1 to T), which is the heuristic backward likelihood 8, is calculated by the formulas and output.

【００１２】ヒューリスティック前向き尤度７とヒュー
リスティック後ろ向き尤度８の計算方法を説明する準備
として前向き尤度α(Ｓ_i,ｔ)と、後ろ向き尤度β(Ｓ_i,
ｔ)をそれぞれ（１）式、（２）式のように定義する。As a preparation for explaining the calculation method of the heuristic forward likelihood 7 and the heuristic backward likelihood 8, the forward likelihood α (S _i , t) and the backward likelihood β (S _i , T
t) is defined as in equations (1) and (2), respectively.

【００１３】[0013]

【数１】 [Equation 1]

【００１４】[0014]

【数２】 [Equation 2]

【００１５】すなわちα(Ｓ_i,ｔ)は、ＨＭＭでモデル化
したヒューリスティック言語モデル５の初期状態から時
刻０に遷移を開始し、特徴ベクトルの時系列４の部分区
間であるＸ₁ Ｘ₂ Ｘ₃ 、……、Ｘ_t までを出力して、時
刻ｔに状態Ｓ_i に到達する確率である。また、β(Ｓ_i,
ｔ)は時間軸を逆方向にして、ＨＭＭでモデル化したヒ
ューリスティック言語モデル５の最終状態から、時刻Ｔ
に遷移を開始し、特徴ベクトルの時系列４の部分区間で
あるＸ_T Ｘ_T-1 Ｘ_T-2 、……、Ｘ_t+1 までを出力して時
刻ｔに状態Ｓ_i に到達する確率である。That is, α (S _i , t) starts a transition from the initial state of the heuristic language model 5 modeled by the HMM at time 0, and is a partial section of the time series 4 of the feature vector X ₁ X ₂ X ₃ , ..., Probability of reaching the state S _i at time t by outputting up to X _t . Also, β (S _i ,
t) shows the time T from the final state of the heuristic language model 5 modeled by the HMM with the time axis reversed.
Probability of starting the transition to and outputting up to X _T X _T-1 X _T-2 , ..., X _{t + 1} , which is a subinterval of the time series 4 of the feature vector, and reaches state S _i at time t. Is.

【００１６】このように定義した前向き尤度と後ろ向き
尤度を用いて、ヒューリスティック前向き尤度７である
Ｓ_fw(t)(ｔ＝１〜Ｔ）と、ヒューリスティック後ろ向き
尤度８であるＳ_bw(t)(ｔ＝１〜Ｔ）を計算する。ヒュー
リスティック前向き尤度７であるＳ_fw(t)(ｔ＝１〜Ｔ）
は、ヒューリスティック言語モデル５を用いて、（３）
式によって特徴ベクトルの時系列４であるＸ₁ Ｘ₂ Ｘ
₃ 、……、Ｘ_T に対する前向き尤度を計算することによ
って得られる。Using the forward likelihood and backward likelihood defined in this way, heuristic forward likelihood 7 is S _fw (t) (t = 1 to T) and heuristic backward likelihood 8 is S _bw ( t) (t = 1 to T) is calculated. Heuristic forward likelihood 7 S _fw (t) (t = 1 to T)
Using the heuristic language model 5, (3)
X ₁ X ₂ X which is the time series 4 of the feature vector according to the formula
_3, ..., it is obtained by calculating the forward likelihood for X _T.

【００１７】[0017]

【数３】 (Equation 3)

【００１８】この前向き尤度α(Ｓ_J,ｔ)は、時間軸上で
順方向に（４）式、（５）式の漸化式を計算することに
よって得られる。（４）式、（５）式中で、ａ_jiは、状
態Ｓ_j から状態Ｓ_i への遷移が起きる確率、ｂ_ji(Ｘ_t）
は、状態Ｓ_j から状態Ｓ_i への遷移の際に特徴ベクトル
Ｘ_t を出力する確率である。［ｔ＝０のときの初期値設定］The forward likelihood α (S _J , t) is obtained by calculating the recurrence formulas (4) and (5) in the forward direction on the time axis. In the expressions (4) and (5), a _ji is the probability that a transition from the state S _j to the state S _i occurs, b _ji (X _t ).
Is the probability of outputting the feature vector X _t at the transition from the state S _j to the state S _i . [Initial value setting when t = 0]

【００１９】[0019]

【数４】 (Equation 4)

【００２０】［ｔ＝１〜Ｔ、Ｓ_i(ｉ＝１〜Ｊ）について
の漸化式計算］[Recursion formula calculation for t = 1 to T, S _i (i = 1 to J)]

【００２１】[0021]

【数５】 (Equation 5)

【００２２】またヒューリスティック後ろ向き尤度８で
あるＳ_bw(t)(ｔ＝１〜Ｔ）は、ヒューリスティック言語
モデル５を用いて、（６）式によってスペクトル特徴ベ
クトルの時系列４に対する後ろ向き尤度を計算すること
によって得られる。Further, S _bw (t) (t = 1 to T), which is the heuristic backward likelihood 8, uses the heuristic language model 5 to calculate the backward likelihood for the time series 4 of the spectral feature vector by the equation (6). It is obtained by calculating.

【００２３】[0023]

【数６】 (Equation 6)

【００２４】この後ろ向き尤度β(Ｓ₁,ｔ)は、時間軸上
で逆方向に（７）式、（８）式の漸化式を計算すること
によって得られる。［ｔ＝Ｔのときの初期値設定］The backward likelihood β (S ₁ , t) is obtained by calculating the recurrence formulas (7) and (8) in the opposite direction on the time axis. [Initial value setting when t = T]

【００２５】[0025]

【数７】 (Equation 7)

【００２６】［ｔ＝Ｔ−１〜１、Ｓ_i(ｉ＝１〜Ｊ）につ
いての漸化式計算］[Recursion formula calculation for t = T-1 to 1, S _i (i = 1 to J)]

【数８】 (Equation 8)

【００２７】スポッティング手段１０は、特徴ベクトル
の時系列４とヒューリスティック前向き尤度７とヒュー
リスティック後ろ向き尤度８とを入力とし、単語モデル
９を用いて（９）式により、各スポッティング対象単語
毎に、各単語のスポッティングスコアであるＦ⁽ⁿ⁾(t)
(ｔ＝１〜Ｔ)を計算する。ここで肩の添え字ｎは単語モ
デルの番号であり、ｎ＝１、２、３、……、Ｎ（Ｎ：ス
ポッティング対象単語総数）である。そして、前記スポ
ッティングスコアＦ⁽ⁿ⁾(t)が予め定められた閾値以上で
ある場合に、単語モデルの番号ｎと、（１０）式によっ
て求められる開始時刻ｓｔｉｍｅ⁽ⁿ⁾(t)、終了時刻ｔ及
びスポッティングスコアＦ⁽ⁿ⁾(t)をスポッティング結果
１１として出力する。The spotting means 10 receives the time series 4 of the feature vector, the heuristic forward likelihood 7 and the heuristic backward likelihood 8 as input, and uses the word model 9 according to equation (9) for each spotting target word. F ⁽ⁿ⁾ (t) which is the spotting score of each word
Calculate (t = 1 to T). Here, the subscript n on the shoulder is the number of the word model, and n = 1, 2, 3, ..., N (N: total number of words to be spotted). Then, when the spotting score F ⁽ⁿ⁾ (t) is equal to or greater than a predetermined threshold value, the word model number n, the start time time ⁽ⁿ⁾ (t), and the end time obtained by the equation (10). The t and the spotting score F ⁽ⁿ⁾ (t) are output as the spotting result 11.

【００２８】[0028]

【数９】 [Equation 9]

【００２９】[0029]

【数１０】 (Equation 10)

【００３０】（９）式の右辺の分子の積は、スポッティ
ング対象の単語モデル９の後ろに、ヒューリスティック
言語モデル５を接続して、特徴ベクトルの時系列４であ
るＸ₁ Ｘ₂ Ｘ₃ 、……、Ｘ_T 全体のスコアを求めること
を意味している。The product of the numerator on the right side of the equation (9) is obtained by connecting the heuristic language model 5 after the word model 9 to be spotted, and X ₁ X ₂ X ₃ , which is a time series 4 of feature vectors. ..., which means to obtain the score of X _T as a whole.

【００３１】また、（９）式の右辺の分母は時刻Ｔでの
ヒューリスティック前向き尤度７であり、特徴ベクトル
の時系列４であるＸ₁ Ｘ₂ Ｘ₃ 、……、Ｘ_T 全体に対す
る尤度が最大になるようにヒューリスティック言語モデ
ル５を構成する単語モデルを接続したときの、尤度とな
っている。スポッティングの対象としている全ての単語
モデル９は、ヒューリスティック言語モデル５を構成す
る単語モデルに含まれるので、式の右辺は常に（１１）
式の関係が成立し、前記スポッティングスコアＦ⁽ⁿ⁾(t)
は０〜１の値をとる。The denominator on the right side of the equation (9) is the heuristic forward likelihood 7 at time T, and the likelihood for the entire X ₁ X ₂ X ₃ , ..., X _T which is the time series 4 of the feature vector. Is the likelihood when the word models that form the heuristic language model 5 are connected so that Since all word models 9 to be spotted are included in the word models forming the heuristic language model 5, the right side of the formula is always (11).
The relationship of formulas is established, and the spotting score F ⁽ⁿ⁾ (t)
Takes a value of 0 to 1.

【００３２】[0032]

【数１１】 [Equation 11]

【００３３】（９）式中の単語前向き尤度α⁽ⁿ⁾(Ｓ_J,
ｔ）は、ヒューリスティック前向き尤度７と同様にし
て、時間軸上で順方向に以下の（１２）式、（１４）
式、（１６）式の漸化式を計算することによって得られ
る。但し（１４）式からわかるように、ヒューリスティ
ック前向き尤度７の計算とは、初期状態における計算方
法が異なっており、これはスポッティング対象としてい
る単語モデル９の前方にヒューリスティック言語モデル
５を接続して、単語前向き尤度α(Ｓ_J,ｔ)を計算するこ
とを意味している。また単語ｎ開始時刻を記憶している
バックポインタであるＢＴＫ⁽ⁿ⁾(Ｓ_J,以下の（１３）
式、（１５）式、（１７）式、（１８）式の漸化式計算
によって求めることができる。［ｔ＝０のときの初期値設定］The forward likelihood α ⁽ⁿ⁾ (S _J ,
t) is the same as the heuristic forward likelihood 7, and is expressed by the following equations (12) and (14) in the forward direction on the time axis.
It is obtained by calculating the recurrence formula of the formula (16). However, as can be seen from the equation (14), the calculation method in the initial state is different from the calculation of the heuristic forward likelihood 7, which is because the heuristic language model 5 is connected in front of the word model 9 to be spotted. , Word forward likelihood α (S _J , t) is calculated. In addition, BTK ⁽ⁿ⁾ (S _J , the following (13), which is a back pointer that stores the start time of word n
It can be obtained by the recurrence formula calculation of formula, formula (15), formula (17), and formula (18). [Initial value setting when t = 0]

【００３４】[0034]

【数１２】 (Equation 12)

【００３５】[0035]

【数１３】 (Equation 13)

【００３６】［ｔ＝１〜Ｔ、単語モデルの初期状態Ｓ₁
についての漸化式計算］[T = 1 to T, initial state of word model S ₁
Recurrence formula

【００３７】[0037]

【数１４】 [Equation 14]

【００３８】[0038]

【数１５】 (Equation 15)

【００３９】［ｔ＝１〜Ｔ、単語モデルの初期状態以外
Ｓ_i(ｉ＝２〜Ｊ）についての漸化式計算］[T = 1 to T, recurrence formula calculation for S _i (i = 2 to J) other than the initial state of the word model]

【００４０】[0040]

【数１６】 (Equation 16)

【００４１】[0041]

【数１７】 [Equation 17]

【００４２】但し、However,

【００４３】[0043]

【数１８】 (Equation 18)

【００４４】以上説明したように、本スポッティング方
式では、スポッティング対象とする単語モデル９の前後
にヒューリスティック言語モデル５を接続して、入力音
声中でスポッティング対象単語が存在する区間に対して
は、そのスポッティング対象とする単語モデル９でモデ
ル化し、それ以外の音声区間に対してはヒューリスティ
ック言語モデル５によってモデル化する。As described above, in this spotting method, the heuristic language model 5 is connected before and after the word model 9 to be spotted, and the section where the spotting target word exists in the input voice is The word model 9 to be spotted is modeled, and the other speech sections are modeled by the heuristic language model 5.

【００４５】そして特徴ベクトルの時系列４であるＸ₁
Ｘ₂ Ｘ₃ 、……、Ｘ_T 全体との尤度計算を行い、スポッ
ティングスコアを求める。従ってスッポッティングスコ
アは、常に特徴ベクトルの時系列４であるＸ₁ Ｘ₂ Ｘ
₃ 、……、Ｘ_T 全体を考慮したものとなり、時間軸上で
異なる時刻にスポッティングされた単語同士のスポッテ
ィングスコアの比較が正確にできる。また、ヒューリス
ティック言語モデル５によって、スポッティング対象と
する単語の前後の音声区間をモデル化しているので湧き
出し誤りを低減することができる。X ₁ which is the time series 4 of the feature vector
X ₂ X ₃ , ..., X _{T The} whole likelihood is calculated and the spotting score is obtained. Therefore, the spotting score is always X ₁ X ₂ X which is the time series 4 of the feature vector.
_3, ..., it is assumed in consideration of the entire X _T, comparison of spotting scores between words which are spotted at different times on the time axis can be accurate. In addition, since the heuristic language model 5 models the voice section before and after the word to be spotted, it is possible to reduce the occurrence error.

【００４６】ところで湧き出し誤りを低減する効果は、
ヒューリスティック言語モデル５がスポッティング対象
とする単語の前後の音声区間をモデル化する際に仮定す
る言語的な制約に左右される。例えば、文献１に記載さ
れているようにヒューリスティック言語モデル５として
は、本例で説明した単語モデル９の連鎖の他にも、日本
語に現れる音節音声の特徴ベクトルの時系列を各音節毎
にモデル化した音節モデルを音節モデル間で接続して任
意の音節列の特徴ベクトルの時系列をモデル化できるよ
うにした音節連鎖モデル等が考えられる。By the way, the effect of reducing the spring-out error is
The heuristic language model 5 is subject to linguistic restrictions that are assumed when modeling the speech sections before and after the word to be spotted. For example, as described in Reference 1, as the heuristic language model 5, in addition to the chain of the word models 9 described in this example, the time series of the feature vector of the syllable speech appearing in Japanese is obtained for each syllable. A syllable chain model or the like is conceivable in which the modeled syllable models are connected between syllable models to model the time series of the feature vector of an arbitrary syllable string.

【００４７】上記２つのヒューリスティック言語モデル
の性質の違いを例をあげて説明する。例えば「今日（き
ょう）、東京（とうきょう）へ行く」と発声された音声
中から、「今日（きょう）」という単語をスポッティン
グする場合を考える。ヒューリスティック言語モデルと
して前記音節連鎖モデルを用いた場合には前記音声は、
スポティング対象単語モデル（「今日」）＋ヒューリス
ティック言語モデル（「と」＋「う」＋「きょ」＋
「う」＋「へ」＋「い」＋「く」）というモデルの接続
と、ヒューリスティック言語モデル（「きょ」＋「う」
＋と」＋「う」）＋スポティング対象単語モデル（「今
日」）＋ヒューリスティック言語モデル（「へ」＋
「い」＋「く」）というモデルの接続の２通りで、モデ
ル化が可能となる。結果として、単語モデル「今日」
は、正しい位置の他に「東京」の「京」の部分にもスポ
ッティングされて湧き出し誤りを生じることになる。The difference between the properties of the above two heuristic language models will be described with an example. Consider, for example, the case where the word "today" is spotted from the voice uttered "today, go to Tokyo." When using the syllable chain model as a heuristic language model, the speech is
Spotting word model (“Today”) + heuristic language model (“to” + “U” + “Kyo” +
A model connection of “U” + “He” + “I” + “KU”) and a heuristic language model (“Kyo” + “U”)
+ And ”+“ U ”) + word model for spotting (“ Today ”) + heuristic language model (“ to ”+
Modeling is possible with two types of connection of models "i" + "ku"). As a result, the word model "today"
Will be spotted not only in the correct position but also in the “Kyo” part of “Tokyo”, causing an error.

【００４８】一方、ヒューリスティック言語モデルとし
て単語連鎖モデルを用いた場合で、ヒューリスティック
言語モデルを構成する単語モデルを、「今日」、「東
京」、「へ」、「行く」の４単語とした場合、前記の入
力音声「今日（きょう）、東京（とうきょう）へ行く」
は、スポティング対象単語モデル（「今日」）＋ヒュー
リスティック言語モデル（「東京」＋「へ」＋「行
く」）というモデルの接続でのみモデル化が可能であ
り、「東京」の「京」の部分で「今日」がスポッティン
グされるのを抑制することができる。半面、ヒューリス
ティック言語モデルとして前記単語連鎖モデルを用いた
場合、入力音声中に未知語が含まれるときには湧き出し
誤りを生ずる原因となる。On the other hand, when the word chain model is used as the heuristic language model, and the word models constituting the heuristic language model are four words “today”, “Tokyo”, “he”, and “go”, The above-mentioned input voice "Today, go to Tokyo"
Can be modeled only by connecting the wording model for spotting (“Today”) + the heuristic language model (“Tokyo” + “To” + “Go”). It is possible to prevent "today" from being spotted in a part. On the other hand, when the word chain model is used as the heuristic language model, when an unknown word is included in the input speech, it causes a spelling error.

【００４９】例えば前記と同様にヒューリスティック言
語モデルを構成する単語モデルを、「今日」、「東
京」、「へ」、「行く」の４単語とした場合、「きの
う、東京（とうきょう）へ行った」と発声された音声
は、「きのう」と「行った」という単語は未知語である
が、前記ヒューリスティック言語モデルを構成する単語
モデルである「今日」、「東京」、「へ」、「行く」の
うち類似度が最も高いモデルで強制的に前記の「きの
う」と「行った」の音声区間がモデル化されることにな
る。すなわち、前記ヒューリスティック言語モデルによ
って（「今日」＋「東京」＋「へ」＋「行く」）という
モデルの接続でモデル化される。このときに単語「今
日」をスポッティングすると、スポッティング対象単語
モデルの「今日」と、前記ヒューリスティック言語モデ
ルを構成する単語モデルの「今日」は同一のモデルなの
で、前記ヒューリスティック言語モデルによって「今
日」とモデル化された音声区間でスポッティング対象単
語モデルの「今日」がスポッティングがされ、湧き出し
誤りを生ずる。For example, assuming that the word model constituting the heuristic language model is the four words "today", "Tokyo", "he", and "go" as in the above case, the user went to "Tokyo yesterday". In the voice uttered as "," the words "Kiyo" and "I went" are unknown words, but the word models that make up the heuristic language model are "Today", "Tokyo", "He", and "Go." The model having the highest degree of similarity will be forced to model the above-mentioned voice segments of “Kino” and “I performed”. That is, the heuristic language model is modeled by the connection of the model (“Today” + “Tokyo” + “He” + “Go”). At this time, if the word "today" is spotted, the "today" of the word model to be spotted and the "today" of the word models constituting the heuristic language model are the same model. Therefore, the model "today" is modeled by the heuristic language model. The word model "today" of the spotting target word model is spotted in the converted speech section, and a spelling error occurs.

【００５０】かつスポッティングスコアは、（９）式で
示したようにヒューリスティック言語モデルによって入
力音声全体をモデル化した場合の尤度で正規化した値と
なっているので、前記の単語の連鎖で構成されたヒュー
リスティック言語モデルによって「今日」とモデル化さ
れた音声区間での、スポッティング対象単語モデルであ
る「今日」のスポッティングスコアは最高値（1.0)とな
り、入力音声中に未知語（「きのう」）が含まれていて
も、未知語が含まれている可能性があるという情報はス
ポッティングスコアには全く反映されない。Since the spotting score is a value normalized by the likelihood when the entire input speech is modeled by the heuristic language model as shown in the equation (9), it is composed of the above-mentioned word chain. In the speech segment modeled as "today" by the adopted heuristic language model, the spotting score of "today", which is the target word model, is the highest value (1.0), and unknown words ("kino") appear in the input speech. However, the information that the unknown word may be included is not reflected in the spotting score at all.

【００５１】一方、ヒューリスティック言語モデルとし
て前記音節連鎖モデルを用いた場合には、前記の「きの
う、東京へ行った」と発声された音声は、ヒューリステ
ィック言語モデルを構成する音節モデルを（「き」＋
「の」＋「う」＋「と」＋「う」＋「きょ」＋「う」＋
「へ」＋「い」＋「っ」＋「た」）と接続してモデル化
が可能であり、前記発話音声中の「きのう」の区間にお
いては、ヒューリスティック言語モデルである（「き」
＋「の」＋「う」）でモデル化したほうがスポッティン
グ対象単語モデル「今日」でモデル化するよりも尤度が
高くなるので、前記発話音声中の「きのう」の区間で、
単語「今日」がスポッティングされた場合でもスポッテ
ィングスコアは最高（1.0)となることはない。On the other hand, when the syllable chain model is used as the heuristic language model, the voice uttered as "I went to Tokyo yesterday" is the syllable model which constitutes the heuristic language model ("Ki"). +
"No" + "U" + "To" + "U" + "Kyo" + "U" +
It can be modeled by connecting "he" + "i" + "tsu" + "ta"), and is a heuristic language model in the "Kinou" section of the uttered speech ("Ki").
Since the likelihood of modeling with “+” of “+” ”is higher than that of modeling with the word model“ today ”to be spotted, in the section of“ Kinou ”in the uttered speech,
Even if the word "today" is spotted, the spotting score is never the highest (1.0).

【００５２】このように、ヒューリスティック言語モデ
ルによって、入力音声に課す言語的な制約が強いほど、
湧き出し誤りの抑制効果が大きくなるが、仮定した言語
的な制約を満たさない音声に対しては逆に湧き出し誤り
を増加させる可能性がある。すなわち湧き出し誤りの抑
制効果と、仮定した言語的な制約を満たさない音声に対
する頑健性とはトレードオフの関係にある。As described above, as the heuristic language model imposes stronger linguistic restrictions on input speech,
Although the effect of suppressing the source error increases, the source error may increase on the contrary for speech that does not satisfy the assumed linguistic constraint. That is, there is a trade-off relationship between the effect of suppressing the brow-out error and the robustness with respect to speech that does not satisfy the assumed linguistic constraint.

【００５３】[0053]

【発明が解決しようとする課題】上述したように文献１
に記載のスポッティング方式では、湧き出し誤りの抑制
効果と仮定した言語的な制約を満たさない音声に対する
頑健性とはトレードオフの関係にあり、湧き出し誤りの
抑制効果と頑健性をともに高く保つことは困難であると
いう問題点があった。また、スポッティング対象とする
単語モデルの連鎖で構成したヒューリスティック言語モ
デルを用いたときには、入力音声中に未知語が含まれて
いても、未知語が含まれている可能性があるという情報
がスポッティングスコアには全く反映されない場合があ
り、このときには、入力音声中に未知語が含まれている
可能性があるという情報を得る手段がないという問題点
があった。DISCLOSURE OF THE INVENTION Problems to be Solved by the Invention
In the spotting method described in (1), there is a trade-off relationship between the suppression effect of the source error and the robustness against speech that does not satisfy the assumed linguistic constraint, and both the suppression effect of the source error and the robustness should be kept high. Had the problem of being difficult. In addition, when using a heuristic language model composed of a chain of word models to be spotted, even if an unknown word is included in the input speech, information that the unknown word may be included is included in the spotting score. May not be reflected at all in this case, and at this time, there is a problem that there is no means for obtaining information that the unknown voice may be included in the input voice.

【００５４】本発明は上記課題を解決するためになされ
たもので、スポッティング対象とする単語や文節等の音
声単位と未知語区間の両方をスポッティングし、入力音
声中に未知語が含まれている可能性があるという情報を
出力し得る音声スポッティング装置を提供することを目
的とする。The present invention has been made in order to solve the above problems, and spots both a speech unit such as a word or a phrase to be spotted and an unknown word section, and an unknown word is included in the input speech. It is an object of the present invention to provide a voice spotting device that can output information that there is a possibility.

【００５５】[0055]

【課題を解決するための手段】この発明に係る音声スポ
ッティング装置は、入力音声の音響分析を行い入力音声
の音響特徴ベクトルの時系列を出力する分析手段と、種
々の発話内容の音声の特徴ベクトルの時系列をモデル化
したヒューリスティック言語モデルと、入力音声の特徴
ベクトルの時系列を入力としてヒューリスティック言語
モデルを用いてヒューリスティック前向き尤度とヒュー
リスティック後ろ向き尤度を計算するヒューリスティッ
ク言語モデル照合手段と、スポッティング対象とする音
声単位のスペクトル特徴ベクトルの時系列をモデル化し
たスポッティング対象音声モデルと、入力音声の特徴ベ
クトルの時系列とヒューリスティック前向き尤度とヒュ
ーリスティック後ろ向き尤度とを入力としてスポッティ
ング対象音声モデルを用いて、スポッティング対象とす
る音声単位のスポッティングを行うスポッティング手段
と、任意の音声のスペクトル特徴ベクトルの時系列をモ
デル化した未知語モデルと、入力音声の特徴ベクトルの
時系列とヒューリスティック前向き尤度と、ヒューリス
ティック後ろ向き尤度とを入力として未知語モデルを用
いて未知語の音声区間のスポッティングを行う未知語ス
ポッティング手段とを備え、スポッティング対象とする
音声単位のスポッティング結果と、未知語の音声区間の
スポッティング結果の両方を出力するものである。SUMMARY OF THE INVENTION A voice spotting device according to the present invention comprises an analysis means for performing an acoustic analysis of an input voice and outputting a time series of acoustic feature vectors of the input voice, and a feature vector of voice of various utterance contents. , A heuristic language model that models the time series of the, and a heuristic language model matching means that calculates the heuristic forward likelihood and the heuristic backward likelihood using the time series of the feature vector of the input speech as the input and the spotting target. The speech model to be spotted that models the time series of the spectral feature vector for each speech unit, and the speech model to be spotted using the time series of the input speech feature vector, the heuristic forward likelihood, and the heuristic backward likelihood as inputs. , A spotting means for spotting speech units to be spotted, an unknown word model that models the time series of the spectral feature vector of an arbitrary voice, the time series of the feature vector of the input voice, and the heuristic forward likelihood. , And an unknown word spotting means for spotting the speech section of the unknown word using the unknown word model with the heuristic backward likelihood as input, and the spotting result of the speech unit to be spotted and the speech section of the unknown word. It outputs both the spotting results.

【００５６】また次の発明に係る音声スポッティング装
置は、ヒューリスティック言語モデルは、種々の発話内
容の文節音声の特徴ベクトルの時系列をモデル化したヒ
ューリスティック文節モデルとし、スポッティング対象
とする音声単位を文節とし、スポッティング対象音声モ
デルとしてスポッティング対象とする文節音声のスペク
トル特徴ベクトルの時系列をモデル化した文節モデルを
用いることとし、未知語スポッティング手段は、文節区
間を単位として未知語を含む区間をスポッティングする
未知語文節区間スポッティング手段としたものである。In the speech spotting apparatus according to the next invention, the heuristic language model is a heuristic clause model in which a time series of feature vectors of clause speech having various utterance contents is modeled, and a speech unit to be spotted is a clause. As the spotting target speech model, a bunsetsu model in which a time series of spectral feature vectors of bunsetsu speech to be spotted is modeled is used, and the unknown word spotting means spots a section including an unknown word in units of bunsetsu sections. It is used as a word segment interval spotting means.

【００５７】また次の発明に係る音声スポッティング装
置は、未知語スポッティング手段として、時間軸に対し
て逆方向に未知語のスポッティングを行う後ろ向き未知
語スポッティング手段を備え、任意の開始時刻につい
て、未知語を含む音声区間のスポッティング結果が得ら
れるようにしたものである。Further, the voice spotting device according to the next invention is provided with backward unknown word spotting means for spotting unknown words in a direction opposite to the time axis as unknown word spotting means. This is so that the spotting result of the voice section including is obtained.

【００５８】また次の発明に係る音声スポッティング装
置は、未知語スポッティング手段として、同一の終了時
刻で開始時刻の異なる複数個の未知語を含む音声区間を
スポッティングする複数未知語スポッティング手段を備
えたものである。Further, the voice spotting device according to the next invention comprises, as the unknown word spotting means, a plurality of unknown word spotting means for spotting a voice section containing a plurality of unknown words having the same end time but different start times. Is.

【００５９】[0059]

【発明の実施の形態】以下図面を参照しながら、この発
明の実施の形態について説明する。Embodiments of the present invention will be described below with reference to the drawings.

【００６０】実施の形態1．図５との対応部分に同一符
号を付けて示す図１は、この発明による音声スポッティ
ング装置の実施の形態１の構成である。この実施の形態
１においても、従来と同様にスポッティングのタスクは
個人スケジュール管理に関するものとする。従来につい
て説明した図５と異なるこの発明の特徴的な部分は、未
知語モデル１５と未知語文節区間スポッティッグ手段１
６を新たに設け、未知語を含む文節区間をスポッティン
グする点である。Embodiment 1. FIG. 1 in which parts corresponding to those in FIG. 5 are assigned the same reference numerals shows the configuration of Embodiment 1 of the voice spotting device according to the present invention. Also in the first embodiment, the spotting task is related to the personal schedule management as in the conventional case. The characteristic part of the present invention different from FIG. 5 described in the related art is the unknown word model 15 and the unknown word phrase section spotting means 1
6 is newly provided to spot a phrase section including an unknown word.

【００６１】またヒューリスティック言語モデルは文節
モデルの連鎖として構成したヒューリスティック文節モ
デル１２とし、またスポッティング手段１０においても
文節モデル１３を用いて単語ではなく文節を単位とスポ
ッティングを行うことが、従来との相違点である。図１
において文節モデル１３は、文節を構成する自立語のモ
デルに付属語のモデルを幾つか接続したモデルを用いる
ものとし、スポッティング対象とする全ての文節に対し
て文節モデルを用意する。The heuristic language model is a heuristic bunsetsu model 12 constructed as a chain of bunsetsu models, and the spotting means 10 also uses the bunsetsu model 13 to perform spotting in units of phrases instead of words. It is a point. FIG.
In the above, the bunsetsu model 13 uses a model in which several models of adjunct words are connected to a model of an independent word constituting a bunsetsu, and a bunsetsu model is prepared for all bunsetsu to be spotted.

【００６２】また、ヒューリスティック文節モデル１２
は、図２に示すように文節モデル１３間で全ての接続を
許すように構成する。図において四角で囲まれた部分は
単語モデルを表しており、単語モデルのネットワークと
して文節モデル１２が構成される。スポッティングのタ
スクに現れる全ての文節のモデルをヒューリスティック
文節モデル１２に組み込むことにより、タスク内の発声
であれば、全ての発声に対する特徴ベクトルの時系列を
ヒューリスティック文節モデル１２によってモデル化す
ることができる。Further, the heuristic clause model 12
Is configured to allow all connections between clause models 13, as shown in FIG. In the figure, the portion surrounded by a square represents a word model, and the phrase model 12 is configured as a network of word models. By incorporating the models of all clauses appearing in the spotting task into the heuristic clause model 12, the heuristic clause model 12 can be used to model the time series of the feature vectors for all the utterances if the utterances are within the task.

【００６３】ヒューリスティック言語モデルとして、文
節モデル１３の連鎖で構成されるヒューリスティック文
節モデル１２を用いるということは、「入力音声の発話
内容は文節のつながりとして表現できる」という仮定を
たてることである。これは、従来で述べた単語の連鎖よ
りも強い言語制約であるが、日本語では多くの場合その
仮定を満たしているので、ヒューリスティック言語モデ
ルとして、単語モデルの連鎖を用いる場合と比較してさ
らに湧き出し誤りを少なくすることができる。The use of the heuristic bunsetsu model 12 composed of a chain of bunsetsu models 13 as a heuristic language model is based on the assumption that "the utterance content of the input voice can be expressed as a connection of bunsetsu". This is a language constraint that is stronger than the chain of words described above, but in Japanese, this assumption is satisfied in many cases. It is possible to reduce the occurrence of mistakes.

【００６４】一方、ヒューリスティック言語モデルとし
て文節モデルの連鎖を用いる場合、問題となるのは入力
音声中に未知語が含まれる場合である。この場合には、
従来でヒューリスティック言語モデルとして単語モデル
の連鎖を用いる場合の例で説明したときと同じ理由によ
り、未知語を含む文節区間においても、スポッティング
対象文節のうちで、前記未知語を含む文節区間の特徴ベ
クトルの時系列に対して最も尤度の高い文節モデルがス
ポッティッグされ、湧き出し誤りの原因となるととも
に、入力音声中に未知語が含まれている可能性があると
いう情報はスポッティングスコアには全く反映されな
い。On the other hand, when a chain of clause models is used as a heuristic language model, a problem arises when the input speech contains an unknown word. In this case,
For the same reason as explained in the case of using a chain of word models as a heuristic language model in the past, even in a bunsetsu section including an unknown word, a feature vector of a bunsetsu section including the unknown word in the spotting target bunsetsu The phrase model with the highest likelihood is spotted for the time series of, which causes a spelling error, and the fact that unknown words may be included in the input speech is completely reflected in the spotting score. Not done.

【００６５】そこでこの発明に係わる音声スポッティン
グ装置においては、未知語文節区間スポッティッグ手段
１６を新たに設け、未知語を含む文節区間をスポッティ
ッグすることによって、入力音声中に未知語が含まれて
いる可能性があるという情報を得るようにする。未知語
モデル１５はＨＭＭでモデル化する。未知語モデル１５
は、日本語の任意の音声の特徴ベクトルの時系列をモデ
ル化できることが必要である。本実施の形態では例え
ば、未知語モデル１５としてガベージモデルを用いるこ
とにする。ここでいうガベージモデルとは、日本語に現
れる全ての音節音声を偏りなく含むような大量の音声デ
ータを用いて、ガベージモデルのパラメータである遷移
確率ａ_ijと、出力確率ｂ_ij(Ｘ)を学習したモデルであ
り、日本語の任意の音声の特徴ベクトルの時系列をモデ
ル化できる。Therefore, in the voice spotting device according to the present invention, unknown word bunsetsu section spotting means 16 is newly provided, and a bunsetsu section including an unknown word is spotted, so that an input word can include an unknown word. Get information that you have a potential. The unknown word model 15 is modeled by HMM. Unknown word model 15
Needs to be able to model the time series of feature vectors of arbitrary speech in Japanese. In this embodiment, for example, a garbage model is used as the unknown word model 15. The garbage model here means that the transition probability a _ij and the output probability b _ij (X), which are parameters of the garbage model, are calculated by using a large amount of voice data that includes all syllable voices appearing in Japanese without bias. It is a learned model and can model the time series of feature vectors of arbitrary Japanese speech.

【００６６】但し、ガベージモデルは少ないパラメータ
で日本語の任意の音声の特徴ベクトルの時系列をモデル
化するので、ヒューリスティック文節モデル１２を構成
する文節モデル１３と比較してモデル化の精度は低くな
っており、「東京へ」という入力音声に対する尤度は、
「東京へ」をモデル化した文節モデル１２よりも低くな
る。しかし入力音声中に未知語が含まれる場合には、そ
の未知語を含む文節区間では、ガベージモデルで構成さ
れる未知語モデル１５のほうが、ヒューリスティック文
節モデル１５を構成する文節モデル１６よりも尤度が高
くなることが期待できる。However, since the garbage model models the time series of feature vectors of arbitrary Japanese speech with a small number of parameters, the accuracy of modeling is lower than that of the bunsetsu model 13 constituting the heuristic bunsetsu model 12. Therefore, the likelihood for the input voice "To Tokyo" is
It is lower than the phrase model 12 that models “To Tokyo”. However, when the input speech contains an unknown word, the unknown word model 15 composed of the garbage model is more likely than the clause model 16 forming the heuristic clause model 15 in the phrase section including the unknown word. Can be expected to increase.

【００６７】次に動作について説明する。ヒューリステ
ィック言語モデル照合手段６は、ヒューリスティック言
語モデルとしてヒューリスティック文節モデルを用いる
こと以外は、従来で述べたヒューリスティック言語モデ
ル照合手段６と同じ動作をして、ヒューリスティック前
向き尤度７とヒューリスティック後ろ向き尤度８を計算
し出力する。スポッティング手段１０は、特徴ベクトル
の時系列４とヒューリスティック前向き尤度７とヒュー
リスティック後ろ向き尤度８とを入力とし、文節モデル
１３を用いて、従来で述べたスポッティング手段１０と
同じ動作をして、各文節モデル９のスポッティング結果
１４を出力する。Next, the operation will be described. The heuristic language model matching unit 6 operates in the same manner as the heuristic language model matching unit 6 described above except that the heuristic clause model is used as the heuristic language model, and the heuristic forward likelihood 7 and the heuristic backward likelihood 8 are obtained. Calculate and output. The spotting means 10 receives the time series 4 of the feature vector, the heuristic forward likelihood 7 and the heuristic backward likelihood 8 as inputs, and uses the clause model 13 to perform the same operation as the conventional spotting means 10, The spotting result 14 of the phrase model 9 is output.

【００６８】未知語文節区間スポッティング手段１６
は、未知語を含む文節区間をスポッティングする。未知
語のスポッティングスコアの計算方法は、スポッティン
グ手段１０におけるスポッティングスコアの計算方法と
同一であるが、スポッティング手段１０では、文節モデ
ル１３を用いて文節のスポッティングを行うのに対し、
未知語文節区間スポッティング手段１６では未知語モデ
ル１５を用いて未知語を含む文節区間のスポッティング
を行う点が異なっている。Unknown word phrase section spotting means 16
Spots a phrase segment containing an unknown word. The method of calculating the spotting score of an unknown word is the same as the method of calculating the spotting score in the spotting means 10, but the spotting means 10 uses the phrase model 13 to spot a phrase.
The unknown word phrase section spotting means 16 is different in that the unknown word model 15 is used to spot a phrase section containing an unknown word.

【００６９】すなわち未知語スポッティング手段１６
は、特徴ベクトルの時系列４とヒューリスティック前向
き尤度７とヒューリスティック後ろ向き尤度８を入力と
し、未知語モデル１５を用いて（１９）式により、未知
語のスポッティングスコアであるＦ^(u)(t)（ｔ＝１〜
Ｔ）を計算する。そして前記スポッティングスコアＦ
^(u)(t)が予め定められた閾値以上である場合に、（２
０）式によって求められる開始時刻ｓｔｉｍｅ^(u)(t)、
終了時刻ｔ及びスポッティングスコアＦ^(u)(t)を未知語
区間スポッティング結果１７として出力する。入力音声
中に未知語が含まれない場合は、既に説明したようにガ
ベージモデルにより構成される未知語モデル１５の尤度
は、ヒューリスティック文節モデル１２の尤度よりも低
いので、（１９）式で計算される未知語のスポッティン
グスコアＦ^(u)(t)は特徴ベクトルの時系列４の全区間で
低い値をとる。That is, unknown word spotting means 16
Is a spotting score F ^(u) (t ⁾ of the unknown word by using the unknown word model 15 and the equation (19) with the time series 4 of the feature vector, the heuristic forward likelihood 7 and the heuristic backward likelihood 8 as inputs. ) (T = 1 to 1
Calculate T). And the spotting score F
^{If (u)} (t) is greater than or equal to a predetermined threshold value, (2
0) start time time ^(u) (t),
The end time t and the spotting score F ^(u) (t) are output as the unknown word section spotting result 17. If the unknown word is not included in the input speech, the likelihood of the unknown word model 15 configured by the garbage model as described above is lower than the likelihood of the heuristic bunsetsu model 12, and therefore the equation (19) is used. The calculated unknown word spotting score F ^(u) (t) takes a low value in all sections of the time series 4 of the feature vector.

【００７０】一方、入力音声中に未知語が含まれる場合
には、ヒューリスティック文節モデル１２を構成する文
節モデル１３で、その未知語を含む文節区間を強制的に
モデル化するよりは、未知語モデル１５でその未知語を
含む文節区間をモデル化したほうが尤度が高くなり、結
果として未知語を含む文節区間では、（１９）式で計算
される未知語のスポッティングスコアＦ^(u)(t)は高い値
をとる。すなわち、未知語を含む文節区間で未知語モデ
ル１５がスポッティングされる。未知語モデル１５がス
ポッティングされる区間が、単語ではなく文節区間にす
ることができるのは、ヒューリスティック言語モデルと
してヒューリスティック文節モデル１２を用いているの
で入力音声が文節区間ごとにモデル化されるためであ
る。On the other hand, when an unknown word is included in the input speech, the bunsetsu model 13 constituting the heuristic bunsetsu model 12 does not forcefully model the bunsetsu section including the unknown word, but rather the unknown word model. The likelihood is higher when the bunsetsu section including the unknown word is modeled in 15, and as a result, in the bunsetsu section including the unknown word, the spotting score F ^(u) (t) of the unknown word calculated by the equation (19). Has a high value. That is, the unknown word model 15 is spotted in the phrase section including the unknown word. The section in which the unknown word model 15 is spotted can be a bunsetsu section instead of a word because the input speech is modeled for each bunsetsu section because the heuristic bunsetsu model 12 is used as a heuristic language model. is there.

【００７１】[0071]

【数１９】 [Equation 19]

【００７２】[0072]

【数２０】 (Equation 20)

【００７３】（１９）式中のα^(u)(Ｓ_J,ｔ）は未知語前
向き尤度である。また（２０）式中のＢＴＫ^(u)(Ｓ_J,
ｔ）は未知語前向きバックポインタである。前記未知語
前向き尤度は、未知語モデル１５を用いて以下の（２
１）式、（２３）式、（２５）式の漸化式を計算するこ
とによって得られる。また前記未知語前向きバックポイ
ンタは、未知語モデル１５を用いて以下の（２２）式、
（２４）式、（２６）式、（２７）式の漸化式を計算す
ることによって得られる。［ｔ＝０のときの初期値設定］Α ^(u) (S _J , t) in equation (19) is the unknown word forward likelihood. In addition, BTK ^(u) (S _J ,
t) is an unknown word forward back pointer. The unknown word forward likelihood is calculated using the unknown word model 15 as follows (2
It is obtained by calculating the recurrence formulas of formulas (1), (23), and (25). Further, the unknown word forward back pointer uses the unknown word model 15 to express the following equation (22):
It is obtained by calculating the recurrence formulas of the formulas (24), (26), and (27). [Initial value setting when t = 0]

【００７４】[0074]

【数２１】 (Equation 21)

【００７５】[0075]

【数２２】 (Equation 22)

【００７６】［ｔ＝１〜Ｔ、未知語モデルの初期状態Ｓ
₁ についての漸化式計算］[T = 1 to T, initial state S of unknown word model S
Recurrence formula calculation for ₁ ]

【００７７】[0077]

【数２３】 (Equation 23)

【００７８】[0078]

【数２４】 (Equation 24)

【００７９】［ｔ＝１〜Ｔ、未知語モデルの初期状態以
外Ｓ_i （ｉ＝２〜Ｊ）についての漸化式計算］[T = 1 to T, recurrence formula calculation for S _i (i = 2 to J) other than initial state of unknown word model]

【００８０】[0080]

【数２５】 (Equation 25)

【００８１】[0081]

【数２６】 (Equation 26)

【００８２】但し、However,

【００８３】[0083]

【数２７】 [Equation 27]

【００８４】実施の形態２.図１との対応部分に同一符
号を付けて示す図３は、この発明による音声スポッティ
ング装置の実施の形態２の構成である。この実施の形態
２において特徴的な点は、未知語スポッティング手段と
して後ろ向き未知語スポッティング手段１８を用いるこ
とである。Embodiment 2 FIG. 3 in which parts corresponding to those in FIG. 1 are assigned the same reference numerals is the structure of Embodiment 2 of the voice spotting apparatus according to the present invention. A characteristic point of this second embodiment is that the backward unknown word spotting means 18 is used as the unknown word spotting means.

【００８５】動作について説明する。ヒューリスティッ
ク言語モデル照合手段６は、実施の形態１と同一の動作
をして、ヒューリスティック前向き尤度７とヒューリス
ティック後ろ向き尤度８を計算し出力する。スポッティ
ング手段１０は、実施の形態１と同一の動作をして、各
文節モデル９のスポッティング結果１４を出力する。後
ろ向き未知語スポッティング手段１８は、特徴ベクトル
の時系列４とヒューリスティック前向き尤度７とヒュー
リスティック後ろ向き尤度８とを入力とし、未知語モデ
ル１５を用いて（２８）式により、未知語の後ろ向きス
ポッティングスコアであるＦ^(u) _bw(t)（ｔ＝Ｔ〜１）を
計算する。The operation will be described. The heuristic language model matching means 6 operates in the same manner as in the first embodiment to calculate and output the heuristic forward likelihood 7 and the heuristic backward likelihood 8. The spotting means 10 performs the same operation as that of the first embodiment, and outputs the spotting result 14 of each phrase model 9. The backward unknown word spotting means 18 receives the time series 4 of the feature vector, the heuristic forward likelihood 7 and the heuristic backward likelihood 8 as inputs, and uses the unknown word model 15 by the equation (28) to calculate the backward spotting score of the unknown word. F ^(u) _bw (t) (t = T˜1) is calculated.

【００８６】スポッティングスコアＦ^(u) _bw(t)が予め定
められた閾値以上である場合に、開始時刻ｔ、（２９）
式によって求められる終了時刻ｅｔｉｍｅ^(u)(t)及びス
ポッティングスコアＦ^(u) _bw(t)を後ろ向き未知語区間ス
ポッティング結果１９として出力する。実施の形態１で
述べた（１９）式と、この実施の形態２の（２８）式か
ら明らかなように、実施の形態１では、未知語区間のス
ポッティングスコアは、終了時刻に関しては全ての時刻
で求まるが、開始時刻は（２０）式によって決ってしま
うので、任意の始端時刻における未知語区間のスポッテ
ィングスコアは得られなかった。一方、この実施の形態
２では、未知語区間のスポッティングスコアは、始端時
刻に関しては全ての時刻で求められる。終端時刻は（２
９）式によって決まる。If the spotting score F ^(u) _bw (t) is greater than or equal to a predetermined threshold value, the start time t, (29)
The end time etime ^(u) (t) and the spotting score F ^(u) _bw (t) obtained by the formula are output as the backward unknown word segment spotting result 19. As is clear from the expression (19) described in the first embodiment and the expression (28) in the second embodiment, in the first embodiment, the spotting score of the unknown word section is the time at all the end times. However, since the start time is determined by the equation (20), the spotting score of the unknown word section at any start time could not be obtained. On the other hand, in the second embodiment, the spotting score of the unknown word section is obtained at all times with respect to the start time. The end time is (2
It is determined by the equation 9).

【００８７】[0087]

【数２８】 [Equation 28]

【００８８】[0088]

【数２９】 (Equation 29)

【００８９】（２８）式の右辺の積は、未知語モデル１
５の前に、ヒューリスティック文節モデル１２を接続し
て、前記入力音声の特徴ベクトルの時系列Ｘ₁ Ｘ₂ Ｘ
₃ 、……、Ｘ_T 全体のスコアを求めることを意味してい
る。後向き尤度β^(u)(Ｓ₁,ｔ）は、従来の説明で述べた
ヒューリスティック言語モデルの後ろ向き尤度と同様に
して、時間軸上で逆方向に以下の（３０）式、（３２）
式、（３４）式の漸化式を計算することによって得られ
る。但し式から分かるように、ヒューリスティック言語
モデルの後ろ向き尤度計算とは、最終状態における計算
方法が異なっており、これは前記未知語モデルの後方に
ヒューリスティック言語モデルを接続してて、後ろ向き
尤度β^(u)(Ｓ₁,ｔ）を計算することを意味している。ま
た未知語の終了時刻を記憶しているバックポインタであ
るＢＴＫ^(u) _bw (Ｓ₁,ｔ)も、以下の（３１）式、（３
３）式、（３５）式、（３６）式の漸化式計算によって
求めることができる。［ｔ＝Ｔのときの初期値設定］The product of the right side of equation (28) is the unknown word model 1
5 is connected to the heuristic clause model 12 before the time series X ₁ X ₂ X of the feature vector of the input speech.
_3, ..., it is meant to determine the score of the entire X _T. The backward likelihood β ^(u) (S ₁ , t) is the same as the backward likelihood of the heuristic language model described in the conventional description, and is expressed in the following equations (30) and (32) in the reverse direction on the time axis.
It is obtained by calculating the recurrence formula of the formula (34). However, as can be seen from the equation, the calculation method in the final state is different from the backward likelihood calculation of the heuristic language model. This is because the heuristic language model is connected behind the unknown word model and the backward likelihood β ^(u) means to calculate (S ₁ , t). Also, BTK ^(u) _bw (S ₁ , t), which is a back pointer that stores the end time of the unknown word, can be expressed by the following equation (31), (3)
It can be obtained by the recurrence formula calculation of the formulas (3), (35), and (36). [Initial value setting when t = T]

【００９０】[0090]

【数３０】 [Equation 30]

【００９１】[0091]

【数３１】 (Equation 31)

【００９２】［ｔ＝Ｔ−１〜１、未知語モデルの最終状
態Ｓ_J についての漸化式計算］[T = T-1 to 1, recurrence formula calculation for final state S _J of unknown word model]

【００９３】[0093]

【数３２】 (Equation 32)

【００９４】[0094]

【数３３】 [Equation 33]

【００９５】［ｔ＝Ｔ−１〜１、未知語モデルの最終状
態以外Ｓ_i(ｉ＝１〜Ｊ−１）についての漸化式計算］[T = T-1 to 1, recurrence formula calculation for S _i (i = 1 to J-1) other than final state of unknown word model]

【００９６】[0096]

【数３４】 (Equation 34)

【００９７】[0097]

【数３５】 (Equation 35)

【００９８】但し、However,

【００９９】[0099]

【数３６】 [Equation 36]

【０１００】実施の形態３.図１との対応部分に同一符
号を付けて示す図４は、この発明による音声スポッティ
ング装置の実施の形態３の構成である。この実施の形態
３において特徴的な点は、未知語スポッティング手段と
して複数未知語スポッティング手段２０を用いることで
ある。Third Embodiment FIG. 4 in which parts corresponding to those in FIG. 1 are assigned the same reference numerals shows a configuration of a third embodiment of a voice spotting device according to the present invention. A characteristic point of the third embodiment is that the plural unknown word spotting means 20 is used as the unknown word spotting means.

【０１０１】動作について説明する。ヒューリスティッ
ク言語モデル照合手段６は、実施の形態１と同一の動作
をして、ヒューリスティック前向き尤度７とヒューリス
ティック後ろ向き尤度８を計算し、出力する。スポッテ
ィング手段１０は、実施の形態１と同一の動作をして、
各文節モデル９のスポッティング結果１４を出力する。
複数未知語スポッティング手段２０は、特徴ベクトルの
時系列４とヒューリスティック前向き尤度７とヒューリ
スティック後ろ向き尤度８を入力とし、未知語モデル１
５を用いて（３７）式により、未知語のスポッティング
スコアであるＦ^(u)(t)(ｔ＝１〜Ｔ)を計算する。The operation will be described. The heuristic language model matching means 6 performs the same operation as in the first embodiment to calculate and output the heuristic forward likelihood 7 and the heuristic backward likelihood 8. The spotting means 10 performs the same operation as in the first embodiment,
The spotting result 14 of each phrase model 9 is output.
The plural unknown word spotting means 20 receives the time series 4 of the feature vector, the heuristic forward likelihood 7 and the heuristic backward likelihood 8 as inputs, and the unknown word model 1
F ^(u) (t) (t = 1 to T), which is the spotting score of the unknown word, is calculated by the equation (37) using 5.

【０１０２】そしてスポッティングスコアＦ^(u)(t)が予
め定められた閾値以上である場合に、（３８）式によっ
て求められる開始時刻ｓｔｉｍｅ^(u) _k(t) 、（ｋ＝１〜
Ｋ）、終了時刻ｔ及びスポッティングスコアＦ_(u) ^k(t)
(ｋ＝１〜Ｋ）を複数未知語区間スポッティング結果２
１として出力する。ここで添字ｋは、同一の終端時刻で
始端時刻の異なるＫ個のスポッティング結果を区別する
ためのものである。Then, when the spotting score F ^(u) (t) is equal to or more than a predetermined threshold value, the start times time ^(u) _k (t), (k = 1 to
K), end time t, and spotting score F _(u) ^k (t)
(k = 1 to K) Multiple unknown word section spotting result 2
Output as 1. Here, the subscript k is for distinguishing K spotting results having the same end time but different start times.

【０１０３】[0103]

【数３７】 (37)

【０１０４】[0104]

【数３８】 (38)

【０１０５】(３７）式中のα^(u) _k(Ｓ_J,ｔ) は未知語前
向き尤度である。また（３８）式中のＢＴＫ^(u) _k(Ｓ_J,
ｔ)は未知語前向きバックポインタである。未知語前向
き尤度α^(u) _k(Ｓ_J,ｔ)と、未知語前向きバックポインタ
ＢＴＫ^(u) (Ｓ_J,ｔ)_k は、実施の形態１で述べた（２
３）式、（２４）式、（２５）式、（２６）式、（２
７）式における最大値選択の操作を、上位Ｋ個を選択す
る操作に置き換えることによって計算できる。Α ^(u) _k (S _J , t) in the equation (37) is the unknown word forward likelihood. Also, BTK ^(u) _k (S _J ,
t) is an unknown word forward back pointer. The unknown word forward likelihood α ^(u) _k (S _J , t) and the unknown word forward back pointer BTK ^(u) (S _J , t) _k have been described in the first embodiment (2).
Formula 3), Formula (24), Formula (25), Formula (26), (2
It can be calculated by replacing the operation of selecting the maximum value in the equation 7) with the operation of selecting the top K items.

【０１０６】[0106]

【発明の効果】以上述べたようにこの発明によれば、入
力音声の特徴ベクトルの時系列を入力として、種々の発
話内容の音声の特徴ベクトルの時系列をモデル化したヒ
ューリスティック言語モデルを用いてヒューリスティッ
ク前向き尤度とヒューリスティック後ろ向き尤度を計算
し、入力音声の特徴ベクトルの時系列とヒューリスティ
ック前向き尤度とヒューリスティック後ろ向き尤度とを
入力として、スポッティング対象とする音声単位のスペ
クトル特徴ベクトルの時系列をモデル化したスポッティ
ング対象音声モデルを用いて、スポッティング対象とす
る音声単位のスポッティングを行い、任意の音声のスペ
クトル特徴ベクトルの時系列をモデル化した未知語モデ
ルと、入力音声の特徴ベクトルの時系列とヒューリステ
ィック前向き尤度と、ヒューリスティック後ろ向き尤度
とを入力として未知語モデルを用いて未知語の音声区間
のスポッティングを行ようにしたことにより、スポッテ
ィング対象とする音声単位のスポッティング結果の他に
未知語区間らしい音声区間の開始時刻と終了時刻及び未
知語区間らしさの程度を数値化した未知語スポッティン
グを出力し得る音声スポッティング装置を実現できる。As described above, according to the present invention, the time series of the feature vector of the input speech is input, and the heuristic language model that models the time series of the feature vector of the speech of various speech contents is used. The heuristic forward likelihood and the heuristic backward likelihood are calculated, and the time series of the spectral feature vector of the speech unit to be spotted is input by inputting the time series of the feature vector of the input speech, the heuristic forward likelihood and the heuristic backward likelihood. Using the modeled spotting target voice model, spotting is performed for each voice unit to be spotted, and an unknown word model that models the time series of the spectral feature vector of an arbitrary voice and the time series of the feature vector of the input voice. Heuristic forward likelihood , The heuristic backward likelihood and the unknown word model are used to perform the spotting of the voice section of the unknown word, so that the start of the voice section that seems to be the unknown word section in addition to the spotting result for each voice unit to be spotted. It is possible to realize a voice spotting device capable of outputting unknown word spotting in which the time, the end time, and the degree of the unknown word interval are digitized.

【０１０７】また次の発明によれば、ヒューリスティッ
ク言語モデルは、種々の発話内容の文節音声の特徴ベク
トルの時系列をモデル化したヒューリスティック文節モ
デルとしたので、スポッティング対象音声モデルとして
文節モデルを用いて文節スポッティングを行うと、ヒュ
ーリスティック文節モデルによる言語制約が有効に働
き、湧き出し誤りの少ない文節スポッティングができ
る。また同時にヒューリスティック文節モデルを用いて
いるので入力音声が文節区間毎にモデル化され、未知語
スポッティング手段で文節区間を単位として未知語を含
む区間をスポッティングして結果を出力でき、言語処理
部では未知語は常に単語か音節等ではなく文節であると
いう仮定をおけるので、言語処理の探索区間を格段的に
狭めることができる。According to the next invention, since the heuristic language model is a heuristic bunsetsu model that models the time series of the feature vectors of bunsetsu speech of various utterance contents, the bunsetsu model is used as the spotting target speech model. When bunsetsu spotting is performed, linguistic constraints based on the heuristic bunsetsu model work effectively, and bunsetsu spotting with few spoiling errors can be performed. At the same time, since the heuristic bunsetsu model is used, the input speech is modeled for each bunsetsu section, and the unknown word spotting unit can spot a section containing an unknown word in bunsetsu section units and output the result. Since it is possible to assume that a word is always a phrase, not a word or a syllable, etc., the search interval of language processing can be significantly narrowed.

【０１０８】また次の発明によれば、未知語スポッティ
ング手段として時間軸に対して逆方向に未知語のスポッ
ティングを行うようにしたことにより、任意の開始時刻
について、未知語を含む音声区間のスポッティング結果
を得ることができ、言語処理の過程で未知語を含む音声
区間の開始時刻仮説を自由に生成し、そのスポッティン
グスコアを参照できるので、柔軟な言語処理が可能にな
る。Further, according to the next invention, the unknown word spotting means performs the unknown word spotting in the opposite direction to the time axis, so that the speech section including the unknown word is spotted at any start time. Since the result can be obtained and the start time hypothesis of the voice section including the unknown word can be freely generated in the course of the language processing and the spotting score can be referred to, flexible language processing can be performed.

【０１０９】また次の発明によれば、未知語スポッティ
ング手段として複数未知語スポッティング手段を備えた
ので、同一の終了時刻で開始時刻の異なる複数個の未知
語を含む音声区間をスポッティングすることができ、後
ろ向き未知語スポッティング手段を備えたのと同様の効
果を得ることができる。According to the next invention, since a plurality of unknown word spotting means are provided as the unknown word spotting means, it is possible to spot a voice section containing a plurality of unknown words having the same end time but different start times. , It is possible to obtain the same effect as that provided with the backward unknown word spotting means.

[Brief description of drawings]

【図１】この発明による音声スポッティング装置の実
施の形態１の構成を示すブロック図である。FIG. 1 is a block diagram showing a configuration of a first embodiment of a voice spotting device according to the present invention.

【図２】ヒューリスティック文節モデルの説明に供す
る略線図である。FIG. 2 is a schematic diagram for explaining a heuristic clause model.

【図３】この発明による音声スポッティング装置の実
施の形態２の構成を示すブロック図である。FIG. 3 is a block diagram showing a configuration of a second embodiment of a voice spotting device according to the present invention.

【図４】この発明による音声スポッティング装置の実
施の形態３の構成を示すブロック図である。FIG. 4 is a block diagram showing a configuration of a third embodiment of a voice spotting device according to the present invention.

【図５】従来の単語スポッティング方式による装置の
構成を示すブロック図である。FIG. 5 is a block diagram showing a configuration of a conventional word spotting device.

【図６】ヒューリスティック言語モデルと単語モデル
の接続方法の説明に供する略線図である。FIG. 6 is a schematic diagram for explaining a method for connecting a heuristic language model and a word model.

【図７】単語モデルの連鎖として構成されるヒューリ
スティック言語モデルの説明に供する略線図である。FIG. 7 is a schematic diagram for explaining a heuristic language model configured as a chain of word models.

[Explanation of symbols]

１音声信号の入力端２音声信号の入力端１から入力された音声信号３分析手段４特徴ベクトルの時系列５ヒューリスティック言語モデル６ヒューリスティック言語モデル照合手段７ヒューリスティック前向き尤度８ヒューリスティック後ろ向き尤度９単語モデル１０スポッティング手段１１単語のスポッティング結果１２ヒューリスティック文節モデル１３文節モデル１４文節のスポッティング結果１５未知語モデル１６未知語文節区間スポッティッグ手段１７未知語文節区間のスポッティッグ結果１８後ろ向き未知語スポッティング手段１９後ろ向き未知語区間スポッティング結果２０複数未知語スポッティング手段２１複数未知語区間スポッティング結果 1 voice signal input end 2 voice signal input from voice signal input end 3 analysis means 4 feature vector time series 5 heuristic language model 6 heuristic language model matching means 7 heuristic forward likelihood 8 heuristic backward likelihood 9 words Model 10 Spotting Means 11 Word Spotting Results 12 Heuristic Phrase Model 13 Phrase Model 14 Phrase Spotting Results 15 Unknown Word Model 16 Unknown Word Phrase Section Spotting Means 17 Unknown Word Phrase Section Spotting Results 18 Backward Unknown Word Spotting Means 19 Backward Unknown Words Section spotting result 20 Plural unknown word spotting means 21 Plural unknown word Section spotting result

Claims

[Claims]

1. An analysis means for acoustically analyzing an input voice to output a time series of acoustic feature vectors of the input voice, and a heuristic language model modeling a time series of voice feature vectors of various utterance contents. Heuristic language model matching means for calculating heuristic forward likelihood and heuristic backward likelihood using the heuristic language model with the time series of the feature vector of the input voice as input, and the spectral feature vector of the voice unit to be spotted A speech model targeted for spotting, a time series of the feature vector of the input speech, the heuristic forward likelihood and the heuristic backward likelihood as inputs, and the spotting speech model is used as a spotting target. Spotting means for performing spotting for each voice unit, an unknown word model that models a time series of a spectral feature vector of an arbitrary voice, a time series of the feature vector of the input voice, the heuristic forward likelihood, and a heuristic backward likelihood. And an unknown word spotting means for spotting the voice section of the unknown word using the unknown word model as input, both the spotting result of the voice unit to be spotted and the spotting result of the voice section of the unknown word. An audio spotting device characterized by outputting

2. The heuristic language model is
As a heuristic bunsetsu model that models the time series of the verse speech feature vector of various utterances, the speech unit to be spotted is a bunsetsu, and the spectral feature vector of the bunsetsu speech to be spotted as the spotting target voice model. A bunsetsu model that models a time series is used, and the unknown word spotting means is an unknown word bunsetsu section spotting means for spotting a section including an unknown word in units of a bunsetsu section. The described voice spotting device.

3. The unknown word spotting means,
A backward unknown word spotting means for spotting an unknown word in a direction opposite to a time axis is provided, and a spotting result of a voice section including an unknown word is obtained at an arbitrary start time. The audio spotting device according to claim 1 or 2.

4. The unknown word spotting means,
3. The voice spotting device according to claim 1, further comprising a plurality of unknown word spotting means for spotting a voice segment including a plurality of unknown words having the same end time but different start times.