JP3403838B2

JP3403838B2 - Phrase boundary probability calculator and phrase boundary probability continuous speech recognizer

Info

Publication number: JP3403838B2
Application number: JP26566894A
Authority: JP
Inventors: 利行花沢
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 1994-10-28
Filing date: 1994-10-28
Publication date: 2003-05-06
Anticipated expiration: 2018-05-06
Also published as: JPH08123469A

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、文音声の各時刻におけ
るアクセント句境界確率を計算する句境界確率計算装
置、およびアクセント句境界確率を利用して音声認識精
度の向上を図る句境界確率利用連続音声認識装置に係わ
るものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a phrase boundary probability calculation device for calculating accent phrase boundary probabilities at each time of sentence speech, and use of phrase boundary probabilities for improving speech recognition accuracy using accent phrase boundary probabilities. It relates to a continuous voice recognition device.

【０００２】[0002]

【従来の技術】音声認識は一般に、一定時間毎に音声信
号の周波数分析を行うことにより得られるスペクトル特
徴ベクトルの時系列を特徴パラメータとし、予め用意さ
れた認識対象単語や文の標準パタンとパタンマッチング
を行うことによって実現される。しかし人間はアクセン
卜やイントネーションの情報も利用して音声を認識して
いると考えられる。そこで近年、音声のアクセント句、
すなわち、１個のアクセント核を有する句の境界を検出
し、音声認識に利用しようとする試みがなされている。2. Description of the Related Art Generally, speech recognition uses a time series of spectral characteristic vectors obtained by performing frequency analysis of a speech signal at fixed time intervals as a characteristic parameter, and prepares standard patterns and patterns of recognition target words and sentences prepared in advance. It is realized by performing matching. However, it is considered that humans also recognize the voice by using the information of Accen and intonation. So in recent years, voice accent phrases,
That is, an attempt has been made to detect the boundary of a phrase having one accent nucleus and utilize it for speech recognition.

【０００３】例えば、図ｌ０は文献「中井、下平、嵯峨
山、“ピッチパターンのクラスタリングに基づく不特定
話者連続音声の句境界検出”、電子通信学会論文誌、
Ａ、Ｖｏｌ．Ｊ７７−Ａ、Ｎｏ．２、ｐｐ２０６−２ｌ
４、ｌ９９４年２月」に記載されている句境界検出方式
の一実施例を示す構成ブロック図である。For example, FIG. 10 shows the document "Nakai, Shimodaira, Sagayama," Phrase boundary detection of unspecified speaker continuous speech based on clustering of pitch patterns ", IEICE Transactions,
A, Vol. J77-A, No. 2, pp206-2l
4 is a block diagram showing an example of a phrase boundary detection method described in "February 1, 1992".

【０００４】図ｌ０においてｌは音声信号の入力端、２
は音声信号の入力端ｌから入力された音声信号、３は音
声信号２のピッチ分折を行うピッチ分析手段、４は音声
信号２のポーズ区間を検出するポーズ検出手段、５はア
クセント句のピッチパタンを表現するピッチパタンテン
プレート、６はピッチパタンテンプレート５を用いて音
声信号２の句境界時刻を検出する句境界検出手段であ
る。In FIG. 10, l is a voice signal input terminal, 2
Is a voice signal input from the input terminal 1 of the voice signal, 3 is a pitch analyzing means for performing pitch division of the voice signal 2, 4 is a pause detecting means for detecting a pause section of the voice signal 2, and 5 is a pitch of an accent phrase. A pitch pattern template expressing a pattern, and 6 is a phrase boundary detecting means for detecting a phrase boundary time of the audio signal 2 using the pitch pattern template 5.

【０００５】図１０におけるピッチパタンテンプレート
５は、予め大量のアクセント句のピッチパタンを用い、
そのピッチパタンをいくつかの類型に分け、この類型の
平均として作成しておく。すなわち、ピッチパタンテン
プレート５は、アクセント句の各類型ピッチパタンの代
表パタンである。The pitch pattern template 5 in FIG. 10 uses a large number of pitch patterns of accent phrases in advance,
The pitch pattern is divided into several types and created as an average of these types. That is, the pitch pattern template 5 is a representative pattern of each type of pitch patterns of accent phrases.

【０００６】アクセント句のピッチパタンは不特定多数
の話者が発声した文音声から抽出される。日本語のアク
セント句に現れるピッチパタンは平板型、頭高型、中高
型等、複数個のパタンがあるので、複数個のピッチパタ
ンテンプレートが作成される。The pitch pattern of the accent phrase is extracted from the sentence voice uttered by an unspecified number of speakers. Since there are a plurality of pitch patterns appearing in Japanese accent phrases, such as flat plate type, head height type, and middle height type, a plurality of pitch pattern templates are created.

【０００７】上述したピッチパタンとしては、例えば対
数ピッチ周波数の時系列が用いられる。As the above-mentioned pitch pattern, for example, a time series of logarithmic pitch frequencies is used.

【０００８】ピッチパタンテンプレート５の作成は、以
下の手順で行う。The pitch pattern template 5 is created by the following procedure.

【０００９】（テンプレート作成手順ｌ）大量のアクセ
ント句のピッチパタンをそれぞれ一定の時間長に線形伸
縮し、各ピッチパタンを同一の時間長に揃える。(Procedure 1 for creating a template) Pitch patterns of a large number of accent phrases are linearly expanded / contracted to have fixed time lengths, and the pitch patterns are made uniform in time length.

【００１０】（テンプレート作成手順２）同一の時間長
に揃えたピッチパタンを用いてクラスタリングを行い、
各ピッチパタンをｎ個のクラスタに分類する。ここでク
ラスタの個数ｎは、日本語にあらわれるアクセント句の
パタンを考慮して、例えば４個とする。(Procedure 2 for creating a template) Clustering is performed by using pitch patterns that have the same time length.
Each pitch pattern is classified into n clusters. Here, the number n of clusters is, for example, 4 in consideration of the pattern of accent phrases appearing in Japanese.

【００１１】（テンプレート作成手順３）前記各クラス
タ毎に、同一の時間長に揃えたピッチパタンの平均を求
め、ピッチパタンテンプレート５とする。(Procedure 3 for creating a template) For each of the clusters, the average of the pitch patterns aligned for the same time length is obtained and used as the pitch pattern template 5.

【００１２】以上の手順により、ｎ（＝４）個のピッチ
パタンテンプレートが作成される。By the above procedure, n (= 4) pitch pattern templates are created.

【００１３】句境界の検出は以下のように行う。The phrase boundary is detected as follows.

【００１４】音声信号の入力端ｌから入力された音声信
号２はピッチ分析手段３およびポーズ検出手段４に入力
される。The voice signal 2 inputted from the input terminal 1 of the voice signal is inputted to the pitch analyzing means 3 and the pause detecting means 4.

【００１５】ピッチ分折手段３は、一定時間毎に音声信
号２のピッチ分折を行い、音声信号２の対数ピッチ周波
数の時系列を求める。ここでピッチ分折には例えばラグ
窓法を用いる。The pitch dividing means 3 divides the pitch of the voice signal 2 at regular intervals to obtain a time series of the logarithmic pitch frequency of the voice signal 2. Here, for example, the lag window method is used for the pitch folding.

【００１６】またポーズ検出手段４は、音声信号２のパ
ワーの時系列を求め、予め定められたパワー閾値以下の
区間を抽出し、この抽出区間の継続時間長が予め定めら
れた閾値以上の区間をポーズ区間として検出し、各ポー
ズ区間の開始時刻と終了時刻を出力する。Further, the pause detecting means 4 obtains a time series of the power of the audio signal 2, extracts a section equal to or less than a predetermined power threshold, and a duration of the extracted section is equal to or more than a predetermined threshold. Is detected as a pause section, and the start time and end time of each pause section are output.

【００１７】句境界検出手段６は、ピッチ分折手段３の
出力である対数ピッチ周波数時系列と、ピッチテンプレ
ート５をいくつか接続したテンプレートとのパタンマッ
チングを行い、ピッチパタンテンプレート５の接続境界
時刻を句境界時刻として出力する。The phrase boundary detection means 6 performs pattern matching between the logarithmic pitch frequency time series output from the pitch folding means 3 and a template in which several pitch templates 5 are connected, and the connection boundary time of the pitch pattern template 5 is determined. Is output as the phrase boundary time.

【００１８】すなわち、句境界検出手段６は、ピッチ分
折手段３の出力である対数ピッチ周波数時系列と、ポー
ズ検出手段４の出力であるポーズ区間の開始時刻及び終
了時刻とを入力とし、ポーズ区間以外の区間に対して、
以下の手順で句境界の検出を行う。That is, the phrase boundary detecting means 6 receives the logarithmic pitch frequency time series output from the pitch dividing means 3 and the start time and end time of the pause section output from the pause detecting means 4 as inputs, and pauses. For sections other than sections,
The phrase boundaries are detected by the following procedure.

【００１９】（句境界検出手順ｌ）ｎ（＝４）個のピッ
チパタンテンプレート５を用いて、前記対数ピッチ周波
数時系列を入力パタンとしてＯｎｅ−ＳｔａｇｅＤＰ
マッチングを行う。(Phrase boundary detection procedure 1) One-Stage DP using n (= 4) pitch pattern templates 5 with the logarithmic pitch frequency time series as an input pattern.
Match.

【００２０】（句境界検出手順２）ＤＰマッチング終了
後、ＤＰ経路のバックトレースを行い、前記対数ピッチ
周波数時系列とのＤＰ距離が最も小さくなるピッチパタ
ンテンプレートの接続系列を求め、接続境界時刻を句境
界時刻として出力する。(Phrase Boundary Detecting Procedure 2) After DP matching is completed, a back trace of the DP path is performed to obtain a connection sequence of the pitch pattern template having the smallest DP distance from the logarithmic pitch frequency time series, and the connection boundary time is calculated. Output as phrase boundary time.

【００２１】[0021]

【発明が解決しようとする課題】上記従来の句境界検出
方式では、入力音声の各時刻が句境界であるか否かのみ
を判定している。しかし句境界をｌ００％検出し、かつ
句境界でないものを誤検出しないようにすることは困難
であり、検出結果をｌ００％正しいものであると仮定す
ることはできない。また、上記従来技術では、検出した
句境界の信頼性を定量的に求めることができない。ゆえ
に検出された句境界を音声認識のために利用することが
困難であるという問題点があった。In the above-mentioned conventional phrase boundary detection method, it is determined only whether or not each time of the input voice is a phrase boundary. However, it is difficult to detect phrase boundaries by 100% and prevent false detection of non-phrase boundaries, and it cannot be assumed that the detection result is 100% correct. Further, with the above-mentioned conventional technique, the reliability of the detected phrase boundary cannot be quantitatively obtained. Therefore, it is difficult to use the detected phrase boundary for speech recognition.

【００２２】この発明は上記課題を解決するためになさ
れたもので、入力音声の各時刻が句境界であるか否かを
ｌ，０で判定するのではなく、入力音声の各時刻におい
て句境界らしさを確率として求め、句境界情報の音声認
識への利用を容易にすることを目的としている。さら
に、前記句境界らしさの確率を音声認識のために利用し
て、音声認識精度を改善する方法を提供することを目的
としている。The present invention has been made in order to solve the above problems, and does not judge whether or not each time of input speech is a phrase boundary by l and 0, but at each time of input speech. The purpose is to make it easier to use phrase boundary information for speech recognition. Further, another object of the present invention is to provide a method for improving the accuracy of speech recognition by utilizing the probability of phrase boundary likeness for speech recognition.

【００２３】[0023]

【問題を解決するための手段】この発明に係わる句境界
確率計算装置および句境界確率利用連続音声認識装置
は、アクセント句のピッチ特徴量の時系列をモデル化し
た１個または複数個のピッチパタンモデルを記憶したピ
ッチパタンモデルメモリと、前記ピッチパタンモデルを
用いてピッチ特徴量の時系列に対するピッチ前向き確率
とピッチ後ろ向き確率とを計算し、このピッチ前向き確
率とピッチ後ろ向き確率とに基づいて前記入力音声の各
時刻におけるアクセント句境界確率を計算する句境界確
率計算手段と、を備えた。A phrase boundary probability calculating device and a phrase boundary probability utilizing continuous speech recognizing device according to the present invention include one or a plurality of pitch patterns obtained by modeling a time series of pitch feature amounts of accent phrases. A pitch pattern model memory storing a model and a pitch forward probability and a pitch backward probability with respect to the time series of the pitch feature amount are calculated using the pitch pattern model, and the input is performed based on the pitch forward probability and the pitch backward probability. And a phrase boundary probability calculating means for calculating an accent phrase boundary probability at each time of speech.

【００２４】また、時間長の異なる複数個のベクトルの
時系列毎に同一構造のＨＭＭを学習し、学習後に得られ
た各ＨＭＭ毎の平均ベクトルをクラスタリング用データ
として用いるクラスタリング方法によって、アクセント
句のピッチパタンをクラスタリングし、各クラスタ毎に
ピッチパタンモデルを学習する。In addition, an HMM having the same structure is learned for each time series of a plurality of vectors having different time lengths, and the average vector of each HMM obtained after learning is used as clustering data by a clustering method. Cluster pitch patterns and learn a pitch pattern model for each cluster.

【００２５】また、アクセント句境界確率に対して重み
付け係数を備え、アクセント句境界確率に重み付けを行
う句境界確率重み付け手段を備えた。The accent phrase boundary probability is provided with a weighting coefficient, and phrase boundary probability weighting means for weighting the accent phrase boundary probability is provided.

【００２６】また、音声のスペクトル特徴ベクトルの時
系列をモデル化したｌ個または複数個のバックグランド
モデルを記憶したバックグランドモデルメモリと、前記
入力音声のスペクトル特徴ベクトルの時系列を入力と
し、前記バックグランドモデルを用いて前記スペクトル
特徴ベクトル時系列に対するスペクトル特徴前向き確率
とスペクトル特徴後ろ向き確率を計算するバックグラン
ドモデル照合手段と、前記句境界確率と前記スペクトル
特徴前向き確率との積である統合化前向き確率を算出す
る前向き確率統合手段と、前記句境界確率と前記スペク
トル特徴後ろ向き確率との積である統合化後ろ向き確率
を算出する後ろ向き確率統合手段と、スポッティング対
象とする文節音声のスペクトル特徴ベクトルの時系列を
モデル化した文節モデルと、前記音声のスペクトル特徴
ベクトルの時系列と前記統合化前向き確率と統合化後ろ
向き確率とを入力とし、前記文節モデルを用いて文節の
スポッティングを行うスポッティング手段と、を備え
た。Further, a background model memory storing one or a plurality of background models that model the time series of the spectral feature vector of the voice, and the time series of the spectral feature vector of the input voice are input, Background model matching means for calculating a spectral feature forward probability and a spectral feature backward probability for the spectral feature vector time series using a background model, and an integrated forward which is a product of the phrase boundary probability and the spectral feature forward probability. Forward probability integrating means for calculating a probability, backward probability integrating means for calculating an integrated backward probability which is a product of the phrase boundary probability and the spectral feature backward probability, and a spectral feature vector of a phrase speech to be spotted A bunsetsu model that models a series And Le, said the time series of the spectral feature vectors of the speech and the integration forward probability and inputs the integrated backward probability, with a, a spotting means for performing spotting clause using the phrase model.

【００２７】また、バックグランドモデルとして文節モ
デルの連鎖を用いることとした。In addition, a chain of clause models is used as the background model.

【００２８】また、認識対象とする文音声のスペクトル
特徴ベクトルの時系列をモデル化した文モデルネットワ
ークを記憶した文モデルネットワークメモリと、前記入
力音声のスペクトル特徴ベクトルの時系列を入力とし、
前記文モデルネットワークを用いて、前記入力音声の認
識を行い、複数の認識結果候補文と各認識結果候補文の
スペクトル特徴認識スコアと、各認識緒果候補文毎にそ
の文を構成する文節の境界時刻とを出力する連続音声認
識手段と、前記句境界確率と前記複数の各認識結果候補
文のスペクトル特徴認識スコアと各認識結果候補文を構
成する文節の境界時刻とを入力として、各認識結果候補
文の文節の境界時刻における句境界確率を用いてスペク
トル特徴認識スコアを補正し、この補正された認識スコ
アに基づいて、認識結果候補文を決定する確率統合手段
と、を備えた。Further, a sentence model network memory storing a sentence model network in which a time series of spectral feature vectors of a sentence speech to be recognized is stored, and a time series of spectral feature vectors of the input speech are input.
Using the sentence model network, the input speech is recognized, and a plurality of recognition result candidate sentences, spectral feature recognition scores of each recognition result candidate sentence, and a clause constituting the sentence for each recognition result candidate sentence are included. A continuous speech recognition unit that outputs a boundary time, the phrase boundary probability, the spectral feature recognition score of each of the plurality of recognition result candidate sentences, and the boundary time of the clauses that form each recognition result candidate sentence are input, and each recognition is performed. A probability integration unit that corrects the spectral feature recognition score using the phrase boundary probability at the boundary time of the clauses of the result candidate sentence and determines the recognition result candidate sentence based on the corrected recognition score.

【００２９】[0029]

【作用】ピッチパタンモデルは、アクセント句のピッチ
特徴量の時系列を統計的にモデル化し、句境界確率計算
手段は、前記ピッチパタンモデルを用いてピッチ特徴量
の時系列に対するピッチ前向き確率とピッチ後ろ向き確
率とを計算し、このピッチ前向き確率とピッチ後ろ向き
確率に基づいて前記入力音声の各時刻におけるアクセン
ト句境界確率を計算する。The pitch pattern model statistically models the time series of the pitch feature quantity of the accent phrase, and the phrase boundary probability calculating means uses the pitch pattern model to calculate the pitch forward probability and pitch for the time series of the pitch feature quantity. The backward probability is calculated, and the accent phrase boundary probability at each time of the input speech is calculated based on the pitch forward probability and the pitch backward probability.

【００３０】また、時間長の異なる複数個の特徴ベクト
ルの時系列毎に同一の構造のＨＭＭを学習することによ
り、時間長の異なる複数個の特徴ベクトルの時系列を非
線形圧縮して、前記複数個の特徴ベクトルの時系列を同
一の時間長に揃えて、クラスタリングを行う。Further, by learning the HMM having the same structure for each time series of a plurality of feature vectors having different time lengths, the time series of a plurality of feature vectors having different time lengths is non-linearly compressed, and Clustering is performed by aligning the time series of the individual feature vectors to the same time length.

【００３１】また、句境界確率重み付け手段は、アクセ
ント句境界確率に対して、重み付け係数を備えアクセン
ト句境界確率に重みを付けを行うことにより、スペクト
ル特徴量から計算される音声認識スコアと統合する場合
のアクセント句境界確率の寄与率を調整する。Further, the phrase boundary probability weighting means is provided with a weighting coefficient for the accent phrase boundary probability and weights the accent phrase boundary probability, thereby integrating with the speech recognition score calculated from the spectral feature amount. Adjust the contribution rate of the accent phrase boundary probability in the case.

【００３２】また、前向き確率統合手段は句境界確率と
スペクトル特徴前向き確率との積を求めることにより統
合化前向き確率を算出し、後ろ向き確率統合手段は句境
界確率とスペクトル特徴後ろ向き確率との積を求めるこ
とにより統合化後ろ向き確率を算出し、スポッティング
手段は、統合化前向き確率と統合化後ろ向き確率とを用
いて、スポッティングを行う。The forward probability integrating means calculates the integrated forward probability by calculating the product of the phrase boundary probability and the spectrum feature forward probability, and the backward probability integrating means calculates the product of the phrase boundary probability and the spectrum feature backward probability. By calculating the integrated backward probability, the spotting means performs spotting using the integrated forward probability and the integrated backward probability.

【００３３】文節モデルの連鎖から構成されるバックグ
ランドモデルは、文節境界時刻以外でのスペクトル特徴
前向き確率と、スペクトル特徴後ろ向き確率とを小さく
抑える。The background model constituted by a chain of bunsetsu models suppresses the spectrum feature forward probability and the spectrum feature backward probability at times other than the bunsetsu boundary time.

【００３４】また、連続音声認識手段は、文モデルネッ
トワークを用いて、入力音声の認識を行い、複数の認識
結果候補文と各認識結果候補文のスペクトル特徴認識ス
コアと、各認識結果候補文毎にその文を構成する文節の
境界時刻とを算出し、確率統合手段は、各認識結果候補
文の文節の境界時刻における句境界確率を用いてスペク
トル特徴認識スコアを補正する。Further, the continuous speech recognition means recognizes the input speech by using the sentence model network, and recognizes a plurality of recognition result candidate sentences, the spectral feature recognition score of each recognition result candidate sentence, and each recognition result candidate sentence. And the boundary time of the bunsetsu that composes the sentence, and the probability integration means corrects the spectral feature recognition score using the phrase boundary probability at the boundary time of the bunsetsu of each recognition result candidate sentence.

【００３５】[0035]

【Example】

実施例ｌ．図ｌは請求項ｌ記載の発明に係わる句境界確
率計算装置の一構成例を示すブロック構成図である。図
ｌにおいて従来技術の説明図である図ｌ０と同一機能ブ
ロックには同一の番号を付し、説明は省略する。Example l. FIG. 1 is a block diagram showing an example of the configuration of a phrase boundary probability calculation apparatus according to the invention described in claim 1. In FIG. 1, the same functional blocks as those in FIG. 10, which is an explanatory diagram of the conventional technique, are denoted by the same reference numerals, and description thereof will be omitted.

【００３６】本実施例において特徴的な点は、アクセン
ト句のピッチパタンを代表している複数個のピッチパタ
ンモデルを記憶するピッチパタンモデルメモリ７と、こ
のピッチパタンモデルのネットワークを記憶するピッチ
パタンモデルネットワークメモリ８と、このピッチパタ
ンモデルネットワークを用いて音声信号２に対する各時
刻における句境界確率を計算する句境界確率計算手段９
とを備えることである。A characteristic point of this embodiment is that a pitch pattern model memory 7 for storing a plurality of pitch pattern models representing the pitch patterns of accent phrases and a pitch pattern memory for storing a network of these pitch pattern models. A model network memory 8 and a phrase boundary probability calculating means 9 for calculating a phrase boundary probability at each time for the voice signal 2 using the pitch pattern model network.
And to prepare.

【００３７】上述のピッチパタンモデルは、各アクセン
ト句の類型の平均を表しており、アクセント句のピッチ
パタンの代表パタンを表している。The pitch pattern model described above represents the average of the types of accent phrases, and represents the representative pattern of the pitch patterns of accent phrases.

【００３８】音声信号の句境界確率を計算する前に、ピ
ッチパタンモデルと、ピッチパタンモデルネットワーク
とを作成する必要がある。Before calculating the phrase boundary probability of a speech signal, it is necessary to create a pitch pattern model and a pitch pattern model network.

【００３９】まず、ピッチパタンモデルの作成方法を説
明する。First, a method of creating a pitch pattern model will be described.

【００４０】本実施例ではピッチパタンモデルとして連
続型のＨＭＭ（ＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅ
ｌ、隠れマルコフモデル）を用いる。またピッチパタン
を表現するパラメータは、従来技術と同様に対数ピッチ
周波数の時系列が用いられる。In this embodiment, a continuous pattern HMM (Hidden Markov Mode) is used as a pitch pattern model.
l, Hidden Markov Model). Further, as the parameter expressing the pitch pattern, a time series of logarithmic pitch frequency is used as in the prior art.

【００４１】ここで、ピッチパタンモデルとしてＨＭＭ
を使用するのは、入力音声の各時刻において句境界らし
さを確率として求めるためである。従来技術のように、
ピッチパタンテンプレートを使用する場合には、句境界
か否かが１ビットで求まるだけで、句境界らしさの確率
を求めることはできない。Here, the HMM is used as a pitch pattern model.
Is used to obtain the probability of phrase boundary at each time of the input voice as a probability. Like the prior art,
When the pitch pattern template is used, whether or not it is a phrase boundary can be obtained by 1 bit, and the probability of phrase boundary likeness cannot be obtained.

【００４２】ＨＭＭは、図２に示されるように、幾つか
の状態（Ｓ₁，Ｓ₂，Ｓ₃，Ｓ₄，Ｓ₅，Ｓ₆）と状態
間を結ぶ弧によって構成される。各弧には、その弧を通
って各状態を遷移する遷移確率ａ_ijと、音声の特徴ベク
トルｘの出力確立ｂ_ij（ｘ）がパラメータとして与えら
れている。ここで添字ｉｊは状態Ｓ_iから状態Ｓ_jへの
遷移を示すものであり、遷移確率ａ_ijは状態Ｓ_iからＳ
_jへ遷移が起きる確率である。また出力確率ｂ_ij（ｘ）
は通常、多次元正規分布で表現され、声の特徴ベクトル
の平均値と分散がパラメータとして与えられており、状
態Ｓ_iからＳ_jへの遷移の際に音声の特徴ベクトルｘが
出力される確率密度を表している。As shown in FIG. 2, the HMM is composed of several states (S ₁ , S ₂ , S ₃ , S ₄ , S ₅ , S ₆ ) and an arc connecting the states. A transition probability a _{ij of} transitioning each state through the arc and an output probability b _ij (x) of the voice feature vector x are given as parameters to each arc. Here, the subscript ij indicates the transition from the state S _i to the state S _j , and the transition probability a _ij is the state S _i to S.
_The probability that a transition to _j occurs. The output probability b _ij (x)
Is usually expressed by a multidimensional normal distribution, the average value and variance of the voice feature vector are given as parameters, and the probability that the voice feature vector x is output at the transition from the state S _i to S _j Shows the density.

【００４３】また弧には、遷移確率ａ_ijのみをパラメー
タとして持ち、音声の特徴ベクトルを出力することな
く、状態間の遷移のみに寄与する弧がある。この弧によ
る状態遷移をヌル遷移と呼んでいる。The arc has an arc having only the transition probability a _ij as a parameter and contributing only to the transition between states without outputting the feature vector of the voice. The state transition by this arc is called null transition.

【００４４】本実施例では音声の特徴ベクトルとして、
対数ピッチ周波数を用いるので特徴ベクトルの次元数は
１である。In the present embodiment, as the voice feature vector,
Since the logarithmic pitch frequency is used, the dimension number of the feature vector is one.

【００４５】ＨＭＭは初期状態と呼ばれる状態（図２に
おけるＳ₁）から遷移を開始し、最終状態（図２におけ
るＳ₆）へ到達する過程での状態遷移により、遷移確率
ａ_ijと出力確率ｂ_ij（ｘ）によって計算される確率で様
々な音声の特徴ベクトルの時系列を生成することができ
る。すなわち、状態Ｓ₁から状態Ｓ₆まで遷移する間
に、アクセン卜句のピッチパタンを生成することにな
る。The HMM starts transition from a state (S ₁ in FIG. 2) called an initial state and reaches the final state (S ₆ in FIG. 2) by the state transition in the process of transition probability a _ij and output probability b. It is possible to generate a time series of various voice feature vectors with the probability calculated by _ij (x). That is, during the transition from the state S ₁ to the state S ₆ , the pitch pattern of the acceleration phrase is generated.

【００４６】例えば、音声の特徴ベクトルとして、対数
ピッチ周波数を用いた場合、ＨＭＭは様々な対数ピッチ
周波数の時系列を生成することができる。この場合、Ｈ
ＭＭによって対数ピッチ周波数の時系列としての各アク
セン卜句のピッチパタンが生成される確率を算出するこ
とができる。従って、ＨＭＭによって種々のアクセン卜
句のピッチパタンを統計的にモデル化することができ
る。For example, when the logarithmic pitch frequency is used as the voice feature vector, the HMM can generate time series of various logarithmic pitch frequencies. In this case, H
It is possible to calculate the probability that the pitch pattern of each acceleration phrase as a time series of the logarithmic pitch frequency is generated by the MM. Therefore, the HMM can statistically model pitch patterns of various acceleration phrases.

【００４７】ＨＭＭによるアクセント句のピッチパタン
のモデル化は、大量のアクセント句のピッチパタンを用
いて、これらのピッチパタンがＨＭＭから生成される確
率が高くなるように、ＨＭＭのパラメータである遷移確
率ａ_ijと出力確率ｂ_ij（ｘ）を推定することによって実
現される。この推定方法は最尤推定と呼ばれており、パ
ラメータ推定手続きをＨＭＭの学習といい、ＨＭＭの学
習に用いるデータを学習データという。The modeling of accent phrase pitch patterns by the HMM is performed by using a large number of accent phrase pitch patterns so that the probability of these pitch patterns being generated from the HMM is high, and the transition probability that is a parameter of the HMM is high. It is realized by estimating a _ij and output probability b _ij (x). This estimation method is called maximum likelihood estimation, the parameter estimation procedure is called HMM learning, and the data used for HMM learning is called learning data.

【００４８】本実施例では、学習に用いるアクセント句
のピッチパタンは、従来技術と同様に不特定多数の話者
が発声した文音声から抽出される。In this embodiment, the pitch pattern of the accent phrase used for learning is extracted from the sentence voice uttered by an unspecified number of speakers, as in the prior art.

【００４９】学習手順は、ピッチパタンのクラスタリン
グと、ＨＭＭの学習の２つの過程に分けられる。The learning procedure is divided into two processes: pitch pattern clustering and HMM learning.

【００５０】以下に、クラスタリングと学習の手順を示
す。The procedures of clustering and learning will be described below.

【００５１】まずクラスタリングの手順は次の通りであ
る。First, the procedure of clustering is as follows.

【００５２】（クラスタリング手順ｌ）大量のアクセン
ト句のピッチパタンの各々に対して、各ピッチパタンを
学習データとして、ＨＭＭを学習する。すなわち各ピッ
チパタン毎にＨＭＭを学習する。この際、各アクセント
句の系列長、すなわち対数ピッチ周波数の時系列の長さ
が異なっても、ＨＭＭの状態数や遷移の構造は各パタン
で共通の構造を用いる。例えば図２に示したＨＭＭの構
造を用いる。(Clustering procedure 1) For each of a large number of pitch patterns of accent phrases, the HMM is learned by using each pitch pattern as learning data. That is, the HMM is learned for each pitch pattern. At this time, even if the sequence length of each accent phrase, that is, the length of the time series of the logarithmic pitch frequency is different, the structure of the number of states and transition of the HMM is common to each pattern. For example, the HMM structure shown in FIG. 2 is used.

【００５３】（クラスタリング手順２）上記クラスタリ
ング手順ｌによって作成された各ピッチパタン毎のＨＭ
Ｍから、出力確率ｂ_ij（ｘ）の平均ベクトルを抽出す
る。例えば図２に示したＨＭＭの構造を用いて学習した
場合には出力確率は、ｂ₁₁（Ｘ），ｂ₁₂（Ｘ），ｂ
₂₂（Ｘ），ｂ₂₃（Ｘ），ｂ₃₃（ｘ），ｂ₃₄（ｘ），ｂ₄₄
（ｘ），ｂ₄₅（ｘ），ｂ₅₅（ｘ），ｂ₅₆（ｘ）の計ｌ０
個あるので、各ピッチパタン毎にｌ０個の平均ベクトル
を抽出する。(Clustering procedure 2) HM for each pitch pattern created by the above clustering procedure 1
From M, an average vector of output probabilities b _ij (x) is extracted. For example, when learning is performed using the HMM structure shown in FIG. 2, the output probabilities are b ₁₁ (X), b ₁₂ (X), b
₂₂ (X), b ₂₃ (X), b ₃₃ (x), b ₃₄ (x), b ₄₄
(X), b ₄₅ (x), b ₅₅ (x), b ₅₆ (x) total 10
Since there are such numbers, 10 average vectors are extracted for each pitch pattern.

【００５４】（クラスタリング手順３）上記クラスタリ
ング手順２によって作成された、各ピッチパタン毎のｌ
０個の平均ベクトルをクラスタリング用データとして用
い、例えばＬＢＧアルゴリズムを用いて、各ピッチパタ
ンのクラスタリングを行う。クラスタ数は例えば、従来
技術と同様に４個とする。従って、本手順によりピッチ
パタンのクラスタが４個生成される。(Clustering procedure 3) l for each pitch pattern created by the above clustering procedure 2
Clustering of each pitch pattern is performed by using 0 average vectors as clustering data and using, for example, the LBG algorithm. The number of clusters is, for example, 4 as in the conventional technique. Therefore, four clusters of pitch patterns are generated by this procedure.

【００５５】次に、学習の手順を示す。Next, a learning procedure will be shown.

【００５６】（学習手順ｌ）前記各クラスタ毎に、各ク
ラスタに属するピッチパタンを用いてＨＭＭを学習す
る。但し学習には、上記クラスタリング手順２によって
作成されたデータを用いるのではなく、圧縮前の元のピ
ッチパタンを用いる。上記クラスタリング手順２によっ
て作成されたデータはクラスタリングのためだけに用い
られる。学習には例えばフォワード・バックワードアル
ゴリズムを用いる。本手順により、各クラスタの代表ピ
ッチパタンのＨＭＭ、すなわち、ピッチパタンモデルが
求められる。このピッチパタンモデルは、図１に示され
るピッチパタンモデルメモリ７に格納される。(Learning Procedure 1) For each cluster, the HMM is learned using the pitch pattern belonging to each cluster. However, for learning, the original pitch pattern before compression is used instead of using the data created by the clustering procedure 2. The data created by the above clustering procedure 2 is used only for clustering. For example, a forward / backward algorithm is used for learning. By this procedure, the HMM of the representative pitch pattern of each cluster, that is, the pitch pattern model is obtained. This pitch pattern model is stored in the pitch pattern model memory 7 shown in FIG.

【００５７】以上で説明したように、クラスタリングに
使用するデータは、時間長の異なるピッチパタンをＨＭ
Ｍの学習という操作によって非線形に圧縮して、各ピッ
チパタンを同一のデータ長として揃えたものである。ま
た、各ＨＭＭの平均ベクトルは最尤推定によって求めら
れている。従って、前記従来技術のように線形伸縮によ
ってデータ長を揃える場合よりも、データ伸縮によるパ
タンの歪が小さく抑えられ、正確なクラスタリングが可
能となる。As described above, in the data used for clustering, pitch patterns with different time lengths are HM
Non-linear compression is performed by an operation of learning M, and each pitch pattern is made to have the same data length. The average vector of each HMM is obtained by maximum likelihood estimation. Therefore, as compared with the case where the data length is made uniform by linear expansion and contraction as in the above-mentioned conventional technique, pattern distortion due to data expansion and contraction is suppressed to be small, and accurate clustering is possible.

【００５８】またピッチパタンモデルネットワークは、
前記各クラスタのピッチパタンモデルを図３に示すよう
に接続することにより生成される。The pitch pattern model network is
It is generated by connecting the pitch pattern models of each cluster as shown in FIG.

【００５９】まずネットワーク初期状態Ｓ₁とネットワ
ーク最終状態Ｓ₂₆を新たに生成する。次に、ネットワー
ク初期状態Ｓ₁と、前記各ピッチパタンモデルの初期状
態をヌル遷移で接続する。すなわち状態Ｓ₁とＳ₂、状
態Ｓ₁とＳ₈、状態Ｓ₁とＳ₁₄、状態Ｓ₁とＳ₂₀をそれ
ぞれヌル遷移で接続する。図３においてヌル運移は点線
で示される。First, a network initial state S ₁ and a network final state S ₂₆ are newly generated. Next, the network initial state S ₁ and the initial state of each pitch pattern model are connected by a null transition. That is, the states S ₁ and S ₂ , the states S ₁ and S ₈ , the states S ₁ and S ₁₄ , and the states S ₁ and S ₂₀ are connected by null transitions. The null transfer is shown in dotted lines in FIG.

【００６０】次に各モデルの最終状態であるＳ₇とＳ₁₃
とＳ₁₉とＳ₂₅とを、ネットワークの最終状態Ｓ₂₆にヌル
遷移で接続する。Next, S ₇ and S ₁₃ which are the final states of each model
And S ₁₉ and S ₂₅ are connected to the final state S _{26 of the} network with a null transition.

【００６１】またネットワークの最終状態Ｓ₂₆から初期
状態Ｓ₁へのループを可能にするため、最終状態Ｓ₂₆か
ら初期状態Ｓ₁ヘのヌル遷移を生成する。以上でピッチ
パタンモデルネットワークが完成される。このピッチパ
タンモデルネットワークは、図１に示されるピッチパタ
ンモデルネットワークメモリ８に格納される。[0061] In addition to allow the loop from the final state S ₂₆ of the network to the initial state S _1, and generates a null transition in the initial state S ₁ f from the final state S _26. This completes the pitch pattern model network. This pitch pattern model network is stored in the pitch pattern model network memory 8 shown in FIG.

【００６２】このようにネットワークの最終状態Ｓ₂₆か
ら初期状態Ｓ₁へのループを設けたピッチパタンモデル
ネットワークを構成することにより、各クラスタのピッ
チパタンモデル間での任意の遷移が可能となり、入力音
声の対数ピッチ周波数時系列をピッチパタンモデルの連
鎖として表現することができる。By constructing a pitch pattern model network having a loop from the final state S ₂₆ of the network to the initial state S _{1 in} this way, it is possible to make an arbitrary transition between the pitch pattern models of each cluster, and The logarithmic pitch frequency time series of speech can be expressed as a chain of pitch pattern models.

【００６３】次に句境界確率の計算方法を説明する。Next, a method of calculating the phrase boundary probability will be described.

【００６４】音声信号の入力端ｌから入力された音声信
号２は、ピッチ分折手段３およびポーズ検出手段４に入
力される。The audio signal 2 input from the input terminal 1 of the audio signal is input to the pitch dividing means 3 and the pause detecting means 4.

【００６５】ピッチ分折手段３とポーズ検出手段４の動
作は従来技術と同様なので説明を省略する。Since the operations of the pitch folding means 3 and the pause detecting means 4 are the same as those of the prior art, the description thereof will be omitted.

【００６６】句境界確率計算手段９は、ピッチ分析手段
３の出力である対数ピッチ周波数時系列と、ポーズ検出
手段４の出力であるポーズ区間の開始時刻と終了時刻と
を入力とし、ピッチパタンモデルネットワークメモリ８
のピッチパタンモデルネットワークを用いてポーズ区間
以外の区間に対して以下に示すように句境界確率の計算
を行う。The phrase boundary probability calculation means 9 receives the logarithmic pitch frequency time series output from the pitch analysis means 3 and the start time and end time of the pause section output from the pause detection means 4, and receives the pitch pattern model. Network memory 8
Using the pitch pattern model network of, the phrase boundary probability is calculated for the sections other than the pose section as follows.

【００６７】対数ピッチ周波数の時系列をＰ₁Ｐ
₂Ｐ₃，・・・，Ｐ_T（添え字は時刻を表す）としたと
き、まず前向き確率α（Ｓ_i，ｔ）を（ｌ）式のように
定義する。The time series of the logarithmic pitch frequency is P ₁ P
₂ P ₃ , ..., P _T (subscript indicates time), the forward probability α (S _i , t) is first defined as in equation (1).

【００６８】[0068]

【数１】すなわちα（Ｓ_i，ｔ）は、ピッチパタンモデルネット
ワークにおける初期状態Ｓ₁から遷移を開始し、対数ピ
ッチ周波数の時系列Ｐ₁Ｐ₂Ｐ₃，・・・，Ｐ_tまでを
出力して状態Ｓｉに到達する確率である。[Equation 1] That α (S _i, t) starts a transition from the initial state S ₁ in the pitch pattern model network, the time series P ₁ P ₂ P ₃ logarithmic pitch frequency, ..., and outputs to P _t state It is the probability of reaching Si.

【００６９】また後ろ向き確率β（Ｓ_i，ｔ）を（２）
式のように定義する。Further, the backward probability β (S _i , t) is set to (2)
Define like an expression.

【００７０】[0070]

【数２】すなわちβ（Ｓ_i，ｔ）は時間軸を逆方向にして、ピッ
チパタンモデルネットワークにおける最終状態Ｓ₂₆から
遷移を開始し、対数ピッチ周波数の後ろ向き時系列であ
るＰ_TＰ_T-1Ｐ_T-2，．．．，Ｐ_t+1までを出力して状
態Ｓ_iに到達する確率である。[Equation 2] That is, β (S _i , t) has a time axis in the opposite direction, starts transition from the final state S ₂₆ in the pitch pattern model network, and is P _T P _T-1 P _T- which is a backward time series of logarithmic pitch frequencies _{. 2} ,. ．． , P _{t + 1} , and the probability of reaching the state S _i .

【００７１】前向き確率α（Ｓ_i，ｔ）は以下のような
漸化式によって計算することができる。The forward probability α (S _i , t) can be calculated by the following recurrence formula.

【００７２】［初期値設定］[Initial value setting]

【数３】［ｔ＝１〜Ｔ，Ｓ_i（ｉ＝１〜Ｊ）についての漸化式計
算］[Equation 3] [Recursion formula calculation for t = 1 to T, S _i (i = 1 to J)]

【数４】 [Equation 4]

【数５】また後ろ向き確率β（Ｓ_i，ｔ）は以下のような漸化式
によって計算することができる。[Equation 5] The backward probability β (S _i , t) can be calculated by the following recurrence formula.

【００７３】［初期値設定］[Initial value setting]

【数６】［ｔ＝Ｔ−１〜１，Ｓ_i（ｉ＝Ｊ〜１）についての漸化
式計算］[Equation 6] [Recursion Formula Calculation for t = T-1 to 1, S _i (i = J to 1)]

【数７】 [Equation 7]

【数８】上記の前向き確率α（Ｓ_i，ｔ）と後ろ向き確率β（Ｓ
_i，ｔ）を用いて、時刻ｔにおける句境界確率Ｓ
_p（ｔ）（ｔ＝ｌ〜Ｔ）を（９）式によって計算する。[Equation 8] The forward probability α (S _i , t) and the backward probability β (S
_i , t), the phrase boundary probability S at time t
_p (t) (t = 1 to T) is calculated by the equation (9).

【００７４】[0074]

【数９】前向き確率α（Ｓ_i，ｔ）と後ろ向き確率β（Ｓ_i，
ｔ）の定義より、（９）式において、分母は全ての状態
遷移を考慮した場合の前記ピッチパタンの時系列Ｐ₁Ｐ
₂Ｐ₃，．．．，Ｐ_Tが生成される確率である。すなわ
ち、ピッチパタンモデルネットワークにより前記時系列
が生成される全確率である。また、分子は時刻ｔにおい
て各クラスタＨＭＭの初期状態を通過した遷移により前
記時系列Ｐ₁Ｐ₂Ｐ₃，．．．，Ｐ_Tが生成される確率
の和である。ゆえに、両者の比をとることにより時刻ｔ
において各クラスタＨＭＭの初期状態を遷移した確率、
すなわち句の境界である確率を求めることができる。こ
れはＨＭＭの学習におけるフォワード・バックワードア
ルゴリズムを句境界確率計算に用いたものと考えること
ができる。[Equation 9] Forward probability α (S _i , t) and backward probability β (S _i ,
From the definition of t), in the formula (9), the denominator is the time series P ₁ P of the pitch pattern when all state transitions are considered.
₂ P ₃ ,. ．． , P _T is the probability of being generated. That is, it is the total probability that the time series is generated by the pitch pattern model network. Further, the numerator is the time series P ₁ P ₂ P ₃ ,. ．． , P _T is the sum of the probabilities of being generated. Therefore, by taking the ratio of both, time t
, The probability of transitioning the initial state of each cluster HMM at
That is, the probability of being a phrase boundary can be obtained. This can be thought of as using the forward-backward algorithm in HMM learning for phrase boundary probability calculation.

【００７５】また（９）式において、分子と分母の確率
和を最大値選択に置き換えることも可能である。すなわ
ち以下の（ｌ０）式を用いても句境界確率を計算するこ
とができる。Further, in the equation (9), it is possible to replace the probability sum of the numerator and the denominator with the maximum value selection. That is, the phrase boundary probability can also be calculated using the following formula (10).

【００７６】[0076]

【数１０】（ｌ０）式では、式の値がｌとなる時刻が、ピッチパタ
ンモデルの最適系列を求めたときのモデルの境界時刻と
なっており、前記従来技術と等価な句境界検出も可能で
ある。[Equation 10] In the equation (10), the time when the value of the equation becomes 1 is the boundary time of the model when the optimum series of the pitch pattern model is obtained, and phrase boundary detection equivalent to the above-mentioned conventional technique is also possible.

【００７７】実施例２．図４は本発明の実施例２に係わ
る句境界確率計算装置の一構成例を示すブロック構成図
である。図４において実施例ｌの説明図である図ｌと同
一機能ブロックには同一の番号を付し、説明は省略す
る。Example 2. FIG. 4 is a block diagram showing a configuration example of the phrase boundary probability calculation device according to the second embodiment of the present invention. 4, the same functional blocks as those in FIG. 1 which is an explanatory diagram of the embodiment 1 are denoted by the same reference numerals, and the description thereof will be omitted.

【００７８】本実施例において特徴的な点は、句境界確
率重み付け手段ｌ０を新たに付加したことである。ピッ
チパタンモデルおよびピッチパタンモデルネットワーク
は、実施例１と同様にして作成しておく。A characteristic point of this embodiment is that a phrase boundary probability weighting means 10 is newly added. The pitch pattern model and the pitch pattern model network are created in the same manner as in the first embodiment.

【００７９】次に動作について説明する。Next, the operation will be described.

【００８０】音声信号の入力端ｌから入力された音声信
号２はピッチ分折手段３およびポーズ検出手段４に入力
される。そして実施例ｌと同様の動作によって句境界確
率計算手段９は、時刻ｔにおける句境界確率Ｓ_p（ｔ）
（ｔ＝ｌ〜Ｔ）を出力する。The voice signal 2 inputted from the input terminal 1 of the voice signal is inputted to the pitch folding means 3 and the pause detecting means 4. Then, by the same operation as that of the example 1, the phrase boundary probability calculating means 9 causes the phrase boundary probability S _p (t) at the time t.
(T = 1 to T) is output.

【００８１】句境界確率重み付け手段ｌ０は、句境界確
率Ｓ_p（ｔ）（ｔ＝ｌ〜Ｔ）を入力として、（ｌｌ）式
により重み付き句境界確率Ｓ’_p（ｔ）（ｔ＝ｌ〜Ｔ）
を計算して出力する。（ｌｌ）式においてｗは重み付け
の程度を決める定数であり、後の実施例で述べるよう
に、アクセント句境界確率とスペクトル特徴量から計算
される音声認識スコアとを統合する場合に、アクセント
句境界確率の寄与率を調整するためのものである。この
寄与率を調整することにより、音声認識をより高精度で
実施することができる。The phrase boundary probability weighting means 10 receives the phrase boundary probability S _p (t) (t = l to T) as an input and uses the formula (11) to weight the phrase boundary probability S ′ _p (t) (t = 1). ~ T)
Is calculated and output. In the formula (11), w is a constant that determines the degree of weighting, and as will be described in a later example, when integrating the accent phrase boundary probability and the speech recognition score calculated from the spectral feature, the accent phrase boundary is used. This is for adjusting the probability contribution rate. By adjusting the contribution rate, the voice recognition can be performed with higher accuracy.

【００８２】[0082]

【数１１】実施例３．図５は本発明の実施例３に係わる連続音声認
識装置の一構成例を示す構成ブロック図である。本発明
に係わる連続音声認識装置は、句境界確率計算部ｌｌと
スポッティング部ｌ９から構成される。[Equation 11] Example 3. FIG. 5 is a configuration block diagram showing a configuration example of a continuous speech recognition apparatus according to the third embodiment of the present invention. The continuous speech recognition apparatus according to the present invention comprises a phrase boundary probability calculation unit 11 and a spotting unit 19.

【００８３】本実施例における句境界確率計算部１ｌ
は、実施例２で述べた句境界計算装置を用いるので、同
一の符号を付し、説明は省路する。尚、句境界確率計算
部１ｌとして、実施例１で述べた句境界計算装置を用い
ることもできる。Phrase boundary probability calculation unit 1l in this embodiment
Since the phrase boundary calculation device described in the second embodiment is used, the same reference numerals are given and the description thereof will be omitted. The phrase boundary calculation device described in the first embodiment can be used as the phrase boundary probability calculation unit 11.

【００８４】スポッティング部ｌ９は、入力音声のスペ
クトル特徴ベクトルの時系列を算出するスペクトル分折
手段ｌ２と、音声のスペクトル特徴ベクトルの時系列を
モデル化したバックグランドモデルを記憶するバックグ
ランドモデルメモリｌ３と、前記入力音声のスペクトル
特徴ベクトルの時系列を入力とし、バックグランドモデ
ルｌ３を用いて前記スペクトル特徴ベクトル時系列に対
するスペクトル特徴前向き確率とスペクトル特徴後ろ向
き確率を計算するバックグランドモデル照合手段ｌ４
と、句境界確率計算部ｌｌの出力である重み付き句境界
確率と前記スペクトル特徴前向き確率との積を求め、統
合化前同き確率を算出する前向き確率統合手段ｌ５と、
句境界確率計算部ｌｌの出力である重み付き句境界確率
と前記スペクトル特徴後ろ向き確率との積を求め、統合
化後ろ向き確率を算出する後ろ向き確率統合手段ｌ６
と、スポッティング対象とする文節音声のスペクトル特
徴ベクトルの時系列をモデル化した文節モデルを記憶す
る文節モデルメモリｌ７と、前記音声のスペクトル特徴
ベクトルの時系列と前記統合化前向き確率と統合化後ろ
向き確率とを入力とし、文節モデルを用いて文節のスポ
ッティングを行うスポッティング手段ｌ８から構成され
る。The spotting unit 19 is a spectrum dividing unit 12 for calculating the time series of the spectrum feature vector of the input voice, and a background model memory 13 for storing a background model that models the time sequence of the spectrum feature vector of the voice. And a background model matching means l4 for inputting a time series of the spectral feature vector of the input speech and calculating a spectral feature forward probability and a spectral feature backward probability for the spectral feature vector time series using the background model l3.
And a forward probability integrating means 15 for calculating a product of the weighted phrase boundary probability output from the phrase boundary probability calculating unit 11 and the spectral feature forward probability, and calculating a pre-integration probability.
Backward probability integrating means 16 for calculating a product of the weighted phrase boundary probability output from the phrase boundary probability calculation unit 11 and the spectral feature backward probability to calculate an integrated backward probability.
And a phrase model memory 17 for storing a phrase model that models a time series of spectral feature vectors of the phrase speech to be spotted, a time series of the spectrum feature vector of the voice, the integrated forward probability, and the integrated backward probability. It is composed of spotting means 18 for inputting and, and using the phrase model to spot a phrase.

【００８５】スポッティングとは、入力音声中から所定
の単語や文節を抽出する技術である。例えば、「明日、
東京へ行きます」と発声された音声中から「東京へ」と
いう文節をスポッティングするということは、「東京
ヘ」いう文節の発声開始時刻、発声終了時刻や、後述す
るスポッティングスコア等を求めることである。Spotting is a technique for extracting a predetermined word or phrase from the input voice. For example, "Tomorrow,
Spotting the phrase "To Tokyo" from the voice that says "I'm going to Tokyo" means finding the start time, end time, and spotting score of the phrase "To Tokyo". is there.

【００８６】スポッティングの方法は幾つかあるが、本
実施例では、後述するように、スポッティング対象とな
る文節モデルの前後に、任意の音声の特徴ベクトル時系
列を表現できるモデルを接続して、入力音声の全区間の
特徴ベクトル時系列とのパタンマッチングを行う過程で
前記文節モデルと入力音声とのマッチング区間を求める
方法を用いる。ここで前記文節モデルの前後に接続する
モデルのことをバックグランドモデルという。Although there are several spotting methods, in the present embodiment, as will be described later, a model capable of expressing a feature vector time series of an arbitrary voice is connected before and after a phrase model to be spotted and input. A method of obtaining a matching section between the phrase model and the input speech is used in the process of performing pattern matching with the feature vector time series of all sections of the speech. Here, a model connected before and after the phrase model is called a background model.

【００８７】本実施例では、上記のバックグランドモデ
ルと文節モデルとして、ともにＨＭＭを用いる。ピッチ
パタンモデルとの違いは、特徴ベクトル時系列がピッチ
パタンではなく、スペクトル特徴ベクトルの時系列であ
る点である。ここで、スペクトル特徴ベクトルの時系列
を用いるのは、スペクトル特徴ベクトルでないと文節の
認識ができないからである。これに対して、ピッチパタ
ンモデルは句境界検出に大変有用であり、本実施例にお
いても、句境界確率計算部ｌｌにはピッチパタンモデル
が使用されている。In this embodiment, HMMs are used as both the background model and the clause model. The difference from the pitch pattern model is that the feature vector time series is not a pitch pattern but a time series of spectral feature vectors. Here, the time series of the spectrum feature vector is used because the phrase cannot be recognized unless the spectrum feature vector is used. On the other hand, the pitch pattern model is very useful for phrase boundary detection, and the pitch pattern model is also used in the phrase boundary probability calculation unit 11 in this embodiment.

【００８８】バックグランドモデルとしては、例えば図
６に示されるような音節モデルネットワークを用いる。
これは実施例ｌで述べたピッチパタンモデルネットワー
クの構成法と全く同一である。日本語に現れる全ての音
節に対して、音節モデルを用意しておくことにより、日
本語の任意の発声のスペクトル特徴ベクトルの時系列を
モデル化することができる。As the background model, for example, a syllable model network as shown in FIG. 6 is used.
This is exactly the same as the method of constructing the pitch pattern model network described in the first embodiment. By preparing a syllable model for all syllables appearing in Japanese, it is possible to model the time series of spectral feature vectors of arbitrary utterances in Japanese.

【００８９】また、文節モデルは文節を構成する自立
語、例えば「東京」のモデルに、付属語、例えば「へ」
のモデルを幾つか接続したモデルを用いるものとし、ス
ポッティング対象とする全ての文節に対して文節モデル
を用意しておく。In addition, the bunsetsu model is an independent word that constitutes a bunsetsu, for example, a model of "Tokyo", and an attached word, for example, "he".
A model in which several models are connected is used, and a bunsetsu model is prepared for all bunsetsu to be spotted.

【００９０】次に動作について説明する。Next, the operation will be described.

【００９１】音声信号の入力端１から入力された音声信
号２は、句境界確率計算部１１とスポッティング部１９
に入力される。The voice signal 2 input from the voice signal input terminal 1 is supplied to the phrase boundary probability calculation unit 11 and the spotting unit 19.
Entered in.

【００９２】句境界確率計算部ｌｌは実施例２と全く同
じ動作をし、時刻ｔにおける重み付き句境界確率Ｓ’
（ｔ）（ｔ＝ｌ〜Ｔ）を出力する。The phrase boundary probability calculation unit 11 performs exactly the same operation as in the second embodiment, and the weighted phrase boundary probability S'at time t.
(T) (t = 1 to T) is output.

【００９３】スポッティング部ｌ９に入力された音声信
号２は、スペクトル分折手段ｌ２によってスペクトル分
折され、スペクトル特徴ベクトルの時系列Ｘ₁Ｘ
₂Ｘ₃，・・・，Ｘ_Tに変換される。スペクトル特徴ベ
クトルＸは例えばＬＰＣケプストラムである。The speech signal 2 input to the spotting unit 19 is spectrally divided by the spectral dividing unit 12, and a time series X ₁ X of spectral feature vectors is obtained.
₂ X ₃ , ..., X _T. The spectral feature vector X is, for example, an LPC cepstrum.

【００９４】バックグランドモデル照合手段ｌ４は、ス
ペクトル分析手段ｌ２の出力であるスペクトル特徴ベク
トルの時系列を入力として、バックグランドモデルメモ
リｌ３からのバックグランドモデルを用いて、以下のよ
うにスペクトル特徴前向き確率であるＳ_fw（ｔ）（ｔ＝
ｌ〜Ｔ）と、スペクトル特徴後ろ向き確率であるＳ
_bw（ｔ）（ｔ＝ｌ〜Ｔ）を算出する。The background model matching means 14 uses the time series of the spectrum feature vector output from the spectrum analysis means 12 as an input, and uses the background model from the background model memory 13 as follows. Probability S _fw (t) (t =
1 to T) and S, which is the backward probability of the spectral feature.
Calculate _bw (t) (t = 1 to T).

【００９５】スペクトル特徴前向き確率Ｓ_fw（ｔ）（ｔ
＝ｌ〜Ｔ）は（ｌ２）式により計算する。Spectral feature forward probability S _fw (t) (t
= L to T) is calculated by the equation (12).

【００９６】[0096]

【数１２】すなわち、Ｓ_fw（ｔ）はバックグランドモデルとして図
６に示される音節モデルネットワークにおける初期状態
（Ｓ₁）から遷移を開始し、スペクトル特徴ベクトルの
時系列Ｘ₁Ｘ₂Ｘ₃，・・・，Ｘ_tまでを出力して最終
状態（Ｓ_J）に到達する確率である。[Equation 12] That is, S _fw (t) starts the transition from the initial state (S ₁ ) in the syllable model network shown in FIG. 6 as a background model, and the time series of spectral feature vectors X ₁ X ₂ X ₃ , ..., It is the probability of outputting up to X _t and reaching the final state (S _J ).

【００９７】但し、［初期値設定］However, [initial value setting]

【数１３】［ｔ＝ｌ〜Ｔ，Ｓ_i（ｉ＝ｌ〜Ｊ）についての漸化式計
算］[Equation 13] [Recursion formula calculation for t = 1 to T, S _i (i = 1 to J)]

【数１４】 [Equation 14]

【数１５】また、スペクトル特徴後ろ向き確率Ｓ_bw（ｔ）は（ｌ
６）式により計算する。[Equation 15] In addition, the spectral feature backward probability S _bw (t) is (l
6) Calculate with the formula.

【００９８】[0098]

【数１６】すなわち、Ｓ_bw（ｔ）は時間軸を逆方向にして、図６に
示される音節モデルネットワークにおける最終状態（Ｓ
_J）から遷移を開始し、スペクトル特徴ベクトルの後ろ
向き時系列であるＸ_TＸ_T-1Ｘ_T-2，・・・，Ｘ_t+1ま
でを出力して初期状態（Ｓ₁）に到達する確率である。[Equation 16] That is, S _bw (t) has the time axis in the opposite direction, and the final state (S _bw (t) in the syllable model network shown in FIG.
Start the transition from _J), and reaches the initial state (S ₁₎ and outputs _{_{X T X T-1 X T}} -2 is a retrospective time series of spectral feature vectors, ..., up to X t + ₁ It is a probability.

【００９９】但し、［初期値設定］However, [initial value setting]

【数１７】［ｔ＝Ｔ−ｌ〜ｌ，Ｓ_i（ｉ＝Ｊ〜ｌ）についての漸化
式計算］[Equation 17] [Recursion Formula Calculation for t = T-1 to S _i (i = J to 1)]

【数１８】 [Equation 18]

【数１９】前向き確率統合手段１５は、句境界確率計算部１１の出
力である、重み付き句境界確率Ｓ’_p（ｔ）と、バック
グランドモデル照合手段１４の出力である前記スペクト
ル特徴前向き確率Ｓ_fw（ｔ）を入力として、（２０）式
にしたがって統合化前向き確率であるＳ’_fw（ｔ）（ｔ
＝１〜Ｔ）を算出する。[Formula 19] The forward probability integrating means 15 outputs the weighted phrase boundary probability S ′ _p (t) which is the output of the phrase boundary probability calculating section 11, and the spectrum feature forward probability S _fw (t) which is the output of the background model matching means 14. ) As an input, S ′ _fw (t) (t
= 1 to T) is calculated.

【０１００】[0100]

【数２０】後ろ向き確率統合手段１６は、句境界確率計算部１１の
出力である重み付き句境界確率Ｓ’_p（ｔ）と、バック
グランドモデル照合手段１４の出力である前記スペクト
ル特徴後ろ向き確率Ｓ_bw（ｔ）を入力として、（２１）
式にしたがって統合化後ろ向き確率であるＳ’_bw（ｔ）
（ｔ＝１〜Ｔ）を算出する。[Equation 20] The backward probability integrating means 16 outputs the weighted phrase boundary probability S ′ _p (t) output from the phrase boundary probability calculating section 11 and the spectral feature backward probability S _bw (t) output from the background model matching means 14. Input as (21)
_S'bw (t) which is the integrated backward probability according to the formula
(T = 1 to T) is calculated.

【０１０１】[0101]

【数２１】スポッテイング手段１８は、前記音声のスペクトル特徴
ベクトルの時系列と前期統合化前向き確率と統合化後ろ
向き確率とを入力とし、文節モデルを用いて（２２）式
により、各スポッティング対象文節毎に、各文節のスポ
ッティングスコアであるＦ⁽ⁿ⁾（ｔ）（ｔ＝１〜Ｔ）を
計算する。ここで肩の添字（ｎ）は文節モデルの番号で
あり、ｎ＝１，２，３，．．．Ｎ（Ｎ：スポッティング
対象文節総数）である。[Equation 21] The spotting means 18 receives the time series of the spectral feature vector of the voice, the previous integrated forward probability and the integrated backward probability as inputs, and uses the phrase model according to equation (22) for each spotting target phrase. F ⁽ⁿ⁾ (t) (t = 1 to T), which is the spotting score of the phrase, is calculated. Here, the shoulder subscript (n) is the number of the bunsetsu model, and n = 1, 2, 3 ,. ．． N (N: total number of target phrases for spotting).

【０１０２】そして、前記スポティングスコアＦ
⁽ⁿ⁾（ｔ）が予め定められた閾値以上である場合に、そ
の時刻ｔと、文節モデル番号ｎと、スポッティングスコ
アＦ⁽ⁿ⁾（ｔ）をスポッティング結果として出力する。
スポティングスコアＦ⁽ⁿ⁾（ｔ）が予め定められた閾値
を越えた時刻ｔが文節の境界である確率が高い時刻であ
る。Then, the spotting score F
^{When (n)} (t) is equal to or greater than a predetermined threshold value, the time t, the phrase model number n, and the spotting score F ⁽ⁿ⁾ (t) are output as the spotting result.
The time t at which the spotting score F ⁽ⁿ⁾ (t) exceeds a predetermined threshold is a time at which there is a high probability that it is a bunsetsu boundary.

【０１０３】[0103]

【数２２】但し、［初期値設定］[Equation 22] However, [Initial value setting]

【数２３】［ｔ＝１〜Ｔ，Ｓ_i（ｉ＝１）についての漸化式計算］[Equation 23] [Recursion formula calculation for t = 1 to T, S _i (i = 1)]

【数２４】［ｔ＝１〜Ｔ，Ｓ_i（ｉ＝２〜Ｊ，Ｓ_J：最終状態）に
ついての漸化式計算］[Equation 24] [Recursion formula calculation for t = 1 to T, S _i (i = 2 to _J , S _J : final state)]

【数２５】 [Equation 25]

【数２６】（２２）式および（２４）式から明かなように、前記統
合化前向き確率と前記統合化後ろ向き確率が、それぞれ
大きい区間で、スポッティングスコアが大きくなる。ゆ
えにバックグランドモデルを用いて計算されたスペクト
ル特徴前向き確率と、スペクトル特徴後ろ向き確率に、
重み付き句境界確率を乗じることにより、句境界以外の
時刻での、スペクトル特徴前向き確率と、スペクトル特
徴後ろ向き確率を小さく抑えることが可能となり、句境
界以外の時刻で誤って文節がスポッティングされること
を抑制することができる。句境界は文節境界と一致して
いる場合が殆どであり、結局、文節境界以外の時刻で誤
って文節がスポッティングされることを抑制することが
できる。[Equation 26] As is clear from the expressions (22) and (24), the spotting score becomes large in the sections where the integrated forward probability and the integrated backward probability are large. Therefore, the spectral feature forward probability calculated using the background model, and the spectral feature backward probability,
By multiplying the weighted phrase boundary probability, it is possible to suppress the spectral feature forward probability and the spectral feature backward probability at times other than phrase boundaries, and bunsetsu is mistakenly spotted at times other than phrase boundaries. Can be suppressed. In most cases, the phrase boundary coincides with the bunsetsu boundary, and in the end, it is possible to prevent the bunsetsu from being mistakenly spotted at a time other than the bunsetsu boundary.

【０１０４】実施例４．実施例３で述べた句境界確率利
用音声認識装置において、バックグラントモデルとして
文節モデルの連鎖を用いる場合の実施例を説明する。Example 4. An example in which a phrase model chain is used as the background model in the speech recognition apparatus using phrase boundary probabilities described in the third example will be described.

【０１０５】図７は、例えば行動予定管理を認識タスク
とした場合、すなわち行動予定に関する音声入力に対し
て専用化されたバックグランドモデルの一構成例が示さ
れる。FIG. 7 shows an example of the configuration of a background model dedicated to voice input regarding an action schedule, for example, when action schedule management is used as a recognition task.

【０１０６】図７において四角で囲まれた部分は単語モ
デルを表しており、単語モデルのネットワークとして文
節モデルが構成されている。認識タスクに現れる全ての
文節のモデルをバックグランドモデルに組み込むことに
より、認識タスク内の発声であれば、全ての発声に対す
るスペクトル特徴ベクトルの時系列を前記バックグラン
ドモデルによって表現することができる。In FIG. 7, a portion surrounded by a square represents a word model, and a bunsetsu model is constructed as a network of word models. By incorporating the models of all the clauses appearing in the recognition task into the background model, the time series of the spectral feature vectors for all the utterances can be represented by the background model if the utterance is in the recognition task.

【０１０７】バックグランドモデルとして文節モデルの
連鎖を用いること以外は、スポッティングの動作は、実
施例３で述べた句境界確率利用音声認識装置と全く同様
なので説明は省略する。The operation of spotting is exactly the same as that of the phrase boundary probability utilizing speech recognition apparatus described in the third embodiment except that the phrase model chain is used as the background model, and therefore the description thereof is omitted.

【０１０８】バックグランドモデルとして文節モデルの
連鎖を用いることによって、実施例３で述べたバックグ
ランドモデルのように音節の連鎖を用いる場合と比較し
て、実施例３で述べたスペクトル特徴前向き確率Ｓ
_fw（ｔ）（ｔ＝ｌ〜Ｔ）とスペクトル特徴後ろ向き確率
であるＳ_bw（ｔ）（ｔ＝ｌ〜Ｔ）が文節境界時刻以外で
は更に小さな値に抑えられるので、文節境界以外の時刻
で誤って文節がスポッティングされることを、より抑制
することができる。By using the bunsetsu model chain as the background model, as compared with the case of using the syllable chain as in the background model described in the third embodiment, the spectral feature forward probability S described in the third embodiment.
_{Since fw} (t) (t = 1 to T) and the spectral feature backward probability S _bw (t) (t = 1 to T) can be suppressed to smaller values other than the bunsetsu boundary time, at times other than the bunsetsu boundary time. It is possible to further prevent bunsetsu from being spotted by mistake.

【０１０９】実施例５．図８は請求項７記載の発明に係
わる連続音声認識装置の一構成例を示す図である。本発
明に係わる連続音声認識装置は、句境界確率計算部ｌｌ
と連続音声認識部２３から構成される。Example 5. FIG. 8 is a diagram showing an example of the configuration of a continuous speech recognition apparatus according to the invention of claim 7. The continuous speech recognition apparatus according to the present invention is a phrase boundary probability calculation unit ll.
And the continuous speech recognition unit 23.

【０１１０】本実施例における句境界確率計算部ｌｌ
は、実施例２で述べた句境界計算装置を用いるので、同
一の番号を付し、説明は省略する。尚、実施例１で述べ
た句境界計算装置を用いることもできる。The phrase boundary probability calculation unit ll in this embodiment.
Since the phrase boundary calculation device described in the second embodiment is used, the same reference numerals are given and description thereof is omitted. The phrase boundary calculation device described in the first embodiment can also be used.

【０１１１】連続音声認識部２３は、入力音声のスペク
トル特徴ベクトルの時系列を算出するスペクトル分析手
段１２と、認識対象とする文音声のスペクトル特徴ベク
トルの時系列をモデル化した文モデルネットワークを記
憶する文モデルネットワークメモリ２０と、前記入力音
声のスペクトル特徴ベクトルの時系列を入力とし、文モ
デルネットワークを用いて前記入力音声の認識を行い、
複数の認識結果候補文と各認識結果候捕文のスペクトル
特徴認識スコアと、各認識結果候捕文毎にその文を構成
する文節の境界時刻とを出力する連続音声認識手段２ｌ
と、前記各認識結果候補文の文節の境界時刻における重
み付き句境界確率を用いて、スペクトル特徴認識スコア
を補正し、この補正した認識スコアに基づいて、最終的
な認識結果候補文を決定する確率統合手段２２から構成
される。The continuous speech recognition unit 23 stores the spectrum analysis means 12 for calculating the time series of the spectral feature vector of the input speech, and the sentence model network that models the time series of the spectral feature vector of the sentence speech to be recognized. The sentence model network memory 20 and the time series of the spectral feature vector of the input speech are input, and the input speech is recognized using the sentence model network.
A continuous speech recognition unit 21 that outputs a plurality of recognition result candidate sentences, the spectral feature recognition score of each recognition result candidate sentence, and the boundary time of the clauses forming the sentence for each recognition result candidate sentence.
And using the weighted phrase boundary probabilities at the boundary times of the clauses of each recognition result candidate sentence, the spectral feature recognition score is corrected, and the final recognition result candidate sentence is determined based on this corrected recognition score. It is composed of probability integration means 22.

【０１１２】本実施例では、文モデルネットワークとし
てＨＭＭを用いる。文モデルネットワークでモデル化す
る特徴量パラメータは、実施例３で述べた文節モデルと
同様にスペクトル特徴ベクトルの時系列である。連続音
声認識では、認識対象とする文の総数は非常に多いの
で、認識対象とする各文毎にモデルを用意するのではな
く、図９に示されるように単語モデルを接続して構成し
た文モデルネットワークを用いる。In this embodiment, an HMM is used as the sentence model network. The feature parameter modeled by the sentence model network is a time series of spectral feature vectors as in the phrase model described in the third embodiment. In continuous speech recognition, the total number of sentences to be recognized is very large. Therefore, instead of preparing a model for each sentence to be recognized, sentences formed by connecting word models as shown in FIG. Use a model network.

【０１１３】また図９に示されるように、文モデルネッ
トワーク中に文節区切り位置の情報を付与しておく。Further, as shown in FIG. 9, information on the bunsetsu delimiter position is added to the sentence model network.

【０１１４】次に動作について説明する。Next, the operation will be described.

【０１１５】音声信号の入力端ｌから入力された音声信
号２は、句境界確率計算部ｌｌと連続音声認識部２３に
入力される。The voice signal 2 input from the input end 1 of the voice signal is input to the phrase boundary probability calculation unit 11 and the continuous voice recognition unit 23.

【０１１６】句境界確率計算部ｌｌは実施例２と全く同
じ動作をし、時刻ｔにおける重み付き句境界確率Ｓ’_p
（ｔ）（ｔ＝１〜Ｔ）を出力する。[0116] The phrase boundary probabilities calculator ll is exactly the same operation as in Example 2, the weighted phrase boundary probabilities at time t S _'p
(T) (t = 1 to T) is output.

【０１１７】連続音声認識部２３に入力された音声信号
２は、スペクトル分折手段ｌ２によってスペクトル分折
され、スペクトル特徴ベクトルの時系列Ｘ₁Ｘ₂Ｘ₃，
・・・，Ｘ_Tに変換される。スペクトル特徴ベクトルＸ
は例えばＬＰＣケプストラムである。The speech signal 2 input to the continuous speech recognition unit 23 is spectrally divided by the spectrum dividing unit 12, and the time series of spectral feature vectors X ₁ X ₂ X ₃ ,
..., converted to X _T. Spectral feature vector X
Is, for example, the LPC cepstrum.

【０１１８】連続音声認識手段２ｌは、スペクトル分析
手段１２の出力であるスペクトル特徴ベクトルの時系列
を入力として、文モデルネットワークメモリ２０からの
文モデルネットワークを用いて、例えばＮ−ｂｅｓｔア
ルゴリズムによって連続音声認識を行い、Ｎ個の認識結
果候補文と各認識候補文のスペクトル特徴認識スコアＧ
⁽ⁿ⁾（ｎ＝１〜Ｎ）と、各認識結果候補文毎にその文を
構成する文節の境界時刻ｔ⁽ⁿ⁾ _k（ｎ＝１〜Ｎ、ｋ＝１
〜Ｋ⁽ⁿ⁾、但しＫ⁽ⁿ⁾：ｎ番目認識結果候補文に含まれ
る文節境界の数）とを出力する。The continuous speech recognition means 2l receives the time series of the spectrum feature vector output from the spectrum analysis means 12 as an input and uses the sentence model network from the sentence model network memory 20 to perform continuous speech by, for example, an N-best algorithm. The recognition is performed and the N recognition result candidate sentences and the spectral feature recognition score G of each recognition candidate sentence are recognized.
⁽ⁿ⁾ (n = 1 to N) and the boundary time t ⁽ⁿ⁾ _k (n = 1 to N, k = 1 ^{) of} the bunsetsu that composes each recognition result candidate sentence.
˜K ⁽ⁿ⁾ , where K ⁽ⁿ⁾ : the number of clause boundaries included in the n-th recognition result candidate sentence) is output.

【０１１９】確率統合手段２２は、句境界確率計算部１
１の出力である重み付き句境界確率と連続音声認識手段
２ｌの出力である複数の各認識結果候補文のスペクトル
特徴認識スコアと各認識結果候捕文毎にその文を構成す
る文節の境界時刻とを入力として、（２７）式により各
認識結果候補文の補正認識スコアＧ⁽ⁿ⁾’（ｎ＝ｌ〜
Ｎ）を計算する。そして認識結果として、前記補正認識
スコアＧ⁽ⁿ⁾’の高い順に、認識結果候補文と、補正認
識スコアＧ⁽ⁿ⁾’を出力する。The probability integrating means 22 is the phrase boundary probability calculating section 1
1, the weighted phrase boundary probability, the output of the continuous speech recognition means 2l, the spectral feature recognition score of each of the plurality of recognition result candidate sentences, and the boundary time of the clause forming the sentence for each recognition result signature. By inputting and, the corrected recognition score G ⁽ⁿ⁾ '(n = 1 ~
Calculate N). And as a recognition result, the 'a high order, and the recognition result candidate sentences, correction recognition score G ^(n)' Correction recognition score G ⁽ⁿ⁾ and outputs a.

【０１２０】[0120]

【数２７】上記のごとく、スペクトル特徴認識スコアにくわえて、
ピッチパタンから計算される重み付き句境界確率を統合
することにより、句境界の誤った認識結果文候補の確率
が抑えられ、文音声認識の精度を向上させることができ
る。[Equation 27] As mentioned above, in addition to the spectral feature recognition score,
By integrating the weighted phrase boundary probabilities calculated from the pitch pattern, the probability of recognition result sentence candidates having incorrect phrase boundaries can be suppressed, and the accuracy of sentence speech recognition can be improved.

【０１２１】[0121]

【発明の効果】以上述べたようにこの発明によれば、ピ
ッチパタンモデルによってアクセント句のピッチ特徴量
の時系列を統計的にモデル化し、句境界確率計算手段
は、前記ピッチパタンモデルを用いてピッチ特徴量の時
系列に対するピッチ前向き確率とピッチ後ろ向き確率を
計算し、このピッチ前向き確率とピッチ後ろ向き確率に
基づいて前記入力音声の各時刻におけるアクセント句境
界確率を計算するので、スペクトル特徴量から計算され
る音声認識の認識結果候補の認識スコアと、ピッチパタ
ンから計算されるアクセント句境界確率との統合が可能
となり、入力音声の各時刻が句境界であるか否かをｌ，
０で判定する従来技術と比較して、アクセント句境界情
報の音声認識への利用を容易にする効果がある。As described above, according to the present invention, the time series of the pitch feature amount of the accent phrase is statistically modeled by the pitch pattern model, and the phrase boundary probability calculation means uses the pitch pattern model. The pitch forward probability and the pitch backward probability with respect to the time series of the pitch feature amount are calculated, and the accent phrase boundary probability at each time of the input speech is calculated based on the pitch forward probability and the pitch backward probability. It is possible to integrate the recognition score of the recognition result candidate of the voice recognition and the accent phrase boundary probability calculated from the pitch pattern, and determine whether or not each time of the input speech is a phrase boundary,
Compared with the conventional technique of determining by 0, there is an effect that the accent phrase boundary information is easily used for voice recognition.

【０１２２】またピッチパタンモデル作成のための、ピ
ッチパタンのクラスタリングに使用するデー夕は、時間
長の異なるピッチパタンをＨＭＭの学習という操作によ
って非線形に圧縮して、各ピッチパタンを同一のデータ
長に揃えたものであり、各ＨＭＭの平均ベクトルは最尤
推定によって求められているので、前記従来技術のよう
に線形伸縮によってデータ長を揃える場合よりも、デー
タ伸縮によるパタンの歪が小さく抑えられ、正確なクラ
スタリングが可能となる。Further, the data used for pitch pattern clustering for creating the pitch pattern model is such that pitch patterns having different time lengths are non-linearly compressed by an operation of learning HMM, and each pitch pattern has the same data length. Since the average vector of each HMM is obtained by the maximum likelihood estimation, the pattern distortion due to the data expansion and contraction can be suppressed to be smaller than that in the case where the data lengths are aligned by the linear expansion and contraction as in the above-mentioned conventional technique. , Accurate clustering is possible.

【０１２３】また、句境界確率重み付け手段は、アクセ
ント句境界確率に対して、重み付け係数を備え、アクセ
ント句境界確率に重みを付けを行うので、アクセント句
境界確率を考慮したことによる音声認識の精度向上を最
大にするようにアクセント句境界確率の寄与率を設定す
ることができる。Further, since the phrase boundary probability weighting means is provided with a weighting coefficient for the accent phrase boundary probability and weights the accent phrase boundary probability, the accuracy of speech recognition by considering the accent phrase boundary probability. The contribution rate of the accent phrase boundary probability can be set to maximize the improvement.

【０１２４】また、前向き確率統合手段は重み付き句境
界確率とスペクトル特徴前向き確率との積を求めること
により統合化前向き確率を算出し、後ろ向き確率統合手
段は重み付き句境界確率とスペクトル特徴後ろ向き確率
との積を求めることにより統合化後ろ向き確率を算出
し、スポッティング手段は、統合化前向き確率と統合化
後ろ向き確率とを用いて、スポッティングを行うので、
句境界以外の時刻での、スペクトル特徴前向き確率と、
スベクトル特徴後ろ向き確率を小さく抑えることが可能
となり、句境界以外の時刻で誤って文節がスポッティン
グされることを抑制することができる。Further, the forward probability integrating means calculates the integrated forward probability by calculating the product of the weighted phrase boundary probability and the spectral feature forward probability, and the backward probability integrating means calculates the weighted phrase boundary probability and the spectral feature backward probability. The integrated backward probability is calculated by obtaining the product of and, and the spotting means performs the spotting by using the integrated forward probability and the integrated backward probability.
Spectral feature forward probability at times other than phrase boundaries,
It is possible to reduce the backward probability of scan vector features to a small value, and to prevent bunsetsu from being mistakenly spotted at times other than phrase boundaries.

【０１２５】また、バックグランドモデルとして文節モ
デルの連鎖を用いることによって、スペクトル特徴前向
き確率とスペクトル特徴後ろ向き確率が文節境界時刻以
外では更に小さな値に抑えられるので、文節境界以外の
時刻で誤って文節がスポッティングされることを、より
抑制することができる。Further, by using the bunsetsu model chain as the background model, the spectral feature forward probability and the spectral feature backward probability can be suppressed to a smaller value at times other than the bunsetsu boundary time. Can be further suppressed from being spotted.

【０１２６】また、連続音声認識手段は、文モデルネッ
トワークを用いて、入力音声の認識を行い、複数の認識
結果候捕文と各認識結果候補文のスペクトル特徴認識ス
コアと、各認識結果候補文毎にその文を構成する文節の
境界時刻とを算出し、確率統合手段は、各認識結果候補
文の文節の境界時刻における重み付き句境界確率を用い
てスペクトル特徴認識スコアを補正するので、句境界の
誤った認識結果文候補のスコアが抑えられ、文音声認識
の精度を向上させることができる。Further, the continuous speech recognition means recognizes the input speech by using the sentence model network, and recognizes a plurality of recognition result candidate sentences, spectral feature recognition scores of each recognition result candidate sentence, and each recognition result candidate sentence. The boundary time of the clauses that compose the sentence is calculated for each, and the probability integrating unit corrects the spectral feature recognition score using the weighted phrase boundary probability at the boundary time of the clauses of each recognition result candidate sentence. The score of the recognition result sentence candidate having an incorrect boundary can be suppressed, and the accuracy of sentence voice recognition can be improved.

[Brief description of drawings]

【図１】本発明の実施例１に係わる句境界確率計算装
置の構成例を示すブロック図である。FIG. 1 is a block diagram showing a configuration example of a phrase boundary probability calculation device according to a first exemplary embodiment of the present invention.

【図２】ＨＭＭの構成を説明するための図である。FIG. 2 is a diagram for explaining the configuration of an HMM.

【図３】ピッチパタンモデルネットワークの構成を説
明するための図である。FIG. 3 is a diagram for explaining a configuration of a pitch pattern model network.

【図４】本発明の実施例２に係わる句境界確率計算装
置の構成例を示すブロック図である。FIG. 4 is a block diagram showing a configuration example of a phrase boundary probability calculation device according to a second exemplary embodiment of the present invention.

【図５】本発明の実施例３に係わる句境界確率利用連
続音声認識装置の構成例を示すブロック図である。FIG. 5 is a block diagram showing a configuration example of a phrase boundary probability utilizing continuous speech recognition device according to a third embodiment of the present invention.

【図６】音節の連鎖で構成した実施例３のバックグラ
ンドモデルを説明するための図である。FIG. 6 is a diagram for explaining a background model of Example 3 configured by syllable chains.

【図７】文節の連鎖で構成した実施例４のバックグラ
ンドモデルを説明するための図である。FIG. 7 is a diagram for explaining a background model of Example 4 configured by a clause chain.

【図８】本発明の実施例５に係わる句境界確率利用連
続音声認識装置の一構成例を示すブロック図である。FIG. 8 is a block diagram showing a configuration example of a continuous speech recognition apparatus using phrase boundary probability according to a fifth exemplary embodiment of the present invention.

【図９】文モデルネットワークの構成を説明するため
の図である。FIG. 9 is a diagram for explaining the configuration of a sentence model network.

【図１０】従来技術の句境界検出装置の一構成例を示
すブロック図である。FIG. 10 is a block diagram showing a configuration example of a phrase boundary detection device of a conventional technique.

[Explanation of symbols]

１音声信号の入力端、２音声信号、３ピッチ分析
手段、４ポーズ検出手段、５ピッチパタンテンプレ
ート、６句境界検出手段、７ピッチパタンモデルメ
モリ、８ピッチパタンモデルネットワークメモリ、９
句境界確率計算手段、１０句境界確率重み付け手
段、１１句境界確率計算部、１２スペクトル分析手
段、１３バックグランドモデルメモリ、１４バック
グランドモデル照合手段、１５前向き確率統合手段、
１６後ろ向き確率統合手段、１７文節モデルメモリ、
１８スポッティング手段、１９スポッティング部、
２０文モデルネットワークメモリ、２１連続音声認
識手段、２２確率統合手段、２３連続音声認識部。1 voice signal input terminal, 2 voice signal, 3 pitch analyzing means, 4 pause detecting means, 5 pitch pattern template, 6 phrase boundary detecting means, 7 pitch pattern model memory, 8 pitch pattern model network memory, 9
Phrase boundary probability calculation means, 10 Phrase boundary probability weighting means, 11 Phrase boundary probability calculation part, 12 Spectrum analysis means, 13 Background model memory, 14 Background model matching means, 15 Forward probability integration means,
16 backward probability integration means, 17 bunsetsu model memory,
18 spotting means, 19 spotting part,
20 sentence model network memory, 21 continuous speech recognition means, 22 probability integration means, 23 continuous speech recognition section.

───────────────────────────────────────────────────── フロントページの続き (51)Int.Cl.⁷ 識別記号ＦＩＧ１０Ｌ 15/14 Ｇ１０Ｌ 3/00 ５３１Ｗ 15/18 ５３５Ｚ５３７Ｅ５３７Ｇ 9/00 ３０１Ａ (56)参考文献特許2793137（ＪＰ，Ｂ２) 特許3061292（ＪＰ，Ｂ２) 特開平４−66999（ＪＰ，Ａ) 特開昭60−229099（ＪＰ，Ａ) 花沢，阿部，中島，意味主導型音声理解システムのための文節スポッティングの検討，日本音響学会秋季研究発表会講演論文集，日本，1994年10月31日，１− Ｑ−14，169−170 中井，下平，嵯峨山，ピッチパターンのクラスタリングに基づく不特定話者連続音声の句境界検出，電子情報通信学会誌Ａ，日本，1994年２月25日，Ｖｏｌ．Ｊ77−Ａ，Ｎｏ．２，ｐ．206−214 北岡，河原，堂下，自由発話認識・理解のためのフレーズスポッティング，電子情報通信学会技術研究報告［音声］, 日本，1993年12月10日，ＳＰ93−116, ＮＬＣ93−56，ｐ．15−22 (58)調査した分野(Int.Cl.⁷，ＤＢ名) G10L 15/00 - 15/28 ＪＩＣＳＴファイル（ＪＯＩＳ)─────────────────────────────────────────────────── ─── Continuation of the front page (51) Int.Cl. ⁷ Identification code FI G10L 15/14 G10L 3/00 531W 15/18 535Z 537E 537G 9/00 301A (56) Reference Patent 2793137 (JP, B2) Patent 3061292 (JP, B2) JP-A-4-66999 (JP, A) JP-A-60-229099 (JP, A) Hanazawa, Abe, Nakajima, Examination of bunsetsu spotting for semantic driven speech understanding system, Japan Proceedings of Autumn Meeting of ASJ, Japan, October 31, 1994, 1-Q-14, 169-170 Nakai, Shimodaira, Sagayama, Independent speaker continuous speech based on pitch pattern clustering Phrase boundary detection, IEICE Journal A, Japan, February 25, 1994, Vol. J77-A, No. 2, p. 206-214 Kitaoka, Kawahara, Doshita, Phrase spotting for free speech recognition and understanding, Technical Report of IEICE [Voice], Japan, December 10, 1993, SP93-116, NLC93-56 , P. 15-22 (58) Fields investigated (Int.Cl. ⁷ , DB name) G10L 15/00-15/28 JISST file (JOIS)

Claims

(57) [Claims]

1. A pitch folding means for calculating a time series of pitch feature amounts of input speech, and a modeled time series of pitch feature amounts of accent phrases.
A pitch pattern model memory that stores one or a plurality of pitch pattern models, and a time series of the pitch feature amount as an input, and a chain of the pitch pattern models from the initial to a predetermined time of the time series of the pitch feature amount. A pitch forward probability that is the probability that the pitch feature amount is output, and a pitch backward probability that is the probability that the pitch feature amount from the end of the time series of the pitch feature amount to a predetermined time by the chain of the pitch pattern model is output. And, based on this pitch forward probability and pitch backward probability, the total probability that the time series of the pitch feature amount is generated by the chain of the pitch pattern models, and the pitch pattern model at the predetermined time. Depending on the ratio with the probability that the time series of the pitch feature will be generated after passing through the boundary, it will be determined at each time of the input speech. And a phrase boundary probability calculating means for calculating an accent phrase boundary probability, and a phrase boundary probability calculating device comprising:

2. A sequence characterized by learning an HMM having the same structure for each sequence of a plurality of feature vectors having different sequence lengths, and using an average vector for each HMM obtained after learning as clustering data. A clustering method for a series of feature vectors with different lengths.

3. The accent pattern pitch pattern is clustered by the clustering method according to claim 2,
2. The phrase boundary probability calculation device according to claim 1, wherein a pitch pattern model is created and used based on the clustering result.

4. The phrase boundary probability calculation device according to claim 1, further comprising a phrase boundary probability weighting means for weighting accent phrase boundary probabilities by providing a weighting coefficient for accent phrase boundary probabilities.

5. A spectrum analysis means for calculating a time series of a spectrum feature vector of an input voice, a background model memory storing a background model modeling a time series of a spectrum feature vector of the voice, and the input voice Spectral feature forward probability, which is the probability that a spectral feature vector time series is input, and the spectral feature vector from the initial of the spectral feature vector time series to a predetermined time is output by the background model chain, and the background. Background model matching means for calculating a spectrum feature backward probability which is a probability that a spectrum feature vector from the end of the spectrum feature vector time series to a predetermined time by a chain of models, and the spectrum feature forward probability, Claim l Alternatively, a forward probability integrating unit that calculates an integrated forward probability that is a product of the phrase boundary probability output from the phrase boundary probability calculating device according to claim 4, the spectrum feature backward probability, and the claim 1 or claim 4. A backward probability integration means for calculating an integrated backward probability that is a product of the phrase boundary probability output from the described phrase boundary probability calculator, and a phrase that models the time series of the spectral feature vector of the phrase speech to be spotted. A bunsetsu model memory storing a model, and a spotting means for spotting bunsetsu using the bunsetsu model by inputting the time series of the spectral feature vector of the voice, the integrated forward probability and the integrated backward probability A continuous speech recognition apparatus using phrase boundary probabilities, characterized in that

6. The continuous speech recognition apparatus using phrase boundary probabilities according to claim 5, wherein a chain of clause models is used as the background model.

7. A spectrum analysis means for calculating a time series of spectral feature vectors of input speech, a sentence model network modeling a time series of spectral feature vectors of sentence speech to be recognized, and a spectral feature vector of input speech. , The input of the time series, the input speech recognition is performed, the plurality of recognition result candidate sentences, the spectral feature recognition score of each recognition result candidate sentence, and the sentence for each recognition result candidate sentence. Continuous speech recognition means for outputting the boundary time of the bunsetsu that composes, the spectral feature recognition score of each of the plurality of recognition result candidate sentences, and the boundary time of the bunsetsu that composes each recognition result candidate sentence; The phrase boundary probability output from the phrase boundary probability calculation device according to item 4 is used as an input, and the phrase boundary probability at the boundary time of the phrase of each recognition result candidate sentence is determined. A phrase-boundary-probability-based continuous speech recognition apparatus comprising a probability integration unit that corrects a spectral feature recognition score using a rate and determines a recognition result candidate sentence based on the corrected recognition score.