JPH0981185A

JPH0981185A - Continuous voice recognition device

Info

Publication number: JPH0981185A
Application number: JP7234043A
Authority: JP
Inventors: Toru Shimizu; 徹清水; Shoichi Matsunaga; 昭一松永; Yoshinori Kosaka; 芳典匂坂
Original assignee: ATR ONSEI HONYAKU TSUSHIN KENKYUSHO KK; ATR Interpreting Telecommunications Research Laboratories
Current assignee: ATR ONSEI HONYAKU TSUSHIN KENKYUSHO KK; ATR Interpreting Telecommunications Research Laboratories
Priority date: 1995-09-12
Filing date: 1995-09-12
Publication date: 1997-03-28
Anticipated expiration: 2015-09-12
Also published as: JP2731133B2

Abstract

PROBLEM TO BE SOLVED: To provide a continuous voice recognition device which performs continuous voice recognition of natural uttering with a smaller computing cost than the one using a conventional method. SOLUTION: A word collating section 4 detects a word-hypothesis of an uttered voice sentence based on the feature parameters of the voice signals of the inputted uttered voice sentence using a one pass.viterbi decoding method, for example, computes the liklihood and outputs them. A word narrowing-down section 6 performs word-hypothesis narrowing-down so that the word-hypotheses of the same word outputted from the section 4 through a buffer memory 5 having the same completion time and a different starting time are represented by a single word-hypothesis having a highest liklihood among the entire liklihood computed from the starting time of uttering to the completion time of the work for every leading phoneme environment of the word.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、入力される発声音
声文の音声信号に基づいて連続的に音声認識する連続音
声認識装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a continuous voice recognition device for continuously recognizing a voice based on an input voice signal of a voiced voice sentence.

【０００２】[0002]

【従来の技術】従来から、本特許出願人は、自然発話の
音声認識を目的として、連続音声認識系（以下、第１の
従来例という。）の開発を進めている（例えば、従来文
献１「Ｎａｇａｉ，Ｔａｋａｍｉ，Ｓａｇａｙａｍａ，
“ＴｈｅＳＳＳ−ＬＲＣｏｎｔｉｎｕｏｕｓＳｐ
ｅｅｃｈＲｅｃｏｇｎｉｔｉｏｎＳｙｓｔｅｍ：Ｉ
ｎｔｅｇｒａｔｉｎｇＳＳＳ−ＤｅｒｉｖｒｄＡｌ
ｌｏｐｏｈｎｅＭｏｄｅｌｓａｎｄａＰｈｏｎ
ｅｍｅ−Ｃｏｎｔｅｘｔ−ＤｅｐｅｎｄｅｎｔＬＲＰ
ａｒｓｅｒ”，Ｐｒｏｃ．ｏｆＩＣＳＬＰ９２，ｐ
ｐ．１５１１−１５１４，１９９２年」及び従来文献２
「Ｓｈｉｍｉｚｕ，Ｍｏｎｚｅｎ，Ｓｉｎｇｅｒ，Ｍａ
ｔｓｕｎａｇａ，“Ｔｉｍｅ−Ｓｙｎｃｈｒｏｎｏｕｓ
ＣｏｎｔｉｎｕｏｕｓＳｐｅｅｃｈＲｅｃｏｇｎ
ｉｚｅｒＤｒｉｖｅｎｂｙａＣｏｎｔｅｘｔ−Ｆ
ｒｅｅＧｒａｍｍａｒ”，Ｐｒｏｃ．ｏｆＩＣＡＳ
ＳＰ９５，ｐｐ．５８４−５８７，１９９５年」参
照。）。この第１の従来例では、入力される発生音声文
の音声信号に基づいて、音素隠れマルコフモデル（以
下、隠れマルコフモデルをＨＭＭという。）と単語辞書
を用いて、発声開始からの単語の履歴及び文法状態を管
理しながら、音声認識を行っている。2. Description of the Related Art Conventionally, the present applicant has been developing a continuous speech recognition system (hereinafter, referred to as a first conventional example) for the purpose of spontaneous speech recognition (for example, conventional document 1). "Nagai, Takami, Sagayama,
"The SSS-LR Continuous Sp
ech Recognition System: I
ntegrating SSS-Derivrd Al
lopohne Models and a Phon
eme-Context-DependentLR P
“Arser”, Proc. of ICSLP92, p.
p. 1511-1514, 1992 "and conventional literature 2
"Shimizu, Monzen, Singer, Ma
Tsunaaga, "Time-Synchronous
Continuous Speech Recogn
izer Driven bya Context-F
ree Grammar ”, Proc. of ICAS
SP95, pp. 584-587, 1995 ". ). In the first conventional example, a history of words from the start of utterance is used based on an input voice signal of a generated voice sentence by using a phoneme hidden Markov model (hereinafter, the hidden Markov model is referred to as HMM) and a word dictionary. Also, voice recognition is performed while managing the grammar state.

【０００３】一方、単語グラフを用いた音声認識方法
（以下、第２の従来例という。）が、従来文献３「Ｎｅ
ｙ，Ａｕｂｅｒｔ，“ＡＷｏｒｄＧｒａｐｈＡｌ
ｇｏｒｉｔｈｍｆｏｒＬａｒｇｅＶｏｃａｂｕｌ
ａｒｙ，ＣｏｎｔｉｎｕｏｕｓＳｐｅｅｃｈＲｅ
ｃｏｇｎｉｔｉｏｎ”，Ｐｒｏｃ．ｏｆＩＣＳＬＰ９
４，ｐｐ．１３５５−１３５８，１９９４年」及び従来
文献４「Ｗｏｏｄｌａｎｄ，Ｌｅｇｇｅｔｔｅｒ，Ｏｄ
ｅｌｌ，Ｖａｌｔｃｈｅｖ，Ｙｏｕｎｇ，“Ｔｈｅ１
９９４ＨＴＫＬａｒｇｅＶｏｃａｂｕｌａｒｙ
ＳｐｅｅｃｈＲｅｃｏｇｎｉｔｉｏｎＳｙｓｔｅ
ｍ”，Ｐｒｏｃ．ｏｆＩＣＡＳＳＰ９５，ｐｐ．７
３−７６，１９９５年」において提案されている。On the other hand, a speech recognition method using a word graph (hereinafter referred to as a second conventional example) is disclosed in a conventional document 3 "Ne.
y, Aubert, “A Word Graph Al
gorithm for Large Vocabul
ary, Continuous Speech Re
Cognition ”, Proc. of ICSLP9
4, pp. 1355-1358, 1994 "and the conventional document 4" Woodland, Leggetter, Od. "
ell, Valtchev, Young, “The 1
994 HTK Large Vocabulary
Speech Recognition System
m ", Proc. of ICASSP95, pp. 7
3-76, 1995 ".

【０００４】この第２の従来例の単語グラフの主たるア
イデアは、音声認識におけるあいまいさが比較的高い音
声信号の領域において単語仮説の候補を処理するという
ことである。この利点は、純粋の音声認識は言語モデル
のアプリケーションとは切り離されていることと、複雑
な言語モデルは、現在認識中の単語に続く公知のステッ
プに適用することができることである。単語仮説の候補
の数は音声認識におけるあいまいさのレベルに対応して
変化する必要がある。良い単語グラフを効率的に構築す
るときの困難さは次の通りである。単語の開始時刻は、
一般的に、先行する単語に依存している。第１の近似に
おいては、この依存性を直前の先行単語に対して制限を
加えることにより、以下に示すようないわゆる単語ペア
近似法を得ている。すなわち、単語のペアとその終了時
刻が与えられたときに、２つの単語の間の単語境界は別
の先行する単語に独立であるということである。この単
語ペア近似法は、本来、複数の文又はｎ個のベスト（最
良）である文を効率的に計算するために導入されてき
た。この単語グラフは、ｎ個のベストを得るアプローチ
の方法（以下、ｎベスト法という。）よりも効率的であ
ると期待されている。この単語グラフを用いた方法で
は、複数の単語仮説を局所的にのみ発生する必要がある
一方、ｎベスト法においては、各局所的な単語仮説の候
補は、ｎ個のベストである文のリストに対して加えるべ
き全体の文を必要としている。The main idea of the word graph of the second conventional example is to process word hypothesis candidates in a region of a speech signal where ambiguity in speech recognition is relatively high. The advantage of this is that pure speech recognition is separate from the application of language models and that complex language models can be applied to the known steps following the word currently being recognized. The number of word hypothesis candidates must change according to the level of ambiguity in speech recognition. The difficulties in efficiently building a good word graph are as follows. The start time of a word is
Generally, it depends on the preceding word. In the first approximation, this dependency is limited to the immediately preceding word to obtain a so-called word pair approximation method as shown below. That is, given a word pair and its end time, the word boundary between two words is independent of another preceding word. This word pair approximation method has been introduced in order to efficiently calculate multiple sentences or n best sentences by nature. This word graph is expected to be more efficient than the approach of obtaining n bests (hereinafter referred to as the n-best method). In this method using a word graph, it is necessary to generate a plurality of word hypotheses only locally, whereas in the n-best method, each local word hypothesis candidate is a list of n best sentences. You need the whole sentence to add to.

【０００５】[0005]

【発明が解決しようとする課題】しかしながら、第１の
従来例においては、発声開始からの単語の履歴及び文法
状態を管理する必要があるため、間投詞の挿入や、言い
淀み、言い直しが頻繁に生じる自然発話の認識に用いた
場合、単語仮説の併合又は分割に要する計算コストが極
めて大きいという問題点があった。すなわち、音声認識
のために必要な処理量が大きくなって比較的大きな記憶
容量を有する記憶装置が必要となる一方、処理量が大き
くなるので処理時間が長くなるという問題点があった。However, in the first conventional example, since it is necessary to manage the history of words and the grammatical state from the start of utterance, insertion of interjections, stagnation, and rewording are frequently performed. There is a problem that the calculation cost required for merging or dividing word hypotheses is extremely large when used for recognition of natural speech that occurs. That is, the processing amount required for voice recognition becomes large and a storage device having a comparatively large storage capacity is required, while the processing amount becomes large, resulting in a long processing time.

【０００６】また、上記第２の従来例の単語ペア近似法
においては、先行単語毎に１つの仮説で代表させるが、
いまだ近似効果は比較的小さい。このため、上記第１の
従来例と同様の問題点が生じる。In the second conventional word pair approximation method, each preceding word is represented by one hypothesis.
The approximation effect is still relatively small. Therefore, the same problems as those of the first conventional example occur.

【０００７】本発明の目的は以上の問題点を解決し、従
来例に比較してより小さい計算コストで自然発話の連続
音声認識を行うことができる連続音声認識装置を提供す
ることにある。An object of the present invention is to solve the above problems and to provide a continuous speech recognition apparatus capable of performing continuous speech recognition of natural speech at a smaller calculation cost as compared with the conventional example.

【０００８】[0008]

【課題を解決するための手段】本発明に係る連続音声認
識装置は、入力される発声音声文の音声信号に基づいて
上記発声音声文の単語仮説を検出し尤度を計算すること
により、連続的に音声認識する音声認識手段を備えた連
続音声認識装置において、上記音声認識手段は、終了時
刻が等しく開始時刻が異なる同一の単語の単語仮説に対
して、当該単語の先頭音素環境毎に、発声開始時刻から
当該単語の終了時刻に至る計算された総尤度のうちの最
も高い尤度を有する１つの単語仮説で代表させるように
単語仮説の絞り込みを行うことを特徴とする。A continuous speech recognition apparatus according to the present invention detects a word hypothesis of an uttered voice sentence based on an input voice signal of the uttered voice sentence and calculates a likelihood to continuously output the word hypothesis. In a continuous speech recognition device having a speech recognition means for recognizing speech, the speech recognition means, for the word hypothesis of the same word with the same end time but different start time, for each head phoneme environment of the word, It is characterized in that the word hypotheses are narrowed down so as to be represented by one word hypothesis having the highest likelihood of the calculated total likelihood from the utterance start time to the end time of the word.

【０００９】[0009]

【発明の実施の形態】以下、図面を参照して本発明に係
る実施形態について説明する。図１に本発明に係る一実
施形態の連続音声認識装置のブロック図を示す。本実施
形態の連続音声認識装置は、公知のワン−パス・ビタビ
復号化法を用いて、入力される発声音声文の音声信号の
特徴パラメータに基づいて上記発声音声文の単語仮説を
検出し尤度を計算して出力する単語照合部４を備えた連
続音声認識装置において、単語照合部４からバッファメ
モリ５を介して出力される、終了時刻が等しく開始時刻
が異なる同一の単語の単語仮説に対して、当該単語の先
頭音素環境毎に、発声開始時刻から当該単語の終了時刻
に至る計算された総尤度のうちの最も高い尤度を有する
１つの単語仮説で代表させるように単語仮説の絞り込み
を行う単語仮説絞込部６を備えたことを特徴とする。DETAILED DESCRIPTION OF THE INVENTION Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 shows a block diagram of a continuous speech recognition apparatus according to an embodiment of the present invention. The continuous speech recognition apparatus of the present embodiment uses a known one-pass Viterbi decoding method to detect the word hypothesis of the uttered voice sentence based on the characteristic parameters of the voice signal of the input uttered voice sentence. In a continuous speech recognition apparatus having a word matching unit 4 for calculating and outputting a degree, a word hypothesis of the same word output from the word matching unit 4 via the buffer memory 5 and having the same end time but different start time is used. On the other hand, for each head phoneme environment of the word, the word hypothesis is represented by one word hypothesis having the highest likelihood of the total likelihood calculated from the utterance start time to the end time of the word. It is characterized in that a word hypothesis narrowing unit 6 for narrowing down is provided.

【００１０】図１において、単語照合部４に接続され、
例えばハードディスクメモリに格納される音素ＨＭＭ１
１は、各状態を含んで表され、各状態はそれぞれ以下の
情報を有する。（ａ）状態番号（ｂ）受理可能なコンテキストクラス（ｃ）先行状態、及び後続状態のリスト（ｄ）出力確率密度分布のパラメータ（ｅ）自己遷移確率及び後続状態への遷移確率なお、本実施例において用いる音素ＨＭＭは、各分布が
どの話者に由来するかを特定する必要があるため、所定
の話者混合ＨＭＭを変換して作成する。ここで、出力確
率密度関数は３４次元の対角共分散行列をもつ混合ガウ
ス分布である。In FIG. 1, connected to the word matching unit 4,
For example, the phoneme HMM1 stored in the hard disk memory
1 is represented including each state, and each state has the following information. (A) State number (b) Acceptable context class (c) List of preceding states and succeeding states (d) Parameter of output probability density distribution (e) Probability of self transition and transition probability to subsequent states The phoneme HMM used in the example needs to specify which speaker each distribution is derived from, and thus is created by converting a predetermined speaker mixed HMM. Here, the output probability density function is a Gaussian mixture mixture having a 34-dimensional diagonal covariance matrix.

【００１１】また、単語照合部４に接続され、例えばハ
ードディスクに格納される単語辞書１２は、音素ＨＭＭ
１１の各単語毎にシンボルで表した読みを示すシンボル
列を格納する。The word dictionary 12 connected to the word matching unit 4 and stored in, for example, a hard disk is a phoneme HMM.
For each of the 11 words, a symbol string indicating the symbolic reading is stored.

【００１２】図１において、話者の発声音声はマイクロ
ホン１に入力されて音声信号に変換された後、特徴抽出
部２に入力される。特徴抽出部２は、入力された音声信
号をＡ／Ｄ変換した後、例えばＬＰＣ分析を実行し、対
数パワー、１６次ケプストラム係数、Δ対数パワー及び
１６次Δケプストラム係数を含む３４次元の特徴パラメ
ータを抽出する。抽出された特徴パラメータの時系列は
バッファメモリ３を介して単語照合部４に入力される。In FIG. 1, the vocalized voice of the speaker is input to the microphone 1 and converted into a voice signal, and then input to the feature extraction unit 2. The feature extraction unit 2 performs, for example, LPC analysis after A / D conversion of the input voice signal, and a 34-dimensional feature parameter including logarithmic power, 16th-order cepstrum coefficient, Δ logarithmic power, and 16th-order Δ cepstrum coefficient. To extract. The time series of the extracted characteristic parameters is input to the word matching unit 4 via the buffer memory 3.

【００１３】単語照合部４は、ワン−パス・ビタビ復号
化法を用いて、バッファメモリ３を介して入力される特
徴パラメータのデータに基づいて、音素ＨＭＭ１１と単
語辞書１２とを用いて単語仮説を検出し尤度を計算して
出力する。ここで、単語照合部４は、各時刻の各ＨＭＭ
の状態毎に、単語内の尤度と発声開始からの尤度を計算
する。尤度は、単語の識別番号、単語の開始時刻、先行
単語の違い毎に個別にもつ。また、計算処理量の削減の
ために、音素ＨＭＭ１１及び単語辞書１２とに基づいて
計算される総尤度のうちの低い尤度のグリッド仮説を削
減する。単語照合部４は、その結果の単語仮説と尤度の
情報を発声開始時刻からの時間情報（具体的には、例え
ばフレーム番号）とともにバッファメモリ５を介して単
語仮説絞込部６に出力する。The word collation unit 4 uses the one-pass Viterbi decoding method to generate a word hypothesis using the phoneme HMM 11 and the word dictionary 12 based on the data of the characteristic parameter input via the buffer memory 3. Is detected and the likelihood is calculated and output. Here, the word matching unit 4 determines that each HMM at each time
The likelihood within a word and the likelihood from the start of utterance are calculated for each state. The likelihood is individually provided for each word identification number, word start time, and difference between preceding words. Further, in order to reduce the amount of calculation processing, the grid hypothesis having a low likelihood of the total likelihood calculated based on the phoneme HMM 11 and the word dictionary 12 is reduced. The word matching unit 4 outputs the resulting word hypothesis and likelihood information to the word hypothesis narrowing unit 6 via the buffer memory 5 together with time information (specifically, for example, frame number) from the utterance start time. .

【００１４】単語仮説絞込部６は、単語照合部４からバ
ッファメモリ５を介して出力される単語仮説に基づい
て、終了時刻が等しく開始時刻が異なる同一の単語の単
語仮説に対して、当該単語の先頭音素環境毎に、発声開
始時刻から当該単語の終了時刻に至る計算された総尤度
のうちの最も高い尤度を有する１つの単語仮説で代表さ
せるように単語仮説の絞り込みを行った後、絞り込み後
のすべての単語仮説の単語列のうち、最大の総尤度を有
する仮説の単語列を認識結果として出力する。本実施形
態においては、好ましくは、処理すべき当該単語の先頭
音素環境とは、当該単語より先行する単語仮説の最終音
素と、当該単語の単語仮説の最初の２つの音素とを含む
３つの音素並びをいう。The word hypothesis narrowing unit 6 is based on the word hypothesis output from the word collating unit 4 via the buffer memory 5 for the word hypotheses of the same word having the same end time but different start times. For each head phoneme environment of a word, the word hypotheses were narrowed down to be represented by one word hypothesis having the highest likelihood of the total likelihood calculated from the vocalization start time to the end time of the word. After that, of the word strings of all the word hypotheses after narrowing down, the word string of the hypothesis having the maximum total likelihood is output as the recognition result. In the present embodiment, preferably, the first phoneme environment of the word to be processed is the three phonemes including the final phoneme of the word hypothesis preceding the word and the first two phonemes of the word hypothesis of the word. Say a line.

【００１５】例えば、図２に示すように、（ｉ−１）番
目の単語Ｗ_i-1の次に、音素列ａ₁，ａ₂，…，ａ_nからな
るｉ番目の単語Ｗ_iがくるときに、単語Ｗ_i-1の単語仮説
として６つの仮説Ｗａ，Ｗｂ，Ｗｃ，Ｗｄ，Ｗｅ，Ｗｆ
が存在している。ここで、前者３つの単語仮説Ｗａ，Ｗ
ｂ，Ｗｃの最終音素は／ｘ／であるとし、後者３つの単
語仮説Ｗｄ，Ｗｅ，Ｗｆの最終音素は／ｙ／であるとす
る。終了時刻ｔ_eと先頭音素環境が等しい仮説（図２で
は先頭音素環境が“ｘ／ａ₁／ａ₂”である上から３つの
単語仮説）のうち総尤度が最も高い仮説（例えば、図２
において１番上の仮説）以外を削除する。なお、上から
４番めの仮説は先頭音素環境が違うため、すなわち、先
行する単語仮説の最終音素がｘではなくｙであるので、
上から４番めの仮説を削除しない。すなわち、先行する
単語仮説の最終音素毎に１つのみ仮説を残す。図２の例
では、最終音素／ｘ／に対して１つの仮説を残し、最終
音素／ｙ／に対して１つの仮説を残す。[0015] For example, as shown in FIG. 2, the (i-1) th word W _i-1 of the following phoneme string a _1, a _2, ..., come i th word W _i consisting a _n Sometimes, six hypotheses Wa, Wb, Wc, Wd, We, Wf are used as word hypotheses for the word W _i-1.
Exists. Here, the former three word hypotheses Wa, W
It is assumed that the final phonemes of b and Wc are / x /, and the final phonemes of the latter three word hypotheses Wd, We and Wf are / y /. Of the hypotheses in which the end time t _e is equal to the head phoneme environment (in FIG. 2, the top phoneme environment is “x / a ₁ / a ₂ ”, the three word hypotheses from the top), the hypothesis with the highest total likelihood (for example, FIG. Two
Are deleted except for the top hypothesis). Since the fourth phoneme from the top has a different first phoneme environment, that is, the last phoneme of the preceding word hypothesis is y instead of x,
Do not delete the fourth hypothesis from the top. That is, only one hypothesis remains for each final phoneme of the preceding word hypothesis. In the example of FIG. 2, one hypothesis is left for the final phoneme / x /, and one hypothesis is left for the final phoneme / y /.

【００１６】以上の実施形態においては、当該単語の先
頭音素環境とは、当該単語より先行する単語仮説の最終
音素と、当該単語の単語仮説の最初の２つの音素とを含
む３つの音素並びとして定義されているが、本発明はこ
れに限らず、先行する単語仮説の最終音素と、最終音素
と連続する先行する単語仮説の少なくとも１つの音素と
を含む先行単語仮説の音素列と、当該単語の単語仮説の
最初の音素を含む音素列とを含む音素並びとしてもよ
い。In the above embodiment, the leading phoneme environment of the word is defined as three phoneme sequences including the final phoneme of the word hypothesis preceding the word and the first two phonemes of the word hypothesis of the word. Although defined, the present invention is not limited to this, and the phoneme string of the preceding word hypothesis including the final phoneme of the preceding word hypothesis and at least one phoneme of the preceding word hypothesis continuous with the final phoneme, and the word May be a phoneme sequence including a phoneme sequence including the first phoneme of the word hypothesis.

【００１７】[0017]

【実施例】本発明者は、図１の連続音声認識装置の有効
性を確認するために、自然発話データベースを用いて単
語グラフ生成実験を行なった。“トラベル・プランニン
グ”をタスクとした本出願人が所有する音声言語データ
ベース（例えば、従来文献５「Ｍｏｒｉｍｏｔｏｅｔ
ａｌ．，“ＡＳｐｅｅｃｈａｎｄＬａｎｇｕａ
ｇｅＤａｔａｂａｓｅｆｏｒＳｐｅｅｃｈＴｒ
ａｎｓｌａｔｉｏｎＲｅｓｅａｒｃｈ”，Ｐｒｏｃ．ｏ
ｆＩＣＳＬＰ９４，ｐｐ．１７９１−１７９４，１９
９４年」参照。）の「ホテル予約」に関する対話（申込
者側５話者の発声：５対話，５６発声，６８７語）を用
いて評価した。音響分析は、標本化周波数１２ｋＨｚ，
フレーム間隔５ｍｓｅｃ，ハミング窓２０ｍｓｅｃの仕
様で分析し、特徴パラメータとして、１〜１６次ＬＰＣ
ケプストラム、１〜１６次ΔＬＰＣケプストラム、対数
パワー、Δ対数パワーを用いた。音響モデル（隠れマル
コフ網：４０１状態，５混合）は、朗読音声（１５０
文）を用いて学習した音響モデルをさらに上記データベ
ースのテストデータに現れない話者９名の発声（１２８
発声）を用いて発話様式に適応した。また、言語モデル
は、「ホテル予約」を含む“トラベル・プランニング”
全般（１８，３１５発声，２２９，１５９語）を用いて
学習した。単語パープレキシティは、５５．９であっ
た。単語辞書（１，１１３語）は、評価データの語彙を
全て含んでおり、予め登録されていない未知語（未登録
語ともいう。）はないものとした。EXAMPLES The present inventor conducted a word graph generation experiment using a spontaneous speech database in order to confirm the effectiveness of the continuous speech recognition apparatus of FIG. Spoken language database owned by the applicant with “travel planning” as a task (for example, refer to the conventional document 5 “Morimoto et.
al. , "A Speech and Langua
ge Database for Speech Tr
translationResearch ”, Proc.o
f ICSLP94, pp. 1791-1794, 19
1994 ". )) "Hotel reservation" dialogue (voices from 5 speakers of the applicant side: 5 dialogues, 56 vocalizations, 687 words). Acoustic analysis uses a sampling frequency of 12 kHz,
Analysis is performed with the specifications of a frame interval of 5 msec and a Hamming window of 20 msec, and 1 to 16th-order LPCs are used as characteristic parameters.
Cepstrum, 1st to 16th order ΔLPC cepstrum, logarithmic power, and Δlogarithmic power were used. The acoustic model (Hidden Markov network: 401 states, 5 mixed)
Utterances of nine speakers who did not appear in the test data of the above database (128)
Utterance) was used to adapt to the speaking style. Also, the language model is "travel planning" including "hotel reservation".
Students learned by using the general language (18,315 utterances, 229,159 words). The word perplexity was 55.9. The word dictionary (1,113 words) includes all vocabularies of the evaluation data, and there is no unknown word (also referred to as unregistered word) that is not registered in advance.

【００１８】次いで、開始時刻の異なる単語仮説の絞り
込み効果について以下に説明する。図３に、絞り込みを
行なった場合（本実施形態）と絞り込みを行なわない場
合の各単語仮説の先行単語数の分布の比較を示す。絞り
込みを行なうことによって、平均先行単語数が３．５９
から１．７０に削減された。また、絞り込みを行なわな
かった場合に対して、開始時刻の違いを無視した平均先
行単語数を計算したところ、１．３６であった。この結
果から、単語の先頭音素環境ごとに１つの仮説で代表さ
せる本発明の方法は、少ない計算量で、先行単語毎に１
つの仮説で代表させる第２の従来例の単語ペア近似法に
かなり近い効果が得られると考えられる。Next, the effect of narrowing down word hypotheses having different start times will be described below. FIG. 3 shows a comparison of the distribution of the number of preceding words of each word hypothesis when narrowing down (this embodiment) and when not narrowing down. By narrowing down, the average number of preceding words is 3.59.
Was reduced from 1.70 to 1.70. Further, the average number of preceding words was calculated to be 1.36 when the difference in the start time was ignored when the narrowing down was not performed. From this result, the method of the present invention in which one hypothesis is represented for each head phoneme environment of a word is 1 for each preceding word with a small calculation amount.
It is considered that an effect quite similar to the word pair approximation method of the second conventional example represented by one hypothesis can be obtained.

【００１９】以上説明したように、本実施形態によれ
ば、終了時刻が等しく開始時刻が異なる同一の単語の単
語仮説に対して、当該単語の先頭音素環境毎に、発声開
始時刻から当該単語の終了時刻に至る計算された総尤度
のうちの最も高い尤度を有する１つの単語仮説で代表さ
せるように単語仮説の絞り込みを行う。すなわち、先行
単語毎に１つの単語仮説で代表させる第２の従来例の単
語ペア近似法に比較して、単語の先頭音素の先行音素
（つまり、先行単語の最終音素）が等しいものをひとま
とめに扱うために、単語仮説数を削減することができ、
近似効果は大きい。特に、語彙数が増加した場合におい
て削減効果が大きい。従って、当該連続音声認識装置
を、間投詞の挿入や、言い淀み、言い直しが頻繁に生じ
る自然発話の認識に用いた場合であっても、単語仮説の
併合又は分割に要する計算コストは従来例に比較して小
さくなる。すなわち、音声認識のために必要な処理量が
小さくなり、それ故、単語照合部４のワーキングメモリ
（図示せず。）、バッファメモリ５及び単語仮説絞込部
６のワーキングメモリ（図示せず。）などの音声認識の
ための記憶装置において必要な記憶容量は小さくなる一
方、処理量が小さくなるので音声認識のための処理時間
を短縮することができる。As described above, according to this embodiment, with respect to the word hypothesis of the same word having the same end time but different start time, for each head phoneme environment of the word, from the utterance start time to the word start of the word. The word hypotheses are narrowed down so as to be represented by one word hypothesis having the highest likelihood among the calculated total likelihoods up to the end time. That is, as compared with the second conventional word pair approximation method in which each preceding word is represented by one word hypothesis, those having the same leading phoneme of the first phoneme of the word (that is, the last phoneme of the preceding word) are collectively collected. You can reduce the number of word hypotheses to handle,
The approximation effect is large. In particular, the reduction effect is great when the number of vocabularies is increased. Therefore, even when the continuous speech recognition device is used for the recognition of natural utterances in which interjections are inserted, stagnation, and rewording frequently occur, the calculation cost required for merging or dividing word hypotheses is the same as in the conventional example. It becomes small compared to. That is, the amount of processing required for speech recognition is reduced, and therefore the working memory (not shown) of the word matching unit 4, the buffer memory 5 and the working memory of the word hypothesis narrowing unit 6 (not shown). While the storage capacity required in the storage device for voice recognition such as) is small, the processing amount is small, so that the processing time for voice recognition can be shortened.

【００２０】[0020]

【発明の効果】以上詳述したように本発明によれば、入
力される発声音声文の音声信号に基づいて上記発声音声
文の単語仮説を検出し尤度を計算することにより、連続
的に音声認識する音声認識手段を備えた連続音声認識装
置において、上記音声認識手段は、終了時刻が等しく開
始時刻が異なる同一の単語の単語仮説に対して、当該単
語の先頭音素環境毎に、発声開始時刻から当該単語の終
了時刻に至る計算された総尤度のうちの最も高い尤度を
有する１つの単語仮説で代表させるように絞り込みを行
う。すなわち、先行単語毎に１つの単語仮説で代表させ
る第２の従来例の単語ペア近似法に比較して、単語の先
頭音素の先行音素（つまり、先行単語の最終音素）が等
しいものをひとまとめに扱うために、単語仮説数を削減
することができ、近似効果は大きい。特に、語彙数が増
加した場合において削減効果が大きい。従って、当該連
続音声認識装置を、間投詞の挿入や、言い淀み、言い直
しが頻繁に生じる自然発話の認識に用いた場合であって
も、単語仮説の併合又は分割に要する計算コストは従来
例に比較して小さくなる。すなわち、音声認識のために
必要な処理量が小さくなり、それ故、音声認識のための
記憶装置において必要な記憶容量は小さくなる一方、処
理量が小さくなるので音声認識のための処理時間を短縮
することができる。As described in detail above, according to the present invention, the word hypothesis of the uttered voice sentence is detected based on the input voice signal of the uttered voice sentence and the likelihood is calculated to continuously calculate the likelihood. In a continuous speech recognition device equipped with a speech recognition means, the speech recognition means starts utterance for each head phoneme environment of the word with respect to a word hypothesis of the same word having the same end time but different start times. Narrowing is performed so that one word hypothesis having the highest likelihood of the calculated total likelihood from the time to the end time of the word is represented. That is, as compared with the second conventional word pair approximation method in which each preceding word is represented by one word hypothesis, those having the same leading phoneme of the first phoneme of the word (that is, the last phoneme of the preceding word) are collectively collected. The number of word hypotheses can be reduced and the approximation effect is large. In particular, the reduction effect is great when the number of vocabularies is increased. Therefore, even when the continuous speech recognition device is used for the recognition of natural utterances in which interjections are inserted, stagnation, and rewording frequently occur, the calculation cost required for merging or dividing word hypotheses is the same as in the conventional example. It becomes small compared to. In other words, the amount of processing required for voice recognition is reduced, and therefore the storage capacity required in the storage device for voice recognition is reduced, while the amount of processing is reduced, which reduces the processing time for voice recognition. can do.

[Brief description of drawings]

【図１】本発明に係る一実施形態である連続音声認識
装置のブロック図である。FIG. 1 is a block diagram of a continuous voice recognition device according to an embodiment of the present invention.

【図２】図１の連続音声認識装置における単語仮説絞
込部６の処理を示すタイミングチャートである。2 is a timing chart showing a process of a word hypothesis narrowing unit 6 in the continuous speech recognition apparatus of FIG.

【図３】図１の連続音声認識装置の実験結果におけ
る、単語間の遷移における単語仮説の絞り込み効果を示
す先行単語の個数に対するノード数のグラフである。3 is a graph of the number of nodes with respect to the number of preceding words showing the effect of narrowing the word hypothesis in the transition between words in the experimental result of the continuous speech recognition apparatus of FIG.

[Explanation of symbols]

１…マイクロホン、２…特徴抽出部、３，５…バッファメモリ、４…単語照合部、６…単語仮説絞込部、１１…音素ＨＭＭ、１２…単語辞書。 1 ... Microphone, 2 ... Feature extraction part, 3, 5 ... Buffer memory, 4 ... Word collation part, 6 ... Word hypothesis narrowing part, 11 ... Phoneme HMM, 12 ... Word dictionary.

フロントページの続き (72)発明者松永昭一京都府相楽郡精華町大字乾谷小字三平谷５番地株式会社エイ・ティ・アール音声翻訳通信研究所内 (72)発明者匂坂芳典京都府相楽郡精華町大字乾谷小字三平谷５番地株式会社エイ・ティ・アール音声翻訳通信研究所内Front page continued (72) Inventor Shoichi Matsunaga Seika-cho, Soraku-gun, Kyoto Pref. 5 Inhiratani, Mihiratani No.5, A.T.R. Co., Ltd. Speech Translation Communication Research Laboratories (72) Inventor Yoshinori Kosaka, Seiraku-cho, Soraku-gun, Kyoto Prefecture Daiji Intani, Shoji, Hiratani, No.5, ATR Co., Ltd.

Claims

[Claims]

1. Continuous speech recognition provided with a speech recognition means for continuously recognizing speech by detecting a word hypothesis of the uttered speech sentence based on an input speech signal of the uttered speech sentence and calculating a likelihood. In the device, the speech recognition means calculates, for each of the head phoneme environments of the word, from the utterance start time to the end time of the word for the word hypothesis of the same word having the same end time but different start times. A continuous speech recognition device characterized in that a word hypothesis is narrowed down so that it is represented by one word hypothesis having the highest likelihood of the total likelihood.