JPH0981185A - Continuous voice recognition device - Google Patents

Continuous voice recognition device

Info

Publication number
JPH0981185A
JPH0981185A JP7234043A JP23404395A JPH0981185A JP H0981185 A JPH0981185 A JP H0981185A JP 7234043 A JP7234043 A JP 7234043A JP 23404395 A JP23404395 A JP 23404395A JP H0981185 A JPH0981185 A JP H0981185A
Authority
JP
Japan
Prior art keywords
word
hypothesis
phoneme
speech recognition
likelihood
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
JP7234043A
Other languages
Japanese (ja)
Other versions
JP2731133B2 (en
Inventor
Toru Shimizu
徹 清水
Shoichi Matsunaga
昭一 松永
Yoshinori Kosaka
芳典 匂坂
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ATR ONSEI HONYAKU TSUSHIN KENKYUSHO KK
ATR Interpreting Telecommunications Research Laboratories
Original Assignee
ATR ONSEI HONYAKU TSUSHIN KENKYUSHO KK
ATR Interpreting Telecommunications Research Laboratories
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ATR ONSEI HONYAKU TSUSHIN KENKYUSHO KK, ATR Interpreting Telecommunications Research Laboratories filed Critical ATR ONSEI HONYAKU TSUSHIN KENKYUSHO KK
Priority to JP7234043A priority Critical patent/JP2731133B2/en
Publication of JPH0981185A publication Critical patent/JPH0981185A/en
Application granted granted Critical
Publication of JP2731133B2 publication Critical patent/JP2731133B2/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Abstract

PROBLEM TO BE SOLVED: To provide a continuous voice recognition device which performs continuous voice recognition of natural uttering with a smaller computing cost than the one using a conventional method. SOLUTION: A word collating section 4 detects a word-hypothesis of an uttered voice sentence based on the feature parameters of the voice signals of the inputted uttered voice sentence using a one pass.viterbi decoding method, for example, computes the liklihood and outputs them. A word narrowing-down section 6 performs word-hypothesis narrowing-down so that the word-hypotheses of the same word outputted from the section 4 through a buffer memory 5 having the same completion time and a different starting time are represented by a single word-hypothesis having a highest liklihood among the entire liklihood computed from the starting time of uttering to the completion time of the work for every leading phoneme environment of the word.

Description

【発明の詳細な説明】Detailed Description of the Invention

【0001】[0001]

【発明の属する技術分野】本発明は、入力される発声音
声文の音声信号に基づいて連続的に音声認識する連続音
声認識装置に関する。
BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a continuous voice recognition device for continuously recognizing a voice based on an input voice signal of a voiced voice sentence.

【0002】[0002]

【従来の技術】従来から、本特許出願人は、自然発話の
音声認識を目的として、連続音声認識系(以下、第1の
従来例という。)の開発を進めている(例えば、従来文
献1「Nagai,Takami,Sagayama,
“The SSS−LR Continuous Sp
eech Recognition System:I
ntegrating SSS−Derivrd Al
lopohne Models and a Phon
eme−Context−DependentLR P
arser”,Proc.of ICSLP92,p
p.1511−1514,1992年」及び従来文献2
「Shimizu,Monzen,Singer,Ma
tsunaga,“Time−Synchronous
Continuous Speech Recogn
izer Driven bya Context−F
ree Grammar”,Proc.of ICAS
SP95,pp.584−587,1995年」参
照。)。この第1の従来例では、入力される発生音声文
の音声信号に基づいて、音素隠れマルコフモデル(以
下、隠れマルコフモデルをHMMという。)と単語辞書
を用いて、発声開始からの単語の履歴及び文法状態を管
理しながら、音声認識を行っている。
2. Description of the Related Art Conventionally, the present applicant has been developing a continuous speech recognition system (hereinafter, referred to as a first conventional example) for the purpose of spontaneous speech recognition (for example, conventional document 1). "Nagai, Takami, Sagayama,
"The SSS-LR Continuous Sp
ech Recognition System: I
ntegrating SSS-Derivrd Al
lopohne Models and a Phon
eme-Context-DependentLR P
“Arser”, Proc. of ICSLP92, p.
p. 1511-1514, 1992 "and conventional literature 2
"Shimizu, Monzen, Singer, Ma
Tsunaaga, "Time-Synchronous
Continuous Speech Recogn
izer Driven bya Context-F
ree Grammar ”, Proc. of ICAS
SP95, pp. 584-587, 1995 ". ). In the first conventional example, a history of words from the start of utterance is used based on an input voice signal of a generated voice sentence by using a phoneme hidden Markov model (hereinafter, the hidden Markov model is referred to as HMM) and a word dictionary. Also, voice recognition is performed while managing the grammar state.

【0003】一方、単語グラフを用いた音声認識方法
(以下、第2の従来例という。)が、従来文献3「Ne
y,Aubert,“A Word Graph Al
gorithm for Large Vocabul
ary, Continuous Speech Re
cognition”,Proc.of ICSLP9
4,pp.1355−1358,1994年」及び従来
文献4「Woodland,Leggetter,Od
ell,Valtchev,Young,“The 1
994 HTK Large Vocabulary
Speech Recognition Syste
m”,Proc. of ICASSP95,pp.7
3−76,1995年」において提案されている。
On the other hand, a speech recognition method using a word graph (hereinafter referred to as a second conventional example) is disclosed in a conventional document 3 "Ne.
y, Aubert, “A Word Graph Al
gorithm for Large Vocabul
ary, Continuous Speech Re
Cognition ”, Proc. of ICSLP9
4, pp. 1355-1358, 1994 "and the conventional document 4" Woodland, Leggetter, Od. "
ell, Valtchev, Young, “The 1
994 HTK Large Vocabulary
Speech Recognition System
m ", Proc. of ICASSP95, pp. 7
3-76, 1995 ".

【0004】この第2の従来例の単語グラフの主たるア
イデアは、音声認識におけるあいまいさが比較的高い音
声信号の領域において単語仮説の候補を処理するという
ことである。この利点は、純粋の音声認識は言語モデル
のアプリケーションとは切り離されていることと、複雑
な言語モデルは、現在認識中の単語に続く公知のステッ
プに適用することができることである。単語仮説の候補
の数は音声認識におけるあいまいさのレベルに対応して
変化する必要がある。良い単語グラフを効率的に構築す
るときの困難さは次の通りである。単語の開始時刻は、
一般的に、先行する単語に依存している。第1の近似に
おいては、この依存性を直前の先行単語に対して制限を
加えることにより、以下に示すようないわゆる単語ペア
近似法を得ている。すなわち、単語のペアとその終了時
刻が与えられたときに、2つの単語の間の単語境界は別
の先行する単語に独立であるということである。この単
語ペア近似法は、本来、複数の文又はn個のベスト(最
良)である文を効率的に計算するために導入されてき
た。この単語グラフは、n個のベストを得るアプローチ
の方法(以下、nベスト法という。)よりも効率的であ
ると期待されている。この単語グラフを用いた方法で
は、複数の単語仮説を局所的にのみ発生する必要がある
一方、nベスト法においては、各局所的な単語仮説の候
補は、n個のベストである文のリストに対して加えるべ
き全体の文を必要としている。
The main idea of the word graph of the second conventional example is to process word hypothesis candidates in a region of a speech signal where ambiguity in speech recognition is relatively high. The advantage of this is that pure speech recognition is separate from the application of language models and that complex language models can be applied to the known steps following the word currently being recognized. The number of word hypothesis candidates must change according to the level of ambiguity in speech recognition. The difficulties in efficiently building a good word graph are as follows. The start time of a word is
Generally, it depends on the preceding word. In the first approximation, this dependency is limited to the immediately preceding word to obtain a so-called word pair approximation method as shown below. That is, given a word pair and its end time, the word boundary between two words is independent of another preceding word. This word pair approximation method has been introduced in order to efficiently calculate multiple sentences or n best sentences by nature. This word graph is expected to be more efficient than the approach of obtaining n bests (hereinafter referred to as the n-best method). In this method using a word graph, it is necessary to generate a plurality of word hypotheses only locally, whereas in the n-best method, each local word hypothesis candidate is a list of n best sentences. You need the whole sentence to add to.

【0005】[0005]

【発明が解決しようとする課題】しかしながら、第1の
従来例においては、発声開始からの単語の履歴及び文法
状態を管理する必要があるため、間投詞の挿入や、言い
淀み、言い直しが頻繁に生じる自然発話の認識に用いた
場合、単語仮説の併合又は分割に要する計算コストが極
めて大きいという問題点があった。すなわち、音声認識
のために必要な処理量が大きくなって比較的大きな記憶
容量を有する記憶装置が必要となる一方、処理量が大き
くなるので処理時間が長くなるという問題点があった。
However, in the first conventional example, since it is necessary to manage the history of words and the grammatical state from the start of utterance, insertion of interjections, stagnation, and rewording are frequently performed. There is a problem that the calculation cost required for merging or dividing word hypotheses is extremely large when used for recognition of natural speech that occurs. That is, the processing amount required for voice recognition becomes large and a storage device having a comparatively large storage capacity is required, while the processing amount becomes large, resulting in a long processing time.

【0006】また、上記第2の従来例の単語ペア近似法
においては、先行単語毎に1つの仮説で代表させるが、
いまだ近似効果は比較的小さい。このため、上記第1の
従来例と同様の問題点が生じる。
In the second conventional word pair approximation method, each preceding word is represented by one hypothesis.
The approximation effect is still relatively small. Therefore, the same problems as those of the first conventional example occur.

【0007】本発明の目的は以上の問題点を解決し、従
来例に比較してより小さい計算コストで自然発話の連続
音声認識を行うことができる連続音声認識装置を提供す
ることにある。
An object of the present invention is to solve the above problems and to provide a continuous speech recognition apparatus capable of performing continuous speech recognition of natural speech at a smaller calculation cost as compared with the conventional example.

【0008】[0008]

【課題を解決するための手段】本発明に係る連続音声認
識装置は、入力される発声音声文の音声信号に基づいて
上記発声音声文の単語仮説を検出し尤度を計算すること
により、連続的に音声認識する音声認識手段を備えた連
続音声認識装置において、上記音声認識手段は、終了時
刻が等しく開始時刻が異なる同一の単語の単語仮説に対
して、当該単語の先頭音素環境毎に、発声開始時刻から
当該単語の終了時刻に至る計算された総尤度のうちの最
も高い尤度を有する1つの単語仮説で代表させるように
単語仮説の絞り込みを行うことを特徴とする。
A continuous speech recognition apparatus according to the present invention detects a word hypothesis of an uttered voice sentence based on an input voice signal of the uttered voice sentence and calculates a likelihood to continuously output the word hypothesis. In a continuous speech recognition device having a speech recognition means for recognizing speech, the speech recognition means, for the word hypothesis of the same word with the same end time but different start time, for each head phoneme environment of the word, It is characterized in that the word hypotheses are narrowed down so as to be represented by one word hypothesis having the highest likelihood of the calculated total likelihood from the utterance start time to the end time of the word.

【0009】[0009]

【発明の実施の形態】以下、図面を参照して本発明に係
る実施形態について説明する。図1に本発明に係る一実
施形態の連続音声認識装置のブロック図を示す。本実施
形態の連続音声認識装置は、公知のワン−パス・ビタビ
復号化法を用いて、入力される発声音声文の音声信号の
特徴パラメータに基づいて上記発声音声文の単語仮説を
検出し尤度を計算して出力する単語照合部4を備えた連
続音声認識装置において、単語照合部4からバッファメ
モリ5を介して出力される、終了時刻が等しく開始時刻
が異なる同一の単語の単語仮説に対して、当該単語の先
頭音素環境毎に、発声開始時刻から当該単語の終了時刻
に至る計算された総尤度のうちの最も高い尤度を有する
1つの単語仮説で代表させるように単語仮説の絞り込み
を行う単語仮説絞込部6を備えたことを特徴とする。
DETAILED DESCRIPTION OF THE INVENTION Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 shows a block diagram of a continuous speech recognition apparatus according to an embodiment of the present invention. The continuous speech recognition apparatus of the present embodiment uses a known one-pass Viterbi decoding method to detect the word hypothesis of the uttered voice sentence based on the characteristic parameters of the voice signal of the input uttered voice sentence. In a continuous speech recognition apparatus having a word matching unit 4 for calculating and outputting a degree, a word hypothesis of the same word output from the word matching unit 4 via the buffer memory 5 and having the same end time but different start time is used. On the other hand, for each head phoneme environment of the word, the word hypothesis is represented by one word hypothesis having the highest likelihood of the total likelihood calculated from the utterance start time to the end time of the word. It is characterized in that a word hypothesis narrowing unit 6 for narrowing down is provided.

【0010】図1において、単語照合部4に接続され、
例えばハードディスクメモリに格納される音素HMM1
1は、各状態を含んで表され、各状態はそれぞれ以下の
情報を有する。 (a)状態番号 (b)受理可能なコンテキストクラス (c)先行状態、及び後続状態のリスト (d)出力確率密度分布のパラメータ (e)自己遷移確率及び後続状態への遷移確率 なお、本実施例において用いる音素HMMは、各分布が
どの話者に由来するかを特定する必要があるため、所定
の話者混合HMMを変換して作成する。ここで、出力確
率密度関数は34次元の対角共分散行列をもつ混合ガウ
ス分布である。
In FIG. 1, connected to the word matching unit 4,
For example, the phoneme HMM1 stored in the hard disk memory
1 is represented including each state, and each state has the following information. (A) State number (b) Acceptable context class (c) List of preceding states and succeeding states (d) Parameter of output probability density distribution (e) Probability of self transition and transition probability to subsequent states The phoneme HMM used in the example needs to specify which speaker each distribution is derived from, and thus is created by converting a predetermined speaker mixed HMM. Here, the output probability density function is a Gaussian mixture mixture having a 34-dimensional diagonal covariance matrix.

【0011】また、単語照合部4に接続され、例えばハ
ードディスクに格納される単語辞書12は、音素HMM
11の各単語毎にシンボルで表した読みを示すシンボル
列を格納する。
The word dictionary 12 connected to the word matching unit 4 and stored in, for example, a hard disk is a phoneme HMM.
For each of the 11 words, a symbol string indicating the symbolic reading is stored.

【0012】図1において、話者の発声音声はマイクロ
ホン1に入力されて音声信号に変換された後、特徴抽出
部2に入力される。特徴抽出部2は、入力された音声信
号をA/D変換した後、例えばLPC分析を実行し、対
数パワー、16次ケプストラム係数、Δ対数パワー及び
16次Δケプストラム係数を含む34次元の特徴パラメ
ータを抽出する。抽出された特徴パラメータの時系列は
バッファメモリ3を介して単語照合部4に入力される。
In FIG. 1, the vocalized voice of the speaker is input to the microphone 1 and converted into a voice signal, and then input to the feature extraction unit 2. The feature extraction unit 2 performs, for example, LPC analysis after A / D conversion of the input voice signal, and a 34-dimensional feature parameter including logarithmic power, 16th-order cepstrum coefficient, Δ logarithmic power, and 16th-order Δ cepstrum coefficient. To extract. The time series of the extracted characteristic parameters is input to the word matching unit 4 via the buffer memory 3.

【0013】単語照合部4は、ワン−パス・ビタビ復号
化法を用いて、バッファメモリ3を介して入力される特
徴パラメータのデータに基づいて、音素HMM11と単
語辞書12とを用いて単語仮説を検出し尤度を計算して
出力する。ここで、単語照合部4は、各時刻の各HMM
の状態毎に、単語内の尤度と発声開始からの尤度を計算
する。尤度は、単語の識別番号、単語の開始時刻、先行
単語の違い毎に個別にもつ。また、計算処理量の削減の
ために、音素HMM11及び単語辞書12とに基づいて
計算される総尤度のうちの低い尤度のグリッド仮説を削
減する。単語照合部4は、その結果の単語仮説と尤度の
情報を発声開始時刻からの時間情報(具体的には、例え
ばフレーム番号)とともにバッファメモリ5を介して単
語仮説絞込部6に出力する。
The word collation unit 4 uses the one-pass Viterbi decoding method to generate a word hypothesis using the phoneme HMM 11 and the word dictionary 12 based on the data of the characteristic parameter input via the buffer memory 3. Is detected and the likelihood is calculated and output. Here, the word matching unit 4 determines that each HMM at each time
The likelihood within a word and the likelihood from the start of utterance are calculated for each state. The likelihood is individually provided for each word identification number, word start time, and difference between preceding words. Further, in order to reduce the amount of calculation processing, the grid hypothesis having a low likelihood of the total likelihood calculated based on the phoneme HMM 11 and the word dictionary 12 is reduced. The word matching unit 4 outputs the resulting word hypothesis and likelihood information to the word hypothesis narrowing unit 6 via the buffer memory 5 together with time information (specifically, for example, frame number) from the utterance start time. .

【0014】単語仮説絞込部6は、単語照合部4からバ
ッファメモリ5を介して出力される単語仮説に基づい
て、終了時刻が等しく開始時刻が異なる同一の単語の単
語仮説に対して、当該単語の先頭音素環境毎に、発声開
始時刻から当該単語の終了時刻に至る計算された総尤度
のうちの最も高い尤度を有する1つの単語仮説で代表さ
せるように単語仮説の絞り込みを行った後、絞り込み後
のすべての単語仮説の単語列のうち、最大の総尤度を有
する仮説の単語列を認識結果として出力する。本実施形
態においては、好ましくは、処理すべき当該単語の先頭
音素環境とは、当該単語より先行する単語仮説の最終音
素と、当該単語の単語仮説の最初の2つの音素とを含む
3つの音素並びをいう。
The word hypothesis narrowing unit 6 is based on the word hypothesis output from the word collating unit 4 via the buffer memory 5 for the word hypotheses of the same word having the same end time but different start times. For each head phoneme environment of a word, the word hypotheses were narrowed down to be represented by one word hypothesis having the highest likelihood of the total likelihood calculated from the vocalization start time to the end time of the word. After that, of the word strings of all the word hypotheses after narrowing down, the word string of the hypothesis having the maximum total likelihood is output as the recognition result. In the present embodiment, preferably, the first phoneme environment of the word to be processed is the three phonemes including the final phoneme of the word hypothesis preceding the word and the first two phonemes of the word hypothesis of the word. Say a line.

【0015】例えば、図2に示すように、(i−1)番
目の単語Wi-1の次に、音素列a1,a2,…,anからな
るi番目の単語Wiがくるときに、単語Wi-1の単語仮説
として6つの仮説Wa,Wb,Wc,Wd,We,Wf
が存在している。ここで、前者3つの単語仮説Wa,W
b,Wcの最終音素は/x/であるとし、後者3つの単
語仮説Wd,We,Wfの最終音素は/y/であるとす
る。終了時刻teと先頭音素環境が等しい仮説(図2で
は先頭音素環境が“x/a1/a2”である上から3つの
単語仮説)のうち総尤度が最も高い仮説(例えば、図2
において1番上の仮説)以外を削除する。なお、上から
4番めの仮説は先頭音素環境が違うため、すなわち、先
行する単語仮説の最終音素がxではなくyであるので、
上から4番めの仮説を削除しない。すなわち、先行する
単語仮説の最終音素毎に1つのみ仮説を残す。図2の例
では、最終音素/x/に対して1つの仮説を残し、最終
音素/y/に対して1つの仮説を残す。
[0015] For example, as shown in FIG. 2, the (i-1) th word W i-1 of the following phoneme string a 1, a 2, ..., come i th word W i consisting a n Sometimes, six hypotheses Wa, Wb, Wc, Wd, We, Wf are used as word hypotheses for the word W i-1.
Exists. Here, the former three word hypotheses Wa, W
It is assumed that the final phonemes of b and Wc are / x /, and the final phonemes of the latter three word hypotheses Wd, We and Wf are / y /. Of the hypotheses in which the end time t e is equal to the head phoneme environment (in FIG. 2, the top phoneme environment is “x / a 1 / a 2 ”, the three word hypotheses from the top), the hypothesis with the highest total likelihood (for example, FIG. Two
Are deleted except for the top hypothesis). Since the fourth phoneme from the top has a different first phoneme environment, that is, the last phoneme of the preceding word hypothesis is y instead of x,
Do not delete the fourth hypothesis from the top. That is, only one hypothesis remains for each final phoneme of the preceding word hypothesis. In the example of FIG. 2, one hypothesis is left for the final phoneme / x /, and one hypothesis is left for the final phoneme / y /.

【0016】以上の実施形態においては、当該単語の先
頭音素環境とは、当該単語より先行する単語仮説の最終
音素と、当該単語の単語仮説の最初の2つの音素とを含
む3つの音素並びとして定義されているが、本発明はこ
れに限らず、先行する単語仮説の最終音素と、最終音素
と連続する先行する単語仮説の少なくとも1つの音素と
を含む先行単語仮説の音素列と、当該単語の単語仮説の
最初の音素を含む音素列とを含む音素並びとしてもよ
い。
In the above embodiment, the leading phoneme environment of the word is defined as three phoneme sequences including the final phoneme of the word hypothesis preceding the word and the first two phonemes of the word hypothesis of the word. Although defined, the present invention is not limited to this, and the phoneme string of the preceding word hypothesis including the final phoneme of the preceding word hypothesis and at least one phoneme of the preceding word hypothesis continuous with the final phoneme, and the word May be a phoneme sequence including a phoneme sequence including the first phoneme of the word hypothesis.

【0017】[0017]

【実施例】本発明者は、図1の連続音声認識装置の有効
性を確認するために、自然発話データベースを用いて単
語グラフ生成実験を行なった。“トラベル・プランニン
グ”をタスクとした本出願人が所有する音声言語データ
ベース(例えば、従来文献5「Morimoto et
al.,“A Speech and Langua
ge Database for Speech Tr
anslationResearch”,Proc.o
f ICSLP94,pp.1791−1794,19
94年」参照。)の「ホテル予約」に関する対話(申込
者側5話者の発声:5対話,56発声,687語)を用
いて評価した。音響分析は、標本化周波数12kHz,
フレーム間隔5msec,ハミング窓20msecの仕
様で分析し、特徴パラメータとして、1〜16次LPC
ケプストラム、1〜16次ΔLPCケプストラム、対数
パワー、Δ対数パワーを用いた。音響モデル(隠れマル
コフ網:401状態,5混合)は、朗読音声(150
文)を用いて学習した音響モデルをさらに上記データベ
ースのテストデータに現れない話者9名の発声(128
発声)を用いて発話様式に適応した。また、言語モデル
は、「ホテル予約」を含む“トラベル・プランニング”
全般(18,315発声,229,159語)を用いて
学習した。単語パープレキシティは、55.9であっ
た。単語辞書(1,113語)は、評価データの語彙を
全て含んでおり、予め登録されていない未知語(未登録
語ともいう。)はないものとした。
EXAMPLES The present inventor conducted a word graph generation experiment using a spontaneous speech database in order to confirm the effectiveness of the continuous speech recognition apparatus of FIG. Spoken language database owned by the applicant with “travel planning” as a task (for example, refer to the conventional document 5 “Morimoto et.
al. , "A Speech and Langua
ge Database for Speech Tr
translationResearch ”, Proc.o
f ICSLP94, pp. 1791-1794, 19
1994 ". )) "Hotel reservation" dialogue (voices from 5 speakers of the applicant side: 5 dialogues, 56 vocalizations, 687 words). Acoustic analysis uses a sampling frequency of 12 kHz,
Analysis is performed with the specifications of a frame interval of 5 msec and a Hamming window of 20 msec, and 1 to 16th-order LPCs are used as characteristic parameters.
Cepstrum, 1st to 16th order ΔLPC cepstrum, logarithmic power, and Δlogarithmic power were used. The acoustic model (Hidden Markov network: 401 states, 5 mixed)
Utterances of nine speakers who did not appear in the test data of the above database (128)
Utterance) was used to adapt to the speaking style. Also, the language model is "travel planning" including "hotel reservation".
Students learned by using the general language (18,315 utterances, 229,159 words). The word perplexity was 55.9. The word dictionary (1,113 words) includes all vocabularies of the evaluation data, and there is no unknown word (also referred to as unregistered word) that is not registered in advance.

【0018】次いで、開始時刻の異なる単語仮説の絞り
込み効果について以下に説明する。図3に、絞り込みを
行なった場合(本実施形態)と絞り込みを行なわない場
合の各単語仮説の先行単語数の分布の比較を示す。絞り
込みを行なうことによって、平均先行単語数が3.59
から1.70に削減された。また、絞り込みを行なわな
かった場合に対して、開始時刻の違いを無視した平均先
行単語数を計算したところ、1.36であった。この結
果から、単語の先頭音素環境ごとに1つの仮説で代表さ
せる本発明の方法は、少ない計算量で、先行単語毎に1
つの仮説で代表させる第2の従来例の単語ペア近似法に
かなり近い効果が得られると考えられる。
Next, the effect of narrowing down word hypotheses having different start times will be described below. FIG. 3 shows a comparison of the distribution of the number of preceding words of each word hypothesis when narrowing down (this embodiment) and when not narrowing down. By narrowing down, the average number of preceding words is 3.59.
Was reduced from 1.70 to 1.70. Further, the average number of preceding words was calculated to be 1.36 when the difference in the start time was ignored when the narrowing down was not performed. From this result, the method of the present invention in which one hypothesis is represented for each head phoneme environment of a word is 1 for each preceding word with a small calculation amount.
It is considered that an effect quite similar to the word pair approximation method of the second conventional example represented by one hypothesis can be obtained.

【0019】以上説明したように、本実施形態によれ
ば、終了時刻が等しく開始時刻が異なる同一の単語の単
語仮説に対して、当該単語の先頭音素環境毎に、発声開
始時刻から当該単語の終了時刻に至る計算された総尤度
のうちの最も高い尤度を有する1つの単語仮説で代表さ
せるように単語仮説の絞り込みを行う。すなわち、先行
単語毎に1つの単語仮説で代表させる第2の従来例の単
語ペア近似法に比較して、単語の先頭音素の先行音素
(つまり、先行単語の最終音素)が等しいものをひとま
とめに扱うために、単語仮説数を削減することができ、
近似効果は大きい。特に、語彙数が増加した場合におい
て削減効果が大きい。従って、当該連続音声認識装置
を、間投詞の挿入や、言い淀み、言い直しが頻繁に生じ
る自然発話の認識に用いた場合であっても、単語仮説の
併合又は分割に要する計算コストは従来例に比較して小
さくなる。すなわち、音声認識のために必要な処理量が
小さくなり、それ故、単語照合部4のワーキングメモリ
(図示せず。)、バッファメモリ5及び単語仮説絞込部
6のワーキングメモリ(図示せず。)などの音声認識の
ための記憶装置において必要な記憶容量は小さくなる一
方、処理量が小さくなるので音声認識のための処理時間
を短縮することができる。
As described above, according to this embodiment, with respect to the word hypothesis of the same word having the same end time but different start time, for each head phoneme environment of the word, from the utterance start time to the word start of the word. The word hypotheses are narrowed down so as to be represented by one word hypothesis having the highest likelihood among the calculated total likelihoods up to the end time. That is, as compared with the second conventional word pair approximation method in which each preceding word is represented by one word hypothesis, those having the same leading phoneme of the first phoneme of the word (that is, the last phoneme of the preceding word) are collectively collected. You can reduce the number of word hypotheses to handle,
The approximation effect is large. In particular, the reduction effect is great when the number of vocabularies is increased. Therefore, even when the continuous speech recognition device is used for the recognition of natural utterances in which interjections are inserted, stagnation, and rewording frequently occur, the calculation cost required for merging or dividing word hypotheses is the same as in the conventional example. It becomes small compared to. That is, the amount of processing required for speech recognition is reduced, and therefore the working memory (not shown) of the word matching unit 4, the buffer memory 5 and the working memory of the word hypothesis narrowing unit 6 (not shown). While the storage capacity required in the storage device for voice recognition such as) is small, the processing amount is small, so that the processing time for voice recognition can be shortened.

【0020】[0020]

【発明の効果】以上詳述したように本発明によれば、入
力される発声音声文の音声信号に基づいて上記発声音声
文の単語仮説を検出し尤度を計算することにより、連続
的に音声認識する音声認識手段を備えた連続音声認識装
置において、上記音声認識手段は、終了時刻が等しく開
始時刻が異なる同一の単語の単語仮説に対して、当該単
語の先頭音素環境毎に、発声開始時刻から当該単語の終
了時刻に至る計算された総尤度のうちの最も高い尤度を
有する1つの単語仮説で代表させるように絞り込みを行
う。すなわち、先行単語毎に1つの単語仮説で代表させ
る第2の従来例の単語ペア近似法に比較して、単語の先
頭音素の先行音素(つまり、先行単語の最終音素)が等
しいものをひとまとめに扱うために、単語仮説数を削減
することができ、近似効果は大きい。特に、語彙数が増
加した場合において削減効果が大きい。従って、当該連
続音声認識装置を、間投詞の挿入や、言い淀み、言い直
しが頻繁に生じる自然発話の認識に用いた場合であって
も、単語仮説の併合又は分割に要する計算コストは従来
例に比較して小さくなる。すなわち、音声認識のために
必要な処理量が小さくなり、それ故、音声認識のための
記憶装置において必要な記憶容量は小さくなる一方、処
理量が小さくなるので音声認識のための処理時間を短縮
することができる。
As described in detail above, according to the present invention, the word hypothesis of the uttered voice sentence is detected based on the input voice signal of the uttered voice sentence and the likelihood is calculated to continuously calculate the likelihood. In a continuous speech recognition device equipped with a speech recognition means, the speech recognition means starts utterance for each head phoneme environment of the word with respect to a word hypothesis of the same word having the same end time but different start times. Narrowing is performed so that one word hypothesis having the highest likelihood of the calculated total likelihood from the time to the end time of the word is represented. That is, as compared with the second conventional word pair approximation method in which each preceding word is represented by one word hypothesis, those having the same leading phoneme of the first phoneme of the word (that is, the last phoneme of the preceding word) are collectively collected. The number of word hypotheses can be reduced and the approximation effect is large. In particular, the reduction effect is great when the number of vocabularies is increased. Therefore, even when the continuous speech recognition device is used for the recognition of natural utterances in which interjections are inserted, stagnation, and rewording frequently occur, the calculation cost required for merging or dividing word hypotheses is the same as in the conventional example. It becomes small compared to. In other words, the amount of processing required for voice recognition is reduced, and therefore the storage capacity required in the storage device for voice recognition is reduced, while the amount of processing is reduced, which reduces the processing time for voice recognition. can do.

【図面の簡単な説明】[Brief description of drawings]

【図1】 本発明に係る一実施形態である連続音声認識
装置のブロック図である。
FIG. 1 is a block diagram of a continuous voice recognition device according to an embodiment of the present invention.

【図2】 図1の連続音声認識装置における単語仮説絞
込部6の処理を示すタイミングチャートである。
2 is a timing chart showing a process of a word hypothesis narrowing unit 6 in the continuous speech recognition apparatus of FIG.

【図3】 図1の連続音声認識装置の実験結果におけ
る、単語間の遷移における単語仮説の絞り込み効果を示
す先行単語の個数に対するノード数のグラフである。
3 is a graph of the number of nodes with respect to the number of preceding words showing the effect of narrowing the word hypothesis in the transition between words in the experimental result of the continuous speech recognition apparatus of FIG.

【符号の説明】[Explanation of symbols]

1…マイクロホン、 2…特徴抽出部、 3,5…バッファメモリ、 4…単語照合部、 6…単語仮説絞込部、 11…音素HMM、 12…単語辞書。 1 ... Microphone, 2 ... Feature extraction part, 3, 5 ... Buffer memory, 4 ... Word collation part, 6 ... Word hypothesis narrowing part, 11 ... Phoneme HMM, 12 ... Word dictionary.

フロントページの続き (72)発明者 松永 昭一 京都府相楽郡精華町大字乾谷小字三平谷5 番地 株式会社エイ・ティ・アール音声翻 訳通信研究所内 (72)発明者 匂坂 芳典 京都府相楽郡精華町大字乾谷小字三平谷5 番地 株式会社エイ・ティ・アール音声翻 訳通信研究所内Front page continued (72) Inventor Shoichi Matsunaga Seika-cho, Soraku-gun, Kyoto Pref. 5 Inhiratani, Mihiratani No.5, A.T.R. Co., Ltd. Speech Translation Communication Research Laboratories (72) Inventor Yoshinori Kosaka, Seiraku-cho, Soraku-gun, Kyoto Prefecture Daiji Intani, Shoji, Hiratani, No.5, ATR Co., Ltd.

Claims (1)

【特許請求の範囲】[Claims] 【請求項1】 入力される発声音声文の音声信号に基づ
いて上記発声音声文の単語仮説を検出し尤度を計算する
ことにより、連続的に音声認識する音声認識手段を備え
た連続音声認識装置において、 上記音声認識手段は、終了時刻が等しく開始時刻が異な
る同一の単語の単語仮説に対して、当該単語の先頭音素
環境毎に、発声開始時刻から当該単語の終了時刻に至る
計算された総尤度のうちの最も高い尤度を有する1つの
単語仮説で代表させるように単語仮説の絞り込みを行う
ことを特徴とする連続音声認識装置。
1. Continuous speech recognition provided with a speech recognition means for continuously recognizing speech by detecting a word hypothesis of the uttered speech sentence based on an input speech signal of the uttered speech sentence and calculating a likelihood. In the device, the speech recognition means calculates, for each of the head phoneme environments of the word, from the utterance start time to the end time of the word for the word hypothesis of the same word having the same end time but different start times. A continuous speech recognition device characterized in that a word hypothesis is narrowed down so that it is represented by one word hypothesis having the highest likelihood of the total likelihood.
JP7234043A 1995-09-12 1995-09-12 Continuous speech recognition device Expired - Fee Related JP2731133B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP7234043A JP2731133B2 (en) 1995-09-12 1995-09-12 Continuous speech recognition device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP7234043A JP2731133B2 (en) 1995-09-12 1995-09-12 Continuous speech recognition device

Publications (2)

Publication Number Publication Date
JPH0981185A true JPH0981185A (en) 1997-03-28
JP2731133B2 JP2731133B2 (en) 1998-03-25

Family

ID=16964682

Family Applications (1)

Application Number Title Priority Date Filing Date
JP7234043A Expired - Fee Related JP2731133B2 (en) 1995-09-12 1995-09-12 Continuous speech recognition device

Country Status (1)

Country Link
JP (1) JP2731133B2 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7072835B2 (en) 2001-01-23 2006-07-04 Matsushita Electric Industrial Co., Ltd. Method and apparatus for speech recognition

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2999726B2 (en) 1996-09-18 2000-01-17 株式会社エイ・ティ・アール音声翻訳通信研究所 Continuous speech recognition device

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7072835B2 (en) 2001-01-23 2006-07-04 Matsushita Electric Industrial Co., Ltd. Method and apparatus for speech recognition

Also Published As

Publication number Publication date
JP2731133B2 (en) 1998-03-25

Similar Documents

Publication Publication Date Title
US9812122B2 (en) Speech recognition model construction method, speech recognition method, computer system, speech recognition apparatus, program, and recording medium
EP0533491B1 (en) Wordspotting using two hidden Markov models (HMM)
JP2963142B2 (en) Signal processing method
Lee et al. Improved acoustic modeling for large vocabulary continuous speech recognition
JP2006038895A (en) Device and method for speech processing, program, and recording medium
JP2003316386A (en) Method, device, and program for speech recognition
KR101014086B1 (en) Voice processing device and method, and recording medium
Schlüter et al. Interdependence of language models and discriminative training
Hieronymus et al. Spoken language identification using large vocabulary speech recognition
Boite et al. A new approach towards keyword spotting.
Hieronymus et al. Robust spoken language identification using large vocabulary speech recognition
Lee et al. Acoustic modeling of subword units for speech recognition
Mŭller et al. Design of speech recognition engine
JP2974621B2 (en) Speech recognition word dictionary creation device and continuous speech recognition device
JP2871420B2 (en) Spoken dialogue system
Rebai et al. Linto platform: A smart open voice assistant for business environments
JP2731133B2 (en) Continuous speech recognition device
JP3104900B2 (en) Voice recognition method
Jitsuhiro et al. Automatic generation of non-uniform context-dependent HMM topologies based on the MDL criterion.
JP2005250071A (en) Method and device for speech recognition, speech recognition program, and storage medium with speech recognition program stored therein
JPH0981182A (en) Learning device for hidden markov model(hmm) and voice recognition device
Wu et al. Application of simultaneous decoding algorithms to automatic transcription of known and unknown words
JP2999727B2 (en) Voice recognition device
Raj et al. Design and implementation of speech recognition systems
JPH09212190A (en) Speech recognition device and sentence recognition device

Legal Events

Date Code Title Description
LAPS Cancellation because of no payment of annual fees