JPH11237893A

JPH11237893A - Speech recognition system and phoneme recognizing method

Info

Publication number: JPH11237893A
Application number: JP3836598A
Authority: JP
Inventors: Shintaro Murakami; 伸太郎村上
Original assignee: Meidensha Corp; Meidensha Electric Manufacturing Co Ltd
Current assignee: Meidensha Corp; Meidensha Electric Manufacturing Co Ltd
Priority date: 1998-02-20
Filing date: 1998-02-20
Publication date: 1999-08-31

Abstract

PROBLEM TO BE SOLVED: To improve the phoneme recognition rate by matching an input phoneme against a dictionary template and expanding a range in which an optimum start end is possibly present by making use of results up to a specific frame when the optimum start end is set for the 1st phoneme of the dictionary template in word spot calculation. SOLUTION: A feature extraction part 13 takes a frequency analysis of voice data inputted to a voice input part 12 to obtain a spectrum sequence, which is inputted to a phoneme recognition part 14. A phoneme sequence is obtained as its output and supplied to a matching part 16, which finds a cumulative distance of matching distances between the input phonemes and the dictionary template up to an (n)th frame of the input phonemes when matching the phoneme sequence against templates in the dictionary. Then the most similar word or word sequence is outputted as a recognition result. At this time, the path of DP at a word spot calculation part is changed so that the optimum start end can be selected out of even frames far from a tail end.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】この発明は、音素認識部を前
段に持つ連続単語音声認識システムにおける音素認識方
法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a phoneme recognition method in a continuous word speech recognition system having a phoneme recognition unit at a preceding stage.

【０００２】[0002]

【従来の技術】音声認識装置の一例として図４に示す離
散単語音声認識システムがある。このシステムは、図４
に示すように、電話やマイクロフォンなどの音声入力装
置１１から音声データが音声入力部１２に入力される。
この音声入力部１２に入力された音声データは、特徴抽
出部１３に供給され、ここで、音声データは周波数分析
される。この周波数分析の結果からスペクトル列を得
て、このスペクトル列が音素認識部１４に入力される。
音素認識部１４は、出力を二重化したニューラルネット
ワーク（図示省略）によって構成されている。2. Description of the Related Art As an example of a speech recognition apparatus, there is a discrete word speech recognition system shown in FIG. This system is shown in FIG.
As shown in FIG. 1, voice data is input to a voice input unit 12 from a voice input device 11 such as a telephone or a microphone.
The audio data input to the audio input unit 12 is supplied to a feature extraction unit 13, where the audio data is subjected to frequency analysis. A spectrum sequence is obtained from the result of the frequency analysis, and the spectrum sequence is input to the phoneme recognition unit 14.
The phoneme recognition unit 14 is configured by a neural network (not shown) whose output is duplicated.

【０００３】上記ニューラルネットワークは入力層、隠
れ層、出力層からなり、入力層に例えば、１時刻毎に５
フレームのスペクトルが入力され、それの中央のスペク
トルに該当する音素がどれであるかを、出力層のユニッ
トの値によって送出する。出力ユニットは、二重化され
ているため、各音素カテゴリ毎にユニットは２個づつ対
応づけられている。それに対して結果は、最大の出力値
を示すものから２つのユニットを選び、それが対応する
音素を第１位、第２位音素候補として得る。The above-mentioned neural network is composed of an input layer, a hidden layer, and an output layer.
The spectrum of the frame is input, and the phoneme corresponding to the center spectrum of the frame is transmitted according to the value of the unit in the output layer. Since the output units are duplicated, two units are associated with each phoneme category. On the other hand, as a result, two units are selected from those having the largest output value, and the phonemes corresponding to the two units are obtained as the first and second phoneme candidates.

【０００４】その認識された音素候補列と、認識させた
い語彙の音素パターンを持たせた辞書中のテンプレート
１５との類似度は、DP（Dynamic Programming）法によ
ってマッチング部１６でマッチングされる。そして、最も
類似する単語又は単語列を認識結果としてマッチング部
１６から出力する。[0004] The similarity between the recognized phoneme candidate sequence and the template 15 in the dictionary having the phoneme pattern of the vocabulary to be recognized is matched by the matching unit 16 by the DP (Dynamic Programming) method. Then, the matching unit 16 outputs the most similar word or word string as a recognition result.

【０００５】ここで、一般的な連続単語認識アルゴリズ
ムの概要について述べる。いま、単語の接続条件（文
法）が、図５に示す有限状態オートマトンで記述されて
いるとする。これは認識単語数、単語間の接続等を制限
する働きがある。入力パターンを音素認識した結果 T=
{a(1),a(2),a(3),……a(t)}（フレーム数t)の、第ｉフ
レームから第ｊフレーム{ai,……,aj}と、辞書単語テン
プレートn={b(1),b(2),……,bN(n)}とのDPマッチング距
離をdist[n][i][j]と書く（N(n):テンプレートnの長
さ）。dist[n][i][j]はDPマッチングなどにより求めら
れる。この値はワードスポット値とも呼ばれる。また、
ｉをワードスポット始端、ｊをワードスポット終端と呼
ぶことにする。Here, an outline of a general continuous word recognition algorithm will be described. Now, it is assumed that word connection conditions (grammar) are described by a finite state automaton shown in FIG. This serves to limit the number of recognized words, connections between words, and the like. Result of phoneme recognition of input pattern T =
{a (1), a (2), a (3),... a (t)} (the number of frames t) from the i-th frame to the j-th frame {ai,..., aj} and the dictionary word template Write the DP matching distance to n = {b (1), b (2), ..., bN (n)} as dist [n] [i] [j] (N (n): length of template n ). dist [n] [i] [j] is obtained by DP matching or the like. This value is also called the word spot value. Also,
Let i be the word spot start and j be the word spot end.

【０００６】なお、図５は２桁数字を処理するオートマ
トンの例（状態数＝３）である。２桁数字の場合、状態
１に遷移する時に２桁目を状態１から状態２に遷移する
ときに１桁目を出力する。それ以外の遷移（状態０から
状態２など）では２桁数字は出力されない。FIG. 5 shows an example of an automaton for processing two-digit numbers (the number of states = 3). In the case of a two-digit number, the second digit is output when transitioning to state 1, and the first digit is output when transitioning from state 1 to state 2. In other transitions (from state 0 to state 2), a two-digit number is not output.

【０００７】今、状態数stat numのオートマトン、入力
音素フレーム数len obj（つまり、t=len obj）、辞書単
語数word numの場合を考える。オートマトン状態stat，
入力音素の第kフレームまでのマッチング累積距離frm s
cr[stat][k]を求めたいとすると(0<=stat<last stat,0<
=k<last frm)、それは次の式（１）のようになる。Now, consider the case where the number of states is stat num, the number of input phoneme frames is len obj (that is, t = len obj), and the number of dictionary words is word num. Automaton state stat,
Cumulative matching distance frm s of input phoneme to frame k
If you want to find cr [stat] [k] (0 <= stat <last stat, 0 <
= k <last frm), which is expressed by the following equation (1).

【０００８】 frm scr[stat][k]=min{frm scr[from stat][m]+dist[n][m+1][k]}…（１）ただし、状態ｐから単語ｎを生成して状態ｑへ遷移でき
ることを f(p,n)=q ……（２）と表すと、from statとnは、0<n<word num，f(from sta
t,n)=statを満たすようなもの、また、0<m<kであり（実
際にはmの範囲は計算量などの関係からさらに制限され
る）、minは、from stat,n,mを、その範囲内で動かした
ときの最小を取ることを示す。Frm scr [stat] [k] = min {frm scr [from stat] [m] + dist [n] [m + 1] [k]} (1) where word n is generated from state p F (p, n) = q (2) indicates that from stat and n are 0 <n <word num, f (from sta
t, n) = stat, and 0 <m <k (actually, the range of m is further limited by the amount of calculation, etc.), and min is from stat, n, m To take the minimum when moved within that range.

【０００９】上記式（１）、（２）を満たすfrom stat,
n,mをそれぞれ、frm stt[stat][k],frm tpl[stat][k],
frm frm[stat][k]（stt:状態、tpl:テンプレート、frm:
フレーム）とする。このような手順を、0<=k<len obj,0<=
stat<stat numについて求める。実際のプログラムで
は、次のような処理を行うのが一般的である。From stat, which satisfies the above equations (1) and (2),
n, m are respectively frm stt [stat] [k], frm tpl [stat] [k],
frm frm [stat] [k] (stt: state, tpl: template, frm:
Frame). Such a procedure is called 0 <= k <len obj, 0 <=
Calculate about stat <stat num. In an actual program, the following processing is generally performed.

【００１０】まず、第ｋフレームにおける累積距離のア
ルゴリズムについて述べる。すべての状態stat(0<=stat<stat num)について、次の
以下を実行すべての辞書単語n(0<=n<word num)について、次の
以下を実行累積距離scr=min{frm scr[from stat][m]+dist[n][m+
1][k]}を求める。ここで、minは、m,from statだけを動
かしたときの最小値を示す。また、from statは、上記
式（２）を満たすものである。 frm scr[stat][k]>scrならば、を実行する。 frm scr[stat][k]=scr,frm tpl[stat][k]=n,frm frm
[stat][k]=(を満たすm),frm stt[stat][k]=(を満た
すfrom stat) その後、次のようにしてバックトレースを行い、認識単
語列を得る。First, the algorithm of the cumulative distance in the k-th frame will be described. Execute the following for all states stat (0 <= stat <stat num) Execute the following for all dictionary words n (0 <= n <word num) Cumulative distance scr = min {frm scr [ from stat] [m] + dist [n] [m +
1] [k]}. Here, min indicates the minimum value when only m and from stat are operated. “From stat” satisfies the above expression (2). If frm scr [stat] [k]> scr, execute frm scr [stat] [k] = scr, frm tpl [stat] [k] = n, frm frm
[stat] [k] = (satisfies m), frm stt [stat] [k] = (satisfies from stat) Then, backtrace is performed as follows to obtain a recognized word string.

【００１１】図６に上記累積距離のアルゴリズム（入力
第jフレーム）のフローチャートを示す。図６におい
て、ステップＳ１で上記の処理を実行した後、ステッ
プＳ２の処理を実行する。ステップＳ２は上記の処理
を実行する。ステップＳ２の処理が実行されたなら、上
記の処理をステップＳ３で実行する。その後、ステッ
プＳ４で上記の判断を実行し、「yes」なら上記に
おけるステップＳ５の処理を実行し、「no」ならステッ
プＳ６の処理に進む。ステップＳ５の実行が終了したな
ら、辞書単語テンプレートｎが、辞書単語word numより
大きいかをステップＳ６で判断する。この判断の結果
「yes」なら、ステップＳ７でオートマトン状態statが
状態数stat numより大きいかを判断し、「yes」なら累
積距離計算を終了する。なお、ステップＳ６、７で「n
o」ならステップＳ３とステップＳ２の処理に戻る。FIG. 6 shows a flowchart of the algorithm of the cumulative distance (input j-th frame). In FIG. 6, after performing the above processing in step S1, the processing in step S2 is performed. Step S2 executes the above processing. When the processing in step S2 has been executed, the above processing is executed in step S3. Thereafter, the above-described determination is performed in step S4. If "yes", the process in step S5 is performed, and if "no", the process proceeds to step S6. When the execution of step S5 is completed, it is determined in step S6 whether the dictionary word template n is larger than the dictionary word word num. If the result of this determination is "yes", it is determined in step S7 whether the automaton state stat is greater than the number of states stat num, and if "yes", the cumulative distance calculation ends. In steps S6 and S7, "n
If "o", the flow returns to the processing of steps S3 and S2.

【００１２】なお、バックトレースのアルゴリズムは次
のように行われる。 k=len obj,stat=stat num(len obj:入力フレーム
数、stat num:有限状態オートマトンの終状態の番号) frm tpl[stat][k]を認識結果として出力。その後、
k= frm frm[stat][k],stat=frm stt[stat][k]とする。 k=0なら終了、それ以外なら上記へ。The algorithm of the back trace is performed as follows. k = len obj, stat = stat num (len obj: number of input frames, stat num: final state number of finite state automaton) frm tpl [stat] [k] is output as a recognition result. afterwards,
Let k = frm frm [stat] [k] and stat = frm stt [stat] [k]. If k = 0, end; otherwise, go to above.

【００１３】次に、連続単語音声認識アルゴリズムとし
て利用されている拡張連続ＤＰについて、そのアルゴリ
ズムを説明する。拡張連続ＤＰにおいては、すべての
(i,j)に対してワードスポット値dist[n][i][j]を求める
のではなく、(j,n)が与えられたとき、dist[n][i][j]を
最小にするiについて、その値(i min)と、dist[n][i mi
n][j]のみを求める。すなわち、終端ｊに対する、最適
な始端(i min)に対してのみのワードスポット値を利用
する。Next, an algorithm of the extended continuous DP used as a continuous word speech recognition algorithm will be described. In extended continuous DP, all
Instead of finding the word spot value dist [n] [i] [j] for (i, j), when (j, n) is given, minimize dist [n] [i] [j]. I, its value (i min) and dist [n] [i mi
Find only n] [j]. That is, the word spot value for only the optimal start (i min) with respect to the end j is used.

【００１４】今、dist[n][i min][j]をaug dist[n][j]
とし、最適始端(i min)をaug ini[n][j]とする。（au
g：augumented，ini：initialの略で始端を計算するた
めの変数）aug dist[n][j]、aug ini[n][j]を求めるア
ルゴリズムは次のようになる。ここでは説明を簡単にす
るため、ＤＰスコアの遷移を指定するためのＤＰマッチ
ング計算用パスを図７（ａ）のように傾斜制限を持たな
いパスの場合で考えるが、図７（ｂ）に示すように、傾
斜制限をもつパスの方が、一般に多用されている。Now, dist [n] [i min] [j] is converted to aug dist [n] [j]
And the optimal starting point (i min) is aug ini [n] [j]. (Au
g: augmented, ini: initial, a variable for calculating the starting point) aug dist [n] [j] and aug ini [n] [j] are as follows. Here, for the sake of simplicity, the DP matching calculation path for designating the transition of the DP score is assumed to be a path having no inclination limit as shown in FIG. 7A. As shown, a path having a slope restriction is more commonly used.

【００１５】次に拡張連続ＤＰマッチング距離（ワード
スポット）計算アルゴリズムについて述べる。すべての入力フレームj(0<=j<len obj)について、
以下を実行すべての辞書テンプレートn(0<=n<word num)につい
て、以下を実行ｎのすべてのテンプレート音素tpl(0<=tpl＜N(n))に
ついて、以下を実行 tpl=0のとき、 dist tmp[tpl][j]=value（tpl,j)（tem：temporaryの略） ini tmp[tpl][j]=j tpl>=1のとき、 dist tmp[tpl][j]=min{dist tmp[tpl-1][k]}+value(tpl,j) (j-2<=k<=j) 上記式を満たすkをmin kとすると、 ini tmp[tpl][j]=ini tmp[tpl-1][min k] ただし、value(tpl,j)は、テンプレートnの第tplフレー
ムの音素と、入力フレームの第jフレーム音素との音素間
距離である。フレーム間距離は、例えば、音素同士が一致
するときのスコアを「0」、一致しないときのスコアを
「1」と設定した場合、入力フレーム第j番目の音素a(j)
と、テンプレートnの第i番目の音素b(i)とのフレーム間
距離は、次のように定義される。 aug dist[n][j]=dist tmp[N(n)-1][j], aug ini[n][j]=ini tmp[N(n)-1][j] 上記aug dist[][],aug ini[][]を用いて、from scr[][]
等を求めるわけだが、フレームjに対して、最適始端のみ
しか求められていないため、上記「第kフレームにおけ
る累積距離のアルゴリズム」ので述べた最小値を計算
する際に、最適始端周辺ワードスポットスコアを近似的
に求める。Next, an algorithm for calculating an extended continuous DP matching distance (word spot) will be described. For all input frames j (0 <= j <len obj),
Execute the following for all dictionary templates n (0 <= n <word num) Execute the following for all template phonemes tpl (0 <= tpl <N (n)) of n When tpl = 0 Dist tmp [tpl] [j] = value (tpl, j) (tem: abbreviation for temporary) When ini tmp [tpl] [j] = j tpl> = 1, dist tmp [tpl] [j] = min {dist tmp [tpl-1] [k]} + value (tpl, j) (j-2 <= k <= j) If k that satisfies the above equation is min k, ini tmp [tpl] [j] = ini tmp [tpl-1] [min k] where value (tpl, j) is the distance between phonemes between the phoneme of the tpl frame of template n and the phoneme of the jth frame of the input frame. The inter-frame distance is, for example, when the score when the phonemes match is set to “0” and the score when they do not match is set to “1”, the j-th phoneme a (j) of the input frame
And the inter-frame distance between the i-th phoneme b (i) of the template n is defined as follows. aug dist [n] [j] = dist tmp [N (n) -1] [j], aug ini [n] [j] = ini tmp [N (n) -1] [j] aug dist [] above Using [], aug ini [] [], from scr [] []
However, since only the optimal starting point is obtained for frame j, when calculating the minimum value described in the above “Algorithm of Cumulative Distance in k-th Frame”, the word spot score around the optimal starting point is calculated. Is approximately obtained.

【００１６】図８は上記拡張連続ＤＰマッチング距離
（ワードスポット）計算アルゴリズムのフローチャート
で、このフローチャートにおいて、まず、上記〜の
処理をステップＳ１〜ステップＳ３で実行するために、
j,n,tplの初期値を「０」に設定する。次に、上記ステ
ップＳ４で、テンプレート音素tplが「０」であるかを
判定し、ステップＳ５、ステップＳ６で，上記の処理
を実行する。その後、ステップＳ７でtpl＞＝N(n)かを
判定し、「ｎ」ならステップＳ５の処理を繰り返し、
「ｙ」なら上記の処理をステップＳ８で実行する。ス
テップＳ８の実行後、辞書テンプレートnをステップＳ
９で、入力フレームjをステップＳ１０で判定し、ワー
ドスポットアルゴリズムの処理を終わる。FIG. 8 is a flowchart of the algorithm for calculating the extended continuous DP matching distance (word spot). In this flowchart, first, in order to execute the above-mentioned processes in steps S1 to S3,
Set the initial values of j, n, tpl to “0”. Next, it is determined whether or not the template phoneme tpl is “0” in the above step S4, and the above processing is executed in steps S5 and S6. Thereafter, it is determined whether tpl> = N (n) in step S7, and if "n", the process in step S5 is repeated.
If "y", the above processing is executed in step S8. After execution of step S8, dictionary template n is stored in step S8.
In step 9, the input frame j is determined in step S10, and the processing of the word spot algorithm ends.

【００１７】次に拡張連続ＤＰにおける累積距離のアル
ゴリズムについて述べる。すべてのj(0<=j<len obj)について、次の以下を実行すべての状態stat(0<=stat<stat num)について、次の
以下を実行すべてのn(0<=n<word num)について、次の以下を実
行 scr=min{frm scr[from stat][aug ini[n][j]+m-1]+ap
x scr[aug ini[n][j]+m]を求める。ここで、minはm、fr
om statだけを動かしたときの最小値を示す。mは予め指
定された範囲APX MIN〜APX MAXを動く値である。（ap
x：approximateの略）また、from statは、前記式
（２）を同時に満たすものである。Next, the algorithm of the cumulative distance in the extended continuous DP will be described. Execute the following for all j (0 <= j <len obj) Execute the following for all states stat (0 <= stat <stat num) All n (0 <= n <word num ), Execute the following scr = min {frm scr [from stat] [aug ini [n] [j] + m-1] + ap
x scr [aug ini [n] [j] + m] Where min is m, fr
Shows the minimum value when only om stat is run. m is a value that moves in a predetermined range from APX MIN to APX MAX. (Ap
x: abbreviation for approximate) Also, from stat satisfies the expression (2) at the same time.

【００１８】apx scr[aug ini[n][j]+m]は次のように求
める。 apx scr[aug ini[n][j]+m]=aug dist[n][j]×(j-(aug i
ni[n][j]+m)/(j-aug ini[n][j]) この値は、始端(aug ini[n][j])のワードスポット値に、
フレーム長に比例した係数をかけることで、始端(aug i
ni[n][j]+m)のワードスポット値を近似的に求めたもの
である。 frm scr[stat][j]>scrならば、次のを実行する。 frm scr[stat][j]=scr,frm tpl[stat][j]=n, frm frm[stat][j]=aug ini[n][j]+（を満たすm）， frm stt[stat][j]=（を満たすfrom stat）その後、バックトレースで認識単語列を求める。Apx scr [aug ini [n] [j] + m] is obtained as follows. apx scr [aug ini [n] [j] + m] = aug dist [n] [j] × (j- (aug i
ni [n] [j] + m) / (j-aug ini [n] [j]) This value is the word spot value at the beginning (aug ini [n] [j]),
By applying a coefficient proportional to the frame length, the beginning (aug i
The word spot value of ni [n] [j] + m) is approximately obtained. If frm scr [stat] [j]> scr, execute the following. frm scr [stat] [j] = scr, frm tpl [stat] [j] = n, frm frm [stat] [j] = aug ini [n] [j] + (m satisfying), frm stt [stat ] [j] = (satisfies from stat) Then, find the recognized word string by backtrace.

【００１９】図９は拡張連続ＤＰの累積計算アルゴリズ
ムのフローチャートで、このフローチャートにおいて、
前記図９と同様に、まず、上記〜の処理をステップ
Ｓ１〜ステップＳ３で実行するために、j,stat,nの初期
値を「０」に設定する。次にステップＳ４で上記に処
理を行い、ステップＳ５で、上記の処理であるfrmscr
[stat][j]>scrならばステップＳ６で上記の処理を行
った後に、ステップＳ７〜ステップＳ９の処理を行って
処理を終了する。FIG. 9 is a flowchart of an algorithm for cumulatively calculating the extended continuous DP.
As in the case of FIG. 9, first, in order to execute the above processes in steps S1 to S3, the initial values of j, stat, n are set to “0”. Next, the above processing is performed in step S4, and in step S5, the above-described frmscr
If [stat] [j]> scr, after performing the above processing in step S6, the processing in steps S7 to S9 is performed and the processing ends.

【００２０】[0020]

【発明が解決しようとする課題】（１）上記のように拡
張連続ＤＰのワードスポットアルゴリズムを用いても、
例えば、「おはよう」を認識した際に、図１０（ａ）に
示すように、音素認識がすべてうまくいった場合は、正
確にワードスポットが可能となるが、図１０（ｂ）に示
すように途中に誤認識音素が含まれ、音素認識がうまく
行かなかった場合は、図示下線部のみがワードスポット
され、一部が認識できなくなり取りこぼしが生じてしま
う。このように、途中に誤認識音素が含まれた場合に
は、図１０（ｂ）の傾斜制限パスを利用した場合など
は、終端により近いフレームを最適始端に選ぶ傾向が強
くなる。そのため、わずかな音素誤認識によりワードス
ポットの精度が大幅に低下してしまう問題がある。(1) Even if the word spot algorithm of the extended continuous DP is used as described above,
For example, when "good morning" is recognized, as shown in FIG. 10 (a), if all phoneme recognitions are successful, word spots can be accurately formed, but as shown in FIG. 10 (b). If an incorrectly recognized phoneme is included in the middle and the phoneme recognition is not performed successfully, only the underlined portions in the drawing are word spotted, and some of them cannot be recognized, resulting in missing. As described above, when an erroneously recognized phoneme is included in the middle, in a case where the inclination restriction path in FIG. 10B is used, there is a strong tendency to select a frame closer to the end as the optimum start. For this reason, there is a problem that the accuracy of the word spot is greatly reduced due to slight phoneme error recognition.

【００２１】（２）拡張連続ＤＰの累積距離計算は、AP
X MIN<=m<=APX MAXの範囲でのみ累積距離計算を行う。
そのため、ワードスポットがうまく当てはまらない場合
などには、あるstat,jについて、from scr[stat][j]
が、更新されないまま初期値の値が残される可能性があ
る。from scr[stat num-1][len obj-1]の値が初期値の
まま残された場合、認識結果が求められないということ
になってしまう問題がある。(2) The cumulative distance of the extended continuous DP is calculated by AP
The cumulative distance is calculated only in the range of X MIN <= m <= APX MAX.
Therefore, when the word spot does not fit well, for some stat, j, from scr [stat] [j]
However, there is a possibility that the initial value remains without being updated. If the value of from scr [stat num-1] [len obj-1] is left as the initial value, there is a problem that the recognition result is not obtained.

【００２２】この発明は上記の事情に鑑みてなされたも
ので、ワードスポットの精度低下を抑えるとともに、音
素認識結果が出力できないような場合を大幅に低減する
ことにより、音素認識率の向上を図った音声認識システ
ムにおける音素認識方法を提供することを課題とする。The present invention has been made in view of the above circumstances, and aims to improve the phoneme recognition rate by suppressing a decrease in accuracy of a word spot and by drastically reducing cases in which a phoneme recognition result cannot be output. It is an object to provide a phoneme recognition method in a voice recognition system.

【００２３】[0023]

【課題を解決するための手段】この発明は、上記の課題
を達成するために、第１発明は、音声入力部に入力され
た音声データを特徴抽出部により周波数分析してスペク
トル列を得、そのスペクトル列を音素認識部に入力し
て、その出力に音素列を得、その音素列をマッチング部
に供給して辞書内のテンプレートとマッチングさせる際
に、入力音素と辞書テンプレートとのマッチング距離の
入力音素のｎフレームまでの累積距離を求めた後、最も
類似する単語または単語列を認識結果として出力するよ
うにした音声認識システムにおいて、前記マッチング部
で入力音素と辞書テンプレートとをマッチングさせ、ワ
ードスポット計算における辞書テンプレートの第１音素
での最適始端設定時には、２フレーム前までの結果を利
用して、最適始端が取り得る範囲を拡張するようにした
ことを特徴とするものである。According to the present invention, in order to achieve the above object, a first invention is to provide a feature extraction unit to perform frequency analysis on voice data input to a voice input unit to obtain a spectrum sequence, The spectrum sequence is input to a phoneme recognition unit, and a phoneme sequence is obtained as an output. The phoneme sequence is supplied to a matching unit to be matched with a template in a dictionary. In a speech recognition system in which a cumulative distance of an input phoneme up to n frames is obtained, the most similar word or word string is output as a recognition result. The matching unit matches the input phoneme with a dictionary template, When setting the optimal starting point for the first phoneme of the dictionary template in the spot calculation, the optimal starting point is determined using the result up to two frames before. It is characterized in that so as to extend the range to obtain Ri.

【００２４】第２発明は、前記マッチング部に、拡張連
続ＤＰマッチング処理を用い、テンプレート第１フレー
ムとのマッチングに限り、指定された条件下では、ＤＰ
パスを変更するようにしたことを特徴とするものであ
る。The second invention uses an extended continuous DP matching process for the matching unit, and performs DP matching under specified conditions only for matching with the template first frame.
The feature is that the path is changed.

【００２５】第３発明は、前記マッチング部で入力音素
と辞書テンプレートとをマッチングさせ、各入力フレー
ムで累積距離計算が終了した時点で、累積距離の値をチ
ェックし、初期値のままであれば、適切な値を設定する
ようにしたことを特徴とするものである。According to a third aspect of the present invention, the matching unit matches an input phoneme with a dictionary template, and when the cumulative distance calculation is completed for each input frame, checks the value of the cumulative distance. , An appropriate value is set.

【００２６】[0026]

【発明の実施の形態】以下この発明の実施の形態を図面
に基づいて説明する。図１はこの発明の実施の第１形態
を示すフローチャートで、この第１形態では、前記「発
明が解決しようとする課題」の項（１）で述べた問題
が、最適始端を、終端からより遠いフレームからでも選
択できるように、ワードスポット計算部でＤＰのパスを
変更し、図１において、次のような処理を行う。Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 is a flowchart showing a first embodiment of the present invention. In the first embodiment, the problem described in the section (1) of the “Problem to be Solved by the Invention” is that the optimum starting point is shifted from the end to the end. The path of the DP is changed by the word spot calculation unit so that the frame can be selected even from a distant frame, and the following processing is performed in FIG.

【００２７】すべての入力フレームj(0<=j<len obj)
について、以下を実行（ステップＳ１）すべての辞書テンプレートn(0<=n<word num)につい
て、以下を実行（ステップＳ２）ｎのすべてのテンプレート音素tpl(0<=tpl＜N(n))に
ついて、以下を実行（ステップＳ３）ステップＳ４でtpl=0のときには、ステップＳ５で次
の処理が実行される。 dist tmp[tpl][j]=value（tpl,j) ini tmp[tpl][j]=j 上記ステップＳ５の処理の後、j>=2の場合(ステップＳ
１１）になったなら、図２（ｂ）に示すようなＤＰ１パ
スもテンプレート第１フレームとのマッチングに限り、
指定された条件の下では許すようにする。その後、ステ
ップＳ１２の処理であるdist tmp[tpl][j-2]<OTHER VAL
（音素間距離の最大値）を行う。この処理で「y」なら
ステップＳ１３の処理である、 dist tmp[tpl][j]＝dist tmp[tpl][j]+dist tmp[tpl][j-2] ini tmp[tpl][j]=ini tmp[tpl][j-2] が実行される。なお、図２はテンプレート第１フレーム
での処理を示すもので、図２（ａ）は通常のＤＰパス、
図２（ｂ）は第１形態で使用される改良したＤＰパスで
ある。All input frames j (0 <= j <len obj)
(Step S1) For all dictionary templates n (0 <= n <word num), execute the following (Step S2) For all template phonemes tpl of n (0 <= tpl <N (n)) (Step S3) When tpl = 0 in Step S4, the following processing is executed in Step S5. dist tmp [tpl] [j] = value (tpl, j) ini tmp [tpl] [j] = j After processing in step S5, if j> = 2 (step S5)
11), the DP1 pass as shown in FIG. 2B is limited to the matching with the template first frame.
Allow under specified conditions. Then, dist tmp [tpl] [j-2] <OTHER VAL, which is the process of step S12.
(Maximum value of the distance between phonemes). If “y” in this process, the process of step S13 is performed. Dist tmp [tpl] [j] = dist tmp [tpl] [j] + dist tmp [tpl] [j-2] ini tmp [tpl] [j] = ini tmp [tpl] [j-2] is executed. FIG. 2 shows a process in the first frame of the template, and FIG. 2A shows a normal DP pass,
FIG. 2B shows an improved DP path used in the first embodiment.

【００２８】一方、前記ステップＳ１２での処理が、di
st tmp[tpl][j-2]>=OTHER VALで、かつステップＳ１４
での処理が、dist tmp[tpl][j-1]<OTHER VALなら、次の
ステップＳ１５の処理である、 dist tmp[tpl][j]＝dist tmp[tpl][j]+dist tmp[tpl][j-1] ini tmp[tpl][j]=ini tmp[tpl][j-1] が実行される。このステップＳ１５および前記ステップ
Ｓ１３の処理が終わった後は、図９に示すステップＳ７
で、tpl＞＝N(n)かを判定し、判定の結果により、その
後、ステップＳ８以降の処理が実行される。On the other hand, the processing in step S12 is
st tmp [tpl] [j-2]> = OTHER VAL and step S14
Is dist tmp [tpl] [j-1] <OTHER VAL, the processing in the next step S15 is dist tmp [tpl] [j] = dist tmp [tpl] [j] + dist tmp [ tpl] [j-1] ini tmp [tpl] [j] = ini tmp [tpl] [j-1] is executed. After the processing in step S15 and step S13 is completed, step S7 shown in FIG.
Then, it is determined whether or not tpl> = N (n), and based on the result of the determination, the processing after step S8 is executed.

【００２９】前記ステップＳ４で、tpl>=1になったと
き、次に示すステップＳ６の処理が実行される。 dist tmp[tpl][j]=min{dist tmp[tpl-1][k]}+value(tpl,j) (j-2<=k<=j) ここで上記式を満たすkをmin kとすると、ini tmp[tpl]
[j]=ini tmp[tpl-1][min k]となる。ただし、value(tpl,
j)は、テンプレートnの第tplフレームの音素と、入力フ
レームの第jフレーム音素との音素間距離である。また、
フレーム間距離は、例えば、音素同士が一致するときのス
コアを「0」、一致しないときのスコアを「1」と設定し
た場合、入力フレーム第j番目の音素a(j)と、テンプレー
トnの第i番目の音素b(i)とのフレーム間距離は、次のよ
うに定義される。ステップＳ６の処理の後、ステップＳ７の判定処理が行
われて、「ｙ」なら次ののステップＳ８の処理が実行
される。 aug dist[n][j]=dist tmp[N(n)-1][j], aug ini[n][j]=ini tmp[N(n)-1][j] ステップＳ８の実行後、辞書テンプレートnをステップ
Ｓ９で、入力フレームjをステップＳ１０で判定し、ワ
ードスポットアルゴリズムの処理を終わる。When tpl> = 1 in step S4, the process of step S6 described below is executed. dist tmp [tpl] [j] = min {dist tmp [tpl-1] [k]} + value (tpl, j) (j-2 <= k <= j) where k is min k Then ini tmp [tpl]
[j] = ini tmp [tpl-1] [min k]. Where value (tpl,
j) is the inter-phoneme distance between the phoneme of the tpl frame of the template n and the phoneme of the jth frame of the input frame. Also,
For example, when the score between phonemes is set to “0” and the score when they do not match is set to “1”, the frame-to-frame distance is calculated based on the j-th phoneme a (j) of the input frame and the template n. The distance between frames with the i-th phoneme b (i) is defined as follows. After the process in step S6, the determination process in step S7 is performed. If "y", the process in the next step S8 is performed. aug dist [n] [j] = dist tmp [N (n) -1] [j], aug ini [n] [j] = ini tmp [N (n) -1] [j] After execution of step S8 , The dictionary template n is determined in step S9, and the input frame j is determined in step S10, and the processing of the word spot algorithm ends.

【００３０】図３はこの発明の実施の第２形態を示すフ
ローチャートで、この第２形態では、前記「発明が解決
しようとする課題」の項（２）で述べた問題が、累積ス
コアが更新されなかった場合においても、最も適切と思
われる値を代入することで避けるようにしたものであ
る。そこで、第２形態では、各フレームにおいて、累積
距離計算を終了した後に、累積距離が更新されていなけ
れば、値を設定するような処理を行うようにした。以下
に累積距離計算部のアルゴリズムを図３のフローチャー
トと対応させて述べるに、図９と同一部分は同一符号を
付して示す。FIG. 3 is a flow chart showing a second embodiment of the present invention. In the second embodiment, the problem described in the section (2) of the “problem to be solved by the invention” is the same as that of the first embodiment. Even if it is not done, it is avoided by substituting the most appropriate value. Therefore, in the second embodiment, in each frame, after the cumulative distance calculation is completed, if the cumulative distance is not updated, a process of setting a value is performed. Hereinafter, the algorithm of the cumulative distance calculation unit will be described in association with the flowchart of FIG. 3, and the same parts as those in FIG. 9 are denoted by the same reference numerals.

【００３１】すべてのj(0<=j<len obj)について、次の
−を実行（ステップＳ１）すべての状態stat(0<=stat<stat num)について、次の
−を実行（ステップＳ２）すべての辞書テンプレートn(0<=n<word num)につい
て、次の−を実行（ステップＳ３） scr=min{frm scr[from stat][aug ini[n][j]+m-1]+ap
x scr[aug ini[n][j]+m]を求める。ここで、minはm、fr
om statだけを動かしたときの最小値を示す。mは予め指
定された範囲APX MIN〜APX MAXを動く値である。また、
from statは、前記式（２）を同時に満たすものであ
る。The following-is executed for all j (0 <= j <len obj) (step S1) The following-is executed for all states stat (0 <= stat <stat num) (step S2) The following − is executed for all dictionary templates n (0 <= n <word num) (step S3) scr = min {frm scr [from stat] [aug ini [n] [j] + m-1] + ap
x scr [aug ini [n] [j] + m] Where min is m, fr
Shows the minimum value when only om stat is run. m is a value that moves in a predetermined range from APX MIN to APX MAX. Also,
from stat satisfies the above expression (2) at the same time.

【００３２】apx scr[aug ini[n][j]+m]は次のように求
める。 apx scr[aug ini[n][j]+m]=aug dist[n][j]×(j-(aug i
ni[n][j]+m)/(j-aug ini[n][j]) この値は、始端(aug ini[n][j])のワードスポット値に、
フレーム長に比例した係数をかけることで、始端(aug i
ni[n][j]+m)のワードスポット値を近似的に求めたもの
である。（ステップＳ４） frm scr[stat][j]>scrならば、次のを実行する。
（ステップＳ５） frm scr[stat][j]=scr,frm tpl[stat][j]=n, frm frm[stat][j]=aug ini[n][j]+（を満たすm）， frm stt[stat][j]=（を満たすfrom stat）（ステップ
Ｓ６）ステップＳ６による実行処理の後、辞書テンプレートn
がword num寄り大きいかをステップＳ７で判定する。判
定の結果、「ｙ」ならステップＳ１０の判定処理を行
い、「n」ならステップＳ４の処理に戻る。Apx scr [aug ini [n] [j] + m] is obtained as follows. apx scr [aug ini [n] [j] + m] = aug dist [n] [j] × (j- (aug i
ni [n] [j] + m) / (j-aug ini [n] [j]) This value is the word spot value at the beginning (aug ini [n] [j]),
By applying a coefficient proportional to the frame length, the beginning (aug i
The word spot value of ni [n] [j] + m) is approximately obtained. (Step S4) If frm scr [stat] [j]> scr, the following is executed.
(Step S5) frm scr [stat] [j] = scr, frm tpl [stat] [j] = n, frm frm [stat] [j] = aug ini [n] [j] + (m satisfying), frm stt [stat] [j] = (satisfies from stat) (step S6) After the execution processing in step S6, the dictionary template n
Is larger than word num in step S7. As a result of the determination, if “y”, the determination process of step S10 is performed, and if “n”, the process returns to step S4.

【００３３】ステップＳ１０では、次のようなに示す
判定処理が行われる。 from scr[stat][j]が、初期値MAX VALのままで、か
つ、from scr[stat] [j-1]<MAX VAL-OTHER VALのとき、
すなわち、「ｙ」のとき、次の(ステップＳ１１）を
実行する。ここで、OTHER VALは、音素間距離の最大値
（value(a,b)の取る最大値）である。In step S10, the following determination processing is performed. When from scr [stat] [j] is the initial value MAX VAL and from scr [stat] [j-1] <MAX VAL-OTHER VAL,
That is, when "y", the following (step S11) is executed. Here, OTHER VAL is the maximum value of the inter-phoneme distance (the maximum value taken by value (a, b)).

【００３４】from scr[stat][j]=from scr[stat][j-
1]+OTHER VAL， from frm[stat][j]=from frm[stat][j-1]， from tpl[stat][j]=from tpl[stat][j-1]， from stt[stat][j]=from stt[stat][j-1] なお、ステップＳ１０で「ｎ」ときと、ステップＳ１１
の実行後は、ステップＳ８ですべての状態statを判定し
た後、ステップＳ９ですべての入力フレームjを判定し
て処理を終了する。From scr [stat] [j] = from scr [stat] [j-
1] + OTHER VAL, from frm [stat] [j] = from frm [stat] [j-1], from tpl [stat] [j] = from tpl [stat] [j-1], from stt [stat ] [j] = from stt [stat] [j-1] Note that when “n” in step S10,
After the execution of, after all states stat are determined in step S8, all input frames j are determined in step S9, and the process is terminated.

【００３５】次の表１は、第１、第２形態の処理を併用
した場合の音素認識率と、従来の拡張連続ＤＰを利用し
た場合の単語認識率を実験により得た場合のものであ
る。実験条件としては、Ａ：音素学習データを３話者に
よる、１０１単語の２回発声、Ｂ：実験データは前記３話者（学習話者）＋６話者（評
価話者）による、１０１単語の１回発声で、音素認識部
はＡで学習済みのものを利用した。The following Table 1 shows the phoneme recognition rate when the first and second modes of processing are used together and the word recognition rate when the conventional extended continuous DP is used, obtained by experiments. . The experimental conditions are as follows: A: phoneme learning data is uttered twice by three speakers, and 101 words are uttered. B: Experimental data is 101 words of the three speakers (learning speakers) +6 speakers (evaluation speakers). With one utterance, the phoneme recognition unit used was learned in A.

【００３６】[0036]

【表１】 [Table 1]

【００３７】上記実験では、図７（ｂ）のＤＰパスを利
用した。表１から学習話者、評価話者のいずれに対して
も音素認識率が大幅に向上していることが明らかであ
る。In the above experiment, the DP path shown in FIG. 7B was used. From Table 1, it is clear that the phoneme recognition rate is significantly improved for both the learning speaker and the evaluation speaker.

【００３８】[0038]

【発明の効果】以上述べたように、この発明によれば、
音素誤認識等によるワードスポットの精度の低下を抑
え、それにより音素認識率の向上を図ることができると
ともに、音素認識結果が出力できないような場合を大幅
に低減することにより、音素認識率の向上を図ることが
できるようになる等の利点が得られる。As described above, according to the present invention,
A reduction in the accuracy of word spots due to misrecognition of phonemes can be suppressed, thereby improving the phoneme recognition rate. In addition, the phoneme recognition rate can be improved by drastically reducing cases in which a phoneme recognition result cannot be output. And the like.

[Brief description of the drawings]

【図１】この発明の実施の第１形態を示すフローチャー
ト。FIG. 1 is a flowchart showing a first embodiment of the present invention.

【図２】テンプレート第１フレームでの処理における通
常のＤＰパスと改良したＤＰパスの特性図。FIG. 2 is a characteristic diagram of a normal DP path and an improved DP path in processing in a template first frame.

【図３】この発明の実施の第２形態を示すフローチャー
ト。FIG. 3 is a flowchart showing a second embodiment of the present invention.

【図４】離散単語音声認識システムのブロック構成図。FIG. 4 is a block diagram of a discrete word speech recognition system.

【図５】有限状態オートマトンの説明図。FIG. 5 is an explanatory diagram of a finite state automaton.

【図６】一般的な累積計算部のアルゴリズムを示すフロ
ーチャート。FIG. 6 is a flowchart illustrating an algorithm of a general accumulation calculating unit.

【図７】ＤＰマッチング計算用パスの特性図。FIG. 7 is a characteristic diagram of a DP matching calculation path.

【図８】拡張連続ＤＰのワードスポットアルゴリズムの
フローチャート。FIG. 8 is a flowchart of an extended continuous DP word spot algorithm.

【図９】拡張連続ＤＰの累積計算アルゴリズムのフロー
チャート。FIG. 9 is a flowchart of a cumulative calculation algorithm of the extended continuous DP.

【図１０】誤認識音素を含む例の説明図。FIG. 10 is an explanatory diagram of an example including a misrecognized phoneme.

[Explanation of symbols]

１１…音声入力装置１２…音声入力部１３…特徴抽出部１４…音素認識部１５…辞書テンプレート１６…マッチング部 DESCRIPTION OF SYMBOLS 11 ... Voice input device 12 ... Voice input part 13 ... Feature extraction part 14 ... Phoneme recognition part 15 ... Dictionary template 16 ... Matching part

Claims

[Claims]

1. A feature extraction unit frequency-analyzes voice data input to a voice input unit to obtain a spectrum sequence, inputs the spectrum sequence to a phoneme recognition unit, obtains a phoneme sequence at its output, and obtains a phoneme sequence. When the sequence is supplied to the matching unit and matched with the template in the dictionary, after calculating the cumulative distance of the matching distance between the input phoneme and the dictionary template up to n frames of the input phoneme, the most similar word or word sequence is determined. In the speech recognition system configured to output as a recognition result, the input phoneme and the dictionary template are matched by the matching unit, and when the optimum starting point of the dictionary template in the first phoneme is set in the word spot calculation, the result up to two frames before is set. The speech recognition system is characterized by using Kicking phoneme recognition method.

2. The method according to claim 1, wherein the matching unit uses an extended continuous DP matching process to change a DP path only under matching with a template first frame under designated conditions. Item 1. A phoneme recognition method in the speech recognition system according to Item 1.

3. A speech sequence input to a speech input unit is subjected to frequency analysis by a feature extraction unit to obtain a spectrum sequence, the spectrum sequence is input to a phoneme recognition unit, and a phoneme sequence is obtained at an output thereof. When the sequence is supplied to the matching unit and matched with the template in the dictionary, after calculating the cumulative distance of the matching distance between the input phoneme and the dictionary template up to n frames of the input phoneme, the most similar word or word sequence is determined. In the speech recognition system configured to output as a recognition result, the input unit is matched with the dictionary template by the matching unit, and when the cumulative distance calculation is completed in each input frame, the value of the cumulative distance is checked, and the initial value is checked. A phoneme recognition method in a speech recognition system, wherein an appropriate value is set if it remains as it is.