JPH0573036B2

JPH0573036B2 -

Info

Publication number: JPH0573036B2
Application number: JP59130722A
Authority: JP
Inventors: Yoichiro Sako; Makoto Akaha; Atsunobu Hiraiwa; Masao Watari
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 1984-06-25
Filing date: 1984-06-25
Publication date: 1993-10-13
Also published as: JPS619697A

Description

[Detailed description of the invention]

産業上の利用分野本発明は音声を認識する音声認識装置に関す
る。背景技術とその問題点従来、音声の発声速度変動に対処した音声認識
装置として例えば特開昭50−96104号公報に示さ
れるようなDPマツチング処理を行なうようにし
たものが提案されている。先ず、このDPマツチング処理にて音声認識を
行なうようにした音声認識装置について説明す
る。第１図において、１は音声信号入力部としての
マイクロホンを示し、このマイクロホン１からの
音声信号が音響分析部２に供給され、この音響分
析部２にて音響パラメータ時系列Pi(n)が得られ
る。この音響分析部２において例えばバンドパス
フイルタバンクの整流平滑化出力が音響パラメー
タ時系列Pi(n)（ｉ＝１，…，Ｉ；Ｉはバンドパス
フイルタバンクのチヤンネル数、ｎ＝１，…，
Ｎ；Ｎは音声区間判定により切り出されたフレー
ム数である。）として得られる。この音響分析部２の音響パラメータ時系列Pi(n)
がモード切換スイツチ３により、登録モードにお
いては認識対象語毎に標準パターンメモリ４に格
納され、認識モードにおいてはDPマツチング距
離計算部５の一端に供給される。又、この認識モ
ードにおいては標準パターンメモリ４に格納され
ている標準パターンがDPマツチング距離計算部
５の他端に供給される。このDPマツチング距離計算部５にてその時入
力されている音声の音響パラメータ時系列Pi(n)よ
りなる入力パターンと標準パターンメモリ４の標
準パターンとのDPマツチング距離計算距離がな
され、このDPマツチング距離計算部５のDPマツ
チング距離を示す距離信号が最小距離判定部６に
供給され、この最小距離判定部６にて入力パター
ンに対してDPマツチング距離が最小となる標準
パターンが判定され、この判定結果より入力音声
を示す認識結果が出力端子７に得られる。ところで、一般に標準パターンメモリ４に格納
される標準パターンのフレーム数Ｎは発声速度変
動や単語長の違いにより異なつている。DPマツ
チング処理によりこの発声速度変動や単語長の違
いに対処する為の時間軸正規化がなされる。以下、このDPマツチング処理について説明す
る。ここで、簡単の為に音響パラメータ時系列Pi
(n)の周波数軸方向ｉに対応する次元を省略して標
準パターンのパラメータ時系列をb₁，…，b_N、入
力パターンのパラメータ時系列をa₁，…，a_Mとし
て、端点固定のDP−パスの場合のDPマツチング
処理について説明する。第２図はDPマツチング処理の概念図を示し、
横軸に入力パラメータ（Ｍ＝19）が並べられ、縦
軸に標準パラメータ（Ｎ＝12）が並べられ、この
第２図に示す（Ｍ，Ｎ）格子状平面に於ける・点
はＭ×Ｎ個であり、この各・点に１つの距離が対
応する。例えばa₃とb₅との距離がa₃から縦に伸し
た直線と、b₅から横に伸した直線との交点に位置
する・に対応する。この場合、距離として例えば
チエビシエフ距離を取れば、a₃とb₅とのチエビシ
エフ距離ｄ（３，５）はｄ（３，５）＝_I 〓ⁱ⁼¹ ｜P^a _i(3)−P^b _i(5)｜となる（この場合、周波数軸方向ｉに対応する次
元を省略しているのでＩ＝１である。）。そして、
端点固定のDP−パスとして、格子点（ｍ，ｎ）
に対してこの格子点（ｍ，ｎ）に結びつける前の
状態として左側の格子点（ｍ−１，ｎ）、斜め左
下側の格子点（ｍ−１，ｎ−１）及び下側の格子
点（ｍ，ｎ−１）の３つ〓だけを許した場合、始
点、即ちa₁とb₁とのチエビシエフ距離D₁₁を示す
点○・から出発し、パス（経路）として３方向〓を
選び、終点、即ちa_Mとb_Nとのチエビシエフ距離
ｄ（Ｍ，Ｎ）示す点〓に至るパスで、通過する各
格子点の距離の総和が最小になるものを求め、こ
の距離の総和を入力パラメータ数Ｍと標準パラメ
ータ数Ｎとの和より値１を減算した（Ｍ＋Ｎ−
１）にて除算して得られた結果が入力パターンの
パラメータ時系列a₁，…，a_Mと標準パターンのパ
ラメータ時系列b₁，…，b_NとのDPマツチング距
離となされる。この様な処理を示す初期条件及び
漸化式は初期条件ｇ（１，１）＝ｄ（１，１）漸化式ｇ（ｍ，ｎ）＝minｄ（ｍ，ｎ）＋ｇ（ｍ−１，
ｎ）２・ｄ（ｍ，ｎ）＋ｇ（ｍ−１，ｎ−１）ｄ（ｍ，ｎ）＋ｇ（ｍ，ｎ−１）と表され、これよりDPマツチング距離Ｄ（Ａ，
Ｂ）はＤ（Ａ，Ｂ）＝ｇ（Ｍ，Ｎ）／（Ｍ＋Ｎ−１）と表される（Ｍ＋Ｎ−１）でｇ（Ｍ，Ｎ）を割つ
ているのは標準パターンのフレーム数Ｎの違いに
よる距離の値の差を補正するためである。）。この
様な処理により標準パターンの数がＬ個ある場合
には入力パターンに対するDPマツチング距離が
Ｌ個求められ、このＬ個のDPマツチング距離中
最小の距離となる標準パターンが認識結果となさ
れる。この様なDPマツチング処理による音声認識装
置によれば発声速度変動や単語長の違いに対処、
即ち時間軸正規化のなされた音声認識を行なうこ
とができる。然し乍ら、この様なDPマツチング処理により
音声認識を行なうものにおいては、音声の定常部
がDPマツチング距離に大きく反映し、部分的に
類似しているような語い間に於いて誤認識し易い
ということが明らかとなつた。即ち、音響パラメータ時系列Pi(n)はそのパラメ
ータ空間で軌跡を描くと考えることができる。実
際には各フレームｎのパラメータがパラメータ空
間内の１点に対応することから、点列ではあるが
時系列方向に曲線で結んでいくと始点から終点迄
の１つの軌跡が考えられる。例えば２種類の単語
“SAN”と“HAI”とを登録した場合、夫々の標
準パターンA′，B′は第３図に示す如く“Ｓ”，
“Ａ”，“Ｎ”，“Ｈ”，“Ａ”，“Ｉ”の各音韻領域
を
通過する軌跡を描く。そして、認識モードで
“SAN”と発声した場合、全体的にみれば入力パ
ターンＡに対する標準パターンB′の類似する部
分は非常に少ないが、この入力パターンＡの
“SAN”の“Ａ”の部分が標準パターンA′の
“SAN”の“Ａ”の部分より標準パターンB′の
“HAI”の“Ａ”の部分により類似し、且つその
部分（準定常部）に点数が多い場合がある。ここで、第３図に示す如く入力パターンＡのパ
ラメータが全体的には標準パターンA′のパラメ
ータに類似し、部分的には標準パターンB′のパ
ラメータに類似する場合にDPマツチング処理に
より誤認識を招く場合を１次元パラメータを例に
説明する。この場合、第３図に示す状況、即ち部
分的に類似している語い間の関係と同様の１次元
パラメータ時系列として第４図に示す如き入力パ
ターンＡ；２，４，６，８，８，８，８，６，
４，４，４，６，８と、第５図に示す如き標準パ
ターンA′；３，５，７，９，９，９，９，７，
５，５，７，９と、第６図に示す如き標準パター
ンB′；７，６，６，８，８，８，８，６，４，
４，４とを考える。これら第４図乃至第６図のパ
ターンより明らかな如く入力パターンＡは標準パ
ターンA′と判定されて欲しいパターンである。
ところが、入力パターンＡに対する標準パターン
A′及びB′のDPマツチング距離を計算すると、入
力パターンＡは標準パターンB′に近いことが示
される。即ち、入力パターンＡに対する標準パターン
A′のDPマツチング処理として第２図と同様、第
７図に示す如く横軸に入力パターンＡのパラメー
タ時系列；２，４，６，８，８，８，８，６，
４，４，４，６，８を並べ、縦軸に標準パターン
A′のパラメータ時系列；３，５，７，９，９，
９，９，７，５，５，７，９を並べ、格子状平面
に於ける交点に対応して入力パターンＡの個々の
パラメータに対する標準パターンA′の個々のパ
ラメータのチエビシエフ距離を求める。そして、
入力パラメータＡのパラメータ時系列の第１番目
のパラメータ２と、標準パラメータA′のパラメ
ータ時系列の第１番目のパラメータ３とのチエビ
シエフ距離ｄ（１，１）＝１の点を始点とし、入力
パターンＡのパラメータ時系列の第13番目のパラ
メータ８と、標準パターンA′のパラメータ時系
列の第12番目のパラメータ９とのチエビシエフ距
離ｄ（13，12）＝１の点を終点とし、DP−パスと
して第２図の場合と同様、任意の点に対する前の
状態としてその任意の点の左側の点、下側の点及
び斜め左下側の点を取ることを許した場合（この
パスを実線矢印にて示す。）、パス上の点はｄ（１，
１）−ｄ（２，２）−ｄ（３，３）−ｄ（４，４）−ｄ
（５，５）−ｄ（６，６）−ｄ（７，７）−ｄ（８，８
）
−ｄ（９，９）−ｄ（10，10）−ｄ（11，10）−ｄ（12
，
10）−ｄ（13，11）−ｄ（13，12）の14点であり、そ
の距離の総和は24であり、このDPマツチング距
離Ｄ（Ａ，A′）は１である。一方、入力パターンＡに対する標準パターン
B′のDPマツチング処理を上述第７図に示す場合
と同様、第８図に示す如く行なう。即ち、入力パ
ターンＡの個々のパラメータ；２，４，６，８，
８，８，８，６，４，４，４，６，８に対する標
準パターンB′の個個のパラメータ；７，６，６，
８，８，８，８，６，４，４，４のチエビシエフ
距離を求め、DP−パスとして任意の点に対する
前の状態としてその任意の点の左側の点、下側の
点及び斜め左下側の点を取ることを許した場合
（このパスを実線矢印にて示す。）、パス上の点は
ｄ（１，１）−ｄ（２，２）−ｄ（３，３）−ｄ（４，
４）−ｄ（５，５）−ｄ（６，６）−ｄ（７，７）−ｄ
（８，８）−ｄ（９，９）−ｄ（10，10）−ｄ（11，11
）
−ｄ（12，11）−ｄ（13，11）の13点であり、その
距離の総和は15であり、このDPマツチング距離
Ｄ（Ａ，B′）は0.65である。このDP−パスを３方向〓とした結果より明ら
かな様に入力パターンＡがそのDPマツチング距
離の小さな標準パターンB′と判定され、判定さ
れるべき結果が得られない。この様にDPマツチ
ング処理においては部分的に類似しているような
語い間に於いて誤認識し易い。又、DPマツチング処理においては上述した様
に標準パターンのフレーム数Ｎが不定であり、し
かも入力パターンに対して全標準パターンをDP
マツチング処理する必要があり、語いが多くなる
とそれに伴つて演算量が飛躍的に増加し、標準パ
ターンメモリ４の記憶容量や演算量の点で問題が
あつた。この為、部分的に類似しているような語い間に
於いても誤認識することが比較的少なく、且つ標
準パターンメモリ４の記憶容量や処理の為の演算
量が比較的少ない音声認識装置として第９図に示
す如きものが考えられている。第９図において、１は音声信号入力部としての
マイクロホンを示し、このマイクロホン１からの
音声信号を音響分析部２の増幅器８に供給し、こ
の増幅器８の音声信号をカツトオフ周波数5.5K
Hzのローパスフイルタ９を介してサンプリング周
波数12.5KHzの12ビツトＡ／Ｄ変換器１０に供給
し、このＡ／Ｄ変換器１０のデジタル音声信号を
15チヤンネルのデジタルバンドパスフイルタバン
ク１１_A，１１_B，…，１１_Oに供給する。この15
チヤンネルのデジタルバンドパスフイルタバンク
１１_A，１１_B，…，１１_Oは例えばバターワース
４次のデジタルフイルタにて構成し、250Hzから
5.5KHzまでの帯域が対数軸上で等間隔となるよ
うに割り振られている。そして、各デジタルバン
ドパスフイルタ１１_A，１１_B，…，１１_Oの出力
信号を15チヤンネルの整流器１２_A，１２_B，…，
１２_Oに夫々供給し、これら整流器１２_A，１２_B，
…，１２_Oの２乗出力を15チヤンネルのデジタル
ローパスフイルタ１３_A，１３_B，…，１３_Oに
夫々供給する。これらデジタルローパスフイルタ
１３_A，１３_B，…，１３_Oはカツトオフ周波数52.8
HzのFIR（有限インパルス応答形）ローパスフイ
ルタにて構成する。そして、各デジタルローパスフイルタ１３_A，
１３_B，…，１３_Oの出力信号をサンプリング周期
5.12msのサンプラー１４に供給する。このサン
プラー１４によりデジタルローパスフイルタ１３
_Ａ，１３_B，…，１３_Oの出力信号をフレーム周期
5.12ms毎にサンプリングし、このサンプラー１
４のサンプリング信号を音源情報正規化器１５に
供給する。この音源情報正規化器１５は認識しよ
うとする音声の話者による声帯音源特性の違いを
除去するものである。即ち、フレーム周期毎にサンプラー１４から供
給されるサンプリング信号Ai(n)（ｉ＝１，…，
15；ｎ：フレーム番号）に対して A′i(n)＝log（Ai(n)＋Ｂ） …(1) なる対数変換がなされる。この(1)式において、Ｂ
はバイアスでノイズレベルが隠れる程度の値を設
定する。そして、声帯音源特性をyi＝ａ・ｉ＋ｂ
なる式で近似する。このａ及びｂの計数は次式に
より決定される。 INDUSTRIAL APPLICATION FIELD The present invention relates to a speech recognition device that recognizes speech. BACKGROUND TECHNOLOGY AND PROBLEMS Conventionally, as a speech recognition device that copes with variations in speech rate, there has been proposed a device that performs DP matching processing as disclosed in, for example, Japanese Unexamined Patent Publication No. 50-96104. First, a speech recognition device that performs speech recognition using this DP matching process will be described. In FIG. 1, 1 indicates a microphone as an audio signal input section, and the audio signal from this microphone 1 is supplied to an acoustic analysis section 2, where an acoustic parameter time series Pi(n) is obtained. It will be done. In this acoustic analysis section 2, for example, the rectified and smoothed output of the bandpass filter bank is converted into the acoustic parameter time series Pi(n) (i=1,...,I; I is the number of channels of the bandpass filter bank, n=1,...,
N; N is the number of frames cut out by voice section determination. ) is obtained as Acoustic parameter time series Pi(n) of this acoustic analysis section 2
is stored in the standard pattern memory 4 for each recognition target word in the registration mode by the mode changeover switch 3, and is supplied to one end of the DP matching distance calculation unit 5 in the recognition mode. Further, in this recognition mode, the standard pattern stored in the standard pattern memory 4 is supplied to the other end of the DP matching distance calculation section 5. The DP matching distance calculation unit 5 calculates the DP matching distance between the input pattern consisting of the acoustic parameter time series Pi(n) of the audio input at that time and the standard pattern in the standard pattern memory 4, and calculates the DP matching distance. A distance signal indicating the DP matching distance from the calculation unit 5 is supplied to the minimum distance determination unit 6, and the minimum distance determination unit 6 determines a standard pattern with the minimum DP matching distance for the input pattern, and the determination result is A recognition result indicating the input voice is obtained at the output terminal 7. Incidentally, the number N of frames of the standard pattern stored in the standard pattern memory 4 generally varies depending on variations in speaking speed and differences in word length. The DP matching process performs time axis normalization to deal with variations in speaking speed and differences in word length. This DP matching process will be explained below. Here, for simplicity, the acoustic parameter time series Pi
The dimension corresponding to the frequency axis direction i of (n) is omitted, and the parameter time series of the standard pattern is b ₁ ,...,b _N , and the parameter time series of the input pattern is a ₁ ,..., a _M , and the end point is fixed. DP matching processing in the case of DP-path will be explained. Figure 2 shows a conceptual diagram of the DP matching process.
Input parameters (M = 19) are arranged on the horizontal axis, standard parameters (N = 12) are arranged on the vertical axis, and the points on the (M, N) grid plane shown in Fig. 2 are M x There are N points, and each point corresponds to one distance. For example, the distance between a ₃ and b ₅ corresponds to the intersection of a straight line extending vertically from a ₃ and a straight line extending horizontally from b ₅ . In this case, if we take, for example, the Tiebishiev distance as the distance, the Tiebishiev distance d(3,5) between a ₃ and b ₅ is d(3,5)= _I 〓 ⁱ⁼¹ |P ^a _i (3)−P ^b _i (5)| (In this case, I=1 because the dimension corresponding to the frequency axis direction i is omitted.) and,
As a DP-path with fixed end points, grid point (m, n)
As for the state before connecting to this grid point (m, n), the left grid point (m-1, n), the diagonally lower left grid point (m-1, n-1), and the lower grid point If only the three 〓 of (m, n-1) are allowed, start from the starting point, that is, the point ○, which indicates the Tievisiev distance D ₁₁ between a ₁ and b ₁ , and select 3 directions 〓 as the path. , find the path that minimizes the sum of the distances of each grid point passed through, and enter the sum of this _distance _. The value 1 was subtracted from the sum of the number of parameters M and the number of standard parameters N (M+N-
1) is used as the DP matching distance between the parameter time series a ₁ , ..., a _M of the input pattern and the parameter time series b ₁ , ..., b _N of the standard pattern. The initial conditions and recurrence formula showing this kind of processing are: Initial condition g(1,1)=d(1,1) Recurrence formula g(m,n)=mind(m,n)+g(m-1,
n) 2・d(m,n)+g(m-1,n-1) d(m,n)+g(m,n-1) From this, the DP matching distance D(A,
B) is expressed as D(A,B)=g(M,N)/(M+N-1).G(M,N) is divided by (M+N-1), which is the number of frames of the standard pattern N. This is to correct the difference in distance values due to the difference in . ). Through such processing, when there are L standard patterns, L DP matching distances for the input pattern are obtained, and the standard pattern having the minimum distance among the L DP matching distances is determined as the recognition result. A speech recognition device using such DP matching processing can deal with variations in speaking speed and differences in word length.
That is, it is possible to perform speech recognition that has been normalized on the time axis. However, in systems that perform speech recognition using such DP matching processing, the stationary parts of the speech are largely reflected in the DP matching distance, and it is easy to misrecognize between words that are partially similar. It became clear. That is, the acoustic parameter time series Pi(n) can be considered to draw a trajectory in the parameter space. Actually, since the parameters of each frame n correspond to one point in the parameter space, if the points are connected by curves in the time series direction, one trajectory from the starting point to the ending point can be considered. For example, when two types of words "SAN" and "HAI" are registered, the respective standard patterns A' and B' are "S", "S", and "HAI" as shown in Figure 3.
Draw a trajectory that passes through each phoneme region of "A", "N", "H", "A", and "I". When uttering "SAN" in the recognition mode, overall there are very few similarities between standard pattern B' and input pattern A, but the "A" part of "SAN" in input pattern A is is more similar to the "A" part of "HAI" of standard pattern B' than the "A" part of "SAN" of standard pattern A', and there are cases where there are many points in that part (quasi-stationary part). Here, as shown in Fig. 3, when the parameters of input pattern A are similar to the parameters of standard pattern A' as a whole, and partially similar to the parameters of standard pattern B', DP matching processing causes erroneous recognition. The case where this occurs will be explained using a one-dimensional parameter as an example. In this case, the situation shown in FIG. 3, that is, the input pattern A shown in FIG. 4 as a one-dimensional parameter time series similar to the relationship between partially similar words; 2, 4, 6, 8, 8, 8, 8, 6,
4, 4, 4, 6, 8, and the standard pattern A' as shown in Figure 5; 3, 5, 7, 9, 9, 9, 9, 7,
5, 5, 7, 9, and the standard pattern B' as shown in FIG.
Consider 4,4. As is clear from the patterns shown in FIGS. 4 to 6, input pattern A is a pattern that is desired to be determined as standard pattern A'.
However, the standard pattern for input pattern A
Calculating the DP matching distance of A' and B' shows that input pattern A is close to standard pattern B'. That is, the standard pattern for input pattern A
As shown in FIG. 7, the horizontal axis shows the parameter time series of input pattern A; 2, 4, 6, 8, 8, 8, 8, 6,
4, 4, 4, 6, 8 are lined up, and the standard pattern is on the vertical axis.
Parameter time series of A′; 3, 5, 7, 9, 9,
9, 9, 7, 5, 5, 7, and 9 are arranged, and the Thievisiev distances of the individual parameters of the standard pattern A' with respect to the individual parameters of the input pattern A are determined corresponding to the intersections on the lattice plane. and,
The starting point is the point where the Tievishev distance d (1, 1) = 1 between the first parameter 2 of the parameter time series of input parameter A and the first parameter 3 of the parameter time series of standard parameter A', and input The end point is the point at which the Tievisiev distance d (13, 12) = 1 between the 13th parameter 8 of the parameter time series of pattern A and the 12th parameter 9 of the parameter time series of standard pattern A', and DP- As in the case of Figure 2, if we allow the path to take the point to the left, the point below, and the point diagonally to the lower left of any point as the previous state for that point (this path is represented by the solid line arrow). ), and the point on the path is d(1,
1)-d(2,2)-d(3,3)-d(4,4)-d
(5,5)-d(6,6)-d(7,7)-d(8,8
)
-d(9,9)-d(10,10)-d(11,10)-d(12
，
10) - d (13, 11) - d (13, 12), the total distance of which is 24, and this DP matching distance D (A, A') is 1. On the other hand, the standard pattern for input pattern A
The DP matching process for B' is performed as shown in FIG. 8, similar to the case shown in FIG. 7 above. That is, the individual parameters of input pattern A; 2, 4, 6, 8,
Individual parameters of standard pattern B' for 8, 8, 8, 6, 4, 4, 4, 6, 8; 7, 6, 6,
Find the Tievisiev distance of 8, 8, 8, 8, 6, 4, 4, 4, and as the DP path, the previous state for any point is the point to the left of that arbitrary point, the point below, and the diagonally lower left side. (This path is indicated by a solid arrow), the points on the path are d(1,1) - d(2,2) - d(3,3) - d(4 ，
4)-d(5,5)-d(6,6)-d(7,7)-d
(8,8)-d(9,9)-d(10,10)-d(11,11
)
-d(12,11)-d(13,11), and the total distance is 15, and this DP matching distance D(A,B') is 0.65. As is clear from the result of setting this DP-path in three directions, the input pattern A is determined to be the standard pattern B' whose DP matching distance is small, and the result to be determined cannot be obtained. In this way, in the DP matching process, it is easy to misrecognize words that are partially similar. In addition, in the DP matching process, as mentioned above, the number of frames N of the standard pattern is undefined, and all standard patterns are DP matched to the input pattern.
Matching processing is required, and as the number of words increases, the amount of calculation increases dramatically, causing problems in terms of the storage capacity of the standard pattern memory 4 and the amount of calculation. For this reason, it is a speech recognition device that has relatively few erroneous recognitions even between words that are partially similar, and that requires a relatively small storage capacity of the standard pattern memory 4 and a relatively small amount of calculation for processing. The one shown in FIG. 9 has been considered. In FIG. 9, reference numeral 1 indicates a microphone as an audio signal input section.The audio signal from this microphone 1 is supplied to an amplifier 8 of an acoustic analysis section 2, and the audio signal of this amplifier 8 is cut off at a cut-off frequency of 5.5K.
Hz low-pass filter 9 to a 12-bit A/D converter 10 with a sampling frequency of 12.5 KHz, and the digital audio signal of this A/D converter 10 is
It is supplied to 15 channels of digital bandpass filter banks 11 _A , 11 _B , ..., 11 _O. This 15
The channel's digital bandpass filter banks _11A , _11B ,..., _11O are configured with, for example, Butterworth 4th order digital filters, and
Bands up to 5.5KHz are distributed at equal intervals on the logarithmic axis. Then, the output signals of the digital bandpass filters 11 _A , 11 _B , ..., 11 _O are transferred to the 15-channel rectifiers 12 _A , 12 _B , ...,
12 _O , and these rectifiers 12 _A , 12 _B ,
. . , 12 _O are supplied to 15 channels of digital low-pass filters 13 _A , 13 _B , . . . , 13 _O , respectively. These digital low-pass filters _13A , _13B ,..., _13O have a cutoff frequency of 52.8.
Consists of a Hz FIR (finite impulse response type) low-pass filter. And each digital low pass filter _13A ,
13 _B ,...,13 _O output signal sampling period
5.12ms sampler 14. This sampler 14 filters the digital low-pass filter 13.
_A , 13 _B , ..., 13 _O output signals with frame period
This sampler 1 samples every 5.12ms.
4 sampling signals are supplied to the sound source information normalizer 15. This sound source information normalizer 15 removes differences in vocal cord sound source characteristics depending on the speaker of the speech to be recognized. That is, the sampling signal Ai(n) (i=1,...,
15; n: frame number) is subjected to logarithmic transformation as follows: A′i(n)=log(Ai(n)+B) (1). In this equation (1), B
Set the bias to a value that hides the noise level. Then, the vocal cord sound source characteristics are yi=a・i+b
Approximate it with the formula: The counts of a and b are determined by the following equation.

【化】[ka]

【化】そして、音源の正規化されたパラメータをPi(n)
とすると、ａ(n)＜０のときパラメータPi(n)は Pi(n)＝A′i(n)−｛ａ(n)・ｉ＋ｂ(n)｝ …(4) と表わされる。又、ａ(n)≧０のときレベルの正規化のみ行な
い、パラメータPi(n)は Pi(n)＝A′i(n)−_I 〓ⁱ⁼¹ ′A′i(n)／Ｉ（Ｉ＝15） …(5) と表わされる。この様な処理により声帯音源特性の正規化され
たパラメータPi(n)を音声区間内パラメータメモリ
１６に供給する。この音声区間内パラメータメモ
リ１６は後述する音声区間判定部１７からの音声
区間判定信号を受けて声帯音源特性の正規化され
たパラメータPi(n)を音声区間毎に格納する。一方、Ａ／Ｄ変換器１０のデジタル音声信号を
音声区間判定部１７のゼロクロスカウンタ１８及
びパワー算出器１９に夫々供給する。このゼロク
ロスカウンタ１８は5.12ms毎にその区間の64点
のデジタル音声信号のゼロクロス数をカウント
し、そのカウント値を音声区間判定器２０の第１
の入力端に供給する。又、パワー算出器１９は
5.12ms毎にその区間のデジタル音声信号のパワ
ー、即ち２乗和を求め、その区間内パワーを示す
パワー信号を音声区間判定器２０の第２の入力端
に供給する。更に、音源情報正規化器１５の音源
正規化情報ａ(n)及びｂ(n)を音声区間判定器２０の
第３の入力端に供給する。そして、音声区間判定
器２０においてはゼロクロス数、区間内パワー及
び音源正規化情報ａ(n)，ｂ(n)を複合的に処理し、
無音、無声音及び有声音の判定処理を行ない、音
声区間を決定する。この音声区間判定器２０の音
声区間を示す音声区間判定信号を音声区間判定部
１７の出力として音声区間内パラメータメモリ１
６に供給する。この音声区間内パラメータメモリ１６に格納さ
れた音声区間毎に声帯音源特性の正規化された音
響パラメータPi(n)をその時系列方向にNAT
（Normalization Along Trajectory）処理部２
１に供給する。このNAT処理部２１はNAT処
理として音響パラメータ時系列Pi(n)からそのパラ
メータ空間における軌跡を直線近似にて推定し、
この軌跡に沿つて直線補間にて新たな音響パラメ
ータ時系列Qi(m)を形成する。ここで、このNAT処理部２１について更に説
明する。音響パラメータ時系列Pi(n)（ｉ＝１，
…，Ｉ；ｎ＝１，…，Ｎ）はそのパラメータ空間
に点列を描く。第１０図に２次元パラメータ空間
に分布する点列の例を示す。この第１０図に示す
如く音声の非定常部の点列は粗に分布し、準定常
部は密に分布する。この事は完全に定常であれば
パラメータは変化せず、その場合には点列はパラ
メータ空間に停留することからも明らかである。第１１図は第１０図に示す如き点列上に滑らか
な曲線よりなる軌跡を推定し描いた例を示す。こ
の第１１図に示す如く点列に対して軌跡を推定で
きれば、音声の発声速度変動に対して軌跡は殆ど
不変であると考えることができる。何故ならば、
音声の発声速度変動による時間長の違いは殆どが
準定常部の時間的伸縮（第１０図に示す如き点列
においては準定常部の点列密度の違いに相当す
る。）に起因し、非定常部の時間長の影響は少な
いと考えられるからである。 NAT処理部２１においてはこの様な音声の発
声速度変動に対する軌跡の不変性に着目して時間
軸正規化を行なう。即ち、第１に音響パラメータ時系列Pi(n)に対し
て始点Pi(1)から終点Pi（Ｎ）迄を連続曲線で描い
た軌跡を推定し、この軌跡を示す曲線をPi^(s)（０
≦ｓ≦Ｓ）とする。この場合、必ずしもPi^(o)＝Pi
(1)、Pi^(s)＝Pi（Ｎ）である必要は無く、基本的に
はPi^(s)が点列全体を近似的に通過するようなもの
であれば良い。第２に推定されたPi^(s)から軌跡の長さSLを求
め、第１２図に○印にて示す如く軌跡に沿つて一
定長で新たな点列をリサンプリングする。例えば
Ｍ点にサンプリングする場合、一定長さ、即ちリ
サンプリング間隔Ｔ＝SL／（Ｍ−１）を基準と
して軌跡上をリサンプリングする。このリサンプ
リングされた点列をQi(m)（ｉ＝１，…，Ｉ；ｍ
＝１，…，Ｍ）とすれば、Qi(1)＝Pi^、Qi（Ｍ）＝
Pi^(s)である。この様にして得られた新たなパラメータ時系列
Qi(m)は軌跡の基本情報を有しており、しかも音
声の発声速度変動に対して殆ど不変なパラメータ
となる。即ち、新たなパラメータ時系列Qi(m)は
時間軸正規化がなされたパラメータ時系列とな
る。この様な処理の為に、音声区間内パラメータメ
モリ１６の音響パラメータ時系列Pi(n)を軌跡長算
出器２２に供給する。この軌跡長算出器２２は音
響パラメータ時系列Pi(n)がそのパラメータ空間に
おいて描く直線近似による軌跡の長さ、即ち軌跡
長を算出するものである。この場合、１次元ベク
トルa_i及びb_i間の距離としては例えばユークリツ
ド距離Ｄ（a_i，b_i）をとれば[ ] Then, the normalized parameters of the sound source are Pi(n)
Then, when a(n)<0, the parameter Pi(n) is expressed as Pi(n)=A′i(n)−{a(n)·i+b(n)} (4). Also, when a(n)≧0, only level normalization is performed, and the parameter Pi(n) is Pi(n)=A′i(n)− _I 〓 ⁱ⁼¹ ′A′i(n)/I( I=15) ...(5) Through such processing, the normalized parameters Pi(n) of the vocal cord sound source characteristics are supplied to the intra-speech interval parameter memory 16. The intra-speech-segment parameter memory 16 receives a speech-segment determination signal from a speech-segment determining section 17, which will be described later, and stores a normalized parameter Pi(n) of the vocal cord sound source characteristic for each speech period. On the other hand, the digital audio signal from the A/D converter 10 is supplied to a zero cross counter 18 and a power calculator 19 of the audio section determining section 17, respectively. This zero cross counter 18 counts the number of zero crosses of the digital audio signal at 64 points in that section every 5.12 ms, and uses the count value as the first
Supplied to the input end of the Also, the power calculator 19
The power of the digital audio signal in that section, that is, the sum of squares, is determined every 5.12 ms, and a power signal indicating the power within that section is supplied to the second input terminal of the speech section determiner 20. Furthermore, the sound source normalization information a(n) and b(n) of the sound source information normalizer 15 is supplied to the third input terminal of the speech segment determiner 20. Then, in the voice section determiner 20, the number of zero crossings, the power within the section, and the sound source normalization information a(n), b(n) are processed in a composite manner,
A process of determining silent, unvoiced, and voiced sounds is performed, and a voice section is determined. The voice interval determination signal indicating the voice interval of the voice interval determiner 20 is output from the voice interval determination unit 17, and the voice interval parameter memory 1
Supply to 6. NAT the normalized acoustic parameters Pi(n) of the vocal cord sound source characteristics for each voice interval stored in the voice interval parameter memory 16 in the chronological direction.
(Normalization Along Trajectory) Processing Unit 2
Supply to 1. As NAT processing, this NAT processing unit 21 estimates the trajectory in the parameter space from the acoustic parameter time series Pi(n) by linear approximation,
A new acoustic parameter time series Qi(m) is formed by linear interpolation along this trajectory. Here, this NAT processing section 21 will be further explained. Acoustic parameter time series Pi(n) (i=1,
..., I; n=1, ..., N) draws a point sequence in the parameter space. FIG. 10 shows an example of a point sequence distributed in a two-dimensional parameter space. As shown in FIG. 10, the point sequence of the non-stationary part of the voice is distributed coarsely, and the quasi-stationary part is densely distributed. This is clear from the fact that if it is completely stationary, the parameters will not change, and in that case the point sequence will remain in the parameter space. FIG. 11 shows an example in which a locus consisting of a smooth curve is estimated and drawn on a series of points as shown in FIG. If a trajectory can be estimated for a sequence of points as shown in FIG. 11, it can be considered that the trajectory remains almost unchanged with respect to variations in speech rate. because,
Differences in time length due to variations in speech rate are mostly due to temporal expansion and contraction of the quasi-stationary part (corresponds to differences in point sequence density in the quasi-stationary part in the point sequence shown in Fig. 10); This is because it is thought that the influence of the time length of the stationary part is small. The NAT processing unit 21 performs time axis normalization by focusing on the invariance of the trajectory with respect to such variations in speech rate. That is, first, a trajectory drawn as a continuous curve from the starting point Pi(1) to the ending point Pi(N) is estimated for the acoustic parameter time series Pi(n), and the curve representing this trajectory is called Pi^(s). (0
≦s≦S). In this case, Pi^(o)=Pi
(1), it is not necessary that Pi^(s) = Pi(N), and basically it is sufficient as long as Pi^(s) approximately passes through the entire point sequence. Second, the length SL of the trajectory is obtained from the estimated Pi^(s), and a new point sequence is resampled at a constant length along the trajectory as shown by the circle in FIG. For example, when sampling at M points, the trajectory is resampled using a constant length, that is, a resampling interval T=SL/(M-1). This resampled point sequence is defined as Qi(m) (i=1,...,I;m
=1,...,M), then Qi(1)=Pi^, Qi(M)=
Pi^(s). New parameter time series obtained in this way
Qi(m) has basic information on the trajectory, and is a parameter that is almost invariant to variations in speech rate. That is, the new parameter time series Qi(m) becomes a parameter time series subjected to time axis normalization. For such processing, the acoustic parameter time series Pi(n) in the intra-audio section parameter memory 16 is supplied to the trajectory length calculator 22. This trajectory length calculator 22 calculates the length of the trajectory drawn by the acoustic parameter time series Pi(n) in its parameter space by linear approximation, that is, the trajectory length. In this case, the distance between the one-dimensional vectors a _i and b _i is, for example, the Euclidean distance D (a _i , b _i ).

【化】である。尚、この距離としてはチエビシエフ距
離、平方距離等をとることを可とする。そこで、
１次元の音響パラメータ時系列Pi(n)（ｉ＝１，
…，Ｉ；ｎ＝１，…，Ｎ）より、直線近似により
軌跡を推定した場合の時系列方向に隣接するパラ
メータ間距離Ｓ(n)はＳ(n)＝Ｄ（Pi（ｎ＋１）、Pi(n)）（ｎ＝１，…，Ｎ
−１） …(7) と表わされる。そして、時系列方向における第１
番目のパラメータPi(1)から第ｎ番目のパラメータ
Pi(n)迄の距離SL(n)は SL(n)＝_o-1 〓ⁿ ′⁼¹Ｓ（n′） …(8) と表わされる。尚、SL(1)＝０である。更に、軌
跡長SLは SL＝SL（Ｎ）＝_N-1 〓ⁿ ′⁼¹Ｓ（n′） …(9) と表わされる。軌跡長算出器２２はこの(7)式、(8)
式及び(9)式にて示す信号処理を行なう如くなす。この軌跡長算出器２２の軌跡長SLを示す軌跡
長信号を補間間隔算出器２３に供給する。この補
間間隔算出器２３は軌跡に沿つて直線補間により
新たな点列をリサンプリングする一定長のリサン
プリング間隔Ｔを算出するものである。この場
合、Ｍ点にリサンプリングするとすれば、リサン
プリング間隔ＴはＴ＝SL／（Ｍ−１） …(10) と表わされる。補間間隔算出器２３はこの(10)式に
て示す信号処理を行なう如くなす。この補間間隔算出器２３のリサンプリング間隔
Ｔを示すリサンプリング間隔信号を補間点抽出器
２４の一端に供給すると共に音声区間内パラメー
タメモリ１６の音響パラメータ時系列Pi(n)を補間
点抽出器２４の他端に供給する。この補間点抽出
器２４は音響パラメータ時系列Pi(n)のそのパラメ
ータ空間における軌跡例えばパラメータ間を直線
近似した軌跡に沿つてリサンプリング間隔Ｔで新
たな点列をリサンプリングし、この新たな点列よ
り新たな音響パラメータ時系列Qi(m)を形成する
ものである。ここで、この補間点抽出器２４における信号処
理を第１３図に示す流れ図に沿つて説明する。先
ず、ブロツク２４ａにてリサンプリング点の時系
列方向における番号を示す変数Ｊに値１が設定さ
れると共に音響パラメータ時系列Pi(n)の時系列方
向における番号を示す変数ICに値１が設定され
る。そして、ブロツク２４ｂにて変数Ｊがインク
リメントされ、ブロツク２４ｃにてそのときの変
数Ｊが（Ｍ−１）以下であるかどうかにより、そ
のときのリサンプリング点の時系列方向における
番号がリサンプリングする必要のある最後の番号
になつていないかどうかを判断し、なつていれば
この補間点抽出器２４の信号処理を終了し、なつ
ていなければブロツク２４ｄにて第１番目のリサ
ンプリング点から第Ｊ番目のリサンプリング点ま
でのリサンプル距離DLが算出され、ブロツク２
４ｅにて変数ICがインクリメントされ、ブロツ
ク２４ｆにてリサンプル距離DLが音響パラメー
タ時系列Pi(n)の第１番目のパラメータPi(1)から第
IC番目のパラメータPi（IC）までの距離SL（IC）
よりも小さいかどうかにより、そのときのリサン
プリング点が軌跡上においてそのときのパラメー
タPi（IC）よりも軌跡の始端側に位置するかどう
かを判断し、位置していなければブロツク２４ｅ
にて変数ICをインクリメントした後再びブロツ
ク２４ｆにてリサンプリング点とパラメータPi
（IC）との軌跡上における位置の比較をし、リサ
ンプリング点が軌跡上においてパラメータPi
（IC）よりも始端側に位置すると判断されたと
き、ブロツク２４ｇにてリサンプリングにより軌
跡に沿う新たな音響パラメータQi（Ｊ）が形成さ
れる。即ち、先ず第Ｊ番目のリサンプリング点に
よるリサンプル距離DLからこの第Ｊ番目のリサ
ンプリング点よりも始端側に位置する第（IC−
１）番目のパラメータPi（IC−１）による距離SL
（IC−１）を減算して第（IC−１）番目のパラメ
ータPi（IC−１）から第Ｊ番目のリサンプリング
点迄の距離SSを求める。次に、軌跡上において
この第Ｊ番目のリサンプリング点の両側に位置す
るパラメータPi（IC−１）及びパラメータPi（IC）
間の距離Ｓ（IC−１）（この距離Ｓ（IC−１）は(7)
式にて示される信号処理にて得られる。）にてこ
の距離SSを除算SS／Ｓ（IC−１）し、この除算
結果SS／Ｓ（IC−１）に軌跡上において第Ｊ番目
のリサンプリング点の両側に位置するパラメータ
Pi（IC）とPi（IC−１）との差（Pi（IC）−Pi（IC−
１））を掛算（Pi（IC）−Pi（IC−１））＊SS／Ｓ
（IC−１）して、軌跡上において第Ｊ番目のリサ
ンプリング点のこのリサンプリング点よりも始端
側に隣接して位置する第（IC−１）番目のパラ
メータPi（IC−１）からの補間量を算出し、この
補間量と第Ｊ番目のリサンプリング点よりも始端
側に隣接して位置する第（IC−１）番目のパラ
メータPi（IC−１）とを加算して、軌跡に沿う新
たな音響パラメータQi（Ｊ）が形成される。第１
４図に２次元の音響パラメータ時系列Ｐ(1)，Ｐ
(2)，…，Ｐ(8)に対してパラメータ間を直線近似し
て軌跡を推定し、この軌跡に沿つて直線補間によ
り６点の新たな音響パラメータ時系列Ｑ(1)，Ｑ
(2)，…，Ｑ(6)を形成した例を示す。又、このブロ
ツク２４ｇにおいては周波数系列方向にＩ次元分
（ｉ＝１，…，Ｉ）の信号処理が行なわれる。この様にしてブロツク２４ｂ乃至２４ｇにて始
点及び終点（これらはQi(1)＝Pi^(o)、Qi（Ｍ）、Pi^
(s)である。）を除く（Ｍ−２）点のリサンプリン
グにより新たな音響パラメータ時系列Qi(m)が形
成される。このNAT処理部２１の新たな音響パラメータ
時系列Qi(m)をモード切換スイツチ３により、登
録モードにおいて認識対象語毎に標準パターンメ
モリ４に格納し、認識モードにおいてはチエビシ
エフ距離算出部２５の一端に供給する。又、この
認識モードにおいては標準パターンメモリ４に格
納されている標準パターンをチエビシエフ距算出
部２５の他端に供給する。このチエビシエフ距離
算出部２５においてはその時入力されている音声
の時間軸の正規化された新たな音響パラメータ時
系列Qi(m)よりなる入力パターンと、標準パター
ンメモリ４の標準パターンとのチエビシエフ距離
算出処理がなされる。そして、このチエビシエフ距離を示す距離信号
を最小距離判定部６に供給し、この最小距離判定
部６にて入力パターンに対するチエビシエフ距離
が最小となる標準パターンが判定され、この判定
結果より入力音声を示す認識結果を出力端子７に
供給する。この様にしてなる音声認識装置の動作について
説明する。マイクロホン１の音声信号が音響分析部２にて
音声区間毎に声帯音源特性の正規化された音響パ
ラメータ時系列Pi(n)に変換され、この音響パラメ
ータ時系列Pi(n)がNAT処理部２１に供給され、
このNAT処理部２１にて音響パラメータ時系列
Pi(n)からそのパラメータ空間における直線近似に
よる軌跡が推定され、この軌跡に沿つて直線補間
され時間軸正規化のなされた新たな音響パラメー
タ時系列Qi(m)が形成され、登録モードにおいて
はこの新たな音響パラメータ時系列Qi(m)がモー
ド切換スイツチ３を介して標準パターンメモリ４
に格納される。又、認識モードにおいては、NAT処理部２１
の新たな音響パラメータ時系列Qi(m)がモード切
換スイツチ３を介してチエビシエフ距離算出部２
５に供給されると共に標準パターンメモリ４の標
準パターンがチエビシエフ距離算出部２５に供給
される。第１５図乃至第１７図に第４図乃至第６
図に示す１次元の入力パターンＡのパラメータ時
系列；２，４，６，８，８，８，８，６，４，
４，４，６，８、標準パターンA′のパラメータ
時系列；３，５，７，９，９，９，９，７，５，
５，７，９、標準パターンB′のパラメータ時系
列；７，６，６，８，８，８，８，６，４，４，
４をNAT処理部２１にて直線近似にて軌跡を推
定し、リサンプリング点を８点とする処理をした
１次元の入力パターンＡのパラメータ時系列；
２，４，６，８，６，４，６，８、標準パターン
A′のパラメータ時系列；３，５，７，９，７，
５，７，９、標準パターンB′のパラメータ時系
列；７，６，７，８，７，６，５，４を夫々示
す。この場合、音響パラメータ時系列Pi(n)からそ
のパラメータ空間における軌跡を推定し、この軌
跡に沿つて新たな音響パラメータ時系列Qi(m)が
形成されるので、入力音声を変換した音響パラメ
ータ時系列Pi(n)自身により時間軸正規化がなされ
る。そして、チエビシエフ距離算出部２５におい
て入力パターンＡと標準パターンA′との間のチ
エビシエフ距離８が算出されると共に入力パター
ンＡと標準パターンB′との間のチエビシエフ距
離16が算出され、これら距離８及び距離16を夫々
示す距離信号が最小距離判定部６に供給され、こ
の最小距離判定部６にて距離８が距離16よりも小
さいことから標準パターンＡが入力パターン
A′であると判定され、この判定結果より入力音
声が標準パターンＡであることを示す認識結果が
出力端子７に得られる。従つて、部分的に類似し
ているような語い間に於いても誤認識することが
比較的少ない音声認識を行なうことができる。ここで、NAT処理を行なう音声認識装置とDP
マツチング処理を行なう音声認識装置との演算量
における差異について説明する。入力パターンに対する標準パターン１個当たり
のDPマツチング距離計算部５における平均演算
量をαとし、チエビシエフ距離算出部２５におけ
る平均演算量をβとし、NAT処理部２１の平均
の演算量をγとしたとき、Ｊ個の標準パターンに
対するDPマツチング処理による演算量C₁は C₁＝α・Ｊ …(11) である。又、Ｊ個の標準パターンに対するNAT
処理した場合の演算量C₂は C₂＝β・Ｊ＋γ …(12) である。一般に、平均演算量αは平均演算量βに
対してα≫βなる関係がある。従つて、Ｊ≫γ／α−β …(13) なる関係が成り立つ、即ち認識対象語い数が増加
するに従つて演算量C₁は演算量C₂に対してC₁≫
C₂なる関係となり、NAT処理を行なう音声認識
装置に依れば、演算量を大幅に低減できる。又、NAT処理部２１より得られる新たな音響
パラメータ時系列Qi(m)はその時系列方向におい
て一定のパラメータ数に設定できるので、標準パ
ターンメモリ４の記憶領域を有効に利用でき、そ
の記憶容量を比較的少なくできる。ところで、一般に音声の語尾には第１８図に示
す如くゆらぎｙがある。この音声の語尾のゆらぎ
ｙは音声区間判定により検出し除去することは難
しい。この為、従来、この音声の語尾のゆらぎｙ
を含めてマツチング処理を行なうようにしてお
り、この音声の語尾のゆらぎｙにより認識率が低
下していることが明らかとなつた。発明の目的本発明は斯かる点に鑑み音声の語尾のゆらぎに
よる影響を比較的少なくして認識率の比較的高い
ものを得ることを目的とする。発明の概要本発明は音声信号入力部を有し、この音声信号
入力部の音声信号を音響分析部に供給し、この音
響分析部の音響パラメータ時系列をNAT処理部
に供給し、このNAT処理部にて音響パラメータ
時系列からそのパラメータ空間における軌跡を推
定し、この軌跡に沿つて新たな音響パラメータ時
系列を形成し、NAT処理部のこの新たな音響パ
ラメータ時系列を処理することにより音声を認識
するようにした音声認識装置において、NAT処
理部にて音声の語尾に対応する部分を所定量除去
して新たな音響パラメータ時系列を形成するよう
にしたものであり、斯かる本発明音声認識装置に
依れば、音声の語尾のゆらぎによる影響が比較的
少なく認識率の比較的高いものを得ることができ
る利益がある。実施例以下、第１９図及び第２０図を参照しながら本
発明音声認識装置の一実施例について説明しよ
う。この第１９図及び第２０図において第１図乃
至第１８図と対応する部分に同一符号を付してそ
の詳細な説明は省略する。本例においては第１９図に示す如くNAT処理
部２１の補間点抽出器２４にて音響パラメータ時
系列Pi(n)からそのパラメータ空間における軌跡例
えばパラメータ間を直線近似した軌跡に沿つてリ
サンプリング間隔Ｔで新たな点列をリサンプリン
グし、この新たな点列からその音声の語尾に対応
する例えば２つ（これは音声の語尾のゆらぎの統
計により決定する。）の点列を除いて新たな音響
パラメータ時系列Qi(m)（ｉ＝１，…，Ｉ；ｍ＝
１，…，Ｍ−２）を形成する如くなす。ここで、この補間点抽出器２４における信号処
理を第１９図に示す流れ図に沿つて説明する。先
ず、ブロツク２４ｈにてリサンプリング点の時系
列方向における番号を示す変数Ｊに値１が設定さ
れると共に音響パラメータ時系列Pi(n)の時系列方
向における番号を示す変数ICに値１が設定され
る。そして、ブロツク２４ｉにて変数Ｊがインク
リメントされ、ブロツク２４ｊにてそのときの変
数Ｊが（Ｍ−２）以下であるかどうかにより、そ
のときのリサンプリング点の時系列方向における
番号がリサンプリングする必要のある最後の番号
になつていないかどうか判断し、なつていればこ
の補間点抽出器２４の信号処理を終了し、なつて
いなければブロツク２４ｋにて第１番目のリサン
プリング点から第Ｊ番目のリンサンプリング点ま
でのリサンプル距離DLが算出され、ブロツク２
４ｌにて変数ICがインクリメントされ、ブロツ
ク２４ｍにてリサンプル距離DLが音響パラメー
タ時系列Pi(n)の第１番目のパラメータPi(1)から第
IC番目のパラメータPi（IC）までの距離SL（IC）
よりも小さいかどうかにより、そのときのリサン
プリング点が軌跡上においてそのときのパラメー
タPi（IC）よりも軌跡の始端側に位置するかどう
かを判断し、位置していなければブロツク２４ｌ
にて変数ICをインクリメントした後再びブロツ
ク２４ｍにてリサンプリング点とパラメータPi
（IC）との軌跡上における位置の比較をし、リサ
ンプリング点が軌跡上においてパラメータPi
（IC）よりも始端側に位置すると判断されたと
き、ブロツク２４ｍにてリサンプリングにより軌
跡に沿う新たな音響パラメータQi（Ｊ）が形成さ
れる。即ち、先ず第Ｊ番目のリサンプリング点に
よるリサンプル距離DLからこの第Ｊ番目のリサ
ンプリング点よりも始端側に位置する第（IC−
１）番目のパラメータPi（IC−１）による距離SL
（IC−１）を減算して第（IC−１）番目のパラメ
ータPi（IC−１）から第Ｊ番目のリサンプリング
点迄の距離SSを求める。次に、軌跡上において
この第Ｊ番目のリサンプリング点の両側に位置す
るパラメータPi（IC−１）及びパラメータPi（IC）
間の距離Ｓ（IC−１）（この距離Ｓ（IC−１）は(7)
式にて示される信号処理にて得られる。）にてこ
の距離SSを除算SS／Ｓ（IC−１）し、この除算
結果SS／Ｓ（IC−１）に軌跡上において第Ｊ番目
のリサンプリング点の両側に位置するパラメータ
Pi（IC）とPi（IC−１）との差（Pi（IC）−Pi（IC−
１））を掛算（Pi（IC）−Pi（IC−１））＊SS／Ｓ
（IC−１）して、軌跡上において第Ｊ番目のリサ
ンプリング点のこのリサンプリング点よりも始端
側に隣接して位置する第（IC−１）番目のパラ
メータPi（IC−１）からの補間量を算出し、この
補間量と第Ｊ番目のリサンプリング点よりも始端
側に隣接して位置する第（IC−１）番目のパラ
メータPi（IC−１）とを加算して、軌跡に沿う新
たな音響パラメータQi（Ｊ）が形成され、以下ブ
ロツク２４ｉ乃至２４ｎにて始点（これはQi(1)
＝Ｐｉ∧(o)であり、算出処理を必要とせず置換によ
り得られる。）と、音声の語尾に対応する所定数
例えば２つの新たな音響パラメータQi（Ｍ−１）
及びQi（Ｍ）とを除く算出処理により、（Ｍ−２）
点の新たな音響パラメータ時系列Qi(m)（ｉ＝１，
…，Ｉ；ｍ＝１，…，Ｍ−２）が形成される。
尚、このブロツク２４ｎにおいては周波数系列方
向にＩ次元分の信号処理が行なわれる。このNAT処理部２１の音声の語尾に対応する
パラメータを除いた新たな音響パラメータ時系列
Qi(m)（ｉ＝１，…，Ｉ；ｍ＝１，…，Ｍ−２）
をモード切換スイツチ３に供給する。その他音響
分析部２、標準パターンメモリ４、チエビシエフ
距離算出部２５、最小距離判定部６等は上述第９
図に示す音声認識装置と同様に構成する。この様にしてなる音声認識装置の動作について
説明する。マイクロホン１の音声信号が音響分析部２にて
音声区間毎に声帯音源特性の正規化された音響パ
ラメータ時系列Pi(n)に変換され、この音響パラメ
ータ時系列Pi(n)がNAT処理部２１に供給され、
このNAT処理部２１にて音響パラメータ時系列
Pi(n)からそのパラメータ空間における直線近似に
よる軌跡が推定され、この軌跡に沿つて音声の語
尾を対応する音響パラメータQi（Ｍ−１）及びQi
（Ｍ）を除く新たな音響パラメータ時系列Qi(m)
（ｉ＝１，…，Ｉ；ｍ＝１，…，Ｍ−２）が形成
される。音声の語尾のゆらぎは軌跡上において後
方に比較的短い軌跡長となつて表われ、この軌跡
長は発声の違いによる変動は極めて小さく、軌跡
の後方の所定の軌跡長（これは統計的に得られ
る。）を除くことにより音声の語尾のゆらぎだけ
を除くことができる。従つて、音声の語尾に対応
する音響パラメータQi（Ｍ−１）及びQi（Ｍ）を
除く新たな音響パラメータ時系列Qi(m)（ｉ＝１，
…，Ｉ；ｍ＝１，…，Ｍ−２）は音声の語尾のゆ
らぎの影響が比較的少ないものとなる。第２０図
に２次元にてこの様な状態を示す。この第２０図
において、11点の・印により音響パラメータ時系
列Ｐ(1)，…，Ｐ(11)を示し、これら・印間の実線
により直線近似による軌跡を示し、９点の○印に
より新たな音響パラメータ時系列Ｑ(1)，…，Ｑ(9)
を示す。この第２０図より明らかな様に軌跡の後
方に位置する音響パラメータＰ(9)，Ｐ(10)，Ｐ(11)
は軌跡上において比較的短い軌跡長をとり、この
部分は音声の語尾のゆらぎを表わし、この分に対
応する新たな音響パラメータＱ(9)を除く新たな音
響パラメータ時系列Ｑ(1)，…，Ｑ(8)は音声の語尾
のゆらぎの影響の比較的少ないものとなる。このNAT処理部２１の新たな音響パラメータ
時系列Qi(m)（ｉ＝１，…，Ｉ；ｍ＝１，…，Ｍ
−２）が標準パターンとして登録モードにおいて
はモード切換スイツチ３を介して標準パターンメ
モリ４に格納される。又、認識モードにおいては、NAT処理部２１
の新たな音響パラメータ時系列Qi(m)（ｉ＝１，
…，Ｉ；ｍ＝１，…，Ｍ−２）が入力パターンと
してモード切換スイツチ３を介してチエビシエフ
距離算出部２５に供給されると共に標準パターン
メモリ４の標準パターンがチエビシエフ距離算出
部２５に供給され、入力パターンと標準パターン
とのチエビシエフ距離が算出され、最小距離判定
部６にてこのチエビシエフ距離が最小となる標準
パターンが判定され、この標準パターンが入力音
声を示す認識結果として出力端子７に得られる。
この場合、標準パターン及び入力パターンが音声
の語尾のゆらぎの影響の比較的少ない新たな音響
パラメータ時系列Qi(m)（ｉ＝１，…，Ｉ；ｍ＝
１，…，Ｍ−２）よりなるので、音声の語尾のゆ
らぎによる影響が比較的少なく、比較的高い認識
率が得られる。以上述べた如く本例に依れば音声信号入力部と
してのマイクロホン１を有し、この音声信号入力
部１の音声信号を音響分析部２に供給し、この音
響分析部の音響パラメータ時系列Pi(n)をNAT処
理部２１に供給し、このNAT処理部２１にて音
響パラメータ時系列Pi(n)からそのパラメータ空間
における軌跡を推定し、この軌跡に沿つて新たな
音響パラメータ時系列Qi(m)（ｉ＝１，…，Ｉ；
ｍ＝１，…，Ｍ）を形成し、NAT処理部２１の
この新たな音響パラメータ時系列Qi(m)を処理す
ることにより音声を認識するようにした音声認識
装置において、NAT処理部２１にて音声の語尾
に対応する２つの新たな音響パラメータQi（Ｍ−
１）及びQi（Ｍ）を除去して新たな音響パラメー
タ時系列Qi(m)（ｉ＝１，…，Ｉ；ｍ＝１，…，
Ｍ−２）を形成するようにした為、音声の語尾を
ゆらぎによる影響が比較的少なく、認識率の比較
的高いものを得ることができる利益がある。又、NAT処理部２１にて形成する新たな音響
パラメータ時系列Qi(m)（ｉ＝１，…，Ｉ；ｍ＝
１，…，Ｍ−２）が音声の語尾に対応するパラメ
ータ分だけ少ないパラメータ数にできるので、そ
の分だけ標準パターンメモリ４の記憶容量を少な
くできる利益がある。尚、上述実施例においては音声の語尾に対応し
て２つのリサンプリング点を除いて新たな音響パ
ラメータ時系列Qi(m)（ｉ＝１，…，Ｉ；ｍ＝１，
…，Ｍ−２）を形成した場合について述べたけれ
ども、音声の語尾のゆらぎの統計に応じて１つ等
適宜な数のリサンプリング点を除いて新たな音響
パラメータ時系列を形成しても上述実施例と同様
の作用効果を得ることができることは容易に理解
できよう。又、上述実施例においては音響パラメ
ータ時系列Pi(n)からそのパラメータ空間における
軌跡を推定し、この軌跡に沿つて新たな点列をリ
サンプリングし、この新たな点列のうちの音声の
語尾に対応する点列を除く点列より新たな音響パ
ラメータ時系列を形成するようにした場合につい
て述べたけれども、音響分析部２の音響パラメー
タ時系列Pi(n)からそのパラメータ空間における軌
跡を推定し、この軌跡上における音声の語尾のゆ
らぎに対応する部分の統計により得られる軌跡長
を軌跡の終端側から除き、この音声の語尾のゆら
ぎに対応する軌跡長を除いた軌跡に沿つて新たな
音響パラメータ時系列を形成するようにしても上
述実施例と同様の作用効果を得ることができるこ
とは容易に理解できよう。又、上述実施例におい
ては音響パラメータ時系列からそのパラメータ空
間における直線近似による軌跡の軌跡長を算出す
るようにした場合について述べたけれども、円弧
近似、スプライン近似等による軌跡の軌跡長を算
出するようにしても上述実施例と同様の作用効果
を得ることができることは容易に理解できよう。
更に、本発明は上述実施例に限らず本発明の要旨
を逸脱することなくその他種々の構成を取り得る
ことは勿論である。発明の効果本発明音声認識装置に依れば、音声信号入力部
を有し、この音声信号入力部の音声信号を音響分
析部に供給し、この音響分析部の音響パラメータ
時系列をNAT処理部に供給し、このNAT処理
部にて音響パラメータ時系列からそのパラメータ
空間における軌跡を推定し、この軌跡に沿つて新
たな音響パラメータ時系列を形成し、NAT処理
部のこの新たな音響パラメータ時系列を処理する
ことにより音声を認識するようにした音声認識装
置において、NAT処理部にて音声の語尾に対応
する部分を所定量除去して新たな音響パラメータ
時系列を形成するようにした為、音声の語尾のゆ
らぎによる影響が比較的少なく認識率の比較的高
いものを得ることができる利益がある。It is [ ]. Incidentally, as this distance, it is possible to use the Thiebishiev distance, the square distance, or the like. Therefore,
One-dimensional acoustic parameter time series Pi(n) (i=1,
…, I; n=1, …, N), the distance S(n) between adjacent parameters in the time series direction when the trajectory is estimated by linear approximation is S(n)=D(Pi(n+1), Pi (n)) (n=1,...,N
−1) …(7) Then, the first
The nth parameter from the th parameter Pi(1)
The distance SL(n) to Pi(n) is expressed as SL(n)= _o-1 〓 ⁿ ′ ⁼¹ S(n′) (8). Note that SL(1)=0. Furthermore, the trajectory length SL is expressed as SL=SL(N)= _N-1 〓 ⁿ ′ ⁼¹ S(n′) (9). The trajectory length calculator 22 uses this equation (7), (8)
The signal processing shown in equations and equations (9) is performed. A trajectory length signal indicating the trajectory length SL of the trajectory length calculator 22 is supplied to the interpolation interval calculator 23. This interpolation interval calculator 23 calculates a resampling interval T of a constant length for resampling a new point sequence by linear interpolation along the locus. In this case, if resampling is performed at M points, the resampling interval T is expressed as T=SL/(M-1) (10). The interpolation interval calculator 23 is configured to perform signal processing as shown in equation (10). A resampling interval signal indicating the resampling interval T of the interpolation interval calculator 23 is supplied to one end of the interpolation point extractor 24, and the acoustic parameter time series Pi(n) of the intra-speech interval parameter memory 16 is supplied to the interpolation point extractor 24. supply to the other end. This interpolation point extractor 24 resamples a new point sequence at a resampling interval T along a trajectory in the parameter space of the acoustic parameter time series Pi(n), for example, a trajectory obtained by linear approximation between parameters, and A new acoustic parameter time series Qi(m) is formed from the sequence. Here, the signal processing in this interpolation point extractor 24 will be explained along the flowchart shown in FIG. First, in block 24a, the value 1 is set to the variable J indicating the number in the time series direction of the resampling point, and the value 1 is set to the variable IC indicating the number in the time series direction of the acoustic parameter time series Pi(n). be done. Then, in block 24b, the variable J is incremented, and in block 24c, the number of the resampling point in the time series direction is resampled depending on whether the variable J at that time is less than or equal to (M-1). It is judged whether or not it has reached the last required number. If it has, the signal processing of this interpolation point extractor 24 is finished. If it has not reached the last number, the signal processing from the first resampling point is started in block 24d. The resampling distance DL to the Jth resampling point is calculated, and block 2
In block 4e, the variable IC is incremented, and in block 24f, the resampling distance DL is incremented from the first parameter Pi(1) of the acoustic parameter time series Pi(n).
Distance SL (IC) to ICth parameter Pi (IC)
It is determined whether the resampling point at that time is located closer to the starting end of the trajectory than the parameter Pi (IC) at that time on the trajectory, and if it is not located, block 24e is selected.
After incrementing the variable IC in block 24f, the resampling point and the parameter Pi are incremented again in block 24f.
Compare the position on the trajectory with (IC) and set the resampling point on the trajectory with the parameter Pi
When it is determined that the position is closer to the starting end than (IC), a new acoustic parameter Qi (J) along the trajectory is formed by resampling in block 24g. That is, first, from the resampling distance DL at the J-th resampling point, the distance (IC-
1) Distance SL by the th parameter Pi (IC-1)
(IC-1) is subtracted to find the distance SS from the (IC-1)th parameter Pi (IC-1) to the Jth resampling point. Next, the parameter Pi (IC-1) and the parameter Pi (IC) located on both sides of this J-th resampling point on the trajectory
The distance between S(IC-1) (this distance S(IC-1) is (7)
It is obtained by signal processing shown in the formula. ), divide this distance SS by SS/S (IC-1), and add the parameters located on both sides of the J-th resampling point on the trajectory to this division result SS/S (IC-1).
The difference between Pi(IC) and Pi(IC−1) (Pi(IC)−Pi(IC−)
1) Multiply by (Pi(IC)-Pi(IC-1))*SS/S
(IC-1), and the (IC-1)-th parameter Pi (IC-1) located adjacent to the starting end side of the J-th resampling point on the trajectory. Calculate the interpolation amount, add this interpolation amount and the (IC-1)th parameter Pi (IC-1) located adjacent to the start end side of the J-th resampling point, and add it to the trajectory. A new acoustic parameter Qi(J) is formed. 1st
Figure 4 shows two-dimensional acoustic parameter time series P(1), P
(2),...,P(8), a trajectory is estimated by linear approximation between the parameters, and along this trajectory, linear interpolation is performed to create new acoustic parameter time series Q(1),Q of 6 points.
An example of forming (2),...,Q(6) is shown below. Further, in this block 24g, signal processing for I dimensions (i=1, . . . , I) is performed in the frequency series direction. In this way, the starting and ending points of blocks 24b to 24g (these are Qi(1)=Pi^(o), Qi(M), Pi^
(s). A new acoustic parameter time series Qi(m) is formed by resampling (M-2) points excluding ). This new acoustic parameter time series Qi(m) of the NAT processing unit 21 is stored in the standard pattern memory 4 for each recognition target word in the registration mode by the mode changeover switch 3, and in the recognition mode, one end of the Tievisiev distance calculation unit 25 supply to. Further, in this recognition mode, the standard pattern stored in the standard pattern memory 4 is supplied to the other end of the Tievisiev distance calculating section 25. This Chiebishiev distance calculation unit 25 calculates the Chiebishiev distance between the input pattern consisting of the new acoustic parameter time series Qi(m) whose time axis of the currently input audio is normalized and the standard pattern in the standard pattern memory 4. Processing is done. Then, the distance signal indicating this Chiebishiev distance is supplied to the minimum distance determining unit 6, and the minimum distance determining unit 6 determines a standard pattern that minimizes the Chiebishiev distance with respect to the input pattern, and based on this determination result, it indicates the input voice. The recognition result is supplied to the output terminal 7. The operation of the speech recognition device constructed in this way will be explained. The audio signal of the microphone 1 is converted into an acoustic parameter time series Pi(n) in which the vocal cord sound source characteristics are normalized for each voice section in the acoustic analysis unit 2, and this acoustic parameter time series Pi(n) is converted to an acoustic parameter time series Pi(n) in the NAT processing unit 21. supplied to,
In this NAT processing unit 21, acoustic parameter time series
A trajectory is estimated from Pi(n) by linear approximation in the parameter space, and a new acoustic parameter time series Qi(m) is formed by linear interpolation and time axis normalization along this trajectory. This new acoustic parameter time series Qi(m) is transferred to the standard pattern memory 4 via the mode changeover switch 3.
is stored in In addition, in the recognition mode, the NAT processing unit 21
The new acoustic parameter time series Qi(m) of
5, and the standard pattern in the standard pattern memory 4 is also supplied to the Tiebisiev distance calculating section 25. Figures 15 to 17, Figures 4 to 6
Parameter time series of one-dimensional input pattern A shown in the figure; 2, 4, 6, 8, 8, 8, 8, 6, 4,
4, 4, 6, 8, parameter time series of standard pattern A'; 3, 5, 7, 9, 9, 9, 9, 7, 5,
5, 7, 9, Parameter time series of standard pattern B'; 7, 6, 6, 8, 8, 8, 8, 6, 4, 4,
Parameter time series of a one-dimensional input pattern A whose trajectory is estimated by linear approximation in the NAT processing unit 21, and the resampling points are set to 8 points;
2, 4, 6, 8, 6, 4, 6, 8, standard pattern
Parameter time series of A′; 3, 5, 7, 9, 7,
5, 7, 9, parameter time series of standard pattern B'; 7, 6, 7, 8, 7, 6, 5, 4 are shown, respectively. In this case, a trajectory in the parameter space is estimated from the acoustic parameter time series Pi(n), and a new acoustic parameter time series Qi(m) is formed along this trajectory. Time axis normalization is performed by the series Pi(n) itself. Then, the Tiebishiev distance 8 between the input pattern A and the standard pattern A' is calculated in the Tiebishiev distance calculation unit 25, and the Tiebisiev distance 16 between the input pattern A and the standard pattern B' is calculated, and these distances 8 and distance 16 are supplied to the minimum distance determination section 6, and since the distance 8 is smaller than the distance 16, the minimum distance determination section 6 determines that the standard pattern A is the input pattern.
A' is determined, and from this determination result, a recognition result indicating that the input voice is the standard pattern A is obtained at the output terminal 7. Therefore, speech recognition can be performed with relatively few erroneous recognitions even between words that are partially similar. Here, the voice recognition device that performs NAT processing and the DP
The difference in the amount of calculation with a speech recognition device that performs matching processing will be explained. When the average amount of calculations in the DP matching distance calculation unit 5 per standard pattern for the input pattern is α, the average amount of calculations in the Thiebishiev distance calculation unit 25 is β, and the average amount of calculations in the NAT processing unit 21 is γ. , the amount of calculation C ₁ due to the DP matching process for J standard patterns is C ₁ =α·J (11). Also, NAT for J standard patterns
The amount of calculation C ₂ in the case of processing is C ₂ =β·J+γ (12). Generally, the average amount of calculations α has a relationship α≫β with respect to the average amount of calculations β. Therefore, _the following relationship holds: J≫γ/α−β (13) That is, as the number of words to be recognized increases, the amount of calculation C ₁ becomes smaller than the amount of calculation C ₂ ≫
The relationship is _C2 , and by using a voice recognition device that performs NAT processing, the amount of calculation can be significantly reduced. Furthermore, since the new acoustic parameter time series Qi(m) obtained from the NAT processing unit 21 can be set to a constant number of parameters in the time series direction, the storage area of the standard pattern memory 4 can be used effectively, and its storage capacity can be reduced. It can be done relatively little. By the way, there is generally a fluctuation y at the end of a voice as shown in FIG. It is difficult to detect and remove this fluctuation y at the end of speech by speech segment determination. For this reason, conventionally, the fluctuation of the ending of this voice
It has become clear that the recognition rate decreases due to fluctuations in the endings of the speech. OBJECTS OF THE INVENTION In view of the above, an object of the present invention is to obtain a relatively high recognition rate by relatively reducing the influence of fluctuations in the endings of speech. Summary of the Invention The present invention has an audio signal input section, supplies an audio signal from the audio signal input section to an acoustic analysis section, supplies an acoustic parameter time series from the acoustic analysis section to a NAT processing section, and processes the NAT processing section. The section estimates the trajectory in the parameter space from the acoustic parameter time series, forms a new acoustic parameter time series along this trajectory, and processes this new acoustic parameter time series in the NAT processing section to generate audio. In the speech recognition device of the present invention, the NAT processing section removes a predetermined amount of the part corresponding to the ending of the speech to form a new acoustic parameter time series. Depending on the device, there is an advantage that the influence of fluctuations in the endings of speech is relatively small and a relatively high recognition rate can be obtained. Embodiment Hereinafter, an embodiment of the speech recognition apparatus of the present invention will be described with reference to FIGS. 19 and 20. In FIGS. 19 and 20, parts corresponding to those in FIGS. 1 to 18 are designated by the same reference numerals, and detailed explanation thereof will be omitted. In this example, as shown in FIG. 19, the interpolation point extractor 24 of the NAT processing unit 21 extracts the resampling interval from the acoustic parameter time series Pi(n) along a trajectory in the parameter space, for example, a trajectory obtained by linearly approximating the parameters. Resample a new point sequence at T, and remove from this new point sequence, for example two points (this is determined by the statistics of the fluctuation of the ending of the voice) that correspond to the ending of the voice, and add a new point sequence. Acoustic parameter time series Qi(m) (i=1,...,I; m=
1,...,M-2). Here, signal processing in this interpolation point extractor 24 will be explained along the flowchart shown in FIG. 19. First, in block 24h, the value 1 is set to the variable J indicating the number in the time series direction of the resampling point, and the value 1 is set to the variable IC indicating the number in the time series direction of the acoustic parameter time series Pi(n). be done. Then, in block 24i, the variable J is incremented, and in block 24j, the number of the resampling point in the time series direction is resampled depending on whether the variable J at that time is less than or equal to (M-2). It is determined whether or not it has reached the last required number. If it has, the signal processing of this interpolation point extractor 24 is finished. If it has not reached the last number, the signal processing from the first resampling point to the Jth resampling point is completed. The resampling distance DL to the th ring sampling point is calculated, and the block 2
At block 4l, the variable IC is incremented, and at block 24m, the resampling distance DL is incremented from the first parameter Pi(1) of the acoustic parameter time series Pi(n).
Distance SL (IC) to ICth parameter Pi (IC)
Based on whether the resampling point is smaller than , it is determined whether the resampling point at that time is located closer to the starting end of the trajectory than the parameter Pi (IC) at that time on the trajectory, and if it is not located, block 24l is
After incrementing the variable IC at block 24m, the resampling point and parameter Pi are again set at block 24m.
Compare the position on the trajectory with (IC) and set the resampling point on the trajectory with the parameter Pi
(IC), a new acoustic parameter Qi (J) along the trajectory is formed by resampling in block 24m. That is, first, from the resampling distance DL at the J-th resampling point, the distance (IC-
1) Distance SL by the th parameter Pi (IC-1)
(IC-1) is subtracted to find the distance SS from the (IC-1)th parameter Pi (IC-1) to the Jth resampling point. Next, the parameter Pi (IC-1) and the parameter Pi (IC) located on both sides of this J-th resampling point on the trajectory
The distance between S(IC-1) (this distance S(IC-1) is (7)
It is obtained by signal processing shown in the formula. ), divide this distance SS by SS/S (IC-1), and add the parameters located on both sides of the J-th resampling point on the trajectory to this division result SS/S (IC-1).
The difference between Pi(IC) and Pi(IC−1) (Pi(IC)−Pi(IC−)
1) Multiply by (Pi(IC)-Pi(IC-1))*SS/S
(IC-1), and the (IC-1)-th parameter Pi (IC-1) located adjacent to the starting end side of the J-th resampling point on the trajectory. Calculate the interpolation amount, add this interpolation amount and the (IC-1)th parameter Pi (IC-1) located adjacent to the start end side of the J-th resampling point, and add it to the trajectory. A new acoustic parameter Qi (J) is formed, and the starting point (this is Qi (1)
= Pi∧(o), which can be obtained by substitution without requiring calculation processing. ) and a predetermined number, for example, two new acoustic parameters Qi (M-1) corresponding to the endings of the speech.
By the calculation process excluding Qi (M), (M-2)
The new acoustic parameter time series Qi(m) (i=1,
..., I; m=1, ..., M-2) is formed.
In this block 24n, I-dimensional signal processing is performed in the frequency series direction. A new acoustic parameter time series excluding the parameter corresponding to the ending of the voice of this NAT processing unit 21
Qi(m) (i=1,...,I; m=1,...,M-2)
is supplied to the mode changeover switch 3. In addition, the acoustic analysis section 2, standard pattern memory 4, Tiebishiev distance calculation section 25, minimum distance determination section 6, etc. are the above-mentioned 9th
The configuration is similar to the speech recognition device shown in the figure. The operation of the speech recognition device constructed in this way will be explained. The audio signal of the microphone 1 is converted into an acoustic parameter time series Pi(n) in which the vocal cord sound source characteristics are normalized for each voice section in the acoustic analysis unit 2, and this acoustic parameter time series Pi(n) is converted to an acoustic parameter time series Pi(n) in the NAT processing unit 21. supplied to,
In this NAT processing unit 21, acoustic parameter time series
A trajectory is estimated from Pi(n) by linear approximation in the parameter space, and along this trajectory, the ending of the speech is determined by the corresponding acoustic parameters Qi (M-1) and Qi
New acoustic parameter time series Qi(m) excluding (M)
(i=1,...,I; m=1,...,M-2) is formed. Fluctuations at the end of a voice appear as a relatively short trajectory length backwards on the trajectory, and this trajectory length has extremely small fluctuations due to differences in utterances. ), it is possible to remove only the fluctuations at the end of the speech. Therefore, a new acoustic parameter time series Qi(m) (i=1,
. . , I; m=1, . FIG. 20 shows such a state in two dimensions. In this Figure 20, the 11 points indicate the acoustic parameter time series P(1),...,P(11), the solid line between these marks indicates the trajectory based on linear approximation, and the 9 points indicate the trajectory based on linear approximation. New acoustic parameter time series Q(1),...,Q(9)
shows. As is clear from Fig. 20, the acoustic parameters P(9), P(10), P(11) located at the rear of the trajectory
takes a relatively short trajectory length on the trajectory, and this part represents the fluctuation of the ending of the speech, and the new acoustic parameter time series Q(1),... except for the new acoustic parameter Q(9) corresponding to this part. , Q(8) is relatively unaffected by fluctuations in the endings of speech. New acoustic parameter time series Qi(m) (i=1,...,I; m=1,...,M
-2) is stored as a standard pattern in the standard pattern memory 4 via the mode changeover switch 3 in the registration mode. In addition, in the recognition mode, the NAT processing unit 21
A new acoustic parameter time series Qi(m) (i=1,
..., I; m=1,...,M-2) is supplied as an input pattern to the Tiebishiev distance calculation section 25 via the mode changeover switch 3, and the standard pattern in the standard pattern memory 4 is supplied to the Tiebishiev distance calculation section 25. Then, the Tiebyshiev distance between the input pattern and the standard pattern is calculated, and the minimum distance determination unit 6 determines the standard pattern with the minimum Tiebyshiev distance, and this standard pattern is sent to the output terminal 7 as a recognition result indicating the input voice. can get.
In this case, the standard pattern and the input pattern are new acoustic parameter time series Qi(m) (i=1,...,I; m=
1, . As described above, according to this example, the microphone 1 is provided as an audio signal input section, the audio signal from this audio signal input section 1 is supplied to the acoustic analysis section 2, and the acoustic parameter time series Pi of this acoustic analysis section is provided. (n) to the NAT processing unit 21, the NAT processing unit 21 estimates a trajectory in the parameter space from the acoustic parameter time series Pi(n), and creates a new acoustic parameter time series Qi(n) along this trajectory. m) (i=1,...,I;
m=1,...,M), and the NAT processing section 21 processes this new acoustic parameter time series Qi(m) to recognize speech. Two new acoustic parameters Qi (M−
1) and Qi(M) to create a new acoustic parameter time series Qi(m)(i=1,...,I; m=1,...,
M-2), the effect of fluctuation on the ending of the speech is relatively small, and there is an advantage that a relatively high recognition rate can be obtained. In addition, a new acoustic parameter time series Qi(m) (i=1,...,I; m=
1, . . . , M-2) can be reduced by the number of parameters corresponding to the endings of the speech, so there is an advantage that the storage capacity of the standard pattern memory 4 can be reduced by that amount. In the above embodiment, a new acoustic parameter time series Qi(m) (i=1,...,I; m=1,
..., M-2), however, it is also possible to form a new acoustic parameter time series by removing an appropriate number of resampling points, such as one, according to the statistics of the fluctuations in the endings of speech. It is easy to understand that the same effects as in the example can be obtained. In addition, in the above embodiment, a trajectory in the parameter space is estimated from the acoustic parameter time series Pi(n), a new point sequence is resampled along this trajectory, and the ending of the speech in this new point sequence is Although we have described the case where a new acoustic parameter time series is formed from a point sequence excluding the point sequence corresponding to , the trajectory length obtained from the statistics of the part corresponding to the fluctuation in the ending of the voice on this trajectory is removed from the terminal side of the trajectory, and a new sound is generated along the trajectory excluding the trajectory length corresponding to the fluctuation in the ending of the speech. It is easy to understand that the same effects as in the above embodiment can be obtained even if a parameter time series is formed. Furthermore, in the above embodiment, the length of the trajectory is calculated from the acoustic parameter time series by linear approximation in the parameter space. However, it is easy to understand that the same effects as in the above embodiment can be obtained.
Furthermore, it goes without saying that the present invention is not limited to the above-described embodiments, and can take various other configurations without departing from the gist of the present invention. Effects of the Invention According to the speech recognition device of the present invention, the speech recognition device has a speech signal input section, supplies the speech signal of the speech signal input section to the acoustic analysis section, and transmits the acoustic parameter time series of the acoustic analysis section to the NAT processing section. This NAT processing unit estimates a trajectory in the parameter space from the acoustic parameter time series, forms a new acoustic parameter time series along this trajectory, and uses this new acoustic parameter time series in the NAT processing unit. In a speech recognition device that recognizes speech by processing the speech, the NAT processing unit removes a predetermined amount of the part corresponding to the end of the speech to form a new acoustic parameter time series. This method has the advantage of being relatively unaffected by fluctuations in the endings of words and achieving a relatively high recognition rate.

[Brief explanation of the drawing]

第１図はDPマツチング処理により音声認識を
行なうようにした音声認識装置の例を示す構成
図、第２図はDPマツチング処理の説明に供する
概念図、第３図は音響パラメータ空間における軌
跡の説明に供する線図、第４図、第５図及び第６
図は夫々１次元の入力パターンＡ、標準パターン
A′及び標準パターンB′の例を示す線図、第７図
は入力パターンＡのパラメータ時系列と標準パタ
ーンA′のパラメータ時系列とのDPマツチング処
理による時間軸正規化の説明に供する線図、第８
図は入力パターンＡのパラメータ時系列と標準パ
ターンB′のパラメータ時系列とのDPマツチング
処理による時間軸正規化の説明に供する線図、第
９図はNAT処理をして音声認識を行なうように
した音声認識装置の例を示す構成図、第１０図、
第１１図、第１２図及び第１４図は夫々NAT処
理部の説明に供する線図、第１３図は補間点抽出
器の説明に供する流れ図、第１５図、第１６図及
び第１７図は夫々NAT処理部にてNAT処理し
た入力パターンＡ、標準パターンA′及び標準パ
ターンB′の１次元の音響パラメータ時系列を示
す線図、第１８図は音声波形の例を示す線図、第
１９図は本発明音声認識装置の要部の一実施例を
示す流れ図、第２０図は第１９図の説明に供する
線図である。１は音声信号入力部としてのマイクロホン、２
は音響分析部、３はモード切換スイツチ、４は標
準パターンメモリ、６は最小距離判定部、１１_A，
１１_B，…，１１_Oは15チヤンネルのデジタルバン
ドパスフイルタバンク、１６は音声区間内パラメ
ータメモリ、２１はNAT処理部、２２は軌跡長
算出器、２３は補間間隔算出器、２４は補間点抽
出器、２５はチエビシエフ距離算出部、２６は軌
跡長信号付加器、２７は標準パターン選択部、ｙ
は音声の語尾のゆらぎである。 Fig. 1 is a block diagram showing an example of a speech recognition device that performs speech recognition using DP matching processing, Fig. 2 is a conceptual diagram explaining the DP matching processing, and Fig. 3 is an explanation of a trajectory in acoustic parameter space. Diagrams provided for, Figures 4, 5 and 6
The figures are one-dimensional input pattern A and standard pattern, respectively.
A diagram showing examples of A' and standard pattern B', and Figure 7 is a diagram used to explain time axis normalization by DP matching processing between the parameter time series of input pattern A and the parameter time series of standard pattern A'. , 8th
The figure is a diagram to explain time axis normalization by DP matching processing between the parameter time series of input pattern A and the parameter time series of standard pattern B'. A configuration diagram showing an example of a speech recognition device, FIG.
Figures 11, 12, and 14 are diagrams explaining the NAT processing unit, Figure 13 is a flowchart explaining the interpolation point extractor, and Figures 15, 16, and 17 are diagrams explaining the NAT processing unit, respectively. A diagram showing the one-dimensional acoustic parameter time series of input pattern A, standard pattern A', and standard pattern B' processed by the NAT processing unit. Figure 18 is a diagram showing an example of audio waveform. Figure 19. 20 is a flowchart showing an embodiment of the main part of the speech recognition device of the present invention, and FIG. 20 is a diagram for explaining FIG. 19. 1 is a microphone as an audio signal input section; 2
is an acoustic analysis section, 3 is a mode changeover switch, 4 is a standard pattern memory, 6 is a minimum distance judgment section, _11A ,
11 _B ,..., 11 _O are digital band pass filter banks of 15 channels, 16 is a voice interval parameter memory, 21 is a NAT processing unit, 22 is a trajectory length calculator, 23 is an interpolation interval calculator, and 24 is an interpolation point extraction 25 is a Tievishiev distance calculation unit, 26 is a trajectory length signal adder, 27 is a standard pattern selection unit, y
is the fluctuation of the ending of the voice.

Claims

[Claims]

1 has an audio signal input section, supplies the audio signal of the audio signal input section to an acoustic analysis section, supplies the acoustic parameter time series of the acoustic analysis section to a NAT processing section,
The NAT processing unit estimates a trajectory in the parameter space from the acoustic parameter time series, forms a new acoustic parameter time series along the trajectory, and processes the new acoustic parameter time series in the NAT processing unit. The speech recognition device is characterized in that the NAT processing unit removes a predetermined amount of a portion corresponding to the ending of the speech to form a new acoustic parameter time series. speech recognition device.