JPH0634183B2

JPH0634183B2 - Voice recognizer

Info

Publication number: JPH0634183B2
Application number: JP59111707A
Authority: JP
Inventors: 曜一郎佐古; 篤信平岩; 誠赤羽; 雅男渡
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 1984-05-31
Filing date: 1984-05-31
Publication date: 1994-05-02
Anticipated expiration: 2009-05-02
Also published as: JPS60254198A

Description

【発明の詳細な説明】産業上の利用分野本発明は音声を認識する音声認識装置に関する。TECHNICAL FIELD The present invention relates to a voice recognition device for recognizing voice.

背景技術とその問題点従来、音声の発声速度変動に対処した音声認識装置とし
て例えば特開昭50-96104号公報に示されるようなＤＰマ
ッチング処理を行なうようにしたものが提案されてい
る。2. Description of the Related Art Conventionally, as a voice recognition device that copes with a change in the vocalization speed of a voice, there has been proposed a voice recognition device that performs a DP matching process as disclosed in Japanese Patent Laid-Open No. 50-96104.

先ず、このＤＰマッチング処理にて音声認識を行なうよ
うにした音声認識装置について説明する。First, a voice recognition device that performs voice recognition by this DP matching processing will be described.

第１図において、(1)は音声信号入力部としてのマイク
ロホンを示し、このマイクロホン(1)からの音声信号が
音響分析部(2)に供給され、この音響分析部(2)にて音響
パラメータ時系列Pi(n)が得られる。この音響分析部(2)
において例えばバンドパスフィルタバンクの整流平滑化
出力が音響パラメータ時系列Pi(n)（ｉ＝１，……，
Ｉ；Ｉはバンドパスフィルタバンクのチャンネル数、ｎ
＝１，……，Ｎ；Ｎは音声区間判定により切り出された
フレーム数である。）として得られる。In FIG. 1, (1) shows a microphone as an audio signal input section, the audio signal from this microphone (1) is supplied to an acoustic analysis section (2), and the acoustic parameter is measured by this acoustic analysis section (2). A time series Pi (n) is obtained. This acoustic analysis unit (2)
In, for example, the rectified and smoothed output of the bandpass filter bank is the acoustic parameter time series Pi (n) (i = 1, ...,
I; I is the number of channels in the bandpass filter bank, n
= 1, ..., N; N is the number of frames cut out by the voice section determination. ) Is obtained as.

この音響分析部(2)の音響パラメータ時系列Pi(n)がモー
ド切換スイッチ(3)により、登録モードにおいては認識
対象語毎に標準パターンメモリ(4)に格納され、認識モ
ードにおいてはＤＰマッチング距離計算部(5)の一端に
供給される。又、この認識モードにおいては標準パター
ンメモリ(4)に格納されている標準パターンがＤＰマッ
チング距離計算部(5)の他端に供給される。The acoustic parameter time series Pi (n) of the acoustic analysis unit (2) is stored in the standard pattern memory (4) for each recognition target word in the registration mode by the mode changeover switch (3), and DP matching is performed in the recognition mode. It is supplied to one end of the distance calculator (5). In this recognition mode, the standard pattern stored in the standard pattern memory (4) is supplied to the other end of the DP matching distance calculation unit (5).

このＤＰマッチング距離計算部(5)にてその時入力され
ている音声の音響パラメータ時系列Pi(n)よりなる入力
パターンと標準パターンメモリ(4)の標準パターンとの
ＤＰマッチング距離計算処理がなされ、このＤＰマッチ
ング距離計算部(5)のＤＰマッチング距離を示す距離信
号が最小距離判定部(6)に供給され、この最小距離判定
部(6)にて入力パターンに対してＤＰマッチング距離が
最小となる標準パターンが判定され、この判定結果より
入力音声を示す認識結果が出力端子(7)に得られる。In this DP matching distance calculation unit (5), a DP matching distance calculation process is performed between the input pattern consisting of the acoustic parameter time series Pi (n) of the voice being input at that time and the standard pattern of the standard pattern memory (4), A distance signal indicating the DP matching distance of the DP matching distance calculating unit (5) is supplied to the minimum distance determining unit (6), and the minimum distance determining unit (6) determines that the DP matching distance is the minimum for the input pattern. The standard pattern is determined, and the recognition result indicating the input voice is obtained at the output terminal (7) from the determination result.

ところで、一般に標準パターンメモリ(4)に格納される
標準パターンのフレーム数Ｎは発声速度変動や単語長の
違いにより異なっている。ＤＰマッチング処理によりこ
の発声速度変動や単語長の違いに対処する為の時間軸正
規化がなされる。By the way, generally, the number N of frames of the standard pattern stored in the standard pattern memory (4) differs depending on the variation in the utterance speed and the difference in the word length. By the DP matching process, the time axis normalization for coping with the variation in the utterance speed and the difference in the word length is performed.

以下、このＤＰマッチング処理について説明する。ここ
で、簡単の為に音響パラメータ時系列Pi(n)の周波数軸
方向ｉに対応する次元を省略して標準パターンのパラメ
ータ時系列をｂ_１，……，ｂ_Ｎ、入力パターンのパラメ
ータ時系列をａ_１，……，ａ_Ｍとして、端点固定のＤＰ
−パスの場合のＤＰマッチング処理について説明する。The DP matching process will be described below. Here, for simplification, the dimension corresponding to the frequency axis direction i of the acoustic parameter time series Pi (n) is omitted, and the parameter time series of the standard pattern is b ₁ , ..., B _N , the parameter time series of the input pattern. the _a 1, ......, as _{a M,} of the endpoints fixed DP
-Pass The DP matching process in the case of will be described.

第２図はＤＰマッチング処理の概念図を示し、横軸に入
力パラメータ（Ｍ＝19）が並べられ、縦軸に標準パラメ
ータ（Ｎ−12）が並べられ、この第２図に示す（Ｍ，
Ｎ）格子状平面に於ける・点はＭ×Ｎ個であり、この各
・点に１つの距離が対応する。例えばａ_３とｂ_５との距
離がａ_３から縦に伸した直線と、ｂ_５から横に伸した直
線との交点に位置する・に対応する。この場合、距離と
して例えばチェビシェフ距離を取れば、ａ_３とｂ_５との
チェビシェフ距離ｄ（３，５）はとなる。（この場合、周波数軸方向ｉに対応する次元を
省略しているのでＩ＝１である。）。そして、端点固定
のＤＰ−パスとして、格子点（ｍ，ｎ）に対してこの格
子点（ｍ，ｎ）に結びつける前の状態として左側の格子
点（ｍ−１，ｎ）、斜め左下側の格子点（ｍ−１，ｎ−
１）及び下側の格子点（ｍ，ｎ−１）の３つだけを許した場合、始点、即ちａ_１とｂ_１とのチェビシ
ェフ距離ｄ（１，１）を示すから出発し、パス（経路）として３方向を選び、終点、即ちａ_Ｍとｂ_Ｎとのチェビシェフ距離ｄ
（Ｍ，Ｎ）を示すに至るパスで、通過する各格子点の距離の総和が最小に
なるものを求め、この距離の総和を入力パラメータ数Ｍ
と標準パラメータ数Ｎとの和より値１を減算した（Ｍ＋
Ｎ−１）にて除算して得られた結果が入力パターンのパ
ラメータ時系列ａ_１，……，ａ_Ｍと標準パターンのパラ
メータ時系列ｂ_１，……，ｂ_ＮとのＤＰマッチング距離
となされる。この様な処理を示す初期条件及び漸化式は初期条件ｇ（１，１）＝ｄ（１，１）漸化式と表され、これよりＤＰマッチング距離ＤＭ（Ａ，Ｂ）
はＤＭ（Ａ，Ｂ）＝ｇ（Ｍ，Ｎ）／（Ｍ＋Ｎ−１）と表さ
れる（Ｍ＋Ｎ−１）でｇ（Ｍ，Ｎ）を割っているのは標
準パターンのフレーム数Ｎの違いによる距離の値の差を
補正するためである。）。この様な処理により標準パタ
ーンの数がＬ個ある場合には入力パターンに対するＤＰ
マッチング距離がＬ個求められ、このＬ個のＤＰマッチ
ング距離中最小の距離となる標準パターンが認識結果と
なされる。FIG. 2 shows a conceptual diagram of the DP matching process, in which the horizontal axis shows the input parameters (M = 19) and the vertical axis shows the standard parameters (N-12).
N) There are M × N points on the grid plane, and one distance corresponds to each point. For example, the distance between a ₃ and b ₅ is located at the intersection of a straight line extending vertically from a ₃ and a straight line extending horizontally from b ₅ . In this case, if the Chebyshev distance is taken as the distance, the Chebyshev distance d (3,5) between a ₃ and b ₅ is Becomes (In this case, since the dimension corresponding to the frequency axis direction i is omitted, I = 1.). Then, as a DP-path with a fixed end point, the left grid point (m-1, n) is set to the grid point (m, n) before the grid point (m, n) is connected to this grid point (m, n). Lattice points (m-1, n-
1) and three lower grid points (m, n-1) If only allowed, the starting point, that is, the Chebyshev distance d (1,1) between a ₁ and b ₁ is shown Starting from, 3 directions as a path And the end point, that is, the Chebyshev distance d between a _M and b _N
Indicates (M, N) In the path leading up to, the sum of the distances of the passing grid points is found to be the minimum, and the sum of the distances is used as the number of input parameters M
And the standard parameter number N are subtracted by 1 (M +
The result obtained by dividing by N-1) is the DP matching distance between the parameter pattern time series a ₁ , ..., A _{M of the} input pattern and the parameter time series b ₁ , ..., b _N of the standard pattern. It The initial condition and the recurrence formula showing such processing are the initial condition g (1,1) = d (1,1) recurrence formula And DP matching distance DM (A, B)
Is DM (A, B) = g (M, N) / (M + N-1) and (M + N-1) divides g (M, N) by the difference in the number N of standard pattern frames. This is to correct the difference in the distance value due to. ). By such processing, when the number of standard patterns is L, DP for the input pattern
L matching distances are obtained, and the standard pattern having the smallest distance among the L DP matching distances is used as the recognition result.

この様なＤＰマッチング処理による音声認識装置によれ
ば発音速度変動や単語長の違いに対処、即ち時間軸正規
化のなされた音声認識を行なうことができる。According to the voice recognition device based on the DP matching process as described above, it is possible to deal with the fluctuation of the pronunciation speed and the difference in the word length, that is, the voice recognition with the time axis normalization can be performed.

然し乍ら、この様なＤＰマッチング処理により音声認識
を行なうものにおいては、音声の定常部がＤＰマッチン
グ距離に大きく反映し、部分的に類似しているような語
い間に於いて誤認識し易いということが明らかとなっ
た。However, in the case of performing voice recognition by such DP matching processing, the stationary part of the voice is largely reflected in the DP matching distance, and it is easy to erroneously recognize between partially similar words. It became clear.

即ち、音響パラメータ時系列pi(n)はそのパラメータ空
間で軌跡を描くと考えることができる。実際には各フレ
ームｎのパラメータがパラメータ空間内の１点に対応す
ることから、点列ではあるが時系列方向に曲線で結んで
いくと始点から終点迄の１つの軌跡が考えられる。例え
ば２種類の単語“ＳＡＮ”と“ＨＡＩ”とを登録した場
合、夫々の標準パターンＡ′，Ｂ′は第３図に示す如く
“Ｓ”，“Ａ”，“Ｎ”，“Ｈ”，“Ａ”，“Ｉ”の各
音韻領域を通過する軌跡を描く。そして、認識モードで
“ＳＡＮ”と発声した場合、全体的にみれば入力パター
ンＡに対する標準パターンＢ′の類似する部分は非常に
少ないが、この入力パターンＡの“ＳＡＮ”の“Ａ”の
部分が標準パターンＡ′の“ＳＡＮ”の“Ａ”の部分よ
り標準パターンＢ′の“ＨＡＩ”の“Ａ”の部分により
類似し、且つその部分（準定常部）に点数が多い場合が
ある。That is, it can be considered that the acoustic parameter time series pi (n) draws a locus in the parameter space. Actually, since the parameter of each frame n corresponds to one point in the parameter space, one trajectory from the start point to the end point is conceivable when connecting with a curve in the time series direction though it is a sequence of points. For example, when two types of words "SAN" and "HAI" are registered, the respective standard patterns A'and B'are "S", "A", "N", "H", as shown in FIG. A trajectory passing through each phoneme region of "A" and "I" is drawn. When uttering "SAN" in the recognition mode, there are very few similar parts of the standard pattern B'to the input pattern A as a whole, but the "A" part of the "SAN" of this input pattern A is very small. May be more similar to the "A" part of the "SAN" of the standard pattern A'and the "A" part of the "HAI" of the standard pattern B ', and that part (quasi-stationary part) may have more points.

ここで、第３図に示す如く入力パターンＡのパラメータ
が全体的には標準パターンＡ′のパラメータに類似し、
部分的には標準パターンＢ′のパラメータに類似する場
合にＤＰマッチング処理により誤認識を招く場合を１次
元パラメータを例に説明する。この場合、第３図に示す
状況、即ち部分的に類似している語い間の関係と同様の
１次元パラメータ時系列として第４図に示す如き入力パ
ターンＡ；2,4,6,8,8,8,8,6,4,4,4,6,8と、第５図に示
す如き標準パターンＡ′；3,5,7,9,9,9,9,7,5,5,7,9
と、第６図に示す如き標準パターンＢ′；7,6,6,8,8,8,
8,6,4,4,4とを考える。これら第４図乃至第６図のパタ
ーンより明らかな如く入力パターンＡは標準パターン
Ａ′と判定されて欲しいパターンである。ところが、入
力パターンＡに対する標準パターンＡ′及びＢ′のＤＰ
マッチング距離を計算すると、入力パターンＡは標準パ
ターンＢ′に近いことが示される。Here, as shown in FIG. 3, the parameters of the input pattern A are generally similar to the parameters of the standard pattern A ′,
A case in which the DP matching process causes an erroneous recognition when it is partially similar to the parameter of the standard pattern B ′ will be described by taking a one-dimensional parameter as an example. In this case, the situation shown in FIG. 3, that is, the input pattern A as shown in FIG. 4 as a one-dimensional parameter time series similar to the relationship between partially similar words; 2, 4, 6, 8, 8,8,8,6,4,4,4,6,8 and standard pattern A'as shown in FIG. 5; 3,5,7,9,9,9,9,7,5,5, 7,9
And a standard pattern B ′ as shown in FIG. 6; 7,6,6,8,8,8,
Consider 8,6,4,4,4. As is apparent from the patterns shown in FIGS. 4 to 6, the input pattern A is a pattern which should be judged as the standard pattern A '. However, the DP of the standard patterns A ′ and B ′ with respect to the input pattern A
Calculation of the matching distance shows that the input pattern A is close to the standard pattern B '.

即ち、入力パターンＡに対する標準パターンＡ′のＤＰ
マッチング処理として第２図と同様、第７図に示す如く
横軸に入力パターンＡのパラメータ時系列；2,4,6,8,8,
8,8,6,4,4,4,6,8を並べ、縦軸に標準パターンＡ′のパ
ラメータ時系列；3,5,7,9,9,9,9,7,5,5,7,9を並べ、格
子状平面に於ける交点に対応して入力パターンＡの個々
のパラメータに対する標準パターンＡ′の個々のパラメ
ータのチェビシェフ距離を求める。そして、入力パラメ
ータＡのパラメータ時系列の第１番目のパラメータ２
と、標準パラメータＡ′のパラメータ時系列の第１番目
のパラメータ３とのチェビシェフ距離ｄ（１，１）＝１
の点を始点とし、入力パターンＡのパラメータ時系列の
第13番目のパラメータ８と、標準パターンＡ′のパラメ
ータ時系列の第12番目のパラメータ９とのチェビシェフ
距離ｄ（13,12）＝１の点を終点とし、ＤＰ−パスとし
て第２図の場合と同様、任意の点に対する前の状態とし
てその任意の点の左側の点、下側の点及び斜め左下側の
点を取ることを許した場合（このパスを実線矢印にて示
す。）、パス上の点はｄ（１，１）−ｄ（２，２）−ｄ
（３，３）−ｄ（４，４）−ｄ（５，５）−ｄ（６，
６）−ｄ（７，７）−ｄ（８，８）−ｄ（９，９）−ｄ
（10,10）−ｄ（11、10）−ｄ（12,10）−ｄ（13,11）−
ｄ（13,12）の14点であり、その距離の総和は24であ
り、このＤＰマッチング距離ＤＭ（Ａ，Ａ′）は１であ
る。That is, the DP of the standard pattern A ′ with respect to the input pattern A
As the matching process, as in FIG. 2, the parameter time series of the input pattern A is plotted along the horizontal axis as shown in FIG. 7, 2, 4, 6, 8, 8,
8,8,6,4,4,4,4,6,8 are arranged, and the vertical axis is the parameter time series of the standard pattern A ′; 3,5,7,9,9,9,9,7,5,5, 7 and 9 are arranged, and the Chebyshev distances of the individual parameters of the standard pattern A ′ with respect to the individual parameters of the input pattern A are obtained corresponding to the intersections on the grid plane. Then, the first parameter 2 of the parameter time series of the input parameter A
And the Chebyshev distance d (1,1) = 1 from the first parameter 3 in the parameter time series of the standard parameter A ′.
Of the Chebyshev distance d (13,12) = 1 between the 13th parameter 8 of the parameter time series of the input pattern A and the 12th parameter 9 of the parameter time series of the standard pattern A ′. The point is set as the end point, and as in the case of FIG. 2, the DP-path is allowed to take the left side point, the lower side point and the diagonal left lower side point of the arbitrary point as the previous state with respect to the arbitrary point. In this case (this path is indicated by a solid arrow), the points on the path are d (1,1) -d (2,2) -d.
(3,3) -d (4,4) -d (5,5) -d (6
6) -d (7,7) -d (8,8) -d (9,9) -d
(10,10) -d (11,10) -d (12,10) -d (13,11)-
There are 14 points of d (13,12), the total sum of the distances is 24, and the DP matching distance DM (A, A ') is 1.

一方、入力パターンＡに対する標準パターンＢ′のＤＰ
マッチング処理を上述第７図に示す場合と同様、第８図
に示す如く行なう。即ち、入力パターンＡの個々のパラ
メータ；2,4,6,8,8,8,8,6,4,4,4,6,8に対する標準パタ
ーンＢ′の個個のパラメータ；7,6,6,8,8,8,8,6,4,4,4
のチェビシェフ距離を求め、ＤＰ−パスとして任意の点
に対する前の状態としてその任意の点の左側の点、下側
の点及び斜め左下側の点を取ることを許した場合（この
パスを実線矢印にて示す。）、パス上の点はｄ（１，
１）−ｄ（２，２）−ｄ（３，３）−ｄ（４，４）−ｄ
（５，５）−ｄ（６，６）−ｄ（７，７）−ｄ（８，
８）−ｄ（９，９）−ｄ（10,10）−ｄ（11,11）−ｄ
（12,11）−ｄ（13,11）の13点であり、その距離の総和
は15であり、このＤＰマッチング距離ＤＭ（Ａ，Ｂ′）
は0.65である。On the other hand, the DP of the standard pattern B'for the input pattern A
The matching process is performed as shown in FIG. 8 as in the case shown in FIG. That is, the individual parameters of the input pattern A; the individual parameters of the standard pattern B'for the 2,4,6,8,8,8,8,6,4,4,4,6,8; 7,6, 6,8,8,8,8,6,4,4,4
Chebyshev distance is calculated, and if it is allowed to take a point on the left side, a point on the lower side, and a point on the lower left side of the diagonal point of the arbitrary point as the previous state for the DP-path (the solid line arrow , And points on the path are d (1,
1) -d (2,2) -d (3,3) -d (4,4) -d
(5,5) -d (6,6) -d (7,7) -d (8,
8) -d (9,9) -d (10,10) -d (11,11) -d
There are 13 points of (12,11) -d (13,11), and the sum of the distances is 15, and this DP matching distance DM (A, B ')
Is 0.65.

このＤＰ−パスを３方向とした結果より明らかな様に入力パターンＡがそのＤＰ
マッチング距離の小さな標準パターンＢ′と判定され、
判定されるべき結果が得られない。この様にＤＰマッチ
ング処理においては部分的に類似しているような語い間
に於いて語認識し易い。This DP-path is in 3 directions As is clear from the result, the input pattern A is the DP
It is determined as a standard pattern B ′ with a small matching distance,
The result to be judged cannot be obtained. As described above, in the DP matching process, it is easy to recognize words between words that are partially similar to each other.

又、ＤＰマッチング処理においては上述した様に標準パ
ターンのフレーム数Ｎが不定であり、しかも入力パター
ンに対して全標準パターンをＤＰマッチング処理する必
要があり、語いが多くなるとそれに伴って演算量が飛躍
的に増加し、標準パターンメモリ(4)の記憶容量や演算
量の点で問題があった。Further, in the DP matching processing, the number N of frames of standard patterns is indefinite as described above, and it is necessary to perform DP matching processing for all standard patterns with respect to the input pattern. However, there was a problem in the storage capacity and calculation amount of the standard pattern memory (4).

この為、部分的に類似しているような語い間に於いても
誤認識することが比較的少なく、且つ標準パターンメモ
リ(4)の記憶容量や処理の為の演算量が比較的少ない音
声認識装置として第９図に示す如きものが考えられてい
る。For this reason, there is relatively little erroneous recognition even between words that are partially similar, and the storage capacity of the standard pattern memory (4) and the amount of computation for processing are relatively small. A recognition device shown in FIG. 9 is considered.

第９図において、(1)は音声信号入力部としてのマイク
ロホンを示し、このマイクロホン(1)からの音声信号を
音響分析部(2)の増幅器(8)に供給し、この増幅器(8)の
音声信号をカットオフ周波数5.5KHzのローパスフィルタ
(9)を介してサンプリング周波数12.5KHzの12ビットＡ／
Ｄ変換器(10)に供給し、このＡ／Ｄ変換器(10)のデジタ
ル音声信号を15チャンネルのデジタルバンドパスフィル
タバンク（11_A），（11_B），……，（11_O）に供給す
る。この15チャンネルのデジタルバンドパスフィルタバ
ンク（11_A），（11_B），……，（11_O）は例えばバター
ワース４次のデジタルフィルタにて構成し、250Hzから
5.5KHzまでの帯域が対数軸上で等間隔となるように割り
振られている。そして、各デジタルバンドパスフィルタ
（11_A），（11_B），……，（11_O）の出力信号を15チャ
ンネルの整流器（12_A），（12_B），……，（12_O）に夫
々供給し、これら整流器（12_A），（12_B），……，（12
_O）の２乗出力を15チャンネルのデジタルローパスフィ
ルタ（13_A），（13_B），……，（13_O）に夫々供給す
る。これらデジタルローパスフィルタ（13_A），（1
3_B），……，（13_O）はカットオフ周波数52.8HzのＦＩ
Ｒ（有限インパルス応答形）ローパスフィルタにて構成
する。In FIG. 9, (1) indicates a microphone as an audio signal input section, and the audio signal from this microphone (1) is supplied to the amplifier (8) of the acoustic analysis section (2), and the amplifier (8) Low-pass filter with a cutoff frequency of 5.5 KHz for audio signals
12-bit A / with sampling frequency of 12.5 KHz via (9)
It is supplied to the D converter (10), and the digital audio signal of the A / D converter (10) is supplied to the 15-channel digital bandpass filter bank (11 _A ), (11 _B ), ..., (11 _O ). Supply. This 15-channel digital bandpass filter bank (11 _A ), (11 _B ), ..., (11 _O ) is composed of, for example, a Butterworth fourth-order digital filter, and from 250 Hz
Bands up to 5.5 KHz are allocated so that they are evenly spaced on the logarithmic axis. Then, the output signals of the digital bandpass filters (11 _A ), (11 _B ), ..., (11 _O ) are input to the 15-channel rectifiers (12 _A ), (12 _B ), ..., (12 _O ). The rectifiers (12 _A ), (12 _B ),…, (12
_The squared output of _O ) is supplied to the 15-channel digital low-pass filters (13 _A ), (13 _B ), ..., (13 _O ), respectively. These digital low-pass filters (13 _A ), (1
3 _B ), ..., (13 _O ) are FI with a cutoff frequency of 52.8 Hz
It is composed of an R (finite impulse response type) low-pass filter.

そして、各デジタルローパスフィルタ（13_A），（1
3_B），……，（13_O）の出力信号をサンプリング周期5.1
2msのサンプラー(14)に供給する。このサンプラー(14)
によりデジタルローパスフィルタ（13_A），（13_B），…
…，（13_O）の出力信号をフレーム周期5.12ms毎にサン
プリングし、このサンプラー(14)のサンプリング信号を
音源情報正規化器(15)に供給する。この音源情報正規化
器(15)は認識しようとする音声の話者による声帯音源特
性の違いを除去するものである。Then, each digital low-pass filter (13 _A ), (1
3 _B ), ……, (13 _O ) output signal with sampling period 5.1
Supply to the 2ms sampler (14). This Sampler (14)
The digital low-pass filter (13 _A ), (13 _B ), ...
The output signal of (13 _O ) is sampled at every 5.12 ms frame period, and the sampling signal of this sampler (14) is supplied to the sound source information normalizer (15). The sound source information normalizer (15) removes a difference in vocal cord sound source characteristics depending on the speaker of the voice to be recognized.

即ち、フレーム周期毎にサンプラー(14)から供給される
サンプリング信号Ai(n)（ｉ＝１，……，15；ｎ：フレ
ーム番号）に対してＡ′i(n)＝log（Ai(n)＋Ｂ）・・・(1) なる対数変換がなされる。この(1)式において、Ｂはバ
イアスでノイズレベルが隠れる程度の値を設定する。そ
して、声帯音源特性をｙｉ＝ａ・ｉ＋ｂなる式で近似す
る。このａ及びｂの計数は次式により決定される。That is, for each sampling signal Ai (n) (i = 1, ..., 15; n: frame number) supplied from the sampler (14) every frame period, A′i (n) = log (Ai (n ) + B) ... (1) Logarithmic transformation is performed. In this equation (1), B is set to a value such that the noise level is hidden by the bias. Then, the vocal cord sound source characteristic is approximated by an expression yi = a · i + b. The counts of a and b are determined by the following equation.

そして、音源の正規化されたパラメータをPi(n)とする
と、ａ(n)＜０のときパラメータPi(n)は Pi(n)＝Ａ′i(n)−｛ａ(n)・ｉ＋ｂ(n)｝・・・(4) と表わされる。 When the normalized parameter of the sound source is Pi (n), the parameter Pi (n) is Pi (n) = A′i (n) − {a (n) · i + b when a (n) <0. (n)} is expressed as (4).

又、ａ(n)≧０のときレベルの正規化のみ行ない、パラ
メータPi(n)はと表わされる。When a (n) ≧ 0, only level normalization is performed, and the parameter Pi (n) is Is represented.

この様な処理により音源特性の正規化されたパラメータ
pi(n)を音声区間内パラメータメモリ(16)に供給する。
この音声区間パラメータメモリ(16)は後述する音声区間
判定部(17)からの音声区間判定信号を受けて音源特性の
正規化されたパラメータPi(n)を音声区間毎に格納す
る。Normalized parameters of sound source characteristics by such processing
The pi (n) is supplied to the parameter memory (16) in the voice section.
The voice section parameter memory (16) receives a voice section determination signal from a voice section determination unit (17) described later, and stores a normalized sound source characteristic parameter Pi (n) for each voice section.

一方、Ａ／Ｄ変換器(10)のデジタル音声信号を音声区間
判定部(17)のゼロクロスカウンタ(18)及びパワー算出器
(19)に夫々供給する。このゼロクロスカウンタ(18)は5.
12ms毎にその区間の64点のデジタル音声信号のゼロクロ
ス数をカウントし、そのカウント値を音声区間判定器(2
0)の第１の入力端に供給する。又、パワー算出器(19)は
5.12ms毎にその区間のデジタル音声信号のパワー、即ち
２乗和を求め、その区間内パワーを示すパワー信号を音
声区間判定器(20)の第２の入力端に供給する。更に、音
源情報正規化器(15)の音源正規化情報ａ(n)及びｂ(n)を
音声区間判定器(20)の第３の入力端に供給する。そし
て、音声区間判定器(20)においてはゼロクロス数、区間
内パワー及び音源正規化情報ａ(n)，ｂ(n)を複合的に処
理し、無音、無声音及び有声音の判定処理を行ない、音
声区間を決定する。この音声区間判定器(20)の音声区間
を示す音声区間判定信号を音声区間判定部(17)の出力と
して音声区間内パラメータメモリ(16)に供給する。On the other hand, the digital voice signal of the A / D converter (10) is converted into the zero cross counter (18) of the voice section determination unit (17) and the power calculator.
Supply to (19) respectively. This zero cross counter (18) is 5.
Every 12 ms, the number of zero crossings of 64 points of digital audio signals in that section is counted, and the count value is counted by the audio section determiner (2
0) to the first input terminal. Also, the power calculator (19)
The power of the digital voice signal in the section, that is, the sum of squares is calculated every 5.12 ms, and the power signal indicating the power in the section is supplied to the second input terminal of the voice section determiner (20). Further, the sound source normalization information a (n) and b (n) of the sound source information normalizer (15) are supplied to the third input terminal of the voice section determiner (20). Then, in the voice section determiner (20), the zero cross number, the intra-section power, and the sound source normalization information a (n) and b (n) are processed in a composite manner to perform a process of determining silence, unvoiced sound and voiced sound, Determine the voice section. The voice section determination signal indicating the voice section of the voice section determiner (20) is supplied to the intra-voice section parameter memory (16) as the output of the voice section determination unit (17).

この音声区間内パラメータメモリ(16)に格納された音声
区間毎に音源特性の正規化された音響パラメータPi(n)
をその時系列方向にＮＡＴ処理部(21)に供給する（この
ＮＡＴとはNormalization Along Trajectoryの頭部分を
取ったものである。）。このＮＡＴ処理部(21)はＮＡＴ
処理として音響パラメータ時系列Pi(n)からそのパラメ
ータ空間における軌跡を推定し、この軌跡に基づいて新
たな音響パラメータ時系列Ｑｉ（ｍ）を形成する。Acoustic parameters Pi (n) with the sound source characteristics normalized for each voice section stored in this voice section parameter memory (16)
Is supplied to the NAT processing unit (21) in the time-series direction (this NAT is the head of the Normalization Along Trajectory). This NAT processing unit (21) is a NAT
As a process, a trajectory in the parameter space is estimated from the acoustic parameter time series Pi (n), and a new acoustic parameter time series Qi (m) is formed based on this trajectory.

ここで、このＮＡＴ処理部(21)について更に説明する。
音響パラメータ時系列Pi(n)（ｉ＝１，……，Ｉ；ｎ＝
１，……，Ｎ）はそのパラメータ空間に点列を描く。第
10図に２次元パラメータ空間に分布する点列の例を示
す。この第10図に示す如く音声の非定常部の点列は粗に
分布し、準定常部は密に分布する。この事は完全に定常
であればパラメータは変化せず、その場合には点列はパ
ラメータ空間に停留することからも明らかである。Here, the NAT processing unit (21) will be further described.
Acoustic parameter time series Pi (n) (i = 1, ..., I; n =
1, ..., N) draws a sequence of points in the parameter space. First
Figure 10 shows an example of a sequence of points distributed in a two-dimensional parameter space. As shown in FIG. 10, the point sequence of the non-stationary part of the voice is roughly distributed, and the quasi-stationary part is densely distributed. This is clear from the fact that the parameters do not change if they are completely stationary, and in that case the point sequence stays in the parameter space.

第11図は第10図に示す如き点列上に滑らかな曲線で軌跡
を描いた例を示す。この第11図に示す如く点列に対して
軌跡を推定できれば、音声の発声速度変動に対して軌跡
は殆ど不変であると考えることができる。何故ならば、
音声の発声速度変動による時間長の違いは殆どが準定常
部の時間的伸縮（第10図に示す如き点列においては準定
常部の点列密度の違いに相当する。）に起因し、非定常
部の時間長の影響は少ないと考えられるからである。FIG. 11 shows an example in which a locus is drawn by a smooth curve on the point sequence as shown in FIG. If the locus can be estimated with respect to the point sequence as shown in FIG. 11, it can be considered that the locus is almost unchanged with respect to the fluctuation of the voice utterance speed. because,
Most of the difference in the time length due to the variation in the vocalization rate of the speech is due to the temporal expansion / contraction of the quasi-stationary portion (corresponding to the difference in the point sequence density of the quasi-stationary portion in the point sequence shown in FIG. 10). This is because the influence of the time length of the stationary part is considered to be small.

ＮＡＴ処理部(21)においてはこの様な音声の発声速度変
動に対する軌跡の不変性に着目して時間軸正規化を行な
う。In the NAT processing section (21), attention is paid to such invariance of the locus with respect to the variation of the vocalization speed of the voice, and the time axis normalization is performed.

即ち、第１に音響パラメータ時系列Pi(n)に対して始端P
i(1)から終端Pi(N)迄を連続曲線で描いた軌跡を推定
し、この軌跡を示す曲線を（０≦ｓ≦Ｓ）とする。この場合、必ずしもである必要は無く、基本的にはが点列全体を近似的に通過するようなものであれば良
い。That is, first, with respect to the acoustic parameter time series Pi (n), the starting point P
Estimate the locus drawn by a continuous curve from i (1) to the end Pi (N), and draw the curve showing this locus. (0 ≦ s ≦ S). In this case, Basically does not have to be May be such that it approximately passes through the entire point sequence.

第２に推定されたから軌跡の長さSLを求め、第12図に○印にて示す如く軌
跡に沿って一定長で新たな点列をリサンプリングする。
例えばＭ点にサンプリングする場合、一定長さ、即ちリ
サンプリング間隔Ｔ＝SL／（Ｍ−１）を基準として軌跡
上をリサンプリングする。このリンサプリングされた点
列をQi(m)（ｉ＝１，……，Ｉ；ｍ＝１，……，Ｍ）と
すれば、である。Second estimated The length SL of the locus is obtained from, and a new point sequence is resampled at a constant length along the locus, as indicated by the circles in FIG.
For example, when sampling at M points, the locus is resampled on the basis of a fixed length, that is, the resampling interval T = SL / (M-1). If this sequence of phosphorus-supplemented points is Qi (m) (i = 1, ..., I; m = 1, ..., M), Is.

この様にして得られた新たなパラメータ時系列Qi(m)は
軌跡の基本情報を有しており、しかも音声の発声速度変
動に対して殆ど不変なパラメータとなる。即ち、新たな
パラメータ時系列Qi(m)は時間軸正規化がなされたパラ
メータ時系列となる。The new parameter time series Qi (m) thus obtained has the basic information of the locus, and is a parameter which is almost invariable with respect to the fluctuation of the speech production speed. That is, the new parameter time series Qi (m) is a parameter time series that has been time-axis normalized.

この様な処理の為に、音声区間内パラメータメモリ(16)
の音響パラメータ時系列Pi(n)を軌跡長算出器(22)に供
給する。この軌跡長算出器(22)は音響パラメータ時系列
Pi(n)がそのパラメータ空間において描く直線近似によ
る軌跡の長さ、即ち軌跡長を算出するものである。この
場合、１次元ベクトルａ_ｉ及びｂ_ｉ間のユークリッド距
離Ｄ（ａ_ｉ，ｂ_ｉ）はである。そこで、１次元の音響パラメータ時系列Pi(n)
（ｉ＝１，……，Ｉ；ｎ＝１，……，Ｎ）より、直線近
似により軌跡を推定した場合の時系列方向に隣接するパ
ラメータ間距離Ｓ(n)はＳ(n)＝Ｄ（Pi（n+1），Pi(n) （ｎ＝１，……，N-1）
……(7) と表わされる。そして、時系列方向における第１番目の
パラメータPi(1)から第ｎ番目のパラメータPi(n)迄の距
離SL(n)はと表わされる。尚、SL(1)＝０である。更に、軌跡長SL
はと表わされる。軌跡長算出器(22)はこの(7)式、(8)式及
び(9)式にて示す信号処理を行なう如くなす。For this kind of processing, the parameter memory in the voice section (16)
The acoustic parameter time series Pi (n) is supplied to the trajectory length calculator (22). This trajectory length calculator (22)
Pi (n) is to calculate the length of the trajectory by the linear approximation drawn in the parameter space, that is, the trajectory length. In this case, the Euclidean distance D (a _i , b _i ) between the one-dimensional vectors a _i and b _i is Is. Therefore, one-dimensional acoustic parameter time series Pi (n)
From (i = 1, ..., I; n = 1, ..., N), the distance S (n) between adjacent parameters in the time series direction when the trajectory is estimated by linear approximation is S (n) = D (Pi (n + 1), Pi (n) (n = 1, ..., N-1)
…… (7) is represented. The distance SL (n) from the first parameter Pi (1) to the nth parameter Pi (n) in the time series direction is Is represented. Note that SL (1) = 0. Furthermore, track length SL
Is Is represented. The trajectory length calculator (22) is configured to perform the signal processing shown in the equations (7), (8) and (9).

この軌跡長算出器(22)の軌跡長SLを示す軌跡長信号を補
間間隔算出器(23)に供給する。この補間間隔算出器(23)
は軌跡に沿って直線補間により新たな点列をリサンプリ
ングする一定長のリンサプリング間隔Ｔを算出するもの
である。この場合、Ｍ点にリサンプリングするとすれ
ば、リサンプリング間隔ＴはＴ＝SL／（Ｍ−１） ……(10) と表わされる。補間間隔算出器(23)はこの(10)式にて示
す信号処理を行なう如くなす。The trajectory length signal indicating the trajectory length SL of the trajectory length calculator (22) is supplied to the interpolation interval calculator (23). This interpolation interval calculator (23)
Is for calculating a constant length phosphorus sampling interval T for re-sampling a new point sequence by linear interpolation along the locus. In this case, if resampling is performed at the point M, the resampling interval T is expressed as T = SL / (M-1) (10). The interpolation interval calculator (23) is configured to perform the signal processing shown by the equation (10).

この補間間隔算出器(23)のリンサプリング間隔Ｔを示す
リサンプリング間隔信号を補間点抽出器(24)の一端に供
給すると共に音声区間内パラメータメモリ(16)の音響パ
ラメータ時系列Pi(n)を補間点抽出器(24)の他端に供給
する。この補間点抽出器(24)は音響パラメータ時系列Pi
(n)のそのパラメータ空間における軌跡例えばパラメー
タ間を直線近似した軌跡に沿ってリサンプリング間隔Ｔ
で新たな点列をリサンプリングし、この新たな点列より
新たな音響パラメータ時系列Qi(m)を形成するものであ
る。The resampling interval signal indicating the phosphorus sampling interval T of the interpolation interval calculator (23) is supplied to one end of the interpolation point extractor (24), and the acoustic parameter time series Pi (n ) Is supplied to the other end of the interpolation point extractor (24). This interpolation point extractor (24) is an acoustic parameter time series Pi
(n) A locus in the parameter space, for example, a resampling interval T along a locus obtained by linear approximation between parameters.
In, a new sequence of points is resampled, and a new acoustic parameter time series Qi (m) is formed from this new sequence of points.

ここで、この補間点抽出器(24)における信号処理を第13
図に示す流れ図に沿って説明する。先ず、ブロック(24
a)にてリサンプリング点の時系列方向における番号を示
す変数Ｊに値１が設定されると共に音響パラメータ時系
列Pi(n)の時系列方向における番号を示す変数ICに値１
が設定される。そして、ブロック(24b)にて変数Ｊがイ
ンクリメントされ、ブロック(24c)にてそのときの変数
Ｊが（Ｍ−１）以下であるかどうかにより、そのときの
リサンプリング点の時系列方向における番号がリサンプ
リングする必要のある最後の番号になっていないかどう
かを判断し、なっていればこの補間点抽出器(24)の信号
処理を終了し、なっていなければブロック(24d)にて第
１番目のリサンプリング点から第Ｊ番目のリサンプリン
グ点までのリサンプル距離DL（＝Ｔ（Ｊ−１））が算出
され、ブロック(24e)にて変数ICがインクリメントさ
れ、ブロック(24f)にてリサンプル距離DLが音響パラメ
ータ時系列Pi(n)の第１番目のパラメータPi(1)から第IC
番目のパラメータPi(IC)までの距離SL(IC)よりも小さい
かどうかにより、そのときのリサンプリング点が軌跡上
においてそのときのパラメータPi(IC)よりも軌跡の始端
側に位置するかどうかを判断し、位置していなければブ
ロック(24e)にて変数ICをインクリメントした後再びブ
ロック(24f)にてリサンプリング点とパラメータPi(IC)
との軌跡上における位置の比較をし、リサンプリング点
が軌跡上においてパラメータPi(IC)よりも始端側に位置
すると判断されたとき、ブロック(24g)にてリサンプリ
ングにより軌跡に沿う新たな音響パラメータQi(J)が形
成される。即ち、先ず第Ｊ番目のリサンプリング点によ
るリサンプル距離DLからこの第Ｊ番目のリサンプリング
点よりも始端側に位置する第（ＩＣ−１）番目のパラメ
ータPi(IC-1)による距離SL(IC-1)を減算して第（ＩＣ−
１）番目のパラメータPi(IC-1)から第Ｊ番目のリサンプ
リング点迄の距離ＳＳを求める。次に、軌跡上において
この第Ｊ番目のリサンプリング点の両側に位置するパラ
メータPi(IC-1)及びパラメータPi(IC)間の距離Ｓ(n)
（この距離Ｓ(n)は(7)式にて示される信号処理にて得ら
れる。）にてこの距離SSを除算SS／Ｓ(IC-1)し、この除
算結果SS／Ｓ(IC-1)に軌跡上において第Ｊ番目のリサン
プリング点の両側に位置するパラメータPi(IC)とPi(IC-
1)との差Pi(IC)−Pi(IC-1)）を掛算（Pi(IC)−Pi(IC-
1)）＊SS／Ｓ(IC-1)して、軌跡上において第Ｊ番目のリ
サンプリング点のこのリサンプリング点よりも始端側に
隣接して位置する第(IC-1)番目のパラメータPi(IC-1)か
らの補間量を算出し、この補間量と第Ｊ番目のリサンプ
リング点よりも始端側に隣接して位置する第（IC−１）
番目のパラメータPi(IC-1)とを加算して、軌跡に沿う新
たな音響パラメータQi(J)が形成される。第14図に２次
元の音響パラメータ時系列Ｐ(1)，Ｐ(2)，……，Ｐ(8)
に対してパラメータ間を直線近似して軌跡を推定し、こ
の軌跡に沿って直線補間により６点の新たな音響パラメ
ータ時系列Ｑ(1)，Ｑ(2)，……，Ｑ(6)を形成した例を
示す。又、このブロック(24g)においては周波数系列方
向に１次元分（ｉ＝１，……，Ｉ）の信号処理が行なわ
れる。Here, the signal processing in this interpolation point extractor (24)
A description will be given along the flow chart shown in the figure. First, the block (24
In a), the value 1 is set to the variable J indicating the number of the resampling points in the time series direction, and the value 1 is set to the variable IC indicating the number of the acoustic parameter time series Pi (n) in the time series direction.
Is set. Then, the variable J is incremented in the block (24b), and the number in the time series direction of the resampling point at that time is determined depending on whether or not the variable J at that time is (M-1) or less in the block (24c). Is not the last number that needs to be resampled, and if not, the signal processing of this interpolation point extractor (24) is terminated. The resampling distance DL (= T (J-1)) from the 1st resampling point to the Jth resampling point is calculated, the variable IC is incremented in the block (24e), and it is stored in the block (24f). The resample distance DL from the first parameter Pi (1) of the acoustic parameter time series Pi (n) to the IC
Whether the resampling point at that time is located on the trailing end side of the trajectory than the parameter Pi (IC) at that time on the trajectory depending on whether it is smaller than the distance SL (IC) to the second parameter Pi (IC). If it is not located, the variable IC is incremented in the block (24e) and then the resampling point and the parameter Pi (IC) are incremented in the block (24f) again.
When the position of the resampling point is judged to be located closer to the starting end than the parameter Pi (IC) on the locus by comparing the positions on the locus with The parameter Qi (J) is formed. That is, first, from the resampling distance DL by the Jth resampling point, the distance SL (by the (IC-1) th parameter Pi (IC-1) located on the start end side of this Jth resampling point is IC-1) is subtracted and the first (IC-
1) Find the distance SS from the 1st parameter Pi (IC-1) to the Jth resampling point. Next, the distance S (n) between the parameter Pi (IC-1) and the parameter Pi (IC) located on both sides of this J-th resampling point on the locus.
(This distance S (n) is obtained by the signal processing shown in equation (7).) This distance SS is divided SS / S (IC-1) and the result SS / S (IC- 1) The parameters Pi (IC) and Pi (IC- that are located on both sides of the J-th resampling point on the trajectory
1) difference (Pi (IC) −Pi (IC-1)) multiplied by (Pi (IC) −Pi (IC-
1)) * SS / S (IC-1), and on the locus, the (IC-1) th parameter Pi located adjacent to the start end side of the Jth resampling point from this resampling point The interpolation amount from (IC-1) is calculated, and the interpolation amount and the (IC-1) position adjacent to the start end side of the Jth resampling point
The th parameter Pi (IC-1) is added to form a new acoustic parameter Qi (J) along the trajectory. Fig. 14 shows a two-dimensional acoustic parameter time series P (1), P (2), ..., P (8)
Then, the trajectory is estimated by linearly approximating the parameters, and along this trajectory, the new acoustic parameter time series Q (1), Q (2), ..., Q (6) of 6 points are linearly interpolated. The example formed is shown. In the block (24g), signal processing for one dimension (i = 1, ..., I) is performed in the frequency sequence direction.

この様にしてブロック(24b)乃至(24g)にて始点及び終点
（これらはである。）を除く（Ｍ−２）点のリサンプリングにより
新たな音響パラメータ時系列Qi(m)が形成される。In this way, the start and end points (these are these in blocks (24b) to (24g)) Is. A new acoustic parameter time series Qi (m) is formed by resampling of (M-2) points excluding).

このＮＡＴ処理部(21)の新たな音響パラメータ時系列Qi
(m)をモード切換スイッチ(3)により、登録モードにおい
ては認識対象語毎に標準パターンメモリ(4)に格納し、
認識モードにおいてはチェビシェフ距離算出部(25)の一
端に供給する。又、この認識モードにおいては標準パタ
ーンメモリ(4)に格納されている標準パターンをチェビ
シェフ距離算出部(25)の他端に供給する。このチェビシ
ェフ距離算出部(25)においてはその時入力されている音
声の時間軸の正規化された新たな音響パラメータ時系列
Qi(m)よりなる入力パターンと、標準パターンメモリ(4)
の標準パターンとのチェビシェフ距離算出処理がなされ
る。New acoustic parameter time series Qi of this NAT processing unit (21)
(m) is stored in the standard pattern memory (4) for each recognition target word in the registration mode by the mode selection switch (3),
In the recognition mode, it is supplied to one end of the Chebyshev distance calculation unit (25). In this recognition mode, the standard pattern stored in the standard pattern memory (4) is supplied to the other end of the Chebyshev distance calculation unit (25). In this Chebyshev distance calculation unit (25), a new time series of normalized acoustic parameters of the time axis of the voice being input at that time
Input pattern consisting of Qi (m) and standard pattern memory (4)
Chebyshev distance calculation processing with the standard pattern of is performed.

そして、このチェビシェフ距離を示す距離信号を最小距
離判定部(6)に供給し、この最小距離判定部(6)にて入力
パターンに対するチェビシェフ距離が最小となる標準パ
ターンが判定され、この判定結果より入力音声を示す認
識結果を出力端子(7)に供給する。Then, the distance signal indicating the Chebyshev distance is supplied to the minimum distance determination unit (6), and the minimum distance determination unit (6) determines the standard pattern that minimizes the Chebyshev distance with respect to the input pattern. The recognition result indicating the input voice is supplied to the output terminal (7).

この様にしてなる音声認識装置の動作について説明す
る。The operation of the speech recognition apparatus thus configured will be described.

マイクロホン(1)の音声信号が音響分析部(2)にて音声区
間毎に音源特性の正規化された音響パラメータ時系列Pi
(n)に変換され、この音響パラメータ時系列Pi(n)がＮＡ
Ｔ処理部(21)に供給され、このＮＡＴ処理部(21)にて音
響パラメータ時系列Pi(n)からそのパラメータ空間にお
ける直線近似による軌跡が推定され、この軌跡に基づい
て時間軸正規化のなされた新たな音響パラメータ時系列
Qi(m)が形成され、登録モードにおいてはこの新たな音
響パラメータ時系列Qi(m)がモード切換スイッチ(3)を介
して標準パターンメモリ(4)に格納される。The sound signal from the microphone (1) is normalized by the sound analysis unit (2) for each sound section, and the sound source characteristics are normalized in time series Pi parameter.
(n) and this acoustic parameter time series Pi (n) is NA
It is supplied to the T processing unit (21), and the NAT processing unit (21) estimates a trajectory by linear approximation in the parameter space from the acoustic parameter time series Pi (n), and based on this trajectory, the time axis normalization is performed. New acoustic parameter time series made
Qi (m) is formed, and in the registration mode, this new acoustic parameter time series Qi (m) is stored in the standard pattern memory (4) via the mode changeover switch (3).

又、認識モードにおいては、ＮＡＴ処理部(21)の新たな
音響パラメータ時系列Qi(m)がモード切換スイッチ(3)を
介してチェビシェフ距離算出部(25)に供給されると共に
標準パターンメモリ(4)の標準パターンがチェビシェフ
距離算出部(25)に供給される。第15図乃至第17図に第４
図乃至第６図に示す１次元の入力パターンＡのパラメー
タ時系列；2,4,6,8,8,8,8,6,4,4,4,6,8、標準パターン
Ａ′のパラメータ時系列；3,5,7,9,9,9,9,7,5,5,7,9、
標準パターンＢ′のパラメータ時系列；7,6,6,8,8,8,8,
6,4,4,4をＮＡＴ処理部(21)にて直線近似にて軌跡を推
定し、リサンプリング点を８点とする処理をした１次元
の入力パターンＡのパラメータ時系列；2,4,6,8,6,4,6,
8、標準パターンＡ′のパラメータ時系列；3,5,7,9,7,
5,7,9、標準パターンＢ′のパラメータ時系列；7,6,7,
8,7,6,5,4を夫々示す。この場合、音響パラメータ時系
列Pi(n)からそのパラメータ空間における軌跡を推定
し、この軌跡に沿って新たな音響パラメータ時系列Qi
(m)が形成されるので、入力音声を変換した音響パラメ
ータ時系列Pi(n)自身により時間軸正規化がなされる。
そして、チェビシェフ距離算出部(25)において入力パタ
ーンＡと標準パターンＡ′との間のチェビシェフ距離８
が算出されると共に入力パターンＡと標準パターンＢ′
との間のチェビシェフ距離16が算出され、これら距離８
及び距離16を夫々示す距離信号が最小距離判定部(6)に
供給され、この最小距離判定部(6)にて距離８が距離16
よりも小さいことから標準パターンＡが入力パターン
Ａ′であると判定され、この判定結果より入力音声が標
準パターンＡであることを示す認識結果が出力端子(7)
に得られる。従って、部分的に類似しているような語い
間に於いても誤認識することが比較的少ない音声認識を
行なうことができる。In the recognition mode, the new acoustic parameter time series Qi (m) of the NAT processing unit (21) is supplied to the Chebyshev distance calculation unit (25) via the mode changeover switch (3) and the standard pattern memory ( The standard pattern of 4) is supplied to the Chebyshev distance calculation unit (25). No. 4 in Figures 15 to 17
Parameter time series of one-dimensional input pattern A shown in Figs. 6 to 6; 2,4,6,8,8,8,8,6,4,4,4,6,8, parameters of standard pattern A ' Time series: 3,5,7,9,9,9,9,7,5,5,7,9,
Parameter time series of standard pattern B ': 7,6,6,8,8,8,8,
The parameter time series of the one-dimensional input pattern A, which is obtained by estimating the locus by linear approximation of 6,4,4,4 in the NAT processing unit (21) and setting the resampling point to 8 points; , 6,8,6,4,6,
8, standard pattern A'parameter time series; 3,5,7,9,7,
5,7,9, parameter time series of standard pattern B '; 7,6,7,
Shows 8,7,6,5,4 respectively. In this case, the trajectory in the parameter space is estimated from the acoustic parameter time series Pi (n), and a new acoustic parameter time series Qi is generated along this trajectory.
Since (m) is formed, time axis normalization is performed by the acoustic parameter time series Pi (n) itself obtained by converting the input voice.
Then, in the Chebyshev distance calculation unit (25), the Chebyshev distance 8 between the input pattern A and the standard pattern A ′ is set.
And the input pattern A and the standard pattern B ′ are calculated.
Chebyshev distance 16 between and is calculated, and these distances 8
And distance signals indicating the distance 16 are supplied to the minimum distance determination unit (6), and the minimum distance determination unit (6) determines that the distance 8 is the distance 16
It is determined that the standard pattern A is the input pattern A'because it is smaller than the above, and the recognition result indicating that the input voice is the standard pattern A is output from the determination result from the output terminal (7).
Can be obtained. Therefore, it is possible to perform voice recognition with relatively few erroneous recognitions even between words that are partially similar.

ここで、ＮＡＴ処理を行なうようにした音声認識装置と
ＤＰマッチング処理を行なうようにした音声認識装置と
の演算量における差異について説明する。Here, the difference in the amount of calculation between the voice recognition device that performs the NAT process and the voice recognition device that performs the DP matching process will be described.

入力パターンに対する標準パターン１個当たりのＤＰマ
ッチング距離計算部(5)における平均演算量をαとし、
チェビシェフ距離算出部(25)における平均演算量をβと
し、ＮＡＴ処理部(21)の平均の演算量をγとしたとき、
Ｊ個の標準パターンに対するＤＰマッチング処理による
演算量Ｃ_１はＣ_１＝α・Ｊ ……(11) である。又、Ｊ個の標準パターンに対するＮＡＴ処理し
た場合の演算量Ｃ_２はＣ_２＝β・Ｊ＋γ ……(12) である。一般に、平均演算量αは平均演算量βに対して
α≫βなる関係がある。従って、なる関係が成り立つ、即ち認識対象語い数が増加するに
従って演算量Ｃ_１は演算量Ｃ_２に対してＣ_１≫Ｃ_２なる
関係となり、ＮＡＴ処理を行なうようにした音声認識装
置に依れば、処理の為の演算量を大幅に低減できる。The average calculation amount in the DP matching distance calculation unit (5) per standard pattern for the input pattern is α,
When the average calculation amount in the Chebyshev distance calculation unit (25) is β and the average calculation amount in the NAT processing unit (21) is γ,
The calculation amount C _{1 in the} DP matching process for J standard patterns is C ₁ = α · J (11). The computation amount C ₂ when NAT processing is performed on J standard patterns is C ₂ = β · J + γ (12). In general, the average calculation amount α has a relationship of α >> β with respect to the average calculation amount β. Therefore, The relationship is established, i.e. the amount of computation C ₁ according to the recognition target vocabulary number is increased becomes C ₁ »C ₂ made relationship with calculation amount C _2, according to the speech recognition apparatus that performs NAT processing , The amount of calculation for processing can be greatly reduced.

又、ＮＡＴ処理部(21)より得られる新たな音響パラメー
タ時系列Qi(m)はその時系列方向において一定のパラメ
ータ数に設定できるので、標準パターンメモリ(4)の記
憶領域を有効に利用でき、その記憶容量を比較的少なく
できる。Further, since the new acoustic parameter time series Qi (m) obtained from the NAT processing unit (21) can be set to a constant number of parameters in the time series direction, the storage area of the standard pattern memory (4) can be effectively used, The storage capacity can be made relatively small.

ところで、この様な音声認識装置におけるＮＡＴ処理部
(21)においてはＮＡＴ処理として音響パラメータ時系列
Pi(n)からそのパラメータ空間における直線近似による
軌跡を推定し、この軌跡に沿って直線補間により新たな
音響パラメータ時系列Qi(m)を形成しているのである
が、音声の発声がゆらぎの少ない理想状態に近い場合等
においては、元も音響パラメータ時系列Pi(n)に有効な
情報が含まれており、補間により新たな音響パラメータ
時系列Qi(m)を形成すると認識率が低下するということ
が明らかとなった。By the way, a NAT processing unit in such a voice recognition device
In (21), acoustic parameter time series as NAT processing
The path is estimated from Pi (n) by linear approximation in the parameter space, and a new acoustic parameter time series Qi (m) is formed along this path by linear interpolation. In the case where there are few ideal states, etc., the effective information is originally included in the acoustic parameter time series Pi (n), and the recognition rate decreases when a new acoustic parameter time series Qi (m) is formed by interpolation. It became clear.

又、新たな音響パラメータ時系列Qi(m)を補間により形
成する場合、その処理の為に比較的多くの演算処理を必
要とする不都合があった。Further, when a new acoustic parameter time series Qi (m) is formed by interpolation, there is a disadvantage that a relatively large amount of arithmetic processing is required for the processing.

発明の目的本発明は斯かる点に鑑み元の音響パラメータ時系列の有
する有効な情報を活かした認識率の比較的高い音声認識
ができ、且つ処理の為の演算量が比較的少ないものを得
ることを目的とする。SUMMARY OF THE INVENTION In view of the above points, the present invention provides a speech recognition system having a relatively high recognition rate that makes use of effective information of the original acoustic parameter time series, and has a relatively small amount of calculation for processing. The purpose is to

発明の概要本発明音声認識装置は例えば第18図に示す如く音声信号
入力部(1)と、この音声信号入力部(1)から音声信号の供
給を受ける音響分析部(2)と、この音響分析部(2)から第
１の音響パラメータ系列の供給を受ける音響パラメータ
処理部(26)と、この音響パラメータ処理部(26)により形
成される第２の音響パラメータ系列に基づいてこの音声
信号に最も近い音声を判定する音声判定部(25)(6)と、
を備える音声認識装置において、この音響パラメータ処
理部(26)が、この第１の音響パラメータ系列からパラメ
ータ空間における軌跡を推定し、この軌跡に基づいてこ
の第１の音響パラメータ系列の部分集合である第２の音
響パラメータ系列を形成するものであり、斯かる本発明
音声認識装置に依れば、部分的に類似しているような語
い間に於いても誤認識することが比較的少なく、元も音
響パラメータ時系列の有する有効な情報を活かした認識
率の比較的高い音声認識ができ、且つ処理の為の演算量
を比較的少なくできると共に標準パターンメモリの記憶
容量を比較的少なくできる利益がある。SUMMARY OF THE INVENTION The voice recognition device of the present invention is, for example, as shown in FIG. 18, a voice signal input unit (1), a sound analysis unit (2) receiving a voice signal supplied from the voice signal input unit (1), and this sound. Based on the acoustic parameter processing unit (26) supplied with the first acoustic parameter sequence from the analysis unit (2) and the second acoustic parameter sequence formed by the acoustic parameter processing unit (26), A voice determination unit (25) (6) that determines the closest voice,
In the speech recognition apparatus including, the acoustic parameter processing unit (26) estimates a trajectory in the parameter space from the first acoustic parameter sequence, and is a subset of the first acoustic parameter sequence based on the trajectory. According to the voice recognition device of the present invention, the second acoustic parameter sequence is formed, and the erroneous recognition is relatively small even between words that are partially similar to each other. Originally, it is possible to perform voice recognition with a relatively high recognition rate that makes use of effective information that the acoustic parameter time series has, and to reduce the amount of calculation for processing and the storage capacity of the standard pattern memory. There is.

実施例以下、第18図乃至第20図を参照しながら本発明音声認識
装置の一実施例について説明しよう。この第18図乃至第
20図において第１図乃至第18図と対応する部分に同一符
号を付してその詳細な説明は省略する。Embodiment An embodiment of the speech recognition apparatus of the present invention will be described below with reference to FIGS. 18 to 20. This Figure 18 through
In FIG. 20, parts corresponding to those in FIGS. 1 to 18 are designated by the same reference numerals, and detailed description thereof will be omitted.

本例においては第18図に示す如く音声信号入力部として
のマイクロホン(1)の音声信号を音響分析部(2)に供給
し、この音響分析部(2)の音響パラメータ時系列Pi(n)を
ＮＡＴ処理部(26)に供給する。このＮＡＴ処理部(26)は
音響パラメータ時系列からそのパラメータ空間における
軌跡を例えば直線近似により推定し、この軌跡に基づい
て元も音響パラメータ時系列Pi(n)を活かした新たな音
響パラメータ時系列Qi(m)を形成するものである。In this example, as shown in FIG. 18, the audio signal of the microphone (1) as the audio signal input unit is supplied to the acoustic analysis unit (2), and the acoustic parameter time series Pi (n) of this acoustic analysis unit (2) is supplied. Is supplied to the NAT processing unit (26). The NAT processing unit (26) estimates a trajectory in the parameter space from the acoustic parameter time series by, for example, linear approximation, and based on the trajectory, a new acoustic parameter time series that originally utilizes the acoustic parameter time series Pi (n). It forms Qi (m).

この様な処理の為に、音響分析部(2)の音響パラメータ
時系列Pi(n)を軌跡長算出器(22)に供給する。この軌跡
長算出器(22)は(6)式、(7)式、(8)式及び(9)式にて示さ
れる信号処理を行ない、音響パラメータ時系列Pi(n)が
そのパラメータ空間において描く直線近似による軌跡の
長さ、即ち軌跡長SLを算出するものである。For such processing, the acoustic parameter time series Pi (n) of the acoustic analysis unit (2) is supplied to the trajectory length calculator (22). This trajectory length calculator (22) performs the signal processing shown in equations (6), (7), (8) and (9), and the acoustic parameter time series Pi (n) is in its parameter space. The length of the trajectory by the drawn straight line approximation, that is, the trajectory length SL is calculated.

この軌跡長算出器(22)の軌跡長SLを示す軌跡長信号を補
間間隔算出器(23)に供給する。この補間間隔算出器(23)
は(10)式にて示される信号処理を行ない、直線近似され
た軌跡に沿って等間隔でＭ点のリサンプリングを行なう
場合のリサンプリング間隔Ｔを算出する。The trajectory length signal indicating the trajectory length SL of the trajectory length calculator (22) is supplied to the interpolation interval calculator (23). This interpolation interval calculator (23)
Performs the signal processing represented by equation (10) to calculate the resampling interval T when the M points are resampled at equal intervals along the linearly approximated locus.

この補間間隔算出器(23)のリサンプリング間隔Ｔを示す
リサンプリング間隔信号を補間点置換器(27)の一端に供
給すると共に音響分析部(2)の音響パラメータ時系列Pi
(n)を補間点置換器(27)の他端に供給する。この補間点
置換器(27)は音響パラメータ時系列Pi(n)のそのパラメ
ータ空間における直線近似による軌跡に沿ってリサンプ
リング間隔Ｔにて直線補間される補間点に最も近い元の
音響パラメータ時系列Pi(n)を新たな音響パラメータ時
系列Qi(m)として形成するものである。The resampling interval signal indicating the resampling interval T of the interpolation interval calculator (23) is supplied to one end of the interpolation point replacer (27), and at the same time, the acoustic parameter time series Pi of the acoustic analysis unit (2).
(n) is supplied to the other end of the interpolation point replacer (27). This interpolation point replacer (27) is the original acoustic parameter time series closest to the interpolation point which is linearly interpolated at the resampling interval T along the trajectory of the acoustic parameter time series Pi (n) obtained by linear approximation in the parameter space. Pi (n) is formed as a new acoustic parameter time series Qi (m).

ここで、この補間点置換器(27)における信号処理を第19
図に示す流れ図に沿って説明する。先ず、ブロック(27
a)にてリンサプリング点の時系列方向における番号を示
す変数Ｊに値１が設定されると共に音響パラメータ時系
列Pi(n)の時系列方向における番号を示す変数ICに値１
が設定される。そして、ブロック(27b)にて変数Ｊがイ
ンクリメントされ、ブロック(27c)にてそのときの変数
Ｊが（Ｍ−１）以下であるかどうかにより、そのときの
リサンプリング点の時系列方向における番号がリサンプ
リングする必要のある最後の番号になっていないかどう
かを判断し、なっていればこの補間点置換器(27)の信号
処理を終了し、なっていなければブロック(27d)にて第
１番目のリサンプリング点から第Ｊ番目のリサンプリン
グ点までのリサンプル距離DL（＝Ｔ（Ｊ−１））が算出
され、ブロック(27e)にて変数ICがインクリメントさ
れ、ブロック(27f)にてリサンプリング距離DLが音響パ
ラメータ時系列Pi(n)の第１番目のパラメータPi(1)から
第IC番目のパラメータPi(IC)までの距離SL(IC)よりも小
さいかどうかにより、そのときのリサンプリング点が軌
跡上においてそのときのパラメータPi(IC)よりも軌跡の
始端側に位置するかどうかを判断し、位置していなけれ
ばブロック(27e)にて変数ICをインクリメントした後再
びブロック(27f)にてリサンプリング点とパラメータPi
(IC)との軌跡上における位置の比較をし、リサンプリン
グ点が軌跡上においてパラメータPi(IC)よりも始端側に
位置すると判断されたとき、ブロック(27g)にて第１番
目のリサンプリング点から第Ｊ番目のリサンプリング点
までのリサンプル距離DLから、第１番目のパラメータPi
(1)から第（IC−１）番目のパラメータPi(IC-1)までの
距離SL(IC-1)を演算(DL-SL(IC-1)したものが、距離SL(I
C)からリサンプル距離DLを減算したものよりも小さいか
どうかにより、直線近似による軌跡上において第Ｊ番目
のリサンプリング点に対して元も音響パラメータ時系列
Pi(n)の内、最も近いパラメータがその第（IC−１）番
目であるか第IC番目であるかを判断し、小さければブロ
ック(27h)にて軌跡に沿う第Ｊ番目の新たな音響パラメ
ータQi(J)として第（IC−１）番目の元も音響パラメー
タPi(IC-1)が置き換えられる（Qi(J)＝Pi(IC-1)）。
又、ブロック(27g)にて小さくないと判断された場合、
ブロック(27i)にて軌跡に沿う第Ｊ番目の新たな音響パ
ラメータQi(J)として第IC番目の元も音響パラメータPi
(IC)が置き換えられる（Qi(J)＝Pi(IC)）。尚、これら
ブロック(27h)及び(27i)においては周波数系列方向に１
次元分の信号処理が行なわれる。Here, the signal processing in this interpolation point replacer (27)
A description will be given along the flow chart shown in the figure. First, the block (27
In a), the value 1 is set to the variable J indicating the number in the time series direction of the phosphorus coupling point, and the value 1 is set to the variable IC indicating the number in the time series direction of the acoustic parameter time series Pi (n).
Is set. Then, the variable J is incremented in the block (27b), and the number in the time series direction of the resampling point at that time is determined depending on whether the variable J at that time is (M-1) or less in the block (27c). Is not the last number that needs to be resampled, and if not, the signal processing of this interpolation point replacer (27) is terminated. The resampling distance DL (= T (J-1)) from the first resampling point to the Jth resampling point is calculated, the variable IC is incremented in the block (27e), and the block (27f) is displayed. Depending on whether the resampling distance DL is smaller than the distance SL (IC) from the first parameter Pi (1) to the IC-th parameter Pi (IC) of the acoustic parameter time series Pi (n), then The resampling points of Then, it is judged whether or not it is located on the starting end side of the locus with respect to the parameter Pi (IC) at that time.If it is not located, the variable IC is incremented in block (27e) and then resampling is performed again in block (27f). Points and parameters Pi
The position of the resampling point on the locus is compared with (IC), and when it is determined that the resampling point is located on the trailing end side of the parameter Pi (IC) on the locus, the first resampling is performed in block (27g) From the resampling distance DL from the point to the Jth resampling point, the first parameter Pi
The calculation of the distance SL (IC-1) from (1) to the (IC-1) th parameter Pi (IC-1) (DL-SL (IC-1) yields the distance SL (I
Depending on whether it is smaller than the value obtained by subtracting the resampling distance DL from (C), the acoustic parameter time series is originally set for the Jth resampling point on the trajectory by the linear approximation.
It is determined whether the closest parameter in Pi (n) is its (IC-1) th or ICth, and if it is smaller, the Jth new sound along the locus in block (27h). The acoustic parameter Pi (IC-1) is also replaced in the (IC-1) th element as the parameter Qi (J) (Qi (J) = Pi (IC-1)).
Also, if it is determined that the block (27g) is not small,
In the block (27i), the Jth new acoustic parameter Qi (J) along the locus is the ICth element and the acoustic parameter Pi
(IC) is replaced (Qi (J) = Pi (IC)). In addition, in these blocks (27h) and (27i), 1 in the frequency sequence direction.
Signal processing for the dimension is performed.

この様にしてブロック(27b)乃至ブロック(27i)にて始点
及び終点（これらはQi(1)＝Pi(1)，Qi(M)＝Pi(N)であ
る。）を除く（Ｍ−２）点の新たな音響パラメータ時系
列Qi(m)が元の音響パラメータ時系列Pi(n)に基づいて置
き換えにより形成される。In this way, the start point and end point (these are Qi (1) = Pi (1) and Qi (M) = Pi (N)) are excluded in the blocks (27b) to (27i) (M-2). ) New acoustic parameter time series Qi (m) is formed by replacement based on the original acoustic parameter time series Pi (n).

その他音響分析部(2)、モード切換スイッチ(3)、標準パ
ターンメモリ(4)、チェビシェフ距離算出部(25)、最小
距離判定部(6)等は第９図に示す音声認識装置と同様に
構成する。The other acoustic analysis unit (2), mode changeover switch (3), standard pattern memory (4), Chebyshev distance calculation unit (25), minimum distance determination unit (6), etc. are the same as in the voice recognition device shown in FIG. Constitute.

マイクロホン(1)の音声信号が音響分析部(2)にて音声区
間毎に音源特性の正規化された音響パラメータ時系列Pi
(n)に変換され、この音響パラメータ時系列Pi(n)にＮＡ
Ｔ処理部(26)に供給され、このＮＡＴ処理部(26)にて音
響パラメータ時系列Pi(n)からそのパラメータ空間にお
ける直線近似による軌跡が推定され、この軌跡に沿って
リサンプリング間隔Ｔでリサンプリングされるリサンプ
リング点に最も近い元の音響パラメータ時系列Pi(n)中
のパラメータが新たな音響パラメータ時系列Qi(m)とし
て置き換えにより形成される。The sound signal from the microphone (1) is normalized by the sound analysis unit (2) for each sound section, and the sound source characteristics are normalized in time series Pi parameter.
(n), and NA is converted to this acoustic parameter time series Pi (n).
It is supplied to the T processing unit (26), and the NAT processing unit (26) estimates a trajectory by linear approximation in the parameter space from the acoustic parameter time series Pi (n), and at a resampling interval T along the trajectory. The parameter in the original acoustic parameter time series Pi (n) closest to the resampling point to be resampled is formed by replacement as a new acoustic parameter time series Qi (m).

第20図にこのＮＡＴ処理部(26)のＮＡＴ処理を簡略化し
て示す。この第20図において、５個の×印Ｐ_１，……，
Ｐ_５は元の音響パラメータ時系列を示し、この元の音響
パラメータ時系列Ｐ_１，……，Ｐ_５からそのパラメータ
空間における直線近似による軌跡（×印間の実線にて示
す。）に沿って３個の新たな音響パラメータ時系列
Ｑ_１，Ｑ_２，Ｑ_３が形成される。この場合、軌跡長SLと
して隣接するパラメータ間距離SL(1),SL(2),SL(3),SL
(4)の和SL(1)+SL(2)+SL(3)+SL(4)が算出され、この軌跡
長SLの中点（破線○印にて示す。）が始点及び終点を除
くリサンプリング点となされ、軌跡上においてこの破線
○印にて示すリサンプリング点に最も近い元の音響パラ
メータＰ_３がＮＡＴ処理にて置き換えにより形成される
新たな音響パラメータＱ_２となされる。従って、ＮＡＴ
処理により形成される新たな音響パラメータ時系列
Ｑ_１，Ｑ_２，Ｑ_３は元の音響パラメータ時系列Ｐ_１，Ｐ
_３，Ｐ_５が置き換えられて形成される。この様にして得
られる新たな音響パラメータ時系列Qi(m)は元の音響パ
ラメータ時系列Pi(n)に含まれる有効な情報を有するパ
ラメータとなる。FIG. 20 shows a simplified NAT process of the NAT processing unit (26). In FIG. 20, the five x marks P ₁ , ...,
P ₅ represents an original acoustic parameter time series, and along the trajectory (indicated by a solid line between X marks) by linear approximation in the parameter space from the original acoustic parameter time series P ₁ , ..., P ₅ . Three new acoustic parameter time series Q ₁ , Q ₂ , Q ₃ are formed. In this case, the distance between adjacent parameters SL (1), SL (2), SL (3), SL as the trajectory length SL
The sum of (4) SL (1) + SL (2) + SL (3) + SL (4) is calculated, and the midpoint of this trajectory length SL (indicated by the broken line circle) excludes the start and end points. The original acoustic parameter P _{3 which} is a resampling point and is closest to the resampling point indicated by the broken line circle on the locus is a new acoustic parameter Q ₂ formed by replacement in the NAT process. Therefore, NAT
The new acoustic parameter time series Q ₁ , Q ₂ , Q ₃ formed by the processing are the original acoustic parameter time series P ₁ , P
₃ and P ₅ are replaced and formed. The new acoustic parameter time series Qi (m) thus obtained is a parameter having effective information included in the original acoustic parameter time series Pi (n).

そして、この新たな音響パラメータ時系列Qi(m)が、登
録モードにおいては標準パターンとしてモード切換スイ
ッチ(3)を介して標準パターンメモリ(4)に格納される。
又、認識モードにおいてはＮＡＴ処理部(26)の新たな音
響パラメータ時系列Qi(m)が入力パターンとしてモード
切換スイッチ(3)を介してチェビシェフ距離算出部(25)
に供給されると共に標準パターンメモリ(4)の標準パタ
ーンがチェビシェフ距離算出部(25)に供給され、このチ
ェビシェフ距離算出部(25)にて入力パターンと標準パタ
ーンとのチェビシェフ距離が算出され、このチェビシェ
フ距離を示す距離信号が最小距離判定部(6)にて判定さ
れ、入力パターンがどの標準パターンであるか、即ち入
力音声が如何なる標準パターンであるかを示す認識結果
が出力端子(7)に得られる。この場合、ＮＡＴ処理部(2
6)にて形成された新たな音響パラメータ時系列Qi(m)が
元の音響パラメータ時系列Pi(n)に含まれる有効な情報
を有しているのでその分だけ認識率の高い音声認識がな
される。Then, this new acoustic parameter time series Qi (m) is stored in the standard pattern memory (4) as a standard pattern in the registration mode via the mode changeover switch (3).
In the recognition mode, the new acoustic parameter time series Qi (m) of the NAT processing unit (26) is used as an input pattern for the Chebyshev distance calculation unit (25) via the mode changeover switch (3).
And the standard pattern of the standard pattern memory (4) is supplied to the Chebyshev distance calculation unit (25), and the Chebyshev distance calculation unit (25) calculates the Chebyshev distance between the input pattern and the standard pattern. The distance signal indicating the Chebyshev distance is judged by the minimum distance judging unit (6), and the recognition result indicating which standard pattern the input pattern is, that is, what kind of standard pattern the input voice is is output terminal (7). can get. In this case, the NAT processing unit (2
Since the new acoustic parameter time series Qi (m) formed in 6) has effective information contained in the original acoustic parameter time series Pi (n), speech recognition with a higher recognition rate can be achieved. Done.

以上述べた如く本例の音声認識装置に依れば、音声信号
入力部としてのマイクロホン(1)を有し、この音声信号
入力部(1)からの音声信号を音響分析部(2)に供給し、こ
の音響分析部(2)の音響パラメータ時系列Pi(n)をＮＡＴ
処理部(26)に供給し、このＮＡＴ処理部(26)により音響
パラメータ時系列Pi(n)からそのパラメータ空間におけ
る直線近似による軌跡を推定し、この軌跡に基づいて元
の音響パラメータ時系列Pi(n)の部分集合である新たな
音響パラメータ時系列Qi(m)を形成し、この新たな音響
パラメータ時系列Qi(m)に基づいて音声信号を認識する
ようにした為、元の音響パラメータ時系列Pi(n)の有す
る有効な情報を活かした認識率の比較的高いものを得る
ことができる利益がある。又、新たな音響パラメータ時
系列Qi(m)を形成するのに、補間でなく元の音響パラメ
ータ時系列Pi(n)を選択して置き換えるようにした為、
演算量が比較的多く必要な補間のための処理を必要とせ
ず、その分だけ処理の為の演算量を少なくできる利益が
ある。更に、上述第９図に示す音声認識装置と同様、部
分的に類似しているような語い間に於いても誤認識する
ことが比較的少なく、且つ標準パターンメモリ(4)の記
憶容量を比較的少なくできる利益がある。As described above, according to the voice recognition device of this example, the microphone (1) is provided as the voice signal input unit, and the voice signal from the voice signal input unit (1) is supplied to the acoustic analysis unit (2). Then, the acoustic parameter time series Pi (n) of this acoustic analysis unit (2) is set to NAT.
It is supplied to the processing unit (26), and the NAT processing unit (26) estimates a trajectory by linear approximation in the parameter space from the acoustic parameter time series Pi (n), and based on this trajectory, the original acoustic parameter time series Pi A new acoustic parameter time series Qi (m) that is a subset of (n) is formed, and the voice signal is recognized based on this new acoustic parameter time series Qi (m). There is an advantage that it is possible to obtain a relatively high recognition rate that makes use of the effective information of the time series Pi (n). Further, in order to form a new acoustic parameter time series Qi (m), since the original acoustic parameter time series Pi (n) is selected and replaced instead of interpolation,
There is an advantage that a relatively large amount of calculation is not required for the processing for interpolation, and the amount of calculation for the processing can be reduced accordingly. Further, similar to the voice recognition device shown in FIG. 9 described above, erroneous recognition is relatively small even between words that are partially similar, and the storage capacity of the standard pattern memory (4) is reduced. There are benefits that can be relatively small.

尚、上述実施例においては音響パラメータ時系列Pi(n)
からそのパラメータ空間における軌跡を直線近似にて推
定するようにした場合について述べたけれども、円弧近
似、スプライン近似等により軌跡を推定するようにして
も上述実施例と同様の作用効果を得ることができること
は容易に理解できよう。又、上述実施例においては音響
パラメータ時系列Pi(n)からそのパラメータ空間におけ
る軌跡を推定し、この軌跡に基づいて元の音響パラメー
タ時系列Pi(n)の部分集合である新たな音響パラメータ
時系列Qi(m)を形成するようにした場合について述べた
けれども、音響パラメータ周波数系列からそのパラメー
タ空間における軌跡を推定し、この軌跡に基づいて元の
音響パラメータ周波数系列の部分集合である新たな音響
パラメータ周波数系列を形成することにより、元の音響
パラメータ周波数系列の有する有効な情報を活かして音
声信号の周波数特性の正規化を行なうことができる。更
に、本発明は上述実施例に限らず本発明の要旨を逸脱す
ることなくその他種々の構成を取り得ることは勿論であ
る。In the above embodiment, the acoustic parameter time series Pi (n)
Although the case where the trajectory in the parameter space is estimated by linear approximation has been described above, the same operational effect as the above-described embodiment can be obtained even if the trajectory is estimated by arc approximation, spline approximation, or the like. Is easy to understand. Further, in the above embodiment, the trajectory in the parameter space is estimated from the acoustic parameter time series Pi (n), and the new acoustic parameter time that is a subset of the original acoustic parameter time series Pi (n) is based on this trajectory. Although the case of forming the sequence Qi (m) has been described, the trajectory in the parameter space is estimated from the acoustic parameter frequency sequence, and a new acoustic that is a subset of the original acoustic parameter frequency sequence is based on this trajectory. By forming the parameter frequency sequence, it is possible to normalize the frequency characteristic of the audio signal by utilizing the effective information of the original acoustic parameter frequency sequence. Furthermore, the present invention is not limited to the above-mentioned embodiments, and it goes without saying that various other configurations can be adopted without departing from the gist of the present invention.

発明の効果本発明音声認識装置に依れば、音声信号入力部と、この
音声信号入力部から音声信号の供給を受ける音響分析部
と、この音響分析部から第１の音響パラメータ系列の供
給を受ける音響パラメータ処理部と、この音響パラメー
タ処理部により形成される第２の音響パラメータ系列に
基づいてこの音声信号に最も近い音声を判定する音声判
定部と、を備える音声認識装置において、この音響パラ
メータ処理部が、この第１の音響パラメータ系列からパ
ラメータ空間における軌跡を推定し、この軌跡に基づい
てこの第１の音響パラメータ系列の部分集合である第２
の音響パラメータ系列を形成するようにした為、部分的
に類似しているような語い間に於いても誤認識すること
が比較的少なく、元の音響パラメータの有する有効な情
報を活かした認識率の比較的高い音声認識ができ、且つ
処理の為の演算量を比較的少なくできると共に標準パタ
ーンメモリの記憶容量を比較的少なくできる利益があ
る。EFFECTS OF THE INVENTION According to the voice recognition device of the present invention, a voice signal input unit, an acoustic analysis unit that receives a voice signal from the voice signal input unit, and a first acoustic parameter sequence supply from the acoustic analysis unit. In a voice recognition device including a receiving acoustic parameter processing unit and a voice determining unit that determines a voice closest to the voice signal based on a second acoustic parameter sequence formed by the acoustic parameter processing unit, the acoustic parameter The processing unit estimates a trajectory in the parameter space from the first acoustic parameter sequence, and based on the trajectory, the second acoustic parameter sequence that is a subset of the second acoustic parameter sequence.
Since it is designed to form a sequence of acoustic parameters, it is relatively unlikely to misrecognize even between words that are partially similar, and recognition that makes use of the effective information of the original acoustic parameters. There is an advantage that the voice recognition with a relatively high rate can be performed, the amount of calculation for processing can be relatively small, and the storage capacity of the standard pattern memory can be relatively small.

[Brief description of drawings]

第１図はＤＰマッチング処理により音声認識を行なうよ
うにした音声認識装置の例を示す構成図、第２図はＤＰ
マッチング処理の説明に供する概念図、第３図は音響パ
ラメータ空間における軌跡の説明に供する線図、第４
図、第５図及び第６図は夫々１次元の入力パターンＡ、
標準パターンＡ′及び標準パターンＢ′の例を示す線
図、第７図は入力パターンＡのパラメータ時系列と標準
パターンＡ′のパラメータ時系列とのＤＰマッチング処
理による時間軸正規化の説明に供する線図、第８図は入
力パターンＡのパラメータ時系列と標準パターンＢ′の
パラメータ時系列とのＤＰマッチング処理による時間軸
正規化の説明に供する線図、第９図はＮＡＴ処理を行な
うようにした音声認識装置の例を示す構成図、第10図、
第11図、第12図及び第14図は夫々ＮＡＴ処理部の説明に
供する線図、第13図は補間点抽出器の説明に供する流れ
図、第15図、第16図及び第17図は夫々ＮＡＴ処理部にて
処理した入力パターンＡ、標準パターンＡ′及び標準パ
ターンＢ′の１次元の音響パラメータ時系列を示す線
図、第18図は本発明音声認識装置の一実施例を示す構成
図、第19図は第18図の説明に供する流れ図、第20図は第
18図の動作の説明に供する線図である。 (1)は音声信号入力部としてのマイクロホン、(2)は音響
分析部、(3)はモード切換スイッチ、(4)は標準パターン
メモリ、(6)は最小距離判定器、(11_A),(11_B)，……，(1
1_O)は15チャンネルのデジタルバンドパスフィルタバン
ク、(16)は音声区間内パラメータメモリ、(21)及び(26)
は夫々ＮＡＴ処理部、(22)は軌跡長算出器、(23)は補間
間隔算出器、(24)は補間点抽出器、(25)はチェビシェフ
距離算出部、(27)は補間点置換器である。FIG. 1 is a block diagram showing an example of a voice recognition device that performs voice recognition by DP matching processing, and FIG. 2 is a DP.
FIG. 3 is a conceptual diagram for explaining the matching process, FIG. 3 is a diagram for explaining the trajectory in the acoustic parameter space, and FIG.
FIGS. 5, 5 and 6 show one-dimensional input patterns A,
FIG. 7 is a diagram showing examples of the standard pattern A ′ and the standard pattern B ′, and FIG. 7 is provided for explaining the time-axis normalization by the DP matching processing of the parameter time series of the input pattern A and the parameter time series of the standard pattern A ′. FIG. 8 is a diagram for explaining the time axis normalization by the DP matching process of the parameter time series of the input pattern A and the parameter time series of the standard pattern B ′, and FIG. 9 shows the NAT process. Block diagram showing an example of a voice recognition device, FIG.
FIGS. 11, 12, and 14 are diagrams for explaining the NAT processing unit, FIG. 13 is a flowchart for explaining the interpolation point extractor, and FIGS. 15, 16, and 17 are respectively. FIG. 18 is a diagram showing a one-dimensional acoustic parameter time series of the input pattern A, the standard pattern A ′ and the standard pattern B ′ processed by the NAT processing section, and FIG. 18 is a configuration diagram showing an embodiment of the speech recognition apparatus of the present invention. , FIG. 19 is a flow chart for explaining FIG. 18, and FIG.
18 is a diagram used to explain the operation of FIG. 18. FIG. (1) is a microphone as an audio signal input section, (2) is an acoustic analysis section, (3) is a mode switch, (4) is a standard pattern memory, (6) is a minimum distance determiner, (11 _A ), (11 _B ) ， …… ， (1
1 _O ) is a 15-channel digital bandpass filter bank, (16) is the voice section parameter memory, (21) and (26)
Is a NAT processing unit, (22) is a trajectory length calculator, (23) is an interpolation interval calculator, (24) is an interpolation point extractor, (25) is a Chebyshev distance calculation unit, and (27) is an interpolation point replacer. Is.

───────────────────────────────────────────────────── フロントページの続き (72)発明者渡雅男東京都品川区北品川６丁目７番35号ソニー株式会社内 (56)参考文献日本音響学会講演論文集昭和59年10月１−９−９Ｐ．17−18 ─────────────────────────────────────────────────── ─── Continuation of the front page (72) Inventor Masao Watanabe 6-35 Kita-Shinagawa, Shinagawa-ku, Tokyo Sony Corporation (56) References Proceedings of the Acoustical Society of Japan, October 1984 1-9 -9P. 17-18

Claims

[Claims]

1. An audio signal input section, an acoustic analysis section supplied with an audio signal from the audio signal input section, an acoustic parameter processing section supplied with a first acoustic parameter sequence from the acoustic analysis section, In a voice recognition device, comprising: a voice determination unit that determines a voice closest to the voice signal based on a second acoustic parameter sequence formed by the acoustic parameter processing unit; A speech recognition device characterized by estimating a trajectory in a parameter space from an acoustic parameter sequence and forming a second acoustic parameter sequence which is a subset of the first acoustic parameter sequence based on the trajectory.