JPH0632011B2

JPH0632011B2 - Voice recognizer

Info

Publication number: JPH0632011B2
Application number: JP60047952A
Authority: JP
Inventors: 震一田村; 曜一郎佐古; 篤信平岩; 誠赤羽; 雅男渡
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 1985-03-11
Filing date: 1985-03-11
Publication date: 1994-04-27
Anticipated expiration: 2009-04-27
Also published as: JPS61208097A

Description

【発明の詳細な説明】以下の順序でこの発明を説明する。DETAILED DESCRIPTION OF THE INVENTION The present invention will be described in the following order.

Ａ産業上の利用分野Ｂ発明の概要Ｃ従来の技術Ｄ発明が解決しようとする問題点Ｅ問題点を解決するための手段（第１図）Ｆ作用Ｇ実施例Ｇ１音響分析回路の説明（第２図）Ｇ２時間正規化処理の説明（第２図，第３図）Ｇ３パターンマッチング処理の説明（第２図）Ｇ４デジタルフィルタ（８）の説明（第４図〜第９
図）Ｈ発明の効果Ａ産業上の利用分野この発明は、前もって作成し記憶してある認識対象語の
標準パターンと、認識したい語の入力パターンとのパタ
ーンマッチングを行うことにより音声認識を行なう装置
に関する。A Industrial field of use B Outline of the invention C Conventional technology D Problems to be solved by the invention E Means for solving problems (Fig. 1) F Action G Example G1 Description of acoustic analysis circuit (No. 1) 2) Explanation of G2 time normalization processing (FIGS. 2 and 3) Explanation of G3 pattern matching processing (FIG. 2) Explanation of G4 digital filter (8) (FIGS. 4-9)
Fig. H Effect of the invention A Industrial field of application The present invention is a device for performing voice recognition by performing pattern matching between a standard pattern of a recognition target word that has been created and stored in advance and an input pattern of a word to be recognized. Regarding

Ｂ発明の概要この発明はパターンマッチングにより音声認識をなす装
置において、そのマッチングをとるパターンとして、入
力音声信号の音声区間で音響分析して得た音響パラメー
タ時系列がそのパラメータ空間で描く軌跡を推定しその
軌跡を所定間隔で再サンプリングして得た新たな認識パ
ラメータを用いるとともに、音響パラメータ時系列の信
号をローパスフィルタを通すことにより、入力音声の定
常部にゆらぎがあっても認識パラメータ時系列にはその
影響が殆んどないようにしたものである。B Outline of the Invention In the present invention, in a device that performs voice recognition by pattern matching, as a pattern for matching, the trajectory of an acoustic parameter time series obtained by acoustic analysis in a voice section of an input voice signal is estimated in the parameter space. Then, the new recognition parameters obtained by re-sampling the trajectory at a predetermined interval are used, and the acoustic parameter time series signals are passed through a low-pass filter so that the recognition parameter time series can be obtained even if there is fluctuation in the stationary part of the input speech. Has almost no influence.

Ｃ従来の技術音声は時間軸に沿って変化する現象で、スペクトラム・
パターンが刻々と変化するように音声を発声することに
よって固有の単語や言葉が生まれる。この人間が発声す
る単語や言葉を自動認識する技術が音声認識であるが、
人間の聴覚機能に匹敵するような音声認識を実現するこ
とは現在のところ至難のことである。このため、現在実
用化されている音声認識の殆んどは、一定の使用条件の
下で、認識対象単語の標準パターンと入力パターンとの
パターンマッチングを行なうことによりなす方法であ
る。C Conventional technology Speech is a phenomenon that changes along the time axis,
A unique word or words are created by uttering a voice so that the pattern changes every moment. Speech recognition is the technology that automatically recognizes words and words spoken by humans.
At present, it is extremely difficult to realize voice recognition that is comparable to human auditory function. Therefore, most of the speech recognition currently in practical use is a method of performing pattern matching between a standard pattern of a recognition target word and an input pattern under a certain use condition.

第１０図はこの音声認識装置の概要を説明するための図
で、マイクロホン（１）よりの音声入力が音響分析回路
（２）に供給される。この音響分析回路（２）では入力
音声パターンの特徴を表わす音響パラメータが抽出され
る。この音響パラメータを抽出する音響分析の方法は種
々考えられるが、例えばその一例してバンドパスフィル
タと整流回路を１チャンネルとし、このようなチャンネ
ルを通過帯域を変えて複数個並べ、このバンドパスフィ
ルタ群の出力としてスペクトラム・パターンの時間変化
を抽出する方法が知られている。この場合、音響パラメ
ータはその時系列Ｐｉ（ｎ）（ｉ＝１，２・・・Ｉ；Ｉ
は例えばバンドパスフィルタのチャンネル数、ｎ＝１，
２・・・Ｎ；Ｎは音声区間判定により判定された区間に
おいて認識の利用されるフレーム数）で表わすことがで
きる。FIG. 10 is a diagram for explaining the outline of the voice recognition device, and voice input from the microphone (1) is supplied to the acoustic analysis circuit (2). In this acoustic analysis circuit (2), acoustic parameters representing the characteristics of the input voice pattern are extracted. Various methods of acoustic analysis for extracting the acoustic parameters are conceivable. For example, as an example, a bandpass filter and a rectifier circuit are provided as one channel, and a plurality of such channels are arranged with different passbands. A method of extracting a time change of a spectrum pattern as an output of a group is known. In this case, the acoustic parameters are the time series Pi (n) (i = 1, 2 ... I; I
Is the number of channels of the bandpass filter, n = 1,
2 ... N; N can be represented by the number of frames used for recognition in the section determined by the voice section determination.

この音響分析回路（２）よりの音響パラメータ時系列Ｐ
ｉ（ｎ）は、例えばスイッチからなるモード切換回路
（３）に供給される。この回路（３）のスイッチが端子
Ａ側に切り換えられるときは登録モード時で、音響パラ
メータ時系列Ｐｉ（ｎ）が認識パラメータとして標準パ
ターンメモリ（４）にストアされる。つまり、音声認識
に先だって話者の音声パターンが標準パターンとしてこ
のメモリ（４）に記憶される。なお、この登録時、発声
速度変動や単語長の違いにより一般に各登録標準パター
ンのフレーム数は異なっている。The acoustic parameter time series P from this acoustic analysis circuit (2)
i (n) is supplied to a mode switching circuit (3) including a switch, for example. When the switch of this circuit (3) is switched to the terminal A side, the acoustic parameter time series Pi (n) is stored in the standard pattern memory (4) as a recognition parameter in the registration mode. That is, the voice pattern of the speaker is stored in this memory (4) as a standard pattern prior to the voice recognition. At the time of registration, generally, the number of frames of each registered standard pattern is different due to the variation in vocalization speed and the difference in word length.

一方、このスイッチ（３）が端子Ｂ側に切り換えられる
ときは認識モード時である。そして、この認識モード時
は、音響分析回路（２）からのそのときの入力音声の音
響パラメータ時系列が入力音声パターンメモリ（５）に
供給されて一時ストアされる。そしてこの入力パターン
と標準パターンメモリ（４）から読み出された複数の認
識対象単語の標準パターンのそれぞれとの違いの大きさ
が距離算出回路（６）にて計算され、そのうち入力パタ
ーンと標準パターンとの差が最小の認識対象単語が最小
値判定回路（７）にて検出され、これにて入力された単
語が認識される。On the other hand, when the switch (3) is switched to the terminal B side, it is in the recognition mode. Then, in this recognition mode, the acoustic parameter time series of the input voice at that time from the acoustic analysis circuit (2) is supplied to the input voice pattern memory (5) and temporarily stored. Then, the magnitude of the difference between this input pattern and each of the standard patterns of the plurality of recognition target words read from the standard pattern memory (4) is calculated by the distance calculation circuit (6). The recognition target word having the smallest difference from is detected by the minimum value judgment circuit (7), and the input word is recognized.

このように、登録された標準パターンと入力パターンの
パターンマッチング処理により入力音声の認識を行なう
ものであるが、この場合に同じ単語を同じように発声し
てもそのスペクトラムパターンは時間軸方向にずれたり
伸縮したりすることを考慮しなければならない。すなわ
ち、例えば「ハイ」という単語を認識する場合、標準パ
ターンが「ハイ」で登録されているとき、入力音声が
「ハーイ」と時間軸方向に伸びてしまった場合、これは
距離が大きく違い、全く違った単語とされてしまい、正
しい認識ができない。このため、音声認識のパターンマ
ッチングでは、この時間軸方向のずれ、伸縮を補正する
時間正規化の処理を行なう必要があり、また、この時間
正規化は認識精度を向上させるための重要な処理であ
る。In this way, the input voice is recognized by the pattern matching process between the registered standard pattern and the input pattern. In this case, even if the same word is uttered in the same way, the spectrum pattern shifts in the time axis direction. It must be taken into consideration that it expands and contracts. That is, for example, when recognizing the word "high", when the standard pattern is registered as "high", and the input voice extends "hi" in the time axis direction, this is a big difference in distance. It is a completely different word, and I cannot recognize it correctly. Therefore, in pattern matching for voice recognition, it is necessary to perform time normalization processing for correcting the displacement and expansion / contraction in the time axis direction, and this time normalization is an important processing for improving recognition accuracy. is there.

この時間正規化の一方法としてＤＰ（Ｄｙｎａｍｉｃ
Ｐｒｏｇｒａｍｍｉｎｇ）マッチングと呼ばれる手法が
ある（例えば特開昭５０−９６１０４号公報参照）。As one method of this time normalization, DP (Dynamic)
There is a method called programming matching (see, for example, Japanese Patent Laid-Open No. 50-96104).

このＤＰマッチングは時間軸のずれを考慮した多数の標
準パターンを用意しておくのではなく、歪関数によって
多数の時間を正規化した標準パターンを生成し、これと
入力パターンとの距離を求め、その最小値のものを検知
することにより、音声認識をするものである。This DP matching does not prepare a large number of standard patterns in consideration of the shift of the time axis, but generates a standard pattern in which a large number of times are normalized by a distortion function and calculates the distance between this and the input pattern. The voice recognition is performed by detecting the minimum value.

ところで、このＤＰマッチングの手法を用いる場合、登
録される標準パターンのフレーム数は不定であり、しか
も全登録標準パターンと入力パターンとのＤＰマッチン
グ処理をする必要があり、語彙が多くなると演算量が飛
躍的に増加する欠点がある。By the way, when this DP matching method is used, the number of frames of standard patterns to be registered is indefinite, and it is necessary to perform DP matching processing between all registered standard patterns and input patterns. It has the drawback of increasing dramatically.

また、ＤＰマッチングは、定常部（スペクトラムパター
ンの時間変化のない部分）を重視したマッチング方式で
あるので部分的類似パターン間で誤認識を生じる可能性
があった。Further, since the DP matching is a matching method that emphasizes the stationary part (the part where the spectrum pattern does not change with time), there is a possibility that erroneous recognition may occur between the partially similar patterns.

このような欠点を生じない時間正規化の手法を本出願人
は先に提案した（例えば特願昭５９−１０６１７７
号）。The present applicant has previously proposed a method of time normalization that does not cause such a defect (for example, Japanese Patent Application No. 59-106177).
issue).

すなわち、音響パラメータ時系列Ｐｉ（ｎ）は、そのパ
ラメータ空間を考えた場合、点列を描く。例えば認識対
象単語が「ＨＡＩ」であるとき音響分析用バンドパスフ
ィルタの数が２個で、Ｐｉ（ｎ）＝（Ｐ_１Ｐ_２）であれば、入力音声の音響パラメータ時系列はその２次
元パラメータ空間には第１１図に示すような点列を描
く。この図から明らかなように音声の非定常部の点列は
粗に分布し、準定常部が密に分布する。この場合、完全
に音声が定常であればパラメータは変化せず、その場合
には点列はパラメータ空間において一点に停留すること
になるが、人間は同じ音を発生しても、音声のゆらぎの
ため完全な定常にはならず、図のように準定常部とし
て、ゆらぎの影響がでる。That is, the acoustic parameter time series Pi (n) draws a sequence of points when the parameter space is considered. For example, when the recognition target word is “HAI” and the number of bandpass filters for acoustic analysis is two, and Pi (n) = (P ₁ P ₂ ), the acoustic parameter time series of the input speech is the two-dimensional A series of points as shown in FIG. 11 is drawn in the parameter space. As is apparent from this figure, the point sequence of the non-stationary part of the voice is roughly distributed, and the quasi-stationary part is densely distributed. In this case, if the voice is completely stationary, the parameters will not change, and in that case the point sequence will stop at one point in the parameter space, but even if a human produces the same sound, Therefore, it does not become a complete steady state, and the influence of fluctuation appears as a quasi-stationary part as shown in the figure.

そして、以上のことから、音声の発声速度変動による時
間軸方向のずれは殆んどが準定常部の点列密度の違いに
起因し、非定常部の時間長の影響は少ないと考えられ
る。そこで、この入力パラメータ時系列Ｐｉ（ｎ）の点
列から第１２図に示すように点列全体を近似的に通過す
るような連続曲線で描いた軌跡を推定すれば、この軌跡
は音声の発声速度変動に対して殆んど不変であることが
わかる。From the above, it is considered that most of the deviation in the time axis direction due to the fluctuation of the vocalization rate of the voice is due to the difference in the point sequence density of the quasi-stationary part, and the influence of the time length of the non-stationary part is small. Therefore, if a locus drawn by a continuous curve that approximately passes through the entire point sequence as shown in FIG. 12 is estimated from the point sequence of the input parameter time series Pi (n), this locus will produce a voice. It can be seen that it is almost invariant to speed fluctuations.

このことから、出願人は、次のような時間軸正規化方法
を提案した。すなわち、先ず入力パラメータの時系列Ｐ
ｉ（ｎ）の始端Ｐｉ（ｌ）から終端Ｐｉ（Ｎ）までを連
続曲線で描いた軌跡を推定する。この場合、この軌跡の推定は
例えば音響パラメータ時系列を第１３図に示すように直
線近似することによって行なう。この推定した曲線から軌跡の長さＳを求める。そして第１３図において〇
印で示すようにこの軌跡に沿って所定長Ｔで再サンプリ
ングする。例えばＭ個の点に再サンプリングする場合、Ｔ＝Ｓ／（Ｍ−１）・・・（１）の長さを基準として軌跡を再サンプリングする。この再
サンプリングされた点列を描くパラメータ時系列をＱｉ
（ｍ）（ｉ＝２，２・・・Ｉ，ｍ＝１，２・・・Ｍ）と
すれば、このパラメータ時系列Ｑｉ（ｍ）は軌跡の基本
情報を有しており、しかも音声の発声速度変動に対して
殆んど不変なパラメータである。つまり、時間軸が正規
化された認識パラメータ時系列である。Therefore, the applicant has proposed the following time axis normalization method. That is, first, the input parameter time series P
A continuous curve from the start Pi (l) to the end Pi (N) of i (n) Estimate the trajectory drawn in. In this case, the trajectory is estimated by, for example, linearly approximating the acoustic parameter time series as shown in FIG. This estimated curve The length S of the locus is obtained from Then, as shown by the mark ◯ in FIG. 13, re-sampling is performed at a predetermined length T along this locus. For example, when re-sampling to M points, the locus is re-sampled based on the length of T = S / (M-1) ... (1). Let Qi be the parameter time series that draws this resampled point sequence.
(M) (i = 2, 2 ... I, m = 1, 2 ... M), this parameter time series Qi (m) has basic information of the locus, and It is a parameter that is almost invariant to changes in vocalization rate. That is, it is a recognition parameter time series whose time axis is normalized.

したがって、このパラメータ時系列Ｑｉ（ｍ）を標準パ
ターンとして登録しておくとともに、入力パターンもこ
のパラメータ時系列Ｑｉ（ｍ）として得、このパラメー
タ時系列Ｑｉ（ｍ）により両パターン間の距離を求め、
その距離が最小であるものを検知して音声認識を行うよ
うにすれば、時間軸方向のずれが正規化して除去された
状態で音声認識が常になされる。Therefore, this parameter time series Qi (m) is registered as a standard pattern, and an input pattern is also obtained as this parameter time series Qi (m), and the distance between both patterns is obtained by this parameter time series Qi (m). ,
If the voice with the smallest distance is detected and the voice is recognized, the voice is always recognized with the deviation in the time axis direction being normalized and removed.

そして、この処理方法によれば、登録時の発声速度変動
や単語長の違いに関係なく認識パラメータ時系列Ｑｉ
（ｍ）のフレーム数は常にＭであり、その上、認識パラ
メータ時系列Ｑｉ（ｍ）は時間正規化されているので、
入力パターンと登録標準パターンとの距離の演算は最も
単純なチェビシェフ距離を求める演算でも良好な効果が
期待できる。In addition, according to this processing method, the recognition parameter time series Qi
Since the number of frames in (m) is always M, and the recognition parameter time series Qi (m) is time-normalized,
Even if the calculation of the distance between the input pattern and the registered standard pattern is the simplest Chebyshev distance calculation, a good effect can be expected.

また、以上の方法は音声の非定常部をより重視した時間
正規化の手法であり、ＤＰマッチング処理のような部分
的類似パターン間の誤認識が少なくなる。Further, the above method is a method of time normalization in which the non-stationary part of the voice is more emphasized, and erroneous recognition between partially similar patterns such as DP matching processing is reduced.

さらに、発声速度の変動情報は正規化パラメータ時系列
Ｑｉ（ｍ）には含まれず、このためパラメータ空間に配
位するパラメータ遷移構造のグローバルな特徴等の扱い
が容易となり、不特定話者認識に対しても有効な各種方
法の適用が可能となる。Furthermore, the variation information of the vocalization rate is not included in the normalized parameter time series Qi (m), which facilitates the handling of the global characteristics of the parameter transition structure coordinated in the parameter space, and enables the unspecified speaker recognition. It is possible to apply various effective methods.

なお、以下、以上のような時間正規化の処理をＮＡＴ
（ＮｏｒｍａｌｉｚａｔｉｏｎＡｌｏｎｇＴｒａｊ
ｅｃｔｏｒｙ）処理と呼ぶ。In the following, the time normalization processing described above is performed by the NAT.
(Normalization Along Traj
This is called processing.

Ｄ発明が解決しようとする問題点上述したように、入力音声はモノトーンであっても音響
分析回路よりの音響パラメータＰｉ（ｎ）は定常状態に
はならず、第１１図のようにゆらぐ。このため、ＮＡＴ
処理において軌跡を推稚するときこのゆらぎの影響を受
ける。例えば第１３図のように直線補間したときは、こ
のゆらぎの大きさがそのまま軌跡長に関与する。このた
め、正規化された認識パラメータ時系列Ｑｉ（ｎ）にこ
のゆらぎによる誤差が生じ、音声認識率の低下につなが
る欠点がある。D Problem to be Solved by the Invention As described above, even if the input voice is a monotone, the acoustic parameter Pi (n) from the acoustic analysis circuit does not reach a steady state and fluctuates as shown in FIG. Therefore, NAT
It is affected by this fluctuation when reconstructing a trajectory in processing. For example, when linear interpolation is performed as shown in FIG. 13, the magnitude of this fluctuation directly affects the trajectory length. Therefore, the normalized recognition parameter time series Qi (n) has an error due to this fluctuation, which leads to a decrease in the voice recognition rate.

Ｅ問題点を解決するための手段第１図はこの発明による音声認識装置の基本的構成の一
例を示す図で第１０図と対応する部分には同一符号を付
す。E Means for Solving the Problems FIG. 1 is a diagram showing an example of the basic configuration of the speech recognition apparatus according to the present invention, and the portions corresponding to those in FIG.

この例の場合、音響分析回路（２）はバンドパスフィル
タ群を用いたものが用いられる。すなわち、マイクロホ
ン（１）からの音声信号はＡ／Ｄコンバータ（２１）に
供給されてデジタル信号に変換され、このデジタル信号
がデジタルバンドパスフィルタ群（２２）に供給されて
複数の周波数成分からなる信号は変換される。このバン
ドパスフィルタ群（２２）出力は特徴抽出回路（２３）
に供給される。Ａ／Ｄコンバータ（２１）よりのデジタ
ル音声信号は、また、音声区間判定回路（２５）に供給
されて、マイクロホン（１）に音声入力がなされた区間
が判定され、その判定出力が特徴抽出回路（２３）に供
給される。In this example, the acoustic analysis circuit (2) uses a bandpass filter group. That is, the audio signal from the microphone (1) is supplied to the A / D converter (21) and converted into a digital signal, and this digital signal is supplied to the digital bandpass filter group (22) and is composed of a plurality of frequency components. The signal is converted. The output of this band pass filter group (22) is the feature extraction circuit (23).
Is supplied to. The digital voice signal from the A / D converter (21) is also supplied to the voice section determination circuit (25) to determine the section in which the voice is input to the microphone (1), and the determination output is the determination output. (23).

特徴抽出回路（２３）では、この音声判定区間において
バンドパスフィルタ群（２２）の出力から音響パラメー
タ時系列Ｐｉ（ｎ）が作成され、これが音響分析回路
（２）の出力とされる。In the feature extraction circuit (23), the acoustic parameter time series Pi (n) is created from the output of the bandpass filter group (22) in this voice determination section, and this is used as the output of the acoustic analysis circuit (2).

この音声判定区間内における音響パラメータ時系列Ｐｉ
（ｎ）はローパスフィルタ特性を有するデジタルフィル
タ（８）に供給される。このデジタルフィルタ（８）の
フィルタ特性は、原理的には音響分析回路（２）のバン
ドパスフィルタ群の出力帯域の０．３倍まで通すもので
あればよい。このデジタルフィルタ（８）の出力として
は定常部のゆらぎが抑圧された音響パラメータ時系列Ｐ
ｉ（ｎ）′が得られ、これがＮＡＴ処理回路（９）に供
給される。Acoustic parameter time series Pi in this voice determination section
(N) is supplied to a digital filter (8) having a low pass filter characteristic. The filter characteristic of the digital filter (8) may be such that it can pass up to 0.3 times the output band of the bandpass filter group of the acoustic analysis circuit (2) in principle. The output of this digital filter (8) is the acoustic parameter time series P in which the fluctuation of the stationary part is suppressed.
i (n) 'is obtained and supplied to the NAT processing circuit (9).

このＮＡＴ処理回路（９）においては音響パラメータ時
系列Ｐｉ（ｎ）′から前述したようにその音響パラメー
タ空間における軌跡が推定され、この軌跡に基づいて新
たな認識パラメータ時系列Ｑｉ（ｍ）が形成される。In the NAT processing circuit (9), the trajectory in the acoustic parameter space is estimated from the acoustic parameter time series Pi (n) 'as described above, and a new recognition parameter time series Qi (m) is formed based on this trajectory. To be done.

そして、このパラメータ時系列Ｑｉ（ｍ）がモード切換
回路（３）を通じて、登録モード時は標準パターンメモ
リ（４）にストアされて登録され、認識モード時は距離
算出回路（６）に供給されて、標準パターンメモリ
（４）からの複数の登録標準パターンとの距離が計算さ
れ、その計算結果の最小の標準パターンが最小値判定回
路（７）にて判定され、その判定出力が認識出力とされ
る。The parameter time series Qi (m) is stored and registered in the standard pattern memory (4) in the registration mode through the mode switching circuit (3) and is supplied to the distance calculation circuit (6) in the recognition mode. , The distances from the standard pattern memory (4) to a plurality of registered standard patterns are calculated, the minimum standard pattern of the calculation result is judged by the minimum value judgment circuit (7), and the judgment output is made the recognition output. It

なお、実際的にはＮＡＴ処理はマイクロコンピュータを
用いて行なうもので、この場合、音声判定区間内の音響
パラメータ時系列Ｐｉ（ｎ）′から軌跡を推定する際、
パラメータ時系列Ｐｉ（ｎ）′の始点Ｐｉ（１）′を軌
跡の始点とせず、図の例のように必ず無音を始点として
推定するようにしてもよい。終端Ｐｉ（Ｎ）についても
同様にできる。Actually, the NAT process is performed by using a microcomputer. In this case, when estimating the trajectory from the acoustic parameter time series Pi (n) ′ in the voice determination section,
Instead of using the starting point Pi (1) 'of the parameter time series Pi (n)' as the starting point of the locus, it is also possible to estimate the silent point as the starting point without fail as in the example of the figure. The same can be applied to the terminal Pi (N).

Ｆ作用音響分析回路（２）からの音響パラメータ時系列Ｐｉ
（ｎ）はローパスフィルタ特性を有するデジタルフィル
タ（８）を通って定常部のゆらぎが軽減されたものにさ
れる。そしてこれがＮＡＴ処理回路（９）に供給されて
軌跡推定に供されるので、軌跡に対するゆらぎの影響は
抑圧され、新たな認識パラメータ時系列Ｑｉ（ｎ）とし
て音声の認識率の向上が期待できるものが得られる。F action acoustic parameter time series Pi from acoustic analysis circuit (2)
(N) passes through a digital filter (8) having a low-pass filter characteristic, and the fluctuation of the steady part is reduced. Since this is supplied to the NAT processing circuit (9) and used for trajectory estimation, the influence of fluctuations on the trajectory is suppressed, and it is expected that the recognition rate of voice will be improved as a new recognition parameter time series Qi (n). Is obtained.

Ｇ実施例第２図はこの発明による音声認識装置の一実施例で、こ
の例は音響分析に１６チャンネルのバンドパスフィルタ
群を用いた場合である。G. Embodiment FIG. 2 shows an embodiment of the speech recognition apparatus according to the present invention, in which a 16-channel bandpass filter group is used for acoustic analysis.

Ｇ１音響分析回路（２）の説明すなわち、音響分析回路（２）においては、マイクロホ
ン（１）からの音声信号がアンプ（２１１）及び帯域制
限用のローパスフィルタ（２１２）を介してＡ／Ｄコン
バータ（２１３）に供給され、例えば１２．５ｋＨｚの
サンプリング周波数で１２ビットのデジタル音声信号に
変換される。このデジタル音声信号は、１５チャンネル
のバンドパスフィルタバンク（２２）の各チャンネルの
デジタルバンドパスフィルタ（２２１_１），（２２
１_２），・・・・，（２２１_１６）に供給される。この
デジタルバンドパスフィルタ（２２１_１），（２２
１_２），・・・・，（２２１_１６）は例えばバターワー
ス４次のデジタルフィルタにて構成され、２５０Ｈｚか
ら５．５ＫＨｚまでの帯域が対数軸上で等間隔で分割さ
れた各帯域が各フィルタの通過帯域となるようにされて
いる。そして、各デジタルバンドパスフィルタ（２２１
_１），（２２１_２），・・・・，（２２１_１６）の出力
信号はそれぞれ整流回路（２２１_１），（２２２_２），
・・・・，（２２１_１６）に供給され、これら整流回路
（２２２_１），，（２２２_２），・・・・，（２２２
_１６）の出力はそれぞれデジタルローパスフィルタ（２
２３_１），（２２３_２），・・・・，（２２３_１６）に
供給される。これらデジタルローパスフィルタ（２２３
_１），（２２３_２），・・・・…，（２２３_１６）は例
えばカットオフ周波数５２．８ＨｚのＦＩＲローパスフ
ィルタにて構成される。G1 Description of Acoustic Analysis Circuit (2) That is, in the acoustic analysis circuit (2), the audio signal from the microphone (1) is an A / D converter via an amplifier (211) and a low-pass filter (212) for band limitation. It is supplied to (213) and converted into a 12-bit digital audio signal at a sampling frequency of 12.5 kHz, for example. This digital audio signal is converted into digital bandpass filters (221 ₁ ) and (22 ₁ ) of each channel of a 15-channel bandpass filter bank (22).
1 _2), ..., it is supplied to the _{(221 16).} This digital bandpass filter (221 ₁ ), (22
1 _2), ..., _{(221 16)} for example is constituted by Butterworth fourth order digital filter, the band band from 250Hz to 5.5KHz is divided at equal intervals on a logarithmic axis each filter It is designed to be the pass band of. Then, each digital bandpass filter (221
₁ ), (221 ₂ ), ..., (221 ₁₆ ) output signals are rectification circuits (221 ₁ ), (222 ₂ ),
..., (221 ₁₆ ) are supplied to these rectifier circuits (222 ₁ ), (222 ₂ ), ..., (222
₁₆ ) outputs the digital low-pass filter (2
23 ₁ ), (223 ₂ ), ..., (223 ₁₆ ). These digital low-pass filters (223
₁ ), (223 ₂ ), ..., (223 ₁₆ ) are configured by, for example, FIR low-pass filters having a cutoff frequency of 52.8 Hz.

音響分析回路（２）の出力である各デジタルローパスフ
ィルタ（２２３_１），（２２３_２），・・・・，（２２
３_１６）の出力信号は特徴抽出回路（２３）を構成する
サンプラー（２３１）に供給される。このサンプラー
（２３１）ではデジタルローパスフィルタ（２２
３_１），（２２３_２），・・・・，（２２３_１６）の出
力信号をフレーム周期５．１２ｍｅｓｃ毎にサンプリン
グする。したがって、これよりはサンプル時系列Ａｉ
（ｎ）（ｉ＝１，２，・・・・１６；ｎはフレーム番号
でｎ＝１，２，・・・・，Ｎ）が得られる。Each digital low-pass filter (223 ₁ ), (223 ₂ ), ..., (22) which is the output of the acoustic analysis circuit (2)
The output signal of 3 ₁₆ ) is supplied to the sampler (231) which comprises the feature extraction circuit (23). This sampler (231) has a digital low-pass filter (22
The output signals of 3 ₁ ), (223 ₂ ), ..., (223 ₁₆ ) are sampled at every frame period of 5.12 mesc. Therefore, rather than this, the sample time series Ai
(N) (i = 1, 2, ..., 16; n is a frame number, n = 1, 2, ..., N).

このサンプラー（２３１）からの出力、つまりサンプル
時系列Ａｉ（ｎ）は音源情報正規化回路（２３２）に供
給され、これにて認識しようとする音声の話者による声
帯音源特性の違いが除去される。The output from the sampler (231), that is, the sample time series Ai (n) is supplied to the sound source information normalization circuit (232), which eliminates the difference in vocal cord sound source characteristics depending on the speaker of the voice to be recognized. It

即ち、フレーム周期毎にサンプラー（２３１）から供給
されるサンプラ時系列Ａｉ（ｎ）に対してｉ（ｎ）＝ｌｏｇ（Ａｉ（ｎ）＋Ｂ）・・・（２）なる対数変換がなされる。この（１）式において、Ｂは
バイアスでノイズレベルが隠れる程度の値を設定する。That is, the logarithmic transformation of i (n) = log (Ai (n) + B) (2) is performed on the sampler time series Ai (n) supplied from the sampler (231) for each frame period. In this equation (1), B is set to a value such that the noise level is hidden by the bias.

そして、声帯音源特性をｙｉ＝ａ・ｉ＋ｂなる式で近似
すると、このａ及びｂの係数は次式により決定される。Then, when the vocal cord sound source characteristic is approximated by the expression yi = a · i + b, the coefficients a and b are determined by the following expressions.

そして、音源の正規化されたパラメータをＰｉ（ｎ）と
すると、ａ（ｎ）＜０のときパラメータＰｉ（ｎ）はＰｉ（ｎ）＝ｉ（ｎ）−｛ａ（ｎ）・ｉ＋ｂ（ｎ）｝
・・・（５）と表される。 When the normalized parameter of the sound source is Pi (n), the parameter Pi (n) is Pi (n) = i (n)-{a (n) .i + b (n) when a (n) <0. )}
It is expressed as (5).

又、ａ（ｎ）≧０のときレベルの正規化のみ行ない、パ
ラメータＰｉ（ｎ）はと表される。When a (n) ≧ 0, only the level normalization is performed, and the parameter Pi (n) is Is expressed as

こうして声帯音源特性の違いが正規化されて除去された
音響パラメータ時系列Ｐｉ（ｎ）がこの音源情報正規化
回路（２３２）より得られる。The sound parameter information normalization circuit (232) obtains the acoustic parameter time series Pi (n) in which the difference in vocal cord sound source characteristics is normalized and removed.

この音響パラメータ時系列Ｐｉ（ｎ）はデジタルフィル
タ（８）に供給される。このデジタルフィルタ（８）は
後述するようにローパスフィルタ特性を有する補間フィ
ルタで、この補間フィルタ（８）のローパスフィルタ特
性によって後段のＮＡＴ処理回路（９）において推定さ
れる軌跡の定常部でのゆらぎを除去するようにするとと
もに、音響パラメータＰｉ（ｎ）を補間し、より正確な
軌跡の推定を行なえるようにするものである。補間は音
響パラメータＰｉ（ｎ）の各々のデータ間に「０」のデ
ータを（Ｐ−１）個挿入した後、ＦＩＲフィルタリング
を行ってなし、これによりデータ数がＰ倍に増えた音響
パラメータＰｉ（ｎ）′がこれより得られる。This acoustic parameter time series Pi (n) is supplied to the digital filter (8). The digital filter (8) is an interpolation filter having a low-pass filter characteristic as described later, and fluctuations in the steady part of the locus estimated by the low-pass filter characteristic of the interpolation filter (8) in the NAT processing circuit (9) in the subsequent stage. Is removed, and the acoustic parameter Pi (n) is interpolated so that more accurate trajectory estimation can be performed. The interpolation is performed by inserting (P-1) pieces of data of "0" between each piece of the acoustic parameter Pi (n) and then performing no FIR filtering, whereby the acoustic parameter Pi in which the number of data is increased P times. (N) 'is obtained from this.

こうして、データサンプル数がＰ倍に増やされ、またロ
ーパスフィルタ特性により定常部のゆらぎの除去された
音響パラメータＰｉ（ｎ）′は音声区間内パラメータメ
モリ（２００）に供給される。この音声区間内パラメー
タメモリ（２００）では音声区間判定回路（２４）から
の音声区間判定信号を受けて、パラメータＰｉ（ｎ）
が、判定された音声区間毎にストアされる。In this way, the number of data samples is increased by a factor of P, and the acoustic parameter Pi (n) ′ from which fluctuations in the stationary part have been removed by the low-pass filter characteristic is supplied to the intra-speech interval parameter memory (200). The parameter memory (200) in the voice section receives the voice section determination signal from the voice section determination circuit (24) and receives the parameter Pi (n).
Is stored for each determined voice section.

音声区間判定回路（２４）はゼロクロスカウンタ（２４
１）とパワー算出回路（２４２）と音声区間決定回路
（２４３）とからなり、Ａ／Ｄコンバータ（２１３）よ
りデジタル音声信号がゼロクロスカウンタ（２４１）及
びパワー算出回路（２４２）に供給される。ゼロクロス
カウンタ（２４１）では１フレーム周期５．１２ｍｓｅ
ｃ毎に、この１フレーム周期内の６４サンプルのデジタ
ル音声信号のゼロクロス数をカウントし、そのカウント
値が音声区間決定回路（２４３）の第１の入力端に供給
される。パワー算出回路（２４２）では１フレーム周期
毎にこの１フレーム周期内のデジタル音声信号のパワ
ー、すなわち２乗和が求められ、その出力パワー信号が
音声区間決定回路（２４３）の第２の入力端に供給され
る。音声区間決定回路（２４３）には、さらに、その第
３の入力端に音源情報正規化回路（２３２）よりの音源
正規化情報が供給される。そして、この音声区間決定回
路（２４３）においてはゼロクロス数、区間内パワー及
び音源正規化情報が複合的に処理され、無音、無声音及
び有声音の判定処理が行なわれ、音声区間が決定され
る。The voice section determination circuit (24) is a zero cross counter (24
1), a power calculation circuit (242) and a voice section determination circuit (243), and a digital voice signal is supplied from an A / D converter (213) to a zero cross counter (241) and a power calculation circuit (242). In the zero cross counter (241), one frame period is 5.12 mse.
For each c, the number of zero-crosses of the digital audio signal of 64 samples within this one frame period is counted, and the count value is supplied to the first input terminal of the audio section determination circuit (243). The power calculation circuit (242) obtains the power of the digital voice signal within this one frame period, that is, the sum of squares, for each frame period, and the output power signal is the second input end of the voice section determination circuit (243). Is supplied to. The sound source normalization information from the sound source information normalization circuit (232) is further supplied to the third input terminal of the voice section determination circuit (243). Then, in the voice section determination circuit (243), the number of zero crosses, the intra-section power, and the sound source normalization information are processed in a complex manner, and the process of determining silence, unvoiced sound, and voiced sound is performed to determine the voice section.

この音声区間決定回路（２４３）よりの判定された音声
区間を示す音声区間判定信号は音声区間判定回路（２
４）の出力として音声区間内パラメータメモリ（２０
０）に供給される。The voice section determination circuit (243) outputs the voice section determination signal indicating the determined voice section to the voice section determination circuit (2).
4) as an output of the voice section parameter memory (20
0).

こうして、判定音声区間内においてメモリ（２００）に
ストアされた音響パラメータ時系列Ｐｉ（ｎ）′はＮＡ
Ｔ処理回路（９）に供給される。In this way, the acoustic parameter time series Pi (n) ′ stored in the memory (200) in the judgment voice section is NA.
It is supplied to the T processing circuit (9).

Ｇ２時間正規化処理の説明ＮＡＴ処理回路（９）は軌跡長算出回路（９１）と補間
間隔算出回路（９２）と補間点抽出回路（９３）からな
る。Explanation of G2 time normalization processing The NAT processing circuit (9) comprises a trajectory length calculation circuit (91), an interpolation interval calculation circuit (92) and an interpolation point extraction circuit (93).

パラメータメモリ（２００）からのパラメータ時系列Ｐ
ｉ（ｎ）′（ｉ＝１，２，・・・・，１６；ｎ＝１，
２，・・・・，Ｎ）は軌跡長算出回路（９１）に供給さ
れる。この軌跡長算出回路（９１）においては音響パラ
メータ時系列Ｐｉ（ｎ）′がそのパラメータ空間におい
て前述の第１３図に示すように描く直線近似による軌跡
の長さを算出する。Parameter time series P from the parameter memory (200)
i (n) '(i = 1, 2, ..., 16; n = 1,
2, ..., N) are supplied to the trajectory length calculation circuit (91). In the locus length calculation circuit (91), the length of the locus by the linear approximation drawn by the acoustic parameter time series Pi (n) 'in the parameter space as shown in FIG. 13 is calculated.

この場合、Ｉ次元ベクトルａ_ｉ及びｂ_ｉ間のユークリッ
ド距離Ｄ（ａ_ｉ，ｂ_ｉ）はである。そこで、Ｉ次元の音響パラメータ時系列Ｐｉ
（ｎ）′より、直線近似により軌跡を推定した場合の時
系列方向に隣接するパラメータ間距離Ｓ（ｎ）はＳ（ｎ）＝Ｄ（Ｐｉ（ｎ＋１）′，Ｐｉ（ｎ）′）（ｎ＝１，・・・・，Ｎ）・・・（８）と表わされる。そして、時系列方向における第１番目の
パラメータＰｉ（１）′から第ｎ番目のパラメータＰｉ
（ｎ）′迄の距離ＳＬ（ｎ）はと表わされる。なお、ＳＬ（１）＝０である。In this case, the Euclidean distance D (a _i , b _i ) between the I-dimensional vectors a _i and b _i is Is. Therefore, the I-dimensional acoustic parameter time series Pi
From (n) ', the distance S (n) between the parameters adjacent to each other in the time series direction when the trajectory is estimated by the linear approximation is S (n) = D (Pi (n + 1)', Pi (n) ') (n = 1, ..., N) ... (8) Then, from the first parameter Pi (1) 'to the nth parameter Pi in the time series direction.
The distance SL (n) to (n) 'is Is represented. Note that SL (1) = 0.

そして、全軌跡長ＳＬはと表わされる。軌跡長算出回路（９１）はこの（１１）
式、（１２）式及び（１３）にて示す信号処理を行な
う。And the total locus length SL is Is represented. The trajectory length calculation circuit (91) uses this (11)
The signal processing shown in equations (12) and (13) is performed.

この軌跡長算出回路（９１）にて求められた軌跡長ＳＬ
を示す信号は補間間隔算出回路（９２）に供給される。
この補間間隔算出回路（９２）では軌跡に沿って再サン
プリングするときの再サンプリング間隔Ｔを算出する。The locus length SL obtained by the locus length calculation circuit (91)
Is supplied to the interpolation interval calculation circuit (92).
This interpolation interval calculation circuit (92) calculates the resampling interval T when resampling along the locus.

この場合、Ｍ点に再サンプリングするとすれば、再サン
プリング間隔ＴはＴ＝ＳＬ／（Ｍ−１）・・・（１１）として求められる。In this case, if the resampling is performed at the point M, the resampling interval T is obtained as T = SL / (M-1) ... (11).

この補間間隔算出回路（９２）よりの再サンプリング間
隔Ｔを示す信号は補間点抽出回路（９３）に供給され
る。また、パラメータメモリ（２００）よりの音響パラ
メータ時系列Ｐｉ（ｎ）′も、また、この補間点抽出回
路（９３）に供給される。この補間点抽出回路（９３）
は音響パラメータ時系列Ｐｉ（ｎ）′のそのパラメータ
空間における軌跡、例えばパラメータ間を直線近似した
軌跡に沿って第１３図において〇印にて示すように再サ
ンプリング間隔Ｔで再サンプリングし、そのサンプリン
グにより得た新たな点列より認識パラメータ時系列Ｑｉ
（ｍ）を形成する。A signal indicating the resampling interval T from the interpolation interval calculation circuit (92) is supplied to the interpolation point extraction circuit (93). The acoustic parameter time series Pi (n) 'from the parameter memory (200) is also supplied to the interpolation point extraction circuit (93). This interpolation point extraction circuit (93)
Is resampled at the resampling interval T along the locus of the acoustic parameter time series Pi (n) 'in that parameter space, for example, the locus obtained by linear approximation between parameters, as shown by the circles in FIG. From the new sequence of points obtained by
(M) is formed.

ここで、この補間点抽出回路（９３）においては第３図
に示すフローチャートに従った処理がなされ、認識パラ
メータ時系列Ｑｉ（ｍ）が形成される。Here, in the interpolation point extraction circuit (93), processing according to the flowchart shown in FIG. 3 is performed, and a recognition parameter time series Qi (m) is formed.

先ず、ステップ〔１０１〕にて再サンプリング点の時系
列方向における番号を示す変数Ｊに値１が設定されると
共に音響パラメータ時系列Ｐｉ（ｎ）′のフレーム番号
を示す変数ＩＣに値１が設定され、イニシャライズされ
る。次にステップ〔１０２〕にて変数Ｊがインクリメン
トされ、ステップ〔１０３〕にてそのときの変数Ｊが
（Ｍ−１）以下であるかどうかが判別されることによ
り、そのときの再サンプリング点の時系列方向における
番号がリサンプリングする必要のある最後の番号になっ
ているかどうかを判断する。最後の番号であればステッ
プ〔１０４〕に進み、再サンプリングは終了する。First, in step [101], the value 1 is set to the variable J indicating the number of the resampling points in the time series direction, and the value 1 is set to the variable IC indicating the frame number of the acoustic parameter time series Pi (n) '. And is initialized. Next, in step [102], the variable J is incremented, and in step [103], it is determined whether or not the variable J at that time is (M-1) or less. Determine if the number in the time series direction is the last number that needs to be resampled. If it is the last number, the process proceeds to step [104], and the resampling ends.

最後の番号でなければステップ〔１０５〕にて第１番目
の再サンプリング点（これは必ず無音の部分である。）
から第Ｊ番目の再サンプリング点までの再サンプリング
距離ＤＬが算出される。次にステップ〔１０６〕に進み
変数ＩＣがインクリメントされる。次にステップ〔１０
７〕にて再サンプル距離ＤＬが音響パラメータ時系列Ｐ
ｉ（ｎ）′の第１番目のパラメータＰｉ（ｌ）′から第
ＩＣ番目のパラメータＰｉ_{（ｌＣ′）}までの距離ＳＬ
_{（ｌＣ′）}よりも小さいかどうかにより、そのときの再
サンプリング点が軌跡上においてそのときのパラメータ
Ｐｉ_{（ｌＣ′）}よりも軌跡の始点側に位置するかどうか
が判断され、始点側に位置していなければステップ〔１
０６〕に戻り変数ＩＣをインクリメントした後再びステ
ップ〔１０７〕にて再サンプリング点とパラメータＰｉ
_{（ｌＣ′）}との軌跡上における位置の比較をし、再サン
プリング点が軌跡上においてパラメータＰｉ_{（ｌＣ′）}
よりも始点側に位置すると判断されたとき、ステップ
〔１０８〕に進み認識パラメータＱｉ_（Ｊ）が形成され
る。If it is not the last number, the first re-sampling point (this is always a silent portion) in step [105].
To the J-th resampling point are calculated. Next, in step [106], the variable IC is incremented. Next step [10
7], the re-sampling distance DL is the acoustic parameter time series P.
Distance SL from the first parameter Pi (l) 'of i (n)' to the _ICth parameter Pi _{(lC ')}
_It is determined whether or not the resampling point at that time is located closer to the starting point side of the locus than the parameter Pi _{(lC ') at} that time depending on whether it is smaller than _(lC'). If not, step [1
06], the variable IC is incremented, and then the re-sampling point and the parameter Pi are again determined in step [107].
The position on the locus with _{(lC ′)} is compared, and the resampling point is the parameter Pi _{(lC ′)} on the locus.
When it is determined that the recognition parameter Qi _(J) is located closer to the start point side than that, the process proceeds to step [108] to form the recognition parameter Qi _(J) .

即ち、第Ｊ番目の再サンプリング点による再サンプリン
グ距離ＤＬからこの第Ｊ番目の再サンプリング点よりも
始点側に位置する第（ＩＣ−１）番目のパラメータＰｉ
_{（ｌＣ−１）}′による距離ＳＬ_{（ｌＣ−１）}を減算して
第（ＩＣ−１）番目のパラメータＰｉ_{（ｌＣ−１）}′か
ら第Ｊ番目の再サンプリング点迄の距離ＳＳを求める。
次に、軌跡上においてこの第Ｊ番目の再サンプリング点
の両側に位置するパラメータＰｉ_{（ｌＣ−１）}′及びパ
ラメータＰｉ_{（ｌＣ′）}間の距離Ｓ（ｎ）（この距離Ｓ
（ｎ）は（１１）式にて示される信号処理にて得られ
る。）にてこの距離ＳＳを除算し、の除算結果ＳＳ／Ｓ
_{（ＩＣ−１）}に軌跡上において第Ｊ番目の再サンプリン
グ点の両側に位置するパラメータＰｉ_{（ｌＣ′）}とＰｉ
_{（ｌＣ−１）}′との差（Ｐｉ_{（ｌＣ′）}−Ｐｉ
_{（ｌＣ−１）}′）を掛算して、軌跡上において第Ｊ番目
の再サンプリング点のこの再サンプリング点よりも始点
側に隣接して位置する第（ＩＣ−１）番目のパラメータ
Ｐｉ_{（ｌＣ−１）}′からの補間量を算出し、この補間量
と第Ｊ番目の再サンプリング点よりも始点側に隣接して
位置する第（ＩＣ−１）番目のパラメータＰｉ
_{（ｌＣ−１）}′とを加算して、軌跡に沿う新たな認識パ
ラメータＱｉ_（Ｊ）が形成される。That is, the (IC-1) th parameter Pi located closer to the start point than the Jth resampling point is from the resampling distance DL at the Jth resampling point.
_The distance SL _(lC-1) by _(lC-1) 'is subtracted to obtain the distance SS from the (IC-1) th parameter Pi _(lC-1) ' to the _Jth resampling point.
Next, the distance S (n) between the parameter Pi _(lC-1) 'and the parameter Pi _(lC') located on both sides of this J-th resampling point on the locus (this distance S
(N) is obtained by the signal processing represented by the equation (11). ), Divide this distance SS, and divide by SS / S
_The parameters Pi _{(lC ′)} and Pi located on both sides of the _Jth resampling point on the locus at _(IC-1).
Difference from _(lC-1) '(Pi _(lC')- Pi
_(IC-1) ') is multiplied to obtain the (IC-1) th parameter Pi _(lC- ) located adjacent to the start point side of the _Jth resampling point on the _{locus. 1)} ′ is calculated, and the interpolation amount and the (IC−1) th parameter Pi located adjacent to the start point side of the Jth resampling point and the interpolation amount.
_(LC-1) 'is added to form a new recognition parameter Qi _(J) along the trajectory.

このようにして始点及び終点（これらはそれぞれ無音で
あるときはである。）を除く（Ｍ−２）点の再サンプリングにより
認識パラメータ時系列Ｑｉ（ｍ）が形成される。In this way the start and end points (when they are each silent Is. ) Is resampled to form a recognition parameter time series Qi (m).

Ｇ３パターンマッチング処理の説明このＮＡＴ処理回路（９）よりの認識パラメータ時系列
Ｑｉ（ｍ）はモード切換スイッチ（３）により、登録モ
ードにおいては認識対象語毎に標準パターンメモリ
（４）にストアされる。また、認識モードにおいては距
離算出回路（６）に供給され、標準パターンメモリ
（４）よりの標準パターンのパラメータ時系列との距離
の算出がなされる。この場合の距離は例えば簡易的なチ
ェビシェフ距離として算出される。この距離算出回路
（６）よりの各標準パターンと入力パターンとの距離の
算出出力は最小値判定回路（７）に供給され、距離算出
値が最小となる標準パターンが判定され、この判定結果
により入力音声の認識結果が出力端（７０）に得られ
る。G3 Pattern Matching Process Description The recognition parameter time series Qi (m) from the NAT processing circuit (9) is stored in the standard pattern memory (4) for each recognition target word in the registration mode by the mode changeover switch (3). It Further, in the recognition mode, the distance is supplied to the distance calculation circuit (6) and the distance between the standard pattern memory (4) and the parameter time series of the standard pattern is calculated. The distance in this case is calculated as a simple Chebyshev distance, for example. The calculation output of the distance between each standard pattern and the input pattern from the distance calculation circuit (6) is supplied to the minimum value determination circuit (7), the standard pattern having the minimum distance calculation value is determined, and the determination result is determined by this determination result. The recognition result of the input voice is obtained at the output end (70).

Ｇ４補間フィルタ（８）の説明第４図は補間フィルタ（８）の構成の一例を示すもの
で、各チャンネルのパラメータＰ_１（ｎ），Ｐ
_２（ｎ），・・・・，Ｐ_１６（ｎ）のそれぞれに対して
データ間に０データを詰める零データ挿入回路（８
１_１）（８１_２）・・・・（８１_１６）とＦＩＲフィル
タ（８２_１）（８２_２）・・・・（８２_１６）が設けら
れる。Description of G4 Interpolation Filter (8) FIG. 4 shows an example of the configuration of the interpolation filter (8). The parameters P ₁ (n), P of each channel are shown.
_A zero data insertion circuit (8) for packing 0 data between data for each of ₂ (n), ..., P ₁₆ (n)
1 ₁₎ ₍₈₁ 2) ... _{(81 16)} and the FIR filter ₍₈₂ ₁₎ (82 2) ... _{(82 16)} are provided.

零データ挿入回路（８１_１）〜（８１_１６）において
は、第５図Ａに示すように隣接パラメータＰｉ（ｋ）と
Ｐｉ（ｋ＋１）との間に〇印で示す０データがＰ−１個
例えば３個挿入される。In the zero data insertion circuits (81 ₁ ) to (81 ₁₆ ), as shown in FIG. 5A, between the adjacent parameters Pi (k) and Pi (k + 1), there are P-1 pieces of 0 data indicated by a circle. For example, three pieces are inserted.

したがって、零データ挿入回路（８１ｉ）からは、Ｐｉ（ｎ）′＝〔Ｐｉ（０），φ，φ，φ，Ｐｉ
（１），φ，φ，φ，Ｐｉ（２）・・・・〕なるパラメータ時系列Ｐｉ（ｎ）′が得られる。すなわ
ち、このパラメータ時系列Ｐｉ（ｎ）′はパラメータ時
系列Ｐｉ（ｎ）に対しサンプルデータ数がＰ＝４倍に増
えたものとなる。Therefore, from the zero data insertion circuit (81i), Pi (n) '= [Pi (0), φ, φ, φ, Pi
(1), φ, φ, φ, Pi (2) ...] The parameter time series Pi (n) ′ is obtained. That is, the parameter time series Pi (n) 'has the number of sample data increased by P = 4 times that of the parameter time series Pi (n).

この新たなパラメータ時系列Ｐｉ（ｎ）′はＦＩＲフィ
ルタ（８２_１）〜（８２_１６）のそれぞれに供給され
る。This new parameter time series Pi (n) 'is supplied to each of the FIR filters (82 ₁ ) to (82 ₁₆ ).

このＦＩＲフィルタ（８２_１）〜（８２_１６）のそれぞ
れは例えば第６図のように構成される。Each of the FIR filters (82 ₁ ) to (82 ₁₆ ) is configured as shown in FIG. 6, for example.

すなわち、同図において、（８２０）は入力端子、ま
た、（８２１_１）（８２１_２）・・・・（８２１_Ｊ）は
それぞれ単位時間分の遅延素子で、この例では単位時間
はフレーム周期の１／４（＝１／ｐ）とされる。また、
（８２２_０）（８２２_１）（８２２_２）・・・・（８２
２_Ｊ）は乗算器で、それぞれ入力端（８２０）に得られ
るデータ、遅延素子（８２１_１）（８２１_２）・・・
・，（８２１_Ｊ）の出力に得られるデータをフィルタ係
数αｊ（ｊ＝０，１，２・・・・，Ｊ）倍する。そし
て、これら乗算器（８２２_０）〜（８２２_Ｊ）の各出力
は加算回路（８２３）に供給され、その加算出力が出力
端子（８２４）に得られる。したがって、この出力端子
（８２４）には、パラメータデータＰｉ（ｋ）とＰｉ
（ｋ＋１）との間に、それぞれＦＩＲフィルタリング処
理により第５図Ｂで●印で示すように入力パラメータＰ
ｉ（ｎ）から３個のサンプルデータがそれぞれ補間され
た状態の出力が得られる。すなわち、４倍にパラメータ
数が増やされたパラメータ時系列Ｐ_１（ｎ）′，Ｐ
_２（ｎ）・・・・Ｐ_１６（ｎ）′がそれぞれ得られ、こ
れがパラメータメモリ（２００）に記憶されこれに基づ
いてＮＡＴ処理がなされる。That is, in the figure, (820) is an input terminal, and (821 ₁ ) (821 ₂ ) ... (821 _J ) are delay elements for each unit time. In this example, the unit time is the frame period. It is set to 1/4 (= 1 / p). Also,
(822 ₀ ) (822 ₁ ) (822 ₂ ) ... (82
2 _J ) is a multiplier, and the data and delay elements (821 ₁ ) (821 ₂ ) ...
The data obtained at the output of (821 _J ) is multiplied by the filter coefficient αj (j = 0, 1, 2, ..., J). Then, the outputs of these multipliers (822 ₀ ) to (822 _J ) are supplied to the adder circuit (823), and the added output is obtained at the output terminal (824). Therefore, at the output terminal (824), the parameter data Pi (k) and Pi
Between (k + 1) and the input parameter P, as indicated by the ● mark in FIG.
An output is obtained from i (n) in which three pieces of sample data are interpolated. That is, the parameter time series P ₁ (n) ′, P with the number of parameters increased four times
₂ (n) ... P ₁₆ (n) 'are respectively obtained and stored in the parameter memory (200), and the NAT processing is performed based on this.

そして、この場合のＦＩＲフィルタ（８２_１）〜（８２
_１６）のそれぞれの乗算器（８２２_０）〜（８２２_Ｊ）
の乗算係数α_ｊを選定することにより、ＦＩＲフィルタ
（８２_１）〜（８２_１６）のそれぞれはローパスフィル
タ特性を有するようにされている。しかも、α_０＝
α_Ｊ，α_１＝α_Ｊ−１，α_２＝α_Ｊ−２・・・・と選定
されて、位相特性がリニアであり、群遅延特性が全周波
数について一定となるようにされている。Then, FIR filter ₍₈₂ 1) of this case - (82
₁₆ ) each of the multipliers (822 ₀ ) to (822 _J )
Each of the FIR filters (82 ₁ ) to (82 ₁₆ ) has a low-pass filter characteristic by selecting the multiplication coefficient α _j of. Moreover, α ₀ =
_{_{_{α J, α 1 = α J}}} -1, is selected as _{_{α 2 = α J-2 ····}} , the phase characteristic is linear, the group delay characteristic is to be constant for all frequencies.

このときのローパスフィルタ特性は第７図に示す通り
で、カットオフ周波数は、π／２Ｐ×β（０＜β≦１）
とされ、β＝１のときはローパスフィルタ特性なしであ
り、βが０に近づくにつれてローパスフィルタ特性が急
になる。つまり、ローパスフィルタによる通過帯域が
〔０，Ω／２×β〕となる。Ωは元のパラメータ時系列
Ｐｉ（ｎ）のサンプリング周波数である。The low-pass filter characteristic at this time is as shown in FIG. 7, and the cutoff frequency is π / 2P × β (0 <β ≦ 1).
Therefore, when β = 1, there is no low-pass filter characteristic, and as β approaches 0, the low-pass filter characteristic becomes steeper. That is, the pass band of the low pass filter is [0, Ω / 2 × β]. Ω is the sampling frequency of the original parameter time series Pi (n).

以上のようにして、補間フィルタ（８）により、パラメ
ータ時系列Ｐｉ（ｎ）は処理されて、データ数は増加さ
せられるとともにローパスフィルタを通されることによ
って定常部でのゆらぎが消失せしめられる。As described above, the interpolation filter (8) processes the parameter time series Pi (n) to increase the number of data and pass the low pass filter to eliminate the fluctuation in the stationary part.

したがって、これにより音声認識率の向上が期待でき
る。Therefore, this can be expected to improve the voice recognition rate.

例えば、Ｉ＝１６、Ｐ＝４、ローパスフィルタの次数を
１２９次、β＝０．４としたとき、（図中実線で示す）
と、β＝１（ローパスフィルタスルー）としたとき（図
中破線で示す）との認識率の違いを第８図に示す。ま
た、第１チャンネルの音響パラメータ時系列Ｐｉ（ｎ）
をこのローパスフィルタを通さなかったときと、通した
ときの出力変化を第９図Ａ及びＢに示す。For example, when I = 16, P = 4, the order of the low-pass filter is 129, and β = 0.4 (shown by the solid line in the figure)
FIG. 8 shows the difference in recognition rate between the case where β = 1 (low-pass filter through) (shown by the broken line in the figure). Also, the acoustic parameter time series Pi (n) of the first channel
FIG. 9A and FIG. 9B show the output changes when the low pass filter is not passed and when it is passed.

第８図から明らかなように、ＮＡＴ処理のみをした場合
に比べて登録人数９名、発声回数（登録回数）３回で、
最高認識率として９６．２４％に対し９７．４５％が得
られ、１．２１％の認識率の向上が得られた。As is clear from FIG. 8, the number of registered people is 9 and the number of vocalizations (the number of registrations) is 3 times as compared with the case where only the NAT processing is performed.
The highest recognition rate was 97.45% against 96.24%, which was an improvement of 1.21%.

なお、これは、ローパスフィルタ特性により軌跡のゆら
ぎの影響を除去したことのみでなく、補間によりＮＡＴ
処理回路（９）の入力データ数が多くなり、ＮＡＴ処理
での補間による誤差が小さくなったことにも起因するも
のである。It should be noted that this is not only because the effect of trajectory fluctuation is eliminated by the low-pass filter characteristic, but also by interpolation by NAT.
This is also due to the fact that the number of input data of the processing circuit (9) is large and the error due to interpolation in the NAT processing is small.

Ｈ発明の効果以上のように、この発明によれば、ＮＡＴ処理をする前
に、ローパスフィルタを設け、音響分析回路よりの音響
パラメータ出力の高域成分をカットしたことにより、音
響パラメータ出力の定常部でのゆらぎが除去される。こ
れによりＮＡＴ処理回路での軌跡の推定に誤差が少なく
なり、これにより作成する認識パラメータ時系列の誤差
も小さくなり、音声認識率が向上するものである。H Effect of the Invention As described above, according to the present invention, the low-pass filter is provided before performing the NAT processing, and the high frequency component of the acoustic parameter output from the acoustic analysis circuit is cut, so that the acoustic parameter output becomes steady. Fluctuations in parts are eliminated. This reduces the error in the trajectory estimation in the NAT processing circuit, reduces the error in the recognition parameter time series to be created, and improves the voice recognition rate.

[Brief description of drawings]

第１図はこの発明装置の一実施例のブロック図、第２図
はこの発明装置の具体的一実施例のブロック図、第３図
はその要部の動作の説明のためのフローチャートを示す
図、第４図及び第６図はその要部回路の一例の構成を示
すブロック図、第５図及び第７図〜第９図はその説明の
ための図、第１０図は音声認識装置の基本構成を示すブ
ロック図、第１１図〜第１３図はＮＡＴ処理を説明する
ための図である。（２）は音響分析回路、（４）は標準パターンメモリ、
（６）は標準パターンと入力パターンとの距離算出回
路、（７）は最小値判定回路、（８）はデジタルフィル
タ、（９）はＮＡＴ処理回路である。FIG. 1 is a block diagram of an embodiment of the device of the present invention, FIG. 2 is a block diagram of a specific embodiment of the device of the present invention, and FIG. 3 is a flow chart for explaining the operation of the main part thereof. , FIG. 4 and FIG. 6 are block diagrams showing the configuration of an example of the main circuit, FIG. 5 and FIG. 7 to FIG. 9 are diagrams for explaining the same, and FIG. A block diagram showing the configuration, and FIGS. 11 to 13 are diagrams for explaining the NAT processing. (2) is an acoustic analysis circuit, (4) is a standard pattern memory,
(6) is a distance calculation circuit between the standard pattern and the input pattern, (7) is a minimum value determination circuit, (8) is a digital filter, and (9) is a NAT processing circuit.

フロントページの続き (72)発明者赤羽誠東京都品川区北品川６丁目７番35号ソニー株式会社内 (72)発明者渡雅男東京都品川区北品川６丁目７番35号ソニー株式会社内 (56)参考文献日本音響学会講演論文集昭和59年10月１−９−９Ｐ．17−18Front page continuation (72) Inventor Makoto Akabane 6-735 Kitashinagawa, Shinagawa-ku, Tokyo Sony Corporation (72) Inventor Masao Watanabe 6-735 Kitashinagawa, Shinagawa-ku, Tokyo Sony Corporation Shares In-house (56) References Acoustical Society of Japan Proceedings Proceedings October 1984 1-9-9 p. 17-18

Claims

[Claims]

1. A speech analysis means for obtaining a time series of acoustic parameters of an input speech signal, a trajectory drawn by the acoustic parameter time series from the acoustic analysis means in a parameter space is estimated, and resampling is performed along the trajectory. A time normalization means for obtaining a time-normalized recognition parameter time series, a standard pattern memory in which a recognition parameter time series of a standard pattern of a recognition target word is stored, and a recognition parameter of an input pattern from the time normalization means Distance calculating means for calculating the difference between the time series and the recognition parameter time series of the standard pattern from the standard pattern memory, and the minimum value for detecting the minimum value calculated by the distance calculating means to obtain a recognition output. Determining means, and the time normalization means for the acoustic parameter time series from the acoustic analysis means through a low-pass filter. A speech recognition apparatus adapted to remove the influence of fluctuation in the quasi-stationary part of the locus by supplying