JPH0632009B2

JPH0632009B2 - Voice recognizer

Info

Publication number: JPH0632009B2
Application number: JP60017133A
Authority: JP
Inventors: 曜一郎佐古; 雅男渡; 篤信平岩; 誠赤羽
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 1985-01-31
Filing date: 1985-01-31
Publication date: 1994-04-27
Anticipated expiration: 2009-04-27
Also published as: JPS61176996A

Description

【発明の詳細な説明】以下の順序でこの発明を説明する。DETAILED DESCRIPTION OF THE INVENTION The present invention will be described in the following order.

Ａ産業上の利用分野Ｂ発明の概要Ｃ従来の技術Ｄ発明が解決しようとする問題点Ｅ問題点を解決するための手段Ｆ作用Ｇ実施例Ｇ１音響分析回路の説明（第１図）Ｇ２時間正規化処理の説明（第１図，第２図，第３図）Ｇ３パターンマッチング処理の説明（第１図）Ｈ発明の効果Ａ産業上の利用分野この発明は、前もって作製し記憶してある認識対象語の
標準パターンと、認識したい語の入力パターンとのパタ
ーンマッチングを行うことによりなす音声認識装置に関
する。A Industrial field of use B Outline of invention C Conventional technology D Problems to be solved by the invention E Means for solving problems F Action G Example G1 Description of acoustic analysis circuit (Fig. 1) G2 time Description of Normalization Process (FIGS. 1, 2, and 3) Description of G3 Pattern Matching Process (FIG. 1) H Effect of Invention A Field of Industrial Application This invention is produced and stored in advance. The present invention relates to a voice recognition device which performs pattern matching between a standard pattern of a recognition target word and an input pattern of a word to be recognized.

Ｂ発明の概要この発明はパターンマッチングにより音声認識をなす装
置において、そのマッチングをとるパターンとして、入
力音声信号の音声区間で音響分析して得た音響パターン
時系列がそのパラメータ空間で描く軌跡を推定しその軌
跡を所定間隔で再サンプリングして得た新たな認識パラ
メータを用いるものであって、この認識パラメータを得
る再サンプリングの間隔を推定した軌跡長に応じて変
え、サンプル点数を軌跡長に応じて可変するようにした
もので、軌跡の変化に応じた十分な情報が得られるよう
にして認識精度を向上させるようにしたものである。B. Summary of the Invention In the present invention, in a device for performing voice recognition by pattern matching, as a pattern to be matched, a trajectory of an acoustic pattern time series obtained by acoustic analysis in a voice section of an input voice signal drawn in its parameter space is estimated. A new recognition parameter obtained by re-sampling the trajectory at a predetermined interval is used.The resampling interval for obtaining this recognition parameter is changed according to the estimated trajectory length, and the number of sample points is changed according to the trajectory length. In this case, the recognition accuracy is improved by obtaining sufficient information according to the change of the locus.

Ｃ従来の技術音声は時間軸に沿って変化する現象で、スペクトラム・
パターンが刻々と変化するように音声を発声することに
よって固有の単語や言葉が生まれる。この人間が発声す
る単語や言葉を自動認識する技術が音声認識であるが、
人間の聴覚機能に匹敵するような音声認識を実現するこ
とは現在のところ至難のことである。このため、現在実
用化されている音声認識の殆んどは、一定の使用条件の
下で、認識対象単語の標準パターンと入力パターンとの
パターンマッチングを行なうことによりなす方法であ
る。C Conventional technology Speech is a phenomenon that changes along the time axis,
A unique word or words are created by uttering a voice so that the pattern changes every moment. Speech recognition is the technology that automatically recognizes words and words spoken by humans.
At present, it is extremely difficult to realize voice recognition that is comparable to human auditory function. Therefore, most of the speech recognition currently in practical use is a method of performing pattern matching between a standard pattern of a recognition target word and an input pattern under a certain use condition.

第４図はこの音声認識装置の概要を説明するための図
で、マイクロホン(1)よりの音声入力が音響分析回路(2)
に供給される。この音響分析回路(2)では入力音声パタ
ーンの特徴を表わす音響パラメータが抽出される。この
音響パラメータを抽出する音響分析の方法は種々考えら
れるが、例えばその一例としてバンドパスフィルタと整
流回路を１チャンネルとし、このようなチャンネルを通
過帯域を変えて複数個並べ、このバンドパスフィルタ群
の出力としてスペクトラム・パターンの時間変化を抽出
する方法が知られている。この場合、音響パラメータは
その時系列Pi(n)（ｉ＝１，２‥‥Ｉ；Ｉは例えばバン
ドパスフィルタのチャンネル数、ｎ＝１，２‥‥Ｎ；Ｎ
は音声区間判定により判定された区間において認識に利
用されるフレーム数）で表わすことができる。FIG. 4 is a diagram for explaining the outline of this voice recognition device, in which the voice input from the microphone (1) is the acoustic analysis circuit (2).
Is supplied to. In this acoustic analysis circuit (2), acoustic parameters representing the characteristics of the input voice pattern are extracted. Various acoustic analysis methods for extracting the acoustic parameters are conceivable. For example, a bandpass filter and a rectifier circuit are provided as one channel, and a plurality of such channels are arranged with different passbands. There is known a method of extracting the time change of the spectrum pattern as the output of the. In this case, the acoustic parameters are the time series Pi (n) (i = 1, 2 ... I; I is the number of channels of the bandpass filter, n = 1, 2 ... N; N).
Can be represented by the number of frames used for recognition in the section determined by the voice section determination.

この音響分析回路(2)よりの音響パラメータ時系列Pi(n)
は、例えばスイッチからなるモード切換回路(3)に供給
される。この回路(3)のスイッチが端子Ａ側に切り換え
られるときは登録モード時で、音響パラメータ時系列Pi
(n)が認識パラメータとして標準パターンメモリ(4)にス
トアされる。つまり、音声認識に先だって話者の音声パ
ターンが標準パターンとしてこのメモリ(4)に記憶され
る。なお、この登録時、発声速度変動や単語長の違いに
より一般に各登録標準パターンのフレーム数は異なって
いる。Acoustic parameter time series Pi (n) from this acoustic analysis circuit (2)
Is supplied to a mode switching circuit (3) including a switch, for example. When the switch of this circuit (3) is switched to the terminal A side, in the registration mode, the acoustic parameter time series Pi
(n) is stored in the standard pattern memory (4) as a recognition parameter. That is, the voice pattern of the speaker is stored in this memory (4) as a standard pattern prior to voice recognition. At the time of registration, generally, the number of frames of each registered standard pattern is different due to the variation in vocalization speed and the difference in word length.

一方、このスイッチ(3)が端子Ｂ側に切り換えられると
きは認識モード時である。そして、この認識モード時
は、音響分析回路(2)からのそのときの入力音声の音響
パラメータ時系列が入力音声パターンメモリ(5)に供給
されて一時ストアされる。そしてこの入力パターンと標
準パターンメモリ(4)からの読み出された複数の認識対
象単語の標準パターンのそれぞれとの違いの大きさが距
離算出回路(6)にて計算され、そのうち入力パターンと
標準パターンとの差が最小の認識対象単語が最小値判定
回路(7)にて検出され、これにて入力された単語が認識
される。On the other hand, when the switch (3) is switched to the terminal B side, it is in the recognition mode. Then, in this recognition mode, the acoustic parameter time series of the input voice at that time from the acoustic analysis circuit (2) is supplied to the input voice pattern memory (5) and temporarily stored. Then, the magnitude of the difference between this input pattern and each of the standard patterns of the plurality of recognition target words read from the standard pattern memory (4) is calculated by the distance calculation circuit (6). The recognition target word having the smallest difference from the pattern is detected by the minimum value judgment circuit (7), and the input word is recognized by this.

このように、登録された標準パターンと入力パターンの
パターンマッチング処理により入力音声の認識を行なう
ものがあるが、この場合に同じ単語を同じように発声し
てもそのスペクトラムパターンは時間軸方向にずれたり
伸縮したりすることを考慮しなければならない。すなわ
ち、例えば「ハイ」という単語を認識する場合、標準パ
ターンが「ハイ」で登録されているとき、入力音声が
「ハーイ」と時間軸方向に伸びてしまった場合、これは
距離が大きく違い、全く違った単語とされてしまい、正
しい認識ができない。このため、音声認識のパターンマ
ッチングでは、この時間軸方向のずれ、伸縮を補正する
時間正規化の処理を行なう必要があり、また、この時間
正規化は認識精度を向上させるための重要な処理であ
る。As described above, there are some that recognize the input voice by pattern matching processing of the registered standard pattern and the input pattern, but in this case, even if the same word is uttered in the same way, the spectrum pattern shifts in the time axis direction. It must be taken into consideration that it expands and contracts. That is, for example, when recognizing the word "high", when the standard pattern is registered as "high", and the input voice extends "hi" in the time axis direction, this is a big difference in distance. It is a completely different word, and I cannot recognize it correctly. Therefore, in pattern matching for voice recognition, it is necessary to perform time normalization processing for correcting the displacement and expansion / contraction in the time axis direction, and this time normalization is an important processing for improving recognition accuracy. is there.

この時間正規化の一方法としてＤＰ(Dynamic Programmi
ng)マッチングと呼ばれる手法がある（例えば特開昭５
０−９６１０４号公報参照）。As one method of this time normalization, DP (Dynamic Program
ng) There is a method called matching (for example, Japanese Patent Laid-Open No. Sho 5).
0-96104).

このＤＰマッチングの手法は次のように説明できる。This DP matching method can be explained as follows.

入力パターンＡを次のように表現する。The input pattern A is expressed as follows.

Ａ＝ａ_１ａ_２‥‥ａ_ｋ‥‥ａ_ｋ (1) ここでａ_ｋは時刻ｋにおける音声の特徴を表す量で特徴
ベクトルと呼び、ａ_ｋ＝（ａ_ｋ１，ａ_ｋ２，‥‥ａ_ｋｑ‥‥ａ_ｋＱ）
(2) で表わされる。Ｑはベクトルの次数で、音響分析にバン
ドパスフィルタ群を使用したときはそのチャンネル数に
相等する。A = a ₁ a ₂ ... a _k ... a _k (1) Here, a _k is a quantity representing the feature of the voice at time k and is called a feature vector, and a _k = (a _k1 , a _k2 ,. _kq ... a _kQ )
It is represented by (2). Q is the order of the vector, and is equivalent to the number of channels when a bandpass filter group is used for acoustic analysis.

同様に特定の単語の標準パターンをＢとし、次のように
表わす。Similarly, the standard pattern of a specific word is B, which is expressed as follows.

Ｂ＝ｂ_１ｂ_２‥‥ｂ_ｌ‥‥ｂ_Ｌ (3) ｂ_ｌ＝（ｂ_ｌ１，ｂ_ｌ２，‥‥ｂ_ｌｑ，‥‥ｂ_ｌｑ）
(4) 音声パターンの時間正規化は第５図に示すように入力パ
ターンＡと標準パターンＢの時間軸ｋとｌの間に写像操
作を行うものとみることができる。B = b ₁ b ₂ ... b _l ... b _L (3) b _l = (b _l1 , b _l2 , ... b _lq , ... b _lq )
(4) The time normalization of the voice pattern can be regarded as performing a mapping operation between the time axes k and l of the input pattern A and the standard pattern B as shown in FIG.

この写像を関数ｌ＝ｌ(k) (5) と表現し、歪関数と呼ぶ。この歪関数がわかれば標準パ
ターンＢの時間軸をこれによって変換し、入力パターン
Ａの時間軸ｋにそろえることができる。換言すれば、こ
の歪関数によりパターンＢは、入力パターンＡの時間軸
ｋにそろえられたパターンＢ′に変換される。This mapping is expressed as a function l = 1 (k) (5) and called a distortion function. If this distortion function is known, the time axis of the standard pattern B can be converted by this and aligned with the time axis k of the input pattern A. In other words, this distortion function converts the pattern B into the pattern B'aligned with the time axis k of the input pattern A.

ここで、Ｂ′＝ａ_１(1)ｂ_１(2)‥‥ｂ_１(k)‥‥ｂ_１(K)
(6) である。Where B '= a _{1 (1)} b _{1 (2)} ... b _{1 (k)} ... b _{1 (K)}
(6)

この歪関数は未知であるが、この歪関数の最適条件から
求めることができる。すなわち、一方のパターン例えば
標準パターンを人工的に歪ませて他方のパターン（入力
パターン）に最も類似するようにする（距離を最小にす
る）と、元の歪はなくなり、最適な歪関数が求まり、写
像パターンＢ′が求まる。Although this distortion function is unknown, it can be obtained from the optimum condition of this distortion function. That is, if one pattern, for example, a standard pattern is artificially distorted so that it is most similar to the other pattern (input pattern) (the distance is minimized), the original distortion disappears and the optimum distortion function is obtained. , Mapping pattern B ′ is obtained.

ＤＰマッチングは、この原理を実行するための手法であ
り、歪関数に次のような制約を与えて、写像パターン
Ｂ′を得るものである。DP matching is a method for executing this principle, and obtains a mapping pattern B ′ by applying the following constraint to the distortion function.

(i) ｌ_(k)は近似的に単調増加関数 (ii) ｌ_(k)は近似的に連続関数 (iii)ｌ_(k)はｋの近傍の値をとる。(i) l _(k) is approximately a monotonically increasing function (ii) l _(k) is approximately a continuous function (iii) l _(k) takes a value near k.

マッチング処理の結果として必要なものは、標準パター
ンと入力パターン間の距離で、で表わされる。ここで‖ ‖は二つのベクトルの距離
を示す。この距離の最小のものが標準パターンＢと入力
パターンＡを最適に時間正規化し、時間歪を除去したう
えでの両パターンの差を表わす量Ｄ（Ａ，Ｂ）であり、で定義できる。What is required as a result of the matching process is the distance between the standard pattern and the input pattern, It is represented by. Where ‖ ‖ indicates the distance between two vectors. The minimum distance is the quantity D (A, B) representing the difference between the standard pattern B and the input pattern A after optimally normalizing the time and removing the time distortion. Can be defined by

したがって、登録された標準パターンが複数あるとき
は、各標準パターンと入力パターンとの量Ｄ（Ａ，Ｂ）
を求め、その量Ｄ（Ａ，Ｂ）が最小になる標準パターン
とマッチングしたと判定する。Therefore, when there are a plurality of registered standard patterns, the amount D (A, B) of each standard pattern and the input pattern
Is determined, and it is determined that the amount D (A, B) matches the standard pattern with the minimum amount.

以上のように、ＤＰマッチングは時間軸のずれを考慮し
た多数の標準パターンを用意しておくのではなく、歪関
数によって多数の時間を正規化した標準パターンを生成
し、これと入力パターンとの距離を求め、その最小値の
ものを検知することにより、音声認識をするものであ
る。As described above, the DP matching does not prepare a large number of standard patterns in consideration of the shift of the time axis, but generates a standard pattern in which a large number of times are normalized by the distortion function, and the standard pattern is generated. Voice recognition is performed by obtaining the distance and detecting the minimum value.

ところで、以上のようなＤＰマッチングの手法を用いる
場合、登録される標準パターンのフレーム数は不定であ
り、しかも全登録標準パターンと入力パターンとのＤＰ
マッチング処理をする必要があり、語彙が多くなると演
算量が飛躍的に増加する欠点がある。By the way, when the above DP matching method is used, the number of frames of the registered standard patterns is indefinite, and the DP of all registered standard patterns and input patterns is
It is necessary to perform matching processing, and there is a drawback that the amount of calculation increases dramatically when the vocabulary increases.

また、ＤＰマッチングは、定常部（スペクトラムパター
ンの時間変化のない部分）を重視したマッチング方式で
あるので部分的類似パターン間で誤認識を生じる可能性
があった。Further, since the DP matching is a matching method that emphasizes the stationary part (the part where the spectrum pattern does not change with time), there is a possibility that erroneous recognition may occur between the partially similar patterns.

このような欠点を生じない時間正規化の手法を本出願人
は先に提案した（例えば特願昭５９−１０６１７７
号）。The present applicant has previously proposed a method of time normalization that does not cause such a defect (for example, Japanese Patent Application No. 59-106177).
issue).

すなわち、音響パラメータ時系列Pi(n)は、そのパラメ
ータ空間を考えた場合、点列を描く。例えば認識対象単
語が「ＨＡＩ」であるとき音響分析バンドパスフィルタ
の数が２個で、 Pi(n)＝（Ｐ_１Ｐ_２）であれば、入力音声の音響パラメータ時系列はその２次
元パラメータ空間には第６図に示すような点列を描く。
この図から明らかなように音声の非定常部の点列は粗に
分布し、準定常部は密に分布する。このことは完全に音
声が定常であればパラメータは変化せず、その場合には
点列はパラメータ空間において一点に停留することとな
ることから明らかであろう。That is, the acoustic parameter time series Pi (n) draws a sequence of points when considering its parameter space. For example, when the recognition target word is “HAI”, the number of acoustic analysis bandpass filters is two, and if Pi (n) = (P ₁ P ₂ ), the acoustic parameter time series of the input speech is the two-dimensional parameter. A series of points as shown in Fig. 6 is drawn in the space.
As is clear from this figure, the point sequence of the non-stationary part of the voice is roughly distributed, and the quasi-stationary part is densely distributed. This will be clear from the fact that if the speech is completely stationary, the parameters do not change, and in that case the point sequence will stop at one point in the parameter space.

そして、以上のことから、音声の発声速度変動による時
間軸方向のずれは殆んどが準定常部の点列密度の違いに
起因し、非定常部の時間長の影響は少ないと考えられ
る。そこで、この入力パラメータ時系列Pi(n)の点列か
ら第７図に示すように点列全体を近似的に通過するよう
な連続曲線で描いた軌跡を推定すれば、この軌跡は音声
の発声速度変動に対して殆んど不変であることがわか
る。From the above, it is considered that most of the deviation in the time axis direction due to the fluctuation of the vocalization rate of the voice is due to the difference in the point sequence density of the quasi-stationary part, and the influence of the time length of the non-stationary part is small. Therefore, if a locus drawn by a continuous curve that approximately passes through the entire point sequence as shown in FIG. 7 is estimated from the point sequence of this input parameter time series Pi (n), this locus will produce a vocal utterance. It can be seen that it is almost invariant to speed fluctuations.

このことから、出願人は、次のような時間軸正規化方法
を提案した。すなわち、先ず入力パラメータの時系列Pi
(n)の始端Pi(1)から終端Pi(n)までを連続曲線Pi(s)で描
いた軌跡を推定し、この推定した曲線Pi(s)から軌跡の
長さＳを求める。そして第８図に示すようにこの軌跡に
沿って所定長Ｔで再サンプリングする。例えばＭ個の点
に再サンプリングする場合、Ｔ＝Ｓ／（Ｍ−１） (9) の長さを基準として軌跡を再サンプリングする。この再
サンプリングされた点列を描くパラメータ時系列をQi
(m)（ｉ＝１，２‥‥Ｉ，ｍ＝１，２‥‥Ｍ）とすれ
ば、このパラメータ時系列Qi(m)は軌跡の基本情報を有
しており、しかも音声の発声速度変動に対して殆んど不
変なパラメータである。つまり、時間軸が正規化された
認識パラメータ時系列である。Therefore, the applicant has proposed the following time axis normalization method. That is, first, the input parameter time series Pi
The trajectory drawn from the starting end Pi (1) to the ending end Pi (n) of (n) with a continuous curve Pi (s) is estimated, and the length S of the trajectory is obtained from the estimated curve Pi (s). Then, as shown in FIG. 8, re-sampling is performed at a predetermined length T along this locus. For example, when re-sampling to M points, the trajectory is re-sampled based on the length of T = S / (M-1) (9). Let Qi be the parameter time series that draws this resampled sequence of points.
(m) (i = 1, 2 ... I, m = 1, 2 ... M), this parameter time series Qi (m) has the basic information of the locus, and moreover, the speech production rate of the voice. It is a parameter that is almost invariant to fluctuations. That is, it is a recognition parameter time series whose time axis is normalized.

したがって、このパラメータ時系列Qi(m)を標準パター
ンとして登録しておくとともに、入力パターンもこのパ
ラメータ時系列Qi(m)として得、このパラメータ時系列Q
i(m)により両パターン間の距離を求め、その距離が最小
であるものを検知して音声認識を行うようにすれば、時
間軸方向のずれが正規化されて除去された状態で音声認
識が常になされる。Therefore, this parameter time series Qi (m) is registered as a standard pattern, and the input pattern is also obtained as this parameter time series Qi (m).
If the distance between both patterns is calculated by i (m) and the one with the smallest distance is detected for voice recognition, the voice recognition is performed with the time-axis shift being normalized and removed. Is always done.

そして、この処理方法によれば、登録時の発声速度変動
や単語長の違いに関係なく認識パラメータ時系列Qi(m)
のフレーム数は常にＭであり、その上認識パラメータ時
系列Qi(m)は時間正規化されているので、入力パターン
と登録標準パターンとの距離の演算は最も単純なチェビ
シェフ距離を求める演算でも良好な効果が期待できる。Then, according to this processing method, the recognition parameter time series Qi (m) is irrespective of the variation in the vocalization speed and the difference in the word length at the time of registration.
Since the number of frames of is always M, and the recognition parameter time series Qi (m) is time-normalized, the distance between the input pattern and the registered standard pattern can be calculated by the simplest Chebyshev distance calculation. You can expect a great effect.

また、以上の方法は音声の非定常部をより重視した時間
正規化の手法であり、ＤＰマッチング処理のような部分
的類似パターン間の誤認識が少なくなる。Further, the above method is a method of time normalization in which the non-stationary part of the voice is more emphasized, and erroneous recognition between partially similar patterns such as DP matching processing is reduced.

さらに、発声速度の変動情報は正規化パラメータ時系列
Qi(m)には含まれず、このためパラメータ空間に配位す
るパラメータ遷移構造のグローバルな特徴等の扱いが容
易となり、不特定話者認識に対しても有効な各種方法の
適用が可能となる。Furthermore, the variation information of the speaking rate is the normalized parameter time series.
Since it is not included in Qi (m), it is easy to handle the global characteristics of the parameter transition structure coordinated in the parameter space, and it is possible to apply various effective methods for unspecified speaker recognition. .

なお、以下、この時間正規化の処理ＮＡＴ(Normalizati
on Along Trajectory)処理と呼ぶ。Note that, in the following, this time normalization processing NAT (Normalizati
on Along Trajectory) process.

Ｄ発明が解決しようとする問題点以上述べたＮＡＴ処理においては、認識パラメータQi
(m)を形成するとき、フレーム数をＭで一定にするべ
く、推定した軌跡の軌跡長Ｓをフレーム数Ｍで除した値
Ｔの間隔で再サンプルを行うようにしている。D Problems to be Solved by the Invention In the NAT processing described above, the recognition parameter Qi
When forming (m), in order to keep the number of frames constant at M, re-sampling is performed at intervals of a value T obtained by dividing the estimated trajectory length S of the trajectory by the number of frames M.

ところが、このようにフレーム数が一定で再サンプリン
グ点数が軌跡長に関係なく一定である場合には、第９図
のような例えば「あ」というような単音節の場合の単純
な軌跡に対し、第１０図に示すような例えば「北海道」
というように音節数が多い場合の複雑な軌跡を考える
と、フレーム数が少ないと第９図のような単音節の軌跡
を表わすパラメータの抽出はできるが、第１０図のよう
な多音節の軌跡を表わすパラメータとしてはフレーム数
つまりサンプル数が少なすぎてしまい、軌跡の特徴を示
すパラメータとしては不十分である。逆に、フレーム数
が多いと、多音節の場合はよいが、単音節の場合には、
不必要にフレーム数が多くなる欠点となる。However, when the number of frames is constant and the number of resampling points is constant irrespective of the trajectory length, as shown in FIG. 9, for a simple trajectory in the case of a single syllable such as "A", For example, "Hokkaido" as shown in FIG.
Considering a complicated locus with a large number of syllables like this, if the number of frames is small, the parameters representing the locus of a single syllable as shown in FIG. 9 can be extracted, but the locus of a multi-syllable as shown in FIG. The number of frames, that is, the number of samples, is too small as a parameter for indicating, and is not sufficient as a parameter for indicating the characteristics of the trajectory. Conversely, if the number of frames is large, it is good for multi-syllables, but for single syllables,
This is a disadvantage that the number of frames is unnecessarily increased.

Ｅ問題点を解決するための手段この発明は、入力音声信号の音声区間を判定する音声区
間判定回路(24)と、この音声区間判定手段(24)にて判定
された音声区間内で音響パラメータ時系列を得る特徴抽
出手段(23)と、この特徴抽出回路(23)よりの音響パラメ
ータ時系列がパラメータ空間で描く軌跡を推定しこの軌
跡を求める演算手段(81)と、この演算手段により求めら
れた軌跡長に応じたサンプル間隔で再サンプルを行なう
ことにより認識パラメータ時系列を得る処理手段(82)(8
3)と、認識対象語の標準パターンの認識パラメータ時系
列がストアされている標準パターンメモリ(4)と、上記
処理手段(82)(83)よりの入力パターンの認識パラメータ
時系列と上記標準パターンメモリ(4)からの標準パター
ンの認識パラメータ時系列との差を算出する距離算出回
路(6)と、この距離算出手段(6)で、算出された値の最小
のものを検知して認識出力を得る最小値判定回路(7)と
からなる。E Means for Solving Problems The present invention relates to a voice section determination circuit (24) for determining a voice section of an input voice signal, and an acoustic parameter within the voice section determined by the voice section determination unit (24). A feature extracting means (23) for obtaining a time series, an arithmetic means (81) for estimating the trajectory drawn by the acoustic parameter time series from the feature extracting circuit (23) in the parameter space, and obtaining this trajectory, and the calculating means Processing means (82) (8) for obtaining a recognition parameter time series by re-sampling at sample intervals according to the trace length
3), a standard pattern memory (4) in which the recognition parameter time series of the standard pattern of the recognition target word is stored, and the recognition parameter time series of the input pattern from the processing means (82) (83) and the standard pattern A distance calculation circuit (6) that calculates the difference between the recognition parameter time series of the standard pattern from the memory (4) and this distance calculation means (6) detects the smallest calculated value and outputs it. And a minimum value determination circuit (7) for obtaining

Ｆ作用ＮＡＴ処理において、軌跡長に応じて再サンプリング間
隔が変えられる。したがって、軌跡長に応じて再サンプ
ル数が変わり、単純な軌跡、複雑な軌跡のそれぞれに対
応したサンプル数となり、情報が軌跡を再現するのに不
十分ということはなくなる。In the F action NAT process, the resampling interval is changed according to the trajectory length. Therefore, the number of resamples changes depending on the trajectory length, and the number of samples corresponds to a simple trajectory and a complex trajectory, and information is not insufficient for reproducing the trajectory.

Ｇ実施例第１図はこの発明による音声認識装置の一実施例で、こ
の例は音響分析に１５チャンネルのバンドパスフィルタ
群を用いた場合である。G. Embodiment FIG. 1 shows an embodiment of a speech recognition apparatus according to the present invention, in which a bandpass filter group of 15 channels is used for acoustic analysis.

Ｇ１音響分析回路(2)の説明すなわち、音響分析回路(2)においては、マイクロホン
(1)からの音声信号がアンプ(211)及び帯域制限用のロー
パスフィルタ(212)を介してＡ／Ｄコンバータ(213)に供
給され、例えば１２．５kHzのサンプリング周波数で１
２ビットのデジタル音声信号に変換される。このデジタ
ル音声信号は、１５チャンネルのバンドパスフィルタバ
ンク(22)の各チャンネルのデジタルバンドパスフィルタ
(221₀)，(221₁)，‥‥，(221₁₄)に供給される。このデ
ジタルバンドパスフィルタ(221₀)，(221₁)，‥‥，(221
₁₄)は例えばバターワース４次のデジタルフィルタにて
構成され、２５０Hzから５．５Hzまでの帯域が対数軸上
で等間隔で分割された各帯域が各フィルタの通過帯域と
なるようにされている。そして、各デジタルバンドパス
フィルタ(221₀)，(221₁)，‥‥，(221₁₄)の出力信号は
それぞれ整流回路(222₀)，(222₁)，‥‥，(222₁₄)に供
給され、これら整流回路(222₀)，(222₁)，‥‥(222₁₄)
に供給され、これら整流回路(222₀)，(222₁)，‥‥(222
₁₄)の出力はそれぞれデジタルローパスフィルタ(22
3₀)，(223₁)，‥‥，(223₁₄)に供給される。これらデジ
タルローパスフィルタ(223₀)，(223₁)，‥‥，(223₁₄)
は例えばカットオフ周波数５２．８HzのＦＩＲローパス
フィルタにて構成される。G1 Description of Acoustic Analysis Circuit (2) That is, in the acoustic analysis circuit (2), a microphone is used.
The audio signal from (1) is supplied to the A / D converter (213) via the amplifier (211) and the low pass filter (212) for band limitation, and for example, 1 at a sampling frequency of 12.5 kHz.
It is converted into a 2-bit digital audio signal. This digital audio signal is a digital bandpass filter for each channel of the 15-channel bandpass filter bank (22).
_{_{(221 0), (221 1}} ), ‥‥, it is supplied to the (221 _14). This digital bandpass filter (221 ₀ ), (221 ₁ ), ..., (221
₁₄ ) is composed of, for example, a Butterworth fourth-order digital filter, and each band obtained by dividing the band from 250 Hz to 5.5 Hz at equal intervals on the logarithmic axis becomes the pass band of each filter. Each digital bandpass filter (221 _0), (221 _1), ‥‥, respectively rectifier circuit output signal of (221 ₁₄₎ (222 _0), (222 _1), ‥‥, supplied to (222 ₁₄₎ These rectification circuits (222 ₀ ), (222 ₁ ), ... (222 ₁₄ )
Are supplied to these rectifier circuits (222 ₀ ), (222 ₁ ) ,.
₁₄ ) outputs are digital low-pass filters (22
3 ₀ ), (223 ₁ ), ..., (223 ₁₄ ). These digital low-pass filters (223 ₀ ), (223 ₁ ), ..., (223 ₁₄ ).
Is constituted by, for example, an FIR low pass filter having a cutoff frequency of 52.8 Hz.

音響分析回路(2)の出力である各デジタルローパスフィ
ルタ(223₀)，(223₁)，‥‥(223₁₄)の出力信号は特徴抽
出回路(23)を構成するサンプラー(231)に供給される。
このサンプラー(231)ではデジタルローパスフィルタ(22
3₀)，(223₁)，‥‥，(223₁₄)の出力信号をフレーム周期
５．１２msec毎にサンプリングする。したがって、これ
よりはサンプル時系列Ai(n)（ｉ＝１，２，‥‥１５；
ｎはフレーム番号でｎ＝１，２，‥‥，Ｎ）が得られ
る。The output signal of each digital low-pass filter (223 ₀ ), (223 ₁ ), ... (223 ₁₄ ), which is the output of the acoustic analysis circuit (2), is supplied to the sampler (231) that constitutes the feature extraction circuit (23). It
This sampler (231) has a digital low pass filter (22
_{_{3 0), (223 1)}} , ‥‥, sampled every frame period 5.12msec an output signal (223 _14). Therefore, the sample time series Ai (n) (i = 1, 2, ... 15;
n is a frame number and n = 1, 2, ..., N) is obtained.

このサンプラー(231)からの出力、つまりサンプル時系
列Ai(n)は音源情報正規化回路(232)に供給され、これに
て認識しようとする音声の話者による声帯音源特性の違
いが除去される。こうして音源特性の違いが正規化され
て除去されて音響パラメータ時系列Pi(n)がこの音源情
報正規化回路(232)より得られる。そして、このパラメ
ータ時系列Pi(n)が音声区間内パラメータメモリ(233)に
供給される。この音声区間内パラメータメモリ(233)で
は音声区間判定回路(24)からの音声区間判定信号を受け
て音源特性の正規化されたパラメータPi(n)が判定され
音声区間毎にスタアされる。The output from this sampler (231), that is, the sample time series Ai (n) is supplied to the sound source information normalization circuit (232), which eliminates the difference in vocal cord sound source characteristics depending on the speaker of the voice to be recognized. It In this way, the difference in the sound source characteristic is normalized and removed, and the acoustic parameter time series Pi (n) is obtained from the sound source information normalizing circuit (232). Then, the parameter time series Pi (n) is supplied to the intra-voice section parameter memory (233). The parameter memory (233) in the voice section receives the voice section determination signal from the voice section determination circuit (24), determines the parameter Pi (n) with the normalized sound source characteristic, and performs the staring for each voice section.

音声区間判定回路(24)はゼロクロスカウンタ(241)とパ
ワー算出回路(242)と音声区間決定回路(243)とからな
り、Ａ／Ｄコンバータ(213)よりのデジタル音声信号が
ゼロクロスカウンタ(241)及びパワー算出回路(242)に供
給される。ゼロクロスカウンタ(241)では１フレーム周
期５．１２msec毎に、この１フレーム周期内の６４サン
プルのデジタル音声信号のゼロクロス数をカウントし、
そのカウント値が音声区間決定回路(243)の第１の入力
端に供給される。パワー算出回路(242)では１フレーム
周期毎にこの１フレーム周期内のデジタル音声信号のパ
ワー、すなわち２乗和が求められ、その出力パワー信号
が音声区間決定回路(243)の第２の入力端に供給され
る。音声区間決定回路(243)には、さらに、その第３の
入力端に音源情報正規化回路(232)よりの音源正規化情
報が供給される。そして、この音声区間決定回路(243)
においてはゼロクロス数、区間内パワー及び音源正規化
情報が複合的に処理され、無音、無声音及び有声音の判
定処理が行なわれ、音声区間が決定される。The voice section determination circuit (24) comprises a zero cross counter (241), a power calculation circuit (242) and a voice section determination circuit (243), and a digital voice signal from the A / D converter (213) is a zero cross counter (241). And the power calculation circuit (242). The zero-cross counter (241) counts the number of zero-crosses of the digital audio signal of 64 samples within this one-frame cycle every 5.12 msec,
The count value is supplied to the first input terminal of the voice section determination circuit (243). The power calculation circuit (242) obtains the power of the digital audio signal within one frame period, that is, the sum of squares, for each frame period, and the output power signal is the second input terminal of the audio section determination circuit (243). Is supplied to. The sound source normalizing information from the sound source information normalizing circuit (232) is further supplied to the third input terminal of the voice section determining circuit (243). And this voice section determination circuit (243)
In (1), the number of zero crosses, the power in the section, and the sound source normalization information are processed in a complex manner, and the process of determining silence, unvoiced sound, and voiced sound is performed to determine the speech section.

この音声区間決定回路(243)よりの判定された音声区間
を示す音声区間判定信号は音声区間判定回路(24)の出力
として音声区間内パラメータメモリ(233)に供給され
る。The voice section determination signal indicating the determined voice section from the voice section determination circuit (243) is supplied to the intra-voice section parameter memory (233) as the output of the voice section determination circuit (24).

こうして、判定音声区間内においてメモリ(233)にスト
アされた音響パラメータ時系列Pi(n)は続み出されて第
１のＮＡＴ処理回路(8)に供給される。In this way, the acoustic parameter time series Pi (n) stored in the memory (233) in the judgment voice section is extracted and supplied to the first NAT processing circuit (8).

Ｇ２時間正規化処理の説明この第１のＮＡＴ処理回路(8)は軌跡長算出回路(81)と
補間間隔算出回路(82)と補間点抽出回路(83)からなる。Explanation of G2 time normalization processing This first NAT processing circuit (8) comprises a trajectory length calculation circuit (81), an interpolation interval calculation circuit (82) and an interpolation point extraction circuit (83).

メモリ(223)より読み出されたパラメータ時系列Pi(n)は
軌跡長算出回路(81)に供給される。この軌跡長算出回路
(81)においては音響パラメータ時系列Pi(n)がそのパラ
メータ空間において第２図に示すように描く直線近似に
よる軌跡の長さ、即ち軌跡長を算出する。The parameter time series Pi (n) read from the memory (223) is supplied to the trajectory length calculation circuit (81). This trajectory length calculation circuit
In (81), the length of the locus by linear approximation drawn by the acoustic parameter time series Pi (n) in the parameter space as shown in FIG. 2, that is, the locus length is calculated.

この場合、１次元ベクトルａ_ｉ及びｂ_ｉ間のユークリッ
ド距離Ｄ（ａ_ｉ，ｂ_ｉ）はである。そこで、１次元の音響パラメータ時系列Pi(n)
より、直線近似により軌跡を推定した場合の時系列方向
に隣接するパラメータ間距離S(n)は S(n)＝ＤPi（ｎ＋１），Pi(n)）（ｎ＝１，‥‥，Ｎ）
(11) と表わされる。そして、時系列方向における第１番目の
パラメータPi(1)から第ｎ番目のパラメータ時系列Pi(n)
迄の距離SL(n)はと表わされる。なお、SL(1)＝０である。In this case, the Euclidean distance D (a _i , b _i ) between the one-dimensional vectors a _i and b _i is Is. Therefore, one-dimensional acoustic parameter time series Pi (n)
Therefore, the distance S (n) between adjacent parameters in the time series direction when the trajectory is estimated by linear approximation is S (n) = DPi (n + 1), Pi (n)) (n = 1, ..., N)
It is expressed as (11). Then, from the first parameter Pi (1) to the nth parameter time series Pi (n) in the time series direction.
Distance to SL (n) is Is represented. Note that SL (1) = 0.

そして、全軌跡長SL₁はと表わされる。軌跡長算出回路(81)はこの(11)式、(12)
式及び(13)にて示す信号処理を行なう。And the total track length SL ₁ is Is represented. The trajectory length calculation circuit (81) is the equation (11), (12)
The signal processing shown in equation (13) is performed.

この軌跡長算出回路(81)にて求められた軌跡長SL₁を示
す信号は補間間隔算出回路(82)に供給される。この補間
間隔算出回路(82)では軌跡に沿って再サンプリングする
ときの再サンプリング間隔Ｔ_１を算出する。A signal indicating the locus length SL ₁ obtained by the locus length calculation circuit (81) is supplied to the interpolation interval calculation circuit (82). The interpolation interval calculation circuit (82) calculates the resampling interval T ₁ for resampling along the locus.

この場合、このサンプリング間隔Ｔ_１は軌跡長算出回路
(81)において算出された軌跡長ｄに応じて変えられる。
そして、この例ではサンプリング間隔Ｔ_１は次のように
して定められる。In this case, the sampling interval T ₁ is the trajectory length calculation circuit.
It can be changed according to the trajectory length d calculated in (81).
Then, in this example, the sampling interval T ₁ is determined as follows.

すなわち、軌跡長ｄに対して先ず再サンプル点数ｐが定
められる。そして、この各値ｄとｐとが連動して変わる
ようにされる。この値ｄに対するｐの値は実験によって
最適な値が定められる。例えば単語長０．５秒くらいで
ｄ＝１００であるときはｄ＝３０，ｄ＝２００のときは
ｐ＝４５，ｄ＝５０のときはｐ＝２０というように定め
られる。That is, the re-sampling number p is first determined for the trajectory length d. Then, the respective values d and p are changed in conjunction with each other. The optimum value of p for this value d is determined by experiments. For example, when the word length is about 0.5 seconds, d = 100, d = 30, d = 200, p = 45, and d = 50, p = 20.

そして、サンプリング間隔Ｔ_１は、Ｔ_１＝SL₁／（ｐ−１） (14) として求められる。Then, the sampling interval T ₁ is calculated as T ₁ = SL ₁ / (p−1) (14).

この補間間隔算出回路(82)よりのサンプリング間隔Ｔ_１
を示す信号は補間点抽出回路(83)に供給されるとともに
メモリ(233)よりの音響パラメータ時系列Pi(n)も、ま
た、この補間点抽出回路(83)に供給される。この補間点
抽出回路(83)では音響パラメータ時系列Pi(n)のそのパ
ラメータ空間におけるパラメータ間を直線近似した軌跡
に沿って、第２図において〇印にて示すようにサンプリ
ング間隔Ｔ_１で再サンプリングがなされ、これにて得ら
れた点列より新たな音響パラメータ時系列Ri(p)が形成
される。Sampling interval T ₁ from this interpolation interval calculation circuit (82)
Is supplied to the interpolation point extraction circuit (83), and the acoustic parameter time series Pi (n) from the memory (233) is also supplied to the interpolation point extraction circuit (83). In this interpolation point extraction circuit (83), the acoustic parameter time series Pi (n) is re-sampled at a sampling interval T ₁ as indicated by a circle in FIG. Sampling is performed, and a new acoustic parameter time series Ri (p) is formed from the obtained point sequence.

前述もしたように、このパラメータ時系列Ri(p)はフレ
ーム数ｐが軌跡長に応じて可変されたもので、軌跡の特
徴を十分に表わし得るものであり、しかも時間軸方向の
正規化もほぼなされている。As described above, this parameter time series Ri (p) is one in which the number of frames p is varied according to the trajectory length, and can sufficiently represent the characteristics of the trajectory, and also normalization in the time axis direction is possible. Almost done.

この音響パラメータ時系列Ri(p)を標準パターンメモリ
(4)に登録しておくとともに、パターンマッチングに用
いてももちろんよい。その場合には、この時系列Ri(p)
をＤＰマッチング処理するようにしてもよい。This acoustic parameter time series Ri (p) is stored in the standard pattern memory.
Of course, it may be registered in (4) and used for pattern matching. In that case, this time series Ri (p)
May be subjected to DP matching processing.

しかし、ＤＰマッチング処理としたのでは、ＮＡＴ処理
の効果が半減する。そこで、この例では、この新たな音
響パラメータ時系列Ri(p)は第２のＮＡＴ処理回路(9)に
供給され、ＮＡＴ処理の特長が生かされるようにされて
いる。However, if the DP matching process is used, the effect of the NAT process is halved. Therefore, in this example, the new acoustic parameter time series Ri (p) is supplied to the second NAT processing circuit (9) so that the characteristics of the NAT processing can be utilized.

すなわち、第２のＮＡＴ処理回路(9)は軌跡長算出回路
(91)と補間間隔算出回路(92)と補間点抽出回路(93)から
なり、音響パラメータ時系列Ri(p)は軌跡長算出回路(9
1)に供給される。この軌跡長算出回路(91)においても回
路(81)と同様にして、音響パラメータ時系列Ri(p)がそ
のパラメータ空間において描く直線近似による軌跡の長
さSL₂が算出される。That is, the second NAT processing circuit (9) is the trajectory length calculation circuit.
(91), interpolation interval calculation circuit (92) and interpolation point extraction circuit (93), the acoustic parameter time series Ri (p) is calculated by the trajectory length calculation circuit (9
Supplied to 1). In the trajectory length calculation circuit (91) as well as in the circuit (81), the trajectory length SL ₂ is calculated by linear approximation drawn by the acoustic parameter time series Ri (p) in the parameter space.

この軌跡長算出回路(91)にて求められた軌跡長SL₂を示
す信号は補間間隔算出回路(92)に供給され、再サンプリ
ング間隔Ｔ_２が算出される。この場合、この第２のＮＡ
Ｔ処理においてはフレーム数は単語長つまり軌跡長に関
係なく一定で、例えばＭ点に再サンプリングするとすれ
ば、再サンプリング間隔Ｔ_２はＴ_２＝SL₂／（Ｍ−１） (15) として求められる。A signal indicating the trajectory length SL ₂ obtained by the trajectory length calculation circuit (91) is supplied to the interpolation interval calculation circuit (92), and the resampling interval T ₂ is calculated. In this case, this second NA
In the T processing, the number of frames is constant regardless of the word length, that is, the trajectory length. For example, if resampling is performed at M points, the resampling interval T ₂ is calculated as T ₂ = SL ₂ / (M−1) (15). To be

この補間間隔算出回路(92)よりの再サンプリング間隔Ｔ
_２を示す信号は補間点抽出回路(93)に供給される。ま
た、補間点抽出回路(83)よりの音響パラメータ時系列Ri
(p)も、また、この補間点抽出回路(93)に供給される。
この補間点抽出回路(93)は音響パラメータ時系列Ri(p)
のそのパラメータ空間における軌跡、例えばパラメータ
間を直線近似した軌跡に沿って再サンプリング間隔Ｔ_２
で再サンプリングし、このサンプリングにより得た新た
な点列より認識パラメータ時系列Qi(m)を形成する。Resampling interval T from this interpolation interval calculation circuit (92)
_The signal indicating ₂ is supplied to the interpolation point extraction circuit (93). Also, the acoustic parameter time series Ri from the interpolation point extraction circuit (83)
(p) is also supplied to the interpolation point extraction circuit (93).
This interpolation point extraction circuit (93) is an acoustic parameter time series Ri (p).
Of the re-sampling interval T ₂ along the trajectory of the
Then, the recognition parameter time series Qi (m) is formed from the new point sequence obtained by this sampling.

ここで、補間点抽出回路(83)及び(93)においては第３図
に示すフローチャートに従った処理がなされ、それぞれ
パラメータ時系列Ri(p)及びQi(m)が形成される。Here, in the interpolation point extraction circuits (83) and (93), the processing according to the flowchart shown in FIG. 3 is performed to form the parameter time series Ri (p) and Qi (m), respectively.

第３図では音響パラメータ時系列Pi(n)から新たな音響
パラメータ時系列Ri(p)を形成する場合について説明す
るが、Ri(p)から認識パラメータ時系列Qi(m)を得る場合
も全く同様になされる。In FIG. 3, the case of forming a new acoustic parameter time series Ri (p) from the acoustic parameter time series Pi (n) will be described, but the case of obtaining the recognition parameter time series Qi (m) from Ri (p) is also completely absent. The same is done.

先ず、ステップ[101]にて再サンプリング点の時系列方
向における番号を示す変数Ｊに値１が設定されると共に
音響パラメータ時系列Pi(n)のフレーム番号を示す変数I
Cに値１が設定され、イニシャライズされる。次にステ
ップ[102]にて変数Ｊがインクリメントされ、ステップ
[103]にてそのときの変数Ｊが（Ｐ−１）以下であるか
どうかが判別されることにより、そのときの再サンプリ
ング点の時系列方向における番号が再サンプリングする
必要のある最後の番号になっていなかどうかを判断す
る。最後の番号であればステップ[104]に進み、再サン
プルは終了する。First, in step [101], a value 1 is set to the variable J indicating the number of the resampling points in the time series direction, and the variable I indicating the frame number of the acoustic parameter time series Pi (n) is set.
A value of 1 is set in C and it is initialized. Next, in step [102], the variable J is incremented, and the step
By determining whether the variable J at that time is (P-1) or less in [103], the number in the time series direction of the re-sampling point at that time is the last number that needs to be re-sampled. Judge if not. If it is the last number, the process proceeds to step [104], and the re-sampling ends.

最後の番号でなければステップ[105]にて第１番目の再
サンプリング点から第Ｊ番目の再サンプリング点までの
再サンプリング距離DLが算出される。次にステップ[10
6]に進み変数ICがインクリメントされる。次にステップ
[107]にて再サンプル距離DLが音響パラメータ時系列Pi
(n)の第１番目のパラメータPi(1)から第IC番目のパラメ
ータPi_[IC]までの距離SL_[IC]よりも小さいかどうかによ
り、そのときの再サンプリング点が軌跡上においてその
ときのパラメータPi_[IC]よりも軌跡の始点側に位置する
かどうかが判断され、始点側に位置していなければステ
ップ[106]に戻り変数ICをインクリメントした後再びス
テップ[107]にて再サンプリング点とパラメータPi_[IC]
との軌跡上における位置の比較をし、再サンプリング点
が軌跡上においてパラメータPi_[IC]よりも始点側に位置
すると判断されたとき、ステップ[108]に進み認識パラ
メータPi_(J)が形成される。If it is not the last number, the resampling distance DL from the first resampling point to the Jth resampling point is calculated in step [105]. Next step [10
6] and the variable IC is incremented. Next step
In [107], the resample distance DL is the acoustic parameter time series Pi
Depending on whether it is smaller than the distance SL _[IC] from the first parameter Pi (1) of (n) to the IC-th parameter Pi _[IC] , the resampling point at that time is It is judged whether or not it is located on the starting point side of the locus with respect to the parameter Pi _{[IC], and} if it is not on the starting point side, the process returns to step [106] and the variable IC is incremented and then the re-sampling point is again set in step [107] And parameters Pi _[IC]
When the resampling point is judged to be located closer to the starting point side than the parameter Pi _[IC] on the trajectory by comparing the positions on the trajectory with and, the recognition parameter Pi _(J) is formed in step [108]. It

即ち、第Ｊ番目の再サンプリング点による再サンプリン
グ距離DLからこの第Ｊ番目の再サンプリング点よりも始
点側に位置する第(IC-1)番目のパラメータPi_[IC-1]によ
る距離SL_[IC-1]を減算して第(IC-1)番目のパラメータPi
_[IC-1]から第Ｊ番目の再サンプリング点迄の距離SSを求
める。次に、軌跡上においてこの第Ｊ番目の再サンプリ
ング点の両側に位置するパラメータPi_[IC-1]及びパラメ
ータPi_[IC]間の距離Ｓ(n)（この距離Ｓ(n)は(11)式にて
示される信号処理にて得られる。）にてこの距離SSを除
算し、この除算結果SS／S_[IC-1]に軌跡上において第Ｊ
番目の再サンプリング点の両側に位置するパラメータPi
_[IC]とPi_[IC-1]との差を掛算して、軌跡上において第Ｊ
番目の再サンプリング点のこの再サンプリング点よりも
始点側に隣接して位置する第(IC-1)番目のパラメータPi
_[IC-1]からの補間量を算出し、この補間量と第Ｊ番目の
再サンプリング点よりも始点側に隣接して位置する第(I
C-1)番目のパラメータPi_[IC-1]′とを加算して、軌跡に
沿う新たな音響パラメータRi_(J)が形成される。That is, from the resampling distance DL by the Jth resampling point, the distance SL _[IC by the (IC-1) th parameter Pi _[IC-1] located closer to the start point than the Jth resampling point is. _-1] is subtracted and the (IC-1) th parameter Pi
Find the distance SS from _[IC-1] to the Jth resampling point. Next, the distance S (n) between the parameter Pi _[IC-1] and the parameter Pi _[IC] located on both sides of this J-th resampling point on the locus (this distance S (n) is (11) This distance SS is divided by the signal processing shown in the formula), and this division result SS / S _[IC-1]
Parameters Pi located on either side of the th resampling point
Multiply the difference between _[IC] and Pi _[IC-1] to find the Jth
The (IC-1) th parameter Pi located adjacent to the start point side of this resampling point of the th resampling point
The interpolation amount from _[IC-1] is calculated, and this interpolation amount and the first (I) located adjacent to the start point side with respect to the J-th resampling point are calculated.
The (C-1) th parameter Pi _[IC-1] 'is added to form a new acoustic parameter Ri _(J) along the trajectory.

このようにして始点及び終点（これらはRi(1)＝Pi(o)＝
Ｏ，Ri(p)＝Pi_(s)である。）を除く（Ｐ−２）点の再サ
ンプリングにより認識パラメータ字形列Ri(p)が形成さ
れる。In this way, the start and end points (these are Ri (1) = Pi (o) =
O, Ri (p) = Pi _(s) . The recognition parameter character string Ri (p) is formed by resampling the (P-2) points excluding).

なお、軌跡の推定及び再サンプリングをするときに、必
ず無音から開始するようにすれば、音声区間判定回路(2
4)での判定区間にずれがあってもそのずれの軌跡及び再
サンプリングへの影響はほとんどなくなる。この場合
に、軌跡の終点及び再サンプリングの終点も無音部にな
るようにしてもよい。When estimating the trajectory and resampling, if you always start from silence, the voice section determination circuit (2
Even if there is a deviation in the judgment section in 4), there is almost no influence on the trajectory and resampling. In this case, the end point of the locus and the end point of resampling may be the silent part.

Ｇ３パターンマッチング処理の説明この第２のＮＡＴ処理回路(9)よりの認識パラメータ時
系列Qi(m)はモード切換スイッチ(3)により、登録モード
においては認識対象語毎に標準パターンメモリ(4)にス
トアされる。また、認識モードにおいては距離算出回路
(6)に供給され、標準パターンメモリ(4)よりの標準パタ
ーンのパラメータ時系列との距離の算出がなされる。こ
の場合の距離は例えば簡易的なチェビシェフ距離として
算出される。この距離算出回路(6)よりの各標準パター
ンと入力パターンとの距離の算出出力は最小値判定回路
(7)に供給され、距離算出値が最小となる標準パターン
が判定され、この判定結果により入力音声の認識結果が
出力端(70)に得られる。G3 Explanation of pattern matching processing The recognition parameter time series Qi (m) from the second NAT processing circuit (9) is controlled by the mode changeover switch (3), and in the registration mode, the standard pattern memory (4) for each recognition target word. Will be stored in. In the recognition mode, the distance calculation circuit
It is supplied to (6) and the distance from the standard pattern memory (4) to the parameter time series of the standard pattern is calculated. The distance in this case is calculated as a simple Chebyshev distance, for example. The calculation output of the distance between each standard pattern and the input pattern from this distance calculation circuit (6) is the minimum value judgment circuit.
The standard pattern that is supplied to (7) and has the smallest distance calculation value is determined, and the result of this determination provides the recognition result of the input voice at the output end (70).

Ｈ発明の効果この発明においてはＮＡＴ処理において、再サンプリン
グのサンプル間隔を音響パラメータ時系列が描く軌跡の
軌跡長に応じて変えるようにしたので、単語長の長短、
つまり音節数の多少の違いによる認識率の劣化を防止す
ることができる。H Effect of the Invention In the present invention, since the sampling interval of resampling is changed according to the trajectory length of the trajectory drawn by the acoustic parameter time series in the NAT processing, the length of the word length,
That is, it is possible to prevent the recognition rate from deteriorating due to a slight difference in the number of syllables.

[Brief description of drawings]

第１図はこの発明装置の一実施例のブロック図、第２図
はその説明のための図、第３図はその要部の動作の説明
のためのフローチャートを示す図、第４図は音声認識装
置の基本構成を示すブロック図、第５図はＤＰマッチン
グを説明するための図、第６図〜第８図はＮＡＴ処理を
説明するための図、第９図及び第１０図はそれぞれ単音
節及び多音節の場合にパラメータ時系列が描く軌跡の例
を示す図である。 (2)は音響分析回路、(4)は標準パターンメモリ、(6)は
標準パターンと入力パターンとの距離算出回路、(7)は
最小値判定回路、(8)は第１のＮＡＴ処理回路、(9)は第
２のＮＡＴ処理回路である。FIG. 1 is a block diagram of an embodiment of the device of the present invention, FIG. 2 is a diagram for explaining the same, FIG. 3 is a diagram for explaining an operation of a main part thereof, and FIG. FIG. 5 is a block diagram showing the basic configuration of the recognition device, FIG. 5 is a diagram for explaining DP matching, FIGS. 6 to 8 are diagrams for explaining NAT processing, and FIGS. It is a figure which shows the example of the locus which a parameter time series draws in the case of a syllable and a multi-syllable. (2) is an acoustic analysis circuit, (4) is a standard pattern memory, (6) is a distance calculation circuit between the standard pattern and the input pattern, (7) is a minimum value determination circuit, and (8) is a first NAT processing circuit. , (9) are the second NAT processing circuits.

───────────────────────────────────────────────────── フロントページの続き (72)発明者赤羽誠東京都品川区北品川６丁目７番35号ソニー株式会社内 (56)参考文献日本音響学会講演論文集昭和59年10月１−９−９Ｐ．17−18 ─────────────────────────────────────────────────── ─── Continuation of the front page (72) Makoto Akabane Makoto Akabane 6-735 Kita-Shinagawa, Shinagawa-ku, Tokyo, Sony Corporation (56) References Acoustics Society of Japan, October 59, 1959 -9P. 17-18

Claims

[Claims]

1. (a) a voice section determining means for determining a voice section of an input voice signal; (b) a feature extracting means for obtaining an acoustic parameter time series within the voice section determined by the voice section determining means. , (C) Estimating the trajectory of the acoustic parameter time series drawn by the feature extracting means in the parameter space, and calculating the trajectory, and (d) Re-sampling at sample intervals according to the trajectory length obtained by the calculating means. A processing means for obtaining a recognition parameter time series by sampling, (e) a standard pattern memory in which the recognition parameter time series of the standard pattern of the recognition target word is stored, and (f) an input pattern of the input means from the above processing means. A distance calculation means for calculating the difference between the recognition parameter time series and the recognition parameter time series of the standard pattern from the standard pattern memory, and (g) the value calculated by this distance calculation means. A voice recognition device comprising a minimum value determining means for detecting the smallest one and obtaining a recognition output.