JPS634198B2

JPS634198B2 -

Info

Publication number: JPS634198B2
Application number: JP56041735A
Authority: JP
Inventors: Yorio Iio
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 1981-03-24
Filing date: 1981-03-24
Publication date: 1988-01-27
Also published as: JPS57157298A

Description

【発明の詳細な説明】本発明は特に発声速度の伸縮を考慮した音声認
識装置に関するものであり、標準音声情報の学習
機能を有し、話者適応能力を持つ不特定話者音声
認織装置に関する。DETAILED DESCRIPTION OF THE INVENTION The present invention relates to a speech recognition device that takes into account expansion and contraction of speaking speed, and is a speaker-independent speech recognition device that has a learning function of standard speech information and has speaker adaptability. Regarding.

本発明では、入力音声の特徴ベクトル系列と予
め記憶しておいた標準音声の特徴ベクトル系列と
の類似度を測定することによつて入力音声を認識
する音声認識装置に関する。 The present invention relates to a speech recognition device that recognizes input speech by measuring the degree of similarity between a feature vector series of input speech and a feature vector series of standard speech stored in advance.

なお、この明細書において、特徴ベクトルとは
一つの時間窓における複数の音声特徴をいい、特
徴ベクトル系列とはある時間長にわたる特徴ベク
トルの時系列をいい、時間軸に関する特徴ベクト
ルの順序を指すものとしてサンプル位置なる単語
を用いる。具体的代表例として、音声をＬ個の
BPFフイルタ（バンドパスフイルタ）によつて
分析し、且つある時間間隔でサンプルすることに
よつて特徴を抽出する場合、あるサンプル位置に
おけるＬ個のBPFフイルタの分析出力を特徴ベ
クトルといい、連続した特徴ベクトルの時系列を
特徴ベクトル系列という。 Note that in this specification, a feature vector refers to a plurality of audio features in one time window, and a feature vector series refers to a time series of feature vectors over a certain time length, and refers to the order of feature vectors with respect to the time axis. We use the word sample position as . As a specific representative example, the audio is
When analyzing with a BPF filter (band pass filter) and extracting features by sampling at a certain time interval, the analysis output of L BPF filters at a certain sample position is called a feature vector, and the continuous A time series of feature vectors is called a feature vector series.

音声認識の認識率を悪化させる要因の一つとし
て、発声速度の変動があり、類似度測定において
は何等かの時間軸正規化を一般に必要とする。時
間軸正規化の代表例としては動的計画法が知られ
ている。しかし、この方法は一定の漸化式によつ
て計算ルートを制限するようにしてはいるけれど
も、発声速度の変動を、本質的には標準音声の全
サンプル位置夫々に入力音声の比較的多数（例え
ば30％）のサンプル位置を対応づけて類似度を測
定するものであるため、非常に膨大な計算処理時
間を要する。又、この方法は発声速度に関する時
間軸正規化が特徴ベクトル間の類似度を中心にし
て行われているため、通常の発声速度変動を越え
て時間軸の非線形伸縮が行われることがあり、そ
の結果、目的としない他の標準音声との類似度を
大きくし、かえつて認識率を悪化させる場合も出
てくる。 One of the factors that deteriorates the recognition rate of speech recognition is variation in speaking speed, and some kind of time axis normalization is generally required in similarity measurement. Dynamic programming is known as a typical example of time axis normalization. However, although this method restricts the computational route by a constant recurrence formula, it essentially reduces the variations in speaking rate to each of the sample positions of the standard voice over a relatively large number of input voices ( Since the similarity is measured by associating sample positions (for example, 30%), it requires an extremely large amount of calculation processing time. In addition, in this method, the time axis normalization regarding the speaking rate is performed based on the similarity between feature vectors, so nonlinear expansion and contraction of the time axis may occur beyond normal speaking rate fluctuations. As a result, there are cases where the degree of similarity with other standard speech that is not the target is increased, which may even worsen the recognition rate.

本発明では、発声速度の非線形変動が調音結合
部分等のように発声速度変動の極めて小さい部分
と、母音定常部のように発声速度が大きく変動す
る部分とが混在することによる、という認識に基
づくものであり、前者の部分では同一時間尺度対
応で類似度を測定し、後者の部分では線型伸縮対
応で類似度を測定し、全体として発声速度の非線
型変動に対応した類似度測定を行わせるものであ
る。そのために本発明では、標準音声の特徴ベク
トル系列の中で発声速度の小さい部分系列（特定
部分系列）を示すサンプル位置を記憶させておく
ものである。又、標準音声と入力音声とのサンプ
ル位置を線型で対応させて、標準音声の特定部分
系列に一定の形式で対応する入力音声の特徴ベク
トル系列候補を決定し、前記特定部分系列と各候
補の類似度を測定し、最大類似度を与える候補を
対応部分系列として決定し、且つそれに対応する
最大類似度を特定部分系列の類似度として検出す
る段階を備える。さらに、標準音声の特徴ベクト
ル系列から特定部分系列を除いた残部系列と入力
音声の特徴ベクトル系列から対応部分系列を除い
た残部系列との類似度を線形伸縮対応で測定す
る。標準音声と入力音声の類似度は、特定部分系
列および残部系列の類似度を総合した類似度をと
り、この総合した類似度から入力音声を認識す
る。 The present invention is based on the recognition that non-linear fluctuations in speaking rate are caused by a mixture of parts where the speaking rate is extremely small, such as in articulatory combination parts, and parts where the speaking rate fluctuates greatly, such as in steady vowel parts. In the former part, the similarity is measured based on the same time scale, and in the latter part, the similarity is measured based on linear expansion/contraction, and the overall similarity is measured in response to nonlinear variations in speech rate. It is something. To this end, in the present invention, sample positions indicating a subsequence (specific subsequence) with a low speaking rate among the standard speech feature vector sequence are stored. In addition, the sample positions of the standard speech and the input speech are linearly correlated, and feature vector sequence candidates of the input speech that correspond in a fixed format to a specific subsequence of the standard speech are determined, and the feature vector sequence candidates of the input speech and the specific subsequence of each candidate The method includes the steps of measuring the degree of similarity, determining a candidate giving the maximum degree of similarity as a corresponding subsequence, and detecting the maximum degree of similarity corresponding thereto as the degree of similarity of a specific subsequence. Furthermore, the degree of similarity between the residual sequence obtained by removing the specific subsequence from the feature vector sequence of the standard speech and the remaining sequence obtained by removing the corresponding subsequence from the feature vector sequence of the input speech is measured using linear expansion/contraction correspondence. The degree of similarity between the standard voice and the input voice is the total similarity of the specific partial sequence and the remaining sequence, and the input voice is recognized from this total similarity.

このような部分的に線形の組合わせによる音声
認識装置は、標準音声の記憶パターンに種々の制
限を加え易い不特定話者認識装置に向いている
が、話者によつて変化する特徴ベクトルを学習し
て適応性を持たせ、認識率を向上させる目的の不
特定話者認識装置の場合には、学習の方法として
標準音声と入力音声との対応のつけ方をくふうす
る必要がある。 Speech recognition devices based on such partially linear combinations are suitable for speaker-independent recognition devices that tend to impose various restrictions on the memory pattern of standard speech. In the case of a speaker-independent recognition device whose purpose is to improve recognition rate by learning and adapting, it is necessary to devise a learning method for establishing correspondence between standard speech and input speech.

従来、標準音声の特徴ベクトル個数と入力音声
の特徴ベクトル個数が異なる場合の学習の問題に
対して、標準音声と入力音声のいずれか一方の特
徴ベクトル個数に合わせて特徴ベクトル個数を変
換する方法があるが、標準音声の各人の自然な発
声の音声長からずれることや、特徴ベクトルの質
が悪くなることがあり、標準音声の特徴ベクトル
の安定性が得られない欠点があつた。 Conventionally, to solve the learning problem when the number of feature vectors of the standard speech and the number of feature vectors of the input speech are different, there has been a method of converting the number of feature vectors to match the number of feature vectors of either the standard speech or the input speech. However, the length of the standard voice may deviate from the natural utterance of each person, the quality of the feature vector may deteriorate, and the stability of the feature vector of the standard voice cannot be achieved.

本発明の目的は、これらの欠点を解決するた
め、標準音声と入力音声の残部系列に対し線形内
挿を用いた対応づけを行つて標準音声の特徴ベク
トル個数及び特徴ベクトルの要素の値を変更する
学習を行うようにしたもので、以下詳細に説明す
る。 The purpose of the present invention is to solve these drawbacks by using linear interpolation to associate the remaining sequences of standard speech and input speech, and changing the number of feature vectors of standard speech and the values of the elements of the feature vectors. This will be explained in detail below.

例えば次のような特徴ベクトル系列と特定部分
系列のサンプル位置とを含む情報で表現した標準
音声を予め用意する。特徴ベクトル系列：x₁、
x₂、…、x_i、x_i+1、…、x_i+k、…、x_n、各特徴ベ
クトルサンプル位置：T₁、T₂、…、T_i、T_i+1、
…、T_i+k、…、T_n、特徴ベクトル系列Ｘにおけ
る特徴ベクトル個数：ｍ、特定部分系列x_i〜x_i+k
の先頭サンプル位置：T_i、特定部分系列x_i〜x_i+k
において特徴ベクトルが連続する個数：ｋ＋１、
特定部分系列が複数個あれば各々について用意し
ておく。特定部分系列は発声速度の変動が小さい
部分である。 For example, a standard speech expressed by information including the following feature vector series and sample positions of a specific subsequence is prepared in advance. Feature vector series: x ₁ ,
x ₂ ,..., x _i , x _i+1 ,..., x _i+k ,..., x _n , each feature vector sample position: T ₁ , T ₂ ,..., T _i , T _i+1 ,
..., T _i+k , ..., T _n , number of feature vectors in feature vector sequence X: m, specific subsequence x _i ~ _{x i+k}
First sample position: T _i , specific subsequence x _i ~ x _i+k
Number of consecutive feature vectors in: k+1,
If there are multiple specific subsequences, prepare one for each. The specific subsequence is a part where the variation in speaking rate is small.

一般に、ある単語を発声した場合、隣り合う音
韻間に調音結合が生じ、この調音結合部分には音
声認識における重要な情報が存在すると共にこの
調音結合部分では発声速度の変動が極めて小さ
い。 Generally, when a certain word is uttered, articulatory connections occur between adjacent phonemes, and important information for speech recognition exists in these articulatory connections, and variations in speech rate are extremely small in these articulatory connections.

この調音結合部分は、容易に知ることができ、
例えばソノグラフを目視することによつて、或い
はホルマントの過渡部を検出することによつて知
ることができる。または複数の調結合部分は特定
部分系列として標準音声ごとに指定するが、必ず
しも一様な指定方法によることはない。 This articulatory connection part can be easily recognized,
For example, this can be determined by visually observing a sonogram or by detecting a formant transient. Alternatively, a plurality of harmonized parts are designated for each standard voice as a specific partial sequence, but the designation method is not necessarily uniform.

例えば比較的多数の特徴ベクトルが調音結合部
分で検出される場合はこの部分を特定部分系列と
すれば十分である。又、例えば単語「ナナ」の例
では、10ｍsec周期でサンプルして特徴ベクトル
を抽出する場合を考えると、最初の調音結合部分
で抽出できるのは１〜２特徴ベクトルであり、後
述の如く、類似度測定の安定性に欠けるので定常
部を一部含んで特定部分系列を指定した方がよ
い。又、当然ではあるが、発声速度の変動が小さ
い部分を知ることができない標準音声については
特定部分系列は指定せず、本発明と直接関係ない
他の方法で類似度を測定することになる。 For example, if a relatively large number of feature vectors are detected in an articulatory connection part, it is sufficient to make this part a specific subsequence. For example, in the case of the word "nana", if we consider the case where the feature vectors are extracted by sampling at 10 msec intervals, only 1 to 2 feature vectors can be extracted in the first articulatory combination part, and as described later, similar Since the stability of the frequency measurement is lacking, it is better to designate a specific partial series that includes part of the stationary part. Also, as a matter of course, for standard speech for which it is not possible to know parts with small variations in speech rate, a specific subsequence is not specified, and the degree of similarity is measured by another method not directly related to the present invention.

入力音声の特徴ベクトル系列中に対応部分系列
を決定するために、サンプル位置を線形伸縮で対
応させて複数の対応部分系列候補を設定し、これ
ら候補と特定部分系列との類似度を同一時間尺度
で測定することによつて対応部分系列を決定す
る。対応部分系列は特定部分系列のそれと同数の
連続せる特徴ベクトルからなる。対応部分系列が
調音結合部分に相当するものであつても、その前
後の母音定常部や停止部の時間伸縮のため、一般
にその位置は特定できない。 In order to determine corresponding subsequences in the feature vector series of input speech, multiple corresponding subsequence candidates are set by correlating sample positions by linear expansion and contraction, and the similarity between these candidates and a specific subsequence is measured on the same time scale. Determine the corresponding subsequence by measuring . The corresponding subsequence consists of the same number of consecutive feature vectors as that of the specific subsequence. Even if the corresponding subsequence corresponds to an articulatory combination part, its position cannot generally be specified because of the time expansion and contraction of the vowel stationary parts and stop parts before and after it.

しかしながら、単語音声の時間長が数秒に及ば
ない限り、比較的狭い範囲で推定することができ
る。例えば、「イチ」、「ニイ」、「サン」、「トウキ
ヨウ」、「ヨコハマ」、等の単語の時間長は高々400
ｍｓ〜500ｍｓ長であり、このような短い単語を
10ｍｓ周期でサンプルした場合、調音結合部分の
先頭特徴ベクトルのサンプル位置変動は４〜10サ
ンプル程度である。 However, as long as the duration of the word speech is less than several seconds, estimation can be made within a relatively narrow range. For example, the duration of words such as "ichi", "nii", "san", "Tokyo", "Yokohama", etc. is at most 400.
ms to 500ms long, and such short words
When sampling at a period of 10 ms, the sample position variation of the leading feature vector of the articulatory connection part is about 4 to 10 samples.

従つて、標準音声のサンプル位置T₁〜T_nと入
力音声のサンプル位置T₁〜T_oとを線形に対応さ
せ、特定部分系列x_i〜x_i+kの先頭サンプル位置T_i
に対応した位置T_jeと、その位置T_jeを含み、それ
と前後する複数のサンプル位置T_ja、…、T_je、
…、T_jjを一定の形式で入力音声のサンプル位置
T₁〜T_oの中に求め、これを先頭サンプル位置と
する候補によつて対応部分系列をよく推定するこ
とができる。対応部分系列候補の数は４〜10個程
度を必要とし、この個数を標準音声毎に定めてお
くこともできるが、高々10個程度なので一律でも
よい。 Therefore, the sample positions T ₁ to _{T n} of the standard voice are made to correspond linearly to the sample positions T ₁ to _{T o} of the input voice, and the first sample position T _i of the specific subsequence x _i to _{x i+k} is
A position T _je corresponding to the position T je and multiple sample positions T _ja , ..., T _je , including the position T _je and surrounding it.
…, T _jj is the sample position of the input audio in a certain format
The corresponding subsequence can be well estimated by finding a candidate between T ₁ and _{T o} and using this as the leading sample position. The number of corresponding subsequence candidates is required to be about 4 to 10, and this number can be determined for each standard voice, but since it is about 10 at most, it may be uniform.

入力音声特徴ベクトル系列Ｙ：y₁、y₂、…、
y_je、y_o、入力音声サンプル位置T₁、T₂、…、
T_je、…、T_o、として次の(1)式のサンプル位置
T_ja、…、T_je、…、T_jjは候補を10個とした場合
の各候補の先頭サンプル位置の第１形式による求
め方を示したものである。 Input speech feature vector sequence Y: y ₁ , y ₂ , ...,
y _je , y _o , input audio sample positions T ₁ , T ₂ ,...,
T _je , ..., T _o , the sample position of the following equation (1)
T _ja , . . . , T _je _, .

但し、T_nは標準音声の特徴ベクトル系列Ｘに
おける最終サンプル位置、T_oは入力音声の特徴
ベクトル系列Ｙにおける最終サンプル位置、T_iは
特定部分系列の先頭特徴ベクトルX_iのサンプル位
置である。 Here, T _n is the final sample position in the standard speech feature vector series X, T _o is the final sample position in the input speech feature vector series Y, and T _i is the sample position of the first feature vector X _i of the specific subsequence.

(1)式は対応部分系列候補の範囲を線形伸縮によ
つて対応づけたものであるが、位置T_jeを含みそ
れと前後する複数のサンプル位置T_je−５〜T_je−
４を入力音声中の対応部分系列候補とする次の(2)
式による第２形式の求め方によつても対応部分系
列をよく推定できる。 Equation (1) associates the range of corresponding subsequence candidates by linear expansion/contraction, but it includes multiple sample positions T _je −5 to T _je − including the position T _je and surrounding it.
4 as the corresponding subsequence candidate in the input audio (2)
The corresponding subsequence can also be well estimated by the second form of the equation.

対応部分系列を決定するために、T_ja、…、
T_je、…、T_jjを先頭サンプル位置として特定部分
系列のそれと同数で連続する特徴ベクトルから成
る10個の対応部分系列候補と特定部分系列との類
似度を同一時間軸対応で測定し、最大の類似度を
与えるものを対応部分系列として決定する。類似
度は絶対値距離や２乗距離等の特徴距離で測定す
ることができ、その特徴距離をｄ（）で表わし
た場合、例えばサンプル位置T_jeを先頭サンプル
位置とする候補と特定部分系列との距離D_Jeは次
のようになる。 To determine the corresponding subsequences, T _ja ,…,
The degree of similarity between the specific subsequence and 10 corresponding subsequence candidates consisting of the same number of consecutive feature vectors as that of the specific subsequence with T _je , ..., T _jj as the first sample position is measured on the same time axis, and the maximum The corresponding subsequence is determined as the corresponding subsequence. Similarity can be measured by feature distances such as absolute value distances and square distances, and when the feature distances are expressed as d( ), for example, a candidate whose first sample position is the sample position T _je and a specific subsequence. The distance D _Je is as follows.

D_je＝ｄ（x_i、y_je）＋ｄ（x_i+1、y_je+1）＋ｄ（x_i+k、y_je+k） ……(3) 但しy_je〜y_je+kはサンプル位置T_jeを先頭位置と
する対応部分系列候補の特徴ベクトルである。D _je = d (x _i , y _je ) + d (x _i+1 , y _je+1 ) + d (x _i+k , y _je+k ) ...(3) However, y _je ~ _{y je+k} is a sample This is the feature vector of the corresponding subsequence candidate with the position T _je as the leading position.

各候補に対する距離をD_ja〜D_jjとして、そのう
ちで最小のもの、すなわち類似度が最大の候補を
y_j、y_j+1、…、y_i+kとすると、これを対応部分系
列として決定し、その時の距離D_jを特定部分系
列の類似度の尺度に使う。なお、音声長に対する
類似度の正規化のため特徴ベクトル系列個数（ｋ
＋１）で距離D_jを除した正規化距離_jを先に計
算しておいてもよい。 Let the distance to each candidate be D _ja ~ D _jj , and select the minimum one among them, that is, the candidate with the maximum similarity.
Assuming y _j , y _j+1 , ..., y _i+k , these are determined as corresponding subsequences, and the distance D _j at that time is used as a measure of the similarity of the specific subsequences. Note that the number of feature vector sequences (k
The normalized distance _j obtained by dividing the distance D _j by +1) may be calculated in advance.

_j＝１／ｋ＋１D_j ……(4) 又、特定部分系列における特徴ベクトルの数が
極端に少ないと対応部分系列の決定が不安定にな
るため、特定部分系列としては母音定常部等の特
徴ベクトルも含めてある程度長くした方がよい。
複数の特定部分系列に対して対応部分系列をすべ
て計算する。対応部分系列をすべて定めたら、残
部の特徴ベクトル系列の類似度を測定する。 _j = 1/k + 1D _j ...(4) Also, if the number of feature vectors in a specific subsequence is extremely small, the determination of the corresponding subsequence becomes unstable, so feature vectors such as vowel stationary parts are used as the specific subsequence. It is better to make it a certain length, including
Calculate all corresponding subsequences for multiple specific subsequences. After determining all corresponding subsequences, the similarity of the remaining feature vector sequences is measured.

第１図は、残部特徴ベクトル系列の線形対応関
係を示したものである。特定部分系列の類似度測
定は同一時間尺度で行つたが、残部系列では異な
り、特定部分系列を除いた標準音声の各残部系列
のサンプル位置を線形伸縮対応させて類似度を計
算する。 FIG. 1 shows the linear correspondence of the residual feature vector series. The similarity measurement for the specific subsequence was performed on the same time scale, but for the remaining sequences, the similarity was calculated by linearly expanding and contracting the sample positions of each remaining sequence of the standard speech excluding the specific subsequence.

例えば第１図において、標準音声の最後の残部
系列の特徴ベクトル、x_i+k+1〜x_nと入力音声の最
後の残部系列y_i+k+1〜y_oとの対応においては、サ
ンプル位置T_u、T_vが次の関係に従う特徴ベクト
ルx_u、x_vを対応させる。 For example, in Figure 1, in the correspondence between the feature vectors x _i+k+1 ~ x _n of the last residual sequence of the standard speech and the final residual sequence y _i+k+1 ~ y _o of the input voice, the sample Positions T _u and T _v correspond to feature vectors x _u and x _v in accordance with the following relationship.

T_v＝T_o−T_j+k+1／T_n−T_i+k+1（T_u−T_i+k+1）＋T_j+k+1…
…(4) 但し、T_u＝T_ik+1、T_i+k+2、…、T_n このように部分的に線形伸縮で対応させた特徴
ベクトル対x_u、y_v間で距離ｄ（x_u、y_v）を求めそ
れらを総合したものを最後の残部系列の距離D₂
すなわち類似度とする。T _v =T _o −T _j+k+1 ／T _n −T _i+k+1 (T _u −T _i+k+1 )＋T _j+k+1 …
...(4) However, T _u = T _ik+1 , T _i+k+2 _, ..., _T _n The distance d ( x _u , y _v ) and synthesize them to calculate the distance D ₂ of the final residual series.
In other words, it is the degree of similarity.

D₂＝_n 〓^u=1+k+1 ｄ（x_u，y_v なお、音声長に対する類似度の正規化のため残
部特徴ベクトル個数ｍ−（ｉ＋ｋ）で距離D₂を除
した正規化距離の和₂を先に計算しておいても
よい。D ₂ = _n 〓 ^u=1+k+1 d(x _u , y _vIn addition, in order to normalize the similarity with respect to speech length, the normalized distance is calculated by dividing the distance D ₂ by the number of remaining feature vectors m - (i + k) The sum ₂ may be calculated in advance.

₂＝１／ｍ−ｉ＋ｋD₂ ………(6) サンプル数による正規化を事前に行つていない
場合は、標準音声と未知音声との類似度は特定部
分系列における類似度と残部系列における類似度
との総和を標準音声のサンプル数ｍで除して求め
る。また、正規化距離による類似度を各特定部分
系列及び残部系列において求める場合には各類似
度の総和によつて標準音声と入力音声との類似度
を求める。認識の場合には、必要な標準音声につ
いて入力音声との類似度を計算し、その中で最も
類似性の高い（距離の最も小さい）標準音声を検
出する。 ₂ = 1/m−i+kD ₂ ………(6) If normalization by the number of samples is not performed in advance, the similarity between the standard speech and the unknown speech is determined by the similarity in the specific subsequence and the similarity in the remaining sequence. It is calculated by dividing the sum of the degrees and the number of samples of standard speech, m. Furthermore, when the degree of similarity based on the normalized distance is determined for each specific partial sequence and the remaining sequence, the degree of similarity between the standard speech and the input voice is determined by the sum of the respective degrees of similarity. In the case of recognition, the degree of similarity between the required standard speech and the input speech is calculated, and the standard speech with the highest similarity (the smallest distance) among them is detected.

このような音声認識装置において、学習を行う
には特定の標準音声に対して入力音声の類似度を
計算し、距離が一定値以下の場合登録することを
許し、標準音声と入力音声の線形内挿により登録
音声を計算する。 In such a speech recognition device, in order to perform learning, the degree of similarity of input speech to a specific standard speech is calculated, and if the distance is less than a certain value, it is allowed to be registered, and if the distance is within a linear range between the standard speech and input speech, Calculate the registered voice by inserting.

第１図において、登録音声特徴ベクトル系列を
Z₁、…、Z_h、…、Z_h+k、…、Z_s、…Z_lとし、それ
に対応する登録音声サンプル位置をT₁、…、T_h、
…、T_h+k、…、T_s、…T_lとする。登録音声特徴
ベクトル系列Ｚにおける特徴ベクトル個数ｌは標
準音声と入力音声の特徴ベクトル個数の内挿によ
り計算する。 In Figure 1, the registered speech feature vector series is
Let Z ₁ ,..., Z _h ,..., Z _h+k ,..., Z _s ,... Z _l and the corresponding registered voice sample positions are T ₁ ,..., T _h ,
..., T _h+k , ..., T _s , ...T _l . The number l of feature vectors in the registered speech feature vector series Z is calculated by interpolating the number of feature vectors of the standard speech and the input speech.

ｌ＝（１−α）ｍ＋α_o、０＜α＜１ ……(8) 但し、αは時間内挿系数である。 l=(1−α)m+α _o , 0<α<1 (8) where α is a time interpolation coefficient.

特定部分系列の登録音声特徴ベクトルZ_h〜Z_h+k
を求めるため、対応するサンプル位置T_h〜T_h+k
を(9)、(10)式により計算する。 Registered speech feature vector Z _h ~ _{Z h+k} of specific subsequence
To find the corresponding sample position T _h ~ _{T h+k}
is calculated using equations (9) and (10).

T_h＝（１−α）T_i＋αT_j、０＜α＜１ ……(9) T_h+p＝T_h+p、ｐ＝０、１、…、ｋ ……(10) また、登録音声特徴ベクトルの値Z_h+pは次式(11)
により計算する。 T _h = (1-α)T _i +αT _j , 0<α<1 ...(9) T _h+p =T _h+p , p=0, 1, ..., k ...(10) Also, registration The value Z _h+p of the audio feature vector is given by the following formula (11)
Calculate by.

Z_h+p＝（１−β）x_i+p＋β・y_i+p、ｐ＝０、１、…、ｋ ……(11) 但し、βは振巾内挿係数であり、０＜β＜１。Z _h ₊ _p = (1 - β) <1.

登録音声の残部系列は、特定系列とは異なり、
標準音声と入力音声のサンプル位置をすべて線形
内挿によつて計算する。例えば第１図における登
録音声の最後の残部系列特徴ベクトルZ_h+k+1〜Z_l
を計算するため、サンプル位置T_h+k+1、T_lを次の
(12)、（13）式により標準音声と入力音声の最後の
残部系列の最初と最後のサンプル位置から計算す
る。 The remaining series of registered voices are different from the specific series,
Calculate all sample positions of standard speech and input speech by linear interpolation. For example, the last residual sequence feature vector Z _h+k+1 ~Z _l of the registered speech in Figure 1
To calculate the sample position T _h+k+1 and T _l as follows
It is calculated from the first and last sample positions of the last remaining sequence of the standard speech and input speech using equations (12) and (13).

T_h+k+1＝（１−α）T_i+k+1＋αT_j+k+1 ……(12) T_l＝（１−α）T_n＋αT_o ……（13）このように、例として最後の残部系列にて説明
するけれども、先頭または特定系列にはさまれた
途中の残部系列にても同様である。T _h+k+1 = (1−α)T _i+k+1 +αT _j+k+1 …(12) T _l =(1−α)T _n +αT _o …(13) In this way, Although the last remaining series will be explained as an example, the same applies to the remaining series sandwiched between the beginning or a specific series.

登録音声の最後の残部系列の任意の特徴ベクト
ルZ_s（但し、ｓ＝ｈ＋ｋ＋１〜ｅ）は標準音声の
サンプル位置T_uと未知音声のサンプル位置T_vを
次の（14）、（15）式で定めて（16）式で計算す
る。 An arbitrary feature vector Z _s (where s = h + k + 1 to e) of the last remaining sequence of registered speech can be calculated using the following equations (14) and (15), where the sample position T _u of the standard speech and the sample position T _v of the unknown speech are Calculate using equation (16).

T_u＝（T_n−T_i+k+1）T_s−T_h+k+1／T_l−T_h+k+1T_i+k+1（14
） T_v（T_o−T_i+k+1）T_s−T_h+k+1／T_l−T_h+k+1＋T_j+k+1（15
） Z_s＝（１−β）x_u＋βy_v ……（16）このようにして、登録音声特徴ベクトルZ₁、
…、Z_lをすべて計算して、これを標準音声の学習
による変更結果として格納する。時間内挿系数α
と振巾内挿系数βは標準音声と入力音声の特徴ベ
クトルの信頼度、学習による適応速度を考慮して
実験的に決定するのでα、βとも0.1〜0.25程度
が適切である。T _u = (T _n −T _i+k+1 )T _s −T _h+k+1 ／T _l −T _h+k+1 T _i+k+1 (14
) T _v (T _o −T _i+k+1 )T _s −T _h+k+1 ／T _l −T _h+k+1 +T _j+k+1 (15
) Z _s = (1-β) x _u + βy _v ... (16) In this way, the registered speech feature vector Z ₁ ,
..., Z _l are all calculated and stored as the result of changes made by learning the standard voice. Time interpolation coefficient α
and the amplitude interpolation coefficient β are determined experimentally by considering the reliability of the feature vectors of the standard speech and input speech, and the adaptation speed by learning, so it is appropriate that both α and β be about 0.1 to 0.25.

以上の説明から明らかなように、登録音声の特
徴ベクトルの大きさは振巾内挿系数の比重で標準
音声の特徴ベクトルと既知入力音声の特徴ベクト
ルとに線形に対応させ、登録音声の特定部分系列
における先頭特徴ベクトルのサンプル位置及び残
部系列の全ての特徴ベクトルのサンプル位置を、
標準音声と既知入力音声とにおける対応するもの
に時間内挿系数の比重で線形に対応させているた
め、話者に適合させ、信頼性のある音声認識がで
きる利点が生じる。 As is clear from the above explanation, the size of the feature vector of the registered speech is made to correspond linearly to the feature vector of the standard speech and the feature vector of the known input speech with the weight of the amplitude interpolation system, and the size of the feature vector of the registered speech is The sample position of the first feature vector in the series and the sample positions of all feature vectors in the remaining series are
Since the standard speech and the known input speech are made to correspond linearly using the weight of the time interpolation system, there is an advantage that the method can be adapted to the speaker and reliable speech recognition can be performed.

第２図は、本発明の第１の実施例であつて、１
はモード切換端子、２はアドレス制御部、３は入
力音声入力端子、４は入力音声特徴ベクトル系列
格納部、５は入力音声サンプル数格納部、６は標
準音声入力端子、７は標準音声１時メモリ、８は
標準音声特徴ベクトル系列格納部、９は登録単語
番号入力端子、１０は標準音声サンプル情報格納
部、１１は特定部分系列対応位置計算部、１２は
距離計算部、１３は加算部、１４は最小値計算
部、１５は最小値信号線、１６は最適マツチング
位置信号線、１７は最適マツチング位置格納部、
１８は加算部、１９は加算部１８の入力信号線、
２０は残部系列対応位置計算部、２１は加算部１
８の入力信号線、２２は正規化部、２３は登録判
定部、２４は登録音声対応位置計算回路、２５は
登録音声内挿値計算回路、２６は最適パターン検
出部、２７は結果出力端子である。 FIG. 2 shows a first embodiment of the present invention, in which 1
is a mode switching terminal, 2 is an address control section, 3 is an input audio input terminal, 4 is an input audio feature vector series storage section, 5 is an input audio sample number storage section, 6 is a standard audio input terminal, 7 is a standard audio 1 time Memory, 8 is a standard speech feature vector series storage section, 9 is a registered word number input terminal, 10 is a standard speech sample information storage section, 11 is a specific subsequence corresponding position calculation section, 12 is a distance calculation section, 13 is an addition section, 14 is a minimum value calculation unit, 15 is a minimum value signal line, 16 is an optimum matching position signal line, 17 is an optimum matching position storage unit,
18 is an adder, 19 is an input signal line of the adder 18,
20 is a residual sequence corresponding position calculation unit, 21 is an addition unit 1
8 is an input signal line, 22 is a normalization unit, 23 is a registration determination unit, 24 is a registered voice corresponding position calculation circuit, 25 is a registered voice interpolation value calculation circuit, 26 is an optimal pattern detection unit, and 27 is a result output terminal. be.

これを動作するには、まずモード切り換え端子
１を通して学習指示信号を入力し、アドレス制御
部２を登録モードにする。次に入力端子３を通し
て格納部４に入力音声特徴ベクトル系列y₁〜y_oを
格納すると共に、格納部５にそのサンプル個数ｎ
を格納する。又、入力端子６を通して標準音声１
時メモリ７に必要な標準音声全部の標準音声情報
を格納する。標準音声１時メモリ７より格納部８
に登録単語番号、入力端子９の入力信号で指定さ
れた単語番号の標準音声特徴ベクトル系列x₁〜
x_nを格納すると共に、格納部１０にそのサンプ
ル個数ｍと特定部分系列サンプル位置T_i、個数ｋ
＋１を格納する。 To operate this, first, a learning instruction signal is input through the mode switching terminal 1, and the address control section 2 is placed in the registration mode. Next, the input speech feature vector sequence y ₁ to y _o is stored in the storage unit 4 through the input terminal 3, and the number of samples n thereof is stored in the storage unit 5.
Store. Also, standard audio 1 is input through input terminal 6.
The standard voice information of all necessary standard voices is stored in the time memory 7. Standard audio 1: Storage section 8 from memory 7
The word number registered in , the standard speech feature vector sequence of the word number specified by the input signal of input terminal 9 x ₁ ~
x _n , and the number m of samples, specific subsequence sample position T _i , and number k in the storage unit 10.
Store +1.

特定部分系列対応位置計算部１１では、特定部
分系列が存在する場合に(1)式の計算を行い、対応
位置を求めてアドレス制御部２に対応位置情報
T_ja〜T_jjを送出する。アドレス制御部２により指
定された格納位置より最初の対応部分系列候補の
特徴ベクトルy_ja〜y_ja+k及び特定部分系列の特徴
ベクトルx_i〜x_i+kが取り出され、距離計算部１２
に送出されて特徴距離が計算され、加算部１３に
結果が送出される。加算部１３では(3)式が実行さ
れ、その距離D_jaが最小値計算部１４へ送出され
る。特定部分系列x_i〜x_i+kの先頭サンプル位置T_i
に対応する対応位置T_jeを含み、それと前後する
数サンプル位置T_ja〜T_jjを先頭サンプル位置とす
る複数の対応部分系列候補について同様の計算が
行われ、そのつど距離D_ja〜D_jjが最小値計算部１
４に送出される。最小値計算部１４では新たに入
力された距離の方が小さい場合にはその距離和で
最小値を更新すると共にアドレス制御部２に指示
を信号線１５により行う。アドレス制御部２で
は、その時の入力音声の対応部分系列の位置情報
を信号線１６を通して最適マツチング位置格納部
１７に送出してサンプル位置情報を更新する。一
つの特定部分系列x_i〜x_i+kに対して全ての動作が
終了した時点で最小値検出部ではその時の最小値
D_jの加算部を加算部１８に信号線１９を通して
送出する。特定部分系列が１つの場合は特定部分
系列に対する計算は終了する。さらに存在してい
る場合は以上の過程を繰り返す。又、特定部分系
列が全くない場合は、以上の過程は全く省略され
る。 The specific subsequence corresponding position calculation unit 11 calculates the equation (1) when a specific subsequence exists, calculates the corresponding position, and sends the corresponding position information to the address control unit 2.
Send T _ja ~ T _jj . The feature vectors y _ja to y _ja+k of the first corresponding subsequence candidate and the feature vectors x _i to x _i+k of the specific subsequence are extracted from the storage location specified by the address control unit 2, and the distance calculation unit 12
The feature distance is calculated, and the result is sent to the adding unit 13. The addition unit 13 executes equation (3) and sends the distance D _ja to the minimum value calculation unit 14 . First sample position T _i of specific subsequence x _i ~ _{x i+k}
Similar calculations are performed for multiple corresponding subsequence candidates including the corresponding position T _je corresponding to T je and several sample positions T _ja ~ T _jj before and after it as the leading sample positions, and each time the distance D _ja ~ _{D jj} is Minimum value calculation section 1
Sent on 4th. If the newly input distance is smaller, the minimum value calculation unit 14 updates the minimum value with the sum of the distances, and also issues an instruction to the address control unit 2 via the signal line 15. The address control section 2 sends the position information of the corresponding partial sequence of the input audio at that time to the optimum matching position storage section 17 through the signal line 16 to update the sample position information. When all operations are completed for one specific subsequence x _i ~ _{x i + k} , the minimum value detector detects the minimum value at that time.
The adder of D _j is sent to the adder 18 through the signal line 19. If there is one specific subsequence, the calculation for the specific subsequence ends. If there are more, repeat the above process. Furthermore, if there are no specific subsequences, the above process is completely omitted.

次に残部系列について類似度の計算が実行され
る。この過程は特定部分系列とほぼ同じである
が、残部系列対応位置計算部２０において、格納
部５及び格納部１０に格納されているサンプル位
置情報と、格納部１７に格納されているサンプル
位置情報から、(4)、(5)式を用いて線形伸縮により
対応位置を定めて、標準音声の残部系列x₁〜
x_i-1、x_i+k+1〜x_oと、入力音声の残部系列y₁〜
y_i-1、y_i+k+1〜y_oの特徴ベクトルを取り出し、加
算部１３の出力が信号線２１を通して直接加算部
１８に送出され、最小値計算部１４の動作は行わ
れない。全ての残部系列の計算が終了した時点
で、正規化部２２ではサンプル個数ｍのデータを
格納部１０より受け取り、そのサンプル個数ｍで
除算を実行して結果を登録判定部２３に送出す
る。 Similarity calculations are then performed on the remaining sequences. This process is almost the same as the specific partial sequence, but in the remaining sequence corresponding position calculation unit 20, the sample position information stored in the storage unit 5 and the storage unit 10, and the sample position information stored in the storage unit 17 are calculated. Then, using equations (4) and (5) to determine the corresponding position by linear expansion and contraction, the remaining sequence of standard speech x ₁ ~
x _i-1 , x _i+k+1 ~x _o , and the remaining sequence of input audio y ₁ ~
The feature vectors y _i-1 , y _i+k+1 to y _o are taken out, and the output of the adder 13 is sent directly to the adder 18 through the signal line 21, and the minimum value calculator 14 does not operate. When the calculations of all remaining sequences are completed, the normalization unit 22 receives the data of the number m of samples from the storage unit 10, executes division by the number m of samples, and sends the result to the registration determination unit 23.

以上の１個の標準音声に対する類似度の計算
は、後述するように登録の場合と同様に認識の場
合にも共通する。 The above calculation of the degree of similarity for one standard voice is common to the case of recognition as well as the case of registration, as will be described later.

登録判定部２３では、正規化部２２からの距離
和が一定値より大きいときは入力音声を登録しな
いが、一定値以下のときは登録音声信号作成過程
にはいる。最適マツチング位置格納部１７、入力
音声サンプル数格納部５、標準音声サンプル情報
格納部１０の内容を用いて、特定系列対応では
８、(9)、(10)式を用いて特徴ベクトルの対応サンプ
ル位置を登録音声対応位置計算回路２４にて計算
し、対応する登録音声ベクトル値を(11)式により登
録音声内挿値計算回路２５にて計算し、残部系列
対応では特徴ベクトルのサンプル位置を(12)、
（13）、（14）、（15）式を用いて登録音声対応位置
計算回路２４にて計算し、登録音声ベクトル値を
（16）式により登録音声内挿値計算回路２５にて
計算し、登録音声のサンプル位置及びベクトル値
を標準音声一時メモリ７に各特定系列、残部系列
ごとに対応する単語番号アドレスに格納して１標
準音声の学習が終了する。 The registration determination unit 23 does not register the input voice when the sum of distances from the normalization unit 22 is greater than a certain value, but enters the registration audio signal creation process when it is less than a certain value. Using the contents of the optimal matching position storage section 17, the number of input audio samples storage section 5, and the standard audio sample information storage section 10, the corresponding sample of the feature vector is determined using equations 8, (9), and (10) for specific series. The position is calculated by the registered voice corresponding position calculation circuit 24, the corresponding registered voice vector value is calculated by the registered voice interpolation value calculation circuit 25 using equation (11), and the sample position of the feature vector is calculated by ( 12),
The registered voice corresponding position calculation circuit 24 calculates the registered voice corresponding position calculation circuit 24 using formulas (13), (14), and (15), and the registered voice interpolation value calculation circuit 25 calculates the registered voice vector value using formula (16). The sample position and vector value of the registered speech are stored in the standard speech temporary memory 7 at word number addresses corresponding to each specific series and each remaining series, and the learning of one standard speech is completed.

認識の段階ではモード切換端子１より認識指示
信号を入力してアドレス制御部２の動作を認識モ
ードにする。認識モードでは入力音声特徴ベクト
ル系列の格納と必要な標準音声全部の格納までは
登録の場合と同じである。次に１つの標準音声に
ついて類似度を計算して正規化出力を得る過程も
登録の場合と同様である。認識の場合には、正規
化結果を最適パターン検出部２６に送出する。類
似度計算を全ての標準音声について実行し、各標
準音声ごとに入力音声との距離和を計算して最適
パターン検出部２６ではその中で距離が最小とな
る標準音声を検出して結果を出力端子２７を通し
てコントロール部（図示せず）等に送出する。 At the recognition stage, a recognition instruction signal is input from the mode switching terminal 1 to set the operation of the address control section 2 to the recognition mode. In the recognition mode, the storage of the input speech feature vector sequence and the storage of all necessary standard speech are the same as in the registration case. Next, the process of calculating the degree of similarity for one standard voice and obtaining a normalized output is the same as in the case of registration. In the case of recognition, the normalization result is sent to the optimal pattern detection section 26. The similarity calculation is executed for all standard voices, the sum of distances from the input voice is calculated for each standard voice, and the optimal pattern detection unit 26 detects the standard voice with the minimum distance among them and outputs the result. The signal is sent to a control unit (not shown) through the terminal 27.

以上の説明から明らかなように、本発明は話者
適応性を有するので高性能の不特定話者認識装置
として利用することができる。 As is clear from the above description, since the present invention has speaker adaptability, it can be used as a high-performance speaker-independent recognition device.

[Brief explanation of the drawing]

第１図は本発明における特徴ベクトルの非線形
対応と線形内挿を説明する図、第２図は本発明に
係る音声認識装置の一例を示すブロツク図であ
る。１……モード切換端子、２……アドレス制御
部、３……入力音声入力端子、４……入力音声特
徴ベクトル系列格納部、５……入力音声サンプル
数格納部、６……標準音声入力端子、７……標準
音声一時メモリ、８……標準音声特徴ベクトル系
列格納部、９……登録単語番号入力端子、１０…
…標準音声サンプル情報格納部、１１……特定部
分系列対応位置計算部、１２……距離計算部、１
３，１８……加算部、１４……最小値計算部、１
５……最小値信号線、１６……最適マツチング位
置信号線、１７……最適マツチング位置格納部、
１９，２１……加算部１８の入力信号線、２０…
…残部系列対応位置計算部、２２……正規化部、
２３……登録判定部、２４……登録音声対応位置
計算回路、２５……登録音声内挿値計算回路、２
６……最適パターン検出部、２７……結果出力端
子。 FIG. 1 is a diagram illustrating nonlinear correspondence and linear interpolation of feature vectors according to the present invention, and FIG. 2 is a block diagram showing an example of a speech recognition device according to the present invention. 1...Mode switching terminal, 2...Address control section, 3...Input audio input terminal, 4...Input audio feature vector series storage section, 5...Input audio sample number storage section, 6...Standard audio input terminal , 7...Standard speech temporary memory, 8...Standard speech feature vector series storage unit, 9...Registered word number input terminal, 10...
...Standard audio sample information storage unit, 11...Specific subsequence corresponding position calculation unit, 12...Distance calculation unit, 1
3, 18...addition section, 14...minimum value calculation section, 1
5... Minimum value signal line, 16... Optimum matching position signal line, 17... Optimum matching position storage section,
19, 21... Input signal line of addition section 18, 20...
... Residual series corresponding position calculation unit, 22... Normalization unit,
23...Registration determination unit, 24...Registered voice corresponding position calculation circuit, 25...Registered voice interpolation value calculation circuit, 2
6...Optimal pattern detection unit, 27...Result output terminal.

Claims

[Scope of Claims] 1. A feature vector sequence of a standard voice and a sample position of a specific subsequence in that sequence are used as standard voice information, and a specific vector sequence of a known input voice is used as known input voice information, and both voice information are combined. Based on the information, registered voice information is created using a specific vector sequence and the sample position of a specific subsequence in that sequence, and the unknown input voice information is measured by measuring the similarity between the unknown input voice information and the registered voice information. In a speech recognition device that recognizes input speech, by measuring the degree of similarity according to a predetermined format, the number of feature vectors (k+1) of a specific subsequence (x _i to _{x i + k} ) of standard speech is comprising means for determining a corresponding subsequence (y _j ~y _j+k ) consisting of continuous feature vectors in the known input speech information, the number (m) of feature vectors of standard speech and the number of feature vectors of known speech; (n) and means for determining the number (l) of feature vectors of the registered speech according to a predetermined time interpolation coefficient (α), the means for creating a specific subsequence in the registered speech, the feature vector ( Z _h , Z _h+1 ,…, Z _h+k )
The size of is the relative weight of the amplitude interpolation coefficient (β), and the feature vectors of standard speech (x _i , x _i+1 , ..., x _i+k ) and the feature vectors of known input speech (y _j , y _j+1 , ..., y _j+k ), and the sample position (T _h ) of the first feature vector is known to be the sample position (T _i ) of the standard speech with the weight of the time interpolation coefficient (α). A means for creating a partial sequence linearly proportional to the sample position (T _j ) of the input audio, and creating a remaining sequence (Z _h+k+1 to Z _l ) of the registered audio excluding the specific partial sequence. , and the sample position (T _s ) of the feature vector (Z _s ) is the sample position (T _u ) of the standard speech residual sequence and the sample position (T _v ), and the size of the feature vector Z _s is linearly proportional to the amplitude interpolation coefficient (β), and the feature vector (x _u ) of the standard speech residual sequence and the feature vector (y _(v )) and one for creating a residual sequence linearly proportional to.