JPS63236000A

JPS63236000A - Voice recognition

Info

Publication number: JPS63236000A
Application number: JP62069344A
Authority: JP
Inventors: 達也木村; 泰助渡辺
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1987-03-24
Filing date: 1987-03-24
Publication date: 1988-09-30

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】産業上の利用分野３　・＼一本発明は人間の声を機械に認識させる音声認識方法に関
するものである。[Detailed Description of the Invention] Industrial Application Field 3 -\1 The present invention relates to a voice recognition method for causing a machine to recognize a human voice.

従来の技術近年音声認識技術の開発が活発に行なわれ、商品化され
ているが、これらのほとんどは声を登録した人のみを認
識対象とする特定話者用である。2. Description of the Related Art Speech recognition technologies have been actively developed and commercialized in recent years, but most of these are for specific speakers whose voices are recognized only by those who have registered their voices.

特定話者用の装置は認識すべき言葉をあらかじめ装置に
登録する手間を要するため、連続的に長時間使用する場
合を除けば、使用者にとって大きな負担となる。これに
対し、声の登録を必要とせず、使い勝手のよい不特定話
者用の認識技術の研究が最近では精力的に行なわれるよ
うになった。Devices for specific speakers require time and effort to register the words to be recognized in the device in advance, which puts a heavy burden on the user unless the device is used continuously for a long time. In response to this, research has recently been actively conducted on recognition technology for non-specific speakers that is easy to use and does not require voice registration.

音声認識方法を一般的に言うと、入力音声と辞書中に格
納しである標準的な音声（これらはパラメータ化しであ
る）のパターンマツチングラ行す°って、類似度が最も
高い辞書中の音声を認識結果として出力するということ
である。この場合、入力音声と辞書中の音声が物理的に
全く同じものならば問題はないわけであるが、一般には
同一音声であっても、人が違ったり、言い方が違ってい
る人の違い、言い方の違いなどは、物理的にはスペクト
ルの特徴の違いと時間的な特徴の違いとして表現される
。すなわち、調音器官（ロ、舌、のどなど）の形状は人
ごとに異なっているので、人が違えば同じ言葉でもスペ
クトル形状は異なる。Generally speaking, a speech recognition method involves pattern matching between the input speech and standard speech stored in a dictionary (these are parameterized). This means that the voice of the user is output as the recognition result. In this case, there is no problem if the input voice and the voice in the dictionary are physically exactly the same, but in general, even if the input voice is the same voice, it may be voiced by different people or people who say it in different ways. Physically, differences in wording are expressed as differences in spectral characteristics and differences in temporal characteristics. In other words, since the shape of the articulatory organs (lo, tongue, throat, etc.) differs from person to person, different people will have different spectral shapes even when speaking the same word.

また早口で発声するか、ゆっくシ発声するかによって時
間的な特徴は異なる。Furthermore, the temporal characteristics differ depending on whether the voice is spoken quickly or slowly.

不特定話者用の認識技術では、このようなスペクトルお
よびその時間的変動を正規化して、標準パターンと比較
する必要がある。Speaker-independent recognition techniques require such spectra and their temporal variations to be normalized and compared to standard patterns.

不特定話者の音声認識に有効な方法として、本出願人等
は既にパラメータの時系列情報と統計的距離尺度を併用
する方法が提案され（二矢田他；“簡単な不特定話者用
音声認識方法″、日本音響学会講演論文集、１−１−４
（昭和６１年３月））更に認識の演算効率の向上をはか
った手法が出願人の一人によって出願されている（特願
昭６１−２７２４７５号）ので以下、従来の技術例とし
て説明する。As an effective method for speaker-independent speech recognition, the present applicant has already proposed a method that uses parameter time series information and a statistical distance measure (Niyata et al., “Simple speaker-independent speech recognition”). Recognition method'', Proceedings of the Acoustical Society of Japan, 1-1-4
(March 1986)) A method for further improving the computational efficiency of recognition has been filed by one of the applicants (Japanese Patent Application No. 61-272475), and will be described below as an example of the prior art.

これらの方法はパターンマツチング法を用いて、５　へ
− 音声を騒音中からスポツティングすることによって、音
声の認識を行うと同時に音声区間をも検出する事ができ
る。These methods use a pattern matching method to spot speech from noise, thereby making it possible to recognize speech and detect speech sections at the same time.

まず、パターンマツチングに用いている距離尺度（統計
的距離尺度）について説明する。First, the distance measure (statistical distance measure) used in pattern matching will be explained.

入力単語音声長をＪフレームに線形伸縮し、１フレーム
あたりのパラメータベクトルを１＋Ｊとすると、入力ベ
クトル×は次のようになる。If the input word audio length is linearly expanded or contracted to J frames and the parameter vector per frame is 1+J, the input vector x becomes as follows.

沃０（月、虜２．・・・・・・、匂）ここで、各＆Ｊはｐ次元のベクトルである。0 (moon, prisoner 2..., smell) Here, each &J is a p-dimensional vector.

単語ωｋ（に＝１．２．・・・、Ｋ）の標準パターンと
して、平均値ベクトルを／ｌ’　ｋ　ｓ共分散行列をＩ
ＷＩ（とすると、事後確率Ｐ（ωｋｌ＆）を最大とする
単語を認識結果とすればよい。As a standard pattern of word ωk (=1.2...,K), the mean value vector is /l' k s covariance matrix is I
WI(, then the word with the maximum posterior probability P(ωkl&) may be taken as the recognition result.

ベイズの定理よりＰ（ωに１次）−Ｐ（ωｋ）・Ｐ（水１ωｋ）／Ｐ（メ
）右辺第１項のＰ（ωｋ）は定数と見なせる。正規外−
ｅｘｐ　（−１／２　（／（−＃ｋ　）　・％Ｗｋ・（
Ｘ−ｙｋ））　　　（２）分母項Ｐ（至）は入力パラメ
ータが同一ならば定数と見做せるが、異なる入力に対し
て相互比較するときは、定数にならない。ここでは、Ｐ
〆）が平均値〃ｘ１　共分散行列ｗｘの正規分布に従う
ものと仮−ｅｘｐ［１／２（＆−Ｉ／１．ｘ）・Ｗ　）
’　（ＪＣ−＃ｘ））　　　　（３）（１）の対数をと
り、定数項を省略して、これをＩＬｋと置くと、１Ｌｋ−（Ａ　７ｚｋ　）・’ｗ　ｋ’　（ｘ　／１ｔ
ｋ）　−（ｆｉ−＃ｘ　）、＼ｗｘ・（ｍ−／ｐｘ）＋
ｌｏｇ　ＩＩＷｋ　ＩＩＷｘ　ｌ　　　　　　　（４）
ここで、ｌＷｋ、ＷＸを全て共通と置き１ｗとする。す
なわち、Ｗ＝（ｌＷＩ＋ｌＷ２＋−−−−−＋ｌＷｋ＋１Ｗｘ）
／（Ｋ＋１）　　　　（５）として（４）式を展開する
と、Ｌｋ　＝　Ｂｋ　−／Ａｋ−メ　　　　　　　　　　　
　　（６）ただし、Ａｋ＝２（ｖｌ／・／１ｔｋ−％Ｗ　　・／ｌ１ｘ）　
　　　　　　　　（７）Ｂｋ＝Ｉｔｔｋ−ＳＷ　　−／
／ｚｋ　−＃ｘ　−ＳＷ　　・〃ｌｚ　　　　　　　　
　（８）７へ（６）式は計算量が少ない１次判別式である。ここで、
（６）式を次のように変形する。According to Bayes' theorem, P(ωk)-P(ωk)·P(Water 1ωk)/P(Me) The first term on the right side, P(ωk), can be regarded as a constant. Non-regular-
exp (-1/2 (/(-#k) ・%Wk・(
(X−yk)) (2) The denominator term P(to) can be regarded as a constant if the input parameters are the same, but it does not become a constant when mutual comparisons are made for different inputs. Here, P
〆) follows the normal distribution of the mean value 〃x1 covariance matrix wx and hypothetically -exp[1/2(&-I/1.x)・W)
'(JC-#x)) (3) If we take the logarithm of (1), omit the constant term, and set it as ILk, we get 1Lk-(A 7zk )・'w k' (x /1t
k) −(fi−#x), \wx・(m−/px)+
log IIWk IIWx l (4)
Here, lWk and WX are all set as 1w. That is, W=(lWI+lW2+−−−−−+lWk+1Wx)
/(K+1) (5) When formula (4) is expanded, Lk = Bk −/Ak − method
(6) However, Ak=2(vl/・/1tk−%W・/l1x)
(7) Bk=Ittk-SW −/
/zk -#x -SW ・〃lz
(8) To 7 Equation (6) is a first-order discriminant with a small amount of calculation. here,
Equation (6) is transformed as follows.

＝−、利の３回の加算と１回の減算で求められる。=-, calculated by adding the interest three times and subtracting it once.

次に上記の距離尺度を用いて騒音中から音声をスポツテ
ィングして認識する方法について説明したちとで、計算
量の削減方法について説明する。Next, we will explain a method for spotting and recognizing speech in noise using the above distance measure, and then we will explain a method for reducing the amount of calculation.

音声を確実に含む十分長い区間を対象として、この中に
種々の部分区間を設定して、各単語との類似度を（９）
式によって求め、全ての部分区間を通して類似度が最大
となる単語を認識結果とすればよい。この類似度計算を
そのまま実行すると計算量が膨大となるが、単語の持続
時間を考慮して部分区間長を制限し、また計算の途中で
部分類似度ｄｋ）を共通に利用することによって、大幅
に計算」量を削減できる。第３図は本方法の説明図である。Targeting a sufficiently long interval that definitely includes speech, we set various subintervals within this and calculate the degree of similarity with each word (9).
It is sufficient to obtain the word using a formula and use the word with the maximum degree of similarity across all subintervals as the recognition result. If this similarity calculation were to be performed as is, the amount of calculation would be enormous, but by limiting the partial interval length in consideration of the word duration, and by commonly using the partial similarity dk) during the calculation, it can be significantly reduced. "Calculated in 2019" can be reduced. FIG. 3 is an explanatory diagram of this method.

入力と単語にの照合を行う場合、部分区間長ｎ　（ｎ”
、’＜ｎ　＜ｎセ））を標準パターン長Ｊに線形伸縮し
、フレームごとに終端固定で類似度を計算していく様子
を示している。類似度はＱＲ上の点Ｔから出発してＰで
終るルートに沿って（９）式で計算される。したがって
、１フレームあたシの類似度計算は全てＡＰＱＲ内で行
われる。ところで（９）式のＸＪは、区間長ｎを伸縮し
た後の第ｊフレーム成分なので、対応する入力フレーム
コ′　が存在する。When matching input to words, the subinterval length n (n”
, '<n <nce)) to the standard pattern length J, and the similarity is calculated for each frame with the end fixed. The degree of similarity is calculated using equation (9) along a route starting from point T on QR and ending at P. Therefore, all similarity calculations for each frame are performed within APQR. By the way, since XJ in equation (9) is the j-th frame component after expanding and contracting the interval length n, a corresponding input frame co' exists.

そこで入力ベクトルを用いて、ｄｋｌを次のように表現
できる。Therefore, using the input vector, dkl can be expressed as follows.

矢）　　　、　　　　　ｋ） ’　　（ｉ　　、ｊ）−町・凌ｔ　　　　　　　　（１
０）ただし、ｉ′−レｒ、（Ｊ）＋１　　　　　（１１
）ここで、ｒ、　（Ｊ）は単語長ｎとＪの線形伸縮を関
係づける関数である。したがって、入力の各フレームと
亀、との部分類似度が予め求められていれば、（９）式
はｌ′の関係を有する部分類似度を選択して加算するこ
とによって簡単に計算できる。ところで、ＡＰＱＲは１
フレームごとに右へ移動するの伽）で、ＰＳ上で機、と／＋［１０部分類似度を計算して、
それを△ＰＱＳに相当する分だけメモリに蓄積し、９　
・＼− フレームごとにシフトするように構成しておけば、必要
な類似度は全てメモリ内にあるので、部分類似度を求め
る演算の重複が省略でき計算量が少なくなる。しかし更
に、上記Ｌｋのかわ９に（１２）式及び（１３）式で示
される遂次演算で与えられる値Ｌ’ｋを用いる事により
更に演算量の削減をはかる事ができる。arrow), k)' (i, j)-cho・Ryot (1
0) However, i'-re r, (J)+1 (11
) Here, r, (J) is a function that relates the word length n to the linear expansion and contraction of J. Therefore, if the partial similarity between each input frame and the turtle is determined in advance, equation (9) can be easily calculated by selecting and adding the partial similarities having the relationship l'. By the way, APQR is 1
(move to the right every frame) Then, calculate the machine on the PS, and / + [10 partial similarity,
Store it in memory for the amount equivalent to △PQS, and
・＼− If the structure is configured to shift each frame, all the necessary similarities are stored in the memory, so the duplication of calculations for obtaining partial similarities can be omitted and the amount of calculation can be reduced. However, the amount of calculations can be further reduced by using the value L'k given by the sequential calculations shown in equations (12) and (13) as the value 9 for Lk.

（転）Ｌｋ−日ｋ　　ＲＪ　　　　　　　　　　　　　　　　
　　　（１２）Ｒ央）＝、卸＋、、、ａｘ（Ｒ（−−ｌ
）　Ｒ（−一（Ｊ−１）　　　　　ｐ（ｒ−７）＞」Ｊ
　　　　　　Ｊ−１＋　　Ｊ−１この式の意味は第３図のへＰＱＲ内の直線ＰＴの集合の
みならず、それを含む傾き１／ｍからＩＡの範囲内の全
ての単調増加折れ線の全集合に対して最適なマツチング
パスを求めている事に相当する。上記のようにして求め
だＬ’には、Ｌｋ　　と比較して、等しいか又は小さな
値をとる性質がある点で多少異る値をとるものの実用上
は支障ないが、むしろ良い結果が得られる事が実験によ
シ確かめられている。又更にこの方法によれば、必要な
演算量及びメモリの量が、それぞれ、（１３）式の遂次
演算を行うに必要な量のみになり、Ｌｋを直接求める方
法に比べて、約１Ａ程度にまで削減できる。(transformation) Lk-dayk RJ
(12)Ro)=,wholesale+,,,ax(R(--l
) R(-1(J-1) p(r-7)>''J
J-1+ J-1 The meaning of this equation is not only for the set of straight lines PT in PQR in Figure 3, but also for the entire set of all monotonically increasing polygonal lines with slopes ranging from 1/m to IA. This corresponds to finding the optimal matching path. L' obtained in the above manner has the property of taking a value that is equal to or smaller than Lk, so although it takes a slightly different value, it does not pose a problem in practice, but it can actually yield better results. This has been confirmed by experiment. Furthermore, according to this method, the amount of calculation and memory required are only the amount required to perform the sequential calculations of equation (13), and the cost is about 1A compared to the method of directly calculating Lk. can be reduced to.

以下その方法を第４図及び第５図を用いて説明する。第
４図はその機能ブロック図である。The method will be explained below using FIGS. 4 and 5. FIG. 4 is a functional block diagram thereof.

入力された未知入力音声信号は、ＡＤ変換部４１で、８
ＫＨｚサンプリングされて１２ビツトのディジタル信号
に変換される。音響分析部４２は１０　ｍ５ｅｃ（ｉフ
レーム）ごとに入力信号のＬＰＧ分析を行ない、１０次
の線形予測係数と残差パワーを求める。特徴パラメータ
抽出部４３は、線形予測係数と残差パワーを用いて、Ｌ
ＰＣケプストラム係数０１〜Ｃ５とパワー環ｃｏを特徴
パラメータとして求める。The inputted unknown input audio signal is converted to 8 by the AD converter 41.
It is sampled at KHz and converted into a 12-bit digital signal. The acoustic analysis unit 42 performs LPG analysis of the input signal every 10 m5ec (i-frame) to obtain the 10th-order linear prediction coefficient and residual power. The feature parameter extraction unit 43 uses the linear prediction coefficients and residual power to calculate L
The PC cepstrum coefficients 01 to C5 and the power ring co are determined as feature parameters.

したがってフレーム毎の特徴ベクトルＸは灰’＝（Ｃｏ
　、　Ｃ１−・−Ｃ５）　　　　　　　（１４）である
。Therefore, the feature vector X for each frame is gray'=(Co
, C1-.-C5) (14).

フレーム同期信号発生部４４ば１０　ｍ５ｅｃ　　ごと
１１−・のタイミング信号（フレーム信号）を発生する部分であ
シ、認識処理フレーム信号に同期して行なわれる。The frame synchronization signal generating section 44 is a part that generates a timing signal (frame signal) every 10 m5ec, and is performed in synchronization with the recognition processing frame signal.

標準パターン選択部４５は、１フレームの期間に標準パ
ターン格納部４６に格納されている単語ナンバーに＝１
，２．・・・・・・にを次々と選択してゆく。The standard pattern selection unit 45 sets the word number stored in the standard pattern storage unit 46 during one frame to =1.
,2.・・・・・・Select one after another.

部分類似度計算部４７では、選択された標準パターン−
と第１フレームの特徴ベクトルＸｉ　　の部分類像度ｃ
ｉ（ｋ）（ｉ＋ｊ）を計算する。In the partial similarity calculation unit 47, the selected standard pattern -
and the partial classification degree c of the feature vector Xi of the first frame
Calculate i(k)(i+j).

ｄｏ）（ｉ、ｊ）−＆ｔｋＪ）、Ｘ戦、−４，２５０１
，、）（１５）類似度計算部４７では上記−（ｉ、ｊ）
よシ、後述の方法によって得られた値をＬ’にとして得
る。do) (i, j) -&tkJ), X battle, -4,2501
,, ) (15) The similarity calculation unit 47 calculates the above −(i, j)
Otherwise, the value obtained by the method described below is obtained as L'.

類似度比較部４１０では、求めたＬｋと一時記憶４１１
の内容を比較し、類似度が大きい（距離が小さい）方を
一時記憶４１１に記録する。The similarity comparison unit 410 uses the obtained Lk and the temporary memory 411.
, and the one with greater similarity (smaller distance) is recorded in temporary storage 411.

Ｌ′シ？（ｍａＸ）このようにして、フレーム１＝ｉｏから始め、標準パタ
ーンに＝１に対して、　＝（１）＜　ｎ（ｎｅ（１）の
範囲で最大類似度Ｌ”、’　（ｍ　ａｘ　）　　を求め
、次にに＝２としてｎ”＜　ｎ　＜　ｎ”の範囲で求め
たＬ２とＬ　、（ｍａｘ　）Ｓ−ｅを比較して類似度の最大値を求め、このようにしてに−
Ｋまで同様な手順を繰返して最大類似度Ｌ’Ａ９　（ｍ
ａｘ　）とその時の単語ナンバーに′　を一時記憶４１
１に記憶する。次に１−１０＋△ｉ　として同様な手順
を繰返して、最終フレームＩ　＝　Ｉに到達した時に一
時記憶に残されている単語ナンバーに＝ｋｍが認識結果
である。また、最大類似度が得られた時のフレームナン
バーｉ＝ｉｍと単語長ｎ＝ｎｍを一時記憶４１１に蓄積
し、更新するようにしておけば、認識結果と同時に、そ
の時の音声区間を結果として求めることができる。音声
区間はｉｍ−ｎｍ−ｉｍである。L'shi? (ma Then, with =2, compare L2 obtained in the range n"<n<n" with L, (max)S-e to find the maximum similarity, and in this way -
Repeat the same procedure up to K to obtain the maximum similarity L'A9 (m
ax ) and the word number at that time ′ is temporarily stored 41
Store in 1. Next, the same procedure is repeated as 1-10+Δi, and when the final frame I=I is reached, the word number temporarily stored in the memory is =km, which is the recognition result. In addition, if the frame number i = im and word length n = nm when the maximum similarity was obtained are stored in the temporary memory 411 and updated, the speech section at that time will be displayed as the result at the same time as the recognition result. You can ask for it. The voice interval is im-nm-im.

以上のようにして音声認識処理が完了する。The speech recognition process is completed as described above.

次に第４図中類似度計算部４７のＬｋ　の算出方法につ
いて詳しく説明する。ここで例としてＪ−１６、ｌ＝　
１　、　ｍ＝４の場合について説明する。Next, the method of calculating Lk by the similarity calculating section 47 in FIG. 4 will be explained in detail. Here, as an example, J-16, l=
1, the case where m=4 will be explained.

第５図は、類似度計算部の１フレーム当シの実際の演算
機構を示した図である。FIG. 5 is a diagram showing an actual calculation mechanism for one frame of the similarity calculation section.

まず動作の前に全ての記憶素子の内容を−■にしておく
。次にフレーム同期信号が６４個入力されるのを待って
から未知入力音声から求まっている部分音素系列を入力
する。又遅延素子フレーム同期信号が１個入力される毎
に素子の入力側の信号を出力側へ移す働きをする。図は
同じ回路１６段（段数はＪに等しい）で構成されてお］
、Ｌ’ｓｃの算出が遂次演算によりなされている事が分
る。First, before operation, the contents of all memory elements are set to -■. Next, after waiting for 64 frame synchronization signals to be input, the partial phoneme sequence determined from the unknown input speech is input. Also, each time one delay element frame synchronization signal is input, the delay element functions to transfer the signal on the input side of the element to the output side. The figure shows the same circuit consisting of 16 stages (the number of stages is equal to J)]
, L'sc is calculated by sequential calculations.

即ち各段の部分累積値なる量、Ｒ覧ＲちいＲ覧を図で示
した位置での数値とすると、第１段の値は（ｋ−４）Ｒ，−１）　　（ｉｓ）（ｊ＝１，２．・・・・・・、１６）伽）で与えられる事が分る。従って最終段の値Ｒ１６はとな
る。但しくξ（２））はを同時に満たす全ての関数の集合である。このようにし
て得られた＜４を定数Ｂｋから減ずれば、最終的にＬ’
ｋが得られる。以上述べた機構により１３の遂次演算を
実現する事ができ、所望のＬ’ｋを得る事ができる。In other words, if the partial cumulative value of each stage is the numerical value at the position shown in the figure, the value of the first stage is (k-4) R, -1) (is) (j= 1, 2...., 16) 伽) It can be seen that it is given by. Therefore, the value R16 of the final stage is as follows. However, ξ(2)) is a set of all functions that simultaneously satisfy . By subtracting <4 obtained in this way from the constant Bk, we finally get L'
k is obtained. With the mechanism described above, 13 sequential operations can be realized, and the desired L'k can be obtained.

この方式における１フレーム１単語当シ必要な計算量は
、第５図によれば４個の数値から最大値を求める演算が
１６．加算が１６回程度の量ですむ。これを９式〜１１
式に基きＬｋを直接求める方法だと第３図中ＡＰＱＲ内
の候補直線ＰＴの本数即ち、始端の範囲のフレーム数回
程度（具体的には約５０）の比較演算と、始端の範囲の
フレー数×１６回程度の演算が必要であシ、又メモリ量
に対しても△ＰＱらの部分に相当する部分だけ必要であ
り、具体的には５００程度必要となり、後に説明した方
法は演算量の点で相当に有利な方法である事が分る。According to FIG. 5, the amount of calculation required for one word per frame in this method is 16. It only takes about 16 additions. This is formula 9 to 11
If Lk is directly calculated based on the formula, the number of candidate straight lines PT in APQR in Fig. 3, that is, the number of frames in the range of the start end (specifically, about 50), and the comparison calculation of the number of frames in the range of the start end, and the number of frames in the range of the start end. It requires about 16 calculations, and the amount of memory required is only the part corresponding to △PQ etc., specifically about 500, and the method explained later has a small amount of calculations. It turns out that this is a considerably advantageous method.

発明が解決しようとする問題点以上説明した従来技術においては（１３）式の演算にお
いて、ｌ、ｍが全フレームに関して一定値であり、（具
体例においてはｌ　＝　１　、　ｍ＝４）であシ、この
ままでもある程度の認識性能の確保はできるものの、パ
ターンマツチングにおける時間軸整合の自由度が太きす
ぎ、それに起因するミスマツチングが生じ、認識性能を
低下する要因を含んでいた。Problems to be Solved by the Invention In the prior art described above, in the calculation of equation (13), l and m are constant values for all frames (l = 1, m = 4 in the specific example). Although a certain degree of recognition performance can be maintained as is, the degree of freedom in time axis alignment in pattern matching is too wide, resulting in mismatching, which is a factor that degrades recognition performance.

本発明はかかる従来の問題点を解決し、きめの細かいマ
ツチングを行ない、認識率の向上を図ることを目的とす
るものである。It is an object of the present invention to solve these conventional problems, perform fine-grained matching, and improve the recognition rate.

問題点を解決するだめの手段本発明は上記目的を達成するものでその技術的手段は、
予め、認識対象とする単語の音声の各々の標準パターン
を表現する長さしのパラメータの系列Ｐ１　、　Ｐ２　
、　、、、　ＰＬを、認識対象とする全音声のデータ及
び全音声の周囲情報を用いて作成しておき、一方、認識
すべき音声とその周囲の情報を含む未知入力に対して単
位時間間隔（フレームという。）毎に、そのフレームに
おけるデータを表現するパラメータＱｎ（ただしｎはフ
レームの番号）を算出し、上記フレーム毎に前記パラメ
ータ系列の各要素Ｐ）（ｉ＝１．２．・・・Ｌ）につい
て、ＰｉとＱｎ　との間の距離又は類似度Ｒｎ、＋　（
ｉ＝１．２゜３、・・・Ｌ）を求め、更に以下の漸化式
９式％を利用し、前記漸化式のｋおよびｍをｎの推移に従って
、又は対象単語毎に動的に変化させながら、全ての認識
対象単語、又は対象となる全フレームｎについて算出さ
れるＳｎ、Ｌの最適値を与える単語を認識結果として得
る事を特徴とする音声認識方法にある。Means for solving the problem The present invention achieves the above object, and its technical means are as follows:
In advance, a series of length parameters P1, P2 expressing each standard pattern of the sound of the word to be recognized is prepared.
, , , PL is created using the data of all the voices to be recognized and the surrounding information of all the voices, and on the other hand, the PL is created using the data of all the voices to be recognized and the surrounding information of all the voices, and on the other hand, the unit time interval is For each frame (referred to as frame), a parameter Qn (where n is the frame number) that expresses the data in that frame is calculated, and for each frame, each element P) (i = 1.2...・For L), the distance or similarity Rn, + (
i = 1.2゜3,...L), and then use the following recurrence formula 9 to calculate k and m in the recurrence formula dynamically according to the transition of n or for each target word. This speech recognition method is characterized in that, while changing the number of words, the words that give the optimum values of Sn and L calculated for all recognition target words or all target frames n are obtained as recognition results.

作　　用本発明は、上記構成において、前述の（１３）式のｊの
変化にともない、β及びｍを動的に変化させる事によシ
、マツチングにおける時間軸整合の自由度を制御するこ
とにより認識性能の向上をは１７　・− かっている。Effect: In the above configuration, the present invention dynamically changes β and m as j in the above equation (13) changes, and by controlling the degree of freedom of time axis alignment in matching. It aims to improve recognition performance.

実施例以下、図面を用いて本発明の実施例について説明する。Example Embodiments of the present invention will be described below with reference to the drawings.

第１図は本発明の一実施例における音声認識方法を具現
化するだめの機能ブロック図である。FIG. 1 is a functional block diagram embodying a speech recognition method according to an embodiment of the present invention.

入力された未知入力音声信号は、ＡＤ変換部１１で、８
にＨｚ　　サンプリングされて１２ピツトのディジタル
信号に変換される。音響分析部１２は１０　ｍ５ｅｃ　
（１フレーム）ごとに入力信号のＬＰＣ分析を行ない、
１０次の線形予測係数と残差パワーを求める。特徴パラ
メータ抽出部１３は、線形予測係数と残差パワーを用い
て、ＬＰＣケプストラム係数０１〜Ｃ５とパワー環ｃｏ
を特徴パラメータとして求める。The input unknown input audio signal is sent to the AD converter 11 by 8
The signal is sampled at Hz and converted into a 12-pit digital signal. Acoustic analysis section 12 is 10 m5ec
Performs LPC analysis of the input signal for each frame (1 frame),
Find the 10th order linear prediction coefficient and residual power. The feature parameter extraction unit 13 uses the linear prediction coefficients and the residual power to extract the LPC cepstral coefficients 01 to C5 and the power ring co.
is obtained as a feature parameter.

したがってフレーム毎の特徴ベクトルＸはＸｔ−（Ｃｏ
　Ｃ１−Ｃｓ　）　　　　　　（１９）である。Therefore, the feature vector X for each frame is Xt−(Co
C1-Cs) (19).

フレーム同期信号発生部１４は１０　ｍ５ｅｃ　　ごと
のタイミング信号（フレーム信号）を発生する部分であ
り、認識処理はフレーム信号に同期して行なわれる。The frame synchronization signal generator 14 is a part that generates a timing signal (frame signal) every 10 m5ec, and recognition processing is performed in synchronization with the frame signal.

標準パターン選択部１５は、１フレームの期間に標準パ
ターン格納部１６に格納されている単語ナンバー、ｋ＝
１．２．・・・にを次々と選択してゆく。The standard pattern selection unit 15 selects the word number stored in the standard pattern storage unit 16 during one frame period, k=
1.2. ...select one after another.

部分類似度計算部１７では、選択された標準バタ眞）一ンａｊと第１フレームの特徴ベクトル×１の部分類似
度ｄ（ｋ′（Ｉ、ｊ）を計算する。The partial similarity calculation unit 17 calculates the partial similarity d(k'(I, j)) between the selected standard baton aj and the feature vector of the first frame×1.

ｄ（ｋ）（Ｉ、ｊ）＝詠）ｔ −ＸＩ（Ｊ””ｔ２ｔ・・・Ｊ）類似度計算部１７では、上記ｄ（ｋ）（Ｉ、ｊ）より従
来例のかわシに後述の方法によって得られた値をＬ’に
として得る。d(k)(I,j)=Ei)t-XI(J""t2t...J) In the similarity calculation unit 17, from the above d(k)(I,j), the conventional example will be described later. The value obtained by the method is obtained as L'.

類似度比較部１１０は、求めだＬｋ　　と一時記憶１１
１の内容を比較し、類似度が大きい（距離が小さい）方
を一時記憶１１に記録する。The similarity comparison unit 110 calculates the obtained Lk and the temporary memory 11.
1, and the one with greater similarity (smaller distance) is recorded in temporary memory 11.

このようにして、フレームＩ　””　＋６から始め、標
準パターンに＝１に対してｎ（１）＜　ｎ　＜　ｎ♀）
の範囲で最大類似度Ｌ１　　（ｍａｘ）を求め、次にに
−２としてｎｓ＜、≦ｎｅ■の範囲で求めた弓０とＬ：
’　（ｍａｘ　）■ を比較して類似度の最大値を求め、このようにしてに＝
Ｋまで同様な手順を繰返して最大類似度１４？　（ｍａ
ｘ）とその時の単語ナンバーに′を一時記憶１１１に記
憶する。次に＋＝ｉｏ＋△ｉ　として同様な手順を繰返
して、最終フレームｉ　＝　ｌに到達した時に一時記憶
に残されている単語ナンバーに＝ｋｍが認識結果である
。また、最大類似度が得らレタ時のフレームナンバーｉ
＝＝ｉｍと単語長ｎ−ｎｍを一時記憶１１１に蓄積し、
更新するようにしておけば、認識結果と同時に、その時
の音声区間を結果として求めることができる。音声区間
はｉｍ−ｎｍ−ｉｍである。In this way, starting from frame I ``'' +6, the standard pattern has n(1) < n < n♀) for =1
Find the maximum similarity L1 (max) in the range of , then -2 and find the bow 0 and L in the range of ns<, ≦ne■:
' (max) ■ to find the maximum similarity, and in this way =
Repeat the same procedure up to K until the maximum similarity is 14? (ma
x) and the word number at that time is stored in the temporary memory 111. Next, the same procedure is repeated with +=io+Δi, and when the final frame i=l is reached, the word number temporarily stored in memory is =km, which is the recognition result. Also, if the maximum similarity is obtained, the frame number i at the time of lettering is
==im and word length n-nm are stored in temporary memory 111,
If it is updated, the speech section at that time can be obtained as a result at the same time as the recognition result. The voice interval is im-nm-im.

最後に、本発明が実施されている部分である類似度計算
部の構成と、動作を説明する。第２図はそれを説明する
だめの図である。第２図は従来例の第５図に示しだ（１
３）式を計算する機構と基本的には同じ動作をするが、
パターンマツチングの際の時間軸の整合の際の自由度を
制御する機構を新しく設けた所が改善された点である。Finally, the configuration and operation of the similarity calculation unit, which is the part in which the present invention is implemented, will be explained. FIG. 2 is a diagram for explaining this. Figure 2 is shown in Figure 5 of the conventional example (1
3) It basically operates in the same way as the mechanism that calculates equations, but
The improvement lies in the provision of a new mechanism for controlling the degree of freedom in aligning time axes during pattern matching.

まず動作の前に全ての記憶素子の内容を−■にしておく
。次にフレーム同期信号が６４個入力されるのを待って
から未知入力音声から求まっている部分音素系列を入力
する。又遅延素子はフレーム同期信号が１個入力される
毎に素子の入力側の信号を出力側へ移す働きをする。又
、図中のｍａｘと書かれた正方形の枠は入力された複数
の数値データのうちから最大値を検出する機能を有する
素０に子、又、制御信号ＳＪ及び複数の数値データを入力し複
数の数値データを出力する素子は、入力データのうちＳ
、で指定されるデータの組のみを有効にするデータ選択
素子である。従来の構成に加えて、上記データ選択素子
を導入する事により、（転）　　　伽）　　　　　伽）各段の部分累積値なるｆ、Ｒ１，Ｒ２，・・・Ｒ１６の
値は、（１６）式に代って、で与えられる。ここで記号Ｓ、は記号括弧の中の要素か
らなる集合の中の部分集合のうちの１つを２１−・　・指定するだめの演算子であり、マツチング段数ｊされた
部分集合の要素の最大値を意味する。最終的にＬ’には仮）Ｌｋ＝Ｂｋ−Ｒ１６（２１）で得られ、この値が最終的な類似度となる。First, before operation, the contents of all memory elements are set to -■. Next, after waiting for 64 frame synchronization signals to be input, the partial phoneme sequence determined from the unknown input speech is input. Further, the delay element functions to transfer the signal on the input side of the element to the output side every time one frame synchronization signal is input. In addition, the square frame labeled max in the figure is a square frame that has the function of detecting the maximum value from among a plurality of input numerical data, and also a frame for inputting the control signal SJ and a plurality of numerical data. Elements that output multiple numerical data are
, is a data selection element that validates only the data set specified by . By introducing the above data selection element in addition to the conventional configuration, the values of f, R1, R2,...R16, which are the partial cumulative values of each stage, can be expressed as in equation (16). Instead, it is given by . Here, the symbol S is an operator that specifies one of the subsets in the set consisting of the elements in the symbol parentheses as 21-. means value. Finally, L' is obtained as (temporarily) Lk=Bk-R16 (21), and this value becomes the final similarity.

ここで従来の方法との比較しながら本実施例の効果を述
べるならば、従来の方法は、（２０）式において、記号
Ｓｊ　の機能を除去し、記号ｍａｘの後の括弧の中の全
ての要素Ｒｊ−１（ｎ−１，２，３，４）が、最大値算
出の対象となる。Here, to describe the effect of this embodiment in comparison with the conventional method, the conventional method removes the function of the symbol Sj in equation (20), and all the functions in the parentheses after the symbol max are Element Rj-1 (n-1, 2, 3, 4) is the target of maximum value calculation.

従来の方式（即ち（１６）式）に従ってパターンマツチ
ング演算を行うと、上述の通り例えば極端な場合、ある
フレームｊにおいては部分類積値（ｋ□１）が採択され
、次のフレームＪ＋１においＪ１（ｋ−４）てはＲが採択され、現実にそぐわないマツチングがなさ
れる可能性がちシ、マツチングの際の自由度が大きすぎ
る事に起因する弊害が生じる恐れがあった。一方、本実
施例による方法によれば、（２０）式に示した通りマツ
チング演算における部分類似度の採択の自由度を認識対
象単語ｋ及びフレーム番号ｊの推移に従って、制御する
事により、従来の方法における問題の解決がはかれ、よ
り高い認識率の実現が可能となる。When a pattern matching operation is performed according to the conventional method (i.e., equation (16)), in an extreme case, for example, as mentioned above, the partial similarity value (k□1) is adopted in a certain frame j, and in the next frame J+1, J1 (k-4), R was adopted, and there was a high possibility that matching would be performed that did not match reality, and there was a risk that problems would arise due to too large a degree of freedom during matching. On the other hand, according to the method according to the present embodiment, the degree of freedom in selecting the partial similarity in the matching calculation is controlled according to the transition of the recognition target word k and the frame number j, as shown in equation (20). Problems in the method can be solved, and higher recognition rates can be achieved.

発明の効果以上要するに本発明は、マツチング演算における部分類
似度の採択の自由度を、認識対象単語及びフレーム番号
の推移に従って制御することにより、きめの細かなマツ
チングが可能となり、よシ高い認識率の実現が可能とな
る利点を有する。Effects of the Invention In short, the present invention enables fine-grained matching by controlling the degree of freedom in selecting partial similarities in matching calculations according to changes in recognition target words and frame numbers, resulting in a higher recognition rate. This has the advantage that it is possible to realize the following.

[Brief explanation of the drawing]

第１図は本発明の一実施例における音声認識方法を具現
化するための機能ブロック図、第２図は本実施例の類似
度計算部の詳細な構成図、第３図は従来の方法における
マツチングの方法を説明するための概念図、第４図は従
来の方法に基く音声認識方法の一構成例を説明するため
の機能ブロック図、第５図は従来の方法に基く構成例の
中の類似度計算部の詳細な構成図である。１１・・・・・・ＡＤ変換部、１２・・・・・・音響分
析部、１３・・・・・・特徴パラメータ抽出部、１４・
・・・・・フレーム同期信号発生部、１５・・・・・・
標準パターン選択部、１６・・・・・・標準パターン格
納部、１７・・・・・・部分類似度計算部、１８・・・
・・・区間候補設定部、１９・・・・・・類似度計算部
、１１０・・・・・・類似度比較部、１１１・・・・・
・一時記憶部。FIG. 1 is a functional block diagram for embodying the speech recognition method in one embodiment of the present invention, FIG. 2 is a detailed configuration diagram of the similarity calculation section of this embodiment, and FIG. A conceptual diagram for explaining the matching method, Figure 4 is a functional block diagram for explaining an example of a configuration of a speech recognition method based on a conventional method, and Figure 5 is a functional block diagram for explaining an example of a configuration of a speech recognition method based on a conventional method. FIG. 3 is a detailed configuration diagram of a similarity calculation section. 11... AD conversion section, 12... Acoustic analysis section, 13... Feature parameter extraction section, 14.
...Frame synchronization signal generation section, 15...
Standard pattern selection unit, 16...Standard pattern storage unit, 17...Partial similarity calculation unit, 18...
...Section candidate setting unit, 19...Similarity calculation unit, 110...Similarity comparison unit, 111...
・Temporary storage.

Claims

[Claims]

(1) In advance, a series of parameters P_1, P of length L expressing each standard pattern of the sound of the word to be recognized.
_2,...P_L is created using the data of all the voices to be recognized and the surrounding information of all the voices, and on the other hand, a unit time interval ( For each frame, a parameter Q_n (where n is the frame number) that expresses the data in that frame is calculated, and for each frame, each element P_i (i=1, 2,...L) of the parameter series is calculated. , the distance or similarity R_n, _ between P_i and Q_n
i (i=1, 2, 3,...L), and further the following recurrence formula S_n, _o=constant S_n, _j=R_n, _j+opt(S_n_-_k
, _j_-_1, S_n_-_k_-_1, _j_-_
1, ..., S_n_-_k_-_m, _j_-_1) (j=1, 2, ..., L) {However, k and m are predetermined positive constants, and the symbol o
pt means to adopt the optimal one in parentheses. } S_n is calculated for all recognition target words or all target frames n while dynamically changing k and m of the recurrence formula according to the transition of n or for each target word, using A speech recognition method characterized in that a word giving an optimal value of L is obtained as a recognition result.

(2) The speech recognition method according to claim 1, characterized in that the degree of similarity or distance between the characteristic parameters of the unknown input signal and the standard pattern of each speech is calculated using a statistical distance measure.

(3) A patent claim characterized in that the statistical distance measure is any one of a measure based on posterior probability, a linear discriminant function, a quadratic discriminant function, Mahalanobis distance, Bayesian judgment, and a measure based on composite similarity. The speech recognition method according to item 2 of the scope.