JPH0634186B2

JPH0634186B2 - Speech recognition method and apparatus thereof

Info

Publication number: JPH0634186B2
Application number: JP59093571A
Authority: JP
Inventors: 聖一中川
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1984-05-10
Filing date: 1984-05-10
Publication date: 1994-05-02
Anticipated expiration: 2009-05-02
Also published as: JPS60237496A

Description

【発明の詳細な説明】技術分野本発明は、ベクトル量子化を用いた大語彙の音声認識方
法及びその装置に関する。TECHNICAL FIELD The present invention relates to a large vocabulary speech recognition method using vector quantization and an apparatus thereof.

従来技術第１図は、音声認識装置の基本回路図で、図中、１はマ
イクロホン、２は分析部、３は切り換えスイッチ、４は
標準パターン部、５は入力音声パターン部、６は距離計
算部、７は最小値検出部、８は認識結果部で、距離計算
部６及び最小値検出部７でパターンマッチングを部を形
成している。第１図において、まず、マイクロホン１か
ら入ってくる音声を分析してその音声パターンの特徴を
認識するパターンを抽出する。特定話者用のシステムで
は、認識する前に、前もってその話者の各認識対象単語
の分析結果を標準パターンとして登録しておき、認識す
る時には、各認識対像単語の標準パターンと入力音声パ
ターンのパラメータを比較して、最も近い即ち距離の小
さい認識対象単語を選択する。なお、不特定話者の場合
には、個人差を吸収できる標準パターンを使用する。2. Description of the Related Art FIG. 1 is a basic circuit diagram of a voice recognition device. In the figure, 1 is a microphone, 2 is an analysis unit, 3 is a changeover switch, 4 is a standard pattern unit, 5 is an input voice pattern unit, and 6 is distance calculation. Reference numeral 7 denotes a minimum value detecting portion, 8 denotes a recognition result portion, and the distance calculating portion 6 and the minimum value detecting portion 7 form a pattern matching portion. In FIG. 1, first, a voice coming from the microphone 1 is analyzed to extract a pattern for recognizing the features of the voice pattern. In the system for a specific speaker, the analysis result of each recognition target word of the speaker is registered in advance as a standard pattern before recognition, and at the time of recognition, the standard pattern of each recognition image word and the input voice pattern are recognized. And the closest recognition word is selected, that is, the recognition target word having the smallest distance is selected. In the case of an unspecified speaker, a standard pattern that can absorb individual differences is used.

第２図は、帯域通過フィルタ群（ＢＰＦ）を使用した分
析法の一例を示す図で、同図は、「３」（／ｓａｎ／）
という音声を１６チャンネルの帯域通過フィルタ群（全
帯域は２００〜６０００Hz）で分析（ＢＰＦ分析）した
スペクトラムパターンの時間変化図である。時間軸の一
単位は１８msで、ある時刻で断面をとると、それがその
時刻でのスペクトラムになっており、実際の認識処理
は、すべてデジタル処理となり、ある時刻ｉでの横一列
のスペクトラムの強度値を特徴ベクトルａｉ（＝ａｉ_１
ａｉ_２ａｉ_３…ａｉ_８…ａｉ_１６）とし、入力音声
パターン（ここでは「３」の音声パターン）はＡ＝ａ_１
ａ_２……ａ_ｉ……ａ_Ｉ（Ｉ＝３２）となる。FIG. 2 is a diagram showing an example of an analysis method using a bandpass filter group (BPF), which is “3” (/ san /).
Is a time change diagram of a spectrum pattern obtained by analyzing (BPF analysis) the following voice with a bandpass filter group of 16 channels (all bands are 200 to 6000 Hz). One unit of the time axis is 18 ms, and if you take a cross section at a certain time, it becomes the spectrum at that time. Actual recognition processing is all digital processing, and the spectrum of a horizontal row at a certain time i The intensity value is represented by the feature vector ai (= ai ₁
ai ₂ ai ₃ ... ai ₈ ... ai ₁₆ ), and the input voice pattern (here, the voice pattern of “3”) is A = a ₁
a ₂ ... A _i ... A _I (I = 32).

従って、音声パターンは次のように表現される。Therefore, the voice pattern is expressed as follows.

Ａ＝ａ_１ａ_２……ａ_ｉ……ａ_Ｉ ……（１）ａｉは時刻ｉにおける音声の特徴を表す量で、一般には
ベクトル値であり、Ａはこの特徴ベクトルａｉ〔ｉ＝１
〜３２（Ｉ＝３２の場合）〕の時系列になり、Ｉは音声
パターンＡの長さに相当する。A = a ₁ a ₂ ...... a _i ...... a _I (1) ai is a quantity representing the feature of the voice at time i, and is generally a vector value, and A is the feature vector ai [i = 1.
~ 32 (when I = 32)], and I corresponds to the length of the voice pattern A.

また、ベクトルａｉを特徴ベクトルと呼び、ａｉ＝（ａｉ_１，ａｉ_２…ａｉ_ｑ…ａｉ_Ｑ） …（２）で表わす。Ｑはベクトルの次数で、第２図の例では帯過
帯域フィルタ群のチャンネる数１６に相当する。Further, the vector ai is called a feature vector and is represented by ai = (ai ₁ , ai ₂ ... ai _q ... ai _Q ) (2). Q is the order of the vector, which in the example of FIG. 2 corresponds to the channel number 16 of the bandpass bandpass filter group.

同様に単語ｎの標準パターンをＢ^ｎとし、Ｂ^ｎ＝ｂ_１ ^ｎｂ_２ ^ｎ…ｂ_ｊ ^ｎ…ｂ_Ｊｎ ^ｎ …（３）で表わす。この時、ｂ_ｊ ^ｎは単語ｎの標準パターンの時
刻ｊにおける特徴ベクトルで、前記入力パターンＡの特
徴ベクトルａｉと同次数である。また、Ｊ^ｎは単語ｎの
標準パターンの長さを表わし、ｎは単語名を示す通し番
号で、Ｎ単語の認識単語セットを考えてΣとすると、 Σ＝｛ｎ｜ｎ＝１，２…ｎ｝ ……（４）となる。ただし、特定の単語を指定する必要がない場合
は添え字ｎを省略して、Ｂ＝ｂ_１ｂ_２…ｂ_ｊ…ｂ_Ｊ ……（５）ｂｊ＝ｂｊ_１，ｂｊ_２，…ｂｊ_８…ｂｊ_Ｑ） …（６）となる。Likewise the standard pattern of the word n and ^{B n,} represented by ^{_{^{_{^{B n = b 1 n b 2}}}}} n ... b j n ... b Jn n ... (3). At this time, b _j ⁿ is a feature vector at the time j of the standard pattern of the word n, and has the same degree as the feature vector ai of the input pattern A. Further, J ⁿ represents the length of the standard pattern of the word n, n is a serial number indicating the word name, and Σ is a recognition word set of N words, where Σ = {n | n = 1,2 ... n } (4) However, if there is no need to specify a particular word is omitted subscripts _{_{_{n, B = b 1 b 2}}} ... b j ... b J ...... (5) bj = bj 1, bj 2, ... bj 8 ... bj _Q ) ... (6)

音声認識処理では、入力パターンＡについて認識単語セ
ットのすべての単語の標準パターンＢ^ｎを時間正規化し
ながらパターンマッチングし、Ｎ単語の中から最も入力
パターンＡに近い単語ｎを探し出す。In the voice recognition process, the standard pattern B ⁿ of all the words in the recognized word set for the input pattern A is subjected to pattern matching while time-normalizing, and the word n closest to the input pattern A is searched from the N words.

第３図は、時間正規化のための写像モデルで、これは、
前記例で言えば「３」という単語の標準パターンＢを写
像関数によって入力パターンの時間軸に揃えるもので、
通常、前記写像関数を、ｊ＝ｊ(i) ……（７）で表現し、これを歪関数と呼んでいる。Figure 3 is a mapping model for time normalization, which is
In the above example, the standard pattern B of the word "3" is aligned with the time axis of the input pattern by the mapping function.
Usually, the mapping function is expressed by j = j (i) (7), which is called a distortion function.

この歪関数が既知であれば、標準パターンＢの時間軸を
第（７）式によって変換して入力パターンＡの時間軸ｉ
に揃えることができるが、実際には、この歪関数は未知
であり、そのため、一方のパターンを人工的に歪ませて
他方のパターンに最も類似するようにしてすなわち距離
を最小にして最適な歪関数を定めるようにしている。If this distortion function is known, the time axis of the standard pattern B is converted by the equation (7) to obtain the time axis i of the input pattern A.
However, in practice, this distortion function is unknown, so that one pattern is artificially distorted so that it is most similar to the other, i.e., the distance is minimized and the optimal distortion is obtained. I try to define the function.

第４図は、上記原理を実行するためのＤＰマッチング法
の一例を説明するための図で、今、標準パターンＢの時
間軸を歪ます関数として歪関数ｊ（ｉ）を考えると、こ
の歪関数ｊ（ｉ）によってパターンＢは次のようなパタ
ーンＢ′に変換される。FIG. 4 is a diagram for explaining an example of the DP matching method for implementing the above principle. Now, considering the distortion function j (i) as a function for distorting the time axis of the standard pattern B, this distortion is considered. The pattern B is converted into the following pattern B ′ by the function j (i).

Ｂ′＝ｂｊ（_１）ｂｊ（_２）…ｂｊ（_ｉ）…ｂ
ｊ（_Ｉ） …（８）上記歪関数には、実際の音声パターンの時間歪現像を考
慮して、例えば、（イ）、ｊ（ｉ）は（近似的に）単調増加関数，（ロ）、ｊ（ｉ）は（近似的に）連続関数，（ハ）、ｊ（ｉ）はｉの近傍の値をとる，等の条件を加えるが、これらの条件を満たす歪関数はほ
とんど無限に存在するが、その中で、Ｂ′が入力パター
ンＡに最も類似するすなわち距離が最も小さくなるよう
な歪関数ｊ（ｉ）を定める。このためには、まず、標準
パターンＢの時間軸を歪関数ｊ（ｉ）で入力パターンＡ
のｉ軸上に写像してパターンＢ′を得るが、この時、パ
ターンＡとパターンＢ′の距離を最小にするような歪関
数ｊ（ｉ）が最適な歪関数である。この入力パターンＡ
と写像パターンＢ′の距離は、で表わされる。ここで、‖ ‖は２つのベクトルの距
離を示す。そして、上記（９）式の距離の最小化問題
は、で定義される。一般に、Ｄ（Ａ，Ｂ）を時間正規化距離
又はパターン間距離と呼び、ｄ（ｉ，ｊ）はベクトルａ
ｉとｂｊとの距離で、通常、ベクトル間距離と呼んでい
る。B '= bj ( ₁ ) bj ( ₂ ) ... bj ( _i ) ... b
j ( _I ) ... (8) In consideration of the time distortion development of the actual voice pattern, for example, (a) and j (i) are (approximately) a monotonically increasing function, and (b) , J (i) is (approximately) a continuous function, (c), j (i) takes a value near i, etc., but there are almost infinite distortion functions that satisfy these conditions. However, the distortion function j (i) is determined so that B ′ is the most similar to the input pattern A, that is, the distance is the smallest. For this purpose, first, the time axis of the standard pattern B is set to the distortion pattern j (i) in the input pattern A
The pattern B'is obtained by mapping on the i-axis of the above. At this time, the distortion function j (i) that minimizes the distance between the pattern A and the pattern B'is the optimum distortion function. This input pattern A
And the mapping pattern B ′ is It is represented by. Here, ‖ ‖ indicates the distance between two vectors. Then, the problem of minimizing the distance in the above equation (9) is Is defined by Generally, D (A, B) is called a time-normalized distance or a distance between patterns, and d (i, j) is a vector a.
The distance between i and bj is usually called the inter-vector distance.

第５図は、第４図に示した（ｉ，ｊ）平面を抽象化して
格子状平面にし、各格子点についてその座標（ｉ，ｊ）
に対応するベクトル間距離ｄ（ｉ，ｊ）を求めるように
したもので、前記第（１０）式をこの平面上で考える
と、（１，１）から始めて（Ｉ，Ｊ）に至る最適な経路
（パス）を探していくことになるが、この場合、ｉ−１
の状態からｉの状態へ移るパスは図示の通り３通りに制
限されることが多い。なお、整合窓は極端な時間歪を起
こさないようにするためのもので、該整合窓になって時
間正規化に関する前記３つの条件（イ）〜（ハ）の満た
している。ここで、今、ｉ＝１，２…Ｉのそれぞれのｉ
において、次にどの状態のｊに移るべきかの制御を最適
に行い、第（１０）式の評価関数を最小にする場合を考
えると、初期条件は、ｇ（１，１）＝ｄ（１，１） ……（１２）漸化式は、パターン間距離は、Ｄ（Ａ，Ｂ）＝ｇ（Ｉ，Ｊ） ……（１４）となり、前記（１３）式の計算は、第５図の格子点を
（ｉ，ｊ）の増加する方向にたどって行うことになる。
すなわち、ｇ（ｉ，ｊ）は（１，１）点から（ｉ，ｊ）
点に至るまでの距離和を最小にしたもので、第（１３）
式は、第（ｉ−１）段のｊ，（ｊ−１），（ｊ−２）に
ついてすでに求まっているｇ（ｉ−１，ｊ），ｇ（ｉ−
１，ｊ−１），ｇ（ｉ−１，ｊ−２）を基に、第ｉ段の
状態ｊにおけるｇ（ｉ，ｊ）を求めるものである。In FIG. 5, the (i, j) plane shown in FIG. 4 is abstracted into a grid plane, and the coordinates (i, j) of each grid point are represented.
When the above equation (10) is considered on this plane, the optimum distance from (1,1) to (I, J) is obtained. I will search for a route, but in this case i-1
In many cases, the number of paths from the state 1 to the state i is limited to three as shown. The matching window is for preventing extreme time distortion, and serves as the matching window and satisfies the above three conditions (a) to (c) regarding the time normalization. Now, each i of i = 1, 2 ... I
Considering a case in which the state j to be moved next is optimally controlled to minimize the evaluation function of the expression (10), the initial condition is g (1,1) = d (1 , 1) (12) The recurrence formula is The distance between patterns is D (A, B) = g (I, J) (14), and the calculation of the equation (13) is performed in the direction of increasing (i, j) at the grid points in FIG. Will be followed.
That is, g (i, j) is (i, j) from the (1,1) point.
The minimum sum of distances to reach the point.
The expression is g (i-1, j), g (i-) that has already been obtained for j, (j-1), (j-2) in the (i-1) th stage.
1, j-1), g (i-1, j-2), g (i, j) in the state j at the i-th stage is obtained.

第６図は、上述ＤＰ（Dynamic Programmig；動的計画
法）マッチング処理を実行するプロセッサのブロック線
図で、図中、１１はＡメモリ、１２はＢメモリ、１３は
ｄ（ｉ，ｊ）計算部、１４はｇ（ｉ，ｊ）計算部、１５
はＧ（ｊ）メモリ、１６は制御部で、ｄ（ｉ，ｊ）計算
部１３でａｉとｂｉのベクトル間距離を計算し、ｇ
（ｉ，ｊ）計算部１４で（ｉ，ｊ）に至る最短距離ｇ
（ｉ，ｊ）を算出し、これらを並行処理する。ｇ（ｉ，
ｊ）；ｊ＝Ｉ〜Ｊを計算する時はＤ(j)メモリ１５にｇ
（ｉ−１，ｊ）；ｊ＝１〜Ｊが入っている。また、ｍｉ
ｎはｇ_１とｇ_２の小さい方を検出し、小さい方の値をｇ
に入れる。FIG. 6 is a block diagram of a processor that executes the above-mentioned DP (Dynamic Programmig) matching processing. In the figure, 11 is A memory, 12 is B memory, and 13 is d (i, j) calculation. Part, 14 is a g (i, j) calculation part, 15
Is a G (j) memory, 16 is a control unit, the d (i, j) calculation unit 13 calculates the inter-vector distance between ai and bi, and g
(I, j) The shortest distance g to reach (i, j) in the calculation unit 14
Calculate (i, j) and process them in parallel. g (i,
j); g is stored in the D (j) memory 15 when j = I to J is calculated.
(I-1, j); j = 1 to J are included. Also, mi
n detects the smaller of g ₁ and g ₂ and sets the smaller value to g
Put in.

而して、上記ＤＰマッチング法による時は、第（１３）
式の右辺第１項から明らかなように、整合窓を設けない
ものとすれば、少なくともＩ×Ｊ×Ｎ（ただしＮは登録
単語数）回の計算を必要とする。When the DP matching method is used, the (13)
As is clear from the first term on the right side of the equation, if the matching window is not provided, at least I × J × N (N is the number of registered words) calculations are required.

上記ＤＰ法による距離計算量を削減するために、擬音韻
単位をとるスプリット法が提案されているが、このスピ
リット法は、入力音声のそれぞれのフレームの距離計算
を予め有限個（Ｋ個とする）の擬音韻（コードブック）
との間だけで行ってマトリックスの形で蓄えておき、Ｄ
Ｐマッチングの際には、単にマトリックスを検索すれば
よいようにして距離の計算量を減らしたものである。こ
のスプリット法でベクトル量子化が行われるのは、単語
標準パターンのみであり、入力音声に対してはベクトル
量子化は適用されていない。而して、このスピリット法
では、入力音声の分析フレームと予め蓄えられた擬音韻
（ベクトル）との距離マトリックスを作成するが、この
距離マトリックスは、横軸が入力音声のフレーム番号と
なり、縦軸が擬音韻（ベクトル）番号となっており、こ
の距離マトリックスを参照してベクトル番号系列として
蓄えられている標準パターンと入力音声とのＤＰマッチ
ングを行う。In order to reduce the distance calculation amount by the DP method, a split method taking an onomatopoeic unit has been proposed. In this spirit method, the distance calculation for each frame of the input speech is limited to a predetermined number (K). ) Onomatopoeia (codebook)
Just go to and save it in the form of a matrix, and
In P matching, the amount of calculation of the distance is reduced by simply searching the matrix. Only the word standard pattern is subjected to vector quantization by the split method, and the vector quantization is not applied to the input voice. Thus, in this spirit method, a distance matrix between the analysis frame of the input voice and the pseudophony (vector) stored in advance is created. In this distance matrix, the horizontal axis is the frame number of the input voice and the vertical axis is Is an onomatopoeia (vector) number, and DP matching between the standard pattern stored as a vector number series and the input voice is performed with reference to this distance matrix.

上記スプリット法を更に改良したものとして、ダブルス
プリット法が提案されているが、このダブルスプリット
法は、標準パターンのみならず入力音声をもベクトル量
子化する方法である。A double split method has been proposed as a further improvement of the above split method. This double split method is a method of vector quantizing not only a standard pattern but also an input voice.

第７図は、上記ダブルスプリット法の一例を説明するた
めのブロック線図で、図中、２０は入力部、２１は分析
部、２２はベクトル量子化部、２３はベクトル番号発生
部、２４は標準ベクトル記憶部、２５はベクトル距離マ
トリックステーブル、２６はＤＰマッチング部、２９は
最小距離単語同定部、３０は認識結果集出力部で、入力
音声はベクトル量子化部２２において標準ベクトル記憶
部２４の標準特徴ベクトルに変換されて量子化され、ベ
クトル番号発生部２３においてベクトル番号系列に変換
され、ＤＰマッチング部２６に送られる。ＤＰマッチン
グ部２６では、前述のごとくして送られてくる入力音声
のベクトル番号系列と、ベクトル記憶部２４に予め蓄え
られている単語標準パターンのベクトル番号系列とのＤ
Ｐマッチングを、ベクトル間の距離を表わす前記ベクト
ル間距離マトリックステーブル２５を参照しながら実行
し、最小距離を有する単語を単語同定部２９で決定し、
認識結果出力部３０にし出力する。FIG. 7 is a block diagram for explaining an example of the double split method. In the figure, 20 is an input unit, 21 is an analysis unit, 22 is a vector quantization unit, 23 is a vector number generation unit, and 24 is a vector number generation unit. A standard vector storage unit, 25 is a vector distance matrix table, 26 is a DP matching unit, 29 is a minimum distance word identification unit, 30 is a recognition result collection output unit, and input speech is stored in the standard vector storage unit 24 in the vector quantization unit 22. It is converted into a standard feature vector and quantized, converted into a vector number sequence in the vector number generation unit 23, and sent to the DP matching unit 26. In the DP matching section 26, the D of the vector number series of the input voice sent as described above and the vector number series of the word standard pattern stored in the vector storage section 24 in advance.
The P matching is executed with reference to the inter-vector distance matrix table 25 representing the inter-vector distance, and the word having the minimum distance is determined by the word identifying unit 29.
The result is output to the recognition result output unit 30.

更に、詳細に説明すると、入力音声Ａのｉ番目のフレー
ムの特徴ベクトルａｉは、ａｉ＝（ａｉ_１，ａｉ_２，ａｉ_ｐ）ただし、ｉ＝１，２，…Ｉ（Ｉ；入力フレーム数）Ｐ；特徴パタメータの次元数で表わされ、一方、音声の標準ベクトルパターンＢのｋ
番目の特徴ベクトルｂｋは、ｂｋ＝（ｂｋ_１，ｂｋ_２…ｂｋ_ｐ）ただし、ｋ＝１，２，…Ｋ（Ｋ；量子化標準ベクトル
数）で表わされる（この特徴ベクトルｂｋはベクトル量
子化されている）。More specifically, the feature vector ai of the i-th frame of the input voice A is ai = (ai ₁ , ai ₂ , ai _p ) where i = 1, 2, ... I (I; number of input frames) P; represented by the number of dimensions of the feature parameter, while k of the standard vector pattern B of the voice
The th feature vector bk is represented by bk = (bk ₁ , bk ₂ ... bk _p ) where k = 1, 2, ... K (K; the number of quantized standard vectors) (this feature vector bk is vector quantized). Has been).

ここで、入力音声の特徴ベクトルａｉをベクトル量子化
するために、標準ベクトル記憶部の標準パターンと照合
して特徴ベクトルに変換する。ただし、である。ただし、は｜（ａｉ−ｂｋ）｜を最小にするｋの値をいう。即
ち、ａｉと全てのｂｋとの距離を計算し、ａｉに距離が
最も近い特徴ベクトルをａｉの代りに用い、とするものである。ここで、ｂｋとｂｊ（ｋ≠ｊ）との
フレーム間距離を前もって計算してベクトル距離マトリ
ックステーブル２５に格納しておくと、各入力フレーム
の特徴ベクトルａｉとベクトル量子化された単語標準パ
ターンｂｋとのフレーム間距離はテーブル２５を参照す
ることによって得られる。ここで、登録単語の標準パタ
ーンは、標準ベクトルのベクトル番号系列によって得ら
れる。Here, in order to perform vector quantization on the feature vector ai of the input voice, the feature vector ai is compared with the standard pattern in the standard vector storage unit. Convert to. However, Is. However, Is the value of k that minimizes | (ai-bk) |. That is, the distance between ai and all bk is calculated, and the feature vector whose distance is closest to ai Instead of ai, It is what Here, when the inter-frame distance between bk and bj (k ≠ j) is calculated in advance and stored in the vector distance matrix table 25, the feature vector ai of each input frame and the vector quantized word standard pattern bk The inter-frame distance between and is obtained by referring to the table 25. Here, the standard pattern of registered words is obtained by a vector number series of standard vectors.

上記ダブルスプリット法は、（１）距離計算量をスプリット法よりも減少できる。The double split method (1) can reduce the distance calculation amount as compared with the split method.

（２）事前に距離マトリックスを設定できるので、距離
マトリックスを巧妙に設定できる。(2) Since the distance matrix can be set in advance, the distance matrix can be skillfully set.

（３）入力音声に簡易な尺度を用いることによりＬＰＣ
分析における特徴パラメータの計算を省略できる。(3) LPC by using a simple scale for input speech
The calculation of the characteristic parameters in the analysis can be omitted.

等の利点があり、特に、特定話者認識に対しては有用で
ある。しかし、同一単語中の同一音韻に対応する特徴ベ
クトルでも、話者によって大きな違いがある。それ故、
不特定話者の音声認識では複数個の標準パターンを用す
ることが多い（マルチテンプレート法，ＫＮＮ法等）。
しかし、計算量やメモリー量が標準パターン数に比例し
て大きくなり実用上問題がある。And the like, and is particularly useful for specific speaker recognition. However, even in the case of feature vectors corresponding to the same phoneme in the same word, there is a large difference between speakers. Therefore,
A plurality of standard patterns are often used in speech recognition of an unspecified speaker (multi-template method, KNN method, etc.).
However, the amount of calculation and the amount of memory increase in proportion to the number of standard patterns, which is a practical problem.

目的本発明は、上述のごとき実情に鑑みてなされたもので、
特に、ダブルスプリット法による音声認識方法及びその
装置において、認識精度及び認識速度の向上を目的とし
てなされたものである。Purpose The present invention has been made in view of the above-mentioned circumstances,
In particular, the speech recognition method and apparatus using the double split method are aimed at improving recognition accuracy and recognition speed.

構成本発明は、上記目的を達成するために、（１）入力音声
を量子化し、量子化された入力音声をインデックス化
し、標準パターンのクラスナンバーシークエンスによっ
て該標準パターンに対応するクラスのベクトル間距離マ
トリックステーブルを選択し、該標準パターンのベクト
ルナンバーシークエンスを参照して入力音声パターンと
のＤＰマッチングを行うようにしたこと、或いは、
（２）入力音声を量子化する量子化手段と、該量子化手
段により量子化された入力音声をインデックス化するイ
ンデックス化手段と、量子化され、ベクトルナンバーシ
ークエンス及びクラス分けされたクラスナンバーシーク
エンスとから成る標準パターンと、該標準パターンと前
記量子化手段により量子化された入力音声とのベクトル
間距離を音韻の種類に応じて予めクラス分けしておくベ
クトル間距離マトリックステーブルと、前記標準パター
ンに付されたクラスナンバーシークエンスに該当する前
記ベクトル間距離マトリックステーブルを選択して標準
パターンのベクトルナンバーシークエンスと入力音声と
のインデックスとによって照合するＤＰマッチング手段
とを有することを特徴としたものである。本発明の構成
について、以下、一実施例に基づいて説明する。Configuration In order to achieve the above-mentioned object, (1) the input voice is quantized, the quantized input voice is indexed, and a vector-to-vector distance of a class corresponding to the standard pattern is determined by a class number sequence of the standard pattern. Selecting a matrix table and referring to the vector number sequence of the standard pattern to perform DP matching with the input voice pattern, or
(2) Quantizing means for quantizing the input speech, indexing means for indexing the input speech quantized by the quantizing means, quantized vector number sequence and class number sequence classified into classes A standard pattern consisting of, a vector-to-vector distance matrix table in which the vector-to-vector distance between the standard pattern and the input speech quantized by the quantizing means is pre-classified according to the type of phoneme, and the standard pattern The present invention is characterized by including DP matching means for selecting the inter-vector distance matrix table corresponding to the attached class number sequence and matching the vector number sequence of the standard pattern with the index of the input voice. The configuration of the present invention will be described below based on an embodiment.

第８図は、本発明による音声認識装置の一実施例を説明
するための構成図で、図中、２７はベクトルナンバーシ
ークエンス及びクラスナンバーシークエンス部、２８は
クラス別されたベクトル間距離マトリックステーブル
で、その他第７図と同様の作用をする部分には第７図と
同一の参照番号が付してある。標準パターンは、第９図
に示すように、単語ｎのベクトルナンバーシークエンス
▲ｂ^ｎ _ｊ▼と、各フレームがクラス分けされた単語ｎの
クラスナンバーシークエンス▲ｃ^ｎ _ｊ▼からなる。ベク
トル間距離マトリックステーブル２８は、第１０図に示
すように、クラス数Ｍ個のテーブルを持ち、クラスｍの
マトリックスの要素▲ｄ^ｍ _ｉｊ▼は入力ベクトルが擬音
韻ｉであった時とクラスｍの標準パターンのベクトルが
擬音韻ｊであった時とのベクトル間距離を表わす。ＤＰ
マッチング部２６において、２７のクラスナンバーシー
クエンスによって標準パターンに対応するクラスの距離
マトリックステーブルが２８から選択され、該パターン
のベクトルナンバーシークエンスを参照して入力パター
ンとのＤＰマッチングが行われ、最小距離を有する単語
を単語同定部２９で決定し、認識結果出力部３０にて出
力する。すなわち、認識処理におけるベクトル間距離マ
トリックステーブルは、標準パターンと入力音声の量子
化したベクトルのフレーム間距離を予め計算しておいた
ものであるから、ＤＰマッチング時に入力ベクトルに対
してベクトル番号発生部２３で発生させられた（変換さ
せられた）ベクトル番号と、ベクトルナンバーシークエ
ンス及びクラス分けされたクラスナンバーシークエンス
とから成る標準パターンに対するベクトル番号及びクラ
ス番号とをインデックスとしてクラス番号から対応した
ベクトル距離間マトリックステーブルを選び、そのテー
ブルに登録してあるベクトル間距離を引き出して、最小
距離を有する単語（標準パターン）を探すように動作す
る。FIG. 8 is a block diagram for explaining one embodiment of the speech recognition apparatus according to the present invention. In the figure, 27 is a vector number sequence and class number sequence unit, and 28 is a vector distance vector matrix table classified by class. The same reference numerals as those in FIG. 7 are attached to the other portions having the same functions as those in FIG. As shown in FIG. 9, the standard pattern consists of a vector number sequence ▲ b ⁿ _j ▼ of the word n and a class number sequence ▲ c ⁿ _j ▼ of the word n into which each frame is classified. As shown in FIG. 10, the inter-vector distance matrix table 28 has a table of the number of classes M, and the element ▲ d ^m _ij ▼ of the matrix of the class m is when the input vector is the pseudophony i and the class m. Represents the distance between vectors when the vector of the standard pattern is the onomatopoeia j. DP
In the matching unit 26, the distance matrix table of the class corresponding to the standard pattern is selected from the class number sequence of 27 from 28, DP matching with the input pattern is performed with reference to the vector number sequence of the pattern, and the minimum distance is determined. The word identification section 29 determines the words that it has, and the recognition result output section 30 outputs them. That is, since the inter-vector distance matrix table in the recognition processing pre-calculates the inter-frame distance between the standard pattern and the quantized vector of the input voice, the vector number generation unit for the input vector during DP matching. A vector distance corresponding to the class number with the vector number and the class number for the standard pattern consisting of the vector number generated (converted) in 23 and the vector number sequence and the class number sequence classified into classes It operates so as to select a matrix table, derive the distance between vectors registered in the table, and search for the word (standard pattern) having the minimum distance.

効果従って、本発明によると、入力音声をベクトルナンバー
シークエンスによって符号化するとともに、ベクトル間
距離マトリックステーブルを音韻の種類に応じて予めク
ラス分けしておき、クラス分けされたインデックス（ク
ラスナンバーシークエンスによって与えられる）に基づ
いて所望のマトリックステーブルを選択してＤＰマッチ
ングするようにしたので音声認識をより迅速にかつ正確
に行うことができる。Effect According to the present invention, therefore, the input speech is encoded by the vector number sequence, the inter-vector distance matrix table is pre-classified according to the type of phoneme, and the class-divided index (given by the class number sequence is given. The desired matrix table is selected based on the above (1) and DP matching is performed, so that voice recognition can be performed more quickly and accurately.

[Brief description of drawings]

第１図は、音声認識装置の基本構成図、第２図は、音声
分析の一例を示す図、第３図は、時間正規化のための写
像モデル、第４図は、歪関数による時間正規化図、第５
図は、時間正規化を行うための格子状平面図、第６図
は、ＤＰマッチング処理を行うプロセッサのブロック線
図、第７図は、ダブルスプリット法の一例を説明するた
めのブロック図、第８図は、本発明による音声認識装置
の一実施例を説明するための図、第９図は、本発明の実
施に使用する標準パターンの構成例を示す図、第１０図
は、本発明の実施に使用するベクトル間距離マトリック
ステーブルの一例を示す図である。２０……入力部、２１……分析部、２２……ベクトル量
子化部、２３……テンプレート（ベクトル）部、２４…
…ベクトルシークエンス器、２５……ベクトル間距離マ
トリックステーブル、２６……ＤＰマッチング部、２７
……ベクトルナンバーシークエンス及びクラスナンバー
シークエンス部、２８……クラス分けされたベクトル間
距離マトリックステーブル、２９……単語同定部、３０
……認識結果出力部。FIG. 1 is a basic configuration diagram of a speech recognition apparatus, FIG. 2 is a diagram showing an example of speech analysis, FIG. 3 is a mapping model for time normalization, and FIG. 4 is time normalization by a distortion function. Figure 5, No.
FIG. 6 is a grid-like plan view for performing time normalization, FIG. 6 is a block diagram of a processor for performing DP matching processing, and FIG. 7 is a block diagram for explaining an example of the double split method. FIG. 8 is a diagram for explaining an embodiment of a voice recognition device according to the present invention, FIG. 9 is a diagram showing an example of the configuration of a standard pattern used for implementing the present invention, and FIG. It is a figure which shows an example of the distance matrix between vectors used for implementation. 20 ... Input part, 21 ... Analysis part, 22 ... Vector quantization part, 23 ... Template (vector) part, 24 ...
... Vector sequencer, 25 ... Vector distance matrix table, 26 ... DP matching section, 27
...... Vector number sequence and class number sequence part, 28 …… Class matrix-divided vector distance matrix table, 29 …… Word identification part, 30
...... Recognition result output section.

Claims

[Claims]

1. A vector of a standard pattern is selected by quantizing an input voice, indexing the quantized input voice, selecting a vector-to-vector distance matrix table of a class corresponding to the standard pattern by a class number sequence of the standard pattern. A voice recognition method characterized in that DP matching with an input voice pattern is performed with reference to a number sequence.

2. Quantizing means for quantizing an input speech, indexing means for indexing the input speech quantized by the quantizing means, quantized vector number sequence and class numbers classified into classes. A standard pattern composed of a sequence, an inter-vector distance matrix table in which the inter-vector distance between the standard pattern and the input voice quantized by the quantizing means is pre-classified according to the type of phoneme, and the standard Speech recognition characterized by having DP matching means for selecting the inter-vector distance matrix table corresponding to the class number sequence attached to the pattern and matching the vector number sequence of the standard pattern with the index of the input voice. apparatus.