JPS592100A

JPS592100A - Voice recognition system

Info

Publication number: JPS592100A
Application number: JP57112185A
Authority: JP
Inventors: 充宏斗谷; 西岡　芳樹; 岩橋　弘幸
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 1982-06-28
Filing date: 1982-06-28
Publication date: 1984-01-07

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】本発明は認識すべき音声の特徴パターンと予め■ｑされ
た複数種類の変声の特徴パターンとの類似度を順次Ｊ１
算して認識判定を行なう音声認識方式の改良に関し、特
にパターンマ・ソチング方式に特徴を存する音声認識方
式に関する。DETAILED DESCRIPTION OF THE INVENTION The present invention sequentially calculates the degree of similarity between the characteristic pattern of the voice to be recognized and the characteristic pattern of a plurality of types of voice alterations determined in advance by J1.
The present invention relates to improvements in speech recognition methods that perform recognition judgments based on calculations, and particularly relates to speech recognition methods characterized by pattern mapping methods.

従来より認識すべき音声の特徴パターンと予め登録され
た複数種類のｒＩ声の特徴パターンとのヲ；ａ似度を順
次計算して認識判定に行なう？１声３．３識装置におい
て、動的側画法を用いたパターンマツチングの手法は、
発声の時間軸［ｊ、１する変動に２１１、て強力な手段
として知られている。Conventionally, recognition is determined by sequentially calculating the degree of similarity between the characteristic pattern of the voice to be recognized and the characteristic patterns of multiple types of rI voices registered in advance. The pattern matching method using the dynamic lateral stroke method in the 1-voice 3.3 recognition device is as follows:
It is known as a powerful means for changing the time axis of vocalization [j, 1211].

しかしながら、従来のこのようなパターンマツチングの
ブチ法では、入力音声と登録ｆＨ声とを各々の時間の長
さに応じたフレーム数に７１して処理する必要があり、
処理時間及び標準パターンのメそす歇の両方に多くを必
要とした。However, in the conventional pattern matching method, it is necessary to process the input voice and the registered fH voice into 71 frames according to the length of each frame.
It required a lot of time, both in processing time and in the standard pattern method.

本発明は」二記の点に鑑みて成されたものであり、処理
時間、標準パターンのメモリ噛の両方を大幅に少なくす
ることがｉ’ｉＪ能な音声認識方式を提供することを目
自勺としている。The present invention has been made in view of the above two points, and aims to provide a speech recognition method capable of significantly reducing both processing time and memory occupancy for standard patterns. I'm a big fan.

この目的を達成するため、本発明は認識すべき音声の特
徴パターンと予め登録された複数種類の音声の特徴パタ
ーンとの類似度計算を順次大イイして認識判定を行う音
声認識装置において、人力音声、！；登録音声との両方
を一定時Ｍｌの分析フレーｌ、て１１冒：りｊｌｌを・
抽出し、発生時間長に応じたフレーＪ・の数を線形に伸
縮し℃１１１定フレーｌ、数とし、その後に動的、ｆ＋
画法を用いたパターンマツチングにより認識判定を行う
ように構成されている。In order to achieve this object, the present invention provides a human-powered speech recognition device that performs recognition judgment by successively calculating the degree of similarity between a speech feature pattern to be recognized and a plurality of pre-registered speech feature patterns. audio,! ; Both with the registered voice and the analysis frame of Ml at a certain time,
Extract and linearly expand and contract the number of flakes J according to the length of occurrence time to obtain the constant flake l, number of °C111, and then dynamically, f +
It is configured to perform recognition determination by pattern matching using a drawing method.

即ち、本発明においては、人力式７ｈと登録高声のいず
れもを等長に線形伸ａ？１ぜしめ、元のプレーノー数に
戻す等の処理を行なわずに動的計画法による゛マノチン
ク処ＪＪｌｉを行なうことを特徴きしており、との′よ
うな構成により、認識処理時間及び標準パターンを記憶
するための必要メモリ喰が大幅に減少せられる。That is, in the present invention, both the human-powered 7h and the registered high-pitched voice are linearly expanded to equal length a? It is characterized by performing the mano-chinking process JJli by dynamic programming without performing any processing such as returning to the original number of planes. The memory requirement for storing is significantly reduced.

以ト″、本発明の一実施例を図面を参照して詳細に説１
すｊする。Hereinafter, one embodiment of the present invention will be described in detail with reference to the drawings.
I'll do it.

第１図は本発明を実施した音声認識装置の−実り飲例を
示す概略ブロック図、第２図は本発明の１３声認識方式
の認識動作フロー図である。FIG. 1 is a schematic block diagram showing an example of a speech recognition device embodying the present invention, and FIG. 2 is a recognition operation flow diagram of the 13-voice recognition system of the present invention.

第１図において１は音声入力をピックアップするマイク
ロホン、２は増幅器、３は人力された音声をデジタル信
号に変換するＡ／Ｄ変換器、４は人Ｑ、　　、た音声の
特徴パターン（ベクトル）を抽出する！ｌｉ￥徴抽出回
路、５は人力したｊ′↑声の？音声区間を検出する音声
区間検出回路、６は各種演算側倒１’ｔＷ行するマイク
ロッ和セノザ（ＣＩ）　ｕ　）、７は入カバターンメモ
リ、８ハ標１１１’＝パターンメモリ、９は動的旧画法
によるマツチング（Ｄ　Ｉ’マツチング）回路、１０は
３．３識結果等を出力する出力回路である。In Figure 1, 1 is a microphone that picks up voice input, 2 is an amplifier, 3 is an A/D converter that converts human input voice into a digital signal, and 4 is a person Q. Extract! Li￥ signature extraction circuit, 5 is the human-powered j′↑ voice? A voice section detection circuit that detects a voice section, 6 is a microcosm sensor (CI) u ) that performs various calculations 1'tW, 7 is an input pattern memory, 8 is a pattern memory, 9 is a dynamic A matching (DI' matching) circuit based on the old painting method, and 10 an output circuit for outputting 3.3 recognition results and the like.

音声認識の一つの方式として、認識すべき声の特徴パタ
ーンを予め登録し、その特徴パターンと入力音声の特徴
パターンとのマツチングをとり、最も似ているパターン
を認識結果として出力する方法がある。この方法では一
般に音声をある分析フレーム長で分析し７、そのフレー
ムの特徴ベクトルを抽出するという操作を、その音声の
長さだけ行い、その特徴ベクトルの時系列をその音声の
特徴パターンとしている。As one method of speech recognition, there is a method in which a feature pattern of a voice to be recognized is registered in advance, the feature pattern is matched with a feature pattern of input speech, and the most similar pattern is output as a recognition result. In this method, generally speaking, speech is analyzed with a certain analysis frame length7, the feature vector of that frame is extracted for the length of the speech, and the time series of the feature vectors is used as the feature pattern of the speech.

例えば第３図に「三島」という発声に対する音声波形ａ
と、分析フレーム幅ｌ〕、フレーム番号Ｃ１各フレーム
の特徴ベク）／しく図では自己相関係数）ｄ乳へ−てい
る。For example, in Figure 3, the voice waveform a for the utterance of "Mishima"
, the analysis frame width l], the frame number C1, the feature vector of each frame)/in the figure, the autocorrelation coefficient).

第１図において発声され、た音μｍ波形ａけマイクｏ　
＋ｈ　ンｌ　ｆ　ｊｆｉす、ＡＧＣイ、１の増幅ａ：’
ｊ　２−ｃ　６　ｄｔ＜、ｚ、　Ｃ。In Fig. 1, the sound μm waveform a microphone o is uttered.
+h nl f jfi, AGC i, 1 amplification a:'
j 2-c 6 dt<, z, C.

で高域強調されて増幅される。アナログ信号である１°
１声信７ｊからデジタル伯号への変換はＡ／Ｄ　　分換
品３に」：り行なわれ、＋ｊ（＋常、音ｉｊＹ認識にお
いて附、８〜２４１ＮＩｚ（例えば８ＫＩＩｚ）でサン
ブリンクさね２、各サンブリンクｆ（ｉ′ｉ′は６〜１
２　ヒソＩ−（例えば８ビット）のデジタル信号で表現
される。The high range is emphasized and amplified. 1° which is an analog signal
Conversion from 1 voice signal 7j to digital number is done in A/D conversion product 3, +j (+always, with sound ijY recognition, sunblink 2 with 8 to 241NIz (e.g. 8KIIz) , each sunblink f (i'i' is 6 to 1
2 Represented by a Hiso I- (e.g. 8 bits) digital signal.

一方、人力式れた音声は音声区間検出回路５に」：りそ
の音声の長さが検出される。この音声区間の間の音が１
は特徴抽出回路により分析フレーム７ｉｉ。On the other hand, the length of the human-generated voice is detected by the voice section detection circuit 5. The sound during this audio interval is 1
is analyzed frame 7ii by the feature extraction circuit.

に分析され、フレーム毎に特徴ベクトル１ｆ″Ｌる１，
このようにして人力音声に対して複数のフレームの特徴
ベクｌール時系列が形成される（第２図ステップ１１２
）。例えば特徴ベクトルとして自己相関係数を用いる場
合、上記の如（　８　Ｋ　Ｈ　ｚのナンブリングで８ビ
ツトに量子化された信号は１２８個のデータを１フレー
ムとして２４ビ・ノドで１５次までの自己相関関数Ｒ　
（　ｍ　）が算出され、その、後、例えば１〜８次まで
を０次の値で割って自己相関係数を４ビット精度で求め
て′ｉ，Ｉ，；徴パラメータとされる。The feature vector 1f″L1,
In this way, a feature vector time series of multiple frames is formed for human speech (step 112 in Figure 2).
). For example, when using an autocorrelation coefficient as a feature vector, as shown above (a signal quantized to 8 bits with 8 KHz numbering is quantized to 8 bits with 128 data pieces as one frame and 24 bits of data up to the 15th order). autocorrelation function R
(m) is calculated, and then, for example, the 1st to 8th orders are divided by the 0th order value to obtain an autocorrelation coefficient with 4-bit accuracy, which is used as the 'i, I, ; signature parameter.

この複数フレートの特徴パターンは線形伸縮により、例
えば１６ルームの固定フレート、数に変換され（スデッ
プ！１３）、登録時には標準パターンメモリ８に格納さ
れ、認識時には入ＪＪバクーンメモリ７に格納される。The characteristic pattern of the plurality of flights is converted into a number of fixed flights of, for example, 16 rooms by linear expansion and contraction (SEDEP! 13), and is stored in the standard pattern memory 8 at the time of registration, and stored in the input JJ Bakun memory 7 at the time of recognition.

線形伸縮の方法は神々の方法が考えられるが、本発明の
実施例では１６フレームに伸縮する場合の例を以下に示
す。The divine method can be considered as the linear expansion/contraction method, but in the embodiment of the present invention, an example of expansion/contraction to 16 frames is shown below.

今、音声区間検出回路５により音声と判定された区間が
ｎフレームあったとする。また各フレームにおける特徴
ベクトルを１（１）で表わし、伸縮された特徴ベク）／
ＶをＴｏ（ｉ）で表わすとすると、次の如き過程により
線形伸縮処理が実行される。Assume now that there are n frames of sections that are determined to be speech by the speech section detection circuit 5. Also, the feature vector in each frame is expressed as 1 (1), and the expanded and contracted feature vector)/
If V is represented by To(i), linear expansion/contraction processing is executed by the following process.

（イ）、第１フレームは無条件に採用する。(b) The first frame is unconditionally adopted.

ｈ　（１）＝　ａ　（１）（口）、残りの（ｎ−１）フレームを１６個の区間に分
ける。h (1) = a (1) (mouth), the remaining (n-1) frames are divided into 16 sections.

（〔〕ｄカウスｄ己号）即ち、上記（イ）、（ロ）の処理により１Ｇ個の区間の
１）ｉｌ　１５個について各区間の先σｆｊフレーＪ、
を各特徴ベクｌ−７１／とすることになり、クリ１フレ
ームを加エテ１１フレーノ・′ｆｆ１６フレームに伸縮
１することが出来る。([]dcousdself) That is, by processing (a) and (b) above, for 1)il 15 of 1G intervals, σfj frame J,
is set as each feature vector l-71/, and one frame can be expanded and contracted into 11 frames and 16 frames.

例えばｊ゛（％　３図の例ではフレーム番υにＯ印を付
加しプこイ）のがフレームを１６フレームに伸縮し７た
時に採用前れるフレームを示している。For example, j゛ (in the example in Figure 3, an O mark is added to the frame number υ) indicates the frame that will be adopted when the frame is expanded or contracted to 16 frames and 7 is reached.

１だ第４図（ａ）及び（１））には音声区間がそれぞ１
Ｌ２０フレーノ１．１３フレーＪ・の時の線形伸縮１の
様ｒ・をノ１ミして」っ・す、Ｆｌは元のフレーム番シ
Ｊ−，ｐ２は伸縮時のフレーム番号を示している。！、
たフレート「伶号中のＸ印のフレート・は採用しないフ
レート、であり、】６フレームよす少すいフレーム数の
場合には同一　フレーＪ、を２度採用することもある。Figure 4 (a) and (1)) each have one voice section.
L20 Freno 1.13 Frame R is the same as linear expansion and contraction 1 at frame J. Fl is the original frame number J-, p2 is the frame number at the time of expansion and contraction. . ! ,
Freights marked with an "X" in the title number are those that will not be adopted.If the number of frames is less than 6 frames, the same frame J may be adopted twice.

ｃ　（７）　ヨウｖこして特徴抽出された１）フレーム
ノ特徴パターンが１６フレームに線形伸縮された後、こ
０１６フレームに伸縮さね、た特徴パターンに１１して
Ｉ）　Ｐマッチンク回路９により、動的ＩＮ画θ、を用
いたマツチング、いわゆるＤＰマ・ンチングによる認識
が行なわれ（ステップｎ４）、その結果が出力装置１０
より出力される（ヌテノブｎ５）。c (7) After the feature pattern of the 1) frame extracted by the above process is linearly expanded/contracted to 16 frames, the characteristic pattern is expanded/contracted to 016 frames, and then the characteristic pattern is converted to 11 by the P matching circuit 9. , dynamic IN image θ, recognition by matching, so-called DP matching, is performed (step n4), and the result is output to the output device 10.
(Nutenobu n5).

Ｄ　Ｐマツチングについては伸々の順化式があるが、例
えば第５図に示すパスを用いる次式でイ１なうことが出
来る。寸たもちろん他の式でも７４　ＪＱすることが可
能であり、本発明は次式に限定さｈるものではない。There are many adaptation formulas for DP matching, but for example, the following formula using the path shown in FIG. 5 can be used. Of course, it is possible to perform 74 JQ using other formulas, and the present invention is not limited to the following formula.

（ａ）、　　ｇ（１，ｌ　）〜２６　（１，１）Ｉ　ｉ
　−ｊ　ｌ≦ｒ、　　ｉ＝２〜１にこでｄ（ｉ、ｊ）は
人力音声の１番目のフレームの特徴ベクトルａ　（ｉ）
ト１ｍ準パターンの一つのｊ番目のフレームの’Ｈ？　
？ａベクトルｔｌ＋（ｉ）の距離を表わしている。(a), g(1,l)~26(1,1)Ii
−j l≦r, i=2 to 1; d(i, j) is the feature vector a (i) of the first frame of human speech;
'H?' of one j-th frame of the 1m quasi-pattern?
? a represents the distance of vector tl+(i).

ｒけ整合窓の大きさを表わし、１６フレー、）ノ・程度に線形沖絡ｉした場合は、ｒ−・１〜２程度と
小さくすることが出来るため、Ｉｉｌ’　−１”：　４
ｉＸを少なくすることが出来る。第６図に最１閃パスの
例を示す。r represents the size of the matching window, which is 16 frets.) If the linear offset i is to the extent of 0, it can be made as small as r-1 to 2, so Iil'-1": 4
iX can be reduced. FIG. 6 shows an example of the first flash pass.

１ツノ、七の」：うな旧゛マッチンクによる認識判定が
Ｉ）Ｐマンチイク回路９によって実行さね２、その認識
結果が出力装置１．０より出力さゎ、ることになる。1 horn, 7": The recognition judgment based on the old "matching" is executed by the I) P manchiking circuit 9, and the recognition result is output from the output device 1.0.

以上述べたように本発明により、げ、人カ音７ｒｒ、：
″″でＸ　ＷＡ　ｔ：ｆ山の両方を一定時間の分析フレ
ートで特徴ｆｉｔ、全抽出（〜、発声時ｕＩＩ長に応じ
たフレーム数を線形に伸縮して固定フレーム数とＬ７、
その後に動的１−１１画法を用いたパターンマツチング
によって認識別カー１を行なうｌ’７１成であるため、
従来のパターンマツチングによる認識に比１咬して、認
識処理時間を大幅にケｊｊ縮することが出来ると共に、
特徴パターンを記憶するだめの必要メモリ１夜を大幅に
減少させることがｉ−＋Ｊ能となる。As described above, according to the present invention, ge, human sound 7rr,:
With ``'', both X WA t:f mountains are analyzed for a certain time, feature fit is extracted, and all extraction (~, the number of frames is linearly expanded or contracted according to the uII length at the time of utterance, and the fixed number of frames and L7,
Since it is a l'71 configuration in which recognition-specific car 1 is then performed by pattern matching using the dynamic 1-11 drawing method,
In addition to being able to significantly reduce recognition processing time by one bit compared to recognition using conventional pattern matching,
The ability to significantly reduce the amount of memory required to store feature patterns is i-+J.

[Brief explanation of the drawing]

第１図、は本発明を実施した音声認識装置の構成を示す
Ｉ７−？＞り図、第２図は本発明の認識方式の認識動作
フロー図、第３図ｅｉ音声波形及びその分析波形の一例
を、］′：、す図、第４図は分析フレーＪ・の線形伸縮
の説明に供する図、第５図け１月）マツチングにおける
パスの一例を示す図、第６図は旧）マツチングによる最
適パスの一例を示す図である。FIG. 1 shows the configuration of a speech recognition device implementing the present invention. Figure 2 is a recognition operation flowchart of the recognition method of the present invention, Figure 3 is an example of an ei speech waveform and its analysis waveform, ]':, Figure 4 is a linear diagram of the analysis frame J. Figure 5 is a diagram for explaining expansion and contraction; Figure 5 is a diagram showing an example of a path in matching (January); Figure 6 is a diagram showing an example of an optimal path by matching (old).

Claims

[Claims] 13. Recognition determination is made by sequentially performing calculations on the degree of similarity between the characteristic pattern of r°1 voice to be recognized and the characteristic patterns of multiple types of voices registered in advance 7:' In the I voice recognition device, features are extracted from both the human voice and the registered voice using Freno analysis for a certain period of time, the number of frames is linearly expanded or contracted according to the utterance time 1φ, and then A speech recognition method characterized in that recognition determination is performed by dynamic programming/< turn matching.