JP2507308B2

JP2507308B2 - Pattern recognition learning method

Info

Publication number: JP2507308B2
Application number: JP60254093A
Authority: JP
Inventors: 博松浦; 洋一竹林
Original assignee: Tokyo Shibaura Electric Co Ltd
Current assignee: Toshiba Corp
Priority date: 1985-11-13
Filing date: 1985-11-13
Publication date: 1996-06-12
Anticipated expiration: 2011-06-12
Also published as: JPS62114082A

Description

【発明の詳細な説明】〔発明の技術分野〕本発明は、例えば入力音声を精度良く認識することの
できる実用性の高いパターン認識学習方式に関する。TECHNICAL FIELD OF THE INVENTION The present invention relates to a highly practical pattern recognition learning method capable of accurately recognizing an input voice, for example.

[Technical background of the invention and its problems]

近時、音声に対するパターン認識処理技術が発達し、
工場における製品管理や各種電話サービス等に幅広く応
用されている。また音声ワードプロセッサへの応用も進
められており、認識対象の拡大や認識性能の向上要求が
益々強くなってきている。Recently, pattern recognition processing technology for voice has developed,
Widely applied to product management in factories and various telephone services. In addition, application to speech word processors is being promoted, and demands for expanding recognition targets and improving recognition performance are increasing.

ところで認識対象とする話者を制限した特定話者認識
装置では、その特定話者が発声した音声を用いて標準パ
ターン（音声認識辞書）を作成することにより、比較的
容易に大語彙認識や連続発声された音声に対する認識性
能の向上を図り得る。つまり認識対象とする話者が特定
されていることから、その話者に対する音声認識辞書の
高性能化を容易に図り、その認識性能を高めることが可
能となる。具体的には、例えばDPマッチングや単純類似
度法によって入力音声を認識する場合、認識辞書の学習
用に入力された音声パターンの平均を標準パターンとす
ることによって、その認識辞書の高性能化を図り得る。
極端な場合には、１つの学習パターンだけからでもその
標準パターンを作成することができ、この標準パターン
を用いて入力音声を認識することが可能となる。By the way, in a specific speaker recognition device in which the speakers to be recognized are limited, by creating a standard pattern (speech recognition dictionary) using the voice uttered by the specific speaker, large vocabulary recognition and continuous recognition are relatively easy. It is possible to improve the recognition performance for spoken voice. That is, since the speaker to be recognized is specified, it is possible to easily improve the performance of the voice recognition dictionary for the speaker and improve the recognition performance. Specifically, for example, when recognizing an input voice by DP matching or a simple similarity method, the average of voice patterns input for learning the recognition dictionary is used as a standard pattern to improve the performance of the recognition dictionary. It can be planned.
In an extreme case, the standard pattern can be created from only one learning pattern, and the input voice can be recognized using this standard pattern.

然し乍ら、パターン変形によって上記学習パターンか
ら少々ずれた音声パターンが入力されると、例えば標準
パターンとの類似度値が低下するので、さほどその認識
率の向上は認めない。ちなみに、学習パターン数を増大
させると、その平均パターンとして求められる標準パタ
ーンが改善され、認識率の向上を図ることができる。し
かし入力音声パターンの上記平均パターンからのずれに
対しては依然として問題が残る。これ故、学習パターン
をいくら増やしても或る程度以上の認識率の向上が望め
ないと云う問題がある。However, if a voice pattern that is slightly deviated from the above learning pattern due to pattern deformation is input, the similarity value with, for example, the standard pattern decreases, so the recognition rate is not so improved. By the way, when the number of learning patterns is increased, the standard pattern obtained as the average pattern is improved, and the recognition rate can be improved. However, a problem still remains with respect to the deviation of the input voice pattern from the average pattern. Therefore, there is a problem in that the recognition rate cannot be expected to improve to a certain degree or more, no matter how many learning patterns are increased.

一方、平均パターンに対してずれを持つ入力音声パタ
ーンを効果的に認識する認識法として、例えば複合類似
度法や２次認識関数等の共分散行列を用いた認識法が提
唱されている。これらの認識法によれば、例えばその共
分散行列に前記平均パターンに対するずれの分布状態が
反映されるので、学習パターンを増やすことによって前
述した単純類似度法等に比較して遥かに高い認識率を達
成することができる。On the other hand, a recognition method using a covariance matrix such as a composite similarity method or a quadratic recognition function has been proposed as a recognition method for effectively recognizing an input voice pattern having a deviation from the average pattern. According to these recognition methods, for example, the covariance matrix reflects the distribution of deviations with respect to the average pattern. Therefore, by increasing the learning patterns, the recognition rate is much higher than that of the simple similarity method described above. Can be achieved.

ところがその反面、認識対象パターンのずれの分布を
正しく反映した共分散行列を得、高性能な認識辞書を作
成するには、膨大な量（数）の学習パターンを収集し、
これらを学習する必要があった。However, on the other hand, in order to obtain a covariance matrix that correctly reflects the deviation distribution of the recognition target pattern and create a high-performance recognition dictionary, collect a huge amount (number) of learning patterns,
I needed to learn these.

[Object of the Invention]

本発明はこのような事情を考慮してなされたもので、
その目的とするところは、学習パターンの収集の程度に
応じて、或いは入力パターンから抽出される特徴ベクト
ルの性質に応じて該入力パターンを高精度に認識処理す
ることのできる汎用性の高いパターン認識学習方式を提
供することにある。The present invention has been made in consideration of such circumstances.
The purpose of this is pattern recognition with high versatility, which enables highly accurate recognition processing of the input pattern according to the degree of collection of the learning pattern or the property of the feature vector extracted from the input pattern. To provide a learning method.

[Outline of Invention]

本発明は、入力パターンを分析して求められる該入力
パターンの特徴ベクトルと、予め登録された認識辞書と
を照合して、前記入力パターンを認識するパターン認識
装置において、認識辞書の作成に用いた学習パターンの数、または認
識処理に供される入力パターンから抽出された特徴ベク
トルの次元数に応じて、該特徴ベクトルとの照合に供せ
られる認識辞書の軸数を制御して、認識性能を十分高く
維持しながら、その認識処理を効率良く行わしめるよう
にしたものである。INDUSTRIAL APPLICABILITY The present invention is used for creating a recognition dictionary in a pattern recognition device for recognizing the input pattern by collating a feature vector of the input pattern obtained by analyzing the input pattern with a previously registered recognition dictionary. According to the number of learning patterns or the number of dimensions of the feature vector extracted from the input pattern used for the recognition process, the number of axes of the recognition dictionary used for matching with the feature vector is controlled to improve the recognition performance. The recognition process can be efficiently performed while maintaining a sufficiently high level.

〔発明の効果〕かくして本発明によれば、認識辞書の作成に用いられ
た学習パターンの入力数、或いは入力パターンから抽出
された特徴ベクトルの次元数に応じて、その特徴ベクト
ルとの照合に供される認識辞書の軸数が制御される。従
って、例えば認識辞書の学習の程度に応じて、或いは入
力パターンの質（特徴ベクトルの次元数）に応じた適切
な認識処理手続きにて該入力パターンを高精度に認識す
ることができる。故に、認識辞書の学習に多くの努力を
払わなくても、その学習の程度に応じた適切な認識処理
によって、その認識率の向上を図ると共に、認識処理効
率の向上を図ることが可能となる。また入力パターンか
ら抽出される特徴ベクトルの次元数に応じて、その特徴
ベクトルとの照合に供される認識辞書の軸数が設定され
るので、認識装置の汎用性を高めることが可能となる等
の実用上多大なる効果が奏せられる。[Effects of the Invention] Thus, according to the present invention, according to the number of inputs of the learning pattern used to create the recognition dictionary or the number of dimensions of the feature vector extracted from the input pattern, the matching with the feature vector is provided. The number of axes of the recognized dictionary is controlled. Therefore, for example, the input pattern can be recognized with high accuracy by an appropriate recognition processing procedure according to the degree of learning of the recognition dictionary or according to the quality of the input pattern (the number of dimensions of the feature vector). Therefore, it is possible to improve the recognition rate as well as the efficiency of the recognition processing by appropriate recognition processing according to the degree of the learning without making much effort to learn the recognition dictionary. . Further, the number of axes of the recognition dictionary used for matching with the feature vector is set according to the number of dimensions of the feature vector extracted from the input pattern, so that the versatility of the recognition device can be improved, etc. The practically great effect can be obtained.

Example of Invention

以下、図面を参照して本発明方式の一実施例につき説
明する。An embodiment of the method of the present invention will be described below with reference to the drawings.

尚、ここでは単語単位に発声された入力音声を認識対
象パターンとして説明するが、文字や図形等を認識対象
とするパターン認識装置であっても同様に適用可能であ
る。また認識対象の単位についても、例えば音素や音
節、更には連続発声された音声等、その仕様に応じて定
めれば良いものである。Although the input voice uttered word by word is described as a recognition target pattern here, the same can be applied to a pattern recognition device that recognizes characters, figures, and the like. Further, the unit to be recognized may be determined according to its specifications, for example, a phoneme, a syllable, and a continuously uttered voice.

第１図は実施例方式を適用した音声認識装置の概略構
成図である。FIG. 1 is a schematic configuration diagram of a voice recognition device to which the embodiment system is applied.

マイクロフォン等からなる音声入力部１を介して入力
された入力音声は、特徴抽出部２に与えられ、その入力
音声パターンの特徴ベクトルが抽出されている。特徴抽
出部２は上記入力音声を、例えばカットオフ周波数5.6K
Hzの低域フィルタを通した後、標本化周波数12KHzで12
ビットのディジタル信号に量子化している。そしてこの
ディジタル化した入力音声データを、例えば８チャンネ
ルのバンド・パス・フィルタ群を介して分析処理し、更
に平滑化、対数変換した後、例えば10msec毎に前記入力
音声の特徴パラメータの時系列として出力している。The input voice input through the voice input unit 1 including a microphone or the like is given to the feature extraction unit 2, and the feature vector of the input voice pattern is extracted. The feature extraction unit 2 converts the input voice into, for example, a cutoff frequency of 5.6K.
After a low pass filter of 12 Hz, the sampling frequency of 12 KHz
It is quantized into a bit digital signal. The digitized input voice data is analyzed through a band pass filter group of, for example, 8 channels, further smoothed and logarithmically converted, and then, for example, as a time series of the characteristic parameters of the input voice every 10 msec. It is outputting.

尚、このディジタル・フィルタ分析処理に代えて、線
形予測分析（LPC分析）等を行ってその特徴パラメータ
の時系列を求めるようにしても良い。Instead of this digital filter analysis processing, linear prediction analysis (LPC analysis) or the like may be performed to obtain the time series of the characteristic parameters.

このようにして分析処理された特徴パラメータの時系
列から、例えば音声区間の検出、この音声区間の検出結
果に基く特徴パラメータの切出しによって前記入力音声
パターンの特徴ベクトルが求められる。具体的には、特
徴パラメータの時系列から、音声区間を７等分割した各
分割点（時間軸方向に８点）の特徴パラメータ（周波数
軸方向に８点）を抽出し、これを入力音声（例えば単語
音声）パターンの特徴ベクトルとして出力している。従
って、この場合には、前記入力音声パターンの特徴ベク
トルは64次元ベクトルとして抽出されることになる。From the time series of the characteristic parameters analyzed in this way, the characteristic vector of the input speech pattern is obtained by, for example, detecting a voice section and cutting out the feature parameter based on the detection result of the voice section. Specifically, the characteristic parameters (8 points in the frequency axis direction) of each division point (8 points in the time axis direction) obtained by dividing the voice section into seven equal parts are extracted from the time series of the characteristic parameters, and the extracted input voice ( For example, it is output as a feature vector of a word voice pattern. Therefore, in this case, the feature vector of the input voice pattern is extracted as a 64-dimensional vector.

このようにして特徴抽出部２にて求められた入力音声
パターンの特徴ベクトルは、入力音声の認識処理時には
認識部３に与えられ、また学習時には学習部４に与えら
れる。尚、特定話者用の単語認識装置にあっては、認識
処理に供する音声の発声に先立って、学習用の音声パタ
ーンの音声入力が行われる。そして制御部５の制御の下
で、前記学習部４において上記学習用パターンに従う認
識辞書（標準パターン）の作成処理が行われる。この学
習部４で作成された認識辞書（標準パターン）が標準パ
ターン辞書メモリ６に格納され、前記認識部３における
入力音声パターンの認識処理に供せられる。The feature vector of the input voice pattern thus obtained by the feature extraction unit 2 is given to the recognition unit 3 at the time of the input voice recognition processing, and given to the learning unit 4 at the time of learning. In the word recognition device for a specific speaker, voice input of a voice pattern for learning is performed prior to utterance of voice used for recognition processing. Then, under the control of the control unit 5, the learning unit 4 performs a process of creating a recognition dictionary (standard pattern) according to the learning pattern. The recognition dictionary (standard pattern) created by the learning unit 4 is stored in the standard pattern dictionary memory 6 and is used for the recognition process of the input voice pattern in the recognition unit 3.

尚、図中７は、認識部３による入力音声パターンの認
識結果等を表示する表示部である。Reference numeral 7 in the figure is a display unit for displaying the recognition result of the input voice pattern by the recognition unit 3.

前記制御部５は、前記学習部４にて認識辞書の作成に
用いられた入力音声パターンの数から、前記学習部４に
おける認識辞書の学習法を制御している。また制御部５
は前記特徴抽出部２で抽出された入力音声の特徴ベクト
ルの次元数、または前記学習部４にて認識辞書の作成に
用いた特徴ベクトルの数の少なくとも１つを用いて予め
設定したテーブル等を参照することにより、前記認識部
３における入力音声パターンの認識処理に用いる前記認
識辞書の軸数を制御している。The control unit 5 controls the learning method of the recognition dictionary in the learning unit 4 based on the number of input voice patterns used by the learning unit 4 to create the recognition dictionary. In addition, the control unit 5
Is a table preset using at least one of the number of dimensions of the feature vector of the input speech extracted by the feature extraction unit 2 or the number of feature vectors used by the learning unit 4 to create the recognition dictionary. By referring to this, the number of axes of the recognition dictionary used for the recognition process of the input voice pattern in the recognition unit 3 is controlled.

上記学習部４における認識辞書（標準パターン）の作
成は、例えば或る認識対象カテゴリに対する学習音声パ
ターンが１つだけ入力された場合、これを第１軸の認識
辞書φ１として行われる。そして２個目、３個目の学習
パターンが入力されると、これを前記第１軸の認識辞書
φ１に対してシュミットの直交化により、第２軸の認識
辞書φ２、第３軸の認識辞書φ３としている。同様に学
習パターンが入力される都度、シュミットの直交化によ
り順次第Ｍ軸の認識辞書φｍを作成している。The recognition unit (standard pattern) is created in the learning unit 4 as the recognition dictionary φ1 of the first axis when only one learning voice pattern for a certain recognition target category is input, for example. Then, when the second and third learning patterns are input, the Schmitt orthogonalization is applied to the first axis recognition dictionary φ1 to obtain a second axis recognition dictionary φ2 and a third axis recognition dictionary. φ3. Similarly, every time a learning pattern is input, the Schmitt orthogonalization is performed to sequentially create the recognition dictionary φm for the Mth axis.

このシュミットの直交化による認識辞書の作成は、例
えば第Ｍ個目の新たな学習パターンfmが入力されたと
き、その第Ｍ軸の標準パターンφｍをとして計算することにより行われる。The recognition dictionary is created by the orthogonalization of the Schmidt. For example, when the Mth new learning pattern fm is input, the standard pattern φm of the Mth axis is calculated. Is calculated as

このようにして学習処理に供せられた入力音声パター
ンの数に応じて求められた第１軸から第Ｍ軸までの標準
パターンが、その認識辞書として標準パターンメモリ６
に格納される。In this way, the standard patterns from the first axis to the M-th axis, which are obtained according to the number of input voice patterns provided for the learning process, are used as the recognition dictionary in the standard pattern memory 6
Stored in.

尚、不特定話者利用の単語音声認識装置にあっては、
特定話者利用の場合よりも入力音声パターンの変動が大
きいことから、例えばその特徴ベクトルを時間軸方向に
16点、周波数軸方向に16点のデータとして、256次元の
ベクトルとして抽出する。In addition, in the word voice recognition device used by the unspecified speaker,
Since the variation of the input voice pattern is larger than that in the case of using a specific speaker, for example, the feature vector is changed in the time axis direction.
Data of 16 points and 16 points in the frequency axis direction are extracted as a 256-dimensional vector.

そして認識辞書作成の為の学習パターン数が１〜10個
の場合には、これらの学習パターンの共分散行列を計算
し、この共分散行列をKL展開してその固有値と固有ベク
トルとを求めて、予め設定した軸数、例えば第１軸から
第４軸までの標準パターン（認識辞書）φ1,φ2,〜φ４
とする。そして認識辞書作成用の学習パターンが更に増
え、その個数が11〜30となった場合には、先に入力され
た学習パターンを含めて同様にこれらの学習パターンの
共分散行列を求め、KL展開処理して、新たに、今度は予
め設定した軸数、例えば第１軸から第６軸までの標準パ
ターン（認識辞書）φ1,φ2,〜φ６を求め、先に求めら
れていた認識辞書を更新する。Then, when the number of learning patterns for creating the recognition dictionary is 1 to 10, the covariance matrix of these learning patterns is calculated, and the eigenvalue and eigenvector are obtained by expanding the covariance matrix by KL, A preset number of axes, for example, standard patterns (recognition dictionary) φ1, φ2, ~ φ4 from the first axis to the fourth axis
And If the number of learning patterns for creating the recognition dictionary further increases and the number becomes 11 to 30, the covariance matrix of these learning patterns is calculated in the same manner including the previously input learning patterns, and the KL expansion is performed. By processing, a new preset number of axes, for example, standard patterns (recognition dictionaries) φ1, φ2, ~ φ6 from the first axis to the sixth axis, is obtained, and the previously obtained recognition dictionary is updated. To do.

尚、先に求められていた認識辞書の特性核を、新たに
入力された学習パターンの特性核を用いて更新処理し、
これをKL展開してその認識辞書を更新するようにしても
良い。In addition, the characteristic nucleus of the recognition dictionary previously obtained is updated using the characteristic nucleus of the newly input learning pattern,
This may be KL expanded to update its recognition dictionary.

同様にして学習パターンの数が30個以上に増えた場合
には、その共分散行列のKL展開によって、第１軸から第
10軸までの認識辞書（標準パターン）φ1,φ2,〜φ10を
求める。Similarly, when the number of learning patterns increases to 30 or more, the KL expansion of the covariance matrix causes the first axis to
Obtain the recognition dictionary (standard pattern) φ1, φ2, ~ φ10 up to 10 axes.

このようにして認識辞書の作成に供された学習パター
ンの数に応じて、所定の軸数までの認識辞書が作成さ
れ、標準パターン辞書メモリ６に格納される。In this way, the recognition dictionaries up to a predetermined number of axes are created according to the number of learning patterns used for creating the recognition dictionary, and stored in the standard pattern dictionary memory 6.

しかして認識部３では、このようにして標準パターン
辞書メモリ６に格納された標準パターンφｉ（ｉ＝1,2,
〜ｍ）を用いて、認識処理に供せられる入力音声パター
ンｆとの照合を、次の複合類似度Ｓを計算することによ
り行っている。Then, in the recognition unit 3, the standard pattern φi (i = 1, 2,
~ M) is used to collate with the input voice pattern f to be subjected to the recognition processing by calculating the next composite similarity S.

但し、‖φｉ‖は（１）に正規化されたものであり、
λｉは係数である。 However, ‖φi‖ is normalized to (1),
λi is a coefficient.

ここで、この認識処理に用いられる前記認識辞書（標
準パターン）の軸数Ｍは、前記認識辞書の作成に用いら
れた学習パターンの数、または前記特徴抽出部２で抽出
された特徴ベクトルの次元数に応じて、前記制御部５の
制御の下で制御されるようになっている。Here, the number M of axes of the recognition dictionary (standard pattern) used for this recognition processing is the number of learning patterns used for creating the recognition dictionary, or the dimension of the feature vector extracted by the feature extraction unit 2. According to the number, it is controlled under the control of the control unit 5.

例えば認識辞書の作成に用いられた学習パターン数が
１〜10個である場合には、上述した例によると第４軸ま
での標準パターンしか求められていないことから、制御
部５は認識部３に対して第１軸から第４軸までの標準パ
ターンと入力音声の特徴ベクトルと類似度計算を行うよ
うに指示している。また認識辞書の学習に用いられた学
習パターン数が11〜30個の場合には、上述した例による
と第６軸までの認識辞書を用いた複合類似度計算を行う
ように指示し、学習パターン数が31個以上の場合には、
第10軸までの認識辞書を用いた複合類似度計算を行うよ
うに指示している。For example, when the number of learning patterns used to create the recognition dictionary is 1 to 10, only the standard patterns up to the fourth axis are obtained according to the above-mentioned example. Is instructed to perform the standard pattern of the first to fourth axes, the feature vector of the input voice, and the similarity calculation. Further, when the number of learning patterns used for learning the recognition dictionary is 11 to 30, according to the above-described example, it is instructed to perform the composite similarity calculation using the recognition dictionary up to the sixth axis, and the learning pattern If the number is 31 or more,
It is instructed to perform composite similarity calculation using the recognition dictionary up to the 10th axis.

このようにして制御部５は、前記認識辞書の学習（作
成）に用いられた学習パターンの数に応じて、入力音声
の特徴ベクトルとの複合類似度計算に供せられる認識辞
書の軸数を制御し、その認識処理計算の無駄を省いてい
る。In this way, the control unit 5 determines the number of axes of the recognition dictionary to be used in the calculation of the composite similarity with the feature vector of the input voice according to the number of learning patterns used for learning (creating) the recognition dictionary. It controls and eliminates the waste of calculation of the recognition process.

尚、認識処理に用いる認識辞書の軸数は、認識対象カ
テゴリ毎にそれぞれ個別に制御されるものであっても良
いし、或いは複数の認識対象カテゴリ毎に求められた各
認識辞書の軸数の中で最低の軸数のものに合せて、全体
的に統一して制御するようにしても良い。The number of axes of the recognition dictionary used in the recognition process may be individually controlled for each recognition target category, or the number of axes of each recognition dictionary obtained for each of the plurality of recognition target categories may be controlled. It is also possible to control the overall control unitedly according to the one having the lowest number of axes.

即ち、第５図は単語カテゴリについて、その認識辞書
を作成する為に学習した学習パターン数と、その学習に
よって作成された認識辞書の軸数との例を示すものであ
るが、この場合認識対象カテゴリによって作成された認
識辞書の軸数に差異がある。That is, FIG. 5 shows an example of the number of learning patterns learned for creating the recognition dictionary for the word category and the number of axes of the recognition dictionary created by the learning. There is a difference in the number of axes of the recognition dictionary created by the category.

具体的には『秋田』なるカテゴリについては第４軸ま
での認識辞書しか求められていないが、『東京』『大
阪』なるカテゴリについては、それぞれ第10軸までの認
識辞書が求められている。Specifically, only the recognition dictionaries up to the fourth axis are required for the category "Akita", but the recognition dictionaries up to the tenth axis are required for the categories "Tokyo" and "Osaka".

しかして各カテゴリ毎に、認識処理に用いる認識辞書
の軸数を変えても良いが、その制御が徒に複雑化する虞
れがあることから、例えばその中で最小の軸数の認識辞
書を見出し、その軸数に合せて各認識対象カテゴリに対
する複合類似度計算を行うようにすれば良い。このよう
にすれば、各認識対象カテゴリに対する類似度値の評価
条件の統一化を図ることかできるので、精度の高い認識
処理を行うことが可能となる。However, the number of axes of the recognition dictionary used for the recognition process may be changed for each category, but since the control may be complicated, for example, the recognition dictionary with the smallest number of axes is selected. The composite similarity calculation for each recognition target category may be performed according to the heading and the number of axes thereof. By doing so, it is possible to unify the evaluation conditions of the similarity value for each recognition target category, so that it is possible to perform highly accurate recognition processing.

つまりこの例では、第４軸までの認識辞書だけを用い
てその認識処理を行うようにすれば良い。That is, in this example, the recognition process may be performed using only the recognition dictionary up to the fourth axis.

ところで音声認識を行う場合、例えばその入力音声パ
ターンの母音成分の特徴ベクトルと子音成分の特徴ベク
トルとをそれぞれ別個に抽出し、それらをそれぞれ認識
処理する場合がある。By the way, when performing voice recognition, for example, a feature vector of a vowel component and a feature vector of a consonant component of the input voice pattern may be extracted separately, and may be subjected to recognition processing.

例えば第２図に示すようにセグメント化処理部８に
て、前記特徴抽出部２で抽出される入力音声の特徴パラ
メータの時系列（例えば16チャンネル）から、該入力音
声の単音節をそれぞれ切出す。そして母音成分について
は、上記16チャンネルの分析出力をその母音部の１フレ
ー分を母音認識用の特徴パターンとして切出し、子音成
分については上記16チャンネルの分析出力を隣接する２
チャンネルづつまとめた８チャンネルの特徴パラメータ
として、子音から母音への亙り部分を含めた８フレーム
分を子音パターンとして切出すことが行われる。For example, as shown in FIG. 2, the segmentation processing unit 8 cuts out each monosyllabic of the input voice from the time series (for example, 16 channels) of the feature parameters of the input voice extracted by the feature extraction unit 2. . For the vowel component, the analysis output of the 16 channels is cut out as one feature pattern of the vowel part as a characteristic pattern for vowel recognition, and for the consonant component, the analysis output of the 16 channels is adjacent to each other.
As a characteristic parameter of 8 channels grouped by channel, 8 frames including a portion from a consonant to a vowel are cut out as a consonant pattern.

この場合、認識処理に供せられる母音パターンの特徴
ベクトルの次元数は16次元となり、また子音パターンの
特徴ベクトルは64次元のベクトルとなる。In this case, the number of dimensions of the feature vector of the vowel pattern used for recognition processing is 16 and the feature vector of the consonant pattern is 64.

制御部５は、このような母音パターンおよび子音パタ
ーンに対する認識辞書の作成を学習部４に対してそれぞ
れ独立に制御している。例えば母音パターンの認識辞書
を作成するに際しては、例えば第３図に示すようにその
学習パターン数に応じて認識辞書の作成軸数を制御し、
また子音パターンの認識辞書を作成するに際しては、例
えば第４図に示すようにその学習パターン数に応じてそ
の認識辞書の作成軸数を制御している。このような認識
辞書の作成は前述したようにシュミットの直交化によっ
て行っても良いが、学習パターンの共分散行列をKL展開
してその固有値と固有ベクトルを求め、これを認識辞書
とすることが好ましい。また認識辞書の更新に際して
は、例えば不特定話者用の共分散行列を特定話者の学習
パターンにより更新し、更新された共分散行列をKL展開
することによって、その認識辞書を特定話者に適合させ
るようにしていけば良い。The control unit 5 independently controls the learning unit 4 to create a recognition dictionary for such vowel patterns and consonant patterns. For example, when creating a vowel pattern recognition dictionary, the number of recognition dictionary creation axes is controlled according to the number of learning patterns, as shown in FIG.
When creating a recognition dictionary for consonant patterns, the number of axes for creating the recognition dictionary is controlled according to the number of learning patterns, as shown in FIG. 4, for example. Such a recognition dictionary may be created by Schmitt orthogonalization as described above, but it is preferable to use the covariance matrix of the learning pattern as the KL expansion to obtain its eigenvalues and eigenvectors, and use this as the recognition dictionary. . When updating the recognition dictionary, for example, the covariance matrix for the unspecified speaker is updated by the learning pattern of the specified speaker, and the updated covariance matrix is expanded by KL to make the recognition dictionary the specified speaker. It should be adapted.

しかして入力音声を認識処理する場合には、その特徴
パターンが母音の特徴ベクトルか或いは子音の特徴ベク
トルかによって、そのベクトルの次元数が異なることか
ら、認識部３にて認識処理に供せられる特徴ベクトルの
次元数に応じて該特徴ベクトルとの複合類似度計算に用
いる認識辞書の軸数を制御する。When the input voice is recognized, the dimension of the vector is different depending on whether the feature pattern is the vowel feature vector or the consonant feature vector, and therefore the recognition unit 3 provides the recognition process. The number of axes of the recognition dictionary used for calculating the composite similarity with the feature vector is controlled according to the dimension number of the feature vector.

この結果、認識処理に供せられる入力パターンの特徴
ベクトルに応じて最適な軸数の認識辞書を用いた認識処
理が効率良く、且つ精度良く行われることになる。As a result, the recognition process using the recognition dictionary having the optimum number of axes according to the feature vector of the input pattern provided for the recognition process can be performed efficiently and accurately.

尚、この特徴ベクトルの次元数に応じた認識辞書の軸
数の選択によるパターン認識処理においても、認識辞書
の作成に用いられた学習パターン数に応じて、その認識
処理に用いる認識辞書の軸数の制御を行うようにしても
良い。つまり、特徴ベクトルの次元数と認識辞書の作成
に用いられた学習パターン数との双方の情報に従って、
認識処理に用いる認識辞書の軸数を制御するようにして
も良い。Even in the pattern recognition processing by selecting the number of axes of the recognition dictionary according to the number of dimensions of the feature vector, the number of axes of the recognition dictionary used for the recognition processing depends on the number of learning patterns used to create the recognition dictionary. May be controlled. That is, according to the information of both the number of dimensions of the feature vector and the number of learning patterns used to create the recognition dictionary,
You may make it control the number of axes of the recognition dictionary used for a recognition process.

以上説明したように、本装置によれば認識辞書の作成
に用いられた学習パターンの数、または認識処理に供せ
られる入力パターンの特徴ベクトルの次元数に応じて、
該入力パターンの特徴ベクトルとの照合に用いられる認
識辞書の軸数が制御され、最適な軸数での認識処理が行
われる。従って認識装置の利用者に負担を掛けることな
く、適切な軸数の認識辞書を用いて効果的な認識処理を
行うことが可能となる。また認識辞書の作成に用いた学
習パターン数の多少に拘ることなく、その認識辞書の学
習回数に応じた適切な認識処理が行われることになる。As described above, according to the present device, according to the number of learning patterns used for creating the recognition dictionary, or the number of dimensions of the feature vector of the input pattern provided for the recognition process,
The number of axes of the recognition dictionary used for matching with the feature vector of the input pattern is controlled, and the recognition processing is performed with the optimum number of axes. Therefore, it is possible to perform effective recognition processing by using a recognition dictionary having an appropriate number of axes without imposing a burden on the user of the recognition device. Also, regardless of the number of learning patterns used to create the recognition dictionary, appropriate recognition processing is performed according to the number of times of learning of the recognition dictionary.

従って、学習データの収集に労力を費やすことなく、
また認識処理すべき入力パターンの特徴ベクトルの次元
数に応じて、簡易に、しかも効果的にパターン認識を行
うことが可能となる等の実用上多大なる効果が奏せられ
る。Therefore, without spending effort to collect learning data,
In addition, practically great effects such as simple and effective pattern recognition can be achieved according to the number of dimensions of the feature vector of the input pattern to be recognized.

尚、本発明は上述した実施例に限定されるものてはな
い。例えば認識辞書の軸数の選択設定基準は、認識辞書
の学習の方式等に応じて定めれば良いものである。また
ここでは、利用者の学習パターンだけを用いてその標準
パターンを作成したが、不特定話者用の標準パターンを
予め準備しておき、これを第１軸の標準パターンとする
と共に、学習パターンの入力に従って第２軸以降の標準
パターンを順次作成していくようにしても良い。また既
に作成されている標準パターンを更新処理して特定話者
用に適応化させていくようにしても良い。またその認識
辞書の学習を更に発展させて、不特定話者用の高性能な
認識辞書を作成して行くようにしても良い。更には実施
例では音声認識を例に説明したが、文字や図形の認識処
理においても同様に適用することが可能であり、要する
に本発明はその要旨を逸脱しない範囲で種々変形して実
施することができる。The present invention is not limited to the above embodiments. For example, the selection setting standard of the number of axes of the recognition dictionary may be determined according to the learning method of the recognition dictionary. In addition, here, the standard pattern was created using only the learning pattern of the user, but a standard pattern for the unspecified speaker is prepared in advance, and this is used as the standard pattern of the first axis, and the learning pattern is used. The standard patterns of the second and subsequent axes may be sequentially created according to the input of. Further, the already created standard pattern may be updated to be adapted to the specific speaker. Further, the learning of the recognition dictionary may be further developed to create a high-performance recognition dictionary for unspecified speakers. Furthermore, although voice recognition has been described as an example in the embodiment, the present invention can be similarly applied to character and graphic recognition processing. In short, the present invention can be variously modified without departing from the scope of the invention. You can

[Brief description of drawings]

第１図は本発明の一実施例方式を適用した音声認識装置
の概略構成図、第２図は他の実施例装置の概略構成を示
す図、第３図および第４図はそれぞれ認識辞書の作成に
用いられた学習パターン数と作成された認識辞書の軸数
との関係を示す図、第５図は認識対象カテゴリによって
異なる認識辞書の学習パターン数とその認識辞書の軸数
の例を示す図である。１……音声入力部、２……特徴抽出部、３……認識部、
４……学習部、５……選択部、６……標準パターン辞書
メモリ、６……表示部、８……セグメント化処理部。FIG. 1 is a schematic configuration diagram of a voice recognition device to which an embodiment system of the present invention is applied, FIG. 2 is a diagram showing a schematic configuration of another embodiment device, and FIGS. 3 and 4 are respectively recognition dictionary. FIG. 5 is a diagram showing the relationship between the number of learning patterns used for creation and the number of axes of the created recognition dictionary, and FIG. 5 shows an example of the number of learning patterns of the recognition dictionary and the number of axes of the recognition dictionary that differ depending on the recognition target category. It is a figure. 1 ... voice input section, 2 ... feature extraction section, 3 ... recognition section,
4 ... Learning section, 5 ... Selection section, 6 ... Standard pattern dictionary memory, 6 ... Display section, 8 ... Segmentation processing section.

Claims

(57) [Claims]

1. When an input pattern is analyzed to extract a feature vector of a feature of a predetermined dimension of the input pattern, and the feature vector is collated with a recognition dictionary registered in advance, the input pattern is recognized, The number of axes of the recognition dictionary used for matching with the feature vector is controlled according to the dimension number of the feature vector used for matching with the recognition dictionary or the number of learning patterns used for creating the recognition dictionary. A pattern recognition learning method characterized by:

2. The input pattern recognition process is performed by the composite similarity method, and the recognition dictionary is created by KL expansion of the characteristic kernel or Schmidt orthogonalization process. The pattern recognition learning method described in the first item of the range.