JPS6130280B2

JPS6130280B2 -

Info

Publication number: JPS6130280B2
Application number: JP54142772A
Authority: JP
Inventors: Isamu Nose; Yorio Iio; Juhei Izawa
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 1979-11-06
Filing date: 1979-11-06
Publication date: 1986-07-12
Also published as: JPS5666900A

Description

[Detailed description of the invention]

本発明は特定話者の音声認識方法に関するもの
である。音声識別装置として現在製品化が一番進
んでいるのは特定話者の単語音声認識装置といわ
れているものである。これはあらかじめ話者が認
識対象の全単語を各々１回から10回程度発声する
ことにより認識装置内にその話者の単語に関する
特徴を記憶させ、しかるのち認識が行われる。こ
れは音声パターンの個人差による変動（主として
周波数構造に関連する）が認識を困難にしている
大きな要因の一つであるが、これをかい避できる
ためである。この明細書では、話者があらかじめ
発声しておく単語を登録語、登録語における特徴
の個々の要素を登録特徴要素、各登録語に関する
登録特徴要素の集合を登録特徴セツトといい、こ
られに対して新らたに発声した認識すべき対象と
しての単語に関するものを、夫々被認識語、被認
識特徴要素及び被認識特徴セツトという。次に、
この特定話者の認識装置の一例を第１図に示す。第１図において、１はマイクロフオン、２はプ
リアンプ、３は帯域フイルタ群、４は整流兼ロー
パスフイルタ群、５はマルチプレクサ、６はAD
変換器、７は制御部、８はマルチプレクサの切換
え信号線、９はAD変換器への制御線、１０及び
１１は制御部７と計算機１２との間の応答信号
線、１２はマイクロコンピユータ及びその周辺
部、１３は識別結果出力線の如く構成されてい
る。その動作はまずマイクロフオン１で電気信号
に変換された音声信号はプリアンプ２で増幅さ
れ、帯域フイルタ群３によりスペクトルに分解さ
れる。一般的には、この帯域フイルタ群３の構成
は音声帯域150Hz位から5kHz位迄を３〜15分割す
るようなフイルタ構成である。帯域フイルタ群３
の出力は整流兼ローパスフイルタ群４を通して時
間的に平均化されマルチプレクサ５の入力信号と
なる。切換信号線８で選択されたマルチプレクサ
５の出力信号は制御線９で起動された周期でAD
変換器６によりアナログ信号からデイジタル信号
に変換され応答信号線１０，１１で制御されて計
算機１２に転送される。この転送周期は音声デー
タのサンプリング周期となり制御部７でAD変換
のタイミングを制御している。一般的には、帯域
フイルタ群３の全出力を数ｍｓから数十ｍｓの周
期でサンプリングしているものが多い。音声認識装置では殆んどの装置がAD変換器迄
をハードウエア化してこれ以降の処理はミニコン
ピユータあるいはマイクロコンピユータで実行し
ている。これはデータの入力速度が比較的遅いた
め、計算機処理に不向きな所（アナログ信号処理
部）以外は専用のハードウエアを用いないでも処
理が可能であり、その方が小型化、低価格化等を
実現しやすいからである。第２図は計算機処理部の基本構成を示すフロー
チヤートである。１４は音声データ入力処理、１
５は音声切出し処理、１６は特徴抽出処理、１７
は訓練データか被認識データかの判断処理、１８
は登録処理、１９は識別処理、２０は結果出力処
理の如く構成されており以下動作について説明す
る。音声データ入力処理１４は音声データを計算
機内部の記憶部にとり入れる作業を行うもので、
音声データの入力指令が何らかの形で与えられる
と制御部７と同期をはかり、AD変換器６の出力
データ（以下サンプルデータと称す）をメモリに
格納する。一般的には単語認識装置では約1.5秒
間の発声音をとり入れている。サンプルデータは
発声音の前後の余分なデータも含んでいるので次
に発声区間を検出するのが音声切出し処理１５で
ある。この処理としては話者が高雑音下の環境に
いない限りは、発声前後のサンプルデータ値は発
声時に比較して小さい値を示すのである闘値を設
けて比較するのが簡単な方法である。検出された音声区間のサンプルデータはこのま
まではデータ量が多く記憶容量の大きい装置が必
要となる。例えば発声時間を1.5秒、サンプル周
期を10ｍｓ、AD変換８ビツトとすると１語につ
き150バイト×チヤネル数のメモリが識別対象語
数だけあらかじめ格納するのに必要となるので、
サンプルデータを何らかの形に変形してデータ数
を少くして保存するのが多い。この作業を行うの
が特徴抽出処理１６である。特徴抽出として簡単
な方法はサンプルデータの線形圧縮方法である。
この方法では、切出された音声区間を等分割（普
通は16〜32分割が多い）して、各分割区間毎に各
チヤネルデータに対し平均値を求め、この平均値
を特徴として用いる。但し声の大きさが同一話者
であつても変化するので、各サンプルデータ毎に
あるいは特徴レベルにおいてフイルタ出力加算値
等のデータで大きさの正規化を行う方が良い。こうして得られる特徴を全ての識別対象である
登録語に対して求め、認識装置内（本例では計算
機メモリ内）に格納するのが特徴登録処理１８で
ある。登録が完了すると識別処理１９が実行され
るよう判断処理１７を実行する。識別処理１９は
要するに、同一話者の発声に基づいて予め保存さ
せておいた各登録語の登録特徴セツトと新らたに
発声した被認識語の被認識特徴セツトとの非類似
度を一定の法則に従つて測定するものであり、結
果出力処理２０において最小非類似度の登録語
（コード等）を外部に出力し、又最小非類似度が
一定の条件を満足しない場合はリジエクト出力を
出力する。なお、識別処理を類似度測定で定義す
るものもあるが、本質的には全く同一である。以
下式により識別処理１９における計算例を説明す
る。保存されている登録語数をＮ個とする。登録
特徴要素をｆ_i（ｌ、ｍ）、被認識特徴要素をｇ
（ｌ、ｍ）とする。但し添字ｉは登録語に付され
た番号で、この場合は１からＮ迄の値である。ｌ
は各フイルム出力に対応する番号で１からフイル
タ数Ｌ迄の値をとる。ｍは時間分割領域毎に付し
た番号で１から分割数Ｍ迄の値をとる。非類似度として失点を考え、保存されているあ
る登録語の登録特徴セツトＦ_iと新たに発声した
被認識語、被認識特徴セツトＧとの失点総和をＳ
_iとすると、 The present invention relates to a method for recognizing the speech of a specific speaker. The type of voice identification device that is currently being commercialized the most is what is called a word voice recognition device for a specific speaker. This is done by having the speaker utter all the words to be recognized one to 10 times in advance, so that the characteristics of the speaker's words are stored in the recognition device, and then recognition is performed. This is because variation due to individual differences in voice patterns (mainly related to frequency structure) is one of the major factors that makes recognition difficult, but this can be avoided. In this specification, words uttered in advance by a speaker are referred to as registered words, individual features of the registered words are referred to as registered feature elements, and a set of registered feature elements related to each registered word is referred to as a registered feature set. The newly uttered word to be recognized is called a recognized word, a recognized feature element, and a recognized feature set, respectively. next,
An example of this specific speaker recognition device is shown in FIG. In Figure 1, 1 is a microphone, 2 is a preamplifier, 3 is a band filter group, 4 is a rectifier/low-pass filter group, 5 is a multiplexer, and 6 is an AD
Converter, 7 is a control unit, 8 is a multiplexer switching signal line, 9 is a control line to the AD converter, 10 and 11 are response signal lines between the control unit 7 and the computer 12, 12 is a microcomputer and its The peripheral portion 13 is configured like an identification result output line. The operation is as follows: First, an audio signal is converted into an electric signal by a microphone 1, amplified by a preamplifier 2, and decomposed into spectra by a group of band filters 3. Generally, the configuration of the band filter group 3 is such that the audio band from about 150 Hz to about 5 kHz is divided into 3 to 15 parts. Bandwidth filter group 3
The output is temporally averaged through a rectifier/low-pass filter group 4 and becomes an input signal to a multiplexer 5. The output signal of the multiplexer 5 selected by the switching signal line 8 is AD at the cycle activated by the control line 9.
The converter 6 converts the analog signal into a digital signal, which is controlled by response signal lines 10 and 11 and transferred to the computer 12. This transfer cycle becomes the sampling cycle of the audio data, and the control unit 7 controls the timing of AD conversion. Generally, the entire output of the band filter group 3 is often sampled at a period of several ms to several tens of ms. Most voice recognition devices use hardware up to the AD converter, and the subsequent processing is executed by a minicomputer or microcomputer. This is because the data input speed is relatively slow, so it is possible to process without using dedicated hardware except for areas unsuitable for computer processing (analog signal processing section), which makes it more compact and cheaper. This is because it is easy to realize. FIG. 2 is a flowchart showing the basic configuration of the computer processing section. 14 is audio data input processing, 1
5 is audio extraction processing, 16 is feature extraction processing, 17
is training data or recognition data, 18
19 is a registration process, 19 is an identification process, and 20 is a result output process, and the operations will be explained below. The audio data input process 14 is a process of importing audio data into the internal memory of the computer.
When an input command for audio data is given in some form, it synchronizes with the control section 7 and stores the output data (hereinafter referred to as sample data) of the AD converter 6 in the memory. Generally, word recognition devices use approximately 1.5 seconds of vocalizations. Since the sample data also includes extra data before and after the utterance, the next step is to detect the utterance section in the voice cutting process 15. As long as the speaker is not in a noisy environment, a simple method for this process is to set and compare a threshold value in which sample data values before and after utterance are smaller than those at the time of utterance. The sample data of the detected voice section has a large amount of data, and a device with a large storage capacity is required. For example, if the utterance time is 1.5 seconds, the sampling period is 10 ms, and AD conversion is 8 bits, a memory of 150 bytes x number of channels is required for each word to store the number of words to be identified in advance.
Sample data is often transformed into some form to reduce the amount of data and then saved. The feature extraction process 16 performs this work. A simple method for feature extraction is a linear compression method of sample data.
In this method, the extracted audio section is divided into equal parts (usually 16 to 32 parts), an average value is determined for each channel data for each divided section, and this average value is used as a feature. However, since the loudness of the voice changes even for the same speaker, it is better to normalize the loudness using data such as a filter output added value for each sample data or at the feature level. In the feature registration process 18, the features obtained in this way are obtained for all registered words to be identified and stored in the recognition device (in the computer memory in this example). When the registration is completed, the determination process 17 is executed so that the identification process 19 is executed. In short, the identification process 19 calculates the degree of dissimilarity between the registered feature set of each registered word stored in advance based on the utterances of the same speaker and the recognized feature set of the newly uttered recognized word. It is measured according to the law, and in the result output processing 20, the registered word (code, etc.) with the minimum dissimilarity is output to the outside, and if the minimum dissimilarity does not satisfy a certain condition, a reject output is output. do. Note that although some methods define the identification process by measuring similarity, they are essentially the same. An example of calculation in the identification process 19 will be explained using the following formula. Let the number of saved registered words be N. The registered feature element is f _i (l, m), and the recognized feature element is g
Let it be (l, m). However, the subscript i is a number attached to the registered word, and in this case is a value from 1 to N. l
is a number corresponding to each film output and takes a value from 1 to the number of filters L. m is a number assigned to each time division area and takes a value from 1 to the number of divisions M. Considering points lost as the degree of dissimilarity, the sum of points lost between the registered feature set F _i of a certain registered word stored and the newly uttered recognized word and recognized feature set G is calculated as S.
If _i is

【式】である。識別結果はMIN（S₁，S₂，……Ｓ_N）となる登
録語である。但しMIN（S₁，S₂，……Ｓ_N）は
S₁，S₂，……，Ｓ_Nの内、失点総和の最小の物を
選択することを意味する。こういつた方法は簡単
であるが、対象語句の中で似かよつている発声音
（例えば「ナカノ」と「ナガノ」等）があると失
点差が少なくなり、判別が困難になるという欠点
があつた。これは全く同じ発声（音声の強さ、発
声速度、アクセント、明僚度等）を行うことは人
間にとつて極めて困難で、同じ言葉を２度発声し
てその差（失点）を調べてみると、似かよつた言
葉を発声した場合の差（失点）と同じ位いの失点
量になるためである。本発明の目的はこられの欠点を解決するため、
登録特徴要素に重み付けを行い似かよつた特徴を
もつ単語でもはつきり区別ができるようにしたも
ので以下詳細に説明する。本発明においては、各特徴要素対毎に個別に記
憶させるか又は一群の特徴要素対毎に代表的に記
憶させるかは別にして、全ての登録特徴セツト対
における対応した全ての登録特徴要素対の重み係
数を別個に検出記憶させる。この重み係数は、類
似する登録語対において非類似度が大きい登録特
徴要素対の重み係数が、残部の登録特徴要素対の
重み係数よりも大きくなしている。本発明は、登録特徴セツトを使つて認識できな
かつた場合に、更に重み係数を導入して再認識を
実行させるものである。第３図は本発明の第１の実施例のフローチヤー
トである。計算機処理の第１段階の部分について
説明する。２１は訓練データの処理か、重み計算か、認識
データの処理かの判断処理、２２，２７は音声デ
ータ入力処理、２３，２８は音声切出し処理、２
４，２９は特徴抽出処理、２５は特徴登録処理、
２６は重み計算処理、３０は識別処理、３１は結
果出力処理の如く構成されており、以下動作につ
いて説明する。判断処理２１は装置の仕様によ
り、いろいろな方法が可能であるが、簡単のため
オペレータが装置に付属する鍵盤等で１回毎にあ
るいは判断の変り目を指示する方法による。訓練
データの処理（前もつて話者の登録特徴を格納す
る処理）における２２〜２５の処理及び識別デー
タの処理（実際認識を行う処理）における２７〜
２９及び３１の処理は、従来の方法で説明した例
と基本的に同一処理であるので説明は省略する。
訓練データの処理が登録語について全て終了する
と、判断２１は重み計算処理２６を行うよう動作
する。以下重み計算処理２６について説明する。
従来の方法で説明したように、任意の２つの登録
特徴要素をｆ_i（ｌ、ｍ），ｆ_j（ｌ、ｍ）とし
て、その登録特徴セツトをＦ_i、Ｆ_jとし、任意の
登録語対の登録特徴セツト対の差Ｄ_ij及び個別の
登録特徴要素の差ｄ_ij（ｌ、ｍ）を次のように定
義する。ｄ_ij（ｌ、ｍ）＝｜ｆ_i（ｌ、ｍ） −ｆ_j（ｌ、ｍ）｜但しｉ≠ｊ ……(1) 重み係数は任意の登録特徴セツトＦ_i（ｉ＝
１、２、……、Ｎ）に対し、他の全ての登録特徴
セツトＦ_j（ｊ＝１、２、……、Ｎ但しｉ≠ｊで
Ｎは登録語数）との間でＤ_ijを計算する。 (1) Ｄ_ijK1の場合。（但しK1はあらかじめ定め
た定数）Ｆ_iとＦ_jは特徴として差が十分あり、識別時
の失点差は大きいと考えられるので、Ｆ_iとＦ_j
との間の重み係数、すなわち対応した登録特徴
要求対ｆ_i（ｌ、ｍ），ｆ_j（ｌ、ｍ）の全ての
重み係数ｗ_ij（ｌ、ｍ）を「１」とする。 (2) Ｄ_ij＜K1の場合。ｄ_ij（ｌ、ｍ）K2（但しK2はあらかじめ
定めた定数）を満足するｆ_i（ｌ、ｍ）とｆ_j
（ｌ、ｍ）に対し、ｗ_ij（ｌ、ｍ）＝K3とする。
但しK3はあらかじめ定めた重み係数の定数値
であり、K3＞１である。又、ｄ_ij（ｌ、ｍ）＜
K2に対してはｗ_ij（ｌ、ｍ）＝１とする。従つ
て重み係数ｗ_ij（ｌ、ｍ）は上述の条件に従が
い「１」あるいはK3の値をとる。以上の計算を全ての登録特徴同志に対して行い
その重み係数ｗ_ijを格納する。重み係数のメモリ
上の格納状態を第４図及び第５図に示す。第４図は、登録特徴セツトＦ_iとＦ_jとにおける
重み係数の集合を大文字Ｗ_ijで各々の関係がわか
りやすいように示してあり斜線のます目は実際は
格納メモリとして存在していない部分である。第
５図は１つのＷ_ijを構成する重み係数ｗ_ij（ｌ、
ｍ）（ｌはチヤネル番号、ｍは分割領域番号）の
実際のメモリ上の格納状態を示している。このようにして重み計算処理２６の動作が終了
すると識別データ処理に移される。識別データ処
理では識別処理３０についてのみ説明する。従来の例では失点総和Ｓ_iを[Formula]. The identification result is a registered word that is MIN (S ₁ , S ₂ , . . . _SN ). However, MIN(S ₁ , S ₂ ,...S _N ) is
This means selecting the one with the smallest total number of points conceded from among S ₁ , S ₂ , ..., _SN . Although this method is simple, it has the disadvantage that if there are similar pronunciations in the target word (for example, ``nakano'' and ``nagano''), the difference in points will be small and it will be difficult to distinguish. Ta. It is extremely difficult for humans to produce exactly the same vocalizations (voice strength, speaking speed, accent, clarity, etc.), so try saying the same word twice and find out the difference (points lost). This is because the amount of points lost is the same as the difference (points lost) when similar words are uttered. The purpose of the present invention is to overcome these drawbacks.
The registered feature elements are weighted so that even words with similar features can be easily distinguished, and will be explained in detail below. In the present invention, all corresponding registered feature element pairs in all registered feature set pairs, regardless of whether they are stored individually for each feature element pair or representatively for each group of feature element pairs. The weighting coefficients of are separately detected and stored. The weighting coefficients are set so that among similar registered word pairs, the weighting coefficient of a registered feature element pair having a large degree of dissimilarity is larger than the weighting coefficient of the remaining registered feature element pairs. In the present invention, when recognition is not possible using the registered feature set, a weighting coefficient is further introduced to perform re-recognition. FIG. 3 is a flowchart of the first embodiment of the present invention. The first stage of computer processing will be explained. 21 is a process for determining whether to process training data, weight calculation, or recognition data; 22 and 27 are audio data input processes; 23 and 28 are audio extraction processes; 2
4, 29 is a feature extraction process, 25 is a feature registration process,
26 is a weight calculation process, 30 is an identification process, and 31 is a result output process, and the operations thereof will be explained below. The judgment process 21 can be carried out in various ways depending on the specifications of the apparatus, but for simplicity, the operator uses a keyboard attached to the apparatus to instruct the process every time or when the judgment changes. Processes 22 to 25 in training data processing (processing to store previously registered characteristics of speakers) and 27 to 25 in identification data processing (processing to perform actual recognition)
Processes 29 and 31 are basically the same processes as those described in the conventional method, so their explanation will be omitted.
When the processing of the training data is completed for all registered words, the decision 21 operates to perform the weight calculation process 26. The weight calculation process 26 will be explained below.
As explained in the conventional method, let any two registered feature elements be f _i (l, m), f _j (l, m), let their registered feature sets be F _i , F _j , and write any registered word The difference D _ij between a pair of registered feature sets and the difference d _ij (l, m) between individual registered feature elements are defined as follows. d _ij (l, m) = | f _i (l, m) − f _j (l, m) | where i≠j ……(1) The weighting coefficient is an arbitrary registered feature set F _i (i=
1, 2, ..., N) and all other registered feature sets F _j (j = 1, 2, ..., N, where i≠j and N is the number _of registered words). do. (1) For D _ij K1. (However, K1 is a predetermined constant) F _i and F _j have sufficient differences in their characteristics, and the difference in points lost during identification is considered to be large, so F _i and F _j
The weighting coefficients _w _ij (l, m) of the corresponding registered feature request pairs f i (l, m), f _j (l, m) are set to "1". (2) When D _ij <K1. f _i (l, m) and f _j that satisfy d _ij (l, m)K2 (where K2 is a predetermined constant)
For (l, m), let w _ij (l, m)=K3.
However, K3 is a constant value of a predetermined weighting coefficient, and K3>1. Also, d _ij (l, m)<
For K2, w _ij (l, m)=1. Therefore, the weighting coefficient w _ij (l, m) takes the value of "1" or K3 according to the above-mentioned conditions. The above calculations are performed for all registered features and the weighting coefficients w _ij are stored. The storage state of the weighting coefficients on the memory is shown in FIGS. 4 and 5. In Figure 4, the sets of weighting coefficients in the registered feature sets F _i and F _j are shown in capital letters W _ij to make it easier to understand the relationship between them, and the diagonally shaded squares are portions that do not actually exist as storage memory. . FIG. 5 shows weighting coefficients w _ij ( _l ,
m) (l is the channel number, m is the divided area number) is shown in the actual storage state on the memory. When the operation of the weight calculation process 26 is completed in this manner, the process proceeds to identification data processing. In the identification data processing, only the identification processing 30 will be explained. In the conventional example, the total number of points conceded S _i is

【式】で定義したが本発明ではまず同様に従来の方法で失点総和Ｓ
_iを計算する。そして一番失点の少いＳ_iをＳ_a、次
に失点の少いものもＳ_bとしてＳ_b−Ｓ_aK4（但
しK4はあらかじめ定められた定数）を満足する
時はＳ_aに対応する登録語ａを識別結果として出
力するが、Ｓ_b−Ｓ_a＜K4の場合には登録語ａ，
ｂが類似していることを意味するので、登録語ａ
とｂに対して再び次の失点総和SW_a，SW_bの計算
を行う。但し重み係数wab（ｌ、ｍ）は前述の条件に従
つて「１」あるいはK3の値をとる。再び｜SW_a−SW_b｜K4（但しK4はあらかじ
め定められた定数）を満足すればMin（SW_a，
SW_b）となる登録語ａ又はｂを識別結果とする。
又、満足しなければリジエクト（識別不能）とす
る。その他、識別エラーを少くする目的で最小失
点（最小非類似度）Ｓ_aが大きすぎる場合又は
Min（SW_a，SW_b）が大きすぎる場合はリジエク
トする方がよい。ここで１つの例を用いて更に詳
しく説明する。この例では話を簡単にするためフ
イルタ数Ｌ＝１、分割数Ｍ＝４とし、被認識語の
被認識特徴セツトＧに対する識別候補ａ，ｂの登
録特徴セツトをFa，Fbとし、Ｇ，Fa，Fbは次の
値をとるものとする。Ｇ＝（８、16、５、28） Fa＝（８、15、４、32） Fb＝（８、18、７、25）またK₁＝15、K₂＝５、K₃＝５、K₄＝３とする
と、 Dab＝｜（８−８）｜＋｜（15−18）｜＋｜（４−７）｜＋｜（32−25）｜＝13 となり、Dab＜K₁となる。上記各登録特徴セツトFaとFbとにおける重み
係数の集合Wabは、個別の登録特徴要素の差dab
（ｌ、ｍ）K2（＝５）を満足する各登録特徴要
素に対する重み係数wab（ｌ、ｍ）＝K3（＝５）
とし、それ以外の各登録特徴要素に対する重み係
数wab（ｌ、ｍ）＝１とすると、Wab＝（１、１、
１、５）となる。ここで、被認識特徴セツトＧと、各登録特徴セ
ツトFa，Fbとの失点総和Sa，Sbはそれぞれ、 Sa＝｜（８−８）｜＋｜（15−16）｜＋｜（４−５）｜＋｜（32−28）｜＝６、 Sb＝｜（８−８）｜＋｜（18−16）｜＋｜（７−５）｜＋｜（25−28）｜＝７であり、失点総和Sa，Sbからは登録語ａが識別
結果の第１候補となる。しかしながらSb−Sa＝
１＜K₄（＝３）であるため、再び重み係数Wab
を乗じた失点総和SWa，SWbを算出する。この
結果、 SWa＝（０×１）＋（１×１）＋（１×１）＋（４×５）＝22 SWb＝（０×１）＋（２×１）＋（２×１）＋（３×５）＝19 となり、｜SWa−SWb｜＝３K4（＝３）であ
るため、登録語ｂが最終的な認識結果となる。以上出力する登録語の候補が２つの場合に対し
説明したが、候補が２つ以上あつても同様に２つ
づつの組合わせで考えることによつて対処でき
る。例えば、Ｓ_a，Ｓ_b，Ｓ_cが同じような失点を
示した場合、登録語対（ａ、ｂ），（ａ、ｃ），
（ｂ、ｃ）に対して前述と同様の失点を計算し、
各登録語ａ，ｂ，ｃに関する平均の失点SW_a，
SW_b，SW_cを取り、Min（SW_a，SW_b，SW_c）を
求めればよい。又、重み係数が全て同じ値をもつ
語句同志の場合は、代表させて１つの重み係数の
みを格納するようにして（例えば重み係数を格納
してあるメモリを直接参照するのではなく、２つ
の語句に付された番号から一度テーブルをひき、
そのテーブルには重み係数の格納先頭アドレス又
は代表する重み係数が記されるようになつてい
て、どちらが記されているか明示するようにする
ことにより重み係数を格納するメモリー容量を減
少できる。以上説明したように、第１の実施例では各登録
語の登録特徴セツト同志の差を計算して似かよつ
た登録語の登録特徴要素に対してはその特徴要素
の中で違いがはつきりしている部分の特徴要素に
大きな重み付けを行うため、重みづけが一様にな
されていた従来の方法では区別が困難とされる似
かよつた被認識語も区別できるため、被認識語を
制限しなくてよいと共に特徴の異なつている点に
着目することにより認識率を高める利点がある。
第１の実施例では全ての登録特徴要素相互間に異
なつた重み係数を与えることができるようになつ
ているが、重み係数を格納するメモリー容量を少
くする意味から、時間分割領域毎に重み係数を１
つ用意して、その分割領域に属する全てのチヤネ
ルフイルタデータ（特徴）は同一の重み係数とし
ても、効果は十分得られる。この場合は、第１の
実施例において(1)、(2)式に対応するのは、である。そして、重み係数の表現は第１の実施例
がｗ_ij（ｌ、ｍ）＝K3に対しｗ_ij（ｌ）＝K3とな
る。又分割領域ではなく周波数領域毎に重み係数
を割り当ててもよいがこの場合は多少効果が減少
する。本発明はあらかじめ話者が発声した識別対象語
の特徴登録が全て終了した時点で自動的に対象語
の相互間の特徴の差を計算して重みづけを行うの
で似かよつた語句も正確に区別でき、音声認識装
置に利用することができる。Although defined by [Formula], in the present invention, first, the total number of points conceded S is calculated using the conventional method.
Calculate _i . Then, S _i with the fewest points conceded is S _a , and the one with the next fewest goals conceded is also S _b , and when it satisfies S _b - S _a K4 (where K4 is a predetermined constant), it corresponds to S _a Registered word a is output as the identification result, but if S _b −S _a <K4, registered word a,
Since b means similar, registered word a
The next total of points conceded SW _a and SW _b are calculated again for and b. However, the weighting coefficient wab (l, m) takes the value of "1" or K3 according to the above-mentioned conditions. Again, if |SW _a −SW _b |K4 (where K4 is a predetermined constant) is satisfied, then Min(SW _a ,
The registered word a or b that is SW _b ) is the identification result.
Also, if it is not satisfied, it will be rejected (unidentifiable). In addition, when the minimum loss (minimum dissimilarity) S _a is too large for the purpose of reducing identification errors, or
If Min (SW _a , SW _b ) is too large, it is better to reject. Here, a more detailed explanation will be given using one example. In this example, in order to simplify the discussion, the number of filters L = 1, the number of divisions M = 4, the registered feature sets of identification candidates a and b with respect to the recognized feature set G of the recognized word are Fa, Fb, G, Fa , Fb shall take the following values. G = (8, 16, 5, 28) Fa = (8, 15, 4, 32) Fb = (8, 18, 7, 25) Also, K ₁ = 15, K ₂ = 5, K ₃ = 5, K When ₄ =3, Dab=|(8-8)|+|(15-18)|+|(4-7)|+|(32-25)|=13, and Dab<K ₁ . The set Wab of weighting coefficients in each registered feature set Fa and Fb is the difference dab between the individual registered feature elements.
Weighting coefficient wab (l, m) for each registered feature element that satisfies (l, m) K2 (=5) = K3 (=5)
and the weighting coefficient wab (l, m) for each other registered feature element = 1, then Wab = (1, 1,
1, 5). Here, the sum totals Sa and Sb of the recognized feature set G and each registered feature set Fa and Fb are Sa=|(8-8)|+|(15-16)|+|(4-5) ) | + | (32-28) | = 6, Sb = | (8-8) | + | (18-16) | + | (7-5) | + | (25-28) | = 7. , the total points lost Sa, Sb, the registered word a becomes the first candidate for the identification result. However, Sb−Sa=
Since 1<K ₄ (=3), the weighting coefficient Wab
Calculate the total points conceded SWa and SWb by multiplying by As a result, SWa=(0×1)+(1×1)+(1×1)+(4×5)=22 SWb=(0×1)+(2×1)+(2×1)+ (3×5)=19, and |SWa−SWb|=3K4 (=3), so registered word b becomes the final recognition result. The case where there are two registered word candidates to be output has been described above, but even if there are two or more candidates, it can be dealt with by considering the combinations of two at a time. For example, if S _a , S _b , and S _c show similar points, the registered word pairs (a, b), (a, c),
Calculate the points conceded for (b, c) in the same way as above,
Average points lost SW _a for each registered word a, b, c,
Just take SW _b and SW _c and find Min (SW _a , SW _b , SW _c ). Also, if the weighting coefficients are all words that have the same value, store only one representative weighting coefficient (for example, instead of directly referring to the memory that stores the weighting coefficients, store two weighting coefficients). Once the table is drawn from the numbers attached to the words,
The table records the storage start address of the weighting coefficients or the representative weighting coefficients, and by clearly indicating which one is recorded, the memory capacity for storing the weighting coefficients can be reduced. As explained above, in the first embodiment, the difference between the registered feature sets of each registered word is calculated, and the differences among the registered feature elements of similar registered words are determined. Since the feature elements of the part that are recognized are heavily weighted, it is possible to distinguish between similar recognized words that would be difficult to distinguish using conventional methods in which weighting is uniformly applied. There is an advantage that the recognition rate can be increased by focusing on the points where the features are different.
In the first embodiment, different weighting coefficients can be given to all the registered feature elements, but in order to reduce the memory capacity for storing the weighting coefficients, the weighting coefficients are set for each time-divided area. 1
Even if one is prepared and all channel filter data (features) belonging to the divided area have the same weighting coefficient, sufficient effects can be obtained. In this case, the equations (1) and (2) in the first embodiment correspond to It is. The expression of the weighting coefficient is w _ij (l)=K3, whereas w _ij (l, m)=K3 in the first embodiment. Also, weighting coefficients may be assigned to each frequency domain instead of each divided domain, but in this case the effect is somewhat reduced. The present invention automatically calculates and weights the differences in features between target words when all the features of the target words uttered by the speaker are completed, so even similar words can be accurately distinguished. It can be used in speech recognition devices.

[Brief explanation of the drawing]

第１図は音声認識装置例の構成図、第２図は従
来の音声認識装置の計算機処理部のフローチヤー
ト、第３図は本発明の一実施例の計算機処理部の
フローチヤート、第４図１は重み係数をメモリー
に格納した概説図、第５図はその１ブロツクを示
した詳細図である。１……マイクロフオン、２……プリアンプ、３
……帯域フイルタ群、４……ローパスフイルタ
群、５……マルチプレクサ、６……AD変換器、
７……制御部、８……マルチプレクサ切換え信号
線、９……AD変換器の制御線、１０，１１……
制御部７と計算機１２とのインターフエース信号
線、１２……小型計算機又はマイクロコンピユー
タとその周辺部、１３……識別結果出力線、１４
……音声データ入力処理、１５……音声切出し処
理、１６……特徴抽出処理、１７……訓練か識別
かの判断、１８……特徴登録処理、１９……識別
処理、２０……結果出力処理、２１……訓練か重
み計算か、識別かの判断、２２，２７……音声デ
ータ入力処理、２３，２８……音声切出し処理、
２４，２９……特徴抽出処理、２５……特徴登録
処理、２６……重み計算処理、３０……識別処
理、３１……結果出力処理、Ｄ_ij……登録特徴セ
ツト対の差、ｄ_ij（ｌ、ｍ）……登録特徴要素対
の差、Ｆ_i，Ｆ_j……登録特徴セツト、ｆ_i（ｌ、
ｍ），ｆ_j（ｌ、ｍ）……登録特徴要素、ｇ（ｌ、
ｍ）……被認識特徴要素、Ｗ_i、_j……登録特徴セ
ツト対における重み係数の集合、ｗ_ij（ｌ、ｍ）
……登録特徴要素対の重み係数、Ｓ_i，SW_a，
SW_b……失点総和。 FIG. 1 is a block diagram of an example of a speech recognition device, FIG. 2 is a flowchart of a computer processing section of a conventional speech recognition device, FIG. 3 is a flowchart of a computer processing section of an embodiment of the present invention, and FIG. 4 1 is a schematic diagram showing the weighting coefficients stored in memory, and FIG. 5 is a detailed diagram showing one block thereof. 1...Microphone, 2...Preamplifier, 3
...Band filter group, 4...Low pass filter group, 5...Multiplexer, 6...AD converter,
7... Control unit, 8... Multiplexer switching signal line, 9... AD converter control line, 10, 11...
Interface signal line between control unit 7 and computer 12, 12...Small computer or microcomputer and its peripheral parts, 13...Identification result output line, 14
...Audio data input processing, 15...Audio extraction processing, 16...Feature extraction processing, 17...Judgment of training or identification, 18...Feature registration processing, 19...Identification processing, 20...Result output processing , 21... Judgment of training, weight calculation, or identification, 22, 27... Audio data input processing, 23, 28... Audio extraction processing,
24, 29...Feature extraction processing, 25...Feature registration processing, 26...Weight calculation processing, 30...Identification processing, 31...Result output processing, D _ij ...Difference between registered feature set pairs, d _ij ( l, m)... Difference between registered feature element pair, F _i , F _j ... Registered feature set, f _i (l,
m), f _j (l, m)...registered feature element, g(l,
m)...Recognized feature element, W _i , _j ... Set of weighting coefficients in the registered feature set pair, w _ij (l, m)
...Weighting coefficients of registered feature element pairs, S _i , SW _a ,
SW _b ...Total points conceded.

Claims

[Claims]

1 The degree of dissimilarity between the registered feature set consisting of the registered feature elements of each registered word based on the utterances of the same speaker and the recognized feature set consisting of the recognized feature elements of the recognized word is measured according to a certain rule. In a speech recognition method that outputs a registered word corresponding to a registered feature set with a minimum dissimilarity, the degree of dissimilarity of each corresponding registered feature element pair in a registered feature set pair is measured according to a certain rule, and the registered feature is The dissimilarity between pairs of registered feature sets is measured according to a certain rule, and the weighting coefficient of each corresponding pair of registered feature elements is stored for all combinations of registered feature sets. The weighting coefficient of the registered feature element pair whose degree of similarity is smaller than the first certain value and whose degree of dissimilarity between the registered feature element pair is larger than the second certain value is set higher than the weighting coefficient of the remaining registered feature element pairs. The present invention includes a weighting coefficient setting storage means for setting a large weighting coefficient, and detects a small number of registered words with small dissimilarity when the minimum dissimilarity between each registered feature set and the recognized feature set does not satisfy a certain condition. step, each registered feature set targeting the detected registered word and using the registered feature element obtained by multiplying the registered feature element by the weighting coefficient as a new registered feature element, and adding the above to the recognized feature element. Step of measuring the degree of dissimilarity with the recognized feature set, using the new recognized feature element obtained by multiplying it by the weighting coefficient, and outputting a registered word corresponding to the degree of dissimilarity that satisfies a certain condition. A speech recognition method comprising: