JPS6053998A

JPS6053998A - Voice recognition equipment

Info

Publication number: JPS6053998A
Application number: JP58163537A
Authority: JP
Inventors: 藤井　諭; 二矢田　勝行
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1983-09-05
Filing date: 1983-09-05
Publication date: 1985-03-28
Also published as: JPH042197B2

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】産業上の利用分野本発明は人間の声によって発声された音声信号を自動的
に認識するための、音声認識装置に関するものである。DETAILED DESCRIPTION OF THE INVENTION Field of the Invention The present invention relates to a speech recognition device for automatically recognizing speech signals uttered by a human voice.

従来例の構成とその問題点音声を自動的に認識する音声認識装置は人間から電子計
算機や各種機械へデータや命令を与える手段として非常
に有効と考えられる。Conventional configurations and their problems A speech recognition device that automatically recognizes speech is considered to be very effective as a means for providing data and commands from humans to computers and various machines.

従来研究あるいは発表されている音声認識装置の動作原
理としてはバタンマツチング法が多く採用されている。The slam matching method is often adopted as the operating principle of speech recognition devices that have been researched or published in the past.

この方法は認識される必要がある全種類の単語に対して
標準パターンをあらかじめ記憶しておき、入力される未
知の入力バタンと比較することによって一致の度合（以
下類似度と呼ぶ）を計算し、最大一致が得られる標準バ
タンと同一の単語であると判定するものでおる・このバ
タンマツチング法では認識されるべき全ての単語に対し
て標準バタンを用意しなければならないため、発声者が
変った場合には新しく標準バタンを入力して記憶させる
必要がある。したがって数百種類以上の単語を認識対象
とするような場合、全種類の単語を発声して登録するに
は時間と労力を必要とし、また登録に要するメモリー容
量も膨大になることが予想される。さらに入力バタンと
標準バタンのバタンマツチングに要する時間も単語数が
多くなると長くなってしまう欠点がある。In this method, standard patterns are memorized in advance for all types of words that need to be recognized, and the degree of matching (hereinafter referred to as similarity) is calculated by comparing them with the unknown input button. , it is determined that the word is the same as the standard baton that yields the maximum match.In this batan matching method, a standard baton must be prepared for every word to be recognized, so the speaker If the button changes, it is necessary to input a new standard button and memorize it. Therefore, if more than several hundred types of words are to be recognized, it will take time and effort to pronounce and register all types of words, and it is expected that the memory capacity required for registration will be enormous. . Furthermore, there is a drawback that the time required for matching the input button and the standard button increases as the number of words increases.

これに対して、入力音声を音素単位に分けて音素の組合
せとして認識しく以下音素認識と呼ぶ）音素単位で表記
された単語辞書との類似度をめる方法は単語辞書に要す
るメモリー容量が大幅に少なくて済み、バタンマツチン
グに要する時間が短く、辞書の内容変更も容易であると
いう特長を持っている、例えば「赤い」という発声は／
ａ／。On the other hand, the method of dividing input speech into phoneme units and recognizing them as combinations of phonemes (hereinafter referred to as phoneme recognition) and measuring the similarity with a word dictionary written in phoneme units requires a large amount of memory capacity for the word dictionary. For example, the utterance of ``red'' is /
a/.

／に／　、　／ｉ／という三つの音素を組合せてＡＫＡ
Ｉという極めて簡単な形式で表現することができるため
、不特定話者で多数語の音声に対処することが容易であ
る。AKA by combining the three phonemes /ni/ and /i/
Since it can be expressed in the extremely simple format I, it is easy for unspecified speakers to handle speech with many words.

第１図に音素認識を行うことを特徴とする音声認識方式
のブロック図を示す。マイク等で入力された音声は音響
分析部１によって分析を行なう。FIG. 1 shows a block diagram of a speech recognition system characterized by performing phoneme recognition. Audio input through a microphone or the like is analyzed by an acoustic analysis section 1.

分析方法としては帯域フィルタ群や線形予測分析を用い
、フレーム周期（１ｏｍｓ程度）毎にスペクトル情報を
得る。音素判別部２では、音響分析部１で得たスペクト
ル情報を用い、標準パターン格納部３のデータによって
フレーム毎の音素判別を行なう。標準パターン格納部３
に格納された標準パターンは、あらかじめ多数話者の音
声よシ音素毎にめておく。セグメンテーション部４では
音響分析部１の分析出力をもとに音声区間の検出と音素
毎の境界決定（以下セグメンテーションと呼ぶ）を行う
。音素認識部５ではセグメンテーション部４と音素判別
部２の結果をもとに１つの音素区間毎に何という音素で
あるかを決定する作業を行なう。この結果として音素の
系列が完成する。As the analysis method, a group of bandpass filters and linear predictive analysis are used to obtain spectrum information for each frame period (about 1 ms). The phoneme discrimination unit 2 uses the spectrum information obtained by the acoustic analysis unit 1 to discriminate phonemes for each frame based on the data in the standard pattern storage unit 3. Standard pattern storage section 3
The standard patterns stored in are prepared in advance for each phoneme of the voices of many speakers. The segmentation unit 4 detects speech sections and determines boundaries for each phoneme (hereinafter referred to as segmentation) based on the analysis output of the acoustic analysis unit 1. The phoneme recognition section 5 performs a task of determining what phoneme is for each phoneme section based on the results of the segmentation section 4 and the phoneme discrimination section 2. As a result, a series of phonemes is completed.

単語認識部６では、この音素系列を同様に音素系列で表
記された単語辞書７と照合し、最も類似度の高い単語を
認識結果として出力する。The word recognition unit 6 compares this phoneme sequence with a word dictionary 7 similarly written in phoneme sequences, and outputs the word with the highest degree of similarity as a recognition result.

前記方法で不特定話者を対象とする場合に最も重要な点
は、高い音声認識精度を、どういう話者環境に対しても
安定して得ることである。また、そのために話者に負担
をかけすぎたり音声認識装置にした場合に高価な部分を
要するようであってはならない。The most important point when using the above method to target unspecified speakers is to stably obtain high speech recognition accuracy for any speaker environment. Furthermore, this should not place too much burden on the speaker or require expensive components in the speech recognition device.

しかし従来発表または試作されている音声認識装置は前
記条件が不十分であるという欠点があった・従来例として、予測残差を対象とする方式（鹿野、好用
［会話音声中の母音認識を目的としたＬＰＧ距離尺度の
評価］電子通信学会誌８０／６゜Ｖ　ＯＬ　Ｊ　−６３
ｆｌ　、　Ａ　５参照）テは、あらかじめ多数話者の音
声よ多線形予測分析によって音素ｉの最大パラメータム
１ｊ（ｊ＝’　＋　２　＋・・・・・・＋　ｐ　）（ｐ
は分析次数）をめておき、予測残差を次式ここでＳｊは
未知な入力音声からめた自己相関係数である。この予測
残差Ｎｉを、対象とする音素毎にめこれを距離尺度とし
て、Ｎｉが最少となる音素を判別結果とする。However, the speech recognition devices that have been announced or prototyped so far have had the drawback of not meeting the above conditions. As a conventional example, a method that targets prediction residuals (Kano, 2003, preferred [used for vowel recognition in conversational speech) Evaluation of targeted LPG distance scale] Journal of the Institute of Electronics and Communication Engineers 80/6°V OL J-63
fl, A5) Te calculates the maximum parameter 1j (j=' + 2 +...+ p ) (p
is the analysis order), and the prediction residual is expressed as follows: where Sj is an autocorrelation coefficient calculated from unknown input speech. This prediction residual Ni is used as a distance measure for each target phoneme, and the phoneme with the minimum Ni is taken as the discrimination result.

しかしこの方法は音素の標準バタンに相当する最大パラ
メータムｉｊが単なる平均値であるため、たとえ使用者
にあわせてムｉｊを作シ直すという学習機能を設けたと
しても、調音結合による発声の変動に対処することがで
きず、認識率が低いという欠点があった・発明の目的本発明は前記欠点を解消し、不特定話者に対処できると
ともに話者、環境、言葉のちがいに影響されることなく
安定に高い音声認識精度を得ることのできる音声認識装
置を提供することを目的とする。However, in this method, the maximum parameter ``mu'', which corresponds to the standard slam of a phoneme, is simply an average value, so even if a learning function is provided to re-create the ``mu'' to suit the user, variations in pronunciation due to articulatory combination may occur. The present invention solves the above-mentioned drawbacks and is capable of dealing with unspecified speakers, as well as being able to deal with differences in speakers, environments, and language. An object of the present invention is to provide a speech recognition device that can stably obtain high speech recognition accuracy without any problems.

発明の構成本発明は上記目的番達成するためになされたもので、音
素毎に区切られた音声信号からスペクトルまたはそれに
類似する情報（以下スペクトル情報と記す）を算出する
音響分析部と、多数話者からなる標準音声信号から得ら
れた標準パターンを予め格納する係数記憶部と、前記ス
ペクトル情報と標準パターンとを用いて音素毎のフィル
タ出力をめる判別フィルタ部と、類似度または音素系列
で表記された単語辞書を格納する単語辞書記憶部と、前
記判別フィルタ部を経て作成された類似度または音素系
列を単語辞書と照合し最も類似度の高い単語を認識結果
として出力する出力部と、前記出力部の結果と前記音響
分析部のスペクトル情報とから新しい標準パターンを作
成しその結果に基づき前記係数記憶部の内容を書き替え
る学習部とを具備するものである。Structure of the Invention The present invention has been made to achieve the above-mentioned objective number, and includes an acoustic analysis section that calculates a spectrum or information similar to it (hereinafter referred to as spectral information) from an audio signal divided into each phoneme, and a coefficient storage unit that stores in advance a standard pattern obtained from a standard speech signal consisting of a person; a discrimination filter unit that uses the spectral information and the standard pattern to obtain a filter output for each phoneme; a word dictionary storage unit that stores a written word dictionary; an output unit that compares the degree of similarity or phoneme sequence created through the discrimination filter unit with the word dictionary and outputs the word with the highest degree of similarity as a recognition result; The apparatus includes a learning section that creates a new standard pattern from the result of the output section and the spectrum information of the acoustic analysis section, and rewrites the contents of the coefficient storage section based on the result.

実施例の説明第２図に本発明の音声認識装置の構成の一実施例を示す
。マイク３１から入った音声信号はＡＤ変換器２１で、
１２　ＫＨｚサンプリングで１２ビツトに変換する。こ
れを信号処理回路でブリエンファシスおよび２０　ｍｓ
のハミング窓をかけ、１０ｍ５毎に線形予測分析プロセ
ッサ２３にてＬＰＣケプヌトラム係数を算出する。この
ＬＰＧケプヌトラム係数を判別フィルタ２４に通し、各
音素に対するフィルタ出力をフレーム毎に算出し、メイ
ンメモリ２７に転送する。係数メモリ２５は各音素毎の
フィルタ係数を格納している。DESCRIPTION OF EMBODIMENTS FIG. 2 shows an embodiment of the configuration of a speech recognition apparatus according to the present invention. The audio signal received from the microphone 31 is sent to the AD converter 21.
Convert to 12 bits with 12 KHz sampling. This is then processed using a signal processing circuit to perform pre-emphasis and 20 ms
A Hamming window is applied, and the linear predictive analysis processor 23 calculates the LPC cepnutrum coefficient every 10 m5. This LPG kepnutrum coefficient is passed through a discrimination filter 24, and the filter output for each phoneme is calculated for each frame and transferred to the main memory 27. The coefficient memory 25 stores filter coefficients for each phoneme.

一方、帯域フィルタ２６では３チャネル程度の帯域パワ
ーおよび全パワーを算出し、音素のセグメンテーション
用のデータとしてメインメモリ２７に転送する。メイン
プロセッサ２８では判別、フィルタ２４および帯域フィ
ルタ２６の結果を用いて音声区間の検出と音素毎のセグ
メンテーションを行った後、判別フィルタ２４の音素毎
の判別フィルタ出力から類似度の最も高い音素を区間毎
に決定し、音素系列を作成する。この音素系列を同様に
音素系列で表記された単語辞書メモリ２９と照合するこ
とによって最も類似度の大きい単語名を認識結果として
出力部３ｏに出力する。On the other hand, the bandpass filter 26 calculates band powers and total powers of about three channels, and transfers them to the main memory 27 as data for phoneme segmentation. The main processor 28 uses the results of the discrimination, the filter 24 and the bandpass filter 26 to detect speech intervals and perform segmentation for each phoneme, and then selects the phoneme with the highest degree of similarity from the output of the discrimination filter for each phoneme of the discrimination filter 24. each time, and create a phoneme sequence. By comparing this phoneme sequence with the word dictionary memory 29 which is also written in a phoneme sequence, the word name with the highest degree of similarity is outputted to the output unit 3o as a recognition result.

しかし、これだけでは不特定話者に対して使用は可能で
あるが、標準パターンに相当する係数メモリ２５が固定
されるため、話者による認識性能のバラツキが大きく、
認識率がかなり低くなってしまう場合が生ずる。そこで
、新しく学習機能をもたせるために学習部３２を設ける
。この学習部３２は線形予測分析プロセッサ２３で得た
ＬＰＧケプストラム係数を受け、出力部３０から得た結
果を参照に学習データを作成し、あらかじめめておいた
分散、共分散行列をもとにその話者に最もふされしい音
素毎の判別係数を計算し直し、係数メモリ２５に転送す
るための動作を行う。However, although this alone can be used for unspecified speakers, since the coefficient memory 25 corresponding to the standard pattern is fixed, the recognition performance varies greatly depending on the speaker.
There are cases where the recognition rate becomes considerably low. Therefore, a learning section 32 is provided to provide a new learning function. This learning section 32 receives the LPG cepstral coefficients obtained by the linear predictive analysis processor 23, creates learning data with reference to the results obtained from the output section 30, and creates learning data based on the variance and covariance matrix prepared in advance. An operation is performed to recalculate the discrimination coefficient for each phoneme that is most suitable for the speaker and transfer it to the coefficient memory 25.

次に本発明に係る音素認識装置の動作について第２図を
参照にしながら詳しく説明する。Next, the operation of the phoneme recognition device according to the present invention will be explained in detail with reference to FIG.

あらかじめマイク３１から入力された多数話者の発声し
た多数の単語音声から五り変換器２１を介して母音／ａ
／　＋１０／　、／ｕ／　、／ｉ／　、／ｅ／と鼻音（
／Ｎ／で表わす）の切出しを行っておく。The vowel /a is converted from a large number of word sounds uttered by multiple speakers inputted in advance from the microphone 31 through the five-word converter 21.
/ +10/ , /u/ , /i/ , /e/ and nasal sounds (
/N/) is cut out.

この音声データを用いて信号処理回路２２および線形予
測分析プロセッサ２３により１０　ｍｓの分析区間毎に
線形予測分析を行い、ｐ次元のＬＰＧケプストラム係数
を算出する。このＬＰＣケプヌトラム係数を用いて全音
素を対象とした共分散行列Ｗと、各音素毎の平均値ｍｉ
（Ｌは音素の種類を表わす）をめる。この結果よシ、音
素ｉに対する判別係数ａｉｊ　（１−１１２、・・・・
・・、ｐ）は共分散行列Ｗの逆行列１゛の（ｊｓｊ’）
要素をδｊｊ′とすると、で表わすことができる。Using this audio data, the signal processing circuit 22 and linear predictive analysis processor 23 perform linear predictive analysis for each 10 ms analysis interval to calculate p-dimensional LPG cepstral coefficients. Using this LPC kepnutrum coefficient, we can calculate the covariance matrix W for all phonemes and the average value mi for each phoneme.
(L represents the type of phoneme). As a result, the discrimination coefficient aij (1-112,...
..., p) is the inverse matrix 1゛ of the covariance matrix W (jsj')
Letting the element be δjj′, it can be expressed as follows.

各音素毎にａｌｊ、　ｍｉ１’　、ａｌｊ　、　ｍｉ’
　Ｗ　’　ｍｉ　（Ｖｔ。alj, mi1', alj, mi' for each phoneme
W' mi (Vt.

述）をめ標準パターンとして係数メモリ２６に格納して
おく。) is stored in the coefficient memory 26 as a standard pattern.

次に使用者に内容のあらかじめわかっている音声（たと
えば／　ａ　、’　＋　／　１　／　＋　／　ｕ　／　
１　／　ｅ　／　Ｈ／　ｏ　／　）を発声させ、音声区
間中の分析区間毎のＬＰＯケプヌトラム係数を線形予測
分析プロセッサ２３でめ、学習部３２に転送する。一方
予め格納されている係数メモリ２６の標準パターンを用
いて、判別フィルタ２４で類似度をめる。判別フィルタ
２４では入力信号のＬＰＣケプストラム係数係数対する
マ・・ラノビス距離Ｄ？は（先は転置行列を示す）で表わすことができるが、第１項は音素ｉに対して不変
であるため、類似度Ｌｉを簡易的にで表わし、（４）式
を用いて類似度を計算する。その結果をメインメモリ２
７に転送し、メインプロセッサ２８を通して音素系列を
作成する。次に、学習すべき音素の時間軸上の位置を示
す値を出力部３０より学習部３２にもどし、学習すべき
音素のＬＰＧケプストラム係数の平均値をめる。以上を
音声の種類を変えながら必要な回数くシ返す。Next, a voice whose content is known in advance to the user (for example / a , ' + / 1 / + / u /
1/e/H/o/) is uttered, and the LPO cepnutrum coefficients for each analysis section in the speech section are calculated by the linear prediction analysis processor 23 and transferred to the learning section 32. On the other hand, the degree of similarity is determined by the discrimination filter 24 using the standard pattern stored in the coefficient memory 26 that is stored in advance. The discrimination filter 24 determines the ma...lanobis distance D? with respect to the LPC cepstral coefficients of the input signal. can be expressed as (the first term indicates the transposed matrix), but since the first term is invariant for the phoneme i, the similarity Li can be expressed simply as , and the similarity can be expressed using equation (4). calculate. The result is stored in main memory 2.
7 and creates a phoneme sequence through the main processor 28. Next, a value indicating the position of the phoneme to be learned on the time axis is returned from the output unit 30 to the learning unit 32, and the average value of the LPG cepstral coefficients of the phoneme to be learned is calculated. Repeat the above as many times as necessary while changing the type of voice.

各音素毎の平均値に適度な重み付けをしたものを学習し
ない場合のもとの平均値（ｍｉ　ｊ’　）に加え、新し
い音素毎の平均値を作成し係数メモリ２５の平均値ｍｉ
　１’を置き換える。さらにこの平均値を使用して判別
係数ａｉｊおよび（４）式の定数項（第２項）を音素ご
とに修正し、これらを新しい標準パターンとして係数メ
モリ２６に転送し、標準パターンの書替えを行う。In addition to the original average value (mi j' ) without learning, which is obtained by appropriately weighting the average value for each phoneme, create a new average value for each phoneme and use the average value mi in the coefficient memory 25.
Replace 1'. Furthermore, using this average value, the discriminant coefficient aij and the constant term (second term) of equation (4) are corrected for each phoneme, and these are transferred to the coefficient memory 26 as a new standard pattern, and the standard pattern is rewritten. .

次に実際に音声認識を行う場合について説明する。マイ
ク１０から入力された未知な音声信号について、信号処
理回路２２および線形予測分析プロセッサ２３を使用し
てＬＰＧケプヌトラム係数”（”＋ｙ”２＋・・・・・
・＋　Ｘｐ　）をめ、判別フィルタ２４に転送し、予め
めて係数メ″モリ２５に収納しである標準パターンを用
いて（４）式よシ音素ｉの類似度Ｌｉをめる。Next, a case in which speech recognition is actually performed will be explained. Regarding the unknown audio signal input from the microphone 10, the signal processing circuit 22 and the linear predictive analysis processor 23 are used to calculate the LPG cepnutrum coefficient "("+y"2+...
・+

これを音素毎（１＝　Ｉ　Ｈ２＋・・・・・・、ｎ）（
ｎは音素数）にめ、メインメモリ２７に転送する。This is done for each phoneme (1 = I H2+..., n) (
n is the number of phonemes) and transferred to the main memory 27.

メインプロセッサ２８ではこの類似度と帯域フィルタ２
６の出力をもとにセグメンテーションを行った結果とを
組合わせることにより音素認識を行い音素系列を作成す
る。The main processor 28 uses this similarity and the band filter 2
By combining the results of segmentation based on the output of step 6, phoneme recognition is performed and a phoneme sequence is created.

最後に音素系列を単語辞書メモリ２９と照合し、最も類
似度の高い単語を認識結果として出力部３０に転送する
。Finally, the phoneme sequence is compared with the word dictionary memory 29, and the word with the highest degree of similarity is transferred to the output unit 30 as a recognition result.

上記実施例は音声認識を行う前に、内容の予めわかって
いる音声を入力し、その結果に基づいて係数メモリ２６
内の標準パターンの修正を行う場合について述べたが、
音声認識の途中に未知音声の認識結果に基づいて係数メ
モリ２６内の標準パターンの修正を行っても良いことは
もちろんである。In the above embodiment, before performing voice recognition, a voice whose content is known in advance is input, and based on the result, the coefficient memory 26
I mentioned the case of modifying the standard pattern in
Of course, the standard pattern in the coefficient memory 26 may be modified based on the recognition result of unknown speech during speech recognition.

この場合には内容のわかっている音声を予め学習しなく
ても良く、環境の変化、入力者の音声の変化等に対して
自動的に追随することができる。In this case, there is no need to learn speech whose contents are known in advance, and it is possible to automatically follow changes in the environment, changes in the input person's voice, etc.

このように、本実施例は音素認識を基本とする音声認識
装置において、各音素の標準パタンをあらかじめ簡単な
学習によって使用者に合うように作成する学習機能を持
つことを特徴とし、高い音声認識性能を持たせることが
できる。また、学習のための計算は極めて簡単であり、
特別な高い演算精度を持つ計算回路を要することなく、
すぐに新しい標準パタンを作成することができる。As described above, this embodiment is a speech recognition device based on phoneme recognition, and is characterized by having a learning function that creates a standard pattern for each phoneme to suit the user through simple learning in advance, and is capable of achieving high speech recognition. performance can be achieved. Also, the calculations for learning are extremely simple,
without requiring a calculation circuit with special high calculation accuracy.
New standard patterns can be created immediately.

第３図は成人男子１０人を対象として、学習のない場合
と行った場合の音素認識率の比較を行ったものである。Figure 3 shows a comparison of phoneme recognition rates between 10 adult males with and without learning.

学習は評価用の全単語で行った場合３４と、２０語程度
の少数語で行った場合３６を示した。いずれも、学習の
ない場合３３に比して音素認識率は向上し、特に従来極
端に認識率の低かった話者（ＮＳ、ＫＳ、ＳＭなど）に
対して大きな効果のあることを示している。The results were 34 when learning was performed using all words for evaluation and 36 when learning was performed using a small number of words of about 20 words. In both cases, the phoneme recognition rate improved compared to the case without learning33, indicating that it was particularly effective for speakers (NS, KS, SM, etc.) who had traditionally had extremely low recognition rates. .

第４図は音素毎の認識率の標準偏差を示したもので、学
習のない場合４１に比して学習を全単語で行った場合４
２、少数語で行った場合４３ともにバラツキが減少し、
後段の単語マツチングに好ましい効果を与えることを示
している。Figure 4 shows the standard deviation of the recognition rate for each phoneme, and shows the standard deviation of the recognition rate for each phoneme, which is 41 when learning is performed on all words compared to 41 without learning.
2. When performing with minority words, the variation in both 43 is reduced,
This shows that it has a favorable effect on word matching in the later stage.

本実施例は以下に示すような効果を有する。This embodiment has the following effects.

■　音声認識装置に学習機能を持たせることによシ、使
用者に適合した標準パタンを自動作成し、環境の変化や
話者の個人差によるバラツキの少ない良好な音声認識精
度を持たせることができる。■ By equipping a speech recognition device with a learning function, it is possible to automatically create a standard pattern suitable for the user and achieve good speech recognition accuracy with less variation due to changes in the environment or individual differences between speakers. can.

■　学習は使用前あるいは使用途中に、少数の音声を発
声することによって自動的に行うことができ、標準パタ
ンの作成も特別な装置を要することなく極めて簡単、高
速に行うことができる。■Learning can be performed automatically by uttering a small number of sounds before or during use, and standard patterns can be created extremely easily and quickly without the need for special equipment.

発明の効果以上要するに本発明は音素毎に区切られた音声信号から
スペクトルまたはそれに類似する情報（以下スペクトル
情報と記す）を算出する音響分析部と、多数話者からな
る標準音声信号から得られた標準パターンを予め格納す
る係数記憶部と、前記スペクトル情報と標準パターンと
を用いて音素毎のフィルタ出力をめる判別フィルタ部と
、類似度または音素系列で表記された単語辞書を格納す
る単語辞書記憶部と、前記判別フィルタ部を経て作成さ
れた類似度または音素系列を単語辞書と照合し最も類似
度の高い単語を認識結果として出力する出力部と、前記
出力部の結果と前記音響分析部のスペクトル情報とから
新しい標準パターンを作成しその結果に基づき前記係数
記憶部の内容を書き替える学習部とを具備することを特
徴とする音声認識装置を提供するもので、話者による音
声認識精度のバラツキを大幅に改善し、不特定話者に対
して安定して使うことができる利点を有する。Effects of the Invention In short, the present invention includes an acoustic analysis section that calculates a spectrum or information similar to it (hereinafter referred to as spectral information) from an audio signal divided into each phoneme, and an acoustic analysis unit that calculates a spectrum or information similar to it (hereinafter referred to as spectral information) from an audio signal divided into each phoneme, and an acoustic analysis unit that calculates a spectrum or information similar to it (hereinafter referred to as spectral information). a coefficient storage unit that stores a standard pattern in advance; a discrimination filter unit that uses the spectrum information and the standard pattern to obtain a filter output for each phoneme; and a word dictionary that stores a word dictionary expressed in terms of similarity or phoneme sequence. a storage unit, an output unit that compares the degree of similarity or phoneme sequence created through the discrimination filter unit with a word dictionary and outputs a word with the highest degree of similarity as a recognition result, and a result of the output unit and the acoustic analysis unit. and a learning section that creates a new standard pattern from spectral information and rewrites the contents of the coefficient storage section based on the result, the speech recognition device improves speech recognition accuracy by a speaker. This method has the advantage that it can be used stably for unspecified speakers.

[Brief explanation of the drawing]

第１図は音素認識を基本とする従来の音声認識装置のブ
ロック図、第２図は本発明の一実施例における音声認識
装置のブロック図、第３図は本発明の音声認識装置の効
果を話者毎に示した図、第４図は本発明の音声認識装置
の効果を音素毎の標準偏差として表わした図である。２１・・・・・・ＡＤ変換器、２２・・・・・・信号処
理回路、２３・・・・・・線形予測分析プロセッサ、２
４・旧・・４’ｌＪ別フイルタ、２５・・・・・・係数
メモリ、２７・・・・・メインメモリ、２８・・・・・
・メインプロセッサ、２９・・・・・単語辞書メモリ、
３０・・・・・・出力部、３２・旧・・学習部。代理人の氏名　弁理士　中　尾　敏　男　ほか１名第４
図手続補正書特許庁長官殿 ■事件の表示昭和５８年特許願第１６３５３７号３補正をする者事件との関係　特　許　出　願　人任　所　大阪府門真市大字門真１００６番地名　称　（
５８２）松下電器産業株式会社代表者　山　下　俊　彦４代理人　〒５７１住　所　大阪府門真市大字門真１００６番地松下電器産
業株式会社内明　細　書１、発明の名称音声認識装置２、特許請求の範囲（１）音声信号からスペクトルまたはそれに類似する情
報（以下スペクトル情報と記す）を算出する音響分析部
と、多数話者からなる標準音声信号から得られた標準パ
ターンを予め格納する係数記憶部と、前記スペクトル情
報と標準パターンとを用いて音素毎の類似度をめる類似
度計算部と、類似度または音素系列で表記された単語辞
書を格納する単語辞書記憶部と、前記類似度計算部を経
て作成された類似度または音素系列を単語辞書と照合し
最も類似度の高い単語を認識結果として出方する出力部
と、前記出力部の結果と前記音響分析部のスペクトル情
報とから新しい標準パターンを作成しその結果に基づき
前記係数記憶部の内容を書き替える学習部とを具備する
ことを特徴とする音声認識装置。（功　標準パターンとしてスペクトル情報の分散共分散
行列および平均値を少なくとも含むことを特徴とする特
許請求の範囲第１項記載の音声認識装置。（′４　人力音声の認識結果に基づいて係数記憶部の内
容を修正するようにしたことを特徴とする特許請求の範
囲第１項記載の音声認識装置。３、発明の詳細な説明産業上の利用分野本発明は人間の声によって発声された音声信号を自動的
に認識するだめの、音声認識装置に関するものである。従来例の構成とその問題点音声を自動的に認識する音声認識装置は人間から電子計
算機や各種機械へデータや命令を与える手段として非常
に有効と考えられる。従来研究あるいは発表されている音声認識装置の動作原
理としてはバタンマツチング法が多く採用されている。この方法は認識される必要がある全種類の単語に対して
標準パターンをあらかじめ記憶しておき、入力される未
知の入力バタンと比較することによって一致の度合（以
下類似度と呼ぶ）を計算し、最大一致が得られる標準バ
タンと同一の単語であると判定するものである。このバ
タンマツチング法では認識されるべき全ての単語に対し
て標準バタンを用意しなければならないため、発声者が
変った場合には新しく標準バタンを人力して記憶させる
必要がある。したがって数百種類以上の単語を認識対象
とするような場合、全種類の単語を発声して登録するに
は時間と労力を必要とし、また登録に要するメモリー容
量も膨大になる仁とが予想される。さらに入力バタンと
標準パタンのバタンマツチングに要する時間も単語数が
多くなると長くなってしまう欠点がある。これに対して、入力音声を音素単位に分けて音素の組合
せとして認識しく以下音素認識と呼ぶ）音素単位で表記
された単語辞書との類似度をめる方法は単語辞書に要す
るメモリー容量が大幅に少なくて済み、バタンマツチン
グに要する時間が短く、辞書の内容変更も容易であると
いう特長を持っている。例えば「赤い」という発声は／
ａ／。／に／＋／ｌｉ　という三つの音素を組合せてＡＫＡＩ
という極めて簡単な形式で表現することができるため、
不特定話者で多数語の音声に対処することが容易である
。第１図に音素認識を行うことを特徴とする音声認識方式
のブロック図を示す。マイク等で入力された音声は音響
分析部１によって分析を行なう。分析方法としては帯域フィルタ群や線形予測分析を用い
、フレーム周期（１０ｍＳ程度）毎にスペクトル情報を
得る。音素判別部２では、音響分析部１で得たスペクト
ル情報を用い、標準パターン格納部３のデータによって
フレーム毎の音素判別を行なう。標準パターン格納部３
に格納された標準パターンは、あらかじめ多数話者の音
声より音素毎にめておく。セグメンテーション部４では
音響分析部１の分析出力をもとに音声区間の検出と音素
毎の境界決定（以下セグメンテーションと呼ぶ）を行う
。音素認識部６ではセグメンテーション部４と音素判別
部２の結果をもとに１つの音素区間毎に何という音素で
あるかを決定する作業を行なう。この結果として音素の
系列が完成する。単語認識部６では、この音素系列を面様に音素系列で表
記された単語辞書７と照合し、最も類似度の高い単語を
認識結果として出力する。前記方法で不特定話者を対象とする場合に最も重要な点
は、高い音声認識精度を、どういう話者環境に対しても
安定して得ることである。また、そのために話者に負担
をかけすぎたり音声認識装置にした場合に高価な部分を
要するようであってはならない。しかし従来発表または試作されている音声認識装置は前
記条件が不十分であるという欠点があった。従来例として、予測残差を対象とする方式（鹿野、好用
「会話音声中の母音認識を目的としたＬＰＧ距離尺度の
評価」電子通信学会誌８０　／　５　。ＶＯＬ　Ｔ−６３Ｄ　、爲６参照）では、あらかじめ多
数話者の音声より線形予測分析によって音素ｉの最大パ
ラメータＡｍ　ｒ　（）−１ｔ　２　ｔ・・・・・・、
Ｐ）（Ｐは分析次数）をめておき、予測残差を次式ここ
でＳｊ　は未知な入力音声からめた自己相関係数である
。この予測残差Ｎｉ　を、対象とする音素毎にめこれを
距離尺度として、Ｎｉ　が最少となる音素を判別結果と
する。しかしこの方法は音素の標準パタンに相当する最大パラ
メータＡｉ５が単なる平均値であるため、たとえ使用者
にあわせてＡ、を作り直すという学習機能を設けたとし
ても、調音結合による発声の変動に対処することができ
ず、認識率が低いという欠点があった。発明の目的本発明は前記欠点を解消し、不特定話者に対処できると
ともに話者、環境、言葉のちがいに影響されることなく
安定に高い音声認識精度を得ることのできる音声認識装
置を提供することを目的とする。発明の構成本発明は上記目的を達成するためになされたもので、音
声信号からスペクトルまたはそれに類似する情報（以下
スペクトル情報と記す）を算出する音響分析部と、多数
話者からなる標準音声信号から得られた標準パターンを
予め格納する係数記憶部と、前記スペクトル情報と標準
パターンとを用いて音素毎の類似度をめる類似度計算部
と、類似度または音素系列で表記された単語辞書を格納
する単語辞書記憶部と、前記類似度計算部を経て作成さ
れた類似度または音素系列を単語辞書と照合し最も類似
度の高い単語を認識結果として出力する出力部と、前記
出力部の結果と前記音響分析部のスペクトル情報とから
新しい標準パターンを作成しその結果に基づき前記係数
記憶部の内容を書き替える学習部とを具備するものであ
る。実施例の説明第２図に本発明の音声認識装置の構成の一実施例を示す
。マイク３１から入った音声信号はＡＤ変換器２１で、
１２曲サンプリングで１２ビツトに変換する。これを信
号処理回路でプリエンファシスおよび２Ｑｍ３のハミン
グ窓をかけ、１０ｍＳ毎に線形予測分析プロセッサ２３
にてＬＰＣケプストラム係数を算出する。このＬＰＣケ
プストラム係数を類似度計算部２４に通し、各音素に対
する類似度をフレーム毎に算出し、メインメモリ２７に
転送する。係数メモリ２６は各音素毎のフィルタ係数を
格納している。一方、帯域フィルタ２６では３チャネル程度の帯域パワ
ーおよび全パワーを算出し、音素のセグメンテーション
用のデータとしてメインメモリ２７に転送する。メイン
プロセッサ２８では類似度計算部２４および帯域フィル
タ２６の結果を用いて音声区間の検出と音素毎のセグメ
ンテーションを行った後、類似度計算部２４の音素毎の
類似度から類似度の最も高い音素を区間毎に決定し、音
素系列を作成する。この音素系列を同様に音素系列で表
記された単語辞書メモリ２９と照合することによって最
も類似度の大きい単語名を認識結果として出力部３ｏに
出力する。しかし、これだけでは不特定話者に対して使用は可能で
あるが、標準パターンに相幽する係数メモリ２６が固定
されるため、話者による認識性能のバラツキが大きく、
認識率がかなり低くなってしまう場合が生ずる。そこで
、新しく学習機能をもたせるために学習部３２を設ける
。この学習部３２は線形予測分析プロセッサ２３で得た
ＬＰＣケプストラム係数を受け、出力部３ｏから得た結
果を参照に学習データを作成し、あらかじめめておいた
分散、共分散行列をもとにその話者に最もふされしい音
素毎の判別係数を計算し直し、係数メモ！７２５に転送
するための動作を行う。次に本発明に係る音素認識装置の動作について第２図を
参照にしながら詳しく説明する。あらかじめマイク３１から入力された多数話者の発声し
た多数の単語音声からＡＤ変換器２１を介して母音／ａ
／、１０／、／ｕ／、／Ｖ、／ｅ／と鼻音の切出しを行
っておく。この音声データを用いて信号処理回路２２お
よび線形予測分析プロセッサ２３により１０ｍ５の分析
区間毎に線形予測分析を行い、ｐ次元のＬＰＣケブヌト
ラム係数を算出する。仁のＬＰＧケプストラム係数を用
いて全音素を対象とした共分散行列Ｗと、各音素毎の平
均値ｔｎｔ　（ｉは音素の種類を表わす）をめる。この
結果より、音素１に対する判別係数ａｉｊ（ｊ＝１．２
．・・・・・・ｔｐ）は共分散行列Ｗの逆行列Ｗ−１の
（ｉ　、　ｊ’）要素を６月とすると、で表わすことが
できる。各音素毎にａｉ　５　、ｍ、　、’、δ目、ｍ１ｔＷ−
’ｍ、（抜道）をめ標準パターンとして係数メモリ２５
に格納しておく。次に使用者に内容のあらがじめわがっている音声（たと
えば／ａ／、／Ｖ、／ｕ／、／＠／、１０／　）を発声
させ、音声区間中の分析区間毎のＬＰＣケプストラム係
数を線形予測分析プロセッサ２３でめ、学習部３２に転
送する。一方予め格納されている係数メモリ２６の標準
パターンを用いて、判別フィルタ２４で類似度をめる。類似度計算部２４では入力信号のＬＰＣケプストラム係
数Ｘに対するマハラノビス距離り、　ハ（ｔは転置行列を示す）で表わすことができるが、第１項は音素の種類に依存し
５ないため、類似度Ｌｉ　を簡易的にで表わし、（４式
を用いて類似度を計算する０その結果をメインメモリ２
７に転送し、メインプロセッサ２８を通して音素系列を
作成する。次に、学習すべき音素の時間軸上の位置を示
す値を出力部３ｏより学習部３２にもどし、学習すべき
音素のＬＰＣケプストラム係数の平均値をめる。以上を
音声の種類を変えながら必要な回数くり返す。各音素毎の平均値に適度な重み付けをしたものを学習し
ない場合のもとの平均値（ｍ、′）に加え、新しい音素
毎の平均値を作成し係数メモリ２６の平均値ｍ１５７番
置き換える。さらにこの平均値を使用して判別係数ａｉ
ｌおよび（４式の定数項（第２項）を音素ごとに修正し
、仁れらを新しい標準パターンとして係数メモリ２５に
転送し、標準パターンの書替えを行う。次に実際に音声認識を行う場合について説明する。マイ
ク１０から入力された未知な音声信号について、信号処
理回路２２および線形予測分析プロセッサ２３を使用し
てＬＰＣケプストラム係数ｘ（ｘｌ、ｘ２．・・・・・
・ｔ　Ｘｐ　）をめ、類似度計算部２４に転送し、予め
めて係数メモリ２５に収納しである標準パターンを用い
て（４式より音素ｌの類似度り、をめる。これを音素毎（ｌ＝１，２．・・・・・・、ｎ）（ｎは
音素数）にめ、メインメモリ２７に転送する。メインプロセッサ２８ではこの類似度と帯域フィルタ２
６の出力をもとにセグメンテーションを行った結果とを
組合わせることに占り音素認識を行い音素系列を作成す
る。最後に音素系列を単語辞書メモリ２９と照合し、最も類
似度の高い単語を認識結果として出力部３０に転送する
。上記実施例は音声認識を行う前に、内容の予めわかって
いる音声を入力し、その結果に基づいて係数メモリ２５
内の標準パターンの修正を行う場合について述べたが、
音声認識の途中に音声の認識結果に基づいて係数メモリ
２５内の標準パターンの修正を行っても良いことはもち
ろんである。この場合には内容のわがっている音声を予め学習しなく
ても良く、環境の変化、入力者の音声の変化等に対して
自動的に追随することができる。このように、本実施例は音素認識を基本とする音声認識
装置において、各音素の標準パタンをあらかじめ簡単な
学習によって使用者に合うように作成する学習機能を持
つことを特徴とし、高い音声認識性能を持たせることが
できる。また、学習のための計算は極めて簡単であり、
特別な高い演算精度を持つ計算回路を要することなく、
すぐに新しい標準バタンを作成することができる。第３図は成人男子１０人を対象として、学習のない場合
と行った場合の音素認識率の比較を行ったものである。学習は評価用の全単語で行った場合３４と、２０語程度
の少数語で行った場合３６を示した。いずれも、学習の
ない場合３３に比して音素認識率は向上し、特に従来極
端に認識率の低かった話者（ＮＳ　、ＫＳ　、　ＳＭな
ど）に対して大きな効果のあることを示している。第４図は音素毎の認識率の標準偏差を示したもので、学
習のない場合４１に比して学習を全単語で行った場合４
２、少数語で行った場合４３ともにバラツキが減少し、
後段の単語マツチングの性能を向上させる効果を与える
ことを示している。本実施例は以下に示すような効果を有する。 ■　音声認識装置に学習機能を持たせることにより、使
用者に適合した標準バタンを自動作成し、環境の変化や
話者の個人差によるバラツキの少ない良好な音声認識精
度を持たせることができる。 ■　学習は使用前あるいは使用途中に、少数の音声を発
声することによって自動的に行うことができ、標準パタ
ンの作成も特別な装置を要することなく極めて簡単、高
速に行うことができる。発明の効果以上要するに本発明は音声信号からスペクトルる標準音
声信号から得られた標準パターンを予め格納する係数記
憶部と、前記スペクトル情報と標準パターンとを用いて
音素毎の類似度をめる類似度計算部と、類似度または音
素系列で表記された単語辞書を格納する単語辞書記憶部
と、前記類似度計算部を経て作成された類似度または音
素系列を単語辞書と照合し最も類似度の高い単語を認識
結果として出力する出力部と、前記出力部の結果と前記
音響分析部のスペクトル情報とから新しい標準パターン
を作成しその結果に基づき前記係数記憶部の内容を書き
替える学習部とを具備することを特徴とする音声認識装
置を提供するもので、話者による音声認識精度のバラツ
キを大幅に改善し、不特定話者に対して安定して使うこ
とができる利点を有する。４、図面の簡単な説明第１図は音素認識を基本とする従来の音声認識装置のブ
ロック図、第２図は本発明の一実施例における音声認識
装置のブロック図、第３図は本発明の音声認識装置の効
果を話者毎に示した図、第４図は本発明の音声認識装置
の効果を音素毎の標準偏差として表わした図である。２１・・・・・・ＡＤ変換器、２２・・・・・・信号処
理回路、２３・・・・・・線形予測分析プロセッサ、２
４・・・・・・類似度計算部、２６・・・・・・係数メ
モリ、２７・・・・・メインメモリ、２８・パ・・・メ
インプロセッサ、２９・・・・・・単語辞書メモリ、３
０曲・・出方部、３２・・・・・・学習部。代理人の氏名　弁理士　中　尾　敏　男　はが１名第２
図第乎図Fig. 1 is a block diagram of a conventional speech recognition device based on phoneme recognition, Fig. 2 is a block diagram of a speech recognition device according to an embodiment of the present invention, and Fig. 3 shows the effects of the speech recognition device of the present invention. FIG. 4 is a diagram showing the effect of the speech recognition device of the present invention as a standard deviation for each phoneme. 21...AD converter, 22...signal processing circuit, 23...linear predictive analysis processor, 2
4. Old... 4'lJ separate filter, 25... Coefficient memory, 27... Main memory, 28...
・Main processor, 29...Word dictionary memory,
30... Output section, 32. Old... Learning section. Name of agent: Patent attorney Toshio Nakao and 1 other person No. 4
Letter of amendment to figure procedure Dear Commissioner of the Japan Patent Office ■ Display of the case 1982 Patent Application No. 163537 3 Person making the amendment Relationship with the case Patent application Person Address 1006 Kadoma, Kadoma City, Osaka Name Name (
582) Matsushita Electric Industrial Co., Ltd. Representative Toshihiko Yamashita 4 Agent 571 Address 1006 Oaza Kadoma, Kadoma City, Osaka Prefecture Matsushita Electric Industrial Co., Ltd. Specification 1, Name of the invention Voice recognition device 2, Patent claim Scope (1) An acoustic analysis unit that calculates a spectrum or information similar to it (hereinafter referred to as spectral information) from an audio signal, and a coefficient storage unit that stores in advance a standard pattern obtained from a standard audio signal composed of multiple speakers. , a similarity calculation unit that calculates the similarity of each phoneme using the spectral information and the standard pattern; a word dictionary storage unit that stores a word dictionary expressed in similarities or phoneme sequences; and the similarity calculation unit. an output unit that compares the similarity or phoneme sequence created through the process with a word dictionary and outputs the word with the highest degree of similarity as a recognition result; and a new standard from the results of the output unit and the spectrum information of the acoustic analysis unit. A speech recognition device comprising: a learning section that creates a pattern and rewrites the contents of the coefficient storage section based on the result. ('4) The speech recognition device according to claim 1, characterized in that the standard pattern includes at least a variance-covariance matrix and an average value of spectral information. The speech recognition device according to claim 1, characterized in that the content of the speech recognition device is modified. 3. Detailed Description of the Invention Industrial Field of Application The present invention is directed to a speech signal uttered by a human voice This article relates to a speech recognition device that automatically recognizes speech.The structure of conventional examples and their problemsSpeech recognition devices that automatically recognize speech are a means of giving data and commands from humans to electronic computers and various machines. It is considered to be very effective as a method of speech recognition.The operating principle of many speech recognition devices that have been researched or published in the past is the slam matching method.This method can be used for all types of words that need to be recognized. The standard pattern is memorized in advance, and the degree of matching (hereinafter referred to as similarity) is calculated by comparing it with the unknown input button, and it is determined that the word is the same as the standard button that provides the maximum match. In this method of matching, a standard button must be prepared for every word to be recognized, so if the speaker changes, it is necessary to manually memorize a new standard button. Therefore, when there are hundreds of types of words to be recognized, it takes time and effort to pronounce and register all types of words, and the memory capacity required for registration is enormous. Furthermore, the time required to match the input sound with the standard pattern increases as the number of words increases.On the other hand, it is difficult to divide the input speech into phoneme units and recognize them as combinations of phonemes. This method (hereinafter referred to as phoneme recognition) that measures the similarity with a word dictionary written in phoneme units requires significantly less memory capacity for the word dictionary, shortens the time required for bang matching, and allows the contents of the dictionary to be changed easily. It has the feature of being easy.For example, the utterance of "red" is /
a/. AKAI is created by combining the three phonemes /ni/+/li.
Since it can be expressed in an extremely simple form,
It is easy to handle speech with multiple words by any speaker. FIG. 1 shows a block diagram of a speech recognition system characterized by performing phoneme recognition. Audio input through a microphone or the like is analyzed by an acoustic analysis section 1. As an analysis method, a group of bandpass filters and linear predictive analysis are used to obtain spectrum information at every frame period (about 10 mS). The phoneme discrimination unit 2 uses the spectrum information obtained by the acoustic analysis unit 1 to discriminate phonemes for each frame based on the data in the standard pattern storage unit 3. Standard pattern storage section 3
The standard patterns stored in are prepared in advance for each phoneme from the voices of many speakers. The segmentation unit 4 detects speech sections and determines boundaries for each phoneme (hereinafter referred to as segmentation) based on the analysis output of the acoustic analysis unit 1. The phoneme recognition section 6 performs the work of determining what phoneme is for each phoneme section based on the results of the segmentation section 4 and the phoneme discrimination section 2. As a result, a series of phonemes is completed. The word recognition unit 6 compares this phoneme sequence with a word dictionary 7 written in phoneme sequences in a regular manner, and outputs the word with the highest degree of similarity as a recognition result. The most important point when using the above method to target unspecified speakers is to stably obtain high speech recognition accuracy for any speaker environment. Furthermore, this should not place too much burden on the speaker or require expensive components in the speech recognition device. However, speech recognition devices that have been announced or prototyped so far have had the drawback of not meeting the above conditions. As a conventional example, a method that targets prediction residuals (see Kano, ``Evaluation of LPG distance scale for vowel recognition in conversational speech'', Journal of the Institute of Electronics and Communication Engineers 80/5. VOL T-63D, 爲6) ), the maximum parameter of phoneme i is determined in advance by linear predictive analysis from the voices of multiple speakers Am r ()-1t 2 t...
P) (P is the analysis order), and the prediction residual is calculated using the following equation, where Sj is an autocorrelation coefficient calculated from unknown input speech. This prediction residual Ni is used as a distance measure for each target phoneme, and the phoneme with the minimum Ni is determined as the discrimination result. However, in this method, the maximum parameter Ai5, which corresponds to the standard pattern of phonemes, is just an average value, so even if a learning function is provided to recreate A to suit the user, it will not be possible to deal with variations in pronunciation due to articulatory combination. However, the problem was that the recognition rate was low. OBJECTS OF THE INVENTION The present invention solves the above-mentioned drawbacks, and provides a speech recognition device that can deal with unspecified speakers and stably obtain high speech recognition accuracy without being affected by differences in speakers, environments, and language. The purpose is to Structure of the Invention The present invention has been made to achieve the above object, and includes an acoustic analysis section that calculates a spectrum or information similar to it (hereinafter referred to as spectrum information) from an audio signal, and a standard audio signal composed of multiple speakers. a coefficient storage unit that stores in advance the standard pattern obtained from the above, a similarity calculation unit that calculates the similarity of each phoneme using the spectral information and the standard pattern, and a word dictionary in which the similarity or phoneme sequence is expressed. an output unit that compares the similarity or phoneme sequence created through the similarity calculation unit with a word dictionary and outputs the word with the highest degree of similarity as a recognition result; The apparatus includes a learning section that creates a new standard pattern from the result and spectrum information from the acoustic analysis section and rewrites the contents of the coefficient storage section based on the result. DESCRIPTION OF EMBODIMENTS FIG. 2 shows an embodiment of the configuration of a speech recognition apparatus according to the present invention. The audio signal received from the microphone 31 is sent to the AD converter 21.
12 songs are sampled and converted to 12 bit. A signal processing circuit applies pre-emphasis and a Hamming window of 2Qm3 to this signal, and a linear predictive analysis processor 23 processes it every 10 mS.
Calculate the LPC cepstral coefficients. The LPC cepstral coefficients are passed through the similarity calculation unit 24 to calculate the similarity for each phoneme for each frame and transferred to the main memory 27. Coefficient memory 26 stores filter coefficients for each phoneme. On the other hand, the bandpass filter 26 calculates band powers and total powers of about three channels, and transfers them to the main memory 27 as data for phoneme segmentation. The main processor 28 uses the results of the similarity calculation unit 24 and the bandpass filter 26 to detect speech intervals and perform segmentation for each phoneme, and then selects the phoneme with the highest degree of similarity from the similarity of each phoneme in the similarity calculation unit 24. is determined for each section and a phoneme sequence is created. By comparing this phoneme sequence with the word dictionary memory 29 which is also written in a phoneme sequence, the word name with the highest degree of similarity is outputted to the output unit 3o as a recognition result. However, although this alone can be used for unspecified speakers, since the coefficient memory 26 that corresponds to the standard pattern is fixed, the recognition performance varies greatly depending on the speaker.
There are cases where the recognition rate becomes considerably low. Therefore, a learning section 32 is provided to provide a new learning function. This learning section 32 receives the LPC cepstral coefficients obtained by the linear predictive analysis processor 23, creates learning data with reference to the results obtained from the output section 3o, and creates learning data based on the variance and covariance matrix prepared in advance. Recalculate the discriminant coefficient for each phoneme that is most suitable for the speaker and memo the coefficient! 725. Next, the operation of the phoneme recognition device according to the present invention will be explained in detail with reference to FIG. The vowel /a is converted from the many word sounds uttered by many speakers inputted in advance from the microphone 31 through the AD converter 21.
The nasal sounds /, 10/, /u/, /V, /e/ are cut out in advance. Using this audio data, the signal processing circuit 22 and the linear predictive analysis processor 23 perform linear predictive analysis for each analysis section of 10 m5 to calculate p-dimensional LPC Keb Nutram coefficients. A covariance matrix W for all phonemes and an average value tnt (i represents the type of phoneme) for each phoneme are calculated using Jin's LPG cepstral coefficients. From this result, the discrimination coefficient aij (j=1.2
．． . . . tp) can be expressed as follows, assuming that the (i, j') element of the inverse matrix W-1 of the covariance matrix W is June. For each phoneme, ai 5, m, ,', δth, m1tW-
'm, (Nukimichi) as standard pattern coefficient memory 25
Store it in Next, have the user utter a voice whose content is somewhat distorted (e.g. /a/, /V, /u/, /@/, 10/), and create an LPC cepstrum for each analysis interval in the voice interval. The coefficients are calculated by the linear prediction analysis processor 23 and transferred to the learning section 32. On the other hand, the degree of similarity is determined by the discrimination filter 24 using the standard pattern stored in the coefficient memory 26 that is stored in advance. The similarity calculation unit 24 can express the Mahalanobis distance with respect to the LPC cepstral coefficient Li is simply expressed as 0, and the similarity is calculated using equation 4.0, and the result is stored in the main memory 2.
7 and creates a phoneme sequence through the main processor 28. Next, a value indicating the position of the phoneme to be learned on the time axis is returned from the output unit 3o to the learning unit 32, and the average value of the LPC cepstral coefficients of the phoneme to be learned is calculated. Repeat the above as many times as necessary while changing the type of voice. In addition to the original average value (m,') in the case of not learning, which is obtained by appropriately weighting the average value for each phoneme, a new average value for each phoneme is created and the average value m157 is replaced in the coefficient memory 26. Furthermore, using this average value, the discrimination coefficient ai
The constant term (second term) of equations l and (4) is corrected for each phoneme, and Nire et al. is transferred to the coefficient memory 25 as a new standard pattern, and the standard pattern is rewritten.Next, speech recognition is actually performed. A case will be described. For an unknown audio signal input from the microphone 10, the signal processing circuit 22 and the linear predictive analysis processor 23 are used to calculate the LPC cepstral coefficients x(xl, x2...
・t every (l = 1, 2..., n) (n is the number of phonemes) and transfers it to the main memory 27.The main processor 28 uses this similarity and the band filter 2.
By combining the results of segmentation based on the output of step 6, fortune-telling phoneme recognition is performed and a phoneme sequence is created. Finally, the phoneme sequence is compared with the word dictionary memory 29, and the word with the highest degree of similarity is transferred to the output unit 30 as a recognition result. In the above embodiment, before performing voice recognition, a voice whose content is known in advance is input, and based on the result, the coefficient memory 25
I mentioned the case of modifying the standard pattern in
Of course, the standard pattern in the coefficient memory 25 may be modified based on the speech recognition result during speech recognition. In this case, there is no need to previously learn speech with strange content, and it is possible to automatically follow changes in the environment, changes in the input person's speech, etc. As described above, this embodiment is a speech recognition device based on phoneme recognition, and is characterized by having a learning function that creates a standard pattern for each phoneme to suit the user through simple learning in advance, and is capable of achieving high speech recognition. performance can be achieved. Also, the calculations for learning are extremely simple,
without requiring a calculation circuit with special high calculation accuracy.
You can quickly create new standard batons. Figure 3 shows a comparison of phoneme recognition rates between 10 adult males with and without learning. The results were 34 when learning was performed using all words for evaluation and 36 when learning was performed using a small number of words of about 20 words. In both cases, the phoneme recognition rate improved compared to the case without learning33, indicating that it is particularly effective for speakers who previously had extremely low recognition rates (NS, KS, SM, etc.). . Figure 4 shows the standard deviation of the recognition rate for each phoneme, and shows the standard deviation of the recognition rate for each phoneme, which is 41 when learning is performed on all words compared to 41 without learning.
2. When performing with minority words, the variation in both 43 is reduced,
This shows that it has the effect of improving the performance of word matching in the subsequent stage. This embodiment has the following effects. - By equipping the speech recognition device with a learning function, it is possible to automatically create standard batons suited to the user and achieve good speech recognition accuracy with little variation due to changes in the environment or individual differences among speakers. ■Learning can be performed automatically by uttering a small number of sounds before or during use, and standard patterns can be created extremely easily and quickly without the need for special equipment. Effects of the Invention In short, the present invention includes a coefficient storage section that stores in advance a standard pattern obtained from a standard speech signal that is a spectrum of a speech signal, and a coefficient storage section that stores in advance a standard pattern obtained from a standard speech signal that is a spectrum of a speech signal, and a coefficient storage section that measures the similarity of each phoneme using the spectral information and the standard pattern. a word dictionary storage unit that stores a word dictionary expressed in similarities or phoneme sequences; and a word dictionary storage unit that stores a word dictionary expressed in similarities or phoneme sequences, and compares the similarity or phoneme sequence created through the similarity calculation unit with the word dictionary to find the most an output unit that outputs high-value words as recognition results; and a learning unit that creates a new standard pattern from the results of the output unit and the spectrum information of the acoustic analysis unit and rewrites the contents of the coefficient storage unit based on the results. The present invention provides a speech recognition device having the following features, which has the advantage of greatly reducing variations in speech recognition accuracy between speakers and being able to be used stably for unspecified speakers. 4. Brief description of the drawings Fig. 1 is a block diagram of a conventional speech recognition device based on phoneme recognition, Fig. 2 is a block diagram of a speech recognition device according to an embodiment of the present invention, and Fig. 3 is a block diagram of a speech recognition device according to an embodiment of the present invention. FIG. 4 is a diagram showing the effectiveness of the speech recognition device of the present invention for each speaker, and FIG. 4 is a diagram showing the effectiveness of the speech recognition device of the present invention as a standard deviation for each phoneme. 21...AD converter, 22...signal processing circuit, 23...linear predictive analysis processor, 2
4...Similarity calculating unit, 26...Coefficient memory, 27...Main memory, 28...Main processor, 29...Word dictionary memory ,3
0 songs...How to play part, 32...Learning part. Name of agent: Patent attorney Toshio Nakao, 1st person, 2nd person
Diagram No.

Claims

[Scope of Claims] (1) An acoustic analysis unit that calculates a spectrum or information similar to it (hereinafter referred to as spectrum information) from a speech signal divided into phonemes, and a coefficient storage section that stores in advance a standard pattern based on the spectral information; a discrimination filter section that uses the spectral information and the standard pattern to obtain a filter output for each phoneme; and a word dictionary that stores a word dictionary expressed in terms of similarity or phoneme sequence. a dictionary storage unit, an output unit that compares the degree of similarity or phoneme sequence created through the discrimination filter unit with a word dictionary and outputs a word with the highest degree of similarity as a recognition result, and a result of the output unit and the acoustic analysis. 1. A speech recognition device comprising: a learning section that creates a new standard pattern from spectrum information of the coefficient storage section and rewrites the contents of the coefficient storage section based on the result. (Rumor) The speech recognition device according to claim 1, characterized in that the standard pattern includes at least a variance-covariance matrix and an average value of spectral information. (3) Coefficient storage based on the recognition result of unknown input speech. 2. The speech recognition device according to claim 1, wherein the content of the part is modified.