JPH06289899A

JPH06289899A - Speech recognition device

Info

Publication number: JPH06289899A
Application number: JP5074107A
Authority: JP
Inventors: Tadamichi Tokuda; 肇道徳田
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1993-03-31
Filing date: 1993-03-31
Publication date: 1994-10-18

Abstract

PURPOSE:To correspond the speech recognition device which uses a neural network as a recognition part to aging variation in the pronunciation of a speaker with time by letting a neural network part learn speech feature data which causes misrecognition in case of misrecognition. CONSTITUTION:The neural network 3 incorporated in the speech recognition device learns the speech feature data which causes misrecognition automatically at each occasion under the control of a speech recognition and learning control part 7. In order to prevent defective data from being learnt, the learning is performed only when the speech feature data match data which are already learnt to some extent. Therefore, even if the pronunciation of the speaker varies with time, the speech feature data are learnt at each time to adapt the device to the variation in the pronunciation of the speaker, thereby maintaining high recognition precision.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、話者が発する単語音声
を認識し、その結果を出力する音声認識装置に関するも
のである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice recognition device for recognizing a word voice uttered by a speaker and outputting the result.

【０００２】[0002]

【従来の技術】従来のニューラルネット部を用いて単語
音声を認識する音声認識装置では、単語音声の音声特徴
データを予めニューラルネット部に学習させ、認識時に
は音声の特徴データがニューラルネット部に渡され、学
習済み単語との一致度が認識結果として出力されるよう
になっていた。2. Description of the Related Art In a conventional voice recognition device for recognizing a word voice using a neural network unit, the voice feature data of the word voice is learned in advance in a neural network unit, and the voice feature data is passed to the neural network unit during recognition. Then, the degree of coincidence with the learned word is output as the recognition result.

【０００３】[0003]

【発明が解決しようとする課題】従来の音声認識装置で
は、ニューラルネット部の学習後に話者の発音が経時変
化した場合、単語の認識ミスが生じやすかった。そし
て、認識を行うニューラルネット部の学習データを入力
し直して学習をやり直さない限り、認識ミスが改善され
ることはなかった。また、その際に雑音や特異な発音を
含む不良データを学習させると、かえって認識精度を低
下させるおそれがあった。In the conventional voice recognition device, when the pronunciation of the speaker changes with time after learning of the neural network unit, a word recognition error is likely to occur. Then, unless the learning data of the neural network part for recognition is input again and learning is performed again, the recognition error is not improved. In addition, learning bad data including noise and peculiar pronunciation at that time may rather reduce the recognition accuracy.

【０００４】本発明はこのような従来の問題点を解決
し、話者の発音の変化に対応して高い認識精度を維持し
うる音声認識装置の提供を目的とする。An object of the present invention is to solve the above conventional problems and to provide a voice recognition device capable of maintaining high recognition accuracy in response to changes in the pronunciation of a speaker.

【０００５】[0005]

【課題を解決するための手段】本発明は上記目的を達成
するため、特定・不特定話者が発声する単語音声を認識
するニューラルネットを用いた音声認識装置において、
音声信号を入力するための音声入力部と、前記音声信号
からその特徴を抽出する音声信号特徴抽出部と、ニュー
ラルネットの学習用データ記憶部と、認識結果を表示し
ユーザからその正誤を受け取る外部インタフェース部
と、抽出した音声特徴データと学習済みの各単語の音声
特徴データとの一致の度合いを数字で出力し、かつ与え
られた音声特徴データを学習するニューラルネット部
と、前記ニューラルネット部が出力する一致度から認識
結果を判定する認識結果判定部と、上記データの流れを
制御し、前記ニューラルネット部に学習を行わせる音声
認識・学習制御部とを備えたことを特徴とする。In order to achieve the above object, the present invention provides a voice recognition device using a neural network for recognizing word voices uttered by specific and unspecified speakers,
A voice input unit for inputting a voice signal, a voice signal feature extraction unit for extracting the features from the voice signal, a learning data storage unit for the neural network, a recognition result display, and the correctness received from the user External The interface unit, a neural network unit that outputs the degree of coincidence between the extracted voice feature data and the voice feature data of each learned word by numbers, and learns the given voice feature data, and the neural network unit. It is characterized in that it is provided with a recognition result judging unit for judging a recognition result from the output coincidence degree, and a voice recognition / learning control unit for controlling the flow of the data and for making the neural network unit perform learning.

【０００６】[0006]

【作用】本発明によれば、音声認識装置に組み込まれた
ニューラルネット部に、認識をミスした音声特徴データ
を、そのつど自動的に学習させるものである。また、不
良データの学習を防ぐため、学習はその音声特徴データ
が学習済みのデータにある程度一致している場合のみ実
行される。According to the present invention, the neural network unit incorporated in the voice recognition device is made to automatically learn the voice feature data for which recognition has been missed. Further, in order to prevent the learning of defective data, the learning is executed only when the voice feature data matches the learned data to some extent.

【０００７】したがって、話者の発音が経時変化して
も、その音声特徴データを、そのつど学習させることに
よって話者の発音の変化に対応し、高い認識精度を保ち
続けることができる。Therefore, even if the pronunciation of the speaker changes with time, by learning the voice feature data each time, it is possible to respond to the change in the pronunciation of the speaker and maintain high recognition accuracy.

【０００８】[0008]

【実施例】図１は本発明の一実施例における音声認識装
置の機能ブロック図であり、図１において、１は話者が
発声した音声を入力するための音声入力部、２は前記音
声入力部１により入力された音声信号から、その音声特
徴データを算出する音声信号特徴抽出部、３は音声特徴
データを入力とし、学習した各単語の音声特徴データと
の一致の度合いを出力するニューラルネット部、４はニ
ューラルネット部３から認識結果を受け取り、一致度の
上位３つの単語を算出する認識結果判定部、５はニュー
ラルネット部３に学習させた単語の音声特徴データを記
憶しておく学習用データ記憶部、６は学習用データ記憶
部５から認識結果を受け取り、それを表示し、ユーザか
ら結果の正誤を入力してもらう外部インタフェース部、
７は外部インタフェース部６から認識結果の正誤情報を
受け取り、ニューラルネット部３に学習させるかどうか
を決定する音声認識・学習制御部である。1 is a functional block diagram of a voice recognition apparatus according to an embodiment of the present invention. In FIG. 1, 1 is a voice input section for inputting a voice uttered by a speaker, and 2 is the voice input. A voice signal feature extraction unit for calculating voice feature data from the voice signal input by the unit 1, and a neural network for receiving the voice feature data as an input and outputting the degree of matching with the voice feature data of each learned word. A learning unit 4 receives a recognition result from the neural network unit 3 and calculates a top 3 words with a high degree of coincidence. A learning unit 5 stores voice feature data of the words learned by the neural network unit 3. An external interface unit for receiving a recognition result from the learning data storage unit 5, displaying the recognition result, and allowing the user to input the correctness of the result,
Reference numeral 7 denotes a voice recognition / learning control unit that receives correctness / incorrectness information of the recognition result from the external interface unit 6 and determines whether the neural network unit 3 should perform learning.

【０００９】図２は図１の音声認識装置の回路ブロック
図であり、８はマイクロホン、９はリードオンリメモリ
(以下、ＲＯＭと略称する)、10は中央処理装置(以下、
ＣＰＵと略称する)、11はランダムアクセスメモリ(以
下、ＲＡＭと略称する)、12はモニター、13はキーボー
ドである。FIG. 2 is a circuit block diagram of the voice recognition apparatus of FIG. 1, where 8 is a microphone and 9 is a read-only memory.
(Hereinafter, abbreviated as ROM), 10 is a central processing unit (hereinafter,
Reference numeral 11 is a random access memory (hereinafter abbreviated as RAM), 12 is a monitor, and 13 is a keyboard.

【００１０】ここで上記図１に示した音声入力部１はマ
イクロホン８により、学習用データ記憶部５はＲＡＭ11
により、音声信号特徴抽出部２とニューラルネット部３
と認識結果判定部４と音声認識・学習制御部７は、ＣＰ
Ｕ10がＲＯＭ９およびＲＡＭ11とデータの授受を行いな
がらＲＯＭ９に記憶されたプログラムを実行することに
より、外部インタフェース部６はモニター12とキーボー
ド13により、それぞれ実現されている。The voice input section 1 shown in FIG. 1 is a microphone 8 and the learning data storage section 5 is a RAM 11 shown in FIG.
The voice signal feature extraction unit 2 and the neural network unit 3
The recognition result determination unit 4 and the voice recognition / learning control unit 7
The external interface unit 6 is realized by the monitor 12 and the keyboard 13 as the U 10 executes the program stored in the ROM 9 while exchanging data with the ROM 9 and the RAM 11.

【００１１】上記のように構成された本発明の一実施例
における音声認識装置に、「たなか」という単語の音声
が、初期の学習時とは異なる発音で与えられた場合につ
いて、以下、この動作を図３のフローチャートに基づき
説明する。なお、ニューラルネット部３は(表１)に示す
学習用データを既に学習しているものとする。一単語に
つき、２つのデータがあり、１つのデータの大きさは27
0バイトで、45個の数値よりなる。This operation will be described below in the case where the voice recognition device in the embodiment of the present invention configured as described above is given a voice of the word "tanaka" with a pronunciation different from that at the time of initial learning. Will be described with reference to the flowchart of FIG. It is assumed that the neural network unit 3 has already learned the learning data shown in (Table 1). There are two data for each word, and the size of one data is 27.
It is 0 bytes and consists of 45 numbers.

【００１２】[0012]

【表１】 [Table 1]

【００１３】ステップ(Ｓ1)で、音声信号特徴抽出部２
は音声入力部１から入力された入力音声信号に対する音
声特徴抽出を行う。In step (S1), the voice signal feature extraction unit 2
Performs voice feature extraction for the input voice signal input from the voice input unit 1.

【００１４】ステップ(Ｓ2)では、抽出された音声特徴
データをニューラルネット部３に入力し、出力として学
習済みの各単語との一致度を得る。得られた一致度が大
きい順に上位３つの単語が認識結果判定部４によって算
出される。上記例では、(表２)に示すように、「たな
か」は第２位で、「とくだ」が第１位となったとする。In step (S2), the extracted voice feature data is input to the neural network unit 3, and the degree of coincidence with each learned word is obtained as an output. The recognition result determination unit 4 calculates the top three words in the descending order of the degree of matching. In the above example, as shown in (Table 2), it is assumed that "Tanaka" is in second place and "Tokuda" is in first place.

【００１５】[0015]

【表２】 [Table 2]

【００１６】ステップ(Ｓ3)では、外部インタフェース
部６が認識結果をモニター12に出力し、ユーザがそれを
見て、認識結果が正しい場合はyesを、認識結果が誤っ
ている場合は正しい答えをキーボード13に入力する。上
記例では、認識結果が誤っているため(no)、ユーザは
「たなか」と入力する。In step (S3), the external interface unit 6 outputs the recognition result to the monitor 12, and the user looks at it and gives a yes when the recognition result is correct and a correct answer when the recognition result is incorrect. Input on the keyboard 13. In the above example, since the recognition result is incorrect (no), the user inputs “tanaka”.

【００１７】ステップ(Ｓ4)では、音声認識・学習制御
部７が外部インタフェース部６からユーザの入力を受け
取り、入力がyesであれば処理を終了する。入力がnoの
場合は、ユーザが入力した正解単語が認識の第３位以内
に入っていれば、ステップ(Ｓ5)に進み、入って入なけ
れば処理を終了する。これは、雑音が混じった不良デー
タや、発音が大きく乱れたデータが学習用データに混入
することを防ぐためである。上記例では、「たなか」は
第２位になっているため正常データとみなし、ステップ
(Ｓ6)に進む。In step (S4), the voice recognition / learning control unit 7 receives the user's input from the external interface unit 6, and if the input is yes, the process is terminated. When the input is no, if the correct word input by the user is within the third rank for recognition, the process proceeds to step (S5), and if not entered, the process ends. This is to prevent defective data containing noise and data with greatly disturbed pronunciation from being mixed into the learning data. In the above example, since "Tanaka" is in the second place, it is regarded as normal data and the step
Proceed to (S6).

【００１８】ステップ(Ｓ5)では、学習用データ記憶部
５の学習用データの中の古い音声特徴データを今回誤認
識した音声特徴データに更新し、ＲＡＭ11に記憶する。
上記例では、(表１)の学習用データ中の「たなか」の音
声特徴データの古い方(No.3)を消去し、今回誤認識した
「たなか」のデータを記録する。つまり、(表３)に変更
後の学習用データを示し、No.3に新しいデータが挿入さ
れ、もとのNo.3はNo.4になる。In step (S5), the old voice feature data in the learning data in the learning data storage unit 5 is updated to the voice feature data that was erroneously recognized this time, and is stored in the RAM 11.
In the above example, the older one (No. 3) of the voice feature data of "Tanaka" in the learning data of (Table 1) is erased, and the data of "Tanaka" which is erroneously recognized this time is recorded. In other words, (Table 3) shows the changed learning data, new data is inserted in No. 3, and the original No. 3 becomes No. 4.

【００１９】[0019]

【表３】 [Table 3]

【００２０】また、ステップ(Ｓ6)では、更新された学
習用データをニューラルネット部３に学習させ、学習し
たニューラルネットを保存する。ニューラルネット部の
初期の学習では数千回の学習回数が必要だが、学習用デ
ータの一部更新のみの場合では百回程度の回数で十分に
学習が収束し、学習にかかる時間も実用範囲内であり、
学習したデータを正確に認識できるようになることが実
験によって明らかになっている。上記例では、ニューラ
ルネット部は「たなか」の発音の変化を学習するため、
後に「たなか」の発音が同様に変化しても正しく認識す
ることができるようになる。In step (S6), the neural network unit 3 is made to learn the updated learning data, and the learned neural net is stored. The initial learning of the neural network part requires several thousand times of learning, but if only part of the learning data is updated, about 100 times of learning is sufficient for the learning to converge, and the time required for learning is within the practical range. And
Experiments have revealed that the learned data can be accurately recognized. In the above example, the neural network unit learns the change in the pronunciation of "Tanaka",
Later, even if the pronunciation of "tanaka" changes, it will be possible to correctly recognize it.

【００２１】以上のように、従来の音声認識装置にニュ
ーラルネット部の学習機能と正常データの判別機能を加
えることによって、ユーザが誤りを指摘するだけで装置
は正常データのみを自動的に学習し、話者の発音が経時
変化しても高い認識精度を保つことが可能となる。As described above, by adding the learning function of the neural network section and the function of discriminating normal data to the conventional voice recognition device, the device automatically learns only normal data when the user points out an error. , It is possible to maintain high recognition accuracy even if the pronunciation of the speaker changes over time.

【００２２】[0022]

【発明の効果】以上説明したように本発明の音声認識装
置は、特別な操作を必要とせずに話者の発音の変化を自
動的に学習することによって、高い認識精度を得ること
ができる。As described above, the voice recognition apparatus of the present invention can obtain high recognition accuracy by automatically learning the change in the pronunciation of the speaker without requiring any special operation.

【図面の簡単な説明】[Brief description of drawings]

【図１】本発明の一実施例における音声合成装置の機能
ブロック図である。FIG. 1 is a functional block diagram of a voice synthesizer according to an embodiment of the present invention.

【図２】図１の回路ブロック図である。FIG. 2 is a circuit block diagram of FIG.

【図３】図１の動作を説明するフローチャートである。3 is a flowchart illustrating the operation of FIG.

[Explanation of symbols]

１…音声入力部、２…音声信号特徴抽出部、３…ニ
ューラルネット部、４…認識結果判定部、５…学習
用データ記憶部、６…外部インタフェース部、７…音
声認識・学習制御部、８…マイクロホン、９…ＲＯ
Ｍ、 10…ＣＰＵ、 11…ＲＡＭ、 12…モニター、
13…キーボード。DESCRIPTION OF SYMBOLS 1 ... Voice input part, 2 ... Voice signal feature extraction part, 3 ... Neural net part, 4 ... Recognition result determination part, 5 ... Learning data storage part, 6 ... External interface part, 7 ... Voice recognition / learning control part, 8 ... Microphone, 9 ... RO
M, 10 ... CPU, 11 ... RAM, 12 ... Monitor,
13 ... keyboard.

Claims

[Claims]

1. A voice recognition apparatus using a neural network for recognizing a word voice uttered by a specific / unspecified speaker, and a voice input unit for inputting a voice signal, and a feature thereof is extracted from the voice signal. The voice signal feature extraction unit, the learning data storage unit of the neural network, the external interface unit that displays the recognition result and receives the correctness from the user, the extracted voice feature data and the voice feature data of each learned word A neural network unit that outputs the degree of coincidence as a number and learns given voice feature data, a recognition result determination unit that determines a recognition result from the degree of coincidence output by the neural network unit, and a flow of the data. A voice recognition device comprising: a voice recognition / learning control unit that controls the neural network unit to perform learning.