JPH04280299A

JPH04280299A - Speech recognition device

Info

Publication number: JPH04280299A
Application number: JP3068855A
Authority: JP
Inventors: Keiichi Miyamoto; 恵一宮本
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1991-03-08
Filing date: 1991-03-08
Publication date: 1992-10-06

Abstract

PURPOSE:To lighten an user's burden concerning the registration of a specific dictionary by automatically performing the registration of the specific dictionary upon recognition operation. CONSTITUTION:The feature level of the voice to be input upon recognizing operation is temporality stored in a disctionary buffer 10, and its word number is stored in a word number buffer 11 refleting it to the voice stored in the dictionary buffer 10 by pushing a key in a recogniged result inputting part 12 when the recognized result to the voice is correct. When the same word numbers are stored in no less than the predetermined number (Th1) in the word number buffer 11, the similarity measure of respective characteristic quantities together with the voice stored in the dictionary buffer 10 corresponding to the word number is calculated, and subsequently either feature level out of them stored in the dictionary buffer 10 is stored, as a specific dictionary, in a specific dictionary part 4, when the similarity measure is greater than the predetermined threshold value (Th2).

Description

[Detailed description of the invention]

【０００１】0001

【技術分野】本発明は、音声認識装置、より詳細には、
不特定話者音声認識装置の認識率の向上に関するもので
ある。[Technical Field] The present invention relates to a speech recognition device, more specifically,
This invention relates to improving the recognition rate of a speaker-independent speech recognition device.

【０００２】0002

【従来技術】音声認識装置は、各種制御機器、パソコン
、ワープロ等の入力装置として実用化されつつある。これらの装置の認識対象とするのは、一般に、離散発声
された単語で制御コマンドや制御オブジェクトを表すも
のが主である。これらの機能を実現する音声認識装置は
、大別して、特定話者方式と不特定話者方式とにわける
ことができる。特定話者方式とは、認識の対象となる単
語音声をその認識装置を使うユーザー自らの声で登録し
ておく方法であり、語彙の選択もある程度自由である。また、認識時に入力される音声とほぼ同じ音声で認識辞
書が構成できるため、比較的高い認識率を得ることがで
きる。しかしながら、この方法ではユーザーの音声登録
が必要であり、単語数が増加するとその負担が大きくな
る。これに対して不特定話者方式とは、認識対象単語音
声辞書をあらかじめ認識装置の提供者（メーカー）側で
用意しておく方法である。ユーザーはこの辞書を用いて
認識するため、音声登録の必要はない。メーカーでは多
数の音声を収録し、平均的な音声によって辞書を構成す
る。しかし、この方法では、話者固有の癖や、方言のた
めに比較的高い認識率を得ることは難しい。このような
、不特定音声認識装置の持つ登録の手間がいらないと言
う使い勝手の良さと、特定話者音声認識装置の高い認識
率を合わせ持つ手法として、不特定辞書と特定辞書を合
わせ持つという方法が考えられた。上記方法は、基本的
には不特定話者音声認識装置であるが、ユーザーが使っ
ているうちに認識しずらい単語を選び、特定話者方式の
辞書として登録することにより、その単語の認識率を上
げるというものである。2. Description of the Related Art Speech recognition devices are being put into practical use as input devices for various control devices, personal computers, word processors, and the like. Generally, these devices mainly recognize discretely uttered words that represent control commands or control objects. Speech recognition devices that implement these functions can be roughly divided into speaker-specific and speaker-independent systems. The specific speaker method is a method in which the user using the recognition device registers the word speech to be recognized in his or her own voice, and the selection of vocabulary is also free to some extent. Furthermore, since the recognition dictionary can be configured with almost the same voice as the voice input during recognition, a relatively high recognition rate can be obtained. However, this method requires the user's voice registration, and as the number of words increases, the burden increases. On the other hand, the speaker-independent method is a method in which a speech dictionary of words to be recognized is prepared in advance by the provider (manufacturer) of the recognition device. Since the user uses this dictionary for recognition, there is no need for voice registration. Manufacturers record a large number of voices and compose a dictionary based on the average voice. However, with this method, it is difficult to obtain a relatively high recognition rate due to speaker-specific habits and dialects. A method that combines the ease of use of a non-specific speech recognition device, which requires no registration effort, with the high recognition rate of a specific speaker speech recognition device, is to have both a non-specific dictionary and a specific dictionary. was considered. The above method is basically a speaker-independent speech recognition device, but by selecting a word that is difficult to recognize while the user is using it and registering it as a speaker-specific dictionary, it is possible to recognize the word. The idea is to increase the rate.

【０００３】図２は、上述のごとき従来の音声認識装置
の一例を説明するための構成図で、認識時には、入力さ
れた音声の特徴量が特徴抽出部１によって抽出され、類
似度計算部２に送られる。一方、不特定辞書３および特
定辞書４に登録された特徴量は順次、類似度計算部２に
送られて、先の入力音声との類似度が計算される。一通
り類似度計算が終わると、類似度のもっとも高いものが
認識結果として、結果出力部５から出力される。この認
識装置を使い初める当初は、特定辞書４には１件も登録
されておらず、ユーザーは認識しずらい単語をこの特定
辞書４に登録することができる。これは、キースイッチ
６の入力によって制御部７を登録モードにし、登録した
い単語の番号などを指定し、その後発声することにより
、特定辞書４に登録されるように構成すればよい。その
具体的な例としては、例えば、特願平１−２６１０９７
号、特願平１−２６１０９８号、特願平１−２８６７９
０号等に詳しいので、ここでの説明は省略する。FIG. 2 is a block diagram illustrating an example of a conventional speech recognition device as described above. During recognition, feature quantities of input speech are extracted by a feature extraction unit 1, and similarity calculation unit 2 extracts features of input speech. sent to. On the other hand, the feature amounts registered in the unspecified dictionary 3 and the specific dictionary 4 are sequentially sent to the similarity calculation unit 2, where the similarity with the previous input voice is calculated. Once the similarity calculations have been completed, the one with the highest similarity is output from the result output unit 5 as the recognition result. When the recognition device is first used, no words are registered in the specific dictionary 4, and the user can register words that are difficult to recognize in the specific dictionary 4. This can be done by setting the control unit 7 to the registration mode by inputting the key switch 6, specifying the number of the word to be registered, and then uttering the word to register it in the specific dictionary 4. Specific examples include, for example, Japanese Patent Application No. 1-261097.
No., Japanese Patent Application No. 1-261098, Japanese Patent Application No. 1-28679
0 etc., so the explanation here will be omitted.

【０００４】上述のように特定辞書に登録された音声は
、その認識率が改善されている。単語番号やその他の属
性（認識したことによって起こる動作など）を、不特定
辞書の対応する単語のものを継承するようにすれば、問
題なく使用できるようになる。この方法によって、不特
定音声認識装置の持つ登録の手間がいらないと言う使い
勝手の良さと、特定話者音声認識装置の高い認識率とを
合わせ持つ、より良い音声認識装置が得られる。しかし
ながら、前述のように、特定辞書の登録には「登録」と
言う動作が必要であり、また、「認識しずらい単語」を
ユーザー自身がメモするなどして覚えておかなければな
らないため、まだまだ不特定音声認識装置が本来持つ使
い勝手の良さには及ばない。[0004] As mentioned above, the recognition rate of speech registered in a specific dictionary has been improved. By inheriting the word number and other attributes (such as the actions that occur upon recognition) from the corresponding word in the unspecified dictionary, it can be used without any problems. By this method, it is possible to obtain a better speech recognition device that has both the ease of use of an unspecified speech recognition device, which requires no registration effort, and the high recognition rate of a speaker-specific speech recognition device. However, as mentioned above, registering a specific dictionary requires an action called "registration," and the user must memorize "words that are difficult to recognize" by themselves. It still falls short of the ease of use inherent in unspecified voice recognition devices.

【０００５】[0005]

【目的】本発明は、上述のごとき実情に鑑みてなされた
もので、特に、特定辞書の登録に関するユーザーの負担
を軽減するために、このような特定辞書の登録を認識動
作時に自動的に行なうことを目的としてなされたもので
ある。[Purpose] The present invention has been made in view of the above-mentioned circumstances, and in particular, in order to reduce the burden on users regarding registration of specific dictionaries, it is an object of the present invention to automatically register such specific dictionaries at the time of recognition operation. It was done for that purpose.

【０００６】[0006]

【構成】本発明は、上記目的を達成するために、（１）
入力された音声の特徴を抽出する特徴抽出部と、抽出し
た複数の音声の特徴量を特定話者用辞書として記憶する
特定辞書部と、あらかじめ別途抽出した音声の特徴量を
不特定話者用辞書として記憶する不特定辞書部と、前記
抽出された音声の特徴量と前記特定辞書部及び不特定辞
書部に記憶された複数の辞書との類似度を計算する類似
度計算部と、該計算された類似度が高いものから順に一
つあるいは複数の辞書を認識結果として選択する結果選
択部とを具備する音声認識装置において、認識結果が正
解か否かを入力される認識結果入力部と、認識動作時に
入力される音声の特徴量を一時的に複数個記憶する辞書
バッファと、該音声に対する認識結果が正解であったと
きにその単語番号を前記辞書バッファに記憶された音声
に関連づけて複数個記憶する単語番号バッファとを具備
し、該単語番号バッファに同じ単語番号が所定（Ｔｈ１
）の数以上記憶されたときにその単語番号に対応する辞
書バッファに記憶された音声の特徴量同士の類似度を計
算し、その類似度が所定（Ｔｈ２）の閾値以上であると
きに、前記複数の辞書バッファに記憶されたいずれかの
特徴量を特定辞書として特定辞書部に記憶することを特
徴としたものであり、更には、（２）前記（１）におい
て、前記複数の辞書バッファに記憶された特徴量の平均
値又は中央値を特定辞書として特定辞書部に記憶するこ
とを特徴としたものであり、更には、前記（１）又は（
２）において、（３）入力された音声が正解の時の類似
度が所定（Ｔｈ３）の閾値より高ければ、単語番号バッ
ファへの単語番号の記憶をしないで、その時記憶された
辞書バッファの内容をクリアすること、或いは、（４）
入力された音声が正解の時の類似度が所定（Ｔｈ４）の
閾値より高ければ、単語番号バッファへの単語番号の記
憶をしないで、該単語番号に対応する単一もしくは複数
の辞書バッファの内容をクリアすることを特徴としたも
のである。以下、本発明の実施例に基いて説明する。[Structure] In order to achieve the above objects, the present invention provides (1)
A feature extraction unit extracts the features of the input voice; a specific dictionary unit stores the extracted voice features as a dictionary for specific speakers; and a specific dictionary unit stores the extracted voice features as a dictionary for specific speakers; a non-specific dictionary section that stores the information as a dictionary; a similarity calculation section that calculates the degree of similarity between the extracted feature quantity of the voice and a plurality of dictionaries stored in the specific dictionary section and the non-specific dictionary section; and a result selection unit that selects one or more dictionaries as recognition results in descending order of similarity, the recognition result input unit receiving an input as to whether or not the recognition result is correct; A dictionary buffer that temporarily stores a plurality of voice feature values input during a recognition operation; The same word number is stored in the word number buffer (Th1).
), the degree of similarity between the features of the voices stored in the dictionary buffer corresponding to the word number is calculated, and when the degree of similarity is greater than or equal to a predetermined threshold (Th2), the The present invention is characterized in that any of the feature quantities stored in the plurality of dictionary buffers is stored as a specific dictionary in the specific dictionary section, and further, (2) in the above (1), the feature quantity is stored in the plurality of dictionary buffers as a specific dictionary. The feature is that the average value or median value of the stored feature quantities is stored as a specific dictionary in the specific dictionary section, and further, the above (1) or (
In 2), if the similarity when the input voice is correct (3) is higher than a predetermined threshold (Th3), the word number is not stored in the word number buffer and the contents of the dictionary buffer stored at that time are stored. or (4)
If the similarity when the input voice is correct is higher than a predetermined threshold (Th4), the word number is not stored in the word number buffer and the contents of the single or multiple dictionary buffers corresponding to the word number are stored. It is characterized by clearing. Hereinafter, the present invention will be explained based on examples.

【０００７】図１は、本発明の一実施例を説明するため
の構成図で、図中、図２に示した従来技術と同様の作用
をする部分には図２の場合と同一の参照番号が付してあ
る。図１において、初期状態では、特定辞書４、辞書バ
ッファ１０、単語番号バッファ１１は全て空である。こ
の時、この音声認識装置は純粋に不特定話者音声認識装
置として働く。認識のための音声が入力されると、その
特徴量が特徴抽出部１で抽出される。抽出された特徴量
は類似度計算部２に送られる。また、この特徴量は辞書
バッファ１０に記憶される。類似度計算部２では、不特
定辞書３中の辞書と先の特徴量との類似度が計算され、
その結果に応じた認識結果（単語番号など）が、結果出
力部５によって出力される。ここで、この時の認識結果
に対する正解／不正解の応答が、正解／不正解キーを有
する認識結果入力部（ここではキースイッチを使ってい
る）１２より、ユーザーによって入力される。この認識
結果が正解であり、さらにこの時の類似度が所定の閾値
以下であった場合に、正解の単語番号を辞書バッファ１
０に対応付けて単語番号バッファ１１に記憶する。正解
であっても、類似度が十分高ければ（所定の閾値より高
ければ）、不特定辞書３で十分認識するので、この時の
バッファはクリアする。このように、認識がされる度に
辞書バッファ１０に特徴量が記憶されていく。この後、
単語番号バッファ１１に記憶された単語番号に同じもの
が所定（Ｔｈ１）の数現われたときにその単語番号に対
応する辞書バッファ１０に記憶された音声の特徴量同志
の類似度を計算し、その類似度が所定（Ｔｈ２）の閾値
以上であるときに、前記複数の辞書バッファに記憶され
たいずれかの特徴量（または平均値、中央値など）を特
定辞書として特定辞書部４に記憶する。特定辞書４に特
徴量が登録されれば、次回からこれが認識の対象となる
。上述のように認識を重ねることによって、不特定辞書
３では認識しずらい（類似度が高くない）単語のみ、特
定辞書４に自動的に登録することができる。複数回の発
声の類似度を計算してある閾値以上のものを採用するの
は、突発的な雑音や、調子の悪いときの音声を特定辞書
に登録するのを防ぐためである。以上に本発明の一実施
例について説明したが、制御系の動作や認識装置等の本
発明に関する以外の部分については公知例などに詳しい
ので省略した。また、ユーザーの正解／不正解の入力に
関しては、不正解の入力がないときを正解としたり、認
識後起こる動作のキャンセルによって、不正解とするな
ど、色々考えられるがこれらは本発明を限定するもので
はない。さらに、音声の特徴量や、認識のアルゴリズム
についても、一切限定するものではない。FIG. 1 is a block diagram for explaining one embodiment of the present invention. In the figure, parts having the same functions as those of the prior art shown in FIG. 2 are designated by the same reference numerals as in FIG. is attached. In FIG. 1, in the initial state, the specific dictionary 4, dictionary buffer 10, and word number buffer 11 are all empty. At this time, this speech recognition device functions purely as a speaker-independent speech recognition device. When speech for recognition is input, its feature quantity is extracted by the feature extraction unit 1. The extracted feature amounts are sent to the similarity calculation section 2. Further, this feature quantity is stored in the dictionary buffer 10. The similarity calculation unit 2 calculates the similarity between the dictionary in the unspecified dictionary 3 and the previous feature amount,
A recognition result (word number, etc.) corresponding to the result is outputted by the result output unit 5. Here, a correct/incorrect response to the current recognition result is input by the user through the recognition result input unit 12 (here, a key switch is used) having a correct/incorrect key. If this recognition result is correct and the similarity at this time is less than a predetermined threshold, the word number of the correct answer is transferred to the dictionary buffer.
It is stored in the word number buffer 11 in association with 0. Even if the answer is correct, if the degree of similarity is sufficiently high (higher than a predetermined threshold), the unspecified dictionary 3 will be able to sufficiently recognize it, so the buffer at this time is cleared. In this way, feature quantities are stored in the dictionary buffer 10 each time recognition is performed. After this,
When the same word number stored in the word number buffer 11 appears a predetermined number (Th1), the similarity between the voice features stored in the dictionary buffer 10 corresponding to the word number is calculated, and the similarity is calculated. When the degree of similarity is equal to or greater than a predetermined threshold (Th2), one of the feature amounts (or average value, median value, etc.) stored in the plurality of dictionary buffers is stored in the specific dictionary section 4 as a specific dictionary. Once a feature quantity is registered in the specific dictionary 4, it becomes a recognition target from the next time onwards. By repeating recognition as described above, only words that are difficult to recognize (do not have a high degree of similarity) in the non-specific dictionary 3 can be automatically registered in the specific dictionary 4. The reason why the similarity of multiple utterances is calculated and the one that is equal to or higher than a certain threshold is adopted is to prevent sudden noises or voices that are not in good condition from being registered in a specific dictionary. Although one embodiment of the present invention has been described above, the details of parts other than those related to the present invention, such as the operation of the control system and the recognition device, are omitted because they are well known in known examples. Furthermore, regarding the user's input of correct/incorrect answers, there are various possibilities, such as determining the correct answer when there is no input of an incorrect answer, or determining the answer as incorrect by canceling the action that occurs after recognition, but these do not limit the present invention. It's not a thing. Furthermore, there are no limitations whatsoever on the voice features or the recognition algorithm.

【０００８】[0008]

【効果】以上の説明から明らかなように、請求項１、２
の発明によると認識時の音声を自動的に特定辞書として
登録できるので、特定辞書の登録に関するユーザーの負
担を軽減することができる。請求項３、４の発明による
と、特定辞書の登録は不特定として認識しずらい単語に
対してのみ行なわれるので、特定辞書を不要とし、容量
の肥大化を防ぐことができる。[Effect] As is clear from the above explanation, claims 1 and 2
According to the invention, since the speech at the time of recognition can be automatically registered as a specific dictionary, the burden on the user regarding registration of the specific dictionary can be reduced. According to the third and fourth aspects of the invention, since registration in the specific dictionary is performed only for words that are unspecified and difficult to recognize, the specific dictionary is not required and an increase in capacity can be prevented.

[Brief explanation of the drawing]

【図１】　　本発明による音声認識装置の一実施例を説
明するための構成図である。FIG. 1 is a configuration diagram for explaining an embodiment of a speech recognition device according to the present invention.

【図２】　　従来の音声認識装置の一例を説明するため
の構成図である。FIG. 2 is a configuration diagram for explaining an example of a conventional speech recognition device.

[Explanation of symbols]

１…特徴抽出部、２…類似度計算部、３…不特定辞書、
４…特定辞書、５…結果出力部、６…キースイッチ、７
…制御部、１０…辞書バッファ、１１…単語番号バッフ
ァ、１２…認識結果入力部。1...Feature extraction unit, 2...Similarity calculation unit, 3...Unspecified dictionary,
4...Specific dictionary, 5...Result output section, 6...Key switch, 7
...Control unit, 10...Dictionary buffer, 11...Word number buffer, 12...Recognition result input unit.

Claims

[Claims]

Claim 1: A feature extraction unit that extracts features of input speech; a specific dictionary unit that stores extracted feature quantities of a plurality of speech sounds as a dictionary for a specific speaker; Similarity calculation that calculates the degree of similarity between the unspecified dictionary section stored as a dictionary for unspecified speakers, the feature amount of the extracted voice, and a plurality of dictionaries stored in the specified dictionary section and the unspecified dictionary section. and a result selection unit that selects one or more dictionaries as recognition results in descending order of the calculated similarity, in which a recognition result is input as to whether or not the recognition result is correct. a result input unit; a dictionary buffer that temporarily stores a plurality of feature quantities of a voice input during a recognition operation; and when a recognition result for the voice is correct, the word number is stored in the dictionary buffer. It is equipped with a word number buffer that stores a plurality of word numbers in association with speech, and when a predetermined (Th1) number or more of the same word number is stored in the word number buffer, the speech stored in the dictionary buffer corresponding to the word number is , and when the similarity is greater than or equal to a predetermined (Th2) threshold, one of the feature values stored in the plurality of dictionary buffers is sent to the specific dictionary section as a specific dictionary. A speech recognition device characterized by memory.

2. The speech recognition device according to claim 1, wherein an average value or a median value of the feature quantities stored in the plurality of dictionary buffers is stored as a specific dictionary in a specific dictionary section. Device.

3. In the speech recognition device according to claim 1 or 2, the degree of similarity when the input speech is correct is a predetermined (
Th3) If it is higher than the threshold value, the speech recognition device does not store the word number in the word number buffer and clears the contents of the dictionary buffer stored at that time.

4. In the speech recognition device according to claim 1 or 2, the degree of similarity when the input speech is correct is a predetermined (
Th4) If the threshold value is higher than the threshold value, the speech recognition device clears the contents of one or more dictionary buffers corresponding to the word number without storing the word number in the word number buffer.