JPH05224694A

JPH05224694A - Speech recognition device

Info

Publication number: JPH05224694A
Application number: JP4027906A
Authority: JP
Inventors: Shoji Kuriki; 章次栗木
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1992-02-14
Filing date: 1992-02-14
Publication date: 1993-09-03

Abstract

PURPOSE:To obtain the largest recognition rate regardless of whether the speech level of a speaker is high or low by judging the speech level of an inputted hot key and setting the gain of the whole system corresponding to the judged level. CONSTITUTION:An speech signal which is inputted is amplified by a microphone amplifier, divided into respective frequency bands by a filter bank 4, and digitized, and then, a process part 6 extracts features. The feature-extracted speech pattern is divided into speech sections, which are stored in an input pattern storage part 8. When the stored speech signal is inputted to a recognition and collation part 9, it is judged that a hot key process is completed; when the hot key is inputted, a selector 15 selects one of hot key dictionaries 14 provided by gains and the recognition and collation part 9 performs collation by using this dictionary to detect the dictionary which gives maximum similarity. The speech level of the speaker is estimated from the gain of this dictionary and the speech level of the whole system is so set as to obtain the maximum recognition rate.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、マイクロホンを通して
入力された音声を認識する音声認識装置に関するもので
ある。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice recognition device for recognizing voice input through a microphone.

【０００２】[0002]

【従来技術】この種音声認識装置の認識率を向上するう
えで、一つの大きなネックは、音声レベルの個人差であ
り、音声のパワーでみると、音の大きい人と小さい人と
では数１０dＢ程度の差がある。この音声レベルと認識
率との間には、図１に示すように、低レベルの領域では
高い認識率を得ることができず、ある程度以上のレベル
でないと有効な認識率を得ることはできない。換言すれ
ば、低レベルから高レベルの広い範囲にわたって高い認
識率を保証することは現状では困難であり、結局装置と
しては、平均的な音声レベルで最大認識率が得られるよ
うな設定とせざるを得ない。したがって、その場合に
は、低レベルの音声入力に対しては必要な認識率が得ら
れない。2. Description of the Related Art One of the major obstacles to improving the recognition rate of this type of voice recognition device is the individual difference in voice level. From the viewpoint of the power of voice, a person with a loud sound and a person with a low sound make several tens of dB. There are differences. As shown in FIG. 1, between the voice level and the recognition rate, a high recognition rate cannot be obtained in a low level region, and an effective recognition rate cannot be obtained unless the level is above a certain level. In other words, it is difficult at present to guarantee a high recognition rate over a wide range from low level to high level, and as a result, the device must be set so that the maximum recognition rate can be obtained at an average voice level. I don't get it. Therefore, in that case, the necessary recognition rate cannot be obtained for low-level voice input.

【０００３】このため、従来においては、マイクアンプ
にＡＧＣ(自動利得制御回路)を付加し、常にほぼ一定し
た音声入力を得るようにしたものが提案されている(実
開昭５９−６０７００号公報参照)。しかしながら、上
記ＡＧＣで対応できるのは、たかだか４０dＢ程度であ
って、前述したように、数１０dＢの個人差に充分に対
処することができない。Therefore, conventionally, there has been proposed a microphone amplifier to which an AGC (automatic gain control circuit) is added so that a substantially constant voice input is always obtained (Japanese Utility Model Laid-Open No. 59-60700). reference). However, the above AGC can handle only about 40 dB, and as described above, it is not possible to sufficiently cope with the individual difference of several tens of dB.

【０００４】また、発声者に適当な手段で当該発声者の
音声レベルを知らせ、発声者に対し最適のレベルで発声
するよう促すようにしたものも提案されている(実開平
０１−１３７４９７号公報,特開昭６３−０１４２００
号公報参照)。しかしながら、発声レベルは個人,個人特
有のもので、無理に変えようとすると、発声自体が不自
然となり、却って認識率を低下させるおそれがあるう
え、同一人の発声においても子音と母音とでは１０dＢ
以上のパワー差があるため、最適なレベル設定が実際上
は困難であるといった問題があった。There is also proposed a method in which a speaker is informed of the voice level of the speaker by an appropriate means so as to urge the speaker to speak at an optimum level (Japanese Utility Model Publication No. 01-137497). JP-A-63-014200
(See Japanese Patent Publication). However, the utterance level is individual and individual, and if it is forcibly changed, the utterance itself becomes unnatural, which may rather lower the recognition rate, and even if the same person utters, the consonant and vowel sounds are 10 dB.
Due to the above power difference, there is a problem that it is actually difficult to set the optimum level.

【０００５】[0005]

【発明が解決すべき課題】したがって、本発明の技術的
課題は、発声者の発声レベルに応じて自動的に最大認識
率が得られるようにすることである。SUMMARY OF THE INVENTION Therefore, a technical problem of the present invention is to automatically obtain the maximum recognition rate in accordance with the utterance level of the utterer.

【０００６】[0006]

【課題を解決するための手段】このため、本発明は、マ
イクロホンと、マイクロホンの音声信号出力を増幅する
マイクアンプと、マイクアンプの出力信号をデジタル信
号に変換するＡ／Ｄコンバータと、Ａ／Ｄコンバータか
ら出力される音声信号の特徴を抽出する特徴抽出部と、
上記音声信号から音声区間を検出する音声区間検出部
と、検出された音声区間間の特徴抽出されたパターンを
入力パターンとして記憶する入力パターン記憶部と、入
力パターンを辞書パターンと比較して入力パターンを特
定する音声認識部とを備えた音声認識装置において、ゲ
イン別のホットキー辞書を複数設けるとともに、ホット
キー入力時、これら複数のホットキー辞書を用いて類似
度を各々演算して、最大類似度が得られたホットキー辞
書に設定されたゲインに応じて以後入力される音声に対
する音声認識装置のゲインを設定する手段を備えたこと
を特徴とする音声認識装置を提供するものである。Therefore, according to the present invention, a microphone, a microphone amplifier for amplifying a sound signal output of the microphone, an A / D converter for converting an output signal of the microphone amplifier into a digital signal, and an A / D converter A feature extraction unit for extracting features of the audio signal output from the D converter,
A voice section detection unit that detects a voice section from the voice signal, an input pattern storage section that stores the feature-extracted pattern between the detected voice sections as an input pattern, and an input pattern that compares the input pattern with a dictionary pattern. In a voice recognition device having a voice recognition unit that specifies the maximum similarity, a plurality of hot key dictionaries for each gain are provided, and when a hot key is input, the similarity is calculated using each of these hot key dictionaries to obtain the maximum similarity. Provided is a voice recognition device comprising means for setting a gain of the voice recognition device for a voice to be subsequently input according to the gain set in the hot key dictionary obtained.

【０００７】即ち、本発明においては、ホットキー入力
を利用し、入力されたホットキーの音声レベルに応じ
て、装置全体としてのゲインを最大認識率が得られるよ
うに設置する。ここで、装置全体としてのゲインとは、
音声入力を直接に増幅するマイクアンプのゲインにとど
まらず、音声入力に対するゲインを間接的に制御する手
段、例えば、Ａ／Ｄコンバータに対して印加する参照電
圧や、低レベルの領域で最大認識率が得られるような辞
書等を意味する。That is, in the present invention, the hot key input is used, and the gain of the entire apparatus is set so as to obtain the maximum recognition rate according to the voice level of the input hot key. Here, the gain of the entire device is
Not only the gain of the microphone amplifier that directly amplifies the voice input, but also means for indirectly controlling the gain for the voice input, for example, the reference voltage applied to the A / D converter and the maximum recognition rate in the low level region. Means a dictionary or the like such that

【０００８】より具体的には、マイクアンプに対して複
数のゲインを選択的に設定可能とし、最大類似度が得ら
れたホットキー辞書に設定されたゲインに応じて最適の
ゲインをマイクアンプに対して選択するようにしてもよ
く、Ａ／Ｄコンバータの参照電圧を最適に設定するよう
にしてもよい。More specifically, a plurality of gains can be selectively set for the microphone amplifier, and the optimum gain is set for the microphone amplifier according to the gain set in the hot key dictionary for which the maximum similarity is obtained. Alternatively, the reference voltage of the A / D converter may be optimally set.

【０００９】また、ゲイン別の辞書を複数用意してお
き、ゲイン別のホットキー辞書のうち、最大類似度が得
られたもののゲインに応じて辞書を選択して、以後の認
識を行うようにしてもよい。Further, a plurality of gain-specific dictionaries are prepared, and a dictionary is selected according to the gain of the gain-specific hot key dictionaries for which the maximum similarity is obtained, for subsequent recognition. May be.

【００１０】[0010]

【作用・効果】本発明によれば、ホットキー入力を利用
して音声認識装置全体としてのゲインを発声者の音声レ
ベルに応じて最適に設定することができ、発声者に不自
然な発声を強いることなしに、常に最大認識率で音声認
識を行うことができるようになる。According to the present invention, the gain of the entire voice recognition device can be optimally set according to the voice level of the speaker by using the hot key input, and an unnatural voice is given to the speaker. It becomes possible to always perform voice recognition at the maximum recognition rate without being forced.

【００１１】また、請求項２の発明によれば、ゲイン別
のマイクアンプを複数用意して、発声者の音声レベルに
最適なマイクアンプを選択することによって、最大認識
率を保証することができる。請求項３の発明によれば、
Ａ／Ｄコンバータに対する参照電圧を発声者の音声レベ
ルに応じて最適に設定することにより、最大認識率を保
証することができる。According to the invention of claim 2, the maximum recognition rate can be guaranteed by preparing a plurality of microphone amplifiers for each gain and selecting the most suitable microphone amplifier for the voice level of the speaker. .. According to the invention of claim 3,
The maximum recognition rate can be guaranteed by optimally setting the reference voltage for the A / D converter according to the voice level of the speaker.

【００１２】さらに、請求項４の発明によれば、ゲイン
別の辞書を複数用意して、発声者の音声レベルに応じた
辞書を選択することにより、最大認識率を確保すること
ができる。Further, according to the invention of claim 4, the maximum recognition rate can be secured by preparing a plurality of dictionaries for each gain and selecting a dictionary according to the voice level of the speaker.

【００１３】[0013]

【実施例】以下、本発明の実施例を具体的に説明する。 (基本システム)図２に本発明にかかる音声認識装置の基
本システムを示す。マイクロホン１から入力された音声
信号は、前処理部２で前処理したうえで、マイクアンプ
３を構成する自動利得制御回路(ＡＧＣ)で増幅され、フ
ィルタバンク４によって各周波数帯ごとに分離され、Ａ
／Ｄコンバータ５によってデジタル信号に変換され、処
理部６に入力され、特徴抽出が行われる。また、各周波
数毎の音声信号は、音声区間検出回路７にも入力され、
音声区間が検出される。特徴抽出された音声パターン
は、検出された音声区間で区切られた状態で入力パター
ンとして入力パターン記憶部８に記憶される。EXAMPLES Examples of the present invention will be specifically described below. (Basic System) FIG. 2 shows a basic system of the voice recognition device according to the present invention. The audio signal input from the microphone 1 is pre-processed by the pre-processing unit 2 and then amplified by the automatic gain control circuit (AGC) that constitutes the microphone amplifier 3, and separated by the filter bank 4 for each frequency band. A
The signal is converted into a digital signal by the / D converter 5, is input to the processing unit 6, and feature extraction is performed. The voice signal for each frequency is also input to the voice section detection circuit 7,
The voice section is detected. The feature-extracted voice pattern is stored in the input pattern storage unit 8 as an input pattern in a state of being divided by the detected voice section.

【００１４】記憶された入力パターンは、認識照合部９
において辞書１０の各テンプレートと照合され、類似度
算出部１１において類似度が計算される。類似度算出部
１１は算出した類似度のうち最大類似度を与えるテンプ
レートを検出し、システム全体の制御を行う制御部１２
にこれを出力する。制御部１２は、入力されたテンプレ
ートに対応する認識結果を結果出力部１３に出力し、結
果出力部１３はその認識結果を表示する。The stored input pattern is recognized by the recognition and collation unit 9
In, the template is collated with each template of the dictionary 10 and the similarity is calculated in the similarity calculator 11. The similarity calculation unit 11 detects a template that gives the maximum similarity among the calculated similarities, and controls the entire system.
Output this to. The control unit 12 outputs the recognition result corresponding to the input template to the result output unit 13, and the result output unit 13 displays the recognition result.

【００１５】上記の音声認識装置は、所謂ホットキー入
力によって動作を開示するようになっており、ホットキ
ー入力があった場合には、ゲイン別に設けた複数のホッ
トキー辞書１４をセレクタ１５によって選択し、認識照
合部９は、ホットキー辞書１４を用いて照合を行い、最
大類似度を与えるホットキー辞書を検出する。The above speech recognition apparatus is designed to disclose the operation by so-called hot key input. When a hot key input is made, a plurality of hot key dictionaries 14 provided for each gain are selected by a selector 15. Then, the recognition matching unit 9 performs matching using the hot key dictionary 14 and detects a hot key dictionary that gives the maximum similarity.

【００１６】図３は、上記制御部１２が実行するホット
キー入力処理ルーチンを示すものであって、システムの
音量レベルを最適に設定する。FIG. 3 shows a hot key input processing routine executed by the control unit 12 for optimally setting the volume level of the system.

【００１７】入力パターン記憶部８に記憶された音声信
号入力が認識照合部９に入力されると、まずステップＳ
１においてホットキー処理が終了したか否かをフラグに
よって判定し、ホットキー処理が終了していれば、通常
の認識処理を実行する。フラグが立っていない、つまり
ホットキー処理が終了していないときは、ステップＳ２
において当該音声入力がホットキーか否かが判断され、
ホットキーであると判断された場合には、ステップＳ３
において、ゲイン別に予め設けたホットキー辞書１４を
用いて類似度を比較する。ゲイン別のホットキー辞書
は、例えば、標準の音声レベルを中心に(０dＢ)、±６d
Ｂ,±１２dＢの音声レベルで作成しておき、入力された
ホットキーに対して最大の類似度を与えるホットキー辞
書を検出する。ステップＳ４では、最大類似度を与える
ホットキー辞書のゲインから、発声者の音声レベルを推
定し、例えば、−６dＢのホットキー辞書が最大類似度
を与えるとした場合には、ステップＳ５において、標準
の音声レベルより６dＢだけ低い音声レベルでシステム
の認識率が最大となるように、システムの全体としての
音声レベルを設定する。この音声レベルの設定が終了す
ると、ステップＳ６でホットキー処理の終了を示すフラ
グをセットし、リターンする。When the voice signal input stored in the input pattern storage unit 8 is input to the recognition and collation unit 9, first, step S
In 1, the flag is used to determine whether or not the hot key processing is completed. If the hot key processing is completed, normal recognition processing is executed. If the flag is not set, that is, if the hot key processing is not completed, step S2
In, it is determined whether the voice input is a hot key,
If it is determined that the key is a hot key, step S3
In, the similarity is compared using the hot key dictionary 14 provided in advance for each gain. The hot key dictionary for each gain is, for example, a standard voice level (0 dB), ± 6d.
A hot key dictionary that produces the maximum similarity to the input hot key is detected in advance with the voice level of B, ± 12 dB. In step S4, the voice level of the speaker is estimated from the gain of the hot key dictionary that gives the maximum similarity. For example, if the hot key dictionary of -6 dB gives the maximum similarity, the standard level is calculated in step S5. The audio level of the system as a whole is set so that the recognition rate of the system is maximized at the audio level lower than the audio level of 6 dB. When the setting of the voice level is completed, a flag indicating the end of the hot key process is set in step S6, and the process returns.

【００１８】(システムの音声レベル設定方式) (その１)図４に音声レベル設定方式の一例を示す。図に
示すように、この例では、５つのゲイン別ホットキー辞
書１４−１,…,１４−５に対応したゲイン別のマイクア
ンプ３−１,…,３−５を設け、入力される音声レベルに
対応したゲインのマイクアンプを選択するようにしてい
る。(System Audio Level Setting Method) (Part 1) FIG. 4 shows an example of the audio level setting method. As shown in the figure, in this example, the gain-specific microphone amplifiers 3-1, ..., 3-5 corresponding to the five gain-specific hot key dictionaries 14-1 ,. I try to select a microphone amplifier with a gain corresponding to the level.

【００１９】即ち、標準の音声レベルＫdＢに対して、
±１２dＢ,±６dＢの音声レベルを予定し、夫々の音声
レベルに対して最適のゲインを与えた計５個のマイクア
ンプ３−１,３−２,…,３−５をマイクアンプセレクタ
２０によって選択する。この選択は、図２に示したホッ
トキー処理において、例えば、＋６dＢのホットキー辞
書１４−２が最大類似度を与えると判定された場合、こ
れを最も類似した辞書記憶部２１に一旦記憶し、マイク
アンプセレクタ２０はこの辞書記憶部２１に記憶されて
いるホットキー辞書の種類を読取り、＋６dＢのホット
キー辞書１４−２である場合には、これに対応した(Ｋ
＋６)dＢのマイクアンプ３−２を選択する。That is, with respect to the standard voice level KdB,
By the microphone amplifier selector 20, a total of five microphone amplifiers 3-1, 3-2, ..., 3-5, which are scheduled to have an audio level of ± 12 dB and ± 6 dB and have been given an optimum gain for each audio level, are selected. select. In the hot key processing shown in FIG. 2, for example, when it is determined that the hot key dictionary 14-2 of +6 dB gives the maximum similarity, this selection is temporarily stored in the most similar dictionary storage unit 21, The microphone amplifier selector 20 reads the type of the hot key dictionary stored in the dictionary storage unit 21, and if it is the +6 dB hot key dictionary 14-2, it corresponds to this (K
Select +6) dB microphone amplifier 3-2.

【００２０】なお、図４は図２と正確には対応していな
いが、図２で示した部分と同一の部分には同一の番号を
付してそれ以上の説明を省略する。Although FIG. 4 does not correspond exactly to FIG. 2, the same parts as those shown in FIG. 2 are designated by the same reference numerals and further description will be omitted.

【００２１】(その２)図５は、音声レベル設定方式のい
ま一つの例を示す。この場合には、Ａ／Ｄコンバータ５
に印加するリファレンス電圧を音声レベルに応じて選択
する。即ち、標準の音声レベルに対応するリファレンス
電圧(０dＢと表示)に対して、±１２dＢ,±６dＢのリフ
ァレンス電圧を予め選択可能に設定しておき、最も類似
した辞書記憶部２１に記憶されているホットキー辞書の
ゲインに対応したリファレンス電圧をセレクタ２２で選
択する。なお、図５において、図４と同様、図２と同じ
ものは同じ番号を対して説明を省略する。(Part 2) FIG. 5 shows another example of the audio level setting method. In this case, the A / D converter 5
The reference voltage to be applied to is selected according to the audio level. That is, with respect to the reference voltage (displayed as 0 dB) corresponding to the standard audio level, the reference voltages of ± 12 dB and ± 6 dB are set to be selectable in advance and stored in the most similar dictionary storage unit 21. The selector 22 selects the reference voltage corresponding to the gain of the hot key dictionary. In FIG. 5, as in FIG. 4, the same parts as those in FIG.

【００２２】(その３)図６は音声レベル設定方式の他の
例を示す。図示のように、本例では、通常の音声認識に
使用する辞書をゲイン別に設ける。具体的には、標準の
音声レベルで作成した辞書(０dＢ辞書と表示)の他に、
標準の音声レベルより±６dＢ,±１２dＢだけ相違する
音声レベルで作成した全部で５つの辞書１０−１,１０
−２,…,１０−５をゲイン別のホットキー辞書１４−
１,１４−２,…１４−５に対応して予め用意しておき、
最も類似した辞書記憶部２１に記憶されたホットキー辞
書が例えば＋６dＢのもの１４−２であれば、これに対
応した＋６dＢの辞書１０−２をセレクタ２３により選
択する。なお、図６において、図４と同様、図２と同じ
ものは同じ番号を付して説明を省略する。(Part 3) FIG. 6 shows another example of the audio level setting system. As shown in the figure, in this example, a dictionary used for normal voice recognition is provided for each gain. Specifically, in addition to the dictionary created at the standard voice level (displayed as 0 dB dictionary),
Five dictionaries 10-1 and 10 created at voice levels that differ from the standard voice level by ± 6 dB and ± 12 dB.
-2, ..., 10-5 is a hot key dictionary for each gain 14-
1, 14-2, ... 14-5 are prepared in advance,
For example, if the hot key dictionary stored in the most similar dictionary storage unit 21 is +6 dB 14-2, the selector 23 selects the corresponding +6 dB dictionary 10-2. Note that, in FIG. 6, as in FIG. 4, the same parts as those in FIG.

[Brief description of drawings]

【図１】は入力音声レベルと認識率との関係を示すグ
ラフである。FIG. 1 is a graph showing a relationship between an input voice level and a recognition rate.

【図２】は本発明にかかる音声認識装置の概略システ
ム図である。FIG. 2 is a schematic system diagram of a voice recognition device according to the present invention.

【図３】は本発明において実行されるホットキー入力
処理のフローチャートである。FIG. 3 is a flowchart of hot key input processing executed in the present invention.

【図４】は本発明の一実施例を示す概略説明図であ
る。FIG. 4 is a schematic explanatory view showing an embodiment of the present invention.

【図５】は本発明の他の実施例を示す概略説明図であ
る。FIG. 5 is a schematic explanatory view showing another embodiment of the present invention.

【図６】は本発明のいま一つの実施例を示す概略説明
図である。FIG. 6 is a schematic explanatory view showing another embodiment of the present invention.

[Explanation of symbols]

１…マイク３…マイクアンプ３−１,３−２,…,３−５…ゲイン別マイクアンプ５…Ａ／Ｄコンバータ６…処理部７…音声区間検出部９…認識照合部１０…辞書１１…類似度算出部１２…制御部１４…ホットキー辞
書１０−１,１０−２,…,１０−５ゲイン別辞書１４−１,１４−２,…,１４−５ゲイン別ホットキー
辞書1 ... Microphone 3 ... Microphone amplifier 3-1, 3-2, ..., 3-5 ... Gain-specific microphone amplifier 5 ... A / D converter 6 ... Processing unit 7 ... Voice section detection unit 9 ... Recognition / collation unit 10 ... Dictionary 11 ... Similarity calculation unit 12 ... Control unit 14 ... Hot key dictionary 10-1, 10-2, ..., 10-5 Gain-specific dictionary 14-1, 14-2, ..., 14-5 Gain-specific hot key dictionary

Claims

[Claims]

1. A microphone, a microphone amplifier for amplifying an audio signal output of the microphone, an A / D converter for converting an output signal of the microphone amplifier into a digital signal, and
A feature extraction unit that extracts a feature of a voice signal output from the D / D converter, a voice segment detection unit that detects a voice segment from the voice signal, and a feature extracted pattern between the detected voice segments as an input pattern. In a voice recognition device having an input pattern storage unit for storing and a voice recognition unit for identifying an input pattern by comparing the input pattern with a dictionary pattern, a plurality of hot key dictionaries for each gain are provided, and at the time of hot key input, A means for calculating the similarity using each of the plurality of hot key dictionaries, and setting the gain of the voice recognition device for the subsequently input speech according to the gain set in the hot key dictionary for which the maximum similarity is obtained. A voice recognition device comprising:

2. The voice recognition device according to claim 1, wherein the microphone amplifier has a plurality of selectable gains, and the gain set in the hot key dictionary that obtains the maximum similarity is obtained. Correspondingly, the microphone amplifier gain is selected.

3. The voice recognition device according to claim 1, wherein the reference voltage of the A / D converter is set according to the gain set in the hot key dictionary for which the maximum similarity is obtained. What is characterized by doing.

4. The voice recognition apparatus according to claim 1, wherein a plurality of dictionaries for each gain are provided, and a dictionary corresponding to the gain set in the hot key dictionary for which the maximum similarity is obtained is selected. Characterized by the fact that