JPS62267800A

JPS62267800A - Voice recognition control system

Info

Publication number: JPS62267800A
Application number: JP61110715A
Authority: JP
Inventors: 義注太田
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1986-05-16
Filing date: 1986-05-16
Publication date: 1987-11-20

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】〔産業上の利用分野〕本発明は音声認識装置に係り、特に周囲雑音が犬なる時
に起きる誤認識またはりジェツトの場合に、その訂正処
理を確実にかつ効率よく行い認識率を向上させるための
音声認識制御方式に関するものである。[Detailed Description of the Invention] [Field of Industrial Application] The present invention relates to a speech recognition device, and in particular, to a method for reliably and efficiently correcting erroneous recognition or collisions that occur when ambient noise is loud. This invention relates to a voice recognition control method for improving recognition rate.

[Conventional technology]

音声認識装置は、登録した音声パタンと入力した音声パ
タンをパタンマツチング手法を使って比較することによ
シ得られるパタン間の類似度を示す数値すなわち距離を
計算し、予め設定した固定値Ｄ□よりそれが小さい場合
は距離の一番小さい候補を認識結果として出力し、大き
い場合は認識結果として出力するのは疑わしいと判断し
リジェクトするものである。しかし、入力音声には種々
の変動要因たとえば発声者の身体状態の変化あるいは周
囲騒音の混入があシ、スイッチなどのように認識率１０
０％を得ることは現状の技術では困難である。したがっ
て必然的に誤認識などに対する訂正処理が必要となる。The speech recognition device calculates a numerical value indicating the degree of similarity between the patterns, that is, distance, by comparing the registered speech pattern and the input speech pattern using a pattern matching method, and calculates the distance, which is a fixed value D set in advance. If it is smaller than □, the candidate with the smallest distance is output as the recognition result, and if it is larger, it is judged that it is questionable to output the recognition result and is rejected. However, the recognition rate may be 10% due to various fluctuation factors in the input voice, such as changes in the speaker's physical condition, ambient noise, switches, etc.
It is difficult to obtain 0% with current technology. Therefore, correction processing for misrecognition etc. is inevitably required.

従来の装置は、特開昭５３−７７４０２号公報に記載の
ように、入力音声の候補を１あるいは複数個表示せしめ
、発声者に選択させるようになっていた。As described in Japanese Unexamined Patent Publication No. 53-77402, conventional devices display one or more candidates for input speech and allow the speaker to select one.

また、￥！ｆ開昭５９−１８５３９４号公報に記載のよ
うに、周囲騒音レベルに対応してＤＴｓｌを可変にし、
周囲騒音レベルにかかわらず、誤認識、リジェクトを最
小限にするようになっていた。Also, ¥! fAs described in Japanese Patent Publication No. 59-185394, DTsl is made variable in accordance with the ambient noise level,
Misrecognition and rejections were minimized regardless of the ambient noise level.

[Problem that the invention seeks to solve]

上記従来技術は、（１）表示を見て、それを選択するよう釦なっているた
め、目を離せない作業者たとえば自動車を運転中の運転
手は訂正操作不能である。The above-mentioned conventional technology has the following features: (1) Since the button is used to select the item by looking at the display, a worker who cannot take his eyes off the screen, such as a driver driving a car, cannot make corrections.

（２）候補表示が周囲騒音などによらず１あるいは複数
個に固定されており、周囲騒音が犬となると表示候補中
に正解がない場合が生じ訂正操作が不能となる。(2) The candidate display is fixed to one or a plurality of candidates regardless of the ambient noise, etc., and if the ambient noise is a dog, there may be a case where there is no correct answer among the display candidates, making a correction operation impossible.

（３）周囲騒音に対応してＤＴＦｉを可変にしても誤認
識、リジェクトは零にならず、訂正処理操作は必要であ
る。(3) Even if DTFi is made variable in response to ambient noise, the number of misrecognitions and rejects does not go to zero, and correction processing operations are necessary.

などの問題があった。There were other problems.

本発明の目的は、自動車等の運転手が使用する音声認識
装置洗おいて、安全な操作と周囲騒音（走行騒音）にか
かわらない安定した動作を得る音声認識制御方式を提供
するにある。SUMMARY OF THE INVENTION An object of the present invention is to provide a voice recognition control system that enables safe operation and stable operation regardless of ambient noise (travel noise) by cleaning a voice recognition device used by a driver of an automobile or the like.

[Means for solving problems]

上記目的は、（１）認識結果候補音声を放声する手段と該候補音声の
ひとつを音声で選択する手段とをもち、該候補音声を類
似度の高い項に無音区間を挿入しながら順次放声し、正
候補の発声直後の無音圧間に、それを聞いていた話者が
音声を発声することにより該選択手段を用いて選択する
ことにより達成される。The above object is as follows: (1) It has a means for emitting recognition result candidate speech and a means for audibly selecting one of the candidate speeches, and sequentially emitting the candidate speech while inserting silent intervals into terms with a high degree of similarity. , is achieved by selecting the correct candidate using the selection means by having the speaker who has been listening to it utter a voice during the silent period immediately after the utterance of the correct candidate.

（２）周囲騒音レベルを検出する手段をもうけ、騒音レ
ベルに対応して該候補放声数を可変とすることにより達
成される。(2) This is achieved by providing means for detecting the ambient noise level and making the number of candidate sounds variable in accordance with the noise level.

（３）候補放声の音量制御手段をもうけ、該周囲騒音レ
ベル検出手段の検出レベルに対応して、該候補放声の音
量を可変とすることにより達成される。(3) This is achieved by providing a volume control means for candidate voice emission and making the volume of the candidate voice variable in accordance with the detection level of the ambient noise level detection means.

（４）候補音声の放声速度制御手段をもうけ、該周囲騒
音レベル検出手段の検出レベルに対応して、該候補音声
の放声速度を可変とすることにより達成される。(4) This is achieved by providing a candidate voice output speed control means and making the output speed of the candidate voice variable in accordance with the detection level of the ambient noise level detection means.

（５）登録音声から認識のための種糸パタンを得ると同
時にその音声を記録再生する手段をもうけ、該登録音声
を認識結果候補放声手段を用いて放声することにより達
成される。(5) This is achieved by obtaining a seed pattern for recognition from the registered voice, at the same time providing means for recording and reproducing the voice, and emitting the registered voice using the recognition result candidate voice emitting means.

（６）　　ピッチ周波数変換手段をもうけ、該候補放声
音声の後半一部をピッチ周波数変換し、聞き返し疑問と
することによシ達成される。(6) This is achieved by providing a pitch frequency converting means and converting the pitch frequency of the latter part of the candidate uttered voice and making it a question for listening back.

[Effect]

本発明の音声認識制御方式によれば、（１）入力音声の認識結果の確認と正候補の選択を音声
で行うように動作する。そのために操作者は目を離さず
また手を使わずに証正処理を行うことができる。According to the speech recognition control method of the present invention, (1) The speech recognition control method operates to confirm the recognition result of the input speech and select the correct candidate using speech. Therefore, the operator can perform the verification process without taking his eyes off or using his hands.

（２）周囲騒音レベルが大となると、自動的に候補発声
数を多く、すなわち選択幅を広げるように動作する。そ
れによって、候補発声音声中に選択すべきものが々く、
再度音声入力を強いられることはない。(2) When the ambient noise level increases, the number of candidate utterances is automatically increased, that is, the selection range is expanded. As a result, there are many candidate utterances to choose from.
You will not be forced to enter your voice again.

（３）周囲騒音レベルが犬となると、自動的に候補発声
音量を上げるように動作する。それにょって操作者は周
囲騒音にさまたげられずに候補発声を受聴でき、正しく
選択が行え誤操作がなくなる。(3) When the ambient noise level becomes dog, the system automatically increases the volume of candidate vocalizations. As a result, the operator can listen to candidate utterances without being disturbed by ambient noise, can make correct selections, and eliminates erroneous operations.

（４）周囲騒音レベルが大となると、候補選択のための
発声数は多くなるが、その発声速度を早めるように動作
する。それによって、正しい認識結果をうるまでの時間
は周囲騒音レベルにかかわらず常に一定にでき、操作者
の不満を軽減できる。(4) As the ambient noise level increases, the number of utterances required to select candidates increases, but the utterance speed is increased. As a result, the time required to obtain a correct recognition result can be kept constant regardless of the ambient noise level, and operator dissatisfaction can be reduced.

（５）候補選択発声が登録音声であシ、操作者は音声入
力のたびに登録音声を受聴し学習するように動作する。(5) The candidate selection utterance is a registered voice, and the operator listens to and learns the registered voice every time a voice is input.

それによって認識のための音声入力を登録音声に合わせ
られ、誤認識を少なくできる。This allows the voice input for recognition to be matched to the registered voice, reducing misrecognition.

（６）候補選択発声の後半のピッチ周波数が上昇するよ
うに動作する。それによって操作者は入力音声と容易に
区別できるとともに、ピッチ周波数上昇が選択のだめの
無音区間の始まりの会図とすることができ誤操作が少な
くなる。またピッチ周波数上昇は音声人力に対する応答
聞き返しとして自然であるため、操作者は装置と対話す
るがごとき雰囲気を味わえ、不満を軽減できる。(6) Operate so that the pitch frequency in the latter half of the candidate-selected utterance increases. This allows the operator to easily distinguish it from the input voice, and allows the rise in pitch frequency to signal the beginning of a silent section that cannot be selected, thereby reducing erroneous operations. In addition, since the pitch frequency rise is natural as a response to human voice input, the operator can enjoy the atmosphere of interacting with the device, which can reduce dissatisfaction.

〔Example〕

以下１本発明の一実施例を第１図によシ説明する。 An embodiment of the present invention will be described below with reference to FIG.

第１図で、１は音声入力に係るマイクロホンＡ１２は入
力音声信号について利得調整し、帯域制限、その他所要
の前処理を行った後、それをディジタル値に変換するＡ
Ｄ変換部、３はディジタル信号から所定のしきい値にし
たがって入力音声の音声区間の検出処理を行う音声区間
検出部、４は入力されたディジタル音声信号に基づいて
入力音声の分析を行い認識に用いる特徴パラメータを抽
出する認識パラメータ抽出部、５は登録時に認識パラメ
ータ抽出部で抽出された特徴パラメータパタンを認識時
の標準パタンとして記憶する標準パタンメモリ、６は標
準パタンメモリに記憶されている標準パタンを選択する
標準パタン選択部、７は入力音声と各標準パタンとのパ
タンマツチング処理（類似度計算処理）を行うパタンマ
ツチング部、８はパタンマツチング処理によって入力音
声に対する各類似度の順位を判定する判定部、９は周囲
騒音に係るマイクロホンＢ１１０は周囲騒音のレベルを
検出する周囲騒音検出部、１１は音声入力指示、操作指
示、リジェクト、認識候補名などを知らせる音声データ
を予じめ固定して記憶している固定音声合成データメモ
リ、１２は固定音声データを選択する固定音声合成デー
タ選択部、１３は音声合成データを合成出力する音声合
成部、１４はスピーカ、１５は上記各部に対する制御そ
の他所要の処理を行う制御部、１６は音声認識結果に基
づいて所要のサービスを行うホスト装置、１７は所要の
キー人力操作に係るコンソール部である。In Fig. 1, 1 is a microphone related to audio input; A 12 is a microphone that adjusts the gain of the input audio signal, performs band limiting and other necessary preprocessing, and then converts it into a digital value.
A D conversion section, 3, a voice section detection section that detects a voice section of the input voice according to a predetermined threshold value from a digital signal, and 4, an analysis of the input voice based on the input digital voice signal, for recognition. A recognition parameter extraction unit extracts the feature parameters to be used; 5 is a standard pattern memory that stores the feature parameter pattern extracted by the recognition parameter extraction unit at the time of registration as a standard pattern at the time of recognition; 6 is a standard stored in the standard pattern memory A standard pattern selection unit selects a pattern, 7 a pattern matching unit that performs pattern matching processing (similarity calculation processing) between the input audio and each standard pattern, and 8 a pattern matching unit that performs pattern matching processing to calculate each similarity for the input audio. A determination section 9 determines the ranking; a microphone B110 relating to ambient noise is an ambient noise detection section detecting the level of ambient noise; and 11 pre-records voice data for informing voice input instructions, operation instructions, rejections, recognition candidate names, etc. 12 is a fixed speech synthesis data selection section that selects fixed speech data; 13 is a speech synthesis section that synthesizes and outputs the speech synthesis data; 14 is a speaker; 15 is each of the above-mentioned sections. 16 is a host device that performs necessary services based on the voice recognition result; and 17 is a console unit that performs necessary manual key operations.

音声区間検出部３は音声の始点および終点を認識パラメ
ータ抽出部４、制御部１５に伝える。周囲騒音検出部１
０は周囲騒音レベルを制御部１５に伝える。認識パラメ
ータ抽出部４は入力音声を切り出し認識パラメータパタ
ンとして登録時には標準パタンメモリ５へ、音声認識処
理時にはパタンマツチング部７へ伝える。The speech section detection section 3 transmits the start point and end point of the speech to the recognition parameter extraction section 4 and the control section 15. Ambient noise detection unit 1
0 transmits the ambient noise level to the control section 15. The recognition parameter extraction section 4 extracts the input speech and transmits it to the standard pattern memory 5 when registering it as a recognition parameter pattern, and to the pattern matching section 7 during speech recognition processing.

第２図、第３図はそれぞれ、その音声登録、音声認識処
理の動作時のフローチャートである。FIGS. 2 and 3 are flowcharts of the voice registration and voice recognition processes, respectively.

まず音声認識処理に先立ち、操作者は音声発鈴作業を行
うわけであるが、そのときの動作を、第１図、第２図に
基づいて説明する。まずコンソール部１７からキー人力
等で登録作業である旨を制御部１５およびホスト装置１
６に指示する。制御部１５は登録音声入力に対する準備
を各部に指示する。（第２図処理３０）次に操作者はサービス内容コードをコンソール部１７か
ら入力しく同図処理３１）、入力催告放声（同図処理３
２）にしたがってその内容コードに対応する音声を入力
する。たとえばａなるコードをコンソール部１７から入
力し、続いて入力催告放声につづいて「エイ」と発声す
る。First, prior to the voice recognition process, the operator performs a voice ringing operation, and the operation at that time will be explained based on FIGS. 1 and 2. First, from the console unit 17, the control unit 15 and the host device 1 are informed that the registration work is to be performed manually using a key or the like.
6. The control unit 15 instructs each unit to prepare for inputting the registered voice. (Process 30 in Figure 2) Next, the operator inputs the service content code from the console section 17 (Process 31 in the same figure), and then makes an input reminder sound (Process 3 in the same figure).
2), input the voice corresponding to the content code. For example, a code a is inputted from the console section 17, and then the input reminder is uttered followed by the utterance of ``A''.

操作者がマイクロホンＡ１から音声を入力すると（同図
処理３３）、ＡＤ変換部２ばその音声をディジタル信号
に変換し、音声区間検出部６はディジタル音声信号を検
出し、認識パラメータ抽出部４は当該音声の特徴パラメ
ータを抽出しく同図処理３４）、それを標準パタンメモ
リ５に記憶する（同図処理３５）。When the operator inputs voice from the microphone A1 (process 33 in the figure), the AD converter 2 converts the voice into a digital signal, the voice section detector 6 detects the digital voice signal, and the recognition parameter extractor 4 converts the voice into a digital signal. The feature parameters of the voice are extracted (process 34 in the figure) and stored in the standard pattern memory 5 (process 35 in the figure).

以上の処理をくシ返し、認識のための標準音声パタンを
登録する。たとえば内容コードｂ　、ｃ　、ｄ　。By repeating the above process, a standard speech pattern for recognition is registered. For example, content codes b, c, d.

ｅ・・・のキー人力に対して、「ビー」、「シー」。"B" and "Shi" for the key human power of e...

「ディ」、「イー」・・・・・・と発声する。以上で音
声登録作業を終了する。Say "di", "ee"... This completes the voice registration work.

次に第１図、第３図に基づいて音声認識処理動作を説明
する。Next, the speech recognition processing operation will be explained based on FIGS. 1 and 3.

まずコンソール部１７から認識処理である旨を制御部１
５およびホスト装置１６に指示する。制御部１５は認識
処理での音声入力に対する準備を各部に指示する。（第
３図処理４０）これらの準備が完了すると、操作者に対して音声入力を
促すべき入力催告メツセージを音声合成部１３を介して
スピーカ１４から放声する。これは該当するメツセージ
データを固定音声合成データメモリ１１から、固定音声
合成データ選択部１２を介して音声合成部１３に伝える
ことで行う（同処理４１）。First, the console unit 17 informs the control unit 1 that recognition processing is to be performed.
5 and the host device 16. The control unit 15 instructs each unit to prepare for voice input in recognition processing. (Process 40 in FIG. 3) When these preparations are completed, an input reminder message to prompt the operator to input voice is emitted from the speaker 14 via the voice synthesis section 13. This is done by transmitting the corresponding message data from the fixed speech synthesis data memory 11 to the speech synthesis section 13 via the fixed speech synthesis data selection section 12 (process 41).

これによシ、操作者がマイクロホンＡ１からｆ声を入力
すると（同処理４２）、ＡＤ変換部２はその音声をディ
ジタル信号に変換し、音声区間検出部Ｓはディジタル音
声信号を検出し、認識パラメータ抽出部４は当該音声の
特徴パラメータを抽出する（同処理４３）。Accordingly, when the operator inputs f voice from the microphone A1 (same process 42), the AD converter 2 converts the voice into a digital signal, and the voice section detecting unit S detects the digital voice signal and performs recognition. The parameter extraction unit 4 extracts the characteristic parameters of the voice (process 43).

パタンマツチング部７は、入力音声の特徴パラメータパ
タンと標準パタンメモリ５に記憶され、標章パタン選択
部６で選択される各登錬種糸音声パタンとの間でパタン
マツチング処理を行い、入力音声に対する上記各標準音
声パタンの類似度を計算し判定部８へ伝える（同処理４
４）。・判定部８は、類似度によシ、各標準パタンに対
する順位付けを行い、それを結果として類似度とともに
制？１ｆＩＪ部１５に伝える。類似度が最上位となる（
最も確からしい）ものが第１立、その次が第２位、・・
・・・・である（同処理４５）。The pattern matching unit 7 performs a pattern matching process between the characteristic parameter pattern of the input voice and each climbing type voice pattern stored in the standard pattern memory 5 and selected by the mark pattern selection unit 6. The degree of similarity of each standard voice pattern to the input voice is calculated and transmitted to the determination unit 8 (processing 4).
4). - The determination unit 8 ranks each standard pattern based on the degree of similarity, and uses the ranking as a result along with the degree of similarity. 1f Inform the IJ department 15. Similarity is the highest (
The most probable) is ranked first, then the second, etc.
... (same process 45).

入力音声に対して最も確からしい類似度の値が予め定め
たしきい値よυ大きく、それを結果として出力するのは
疑わしいとするりジェクトの場合には、制御部１５は再
度入力をうながさせるため、音声合成部１３を介してス
ピーカ１４から再入催告のメツセージを放声させる。こ
れは該当するメツセージデータを固定音声合成データメ
モリ１１から固定音声合成データ選択部１２を介して、
音声合成部１３へ伝えることで行う（同処理４６）。If the most probable similarity value for the input voice is larger than a predetermined threshold and it is doubtful to output it as a result, the control unit 15 prompts the input again. In order to do so, a message reminding the user to re-enter is emitted from the speaker 14 via the voice synthesis section 13. This transfers the corresponding message data from the fixed speech synthesis data memory 11 to the fixed speech synthesis data selection section 12.
This is done by transmitting the information to the speech synthesis unit 13 (processing 46).

リジェクトでない場合には、制御部１５は認識結果を確
認するため、先の判定部８で決められた！＠位にしたが
い、予じめ固定音声合成データメモリ１１に記憶されて
いる認識候補名の音声データから対応する候補名音声合
成データを固定音声合成データ選択部１２を介して、音
声合成部１３へ伝え、スピーカ）４で放声してゆく（同
処理４７）。If it is not rejected, the control unit 15 checks the recognition result, and the determination unit 8 determines the recognition result. According to the @ position, the corresponding candidate name speech synthesis data is sent from the speech data of the recognition candidate names stored in advance in the fixed speech synthesis data memory 11 to the speech synthesis section 13 via the fixed speech synthesis data selection section 12. 4 (speaker) 4 (same process 47).

たとえば、第１位がコードａに対応する標準パタンであ
れば、認識候補名ａを表わす候補名音声合成データを合
成して「エイ」と放声する。第２位がコードｂに対応す
るｆｉ！パタンであれば、次に認識候補名すを表わす候
補名音声合成データを合成して「ピー」と放声、以下同
様に放声してゆく。For example, if the first place is a standard pattern corresponding to code a, candidate name speech synthesis data representing recognition candidate name a is synthesized and ``A'' is uttered. The second place is fi! corresponding to code b! If it is a pattern, then candidate name speech synthesis data representing the recognition candidate name is synthesized and a "beep" is emitted, and the same is repeated.

この放声は制御部１５によ多制御されて、放声と放声の
間に適当な間（無音区間）をはさんで行う。また何位ま
で放声を行うかは、マイクロホンＢ９で周囲騒音を集音
し、周囲騒音検出部１０で検出した入力音声発声前後の
周囲騒音レベルを制°御部１５で監視、これ罠よって定
める（同処理４２＼第１Ａ図は累積認識率を示す図であ
る。図中（ａ）は周囲騒音がない場合、（ｂ）は周囲騒
音がある場合のものである。（ａ）では第３位候補で認
識率が１００俤となっている。これは認識結果候補を上
位３酢までとれば必ずその中に正認識結果が含まれてい
ることを示す。図示するように周囲騒音が混入されると
全体に登録音声と入力音声の類似度は小さくなシ認識率
は低下する。（ｂ）では上位５位で認識率が１００チと
なっている。また認識率の低下の度合は周囲騒音レベル
に比例する。This sound emission is controlled by the control unit 15, and is performed with an appropriate interval (silent interval) between the sound emission and the sound emission. Furthermore, the level at which the sound should be emitted is determined by collecting ambient noise with the microphone B9, monitoring the ambient noise level before and after the input voice is uttered, detected by the ambient noise detecting section 10, with the control section 15, and using a trap. Processing 42\Figure 1A is a diagram showing the cumulative recognition rate. In the figure, (a) is when there is no ambient noise, and (b) is when there is ambient noise. (a) ranks 3rd. The recognition rate for the candidates is 100. This means that if you select the top three recognition result candidates, the correct recognition result will definitely be included.As shown in the figure, ambient noise is mixed in. Overall, the similarity between the registered speech and the input speech is small, and the recognition rate decreases.In (b), the recognition rate is 100chi for the top five places.The degree of decrease in the recognition rate also depends on the ambient noise level. is proportional to.

本実施例は上記知見にもとづき、周囲３音レベルに応じ
て候補名放声数を可変している。In this embodiment, based on the above knowledge, the number of candidate name announcements is varied according to the level of the three surrounding sounds.

たとえば「ハイ」と答える。この放声はマイクロホンＡ
１から入力され、ＡＤ変換部２でその音声はディジタル
信号に変換され、音声区間検出部３が検出し、制御部１
５へ伝える（同処理４８）。For example, answer "Hi." This sound is from microphone A
1, the audio is converted into a digital signal by the AD converter 2, detected by the audio section detector 3, and then sent to the controller 1.
5 (same process 48).

候補名放声の直後すなわち無音区間に何らかの音声入力
があった場合、これが正認識結果であると制御部１５は
判断し、その内容コードをホスト装置に伝える（同処理
４９）。そしてホスト装置その結果によシ所定のサービ
スを行う。If there is any voice input immediately after the candidate name is uttered, that is, during a silent period, the control unit 15 determines that this is a correct recognition result, and transmits the content code to the host device (process 49). The host device then performs a predetermined service based on the results.

候補名放声は先に周囲騒音レベルで定められた所定数、
制御部１５の制御のもとでくシ返えされ（同処理５０）
、所定数放声で何ら操作者の応答がない場合にはりジェ
クトとみなし、再入力催告放声を行う（同処理４６）。A predetermined number of candidate names will be announced based on the ambient noise level.
It is turned back under the control of the control unit 15 (same process 50).
If there is no response from the operator after a predetermined number of vocalizations, it is assumed that the operator has ejected, and a voice prompting for re-input is issued (processing 46).

制御部１５はコンソール部１７からの音声認識処理終了
のキー人力を監視し、終了でなければ再び入力催告放声
を行い、同様な音声認識処理をくシ返す（同処理５１）
。The control unit 15 monitors the key input from the console unit 17 to end the voice recognition process, and if it is not finished, issues an input reminder again and returns the same voice recognition process (same process 51).
.

以上の説明では操作者が自身の音声を登録する特定話者
認識装置として説明したが、予じめ標準パタンとして多
人数の音声パタンから平均化などの処理をして作成した
もの（平均パタン）を登録しておけば不特定話者認識装
置として動作する。In the above explanation, we have described a specific speaker recognition device in which the operator registers his or her own voice, but it is a standard pattern that is created in advance by processing such as averaging from the voice patterns of many people (average pattern). If registered, it will operate as a speaker-independent recognition device.

この場合は第２図に示す登録作業は必要でない。In this case, the registration work shown in FIG. 2 is not necessary.

また認識類似度はパタンマツチング処理によったが、関
数識別など他の手段でもよい。Furthermore, although the recognition similarity was determined by pattern matching processing, other methods such as function identification may be used.

また周囲騒音を集音するマイクロホンＢ９は音声入力の
マイクロホンＡ１で代替してもよく、さらに周囲騒音検
出部１０の入力はｌ）変換部２の出力でもよい。Further, the microphone B9 for collecting ambient noise may be replaced by the microphone A1 for audio input, and furthermore, the input to the ambient noise detection section 10 may be the output of the converting section 2.

また候補名放声に対する応答判断は音声の有無によった
が、ここを「ハイ」と「イイエ」を認識する処理に置き
換えてもよい。さらには応答を声ではなく、たとえばコ
ンソール部１７へのキー人力としてもよい。Further, although the response judgment to the candidate name announcement was based on the presence or absence of voice, this may be replaced with a process that recognizes "yes" and "no". Furthermore, the response may be made by inputting a key to the console section 17, for example, instead of by voice.

さらに候補名放声は音声合成によって行ったが何らこれ
にこだわることはなく、たとえばカセットテープレコー
ダのように音声を記録、再生できるものでもよい。Furthermore, although the candidate name announcement was performed by voice synthesis, there is no need to be limited to this; for example, a device capable of recording and reproducing voice, such as a cassette tape recorder, may be used.

紀４図に第１図の実施例における他の認識処理フローチ
ャートを示す。第４図において第３図と同一符号は同一
処理を示す。Fig. 4 shows another recognition processing flowchart in the embodiment of Fig. 1. In FIG. 4, the same symbols as in FIG. 3 indicate the same processing.

第３図では認識候補名の数戸数を周囲騒音レベルのみで
定めたが、第４図ではさらにリジェクトのためのしきい
値による制限を加えたものである。In FIG. 3, the number of recognition candidate names is determined only by the ambient noise level, but in FIG. 4, a threshold value for rejection is further added.

所定数の放声を終了したかどうかの判断（同処理５０）
の後、さらに放声する場合、次に放声する候補の類似度
がリジェクトのしきい値以下ならば放声するが、それ以
上であれば所定数に達しなくても放声を打ち切シリジェ
クトとみなしく同処理５２）、再入力催告放声を行う。Determining whether a predetermined number of vocalizations have been completed (processing 50)
If the similarity of the candidate to be emitted next is less than or equal to the reject threshold, it will be emitted, but if it is greater than that, the emitted voice will be discontinued even if it does not reach the predetermined number, and it will be considered a reject. Process 52), a re-input reminder is issued.

第５図は、本発明の他の一実施例である。第５図におい
て第１図と同一符号は同一物を示す。FIG. 5 shows another embodiment of the present invention. In FIG. 5, the same symbols as in FIG. 1 indicate the same parts.

１８は音声合成部１３で合成された音声信号の振幅すな
わち音量を制御する音量制御部である。Reference numeral 18 denotes a volume control unit that controls the amplitude, that is, the volume, of the audio signal synthesized by the audio synthesis unit 13.

音量制御部１８は周囲騒音検出部１０で検出された周囲
騒音レベルによシ音声信号の音量を制御しスピーカ１４
から放声する。これにより、周囲石膏が大となれば、候
補名放声音量を大とし操作者が受聴しやすくする。第５
図の実施例の登録、認識処理は第２．３．４図と同様で
ある。The volume control section 18 controls the volume of the audio signal according to the ambient noise level detected by the ambient noise detection section 10, and controls the volume of the audio signal from the speaker 14.
Let out a voice from As a result, when the surrounding plaster becomes large, the volume of the candidate name is increased to make it easier for the operator to hear the sound. Fifth
The registration and recognition processing of the embodiment shown in the figure is the same as that of FIG. 2.3.4.

第６図は、本発明の更に他の一実施例である。FIG. 6 shows yet another embodiment of the present invention.

第６図において第１図と同一符号は同一物を示す。In FIG. 6, the same reference numerals as in FIG. 1 indicate the same parts.

１９は音声合成部１３で合成される音声の発声速度制御
部である。発声速度制御部１９は周囲騒音検出部１０で
検出された周囲騒音レベルにより候補名音声の速度を制
御しスピーカ１４から放声する。これによシ、周囲騒音
が犬となシ、候補名放声数が多くなっても、操作者の受
聴時間を増加させることはない。なお候補名放声間に挿
入する無音区間の長さは固定でも、周囲騒音レベルに応
じて可変してもよい。Reference numeral 19 denotes a speech rate control unit for the voice synthesized by the voice synthesis unit 13. The speech rate control section 19 controls the speed of the candidate name voice based on the ambient noise level detected by the ambient noise detection section 10, and outputs the voice from the speaker 14. As a result, the listening time of the operator will not be increased even if the number of candidate names is increased even if the ambient noise is low. Note that the length of the silent section inserted between candidate name announcements may be fixed or may be varied depending on the ambient noise level.

第６図実施例の登録、認識処理は第２．３．４図と同様
である。The registration and recognition processing in the embodiment of FIG. 6 is the same as that of FIG. 2.3.4.

第７図は本発明の他の一実施例である。第７図において
第１図と同一符号は同一物を示す。FIG. 7 shows another embodiment of the present invention. In FIG. 7, the same symbols as in FIG. 1 indicate the same parts.

２０は入されたディジタル音声信号にもとづいて入力音
声の情報圧縮を行い合成に用いるデータを抽出する音声
合成データ抽出部、２１は登録時江音声合成データ抽出
部２ｏで抽出されたデータを記憶する標準音声合成デー
タメモリ、２２は標準音声合成データメモリ２１に記憶
されている標準音声合成データを選択する標準音声合成
データ選択部である。Reference numeral 20 denotes a speech synthesis data extraction section that compresses information on input speech based on the input digital speech signal and extracts data used for synthesis, and 21 stores the data extracted by the registered voice synthesis data extraction section 2o. The standard speech synthesis data memory 22 is a standard speech synthesis data selection unit that selects the standard speech synthesis data stored in the standard speech synthesis data memory 21.

第１図の実施例では認識候補名の放声は予じめ固定音声
合成データメモリ１１に記憶されている候補名音声合成
データによっていた。本実施例は登録時に認識のための
標準音声パタンをうると同時に、認識候補名放声のため
の該標準音声パタンに対応する標準音声合成データをう
るものである。In the embodiment shown in FIG. 1, the recognition candidate name is uttered based on the candidate name speech synthesis data stored in the fixed speech synthesis data memory 11 in advance. In this embodiment, a standard speech pattern for recognition is obtained at the time of registration, and at the same time, standard speech synthesis data corresponding to the standard speech pattern for announcing the recognition candidate name is obtained.

これによシ、認識時の認識候補名放声に登録音声を用い
るようにしたものである。Accordingly, the registered voice is used to announce the recognition candidate name during recognition.

第８図に本実施例の登録処理フローチャートを示す。第
８図において第２図と同一符号は同一処理を示す。操作
者がマイクロホンＡ１から音声を入力すると（同処理３
３）、ＡＤ変換部２は入力音声をディジタル音声信号に
変換し、認識パラメータ抽出部４は当該音声の特徴パラ
メータを抽出しく同処理３４）、それを標準パタンメモ
リ５に記憶する（同処理３５）。これと同時に音声合成
データ抽出部２０は当該音声を情報圧縮しそのデータを
抽出しく同処理３６）、それを標準音声合成データメモ
リ２１に記憶する（同処理３７）。FIG. 8 shows a registration processing flowchart of this embodiment. In FIG. 8, the same symbols as in FIG. 2 indicate the same processing. When the operator inputs audio from microphone A1 (same process 3
3), the AD conversion unit 2 converts the input voice into a digital voice signal, the recognition parameter extraction unit 4 extracts the characteristic parameters of the voice (process 34), and stores it in the standard pattern memory 5 (process 35). ). At the same time, the speech synthesis data extraction section 20 compresses the information of the speech, extracts the data (process 36), and stores it in the standard speech synthesis data memory 21 (process 37).

以上の処理をくり返し、認識のための標準音声パタンを
登録するとともに、それと同一音声のデータを標準音声
合成データとして記憶する。By repeating the above process, a standard speech pattern for recognition is registered, and data of the same speech is stored as standard speech synthesis data.

認識処理は第３．４図とほぼ同様で処理４７の認識候補
放声が異なる。認識候補放声は、先の登録処理で標準音
声パタンとして登録し、た音声と同一の標準音声合成デ
ータを標準音声合成データメモリ２１から、標準音声合
成データ選択部２２を介して、音声合成部１３へ伝え、
スピーカ１４で放声する。こうすることによシ操作者は
自身の登録した音声を認識候補放声音声として受聴する
。The recognition process is almost the same as that shown in FIG. 3.4, but the recognition candidate voices in process 47 are different. The recognition candidate speech is registered as a standard speech pattern in the previous registration process, and the same standard speech synthesis data as the speech is sent from the standard speech synthesis data memory 21 to the speech synthesis section 13 via the standard speech synthesis data selection section 22. tell it to
A sound is emitted from the speaker 14. By doing this, the operator listens to the voice he or she has registered as the recognition candidate voice.

第９図は、本発明の他の一実施例である。第９図におい
て第７図と同一符号は同一物を示す。FIG. 9 shows another embodiment of the present invention. In FIG. 9, the same reference numerals as in FIG. 7 indicate the same parts.

２３は音声合成部１３で合成された音声信号のピッチ周
波数を変換するピッチ周波数変換部である。23 is a pitch frequency conversion unit that converts the pitch frequency of the audio signal synthesized by the audio synthesis unit 13.

本実施例の登録処理は第８図と同じであり、認識処理は
第７図の実施例とほぼ同様であるが、処理４７の認識候
補放声が異なる。本実施例では、認識候補放声のとき、
制御部１５はピッチ周波数変換部２３を制御し、認識候
補音声の後半のピッチ周波数を上昇し、問い返し疑問と
して受聴する。The registration process in this embodiment is the same as that in FIG. 8, and the recognition process is almost the same as in the embodiment in FIG. 7, but the recognition candidate voices in process 47 are different. In this embodiment, when a recognition candidate voice is emitted,
The control unit 15 controls the pitch frequency conversion unit 23 to increase the pitch frequency of the latter half of the recognition candidate voice, and listens to the voice as a question.

〔Effect of the invention〕

本発明によれば（１）入力音声の認識結果の確認と正候補の選択を音声
で行なえるので操作者は目を離さすに他の作業が行なえ
る効果がある。According to the present invention, (1) the recognition result of the input voice can be checked and the correct candidate can be selected by voice, so the operator can perform other tasks without looking away.

（２）周囲騒音レベルに応じて正認誤の選択幅が広がる
のでリジェクトのため再度音声入力を強い　・られるこ
とかなくなる効果がある。(2) The range of options for correct or incorrect recognition is expanded according to the ambient noise level, which has the effect of eliminating the need to input voice again for rejection.

（３）周囲騒音レベルに応じて認識結果候補の放声音量
が増加するので１周囲騒音にさまたげられずに候補放声
を受聴でき、正しく選択が行え誤操作が少なくなる効果
がある。(3) Since the sound volume of the recognition result candidates increases according to the ambient noise level, the candidate sounds can be heard without being disturbed by the surrounding noise, allowing for correct selection and fewer erroneous operations.

（４）周囲騒音レベルに応じて認識結果候補の放声数が
増加するが同時に放声速蜜も増加するため、正しい認識
結果をうるまでの時間は変らず操作者の不満を解消でき
る効果をもつ。(4) The number of voices emitted from recognition result candidates increases in accordance with the ambient noise level, but at the same time, the speed at which they are emitted also increases, which has the effect of resolving operator dissatisfaction without changing the time it takes to obtain a correct recognition result.

（５）候補選択放声は登録音声であシ、操作者は入力の
たびに登録音声を受聴し学習するため、入力音声を登録
音声に合せるようにな）誤認識が少なくなる効果をもつ
。(5) The candidate selection voice is a registered voice, and since the operator listens to and learns the registered voice each time an input is made, this has the effect of reducing erroneous recognition (by matching the input voice to the registered voice).

（６）候補放声の後半のピッチ周波数を上昇させるので
候補放声と入力音声が容易だ区別でき、またピッチ周波
数上昇が無音区間の始″りの会図となシ、さらに聞き返
し疑問のアクセントとなるため操作者は装置と対話する
ごとき雰囲気を味わえ、不満を軽減できる効果をもつ。(6) Since the pitch frequency of the latter half of the candidate voice is raised, the candidate voice and the input voice can be easily distinguished, and the increase in pitch frequency also marks the beginning of a silent section and serves as an accent for questions to be heard. This allows the operator to experience the atmosphere of interacting with the device, which has the effect of reducing dissatisfaction.

[Brief explanation of drawings]

第１図は本発明の一実施例を示すブロック図、第１Ａ図
は候補頴位に対する認識率を示すグラフ、第２図は第１
図の実施例の登録処理フローチャート、第３図は第１図
の実施例の認識処理フローチャート、第４図は第１図の
実施例の他の認識処理フローチャート、輌５図は本発明
の他の一実施例を示すブロック図、第６図は本発明の更
に他の一実施例を示すブロック図、第７図は本発明の別
の一実施例を示すブロック図、第８図は第７図の実施例
の認識処理フローチャート、第９図は本発明のなお更に
他の一実施例を示すブロック図である。１・・・マイクロホンＡ４・・・認識パラメータ抽出部５・・・標準パタンメモリ９・・・マイクロホンＢ１０・・・周囲騒音検出部１１・・・固定音声合成データメモリ１３・・・音声合成部、１４・・・スピーカ１８・・・音量制御部１９・・・発声速度制御部２０・・・音声合成データ抽出部２１・・・標準音声合成データメモリ２３・・・ピッチ周波数変換部イー淑、捕　１１自イｋ（イ立〕篤３図蔦４図FIG. 1 is a block diagram showing an embodiment of the present invention, FIG. 1A is a graph showing the recognition rate for candidate positions, and FIG.
3 is a flowchart of the registration process of the embodiment shown in FIG. 1, FIG. 4 is a flowchart of another recognition process of the embodiment of FIG. 1, and FIG. FIG. 6 is a block diagram showing yet another embodiment of the present invention. FIG. 7 is a block diagram showing another embodiment of the present invention. FIG. 8 is a block diagram showing another embodiment of the present invention. FIG. 9 is a block diagram showing still another embodiment of the present invention. 1...Microphone A 4...Recognition parameter extraction section 5...Standard pattern memory 9...Microphone B 10...Ambient noise detection section 11...Fixed speech synthesis data memory 13...Speech synthesis 14...Speaker 18...Volume control section 19...Speech rate control section 20...Speech synthesis data extraction section 21...Standard speech synthesis data memory 23...Pitch frequency conversion section E-shu , Capture 11 Self Ik (I standing) Atsushi 3 figure Tsuta 4 figure

Claims

[Claims] 1. In a speech recognition device that performs recognition by comparing a speaker's speech input pattern with a standard speech pattern, as a result of the comparison, the standards obtained as recognition result candidates are ranked in descending order of similarity. means for sequentially emitting voice patterns as sounds; means for detecting a sound to be emitted as a speech when a speaker who listens to the sequentially emitted standard sound patterns selects one of the standard sound patterns; and an ambient noise level. means for detecting the noise level; and means for variably controlling the number of standard sound patterns sequentially emitted from the sound emitting means according to the detected noise level;
A voice recognition control method characterized by comprising: 2. In the voice recognition control system according to claim 1, the number of standard voice patterns sequentially emitted from the voice emitting means is determined based on the noise level detected by the noise level detection means and the voice input pattern. A speech recognition control method characterized by variable control according to both a preset threshold for similarity obtained when comparing with a standard speech pattern. 3. The voice recognition control method according to claim 1, wherein the volume of the voice emitted from the voice emitting means is variably controlled in accordance with the noise level detected by the noise level detection means. . 4. In the voice recognition control system as set forth in claim 1, the voice emission speed of the voice emitted from the voice emission means is variably controlled in accordance with the noise level detected by the noise level detection means. Recognition control method.