JP2007156076A

JP2007156076A - Voice input evaluation apparatus

Info

Publication number: JP2007156076A
Application number: JP2005350535A
Authority: JP
Inventors: Tsuneo Kato; 恒夫加藤; Masaki Naito; 正樹内藤
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2005-12-05
Filing date: 2005-12-05
Publication date: 2007-06-21
Anticipated expiration: 2025-12-05
Also published as: JP4678773B2

Abstract

<P>PROBLEM TO BE SOLVED: To inform a user of whether wrong recognition is caused by a pronunciation way and pronunciation environment or not, by classifying it for each cause of the wrong recognition, when voice for voice recognition is input. <P>SOLUTION: The voice is input from a microphone input section 10, and voice recognition is performed by a voice recognition section 20. A background noise level evaluation section 16 of the microphone input section 10 determines whether background noise is more than a predetermined level or not. An overflow detecting section 17 determines whether local (time local) overflow in a voice interval is more than a level which affects a recognition result of the voice recognition. A start of speech cutting detecting section 18 determines whether a subject pause exists or not in a voice signal stored in an input voice buffer 12. A message display section 15 displays the determination results on a display 40 so as to be classified. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、音声認識のための音声入力評価装置に関し、特に、音声認識用の音声を入力する際、ユーザの発声の仕方や発声環境が適切であるかをユーザに通知することができる音声入力評価装置に関する。 The present invention relates to a speech input evaluation device for speech recognition, and in particular, speech input capable of notifying the user of a user's utterance method and utterance environment when inputting speech recognition speech. It relates to an evaluation device.

音声認識システムの認識性能は、ユーザの発声の仕方や発声環境によって変動する。ユーザの発声の仕方や発声環境に関わらず、最も確からしい認識結果を出力するだけの音声認識システムは、発声の仕方が適切であるか、発声環境が静かであるかをユーザに通知しない。 The recognition performance of the speech recognition system varies depending on the user's utterance method and utterance environment. Regardless of the user's utterance method and utterance environment, the speech recognition system that only outputs the most probable recognition result does not notify the user whether the utterance method is appropriate or the utterance environment is quiet.

特許文献１には、音声入力がない時間帯における環境雑音の音圧と周波数特性の時間変化を測定し、これに基づき音声認識が可能かどうかを判定してユーザに通知する音声認識システムが記載されている。 Patent Document 1 describes a voice recognition system that measures changes in sound pressure and frequency characteristics of environmental noise in a time zone when there is no voice input, determines whether or not voice recognition is possible based on this, and notifies the user. Has been.

特許文献２には、外部から連続的に入力される音声信号の平均電力値と電力変動値とに基づいて子音を正確に検出することにより、話頭切断なく音声の有無(音声区間)を検出する音声検出器が記載されている。 In Patent Document 2, the presence or absence of speech (speech section) is detected without disconnection of the head by accurately detecting the consonant based on the average power value and power fluctuation value of the speech signal continuously input from the outside. An audio detector is described.

特許文献３には、外部から連続的に入力される音声信号の振幅変動が少ない期間の信号レベルを背景雑音レベルとし、この背景雑音レベルを基準として音声信号レベルが大きくなった時点を有音期間とすることにより、先頭の子音から話頭切断なく有音期間を検出する音声検出装置が記載されている。 In Patent Document 3, a signal level during a period in which the amplitude fluctuation of an audio signal continuously input from the outside is small is set as a background noise level, and a point in time when the audio signal level is increased with reference to the background noise level is a sound period. Thus, there is described a voice detection device that detects a voiced period from the first consonant without disconnecting the head of the talk.

特許文献４には、マイクにより入力された音声信号を利得の異なる複数のアンプで増幅し、増幅された信号がオーバーフローしているか否かを判定し、オーバーフローしておらず、かつ最も利得が大きいアンプの信号を選択してＡ／Ｄ変換することにより、歪の小さいデジタル信号を得るデジタル音声回路が記載されている。
特開２００４−２７１５９６号公報特開平１１−６８５８６号公報特開平６−１９７０４９号公報特開平７−１５２６０号公報 In Patent Document 4, an audio signal input from a microphone is amplified by a plurality of amplifiers having different gains, and it is determined whether or not the amplified signal has overflowed. A digital audio circuit that obtains a digital signal with low distortion by selecting an A / D conversion by selecting an amplifier signal is described.
JP 2004-271596 A JP-A-11-68586 Japanese Patent Laid-Open No. 6-197049 Japanese Patent Laid-Open No. 7-15260

音声認識システムが、ユーザの発声の仕方や発声環境に関わらず、最も確からしい認識結果を出力するだけである場合、ユーザは、発声の仕方が適切であるか、発声環境が静かであるかを知ることができない。このため、ユーザは、不適切な発声の仕方や過大な背景雑音によって誤認識が発生したとしてもその原因が解らず、何度も発声して誤認識を繰り返してしまうことがある。 If the speech recognition system only outputs the most probable recognition result regardless of the user's utterance method and utterance environment, the user can check whether the utterance method is appropriate or the utterance environment is quiet. I can't know. For this reason, even if misrecognition occurs due to an inappropriate utterance method or excessive background noise, the user may not understand the cause, and may utter many times and repeat misrecognition.

特許文献１の音声認識システムでは、環境雑音の音圧と周波数特性の時間変化に基づいて音声認識が可能かどうかを判定しているので、環境雑音の状況をユーザに通知できる。しかしながら、音声認識システムにおける誤認識は、環境雑音以外にも話頭切断や波形オーバフローなどによって発生するものであり、特許文献１の音声認識システムではこれらの誤認識に対して何らの考慮もされていない。誤認識の原因が複数ある場合、その原因が解るようにユーザに通知することが望ましい。 In the speech recognition system of Patent Document 1, it is determined whether speech recognition is possible based on the temporal change of the sound pressure and frequency characteristics of the environmental noise, so that the user can be notified of the status of the environmental noise. However, misrecognition in the speech recognition system occurs due to speech disconnection or waveform overflow in addition to environmental noise, and the speech recognition system of Patent Document 1 does not take any consideration into these misrecognitions. . If there are multiple causes of misrecognition, it is desirable to notify the user so that the cause can be understood.

引用文献２の音声検出器や引用文献３の音声検出装置は、外部から連続的に入力される音声信号における子音をより正確に検出して話頭切断なく音声区間を検出するものであり、例えばプッシュツートーク(Push To Talk)型の音声認識システムにおいてマイクがオンになる前に発声したりして、入力される音声信号そのものが話頭切断されたものである場合に、それを判定してユーザに通知することは意図されていない。 The speech detector of cited document 2 and the speech detection device of cited document 3 detect consonant in a speech signal continuously input from the outside more accurately and detect a speech section without disconnection of the speech head. In a two-to-talk (Push To Talk) type speech recognition system, when the microphone is turned on and uttered, the input audio signal itself is cut off from the beginning of the speech, and it is determined to the user. It is not intended to be notified.

引用文献４のデジタル音声回路は、利得の異なるアンプの信号のうちからオーバーフローしていない信号を選択するものであり、音声認識装置への適用も意図されているが、音声認識の認識精度と相関が高い時間局所的な波形オーバーフローに着目されておらず、誤認識の原因になる時間局所的な波形オーバーフローの判定について開示していない。 The digital speech circuit of Cited Reference 4 selects a signal that does not overflow from among the signals of amplifiers having different gains, and is intended to be applied to a speech recognition device. However, it does not focus on the high time local waveform overflow, and does not disclose the determination of the time local waveform overflow that causes misrecognition.

本発明の目的は、上記課題を解決し、音声認識用の音声を入力する際、ユーザの発声の仕方や発声環境が誤認識を引き起こすものであるかどうかを、誤認識の原因ごとに区別してユーザに通知することができる音声入力評価装置を提供することにある。 The purpose of the present invention is to solve the above-mentioned problems and distinguish whether the user's utterance method and utterance environment cause misrecognition for each cause of misrecognition when inputting speech recognition speech. An object of the present invention is to provide a voice input evaluation device capable of notifying a user.

上記課題を解決するために、本発明は、音声認識のために入力された音声が適切であるか否かを評価する音声入力評価装置において、背景雑音レベル評価部、オーバーフロー検出部および話頭切断検出部のうちの少なくとも２つとを備え、前記背景雑音レベル評価部は、背景雑音を測定し、測定した背景雑音が一定レベル以上であるかどうかを判定し、一定レベル以上であると判定した場合には背景雑音が大きすぎる旨のメッセージを送出し、前記オーバーフロー検出部は、入力された音声のオーバフローを検出し、検出したオーバーフローが音声認識の認識結果に影響を与えるレベル以上であるかどうかを判定し、音声認識の認識結果に影響を与えるレベル以上であると判定した場合には発声が大きすぎる旨のメッセージを送出し、前記話頭切断検出部は、入力された音声の話頭切断の有無を判定し、話頭切断が有ると判定した場合にはその旨のメッセージを送出し、前記背景雑音レベル評価部、前記オーバーフロー検出部および前記話頭切断検出部のうちの少なくとも２つから送出されるメッセージは互いに区別可能であることを特徴としている。 In order to solve the above problems, the present invention provides a speech input evaluation device for evaluating whether or not speech input for speech recognition is appropriate, a background noise level evaluation unit, an overflow detection unit, and speech break detection The background noise level evaluation unit measures the background noise, determines whether the measured background noise is above a certain level, and determines that the measured background noise is above a certain level Sends a message that the background noise is too high, and the overflow detection unit detects the overflow of the input speech and determines whether the detected overflow is above a level that affects the recognition result of speech recognition. If it is determined that the level is higher than the level that affects the recognition result of voice recognition, a message indicating that the utterance is too loud is transmitted, The disconnection detection unit determines whether or not there is a speech head disconnection of the input speech, and if it is determined that there is a speech disconnection, sends a message to that effect, and the background noise level evaluation unit, the overflow detection unit, and the speech head Messages transmitted from at least two of the disconnection detection units can be distinguished from each other.

また、本発明は、前記背景雑音レベル評価部が、ユーザ発声前あるいはユーザが発声していない所定期間の入力から背景雑音を測定することを特徴としている。 Further, the present invention is characterized in that the background noise level evaluation unit measures background noise from an input for a predetermined period before the user utters or when the user does not utter.

また、本発明は、前記オーバーフロー検出部が、音声区間のうちの局所的(時間局所的)なオーバーフローの度合いを検出し、このオーバーフローの度合いが認識結果に影響を与えるレベルかどうかを判定することを特徴としている。 Further, according to the present invention, the overflow detection unit detects a local (temporal local) degree of overflow in the speech section, and determines whether or not the degree of the overflow affects a recognition result. It is characterized by.

また、本発明は、前記オーバーフロー検出部が、音声信号の時間波形領域でオーバーフローしたサンプル数をカウントすることで音声区間のうちの局所的(時間局所的)なオーバーフローの度合いを検出し、このカウント数が一定値以上であるかどうかを閾値処理することでオーバーフローの度合いが認識結果に影響を与えるレベルかどうかを判定することを特徴としている。 Further, according to the present invention, the overflow detection unit detects the degree of local (time local) overflow in the audio section by counting the number of samples overflowed in the time waveform area of the audio signal, It is characterized in that it is determined whether or not the degree of overflow is a level that affects the recognition result by thresholding whether or not the number is a certain value or more.

また、本発明は、前記オーバーフロー検出部が、音声認識単位である音響分析フレームごとにオーバーフローしたサンプル数をカウントし、音声信号の全期間における各音響分析フレームでのサンプル数の最大値を音声区間のうちの局所的(時間局所的)なオーバーフローの度合いとして検出することを特徴としている。 Further, according to the present invention, the overflow detection unit counts the number of samples overflowed for each acoustic analysis frame which is a speech recognition unit, and sets the maximum number of samples in each acoustic analysis frame over the entire period of the speech signal Among them, it is detected as a degree of local (time local) overflow.

また、本発明は、前記オーバーフロー検出部が、音声信号の先頭から所定期間においてオーバーフローしたサンプル数をカウントすることで音声区間のうちの局所的(時間局所的)なオーバーフローの度合いを検出することを特徴としている。 In the present invention, the overflow detection unit detects the degree of local (time local) overflow in the audio section by counting the number of samples overflowed in a predetermined period from the beginning of the audio signal. It is a feature.

また、本発明は、前記話頭切断検出部が、音声信号の先頭部分と末尾部分の一定期間の平均パワーを比較し、先頭部分の平均パワーと末尾部分の平均パワーの差分に基づいて話頭切断の有無を判定することを特徴としている。 Further, in the present invention, the speech disconnection detection unit compares the average power of the start portion and the end portion of the audio signal for a certain period, and based on the difference between the average power of the start portion and the average power of the end portion, It is characterized by determining the presence or absence.

また、本発明は、音声認識のための音声を入力するためのマイクを備え、前記話頭切断検出部が、音声信号の先頭部分の一定期間の平均パワーとして前記マイクのオン直後の一定期間の音声信号の平均パワーを求めることを特徴としている。 The present invention further includes a microphone for inputting voice for voice recognition, and the speech disconnection detecting unit uses a voice for a certain period immediately after the microphone is turned on as an average power of the head part of the voice signal for a certain period. It is characterized by obtaining the average power of the signal.

本発明によれば、音声認識用の音声を入力する際、背景雑音レベル、オーバーフロー、話頭切断により誤認識が引き起こされるか否かが、その原因ごとに区別してユーザに通知されるので、ユーザは、無駄に何回も繰り返し音声を入力することなく、適切に対処して音声を入力するようになり、長期的にみて適切な発声を心がけるようになることが見込まれることから、認識完了率の向上が期待できる。 According to the present invention, when inputting voice for voice recognition, whether or not erroneous recognition is caused by background noise level, overflow, or speech disconnection is notified to the user by distinguishing the cause, so the user can , Without having to repeatedly input speech many times, it is expected that the speech will be input appropriately and the speech will be appropriate in the long run. Improvement can be expected.

また、音声区間のうちの局所的(時間局所的)なオーバーフローの度合いでオーバフローを判定することにより、音声認識の認識精度と高い相関を持たせて、入力された音声のオーバフローを判定できる。音声区間のうちの局所的(時間局所的)なオーバーフローの度合いは、音声信号の時間波形領域でオーバーフローしたサンプル数をカウントしたり、音響分析フレームごとにオーバーフローしたサンプル数をカウントし、音声信号の全期間における各音響分析フレームでのサンプル数の最大値を検出したりして、適切に判定できる。 Further, by determining the overflow based on the local (time local) overflow degree in the speech section, it is possible to determine the overflow of the input speech with a high correlation with the recognition accuracy of speech recognition. The degree of local (temporal local) overflow in the speech interval is determined by counting the number of samples overflowed in the time waveform area of the speech signal or counting the number of samples overflowed for each acoustic analysis frame. The maximum value of the number of samples in each acoustic analysis frame over the entire period can be detected and appropriately determined.

また、入力された音声の話頭切断は、音声信号の先頭部分と末尾部分の一定期間の平均パワーを比較し、先頭部分の平均パワーと末尾部分の平均パワーの差分を検出することにより、適切に判定できる。 In addition, the speech head disconnection of the input speech is appropriately performed by comparing the average power of the beginning and end of the audio signal for a certain period and detecting the difference between the average power of the beginning and the end of the speech signal. Can be judged.

音声認識のための音声を入力するためのマイクを備えている場合には、マイクのオン直後の一定期間の音声信号の平均パワーを求めることで、音声信号の先頭部分の一定期間の平均パワーを適切に検出でき、プッシュツートーク型の音声認識システムにおいてマイクがオンになる前に発声したことによる話頭切断をユーザに効果的に知らせることができる。 If a microphone for inputting voice for speech recognition is provided, the average power of the audio signal for a certain period immediately after the microphone is turned on is obtained to obtain the average power for a certain period of the beginning of the audio signal. In the push-to-talk type speech recognition system, the user can be effectively notified of the disconnection of the head due to the utterance before the microphone is turned on.

以下、図面を参照して本発明の実施形態について説明する。図１は、本発明が適用された音声認識システムを示す機能ブロック図である。ここでは、マイク入力部10と音声認識部20が通信を行う音声認識システムを想定している。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. FIG. 1 is a functional block diagram showing a speech recognition system to which the present invention is applied. Here, a speech recognition system in which the microphone input unit 10 and the speech recognition unit 20 communicate with each other is assumed.

マイク入力部10は、ユーザがマイク30から入力する音声の特徴量(音声認識用パラメータ)抽出を行い、抽出された特徴量を認識要求とともに音声認識部20に送信する。音声認識部20は、マイク入力部10から送信された特徴量を元に音声認識を実行し、認識結果(テキスト)を認識要求に対する応答としてマイク入力部10に返す。マイク入力部10は、音声認識部20から返された認識結果をディスプレイ40に表示する。図１ではマイク30およびディスプレイ40がマイク入力部10に内蔵されているが、これらはマイク入力部10に外付けされていてもよい。 The microphone input unit 10 extracts voice feature values (speech recognition parameters) input by the user from the microphone 30, and transmits the extracted feature values to the voice recognition unit 20 together with a recognition request. The speech recognition unit 20 performs speech recognition based on the feature amount transmitted from the microphone input unit 10, and returns a recognition result (text) to the microphone input unit 10 as a response to the recognition request. The microphone input unit 10 displays the recognition result returned from the voice recognition unit 20 on the display 40. In FIG. 1, the microphone 30 and the display 40 are built in the microphone input unit 10, but these may be externally attached to the microphone input unit 10.

次に、マイク入力部10について詳細に説明する。マイク入力部10は、背景雑音バッファ11、入力音声バッファ12、背景雑音抑圧部13、音響特徴量抽出部14、メッセージ表示部15、背景雑音レベル評価部16、オーバーフロー検出部17および話頭切断検出部18を備える。 Next, the microphone input unit 10 will be described in detail. The microphone input unit 10 includes a background noise buffer 11, an input voice buffer 12, a background noise suppression unit 13, an acoustic feature quantity extraction unit 14, a message display unit 15, a background noise level evaluation unit 16, an overflow detection unit 17, and a speech disconnection detection unit. Eighteen.

背景雑音バッファ12は、ユーザが発声する前の所定期間にマイク30から入力された音声信号を背景雑音として格納する。所定期間は、例えば0.2〜1.0秒間である。話頭切断があってユーザの発声前に背景雑音を取得することができない場合には、入力された音声信号の振幅変動が少ない期間の入力信号を背景雑音とすることができる。入力音声バッファ11は、ユーザに発声を促した後にマイク30から入力されるユーザ発声による音声信号を格納する。 The background noise buffer 12 stores an audio signal input from the microphone 30 during the predetermined period before the user utters as background noise. The predetermined period is, for example, 0.2 to 1.0 seconds. When the background noise cannot be acquired before the user utters due to the disconnection of the talk head, the input signal during a period in which the amplitude variation of the input voice signal is small can be used as the background noise. The input voice buffer 11 stores a voice signal by a user utterance input from the microphone 30 after prompting the user to utter.

背景雑音抑圧部13は、入力音声バッファ12からの音声信号に含まれる背景雑音を背景雑音バッファ12に格納された背景雑音を使って抑圧し、背景雑音が抑圧された音声信号を音響特徴量抽出部14に送出する。 The background noise suppression unit 13 uses the background noise stored in the background noise buffer 12 to suppress the background noise contained in the audio signal from the input audio buffer 12, and extracts the audio feature with the background noise suppressed. Send to part 14.

音響特徴量抽出部14は、背景雑音が抑圧された音声信号を音響的に分析し、分析により抽出された特徴量を音声認識部20に送信する。 The acoustic feature quantity extraction unit 14 acoustically analyzes the voice signal in which the background noise is suppressed, and transmits the feature quantity extracted by the analysis to the voice recognition unit 20.

背景雑音レベル評価部16は、背景雑音バッファ12に格納された背景雑音が一定レベル以上であるか否かを判定する。背景雑音が一定レベル以上であるか否かは閾値処理で判定でき、閾値処理での閾値は音声認識の認識精度に応じて設定できる。背景雑音レベル評価部16は、背景雑音が一定レベル以上であると判定した場合には背景雑音が大きすぎる旨のメッセージをメッセージ表示部15に送出する。メッセージ表示部15は、このメッセージをディスプレイ40に表示させる。 The background noise level evaluation unit 16 determines whether the background noise stored in the background noise buffer 12 is equal to or higher than a certain level. Whether or not the background noise is above a certain level can be determined by threshold processing, and the threshold in threshold processing can be set according to the recognition accuracy of speech recognition. If the background noise level evaluation unit 16 determines that the background noise is above a certain level, the background noise level evaluation unit 16 sends a message to the message display unit 15 that the background noise is too large. The message display unit 15 displays this message on the display 40.

オーバーフロー検出部17は、音声区間のうちの局所的(時間局所的：バースト的)なオーバーフローの度合いを検出し、これが音声認識の認識結果に影響を与える(誤認識に到る)レベル以上であるか否かを判定し、認識結果に影響を与えるレベル以上のオーバーフローが生じていると判定した場合には発声が大きすぎる旨のメッセージをメッセージ表示部15に送出する。 The overflow detection unit 17 detects a local (temporal local: burst-like) overflow degree in the speech section, and this is above a level that affects the recognition result of speech recognition (leads to erroneous recognition). If it is determined that there is an overflow of a level that affects the recognition result, a message indicating that the utterance is too loud is sent to the message display unit 15.

音声区間のうちの局所的(時間局所的)なオーバーフローの度合いは、音声認識の認識精度と相関の高いので、その認識精度を勘案したオーバフローを判定できる。オーバーフローの度合いと音声認識の認識結果の関係は、予め学習することなどで取得できる。メッセージ表示部15は、このメッセージをディスプレイ40に表示させる。 Since the degree of local (time local) overflow in the speech section has a high correlation with the recognition accuracy of speech recognition, it is possible to determine an overflow in consideration of the recognition accuracy. The relationship between the degree of overflow and the recognition result of voice recognition can be acquired by learning in advance. The message display unit 15 displays this message on the display 40.

オーバフローが生じているか否かは、入力された音声信号がA/D変換などの処理レンジの最大値以上となっているか否かで判定できる。さらに、音声信号の最小値を設定し、入力された音声信号がこの最小値以下であるか否かを判定するようにすれば、発声が小さすぎる旨のメッセージも送出できる。 Whether or not an overflow has occurred can be determined by whether or not the input audio signal is equal to or greater than the maximum value of the processing range such as A / D conversion. Furthermore, if a minimum value of the audio signal is set and it is determined whether or not the input audio signal is less than or equal to this minimum value, a message that the utterance is too small can be transmitted.

音声区間のうちの局所的(時間局所的)なオーバーフローの度合いは、入力音声バッファ12に格納された音声信号の時間波形領域でオーバーフローしたサンプル数をカウントすることで検出でき、オーバーフローが認識結果に影響を与えるレベル以上であるかどうかの判定は、このカウント数が一定値以上であるかどうかの閾値処理で判定できる。 The degree of local (time local) overflow in the speech interval can be detected by counting the number of samples overflowed in the time waveform area of the audio signal stored in the input audio buffer 12. Whether or not the level is more than the level that affects can be determined by threshold processing that determines whether or not the count number is a certain value or more.

また、音声認識においては、音声信号を一定期間ごとのフレーム(音響分析フレーム)に区切って処理するのが一般的であるので、音声信号の全期間においてフレームごとにオーバフローしたサンプル数をカウントし、各フレームでのサンプル数のうちの最大値を、音声区間のうちの局所的(時間局所的)なオーバフローの度合いとすることができる。 Also, in speech recognition, it is common to divide and process speech signals into frames (acoustic analysis frames) for a certain period, so the number of samples overflowed for each frame in the entire duration of the speech signal is counted, The maximum value of the number of samples in each frame can be the degree of local (time local) overflow in the speech section.

音声認識の認識精度には話頭付近の音声信号のレベルが大きく影響する。したがって、音声信号の先頭から所定期間、例えば1〜2秒間においてオーバーフローしたサンプル数をカウントし、これを音声区間のうちの局所的(時間局所的)なオーバフローの度合いとすることもでき、これによれば音声入力の初期の段階で早期に発声が大きすぎる旨のメッセージを送出することができる。 The level of the speech signal near the head of the speech greatly affects the recognition accuracy of speech recognition. Therefore, it is possible to count the number of samples overflowed in a predetermined period, for example, 1 to 2 seconds from the beginning of the audio signal, and to set this as the degree of local (time local) overflow in the audio section. Therefore, it is possible to send a message indicating that the utterance is too loud at an early stage of voice input.

話頭切断検出部18は、入力音声バッファ12に格納された音声信号において話頭切断の有無を判定し、話頭切断が有ると判定した場合には発声のタイミングが早過ぎる旨のメッセージをメッセージ表示部15に送出する。メッセージ表示部15は、このメッセージをディスプレイ40に表示させる。 The speech disconnection detection unit 18 determines whether or not there is a speech disconnection in the audio signal stored in the input speech buffer 12, and if it is determined that there is a speech disconnection, a message indicating that the utterance timing is too early is displayed in the message display unit 15 To send. The message display unit 15 displays this message on the display 40.

話頭切断は、具体的には、入力音声バッファ12に格納された音声信号の先頭部分の一定期間、例えば数百ミリ秒分の平均パワーと、末尾部分の一定期間、例えば数百ミリ秒分の平均パワーを比較することで判定できる。 Specifically, the speech head disconnection is performed for a certain period of time, such as an average power of several hundred milliseconds, for example, for several hundred milliseconds, and for a certain period of, for example, several hundred milliseconds, of the audio signal stored in the input audio buffer 12. This can be determined by comparing the average power.

音声信号の先頭部分は、入力音声バッファ12と背景雑音抑圧部13の間に音声入力検出器を配設して音声信号の立ち上がりを検出することで判定でき、プッシュツートーク型の音声認識システムではマイク30のオンを始点としても判定できる。また、音声認識部20からユーザに発声を促すような音声認識システムではユーザに発声を促した時点を始点として判定できる。 The head part of the audio signal can be determined by arranging the audio input detector between the input audio buffer 12 and the background noise suppression unit 13 to detect the rising edge of the audio signal. In the push-to-talk type voice recognition system, It can also be determined that the microphone 30 is turned on. Further, in a speech recognition system that prompts the user to speak from the speech recognition unit 20, the time point when the user is prompted to speak can be determined as the starting point.

また、音声信号の末尾部分は、ユーザによる発声が所定期間以上継続してなくなったという終話判定の結果を利用して判定でき、その判定後からマイク30がオフするまでの期間内での平均パワーを取得することができる。なお、音声信号の先頭部分と末尾部分で平均パワーを求める一定期間を同じにする必要はない。 Further, the end portion of the audio signal can be determined by using the result of the end-of-call determination that the user's utterance has been stopped for a predetermined period or longer, and the average over the period from the determination until the microphone 30 is turned off. You can get power. It should be noted that it is not necessary for the fixed period for obtaining the average power to be the same for the head portion and the tail portion of the audio signal.

話頭切断が有る場合には、音声信号の先頭部分の平均パワーが大きくなる。一方、音声信号の末尾部分の平均パワーは、ユーザの発声終了時点後の一定期間の音声信号、つまり一定期間の背景雑音のみの平均パワーとなるので、小さい。したがって、音声信号の先頭部分と末尾部分の平均パワーの差を求め、この差が一定レベル以上である場合、話頭切断が有ると判定できる。 When there is a talk head disconnection, the average power of the head portion of the audio signal increases. On the other hand, the average power of the tail part of the audio signal is small because it is the average power of the audio signal for a certain period after the end of the user's utterance, that is, only the background noise for a certain period. Therefore, the difference between the average powers of the head part and the tail part of the voice signal is obtained, and when this difference is equal to or higher than a certain level, it can be determined that there is a speech head disconnection.

以上、実施形態について説明したが、本発明は、上記実施形態に限定されない。例えば、背景雑音レベルとオーバーフローと話頭切断の３種類のうち少なくとも２種類、例えばオーバーフローと話頭切断の判定、背景雑音レベルと話頭切断の判定、背景雑音レベルとオーバーフローの判定を行うものも本発明に含まれる。また、入力される音声が適切であるか否かを、ディスプレイ表示に限らず、LEDなどの発光素子の発光形態や音声などで報知させることもできる。 As mentioned above, although embodiment was described, this invention is not limited to the said embodiment. For example, at least two of the three types of background noise level, overflow, and speech disconnection, for example, determination of overflow and speech disconnection, determination of background noise level and speech disconnection, and determination of background noise level and overflow are also included in the present invention. included. In addition, whether or not the input sound is appropriate can be notified not only by display display but also by the light emission form or sound of a light emitting element such as an LED.

また、入力音声が不適切と判定された場合にはマイク入力部10から音声認識部20へ認識要求や音響特徴量を送信しないようにしたり、音声認識部20での音声認識処理を行わないようにすれば、無駄な情報送信や処理が行われるのをなくすことができる。 Also, if the input speech is determined to be inappropriate, the microphone input unit 10 should not transmit a recognition request or acoustic feature quantity to the speech recognition unit 20, or the speech recognition unit 20 may not perform speech recognition processing. By doing so, it is possible to eliminate unnecessary information transmission and processing.

本発明は、携帯電話上の音声認識サービスに適用することができる。 The present invention can be applied to a voice recognition service on a mobile phone.

本発明が適用された音声認識システムを示す機能ブロック図である。It is a functional block diagram showing a voice recognition system to which the present invention is applied.

Explanation of symbols

10・・・マイク入力部、11・・・背景雑音バッファ、12・・・入力音声バッファ、13・・・背景雑音抑圧部、14・・・音響特徴量抽出部、15・・・メッセージ表示部、16・・・背景雑音レベル評価部、17・・・オーバーフロー検出部、18・・・話頭切断検出部、20・・・音声認識部、30・・・マイク、40・・・ディスプレイ 10 ... microphone input unit, 11 ... background noise buffer, 12 ... input audio buffer, 13 ... background noise suppression unit, 14 ... acoustic feature quantity extraction unit, 15 ... message display unit , 16 ... Background noise level evaluation unit, 17 ... Overflow detection unit, 18 ... Speech disconnection detection unit, 20 ... Speech recognition unit, 30 ... Microphone, 40 ... Display

Claims

In a speech input evaluation device that evaluates whether or not speech input for speech recognition is appropriate,
And at least two of a background noise level evaluation unit, an overflow detection unit, and a speech disconnection detection unit,
The background noise level evaluation unit measures background noise, determines whether or not the measured background noise is equal to or higher than a certain level, and if it is determined that the background noise is equal to or higher than a certain level, a message indicating that the background noise is too large. Send out,
The overflow detection unit detects an overflow of the input speech, determines whether the detected overflow is at or above a level that affects the speech recognition recognition result, and exceeds the level that affects the speech recognition recognition result If it is determined that it is, send a message that the utterance is too loud,
The speech break detection unit determines whether or not there is a speech break in the input voice, and if it is determined that there is a speech break, sends a message to that effect,
The voice input evaluation apparatus, wherein messages transmitted from at least two of the background noise level evaluation unit, the overflow detection unit, and the head disconnection detection unit are distinguishable from each other.

2. The speech input evaluation apparatus according to claim 1, wherein the background noise level evaluation unit measures background noise from an input for a predetermined period before the user utters or when the user does not utter.

The overflow detection unit detects a local (time local) degree of overflow in a speech section, and determines whether or not the degree of the overflow affects a recognition result. The speech input evaluation apparatus according to 1 or 2.

The overflow detection unit detects the degree of local (time local) overflow in the audio section by counting the number of samples overflowed in the time waveform region of the audio signal, and the count number is equal to or greater than a certain value. 4. The speech input evaluation apparatus according to claim 3, wherein whether or not there is a threshold value is used to determine whether or not the degree of overflow is a level that affects the recognition result.

The overflow detection unit counts the number of samples overflowed for each acoustic analysis frame, which is a speech recognition unit, and determines the maximum number of samples in each acoustic analysis frame in the entire period of the speech signal locally ( 5. The speech input evaluation device according to claim 4, wherein the speech input evaluation device is detected as a degree of time-local overflow.

5. The overflow detection unit detects a local (time local) degree of overflow in a speech section by counting the number of samples overflowed in a predetermined period from the beginning of the speech signal. The voice input evaluation apparatus according to 1.

The speech disconnection detection unit compares the average power of the beginning part and the end part of the audio signal for a certain period, and determines the presence or absence of speech disconnection based on the difference between the average power of the beginning part and the average power of the end part. The voice input evaluation apparatus according to claim 1, wherein the voice input evaluation apparatus is characterized by the following.

Has a microphone to input voice for voice recognition,
8. The speech input evaluation apparatus according to claim 7, wherein the speech disconnection detection unit obtains an average power of a speech signal for a certain period immediately after the microphone is turned on as an average power for a certain period of a head portion of the speech signal. .