JPS60205600A

JPS60205600A - Voice recognition equipment

Info

Publication number: JPS60205600A
Application number: JP59062722A
Authority: JP
Inventors: 上原　堅助
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 1984-03-30
Filing date: 1984-03-30
Publication date: 1985-10-17

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】［発明の技術分野］本発明は、例えば音声認識応答システムに用いられる音
声認識装置に関する。DETAILED DESCRIPTION OF THE INVENTION [Technical Field of the Invention] The present invention relates to a speech recognition device used, for example, in a speech recognition response system.

［発明の技術的背景とその問題点］音声認識応答システムは、例えば利用者とホストコンピ
ュータ間で電話回線を利用して情報を交換し、利用者に
対して情報サービスを提供するシステムに用いられる。[Technical background of the invention and its problems] A voice recognition response system is used, for example, in a system that exchanges information between a user and a host computer using a telephone line and provides information services to the user. .

このような音声認識応答システムは、不特定話者ｗｇｉ
用の音声認識装置を使用し、利用者から発声される主と
して数字からなるコード番号を認識するように構成され
ている。Such a voice recognition response system is a speaker-independent wgi
It is configured to recognize code numbers, mainly consisting of numbers, uttered by the user using a voice recognition device.

この場合、利用者は一連の数字からなるコード番号を一
語毎に区切ってしかも連続して発声することになる。音
声認識装置は、利用者から発声された各数字を単語とみ
なして一語毎に認識する。In this case, the user has to utter a code number consisting of a series of numbers, dividing it into words and uttering them continuously. The speech recognition device regards each number uttered by the user as a word and recognizes each number word by word.

、ところで、音声認識装置で利用者からの音声が認識さ
れる場合、電話回線から入力される音声信号より認識対
象である音響区間を検出する必要がある。この音声区間
を検出する音声検出方式には、第１図に示すように音声
信号Ｕに対する一定の検出レベルＬを統計的にめ、この
検出レベルＬに基づいて音声区間即ち音声の始端Ｓおよ
び終端Ｅを検出する方式がある。しかしながら、電話回
線を経由する音声信号は、回線状態によりその音声レベ
ルが大きく変動することがある。このため、上記のよう
な音声検出方式では、ノイズＮのレベルが高い場合に第
２図（ａ）に示すようにノイズＮが検出レベルＬを越え
る事態が生じて、音声区間の検出が不可能になる。また
、同図（ｂ）に示すように音声信号Ｕのレベルが検出レ
ベルＬより低下して、音声区間の検出が不可能になる場
合がある。By the way, when a voice recognition device recognizes a voice from a user, it is necessary to detect an acoustic section to be recognized from a voice signal input from a telephone line. The voice detection method for detecting this voice section involves statistically determining a certain detection level L for the voice signal U, as shown in FIG. There is a method for detecting E. However, the audio level of audio signals transmitted via a telephone line may vary greatly depending on the line condition. For this reason, in the voice detection method described above, when the level of the noise N is high, a situation occurs where the noise N exceeds the detection level L as shown in Figure 2 (a), making it impossible to detect the voice section. become. Furthermore, as shown in FIG. 2B, the level of the audio signal U may fall below the detection level L, making it impossible to detect the audio section.

上記のような方式に対して、背景ノイズＮのレベルに応
じて検出レベルＬを変化させる音声検出方式が開発され
ている。しかしながら、この方式の場合には第３図に示
すように背景ノイズレベルが小さく、音声レベルが比較
的大きい場合、設定される検出レベルＬが比較的低下す
ることになる。In contrast to the above-mentioned methods, a voice detection method has been developed in which the detection level L is changed depending on the level of the background noise N. However, in the case of this method, as shown in FIG. 3, when the background noise level is low and the audio level is relatively high, the set detection level L will be relatively low.

このため、発声者（利用者）の息づかい又は周囲の小さ
なノイズ等も敏感に検出するような欠点があった。さら
に、音声検出処理において音声の終端Ｅを検出する場合
、第１図に示すように音声レベルが検出レベルしより低
下した際再度上昇する女の時間Ｔに応じて、音声が継続
しているか又は終了であるかを判定する必要がある。こ
の時間（以下音声持続時間と称す。）Ｔを予め適正な値
に設定することは困難である。例えば、音声持続時間Ｔ
を比較的短く設定すると、発声の比較的長い語の音声は
本来の終端の検出以前に音声区間の検出が終了する事態
が生ずる。また、音声持続時間Ｔを比較的長くすると、
音声区間の検出に多くの時間を要し、応答処理の効率が
悪化する欠点があった。For this reason, there is a drawback that the breathing of the speaker (user) or small noises in the surroundings are sensitively detected. Furthermore, when detecting the end E of the voice in the voice detection process, as shown in FIG. It is necessary to determine whether the process has ended. It is difficult to set this time (hereinafter referred to as audio duration) T to an appropriate value in advance. For example, the audio duration T
If is set to a relatively short value, a situation may occur in which the detection of the speech section of a word with a relatively long utterance ends before the original end is detected. Also, if the audio duration T is made relatively long,
This method has the drawback that it takes a lot of time to detect a voice section, and the efficiency of response processing deteriorates.

［発明の目的］本発明は上記の点に鑑みてなされたもので、その目的は
、電話回線を経由して入力される音声信号に対して音声
レベルが変動する場合でも、適正な検出レベル及び音声
持続時間を設定して確実に音声区間を検出できるように
して、高精度の音声認識処理を行なうことができる音声
認識装置を提供することにある。[Object of the Invention] The present invention has been made in view of the above-mentioned points, and its object is to provide an appropriate detection level and a suitable detection level even when the audio level fluctuates with respect to an audio signal input via a telephone line. An object of the present invention is to provide a speech recognition device capable of performing highly accurate speech recognition processing by setting a speech duration and reliably detecting a speech section.

［発明の概要］本発明では、電話回線を介して送信される音声信号に対
して所定の検出レベルで始端及び終端からなる音声区間
を検出しその音声信号の波形軌跡を一時記憶する音声検
出手段が設けられる。音声ｌＬ１ｇ装置は、音声検出手
段により検出された音声区間に基づいて音声認識処理を
行なう。さらに音声検出制御手段は、上記音声検出手段
で記憶された音声信号の波形軌跡に基づいて音声信号の
最大値レベル、背景ノイズレベルを検出して適正な上記
検出レベルを設定しかつ音声持続時間を検出して上記音
声区間の終端を決定するように制御するように構成され
ている。このような構成の音声認識装置により、音声レ
ベルの変動が発生しやすい音声信号に対して確実に音声
区間を検出でき、高＄１度の音声認識処理を行なうこと
ができる。[Summary of the Invention] The present invention provides a voice detection means for detecting a voice section consisting of a start and end at a predetermined detection level in a voice signal transmitted via a telephone line, and temporarily storing the waveform locus of the voice signal. is provided. The voice LL1g device performs voice recognition processing based on the voice section detected by the voice detection means. Furthermore, the voice detection control means detects the maximum level of the voice signal and the background noise level based on the waveform locus of the voice signal stored by the voice detection means, sets the appropriate detection level, and sets the voice duration. The controller is configured to perform control to detect and determine the end of the voice section. With the speech recognition device having such a configuration, it is possible to reliably detect a speech section in a speech signal in which fluctuations in speech level are likely to occur, and it is possible to perform speech recognition processing at a high $1 degree.

［発明の実施例］以下図面を参照して本発明の一実施例を説明する。第４
図は一実施例に係わる音声認識応答システムの構成を示
すブロック図である。第４図において、電話器１０は利
用者が発声した音声を電気信号に変換して、電話回線１
１を介して音声検出回路１２に送信する。音声検出回路
１２は、送信された音声信号に対して所定の検出レベル
Ｌに基づいて音声区間即ち音声の始端Ｓおよび終端Ｅを
検出する。[Embodiment of the Invention] An embodiment of the present invention will be described below with reference to the drawings. Fourth
The figure is a block diagram showing the configuration of a voice recognition response system according to an embodiment. In FIG. 4, a telephone set 10 converts the voice uttered by a user into an electrical signal and connects it to a telephone line 1.
1 to the voice detection circuit 12. The voice detection circuit 12 detects the voice section, that is, the start end S and the end point E of the voice based on a predetermined detection level L of the transmitted voice signal.

また音声検出回路１２は、初期の音声信号（前後の背景
ノイズＮを含めた音声信号）の波形軌跡をメモリ１３に
一時記憶する。音声認識回路１４は、音声検出回路１２
で検出された音声区間の音声の認識処理を行りう。　□ 制御回路１５は、音声検出回路１２によりメモリ１３に
記憶された音声信号の波形軌跡情報を読出して、この情
報に基づいて音声信号の最大値レベル及び背景ノイズレ
ベルを検出する。この検出に基づいて、制御回路１５は
音声区間の検出に必要な検出レベルＬを適正な値で音声
検出回路１２に設定する。The audio detection circuit 12 also temporarily stores the waveform locus of the initial audio signal (the audio signal including the background noise N before and after) in the memory 13. The voice recognition circuit 14 is the voice detection circuit 12
Recognition processing is performed on the voice in the voice section detected in . □ The control circuit 15 reads out the waveform trajectory information of the audio signal stored in the memory 13 by the audio detection circuit 12, and detects the maximum level of the audio signal and the background noise level based on this information. Based on this detection, the control circuit 15 sets the detection level L necessary for detecting the voice section to an appropriate value in the voice detection circuit 12.

さらに制御回路１５は、上記波形軌跡情報により音声持
続時間を検出し、音声検出回路１２における次の音声区
間の検出の際に終端Ｅを決定するように音声検出回路１
２の動作を制御する。音声応答装置１６は、制御回路１
５の制御により音声認識回路１４の認識結果に応じた応
答処理を行なう。Furthermore, the control circuit 15 detects the voice duration based on the waveform locus information, and controls the voice detection circuit 12 to determine the end point E when the voice detection circuit 12 detects the next voice section.
Controls the operation of 2. The voice response device 16 includes a control circuit 1
5 performs response processing according to the recognition result of the speech recognition circuit 14.

上記のような構成の音声認識応答システムにおいて、一
実施例に係わる動作を説明する。先ず、利用者が電話器
１０を操作してシステムとの回線が接続されると、予め
決められた最初のメツセージが音声応答装置１６から利
用者へ送られる。この最初のメツセージは例えば、「×
×のサービスをお伝えまず。よろしければ、はいと言っ
て下さい。」である。これに対して、利用者が「はい」
と発声すると、発声された音声は電話回線１１を経由し
て音声検出回路１２に送信される。The operation of one embodiment of the voice recognition response system configured as described above will be described. First, when the user operates the telephone 10 to establish a line with the system, a predetermined first message is sent from the voice response device 16 to the user. This first message could be, for example, “×
First, I would like to tell you about the service of ×. Please say yes if you like. ”. In response, the user answered "Yes"
When uttered, the uttered voice is transmitted to the voice detection circuit 12 via the telephone line 11.

音声検出回路１２は、第５図（ａ）に示すように、入力
された音声信号Ｕに対して所定の検出レベルＬにより始
端Ｓ及び終端Ｅを検出して、制御回路１５に指示する。As shown in FIG. 5(a), the audio detection circuit 12 detects the starting end S and the ending end E of the input audio signal U at a predetermined detection level L, and instructs the control circuit 15 to detect them.

このとき音声検出回路１２は、その最初の音声信号（背
景ノイズＮを含む）Ｕの波形軌跡をメモリ１３に記憶す
る。音声検出回路１２で検出された音声は、音声認識回
路１４で音声認識処理され、その認識結果が制御回路１
５に出力される。At this time, the audio detection circuit 12 stores the waveform locus of the first audio signal (including the background noise N) U in the memory 13. The voice detected by the voice detection circuit 12 is subjected to voice recognition processing by the voice recognition circuit 14, and the recognition result is sent to the control circuit 1.
5 is output.

この場合、利用者からの音声が音声検出回路１２でノイ
ズと判定されると、制御回路１５の制御により音声応答
装置１６からは例えば「確認できませんので、もう一度
どうぞ」のようなメツセージが利用者に送信される。In this case, when the voice from the user is determined to be noise by the voice detection circuit 12, the voice response device 16 sends a message to the user under the control of the control circuit 15, such as "Cannot confirm. Please try again." Sent.

制御回路１５は、音声検出回路１２を通じてメモリ１３
から音声信号Ｕの波形軌跡情報を読出して、その波形軌跡情報に基づいて音声の最
大値レベル及び背景ノイズレベルを検出する。この検出
結果に応じて、制御回路１５は音声区間の検出に対して
適正な検出レベルＬをめて音声検出回路１２に設定する
。さらに、制御回路１５は上記波形軌跡情報から第５図
（ａ）に示すような音声持続時間Ｔ１を決定する。The control circuit 15 controls the memory 13 through the voice detection circuit 12.
The waveform locus information of the audio signal U is read from the audio signal U, and the maximum audio level and background noise level are detected based on the waveform locus information. According to this detection result, the control circuit 15 determines an appropriate detection level L for detecting the voice section and sets it in the voice detection circuit 12. Further, the control circuit 15 determines the voice duration T1 as shown in FIG. 5(a) from the waveform trajectory information.

、次に、利用者が一連のコード番号を発声すると、発声
された音声信号は上記と同様に音声検出回路１２に入力
される。ここで、音声検出回路１２は制御回路１５から
設定された適正な検出レベルＬに基づいて音声区間を検
出する。このとき音声検出回路１２は、制御回路１５で
決定された音声持続時間Ｔ１により音声区間の終端Ｅを
確実に検出できる。即ち、音声レベルが変動して検出レ
ベルＬより低下した場合でも、そのレベル低下の時間が
上記音声持続時間Ｔ１以下であれば、音声は継続してい
ると判定される。この音声持続時間Ｔ１は、一連の情報
サービスが終了されるまで固定される。これは、同じ利
用者であれば、発声する場合の音声持続時間はほぼ同様
であると考えられるからである。, Next, when the user utters a series of code numbers, the uttered voice signal is input to the voice detection circuit 12 in the same manner as described above. Here, the voice detection circuit 12 detects the voice section based on the appropriate detection level L set by the control circuit 15. At this time, the voice detection circuit 12 can reliably detect the end E of the voice section based on the voice duration T1 determined by the control circuit 15. That is, even if the audio level fluctuates and falls below the detection level L, if the time for the level drop is equal to or less than the audio duration time T1, it is determined that the audio continues. This audio duration T1 is fixed until the series of information services is completed. This is because it is considered that the duration of voice when uttered by the same user is almost the same.

したがって、利用者に応じて最初のメツセージに対する
発声により、制御回路１５で第５図（ｂ）に示すような
音声持続時間Ｔ２が検出されると、この音声持・続時間
Ｔ２が固定されることになる。Therefore, when the control circuit 15 detects the voice duration T2 as shown in FIG. 5(b) by uttering the first message depending on the user, this voice duration T2 is fixed. become.

このようにして、最初のメツセージに対する利用者の初
期発声から、音声区間の検出に用いられる適正な検出レ
ベルＬおよび音声持続時間が設定される。したがって、
電話回線の状態により、音声信号のレベルが多少変動す
る場合でも音声検出回路１２において確実に音声区間が
検出される。また、利用者の初期発声により音声持続時
間が設定されるため、常に適正な音声持続時間を固定で
き、音声区間の終端を確実に決定できる。これにより、
利用！！ｈ＼ら電話回線を介して送信される一連のコー
ド番号を音声認識回路１４で高精度にｉ！識することが
できる。In this way, from the user's initial utterance in response to the first message, the appropriate detection level L and voice duration used for detecting the voice section are set. therefore,
Even if the level of the voice signal varies somewhat depending on the state of the telephone line, the voice detection circuit 12 reliably detects the voice section. Furthermore, since the voice duration is set by the user's initial utterance, the voice duration can always be fixed at an appropriate value, and the end of the voice section can be reliably determined. This results in
use! ! The voice recognition circuit 14 accurately converts the series of code numbers sent via the telephone line into i! can be understood.

［発明の効果］以上詳述したように本発明によれば、電話回線を経由し
て送信される音声信号のレベルが変動する場合でも、そ
の音声の音声区間を確実に検出することができる。した
がって、発声者により電話回線を介して送信される例え
ば一連のコード番号を高精度に音声認識す４ることがで
きる。これにより、本発明を音声認識応答システムに適
用すれば、確実な情報の応答サービスを実現できるもの
である。[Effects of the Invention] As described in detail above, according to the present invention, even when the level of a voice signal transmitted via a telephone line fluctuates, the voice section of the voice can be reliably detected. Therefore, for example, a series of code numbers transmitted by a speaker via a telephone line can be voice recognized with high accuracy. As a result, if the present invention is applied to a voice recognition response system, a reliable information response service can be realized.

[Brief explanation of drawings]

第１図乃至第３図はそれぞれ従来の音声認識方式の動作
を説明するための音声信号の波形図、第４図は本発明の
一実施例に係わる音声認識応答システムの構成を示すブ
ロック図、第５図（ａ）。（ｂ）はそれぞれ第４図のシステムの動作を説明するた
めの音声信号の波形図である。１１・・・電話回線、１２・・・音声検出回路、１３・
・・メモリ、１４・・・音声認識回路、１５・・・１紳
回路。出願人代理人　弁理士　鈴　江　武　愚弟２図第３図1 to 3 are waveform diagrams of audio signals for explaining the operation of conventional speech recognition methods, respectively, and FIG. 4 is a block diagram showing the configuration of a speech recognition response system according to an embodiment of the present invention. Figure 5(a). (b) is a waveform diagram of an audio signal for explaining the operation of the system of FIG. 4, respectively. 11...Telephone line, 12...Voice detection circuit, 13.
...Memory, 14...Speech recognition circuit, 15...1 circuit. Applicant's agent Patent attorney Takeshi Suzue Figure 2 Figure 3

Claims

[Scope of Claims] A voice detection means for detecting a voice section at a predetermined detection level and voice duration from a voice signal transmitted via a telephone line and temporarily storing the waveform locus of the voice signal, and this voice detection means. a voice recognition means that performs voice recognition processing based on the voice signal of the voice section detected by the voice recognition means; and a maximum level of the voice signal based on the voice signal stored by the voice detection means. and voice detection control means for detecting a background noise level and voice duration, setting the appropriate detection level and voice duration, and controlling the voice detection means to detect an appropriate voice section. A speech recognition device characterized by: