JP2013207508A

JP2013207508A - Automatic voice response device

Info

Publication number: JP2013207508A
Application number: JP2012073686A
Authority: JP
Inventors: Futoshi Kaneda; 太兼田
Original assignee: Hitachi Information and Telecommunication Engineering Ltd
Current assignee: Hitachi Information and Telecommunication Engineering Ltd
Priority date: 2012-03-28
Filing date: 2012-03-28
Publication date: 2013-10-07

Abstract

PROBLEM TO BE SOLVED: To provide an automatic voice response device, capable of improving use difficulty on the user side at automatic voice response.SOLUTION: An automatic voice response device 10 includes: a splitter 11 for splitting a voice signal from a user 1 into a background sound and a voice; based on the split background sound, a background state recognition unit 12 for recognizing a background state in which the user 1 is placed; based on the split voice, a user state recognition unit 13 for recognizing a user state, the state of an individual user; based on the recognized background state and the user state, a response content generator unit 14 for generating a response content to respond to the user 1; and based on the generated response content, a voice synthesizer unit 15 for synthesizing and outputting voice.

Description

本発明は、利用者からの電話に対し合成音声により自動で応答する自動音声応答装置に関するものである。 The present invention relates to an automatic voice response device that automatically responds to a phone call from a user with synthesized voice.

自動音声応答装置は例えばコールセンタ等で従来から用いられているが、現状では発信者番号・着信番号・トーン入力といった極めて限定された選択肢の中でしか応答できない。このような場合、利用者の状態を考慮しない一方的な応答になるため、利用者に余計な負担を強いる、あるいは利用者の望む結果に到達しないなどの不都合がある。この種の装置として、たとえば、特許文献１には電話受付システムが記載されている。このシステムでは、音声パターンによる個人特定で不満度を測定しているが、音声パターン以外の状態は考慮しておらず、また当該システムへのフィードバックもないものである。 Although an automatic voice response device has been conventionally used in, for example, a call center or the like, it can respond only within very limited options such as caller ID, incoming number, and tone input. In such a case, since the response is a one-way response that does not take into account the user's condition, there are inconveniences such as placing an extra burden on the user or not reaching the result desired by the user. As this type of device, for example, Patent Document 1 describes a telephone reception system. In this system, the degree of dissatisfaction is measured by personal identification based on a voice pattern, but the state other than the voice pattern is not considered, and there is no feedback to the system.

特許第４０６７４８１号公報Japanese Patent No. 4067481

本発明の目的は、自動音声応答に際し、従来に比べ利用者側の使いにくさを改善することができる自動音声応答装置を提供することにある。 SUMMARY OF THE INVENTION An object of the present invention is to provide an automatic voice response device capable of improving the difficulty of use on the user side as compared with the prior art in automatic voice response.

本発明は、上記目的を達成するため以下のような自動音声応答装置を提供する。
（１）利用者からの音声信号を背景音と音声とに分離するスプリッターと、
前記背景音をもとに前記利用者の置かれている背景状態を認識する背景状態認識部と、
前記音声をもとに前記利用者個人の状態である利用者状態を認識する利用者状態認識部と、
前記背景状態と前記利用者状態とに基づいて前記利用者に応答するための応答内容を生成する応答内容生成部と、
前記応答内容をもとに音声を合成して出力する音声合成部と
を備えたことを特徴とする自動音声応答装置。
（２）前記背景状態認識部は、複数の背景音を周波数分布に変換しその背景音に係る背景状態を対応させて格納した背景状態データベースを備え、前記格納された複数の背景音の周波数分布と前記分離された背景音の周波数分布との相関係数をそれぞれ求め、最大の相関係数を示す背景音に対応する背景状態を、前記利用者の置かれている背景状態として認識することを特徴とする上記（１）に記載の自動音声応答装置。
（３）前記背景状態認識部は、さらに前記分離された背景音の大きさによって、前記利用者の置かれている背景状態を認識することを特徴とする上記（２）に記載の自動音声応答装置。
（４）前記利用者状態認識部は、複数の利用者状態に対応した音声を周波数分布に変換して得られた中心周波数の値および発話ピッチの変動値をスコアにしてそれぞれ格納した利用者状態データベースを備え、前記分離された音声を周波数分布に変換して得られた中心周波数および発話ピッチに相当するスコアをそれぞれ前記利用者状態データベースから求め、前記求めた中心周波数に相当するスコアおよび発話ピッチに相当するスコアの合計が最大の値を示す音声に対応する利用者状態を前記利用者個人の状態である利用者状態として認識することを特徴とする上記（１）〜（３）のいずれかに記載の自動音声応答装置。
（５）前記利用者状態データベースは、さらに複数の利用者状態に対応した音声の大きさの値をスコアにして格納するものであり、前記分離された音声の大きさに相当するスコアを前記利用者状態データベースから求め、前記求めた中心周波数に相当するスコア、発話ピッチに相当するスコアおよび音声の大きさに相当するスコアの合計が最大の値を示す音声に対応する利用者状態を前記利用者個人の状態である利用者状態として認識することを特徴とする上記（４）に記載の自動音声応答装置。
（６）前記応答内容生成部は、自動音声応答時における音量の上げ下げおよび男性音と女性音の切り替えの少なくとも一方を行うことができることを特徴とする上記（１）〜（５）のいずれかに記載の自動音声応答装置。
（７）前記応答内容生成部は、前記応答内容として、標準応答テキスト、簡略応答テキストおよびオペレータによる応答指示のいずれかを出力することを特徴とする上記（１）〜（６）のいずれかに記載の自動音声応答装置。 In order to achieve the above object, the present invention provides the following automatic voice response apparatus.
(1) a splitter that separates an audio signal from a user into background sound and audio;
A background state recognition unit that recognizes a background state of the user based on the background sound;
A user state recognizing unit for recognizing a user state which is a state of the individual user based on the voice;
A response content generator for generating a response content for responding to the user based on the background state and the user state;
An automatic speech response apparatus, comprising: a speech synthesis unit that synthesizes and outputs speech based on the response content.
(2) The background state recognition unit includes a background state database in which a plurality of background sounds are converted into a frequency distribution and the background states related to the background sounds are stored in correspondence with each other, and the frequency distributions of the plurality of stored background sounds And the background coefficient corresponding to the background sound showing the maximum correlation coefficient is recognized as the background condition where the user is placed. The automatic voice response device according to (1), characterized in that it is characterized in that
(3) The automatic voice response according to (2), wherein the background state recognition unit further recognizes a background state where the user is placed based on the magnitude of the separated background sound. apparatus.
(4) The user state recognizing unit stores each of the user frequency states obtained by converting the speech corresponding to a plurality of user states into a frequency distribution and using the center frequency value and the utterance pitch fluctuation value as scores. A score corresponding to the center frequency and utterance pitch obtained by converting the separated speech into a frequency distribution from the user state database, and a score and utterance pitch corresponding to the determined center frequency Any one of the above (1) to (3), wherein a user state corresponding to a voice having a maximum total score corresponding to is recognized as a user state that is the state of the individual user The automatic voice response device described in 1.
(5) The user state database further stores, as a score, a value of sound volume corresponding to a plurality of user states, and uses the score corresponding to the separated sound level. The user state corresponding to the voice that is obtained from the person state database and has the maximum value of the score corresponding to the obtained center frequency, the score corresponding to the utterance pitch, and the score corresponding to the volume of the voice. The automatic voice response device according to (4), wherein the automatic voice response device is recognized as a user state which is an individual state.
(6) In any one of the above (1) to (5), the response content generation unit can perform at least one of raising and lowering a volume and switching between a male sound and a female sound during an automatic voice response. The automatic voice response device described.
(7) The response content generation unit outputs, as the response content, any one of a standard response text, a simplified response text, and a response instruction by an operator, according to any one of the above (1) to (6) The automatic voice response device described.

請求項１に係る発明によれば、自動音声応答に際し、従来に比べ利用者側の使いにくさを改善することができる自動音声応答装置を提供することができる。
請求項２に係る発明によれば、利用者の置かれている背景状態を容易に認識することができる。
請求項３に係る発明によれば、利用者の置かれている背景状態に合った応答を行うことができる。
請求項４に係る発明によれば、利用者個人の状態である利用者状態を容易に認識することができる。
請求項５に係る発明によれば、利用者個人の状態である利用者状態を一層容易に認識することができる。
請求項６に係る発明によれば、利用者の置かれている背景状態や利用者個人の状態に合った応答を行うことができる。
請求項７に係る発明によれば、利用者個人の状態である利用者状態に対して柔軟に対応することができる。 According to the first aspect of the present invention, it is possible to provide an automatic voice response device capable of improving the difficulty of use on the user side as compared with the prior art in automatic voice response.
According to the invention which concerns on Claim 2, the background state in which the user is placed can be recognized easily.
According to the invention which concerns on Claim 3, the response according to the background state where the user is placed can be performed.
According to the invention which concerns on Claim 4, the user state which is a user's individual state can be recognized easily.
According to the invention which concerns on Claim 5, the user state which is a user's individual state can be recognized still more easily.
According to the invention which concerns on Claim 6, the response according to the background state in which the user is set | placed, or a user's individual state can be performed.
According to the invention which concerns on Claim 7, it can respond flexibly with respect to the user state which is a user's individual state.

本発明に係る自動音声応答装置の一実施例を説明するための図である。It is a figure for demonstrating one Example of the automatic voice response apparatus which concerns on this invention. （ａ）〜（ｄ）は、背景状態認識部の構成例を説明するための図である。(A)-(d) is a figure for demonstrating the structural example of a background state recognition part. （ａ）〜（ｄ）は、利用者状態認識部の構成例を説明するための図である。(A)-(d) is a figure for demonstrating the structural example of a user state recognition part. （ａ）〜（ｃ）は、応答内容生成部の構成例を説明するための図である(A)-(c) is a figure for demonstrating the structural example of a response content production | generation part. 図４の例における応答フローの一例を示す図である。It is a figure which shows an example of the response flow in the example of FIG.

図１は、本発明に係る自動音声応答装置の一実施例を説明するための図である。本例ではコールセンタシステムを例にとって説明するが、これに限定されない。図示のように、まず利用者１がコールセンタに電話をかけると、利用者１の音声信号は一般の加入電話回線ネットワークであるＰＳＴＮ（ＰｕｂｌｉｃＳｗｉｔｃｈｅｄＴｅｌｅｐｈｏｎｅＮｅｔｗｏｒｋｓ：公衆交換電話網）２を介してコールセンタの電話回線インタフェース部３で受信される。音声自動応答装置１０は、電話回線インタフェース部３より入力した利用者１からの発話（音声信号）に応じて、内部で生成した所定の合成音声またはオペレータ１６による音声で、電話回線インタフェース部３およびＰＳＴＮ２を介して利用者１に応答する。 FIG. 1 is a diagram for explaining an embodiment of an automatic voice response apparatus according to the present invention. In this example, a call center system will be described as an example, but the present invention is not limited to this. As shown in the figure, when a user 1 first makes a call to a call center, the voice signal of the user 1 is sent to the call center via a PSTN (Public Switched Telephone Networks) 2 which is a general subscriber telephone line network. Received by the telephone line interface unit 3. The automatic voice response device 10 uses a predetermined synthesized voice or voice generated by the operator 16 in response to an utterance (voice signal) from the user 1 input from the telephone line interface unit 3, and the telephone line interface unit 3 and Responds to user 1 via PSTN2.

音声自動応答装置１０は、利用者１からの音声信号を背景音と音声とに分離するスプリッター１１と、分離された背景音をもとに利用者１の置かれている背景状態を認識する背景状態認識部１２と、分離された音声をもとに利用者個人の状態である利用者状態を認識する利用者状態認識部１３と、認識された背景状態と利用者状態とに基づいて利用者１に応答するための応答内容を生成する応答内容生成部１４と、生成された応答内容（テキスト）をもとに音声を合成して出力する音声合成部１５とを備える。 The automatic voice response apparatus 10 includes a splitter 11 that separates a sound signal from the user 1 into background sound and sound, and a background that recognizes the background state of the user 1 based on the separated background sound. Based on the state recognition unit 12, the user state recognition unit 13 that recognizes the user state that is the individual user state based on the separated voice, and the user based on the recognized background state and user state 1 includes a response content generation unit 14 that generates response content for responding to 1 and a speech synthesis unit 15 that synthesizes and outputs speech based on the generated response content (text).

自動音声応答装置１０は、例えば概略次のように動作する。すなわち、自動音声応答装置１０は、利用者１からの音声信号をもとに、利用者１の置かれている背景状態および利用者個人の状態を認識する。その背景状態と利用者状態の中から、最も「適している」と推定できる応答内容を選択し、自動音声応答装置の出力とする。ここで、最も「適している」と推定できる応答内容とは、例えば以下のようなことが該当する。
・背景音が大きい環境であれば合成音声の音量を上げるように調整する。
・利用者が立腹しているようであれば合成音声による応答ではなくオペレータが対応するように応答フローを調整する。
・利用者が急いでいるようであれば合成音声による応答を簡略化するように応答フローを調整する。
以上の応答内容は例示であり、本発明はこれに限定されない。 For example, the automatic voice response apparatus 10 operates as follows. That is, the automatic voice response apparatus 10 recognizes the background state of the user 1 and the individual state of the user based on the voice signal from the user 1. From the background state and the user state, the response content that can be estimated to be the most “suitable” is selected and used as the output of the automatic voice response device. Here, the response contents that can be estimated to be the most “suitable” include, for example, the following.
・ If the background sound is loud, adjust the volume of the synthesized voice to increase.
-If the user seems angry, adjust the response flow so that the operator responds instead of the response by synthesized speech.
-If the user seems to be in a hurry, adjust the response flow to simplify the response by synthesized speech.
The above response content is an example, and the present invention is not limited to this.

一方、オペレータ１６は、必要に応じて利用者の音声および合成音声の少なくとも一方をモニターすることができる。このとき、利用者１の音声（会話）が予期しない方向に流れた場合、オペレータ１６は、応答内容生成部１４または音声合成部１５の動作を制御して、会話の流れを修正する、あるいは自動応答による会話を中止して、自らの会話に切り替えることができる。なお、後述のように、応答内容生成部１４からの応答内容として、合成音声ではなく、オペレータ１６に通話切替を行うための指示が出力されることがあるが、この場合の指示は応答内容生成部１４からオペレータ１６へ電話回線インタフェース部３および音声合成部１５を介して行われる。
以下、自動音声応答装置１０の各部の構成および動作について説明する。 On the other hand, the operator 16 can monitor at least one of the user's voice and synthesized voice as necessary. At this time, when the voice (conversation) of the user 1 flows in an unexpected direction, the operator 16 controls the operation of the response content generation unit 14 or the voice synthesis unit 15 to correct the conversation flow or automatically. You can stop the conversation by answering and switch to your own conversation. As will be described later, as the response content from the response content generation unit 14, an instruction for switching the call may be output to the operator 16 instead of the synthesized speech. This is performed from the unit 14 to the operator 16 via the telephone line interface unit 3 and the voice synthesis unit 15.
Hereinafter, the configuration and operation of each part of the automatic voice response apparatus 10 will be described.

図２（ａ）〜（ｄ）は、背景状態認識部の構成例を説明するための図である。背景状態認識部１２は、図２（ａ）に示すように、背景状態認識手段１２１と、背景状態データベース１２２とを備える。背景状態データベース１２２は、予め複数の背景音を周波数分布（周波数スペクトル）に変換しその背景音に係る背景状態を対応させて格納したものである。背景状態認識手段１２１は、この格納された複数の背景音の周波数分布と、スプリッター１１で分離された背景音の周波数分布との相関係数をそれぞれ求め、最大の相関係数を示す背景音に対応する背景状態を、利用者の置かれている背景状態として認識し出力する。 2A to 2D are diagrams for explaining a configuration example of the background state recognition unit. The background state recognition unit 12 includes background state recognition means 121 and a background state database 122 as shown in FIG. The background state database 122 is obtained by converting a plurality of background sounds into a frequency distribution (frequency spectrum) and storing the background states related to the background sounds in association with each other. The background state recognition unit 121 obtains a correlation coefficient between the stored frequency distribution of the plurality of background sounds and the frequency distribution of the background sound separated by the splitter 11, and obtains the background sound indicating the maximum correlation coefficient. The corresponding background state is recognized and output as the background state where the user is placed.

図２（ｂ）は、背景音の元信号１２３と、その周波数分布１２４の概念図を示すものである。図２（ｃ）は、背景状態データベース１２２に格納されている複数の背景音の周波数分布とそれに対応する背景状態の関係を示すものである。図中の周波数分布１２５，１２６，１２７は、それぞれ背景状態として車中、雨、繁華街に対応する。周波数分布は正規化されている。図２（ｄ）は、スプリッター１１で分離された背景音の周波数分布１２４と、背景状態データベース１２２に格納された背景音の周波数分布１２５，１２６，１２７との相関係数を示す図である。図中で最大の相関係数は、周波数分布１２４と１２５の０．９５である。この場合、図２（ｃ）の周波数分布１２５に対応する「車中」が利用者１の置かれている背景状態と認識される。 FIG. 2B shows a conceptual diagram of the background sound source signal 123 and its frequency distribution 124. FIG. 2C shows the relationship between the frequency distribution of a plurality of background sounds stored in the background state database 122 and the corresponding background state. The frequency distributions 125, 126, and 127 in the figure correspond to the vehicle state, rain, and downtown as background states, respectively. The frequency distribution is normalized. FIG. 2D is a diagram showing a correlation coefficient between the background sound frequency distribution 124 separated by the splitter 11 and the background sound frequency distributions 125, 126, and 127 stored in the background state database 122. The maximum correlation coefficient in the figure is 0.95 of the frequency distributions 124 and 125. In this case, “in the vehicle” corresponding to the frequency distribution 125 in FIG. 2C is recognized as the background state where the user 1 is placed.

背景状態認識部１２は、さらにスプリッター１１で分離された背景音の大きさ（レベル）の大小によって、利用者の置かれている背景状態を別の観点から認識することができる。例えば、背景音の大きさが予め決められた閾値以上の場合は、利用者の置かれている背景状態は「喧騒」、また背景音レベルが上記閾値未満の場合は、利用者の置かれている背景状態は「静寂」として認識することができる。 The background state recognition unit 12 can recognize the background state of the user from another viewpoint based on the magnitude (level) of the background sound separated by the splitter 11. For example, if the background sound level is greater than or equal to a predetermined threshold, the background state where the user is placed is “noisy”, and if the background sound level is less than the above threshold, the user is placed. Can be recognized as "silence".

図３（ａ）〜（ｄ）は、利用者状態認識部の構成例を説明するための図である。利用者状態認識部１３は、図３（ａ）に示すように、利用者状態認識手段１３１と、利用者状態データベース１３２とを備える。利用者状態データベース１３２は、複数の利用者状態に対応した音声を周波数分布に変換して得られた中心周波数の値および発話ピッチの変動値をスコアにしてそれぞれ格納したものである。利用者状態認識手段１３１は、スプリッター１１で分離された音声を周波数分布に変換して得られた中心周波数および発話ピッチに相当するスコアをそれぞれ利用者状態データベース１３２から求め、この求めた中心周波数に相当するスコアおよび発話ピッチに相当するスコアの合計が最大の値を示す音声に対応する利用者状態を、利用者個人の状態である利用者状態として認識し出力する。 3A to 3D are diagrams for explaining a configuration example of the user state recognition unit. As shown in FIG. 3A, the user state recognition unit 13 includes a user state recognition unit 131 and a user state database 132. The user state database 132 stores the values of the center frequency and the utterance pitch fluctuation values obtained by converting speech corresponding to a plurality of user states into frequency distributions as scores. The user state recognizing means 131 obtains a score corresponding to the center frequency and speech pitch obtained by converting the voice separated by the splitter 11 into a frequency distribution from the user state database 132, and uses the obtained center frequency as the obtained center frequency. The user state corresponding to the voice having the maximum sum of the corresponding score and the score corresponding to the utterance pitch is recognized and output as the user state that is the individual user state.

利用者状態データベース１３２は、さらに複数の利用者状態に対応した音声の大きさ（レベル）の値をスコアにして格納することができる。この場合、利用者状態認識手段１３１は、スプリッター１１で分離された音声の大きさに相当するスコアを利用者状態データベース１３２から求め、上記で求めた中心周波数に相当するスコア、発話ピッチに相当するスコアおよび今回求めた音声の大きさに相当するスコアの合計が最大の値を示す音声に対応する利用者状態を、利用者個人の状態である利用者状態として認識し出力することができる。 The user state database 132 can further store a value of a sound volume (level) corresponding to a plurality of user states as a score. In this case, the user state recognizing unit 131 obtains a score corresponding to the loudness of the sound separated by the splitter 11 from the user state database 132, and corresponds to the score and speech pitch corresponding to the center frequency obtained above. It is possible to recognize and output the user state corresponding to the voice having the maximum value of the score and the score corresponding to the volume of the sound obtained this time as the user state that is the individual user state.

図３（ｂ）は、音声の元信号１３３と、その周波数分布１３４の概念図を示すものである。図３（ｃ）は、利用者状態データベース１３２に格納されている複数の利用者状態（男性、女性、緊急、立腹）に対応した音声を周波数分布に変換して得られた中心周波数（ｋＨｚ）の値、発話ピッチ（％）の変動値、および発話レベル（ｄＢ）（音声の大きさ）をそれぞれスコアにしたものである。たとえば、図３（ｃ）中の網掛け部に示すように、音声の元信号１３３を周波数分布１３４に変換して得られた中心周波数が１．８ｋＨｚ、発話ピッチが０％、発話レベルが−２０ｄＢの場合、各利用者状態のスコアの合計は、図３（ｄ）に示すようになる。すなわち、利用者状態として、「男性」がスコアの合計３０、「女性」がスコアの合計１０、「緊急」がスコアの合計０、「立腹」がスコアの合計１０である。この場合、スコアの合計が最大の値を示す「男性」が、利用者個人の利用者状態と認識される。なお、本例では、中心周波数、発話ピッチ、発話レベルは例えば以下のようにして求める。中心周波数は音声を変換して得られた周波数分布におけるピーク周波数とする。発話ピッチは一定時間あたりの中心周波数の遷移頻度とする。発話レベルは一定レベルである時報の音量と比較して求める。 FIG. 3B shows a conceptual diagram of the audio original signal 133 and its frequency distribution 134. FIG. 3C shows a center frequency (kHz) obtained by converting voice corresponding to a plurality of user states (male, female, emergency, and angry) stored in the user state database 132 into a frequency distribution. , The fluctuation value of the utterance pitch (%), and the utterance level (dB) (speech size), respectively, are scored. For example, as shown in the shaded portion in FIG. 3C, the center frequency obtained by converting the original audio signal 133 into the frequency distribution 134 is 1.8 kHz, the utterance pitch is 0%, and the utterance level is − In the case of 20 dB, the total score of each user state is as shown in FIG. That is, as a user state, “male” has a total score of 30, “female” has a total score of 10, “emergency” has a total score of 0, and “angry” has a total score of 10. In this case, “male” having the maximum total score is recognized as the user state of the individual user. In this example, the center frequency, the utterance pitch, and the utterance level are obtained as follows, for example. The center frequency is the peak frequency in the frequency distribution obtained by converting the voice. The utterance pitch is the transition frequency of the center frequency per fixed time. The utterance level is obtained by comparing with the volume of the time signal that is a constant level.

図４（ａ）〜（ｃ）は、応答内容生成部の構成例を説明するための図である。応答内容生成部１４は、図４（ａ）に示すように、応答内容生成手段１４１と、応答方向データベース１４２とを備える。応答方向データベース１４２は、図４（ｂ）に示すように、応答方向として、Ａ１：音量（上）、Ａ２：音量（下）、Ｂ１：話者切替（男性音）、Ｂ２：話者切替（女性音）、Ｃ１：内容（オペレータ）、Ｃ２：内容（標準）、Ｃ３：内容（簡略）を有する。そして、各応答方向に対応して背景状態（喧騒、静寂、車中、雨、繁華街）、および利用者状態（男性、女性、緊急、立腹）が、それぞれの状態に応じてスコアで表されている。図４（ｂ）中の網掛け部は、背景状態認識部１２から出力された背景状態が「喧騒」および「繁華街」であり、利用者状態認識部１３から出力された利用者状態が「緊急」であることを示している。 4A to 4C are diagrams for explaining a configuration example of the response content generation unit. The response content generation unit 14 includes a response content generation unit 141 and a response direction database 142 as illustrated in FIG. As shown in FIG. 4B, the response direction database 142 includes A1: volume (up), A2: volume (down), B1: speaker switch (male sound), B2: speaker switch ( Female sound), C1: content (operator), C2: content (standard), C3: content (simplified). The background state (noisy, quiet, in the car, rain, downtown) and the user state (male, female, emergency, angry) corresponding to each response direction are represented by scores according to each state. ing. 4B, the background states output from the background state recognition unit 12 are “noisy” and “downtown”, and the user state output from the user state recognition unit 13 is “ It is urgent.

応答内容生成手段１４１は、次のようにして応答内容（テキスト）を生成し、音声合成部に出力する。図４（ｃ）に示すように、各応答方向のスコアの合計は、「Ａ１：音量（上）」が７０、「Ａ２：音量（下）」が０、「Ｂ１：話者切替（男性音）」が０、「Ｂ２：話者切替（女性音）」が０、「Ｃ１：内容（オペレータ）」が３０、「Ｃ２：内容（標準）」が０、「Ｃ３：内容（簡略）」が１００となる。Ａ１とＡ２の比較では、Ａ１のスコアの合計が最大の値を示すので、「Ａ１：音量（上）」が応答内容として選択される。Ｂ１とＢ２の比較では、両者のスコアの合計が同値を示すので、応答内容として話者切替は行われず、前回男性音の場合はそのまま男性音とされ、前回女性音の場合はそのまま女性音とされる。Ｃ１とＣ２とＣ３の比較では、Ｃ３のスコアの合計が最大の値を示すので、応答内容として「Ｃ３：内容（簡略）」が選択される。なお、応答方向Ａ１，Ａ２のスコアの合計が同値の場合は、音量変更なしとすることができる。また、応答方向Ｃ１−Ｃ３のスコアの合計が同値の場合は、応答方向を前回と変更なしとすることができる。 The response content generation unit 141 generates response content (text) as follows and outputs it to the speech synthesizer. As shown in FIG. 4C, the total score in each response direction is “A1: volume (up)” is 70, “A2: volume (down)” is 0, “B1: speaker switching (male sound) ) "Is 0," B2: Switch speaker (female sound) "is 0," C1: Content (operator) "is 30," C2: Content (standard) "is 0, and" C3: Content (simplified) " 100. In the comparison between A1 and A2, since the total score of A1 shows the maximum value, “A1: Volume (up)” is selected as the response content. In the comparison between B1 and B2, the sum of both scores shows the same value, so the speaker switching is not performed as the response content, the male sound is the same as the previous male sound, and the female sound is the same as the previous female sound. Is done. In the comparison between C1, C2 and C3, the total score of C3 shows the maximum value, so “C3: content (simplified)” is selected as the response content. Note that when the sum of the scores in the response directions A1 and A2 is the same value, the volume can be changed. Moreover, when the sum total of the score of response direction C1-C3 is the same value, a response direction can be made unchanged with the last time.

このように、応答内容生成部１４は、自動音声応答時における合成音声の音量の上げ下げおよび男性音と女性音の切り替えの少なくとも一方を行うことができる。また、応答内容生成部１４は、応答内容として、「Ｃ２：内容（標準）」（標準応答テキスト）、「Ｃ３：内容（簡略）」（簡略応答テキスト）、および「Ｃ１：内容（オペレータ）」（オペレータによる応答指示）のいずれかを出力することができる。 As described above, the response content generation unit 14 can perform at least one of raising and lowering the volume of the synthesized voice and switching between male sound and female sound during automatic voice response. Further, the response content generation unit 14 sets “C2: content (standard)” (standard response text), “C3: content (simplified)” (simplified response text), and “C1: content (operator)” as the response content. Any one of (response instruction by the operator) can be output.

図５は、図４の例における応答フローの一例を示す図である。まず、応答内容生成手段は、ステップ５１において、応答内容として、「お電話ありがとうございます。こちらはＸＸＸＸに関するお問い合わせを受け付けておりますＸＸＸＸセンターでございます。ご用件はお客様の音声、または電話機をご利用いただけます。」をテキストとして生成する。この場合、上述のように応答方向Ａ１，Ａ２に関し、Ａ１の「音量（上）」５６が選択されているので、応答内容として合成音声の音量を上げる指示が含まれる。また、上述のように応答方向Ｂ１，Ｂ２に関しスコアが同値のため「話者切替なし」５７とされるので、応答内容として、合成音声は前回男性音の場合はそのまま男性音とされ、前回女性音の場合はそのまま女性音とされる指示を含む。 FIG. 5 is a diagram illustrating an example of a response flow in the example of FIG. First, the response content generation means, in step 51, the response content is “Thank you for calling. Can be used as text. In this case, as described above, since “volume (up)” 56 of A1 is selected for the response directions A1 and A2, an instruction to increase the volume of the synthesized speech is included as the response content. In addition, as described above, since the scores in the response directions B1 and B2 are the same, “no speaker switching” 57 is set, and as a response content, in the case of the previous male sound, the synthesized voice is directly used as a male sound, and the previous female In the case of a sound, an instruction for a female sound is included.

次に、ステップ５２において、上述のように応答方向Ｃ１−Ｃ３に関し、Ｃ３の「内容（簡略）」５８が選択されているので、応答内容として、ステップ５３の「商品のご案内であれば１を、ご購入済みの商品のお問い合わせであれば２を…」をテキストとして生成する。仮に、Ｃ２の「内容（標準）」が選択された場合は、応答内容として、ステップ５４の「ご用件をお願いします。電話機を操作する場合、商品のご案内であれば１を、ご購入済みの商品に関するお問い合わせであれば２を…」をテキストとして生成し、Ｃ１の「内容（オペレータ）」が選択された場合は、応答内容として、ステップ５５の「オペレータに通話切替」の指示が出力される。なお、応答方向の種類によっては途中で適用することもあり得る。例えば、先のステップ５１において「お電話ありがとうございます…」と発話している途中に音量を上げるようにすることができる。 Next, in step 52, “content (simplified)” 58 of C3 is selected for the response direction C1-C3 as described above. ”Is generated as a text for an inquiry about a purchased product. If “content (standard)” of C2 is selected, the response content is “Please give me a request” in step 54. If the inquiry is about a purchased product, 2 is generated as a text, and if “content (operator)” of C1 is selected, an instruction “switch call to operator” in step 55 is given as the response content Is output. Depending on the type of response direction, it may be applied midway. For example, in the previous step 51, the volume can be raised while speaking “Thank you for calling ...”.

以上のように、本自動音声応答装置では、利用者からの音声は電話回線インタフェース部にて受信後、スプリッターに入力する。スプリッターは入来信号の周波数帯域、スペクトルの分布差異によって背景音と会話内容（音声）に分離し、それぞれ背景状態認識部と利用者状態認識部に入力する。背景状態認識部では背景音のレベル、周波数分布等から利用者の置かれた状況を推定し、背景状態として応答内容生成部に入力する。利用者状態認識部では利用者の音声のピッチ、周波数分布、発話レベル等から利用者の状態変化を推定し、利用者状態として応答内容生成部に入力する。応答内容生成部は背景状態および利用者状態から、最も妥当と思われる応答内容を生成し、音声合成部に入力する。音声合成部は与えられた応答内容（テキスト）を元に応答音声を合成し、電話回線インタフェース部を通じて利用者に応答する。 As described above, in the automatic voice response apparatus, the voice from the user is received by the telephone line interface unit and then input to the splitter. The splitter separates the background sound and the conversation content (speech) according to the frequency band and spectrum distribution difference of the incoming signal, and inputs them to the background state recognition unit and the user state recognition unit, respectively. The background state recognition unit estimates the user's situation from the background sound level, frequency distribution, etc., and inputs it to the response content generation unit as the background state. The user state recognizing unit estimates a change in the state of the user from the pitch, frequency distribution, speech level, etc. of the user's voice, and inputs it to the response content generating unit as the user state. The response content generation unit generates response content that seems most appropriate from the background state and the user state, and inputs the response content to the speech synthesis unit. The speech synthesizer synthesizes response speech based on the given response content (text) and responds to the user through the telephone line interface unit.

これにより、利用者が使いやすい自動音声応答装置を得ることができる。また、自動音声応答装置で対処しきれずにオペレータの対応が必要になるケースが減るため、オペレータの使用コストを抑制することができる。さらに、自動音声応答装置の対応能力が向上するため、利用者の利便性が向上する。 Thereby, an automatic voice response device that is easy for the user to use can be obtained. In addition, since the number of cases in which the automatic voice response device cannot handle the operator and the operator needs to respond is reduced, the use cost of the operator can be suppressed. Furthermore, since the correspondence capability of the automatic voice response device is improved, convenience for the user is improved.

１利用者
２ＰＳＴＮ（公衆交換電話網）
３電話回線インタフェース部
１０音声自動応答装置
１１スプリッター
１２背景状態認識部
１３利用者状態認識部
１４応答内容生成部
１５音声合成部
１６オペレータ 1 User 2 PSTN (Public Switched Telephone Network)
DESCRIPTION OF SYMBOLS 3 Telephone line interface part 10 Voice automatic response apparatus 11 Splitter 12 Background state recognition part 13 User state recognition part 14 Response content generation part 15 Voice synthesis part 16 Operator

Claims

A splitter that separates the audio signal from the user into background and audio;
A background state recognition unit that recognizes a background state of the user based on the background sound;
A user state recognizing unit for recognizing a user state which is a state of the individual user based on the voice;
A response content generator for generating a response content for responding to the user based on the background state and the user state;
An automatic speech response apparatus, comprising: a speech synthesis unit that synthesizes and outputs speech based on the response content.

The background state recognition unit includes a background state database that converts a plurality of background sounds into a frequency distribution and stores the background states related to the background sounds in correspondence with each other, and stores the frequency distributions of the plurality of background sounds and the separation And obtaining a correlation coefficient with the frequency distribution of the background sound, and recognizing a background state corresponding to the background sound showing the maximum correlation coefficient as the background state where the user is placed. The automatic voice response device according to claim 1.

The automatic voice response apparatus according to claim 2, wherein the background state recognition unit further recognizes a background state where the user is placed based on the magnitude of the separated background sound.

The user state recognizing unit includes a user state database that stores, as scores, a center frequency value and a speech pitch variation value obtained by converting speech corresponding to a plurality of user states into a frequency distribution. The score corresponding to the center frequency and speech pitch obtained by converting the separated speech into a frequency distribution is obtained from the user state database, and corresponds to the score and speech pitch corresponding to the obtained center frequency. The automatic voice response according to any one of claims 1 to 3, wherein a user state corresponding to a voice having a maximum total score is recognized as a user state which is the individual state of the user. apparatus.

The user state database further stores, as a score, a loudness value corresponding to a plurality of user states, and a score corresponding to the separated sound volume is stored in the user state database. The user status corresponding to the voice showing the maximum value of the score corresponding to the obtained center frequency, the score corresponding to the utterance pitch, and the score corresponding to the volume of the voice is determined as the personal status of the user. The automatic voice response apparatus according to claim 4, wherein the automatic voice response apparatus is recognized as a user state.

The said response content production | generation part can perform at least one of the raising / lowering of a volume at the time of an automatic voice response, and the switch of a male sound and a female sound, The one in any one of Claims 1-5 characterized by the above-mentioned. Automatic voice response device.

The automatic response device according to claim 1, wherein the response content generation unit outputs any one of a standard response text, a simplified response text, and a response instruction by an operator as the response content. .