JP6073649B2

JP6073649B2 - Automatic voice recognition / conversion system

Info

Publication number: JP6073649B2
Application number: JP2012245779A
Authority: JP
Inventors: 聡岩垣
Original assignee: Hitachi Systems Ltd
Current assignee: Hitachi Systems Ltd
Priority date: 2012-11-07
Filing date: 2012-11-07
Publication date: 2017-02-01
Anticipated expiration: 2032-11-07
Also published as: JP2014095753A

Description

本発明は、音声自動認識・音声変換システムに関する。例えばコールセンタに提供して好適な音声自動認識・音声変換システムに関する。 The present invention relates to an automatic speech recognition / speech conversion system. For example, the present invention relates to a speech automatic recognition / speech conversion system suitable for provision to a call center.

本技術分野の背景技術として、特開２００５−１２８３３号公報（特許文献１）や特開２０１１−９９０２号公報（特許文献２）がある。 As background art of this technical field, there are JP-A-2005-12833 (Patent Document 1) and JP-A-2011-9902 (Patent Document 2).

前者の公報には、「音声応答により利用者との対話形式で処理を行う音声応答サービス装置において、前記利用者の属性を記憶する属性記憶部と、前記属性記憶部に記憶された利用者の属性情報に基づいて、利用者に応答する音声の音質を変更する利用者音質変更部とを備え、テレフォンサービスの利用者の性別や年齢区分などの属性や相手の操作環境に合わせて、音質を変更することにより快適性の高いサービスを提供する音声応答サービス装置。」と記載されている（要約参照）。 In the former publication, “in a voice response service device that performs processing in a dialogue format with a user by voice response, an attribute storage unit that stores the attribute of the user, and a user's stored in the attribute storage unit” A user sound quality changing unit that changes the sound quality of the voice that responds to the user based on the attribute information, and the sound quality is adjusted according to the attributes such as the gender and age classification of the telephone service user and the operation environment of the other party "Voice response service device that provides a highly comfortable service by changing". (See summary).

また、後者の公報には、「店舗に対する顧客からの電話の音声を取得する第１音声取得手段と、前記第１音声取得手段により取得した音声から感情を認識する感情認識手段と、前記感情認識手段により認識した感情の種別が、「怒り」および「興奮」の少なくとも一方を表すか否かに基づいて、音声内容が苦情か否かを判別する苦情判別手段と、前記苦情の対応を行う担当者の連絡先を記憶する連絡先記憶手段と、前記苦情判別手段により、前記音声内容が苦情であると判別した場合、前記連絡先記憶手段に記憶されている連絡先へ通知を行う第１通知手段と、顧客からの苦情を自動で判別し、適切な応対者に通知することができる顧客対応装置を提供する顧客対応装置。」と記載されている（要約参照）。 Further, the latter publication states that “a first voice acquisition unit that acquires a voice of a telephone call from a customer to a store, an emotion recognition unit that recognizes an emotion from the voice acquired by the first voice acquisition unit, and the emotion recognition Based on whether or not the emotion type recognized by the means represents at least one of “anger” and “excitement”, a complaint determination means for determining whether or not the voice content is a complaint, and a person in charge of handling the complaint A first notification for notifying the contact stored in the contact storage means when the voice content is determined to be a complaint by the contact storage means for storing the contact information of the person and the complaint determination means And a customer response device for providing a customer response device capable of automatically discriminating complaints from customers and notifying an appropriate customer. (See summary).

特開２００５−１２８３３号公報JP 2005-12833 A 特開２０１１−９９０２号公報JP 2011-9902 A

前記特許文献１には、テレフォンサービスの利用者の性別や年齢区分などの属性や相手の操作環境に合わせて、音量と音速度を変更、例えば利用者がプッシュボタンの入力誤りをしたときに音量を上げ、音速度を遅くする音声応答サービス装置が記載されている。しかし、特許文献１の音声応答サービス装置は事前に登録されたナレーションの音声変換を行うテレフォンサービスにおいては有効であるが、リアルタイムで複雑なサポート対応が必要とされる担当者の会話音声を変換するような、例えばコールセンタでは利用できない。 In Patent Document 1, the volume and sound speed are changed in accordance with attributes such as sex and age classification of the telephone service user and the operation environment of the other party. For example, when the user makes an input error on a push button, A voice response service device that increases the sound speed and slows down the sound speed is described. However, the voice response service device of Patent Document 1 is effective in a telephone service that performs voice conversion of pre-registered narration, but converts the conversation voice of a person in charge who needs complex support in real time. For example, it cannot be used in a call center.

また、前記特許文献２には、電話対応中の顧客の電話音声から感情を認識し、苦情を自動で判別し、声紋を分析し、お客の年齢、性別に適したクレーム処理担当者を割り当てる顧客対応装置が記載されている。しかし、担当者を割り当てた後の対応については特に考慮されていない。 Further, in Patent Document 2, a customer who recognizes emotions from telephone voices of customers who are responding to telephone calls, automatically discriminates complaints, analyzes voiceprints, and assigns a complaint processing person who is appropriate for the customer's age and gender. A corresponding device is described. However, no particular consideration is given to the response after assigning the person in charge.

つまり、各特許文献には、会話中に顧客（相手）の感情等の変化が感じ取れた場合、会話中の担当者（顧客対応者）音声を顧客の感情等の変化に対応した適当と思われる音声にリアルタイムで変更し、顧客のストレスを低減させようとすることまでは考慮されていない。 In other words, in each patent document, if changes in the customer's (partner's) emotion or the like can be felt during the conversation, the person in charge of the conversation (customer response person) during the conversation seems to be appropriate for the change in the customer's emotion or the like. It doesn't even consider changing to voice in real time and trying to reduce customer stress.

そこで、本発明は、会話時の顧客の感情状態に応じて、顧客対応担当者の音声をリアルタイムで変更し、顧客のストレスを低減させることが可能な音声自動認識・音声変換システムを提供する。 Therefore, the present invention provides an automatic speech recognition / speech conversion system that can reduce the stress of a customer by changing the voice of a customer service representative in real time according to the emotional state of the customer during conversation.

上記課題を解決するために、本発明は、電話対応開始後における会話中の顧客の音声から感情変化が感じ取れた場合、当該顧客の感情変化に対応して、担当者音声をリアルタイムで変更する音声変換チューニング（音声変換の再構築）を行う音声変換手段を有する。 In order to solve the above-described problem, the present invention provides a voice that changes a person-in-charge voice in real time in response to an emotional change of a customer when the emotional change is perceived from the voice of the customer during a conversation after the start of telephone correspondence. Voice conversion means for performing conversion tuning (reconstruction of voice conversion) is provided.

音声通信手段を介して通話相手と音声による会話を行うサービス提供システムに用いられる音声自動認識・音声変換システムであって、
前記音声自動認識・音声変換システムは、
前記通話相手の入力音声の声紋を認識する機能を有する声紋認識手段と、
前記入力音声から前記通話相手の感情を認識する機能を有する感情認識手段と、
前記通話相手の声紋、年齢、性別を示す顧客情報を蓄積する顧客ＤＢと、
前記通話相手に対応する担当者の音声パラメータを蓄積する担当者ＤＢと、
年齢、性別、感情に対応する音声パラメータを蓄積する音声ＤＢと、
前記声紋認識手段の声紋認識及び前記感情認識手段の感情認識の分析結果を受け、前記担当者ＤＢの音声パラメータと前記音声ＤＢのパラメータの差分を算出し、当該差分を出力する音声分析手段と、
前記音声分析手段により抽出された前記差分を受け、当該差分に基に前記担当者の音声を変換し、前記通信手段に出力する音声変換手段と、
を有し、
前記通話相手と前記担当者間の通話状態において、前記感情認識手段が、前記通話相手の感情変化を検知し、前記音声分析手段が、前記音声ＤＢの音声パラメータと前記担当者の音声パラメータとの差分を算出したとき、前記音声変換手段は、当該差分に基づき前記担当者の音声をリアルタイムで変換する
ことを特徴とする音声自動認識・音声変換システム。 An automatic voice recognition / conversion system used in a service providing system for carrying out voice conversations with a call partner via voice communication means,
The voice automatic recognition / speech conversion system includes:
Voiceprint recognition means having a function of recognizing the voiceprint of the input voice of the other party;
Emotion recognition means having a function of recognizing the other party's emotion from the input voice;
A customer DB that stores customer information indicating the voiceprint, age, and gender of the other party;
A person-in-charge DB that accumulates voice parameters of persons in charge corresponding to the other party;
A voice DB that stores voice parameters corresponding to age, gender, and emotion;
Voice analysis means for receiving a voiceprint recognition result of the voiceprint recognition means and an emotion recognition analysis result of the emotion recognition means, calculating a difference between the voice parameter of the person in charge DB and the parameter of the voice DB, and outputting the difference;
Voice conversion means for receiving the difference extracted by the voice analysis means, converting the voice of the person in charge based on the difference, and outputting the voice to the communication means;
Have
In a call state between the call partner and the person in charge, the emotion recognition unit detects a change in the emotion of the call partner, and the voice analysis unit calculates a voice parameter of the voice DB and a voice parameter of the person in charge. When the difference is calculated, the voice conversion means converts the voice of the person in charge on the basis of the difference in real time. An automatic voice recognition / voice conversion system.

前記音声自動認識・音声変換システムであって、
前記サービス提供システムがコールセンタからなり、当該コールセンタは、前記音声分析手段の分析結果を表示する表示手段を有する
ことを特徴とする音声自動認識・音声変換システム。 The automatic speech recognition / speech conversion system,
The service providing system includes a call center, and the call center includes display means for displaying the analysis result of the voice analysis means.

前記音声自動認識・音声変換システムであって、
前記音声自動認識・音声変換システムは、
更に、前記通話相手の入力音声から聞き直しに関連するキーワードを認識するキーワード認識手段を有し、
前記キーワード認識手段が前記キーワードを認識し、前記音声分析手段が、前記キーワードを検知したとき、前記音声変換手段は、前記キーワードに応じた各属性のパラメータ値を変更し、当該パラメータに従い、前記担当者の音声の音量、及び／又は速度を変換する
ことを特徴とする音声自動認識・音声変換システム。 The automatic speech recognition / speech conversion system,
The voice automatic recognition / speech conversion system includes:
Furthermore, it has keyword recognition means for recognizing a keyword related to re-listening from the input voice of the other party,
When the keyword recognition unit recognizes the keyword and the voice analysis unit detects the keyword, the voice conversion unit changes a parameter value of each attribute according to the keyword, and performs the charge according to the parameter. A voice automatic recognition / conversion system characterized by converting the volume and / or speed of a person's voice.

本発明によれば、お客様（顧客）の感情に応じた担当者音声をリアルタイムで生成することができ、その結果として顧客のストレスを低減させることが可能な音声自動認識・音声変換システムを提供することができる。 According to the present invention, there is provided an automatic speech recognition / speech conversion system capable of generating a person-in-charge voice corresponding to a customer (customer) emotion in real time and, as a result, reducing a customer stress. be able to.

本発明の音声自動認識・音声変換システムを利用したサービス提供装置（コールセンタ装置）に適用したときの全体構成を示す構成図である。It is a block diagram which shows the whole structure when it applies to the service provision apparatus (call center apparatus) using the voice automatic recognition and the voice conversion system of this invention. 音声自動認識・音声変換システムの処理を説明するフローチャートである。It is a flowchart explaining the process of a voice automatic recognition and a voice conversion system. 顧客とコールセンタ間の音声の流れ、及びコールセンタ側での概略処理例を示す図である。It is a figure which shows the flow of the audio | voice between a customer and a call center, and the schematic processing example by the call center side. 音声自動認識・音声変換システムが感情変化を検知した場合のチューニングの例を示す図である。It is a figure which shows the example of tuning when an audio | voice automatic recognition / audio | voice conversion system detects an emotional change. 音声自動認識・音声変換システムが聞き直しのキーワードを検知した場合のチューニングの例を示す図である。It is a figure which shows the example of tuning when the audio | voice automatic recognition and the audio | voice conversion system detect the keyword of re-listening.

以下、実施例を、図面を用いて説明する。
コンタクトセンタ又はコールセンタ受付システム（以下、コールセンタと言う）、電話対応の良し悪しが、対象の製品、サービス、会社自体の印象に大きく左右する。従って、お客様（以下、顧客と言う）に分かり易く、かつストレスを与えることがないような音声対応が要求される。
しかし、電話対応は、担当者により、ばらつきがあるのが現状である。電話対応が悪いだけで、製品、サービス、会社自体の印象も悪くなり、大きな損失を抱えるリスクある。 Hereinafter, examples will be described with reference to the drawings.
The contact center or call center reception system (hereinafter referred to as the call center) and the quality of the telephone response greatly depend on the target product, service, and the impression of the company itself. Therefore, it is required to have a voice response that is easy to understand for customers (hereinafter referred to as customers) and that does not give stress.
However, the current situation is that there are variations in the correspondence between telephones depending on the person in charge. There is a risk that the product, service, and the company's own impression will be worsened by the poor telephone response, and there will be a large loss.

本実施例では、係るコールセンタにおける上述したリスクを軽減する例について説明する。 In this embodiment, an example of reducing the above-described risk in the call center will be described.

図１は、本発明の音声自動認識・音声変換システムをコールセンタ装置に適用したときの全体構成を示す構成図である。 FIG. 1 is a configuration diagram showing the overall configuration when the automatic speech recognition / speech conversion system of the present invention is applied to a call center device.

コールセンタ１００は、通信部（通信手段）１１０、表示部（表示手段）１２０、音声自動認識・音声変換システム１３０、を有する。 The call center 100 includes a communication unit (communication unit) 110, a display unit (display unit) 120, and an automatic speech recognition / speech conversion system 130.

通信手段１１０は、顧客側の電話等の通信端末（通信手段）２００との間で音声の送受信を行う。受信した顧客の音声（入力音声）は、表示手段１２０及び音声自動認識・音声変換システム１３０に供給する。 The communication unit 110 transmits / receives voice to / from a communication terminal (communication unit) 200 such as a telephone on the customer side. The received customer voice (input voice) is supplied to the display means 120 and the automatic voice recognition / speech conversion system 130.

音声自動認識・音声変換システム１３０は、声紋認識部（声紋認識手段）１３０１、感情認識部（感情認識手段）１３０２、キーワード認識部（キーワード認識手段）１３０３、音声分析部（音声分析手段）１３０４、音声変換部（音声変換手段）１３０５、顧客ＤＢ１３０６、担当者ＤＢ１３０７、音声ＤＢ１３０８、を有する。 An automatic speech recognition / speech conversion system 130 includes a voiceprint recognition unit (voiceprint recognition unit) 1301, an emotion recognition unit (emotion recognition unit) 1302, a keyword recognition unit (keyword recognition unit) 1303, a voice analysis unit (speech analysis unit) 1304, A voice conversion unit (voice conversion means) 1305, a customer DB 1306, a person-in-charge DB 1307, and a voice DB 1308 are provided.

声紋認識手段１３０１は、通信手段１１０からの電話音声（入力音声）を受け、当該音声の声紋を認識する。この声紋認識から、年齢、性別、注意点（履歴がある場合）を判別する。
この声紋認識は、例えば、顧客ＤＢ１３０６に登録された声紋と照合して行う。顧客ＤＢ１３０６に登録されていない場合には、声紋認識手段１３０１において、入力音声の声紋から年齢、性別等を判定し、その結果を顧客ＤＢ１３０６に登録する。 The voiceprint recognition unit 1301 receives the telephone voice (input voice) from the communication unit 110 and recognizes the voiceprint of the voice. From this voiceprint recognition, age, gender, and cautionary points (if there is a history) are determined.
This voiceprint recognition is performed by collating with a voiceprint registered in the customer DB 1306, for example. If not registered in the customer DB 1306, the voiceprint recognition unit 1301 determines the age, sex, and the like from the voiceprint of the input voice, and registers the result in the customer DB 1306.

感情認識手段１３０２は、入力音声の音色（声色）から、対象顧客の「喜怒哀楽」等の感情を判別する。 The emotion recognizing means 1302 discriminates emotions such as “feeling emotional” of the target customer from the tone color (voice color) of the input voice.

キーワード認識手段１３０３は、顧客の音声の中から、キーワードとなる音声、例えば聞き直しに関連する「声が小さい」などのキーワードを検出する。 The keyword recognizing unit 1303 detects a keyword as a keyword, for example, a keyword such as “low voice” related to re-listening, from the customer's voice.

音声分析手段１３０４は、声紋認識手段１３０１による年齢、性別を示すデータや感情認識手段１３０２による感情を示すデータを基に音声ＤＢ１３０８や担当者ＤＢ１３０７を検索する。
そして、当該音声ＤＢから、顧客（相手）の聞き取り易い音声情報（音量、音声速度、音程、音質等の音声パラメータ）を抽出する。 The voice analysis unit 1304 searches the voice DB 1308 and the person in charge DB 1307 based on the data indicating the age and sex by the voiceprint recognition unit 1301 and the data indicating the emotion by the emotion recognition unit 1302.
Then, voice information (speech parameters such as volume, voice speed, pitch, sound quality, etc.) that can be easily heard by the customer (partner) is extracted from the voice DB.

また、音声分析手段１３０４は、当該担当者ＤＢ１３０７から、担当者の音声情報（音量、音声速度、音程、音質等の音声パラメータ）を抽出する。また、キーワード認識手段１３０３が、例えば顧客から「声が小さい」、「聞き取れない」等のキーワードを検知した場合、それに対応した音声情報（音量、音声速度等の音声パラメータ）を当該音声ＤＢ１３０８から、抽出する。 Also, the voice analysis unit 1304 extracts voice information (voice parameters such as volume, voice speed, pitch, and sound quality) of the person in charge from the person-in-charge DB 1307. Further, when the keyword recognition unit 1303 detects a keyword such as “voice is low” or “cannot be heard” from a customer, for example, the corresponding voice information (voice parameters such as volume and voice speed) is obtained from the voice DB 1308. Extract.

要するに音声分析手段１３０４は、対象顧客の状態、年齢、性別等に合わせ、顧客が聞き取り易い音量、音声速度、音程、音質に変換可能な音声情報（属性のパラメータ）を抽出し、その分析結果は、表示装置１２０に表示する。 In short, the voice analysis means 1304 extracts voice information (attribute parameters) that can be converted into a volume, voice speed, pitch, and sound quality that are easy for the customer to hear according to the state, age, gender, etc. of the target customer. Are displayed on the display device 120.

音声変換手段１３０５は、担当者の音声（出力音声）を、音声分析手段１３０４による音声分析情報を基に顧客（相手）が聞き取り易い音声（音量、音声速度、音程、音質）に変換する。また、会話中で聞き直す仕草や「声が小さい」、「早口」などの会話内容から「音量を上げる」、「音声速度を遅くする」等の音声変換のチューニングをリアルタイムで実施する。
換言すれば、音声分析手段からの属性のパラメータ値に従い、顧客２０の状態に応じた音声のチューニングを行う。この変換音声は、通信手段１１０を介して顧客側の通信手段２００に送信する。 The voice conversion means 1305 converts the voice of the person in charge (output voice) into voice (volume, voice speed, pitch, sound quality) that is easy for the customer (partner) to hear based on the voice analysis information from the voice analysis means 1304. Also, real-time voice conversion tuning such as “increase the volume” and “decrease the voice speed” from the conversation contents such as “speech”, “quick”, etc. are performed in real time.
In other words, the sound is tuned according to the state of the customer 20 according to the attribute parameter value from the sound analysis means. This converted voice is transmitted to the communication means 200 on the customer side via the communication means 110.

顧客ＤＢ１３０６は、顧客（お客様）の個人情報のほか、声紋、声紋から判別される年齢、性別等を示す情報を蓄積する。 The customer DB 1306 stores personal information of the customer (customer), as well as information indicating the voiceprint, the age and sex determined from the voiceprint.

担当者ＤＢ１３０７は、各担当者の音声パラメータ、担当者Ａとして、例えば「音量：５０、音声速度：７０．音程２０、音質：−４０」等のパラメータ、担当者Ｂとして、「音量：３０、音声速度：３０．音程７０、音質：＋２０」等のパラメータを蓄積する。 The person-in-charge DB 1307 has parameters such as “volume: 50, voice speed: 70. pitch 20, sound quality: −40” and person B as “voice volume”, “person A”, and “volume: 30, Parameters such as “voice speed: 30. pitch 70, sound quality: +20” are accumulated.

音声ＤＢ１３０８は、年齢、性別、感情に対応する音声パラメータ、例えば「年齢５０才、性別：男性、感情：平常」の場合には、「音量：６５、音声速度：４５、音程：６０、音質：＋３０」、「年齢２０才、性別：女性、感情：怒」の場合には、「音量：４５、音声速度：５５、音程：４０、音質：＋１０」等のように顧客にとって最良（聞くのに適している）と思われる音声となるようなパラメータの組合せを蓄積する。 The voice DB 1308 stores voice parameters corresponding to age, gender, and emotion, for example, “volume: 65, voice speed: 45, pitch: 60, sound quality:“ age 50 years old, gender: male, emotion: normal ”. +30 ”,“ age 20 years old, gender: female, emotion: anger ”,“ sound volume: 45, voice speed: 55, pitch: 40, sound quality: +10 ”etc. Accumulate parameter combinations that result in speech that seems to be suitable.

図２は、本発明の音声自動認識・音声変換システムにおける処理フローを示す図である。 FIG. 2 is a diagram showing a processing flow in the automatic speech recognition / speech conversion system of the present invention.

同図において、電話対応および音声変換、変更内容の更新などについて説明する。
まず、ステップＳ１３００１において、事前に音声変換元の担当者の音声パラメータを担当者ＤＢ１３０７に登録する。 In the same figure, telephone correspondence, voice conversion, update of changes, etc. will be described.
First, in step S13001, the voice parameter of the person in charge of the voice conversion source is registered in advance in the person in charge DB 1307.

次に、ステップＳ１３００２において、お客様（以下、顧客と言う）２０側の通信手段２００からの音声３１をサービス提供事業者側１００の通信手段１１０で受信する。 Next, in step S13002, the voice 31 from the communication means 200 on the customer (hereinafter referred to as customer) 20 side is received by the communication means 110 on the service provider side 100.

このとき、音声自動認識・音声変換システム１００は、以下のステップによる処理を実行する。
声紋認識手段１３０１は、ステップＳ１３００３において、顧客２０の音声３１から、声紋、声色の情報を取得する。 At this time, the automatic speech recognition / speech conversion system 100 executes processing according to the following steps.
In step S13003, the voiceprint recognition unit 1301 acquires voiceprint and voice color information from the voice 31 of the customer 20.

また、声紋認識手段１３０１は、ステップＳ１３００４において、顧客ＤＢ１３０６の声紋を検索し、ステップＳ１３０５において、当該声紋に一致する声紋が顧客ＤＢ１３０６に登録済みか否かを判定する。つまり、顧客の音声３１が顧客ＤＢ１３０６に登録されている既存ユーザか否かを判定する。 In step S13004, the voiceprint recognition unit 1301 searches for a voiceprint in the customer DB 1306. In step S1305, the voiceprint recognition unit 1301 determines whether a voiceprint matching the voiceprint has been registered in the customer DB 1306. That is, it is determined whether or not the customer voice 31 is an existing user registered in the customer DB 1306.

その判定結果、顧客ＤＢ１３０６に登録されていない場合（Ｎｏ）には、声紋認識手段１３０１は、ステップＳ１３００６において、当該声紋から年齢、性別を判定し、ステップＳ１３００７において、当該判定結果を顧客ＤＢ１３０６に登録する。 If the determination result shows that it is not registered in the customer DB 1306 (No), the voiceprint recognition means 1301 determines the age and sex from the voiceprint in step S13006, and the determination result is registered in the customer DB1306 in step S13007. To do.

判定結果、声紋が顧客ＤＢ１３０６に登録されている場合（Ｙｅｓ）には、声紋認識手段１３０１は、ステップＳ１３００８において、顧客ＤＢ１３０６から個人（顧客）を特定し、年齢、性別、注意点等を抽出する。 If the determination result shows that the voiceprint is registered in the customer DB 1306 (Yes), the voiceprint recognition unit 1301 identifies an individual (customer) from the customer DB 1306 and extracts age, gender, cautions, etc. in step S13008. .

次に、感情認識手段１３０２は、ステップＳ１３００９において、声色から顧客の感情、つまり「喜怒哀楽」を判定する。 Next, the emotion recognizing means 1302 determines the customer's emotion, that is, “healthy emotion” from the voice color in step S13009.

音声分析手段１３０４は、ステップＳ１３０１０において、年齢、性別、感情のデータを基に音声ＤＢ１３０８を検索し、次いでステップＳ１３０１１において、当該音声ＤＢから相手の年齢、性別、感情に応じた最適と思われる音声パラメータを抽出する。また、ステップＳ１３０１２において、担当者ＤＢ１３０７から、担当者の音声パラメータ（音量、音声速度、音程、音質等の情報）を抽出する。 In step S13010, the voice analysis unit 1304 searches the voice DB 1308 based on the data of age, gender, and emotion, and then in step S13011, the voice that seems to be optimal according to the age, sex, and emotion of the other party. Extract parameters. Also, in step S13012, the voice parameters of the person in charge (information on volume, voice speed, pitch, sound quality, etc.) are extracted from the person in charge DB 1307.

また、音声分析手段１３０４は、ステップＳ１３０１３において、担当者１０の音声パラメータと最適な音声パラメータとの差分を取り、当該差分（変更値）を抽出する。 In step S13013, the voice analysis unit 1304 takes the difference between the voice parameter of the person in charge 10 and the optimum voice parameter, and extracts the difference (change value).

音声変換手段１３０５は、ステップＳ１３０１４において、音声分析手段１３０４から受取った属性のパラメータ値に従い担当者１０の音声を変換する。 In step S13014, the voice conversion unit 1305 converts the voice of the person in charge 10 according to the parameter value of the attribute received from the voice analysis unit 1304.

次に、通信手段１１０は、ステップＳ１３０１５において、音声変換手段１３０５により変換した変換音声３３を顧客２０側の通信手段２００に送信する。
また、ステップＳ１３０１６において、顧客２０の年齢、性別、感情、注意点（履歴がある場合）などの情報を表示手段１２０に表示する。 Next, in step S13015, the communication unit 110 transmits the converted voice 33 converted by the voice conversion unit 1305 to the communication unit 200 on the customer 20 side.
In step S13016, information such as the age, sex, emotions, and caution points (when there is a history) of the customer 20 is displayed on the display unit 120.

以上のステップを実行している状態において、感情認識手段１３０２は、ステップＳ１３０１７において、顧客２０との会話中から、顧客の感情変化を監視する。また、ステップＳ１３０８において、感情変化があるか否かを判定する。 In the state in which the above steps are executed, the emotion recognition means 1302 monitors the customer's emotional change during the conversation with the customer 20 in step S13017. In step S1308, it is determined whether there is an emotional change.

その監視結果、感情認識がある場合（Ｙｅｓ）には、ステップＳ１３０１９において、感情変化を検出し、注意すべき点を表示手段１２０に表示する。例えば、顧客２０が平常状態から突然怒りだした怒り状態に変化した場合、その旨を示すメッセージを表示手段１２０に表示し、担当者１０に対して顧客対応に注意を払うことように喚起する。 If there is emotion recognition as a result of the monitoring (Yes), emotion change is detected in step S13019, and points to be noted are displayed on the display means 120. For example, when the customer 20 changes from a normal state to an angry state that suddenly begins to anger, a message indicating that fact is displayed on the display means 120 and the person in charge 10 is urged to pay attention to customer service.

また、この場合には、感情認識手段１３０２は、ステップＳ１３０２０において、年齢、性別、変化した感情を元に音声ＤＢ１３０８を再検索し、再度この感情変化状態における最適と思われる音声パラメータを抽出する。 In this case, the emotion recognizing means 1302 re-searches the voice DB 1308 based on the age, gender, and changed emotion in step S13020, and again extracts a voice parameter that seems to be optimal in this emotion change state.

また、音声分析手段１３０４は、ステップＳ１３０２１において、担当者の音声パラメータと最適な音声パラメータとの差分を取り、当該差分（変更値）を抽出する。 In step S13021, the voice analysis unit 1304 takes the difference between the voice parameter of the person in charge and the optimum voice parameter, and extracts the difference (change value).

また、音声変換手段１３０５は、ステップＳ１３０２２において、音声分析手段１３０４から受取った属性のパラメータ値に従い、担当者１０の音声を変換する。この変換音声３３は、ステップＳ１３０２３において、通信手段１１０を介して顧客２０側の通信手段２００に送信する。 In step S13022, the voice conversion unit 1305 converts the voice of the person in charge 10 in accordance with the attribute parameter value received from the voice analysis unit 1304. This converted voice 33 is transmitted to the communication means 200 on the customer 20 side via the communication means 110 in step S13023.

また、音声分析手段１３０４は、この変更内容をステップＳ１３０２４において、顧客ＤＢ１３０６に反映する。 In addition, the voice analysis unit 1304 reflects this change in the customer DB 1306 in step S13024.

ステップＳ１３０１８において、感情変化がない場合には、次段のキーワード認識手段１３０３により、ステップＳ１３０２５において、会話中のキーワードを監視する。また、ステップＳ１３０２６において、聞き直すキーワードがあるか否かを判定する。 If there is no change in emotion in step S13018, the keyword recognition means 1303 in the next stage monitors the keyword in conversation in step S13025. In step S13026, it is determined whether there is a keyword to be heard again.

その判定の結果、聞き直すキーワードがある場合（Ｙｅｓ）には、キーワード認識手段１３０３は、ステップＳ１３０２７において、聞き直すキーワードを検出し、注意すべき点、例えば、声が小さいならば、その旨を表示手段１２０に表示する。 If there is a keyword to be re-listed as a result of the determination (Yes), the keyword recognizing unit 1303 detects the re-listening keyword in step S13027, and points to be noted, for example, if the voice is low, to that effect. Display on the display means 120.

また、音声分析手段１３０４は、ステップＳ１３０２８において、キーワード認識手段１３０３により抽出したキーワードに応じた各属性のパラメータ値を変更する。
例えば、担当者１０の声が小さい場合には、音量を上げる（＋１０）。また、早口である場合には、音声速度を下げる（−１５）等のチューニングを行う。 In step S13028, the voice analysis unit 1304 changes the parameter value of each attribute corresponding to the keyword extracted by the keyword recognition unit 1303.
For example, when the voice of the person in charge 10 is low, the volume is increased (+10). If the mouth is fast, tuning such as lowering the voice speed (-15) is performed.

また、音声変換手段１３０５は、ステップＳ１３０２９において、音声分析手段１３０４から受取った属性のパラメータ値に従い、担当者１０の音声を変換する。 In step S13029, the voice conversion unit 1305 converts the voice of the person in charge 10 according to the parameter value of the attribute received from the voice analysis unit 1304.

この変換音声３３は、ステップＳ１３０３０において、通信手段１１０を介して顧客２０側の通信手段２００に送信する。 The converted voice 33 is transmitted to the communication means 200 on the customer 20 side via the communication means 110 in step S13030.

また、音声分析手段１３０４は、この変更内容をステップＳ１３０３１において、顧客ＤＢ１３０６に反映する。 In addition, the voice analysis unit 1304 reflects this change in the customer DB 1306 in step S13031.

最後に、ステップＳ１３０３２において、電話対応が終了したか否かを判定し、終了していない場合（Ｎｏ）には、ステップ１３０１７に戻る。
電話対応が終了した場合（Ｙｅｓ）には、ステップＳ１３０３３において、電話対応を終了、つまり電話を切る。次いで、ステップＳ１３０３４において、顧客ＤＢ１３０６を更新し、履歴追加を行う。 Finally, in step S13032, it is determined whether or not the telephone correspondence is finished. If not finished (No), the process returns to step 13017.
If the telephone correspondence is terminated (Yes), the telephone correspondence is terminated, that is, the telephone is disconnected in step S13033. Next, in step S13034, the customer DB 1306 is updated and a history is added.

図３〜図５は、お客様（顧客端末）と担当者（コールセンタ）との音声による通信（やり取り）の一例を模式的に示す図である。 3 to 5 are diagrams schematically illustrating an example of voice communication (exchange) between a customer (customer terminal) and a person in charge (call center).

まず、感情変化前の状態について図３を用いて説明する。図３は、サービス提供事業者側（コールセンタ）１００がお客様に提供した運用管理ソフトウェアに関して問題があり、顧客からコールセンタに連絡が入ったことを想定したときの担当者による対応を摸式的に示す図である。
お客様から電話により、製品に関する問合せがあったとき、音声自動認識・音声変換システム１３０は、このときのお客様の音声、例えば「もしもし〜」等の音声の声紋が顧客ＤＢ１３０６に登録されている場合には、その情報を参照して認識する。またお客様の音声から感情認識も行う。 First, the state before an emotion change is demonstrated using FIG. FIG. 3 schematically shows the response by the person in charge when it is assumed that there is a problem with the operation management software provided to the customer by the service provider (call center) 100 and the customer contacts the call center. FIG.
When a customer inquires about a product by telephone, the automatic speech recognition / speech conversion system 130 determines that the customer's voice at this time, for example, a voice print such as “Hello”, is registered in the customer DB 1306. Recognize by referring to the information. It also performs emotion recognition from the customer's voice.

これらの認識結果は、例えばコールセンタ側の表示装置１２０のＰＣモニターに表示する。本例では、「Ｘ商事ＹＹ様契約：Ｊソフトウェア年齢：５０、性別：男性、感情：正常」を示している。このモニターは、コールセンタ側の担当者が見られるようにする。 These recognition results are displayed on the PC monitor of the display device 120 on the call center side, for example. In this example, “X Trading YY Contract: J Software Age: 50, Gender: Male, Emotion: Normal” is shown. This monitor allows call center personnel to be seen.

次に、これらの情報を基に、つまり「Ｘ商事ＹＹ様契約：Ｊソフトウェア年齢：５０、性別：男性、感情：正常」に最適値であると思われる「音量：６５、音声速度：４５、音程：６０：音質：３０」のパラメータを音声ＤＢ１３０８から抽出する。また、担当者の担当者値である「担当者Ａ音量：５０、音声速度：７０、音程：２０、音質：４０」のパラメータを担当者ＤＢ１３０７から抽出する。 Next, based on this information, that is, “X Trading YY Contract: J Software Age: 50, Gender: Male, Emotion: Normal” The parameter of “pitch: 60: sound quality: 30” is extracted from the speech DB 1308. In addition, parameters of the “person in charge A volume: 50, voice speed: 70, pitch: 20, sound quality: 40” that are the person in charge values of the person in charge are extracted from the person in charge DB 1307.

そして、これらのパレメータの差分を計算する。本例では、「音量：＋１５、音声速度：−２５、音質：＋４０、音質＋７０」となる。この差分を基に担当者Ａの音声を、顧客に対して適した音声となるように変換を行う。お客様とは、この音声変換された音声により、例えば「Ｘ商事のＹＹさんですね。いつもありがとうございます。Ｊソフトウェアのお問合せでしょうか？」等の音声により応対する。 Then, the difference between these parameters is calculated. In this example, “volume: +15, voice speed: −25, sound quality: +40, sound quality +70”. Based on this difference, the voice of the person in charge A is converted so as to be a voice suitable for the customer. The customer responds to this voice-converted voice, for example, "Yay Y of X Shoji. Thank you all the time. Are you inquiring about J Software?"

次に、顧客の感情変化を検知した場合について図４を用いて説明する。図４は、感情認識が、例えば、入力音声の音量が通常より３％上がり、音程が通常より５％高くなった場合を想定した例である。
以上のような両者の会話において、お客様の音声から感情の変化、例えば「平常」から「怒り」が感じられた場合には、当該お客様の感情状態「怒り」に応じて、音声ＤＢ１３０８から、「音量：５５、音声速度：３５、音程：４０：音質：＋１０」のパラメータを抽出し、担当者Ａのパラメータとの差分「音量：＋５、音声速度：−３５、音程：＋２０：音質：＋５０」を計算する。そして、この差分を基に担当者Ａの音声を、音声変換の再構築（チューニング）を行う。本例によれば、きめ細かな音声変換のチューニングも可能である。
なお、感情認識の判定方法としては、単純な判定方式を採用してもよい。 Next, a case where a change in customer emotion is detected will be described with reference to FIG. FIG. 4 is an example in which emotion recognition is assumed, for example, when the volume of the input voice is 3% higher than normal and the pitch is 5% higher than normal.
In the conversation between the two parties, when a change in emotion is felt from the customer's voice, for example, “normal” to “anger”, the voice DB 1308 reads “ The parameters of volume: 55, voice speed: 35, pitch: 40: sound quality: +10 are extracted, and the difference from the parameter of the person in charge A is “volume: +5, voice speed: -35, pitch: +20: sound quality: +50” Calculate Then, the voice of the person in charge A is reconstructed (tuned) based on this difference. According to this example, fine tuning of voice conversion is also possible.
Note that a simple determination method may be adopted as a determination method of emotion recognition.

次に、聞き直し等の場合について図５を参照して説明する。
お客様が「聞き取れなかったので、もう１度お願いします。少しお声が小さいのですが。」等の音声の場合には、キーワード認識機能により、つまり「声が小さい」の音声を検出し、担当者Ａの音量を、例えば「音量：＋１０」とする。 Next, the case of re-listening will be described with reference to FIG.
If the customer says "I couldn't hear you, please ask again. My voice is a little loud.", The keyword recognition function, that is, "speech" is detected. The volume of the person in charge A is, for example, “volume: +10”.

このときのお客様の感情変化やキーワード音声は、ＰＣモニターに表示し、担当者Ａが、その様子をＰＣモニターから察知できるようにする。 The customer's emotional change and keyword voice at this time are displayed on the PC monitor so that the person in charge A can detect the situation from the PC monitor.

上述した実施例によれば、以下のような効果を期待することができる。
（１）
お客様（顧客）の年齢、性別、感情に応じた最適な音声（音量、音声速度、音程、音質）で会話することができ、電話対応の印象を向上させることができる（製品、サービス、会社自体の印象向上が期待できる）。また、ストレスのない電話対応により、電話対応のトラブル（クレーム）を低減させることができる。
（２）
全ての担当者が上記の対応が可能となり、担当者の対応ばらつきを是正することができる。
（３）
人の耳では判別できないようなお客様の感情変化を適確に検知することができ、音声変換のチューニングを施すと共に担当者への慎重な対応意識を上げることができる。
（４）
年配者ハンディキャップを有するお客様に対しても、担当者の音声を最適な音声に変換して聞かせることができるため、お客様にストレスを与えないサービスを提供することができる。
（５）
また、付随的効果として、声紋で個人を特定することが可能なため、当該個人に関する情報をモニターすることにより、例えばサポートサービスにおいて、契約確認が容易になると共にお客様側に契約確認の手間が軽減できる。例えば、担当者は、「○○社の××さんですね。いつもありがとうございます。△△製品のお問合せでしょうか？」等の応答ができ、お客様から「よく分かったね〜。契約確認が楽で助かるよ」等の返答が期待できる。
（６）
また、付随効果として、声紋で個人を特定することが可能なため、契約情報を不正に取得してサービスを利用する、所謂「なりすまし」を防止することができる。 According to the embodiment described above, the following effects can be expected.
(1)
Can talk with the best voice (volume, voice speed, pitch, sound quality) according to the customer's age, gender, and emotion, and can improve the impression of telephone support (products, services, company itself) Can improve the impression). In addition, telephone support troubles (claims) can be reduced by stress-free telephone support.
(2)
All the persons in charge can respond to the above, and it is possible to correct the dispersion of the persons in charge.
(3)
It can accurately detect customer emotion changes that cannot be detected by human ears, and can tune voice conversion and raise the awareness of the person in charge.
(4)
For customers who have an elderly handicap, the voice of the person in charge can be converted into an optimal voice so that they can provide a service that does not stress the customer.
(5)
In addition, as an incidental effect, it is possible to identify an individual with a voice print. By monitoring information related to the individual, for example, in support services, it is easy to confirm the contract and reduce the trouble of confirming the contract on the part of the customer. it can. For example, the person in charge can respond, such as “Dr. XX from XX. Thank you all the time. △△ Are you inquiring about the product?” You can expect a reply such as
(6)
Further, as an accompanying effect, it is possible to identify an individual with a voiceprint, so that it is possible to prevent so-called “spoofing” in which contract information is illegally obtained and a service is used.

なお、本発明は上記した実施例限定されるものではなく、様々な変形例が含まれる。例えば、上記した実施例は本発明を分かりやすく説明するために詳細に説明したものであり、必ずしも説明した全ての構成を備えるものに限定されるものではない。
また、ある実施例の構成の一部を他の実施例の構成に置き換えることが可能であり、また、ある実施例の構成に他の実施例の構成を加えることも可能である。
また、上記の各構成、機能、処理部、処理手段等は、それらの一部又は全部を、例えば集積回路で設計する等によりハードウェアで実現してもよい。
また、上記の各構成、機能等は、プロセッサがそれぞれの機能を実現するプログラムを解釈し、実行することによりソフトウェアで実現してもよい。
また、制御線や情報線は説明上必要と考えられるものを示しており、製品上必ずしも全ての制御線や情報線を示しているとは限らない。実際には殆ど全ての構成が相互に接続されていると考えてもよい。 In addition, this invention is not limited to an above-described Example, Various modifications are included. For example, the above-described embodiments have been described in detail for easy understanding of the present invention, and are not necessarily limited to those having all the configurations described.
Further, a part of the configuration of one embodiment can be replaced with the configuration of another embodiment, and the configuration of another embodiment can be added to the configuration of one embodiment.
Each of the above-described configurations, functions, processing units, processing means, and the like may be realized by hardware by designing a part or all of them with, for example, an integrated circuit.
Each of the above-described configurations, functions, and the like may be realized by software by interpreting and executing a program that realizes each function by the processor.
Further, the control lines and information lines indicate what is considered necessary for the explanation, and not all the control lines and information lines on the product are necessarily shown. Actually, it may be considered that almost all the components are connected to each other.

本発明は、コールセンタに特定することなく、音声による顧客対応システムであれば適用可能である。 The present invention is applicable to any customer service system using voice without specifying a call center.

１００サービス提供システム（コールセンタ）
１１０通信手段（コールセンタ側）
１２０表示手段
１３０音声自動認識・音声変換システム
１３０１声紋認識手段
１３０２感情認識手段
１３０３キーワード認識手段
１３０４音声分析手段
１３０５音声変換手段
１３０６顧客ＤＢ
１３０７担当者ＤＢ
１３０８音声ＤＢ
２００通信手段（顧客側） 100 Service provision system (call center)
110 Communication means (call center side)
120 Display means 130 Automatic voice recognition / conversion system 1301 Voiceprint recognition means 1302 Emotion recognition means 1303 Keyword recognition means 1304 Voice analysis means 1305 Voice conversion means 1306 Customer DB
1307 Person in charge DB
1308 Voice DB
200 Communication means (customer side)

Claims

An automatic voice recognition / conversion system used in a service providing system for carrying out voice conversations with a call partner via voice communication means,
The voice automatic recognition / speech conversion system includes:
Voiceprint recognition means having a function of recognizing the voiceprint of the input voice of the other party;
Emotion recognition means having a function of recognizing the other party's emotion from the input voice;
A customer DB that stores customer information indicating the voiceprint, age, and gender of the other party;
A person-in-charge DB that accumulates voice parameters of persons in charge corresponding to the other party;
A voice DB that stores voice parameters corresponding to age, gender, and emotion;
Voice analysis means for receiving a voiceprint recognition result of the voiceprint recognition means and an emotion recognition analysis result of the emotion recognition means, calculating a difference between the voice parameter of the person in charge DB and the parameter of the voice DB, and outputting the difference;
Voice conversion means for receiving the difference extracted by the voice analysis means, converting the voice of the person in charge based on the difference, and outputting the voice to the communication means;
Have
In a call state between the call partner and the person in charge, the emotion recognition unit detects a change in the emotion of the call partner, and the voice analysis unit calculates a voice parameter of the voice DB and a voice parameter of the person in charge. When the difference is calculated, the voice conversion means converts the voice of the person in charge on the basis of the difference in real time. An automatic voice recognition / voice conversion system.

The automatic speech recognition / speech conversion system according to claim 1,
The service providing system includes a call center, and the call center includes display means for displaying the analysis result of the voice analysis means.

The automatic speech recognition / speech conversion system according to claim 1 or 2,
The voice automatic recognition / speech conversion system includes:
Furthermore, it has keyword recognition means for recognizing a keyword related to re-listening from the input voice of the other party,
When the keyword recognition unit recognizes the keyword and the voice analysis unit detects the keyword, the voice conversion unit changes a parameter value of each attribute according to the keyword, and performs the charge according to the parameter. A voice automatic recognition / conversion system characterized by converting the volume and / or speed of a person's voice.