JP2021107873A

JP2021107873A - Voice characteristic change system and voice characteristic change method

Info

Publication number: JP2021107873A
Application number: JP2019239264A
Authority: JP
Inventors: ジョーンローレンソンマシュー; John Lawrenson Matthew; ジョンライトクリストファー; John Wright Christopher; マイケルデュフィーディビッド; Michael Duffy David; 昭年泉; Akitoshi Izumi; 毅吉原; Takeshi Yoshihara
Original assignee: Panasonic Intellectual Property Management Co Ltd
Current assignee: Panasonic Intellectual Property Management Co Ltd
Priority date: 2019-12-27
Filing date: 2019-12-27
Publication date: 2021-07-29

Abstract

To efficiently support realization of smooth and comfortable dialogues between an operator and a customer according to the customer's emotions.SOLUTION: In a voice characteristic change system, a receiver which receives and outputs a video and a spoken voice of an operator from an operator terminal, and a server are connected so as to be able to communicate. The receiver is connected to a camera which captures the customer viewing the video and spoken voice, and acquires the captured image of the customer captured by the camera, and sends it to the server. The server derives emotional data indicating emotions for the customer's video and spoken voice based on the captured image of the customer sent from the receiver, and generates and sends processing instructions for changes of the characteristics of the operator's spoken voice, based on the result of deriving the customer's emotional data. The receiver changes and outputs the characteristics of the operator's spoken voice based on the processing instruction sent from the server.SELECTED DRAWING: Figure 2

Description

本開示は、音声特性変更システムおよび音声特性変更方法に関する。 The present disclosure relates to a voice characteristic change system and a voice characteristic change method.

特許文献１には、複数の感情スコアをそれぞれモデル化した感情モデル集合を記憶し、対話者の入力音声信号からフレームごとに音響特徴量を抽出し、音響特徴量から感情モデル集合を用いてフレームごとに感情スコアを計算する、共感反感箇所検出装置が開示されている。共感反感箇所検出装置は、計算された感情スコアに基づいてフレームごとに共感反感箇所推定スコアを計算し、この共感反感箇所推定スコアに基づいて対話者の共感反感箇所を推定する。 In Patent Document 1, an emotion model set in which a plurality of emotion scores are modeled is stored, an acoustic feature amount is extracted for each frame from the input voice signal of the interlocutor, and a frame is used from the acoustic feature amount using the emotion model set. A device for detecting an empathy / antipathy location that calculates an emotion score for each is disclosed. The empathy / antipathy location detection device calculates the empathy / antipathy location estimation score for each frame based on the calculated emotion score, and estimates the empathy / antipathy location of the interlocutor based on the empathy / antipathy location estimation score.

特開２０１５−９９３０４号公報Japanese Unexamined Patent Publication No. 2015-99304

特許文献１によれば、対話者の感情状態が変化した箇所の検出が可能となる。しかし、特許文献１の技術ではオペレータが対応しているコールセンタ等の状況下において顧客が発話した時の感情状態を推定することがフォーカスされており、顧客の感情状態の推定結果に合わせてオペレータ等の情報提供側の音声の特性を変更することは考慮されていない。このために、顧客の感情に適合して顧客に受け入れられやすい何かしらの音声による情報提供の実現が困難であった。 According to Patent Document 1, it is possible to detect a portion where the emotional state of the interlocutor has changed. However, in the technique of Patent Document 1, the focus is on estimating the emotional state when the customer speaks under the situation of the call center or the like supported by the operator, and the operator or the like is adjusted to the estimation result of the customer's emotional state. It is not considered to change the characteristics of the voice of the information provider. For this reason, it has been difficult to provide some kind of voice information that matches the customer's emotions and is easily accepted by the customer.

本開示は、上述した従来の状況に鑑みて案出され、顧客の感情に合わせたオペレータから顧客への音声による情報提供の実現を効率的に支援する音声特性変更システムおよび音声特性変更方法を提供することを目的とする。 The present disclosure is devised in view of the above-mentioned conventional situation, and provides a voice characteristic change system and a voice characteristic change method that efficiently support the realization of voice information provision from the operator to the customer according to the customer's emotions. The purpose is to do.

本開示は、映像およびオペレータの発話音声をオペレータ端末から受信して出力する受信機と、サーバとが通信可能に接続される音声特性変更システムであって、前記受信機は、前記映像および前記発話音声を視聴する顧客を撮像するカメラと接続され、前記カメラにより撮像された前記顧客の撮像画像を取得して前記サーバに送り、前記サーバは、前記受信機から送られた前記顧客の撮像画像に基づいて、前記顧客の前記映像および前記発話音声に対する感情を示す感情データを導出し、前記顧客の前記感情データの導出結果に基づいて、前記オペレータの発話音声の特性の変更に関する処理指示を生成して前記受信機に送り、前記受信機は、前記サーバから送られた前記処理指示に基づいて、前記オペレータの発話音声の特性を変更して出力する、音声特性変更システムを提供する。 The present disclosure is an audio characteristic changing system in which a receiver that receives and outputs video and an operator's utterance voice from an operator terminal and a server are communicably connected, and the receiver is the said video and the said utterance. It is connected to a camera that captures the customer who listens to the sound, acquires the captured image of the customer captured by the camera and sends it to the server, and the server sends the captured image of the customer sent from the receiver to the captured image of the customer. Based on this, emotion data indicating the customer's feelings for the video and the spoken voice is derived, and based on the result of deriving the customer's emotion data, a processing instruction regarding a change in the characteristics of the spoken voice of the operator is generated. The receiver provides a voice characteristic changing system that changes and outputs the characteristics of the spoken voice of the operator based on the processing instruction sent from the server.

また、本開示は、映像およびオペレータの発話音声をオペレータ端末から受信して出力する受信機と、サーバとにより構成される音声特性変更システムにより実行される音声特性変更方法であって、前記受信機により、前記映像および前記発話音声を視聴する顧客を撮像するカメラを有し、前記カメラにより撮像された前記顧客の撮像画像を取得するステップと、前記サーバにより、前記受信機から送られた前記顧客の撮像画像に基づいて、前記顧客の前記映像および前記発話音声に対する感情を示す感情データを導出するステップと、前記サーバにより、前記顧客の前記感情データの導出結果に基づいて、前記オペレータの発話音声の特性の変更に関する処理指示を生成して前記受信機に送るステップと、前記受信機により、前記サーバから送られた前記処理指示に基づいて、前記オペレータの発話音声の特性を変更して出力するステップと、を有する、音声特性変更方法を提供する。 Further, the present disclosure is a voice characteristic changing method executed by a voice characteristic changing system composed of a receiver composed of a receiver that receives and outputs video and an operator's utterance voice from an operator terminal, and the receiver. The customer has a camera that captures the customer viewing the video and the spoken voice, and acquires the captured image of the customer captured by the camera, and the customer sent from the receiver by the server. Based on the step of deriving the emotion data indicating the customer's feelings for the video and the uttered voice based on the captured image of the above, and the server, the uttered voice of the operator based on the result of deriving the emotion data of the customer. Based on the step of generating a processing instruction related to the change of the characteristic of the above and sending it to the receiver and the processing instruction sent from the server by the receiver, the characteristic of the spoken voice of the operator is changed and output. Provided is a method of changing voice characteristics, which has steps and.

本開示によれば、顧客の感情に合わせたオペレータから顧客への音声による情報提供の実現を効率的に支援できる。 According to the present disclosure, it is possible to efficiently support the realization of voice information provision from the operator to the customer according to the customer's emotions.

実施の形態１に係る情報表示システムの概要の一例を示す図The figure which shows an example of the outline of the information display system which concerns on Embodiment 1. 実施の形態１に係る情報表示システムのハードウェア構成例を示すブロック図Block diagram showing a hardware configuration example of the information display system according to the first embodiment 実施の形態１に係る情報表示システムによる音声特性変更の基本動作手順例を示すフローチャートA flowchart showing an example of a basic operation procedure for changing voice characteristics by the information display system according to the first embodiment. 図３のステップＳ３における音声特性変更手順例を示すフローチャートA flowchart showing an example of a voice characteristic changing procedure in step S3 of FIG. 実施の形態１に係る情報表示システムによる動作手順例を示すフローチャートA flowchart showing an example of an operation procedure by the information display system according to the first embodiment. 感情・変調テーブルの登録内容の一例を示す図Diagram showing an example of the registered contents of the emotion / modulation table 実施の形態２に係るＴＶ視聴システムの概要の一例を示す図The figure which shows an example of the outline of the TV viewing system which concerns on Embodiment 2. 実施の形態２に係るＴＶ視聴システムのハードウェア構成例を示すブロック図Block diagram showing a hardware configuration example of the TV viewing system according to the second embodiment 実施の形態２に係るＴＶ視聴システムの動作手順例を示すフローチャートA flowchart showing an example of an operation procedure of the TV viewing system according to the second embodiment.

以下、適宜図面を参照しながら、本開示に係る音声特性変更システムおよび音声特性変更方法を具体的に開示した実施の形態を詳細に説明する。但し、必要以上に詳細な説明は省略する場合がある。例えば、既によく知られた事項の詳細説明および実質的に同一の構成に対する重複説明を省略する場合がある。これは、以下の説明が不必要に冗長になるのを避け、当業者の理解を容易にするためである。なお、添付図面および以下の説明は、当業者が本開示を十分に理解するために提供されるのであって、これらにより特許請求の範囲に記載の主題を限定することは意図されていない。 Hereinafter, embodiments in which the voice characteristic changing system and the voice characteristic changing method according to the present disclosure are specifically disclosed will be described in detail with reference to the drawings as appropriate. However, more detailed explanation than necessary may be omitted. For example, detailed explanations of already well-known matters and duplicate explanations for substantially the same configuration may be omitted. This is to avoid unnecessary redundancy of the following description and to facilitate the understanding of those skilled in the art. It should be noted that the accompanying drawings and the following description are provided for those skilled in the art to fully understand the present disclosure, and are not intended to limit the subject matter described in the claims.

（実施の形態１）
実施の形態１では、本開示に係る音声特性変更システムが図１に示す情報表示システムに適用されるユースケースを説明する。図１は、実施の形態１に係る情報表示システム５の概要の一例を示す図である。情報表示システム５は、対面型情報提供装置１０とオペレータ端末５０とサーバ８０とを含む構成を有し、オペレータｏｐがオペレータ端末５０を使用して顧客と対面する対面型情報提供装置１０に情報を提示する。対面型情報提供装置１０とオペレータ端末５０とは、ネットワークＮＷを介して各種のデータ（例えば、テキストデータ、画像データ、音声データあるいはこれらのデータの組み合わせ等）を相互に通信可能である。また、対面型情報提供装置１０およびオペレータ端末５０のいずれも、ネットワークＮＷに接続されたサーバ８０にアクセス可能である。ネットワークＮＷには、インターネット等の広域通信網に接続される有線ＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）、無線ＬＡＮ、専用線等が用いられる。 (Embodiment 1)
In the first embodiment, a use case in which the voice characteristic changing system according to the present disclosure is applied to the information display system shown in FIG. 1 will be described. FIG. 1 is a diagram showing an example of an outline of the information display system 5 according to the first embodiment. The information display system 5 has a configuration including a face-to-face information providing device 10, an operator terminal 50, and a server 80, and the operator op uses the operator terminal 50 to send information to the face-to-face information providing device 10 facing a customer. Present. The face-to-face information providing device 10 and the operator terminal 50 can communicate with each other various data (for example, text data, image data, voice data, or a combination of these data) via the network NW. Further, both the face-to-face information providing device 10 and the operator terminal 50 can access the server 80 connected to the network NW. As the network NW, a wired LAN (Local Area Network), a wireless LAN, a dedicated line, or the like connected to a wide area communication network such as the Internet is used.

受信機の一例としての対面型情報提供装置１０は、例えば対面型情報提供装置１０の前面にいる顧客等の人物の身長に合わせるように、ある程度の高さを有する箱形の架台３１に載置される。架台３１は、例えば対面型情報提供装置１０の筐体と同一色で塗装される、あるいは対面型情報提供装置１０と共通のカバーで覆われることで、対面型情報提供装置１０と一体化される。対面型情報提供装置１０は、顧客の顔と向き合うように上側に突出するように設けられた第１筐体１５と、顧客の手元（つまり、手、腕、掌、爪等のうち一部の部位を含む手の周囲。以下同様。）が接触可能なように手前に延出された第２筐体１８と、を有する。 The face-to-face information providing device 10 as an example of the receiver is placed on a box-shaped pedestal 31 having a certain height so as to match the height of a person such as a customer in front of the face-to-face information providing device 10. Will be done. The gantry 31 is integrated with the face-to-face information providing device 10 by, for example, being painted in the same color as the housing of the face-to-face information providing device 10 or being covered with a cover common to the face-to-face information providing device 10. .. The face-to-face information providing device 10 includes a first housing 15 provided so as to project upward so as to face the customer's face, and a part of the customer's hands (that is, hands, arms, palms, nails, etc.). It has a second housing 18 extending toward the front so that the periphery of the hand including the portion; the same applies hereinafter) can be contacted.

第１筐体１５の前面には、オペレータ端末５０のカメラ５４（図２参照）によって撮像されるオペレータｏｐの顔と上半身の映像とが映し出される表示部２９が設けられる。表示部２９は、画像を表示するディスプレイ装置（例えばＬＣＤ（ＬｉｑｕｉｄＣｒｙｓｔａｌＤｉｓｐｌａｙ）あるいは有機ＥＬ（Ｅｌｅｃｔｒｏｌｕｍｉｎｅｓｃｅｎｃｅ））で構成される。表示部２９には、実際とほぼ同じサイズで、オペレータｏｐの顔と上半身の映像が表示される。これにより、顧客は、オペレータｏｐと実際に対面しているような臨場感を得ることが可能となる。第１筐体１５の前面下部（つまり、表示部２９と表示部２８とが配置された筐体の中間部１５ｚ）には、顧客の顔および上半身の映像を撮像するカメラ２４が設けられる。また、中間部１５ｚの両端付近には、オペレータｏｐが発話した音声の音声データを出力する左右一対のスピーカ２６が設けられる。中間部１５ｚの中央付近には、顧客が発した音声を収音するマイク２７が設けられる。 On the front surface of the first housing 15, a display unit 29 is provided on which the face of the operator op and the image of the upper body imaged by the camera 54 (see FIG. 2) of the operator terminal 50 are projected. The display unit 29 is composed of a display device (for example, an LCD (Liquid Crystal Display) or an organic EL (Electroluminescence)) for displaying an image. The display unit 29 displays an image of the operator op's face and upper body in a size substantially the same as the actual size. As a result, the customer can get a sense of realism as if he / she is actually facing the operator op. A camera 24 that captures images of the customer's face and upper body is provided in the lower front portion of the first housing 15 (that is, the intermediate portion 15z of the housing in which the display unit 29 and the display unit 28 are arranged). Further, a pair of left and right speakers 26 for outputting voice data of the voice spoken by the operator op are provided near both ends of the intermediate portion 15z. A microphone 27 for collecting the sound emitted by the customer is provided near the center of the intermediate portion 15z.

また、第２筐体１８の上面には、表示部２８が設けられる。表示部２８は、ＵＩ（ＵｓｅｒＩｎｔｅｒｆａｃｅ）画面、パンフレット等の案内情報、Ｗｅｂサイト等を表示可能である。表示部２８は、タッチ入力操作可能な入力部２３（図２参照）と一体化されたタッチパネル１４（図２参照）で構成される。 A display unit 28 is provided on the upper surface of the second housing 18. The display unit 28 can display a UI (User Interface) screen, guidance information such as a pamphlet, a website, and the like. The display unit 28 includes a touch panel 14 (see FIG. 2) integrated with an input unit 23 (see FIG. 2) capable of touch input operation.

一方、オペレータ端末５０は、操作デスク６０を有する。操作デスク６０の前では、ヘッドセット７３を装着したオペレータｏｐが安定した姿勢でチェア７１に座っている。ヘッドセット７３は、オペレータ端末５０の一部として、スピーカ５５（図２参照）およびマイク５６（図２参照）を有し、顧客が発話した音声をスピーカ５５から出力し、オペレータｏｐが発話した音声をマイク５６で収音する。 On the other hand, the operator terminal 50 has an operation desk 60. In front of the operation desk 60, the operator op wearing the headset 73 sits on the chair 71 in a stable posture. The headset 73 has a speaker 55 (see FIG. 2) and a microphone 56 (see FIG. 2) as a part of the operator terminal 50, outputs the voice spoken by the customer from the speaker 55, and the voice spoken by the operator op. Is picked up by the microphone 56.

操作デスク６０の操作面には、門型の支持台６１が固定されている。支持台６１には、カメラ２４によって撮像された顧客の顔および上半身の映像が映し出される表示部５３と、オペレータｏｐの顔および上半身を撮像するカメラ５４とが支持される。表示部５３に表示される映像は、ハーフミラー７５でオペレータｏｐの視線方向に反射され、オペレータｏｐによって視認される。 A gate-shaped support base 61 is fixed to the operation surface of the operation desk 60. The support base 61 supports a display unit 53 on which images of the customer's face and upper body imaged by the camera 24 are projected, and a camera 54 that images the face and upper body of the operator op. The image displayed on the display unit 53 is reflected by the half mirror 75 in the line-of-sight direction of the operator op, and is visually recognized by the operator op.

対面型情報提供装置１０は、オペレータ端末５０から画像データを受信し、表示部２９にオペレータｏｐの顔と上半身の映像を表示する。対面型情報提供装置１０は、オペレータ端末５０から受信した音声データをスピーカ２６から出力し、マイク２７で収音した音声データをオペレータ端末５０に送信する。 The face-to-face information providing device 10 receives image data from the operator terminal 50 and displays an image of the operator op's face and upper body on the display unit 29. The face-to-face information providing device 10 outputs the voice data received from the operator terminal 50 from the speaker 26, and transmits the voice data picked up by the microphone 27 to the operator terminal 50.

一方、オペレータ端末５０は、対面型情報提供装置１０から画像データを受信し、表示部５３に顧客の顔と上半身の映像を表示する。また、オペレータ端末５０は、対面型情報提供装置１０から音声データを受信し、ヘッドセット７３のスピーカ５５（図２参照）から出力し、ヘッドセット７３のマイク５６（図２参照）で収音した音声データを対面型情報提供装置１０に送信する。 On the other hand, the operator terminal 50 receives the image data from the face-to-face information providing device 10 and displays the image of the customer's face and the upper body on the display unit 53. Further, the operator terminal 50 receives voice data from the face-to-face information providing device 10, outputs the voice data from the speaker 55 of the headset 73 (see FIG. 2), and collects the sound by the microphone 56 of the headset 73 (see FIG. 2). The voice data is transmitted to the face-to-face information providing device 10.

図２は、実施の形態１に係る情報表示システム５のハードウェア構成例を示すブロック図である。情報表示システム５は、対面型情報提供装置１０と、オペレータ端末５０と、サーバ８０とを含む構成である。 FIG. 2 is a block diagram showing a hardware configuration example of the information display system 5 according to the first embodiment. The information display system 5 includes a face-to-face information providing device 10, an operator terminal 50, and a server 80.

対面型情報提供装置１０は、オペレータ端末５０を介してオペレータｏｐが顧客と対話可能な装置であり、プロセッサ２１、メモリ２２、タッチパネル１４、通信部２０、表示部２９、カメラ２４、音声制御部２５、スピーカ２６、およびマイク２７を有する。なお、カメラ２４およびマイク２７は、対面型情報提供装置１０とは別体として外部接続されてもよい。 The face-to-face information providing device 10 is a device in which the operator op can interact with the customer via the operator terminal 50, and is a processor 21, a memory 22, a touch panel 14, a communication unit 20, a display unit 29, a camera 24, and a voice control unit 25. , Speaker 26, and microphone 27. The camera 24 and the microphone 27 may be externally connected separately from the face-to-face information providing device 10.

プロセッサ２１は、対面型情報提供装置１０を統括的に制御する。メモリ２２は、プロセッサ２１のワーキングメモリとして使用される他、各種データ、情報、プログラムを記憶する。メモリ２２は、一次記憶装置（例えばＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）およびＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）を含む。メモリ２２は、二次記憶装置（例えばＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）、ＳＳＤ（ＳｏｌｉｄＳｔａｔｅＤｒｉｖｅ）、または三次記憶装置（例えば光ディスク、ＳＤカード）を含んでもよい。 The processor 21 comprehensively controls the face-to-face information providing device 10. The memory 22 is used as a working memory of the processor 21, and also stores various data, information, and programs. The memory 22 includes a primary storage device (for example, RAM (Random Access Memory) and ROM (Read Only Memory). The memory 22 includes a secondary storage device (for example, HDD (Hard Disk Drive)), SSD (Solid State Drive), or SSD (Solid State Drive). A tertiary storage device (eg, an optical disk, an SD card) may be included.

タッチパネル１４は、表示部２８と入力部２３が一体化された構成である。表示部２８と入力部２３は、別体に構成されてもよい。別体に構成される場合、表示部２８は、例えばＬＣＤ、有機ＥＬ等の表示デバイスである。入力部２３は、マウス、キーボード、タッチパッド等の入力デバイスである。 The touch panel 14 has a configuration in which the display unit 28 and the input unit 23 are integrated. The display unit 28 and the input unit 23 may be configured separately. When configured separately, the display unit 28 is a display device such as an LCD or an organic EL. The input unit 23 is an input device such as a mouse, a keyboard, and a touch pad.

通信部２０は、ネットワークＮＷを介してオペレータ端末５０の通信部５７およびサーバ８０の通信部８３と無線または有線で通信を行うネットワークＩ／Ｆ回路である。通信部２０による通信方式は、例えば、ＷＡＮ（ＷｉｄｅＡｒｅａＮｅｔｗｏｒｋ）、ＬＡＮ、ＬＴＥ（ＬｏｎｇＴｅｒｍＥｖｏｌｕｔｉｏｎ）、５Ｇ等の移動体通信、電力線通信、近距離無線通信（例えばＢｌｕｅｔｏｏｔｈ（登録商標）通信）、携帯電話用の通信等である。通信部２０は、カメラ２４により撮像された顧客の顔の画像データ、およびタッチパネル１４の入力部２３に入力された操作情報をオペレータ端末５０に送信する。通信部２０は、オペレータ端末５０から送信されたオペレータｏｐの顔と上半身の映像とを受信する。 The communication unit 20 is a network I / F circuit that wirelessly or wiredly communicates with the communication unit 57 of the operator terminal 50 and the communication unit 83 of the server 80 via the network NW. The communication method by the communication unit 20 includes, for example, mobile communication such as WAN (Wide Area Network), LAN, LTE (Long Term Evolution), 5G, power line communication, short-range wireless communication (for example, Bluetooth (registered trademark) communication), and the like. Communication for mobile phones, etc. The communication unit 20 transmits the image data of the customer's face captured by the camera 24 and the operation information input to the input unit 23 of the touch panel 14 to the operator terminal 50. The communication unit 20 receives the image of the operator op's face and upper body transmitted from the operator terminal 50.

表示部２９は、オペレータｏｐの顔および上半身を表示する、超高解像度ディスプレイ、例えば４Ｋ（３８４０画素×２１６０画素）ディスプレイを有する。なお、表示部２９は、顧客がタッチ入力可能なタッチパネルで構成されてもよい。 The display unit 29 has an ultra-high resolution display, for example, a 4K (3840 pixel × 2160 pixel) display that displays the face and upper body of the operator op. The display unit 29 may be configured with a touch panel capable of touch input by the customer.

カメラ２４は、第１筐体１５の下部に配置された内蔵カメラであり、対面型情報提供装置１０の前に立つ顧客の顔と上半身との映像を撮像する。なお、カメラ２４の画角は、オペレータ端末５０から遠隔操作可能であってもよい。カメラ２４には、高解像度な４Ｋカメラ、フルハイビジョンカメラ、ハイビジョンカメラ、ノーマルカメラ等が用いられる。 The camera 24 is a built-in camera arranged in the lower part of the first housing 15, and captures an image of a customer's face and upper body standing in front of the face-to-face information providing device 10. The angle of view of the camera 24 may be remotely controlled from the operator terminal 50. As the camera 24, a high-resolution 4K camera, a full high-definition camera, a high-definition camera, a normal camera, and the like are used.

音声制御部２５は、通信部２０を介して送受信される音声データに対し圧縮・伸長処理を行い、伸長した音声データをスピーカ２６から出力し、マイク２７で収音された音声の音声データを圧縮する。また、音声制御部２５は、音声データのノイズ除去処理、増幅処理等を行う。 The voice control unit 25 performs compression / decompression processing on the voice data transmitted / received via the communication unit 20, outputs the decompressed voice data from the speaker 26, and compresses the voice data of the voice picked up by the microphone 27. do. In addition, the voice control unit 25 performs noise removal processing, amplification processing, and the like of voice data.

スピーカ２６は、対面型情報提供装置１０の前にいる顧客が聞き取り易くなるように指向性を有するステレオスピーカであり、オペレータｏｐが発話する声の音声等を出力する。 The speaker 26 is a stereo speaker having directivity so that the customer in front of the face-to-face information providing device 10 can easily hear the speaker 26, and outputs the voice of the voice uttered by the operator op.

マイク２７は、顧客に対し指向方向を有する指向性マイクであり、顧客が発話する声の音声を収音する。なお、マイク２７およびスピーカ２６はヘッドセットで構成されてもよく、顧客が対面型情報提供装置１０を操作する際、このヘッドセットを頭部に装着する。 The microphone 27 is a directional microphone having a directivity direction with respect to the customer, and picks up the voice of the voice spoken by the customer. The microphone 27 and the speaker 26 may be composed of a headset, and the headset is worn on the head when the customer operates the face-to-face information providing device 10.

オペレータ端末５０は、オペレータｏｐが操作する端末であり、プロセッサ５１、メモリ５２、表示部５３、カメラ５４、スピーカ５５、マイク５６、および通信部５７を有する。 The operator terminal 50 is a terminal operated by the operator op, and has a processor 51, a memory 52, a display unit 53, a camera 54, a speaker 55, a microphone 56, and a communication unit 57.

プロセッサ５１は、オペレータ端末５０を統括的に制御する。メモリ５２は、プロセッサ５１のワーキングメモリとして使用される他、各種データ、情報、プログラムを記憶する。メモリ５２は、一次記憶装置（例えばＲＡＭおよびＲＯＭ）を含む。メモリ５２は、二次記憶装置（例えばＨＤＤ、ＳＳＤ）、または三次記憶装置（例えば光ディスク、ＳＤカード）を含んでもよい。 The processor 51 controls the operator terminal 50 in an integrated manner. The memory 52 is used as a working memory of the processor 51, and also stores various data, information, and programs. The memory 52 includes a primary storage device (eg, RAM and ROM). The memory 52 may include a secondary storage device (for example, HDD, SSD) or a tertiary storage device (for example, an optical disk, SD card).

表示部５３は、顧客の顔および上半身を表示する、超高解像度ディスプレイ、例えば４Ｋ（３８４０画素×２１６０画素）ディスプレイを有する。 The display unit 53 includes an ultra-high resolution display, for example, a 4K (3840 pixel x 2160 pixel) display that displays the customer's face and upper body.

カメラ５４は、オペレータｏｐの顔と上半身との映像を撮像する。カメラ５４には、高解像度な４Ｋカメラ、フルハイビジョンカメラ、ハイビジョンカメラ、ノーマルカメラ等が用いられる。 The camera 54 captures an image of the operator op's face and upper body. As the camera 54, a high-resolution 4K camera, a full high-definition camera, a high-definition camera, a normal camera, or the like is used.

通信部５７は、ネットワークＮＷを介して対面型情報提供装置１０の通信部２０およびサーバ８０の通信部８３と無線または有線で通信を行うネットワークＩ／Ｆ回路である。通信部５７による通信方式は、例えば、ＷＡＮ、ＬＡＮ、ＬＴＥ、５Ｇ等の移動体通信、電力線通信、近距離無線通信（例えばＢｌｕｅｔｏｏｔｈ（登録商標）通信）、携帯電話用の通信等である。通信部５７は、カメラ５４により撮像されたオペレータｏｐの顔の画像データを対面型情報提供装置１０に送信する。通信部５７は、対面型情報提供装置１０から送信された顧客の顔と上半身との映像を受信する。通信部５７は、サーバ８０から送信された顧客の感情に対応するアドバイス情報を受信する。 The communication unit 57 is a network I / F circuit that wirelessly or wiredly communicates with the communication unit 20 of the face-to-face information providing device 10 and the communication unit 83 of the server 80 via the network NW. The communication method by the communication unit 57 is, for example, mobile communication such as WAN, LAN, LTE, 5G, power line communication, short-range wireless communication (for example, Bluetooth (registered trademark) communication), communication for mobile phones, and the like. The communication unit 57 transmits the image data of the operator op's face captured by the camera 54 to the face-to-face information providing device 10. The communication unit 57 receives the image of the customer's face and the upper body transmitted from the face-to-face information providing device 10. The communication unit 57 receives the advice information corresponding to the customer's emotions transmitted from the server 80.

スピーカ５５は、オペレータｏｐが聞き取り易くなるように指向性を有するステレオスピーカであり、顧客が発話する声の音声等を出力する。マイク５６は、オペレータｏｐに対し指向方向を有する指向性マイクであり、オペレータｏｐが発話する声の音声を収音する。マイク５６およびスピーカ５５は、ヘッドセット７３で構成される。オペレータｏｐは、オペレータ端末５０を操作する際、ヘッドセット７３を頭部に装着する。 The speaker 55 is a stereo speaker having directivity so that the operator op can easily hear it, and outputs the voice of the voice uttered by the customer. The microphone 56 is a directional microphone having a directivity direction with respect to the operator op, and collects the voice of the voice spoken by the operator op. The microphone 56 and the speaker 55 are composed of a headset 73. The operator op wears the headset 73 on the head when operating the operator terminal 50.

サーバ８０は、オペレータｏｐが発話する声の音声の特性を、顧客の感情データに合わせて変更するものであり、プロセッサ８１、メモリ８２、通信部８３、およびストレージ８５を有する。感情データは、顧客の映像および発話音声に対する感情を示す。なお、ここでは、音声の特性を変更することを「変調」とも称する。 The server 80 changes the characteristics of the voice of the voice spoken by the operator op according to the emotion data of the customer, and has a processor 81, a memory 82, a communication unit 83, and a storage 85. The emotional data shows the customer's emotions toward the video and the spoken voice. Here, changing the characteristics of voice is also referred to as "modulation".

プロセッサ８１は、メモリ８２に記憶されたプログラムを実行することにより実現される機能として、変調方法決定部９１および感情分析アルゴリズム９２を含む。感情分析アルゴリズム９２は、顧客の感情を推定するものであり、顧客の顔画像データを基に顧客の感情を推定する画像分析部９３、および顧客が発話する声の音声データを基に顧客の感情を推定する音声分析部９４を含む。感情分析アルゴリズム９２は、顧客の感情を推定した時のタイムスタンプを出力してもよい。 The processor 81 includes a modulation method determination unit 91 and an emotion analysis algorithm 92 as functions realized by executing a program stored in the memory 82. The emotion analysis algorithm 92 estimates the customer's emotion, and the image analysis unit 93 that estimates the customer's emotion based on the customer's face image data, and the customer's emotion based on the voice data of the voice spoken by the customer. Includes a voice analysis unit 94 that estimates. The emotion analysis algorithm 92 may output a time stamp when the customer's emotion is estimated.

変調方法決定部９１は、感情分析アルゴリズム９２で推定された顧客の感情を基に、感情データベース９５に登録された感情・変調テーブルＴｂ１を用いて、推定された顧客の感情に対応する声の変調方法を選択する。 The modulation method determination unit 91 modulates the voice corresponding to the estimated customer's emotion using the emotion / modulation table Tb1 registered in the emotion database 95 based on the customer's emotion estimated by the emotion analysis algorithm 92. Select a method.

メモリ８２は、プロセッサ８１のワーキングメモリとして使用される他、各種データ、情報、プログラムを記憶する。メモリ８２は、一次記憶装置（例えばＲＡＭおよびＲＯＭ）を含む。 The memory 82 is used as a working memory of the processor 81, and also stores various data, information, and programs. The memory 82 includes a primary storage device (eg, RAM and ROM).

通信部８３は、ネットワークＮＷを介して対面型情報提供装置１０の通信部２０およびオペレータ端末５０の通信部５７と無線または有線で通信を行うネットワークＩ／Ｆ回路である。通信部８３による通信方式は、例えば、ＷＡＮ、ＬＡＮ、ＬＴＥ、５Ｇ等の移動体通信、電力線通信、近距離無線通信（例えばＢｌｕｅｔｏｏｔｈ（登録商標）通信）、携帯電話用の通信等である。通信部８３は、オペレータ端末５０に対し顧客の感情に対応するアドバイス情報を送信する。通信部８３は、対面型情報提供装置１０から送信された顧客の顔と上半身との映像を受信し、オペレータｏｐが発話する声の音声の変調方法を対面型情報提供装置１０に送信する。 The communication unit 83 is a network I / F circuit that wirelessly or wiredly communicates with the communication unit 20 of the face-to-face information providing device 10 and the communication unit 57 of the operator terminal 50 via the network NW. The communication method by the communication unit 83 is, for example, mobile communication such as WAN, LAN, LTE, 5G, power line communication, short-range wireless communication (for example, Bluetooth (registered trademark) communication), communication for mobile phones, and the like. The communication unit 83 transmits advice information corresponding to the customer's emotions to the operator terminal 50. The communication unit 83 receives the image of the customer's face and the upper body transmitted from the face-to-face information providing device 10, and transmits the method of modulating the voice of the voice spoken by the operator op to the face-to-face information providing device 10.

ストレージ８５は、ＨＤＤまたはＳＳＤを含み、感情データベース９５を記憶する。感情データベース９５は、顧客の感情とオペレータｏｐの声の変調方法が登録された感情・変調テーブルＴｂ１（図６参照）を含む。 The storage 85 includes an HDD or SSD and stores an emotion database 95. The emotion database 95 includes an emotion / modulation table Tb1 (see FIG. 6) in which a method of modulating the customer's emotion and the voice of the operator op is registered.

次に、実施の形態１に係る情報表示システム５の動作手順を説明する。 Next, the operation procedure of the information display system 5 according to the first embodiment will be described.

始めに、音声特性変更の基本動作について説明する。一例として、オペレータｏｐがオペレータ端末５０を通じて顧客が視聴する対面型情報提供装置１０に音声データを送信し、顧客に物事を音声で伝える場面を想定する。図３は、実施の形態１に係る情報表示システム５による音声特性変更の基本動作手順例を示すフローチャートである。 First, the basic operation of changing the voice characteristics will be described. As an example, it is assumed that the operator op transmits voice data to the face-to-face information providing device 10 that the customer watches through the operator terminal 50 and conveys things to the customer by voice. FIG. 3 is a flowchart showing an example of a basic operation procedure for changing the voice characteristics by the information display system 5 according to the first embodiment.

図３において、サーバ８０は、対面型情報提供装置１０から顧客の音声データおよび画像データを取得する（Ｓ１）。サーバ８０は、顧客の音声データおよび画像データを基に、顧客の感情を推定する（Ｓ２）。サーバ８０は、推定した顧客の感情に合わせてオペレータｏｐが発話する声の音声の特性を変更する指示を行う。対面型情報提供装置１０は、サーバ８０からの指示に従い、オペレータｏｐが発話する声の音声を変調して出力する（Ｓ３）。ステップＳ３の詳細については、図４を参照して後述する。 In FIG. 3, the server 80 acquires customer voice data and image data from the face-to-face information providing device 10 (S1). The server 80 estimates the customer's emotions based on the customer's voice data and image data (S2). The server 80 gives an instruction to change the characteristics of the voice of the voice spoken by the operator op according to the estimated customer's emotion. The face-to-face information providing device 10 modulates and outputs the voice of the voice spoken by the operator op according to the instruction from the server 80 (S3). Details of step S3 will be described later with reference to FIG.

図４は、図３のステップＳ３における音声特性変更手順例を示すフローチャートである。図４に示す一連の処理は、図３のステップＳ３における音声特性変更手順の詳細を示すサブルーチンである。 FIG. 4 is a flowchart showing an example of a voice characteristic changing procedure in step S3 of FIG. The series of processes shown in FIG. 4 is a subroutine showing details of the voice characteristic changing procedure in step S3 of FIG.

図４において、サーバ８０は、図３のステップＳ２において推定された顧客の感情に変化が起きた時（例えば、顧客が突然怒りだした時）の生体情報の特徴を特定する（Ｓ３１）。生体情報の特徴として、サーバ８０は、図３のステップＳ１で取得された画像データを基に顔認識を行い、顧客の顔画像に現れた喜怒哀楽の表面感情の検知結果が挙げられる。また、生体情報の特徴として、図３のステップＳ１で取得された顧客の顔画像データを基にサーバ８０により導出される心拍数あるいは心拍変動のデータを用いてもよい。心拍変動を基に内面感情（特に、ストレス度）を分析する技術として、例えば、特許第６３５８５０６号公報には、被験者が撮像された画像データを入力し、入力された画像データの複数フレームにわたる肌色部分の画素値の周期を基に脈拍数を推定することが開示されている。同様に、国際公開第２０１７／１５４４７７公報には、撮像画像から肌色領域を検出し、肌色領域から抽出した情報に基づき脈波信号を検出し、脈波信号に基づき被検体の脈拍を推定することが開示されている。また、生体情報として、特許文献１に示すように、顧客が発話する声の音声データを用いて、顧客の感情を推定することが知られている。 In FIG. 4, the server 80 identifies the characteristics of the biometric information when the customer's emotions estimated in step S2 of FIG. 3 change (for example, when the customer suddenly becomes angry) (S31). As a feature of the biological information, the server 80 performs face recognition based on the image data acquired in step S1 of FIG. 3, and detects the surface emotions of emotions and emotions appearing in the customer's face image. Further, as a feature of the biological information, the heart rate or heart rate variability data derived by the server 80 based on the customer's face image data acquired in step S1 of FIG. 3 may be used. As a technique for analyzing internal emotions (particularly, the degree of stress) based on heart rate variability, for example, in Japanese Patent No. 6358506, an image data captured by a subject is input, and skin color over a plurality of frames of the input image data is input. It is disclosed that the pulse rate is estimated based on the period of the pixel value of the portion. Similarly, according to the International Publication No. 2017/154477, the skin color region is detected from the captured image, the pulse wave signal is detected based on the information extracted from the skin color region, and the pulse of the subject is estimated based on the pulse wave signal. Is disclosed. Further, as biometric information, as shown in Patent Document 1, it is known to estimate a customer's emotion by using voice data of a voice spoken by the customer.

サーバ８０のプロセッサ８１は、ストレージ８５に記憶された感情データベース９５を基に、特定した生体情報の特徴と類似する生体情報の特徴を検索する（Ｓ３２）。感情データベース９５には、感情あるいは感情の変化に対応する生体情報の特徴が登録されている。生体情報は、顔の喜怒哀楽の表情、心拍数、心拍変動、音声等、少なくとも１つ含む。 The processor 81 of the server 80 searches for the features of the biometric information similar to the features of the identified biometric information based on the emotion database 95 stored in the storage 85 (S32). In the emotion database 95, features of biometric information corresponding to emotions or changes in emotions are registered. The biological information includes at least one facial expression such as emotions, heart rate, heart rate variability, and voice.

プロセッサ８１は、感情データベース９５を検索した結果、生体情報の特徴が該当した場合、感情データベース９５に登録された感情・変調テーブルＴｂ１を基に、生体情報の特徴に対応する声の音声の変調方法を選択する（Ｓ３３）。プロセッサ８１は、通信部８３を介して対面型情報提供装置１０に生体情報の特徴に対応する声の変調方法を送信する。 When the features of the biometric information are found as a result of searching the emotion database 95, the processor 81 is a method of modulating the voice of the voice corresponding to the features of the biometric information based on the emotion / modulation table Tb1 registered in the emotion database 95. Is selected (S33). The processor 81 transmits a voice modulation method corresponding to the characteristics of biometric information to the face-to-face information providing device 10 via the communication unit 83.

対面型情報提供装置１０は、声の変調方法に従い、オペレータ端末５０から送信されたオペレータｏｐの声の音声を変調して出力する（Ｓ３４）。 The face-to-face information providing device 10 modulates and outputs the voice of the operator op transmitted from the operator terminal 50 according to the voice modulation method (S34).

次に、情報表示システム５における音声特性変更動作をより具体的に示す。図５は、実施の形態１に係る情報表示システム５の動作手順を示すフローチャートである。図３と同様、オペレータｏｐがオペレータ端末５０を通じて顧客が視聴する対面型情報提供装置１０に音声データを送信し、顧客に物事を音声で伝える場面を想定する。 Next, the voice characteristic changing operation in the information display system 5 will be shown more concretely. FIG. 5 is a flowchart showing an operation procedure of the information display system 5 according to the first embodiment. Similar to FIG. 3, it is assumed that the operator op transmits voice data to the face-to-face information providing device 10 that the customer watches through the operator terminal 50 and conveys things to the customer by voice.

図５において、サーバ８０は、通信部８３を介して、対面型情報提供装置１０から送信された顧客の音声データおよび画像データを受信して取得する（Ｓ４１）。 In FIG. 5, the server 80 receives and acquires the customer's voice data and image data transmitted from the face-to-face information providing device 10 via the communication unit 83 (S41).

プロセッサ８１の感情分析アルゴリズム９２は、顧客の音声データおよび画像データを基に、顧客の感情を推定する（Ｓ４２）。このとき、画像分析部９３は、画像データを基に顔認識を行い、顧客の顔画像に現れる喜怒哀楽の表面感情を推定する。また、画像分析部９３は、顔画像データを基に心拍変動を検知し、顧客の内面感情を推定する。また、音声分析部９４は、顧客が発話する声の音声を基に、顧客の共感、反感等の感情を推定する。 The emotion analysis algorithm 92 of the processor 81 estimates the customer's emotions based on the customer's voice data and image data (S42). At this time, the image analysis unit 93 performs face recognition based on the image data and estimates the surface emotions of emotions appearing in the customer's face image. In addition, the image analysis unit 93 detects the heart rate variability based on the face image data and estimates the inner feelings of the customer. Further, the voice analysis unit 94 estimates emotions such as empathy and antipathy of the customer based on the voice of the voice spoken by the customer.

プロセッサ８１の変調方法決定部９１は、推定した顧客の感情に合わせて、オペレータｏｐの声の音声特性を変更するための指示を作成する（Ｓ４３）。この指示の作成に際し、変調方法決定部９１は、感情データベース９５に登録された感情・変調テーブルＴｂ１を基に、推定された感情に対応する声の変調方法を選択する。図６は、感情・変調テーブルＴｂ１の登録内容の一例を示す図である。感情・変調テーブルＴｂ１には、顧客の感情が「平常」である場合、オペレータが発話する声の「変調無し」が登録される。顧客の感情が「喜び」である場合、同様にオペレータが発話する声の「変調無し」が登録される。顧客の感情が「怒り」である場合、オペレータが発話する声の「語尾のピッチを下げる。怒り度合いに応じて下げる音量および音の長さの少なくもとも一方を変える。怒り度合が大きいほど音量を大きくかつ音の長さを長くする。」が登録される。顧客の感情が「悩み」である場合、オペレータが発話する声の「語気を強めて購買または契約を促す。」が登録される。 The modulation method determination unit 91 of the processor 81 creates an instruction for changing the voice characteristic of the voice of the operator op according to the estimated customer's emotion (S43). In creating this instruction, the modulation method determination unit 91 selects a voice modulation method corresponding to the estimated emotion based on the emotion / modulation table Tb1 registered in the emotion database 95. FIG. 6 is a diagram showing an example of the registered contents of the emotion / modulation table Tb1. In the emotion / modulation table Tb1, when the customer's emotion is "normal", "no modulation" of the voice spoken by the operator is registered. When the customer's emotion is "joy", "no modulation" of the voice spoken by the operator is also registered. When the customer's emotion is "anger", the operator speaks "lower the pitch of the end of the word. Decrease the volume according to the degree of anger and change at least one of the lengths of the sound. The greater the degree of anger, the higher the volume. To make the sound louder and the length of the sound longer. ”Is registered. When the customer's emotion is "worry", the voice "strengthen the voice and encourage the purchase or contract" spoken by the operator is registered.

感情・変調テーブルＴｂ１では、顧客の感情を推定する一例として、声の音声データを例示したが、心拍数、心拍変動等のデータを組み合わせて感情を推定してもよい。また、心拍変動を組み合わせる場合、集中している状態であると心拍変動が安定し、リラックスしている状態であると心拍変動が不安定になる。また、感情・変調テーブルＴｂ１では、感情分析アルゴリズム９２によって推定された感情が「喜び」から「悲しみ」に変更される場合、発話速度を遅くしてピッチを下げるように、オペレータの声が登録されてもよい。また、推定された感情が「怒り」から「興奮」に変更される場合、興奮を煽るような特定の単語の強調を下げてピッチを上げるように、オペレータの声が登録されてもよい。 In the emotion / modulation table Tb1, voice data is illustrated as an example of estimating the emotion of the customer, but the emotion may be estimated by combining data such as heart rate and heart rate variability. In addition, when the heart rate variability is combined, the heart rate variability becomes stable in a concentrated state and unstable in a relaxed state. Further, in the emotion / modulation table Tb1, when the emotion estimated by the emotion analysis algorithm 92 is changed from "joy" to "sadness", the operator's voice is registered so as to slow down the utterance speed and lower the pitch. You may. Also, when the estimated emotion is changed from "anger" to "excitement", the operator's voice may be registered to lower the emphasis of a particular word that incites excitement and raise the pitch.

プロセッサ８１は、ステップＳ４３で作成された、オペレータの声の音声特性を変更するための指示を、通信部８３を介して対面型情報提供装置１０に送信する（Ｓ４４）。 The processor 81 transmits the instruction for changing the voice characteristic of the operator's voice created in step S43 to the face-to-face information providing device 10 via the communication unit 83 (S44).

対面型情報提供装置１０のプロセッサ２１は、通信部２０を介して上記指示を受信すると、指示された変調方法でオペレータｏｐの声を変調して出力する（Ｓ４５）。 When the processor 21 of the face-to-face information providing device 10 receives the above instruction via the communication unit 20, the processor 21 modulates the voice of the operator op by the instructed modulation method and outputs it (S45).

また、サーバ８０のプロセッサ８１は、通信部８３を介して、推定した顧客の感情のデータをオペレータ端末５０に送信する。オペレータ端末５０のプロセッサ５１は、推定した顧客の感情に基づく顧客の表情を表示部５３に表示する（Ｓ４６）。このとき、プロセッサ５１は、例えばメモリ５２に登録された、各種感情の顔アイコンのいずれかを選択して顧客の表情を表示してもよい。また、プロセッサ５１は、テキスト文字、マーク画像等で顧客の表情を表示してもよい。 Further, the processor 81 of the server 80 transmits the estimated customer emotion data to the operator terminal 50 via the communication unit 83. The processor 51 of the operator terminal 50 displays the customer's facial expression based on the estimated customer's emotion on the display unit 53 (S46). At this time, the processor 51 may select one of the face icons of various emotions registered in the memory 52, for example, and display the facial expression of the customer. Further, the processor 51 may display the facial expression of the customer with text characters, mark images, and the like.

サーバ８０のプロセッサ８１は、推定した顧客の感情を基に、オペレータの発話、例えば現在紹介している商品の営業を継続するべきか否かのアドバイスをオペレータ端末５０に送信する。オペレータ端末５０のプロセッサ５１は、このアドバイスを表示部５３に表示する（Ｓ４７）。例えば、想定を超えるような顧客の怒り（なお想定を超えなくてもよい）があった場合、営業の継続を中止するアドバイスが行われてもよい。一例として、サーバ８０のプロセッサ８１は、想定を超える文言、例えば「バカヤロー」、「出て来い！」等のフレーズ（テキストデータ）をメモリ８２に登録しておき、顧客が発話する内容に想定を超える文言が含まれた場合、営業の継続を中止するアドバイスを行う。なお、プロセッサ８１は、推定した顧客の感情、顧客の顔画像、顧客の声の音声等のデータで機械学習を行い、営業の継続を中止する否かのアドバイス行ってもよい。ここでは、アドバイスは、サーバ８０で決定されたが、オペレータ端末５０によって決定されてもよい。オペレータ端末５０が行う場合、サーバ８０は、推定した顧客の感情を表すデータをオペレータ端末５０に送信する。 The processor 81 of the server 80 transmits an operator's utterance, for example, advice on whether or not to continue the business of the product currently being introduced, to the operator terminal 50 based on the estimated customer's feelings. The processor 51 of the operator terminal 50 displays this advice on the display unit 53 (S47). For example, if there is customer anger that exceeds expectations (it does not have to exceed expectations), advice may be given to stop the continuation of business. As an example, the processor 81 of the server 80 registers phrases (text data) such as "Bakayaro" and "Come out!" In the memory 82, and assumes the content to be spoken by the customer. If the wording exceeds the limit, we will give advice to stop the continuation of business. The processor 81 may perform machine learning based on data such as estimated customer emotions, customer face images, and customer voice voices, and give advice on whether or not to stop the continuation of business. Here, the advice is determined by the server 80, but may be determined by the operator terminal 50. When the operator terminal 50 does this, the server 80 transmits data representing the estimated customer emotions to the operator terminal 50.

実施の形態１に係る情報表示システム５は、オペレータｏｐがオペレータ端末５０を通じて顧客が視聴する対面型情報提供装置１０にオペレータの発した音声の音声データを送信し、顧客に物事を音声で伝える場合、推定された顧客の感情に合わせてオペレータｏｐが発話する声の音声の特性を変更する。これにより、顧客の感情に合わせたオペレータｏｐから顧客への音声による情報提供がスムーズかつ効率的に行われるようになる。 The information display system 5 according to the first embodiment is a case where the operator op transmits voice data of the voice uttered by the operator to the face-to-face information providing device 10 that the customer watches through the operator terminal 50, and conveys things to the customer by voice. , The voice characteristics of the voice spoken by the operator op are changed according to the estimated customer's feelings. As a result, information can be smoothly and efficiently provided to the customer by voice from the operator op according to the customer's emotions.

このように、情報表示システム５では、映像およびオペレータｏｐの発話音声をオペレータ端末５０から受信して出力する対面型情報提供装置１０と、サーバ８０とが通信可能に接続される。対面型情報提供装置１０は、映像および発話音声を視聴する顧客を撮像するカメラ２４と接続されあるいはカメラ２４を有し、カメラ２４により撮像された顧客の撮像画像を取得してサーバ８０に送る。サーバ８０は、対面型情報提供装置１０から送られた顧客の撮像画像に基づいて、顧客の映像および発話音声に対する感情を示す感情データを導出する。サーバ８０は、顧客の感情データの導出結果に基づいて、オペレータｏｐの発話音声の特性の変更に関する処理指示を生成して対面型情報提供装置１０に送る。対面型情報提供装置１０は、サーバ８０から送られた処理指示に基づいて、オペレータｏｐの発話音声の特性を変更して出力する。 In this way, in the information display system 5, the face-to-face information providing device 10 that receives and outputs the video and the utterance voice of the operator op from the operator terminal 50 and the server 80 are communicably connected to each other. The face-to-face information providing device 10 is connected to or has a camera 24 that captures a customer who views video and spoken audio, and acquires a captured image of the customer captured by the camera 24 and sends it to the server 80. The server 80 derives emotional data indicating emotions for the customer's video and spoken voice based on the captured image of the customer sent from the face-to-face information providing device 10. Based on the result of deriving the customer's emotion data, the server 80 generates a processing instruction regarding the change in the characteristics of the spoken voice of the operator op and sends it to the face-to-face information providing device 10. The face-to-face information providing device 10 changes the characteristics of the utterance voice of the operator op and outputs it based on the processing instruction sent from the server 80.

これにより、対面型情報提供装置１０は、オペレータの映像を視聴した顧客の感情に合わせてオペレータの音声の特性を適応的に変更して出力できる。従って、情報表示システム５は、顧客の感情に合わせたオペレータから顧客への音声による情報提供の実現を効率的に支援できる。 As a result, the face-to-face information providing device 10 can adaptively change and output the characteristics of the operator's voice according to the emotions of the customer who has viewed the operator's video. Therefore, the information display system 5 can efficiently support the realization of voice information provision from the operator to the customer according to the customer's emotions.

また、対面型情報提供装置１０は、顧客の発話音声を収音するマイク２７と接続されあるいはマイク２７を有し、マイク２７により収音された顧客の発話音声を取得してサーバ８０に送る。サーバ８０は、対面型情報提供装置１０から送られた顧客の撮像画像および顧客の発話音声のうち少なくとも１つに基づいて、顧客の感情データを導出する。これにより、サーバ８０は、顧客の撮像画像または顧客の発話音声を基に、顧客の感情データを容易に推定できる。 Further, the face-to-face information providing device 10 is connected to or has a microphone 27 that collects the customer's uttered voice, and acquires the customer's uttered voice collected by the microphone 27 and sends it to the server 80. The server 80 derives the customer's emotion data based on at least one of the customer's captured image and the customer's utterance voice sent from the face-to-face information providing device 10. As a result, the server 80 can easily estimate the customer's emotional data based on the customer's captured image or the customer's utterance voice.

また、サーバ８０は、顧客の感情データが怒りを示すと判定した場合に、オペレータｏｐの発話音声の語尾部分のピッチを下げる旨の処理指示を生成する。これにより、対面型情報提供装置１０は、オペレータｏｐの発話音声の語尾部分の音程を低くして、顧客の怒りが静まるように仕向けることができる。 Further, when the server 80 determines that the customer's emotional data indicates anger, the server 80 generates a processing instruction to lower the pitch of the ending portion of the spoken voice of the operator op. As a result, the face-to-face information providing device 10 can lower the pitch of the ending portion of the utterance voice of the operator op so that the customer's anger is calmed down.

また、サーバ８０は、顧客の感情データが怒り（例えば想定範囲を超える怒り）を示すと判定した場合に、オペレータｏｐによる発話の継続の中止を促すアドバイス情報を生成してオペレータ端末５０に送信する。オペレータ端末５０は、このアドバイス情報を受信して表示する。これにより、オペレータｏｐは、顧客の怒りを逆なでするような発話を中止し、顧客の怒りが静まるまで待つことができる。 Further, when the server 80 determines that the customer's emotional data indicates anger (for example, anger exceeding the expected range), the server 80 generates advice information for urging the operator op to stop continuing the utterance and transmits it to the operator terminal 50. .. The operator terminal 50 receives and displays this advice information. As a result, the operator op can stop the utterance that reverses the customer's anger and wait until the customer's anger subsides.

また、サーバ８０は、顧客の感情データが悩みを示すと判定した場合に、オペレータｏｐの発話音声のボリュームを上げる旨の処理指示を生成する。これにより、対面型情報提供装置１０は、オペレータｏｐの発話音声のボリュームを上げて、つまり語気を強めて購買または契約を促すように仕向けることができる。また、対面型情報提供装置１０は、悩みを解消して顧客が元気を取り戻すように導くことも可能である。 Further, when it is determined that the customer's emotional data indicates trouble, the server 80 generates a processing instruction to increase the volume of the utterance voice of the operator op. As a result, the face-to-face information providing device 10 can increase the volume of the utterance voice of the operator op, that is, strengthen the vocabulary and encourage the purchase or contract. In addition, the face-to-face information providing device 10 can solve problems and guide customers to regain their energy.

また、サーバ８０は、対面型情報提供装置１０から送られた顧客の撮像画像および顧客の発話音声の両方に基づいて、顧客の感情データを導出する。これにより、サーバ８０は、顧客の撮像画像および顧客の発話音声の両方を用いて、感情データをより正確に推定できる。 Further, the server 80 derives the customer's emotional data based on both the customer's captured image and the customer's utterance voice sent from the face-to-face information providing device 10. This allows the server 80 to more accurately estimate emotional data using both the customer's captured image and the customer's utterance voice.

また、対面型情報提供装置１０は、顧客とオペレータｏｐとの間の対話を支援する。これにより、対面型情報提供装置１０が顧客の感情に合わせてオペレータｏｐの発話音声の特性を変更することで、顧客はオペレータと直接対話しているような臨場感を高めることができる。 In addition, the face-to-face information providing device 10 supports a dialogue between the customer and the operator op. As a result, the face-to-face information providing device 10 changes the characteristics of the utterance voice of the operator op according to the emotions of the customer, so that the customer can enhance the sense of presence as if he / she is directly interacting with the operator.

（実施の形態２）
実施の形態２では、本開示に係る音声特性変更システムが図７に示すＴＶ視聴システムに適用されるユースケースを説明する。ＴＶ視聴システムでは、一例として、顧客はスポーツ（野球、相撲等）をＴＶ（テレビジョン受像機）を通じて観戦する視聴者である。オペレータは、スポーツを実況する実況者である。なお、ここでは、実況者が発話するスポーツ映像は、ライブ映像であるが、録画された映像であってもよい。 (Embodiment 2)
In the second embodiment, a use case in which the audio characteristic changing system according to the present disclosure is applied to the TV viewing system shown in FIG. 7 will be described. In a TV viewing system, for example, a customer is a viewer who watches sports (baseball, sumo, etc.) through a TV (television receiver). An operator is a commentator who plays a sport. Here, the sports video spoken by the live broadcaster is a live video, but may be a recorded video.

図７は、実施の形態２に係るＴＶ視聴システム５００の概要の一例を示す図である。ＴＶ視聴システム５００は、各家庭内、事業所内等に置かれた複数のテレビジョン受信機（以下、単にＴＶと称する）に対し、ＴＶ１００により出力されているＴＶ番組を視聴する視聴者ｖｗの感情に合わせて、実況者Ａｓが発話する声の音声の特性をＴＶ１００ごとに変更して出力する。図７では、一例として３箇所の家庭内ＨＡ，ＨＢ，ＨＣでそれぞれ同一の実況者が実況する同一のＴＶ番組が視聴される場合を示す。ここでは、ＴＶ番組は、ネットワークＮＷを介して各ＴＶ１００に配信されるが、デジタル放送波を用いて各ＴＶに双方向通信可能に放送されてもよい。 FIG. 7 is a diagram showing an example of an outline of the TV viewing system 500 according to the second embodiment. The TV viewing system 500 is a viewer vw's feelings of watching a TV program output by the TV 100 with respect to a plurality of television receivers (hereinafter, simply referred to as TVs) placed in each home, business office, or the like. In accordance with this, the characteristics of the voice of the voice spoken by the commentator As are changed for each TV 100 and output. FIG. 7 shows, as an example, a case where the same TV program is viewed by the same live broadcaster at three home HAs, HBs, and HCs. Here, the TV program is distributed to each TV 100 via the network NW, but may be broadcast to each TV by bidirectional communication using a digital broadcast wave.

図８は、実施の形態２に係るＴＶ視聴システム５００のハードウェア構成例を示すブロック図である。実施の形態２に係るＴＶ視聴システム５００において、実施の形態１に係る情報表示システム５と同一の構成要素については同一もしくは対応する符号を用いることで、その説明を省略または簡略化し、異なる内容について説明する。 FIG. 8 is a block diagram showing a hardware configuration example of the TV viewing system 500 according to the second embodiment. In the TV viewing system 500 according to the second embodiment, the same or corresponding reference numerals are used for the same components as the information display system 5 according to the first embodiment, thereby omitting or simplifying the description and discussing different contents. explain.

ＴＶ視聴システム５００は、複数のＴＶ１００と、サーバ１８０と、実況者端末１５０とを含む構成である。各ＴＶ１００、サーバ１８０、および実況者端末１５０は、ネットワークＮＷに接続され、相互にデータ通信可能である。 The TV viewing system 500 includes a plurality of TVs 100, a server 180, and a live broadcaster terminal 150. Each TV 100, the server 180, and the commentator terminal 150 are connected to the network NW and can perform data communication with each other.

複数のＴＶ１００は、それぞれの家庭内、事業所内等の場所に設置され、ネットワークＮＷを介して実況者端末１５０から受信したスポーツ等のＴＶ番組を映像および音声で出力する。ＴＶ１００は、プロセッサ１２１、メモリ１２２、表示部１２８、通信部１２０、カメラ１２４、音声制御部１２５、スピーカ１２６およびマイク１２７を有する。なお、カメラ１２４およびマイク１２７は、ＴＶ１００とは別体として外部接続されてもよい。 The plurality of TV 100s are installed in places such as homes and business establishments, and output TV programs such as sports received from the live broadcaster terminal 150 via the network NW as video and audio. The TV 100 includes a processor 121, a memory 122, a display unit 128, a communication unit 120, a camera 124, a voice control unit 125, a speaker 126, and a microphone 127. The camera 124 and the microphone 127 may be externally connected separately from the TV 100.

プロセッサ１２１は、ＴＶ１００を統括的に制御する。メモリ１２２は、プロセッサ１２１のワーキングメモリとして使用される他、各種データ、情報、プログラムを記憶する。メモリ１２２は、一次記憶装置（例えばＲＡＭおよびＲＯＭ）を含む。メモリ１２２は、二次記憶装置（例えばＨＤＤ、ＳＳＤ）、または三次記憶装置（例えば光ディスク、ＳＤカード）を含んでもよい。 The processor 121 controls the TV 100 in an integrated manner. The memory 122 is used as a working memory of the processor 121, and also stores various data, information, and programs. The memory 122 includes a primary storage device (eg, RAM and ROM). The memory 122 may include a secondary storage device (for example, HDD, SSD) or a tertiary storage device (for example, an optical disk, SD card).

通信部１２０は、ネットワークＮＷを介して実況者端末１５０の通信部１５７およびサーバ１８０の通信部１８３と無線または有線で通信を行うネットワークＩ／Ｆ回路である。通信部１２０による通信方式は、例えば、ＷＡＮ、ＬＡＮ、ＬＴＥ、５Ｇ等の移動体通信、電力線通信、近距離無線通信（例えばＢｌｕｅｔｏｏｔｈ（登録商標）通信）、携帯電話用の通信等である。通信部１２０は、カメラ１２４により撮像された視聴者ｖｗの顔の画像データをサーバ１８０に送信する。通信部１２０は、実況者端末１５０から送信された実況者ａｓの顔と上半身との映像を受信する。 The communication unit 120 is a network I / F circuit that wirelessly or wiredly communicates with the communication unit 157 of the commentator terminal 150 and the communication unit 183 of the server 180 via the network NW. The communication method by the communication unit 120 is, for example, mobile communication such as WAN, LAN, LTE, 5G, power line communication, short-range wireless communication (for example, Bluetooth (registered trademark) communication), communication for mobile phones, and the like. The communication unit 120 transmits the image data of the face of the viewer vw captured by the camera 124 to the server 180. The communication unit 120 receives the image of the face and upper body of the commentator as transmitted from the commentator terminal 150.

表示部１２８は、例えばＬＣＤあるいは有機ＥＬ等の表示デバイスである。表示部１２８は、スポーツ等のＴＶ番組を表示するとともに、ワイプ画面に実況者ａｓの顔と上半身とを表示する。表示部１２８は、超高解像度ディスプレイ、例えば４Ｋ（３８４０画素×２１６０画素）ディスプレイを有する。 The display unit 128 is a display device such as an LCD or an organic EL. The display unit 128 displays a TV program such as sports, and displays the face and upper body of the live broadcaster as on the wipe screen. The display unit 128 has an ultra-high resolution display, for example, a 4K (3840 pixel × 2160 pixel) display.

カメラ１２４は、ＴＶ１００の筐体前面に配置され、家庭内のリビング等で視聴する視聴者ｖｗの顔と上半身との映像を撮像する。カメラ１２４には、高解像度な４Ｋカメラ、フルハイビジョンカメラ、ハイビジョンカメラ、ノーマルカメラ等が用いられる。 The camera 124 is arranged on the front surface of the housing of the TV 100, and captures an image of the face and upper body of the viewer vw to be viewed in a living room or the like in the home. As the camera 124, a high-resolution 4K camera, a full high-definition camera, a high-definition camera, a normal camera, or the like is used.

音声制御部１２５は、通信部１２０を介して送受信される音声データに対し圧縮・伸長処理を行い、伸長した音声データをスピーカ１２６から出力し、マイク１２７で収音された音声の音声データを圧縮する。また、音声制御部１２５は、音声データのノイズ除去処理、増幅処理等を行う。 The voice control unit 125 performs compression / decompression processing on the voice data transmitted / received via the communication unit 120, outputs the decompressed voice data from the speaker 126, and compresses the voice data of the voice picked up by the microphone 127. do. In addition, the voice control unit 125 performs noise removal processing, amplification processing, and the like of voice data.

スピーカ１２６は、ＴＶ１００の前にいる視聴者ｖｗが聞き取り易くなるように指向性を有するステレオスピーカであり、実況者ａｓが発話する声の音声等を出力する。マイク１２７は、視聴者ｖｗに対し指向方向を有する指向性マイクであり、視聴者ｖｗが発話する声の音声を収音する。 The speaker 126 is a stereo speaker having directivity so that the viewer vw in front of the TV 100 can easily hear it, and outputs the voice of the voice uttered by the commentator as. The microphone 127 is a directional microphone having a directivity direction with respect to the viewer vw, and collects the sound of the voice spoken by the viewer vw.

また、実況者端末１５０は、スポーツ等のＴＶ番組を実況する端末であり、プロセッサ１５１、メモリ１５２、カメラ１５４、マイク１５６および通信部１５７を有する。 Further, the commentator terminal 150 is a terminal that broadcasts a TV program such as sports, and has a processor 151, a memory 152, a camera 154, a microphone 156, and a communication unit 157.

プロセッサ１５１は、実況者端末１５０を統括的に制御する。メモリ１５２は、プロセッサ１５１のワーキングメモリとして使用される他、各種データ、情報、プログラムを記憶する。メモリ１５２は、一次記憶装置（例えばＲＡＭおよびＲＯＭ）を含む。メモリ１５２は、二次記憶装置（例えばＨＤＤ、ＳＳＤ）、または三次記憶装置（例えば光ディスク、ＳＤカード）を含んでもよい。 The processor 151 comprehensively controls the live broadcaster terminal 150. The memory 152 is used as a working memory of the processor 151, and also stores various data, information, and programs. The memory 152 includes a primary storage device (eg, RAM and ROM). The memory 152 may include a secondary storage device (for example, HDD, SSD) or a tertiary storage device (for example, an optical disk, SD card).

カメラ１５４は、実況者ａｓの顔と上半身との映像を撮像する。カメラ１５４には、高解像度な４Ｋカメラ、フルハイビジョンカメラ、ハイビジョンカメラ、ノーマルカメラ等が用いられる。 The camera 154 captures an image of the face and upper body of the commentator as. As the camera 154, a high-resolution 4K camera, a full high-definition camera, a high-definition camera, a normal camera, or the like is used.

通信部１５７は、ネットワークＮＷを介してＴＶ１００の通信部１２０と無線または有線で通信を行うネットワークＩ／Ｆ回路である。通信部１５７による通信方式は、例えば、ＷＡＮ、ＬＡＮ、ＬＴＥ、５Ｇ等の移動体通信、電力線通信、近距離無線通信（例えばＢｌｕｅｔｏｏｔｈ（登録商標）通信）、携帯電話用の通信等である。通信部１５７は、カメラ１５４により撮像された実況者ａｓの顔および上半身の画像データをＴＶ１００に送信する。 The communication unit 157 is a network I / F circuit that wirelessly or wiredly communicates with the communication unit 120 of the TV 100 via the network NW. The communication method by the communication unit 157 is, for example, mobile communication such as WAN, LAN, LTE, 5G, power line communication, short-range wireless communication (for example, Bluetooth (registered trademark) communication), communication for mobile phones, and the like. The communication unit 157 transmits the image data of the face and upper body of the commentator as captured by the camera 154 to the TV 100.

マイク１５６は、実況者ａｓに対し指向方向を有する指向性マイクであり、実況者ａｓが発話する声の音声を収音する。 The microphone 156 is a directional microphone having a directivity direction with respect to the live broadcaster as, and picks up the voice of the voice spoken by the live broadcaster as.

サーバ１８０は、実況者Ａｓが発話する声の音声の特性を、ＴＶ番組を視聴する視聴者ｖｗの感情データに合わせて変更するものであり、プロセッサ１８１、メモリ１８２、通信部１８３、およびストレージ１８５を有する。 The server 180 changes the characteristics of the voice of the voice spoken by the commentator As according to the emotional data of the viewer vw who watches the TV program, and changes the processor 181, the memory 182, the communication unit 183, and the storage 185. Has.

プロセッサ１８１は、メモリ１８２に記憶されたプログラムを実行することにより実現される機能として、変調方法決定部１９１および感情分析アルゴリズム１９２を含む。感情分析アルゴリズム１９２は、視聴者ｖｗの顔画像データを基に視聴者ｖｗの感情を推定する画像分析部１９３、および視聴者ｖｗが発話する声の音声データを基に視聴者ｖｗの感情を推定する音声分析部１９４を含む。変調方法決定部１９１は、感情データベース１９５に登録された感情・変調テーブルＴｂ２を基に、推定された視聴者ｖｗの感情に対応する声の変調方法を選択する。感情・変調テーブルＴｂ２は、前記実施の形態１における感情・変調テーブルＴｂ１と同様の登録内容を含む。例えば、視聴者の感情が「喜び」である場合、実況者の声の変調は「その場が興奮した雰囲気になるように音のピッチを上げて大きな音量にする」である。また、視聴者の感情が「落胆」である場合、実況者の声の変調は「その場が沈んだ雰囲気になるように音のピッチを下げて小さな音量にする」である。 The processor 181 includes a modulation method determination unit 191 and an emotion analysis algorithm 192 as functions realized by executing a program stored in the memory 182. The sentiment analysis algorithm 192 estimates the emotions of the viewer vw based on the image analysis unit 193 that estimates the emotions of the viewer vw based on the face image data of the viewer vw, and the voice data of the voice spoken by the viewer vw. Includes a voice analysis unit 194. The modulation method determination unit 191 selects a voice modulation method corresponding to the estimated emotion of the viewer vw based on the emotion / modulation table Tb2 registered in the emotion database 195. The emotion / modulation table Tb2 includes the same registration contents as the emotion / modulation table Tb1 in the first embodiment. For example, when the viewer's emotion is "joy", the modulation of the commentator's voice is "to raise the pitch of the sound to make it louder so that the place becomes an excited atmosphere". Also, when the viewer's emotions are "disappointing", the modulation of the commentator's voice is "lower the pitch of the sound to a lower volume so that the place becomes a sunken atmosphere".

メモリ１８２は、一次記憶装置（例えばＲＡＭおよびＲＯＭ）を含む。メモリ１８２は、二次記憶装置（例えばＨＤＤ、ＳＳＤ）、または三次記憶装置（例えば光ディスク、ＳＤカード）を含んでもよい。 The memory 182 includes a primary storage device (eg, RAM and ROM). The memory 182 may include a secondary storage device (for example, HDD, SSD) or a tertiary storage device (for example, an optical disk, SD card).

通信部１８３は、ネットワークＮＷを介してＴＶ１００の通信部１２０と無線または有線で通信を行うネットワークＩ／Ｆ回路である。通信部１８３による通信方式は、例えば、ＷＡＮ、ＬＡＮ、ＬＴＥ、５Ｇ等の移動体通信、電力線通信、近距離無線通信（例えばＢｌｕｅｔｏｏｔｈ（登録商標）通信）、携帯電話用の通信等である。通信部１８３は、ＴＶ１００から送信された視聴者ｖｗの顔と上半身との映像を受信し、実況者ａｓが発話する声の音声の変調方法をＴＶ１００に送信する。 The communication unit 183 is a network I / F circuit that wirelessly or wiredly communicates with the communication unit 120 of the TV 100 via the network NW. The communication method by the communication unit 183 is, for example, mobile communication such as WAN, LAN, LTE, 5G, power line communication, short-range wireless communication (for example, Bluetooth (registered trademark) communication), communication for mobile phones, and the like. The communication unit 183 receives the video of the viewer vw's face and upper body transmitted from the TV 100, and transmits to the TV 100 a method of modulating the voice of the voice spoken by the commentator as.

ストレージ１８５は、ＨＤＤまたはＳＳＤを含み、感情データベース１９５を記憶する。感情データベース９５は、視聴者ｖｗの感情と実況者ａｓの声の変調方法が登録された感情・変調テーブルＴｂ２を含む。感情・変調テーブルＴｂ２の登録内容は、実施の形態１に係る感情・変調テーブルＴｂ１と同様である。 The storage 185 includes an HDD or SSD and stores an emotion database 195. The emotion database 95 includes an emotion / modulation table Tb2 in which the modulation method of the emotion of the viewer vw and the voice of the commentator as is registered. The registered contents of the emotion / modulation table Tb2 are the same as those of the emotion / modulation table Tb1 according to the first embodiment.

次に、実施の形態２に係るＴＶ視聴システム５００の動作手順例を説明する。 Next, an example of the operation procedure of the TV viewing system 500 according to the second embodiment will be described.

図９は、実施の形態２に係るＴＶ視聴システム５００の動作手順例を示すフローチャートである。 FIG. 9 is a flowchart showing an example of an operation procedure of the TV viewing system 500 according to the second embodiment.

図９において、サーバ１８０のプロセッサ１８１は、通信部１８３およびネットワークＮＷを介して、各家庭内ＨＡ，ＨＢ，ＨＣに置かれたＴＶ１００から送信されるカメラ１２４による各視聴者ｖｗの顔画像データおよびマイク１２７による各視聴者ｖｗの声の音声データを受信して取得する（Ｓ６１）。 In FIG. 9, the processor 181 of the server 180 is the face image data of each viewer vw by the camera 124 transmitted from the TV 100 placed in each home HA, HB, HC via the communication unit 183 and the network NW. The voice data of the voice of each viewer vw by the microphone 127 is received and acquired (S61).

プロセッサ１８１の感情分析アルゴリズム１９２は、各視聴者ｖｗの顔画像データおよび音声データを基に、各視聴者ｖｗの感情を推定する（Ｓ６２）。画像分析部１９３は、画像データを基に顔認識を行い、各視聴者ｖｗの顔画像に現れる喜怒哀楽の表面感情を推定する。また、画像分析部１９３は、顔画像データを基に心拍変動を検知し、視聴者ｖｗの内面感情を推定する。また、音声分析部１９４は、視聴者ｖｗが発話する声の音声を基に、視聴者ｖｗの共感あるいは反感等の感情を推定する。 The emotion analysis algorithm 192 of the processor 181 estimates the emotion of each viewer vw based on the face image data and the voice data of each viewer vw (S62). The image analysis unit 193 performs face recognition based on the image data, and estimates the surface emotions of emotions appearing in the face image of each viewer vw. In addition, the image analysis unit 193 detects the heart rate variability based on the face image data and estimates the inner emotion of the viewer vw. Further, the voice analysis unit 194 estimates emotions such as empathy or antipathy of the viewer vw based on the voice of the voice spoken by the viewer vw.

プロセッサ１８１は、実況者端末１５０から実況者がマイク１５６に向かって発話する声の音声に対し音声データの特徴を算出する（Ｓ６３）。音声データの特徴は、例えば音の高さ（ピッチ）、音量、音色等を含む。プロセッサ１８１は、感情データベース１９５に登録されている感情・音声特徴テーブル（図示略）を基に、推定した視聴者ｖｗの感情に相応する（マッチングする）実況者の声の音声データの特徴を選択する（Ｓ６４）。なお、プロセッサ１８１は、推定した視聴者ｖｗの感情に相応する実況者の音声の特徴について機械学習を行い、得られた学習済モデルを用いて実況者の音声の特徴を特定してもよい。 The processor 181 calculates the characteristics of the voice data with respect to the voice of the voice spoken by the live broadcaster from the live broadcaster terminal 150 into the microphone 156 (S63). The characteristics of voice data include, for example, pitch, volume, timbre, and the like. The processor 181 selects (matches) the characteristics of the voice data of the commentator's voice corresponding to (matching) the estimated emotion of the viewer vw based on the emotion / voice feature table (not shown) registered in the emotion database 195. (S64). The processor 181 may perform machine learning on the characteristics of the voice of the commentator corresponding to the estimated emotion of the viewer vw, and may specify the characteristics of the voice of the commentator using the obtained learned model.

サーバ１８０のプロセッサ１８１は、ＴＶ１００毎に、選択した実況者の音声データの特徴に変更する指示（アドバイス）をそれぞれ作成し、各ＴＶ１００に送信する（Ｓ６５）。各ＴＶ１００の通信部１２０は、サーバ１８０からの指示を受信する。各ＴＶ１００の音声制御部１２５は、サーバ１８０の指示にしたがい、実況者ａｓが発話する声の音声を変調する（Ｓ６６）。なお、家庭内で複数名の視聴者ｖｗがＴＶ１００を視聴している場合、プロセッサ１８１は、複数の視聴者ｖｗの感情として、所定の感情（例えば、全ての視聴者の感情のうち最多の感情、年長者の感情、平均化された感情等）に見合うような音声の変調を行ってもよい。これにより、同じＴＶを複数名の視聴者が視聴している場合、できる限り複数名の視聴者の感情に見合った、実況者によるアナウンスが可能となる。 The processor 181 of the server 180 creates an instruction (advice) for each TV 100 to change to the characteristics of the voice data of the selected commentator, and transmits the instruction (advice) to each TV 100 (S65). The communication unit 120 of each TV 100 receives an instruction from the server 180. The voice control unit 125 of each TV 100 modulates the voice of the voice spoken by the commentator as according to the instruction of the server 180 (S66). When a plurality of viewers vw are watching the TV 100 in the home, the processor 181 has a predetermined emotion (for example, the most emotions among all the viewers' emotions) as the emotions of the plurality of viewers vw. , Elderly emotions, averaged emotions, etc.) may be modulated. As a result, when a plurality of viewers are watching the same TV, it is possible to make an announcement by a live broadcaster that matches the emotions of the plurality of viewers as much as possible.

実施の形態２に係るＴＶ視聴システム５００では、例えばＴＶが野球中継を放送しており、Ｄチームが勝利に近づいている場合、家庭内ＨＡでは、ＴＶは、Ｄチームを応援している視聴者に対し、実況者の声の音声の特性を、その場が興奮した雰囲気になるように変更する。一方、家庭内ＨＢでは、ＴＶは、Ｄチームを応援しない視聴者に対し、実況者の声の音声の特性を、その場が沈んだ雰囲気になるように変更する。各家庭内では、ＴＶは、それぞれの視聴者の感情に見合った、実況者の声になるように音声の特性を変更できる。 In the TV viewing system 500 according to the second embodiment, for example, when the TV broadcasts a baseball broadcast and the D team is approaching victory, in the home HA, the TV is a viewer supporting the D team. On the other hand, the characteristics of the voice of the commentator are changed so that the place becomes an excited atmosphere. On the other hand, in the home HB, the TV changes the characteristics of the voice of the commentator to the viewer who does not support the D team so that the place becomes a sunken atmosphere. In each home, the TV can change the audio characteristics to match the emotions of the viewer and the voice of the commentator.

このように、ＴＶ１００は、家庭内に配置される。これにより、ＴＶ視聴システム５００は、視聴者ｖｗの感情に合わせて、実況者ａｓによるスムーズかつ快適な実況を視聴者ｖｗに放送できる。 In this way, the TV 100 is arranged in the home. As a result, the TV viewing system 500 can broadcast a smooth and comfortable live broadcast by the live broadcaster as to the viewer vw according to the emotions of the viewer vw.

また、複数のＴＶ１００に対し、サーバ８０は、ＴＶ１００ごとに、実況者ａｓの発話音声の特性の変更に関する処理指示を生成して各ＴＶ１００に送る。これにより、ＴＶ視聴システム５００は、複数の家庭に対し家庭ごとに異なる音声の特性で実況者ａｓによる音声実況を放送できる。 Further, for each of the plurality of TVs 100, the server 80 generates a processing instruction regarding the change in the characteristics of the utterance voice of the commentator as for each TV 100 and sends it to each TV 100. As a result, the TV viewing system 500 can broadcast the live audio by the live broadcaster as to a plurality of homes with different audio characteristics for each home.

また、ＴＶ１００から出力される映像および発話音声を視聴する視聴者ｖｗが複数名である場合、複数の視聴者の感情のうち最多の感情等、所定の感情データの導出結果に基づいて、実況者ａｓの発話音声の特性の変更に関する処理指示を生成する。これにより、ＴＶ視聴システム５００は、１台のＴＶ１００を視聴する視聴者ｖｗが複数名である場合であっても、できる限り視聴者ｖｗの感情に合わせて、実況者によるスムーズかつ快適な実況を視聴者ｖｗに放送できる。 Further, when there are a plurality of viewers vw who view the video and the spoken sound output from the TV 100, the live broadcaster is based on the derivation result of predetermined emotion data such as the most emotions among the emotions of the plurality of viewers. Generates processing instructions for changing the characteristics of as spoken voice. As a result, the TV viewing system 500 provides a smooth and comfortable live broadcast by the live broadcaster according to the emotions of the viewer vw as much as possible even when there are a plurality of viewers vw viewing one TV 100. It can be broadcast to the viewer vw.

以上、図面を参照しながら各種の実施の形態について説明したが、本開示はかかる例に限定されないことはいうまでもない。当業者であれば、特許請求の範囲に記載された範疇内において、各種の変更例、修正例、置換例、付加例、削除例、均等例に想到し得ることは明らかであり、それらについても当然に本開示の技術的範囲に属するものと了解される。また、発明の趣旨を逸脱しない範囲において、上述した各種の実施の形態における各構成要素を任意に組み合わせてもよい。 Although various embodiments have been described above with reference to the drawings, it goes without saying that the present disclosure is not limited to such examples. It is clear that a person skilled in the art can come up with various modifications, modifications, substitutions, additions, deletions, and equality within the scope of the claims. It is understood that it naturally belongs to the technical scope of the present disclosure. Further, each component in the various embodiments described above may be arbitrarily combined as long as the gist of the invention is not deviated.

例えば、上述した実施の形態では、音声特性変更システムは、情報表示システム５およびＴＶ視聴システム４００に適用される場合を示したが、これらに限らず、通信家庭教師サービスにおいて先生と生徒が対話する場合、テレビ会議システムにおいて複数の社員に対し社長が発表する場合等、様々な分野において適用可能である。また、スポーツゲームの解説者、テレビの司会者等もオペレータに含まれる。 For example, in the above-described embodiment, the case where the voice characteristic changing system is applied to the information display system 5 and the TV viewing system 400 is shown, but the present invention is not limited to these, and the teacher and the student interact in the communication tutor service. In this case, it can be applied in various fields such as when the president makes a presentation to multiple employees in a video conferencing system. Operators also include sports game commentators and TV presenters.

また、音声データに基づく声感情の推定、心拍変動データに基づく内面感情の推定、および共感度の推定は、機械学習を用いたアルゴリズムで行われてもよい。 Further, the estimation of voice emotions based on voice data, the estimation of internal emotions based on heart rate variability data, and the estimation of empathy may be performed by an algorithm using machine learning.

また、上述した実施の形態では、画像データを得るためのカメラと、音声データを得るためのマイクを使用したが、顧客、視聴者等が着用するスマートウォッチ、リストバンド等のスマートウェアラブル機器を用いて、音声により発話する声と心臓音（心拍信号）の両方のデータを取得してもよい。スマートウェアラブル機器を用いることで、さらに血圧、血糖値等のバイタルデータを得ることも可能であり、感情の推定に反映できる。 Further, in the above-described embodiment, a camera for obtaining image data and a microphone for obtaining audio data are used, but smart wearable devices such as smart watches and wristbands worn by customers, viewers, etc. are used. Then, data of both the voice spoken by voice and the heart sound (heartbeat signal) may be acquired. By using a smart wearable device, it is possible to further obtain vital data such as blood pressure and blood glucose level, which can be reflected in emotion estimation.

本開示は、オペレータによる発話音声の特性を変更して、顧客の感情に合わせたオペレータと顧客との間のスムーズかつ快適な対話の実現を効率的に支援する音声特性変更システムおよび音声特性変更方法として有用である。 The present disclosure is a voice characteristic change system and a voice characteristic change method that efficiently support the realization of a smooth and comfortable dialogue between the operator and the customer according to the customer's feelings by changing the characteristics of the spoken voice by the operator. It is useful as.

５情報表示システム
１０対面型情報提供装置
５０オペレータ端末
８０サーバ
８１プロセッサ
８２メモリ
８３通信部
８５ストレージ
９１変調方法決定部
９２感情分析アルゴリズム
９３画像分析部
９４音声分析部
９５感情データベース
５００ＴＶ視聴システム 5 Information display system 10 Face-to-face information providing device 50 Operator terminal 80 Server 81 Processor 82 Memory 83 Communication unit 85 Storage 91 Modulation method determination unit 92 Sentiment analysis algorithm 93 Image analysis unit 94 Voice analysis unit 95 Sentiment database 500 TV viewing system

Claims

It is a voice characteristic change system in which a receiver that receives and outputs video and operator's utterance voice from an operator terminal and a server are connected so as to be able to communicate.
The receiver
It is connected to a camera that captures the customer who views the video and the spoken voice, acquires the captured image of the customer captured by the camera, and sends it to the server.
The server
Based on the captured image of the customer sent from the receiver, emotional data indicating emotions of the customer for the video and the spoken voice is derived.
Based on the result of deriving the emotional data of the customer, a processing instruction regarding a change in the characteristics of the spoken voice of the operator is generated and sent to the receiver.
The receiver
Based on the processing instruction sent from the server, the characteristics of the utterance voice of the operator are changed and output.
Voice characteristic change system.

The receiver
It is connected to a microphone that picks up the customer's uttered voice, acquires the customer's uttered voice picked up by the microphone, and sends it to the server.
The server
The emotional data of the customer is derived based on the captured image of the customer or the spoken voice of the customer sent from the receiver.
The voice characteristic changing system according to claim 1.

The server
When it is determined that the emotional data of the customer indicates anger, the processing instruction for lowering the pitch of the ending portion of the spoken voice of the operator is generated.
The voice characteristic changing system according to claim 1.

The server
When it is determined that the emotional data of the customer indicates anger, advice information for urging the operator to stop the continuation of the utterance is generated and transmitted to the operator terminal.
The operator terminal is
Receives and displays the advice information sent from the server.
The voice characteristic changing system according to claim 1.

The server
When it is determined that the emotional data of the customer indicates trouble, the processing instruction to increase the volume of the utterance voice of the operator is generated.
The voice characteristic changing system according to claim 1.

The server
The emotional data of the customer is derived based on both the captured image of the customer and the spoken voice of the customer sent from the receiver.
The voice characteristic changing system according to claim 2.

The receiver is a face-to-face information providing device that supports dialogue with the operator.
The voice characteristic changing system according to any one of claims 1 to 6.

The receiver is a television receiver arranged in the home.
The voice characteristic changing system according to any one of claims 1 to 5.

At least one receiver is arranged in each of the plurality of homes.
The server generates, for each receiver in the home, different processing instructions regarding changes in the characteristics of the spoken voice of the operator and sends them to the corresponding receivers.
The voice characteristic changing system according to claim 8.

The receiver
When there are a plurality of customers who view the video and the utterance voice output from the receiver, a processing instruction regarding a change in the characteristics of the utterance voice of the operator is generated based on the derivation result of the predetermined emotion data. do,
The voice characteristic changing system according to claim 8.

It is a voice characteristic change method executed by a voice characteristic change system composed of a receiver and a server that receives and outputs video and operator's utterance voice from an operator terminal.
A step of having a camera that captures a customer who views the video and the spoken voice by the receiver and acquiring a captured image of the customer captured by the camera.
A step of deriving emotional data indicating emotions of the customer for the video and the spoken voice based on the captured image of the customer sent from the receiver by the server.
A step of generating a processing instruction regarding a change in the characteristics of the spoken voice of the operator based on the result of deriving the emotion data of the customer by the server and sending the processing instruction to the receiver.
The receiver has a step of changing and outputting the characteristics of the utterance voice of the operator based on the processing instruction sent from the server.
How to change voice characteristics.