JP2023077444A

JP2023077444A - Voice processing system, voice processing device and voice processing method

Info

Publication number: JP2023077444A
Application number: JP2021190678A
Authority: JP
Inventors: 敏之中谷; Toshiyuki Nakatani; 君慧末永; Kimisato Suenaga; 俊雄今村; Toshio Imamura; 啓祐阪下; Keisuke Sakashita; 周平 ▲高▼原; Shuhei Takahara
Original assignee: SoftBank Corp
Current assignee: SoftBank Corp
Priority date: 2021-11-25
Filing date: 2021-11-25
Publication date: 2023-06-06
Anticipated expiration: 2041-11-25
Also published as: JP2023164770A; JP7394192B2; JP2023078068A; JP7164793B1

Abstract

To reduce stress of listeners.SOLUTION: A voice processing system 1 includes: an acquisition part for acquiring a speech voice signal, the signal of a speech voice of a first user; a voice recognition part for inputting a feature amount to be extracted based on the speech voice signal into a voice recognition model and generating text data including a word string consisting of one or more words; a voice synthesis part for inputting the feature amount extracted based on the text data into a voice synthesis model and generating a synthesized voice signal, the signal of a synthesized voice; and a voice output part for outputting the synthesized voice to a second user.SELECTED DRAWING: Figure 4

Description

本発明は、音声処理システム、音声処理装置及び音声処理方法に関する。 The present invention relates to an audio processing system, an audio processing device and an audio processing method.

従来、顧客満足度（Customer Satisfaction：ＣＳ）向上のために、顧客の苦情等に対してオペレータが電話で応対する各種のコールセンターが運用されている。このような顧客応対業務では、顧客がオペレータに対して威圧的な言動や理不尽な要求を行う「カスタマーハラスメント」により、オペレータの精神不調を招いたり、オペレータの離職率が高くなったりすることが問題視されている。 Conventionally, in order to improve Customer Satisfaction (CS), various types of call centers have been operated in which operators respond to customer complaints by telephone. In this kind of customer service work, "customer harassment," in which customers make coercive behavior or make unreasonable demands toward operators, can lead to mental disorders and high turnover rates among operators. are viewed.

近年、このようなカスタマーハラスメントから、企業側が従業員であるオペレータを守るための音声変換システムも検討されている。例えば、特許文献１では、入力音声信号から音量及びピッチ変動量を算出し、音量及びピッチ変動量が所定値を超える場合に、音量及びピッチ変動量が所定内に収まるように音量及びピッチを変換して出力するように制御することが記載されている。 In recent years, a speech conversion system has been considered for companies to protect operators who are employees from such customer harassment. For example, in Patent Document 1, volume and pitch variation are calculated from an input audio signal, and when the volume and pitch variation exceed predetermined values, the volume and pitch are converted so that the volume and pitch variation fall within a predetermined range. It is described that it controls to output as

特開２００４－２５２０８５号公報JP 2004-252085 A

しかしながら、例えば、特許文献１に記載の方法で話し手の発話音声を変換するだけでは、話し手（第１のユーザ）の感情が十分に抑制されず、聞き手（第２のユーザ）のストレスを十分に軽減できない恐れがある。一方、聞き手のストレスを軽減するために、聞き手に出力される話し手の発話音声を変換すると、聞き手が話し手の感情を十分に認識できず、聞き手が適切な応対を行うことができない恐れもある。 However, for example, simply converting a speaker's uttered voice by the method described in Patent Document 1 does not sufficiently suppress the speaker's (first user)'s emotions, and the listener's (second user)'s stress is not sufficiently suppressed. may not be mitigated. On the other hand, if the speaker's utterance voice output to the listener is converted in order to reduce the listener's stress, the listener may not be able to fully recognize the speaker's emotions, and the listener may not be able to respond appropriately.

そこで、本発明は、聞き手のストレスの十分な軽減、及び／又は、聞き手の適切な応対を可能とする音声処理システム、音声処理装置及び音声処理方法を提供する。 Accordingly, the present invention provides a speech processing system, a speech processing device, and a speech processing method that enable the listener's stress to be sufficiently reduced and/or the listener's response to be appropriate.

本発明の一つの態様に係る音声処理システムは、第１のユーザの発話音声の信号である発話音声信号を取得する取得部と、前記発話音声信号に基づいて抽出される特徴量を音声認識モデルに入力して、一以上の単語からなる単語列を含むテキストデータを生成する音声認識部と、前記テキストデータに基づいて抽出される特徴量を音声合成モデルに入力して、合成音声の信号である合成音声信号を生成する音声合成部と、第２のユーザに対して前記合成音声を出力する音声出力部と、を備える。 A speech processing system according to one aspect of the present invention includes an acquisition unit that acquires an uttered voice signal that is a signal of an uttered voice of a first user; , a speech recognition unit that generates text data including a word string consisting of one or more words, and a feature amount extracted based on the text data is input to a speech synthesis model, and a synthesized speech signal A speech synthesis unit for generating a synthesized speech signal, and a speech output unit for outputting the synthesized speech to a second user.

この態様によれば、第１のユーザの発話音声信号に基づいてテキストデータを生成し、当該テキストデータに基づいて生成される合成音声を第２のユーザに出力する。このため、第１のユーザの発話音声に含まれる顧客の感情を十分に抑制した合成音声を第２のユーザに聞かせることができ、第１のユーザの感情的発話に起因する第２のユーザのストレスを十分に軽減できる。 According to this aspect, the text data is generated based on the speech signal of the first user, and the synthesized speech generated based on the text data is output to the second user. Therefore, the second user can hear the synthesized voice that sufficiently suppresses the customer's emotion included in the first user's uttered voice, and the second user's voice resulting from the first user's emotional utterance can be heard. can sufficiently reduce the stress of

上記音声処理システムにおいて、前記感情認識部は、発話音声信号、当該発話音声信号から抽出した特徴量、当該発話音声信号から生成したテキストデータ、当該テキストデータから抽出された特徴量、又はこれらの少なくとも二つの組み合わせを入力とし、当該発話音声信号の発話者の感情情報を出力するよう機械学習された感情認識モデルに、前記取得部が取得した発話音声信号、当該発話音声信号から抽出した音声特徴量、当該発話音声信号から生成したテキストデータ、当該テキストデータに対応するテキスト特徴量、又はこれらの少なくとも二つの組み合わせを入力することにより、前記取得部が取得した発話音声信号に対応する第１のユーザの感情情報を生成してもよい。 In the speech processing system, the emotion recognizing unit includes an utterance voice signal, a feature amount extracted from the utterance voice signal, text data generated from the utterance voice signal, a feature amount extracted from the text data, or at least The speech signal obtained by the acquisition unit and the speech feature quantity extracted from the speech signal are applied to an emotion recognition model machine-learned to output the emotional information of the speaker of the speech signal with a combination of the two as input. , text data generated from the speech signal, a text feature amount corresponding to the text data, or a combination of at least two of these, a first user corresponding to the speech signal acquired by the acquisition unit of emotion information may be generated.

本実施形態に係る音声処理システム１の概略の一例を示す図である。It is a figure showing an example of an outline of voice processing system 1 concerning this embodiment. 本実施形態に係る音声処理システム１を構成する各装置の物理構成の一例を示す図である。2 is a diagram showing an example of a physical configuration of each device that constitutes the audio processing system 1 according to the embodiment; FIG. 本実施形態に係る音声処理装置１０の機能構成の一例を示す図である。It is a figure showing an example of functional composition of speech processing unit 10 concerning this embodiment. 本実施形態に係る合成音声信号の生成の一例を示す図である。FIG. 4 is a diagram showing an example of generation of a synthesized speech signal according to the embodiment; 本実施形態に係る顧客の感情情報の生成の一例を示す図である。It is a figure which shows an example of generation of customer's emotion information which concerns on this embodiment. 本実施形態に係る顧客の感情情報の生成の一例を示す図である。It is a figure which shows an example of generation of customer's emotion information which concerns on this embodiment. 本実施形態に係るオペレータ端末２０の機能構成の一例を示す図である。It is a figure showing an example of functional composition of operator terminal 20 concerning this embodiment. 本実施形態に係る画面Ｄ１の一例を示す図である。It is a figure which shows an example of the screen D1 which concerns on this embodiment. 本実施形態に係る画面Ｄ２の一例を示す図である。It is a figure which shows an example of the screen D2 which concerns on this embodiment. 本実施形態に係る感情抑制動作の一例を示すフローチャートである。6 is a flow chart showing an example of an emotion suppression operation according to the present embodiment; 本実施形態に係る感情抑制機能の自動切り替え動作を示すフローチャートである。6 is a flow chart showing an automatic switching operation of the emotion suppression function according to the embodiment; 本実施形態の変更例に係る合成音声信号の生成の一例を示す図である。FIG. 11 is a diagram showing an example of generation of a synthesized speech signal according to a modification of the embodiment; 本実施形態に係る画面Ｄ３の一例を示す図である。It is a figure which shows an example of the screen D3 which concerns on this embodiment.

添付図面を参照して、本発明の実施形態について説明する。なお、各図において、同一の符号を付したものは、同一又は同様の構成を有する。 Embodiments of the present invention will be described with reference to the accompanying drawings. It should be noted that, in each figure, the same reference numerals have the same or similar configurations.

以下、本実施形態に係る音声処理システムをコールセンター等の顧客応対業務において使用することを想定して説明を行うが、本発明の適用形態はこれに限られない。本実施形態は、第１のユーザの発話音声の信号（以下、「発話音声信号」という）に所定の処理を施して生成される音声を第２のユーザに対して出力するどのような場面にも適用可能である。以下では、第１のユーザが顧客であり、第２のユーザがオペレータであるものとするが、これに限られない。 The following description assumes that the voice processing system according to the present embodiment is used in a customer service business such as a call center, but the application form of the present invention is not limited to this. This embodiment is applicable to any situation in which a speech signal generated by performing a predetermined process on a signal of a speech voice of a first user (hereinafter referred to as a "speech voice signal") is output to a second user. is also applicable. In the following description, it is assumed that the first user is the customer and the second user is the operator, but this is not the only option.

（音声処理システムの構成）
＜全体構成＞
図１は、本実施形態に係る音声処理システム１の概略の一例を示す図である。図１に示すように、音声処理システム１は、音声処理装置１０と、第２のユーザ（以下、「オペレータ」という）によって使用される端末（以下、「オペレータ端末」という）２０と、第１のユーザ（以下、「顧客」という）によって使用される端末（以下、「顧客端末」という）３０と、を備える。 (Configuration of voice processing system)
<Overall composition>
FIG. 1 is a diagram showing an example of an overview of a speech processing system 1 according to this embodiment. As shown in FIG. 1, the speech processing system 1 includes a speech processing device 10, a terminal (hereinafter referred to as "operator terminal") 20 used by a second user (hereinafter referred to as "operator"), and a first and a terminal (hereinafter referred to as "customer terminal") 30 used by a user (hereinafter referred to as "customer") of.

音声処理装置１０は、顧客端末３０で取得される発話音声信号を、ネットワーク４０を介して受信する。ネットワーク４０は、インターネット等の外部ネットワークであってもよいし、外部ネットワーク、及び、Local Access Network（ＬＡＮ）等の内部ネットワークを含んでもよい。音声処理装置１０は、顧客の発話音声信号に対して所定の処理を施した音声をオペレータ端末２０に送信する。なお、音声処理装置１０は、一つ又は複数のサーバで構成されてもよい。 The voice processing device 10 receives the speech voice signal acquired by the customer terminal 30 via the network 40 . The network 40 may be an external network such as the Internet, or may include an external network and an internal network such as a Local Access Network (LAN). The voice processing device 10 transmits to the operator terminal 20 voice obtained by subjecting the customer's voice signal to predetermined processing. Note that the speech processing device 10 may be configured with one or a plurality of servers.

オペレータ端末２０は、例えば、電話、スマートフォン、パーソナルコンピュータ、タブレット等である。オペレータ端末２０は、音声処理装置１０で所定の処理で生成される音声信号又は顧客端末３０からの発話音声信号に基づいて、音声をオペレータに出力する。 The operator terminal 20 is, for example, a telephone, smart phone, personal computer, tablet, or the like. The operator terminal 20 outputs voice to the operator based on a voice signal generated by predetermined processing in the voice processing device 10 or an utterance voice signal from the customer terminal 30 .

顧客端末３０は、例えば、電話、スマートフォン、パーソナルコンピュータ、タブレット等である。顧客端末３０は、顧客の発話音声をマイクにより収音して、当該発話音声の信号である発話音声信号を音声処理装置１０に送信する。 The customer terminal 30 is, for example, a telephone, smart phone, personal computer, tablet, or the like. The customer terminal 30 picks up the customer's uttered voice with a microphone and transmits the uttered voice signal, which is the signal of the uttered voice, to the voice processing device 10 .

＜物理構成＞
図２は、本実施形態に係る音声処理システム１を構成する各装置の物理構成の一例を示す図である。各装置（例えば、音声処理装置１０、オペレータ端末２０及び顧客端末３０）は、演算部に相当するプロセッサ１０ａと、記憶部に相当するＲＡＭ（Random Access Memory）１０ｂと、記憶部に相当するＲＯＭ（Read Only Memory）１０ｃと、通信部１０ｄと、入力部１０ｅと、表示部１０ｆと、カメラ１０ｇ、音声入力部１０ｈと、音声出力部１０ｉと、を有する。これらの各構成は、バスを介して相互にデータ送受信可能に接続される。なお、図２で示す構成は一例であり、各装置はこれら以外の構成を有してもよいし、これらの構成のうち一部を有さなくてもよい。 <Physical configuration>
FIG. 2 is a diagram showing an example of the physical configuration of each device that constitutes the speech processing system 1 according to this embodiment. Each device (for example, the voice processing device 10, the operator terminal 20, and the customer terminal 30) includes a processor 10a equivalent to a calculation unit, a RAM (Random Access Memory) 10b equivalent to a storage unit, and a ROM ( 10c, a communication unit 10d, an input unit 10e, a display unit 10f, a camera 10g, an audio input unit 10h, and an audio output unit 10i. These components are connected to each other via a bus so that data can be sent and received. Note that the configuration shown in FIG. 2 is an example, and each device may have a configuration other than these, or may not have some of these configurations.

プロセッサ１０ａは、例えば、ＣＰＵ（Central Processing Unit）である。プロセッサ１０ａは、ＲＡＭ１０ｂ又はＲＯＭ１０ｃに記憶されているプログラムを実行することにより、各装置における各種処理を制御する制御部である。プロセッサ１０ａは、各装置が備える他の構成と、プログラムとの協働により、各装置の機能を実現し、処理の実行を制御する。プロセッサ１０ａは、入力部１０ｅや通信部１０ｄから種々のデータを受け取り、データの演算結果を表示部１０ｆに表示したり、ＲＡＭ１０ｂに格納したりする。 The processor 10a is, for example, a CPU (Central Processing Unit). The processor 10a is a control unit that controls various processes in each device by executing programs stored in the RAM 10b or ROM 10c. The processor 10a implements the functions of each device and controls the execution of processing by cooperating with other components of each device and programs. The processor 10a receives various data from the input unit 10e and the communication unit 10d, and displays the calculation result of the data on the display unit 10f and stores it in the RAM 10b.

ＲＡＭ１０ｂ及びＲＯＭ１０ｃは、各種処理に必要なデータ及び処理結果のデータを記憶する記憶部である。各装置は、ＲＡＭ１０ｂ及びＲＯＭ１０ｃ以外に、ハードディスクドライブ等の大容量の記憶部を備えてもよい。ＲＡＭ１０ｂ及びＲＯＭ１０ｃは、例えば、半導体記憶素子で構成されてもよい。 The RAM 10b and ROM 10c are storage units that store data necessary for various processes and data of the process results. Each device may include a large-capacity storage unit such as a hard disk drive in addition to the RAM 10b and ROM 10c. The RAM 10b and ROM 10c may be composed of semiconductor memory elements, for example.

通信部１０ｄは、各装置を他の機器に接続するインターフェースである。通信部１０ｄは、他の機器と通信する。入力部１０ｅは、ユーザからデータの入力を受け付けるためのデバイスや、各装置の外部からデータを入力するためのデバイスである。入力部１０ｅは、例えば、キーボード、マウス及びタッチパネル等を含んでよい。表示部１０ｆは、プロセッサ１０ａによる制御に従って、情報を表示するデバイスである。表示部１０ｆは、例えば、ＬＣＤ（Liquid Crystal Display）により構成されてよい。 The communication unit 10d is an interface that connects each device to another device. The communication unit 10d communicates with other devices. The input unit 10e is a device for receiving input of data from a user or a device for inputting data from the outside of each device. The input unit 10e may include, for example, a keyboard, mouse, touch panel, and the like. The display unit 10f is a device that displays information under the control of the processor 10a. The display unit 10f may be configured by, for example, an LCD (Liquid Crystal Display).

カメラ１０ｇは、静止画像又は動画像を撮像する撮像素子を含み、所定の領域の撮像により撮像画像（例えば、静止画像又は動画像）を生成する。音声入力部１０ｈは、音声を収音するデバイスであり、例えば、マイクである。音声出力部１０ｉは、音声を出力するデバイスであり、例えば、スピーカーである。 The camera 10g includes an imaging device that captures a still image or a moving image, and generates a captured image (for example, a still image or a moving image) by capturing a predetermined area. The audio input unit 10h is a device for collecting audio, such as a microphone. The audio output unit 10i is a device that outputs audio, such as a speaker.

各装置を実行させるためのプログラムは、ＲＡＭ１０ｂやＲＯＭ１０ｃ等のコンピュータによって読み取り可能な記憶媒体に記憶されて提供されてもよいし、通信部１０ｄにより接続されるネットワーク４０を介して提供されてもよい。各装置では、プロセッサ１０ａが当該プログラムを実行することにより、各装置を制御するための様々な動作が実現される。なお、これらの物理的な構成は例示であって、必ずしも独立した構成でなくてもよい。例えば、各装置は、プロセッサ１０ａとＲＡＭ１０ｂやＲＯＭ１０ｃが一体化したＬＳＩ（Large-Scale Integration）を備えていてもよい。 A program for executing each device may be stored in a computer-readable storage medium such as the RAM 10b and the ROM 10c and provided, or may be provided via the network 40 connected by the communication unit 10d. . In each device, the processor 10a executes the program, thereby realizing various operations for controlling each device. It should be noted that these physical configurations are examples, and do not necessarily have to be independent configurations. For example, each device may include an LSI (Large-Scale Integration) in which the processor 10a and the RAM 10b and ROM 10c are integrated.

＜機能的構成＞
≪音声処理装置≫
図３は、本実施形態に係る音声処理装置１０の機能構成の一例を示す図である。音声処理装置１０は、記憶部１０１、送受信部１０２、音声認識部１０３、除去部１０４、音声合成部１０５、感情認識部１０６、ストレス認識部１０７、制御部１０８、学習部１０９を含む。 <Functional configuration>
≪Sound processing device≫
FIG. 3 is a diagram showing an example of the functional configuration of the speech processing device 10 according to this embodiment. Speech processing device 10 includes storage unit 101 , transmission/reception unit 102 , speech recognition unit 103 , removal unit 104 , speech synthesis unit 105 , emotion recognition unit 106 , stress recognition unit 107 , control unit 108 , and learning unit 109 .

記憶部１０１は、各種情報、プログラム、アルゴリズム、モデル、操作ログ等を記憶する。具体的には、記憶部１０１は、後述する音声認識モデル１０１ａ、音声合成モデル１０１ｂ、感情認識モデル１０１ｃ、ストレス認識モデル１０１ｄ、感情抑制切替モデル１０１ｅ等を記憶する。 The storage unit 101 stores various information, programs, algorithms, models, operation logs, and the like. Specifically, the storage unit 101 stores a speech recognition model 101a, a speech synthesis model 101b, an emotion recognition model 101c, a stress recognition model 101d, an emotion suppression switching model 101e, and the like, which will be described later.

送受信部１０２は、オペレータ端末２０及び／又は顧客端末３０との間で、種々の情報及び／又は信号を送信及び／又は受信する。例えば、送受信部１０２（取得部）は、顧客端末３０で収音された顧客の発話音声の信号である発話音声信号を取得する。送受信部１０２は、オペレータ端末２０に対して、合成音声信号及び／又は発話音声信号を送信する。また、送受信部１０２は、オペレータ端末２０からオペレータによる操作ログを取得してもよい。操作ログにはオペレータによる顧客の感情の主観的評価に関する情報（以下、「主観的評価情報」という）、後述する「ストレスの度合い」、後述する「手動切替履歴データ」が含まれてよい。また、送受信部１０２は、オペレータ端末２０に対して、顧客の感情に関する情報（以下、「感情情報」という）等を送信してもよい。 The transmission/reception unit 102 transmits and/or receives various information and/or signals to/from the operator terminal 20 and/or the customer terminal 30 . For example, the transmission/reception unit 102 (acquisition unit) acquires an utterance voice signal, which is a signal of the customer's utterance voice collected by the customer terminal 30 . The transmitting/receiving unit 102 transmits a synthesized speech signal and/or an utterance speech signal to the operator terminal 20 . Further, the transmission/reception unit 102 may acquire an operation log by an operator from the operator terminal 20 . The operation log may include information on subjective evaluation of the customer's emotion by the operator (hereinafter referred to as "subjective evaluation information"), "degree of stress" to be described later, and "manual switching history data" to be described later. Further, the transmitting/receiving section 102 may transmit information regarding the customer's emotion (hereinafter referred to as “emotional information”) and the like to the operator terminal 20 .

音声認識部１０３は、送受信部１０２で取得された発話音声信号に基づいて抽出される特徴量（以下、「音声特徴量」という）を音声認識モデル１０１ａに入力して、一以上の単語からなる単語列を含むテキストデータを生成する。具体的には、音声認識部１０３は、音声認識モデル１０１ａの音響モデルを用いて上記音声特徴量から単語列を生成し、言語モデルを用いた単語列の分析結果に従って上記テキストデータを生成してもよい。音声認識部１０３は、発話音声信号に対して前処理（例えば、アナログ信号のディジタル化、ノイズの除去、フーリエ変換等）を実施して、音声特徴量を抽出してもよい。 The speech recognition unit 103 inputs a feature quantity (hereinafter referred to as a “speech feature quantity”) extracted based on the uttered speech signal acquired by the transmission/reception unit 102 to the speech recognition model 101a, and converts the speech recognition model 101a into a feature consisting of one or more words. Generate text data containing word strings. Specifically, the speech recognition unit 103 uses the acoustic model of the speech recognition model 101a to generate a word string from the speech feature amount, and generates the text data according to the analysis result of the word string using the language model. good too. The speech recognition unit 103 may perform preprocessing (for example, digitization of analog signals, noise removal, Fourier transform, etc.) on the uttered speech signal to extract speech features.

音声認識モデル１０１ａは、音声信号に基づいて音声の内容を推定するアルゴリズムである。音声認識モデル１０１ａは、ある単語がどのような音となって現れやすいかということをモデル化した音響モデル、及び／又は、特定の言語においてある単語列がどのくらいの確率で現れるかをモデル化した言語モデルを含んでもよい。音響モデルとしては、例えば、隠れマルコフモデル（Hidden Markov Model：ＨＭＭ）及び／又はディープニューラルネットワーク（Deep Neural Network：ＤＮＮ）が用いられてもよい。言語モデルとしては、例えば、ｎグラム言語モデル等の確率的言語モデルが用いられてもよい。 The speech recognition model 101a is an algorithm that estimates speech content based on speech signals. The speech recognition model 101a is an acoustic model that models how a certain word is likely to appear as a sound, and/or a model that models how likely a certain word string appears in a specific language. It may also include a language model. For example, a Hidden Markov Model (HMM) and/or a Deep Neural Network (DNN) may be used as the acoustic model. As the language model, for example, a probabilistic language model such as an n-gram language model may be used.

除去部１０４は、音声認識部１０３で生成されたテキストデータに含まれる特定の単語列を検出し、当該特定の単語列を除去又は前記特定の単語列を他の単語列に置換したテキストデータを生成し、音声合成部１０５に出力する。除去部１０４は、音声認識部１０３で生成されたテキストデータ内で特定の単語列が検出されない場合、当該テキストデータを音声合成部１０５に出力してもよい。 The removal unit 104 detects a specific word string included in the text data generated by the speech recognition unit 103, removes the specific word string, or replaces the specific word string with another word string to obtain text data. generated and output to the speech synthesizing unit 105 . If a specific word string is not detected in the text data generated by the speech recognition section 103 , the removal section 104 may output the text data to the speech synthesis section 105 .

当該特定の単語列は、例えば、聞き手を侮辱したり、聞き手の人格を否定したりする、聞き手を不快にする等、聞き手に心理的悪影響を与える一以上の単語であってもよい。ここで、各単語は、名詞、動詞、副詞、助詞、形容詞、助動詞等の少なくとも一つの品詞、当該品詞が音変化したもの等を含んでもよい。例えば、特定の単語列は、「お前、ぶっ殺すぞ」というような「文」であってもよいし、「困るっつってんの」の「っつってん」等、乱暴な言葉遣いであることを示す「文の一部」であってもよい。除去部１０４は、テキストデータ内で検出された特定の単語列のみを他の単語列に置き換えたテキストデータを音声合成部１０５に出力してもよいし、又は、当該特定の単語列を含む文全体を他の単語列に置き換えたテキストデータを音声合成部１０５に出力してもよい。当該他の単語列は、空白等であってもよい。 The specific word string may be one or more words that have a negative psychological effect on the listener, such as insulting the listener, denying the listener's personality, making the listener uncomfortable, or the like. Here, each word may include at least one part of speech such as a noun, a verb, an adverb, a particle, an adjective, an auxiliary verb, or a phonetic variation of the part of speech. For example, a specific word string may be a "sentence" such as "You, I'll kill you", or a rough wording such as "Ttsuten" in "I'm in trouble". It may be a "part of a sentence" showing. The removal unit 104 may output to the speech synthesis unit 105 text data in which only a specific word string detected in the text data is replaced with another word string, or may replace a sentence containing the specific word string. Text data in which the entire text data is replaced with another word string may be output to the speech synthesizing unit 105 . The other word string may be blank or the like.

除去部１０４は、記憶部１０１に予め記憶された特定の単語列に基づいて、テキストデータ内の特定の単語列の検出及び／又は他の単語列への置き換えを実施してもよい。 The removing unit 104 may detect a specific word string in the text data and/or replace it with another word string based on the specific word string stored in advance in the storage unit 101 .

或いは、除去部１０４は、機械学習により学習されたモデルに基づいて、テキストデータ内の特定の単語列の検出、及び／又は、意味的感情を緩和した他の単語列への置き換えを実施してもよい。例えば、テキストデータ内の特定の単語列「お前」は、「あなた」に置換されてもよい。機械学習に基づくモデルに基づいて、テキストデータ内の特定の単語列の検出及び／又は他の単語列への置き換えを実施してもよい。 Alternatively, the removing unit 104 detects a specific word string in the text data and/or replaces it with another word string with a reduced semantic emotion based on a model learned by machine learning. good too. For example, a specific word string “you” in the text data may be replaced with “you”. Detection and/or replacement of specific word strings in text data with other word strings may be performed based on a model based on machine learning.

なお、除去部１０４は、テキストデータ内で特定の単語列が検出される場合、当該特定の単語列の検出に関する情報（以下、「検出情報」という）を生成してもよい。当該検出情報は、例えば、当該特定の単語列が検出されたことを示す情報（例えば、「ＮＧワード」又は「ＮＧワード検出」という文字列）、当該特定の単語列を示す情報、及び、顧客に対する警告に関する情報（以下、「警告情報」という）の少なくとも一つを含んでもよい。当該警告情報は、例えば、オペレータに対する顧客の発話内容が侮辱罪、名誉棄損罪等の刑事告訴対象となり得ることを通知するための情報であってもよい。検出情報は、送受信部１０２によってオペレータ端末２０に送信されてもよい。検出情報が生成された場合、音声処理装置１０は、顧客端末３０に対して警告情報（例えば、「当社オペレータに対して侮辱罪等の恐れがあります。当社の不手際もあるとは思いますが、当社オペレータに過度な負担になる場合がありますのでご協力を頂けますと幸いです。」）を出力させてもよい。このような警告情報は、カスタマーハラスメントに対する事前告知として利用することができる。 Note that, when a specific word string is detected in the text data, the removal unit 104 may generate information (hereinafter referred to as “detection information”) regarding detection of the specific word string. The detection information is, for example, information indicating that the specific word string has been detected (for example, a character string "NG word" or "NG word detection"), information indicating the specific word string, and customer may include at least one of information (hereinafter referred to as "warning information") regarding a warning to The warning information may be, for example, information for notifying the operator that the contents of the customer's speech to the operator may be subject to criminal prosecution such as insult or defamation. The detection information may be transmitted to the operator terminal 20 by the transmitter/receiver 102 . When the detection information is generated, the voice processing device 10 sends warning information to the customer terminal 30 (for example, "There is a risk of insult to the operator of our company. We would appreciate your cooperation as this may be an excessive burden on our operators.") may be output. Such warning information can be used as advance notice against customer harassment.

音声合成部１０５は、除去部１０４から入力されるテキストデータに基づいて抽出される特徴量（以下、「テキスト特徴量」という）を音声合成モデル１０１ｂに入力して、合成音声の信号（以下、「合成音声信号」という）を生成する。具体的には、除去部１０４は、テキスト特徴量に基づいて音声合成パラメータを予測し、予測された音声合成パラメータを用いて合成音声信号を生成してもよい。音声合成部１０５は、合成音声信号を送受信部１０２に出力する。合成音声信号は、テキストデータの内容を読み上げた音声の信号ともいえる。 The speech synthesizing unit 105 inputs a feature amount extracted based on the text data input from the removing unit 104 (hereinafter referred to as "text feature amount") to the speech synthesis model 101b to generate a synthesized speech signal (hereinafter referred to as "Synthetic Speech Signal"). Specifically, the removal unit 104 may predict a speech synthesis parameter based on the text feature quantity, and generate a synthesized speech signal using the predicted speech synthesis parameter. Speech synthesizing section 105 outputs the synthesized speech signal to transmitting/receiving section 102 . The synthesized speech signal can also be said to be a speech signal obtained by reading out the contents of text data.

音声合成モデル１０１ｂは、テキストデータを入力として当該テキストデータの内容に対応する合成音声信号を出力するアルゴリズムである。音声合成モデル１０１ｂとしては、例えば、上記ＨＭＭ及び／又はＤＮＮが用いられてもよい。 The speech synthesis model 101b is an algorithm for inputting text data and outputting a synthesized speech signal corresponding to the contents of the text data. For example, the above HMM and/or DNN may be used as the speech synthesis model 101b.

当該音声合成モデル１０１ｂは、複数の音声種別に対応してもよい。音声合成部１０５は、複数の音声種別の中から合成音声信号に用いる音声種別を選択し、選択した音声種別とテキストデータとを音声合成モデル１０１ｂに入力して、選択した音声種別の合成音声信号を合成してもよい。当該複数の音声種別は、例えば、抑揚が少ない音声、機械音、キャラクターの音声、芸能人の音声及び声優の音声の少なくとも一つ等であってもよい。音声合成部１０５は、オペレータからオペレータ端末２０を介して音声種別の選択を受け付けてもよい。 The speech synthesis model 101b may correspond to a plurality of speech types. The speech synthesis unit 105 selects a speech type to be used for a synthesized speech signal from among a plurality of speech types, inputs the selected speech type and text data to the speech synthesis model 101b, and generates a synthesized speech signal of the selected speech type. may be synthesized. The plurality of voice types may be, for example, at least one of low intonation voice, mechanical sound, character voice, entertainer voice, and voice actor voice. The voice synthesizing unit 105 may receive selection of voice type from the operator via the operator terminal 20 .

図４は、本実施形態に係る合成音声信号の生成の一例を示す図である。図４では、送受信部１０２で取得された発話音声信号Ｓ１～Ｓ３に基づいて、音声認識部１０３においてテキストデータＴ１～Ｔ３が生成されるものとする。例えば、図４では、除去部１０４は、テキストデータＴ１内で特定の単語列を検出しないので、テキストデータＴ１をそのまま音声合成部１０５に出力する。一方、除去部１０４は、テキストデータＴ２及びＴ３内で特定の単語列（Ｔ２では「お前、ぶっ殺すぞ」、Ｔ３では「っつってん」）を検出するので、当該特定の単語列を除去又は置換したテキストデータＴ２’及びＴ３’を音声合成部１０５に出力する。例えば、テキストデータＴ２’では、テキストデータＴ２内の特定の単語列が空白（□）に置換される。また、テキストデータＴ３’では、テキストデータＴ３内の特定の単語列「っつってん」が「という」に置換される。音声合成部１０５は、テキストデータＴ１、Ｔ２及びＴ３からそれぞれ合成音声信号Ｓ１、Ｓ２’及びＳ３’を生成する。 FIG. 4 is a diagram showing an example of generation of a synthesized speech signal according to this embodiment. In FIG. 4, it is assumed that text data T1 to T3 are generated in speech recognition section 103 based on speech speech signals S1 to S3 acquired by transmission/reception section . For example, in FIG. 4, the removal unit 104 does not detect a specific word string in the text data T1, so the text data T1 is output to the speech synthesis unit 105 as it is. On the other hand, since the removing unit 104 detects a specific word string ("I'll kill you" in T2 and "Ttsuten" in T3) in the text data T2 and T3, it removes or replaces the specific word string. The resulting text data T2′ and T3′ are output to the speech synthesizing unit 105. FIG. For example, in the text data T2', a specific word string in the text data T2 is replaced with blanks (□). Also, in the text data T3', a specific word string "ttsuten" in the text data T3 is replaced with "tou". The speech synthesis unit 105 generates synthesized speech signals S1, S2' and S3' from the text data T1, T2 and T3, respectively.

感情認識部１０６は、送受信部１０２で取得された発話音声信号、音声認識部１０３で生成されたテキストデータ、及び、送受信部１０２で受信される主観的評価情報の少なくとも一つに基づいて、顧客の感情情報を生成する。感情認識部１０６は、発話音声信号に基づいて抽出された音声特徴量（例えば抑揚や音量など）に基づいて顧客の感情情報を生成してよい。感情認識部１０６は、発話音声信号に基づいて生成されたテキストデータに特定の単語列が検出されたこと、又は、特定の単語列が所定時間以上検出されなかったことに基づいて顧客の感情情報を生成してよい。感情認識部１０６は、カメラ１０ｇで取得される顧客の撮像画像に基づいて、顧客の感情情報を生成してもよい。感情認識部１０６は感情認識モデル１０１ｃを用いて顧客の感情情報を生成してもよい。 The emotion recognition unit 106 recognizes the customer's emotion based on at least one of the speech signal acquired by the transmission/reception unit 102, the text data generated by the speech recognition unit 103, and the subjective evaluation information received by the transmission/reception unit 102. to generate emotional information. The emotion recognition unit 106 may generate the customer's emotion information based on voice features (for example, intonation, volume, etc.) extracted based on the speech voice signal. The emotion recognition unit 106 recognizes the customer's emotional information based on the fact that a specific word string is detected in the text data generated based on the utterance voice signal, or that the specific word string has not been detected for a predetermined period of time. can be generated. The emotion recognition unit 106 may generate the customer's emotion information based on the captured image of the customer acquired by the camera 10g. The emotion recognition unit 106 may generate customer's emotion information using the emotion recognition model 101c.

感情認識モデル１０１ｃは、発話音声信号、当該発話音声信号から抽出した音声特徴量、当該発話音声信号から生成したテキストデータ、テキスト特徴量又はこれらの少なくとも二つの組み合わせを入力とし、当該発話音声信号に対応する顧客の感情である感情情報を出力するモデルである。 The emotion recognition model 101c receives as input a speech signal, a speech feature quantity extracted from the speech speech signal, text data generated from the speech speech signal, a text feature quantity, or a combination of at least two of these. This is a model that outputs emotion information, which is the emotion of the corresponding customer.

図５Ａは感情認識モデル１０１ｃの学習処理の説明図である。例えば、感情認識モデル１０１ｃの学習には、発話音声信号から抽出される音声特徴量、テキストデータから抽出されるテキスト特徴量、及び、オペレータによる「主観的評価情報」（又は主観的評価情報から抽出される特徴量）の少なくとも一つをそれぞれ含む複数のデータのセット（以下、「データセット」という）を用いてよい。主観的評価情報は、オペレータが顧客の発話音声信号を聞いて顧客の感情を主観で評価した情報である。例えば、怒りレベル１～１０のように、オペレータが複数のレベルで顧客の怒りを評価するものであってもよい。感情認識モデル１０１ｃを学習するためのデータセットは例えば以下のように生成されてもよい。オペレータは、顧客の生の発話音声信号を聞いて、当該発話音声信号から推定される顧客の感情をアノテーションする（すなわち発話音声信号に対して「主観的評価情報」を付与する）。これにより、発話音声信号と当該発話音声信号から推定される顧客の感情とが時間軸上で関連付けされた情報が得られる。複数のオペレータが複数の発話音声信号に対して主観的評価情報の付与を行うことにより、このような情報の束であるデータセットが得られる。感情認識モデル１０１ｃは、このようなデータセットを用いて教師有り機械学習されてもよい。なお、感情認識モデル１０１ｃの学習に用いられるデータセットは、音声特徴量に加えて又は代えて発話音声信号を含んでもよいし、テキスト特徴量に加えて又は代えてテキストデータを含んでもよい。 FIG. 5A is an explanatory diagram of the learning process of the emotion recognition model 101c. For example, the learning of the emotion recognition model 101c includes speech features extracted from speech signals, text features extracted from text data, and "subjective evaluation information" by the operator (or extracted from the subjective evaluation information). A plurality of data sets (hereinafter referred to as “data sets”) each including at least one of the feature amounts to be measured may be used. The subjective evaluation information is information obtained by the operator subjectively evaluating the customer's emotion after listening to the customer's speech signal. For example, the operator may rate the customer's anger on multiple levels, such as anger levels 1-10. A data set for learning the emotion recognition model 101c may be generated, for example, as follows. The operator listens to the customer's raw speech signal and annotates the customer's emotion estimated from the speech signal (that is, assigns "subjective evaluation information" to the speech signal). As a result, information is obtained in which the speech signal and the customer's emotion estimated from the speech signal are associated on the time axis. A data set, which is a bundle of such information, is obtained by assigning subjective evaluation information to a plurality of speech signals by a plurality of operators. The emotion recognition model 101c may be supervised machine-learned using such datasets. The data set used for learning the emotion recognition model 101c may include speech signals in addition to or instead of voice features, and may include text data in addition to or instead of text features.

図５Ｂは感情認識モデル１０１ｃを用いた推定処理の説明図である。例えば、図５Ｂに示すように、発話音声信号Ｓ１から抽出した音声特徴量、及び／又は、当該発話音声信号Ｓ１から生成したテキストデータＴ１から抽出したテキスト特徴量を感情認識モデル１０１ｃに入力することにより、入力に対応する出力、すなわち発話音声信号に対応する感情情報が得られる。なお、感情認識モデル１０１ｃには、音声特徴量に加えて又は代えて発話音声信号Ｓ１が入力されてもよいし、テキスト特徴量に加えて又は代えてテキストデータＴ１が入力されてもよい。 FIG. 5B is an explanatory diagram of estimation processing using the emotion recognition model 101c. For example, as shown in FIG. 5B, the speech feature amount extracted from the speech signal S1 and/or the text feature amount extracted from the text data T1 generated from the speech signal S1 may be input to the emotion recognition model 101c. obtains the output corresponding to the input, that is, the emotional information corresponding to the speech signal. Note that the emotion recognition model 101c may be input with the speech signal S1 in addition to or instead of the voice feature amount, or may be input with the text data T1 in addition to or instead of the text feature amount.

主観的評価情報は、一以上の感情（例えば、「幸福」、「驚き」、「恐怖」、「怒り」、「嫌悪」及び「悲しみ」の少なくとも一つ等）の度合を数値で示すものであってもよい。又は、感情情報は、顧客が感じている可能性が高い特定の感情（例えば、「怒り」）を示すものであってもよい。 The subjective evaluation information numerically indicates the degree of one or more emotions (for example, at least one of "happiness", "surprise", "fear", "anger", "disgust" and "sadness"). There may be. Alternatively, the affective information may indicate a specific emotion that the customer is likely to be feeling (eg, "anger").

ストレス認識部１０７は、オペレータのストレス状況に関する情報（以下、「ストレス情報」という）を生成する。例えば、ストレス認識部１０７は、オペレータの心拍数、発汗量、呼吸量などのバイタルデータあるいは、カメラを用いて収集したオペレータの視線、表情などの画像情報に基づいて、従来周知の方法によってオペレータのストレス状況を推定してよい。例えば、ストレス認識部１０７は、オペレータによる発話音声に基づいてオペレータのストレス状況を推定してよい。具体的には、ストレス認識部１０７は、オペレータの発話のトーンやスピードの変化、謝罪に関する単語の出現、顧客の発言に被せて発言すること等に基づいて、オペレータのストレス状況を推定してよい。例えば、ストレス認識部１０７は、オペレータ端末２０の操作ログに基づいてオペレータのストレス状況を推定してよい。具体的には、ストレス認識部１０７は、マウス等の動きや、操作すべき場面で操作入力が無いことなどに応じて、オペレータのストレス状況を推定してよい。ストレス認識部１０７は、ストレス認識モデル１０１ｄに基づいてストレス情報を生成してよい。ストレス認識モデル１０１ｄは、発話音声信号、当該発話音声信号から抽出した音声特徴量、当該発話音声信号から生成したテキストデータ、テキスト特徴量又はこれらの少なくとも二つの組み合わせを入力とし、当該発話音声を聞いているオペレータが感じるストレスの推定値を出力するモデルである。ストレス認識モデル１０１ｄの学習には、顧客の発話音声を聞いてオペレータが実際に感じたストレスの実測値を用いてよい。ストレス認識モデル１０１ｄを学習するためのデータセットは例えば以下のように生成されてもよい。オペレータは、顧客の発話音声を聞いて感じたストレスの度合い（例えば１～１０のようなレベル）をアノテーションする（すなわち発話音声信号に対して自身が感じた「ストレスの度合い」を付与する）。これにより、発話音声信号と当該発話音声信号を聞いた際のオペレータのストレスとが時間軸上で関連付けされた情報が得られる。複数のオペレータが複数の発話音声信号に対してストレスの度合いの付与を行うことにより、このような情報の束であるデータセットが得られる。ストレス認識モデル１０１ｄは、このようなデータセットを用いて教師有り機械学習されてもよい。 The stress recognition unit 107 generates information about the stress situation of the operator (hereinafter referred to as "stress information"). For example, the stress recognizing unit 107 recognizes the operator's vital data such as heart rate, perspiration, and respiration, or image information such as the operator's line of sight and facial expression collected using a camera, by a conventionally known method. A stress situation may be estimated. For example, the stress recognition unit 107 may estimate the operator's stress situation based on the operator's uttered voice. Specifically, the stress recognizing unit 107 may estimate the operator's stress situation based on changes in the tone or speed of the operator's speech, appearance of words related to apology, speech over the customer's speech, and the like. . For example, the stress recognition unit 107 may estimate the operator's stress situation based on the operation log of the operator terminal 20 . Specifically, the stress recognizing unit 107 may estimate the operator's stress situation according to the movement of the mouse or the like, or the fact that there is no operation input in the scene where the operation should be performed. The stress recognition unit 107 may generate stress information based on the stress recognition model 101d. The stress recognition model 101d receives an utterance voice signal, a voice feature amount extracted from the utterance voice signal, text data generated from the utterance voice signal, a text feature amount, or a combination of at least two of these, and listens to the utterance voice. It is a model that outputs an estimated value of the stress felt by an operator who is For the learning of the stress recognition model 101d, a measured value of stress actually felt by the operator upon listening to the customer's uttered voice may be used. A data set for learning the stress recognition model 101d may be generated as follows, for example. The operator annotates the degree of stress (for example, a level of 1 to 10) felt by listening to the customer's uttered voice (that is, assigns the ``stress level'' felt by the operator to the uttered voice signal). As a result, information is obtained in which the speech signal and the operator's stress when listening to the speech signal are associated on the time axis. A data set, which is a bundle of such information, is obtained by applying stress degrees to a plurality of speech signals by a plurality of operators. The stress perception model 101d may be supervised machine-learned using such datasets.

制御部１０８は、音声処理装置１０に関する種々の制御を行う。具体的には、制御部１０８は、ストレス認識部１０７において生成されるストレス情報に基づいて、オペレータ端末２０において音声合成部１０５で生成された合成音声又は顧客の発話音声のどちらを出力するかを切り替えてもよい。制御部１０８は、発話音声信号に基づいて合成音声信号を生成するか否かをストレス情報に基づいて切り替えてもよい。例えば、制御部１０８は、ストレス情報が示すストレス度数が所定の閾値以上又はより大きい場合、顧客の発話音声ではなく合成音声をオペレータに出力するように制御してもよい。一方、制御部１０８は、ストレス情報が示すストレス度数が所定の閾値より小さい又は以下である場合、発話音声をオペレータに出力するように制御してもよい。制御部１０８は、オペレータから感情抑制機能の自動切り替えについての指示情報が入力された場合、ストレス情報に基づいて上記切り替えを行ってもよい。感情抑制機能とは、顧客の発話音声に代えて合成音声をオペレータに出力する機能である。 The control unit 108 performs various controls related to the speech processing device 10 . Specifically, based on the stress information generated by the stress recognizing unit 107, the control unit 108 determines which of the synthetic voice generated by the voice synthesizing unit 105 and the customer's uttered voice to be output in the operator terminal 20. You can switch. The control unit 108 may switch whether to generate a synthesized speech signal based on the speech speech signal based on the stress information. For example, when the stress level indicated by the stress information is greater than or equal to a predetermined threshold value, the control unit 108 may control to output synthesized voice to the operator instead of the customer's uttered voice. On the other hand, if the stress level indicated by the stress information is less than or equal to the predetermined threshold value, the control unit 108 may control to output the uttered voice to the operator. When instruction information for automatic switching of the emotion suppression function is input from the operator, the control unit 108 may perform the switching based on the stress information. The emotion suppression function is a function for outputting synthesized speech to the operator in place of the customer's uttered speech.

制御部１０８は、感情情報に基づいて上記切り替えを行ってもよい。制御部１０８は、当該切り替えを感情抑制切替モデル１０１ｅの出力に基づいて行ってもよい。感情抑制切替モデル１０１ｅは、発話音声信号、音声特徴量、テキストデータ、テキスト特徴量又はこれらの少なくとも二つの組み合わせを入力として、感情抑制機能のオン・オフを切り替えるタイミングを出力とするモデルである。感情抑制切替モデル１０１ｅは更にストレス情報又は感情情報を入力としてもよい。感情抑制切替モデル１０１ｅの詳細については後述する。 The control unit 108 may perform the switching based on emotion information. The control unit 108 may perform the switching based on the output of the emotion suppression switching model 101e. The emotion suppression switching model 101e is a model that receives speech signals, voice features, text data, text features, or a combination of at least two of these as inputs, and outputs timings for switching the emotion suppression function on and off. The emotion suppression switching model 101e may further receive stress information or emotion information. The details of the emotion suppression switching model 101e will be described later.

また、制御部１０８は、オペレータによって入力される切り替え情報に基づいて上記切り替えを行ってもよい。ここで、切り替え情報は、顧客の感情抑制機能の適用（オン）又は非適用（オフ）の切り替えに関する情報である。例えば、制御部１０８は、切り替え情報が顧客の感情抑制機能の適用を示す場合、合成音声をオペレータに出力するように制御してもよい。一方、制御部１０８は、切り替え情報が顧客の感情抑制機能の非適用を示す場合、発話音声をオペレータに出力するように制御してもよい。制御部１０８は、オペレータから感情抑制機能の手動切り替えについての指示情報が入力された場合、上記切り替え情報に基づいて上記切り替えを行ってもよい。 Further, the control unit 108 may perform the switching based on switching information input by the operator. Here, the switching information is information regarding switching between application (ON) and non-application (OFF) of the customer's emotion suppression function. For example, when the switching information indicates application of the customer's emotion suppression function, the control unit 108 may control to output synthesized speech to the operator. On the other hand, when the switching information indicates that the customer's emotion suppression function is not applied, the control unit 108 may control to output the uttered voice to the operator. When instruction information for manual switching of the emotion suppression function is input from the operator, control unit 108 may perform the switching based on the switching information.

学習部１０９は、感情認識モデル１０１ｃ、ストレス認識モデル１０１ｄ及び感情抑制切替モデル１０１ｅの学習処理を行ってよい。 The learning unit 109 may perform learning processing for the emotion recognition model 101c, the stress recognition model 101d, and the emotion suppression switching model 101e.

音声処理装置１０は、以下１）乃至７）に示すいずれかの情報、又は、少なくとも二つの情報の組み合わせを時間軸上で関連付け、送受信部１０２を介して、オペレータ端末２０に対して送信してよい。１）顧客の発話音声信号、２）発話音声信号から生成されたテキストデータ、３）除去部１０４の処理を経たあとのテキストデータ、４）検出情報、５）合成音声信号、６）顧客の発話音声信号から推定される顧客の感情情報、７）感情抑制機能のオン・オフを切り替えるタイミング。感情抑制機能がオンである場合、音声処理装置１０は顧客の発話音声信号をオペレータ端末２０に送らなくてもよい。感情抑制機能がオフである場合、音声処理装置１０は合成音声信号をオペレータ端末２０に送らなくてもよい。感情抑制機能のオン・オフに関わらず、音声処理装置１０は顧客の発話音声信号と合成音声信号との両方をオペレータ端末２０に送ってもよい。 The speech processing device 10 associates any of the information shown in 1) to 7) below, or a combination of at least two pieces of information on the time axis, and transmits the information to the operator terminal 20 via the transmission/reception unit 102. good. 1) Customer's utterance voice signal, 2) Text data generated from the utterance voice signal, 3) Text data after being processed by the removal unit 104, 4) Detection information, 5) Synthetic voice signal, 6) Customer's utterance customer's emotional information estimated from the voice signal; 7) timing for switching on/off of the emotion suppression function; When the emotion suppression function is on, the voice processing device 10 does not have to send the customer's speech voice signal to the operator terminal 20 . When the emotion suppression function is off, the speech processing device 10 does not have to send the synthesized speech signal to the operator terminal 20 . Regardless of whether the emotion suppression function is on or off, the speech processing device 10 may send both the customer's uttered speech signal and the synthesized speech signal to the operator terminal 20 .

≪オペレータ端末≫ ≪Operator terminal≫

図６は、本実施形態に係るオペレータ端末の機能構成の一例を示す図である。オペレータ端末２０は、送受信部２０１、入力受付部２０２、制御部２０３を備える。なお、図６に示す機能構成は一例にすぎず、図示しない他の構成を備えてもよい。 FIG. 6 is a diagram showing an example of the functional configuration of an operator terminal according to this embodiment. The operator terminal 20 includes a transmission/reception section 201 , an input reception section 202 and a control section 203 . Note that the functional configuration shown in FIG. 6 is merely an example, and other configurations not shown may be provided.

送受信部２０１は、音声処理装置１０及び／又は顧客端末３０との間で、種々の情報及び／又は信号を送信及び／又は受信する。例えば、送受信部２０１は、顧客端末３０で収音された顧客の発話音声の信号である発話音声信号を受信してもよい。送受信部１０２は、音声処理装置１０から、合成音声信号を受信してもよい。また、送受信部２０１は、音声処理装置１０に対して、主観的評価情報を送信してもよい。また、送受信部２０１は、音声処理装置１０から、顧客の感情情報を受信してもよい。 The transmission/reception unit 201 transmits and/or receives various information and/or signals to/from the voice processing device 10 and/or the customer terminal 30 . For example, the transmitting/receiving unit 201 may receive an utterance voice signal, which is a signal of the customer's utterance voice picked up by the customer terminal 30 . The transceiver 102 may receive the synthesized speech signal from the speech processing device 10 . Further, the transmitting/receiving unit 201 may transmit subjective evaluation information to the speech processing device 10 . Further, the transmitting/receiving unit 201 may receive emotional information of the customer from the speech processing device 10 .

入力受付部２０２は、オペレータによる入力部１０ｅの操作に基づいて、種々の情報の入力を受け付ける。例えば、入力受付部２０２は、感情認識モデル１０１ｃやストレス認識モデル１０１ｄを学習するためのデータセットを生成するための作業の一環として、顧客の生の発話音声信号に対して主観的評価情報やストレスの度合いの入力を受け付けてもよい。以降、オペレータが、オペレータ端末２０において主観的評価情報やストレスの度合いを入力する作業を「アノテーション作業」と呼ぶ。アノテーション作業は、通常のコールセンター業務とは別の業務として位置付けられていてもよい。また、入力受付部２０２は、顧客の感情抑制機能の切り替え情報の入力を受け付けてもよい。また、入力受付部２０２は、感情抑制機能の手動切り替え又は自動切り替えのどちらかを指示する指示情報の入力を受け付けてもよい。 The input reception unit 202 receives input of various information based on the operation of the input unit 10e by the operator. For example, the input reception unit 202 receives subjective evaluation information and stress information from the customer's raw utterance voice signal as part of the work for generating a data set for learning the emotion recognition model 101c and the stress recognition model 101d. You may receive the input of the degree of. Hereinafter, the work of the operator inputting the subjective evaluation information and the degree of stress into the operator terminal 20 will be referred to as "annotation work". Annotation work may be positioned as a business separate from normal call center business. Further, the input reception unit 202 may receive an input of switching information of the customer's emotion suppression function. Further, the input receiving unit 202 may receive an input of instruction information instructing either manual switching or automatic switching of the emotion suppression function.

制御部２０３は、オペレータ端末２０に関する種々の制御を行う。例えば、制御部２０３は、表示部１０ｆにおける情報及び／又は画像の表示を制御する。また、制御部２０３は、音声出力部１０ｉにおける音声の出力を制御する。制御部２０３は、音声処理装置１０から送信される情報に基づいて音声の出力を制御してもよいし、入力受付部２０２が受け付けた情報に基づいて音声の出力を制御してもよい。 The control unit 203 performs various controls regarding the operator terminal 20 . For example, the control unit 203 controls display of information and/or images on the display unit 10f. Further, the control unit 203 controls the output of audio in the audio output unit 10i. The control unit 203 may control audio output based on information transmitted from the audio processing device 10 or may control audio output based on information received by the input receiving unit 202 .

制御部２０３は、音声処理装置１０から受信した合成音声信号に基づいて合成音声を音声出力部１０ｉから出力させる。制御部２０３は、顧客端末３０からの発話音声信号に基づいて発話音声を音声出力部１０ｉから出力させてもよい。 The control unit 203 outputs synthesized speech from the speech output unit 10 i based on the synthesized speech signal received from the speech processing device 10 . The control unit 203 may cause the voice output unit 10i to output the voice based on the voice signal from the customer terminal 30. FIG.

また、制御部２０３は、音声処理装置１０から受信した感情情報に基づいて、合成音声信号に対応する感情情報を表示部１０ｆに表示させてもよい。また、制御部２０３は、音声処理装置１０から受信した合成音声信号に対応するテキストデータを表示部１０ｆに表示させてもよい。例えば、制御部２０３は、感情情報、テキストデータ及び検出情報の少なくとも一つを含む画面Ｄ１を表示部１０ｆに表示させてもよい。また、制御部２０３は、ストレス情報を表示部１０ｆに表示させてもよい。例えば、制御部２０３は、ストレス情報を含む画面Ｄ２を表示部１０ｆに表示させてもよい。 Further, based on the emotional information received from the speech processing device 10, the control unit 203 may cause the display unit 10f to display emotional information corresponding to the synthesized speech signal. Further, the control unit 203 may display text data corresponding to the synthesized speech signal received from the speech processing device 10 on the display unit 10f. For example, the control unit 203 may cause the display unit 10f to display a screen D1 including at least one of emotion information, text data, and detection information. Further, the control unit 203 may display the stress information on the display unit 10f. For example, the control unit 203 may display a screen D2 including stress information on the display unit 10f.

図７は、本実施形態に係る画面Ｄ１の一例を示す図である。図７に示すように、画面Ｄ１において、制御部２０３は、音声出力部１０ｉからの合成音声の出力タイミングＴに合わせて、感情情報Ｉ１を表示部１０ｆに表示させてもよい。合成音声の出力タイミングＴ毎に感情情報Ｉ１を表示させることにより、オペレータは、感情抑制機能により顧客の感情が抑制された合成音声を聞く場合でも、顧客の感情をリアルタイムで認識することができる。 FIG. 7 is a diagram showing an example of the screen D1 according to this embodiment. As shown in FIG. 7, on the screen D1, the control unit 203 may cause the display unit 10f to display the emotion information I1 in accordance with the output timing T of the synthetic voice from the voice output unit 10i. By displaying the emotion information I1 at each output timing T of the synthetic voice, the operator can recognize the customer's emotion in real time even when listening to the synthetic voice in which the customer's emotion is suppressed by the emotion suppression function.

また、画面Ｄ１において、制御部２０３は、当該合成音声の出力タイミングＴに合わせて、当該合成音声に対応するテキストデータＩ２の内容を表示部１０ｆに表示させてもよい。テキストデータＩ２の内容を表示させることにより、オペレータは、合成音声だけでなく、視覚的にも顧客の発話内容を把握可能となる。 Further, on the screen D1, the control unit 203 may cause the display unit 10f to display the content of the text data I2 corresponding to the synthesized speech in accordance with the output timing T of the synthesized speech. By displaying the content of the text data I2, the operator can grasp the content of the customer's utterance visually as well as the synthesized voice.

また、画面Ｄ１では、制御部２０３は、音声処理装置１０から受信した検出情報に基づいて、特定の単語列そのものの表示に代えて、特定の単語列の検出を示す情報Ｉ３（例えば、「ＮＧワード検出」）を表示部１０ｆに表示させてもよい。この機能を「ＮＧワード非表示機能」と呼ぶ。これにより、心理的悪影響を与える顧客の発話の内容をそのままオペレータに認識させるのを回避できるのでオペレータのストレスを抑制できる。また、当該発話があったことはオペレータに通知できるので、オペレータが顧客に対する応対を適切に行うことができる。 On screen D1, based on the detection information received from the speech processing device 10, the control unit 203 displays information I3 (for example, "NG word detection”) may be displayed on the display unit 10f. This function is called "NG word non-display function". As a result, it is possible to prevent the operator from directly recognizing the content of the customer's utterance, which causes psychologically adverse effects, so that the operator's stress can be suppressed. In addition, since the operator can be notified of the speech, the operator can appropriately respond to the customer.

また、画面Ｄ１において、制御部２０３は、音声処理装置１０からの感情情報に基づいて、合成音声の出力タイミングＴ毎に、顧客の特定の感情のレベルＩ４を時系列に表示部１０ｆに表示させてもよい。例えば、図７では、合成音声の出力タイミングＴ毎の顧客の「怒り」のレベルＩ４が折れ線グラフで示される。これにより、オペレータが顧客の特定の感情（例えば、「怒り」）の遷移を容易に把握できるので、顧客に対するオペレータの応対の満足度を向上できる。 Further, on the screen D1, the control unit 203 causes the display unit 10f to display the customer's specific emotional level I4 in chronological order at each output timing T of the synthesized speech based on the emotion information from the speech processing device 10. may For example, in FIG. 7, the customer's "anger" level I4 for each output timing T of the synthesized speech is shown by a line graph. As a result, the operator can easily grasp the transition of the customer's specific emotion (for example, "anger"), so that the operator's satisfaction with the customer can be improved.

画面Ｄ１において、制御部２０３は選択ボタンＩ５を表示部１０ｆに表示させてもよい。
選択ボタンＩ５は、感情抑制機能の適用（オン）又は非適用（オフ）を自動又は手動のどちらで切り替えるかをオペレータが選択可能とするインターフェースである。オペレータは選択ボタンＩ５に対してクリック、タップ又はスライド等の操作を行うことにより「自動切換モード」と「手動切替モード」を切り替えることができる。自動切換モードにおいては、例えば感情情報、ストレス情報、又は感情抑制切替モデル１０１ｅからの出力等に基づいて感情抑制機能のオン・オフが自動で切り替わる。 In the screen D1, the control unit 203 may display the selection button I5 on the display unit 10f.
The selection button I5 is an interface that allows the operator to select whether the application (ON) or non-application (OFF) of the emotion suppression function is switched automatically or manually. The operator can switch between the "automatic switching mode" and the "manual switching mode" by clicking, tapping, or sliding the selection button I5. In the automatic switching mode, ON/OFF of the emotion suppression function is automatically switched based on, for example, emotion information, stress information, or output from the emotion suppression switching model 101e.

「手動切替モード」が選択された場合、制御部２０３は、感情抑制機能の適用又は非適用をオペレータが選択可能とするインターフェースである切替ボタンＩ６を表示部１０ｆに表示させてよい。オペレータが感情抑制機能のオンとオフを切り替えたタイミングは、顧客の発話音声（及び/又は発話音声に基づいて抽出される各種特徴量）と時間軸上で関連付けされて「手動切替履歴データ」として不図示の記憶部に蓄積される。「手動切替履歴データ」には更にオペレータの識別情報が関連付けられてもよい。 When the "manual switching mode" is selected, the control unit 203 may cause the display unit 10f to display a switching button I6, which is an interface that allows the operator to select application or non-application of the emotion suppression function. The timing at which the operator switches the emotion suppression function on and off is associated with the customer's speech (and/or various feature values extracted based on the speech) on the time axis, and is recorded as "manual switching history data." It is stored in a storage unit (not shown). The "manual switching history data" may further be associated with operator identification information.

切り替えボタンＩ７は、「ＮＧワード非表示機能」のオン・オフを切り替えるためのボタンである。「ＮＧワード非表示機能」がオフの場合には、テキストデータＩ２の内に特定の単語列が検出された場合でも、除去部１０４による処理が行われる前のテキストデータＩ２がそのまま表示部１０ｆに表示される。感情抑制機能をオンにしつつＮＧワード非表示機能をオフにした場合、オペレータは顧客による特定の単語列を直接聞くことは無いのでストレスが軽減される一方で、顧客の発話内容を正確に把握することにより顧客の感情をより正確に把握することができる。 The switching button I7 is a button for switching ON/OFF of the "NG word hiding function". When the "NG word hiding function" is off, even if a specific word string is detected in the text data I2, the text data I2 before being processed by the removal unit 104 is displayed on the display unit 10f as it is. Is displayed. When the emotion suppression function is turned on and the NG word hiding function is turned off, the operator does not directly hear the customer's specific word string, so stress is reduced, while the customer's utterance content can be accurately grasped. This makes it possible to grasp the customer's emotions more accurately.

感情抑制切替モデル１０１ｅを学習するためのデータセットは、ストレス情報、感情情報、発話音声信号Ｓ１、音声特徴量、テキストデータ、テキスト特徴量又はこれらの少なくとも二つの組み合わせと、オペレータが感情抑制機能のオン・オフを切り替えたタイミングとが、時間軸上で関連付けされたデータの束であってよい。感情抑制切替モデル１０１eを学習する方法は、例えば下記１）から３）に述べるような様々な方法がある。１）感情抑制切替モデル１０１ｅはオペレータ毎に学習されてもよい。すなわち、或るオペレータに対して適用される感情抑制切替モデル１０１ｅは、そのオペレータによる感情抑制機能の「手動切替履歴データ」のみに基づいて学習されてもよい。この方法によれば、感情抑制切替モデル１０１ｅはそのオペレータの好みに合わせたタイミングで感情抑制機能を切り替えることができるようになる。あるいは、２）或るオペレータに対して適用される感情抑制切替モデル１０１ｅは、不特定多数のオペレータによる「手動切替履歴データ」に基づいて学習されてもよい。この方法によれば、学習に用いることができるデータが多くなるため、感情抑制切替モデル１０１ｅを早く学習することができるようになる。あるいは、３）或るオペレータに対して適用される感情抑制切替モデル１０１ｅは、そのオペレータと年齢・性別・その他の特性が類似したオペレータによる「手動切替履歴データ」に基づいて学習されてもよい。この方法によれば、１）の方法と比較して学習に用いることができるデータが多くなるため感情抑制切替モデル１０１ｅを早く学習することができ、２）の方法と比較して自分の好みに合った切替タイミングを学習することができるようになる。 The data set for learning the emotion suppression switching model 101e includes stress information, emotion information, speech voice signal S1, voice feature quantity, text data, text feature quantity, or a combination of at least two of these, and the operator's emotion suppression function. The on/off switching timing may be a bundle of data associated on the time axis. Methods for learning the emotion suppression switching model 101e include, for example, various methods described in 1) to 3) below. 1) The emotion suppression switching model 101e may be learned for each operator. That is, the emotion suppression switching model 101e applied to a certain operator may be learned based only on the "manual switching history data" of the emotion suppression function of that operator. According to this method, the emotion suppression switching model 101e can switch the emotion suppression function at a timing that matches the operator's preference. Alternatively, 2) the emotion suppression switching model 101e applied to a certain operator may be learned based on "manual switching history data" by an unspecified number of operators. According to this method, the amount of data that can be used for learning increases, so the emotion suppression switching model 101e can be learned quickly. Alternatively, 3) the emotion suppression switching model 101e applied to a certain operator may be learned based on "manual switching history data" by operators similar in age, sex, and other characteristics to the operator. According to this method, the amount of data that can be used for learning is increased compared to method 1). It becomes possible to learn the matching switching timing.

図８は、本実施形態に係る画面Ｄ２の一例を示す図である。画面Ｄ２において、制御部２０３は、音声処理装置１０からのストレス情報を表示させてもよい。例えば、図８では、ストレス情報として、オペレータが感じるストレスの推定値を示す情報（例えば、「５６％」）と、当該オペレータの平常時の状態からの相対的な評価値を示す情報（例えば、「平常時より８．１％減」）とが表示される。 FIG. 8 is a diagram showing an example of the screen D2 according to this embodiment. The control unit 203 may display the stress information from the speech processing device 10 on the screen D2. For example, in FIG. 8, as stress information, information indicating an estimated value of stress felt by the operator (eg, "56%") and information indicating a relative evaluation value from the normal state of the operator (eg, "8.1% less than usual") is displayed.

図１２は、本実施形態に係る画面Ｄ３の一例を示す図である。画面Ｄ３において、制御部２０３は、オペレータがアノテーション作業を行うためのインターフェースＩ８を表示させてもよい。オペレータは、例えば、顧客の生の音声（サンプル音声）を聞きながら、サンプル音声から感じられる顧客の感情をインターフェースＩ８から都度選択する。図１２において、顧客感情Ｉ１はオペレータによる顧客感情の主観的評価情報である。例えば、オペレータが、サンプル音声「今日の夕方までにどうにかして届けてよ」に対して「怒り」という感情をアノテーションしたならば、図１２に示すように、「今日の夕方までにどうにかして届けてよ」というサンプル音声と「怒り」という情報が時間軸上で関連付けられる。アノテーションは文単位で行われてもよいし所定の時間間隔ごとに行われてもよい。 FIG. 12 is a diagram showing an example of the screen D3 according to this embodiment. On the screen D3, the control unit 203 may display an interface I8 for the operator to perform annotation work. For example, while listening to the customer's live voice (sample voice), the operator selects the customer's emotion felt from the sample voice from the interface I8 each time. In FIG. 12, customer sentiment I1 is subjective evaluation information of customer sentiment by the operator. For example, if the operator annotates the sample voice "Please deliver it somehow by this evening" with the emotion "anger", as shown in FIG. Send it to me" and the information "anger" are related on the time axis. Annotation may be performed in units of sentences or may be performed at predetermined time intervals.

（音声処理システムの動作）
図９は、本実施形態に係る感情抑制動作の一例を示すフローチャートである。なお、図９は、例示にすぎず、少なくとも一部のステップ（例えば、ステップＳ１０６）の順番は入れ替えられてもよいし、不図示のステップが実施されてもよいし、一部のステップが省略されてもよい。 (Operation of voice processing system)
FIG. 9 is a flowchart showing an example of an emotion suppression operation according to this embodiment. Note that FIG. 9 is merely an example, and the order of at least some steps (for example, step S106) may be changed, steps not shown may be performed, or some steps may be omitted. may be

音声処理装置１０は、顧客端末３０の音声入力部１０ｈで収音される顧客の発話音声の信号である発話音声信号を取得する（Ｓ１０１）。 The voice processing device 10 acquires an uttered voice signal, which is a signal of the customer's uttered voice picked up by the voice input unit 10h of the customer terminal 30 (S101).

音声処理装置１０は、Ｓ１０１で取得された発話音声信号に基づいて抽出される特徴量を音声認識モデル１０１ａに入力して、一以上の単語からなる単語列を含むテキストデータを生成する（Ｓ１０２）。 The speech processing device 10 inputs the feature quantity extracted based on the utterance speech signal acquired in S101 to the speech recognition model 101a, and generates text data including a word string consisting of one or more words (S102). .

音声処理装置１０は、Ｓ１０２で生成されたテキストデータ内に特定の単語列が含まれるか否かを判定する（Ｓ１０３）。当該テキストデータ内に特定の単語列が含まれる場合、音声処理装置１０は、当該特定の単語列を除去又は前記特定の単語列を他の単語列に変換したテキストデータを生成する（Ｓ１０４）。 The speech processing device 10 determines whether or not the text data generated in S102 includes a specific word string (S103). If the text data contains a specific word string, the speech processing device 10 generates text data by removing the specific word string or converting the specific word string into another word string (S104).

音声処理装置１０は、テキストデータに基づいて抽出される特徴量を音声合成モデル１０１ｂに入力して、合成音声の信号である合成音声信号を生成する（Ｓ１０５）。 The speech processing device 10 inputs the feature amount extracted based on the text data to the speech synthesis model 101b to generate a synthesized speech signal, which is a synthesized speech signal (S105).

音声処理装置１０は、Ｓ１０１で取得された発話音声信号、Ｓ１０２で生成されたテキストデータ、及び、オペレータによって入力される顧客の感情の主観的評価情報の少なくとも一つに基づいて抽出される特徴量を感情認識モデル１０１ｃに入力して、顧客の感情情報を生成する（Ｓ１０６）。 The speech processing device 10 extracts a feature amount based on at least one of the speech audio signal acquired in S101, the text data generated in S102, and subjective evaluation information of the customer's emotion input by the operator. is input to the emotion recognition model 101c to generate the customer's emotion information (S106).

オペレータ端末２０は、Ｓ１０５で生成された合成音声信号に基づいて合成音声を音声出力部１０ｉから出力させるとともに、当該合成音声の出力タイミングＴに合わせて当該合成音声に対応する感情情報を表示部１０ｆに表示させる（Ｓ１０７、例えば、図７）。 The operator terminal 20 outputs synthetic speech from the speech output unit 10i based on the synthesized speech signal generated in S105, and displays emotional information corresponding to the synthetic speech at the output timing T of the synthetic speech on the display unit 10f. (S107, for example, FIG. 7).

音声処理装置１０は、処理を終了するか否かを判定する（Ｓ１０８）。処理を終了しない場合（Ｓ１０８：ＮＯ）、音声処理装置１０は、処理Ｓ１０１～Ｓ１０７を再び実行する。一方、音声変換処理を終了する場合（Ｓ１０８：ＹＥＳ）、音声処理装置１０は、処理を終了する。 The speech processing device 10 determines whether or not to end the process (S108). If the process is not to end (S108: NO), the speech processing device 10 executes the processes S101 to S107 again. On the other hand, when ending the voice conversion process (S108: YES), the voice processing device 10 ends the process.

図１０は、本実施形態に係る感情抑制機能の自動切り替え動作を示すフローチャートである。なお、図１０は、例示にすぎず、少なくとも一部のステップの順番は入れ替えられてもよいし、不図示のステップが実施されてもよいし、一部のステップが省略されてもよい。 FIG. 10 is a flowchart showing an automatic switching operation of the emotion suppression function according to this embodiment. Note that FIG. 10 is merely an example, and the order of at least some steps may be changed, steps not shown may be performed, and some steps may be omitted.

音声処理装置１０は、オペレータのストレス情報を生成する（Ｓ２０１）。 The speech processing device 10 generates operator stress information (S201).

音声処理装置１０は、ストレス情報が所定の条件を満たすか否かを判定する（Ｓ２０２）。例えば、所定の条件は、ストレス情報が示すストレス度数が所定の閾値以上又はより大きいことであってもよい。 The speech processing device 10 determines whether or not the stress information satisfies a predetermined condition (S202). For example, the predetermined condition may be that the stress level indicated by the stress information is greater than or equal to a predetermined threshold.

音声処理装置１０は、ストレス情報が所定の条件を満たす場合（Ｓ２０２：ＹＥＳ）、感情抑制機能を適用（すなわち、オペレータ端末２０から合成音声を出力）してもよい（Ｓ２０３）。一方、音声処理装置１０は、ストレス情報が所定の条件を満たさない場合（Ｓ２０２：ＮＯ）、感情抑制機能を非適用（すなわち、オペレータ端末２０から顧客の発話音声を出力）してもよい（Ｓ２０４）。 If the stress information satisfies a predetermined condition (S202: YES), the speech processing device 10 may apply the emotion suppression function (that is, output synthesized speech from the operator terminal 20) (S203). On the other hand, if the stress information does not satisfy the predetermined condition (S202: NO), the speech processing device 10 may not apply the emotion suppression function (that is, output the customer's uttered voice from the operator terminal 20) (S204). ).

音声処理装置１０は、処理を終了するか否かを判定する（Ｓ２０５）。処理を終了しない場合（Ｓ２０５：ＮＯ）、音声処理装置１０は、処理Ｓ２０１～Ｓ２０４を再び実行する。一方、音声変換処理を終了する場合（Ｓ２０５：ＹＥＳ）、音声処理装置１０は、処理を終了する。なお、Ｓ２０１及びＳ２０２において、音声処理装置１０は、感情情報や感情抑制切替モデル１０１ｅの出力に基づいて、感情抑制機能を適用するか否を決定してもよい。 The speech processing device 10 determines whether or not to end the process (S205). If the process is not to end (S205: NO), the speech processing device 10 executes the processes S201 to S204 again. On the other hand, when ending the voice conversion process (S205: YES), the voice processing device 10 ends the process. In S201 and S202, the speech processing device 10 may determine whether or not to apply the emotion suppression function based on the emotion information and the output of the emotion suppression switching model 101e.

以上のように、本実施形態に係る音声処理システム１によれば、顧客の発話音声信号に基づいてテキストデータを生成し、当該テキストデータに基づいて生成される合成音声をオペレータに出力する。このため、顧客の発話音声に含まれる顧客の感情を十分に抑制した合成音声をオペレータに聞かせることができ、顧客の感情的発話に起因するオペレータのストレスを軽減できる。本発明の発明者は、約５０名の被験者に対して、１）顧客の発話音声そのもの、２）顧客の発話音声の音量を調整した音声、３）顧客の発話音声の声質を変換した音声、４）顧客の発話音声をテキスト化してから生成した合成音声、の４種類の音声を聞き比べてもらい、音声から感じられる怒りの度合いを７段階の尺度で評価してもらう実験を行った。その結果、２）や３）と比較して４）が、被験者に伝わった怒りの軽減度合いが顕著であった。 As described above, according to the speech processing system 1 of the present embodiment, text data is generated based on the customer's uttered voice signal, and synthesized speech generated based on the text data is output to the operator. Therefore, the operator can hear synthesized speech in which the emotions of the customer included in the customer's uttered speech are sufficiently suppressed, and the operator's stress caused by the customer's emotional utterance can be reduced. The inventors of the present invention gave about 50 subjects 1) the customer's uttered voice itself, 2) the customer's uttered voice whose volume was adjusted, 3) the customer's uttered voice whose voice quality was converted, 4) An experiment was conducted in which the customer listened to and compared four types of synthesized speech generated after converting the customer's uttered speech into text, and evaluated the degree of anger felt from the speech on a seven-grade scale. As a result, compared with 2) and 3), 4) showed a remarkable reduction in anger transmitted to the subject.

また、本実施形態に係る音声処理システム１によれば、オペレータに対して、合成音声を出力するだけでなく顧客の感情情報を合成音声出力のタイミングに合わせて通知することができるので、合成音声を聞いたオペレータが顧客の感情をリアルタイムに認識でき、顧客に対して適切な応対を行うことができる。 Further, according to the speech processing system 1 according to the present embodiment, it is possible to notify the operator of the emotional information of the customer in accordance with the timing of outputting the synthesized speech in addition to outputting the synthesized speech. The operator who hears this can recognize the customer's emotions in real time and respond appropriately to the customer.

また、本実施形態に係る音声処理システム１によれば、オペレータのストレス情報又は顧客の感情情報等に基づいて、感情抑制機能を適用するか否か（すなわち、オペレータに対して合成音声又は発話音声のどちらを出力するか）が切り替えられるので、オペレータのストレスと顧客の満足度とのバランスを適切に図ることができる。 Further, according to the speech processing system 1 according to the present embodiment, it is possible to determine whether or not to apply the emotion suppression function based on the operator's stress information or the customer's emotional information (that is, whether to apply the synthesized voice or the spoken voice to the operator). output), it is possible to properly balance the operator's stress and the customer's satisfaction.

（変更例）
上記音声処理システム１では、音声認識部１０３は、発話音声信号から、一つ又は複数の文として確定された単語列を含むテキストデータを生成したが、これに限られない。音声認識部１０３は、発話音声信号から認識された単語列が一つ又は複数の文として確定される前に、一つ又は複数の単語（品詞又は形態素）からなる単語列を含むテキストデータを生成してもよい。除去部１０４は、当該文として確定されていないテキストデータ内の特定の単語列を除去し、音声合成部１０５は、当該文として確定されていないテキストデータから合成音声信号を生成してもよい。 (Change example)
In the speech processing system 1 described above, the speech recognition unit 103 generates text data including word strings determined as one or more sentences from the speech signal, but the present invention is not limited to this. The speech recognition unit 103 generates text data including a word string consisting of one or more words (parts of speech or morphemes) before the word string recognized from the speech signal is determined as one or more sentences. You may The removal unit 104 may remove a specific word string in text data that has not been determined as the sentence, and the speech synthesis unit 105 may generate a synthesized speech signal from the text data that has not been determined as the sentence.

図１１は、本実施形態の変更例に係る合成音声信号の生成の一例を示す図である。図１１では、送受信部１０２で取得された発話音声信号Ｓ４に基づいて、音声認識部１０３においてテキストデータＴ４１～Ｔ４３が生成されるものとする。図１１に示すように、テキストデータＴ４１～Ｔ４３は、「はやく送ってください」という一文の確定前に、意味を持つ形態素単位（「はやく」、「送って」、「ください」）でテキストデータが生成される点で、図４と異なる。除去部１０４は、テキストデータＴ４１～Ｔ４３それぞれに対して特定の単語列が含まれるか否かを判定して、当該特定の単語列を除去して音声合成部１０５に出力する。音声合成部１０５は、テキストデータＴ４１～Ｔ４３からそれぞれ合成音声信号Ｓ４１～Ｓ４３を生成する。 FIG. 11 is a diagram showing an example of generation of a synthesized speech signal according to a modification of this embodiment. In FIG. 11, it is assumed that text data T41 to T43 are generated in speech recognition section 103 based on speech signal S4 acquired in transmission/reception section . As shown in FIG. 11, the text data T41 to T43 are composed of meaningful morpheme units ("fast", "send", "please") before the sentence "please send it quickly" is fixed. It differs from FIG. 4 in that it is generated. The removal unit 104 determines whether or not a specific word string is included in each of the text data T41 to T43, removes the specific word string, and outputs the result to the speech synthesis unit 105. FIG. The speech synthesizing unit 105 generates synthetic speech signals S41 to S43 from the text data T41 to T43, respectively.

図１１に示すように、文の確定前に一つ又は複数の形態素単位でテキストデータを生成して合成音声を出力することにより、テキストデータの生成によりオペレータの応答遅延を軽減できる。なお、形態素単位での複数のテキストデータ（又は合成音声）が意味的に不自然でないかを判定するモデルなどが用いられてもよい。 As shown in FIG. 11, by generating text data in units of one or a plurality of morphemes and outputting synthesized speech before finalizing a sentence, the operator's response delay can be reduced by generating text data. A model for determining whether or not a plurality of text data (or synthesized speech) in morpheme units is semantically unnatural may be used.

また、応答遅延を軽減するために、図４に示す合成音声信号Ｓ１～Ｓ３、図１１に示す合成音声信号Ｓ４１～Ｓ４３それぞれの前及び／又は後に、例えば、「あ～」、「え～」、「まあ」等のフィラー音が追加されてもよい。これにより、オペレータも応答遅延による顧客の満足度の低下を防止できる。 Also, in order to reduce the response delay, for example, "a~" Filler sounds such as , "well" may be added. As a result, the operator can also prevent the customer's satisfaction from being lowered due to the response delay.

また、音声合成部１０５は、感情認識部１０６が推定した顧客の感情に基づいて、複数の音声合成モデル１０１ｂのうちから顧客の感情に合った音声合成モデル１０１ｂを選択してもよい。例えば、感情認識部１０６が推定した顧客の感情が「激昂」である場合、音声合成部１０５は、ピッチが速く抑揚が激しい音声合成モデル１０１ｂを用いてよい。例えば、感情認識部１０６が推定した顧客の感情が「号泣」である場合、音声合成部１０５は、泣き声のような音声を出力する音声合成モデル１０１ｂを用いてよい。或いは、音声合成部１０５は、感情認識部１０６が推定した顧客の感情に基づいて音声合成モデル１０１ｂのパラメータを変更し、顧客の感情に合った音声が出力されるように調整してよい。顧客が激昂している際の生の音声を直接聞いたオペレータは極めて強いストレスを感じてしまう。他方、オペレータは顧客対応業務を適切に遂行するために、顧客の感情をリアルタイムで正確に把握する必要がある。オペレータに発話音声を直接聞かせないことによりオペレータは過剰なストレスを感じることがなく、合成音声に顧客の感情を乗せることにより、オペレータは聴覚を通じて顧客の感情をリアルタイムに把握することができる。 Further, based on the customer's emotion estimated by the emotion recognition unit 106, the speech synthesis unit 105 may select the speech synthesis model 101b that matches the customer's emotion from among the plurality of speech synthesis models 101b. For example, when the customer's emotion estimated by the emotion recognition unit 106 is "excitement", the speech synthesis unit 105 may use the speech synthesis model 101b with a fast pitch and a strong intonation. For example, if the customer's emotion estimated by the emotion recognition unit 106 is "crying", the speech synthesis unit 105 may use the speech synthesis model 101b that outputs a cry-like speech. Alternatively, the speech synthesis unit 105 may change the parameters of the speech synthesis model 101b based on the customer's emotion estimated by the emotion recognition unit 106, and adjust so that the speech matching the customer's emotion is output. An operator who directly hears the live voice of a customer who is furious feels extremely stressed. On the other hand, the operator needs to accurately grasp the customer's emotions in real time in order to perform the customer service work properly. Since the operator does not hear the spoken voice directly, the operator does not feel excessive stress, and by adding the customer's emotion to the synthesized voice, the operator can grasp the customer's emotion in real time through hearing.

（その他の実施形態）
上記実施形態では、顧客の発話音声信号をテキスト化して、合成音声信号をオペレータに出力するものとしたがこれに限られない。音声処理装置１０は、顧客の発話音声信号に基づいて抽出される音声特徴量を音声変換モデルに入力して、変換音声の信号を生成し、オペレータ端末２０から変換音声を出力してもよい。 (Other embodiments)
In the above embodiment, the customer's uttered voice signal is converted into text and the synthesized voice signal is output to the operator, but the present invention is not limited to this. The speech processing device 10 may input speech features extracted based on the customer's uttered speech signal to the speech conversion model, generate a converted speech signal, and output the converted speech from the operator terminal 20 .

特許請求の範囲に記載の「音声変換モデル」は、発話音声信号を一旦テキスト化して合成音声として出力するモデルと、発話音声信号をテキスト化せずに声質を変換させて出力するモデルとの両方を包含する概念である。顧客の発話音声に代えて合成音声または変換音声をオペレータに対して出力することにより、効果の程度の差こそあれ、オペレータが感じるストレスを軽減できる。他方で、顧客対応業務の遂行のためには、オペレータが顧客の感情をリアルタイムに把握することも欠かせない。 The "speech conversion model" described in the claims includes both a model that converts the speech signal into text and outputs it as synthesized speech, and a model that converts the voice quality and outputs it without converting the speech signal into text. It is a concept that includes By outputting synthesized speech or converted speech to the operator in place of the customer's uttered speech, the stress felt by the operator can be reduced, although the degree of effect varies. On the other hand, it is essential for the operator to grasp the customer's emotions in real time in order to carry out the customer service.

本変形例における音声処理装置１０は、顧客の発話音声信号に基づいて抽出される音声特徴量を音声変換モデルに入力して、変換音声信号を生成する。音声処理装置１０は、１）変換音声信号と、２）顧客の発話音声から推定される顧客の感情情報とを時間軸上で関連付けた情報を生成し、オペレータ端末２０に対して送信する。音声処理装置１０が送信する情報には、発話音声信号、発話音声信号から生成されたテキストデータ、除去部１０４の処理を経たあとのテキストデータ、検出情報、感情抑制機能のオン・オフを切り替えるタイミングが関連付けされていてもよい。 The speech processing device 10 in this modified example inputs the speech feature amount extracted based on the customer's uttered speech signal to the speech conversion model to generate the converted speech signal. The speech processing device 10 generates information in which 1) the converted speech signal and 2) the customer's emotional information estimated from the customer's uttered speech are associated on the time axis, and transmits the information to the operator terminal 20 . The information transmitted by the speech processing device 10 includes a speech audio signal, text data generated from the speech audio signal, text data after being processed by the removing unit 104, detection information, and timing for switching on/off of the emotion suppression function. may be associated.

オペレータ端末２０は、音声処理装置１０から受信した変換音声の信号を音声出力部１０ｉから出力し、且つ、音声出力部１０ｉからの変換音声の出力タイミングＴに合わせて、感情情報を示す情報を表示部１０fに表示してよい。オペレータ端末２０は更に、音声出力部１０ｉからの変換音声の出力タイミングＴに合わせて、テキストデータを表示部１０fに表示してよい。かかる表示の態様は図７に図示するようであってよい。 The operator terminal 20 outputs the signal of the converted speech received from the speech processing device 10 from the speech output unit 10i, and displays information indicating emotional information in accordance with the output timing T of the converted speech from the speech output unit 10i. may be displayed in section 10f. Further, the operator terminal 20 may display the text data on the display section 10f in accordance with the output timing T of the converted voice from the voice output section 10i. The mode of such display may be as illustrated in FIG.

本変形例における音声処理装置１０は、感情情報に基づいて、感情情報が示す感情が変換音声に反映されるように、変換音声の信号を生成してもよい。例えば感情情報が示す感情が「激昂」である場合、ピッチが速く抑揚が激しい音声変換モデルを用いてよい。例えば感情情報が示す感情が「号泣」である場合、泣き声のような音声を出力する音声変換モデルを用いてよい。音声処理装置１０は、感情情報が示す感情が変換音声に反映されるように、変換音声の信号を生成してよい。オペレータに発話音声を直接聞かせないことによりオペレータは過剰なストレスを感じるがことなく、変換音声に顧客の感情を乗せることにより、オペレータは聴覚を通じて顧客の感情をリアルタイムに把握することができる。 The speech processing device 10 in this modification may generate a signal of converted speech based on the emotion information so that the emotion indicated by the emotion information is reflected in the converted speech. For example, if the emotion indicated by the emotion information is "furious," a speech conversion model with a fast pitch and a strong intonation may be used. For example, if the emotion indicated by the emotion information is "crying", a speech conversion model that outputs a crying voice may be used. The speech processing device 10 may generate a converted speech signal such that the emotion indicated by the emotion information is reflected in the converted speech. The operator does not feel excessive stress by not letting the operator directly hear the uttered voice, but the operator can grasp the customer's emotion in real time through hearing by putting the customer's emotion on the converted voice.

本変形例における音声処理システム１においては、オペレータによるアノテーション作業は、オペレータによる通常のコールセンター業務中において、変換音声に対して行われても良い。オペレータが変換音声に対して「怒りの感情」をアノテーションした場合、当該アノテーションの結果に基づいて、音声変換モデルがより柔らかい音声を出力するようにリアルタイムに調整されてもよい。 In the speech processing system 1 according to this modification, the annotation work by the operator may be performed on the converted speech during normal call center work by the operator. If the operator annotates the converted voice with "feeling of anger", the speech conversion model may be adjusted in real time to output a softer voice based on the result of the annotation.

以上説明した実施形態では、第１のユーザが顧客であり、第２のユーザがオペレータであるコールセンターを想定したが、本実施形態の適用場面はコールセンターに限られない。例えば、Ｗｅｂミーティング等、第１のユーザの感情を抑制した音声を第２のユーザに出力するどのような場面にも適用可能である。すなわち、本実施形態は、カスタマーハラスメント対策だけでなく、社内のパワーハラスメント等、様々なハラスメントに対する企業側の対策として利用可能である。 In the embodiment described above, a call center where the first user is the customer and the second user is the operator is assumed, but the application scene of the present embodiment is not limited to the call center. For example, the present invention can be applied to any scene such as a web meeting in which the first user's emotion-repressed voice is output to the second user. In other words, the present embodiment can be used not only as a countermeasure against customer harassment, but also as a corporate countermeasure against various types of harassment such as internal power harassment.

以上説明した実施形態における、感情情報と合成音声とを「時間軸上で関連付け」する処理は、図７に示すように、合成音声または変換音声の出力タイミングに合わせて、それらの元となった発話音声から推定される感情情報を表示することが実現可能な態様であれば、その具体的な態様を問わない。以上説明した実施形態における「時間軸上で関連付け」する処理は、何時何分何秒といった時刻情報に基づいて関連付けする処理でも良いし、発話音声情報の開始から何分何秒経過時といった情報に基づいて関連付けする処理でも良いし、文単位、単語単位又は形態素単位で関連付けする処理でもよい。 In the embodiment described above, the process of "associating emotion information and synthesized speech on the time axis" is performed by matching the output timing of synthetic speech or converted speech, as shown in FIG. As long as it is possible to display the emotional information estimated from the uttered voice, the specific mode is not limited. The process of "associating on the time axis" in the above-described embodiment may be a process of associating based on time information such as hour, minute, and second, or based on information such as how many minutes and seconds have elapsed since the start of the utterance voice information. It may be a process of associating on the basis of a sentence, or a process of associating in units of sentences, in units of words, or in units of morphemes.

以上説明した実施形態における音声処理システム１において、顧客からは、自身の音声が感情抑制されてオペレータに届いていることが分からないようにしてもよい。すなわち、感情抑制機能がオンになっているかオフになっているかは、顧客からは把握できないようにしてもよい。 In the speech processing system 1 of the embodiment described above, the customer may be prevented from knowing that his/her own speech is being conveyed to the operator with suppressed emotion. That is, the customer may not be able to ascertain whether the emotion suppression function is on or off.

アノテーション作業は、オペレータがオペレータ端末２０上で行っても良いし、別途、アノテーション作業用の専用のアプリケーションや端末が用意されていてもよい。 The annotation work may be performed by the operator on the operator terminal 20, or a dedicated application or terminal for the annotation work may be separately prepared.

また、以上説明した実施形態は、本発明の理解を容易にするためのものであり、本発明を限定して解釈するためのものではない。実施形態が備える各要素並びにその配置、材料、条件、形状及びサイズ等は、例示したものに限定されるわけではなく適宜変更することができる。また、異なる実施形態で示した構成同士を部分的に置換し又は組み合わせることが可能である。また、音声処理装置１０の機能として記載した機能をオペレータ端末２０が備えていてもよい。また、オペレータ端末２０の機能として記載した機能を音声処理装置１０が備えていてもよい。 Moreover, the embodiments described above are for facilitating understanding of the present invention, and are not intended to limit and interpret the present invention. Each element included in the embodiment and its arrangement, materials, conditions, shape, size, etc. are not limited to those illustrated and can be changed as appropriate. Also, it is possible to partially replace or combine the configurations shown in different embodiments. Also, the operator terminal 20 may have the functions described as the functions of the speech processing device 10 . Also, the voice processing device 10 may have the functions described as the functions of the operator terminal 20 .

１…音声処理システム、１０…音声処理装置、２０…オペレータ端末、３０…顧客端末、１０ａ…プロセッサ、１０ｂ…ＲＡＭ、１０ｃ…ＲＯＭ、１０ｄ…通信部、１０ｅ…入力部、１０ｆ…表示部、１０ｇ…カメラ、１０ｈ…音声入力部、１０ｉ…音声出力部、１０１…記憶部、１０２…送受信部、１０３…音声認識部、１０４…除去部、１０５…音声合成部、１０６…感情認識部、１０７…ストレス認識部、１０８…制御部、１０９…学習部、２０１…送受信部、２０２…入力受付部、２０３…制御部 REFERENCE SIGNS LIST 1 speech processing system 10 speech processing device 20 operator terminal 30 customer terminal 10a processor 10b RAM 10c ROM 10d communication unit 10e input unit 10f display unit 10g Camera 10h Voice input unit 10i Voice output unit 101 Storage unit 102 Transmission/reception unit 103 Voice recognition unit 104 Removal unit 105 Voice synthesis unit 106 Emotion recognition unit 107 STRESS RECOGNIZING UNIT 108 CONTROLLING UNIT 109 LEARNING UNIT 201 TRANSMITTER/RECEIVER UNIT 202 INPUT RECEIVING UNIT 203 CONTROLLER

本発明の一つの態様に係る音声処理システムは、第１のユーザの発話音声の信号である発話音声信号を取得する取得部と、前記発話音声信号に基づいて抽出される特徴量を音声認識モデルに入力して、一以上の単語からなる単語列を含むテキストデータを生成する音声認識部と、前記テキストデータに基づいて抽出される特徴量を音声合成モデルに入力して、合成音声の信号である合成音声信号を生成する音声合成部と、第２のユーザに対して前記合成音声を出力する音声出力部と、前記発話音声信号に対応する第１のユーザの感情情報を生成する感情認識部と、前記感情情報に基づいて、前記音声出力部から前記合成音声又は前記発話音声のどちらを出力するかを切り替える制御部と、を備え、前記制御部は、前記感情情報が第２のユーザにストレスを与えうるものである場合に前記合成音声を出力する、を備える。 A speech processing system according to one aspect of the present invention includes an acquisition unit that acquires an uttered voice signal that is a signal of an uttered voice of a first user; , a speech recognition unit that generates text data including a word string consisting of one or more words, and a feature amount extracted based on the text data is input to a speech synthesis model, and a synthesized speech signal A speech synthesizer for generating a synthesized speech signal, a speech output unit for outputting the synthesized speech to a second user, and an emotion recognition unit for generating emotional information of the first user corresponding to the uttered speech signal. and a control unit for switching between outputting either the synthesized voice or the uttered voice from the voice output unit based on the emotional information, wherein the control unit controls whether the emotional information is transmitted to the second user . and outputting the synthesized speech when it is stressful .

本発明の一つの態様に係る音声処理システムは、第１のユーザの発話音声の信号である発話音声信号を取得する取得部と、前記発話音声信号に基づいて抽出される特徴量を音声認識モデルに入力して、一以上の単語からなる単語列を含むテキストデータを生成する音声認識部と、前記テキストデータに基づいて抽出される特徴量を音声合成モデルに入力して、合成音声の信号である合成音声信号を生成する音声合成部と、第２のユーザに対して前記合成音声を出力する音声出力部と、前記第２のユーザのストレス状況に関するストレス情報を生成するストレス認識部と、前記ストレス情報に基づいて、前記音声出力部から前記合成音声又は前記発話音声のどちらを出力するかを切り替える制御部と、を備え、前記制御部は、前記ストレス情報がストレスの高い状態を示している場合に前記合成音声を出力する。 A speech processing system according to one aspect of the present invention includes an acquisition unit that acquires an uttered voice signal that is a signal of an uttered voice of a first user; , a speech recognition unit that generates text data including a word string consisting of one or more words, and a feature amount extracted based on the text data is input to a speech synthesis model, and a synthesized speech signal a speech synthesis unit that generates a synthesized speech signal; a speech output unit that outputs the synthesized speech to a second user; a stress recognition unit that generates stress information regarding the stress situation of the second user; a control unit that switches between outputting either the synthesized voice or the uttered voice from the voice output unit based on stress information, wherein the control unit indicates that the stress information indicates a high stress state. output the synthesized speech when the

本発明の一つの態様に係る音声処理システムは、第１のユーザの発話音声の信号である発話音声信号を取得する取得部と、前記発話音声信号に基づいて抽出される特徴量を音声認識モデルに入力して、一以上の単語からなる単語列を含むテキストデータを生成する音声認識部と、前記テキストデータに基づいて抽出される特徴量を音声合成モデルに入力して、合成音声の信号である合成音声信号を生成する音声合成部と、第２のユーザに対して前記合成音声を出力する音声出力部と、前記第２のユーザによって入力される切り替え情報に基づいて、前記音声出力部から前記合成音声又は前記発話音声のどちらを出力するかを切り替える制御部と、前記第２のユーザによって入力された切り替え情報を、前記切り替え情報が入力された際の発話音声信号と時間軸上で関連付けた情報を生成し、当該情報に基づいて、発話音声信号、当該発話音声信号から抽出した特徴量、当該発話音声信号から生成したテキストデータ、当該テキストデータから抽出された特徴量、又はこれらの少なくとも二つの組み合わせを入力とし、前記合成音声と前記発話音声とを切り替えるタイミングを出力とする感情抑制切替モデルを機械学習する学習部と、を備え、前記制御部は、前記感情抑制切替モデルに、前記取得部が取得した発話音声信号、当該発話音声信号から抽出した特徴量、当該発話音声信号から生成したテキストデータ、当該テキストデータから抽出された特徴量、又はこれらの少なくとも二つの組み合わせを入力することにより、前記合成音声と前記発話音声とを切り替えるタイミングを生成する。 A speech processing system according to one aspect of the present invention includes an acquisition unit that acquires an uttered voice signal that is a signal of an uttered voice of a first user; , a speech recognition unit that generates text data including a word string consisting of one or more words, and a feature amount extracted based on the text data is input to a speech synthesis model, and a synthesized speech signal a speech synthesis unit for generating a synthesized speech signal; a speech output unit for outputting the synthesized speech to a second user; A control unit for switching between outputting the synthesized voice and the spoken voice, and associating the switching information input by the second user with the spoken voice signal when the switching information is input on the time axis. based on the information, a speech signal, a feature amount extracted from the speech signal, text data generated from the speech signal, a feature amount extracted from the text data, or at least these a learning unit that performs machine learning on an emotional suppression switching model that receives a combination of two as inputs and outputs a timing for switching between the synthesized voice and the uttered voice; inputting the speech signal acquired by the acquiring unit, the feature amount extracted from the speech signal, the text data generated from the speech signal, the feature amount extracted from the text data, or a combination of at least two of these generates a timing for switching between the synthesized speech and the uttered speech .

Claims

an acquisition unit that acquires an utterance audio signal that is a signal of the utterance audio of the first user;
a speech recognition unit that inputs a feature amount extracted based on the speech signal to a speech recognition model to generate text data including a word string consisting of one or more words;
a speech synthesis unit that inputs a feature amount extracted based on the text data to a speech synthesis model to generate a synthesized speech signal that is a synthesized speech signal;
a speech output unit that outputs the synthesized speech to a second user;
An audio processing system comprising:

an emotion recognition unit that generates emotion information of a first user corresponding to the speech audio signal;
a display unit that displays the emotional information to the second user;
The display unit displays the emotional information corresponding to the synthesized speech in accordance with the output timing of the synthesized speech by the speech output unit.
2. The audio processing system of claim 1.

an emotion recognition unit that generates emotion information of a first user corresponding to the speech audio signal;
a control unit that switches between outputting the synthetic voice or the uttered voice from the voice output unit based on the emotion information;
3. The audio processing system of claim 1 or 2, comprising:

The emotion recognition unit receives an utterance voice signal, a feature amount extracted from the utterance voice signal, text data generated from the utterance voice signal, a feature amount extracted from the text data, or a combination of at least two of these. , an emotion recognition model machine-learned to output the emotional information of the speaker of the speech signal, the speech signal acquired by the acquisition unit, the speech feature extracted from the speech signal, and the speech signal By inputting the generated text data, the text feature value corresponding to the text data, or a combination of at least two of these, the first user's emotion information corresponding to the speech signal obtained by the obtaining unit is generated. 4. A speech processing system according to claim 2 or 3.

The speech synthesis unit generates the synthetic speech signal based on the emotion information generated by the emotion recognition unit so that the emotion indicated by the emotion information is reflected in the synthetic speech. 5. The audio processing system according to any one of 4.

a stress recognition unit that generates stress information about the stress situation of the second user;
a control unit that switches between outputting the synthesized voice or the uttered voice from the voice output unit based on the stress information;
3. The audio processing system of claim 1 or 2, comprising:

a control unit that switches between outputting either the synthesized speech or the uttered speech from the speech output unit based on switching information input by the second user;
generating information in which the switching information input by the second user is associated with the speech signal when the switching information is input on a time axis, and generating the speech signal and the speech speech based on the information; A feature quantity extracted from a signal, text data generated from the speech signal, a feature quantity extracted from the text data, or a combination of at least two of these is input, and timing for switching between the synthesized speech and the speech speech is determined. a learning unit that machine-learns the emotion suppression switching model to be output,
The control unit stores, in the emotion suppression switching model, the speech audio signal acquired by the acquisition unit, the feature amount extracted from the speech audio signal, the text data generated from the speech audio signal, and the features extracted from the text data. 7. The speech processing system according to any one of claims 1 to 6, wherein the timing for switching between the synthesized speech and the spoken speech is generated by inputting a quantity, or a combination of at least two of them.

an acquisition unit that acquires an utterance audio signal that is a signal of the utterance audio of the first user;
a speech recognition unit that inputs a feature amount extracted based on the speech signal to a speech recognition model to generate text data including a word string consisting of one or more words;
a speech synthesizing unit that inputs a feature amount extracted based on the text data to a speech synthesis model to generate a synthesized speech signal that is a synthesized speech signal that is output to a second user;
A speech processing device comprising:

an emotion recognition unit that generates emotion information of a first user corresponding to the speech audio signal;
a transmitting unit that associates the emotional information with a speech voice signal and/or a synthesized voice signal corresponding to the emotional information on a time axis and transmits the emotional information to an external device;
9. The audio processing device according to claim 8.

obtaining a speech signal that is a signal of the first user's speech;
A step of inputting the feature amount extracted based on the speech signal into a speech recognition model to generate text data including a word string consisting of one or more words;
a step of inputting a feature quantity extracted based on the text data into a speech synthesis model to generate a synthesized speech signal, which is a synthesized speech signal;
outputting the synthesized speech to a second user;
audio processing methods, including

an acquisition unit that acquires an utterance audio signal that is a signal of the utterance audio of the first user;
a speech conversion unit that inputs a feature quantity extracted based on the speech speech signal to a speech conversion model to generate a converted speech signal;
an audio output unit that outputs the converted audio to a second user;
an emotion recognition unit that generates emotion information of a first user corresponding to the speech audio signal;
a display unit for displaying the emotional information corresponding to the converted voice to the second user in accordance with the output timing of the converted voice by the voice output unit;
An audio processing system comprising:

an acquisition unit that acquires an utterance audio signal that is a signal of the utterance audio of the first user;
a speech conversion unit that inputs a feature quantity extracted based on the speech speech signal to a speech conversion model to generate a converted speech signal;
an audio output unit that outputs the converted audio to a second user;
an emotion recognition unit that generates a first user's emotion information corresponding to the speech audio signal;
The speech processing system, wherein the speech conversion unit generates a signal of the converted speech based on the emotion information generated by the emotion recognition unit so that the emotion indicated by the emotion information is reflected in the converted speech.