JP2023105607A

JP2023105607A - Program, information processing device, and information processing method

Info

Publication number: JP2023105607A
Application number: JP2022006544A
Authority: JP
Inventors: 集平加藤; Shuhei Kato
Original assignee: Revcomm; Revcomm Inc
Current assignee: Revcomm; Revcomm Inc
Priority date: 2022-01-19
Filing date: 2022-01-19
Publication date: 2023-07-31

Abstract

To provide an information processing device allowing a customer to achieve a talk with a user in more appropriate voice in talks between a plurality of users, an information processing method, and a program.SOLUTION: A program causes a computer to allow a talk between a first user and a second user. The program causes a processor of an information processing device to perform a voice acquisition step, a conversion step, an output step, and an attribute acquisition step. The voice acquisition step acquires a talk voice from the first user. The conversion step converts the talk voice acquired by the voice acquisition step. The output step outputs the talk voice converted by the conversion step to the second user. The attribute acquisition step acquires talk attributes on a talk. The conversion step converts the talk voice acquired by the voice acquisition step on the basis of the talk attributes acquired by the attribute acquisition step.SELECTED DRAWING: Figure 12

Description

本開示は、プログラム、情報処理装置及び情報処理方法に関する。 The present disclosure relates to a program, an information processing device, and an information processing method.

従来、イヤホンやヘッドホンのようなユーザが主に頭部に装着して使用する音響デバイスにおいて、外部環境からの環境音（所謂ノイズ）を抑制し遮音効果を高めることが可能な音響デバイスが知られている。
特許文献１には、ユーザの状態や状況が逐次変化するような状況下においても、煩雑な操作を伴わずに、より好適な態様で音響を聴取可能とする技術が開示されている。
特許文献２には、自動で適切なノイズキャンセルフィルタを決定する技術が開示されている。
特許文献３には、突発的な環境の変化や複数の環境が交互に現れるといった環境に対応できる音声認識装置が開示されている。 Conventionally, among acoustic devices such as earphones and headphones worn mainly on the head by users, there have been known acoustic devices capable of suppressing environmental sounds (so-called noise) from the external environment and enhancing the sound insulation effect. ing.
Japanese Patent Laid-Open No. 2002-200002 discloses a technique that enables listening to sound in a more suitable manner without complicated operations even under circumstances where the user's state or situation changes from time to time.
Patent Literature 2 discloses a technique for automatically determining an appropriate noise canceling filter.
Patent Literature 3 discloses a speech recognition apparatus that can cope with an environment such as a sudden change in environment or a plurality of environments appearing alternately.

再表２０１８／６１４９１号広報Retable No. 2018/61491 特開２０２０－８６０９９号公報JP 2020-86099 A 特開２０００－３３０５８７号公報JP-A-2000-330587

しかしながら、ユーザと顧客などの複数のユーザ間で行われる通話において、ユーザにより適した通話を実現することはできていなかった。 However, in a call between a plurality of users such as a user and a customer, it has not been possible to realize a call more suitable for the user.

そこで、本開示は、上記課題を解決すべくなされたものであって、その目的は、複数のユーザ間で行われる通話において、顧客が、より適した音声でユーザと通話を実現する技術を提供することである。 Therefore, the present disclosure has been made to solve the above problems, and its purpose is to provide a technology that enables a customer to talk with a user with a more suitable voice in a call between a plurality of users. It is to be.

プロセッサと、記憶部とを備え、コンピュータに第１ユーザと第２ユーザとの間で行われる通話を行うプログラムであって、プログラムは、プロセッサに、第１ユーザから通話音声を取得する音声取得ステップと、音声取得ステップにおいて取得した通話音声を変換する変換ステップと、変換ステップにおいて変換された通話音声を第２ユーザへ出力する出力ステップと、通話に関する通話属性を取得する属性取得ステップと、を実行させ、変換ステップは、属性取得ステップにおいて取得した通話属性に基づき、音声取得ステップにおいて取得した通話音声を変換するステップを含む、プログラム。 A program, comprising a processor and a storage unit, for performing a telephone call between a first user and a second user in a computer, the program comprising: a voice acquisition step for acquiring a call voice from the first user in the processor; a conversion step of converting the call voice acquired in the voice acquisition step; an output step of outputting the call voice converted in the conversion step to the second user; and an attribute acquisition step of acquiring call attributes related to the call. and the converting step includes converting the call voice obtained in the voice obtaining step based on the call attribute obtained in the attribute obtaining step.

本開示によれば、複数のユーザ間で行われる通話において、顧客は、より適した音声でユーザと通話を行うことができる。 Advantageous Effects of Invention According to the present disclosure, in a call between multiple users, the customer can make a call with the user with a more suitable voice.

情報処理システム１の全体の構成を示す図である。1 is a diagram showing an overall configuration of an information processing system 1; FIG. サーバ１０の機能構成を示すブロック図である。3 is a block diagram showing the functional configuration of the server 10; FIG. ユーザ端末２０の機能構成を示すブロック図である。2 is a block diagram showing a functional configuration of a user terminal 20; FIG. ＣＲＭシステム３０の機能構成を示すブロック図である。3 is a block diagram showing the functional configuration of a CRM system 30; FIG. 顧客端末５０の機能構成を示すブロック図である。3 is a block diagram showing the functional configuration of a customer terminal 50; FIG. ユーザテーブル１０１２のデータ構造を示す図である。FIG. 10 is a diagram showing the data structure of a user table 1012; FIG. 組織テーブル１０１３のデータ構造を示す図である。FIG. 10 is a diagram showing the data structure of an organization table 1013; FIG. 通話テーブル１０１４のデータ構造を示す図である。It is a figure which shows the data structure of the call table 1014. FIG. 音声処理テーブル１０１５のデータ構造を示す図である。FIG. 10 is a diagram showing the data structure of a voice processing table 1015; FIG. 学習用データセット１０３１のデータ構造を示す図である。FIG. 4 is a diagram showing the data structure of a learning data set 1031; 顧客テーブル３０１２のデータ構造を示す図である。FIG. 11 shows the data structure of a customer table 3012; FIG. 音声変換処理（第一実施例）の動作を示すフローチャートである。It is a flowchart which shows the operation|movement of a sound-conversion process (1st Example). 音声変換処理（第二実施例）の動作を示すフローチャートである。It is a flowchart which shows the operation|movement of a sound-conversion process (2nd Example). 音声変換処理（第三実施例）の動作を示すフローチャートである。It is a flowchart which shows the operation|movement of a sound-conversion process (3rd Example). 音声変換処理（第三実施例）におけるユーザ端末２０の表示画面例を示した図である。FIG. 11 is a diagram showing an example of a display screen of the user terminal 20 in voice conversion processing (third embodiment); コンピュータ９０の基本的なハードウェア構成を示すブロック図である。2 is a block diagram showing the basic hardware configuration of computer 90. FIG.

以下、本開示の実施形態について図面を参照して説明する。実施形態を説明する全図において、共通の構成要素には同一の符号を付し、繰り返しの説明を省略する。なお、以下の実施形態は、特許請求の範囲に記載された本開示の内容を不当に限定するものではない。また、実施形態に示される構成要素のすべてが、本開示の必須の構成要素であるとは限らない。また、各図は模式図であり、必ずしも厳密に図示されたものではない。 Hereinafter, embodiments of the present disclosure will be described with reference to the drawings. In all the drawings for explaining the embodiments, common constituent elements are denoted by the same reference numerals, and repeated explanations are omitted. It should be noted that the following embodiments do not unduly limit the content of the present disclosure described in the claims. Also, not all the components shown in the embodiments are essential components of the present disclosure. Each figure is a schematic diagram and is not necessarily strictly illustrated.

＜情報処理システム１の概要＞
本開示における情報処理システム１は、本開示にかかる通話サービスを提供する情報処理システムである。情報処理システム１は、ユーザと顧客との間で行われる通話に関するサービスを提供するとともに、通話に関連するデータを記憶、管理するための情報処理システムである。 <Overview of information processing system 1>
An information processing system 1 according to the present disclosure is an information processing system that provides a call service according to the present disclosure. The information processing system 1 is an information processing system for providing services related to calls between users and customers, and for storing and managing data related to calls.

＜情報処理システム１の基本構成＞
情報処理システム１は、ネットワークＮを介して接続された、サーバ１０、複数のユーザ端末２０Ａ、２０Ｂ、２０Ｃ、ＣＲＭシステム３０、音声サーバ（ＰＢＸ）４０、および、音声サーバ（ＰＢＸ）４０に対して電話網Ｔを介して接続された顧客端末５０Ａ、５０Ｂ、５０Ｃを備えて構成されている。 <Basic Configuration of Information Processing System 1>
The information processing system 1 is connected via a network N to a server 10, a plurality of user terminals 20A, 20B, 20C, a CRM system 30, a voice server (PBX) 40, and a voice server (PBX) 40. It comprises customer terminals 50A, 50B, and 50C connected via a telephone network T. FIG.

図１は、情報処理システム１の全体の構成を示す図である。
図２は、サーバ１０の機能構成を示すブロック図である。
図３は、ユーザ端末２０の機能構成を示すブロック図である。
図４は、ＣＲＭシステム３０の機能構成を示すブロック図である。
図５は、顧客端末５０の機能構成を示すブロック図である。 FIG. 1 is a diagram showing the overall configuration of an information processing system 1. As shown in FIG.
FIG. 2 is a block diagram showing the functional configuration of the server 10. As shown in FIG.
FIG. 3 is a block diagram showing the functional configuration of the user terminal 20. As shown in FIG.
FIG. 4 is a block diagram showing the functional configuration of the CRM system 30. As shown in FIG.
FIG. 5 is a block diagram showing the functional configuration of the customer terminal 50. As shown in FIG.

サーバ１０は、ユーザと顧客との間で行われる通話に関連するデータ（通話データ）を記憶、管理するサービスを提供する情報処理装置である。 The server 10 is an information processing device that provides a service of storing and managing data (call data) related to calls made between users and customers.

ユーザ端末２０は、サービスを利用するユーザが操作する情報処理装置である。ユーザ端末２０は、例えば、据え置き型のＰＣ（ＰｅｒｓｏｎａｌＣｏｍｐｕｔｅｒ）、ラップトップＰＣでもよいし、スマートフォン、タブレット等の携帯端末であってもよい。また、ＨＭＤ（ＨｅａｄＭｏｕｎｔＤｉｓｐｌａｙ）、腕時計型端末等のウェアラブル端末であってもよい。 The user terminal 20 is an information processing device operated by a user who uses the service. The user terminal 20 may be, for example, a stationary PC (Personal Computer), a laptop PC, or a mobile terminal such as a smart phone or a tablet. Moreover, it may be a wearable terminal such as an HMD (Head Mount Display) or a wristwatch type terminal.

ＣＲＭシステム３０は、ＣＲＭ（ＣｕｓｔｏｍｅｒＲｅｌａｔｉｏｎｓｈｉｐＭａｎａｇｅｍｅｎｔ、顧客関係管理）サービスを提供する事業者（ＣＲＭ事業者）が管理、運営する情報処理装置である。ＣＲＭサービスとしては、ＳａｌｅｓＦｏｒｃｅ、ＨｕｂＳｐｏｔ、ＺｏｈｏＣＲＭ、ｋｉｎｔｏｎｅなどがある。 The CRM system 30 is an information processing device managed and operated by a company (CRM company) that provides CRM (Customer Relationship Management) services. CRM services include SalesForce, HubSpot, Zoho CRM, Kintone, and the like.

音声サーバ（ＰＢＸ）４０は、ネットワークＮと電話網Ｔとを互いに接続することでユーザ端末２０と顧客端末５０との間における通話を可能とする交換機として機能する情報処理装置である。 The voice server (PBX) 40 is an information processing device that functions as a switching system that enables communication between the user terminal 20 and the customer terminal 50 by connecting the network N and the telephone network T to each other.

顧客端末５０は、顧客がユーザと通話する際に操作する情報処理装置である。顧客端末５０は、例えば、スマートフォン、タブレット等の携帯端末でもよいし、据え置き型のＰＣ（ＰｅｒｓｏｎａｌＣｏｍｐｕｔｅｒ）、ラップトップＰＣであってもよい。また、ＨＭＤ（ＨｅａｄＭｏｕｎｔＤｉｓｐｌａｙ）、腕時計型端末等のウェアラブル端末であってもよい。 The customer terminal 50 is an information processing device operated by the customer when talking to the user. The customer terminal 50 may be, for example, a mobile terminal such as a smart phone or a tablet, a stationary PC (Personal Computer), or a laptop PC. Moreover, it may be a wearable terminal such as an HMD (Head Mount Display) or a wristwatch type terminal.

各情報処理装置は演算装置と記憶装置とを備えたコンピュータにより構成されている。コンピュータの基本ハードウェア構成および、当該ハードウェア構成により実現されるコンピュータの基本機能構成は後述する。サーバ１０、ユーザ端末２０、ＣＲＭシステム３０、音声サーバ（ＰＢＸ）４０、顧客端末５０のそれぞれについて、後述するコンピュータの基本ハードウェア構成およびコンピュータの基本機能構成と重複する説明は省略する。 Each information processing device is composed of a computer having an arithmetic device and a storage device. The basic hardware configuration of the computer and the basic functional configuration of the computer realized by the hardware configuration will be described later. Descriptions of the server 10, the user terminal 20, the CRM system 30, the voice server (PBX) 40, and the customer terminal 50 that overlap with the basic hardware configuration of the computer and the basic functional configuration of the computer, which will be described later, will be omitted.

以下、各装置の構成およびその動作を説明する。 The configuration and operation of each device will be described below.

＜サーバ１０の機能構成＞
サーバ１０のハードウェア構成が実現する機能構成を図２に示す。サーバ１０は、記憶部１０１、制御部１０４を備える。 <Functional Configuration of Server 10>
FIG. 2 shows a functional configuration realized by the hardware configuration of the server 10. As shown in FIG. The server 10 has a storage unit 101 and a control unit 104 .

＜サーバ１０の記憶部の構成＞
サーバ１０の記憶部１０１は、アプリケーションプログラム１０１１、ユーザテーブル１０１２、組織テーブル１０１３、通話テーブル１０１４、音声処理テーブル１０１５、評価モデル１０２１、生成モデル１０２２、音声処理モデル１０２３、学習用データセット１０３１を備える。
図６は、ユーザテーブル１０１２のデータ構造を示す図である。
図７は、組織テーブル１０１３のデータ構造を示す図である。
図８は、通話テーブル１０１４のデータ構造を示す図である。
図９は、音声処理テーブル１０１５のデータ構造を示す図である。
図１０は、学習用データセット１０３１のデータ構造を示す図である。 <Configuration of Storage Unit of Server 10>
Storage unit 101 of server 10 includes application program 1011 , user table 1012 , organization table 1013 , call table 1014 , speech processing table 1015 , evaluation model 1021 , generation model 1022 , speech processing model 1023 , and learning data set 1031 .
FIG. 6 is a diagram showing the data structure of the user table 1012. As shown in FIG.
FIG. 7 is a diagram showing the data structure of the organization table 1013. As shown in FIG.
FIG. 8 is a diagram showing the data structure of the call table 1014. As shown in FIG.
FIG. 9 is a diagram showing the data structure of the audio processing table 1015. As shown in FIG.
FIG. 10 is a diagram showing the data structure of the learning data set 1031. As shown in FIG.

ユーザテーブル１０１２は、サービスを利用する会員ユーザ（以下、ユーザ）の情報を記憶し管理するテーブルである。ユーザは、サービスの利用登録を行うことで、当該ユーザの情報がユーザテーブル１０１２の新しいレコードに記憶される。これにより、ユーザは本開示にかかるサービスを利用できるようになる。ユーザテーブル１０１２は、ユーザＩＤを主キーとし、ユーザＩＤ、ＣＲＭＩＤ、組織ＩＤ、ユーザ名、ユーザ属性のカラムを有するテーブルである。 The user table 1012 is a table that stores and manages information on member users (hereinafter referred to as users) who use the service. By registering to use the service, the user's information is stored in a new record in the user table 1012 . This enables the user to use the service according to the present disclosure. The user table 1012 is a table having user ID as a primary key and columns of user ID, CRM ID, organization ID, user name, and user attribute.

ユーザＩＤは、ユーザを識別するためのユーザ識別情報を記憶する項目である。
ＣＲＭＩＤは、ＣＲＭシステム３０において、ユーザを識別するための識別情報を記憶する項目である。ユーザはＣＲＭＩＤによりＣＲＭシステム３０にログインすることにより、ＣＲＭサービスの提供を受けることができる。つまり、サーバ１０におけるユーザＩＤと、ＣＲＭシステム３０におけるＣＲＭＩＤが紐付けられる。
組織ＩＤは、ユーザが所属する組織の組織ＩＤを記憶する項目である。
ユーザ名は、ユーザの氏名を記憶する項目である。
ユーザ属性は、ユーザの年齢、性別、出身地、方言、職種（営業、カスタマーサポートなど）などのユーザの属性に関する情報を記憶する項目である。 User ID is an item that stores user identification information for identifying a user.
CRMID is an item that stores identification information for identifying a user in the CRM system 30 . The user can receive CRM services by logging into the CRM system 30 with the CRM ID. That is, the user ID in the server 10 and the CRMID in the CRM system 30 are linked.
The organization ID is an item that stores the organization ID of the organization to which the user belongs.
The user name is an item that stores the name of the user.
The user attribute is an item that stores information related to user attributes such as age, gender, hometown, dialect, occupation (sales, customer support, etc.) of the user.

組織テーブル１０１３は、ユーザが所属する組織に関する情報を定義するテーブルである。組織は、会社、法人、企業グループ、サークル、各種団体など任意の組織、グループなどが含まれる。組織は、会社の部署（営業部、総務部、カスタマーサポート部）などのより詳細なサブグループごとに定義しても良い。組織テーブル１０１３は、組織ＩＤを主キーとして、組織ＩＤ、組織名、組織属性のカラムを有するテーブルである。 The organization table 1013 is a table that defines information about the organization to which the user belongs. Organizations include arbitrary organizations and groups such as companies, corporations, corporate groups, circles, and various organizations. Organizations may also be defined by more detailed sub-groups such as company departments (Sales Department, General Affairs Department, Customer Support Department). The organization table 1013 is a table having columns of organization ID, organization name, and organization attribute with organization ID as a primary key.

組織ＩＤは、組織を識別するための組織識別情報を記憶する項目である。
組織名は、組織の名称を記憶する項目である。組織の名称は、会社名、法人名、企業グループ名、サークル名、各種団体名など任意の組織名、グループ名を含む。
組織属性は、組織種別（会社、企業グループ、その他団体など）、業種（不動産、金融など）などの組織の属性に関する情報を記憶する項目である。 The organization ID is an item that stores organization identification information for identifying an organization.
The organization name is an item that stores the name of the organization. The name of the organization includes arbitrary organization names and group names such as company names, corporate names, corporate group names, circle names, and various organization names.
The organization attribute is an item that stores information on the organization attributes such as the organization type (company, corporate group, other organization, etc.) and industry (real estate, finance, etc.).

通話テーブル１０１４は、ユーザと顧客との間で行われる通話に関連する通話データを記憶し管理するテーブルである。通話テーブル１０１４は、通話ＩＤを主キーとし、通話ＩＤ、ユーザＩＤ、顧客ＩＤ、通話カテゴリ、受発信種別、音声データのカラムを有するテーブルである。 Call table 1014 is a table that stores and manages call data related to calls made between users and customers. The call table 1014 is a table having call ID as a primary key and columns of call ID, user ID, customer ID, call category, incoming/outgoing type, and voice data.

通話ＩＤは、通話データを識別するための通話データ識別情報を記憶する項目である。
ユーザＩＤは、ユーザと顧客との間で行われる通話において、ユーザのユーザＩＤ（ユーザ識別情報）を記憶する項目である。
顧客ＩＤは、ユーザと顧客との間で行われる通話において、顧客の顧客ＩＤ（顧客識別情報）を記憶する項目である。
通話カテゴリは、ユーザと顧客との間で行われた通話の種類（カテゴリ）を記憶する項目である。通話データは、通話カテゴリにより分類される。通話カテゴリには、ユーザと顧客との間で行われる通話の目的などに応じて、テレフォンオペレーター、テレマーケティング、カスタマーサポート、テクニカルサポートなどの値が記憶される。
受発信種別は、ユーザと顧客との間で行われた通話が、ユーザが発信した（アウトバウンド）ものか、ユーザが受信した（インバウンド）もののいずれかを区別するための情報を記憶する項目である。
音声データは、ユーザと顧客との間で行われた通話の音声データを記憶する項目である。音声データの形式としては、ｍｐ４、ｗａｖなど各種音声データ形式を用いることができる。また、他の場所に配置された音声データファイルに対する参照情報（パス）を記憶するものとしても良い。
音声データは、ユーザの音声と顧客の音声とが、それぞれ独立して識別可能な識別子が設定された形式のデータであっても良い。この場合、サーバ１０の制御部１０４は、ユーザの音声、顧客の音声に対してそれぞれ独立した解析処理を実行することができる。
本開示において、音声データに替えて、音声情報を含む動画データを用いても構わない。また、本開示における音声データは、動画データに含まれる音声データも含む概念である。 The call ID is an item that stores call data identification information for identifying call data.
The user ID is an item that stores the user's user ID (user identification information) in a call between the user and the customer.
The customer ID is an item that stores the customer's customer ID (customer identification information) in a call between the user and the customer.
The call category is an item that stores the type (category) of calls made between the user and the customer. Call data is classified by call category. In the call category, values such as telephone operator, telemarketing, customer support, and technical support are stored according to the purpose of the call between the user and the customer.
The incoming/outgoing type is an item that stores information for distinguishing whether a call made between a user and a customer is originated by the user (outbound) or received by the user (inbound). .
Voice data is an item that stores voice data of a call made between a user and a customer. As the audio data format, various audio data formats such as mp4 and wav can be used. Also, it is possible to store reference information (path) to an audio data file located at another location.
The voice data may be data in a format in which identifiers are set so that the voice of the user and the voice of the customer are independently identifiable. In this case, the control unit 104 of the server 10 can perform independent analysis processing on the user's voice and the customer's voice.
In the present disclosure, video data including audio information may be used instead of audio data. Also, audio data in the present disclosure is a concept including audio data included in moving image data.

音声処理テーブル１０１５は、音声データに対して適用するエフェクト、フィルタなどの音声処理に関する情報（音声処理情報）を記憶するテーブルである。
音声処理テーブル１０１５は、音声処理ＩＤを主キーとし、音声処理ＩＤ、音声処理内容のカラムを有するテーブルである。 The audio processing table 1015 is a table that stores information (audio processing information) on audio processing such as effects and filters applied to audio data.
The voice processing table 1015 is a table having voice processing ID as a main key and columns of voice processing ID and voice processing content.

音声処理ＩＤは、音声処理内容を識別するための音声処理識別情報を記憶する項目である。
音声処理内容は、音声データに対して適用する音声処理内容を記憶する項目である。他の場所に配置された音声処理を行う関数、メソッド、プログラムなどへの参照を記憶しても良い。
音声処理内容は、音声データに対して声を変えるといった音声変換処理を含む。音声変換は、音声データを男性の音声、女性の音声、特定の人物、特定のキャラクタの音声への変換処理を含む。音声変換は、音声データを、特定の感情（喜び、悲しみ、怒り、驚き、恐れ、嫌悪）の音声への変換処理を含む。音声変換は、音声データに含まれる周波数ごとの強度（音声のスペクトル構造、周波数分布）の形状を変化させる変換処理を含む。音声変換は、基本周波数、抑揚の強弱、話速を変化させる、抑揚を変化させる（大きくする、小さくする）変換処理を含む。音声変換は、音声中に含まれるフィラー（例えばえー、あのー、などの言い淀み）などを除去する処理を含む。
音声変換処理は、音声データに含まれる人物の音声成分を変換する処理を含み、人物の音声成分以外の背景雑音、ノイズ、騒音などの音声成分を変換する処理を含まない構成としても良い。 The audio processing ID is an item for storing audio processing identification information for identifying audio processing content.
The audio processing content is an item that stores the audio processing content to be applied to the audio data. References to functions, methods, programs, etc. that perform audio processing located elsewhere may be stored.
The audio processing content includes audio conversion processing such as changing the voice of the audio data. Voice conversion includes converting voice data into male voice, female voice, voice of a specific person, or voice of a specific character. Voice conversion includes converting voice data into voice of specific emotions (joy, sadness, anger, surprise, fear, disgust). Audio conversion includes conversion processing that changes the shape of the intensity (speech spectrum structure, frequency distribution) for each frequency included in audio data. Speech conversion includes conversion processing that changes the fundamental frequency, the intensity of intonation, the speed of speech, and the change of intonation (increase or decrease). Speech conversion includes processing to remove fillers (eg, hesitation such as er, er, etc.) contained in the speech.
The voice conversion processing may include processing for converting a person's voice component included in the voice data, and may not include processing for converting voice components other than the person's voice component, such as background noise, noise, and noise.

評価モデル１０２１は、ユーザ属性、顧客属性、通話カテゴリ、受発信種別などの通話属性、音声データを入力データとして、評価指標値を出力（推論）する学習モデルである。
評価モデル１０２１は、単一の学習モデルである必要はなく、出力する評価指標の種別（評価種別）ごとに複数の独立した学習モデルを切り替えて実現しても良い。例えば、評価モデル１０２１は、出力する評価指標の種類（第１指標、第２指標など）に応じて、異なる複数の独立した学習モデルを含む。
評価モデルの一例として、ＳＩＩＢ（Ｓｐｅｅｃｈｉｎｔｅｌｌｉｇｉｂｉｌｉｔｙｉｎｂｉｔｓ）モデル（以下、第１モデル）がある。音声データを第１モデルに適用することによりＳＩＩＢスコア（以下、第１指標）が得られる。ＳＩＩＢスコアについては、arXiv:2104.08499などに記載されており、音声データを入力データとして聞き手側の聴きやすさ（知覚しやすさ）に関する定量的な評価指標である。
評価モデルは、例えば、ＨＡＳＰＩ（Ｈｅａｒｉｎｇ－ａｉｄｓｐｅｅｃｈｉｎｄｅｘ）、ＥＳＴＯＩ（Ｅｘｔｅｎｄｅｄｓｈｏｒｔ－ｔｉｍｅｂｏｊｅｃｔｉｖｅｉｎｔｅｌｌｉｇｉｂｉｌｉｔｙ）、ＰＥＳＱ（Ｐｅｒｃｅｐｔｕａｌｅｖａｌｕａｔｉｏｎｏｆｓｐｅｅｃｈｑｕａｌｉｔｙ）、ＶｉＳＱＯＬ（Ｖｉｒｔｕａｌｓｐｅｅｃｈｑｕａｌｉｔｙｏｂｊｅｃｔｉｖｅｌｉｓｔｅｎｅｒ）などの評価指標に応じて用意しても構わない。
その他、通話属性、音声データを入力データとして、特に聞き手側におけるアンケートなどの評価結果を教師データとして、任意の機械学習、深層学習、人工知能モデルなどを構築しても構わない。例えば、アンケートなどの評価結果としては、信頼性、信用性、心地良さ、快適性、好み、ストレス値、威圧度、興趣性などの項目を含んでも構わない。つまり、通話属性、音声データを入力データとして、聞き手側における信頼性、信用性、心地良さ、快適性、好み、ストレス値、威圧度、興趣性などの評価指標を出力する学習モデルとしても良い。 The evaluation model 1021 is a learning model that outputs (infers) evaluation index values using user attributes, customer attributes, call categories, call attributes such as incoming and outgoing call types, and voice data as input data.
The evaluation model 1021 does not need to be a single learning model, and may be implemented by switching between a plurality of independent learning models for each type of evaluation index to be output (evaluation type). For example, the evaluation model 1021 includes a plurality of different independent learning models according to the type of output evaluation index (first index, second index, etc.).
An example of the evaluation model is a SIIB (Speech intelligence in bits) model (hereinafter referred to as the first model). An SIIB score (hereinafter referred to as the first index) is obtained by applying the speech data to the first model. The SIIB score is described in arXiv:2104.08499, etc., and is a quantitative evaluation index regarding the ease of hearing (easiness of perception) on the listener's side using voice data as input data.
Evaluation models include, for example, HASPI (Hearing-aid speech index), ESTOI (Extended short-time objective intelligence), PESQ (Perceptual evaluation of speech quality), ViSQOL (Vi Prepared according to the evaluation index such as speech quality objective listener) I don't mind.
In addition, arbitrary machine learning, deep learning, artificial intelligence models, etc. may be constructed using call attributes and voice data as input data, and evaluation results such as questionnaires on the listener side as teacher data. For example, evaluation results such as questionnaires may include items such as reliability, credibility, comfort, comfort, preference, stress value, degree of intimidation, and interest. In other words, the learning model may be a learning model that uses call attributes and voice data as input data, and outputs evaluation indices such as reliability, credibility, comfortability, comfortability, preference, stress value, intimidation level, and interest on the listener side.

生成モデル１０２２は、通話属性、音声データを入力データとして、変換音声データを出力（推論）する学習モデルである。
生成モデル１０２２の学習処理は、後述する。
生成モデル１０２２は、単一の学習モデルである必要はなく、通話属性、評価種別ごとに複数の独立した学習モデルを切り替えて実現しても良い。具体的には、生成モデル１０２２は、入力する通話属性、出力する評価種別ごとに複数の独立した学習モデルを含んでも構わない。
例えば、生成モデル１０２２は、通話属性のそれぞれの組み合わせに応じて、適した変換音声データを出力する複数の生成モデルを選択的に切り替えて実現しても良い。
例えば、生成モデル１０２２は、第１指標、第２指標などの複数の評価指標のそれぞれに対して、適した変換音声データを出力する第１生成モデル、第２生成モデルなどを選択的に切り替えて実現しても良い。 The generative model 1022 is a learning model that outputs (infers) converted voice data using call attributes and voice data as input data.
The learning process of the generative model 1022 will be described later.
The generative model 1022 does not need to be a single learning model, and may be implemented by switching between a plurality of independent learning models for each call attribute and evaluation type. Specifically, the generative model 1022 may include a plurality of independent learning models for each input call attribute and output evaluation type.
For example, the generative model 1022 may be implemented by selectively switching between a plurality of generative models that output suitable converted voice data according to each combination of call attributes.
For example, the generative model 1022 selectively switches between a first generative model and a second generative model that output suitable converted speech data for each of a plurality of evaluation indices such as a first index and a second index. It can be realized.

音声処理モデル１０２３は、通話属性、音声データを入力データとして、音声処理ＩＤを出力（推論）する学習モデルである。
音声処理モデル１０２３の学習処理は、後述する。
音声処理モデル１０２３は、単一の学習モデルである必要はなく、通話属性、評価種別ごとに複数の独立した学習モデルを切り替えて実現しても良い。具体的には、音声処理モデル１０２３は、入力する通話属性、出力する評価種別ごとに複数の独立した学習モデルを含んでも構わない。
例えば、音声処理モデル１０２３は、通話属性のそれぞれの組み合わせに応じて、適した変換音声データを出力する複数の音声処理モデルを選択的に切り替えて実現しても良い。
例えば、音声処理モデル１０２３は、第１指標、第２指標などの複数の評価指標のそれぞれに対して、適した変換音声データを出力する第１音声処理モデル、第２音声処理モデルなどを選択的に切り替えて実現しても良い。 The voice processing model 1023 is a learning model that outputs (infers) a voice processing ID using call attributes and voice data as input data.
The learning process of the speech processing model 1023 will be described later.
The voice processing model 1023 does not need to be a single learning model, and may be implemented by switching between a plurality of independent learning models for each call attribute and evaluation type. Specifically, the voice processing model 1023 may include a plurality of independent learning models for each input call attribute and output evaluation type.
For example, the voice processing model 1023 may be implemented by selectively switching between a plurality of voice processing models that output suitable converted voice data according to each combination of call attributes.
For example, the speech processing model 1023 selectively selects a first speech processing model, a second speech processing model, etc. that output suitable converted speech data for each of a plurality of evaluation indices such as the first index and the second index. It may be realized by switching to

評価モデル１０２１、生成モデル１０２２、音声処理モデル１０２３は、例えば機械学習、人工知能、深層学習モデルなどの一種である。
評価モデル１０２１、生成モデル１０２２、音声処理モデル１０２３の一例として、深層学習におけるディープニューラルネットワークによる深層学習モデルを説明する。深層学習モデルは、ＲＮＮ（ＲｅｃｕｒｒｅｎｔＮｅｕｒａｌＮｅｔｗｏｒｋ）、ＬＳＴＭ（ＬｏｎｇＳｈｏｒｔＴｅｒｍＭｅｍｏｒｙ）、ＧＲＵ（ＧａｔｅｄＲｅｃｕｒｒｅｎｔＵｎｉｔ）など、任意の時系列データを入力データとする深層学習モデルであればどのような学習モデルであっても構わない。学習モデルは、例えば、Ａｔｔｅｎｔｉｏｎ、Ｔｒａｎｓｆｏｒｍｅｒなどを含む任意の深層学習モデルを含む。
評価モデル１０２１、生成モデル１０２２、音声処理モデル１０２３は、深層学習モデルである必要は必ずしもなく、任意の機械学習、人工知能モデルでも良い。 The evaluation model 1021, the generative model 1022, and the speech processing model 1023 are, for example, types of machine learning, artificial intelligence, deep learning models, and the like.
As an example of the evaluation model 1021, the generative model 1022, and the speech processing model 1023, a deep learning model using a deep neural network in deep learning will be described. The deep learning model is any deep learning model that takes arbitrary time series data as input data, such as RNN (Recurrent Neural Network), LSTM (Long Short Term Memory), GRU (Gated Recurrent Unit), etc. It doesn't matter if there is. A learning model includes, for example, any deep learning model including Attention, Transformer, and the like.
The evaluation model 1021, generative model 1022, and speech processing model 1023 do not necessarily need to be deep learning models, and may be arbitrary machine learning or artificial intelligence models.

学習用データセット１０３１は、生成モデル１０２２の学習処理に用いられるデータセットを記憶するテーブルである。学習用データセット１０３１は、通話データにおけるユーザに関する属性情報、顧客に関する属性情報、通話に関する属性情報が関連付けて記憶された機械学習、深層学習などの学習処理に用いられるデータセットである。
学習用データセット１０３１は、ユーザおよび顧客の間で行われた過去の通話データが記憶される通話テーブル１０１４、ユーザテーブル１０１２、組織テーブル１０１３、顧客テーブル３０１２などを組み合わせて作成しても良い。
学習用データセット１０３１は、ユーザに関する属性情報、顧客に関する属性情報、通話に関する属性情報、音声データ、第１指標、第２指標、第３指標のカラムを有するテーブルである。 The learning data set 1031 is a table that stores data sets used for learning processing of the generative model 1022 . The learning data set 1031 is a data set used for learning processing such as machine learning and deep learning in which attribute information related to users, attribute information related to customers, and attribute information related to calls in call data are associated and stored.
The learning data set 1031 may be created by combining the call table 1014, the user table 1012, the organization table 1013, the customer table 3012, etc., which store past call data between users and customers.
The learning data set 1031 is a table having columns of user attribute information, customer attribute information, call attribute information, voice data, first index, second index, and third index.

ユーザに関する属性情報は、通話データにおける、ユーザのユーザ属性、ユーザの所属する組織の組織名または組織属性の情報を記憶する項目である。ユーザに関する属性情報は、通話データにおけるユーザの感情（喜び、悲しみ、怒り、驚き、恐れ、嫌悪）に関する情報（感情情報）を含んでも良い。
顧客に関する属性情報は、通話データにおける、顧客の顧客属性、顧客の所属する組織の組織名または組織属性の情報を記憶する項目である。顧客に関する属性情報は、通話データにおける顧客の感情情報を含んでも良い。
通話に関する属性情報は、通話データにおける、通話カテゴリ、受発信者種別の情報を記憶する項目である。通話に関する属性情報は、通話データにおけるユーザおよび顧客の感情情報を含んでも良い。
音声データは、ユーザと顧客との間で行われた通話の音声データを記憶する項目である。通話テーブル１０１４の音声データと同様であるため説明を省略する。 The attribute information about the user is an item that stores the user attribute of the user, the organization name of the organization to which the user belongs, or the information of the organization attribute in the call data. The attribute information about the user may include information (emotional information) about the user's emotions (joy, sadness, anger, surprise, fear, disgust) in call data.
The attribute information about the customer is an item that stores the customer attribute of the customer, the organization name of the organization to which the customer belongs, or the information of the organization attribute in the call data. The customer-related attribute information may include the customer's emotion information in the call data.
Attribute information about a call is an item that stores information on the call category and the type of caller/receiver in the call data. The call attribute information may include user and customer sentiment information in the call data.
Voice data is an item that stores voice data of a call made between a user and a customer. Since it is the same as the voice data of the call table 1014, the description is omitted.

＜サーバ１０の制御部の構成＞
サーバ１０の制御部１０４は、ユーザ登録制御部１０４１、音声変換部１０４２、学習部１０５１を備える。制御部１０４は、記憶部１０１に記憶されたアプリケーションプログラム１０１１を実行することにより、各機能ユニットが実現される。 <Configuration of Control Unit of Server 10>
The control unit 104 of the server 10 includes a user registration control unit 1041 , a voice conversion unit 1042 and a learning unit 1051 . Control unit 104 implements each functional unit by executing application program 1011 stored in storage unit 101 .

ユーザ登録制御部１０４１は、本開示に係るサービスの利用を希望するユーザの情報をユーザテーブル１０１２に記憶する処理を行う。
ユーザテーブル１０１２に記憶される情報は、ユーザが任意の情報処理端末からサービス提供者が運営するウェブページなどを開き、所定の入力フォームに入力しサーバ１０へ送信する。サーバ１０のユーザ登録制御部１０４１は、受信した情報をユーザテーブル１０１２の新しいレコードに記憶し、ユーザ登録が完了する。これにより、ユーザテーブル１０１２に記憶されたユーザはサービスを利用することができるようになる。
ユーザ登録制御部１０４１によるユーザ情報のユーザテーブル１０１２への登録に先立ち、サービス提供者は所定の審査を行いユーザによるサービス利用可否を制限しても良い。
ユーザＩＤは、ユーザを識別できる任意の文字列または数字で良く、ユーザが希望する任意の文字列または数字、もしくはサーバ１０のユーザ登録制御部１０４１が自動的に任意の文字列または数字を設定しても良い。 The user registration control unit 1041 performs processing for storing information of users who wish to use the service according to the present disclosure in the user table 1012 .
The information stored in the user table 1012 is transmitted to the server 10 by the user opening a web page or the like operated by the service provider from any information processing terminal, inputting the information into a predetermined input form. The user registration control unit 1041 of the server 10 stores the received information in a new record of the user table 1012, and user registration is completed. As a result, the users stored in the user table 1012 can use the service.
Prior to registration of user information in the user table 1012 by the user registration control unit 1041, the service provider may perform a predetermined examination to limit whether the user can use the service.
The user ID may be any character string or number that can identify the user, any character string or number desired by the user, or any character string or number automatically set by the user registration control unit 1041 of the server 10. can be

音声変換部１０４２は、音声変換処理（第一実施例）、音声変換処理（第二実施例）、音声変換処理（第三実施例）、音声変換処理（第四実施例）を実行する。詳細は後述する。
学習部１０５１は、学習処理を実行する。詳細は後述する。 The voice conversion unit 1042 executes voice conversion processing (first example), voice conversion processing (second example), voice conversion processing (third example), and voice conversion processing (fourth example). Details will be described later.
The learning unit 1051 executes learning processing. Details will be described later.

＜ユーザ端末２０の機能構成＞
ユーザ端末２０のハードウェア構成が実現する機能構成を図３に示す。ユーザ端末２０は、記憶部２０１、制御部２０４、ユーザ端末２０に接続された入力装置２０６、出力装置２０８を備える。入力装置２０６は、カメラ２０６１、マイク２０６２、位置情報センサ２０６３、モーションセンサ２０６４、キーボード２０６５、マウス２０６６を含む。出力装置２０８は、ディスプレイ２０８１、スピーカ２０８２を含む。 <Functional Configuration of User Terminal 20>
A functional configuration realized by the hardware configuration of the user terminal 20 is shown in FIG. The user terminal 20 includes a storage unit 201 , a control unit 204 , an input device 206 connected to the user terminal 20 and an output device 208 . Input device 206 includes camera 2061 , microphone 2062 , position information sensor 2063 , motion sensor 2064 , keyboard 2065 and mouse 2066 . The output device 208 includes a display 2081 and speakers 2082 .

＜ユーザ端末２０の記憶部の構成＞
ユーザ端末２０の記憶部２０１は、ユーザ端末２０を利用するユーザを識別するためのユーザＩＤ２０１１、アプリケーションプログラム２０１２、ＣＲＭＩＤ２０１３を記憶する。
ユーザＩＤは、サーバ１０に対するユーザのアカウントＩＤである。ユーザは、ユーザ端末２０からユーザＩＤ２０１１を、サーバ１０へ送信する。サーバ１０は、ユーザＩＤ２０１１に基づきユーザを識別し、本開示にかかるサービスをユーザに対して提供する。なお、ユーザＩＤには、ユーザ端末２０を利用しているユーザを識別するにあたりサーバ１０から一時的に付与されるセッションＩＤなどの情報を含む。
ＣＲＭＩＤは、ＣＲＭシステム３０に対するユーザのアカウントＩＤである。ユーザは、ユーザ端末２０からＣＲＭＩＤ２０１３を、ＣＲＭシステム３０へ送信する。ＣＲＭシステム３０は、ＣＲＭＩＤ２０１３に基づきユーザを識別し、ＣＲＭサービスをユーザに対して提供する。なお、ＣＲＭＩＤ２０１３には、ユーザ端末２０を利用しているユーザを識別するにあたりＣＲＭシステム３０から一時的に付与されるセッションＩＤなどの情報を含む。
アプリケーションプログラム２０１２は、記憶部２０１に予め記憶されていても良いし、通信ＩＦを介してサービス提供者が運営するウェブサーバ等からダウンロードする構成としても良い。アプリケーションプログラム２０１２は、ユーザ端末２０に記憶されているウェブブラウザアプリケーション上で実行されるＪａｖａＳｃｒｉｐｔ（登録商標）などのインタープリター型プログラミング言語を含む。 <Configuration of Storage Unit of User Terminal 20>
The storage unit 201 of the user terminal 20 stores a user ID 2011 for identifying a user who uses the user terminal 20, an application program 2012, and a CRM ID 2013.
The user ID is the user's account ID for the server 10 . The user transmits the user ID 2011 from the user terminal 20 to the server 10 . The server 10 identifies the user based on the user ID 2011 and provides the user with the service according to the present disclosure. The user ID includes information such as a session ID temporarily assigned by the server 10 to identify the user using the user terminal 20 .
CRMID is the user's account ID for the CRM system 30 . The user transmits the CRMID 2013 from the user terminal 20 to the CRM system 30 . The CRM system 30 identifies the user based on the CRMID 2013 and provides the CRM service to the user. The CRMID 2013 includes information such as a session ID temporarily assigned by the CRM system 30 to identify the user using the user terminal 20 .
The application program 2012 may be stored in the storage unit 201 in advance, or may be downloaded from a web server or the like operated by the service provider via the communication IF. Application program 2012 includes an interpreted programming language such as JavaScript (registered trademark) that runs on a web browser application stored in user terminal 20 .

＜ユーザ端末２０の制御部の構成＞
ユーザ端末２０の制御部２０４は、入力制御部２０４１および出力制御部２０４２を備える。制御部２０４は、記憶部２０１に記憶されたアプリケーションプログラム２０１２を実行することにより、入力制御部２０４１、出力制御部２０４２の機能ユニットが実現される。
ユーザ端末２０の入力制御部２０４１は、ユーザ端末２０に接続されたカメラ２０６１、マイク２０６２、位置情報センサ２０６３、モーションセンサ２０６４、キーボード２０６５、マウス２０６６などの入力装置から出力される情報を取得し各種処理を実行する。ユーザ端末２０の入力制御部２０４１は、入力装置２０６から取得した情報をユーザＩＤ２０１１とともにサーバ１０へ送信する処理を実行する。同様に、ユーザ端末２０の入力制御部２０４１は、入力装置２０６から取得した情報をＣＲＭＩＤ２０１３とともにＣＲＭシステム３０へ送信する処理を実行する。
ユーザ端末２０の出力制御部２０４２は、入力装置２０６に対するユーザによる操作およびサーバ１０、ＣＲＭシステム３０から情報を受信し、ユーザ端末２０に接続されたディスプレイ２０８１の表示内容、スピーカ２０８２の音声出力内容の制御処理を実行する。 <Configuration of Control Unit of User Terminal 20>
The control unit 204 of the user terminal 20 has an input control unit 2041 and an output control unit 2042 . By executing the application program 2012 stored in the storage unit 201 , the control unit 204 implements functional units of an input control unit 2041 and an output control unit 2042 .
The input control unit 2041 of the user terminal 20 acquires information output from input devices such as a camera 2061, a microphone 2062, a position information sensor 2063, a motion sensor 2064, a keyboard 2065, and a mouse 2066 connected to the user terminal 20, and various Execute the process. The input control unit 2041 of the user terminal 20 executes a process of transmitting information acquired from the input device 206 to the server 10 together with the user ID 2011 . Similarly, the input control unit 2041 of the user terminal 20 performs a process of transmitting information acquired from the input device 206 to the CRM system 30 together with the CRMID 2013 .
The output control unit 2042 of the user terminal 20 receives information from the user's operation on the input device 206 and the server 10 and the CRM system 30, and determines the display content of the display 2081 connected to the user terminal 20 and the audio output content of the speaker 2082. Execute control processing.

＜ＣＲＭシステム３０の機能構成＞
ＣＲＭシステム３０のハードウェア構成が実現する機能構成を図４に示す。ＣＲＭシステム３０は、記憶部３０１、制御部３０４を備える。
ユーザは、別途ＣＲＭ事業者とも契約を締結しており、ユーザごとに設定されたＣＲＭＩＤ２０１３を用いてＣＲＭ事業者が運営するウェブサイトへウェブブラウザなどを介してアクセス（ログイン）することにより、ＣＲＭサービスの提供を受ける事ができる。 <Functional Configuration of CRM System 30>
A functional configuration realized by the hardware configuration of the CRM system 30 is shown in FIG. The CRM system 30 has a storage unit 301 and a control unit 304 .
The user has also concluded a separate contract with a CRM company, and accesses (logs in to) the website operated by the CRM company via a web browser using the CRM ID 2013 set for each user to access the CRM service. can be provided.

＜ＣＲＭシステム３０の記憶部の構成＞
ＣＲＭシステム３０の記憶部３０１は、顧客テーブル３０１２、を備える。
図１１は、顧客テーブル３０１２のデータ構造を示す図である。 <Configuration of Storage Unit of CRM System 30>
The storage unit 301 of the CRM system 30 has a customer table 3012 .
FIG. 11 is a diagram showing the data structure of the customer table 3012. As shown in FIG.

顧客テーブル３０１２は、顧客情報を記憶し管理するためのテーブルである。顧客テーブル３０１２は、顧客ＩＤを主キーとし、顧客ＩＤ、ユーザＩＤ、氏名、電話番号、顧客属性、顧客組織名、顧客組織属性のカラムを有するテーブルである。 The customer table 3012 is a table for storing and managing customer information. The customer table 3012 is a table having customer ID as a primary key and columns of customer ID, user ID, name, telephone number, customer attribute, customer organization name, and customer organization attribute.

顧客ＩＤは、顧客を識別するための顧客識別情報を記憶する項目である。
ユーザＩＤは、顧客に紐付けられたユーザのユーザＩＤ（ユーザ識別情報）を記憶する項目である。ユーザは、自身のユーザＩＤに紐付けられた顧客を一覧表示したり、顧客に対して発信（架電）することができる。
本開示において、顧客はユーザに対して紐付けられるものとしたが、組織（組織テーブル１０１３の組織ＩＤ）に対して紐付けても良い。その場合、組織に所属するユーザは、自身の組織ＩＤに紐付けられた顧客を一覧表示したり、顧客に対して発信することができる。
氏名は、顧客の氏名を記憶する項目である。
電話番号は、顧客の電話番号を記憶する項目である。
ユーザは、ＣＲＭシステムが提供するウェブサイトにアクセスし、電話を発信したい顧客を選択し「発信」などの所定の操作を行なうことにより、ユーザ端末２０から顧客の電話番号に対して電話を発信することができる。
顧客属性は、顧客の年齢、性別、出身地、方言、職種（営業、カスタマーサポートなど）などの顧客の属性に関する情報を記憶する項目である。
顧客組織名は、顧客の所属する組織の名称を記憶する項目である。組織の名称は、会社名、法人名、企業グループ名、サークル名、各種団体名など任意の組織名、グループ名を含む。
顧客組織属性は、顧客の組織種別（会社、企業グループ、その他団体など）、業種（不動産、金融など）などの組織の属性に関する情報を記憶する項目である。
顧客属性、顧客組織名、顧客組織属性は、ユーザが入力することにより記憶する構成としても良いし、所定のウェブサイトへ顧客がアクセスすることにより、顧客に入力させても良い。 The customer ID is an item that stores customer identification information for identifying the customer.
The user ID is an item that stores the user ID (user identification information) of the user associated with the customer. The user can display a list of customers associated with his/her own user ID, and can make calls (calls) to the customers.
In the present disclosure, the customer is associated with the user, but may be associated with the organization (organization ID of the organization table 1013). In that case, a user belonging to an organization can display a list of customers associated with his/her own organization ID, or can send a message to the customer.
The name is an item for storing the customer's name.
The phone number is an item that stores the customer's phone number.
A user accesses a website provided by the CRM system, selects a customer to call, and performs a predetermined operation such as "Call" to make a call to the customer's telephone number from the user terminal 20. be able to.
The customer attribute is an item that stores information related to customer attributes such as customer age, gender, hometown, dialect, occupation (sales, customer support, etc.).
The customer organization name is an item that stores the name of the organization to which the customer belongs. The name of the organization includes arbitrary organization names and group names such as company names, corporate names, corporate group names, circle names, and various organization names.
The customer organization attribute is an item that stores information related to organization attributes such as the customer's organization type (company, corporate group, other organization, etc.) and type of industry (real estate, finance, etc.).
The customer attribute, the customer organization name, and the customer organization attribute may be stored by inputting them by the user, or may be input by the customer when the customer accesses a predetermined website.

＜ＣＲＭシステム３０の制御部の構成＞
ＣＲＭシステム３０の制御部３０４は、ユーザ登録制御部３０４１を備える。制御部３０４は、記憶部３０１に記憶されたアプリケーションプログラム３０１１を実行することにより、各機能ユニットが実現される。 <Configuration of Control Unit of CRM System 30>
The control section 304 of the CRM system 30 has a user registration control section 3041 . Control unit 304 implements each functional unit by executing application program 3011 stored in storage unit 301 .

ＣＲＭシステム３０は、ＡＰＩ（ＡｐｐｌｉｃａｔｉｏｎＰｒｏｇｒａｍｍｉｎｇＩｎｔｅｒｆａｃｅ）、ＳＤＫ（ＳｏｆｔｗａｒｅＤｅｖｅｌｏｐｍｅｎｔＫｉｔ）、コードスニペッド（以下、「ビーコン」と呼ぶ）と呼ばれる機能を提供しており、ユーザは予め本開示にかかるサーバ１０およびＣＲＭシステム３０についてアカウント情報などの紐付け設定を行うことにより、サーバ１０の制御部１０４とＣＲＭシステム３０の制御部３０４は相互に通信し、任意の情報処理を実現することができる。 The CRM system 30 provides functions called API (Application Programming Interface), SDK (Software Development Kit), and code snippets (hereinafter referred to as "beacons"), and the user can use the server 10 and By setting account information and the like for the CRM system 30, the control unit 104 of the server 10 and the control unit 304 of the CRM system 30 can communicate with each other to realize arbitrary information processing.

＜音声サーバ（ＰＢＸ）４０の概要＞
音声サーバ（ＰＢＸ）４０は、ユーザから顧客に対する発信があった場合に、顧客端末５０に対し発信（呼出し）を行う。
音声サーバ（ＰＢＸ）４０は、顧客からユーザに対する発信があった場合に、ユーザ端末２０に対し、その旨を示すメッセージ（以下、「着信通知メッセージ」と呼ぶ）を送る。また、音声サーバ（ＰＢＸ）４０は、サーバ１０が提供するビーコン、ＳＤＫ、ＡＰＩなどに着信通知メッセージを送ることができる。 <Outline of voice server (PBX) 40>
The voice server (PBX) 40 makes a call to the customer terminal 50 when the user makes a call to the customer.
The voice server (PBX) 40 sends a message (hereinafter referred to as an "incoming call notification message") to the user terminal 20 to indicate that the customer has made a call to the user. Also, the voice server (PBX) 40 can send an incoming call notification message to a beacon, SDK, API, etc. provided by the server 10 .

＜顧客端末５０の機能構成＞
顧客端末５０のハードウェア構成が実現する機能構成を図５に示す。顧客端末５０は、記憶部５０１、制御部５０４、タッチパネル５０６、タッチセンシティブデバイス５０６１、ディスプレイ５０６２、マイク５０８１、スピーカ５０８２、位置情報センサ５０８３、カメラ５０８４、モーションセンサ５０８５を備える。 <Functional Configuration of Customer Terminal 50>
FIG. 5 shows a functional configuration realized by the hardware configuration of the customer terminal 50. As shown in FIG. The customer terminal 50 includes a storage unit 501 , a control unit 504 , a touch panel 506 , a touch sensitive device 5061 , a display 5062 , a microphone 5081 , a speaker 5082 , a position information sensor 5083 , a camera 5084 and a motion sensor 5085 .

＜顧客端末５０の記憶部の構成＞
顧客端末５０の記憶部５０１は、顧客端末５０を利用する顧客の電話番号５０１１、アプリケーションプログラム５０１２を記憶する。
アプリケーションプログラム５０１２は、記憶部５０１に予め記憶されていても良いし、通信ＩＦを介してサービス提供者が運営するウェブサーバ等からダウンロードする構成としても良い。アプリケーションプログラム５０１２は、顧客端末５０に記憶されているウェブブラウザアプリケーション上で実行されるＪａｖａＳｃｒｉｐｔ（登録商標）などのインタープリター型プログラミング言語を含む。 <Configuration of Storage Unit of Customer Terminal 50>
The storage unit 501 of the customer terminal 50 stores telephone numbers 5011 and application programs 5012 of customers who use the customer terminal 50 .
The application program 5012 may be stored in advance in the storage unit 501, or may be downloaded from a web server or the like operated by the service provider via the communication IF. The application program 5012 includes an interpreted programming language such as JavaScript (registered trademark) executed on a web browser application stored in the customer terminal 50 .

＜顧客端末５０の制御部の構成＞
顧客端末５０の制御部５０４は、入力制御部５０４１および出力制御部５０４２を備える。制御部５０４は、記憶部５０１に記憶されたアプリケーションプログラム５０１２を実行することにより、入力制御部５０４１、出力制御部５０４２の機能ユニットが実現される。
顧客端末５０の入力制御部５０４１は、ユーザによるタッチパネル５０６のタッチセンシティブデバイス５０６１への操作内容、マイク５０８１への音声入力、位置情報センサ５０８３、カメラ５０８４、モーションセンサ５０８５などの入力装置から出力される情報を取得し各種処理を実行する。
顧客端末５０の出力制御部５０４２は、入力装置に対するユーザによる操作およびサーバ１０から情報を受信し、ディスプレイ５０６２の表示内容、スピーカ５０８２の音声出力内容などの制御処理を実行する。 <Configuration of Control Unit of Customer Terminal 50>
The control section 504 of the customer terminal 50 has an input control section 5041 and an output control section 5042 . By executing the application program 5012 stored in the storage unit 501 , the control unit 504 realizes the functional units of the input control unit 5041 and the output control unit 5042 .
The input control unit 5041 of the customer terminal 50 outputs from input devices such as the user's operation content to the touch sensitive device 5061 of the touch panel 506, voice input to the microphone 5081, position information sensor 5083, camera 5084, motion sensor 5085, etc. Acquire information and execute various processes.
The output control unit 5042 of the customer terminal 50 receives information from the user's operation on the input device and the server 10, and executes control processing of display contents of the display 5062, audio output contents of the speaker 5082, and the like.

＜情報処理システム１の動作＞
以下、情報処理システム１の各処理について説明する。
図１２は、音声変換処理（第一実施例）の動作を示すフローチャートである。
図１３は、音声変換処理（第二実施例）の動作を示すフローチャートである。
図１４は、音声変換処理（第三実施例）の動作を示すフローチャートである。
図１５は、音声変換処理（第三実施例）におけるユーザ端末２０の表示画面例を示した図である。 <Operation of information processing system 1>
Each process of the information processing system 1 will be described below.
FIG. 12 is a flow chart showing the operation of voice conversion processing (first embodiment).
FIG. 13 is a flow chart showing the operation of voice conversion processing (second embodiment).
FIG. 14 is a flow chart showing the operation of voice conversion processing (third embodiment).
FIG. 15 is a diagram showing an example of the display screen of the user terminal 20 in the voice conversion process (third embodiment).

＜用語定義＞
情報処理システム１の各処理について説明するにあたり、用語を以下の通り定義する。
通話データは、ユーザと顧客との間で行われる通話に関するデータであり、通話テーブル１０１４の各項目に記憶されたデータを含むデータである。
通話属性は、ユーザと顧客との間で行われる通話の属性に関するデータであり、ユーザ属性、ユーザの所属する組織の組織名または組織属性、通話におけるユーザの感情に関する情報（ユーザに関する属性情報）、顧客属性、顧客の所属する組織の組織名または組織属性、通話における顧客の感情に関する情報（顧客に関する属性情報）、通話カテゴリ、受発信者種別、通話における感情に関する情報（通話に関する属性情報）などを含む。つまり、通話データは、ユーザに関する属性情報、顧客に関する属性情報、通話に関する属性情報などの通話属性により特徴づけられることになる。
本開示における通話属性は、ユーザ個人および顧客個人に関する属性情報を含み、ユーザおよび顧客の周辺環境、通話環境に関する情報は含まない。例えば、ユーザおよび顧客周辺のノイズ、騒音状況に関する情報は含まない。 <Definition of terms>
In describing each process of the information processing system 1, terms are defined as follows.
The call data is data relating to calls made between the user and the customer, and is data including data stored in each item of the call table 1014 .
The call attribute is data related to the attributes of the call made between the user and the customer, and includes the user attribute, the organization name or organization attribute of the organization to which the user belongs, information on the user's emotion in the call (attribute information on the user), Customer attributes, organization name or organization attributes of the organization to which the customer belongs, information on the customer's emotions in the call (attribute information on the customer), call category, type of recipient and caller, information on the emotion in the call (attribute information on the call), etc. include. In other words, call data is characterized by call attributes such as user attribute information, customer attribute information, and call attribute information.
Call attributes in the present disclosure include attribute information about individual users and customers, and do not include information about the surrounding environment of users and customers and the call environment. For example, it does not include information about noise and noise conditions around the user and customer.

＜発信処理＞
発信処理は、ユーザから顧客に対し発信（架電）する処理である。 <Outgoing process>
The calling process is a process of making a call (calling) from the user to the customer.

＜発信処理の概要＞
発信処理は、ユーザはユーザ端末２０の画面に表示された複数の顧客のうち発信を希望する顧客を選択し、発信操作を行うことにより、顧客に対して発信を行なう一連の処理である。 <Outline of call processing>
The calling process is a series of processes in which the user selects a customer who wishes to make a call from among a plurality of customers displayed on the screen of the user terminal 20 and performs a calling operation to make a call to the customer.

＜発信処理の詳細＞
ユーザから顧客に発信する場合における情報処理システム１の発信処理について説明する。 <Details of outgoing processing>
A call processing of the information processing system 1 when a user makes a call to a customer will be described.

ユーザが顧客に発信する場合、情報処理システム１において以下の処理が実行される。 When a user calls a customer, the information processing system 1 performs the following processes.

ユーザはユーザ端末２０を操作することにより、ウェブブラウザを起動し、ＣＲＭシステム３０が提供するＣＲＭサービスのウェブサイトへアクセスする。ユーザは、CRMサービスが提供する顧客管理画面を開くことにより自身の顧客をユーザ端末２０のディスプレイ２０８１へ一覧表示することができる。
具体的に、ユーザ端末２０は、ＣＲＭＩＤ２０１３および顧客を一覧表示する旨のリクエストをＣＲＭシステム３０へ送信する。ＣＲＭシステム３０は、リクエストを受信すると、顧客テーブル３０１２を検索し、顧客ＩＤ、氏名、電話番号、顧客属性、顧客組織名、顧客組織属性などのユーザの顧客に関する情報をユーザ端末２０に送信する。ユーザ端末２０は、受信した顧客に関する情報をユーザ端末２０のディスプレイ２０８１に表示する。 By operating the user terminal 20 , the user activates the web browser and accesses the CRM service website provided by the CRM system 30 . The user can display a list of his/her own customers on the display 2081 of the user terminal 20 by opening the customer management screen provided by the CRM service.
Specifically, the user terminal 20 transmits to the CRM system 30 a CRM ID 2013 and a request to display a list of customers. When the CRM system 30 receives the request, it searches the customer table 3012 and transmits to the user terminal 20 the user's customer information such as the customer ID, name, telephone number, customer attributes, customer organization name, and customer organization attributes. The user terminal 20 displays the received customer information on the display 2081 of the user terminal 20 .

ユーザは、ユーザ端末２０のディスプレイ２０８１に一覧表示された顧客から発信を希望する顧客を押下し選択する。顧客が選択された状態で、ユーザ端末２０のディスプレイ２０８１に表示された「発信」ボタンまたは、電話番号ボタンを押下することにより、ＣＲＭシステム３０に対し電話番号を含むリクエストを送信する。リクエストを受信したＣＲＭシステム３０は、電話番号を含むリクエストをサーバ１０へ送信する。リクエストを受信したサーバ１０は、音声サーバ（ＰＢＸ）４０に対し、発信リクエストを送信する。音声サーバ（ＰＢＸ）４０は、発信リクエストを受信すると、受信した電話番号に基づき顧客端末５０に対し発信（呼出し）を行う。 The user presses and selects a customer who wishes to make a call from the customers listed on the display 2081 of the user terminal 20 . With the customer selected, pressing the "call" button or phone number button displayed on the display 2081 of the user terminal 20 sends a request including the phone number to the CRM system 30 . The CRM system 30 that has received the request transmits the request including the telephone number to the server 10 . The server 10 that has received the request transmits a call origination request to the voice server (PBX) 40 . Upon receiving the call request, the voice server (PBX) 40 makes a call (call) to the customer terminal 50 based on the received telephone number.

これに伴い、ユーザ端末２０は、スピーカ２０８２などを制御し音声サーバ（ＰＢＸ）４０により発信（呼出し）が行われている旨を示す鳴動を行う。また、ユーザ端末２０のディスプレイ２０８１は、音声サーバ（ＰＢＸ）４０により顧客に対して発信（呼出し）が行われている旨を示す情報を表示する。例えば、ユーザ端末２０のディスプレイ２０８１は、「呼出中」という文字を表示してもよい。 Along with this, the user terminal 20 controls the speaker 2082 and the like to ring to indicate that the voice server (PBX) 40 is making a call (calling). Also, the display 2081 of the user terminal 20 displays information indicating that the voice server (PBX) 40 is making a call to the customer. For example, the display 2081 of the user terminal 20 may display the characters "calling".

顧客は、顧客端末５０において不図示の受話器を持ち上げたり、顧客端末５０のタッチパネル５０６に着信時に表示される「受信」ボタンなどを押下することにより、顧客端末５０は通話可能状態となる。これに伴い、音声サーバ（ＰＢＸ）４０は、顧客端末５０による応答がなされたことを示す情報（以下、「応答イベント」と呼ぶ）を、サーバ１０、ＣＲＭシステム３０などを介してユーザ端末２０に送信する。
これにより、ユーザと顧客は、それぞれユーザ端末２０、顧客端末５０を用いて通話可能状態となり、ユーザと顧客との間で通話することができるようになる。具体的には、ユーザ端末２０のマイク２０６２により集音されたユーザの音声は、顧客端末５０のスピーカ５０８２から出力される。同様に、顧客端末５０のマイク５０８１から集音された顧客の音声は、ユーザ端末２０のスピーカ２０８２から出力される。 When the customer picks up the receiver (not shown) of the customer terminal 50 or presses the "Receive" button displayed on the touch panel 506 of the customer terminal 50 when receiving a call, the customer terminal 50 becomes ready for communication. Along with this, the voice server (PBX) 40 transmits information indicating that the customer terminal 50 has responded (hereinafter referred to as a "response event") to the user terminal 20 via the server 10, the CRM system 30, and the like. Send.
As a result, the user and the customer are ready to talk using the user terminal 20 and the customer terminal 50, respectively, so that the user and the customer can talk to each other. Specifically, the user's voice collected by the microphone 2062 of the user terminal 20 is output from the speaker 5082 of the customer terminal 50 . Similarly, the customer's voice collected from the microphone 5081 of the customer terminal 50 is output from the speaker 2082 of the user terminal 20 .

ユーザ端末２０のディスプレイ２０８１は、通話可能状態になると、応答イベントを受信し、通話が行われていることを示す情報を表示する。例えば、ユーザ端末２０のディスプレイ２０８１は、「応答中」という文字を表示してもよい。 The display 2081 of the user terminal 20 receives the response event and displays information indicating that a call is being made when the call is ready. For example, the display 2081 of the user terminal 20 may display the characters "answering".

＜着信処理＞
着信処理は、ユーザが顧客から着信（受電）する処理である。 <Incoming processing>
Incoming call processing is processing in which a user receives a call (receives a call) from a customer.

＜着信処理の概要＞
着信処理は、ユーザがユーザ端末２０においてアプリケーションを立ち上げている場合に、顧客がユーザに対して発信した場合に、ユーザが着信する一連の処理である。 <Overview of Incoming Call Processing>
The incoming call process is a series of processes in which the user receives an incoming call when the user has started an application on the user terminal 20 and the customer has made a call to the user.

＜着信処理の詳細＞
ユーザが顧客から着信（受電）する場合における情報処理システム１の着信処理について説明する。 <Details of incoming call processing>
Incoming call processing of the information processing system 1 when the user receives an incoming call (receives a call) from a customer will be described.

ユーザが顧客から着信する場合、情報処理システム１において以下の処理が実行される。 When a user receives an incoming call from a customer, the information processing system 1 performs the following processes.

ユーザはユーザ端末２０を操作することにより、ウェブブラウザを起動し、ＣＲＭシステム３０が提供するＣＲＭサービスのウェブサイトへアクセスする。このとき、ユーザはウェブブラウザにおいて、自身のアカウントにてＣＲＭシステム３０にログインし待機しているものとする。なお、ユーザはＣＲＭシステム３０にログインしていれば良く、ＣＲＭサービスにかかる他の作業などを行っていても良い。 By operating the user terminal 20 , the user activates the web browser and accesses the CRM service website provided by the CRM system 30 . At this time, it is assumed that the user logs in to the CRM system 30 with his own account on the web browser and waits. It is sufficient for the user to be logged in to the CRM system 30, and the user may be performing other work related to the CRM service.

顧客は、顧客端末５０を操作し、音声サーバ（ＰＢＸ）４０に割り当てられた所定の電話番号を入力し、音声サーバ（ＰＢＸ）４０に対して発信する。音声サーバ（ＰＢＸ）４０は、顧客端末５０の発信を着信イベントとして受信する。 The customer operates the customer terminal 50 , inputs a predetermined telephone number assigned to the voice server (PBX) 40 , and makes a call to the voice server (PBX) 40 . The voice server (PBX) 40 receives the outgoing call from the customer terminal 50 as an incoming call event.

音声サーバ（ＰＢＸ）４０は、サーバ１０に対し、着信イベントを送信する。具体的には、音声サーバ（ＰＢＸ）４０は、サーバ１０に対して顧客の電話番号５０１１を含む着信リクエストを送信する。サーバ１０は、ＣＲＭシステム３０を介してユーザ端末２０に対して着信リクエストを送信する。
これに伴い、ユーザ端末２０は、スピーカ２０８２などを制御し音声サーバ（ＰＢＸ）４０により着信が行われている旨を示す鳴動を行う。ユーザ端末２０のディスプレイ２０８１は、音声サーバ（ＰＢＸ）４０により顧客から着信があること旨を示す情報を表示する。例えば、ユーザ端末２０のディスプレイ２０８１は、「着信中」という文字を表示してもよい。 A voice server (PBX) 40 sends an incoming call event to the server 10 . Specifically, the voice server (PBX) 40 transmits an incoming call request including the customer's telephone number 5011 to the server 10 . The server 10 transmits an incoming call request to the user terminal 20 via the CRM system 30 .
Along with this, the user terminal 20 controls the speaker 2082 and the like to ring to indicate that the voice server (PBX) 40 is receiving an incoming call. The display 2081 of the user terminal 20 displays information indicating that the voice server (PBX) 40 has received an incoming call from the customer. For example, the display 2081 of the user terminal 20 may display the characters "incoming call".

ユーザ端末２０は、ユーザによる応答操作を受付ける。応答操作は、例えば、ユーザ端末２０において不図示の受話器を持ち上げたり、ユーザ端末２０のディスプレイ２０８１に「電話に出る」と表示されたボタンを、ユーザがマウス２０６６を操作して押下する操作などにより実現される。
ユーザ端末２０は、応答操作を受付けると、音声サーバ（ＰＢＸ）４０に対し、ＣＲＭシステム３０、サーバ１０を介して応答リクエストを送信する。音声サーバ（ＰＢＸ）４０は、送信されてきた応答リクエストを受信し、音声通信を確立する。これにより、ユーザ端末２０は、顧客端末５０と通話可能状態となる。
ユーザ端末２０のディスプレイ２０８１は、通話が行われていることを示す情報を表示する。例えば、ユーザ端末２０のディスプレイ２０８１は、「通話中」という文字を表示してもよい。 The user terminal 20 receives a response operation by the user. The response operation is, for example, by lifting the handset (not shown) of the user terminal 20, or by operating the mouse 2066 to press the button labeled "answer the call" on the display 2081 of the user terminal 20. Realized.
Upon accepting the response operation, the user terminal 20 transmits a response request to the voice server (PBX) 40 via the CRM system 30 and the server 10 . The voice server (PBX) 40 receives the transmitted response request and establishes voice communication. As a result, the user terminal 20 becomes ready for communication with the customer terminal 50 .
The display 2081 of the user terminal 20 displays information indicating that a call is being made. For example, the display 2081 of the user terminal 20 may display the characters "busy".

通話可能状態になると、後述する音声変換処理（第一実施例）、音声変換処理（第二実施例）、音声変換処理（第三実施例）、音声変換処理（第四実施例）が実行される。
特定の、話者および聞き手のペアごとに、音声変換処理（第一実施例）、音声変換処理（第二実施例）、音声変換処理（第三実施例）のいずれかの音声変換処理が行われる構成としても良い。音声変換処理（第四実施例）は音声変換処理（第一実施例）、音声変換処理（第二実施例）、音声変換処理（第三実施例）と同時に実行されても良い。
３人以上の通話が行われている場合には、３人のうちの任意の２人の話者および聞き手のペアの組み合わせごとに、音声変換処理（第一実施例）、音声変換処理（第二実施例）、音声変換処理（第三実施例）のいずれかの音声変換処理を実行しても構わない。つまり、３人のうちの異なる２人の話者および聞き手のペアの組み合わせごとに、異なる音声変換処理が実行されても良い。 When it becomes possible to make a call, voice conversion processing (first embodiment), voice conversion processing (second embodiment), voice conversion processing (third embodiment), and voice conversion processing (fourth embodiment), which will be described later, are executed. be.
One of speech conversion processing (first embodiment), speech conversion processing (second embodiment), and speech conversion processing (third embodiment) is performed for each specific pair of speaker and listener. It is good also as a structure to be used. The voice conversion process (fourth embodiment) may be executed simultaneously with the voice conversion process (first embodiment), the voice conversion process (second embodiment), and the voice conversion process (third embodiment).
When three or more people are talking, voice conversion processing (first embodiment), voice conversion processing (second Second embodiment) or voice conversion processing (third embodiment) may be executed. That is, different speech conversion processes may be performed for each combination of two different speaker-listener pairs out of three.

＜変形例＞
なお、ユーザが顧客との間で通話可能状態となる方法は、発信処理、着信処理に限られず、ユーザと顧客との間で通話を実現するための任意の方法を用いても構わない。例えば、サーバ１０上に、ユーザと顧客との間で通話を行うためのルームとよばれる仮想的な通話空間を作成し、ユーザおよび顧客が当該ルームへユーザ端末２０、顧客端末５０に記憶されたウェブブラウザまたはアプリケーションプログラムを介してアクセスすることにより通話可能状態となる方法でも構わない。この場合、音声サーバ（ＰＢＸ）４０は不要となる。
具体的には、通話の主催者となるユーザがユーザ端末２０の入力装置２０６を操作し、サーバ１０へ通話開催に関するリクエストを送信する。サーバ１０の制御部１０４は、リクエストを受信するとユニークなルームＩＤなどのルーム識別情報を発行し、ユーザ端末２０へレスポンスを送信する。ユーザは、受信したルーム識別情報を、通話相手の顧客へメールなど任意の通信手段により送信する。ユーザは、ユーザ端末２０の入力装置２０６を操作し、ウェブブラウザなどでサーバ１０のルームに関するサービスを提供するＵＲＬへアクセスし、ルーム識別情報を入力することによりルームに入室することができる。同様に、顧客は顧客端末５０のタッチパネル５０６を操作し、ウェブブラウザなどでサーバ１０のルームに関するサービスを提供するＵＲＬへアクセスし、ルーム識別情報を入力することによりルームに入室することができる。これにより、ユーザと顧客とはルーム識別情報により関連付けられたルームとよばれる仮想的な通話空間内で、それぞれユーザ端末２０、顧客端末５０を介して通話を行うことができる。
ルーム識別情報を入力することにより、複数のユーザ、複数の顧客が１つのルームに入室することができる。これにより、複数のユーザと、複数の顧客とはルーム識別情報により関連付けられたルームとよばれる仮想的な通話空間内で、それぞれがユーザ端末２０、顧客端末５０を介して通話を行うことができる。 <Modification>
It should be noted that the method by which the user can make a call with the customer is not limited to the outgoing call process and the incoming call process, and any method for realizing a call between the user and the customer may be used. For example, a virtual communication space called a room is created on the server 10 for communication between the user and the customer, and the user and the customer are stored in the user terminal 20 and the customer terminal 50 in the room. A method of enabling communication by accessing via a web browser or an application program may also be used. In this case, the voice server (PBX) 40 becomes unnecessary.
Specifically, the user who is the organizer of the call operates the input device 206 of the user terminal 20 to transmit a request for holding the call to the server 10 . Upon receiving the request, the control unit 104 of the server 10 issues room identification information such as a unique room ID and transmits a response to the user terminal 20 . The user transmits the received room identification information to the customer who is the other party of the call by any means of communication such as e-mail. The user can enter the room by operating the input device 206 of the user terminal 20, accessing the URL of the server 10 providing room-related services using a web browser, etc., and entering the room identification information. Similarly, the customer can enter the room by operating the touch panel 506 of the customer terminal 50, accessing the URL of the server 10 providing room-related services using a web browser, etc., and entering the room identification information. As a result, the user and the customer can talk via the user terminal 20 and the customer terminal 50, respectively, within a virtual communication space called a room associated with the room identification information.
By inputting room identification information, multiple users and multiple customers can enter one room. As a result, a plurality of users and a plurality of customers can talk via the user terminal 20 and the customer terminal 50 in a virtual communication space called a room associated with the room identification information. .

＜通話記憶処理＞
通話記憶処理は、ユーザと顧客との間で行われる通話に関するデータを記憶する処理である。 <Call memory processing>
A call storage process is the process of storing data relating to calls made between a user and a customer.

＜通話記憶処理の概要＞
通話記憶処理は、ユーザと顧客との間で通話が開始された場合に、通話に関するデータを通話テーブル１０１４に記憶する一連の処理である。 <Overview of call memory processing>
The call storage process is a series of processes for storing data related to calls in the call table 1014 when a call is started between a user and a customer.

＜通話記憶処理の詳細＞
ユーザと顧客との間で通話が開始されると、音声サーバ（ＰＢＸ）４０は、ユーザと顧客との間で行われる通話に関する音声データを録音し、サーバ１０へ送信する。サーバ１０の制御部１０４は、音声データを受信すると、通話テーブル１０１４に新たなレコードを作成し、ユーザと顧客との間で行われる通話に関するデータを記憶する。具体的に、サーバ１０の制御部１０４は、ユーザＩＤ、顧客ＩＤ、通話カテゴリ、受発信種別、音声データの内容を通話テーブル１０１４に記憶する。 <Details of call memory processing>
When a call is started between the user and the customer, the voice server (PBX) 40 records voice data regarding the call between the user and the customer and transmits the data to the server 10 . Upon receiving the voice data, the control unit 104 of the server 10 creates a new record in the call table 1014, and stores data relating to the call made between the user and the customer. Specifically, the control unit 104 of the server 10 stores the user ID, customer ID, call category, incoming/outgoing type, and contents of voice data in the call table 1014 .

サーバ１０の制御部１０４は、発信処理または着信処理においてユーザ端末２０から、ユーザのユーザＩＤ２０１１を取得し、新たなレコードのユーザＩＤの項目に記憶する。
サーバ１０の制御部１０４は、発信処理または着信処理において電話番号に基づきＣＲＭシステム３０へ問い合わせを行なう。ＣＲＭシステム３０は、顧客テーブル３０１２を電話番号により検索することにより、顧客ＩＤを取得し、サーバ１０へ送信する。サーバ１０の制御部１０４は、取得した顧客ＩＤを新たなレコードの顧客ＩＤの項目に記憶する。
サーバ１０の制御部１０４は、予めユーザまたは顧客ごとに設定された通話カテゴリの値を、新たなレコードの通話カテゴリの項目に記憶する。なお、通話カテゴリは、通話ごとにユーザが値を選択したり入力することにより記憶しても良い。
サーバ１０の制御部１０４は、行われている通話がユーザにより発信したものか、顧客から発信されたものかを識別し、新たなレコードの受発信種別の項目にアウトバウンド（ユーザから発信）、インバウンド（顧客から発信）のいずれかの値を記憶する。
サーバ１０の制御部１０４は、音声サーバ（ＰＢＸ）４０から受信する音声データを、新たなレコードの音声データの項目に記憶する。なお、音声データは他の場所に音声データファイルとして記憶し、通話終了後に、音声データファイルに対する参照情報（パス）を記憶するものとしても良い。また、サーバ１０の制御部１０４は、通話終了後にデータを記憶する構成としても良い。 The control unit 104 of the server 10 acquires the user ID 2011 of the user from the user terminal 20 in the outgoing call process or the incoming call process, and stores it in the user ID item of the new record.
The control unit 104 of the server 10 makes an inquiry to the CRM system 30 based on the telephone number in outgoing call processing or incoming call processing. The CRM system 30 acquires the customer ID by searching the customer table 3012 by telephone number and transmits it to the server 10 . The control unit 104 of the server 10 stores the acquired customer ID in the customer ID item of the new record.
The control unit 104 of the server 10 stores the value of the call category set in advance for each user or customer in the item of the call category of the new record. Note that the call category may be stored by the user selecting or inputting a value for each call.
The control unit 104 of the server 10 identifies whether the call being made is originated by the user or originated by the customer, and puts outbound (originating from the user) or inbound in the incoming/outgoing type item of the new record. (originating from the customer).
The control unit 104 of the server 10 stores the voice data received from the voice server (PBX) 40 in the voice data item of the new record. The voice data may be stored as a voice data file in another location, and the reference information (path) to the voice data file may be stored after the call is finished. Also, the control unit 104 of the server 10 may be configured to store data after the end of the call.

＜音声変換処理（第一実施例）＞
音声変換処理（第一実施例）は、ユーザが発話した音声データに対して通話属性に基づき選択された生成モデル１０２２を適用することにより得られる変換音声データを顧客に対して出力する処理である。
これにより、顧客は、より適した音声でユーザと通話を行うことができる。 <Voice conversion processing (first embodiment)>
Speech conversion processing (first embodiment) is processing for outputting converted speech data obtained by applying a generative model 1022 selected based on call attributes to speech data uttered by a user to a customer. .
This allows the customer to communicate with the user with a more suitable voice.

＜音声変換処理（第一実施例）の概要＞
音声変換処理（第一実施例）は、ユーザと顧客とが通話可能状態となると開始される。音声変換処理（第一実施例）は、通話属性を取得し、通話属性、ユーザが発話した音声データを入力データとして生成モデル１０２２に入力し、出力される変換音声データを顧客に対して出力する一連の処理である。 <Outline of voice conversion processing (first embodiment)>
The voice conversion process (first embodiment) is started when the user and the customer are ready to talk. The voice conversion process (first embodiment) acquires call attributes, inputs the call attributes and voice data uttered by the user into the generation model 1022 as input data, and outputs converted voice data to the customer. It is a series of processes.

＜音声変換処理（第一実施例）の詳細＞
ステップＳ１０１において、ユーザと顧客とが通話可能状態となると音声変換処理（第一実施例）が開始される。 <Details of voice conversion processing (first embodiment)>
In step S101, when the user and the customer are ready to talk, the voice conversion process (first embodiment) is started.

ステップＳ１０２において、サーバ１０の音声変換部１０４２は、通話に関する通話属性を取得する。
具体的に、サーバ１０の音声変換部１０４２は、ユーザのユーザＩＤに基づきユーザテーブル１０１２のユーザＩＤの項目を検索し、組織ＩＤおよびユーザ属性の項目を取得する。サーバ１０の音声変換部１０４２は、取得した組織ＩＤに基づき、組織テーブル１０１３の組織ＩＤの項目を検索し、組織名、組織属性の項目を取得する。つまり、サーバ１０の音声変換部１０４２は、ユーザに関する属性情報を取得する。サーバ１０の音声変換部１０４２は、ユーザの発話音声からユーザの感情状態を推定しユーザの感情情報を、ユーザに関する属性情報として取得しても良い。
サーバ１０の音声変換部１０４２は、顧客の顧客ＩＤを含む照会リクエストをＣＲＭシステム３０に送信する。ＣＲＭシステム３０は、受信したリクエストに含まれる顧客ＩＤに基づき、顧客テーブル３０１２の顧客ＩＤの項目を検索し、顧客属性、顧客組織名、顧客組織属性の項目を取得し、サーバ１０へ送信する。サーバ１０の音声変換部１０４２は、ＣＲＭシステム３０から顧客の顧客属性、顧客組織名、顧客組織属性の項目を取得する。つまり、サーバ１０の音声変換部１０４２は、顧客に関する属性情報を取得する。サーバ１０の音声変換部１０４２は、顧客の発話音声から顧客の感情状態を推定し顧客の感情情報を、顧客に関する属性情報として取得しても良い。
サーバ１０の音声変換部１０４２は、通話テーブル１０１４を参照し、通話記憶処理により記憶された通話データに含まれる通話カテゴリ、受発信種別の情報を取得する。サーバ１０の音声変換部１０４２は、通話に関する通話ＩＤに基づき、通話テーブル１０１４の通話ＩＤの項目を検索し、通話カテゴリ、受発信種別の項目を取得する。つまり、サーバ１０の音声変換部１０４２は、通話に関する属性情報を取得する。サーバ１０の音声変換部１０４２は、ユーザおよび顧客の発話音声からユーザおよび顧客の感情状態を推定しユーザおよび顧客の感情情報を、通話に関する属性情報として取得しても良い。 In step S102, the voice conversion unit 1042 of the server 10 acquires call attributes related to the call.
Specifically, the speech conversion unit 1042 of the server 10 searches the user ID item of the user table 1012 based on the user ID of the user, and acquires the organization ID and user attribute items. The speech conversion unit 1042 of the server 10 searches the organization ID item in the organization table 1013 based on the acquired organization ID, and acquires the organization name and organization attribute items. That is, the voice conversion unit 1042 of the server 10 acquires attribute information regarding the user. The speech conversion unit 1042 of the server 10 may estimate the user's emotional state from the user's uttered voice and acquire the user's emotional information as attribute information about the user.
The voice conversion unit 1042 of the server 10 transmits an inquiry request including the customer's customer ID to the CRM system 30 . The CRM system 30 retrieves the customer ID item in the customer table 3012 based on the customer ID included in the received request, acquires the customer attribute, customer organization name, and customer organization attribute items, and transmits them to the server 10 . The voice conversion unit 1042 of the server 10 acquires the items of customer attribute, customer organization name, and customer organization attribute from the CRM system 30 . That is, the voice conversion unit 1042 of the server 10 acquires the attribute information regarding the customer. The voice conversion unit 1042 of the server 10 may estimate the customer's emotional state from the customer's uttered voice and acquire the customer's emotional information as the customer's attribute information.
The voice conversion unit 1042 of the server 10 refers to the call table 1014 and acquires the information on the call category and the type of incoming and outgoing calls included in the call data stored by the call storage process. The voice conversion unit 1042 of the server 10 searches the item of call ID in the call table 1014 based on the call ID related to the call, and acquires the items of the call category and the type of reception/transmission. That is, the voice conversion unit 1042 of the server 10 acquires the attribute information regarding the call. The speech conversion unit 1042 of the server 10 may estimate the emotional state of the user and the customer from the uttered voice of the user and the customer, and acquire the emotional information of the user and the customer as attribute information regarding the call.

サーバ１０の音声変換部１０４２は、ユーザに関する属性情報、顧客に関する属性情報、通話に関する属性情報の少なくともいずれか１つを通話属性として取得しても良い。例えば、ユーザに関する属性情報のみを通話属性として取得しても良い。
また、サーバ１０の音声変換部１０４２は、ユーザ属性、ユーザの所属する組織の組織名または組織属性、ユーザの感情情報、顧客属性、顧客の所属する組織の組織名または組織属性、顧客の感情情報、通話カテゴリ、受発信者種別、通話の感情情報のいずれか１つを通話属性として取得しても良い。例えば、通話カテゴリのみを通話属性として取得しても良い。 The voice conversion unit 1042 of the server 10 may acquire at least one of attribute information about the user, attribute information about the customer, and attribute information about the call as the call attribute. For example, only attribute information about the user may be acquired as a call attribute.
In addition, the voice conversion unit 1042 of the server 10 is configured to convert the user attribute, the organization name or organization attribute of the organization to which the user belongs, the user's emotion information, the customer attribute, the organization name or organization attribute of the organization to which the customer belongs, and the customer's emotion information. , the category of the call, the type of the sender and receiver, and the emotional information of the call may be acquired as the call attribute. For example, only the call category may be acquired as a call attribute.

ステップＳ１０４において、サーバ１０の音声変換部１０４２は、ユーザから通話音声を取得し、取得した通話音声を変換する。このとき、サーバ１０の音声変換部１０４２は、ステップＳ１０２において取得した通話属性に基づき、取得した通話音声を変換する。サーバ１０の音声変換部１０４２は、ステップＳ１０２において取得した通話属性および通話音声に対して生成モデル１０２２を適用することにより、取得した通話音声を変換する。
具体的に、サーバ１０の音声変換部１０４２は、音声サーバ（ＰＢＸ）４０からユーザにより発話された音声データを逐次的に取得する。サーバ１０の音声変換部１０４２は、ユーザにより発話後、発話された音声データをできるだけ遅延なく取得することが望ましい。
サーバ１０の音声変換部１０４２は、取得した通話属性、音声データを入力データとして生成モデル１０２２に入力し、出力される変換音声データを取得する。 In step S104, the voice conversion unit 1042 of the server 10 acquires the call voice from the user and converts the acquired call voice. At this time, the voice conversion unit 1042 of the server 10 converts the acquired call voice based on the call attribute acquired in step S102. The voice conversion unit 1042 of the server 10 converts the acquired call voice by applying the generation model 1022 to the call attributes and call voice acquired in step S102.
Specifically, the voice conversion unit 1042 of the server 10 sequentially acquires voice data uttered by the user from the voice server (PBX) 40 . After the user speaks, the voice conversion unit 1042 of the server 10 preferably acquires voice data spoken by the user with as little delay as possible.
The speech conversion unit 1042 of the server 10 inputs the acquired call attributes and speech data to the generation model 1022 as input data, and acquires converted speech data to be output.

サーバ１０の音声変換部１０４２は、顧客に対して出力される変換音声データの評価指標の種別に応じて複数の生成モデル１０２２を選択的に切り替えて適用し、音声データを変換しても良い。
例えば、顧客からより信頼性が得られるような生成モデル１０２２を用いて、音声データを変換しても良い。例えば、顧客がより聴きやすい（聴き取りやすい）生成モデル１０２２を用いて、音声データを変換しても良い。例えば、顧客に対して、ＳＩＩＢ、ＨＡＳＰＩ、ＥＳＴＯＩ、ＰＥＳＱ、ＶｉＳＱＯＬなどの評価指標や、信頼性、信用性、心地良さ、快適性、好み、ストレス値、威圧度、興趣性などの評価指標が適したものとなるような生成モデル１０２２を用いて、音声データを変換しても良い。 The speech conversion unit 1042 of the server 10 may selectively switch and apply a plurality of generation models 1022 according to the type of evaluation index of the converted speech data output to the customer, and convert the speech data.
For example, the speech data may be converted using a generative model 1022 that provides more credibility to the customer. For example, speech data may be converted using a generative model 1022 that is easier for the customer to hear (easier to hear). For example, evaluation indexes such as SIIB, HASPI, ESTOI, PESQ, and ViSQOL, and evaluation indexes such as reliability, credibility, comfort, comfort, preference, stress value, intimidation, and interest are suitable for customers. A generative model 1022 may be used to convert speech data.

サーバ１０の音声変換部１０４２は、ユーザに関する属性情報、顧客に関する属性情報、通話に関する属性情報の少なくともいずれか１つを通話属性として用いて、通話音声を変換しても良い。例えば、ユーザに関する属性情報のみを通話属性として用いて、通話音声を変換しても良い。
また、サーバ１０の音声変換部１０４２は、ユーザ属性、ユーザの所属する組織の組織名または組織属性、ユーザの感情情報、顧客属性、顧客の所属する組織の組織名または組織属性、顧客の感情情報、通話カテゴリ、受発信者種別、通話の感情情報のいずれか１つを通話属性として用いて、通話音声を変換しても良い。例えば、通話カテゴリのみを通話属性として用いて、通話音声を変換しても良い。 The voice conversion unit 1042 of the server 10 may convert the call voice using at least one of the user attribute information, the customer attribute information, and the call attribute information as the call attribute. For example, the call voice may be converted using only the attribute information about the user as the call attribute.
In addition, the voice conversion unit 1042 of the server 10 can also convert the user attribute, the organization name or organization attribute of the organization to which the user belongs, the user's emotion information, the customer attribute, the organization name or organization attribute of the organization to which the customer belongs, and the customer's emotion information. , the category of the call, the type of the sender and receiver, or the emotional information of the call may be used as the call attribute to convert the call voice. For example, only the call category may be used as the call attribute to convert the call voice.

ステップＳ１０５において、サーバ１０の音声変換部１０４２は、ステップＳ１０４において変換された通話音声を顧客へ出力する。サーバ１０の音声変換部１０４２は、変換音声データを音声サーバ（ＰＢＸ）４０に送信する。音声サーバ（ＰＢＸ）４０は、受信した変換音声データを、顧客端末５０に対して出力する。顧客端末５０のスピーカ５０８２は、受信した変換音声データをユーザの通話音声として出力する。
つまり、ユーザ端末２０のマイク２０６２により集音されたユーザの音声に関する音声データは、サーバ１０の音声変換部１０４２により変換音声データに変換され、顧客端末５０のスピーカ５０８２から出力される。 In step S105, the voice conversion unit 1042 of the server 10 outputs the call voice converted in step S104 to the customer. The voice converter 1042 of the server 10 transmits the converted voice data to the voice server (PBX) 40 . The voice server (PBX) 40 outputs the received converted voice data to the customer terminal 50 . The speaker 5082 of the customer terminal 50 outputs the received converted voice data as the user's call voice.
That is, voice data relating to the user's voice collected by the microphone 2062 of the user terminal 20 is converted into converted voice data by the voice converter 1042 of the server 10 and output from the speaker 5082 of the customer terminal 50 .

＜音声変換処理（第二実施例）＞
音声変換処理（第二実施例）は、ユーザが発話した音声データに対して音声処理モデル１０２３を適用することにより特定された音声処理内容を適用した変換音声データを顧客に対して出力する処理である。
これにより、ユーザは、顧客にとってより適した音声へ変換することができる音声処理を選択することができる。 <Voice Conversion Processing (Second Embodiment)>
The voice conversion process (second embodiment) is a process of outputting to the customer converted voice data to which voice processing content specified by applying the voice processing model 1023 is applied to the voice data uttered by the user. be.
This allows the user to select speech processing that can translate into speech that is more suitable for the customer.

＜音声変換処理（第二実施例）の概要＞
音声変換処理（第二実施例）は、ユーザと顧客とが通話可能状態となると開始される。音声変換処理（第二実施例）は、通話属性を取得し、通話属性を入力データとして音声処理モデル１０２３に入力し、出力される音声処理ＩＤにより音声処理内容を選択し、ユーザが発話した音声データに対して選択された音声処理内容を適用して出力される変換音声データを顧客に対して出力する一連の処理である。 <Outline of voice conversion processing (second embodiment)>
The voice conversion process (second embodiment) is started when the user and the customer are ready to talk. The voice conversion process (second embodiment) acquires the call attribute, inputs the call attribute as input data to the voice processing model 1023, selects the voice processing content based on the output voice processing ID, and converts the voice uttered by the user. This is a series of processes for outputting the converted voice data to the customer by applying the selected voice processing content to the data.

＜音声変換処理（第二実施例）の詳細＞
ステップＳ３０１において、ユーザと顧客とが通話可能状態となると音声変換処理（第二実施例）が開始される。 <Details of voice conversion processing (second embodiment)>
In step S301, when the user and the customer are ready to talk, voice conversion processing (second embodiment) is started.

ステップＳ３０２において、サーバ１０の音声変換部１０４２は、通話に関する通話属性を取得する。ステップＳ３０２は、音声変換処理（第一実施例）におけるステップＳ１０２と同様であるため説明を省略する。 In step S302, the voice conversion unit 1042 of the server 10 acquires call attributes related to the call. Since step S302 is the same as step S102 in the voice conversion process (first embodiment), description thereof is omitted.

ステップＳ３０３において、サーバ１０の音声変換部１０４２は、取得した通話属性に基づき、複数の音声処理のうち所定の音声処理を選択する。サーバ１０の音声変換部１０４２は、ステップＳ３０２において取得した通話属性に対して音声処理モデル１０２３を適用することにより、所定の音声処理を選択する。
具体的には、サーバ１０の音声変換部１０４２は、取得した通話属性を入力データとして音声処理モデル１０２３に入力し、出力される音声処理ＩＤを取得する。サーバ１０の音声変換部１０４２は、取得した音声処理ＩＤに基づき、音声処理テーブル１０１５の音声処理ＩＤの項目を検索し、音声処理内容を取得する。つまり、サーバ１０の音声変換部１０４２は、通話属性に基づき音声処理内容を特定し選択する。 In step S303, the voice conversion unit 1042 of the server 10 selects a predetermined voice process from among a plurality of voice processes based on the acquired call attribute. The voice conversion unit 1042 of the server 10 selects predetermined voice processing by applying the voice processing model 1023 to the call attributes acquired in step S302.
Specifically, the voice conversion unit 1042 of the server 10 inputs the acquired call attribute as input data to the voice processing model 1023, and acquires the output voice processing ID. The voice conversion unit 1042 of the server 10 searches the item of the voice processing ID in the voice processing table 1015 based on the acquired voice processing ID, and acquires the content of the voice processing. That is, the voice conversion unit 1042 of the server 10 specifies and selects voice processing content based on the call attribute.

ステップＳ３０４において、サーバ１０の音声変換部１０４２は、ユーザから通話音声を取得し、取得した通話音声を変換する。このとき、サーバ１０の音声変換部１０４２は、ステップＳ３０２において取得した通話属性に基づき、取得した通話音声を変換する。サーバ１０の音声変換部１０４２は、通話音声に、ステップＳ３０２において選択された所定の音声処理を適用することにより、通話音声を変換する。
具体的に、サーバ１０の音声変換部１０４２は、音声サーバ（ＰＢＸ）４０からユーザにより発話された音声データを逐次的に取得する。サーバ１０の音声変換部１０４２は、ユーザにより発話後、発話された音声データをできるだけ遅延なく取得することが望ましい。
サーバ１０の音声変換部１０４２は、取得した通話属性、音声データに対してステップＳ３０３において選択した音声処理内容を適用し、出力される変換音声データを取得する。 In step S304, the voice conversion unit 1042 of the server 10 acquires the call voice from the user and converts the acquired call voice. At this time, the voice conversion unit 1042 of the server 10 converts the acquired call voice based on the call attribute acquired in step S302. The voice conversion unit 1042 of the server 10 converts the call voice by applying the predetermined voice processing selected in step S302 to the call voice.
Specifically, the voice conversion unit 1042 of the server 10 sequentially acquires voice data uttered by the user from the voice server (PBX) 40 . After the user speaks, the voice conversion unit 1042 of the server 10 preferably acquires voice data spoken by the user with as little delay as possible.
The voice conversion unit 1042 of the server 10 applies the voice processing content selected in step S303 to the acquired call attribute and voice data, and acquires converted voice data to be output.

サーバ１０の音声変換部１０４２は、顧客に対して出力される変換音声データの評価指標の種別に応じて複数の音声処理モデル１０２３を選択的に切り替えて適用し、音声データを変換しても良い。
例えば、顧客からより信頼性が得られるような音声処理モデル１０２３を用いて、音声データを変換しても良い。例えば、顧客がより聴きやすい（聴き取りやすい）音声処理モデル１０２３を用いて、音声データを変換しても良い。例えば、顧客に対して、ＳＩＩＢ、ＨＡＳＰＩ、ＥＳＴＯＩ、ＰＥＳＱ、ＶｉＳＱＯＬなどの評価指標や、信頼性、信用性、心地良さ、快適性、好み、ストレス値、威圧度、興趣性などの評価指標が適したものとなるような音声処理モデル１０２３を用いて、音声データを変換しても良い。 The speech conversion unit 1042 of the server 10 may selectively switch and apply a plurality of speech processing models 1023 according to the type of the evaluation index of the converted speech data output to the customer, and convert the speech data. .
For example, voice data may be converted using a voice processing model 1023 that provides more reliability from customers. For example, voice data may be converted using a voice processing model 1023 that is easier for the customer to hear (easier to hear). For example, evaluation indexes such as SIIB, HASPI, ESTOI, PESQ, and ViSQOL, and evaluation indexes such as reliability, credibility, comfort, comfort, preference, stress value, intimidation, and interest are suitable for customers. The audio data may be converted using an audio processing model 1023 such as

サーバ１０の音声変換部１０４２は、ユーザに関する属性情報、顧客に関する属性情報、通話に関する属性情報の少なくともいずれか１つを通話属性として用いて、通話音声を変換しても良い。例えば、ユーザに関する属性情報のみを通話属性として用いて、通話音声を変換しても良い。
また、サーバ１０の音声変換部１０４２は、ユーザ属性、ユーザの所属する組織の組織名または組織属性、ユーザの感情情報、顧客属性、顧客の所属する組織の組織名または組織属性、顧客の感情情報、通話カテゴリ、受発信者種別、通話の感情情報のいずれか１つを通話属性として用いて、通話音声を変換しても良い。例えば、通話カテゴリのみを通話属性として用いて、通話音声を変換しても良い。 The voice conversion unit 1042 of the server 10 may convert the call voice using at least one of the user attribute information, the customer attribute information, and the call attribute information as the call attribute. For example, the call voice may be converted using only the attribute information about the user as the call attribute.
In addition, the voice conversion unit 1042 of the server 10 is configured to convert the user attribute, the organization name or organization attribute of the organization to which the user belongs, the user's emotion information, the customer attribute, the organization name or organization attribute of the organization to which the customer belongs, and the customer's emotion information. , the category of the call, the type of the sender and receiver, and the emotional information of the call may be used as the call attribute to convert the call voice. For example, only the call category may be used as the call attribute to convert the call voice.

ステップＳ３０５において、サーバ１０の音声変換部１０４２は、ステップＳ３０４において変換された通話音声を顧客へ出力する。ステップＳ３０５は、音声変換処理（第一実施例）におけるステップＳ１０５と同様であるため説明を省略する。 In step S305, the voice conversion unit 1042 of the server 10 outputs the call voice converted in step S304 to the customer. Since step S305 is the same as step S105 in the voice conversion process (first embodiment), description thereof is omitted.

＜音声変換処理（第三実施例）＞
音声変換処理（第三実施例）は、ユーザが発話した音声データに対して、ユーザが選択した音声処理モデル１０２３を適用することにより得られる変換音声データを顧客に対して出力する処理である。
これにより、ユーザの選択指示に応じて、顧客はより快適な音声でユーザとの通話を行うことができる。 <Voice Conversion Processing (Third Embodiment)>
The speech conversion process (third embodiment) is a process of outputting converted speech data obtained by applying the speech processing model 1023 selected by the user to the speech data uttered by the user to the customer.
As a result, the customer can talk with the user in a more comfortable voice according to the user's selection instruction.

＜音声変換処理（第三実施例）の概要＞
音声変換処理（第三実施例）は、ユーザと顧客とが通話可能状態となると開始される。音声変換処理（第三実施例）は、ユーザは音声処理内容を選択し、ユーザが発話した音声データに対して選択された音声処理内容を適用して出力される変換音声データを顧客に対して出力する一連の処理である。 <Outline of voice conversion processing (third embodiment)>
The voice conversion process (third embodiment) is started when the user and the customer are ready to talk. In the speech conversion process (third embodiment), the user selects the content of speech processing, applies the selected speech processing content to the speech data uttered by the user, and outputs the converted speech data to the customer. It is a series of processing to output.

＜音声変換処理（第三実施例）の詳細＞
ステップＳ５０１において、ユーザと顧客とが通話可能状態となると音声変換処理（第三実施例）が開始される。 <Details of voice conversion processing (third embodiment)>
In step S501, when the user and the customer are ready to talk, voice conversion processing (third embodiment) is started.

ステップＳ５０３において、ユーザから受け付けた選択指示に基づき、複数の音声処理のうち所定の音声処理を選択する。例えば、複数の音声処理のうち、抑揚を小さくする音声処理を選択すると、顧客に対する威圧的な印象を軽減することができる。顧客が女性である場合などには、声量の大きな音声を小さくする音声処理を選択することにより、顧客に対する威圧的な印象を軽減することができる。
ユーザは、ユーザと顧客との通話中の任意のタイミングで音声処理を選択しても構わない。また、ユーザは、通話の開始前に音声処理を予め選択しておく構成としても構わない。
具体的に、ユーザはユーザ端末２０の入力装置２０６を操作して、適用を希望する音声処理内容に関する音声処理ＩＤを含むリクエストをサーバ１０へ送信する。サーバ１０の音声変換部１０４２は、受信したリクエストに含まれる音声処理ＩＤに基づき、音声処理テーブル１０１５の音声処理ＩＤの項目を検索し、音声処理内容を取得する。つまり、サーバ１０の音声変換部１０４２は、ユーザから受け付けた選択指示に基づき音声処理内容を特定し選択する。 In step S503, predetermined audio processing is selected from the plurality of audio processing based on the selection instruction received from the user. For example, by selecting voice processing that reduces intonation from a plurality of voice processing, it is possible to reduce the intimidating impression on the customer. If the customer is a woman, the intimidating impression on the customer can be reduced by selecting voice processing that reduces loud voices.
The user may select voice processing at any time during the call between the user and the customer. Also, the user may be configured to select voice processing in advance before starting a call.
Specifically, the user operates the input device 206 of the user terminal 20 to transmit to the server 10 a request including an audio processing ID related to the desired audio processing content. The voice conversion unit 1042 of the server 10 searches the voice processing ID item of the voice processing table 1015 based on the voice processing ID included in the received request, and acquires the voice processing content. That is, the voice conversion unit 1042 of the server 10 specifies and selects voice processing content based on the selection instruction received from the user.

図１５に、音声変換処理（第三実施例）におけるユーザ端末２０の表示画面例を図示する。ユーザ端末２０のディスプレイ２０８１には、通話画面８０が表示される。通話画面８０には、現在、通話中の顧客情報８０１、音声処理内容を選択するためのユーザインタフェース８０２が表示される。顧客情報８０１は、顧客に関する属性情報である、顧客に関する顧客属性、顧客の所属する組織の組織名または組織属性の情報、顧客の感情情報を含んでも良い。ユーザは、ユーザ端末２０の入力装置２０６を操作することにより、音声処理内容に関連付けられたスイッチ８０３を押下することにより、複数の音声処理のうち所定の音声処理を選択する。図１５では、音声処理ＩＤがM００２の音声処理内容が選択されていることが示されている。 FIG. 15 shows an example of the display screen of the user terminal 20 in the voice conversion process (third embodiment). A call screen 80 is displayed on the display 2081 of the user terminal 20 . The call screen 80 displays customer information 801 currently on the phone and a user interface 802 for selecting voice processing content. The customer information 801 may include customer attribute information about the customer, information on the organization name or organization attribute of the organization to which the customer belongs, and customer emotion information, which are attribute information about the customer. By operating the input device 206 of the user terminal 20 to press the switch 803 associated with the content of the audio processing, the user selects a predetermined audio processing from among the plurality of audio processing. FIG. 15 shows that the audio processing content with the audio processing ID M002 is selected.

＜変形例＞
なお、ユーザは、ユーザ端末２０の入力装置２０６を操作して、最適化したい評価指標の種別を選択する構成としても良い。具体的に、ユーザは、ユーザ端末２０の入力装置２０６を操作して、ＳＩＩＢ、ＨＡＳＰＩ、ＥＳＴＯＩ、ＰＥＳＱ、ＶｉＳＱＯＬなどの評価指標や、信頼性、信用性、心地良さ、快適性、好み、ストレス値、威圧度、興趣性などの評価指標を選択しても良い。例えば、ユーザは、ユーザ端末２０の入力装置２０６を操作して、顧客がより聴きやすい（聴き取りやすい）といった選択肢や、顧客からより信頼性が得られるといった選択肢を選択する構成としても良い。
なお、ユーザ端末２０のディスプレイ２０８１は、ユーザが選択可能な評価指標をユーザに対して一覧して提示する構成としても良い。ユーザは、ユーザ端末２０の入力装置２０６を操作して、一覧して提示された評価指標から最適化したい項目を選択することにより、最適化したい評価指標の種別を選択する構成としても良い。 <Modification>
Note that the user may operate the input device 206 of the user terminal 20 to select the type of evaluation index to be optimized. Specifically, the user operates the input device 206 of the user terminal 20 to obtain evaluation indices such as SIIB, HASPI, ESTOI, PESQ, and ViSQOL, reliability, credibility, comfort, comfort, preference, stress value, etc. , the degree of intimidation, and the degree of interest may be selected. For example, the user may operate the input device 206 of the user terminal 20 to select an option that makes it easier for the customer to hear (easier to hear) or an option that makes the customer more reliable.
Note that the display 2081 of the user terminal 20 may be configured to present a list of user-selectable evaluation indices to the user. The user may operate the input device 206 of the user terminal 20 to select an item to be optimized from the presented evaluation indices, thereby selecting the type of evaluation index to be optimized.

ユーザ端末２０は、選択した評価指標の種別を含むリクエストをサーバ１０に送信する。 The user terminal 20 transmits a request including the selected evaluation index type to the server 10 .

サーバ１０の音声変換部１０４２は、受信したリクエストに含まれる評価指標の種別に基づき、音声処理モデル１０２３を選択する。具体的には、サーバ１０の音声変換部１０４２は、受信した評価指標の種別に応じて、より大きな評価指標を得るために最適化（学習）された音声処理モデル１０２３を選択する。
このとき、サーバ１０の音声変換部１０４２は、音声変換処理（第二実施例）のステップＳ３０２、ステップＳ３０３と同様に、通話に関する通話属性を取得し、取得した通話属性に基づき、複数の音声処理のうち所定の音声処理を選択する。 The speech conversion unit 1042 of the server 10 selects the speech processing model 1023 based on the type of evaluation index included in the received request. Specifically, the speech conversion unit 1042 of the server 10 selects an optimized (learned) speech processing model 1023 to obtain a larger evaluation index according to the type of the received evaluation index.
At this time, the voice conversion unit 1042 of the server 10 acquires call attributes related to the call, similarly to steps S302 and S303 of the voice conversion process (second embodiment), and performs a plurality of voice processes based on the acquired call attributes. A predetermined audio process is selected from among the above.

また、サーバ１０の音声変換部１０４２は、受信したリクエストに含まれる評価指標の種別に基づき、生成モデル１０２２を選択しても良い。具体的には、サーバ１０の音声変換部１０４２は、受信した評価指標の種別に応じて、より大きな評価指標を得るために最適化（学習）された生成モデル１０２２を選択しても良い。 Also, the speech conversion unit 1042 of the server 10 may select the generative model 1022 based on the type of evaluation index included in the received request. Specifically, the speech conversion unit 1042 of the server 10 may select the optimized (learned) generation model 1022 to obtain a larger evaluation index according to the received evaluation index type.

つまり、ステップＳ５０３において、サーバ１０の音声変換部１０４２は、ユーザから受け付けた選択指示に基づき、直接的に音声処理ＩＤを受け付けて音声処理内容を特定し選択しても良いし、間接的に音声処理モデル１０２３を用いて音声処理ＩＤを特定し音声処理内容を特定し選択しても良い。また、サーバ１０の音声変換部１０４２は、ユーザから受け付けた選択指示に基づき、生成モデル１０２２を特定し選択しても良い。 That is, in step S503, the voice conversion unit 1042 of the server 10 may directly receive the voice processing ID and specify and select the voice processing content based on the selection instruction received from the user, or may indirectly select the voice processing content. The processing model 1023 may be used to specify an audio processing ID to specify and select audio processing content. Further, the speech conversion unit 1042 of the server 10 may specify and select the generative model 1022 based on a selection instruction received from the user.

ステップＳ５０４において、サーバ１０の音声変換部１０４２は、ユーザから通話音声を取得し、取得した通話音声を変換する。このとき、サーバ１０の音声変換部１０４２は、ステップＳ５０３において選択した音声処理内容に基づき、取得した通話音声を変換する。なお、サーバ１０の音声変換部１０４２は、ステップＳ５０３において選択した生成モデル１０２２に基づき、取得した通話音声を変換しても良い。
具体的に、サーバ１０の音声変換部１０４２は、音声サーバ（ＰＢＸ）４０からユーザにより発話された音声データを逐次的に取得する。サーバ１０の音声変換部１０４２は、ユーザにより発話後、発話された音声データをできるだけ遅延なく取得することが望ましい。
サーバ１０の音声変換部１０４２は、取得した通話属性、音声データに対してステップＳ５０３において選択した音声処理内容を適用し、出力される変換音声データを取得する。また、サーバ１０の音声変換部１０４２は、取得した通話属性、音声データに対してステップＳ５０３において選択した生成モデル１０２２を適用し、出力される変換音声データを取得しても良い。 In step S504, the voice conversion unit 1042 of the server 10 acquires the call voice from the user and converts the acquired call voice. At this time, the voice conversion unit 1042 of the server 10 converts the acquired call voice based on the voice processing content selected in step S503. Note that the voice conversion unit 1042 of the server 10 may convert the acquired call voice based on the generation model 1022 selected in step S503.
Specifically, the voice conversion unit 1042 of the server 10 sequentially acquires voice data uttered by the user from the voice server (PBX) 40 . After the user speaks, the voice conversion unit 1042 of the server 10 preferably acquires voice data spoken by the user with as little delay as possible.
The voice conversion unit 1042 of the server 10 applies the voice processing content selected in step S503 to the acquired call attribute and voice data, and acquires converted voice data to be output. Also, the speech conversion unit 1042 of the server 10 may apply the generation model 1022 selected in step S503 to the acquired call attributes and speech data, and acquire converted speech data to be output.

ステップＳ５０５において、サーバ１０の音声変換部１０４２は、ステップＳ５０４において変換された通話音声を顧客へ出力する。ステップＳ５０５は、音声変換処理（第一実施例）におけるステップＳ１０５と同様であるため説明を省略する。 In step S505, the voice conversion unit 1042 of the server 10 outputs the call voice converted in step S504 to the customer. Since step S505 is the same as step S105 in the voice conversion process (first embodiment), description thereof is omitted.

＜変形例＞
音声変換処理（第一実施例）、音声変換処理（第二実施例）、音声変換処理（第三実施例）においては、音声変換を行うことができるのはユーザの発話音声のみとし、顧客の発話音声に対しては音声変換を行うことができない構成としても良い。
具体的には、顧客は音声処理などの選択指示を行うことができず、サーバ１０の音声変換部１０４２は、顧客から音声処理に関する選択指示は受け付けない構成としても良い。このとき、サーバ１０の音声変換部１０４２は、顧客の発話音声については変換せずに、ユーザに対して出力する構成としても良い。具体的には、顧客端末５０のマイク５０８１により集音された顧客の音声に関する音声データは、サーバ１０の音声変換部１０４２により変換されずに、ユーザ端末２０のスピーカ２０８２から出力される。
これにより、ユーザは、顧客の音声を変換せずに顧客の音声を確認しつつ顧客との通話を行うことができる。 <Modification>
In the voice conversion process (first embodiment), the voice conversion process (second embodiment), and the voice conversion process (third embodiment), only the user's uttered voice can be voice-converted. A configuration may be adopted in which voice conversion cannot be performed on the uttered voice.
Specifically, the customer cannot issue a selection instruction for voice processing or the like, and the voice conversion unit 1042 of the server 10 may be configured not to receive selection instructions regarding voice processing from the customer. At this time, the voice conversion unit 1042 of the server 10 may be configured to output the customer's uttered voice to the user without converting it. Specifically, voice data relating to the customer's voice collected by the microphone 5081 of the customer terminal 50 is output from the speaker 2082 of the user terminal 20 without being converted by the voice converter 1042 of the server 10 .
Thereby, the user can talk with the customer while confirming the customer's voice without converting the customer's voice.

＜音声変換処理（第四実施例）＞
音声変換処理（第四実施例）は、顧客が発話した音声データに対して、ユーザが選択した音声処理モデル１０２３を適用することにより得られる変換音声データをユーザに対して出力する処理である。 <Voice Conversion Processing (Fourth Embodiment)>
The voice conversion process (fourth embodiment) is a process of outputting to the user converted voice data obtained by applying the voice processing model 1023 selected by the user to the voice data uttered by the customer.

＜音声変換処理（第四実施例）の概要＞
音声変換処理（第四実施例）は、ユーザと顧客とが通話可能状態となると開始される。音声変換処理（第四実施例）は、ユーザは音声処理内容を選択し、顧客が発話した音声データに対して選択された音声処理内容を適用して出力される変換音声データをユーザに対して出力する一連の処理である。 <Outline of voice conversion processing (fourth embodiment)>
The voice conversion process (fourth embodiment) is started when the user and the customer are ready to talk. In the voice conversion process (fourth embodiment), the user selects the contents of voice processing, applies the selected voice processing contents to the voice data uttered by the customer, and outputs converted voice data to the user. It is a series of processing to output.

＜音声変換処理（第四実施例）の詳細＞
ユーザと顧客とが通話可能状態となると音声変換処理（第四実施例）が開始される。
ユーザから受け付けた選択指示に基づき、複数の音声処理のうち所定の音声処理を選択する。例えば、複数の音声処理のうち、ユーザにとってより聴きやすい音声となるような所定の音声処理を選択しても良い。例えば、複数の音声処理のうち、ユーザにとってより快適性が得られるような所定の音声処理を選択しても良い。例えば、複数の音声処理のうち、ユーザの好みの音声、ストレス値が小さくなる、興趣性が高まるような所定の音声処理を選択しても良い。例えば、顧客が怒っている場合などには、ユーザは、複数の音声処理のうち、抑揚を小さくしたり、音声を小さくする音声処理を選択することにより、顧客との応対に伴う心理的ストレスを低減させることができる。
ユーザは、ユーザと顧客との通話中の任意のタイミングで音声処理を選択しても構わない。また、ユーザは、通話の開始前に音声処理を予め選択しておく構成としても構わない。
具体的に、ユーザはユーザ端末２０の入力装置２０６を操作して、適用を希望する音声処理内容に関する音声処理ＩＤを含むリクエストをサーバ１０へ送信する。サーバ１０の音声変換部１０４２は、受信したリクエストに含まれる音声処理ＩＤに基づき、音声処理テーブル１０１５の音声処理ＩＤの項目を検索し、音声処理内容を取得する。つまり、サーバ１０の音声変換部１０４２は、ユーザから受け付けた選択指示に基づき音声処理内容を特定し選択する。 <Details of voice conversion processing (fourth embodiment)>
When the user and the customer are ready to talk, the voice conversion process (fourth embodiment) is started.
Predetermined audio processing is selected from a plurality of audio processing based on a selection instruction received from a user. For example, among a plurality of audio processes, a predetermined audio process that makes the audio easier to hear for the user may be selected. For example, among a plurality of audio processing, a predetermined audio processing that provides more comfort to the user may be selected. For example, among a plurality of audio processes, a user's favorite audio, a predetermined audio process that reduces the stress value, and increases interest may be selected. For example, when a customer is angry, the user can reduce the psychological stress associated with dealing with the customer by selecting voice processing that reduces the inflection or voice from a plurality of voice processing. can be reduced.
The user may select voice processing at any time during the call between the user and the customer. Also, the user may be configured to select voice processing in advance before starting a call.
Specifically, the user operates the input device 206 of the user terminal 20 to transmit to the server 10 a request including an audio processing ID related to the desired audio processing content. The voice conversion unit 1042 of the server 10 searches the voice processing ID item of the voice processing table 1015 based on the voice processing ID included in the received request, and acquires the voice processing content. That is, the voice conversion unit 1042 of the server 10 specifies and selects voice processing content based on the selection instruction received from the user.

＜変形例＞
なお、ユーザは、ユーザ端末２０の入力装置２０６を操作して、最適化したい評価指標の種別を選択する構成としても良い。具体的に、ユーザは、ユーザ端末２０の入力装置２０６を操作して、ＳＩＩＢ、ＨＡＳＰＩ、ＥＳＴＯＩ、ＰＥＳＱ、ＶｉＳＱＯＬなどの評価指標や、信頼性、信用性、心地良さ、快適性、好み、ストレス値、威圧度、興趣性などの評価指標を選択しても良い。例えば、ユーザは、ユーザ端末２０の入力装置２０６を操作して、ユーザがより聴きやすい（聴き取りやすい）といった選択肢や、ユーザがより快適性が得られるといった選択肢を選択する構成としても良い。
なお、ユーザ端末２０のディスプレイ２０８１は、ユーザが選択可能な評価指標をユーザに対して一覧して提示する構成としても良い。ユーザは、ユーザ端末２０の入力装置２０６を操作して、一覧して提示された評価指標から最適化したい項目を選択することにより、最適化したい評価指標の種別を選択する構成としても良い。
ユーザ端末２０は、選択した評価指標の種別を含むリクエストをサーバ１０に送信する。
サーバ１０の音声変換部１０４２は、受信したリクエストに含まれる評価指標の種別に基づき、音声処理モデル１０２３を選択する。具体的には、サーバ１０の音声変換部１０４２は、受信した評価指標の種別に応じて、より大きな評価指標を得るために最適化（学習）された音声処理モデル１０２３を選択する。
このとき、サーバ１０の音声変換部１０４２は、音声変換処理（第二実施例）のステップＳ３０２、ステップＳ３０３と同様に、通話に関する通話属性を取得し、取得した通話属性に基づき、複数の音声処理のうち所定の音声処理を選択する。なお、このとき、通話属性としては、音声変換処理（第二実施例）のステップＳ３０２、ステップＳ３０３と異なり、ユーザに関する属性情報と、顧客に関する属性情報とを入れ替えて適用する。音声変換処理（第四実施例）においては、顧客が音声データの話者となり、ユーザが変換音声データの聞き手となるためである。 <Modification>
Note that the user may operate the input device 206 of the user terminal 20 to select the type of evaluation index to be optimized. Specifically, the user operates the input device 206 of the user terminal 20 to obtain evaluation indices such as SIIB, HASPI, ESTOI, PESQ, and ViSQOL, reliability, credibility, comfort, comfort, preference, stress value, etc. , the degree of intimidation, and the degree of interest may be selected. For example, the user may operate the input device 206 of the user terminal 20 to select an option that makes it easier for the user to hear (easier to hear) or an option that makes the user more comfortable.
Note that the display 2081 of the user terminal 20 may be configured to present a list of user-selectable evaluation indices to the user. The user may operate the input device 206 of the user terminal 20 to select an item to be optimized from the presented evaluation indices, thereby selecting the type of evaluation index to be optimized.
The user terminal 20 transmits a request including the selected evaluation index type to the server 10 .
The speech conversion unit 1042 of the server 10 selects the speech processing model 1023 based on the type of evaluation index included in the received request. Specifically, the speech conversion unit 1042 of the server 10 selects an optimized (learned) speech processing model 1023 to obtain a larger evaluation index according to the type of the received evaluation index.
At this time, the voice conversion unit 1042 of the server 10 acquires call attributes related to the call, similarly to steps S302 and S303 of the voice conversion process (second embodiment), and performs a plurality of voice processes based on the acquired call attributes. A predetermined audio process is selected from among the above. At this time, unlike steps S302 and S303 of the voice conversion process (second embodiment), the attribute information regarding the user and the attribute information regarding the customer are exchanged and applied as the call attribute. This is because in the voice conversion process (fourth embodiment), the customer is the speaker of the voice data and the user is the listener of the converted voice data.

つまり、サーバ１０の音声変換部１０４２は、ユーザから受け付けた選択指示に基づき、直接的に音声処理ＩＤを受け付けて音声処理内容を特定し選択しても良いし、間接的に音声処理モデル１０２３を用いて音声処理ＩＤを特定し音声処理内容を特定し選択しても良い。 That is, the voice conversion unit 1042 of the server 10 may directly receive the voice processing ID and specify and select the voice processing content based on the selection instruction received from the user, or may indirectly select the voice processing model 1023. may be used to identify the audio processing ID to identify and select the content of the audio processing.

サーバ１０の音声変換部１０４２は、顧客から通話音声を取得し、取得した通話音声を変換する。このとき、サーバ１０の音声変換部１０４２は、選択した音声処理内容に基づき、取得した通話音声を変換する。
具体的に、サーバ１０の音声変換部１０４２は、音声サーバ（ＰＢＸ）４０から顧客により発話された音声データを逐次的に取得する。サーバ１０の音声変換部１０４２は、顧客により発話後、発話された音声データをできるだけ遅延なく取得することが望ましい。
サーバ１０の音声変換部１０４２は、取得した通話属性、音声データに対して選択した音声処理内容を適用し、出力される変換音声データを取得する。 The voice conversion unit 1042 of the server 10 acquires call voice from the customer and converts the acquired call voice. At this time, the voice conversion unit 1042 of the server 10 converts the acquired call voice based on the selected voice processing content.
Specifically, the voice conversion unit 1042 of the server 10 sequentially acquires voice data uttered by the customer from the voice server (PBX) 40 . After the customer speaks, the voice conversion unit 1042 of the server 10 preferably acquires voice data spoken by the customer with as little delay as possible.
The voice conversion unit 1042 of the server 10 applies the selected voice processing content to the acquired call attribute and voice data, and acquires converted voice data to be output.

サーバ１０の音声変換部１０４２は、ステップＳ５０４において変換された通話音声をユーザへ出力する。ステップＳ５０５は、ユーザと顧客が入れ替わっていることを除き、音声変換処理（第一実施例）におけるステップＳ１０５と同様であるため説明を省略する。 The voice conversion unit 1042 of the server 10 outputs the call voice converted in step S504 to the user. Step S505 is the same as step S105 in the voice conversion process (first embodiment) except that the user and the customer are replaced, so the description is omitted.

＜変形例＞
音声変換処理（第一実施例）、音声変換処理（第二実施例）、音声変換処理（第三実施例）のそれぞれステップＳ１０４、Ｓ３０４、Ｓ５０４において、上述したユーザと顧客がルームとよばれる仮想的な通話空間内で通話を行う場合は、サーバ１０の音声変換部１０４２は、音声サーバ（ＰＢＸ）４０を介さずに、サーバ１０が受け付けたユーザにより発話された音声データを逐次的に取得する構成としても良い。同様に、サーバ１０の音声変換部１０４２は、音声サーバ（ＰＢＸ）４０を介さずに、サーバ１０が受け付けた顧客により発話された音声データを逐次的に取得する構成としても良い。 <Modification>
In steps S104, S304, and S504 of the voice conversion process (first embodiment), voice conversion process (second embodiment), and voice conversion process (third embodiment), respectively, the above-described user and customer are placed in a virtual room called a room. When a call is made in a typical call space, the voice conversion unit 1042 of the server 10 sequentially acquires the voice data uttered by the user received by the server 10 without going through the voice server (PBX) 40. It may be configured. Similarly, the voice conversion unit 1042 of the server 10 may be configured to sequentially acquire the voice data spoken by the customer received by the server 10 without going through the voice server (PBX) 40 .

同様に、音声変換処理（第一実施例）、音声変換処理（第二実施例）、音声変換処理（第三実施例）のそれぞれステップＳ１０５、Ｓ３０５、Ｓ５０５において、上述したユーザと顧客がルームとよばれる仮想的な通話空間内で通話を行う場合は、サーバ１０の音声変換部１０４２は、音声サーバ（ＰＢＸ）４０を介さずに、変換音声データを顧客端末５０に対して出力する構成としても良い。つまり、音声変換処理（第一実施例）、音声変換処理（第二実施例）、音声変換処理（第三実施例）において音声サーバ（ＰＢＸ）４０は必須の構成要件ではない。 Similarly, in steps S105, S305, and S505 of voice conversion processing (first embodiment), voice conversion processing (second embodiment), and voice conversion processing (third embodiment), respectively, the above-described user and customer are in the room. When a call is made in a so-called virtual call space, the voice conversion unit 1042 of the server 10 may output the converted voice data to the customer terminal 50 without going through the voice server (PBX) 40. good. That is, the voice server (PBX) 40 is not an essential component in the voice conversion process (first embodiment), voice conversion process (second embodiment), and voice conversion process (third embodiment).

音声変換処理（第四実施例）において、上述したユーザと顧客がルームとよばれる仮想的な通話空間内で通話を行う場合は、サーバ１０の音声変換部１０４２は、音声サーバ（ＰＢＸ）４０を介さずに、サーバ１０が受け付けた顧客により発話された音声データを逐次的に取得する構成としても良い。同様に、サーバ１０の音声変換部１０４２は、音声サーバ（ＰＢＸ）４０を介さずに、サーバ１０が受け付けたユーザにより発話された音声データを逐次的に取得する構成としても良い。 In the voice conversion process (fourth embodiment), when the above-described user and customer make a call in a virtual call space called a room, the voice conversion unit 1042 of the server 10 uses the voice server (PBX) 40 as The voice data uttered by the customer accepted by the server 10 may be sequentially acquired without intervention. Similarly, the voice conversion unit 1042 of the server 10 may be configured to sequentially acquire the voice data uttered by the user received by the server 10 without going through the voice server (PBX) 40 .

同様に、音声変換処理（第四実施例）において、上述したユーザと顧客がルームとよばれる仮想的な通話空間内で通話を行う場合は、サーバ１０の音声変換部１０４２は、音声サーバ（ＰＢＸ）４０を介さずに、変換音声データをユーザ端末２０に対して出力する構成としても良い。つまり、音声変換処理（第四実施例）において音声サーバ（ＰＢＸ）４０は必須の構成要件ではない。 Similarly, in the voice conversion process (fourth embodiment), when the above-described user and customer make a call in a virtual call space called a room, the voice conversion unit 1042 of the server 10 uses a voice server (PBX ) 40, the converted voice data may be output to the user terminal 20. FIG. In other words, the voice server (PBX) 40 is not an essential component in the voice conversion process (fourth embodiment).

＜学習処理＞
生成モデル１０２２、音声処理モデル１０２３の学習処理を以下に説明する。なお、以下の学習処理は特定の評価指標（例えば、第１指標）に対する学習処理に関するもので、複数の評価指標を用いる場合は、第１指標、第２指標などの評価種別ごとに用意された複数の生成モデル１０２２、音声処理モデル１０２３のそれぞれに対して学習処理が行われる。 <Learning processing>
Learning processing of the generative model 1022 and the speech processing model 1023 will be described below. The learning process below relates to the learning process for a specific evaluation index (for example, the first index). Learning processing is performed for each of the plurality of generative models 1022 and speech processing models 1023 .

＜生成モデル１０２２の学習処理＞
生成モデル１０２２の学習処理は、生成モデル１０２２に含まれるディープニューラルネットワークの学習パラメータを深層学習により学習させる処理である。 <Learning processing of generative model 1022>
The learning process of the generative model 1022 is a process of learning the learning parameters of the deep neural network included in the generative model 1022 by deep learning.

＜生成モデル１０２２の学習処理の概要＞
生成モデル１０２２の学習処理は、ユーザに関する属性情報、顧客に関する属性情報、通話に関する属性情報、音声データを入力データ（入力ベクトル）として、より大きな評価指標が得られる変換音声データを出力するように、生成モデル１０２２に含まれるディープニューラルネットワークの学習パラメータを深層学習により学習させる処理である。 <Overview of Learning Processing of Generative Model 1022>
The learning process of the generative model 1022 uses attribute information about users, attribute information about customers, attribute information about calls, and voice data as input data (input vectors) so that converted voice data that provides a larger evaluation index is output. This is processing for learning the learning parameters of the deep neural network included in the generative model 1022 by deep learning.

＜生成モデル１０２２の学習処理の詳細＞
サーバ１０の学習部１０５１は、通話属性、音声データの各項目を学習用データセット１０３１から取得する。サーバ１０の学習部１０５１は、通話属性、音声データを入力データとして、生成モデル１０２２に含まれるディープニューラルネットワークの学習パラメータを変化させながら適用し、複数の変換音声データを生成する。
このとき、サーバ１０の学習部１０５１は、通話属性として、ユーザに関する属性情報、顧客に関する属性情報、通話に関する属性情報の少なくとも１つを入力データに含めて、それ以外を除外して学習処理を実行しても構わない。サーバ１０の学習部１０５１は、通話属性として、ユーザ属性、ユーザの所属する組織の組織名または組織属性、ユーザの感情情報、顧客属性、顧客の所属する組織の組織名または組織属性、顧客の感情情報、通話カテゴリ、受発信者種別、通話の感情情報のいずれか１つを入力データに含めて、それ以外を除外して学習処理を実行しても構わない。
サーバ１０の学習部１０５１は、通話属性、音声データを入力データとして、評価モデル１０２１に適用することにより、複数の変換音声データのそれぞれに対する聞き手側における評価指標が得られる。サーバ１０の学習部１０５１は、より大きな評価指標が得られるように生成モデル１０２２に含まれるディープニューラルネットワークの学習パラメータを最適化する。
これにより、通話属性、音声データを入力データとして、より大きな評価指標が得られる変換音声データを出力するような生成モデル１０２２を得ることができる。 <Details of the learning process of the generative model 1022>
The learning unit 1051 of the server 10 acquires each item of call attributes and voice data from the learning data set 1031 . The learning unit 1051 of the server 10 uses the call attribute and voice data as input data and applies them while changing the learning parameters of the deep neural network included in the generation model 1022 to generate a plurality of converted voice data.
At this time, the learning unit 1051 of the server 10 includes at least one attribute information regarding the user, attribute information regarding the customer, and attribute information regarding the call as call attributes in the input data, and executes the learning process while excluding the other data. I don't mind. The learning unit 1051 of the server 10 acquires, as call attributes, user attribute, organization name or organization attribute of the organization to which the user belongs, user's emotion information, customer attribute, organization name or organization attribute of the organization to which the customer belongs, customer's emotion Any one of the information, the call category, the type of the caller and receiver, and the emotional information of the call may be included in the input data, and the other data may be excluded from the learning process.
The learning unit 1051 of the server 10 applies the call attribute and voice data as input data to the evaluation model 1021, thereby obtaining an evaluation index on the listener side for each of the plurality of converted voice data. The learning unit 1051 of the server 10 optimizes the learning parameters of the deep neural network included in the generative model 1022 so as to obtain a larger evaluation index.
As a result, it is possible to obtain a generative model 1022 that outputs converted speech data that provides a larger evaluation index by using call attributes and speech data as input data.

サーバ１０の学習部１０５１は、生成モデル１０２２をＧＡＮ（ＧｅｎｅｒａｔｉｖｅＡｄｖｅｒｓａｒｉａｌＮｅｔｗｏｒｋ、敵対的生成ネットワーク）などの任意の学習モデルとして構成しても良い。 The learning unit 1051 of the server 10 may configure the generative model 1022 as an arbitrary learning model such as a GAN (Generative Adversarial Network).

＜音声処理モデル１０２３の学習処理＞
音声処理モデル１０２３の学習処理は、音声処理モデル１０２３に含まれるディープニューラルネットワークの学習パラメータを深層学習により学習させる処理である。 <Learning processing of the speech processing model 1023>
The learning process of the speech processing model 1023 is a process of learning the learning parameters of the deep neural network included in the speech processing model 1023 by deep learning.

＜音声処理モデル１０２３の学習処理の概要＞
音声処理モデル１０２３の学習処理は、ユーザ属性、顧客属性、通話カテゴリ、受発信種別などの通話属性を入力データ（入力ベクトル）として、より大きな評価指標が得られる音声処理内容を出力するように、音声処理モデル１０２３に含まれるディープニューラルネットワークの学習パラメータを深層学習により学習させる処理である。
具体的には、音声処理モデル１０２３の学習処理における音声処理モデル１０２３は、通話属性を入力データ（入力ベクトル）として、より大きな評価指標が得られる音声処理テーブル１０１５における音声処理ＩＤを出力する学習モデルである。 <Overview of Learning Processing of Audio Processing Model 1023>
The learning process of the voice processing model 1023 uses call attributes such as user attributes, customer attributes, call categories, and incoming and outgoing call types as input data (input vectors), and outputs voice processing content that can obtain a larger evaluation index. This is processing for learning the learning parameters of the deep neural network included in the speech processing model 1023 by deep learning.
Specifically, the voice processing model 1023 in the learning process of the voice processing model 1023 is a learning model that outputs a voice processing ID in the voice processing table 1015 that provides a larger evaluation index using call attributes as input data (input vector). is.

＜音声処理モデル１０２３の学習処理の詳細＞
サーバ１０の学習部１０５１は、学習用データセット１０３１に含まれる通話属性に関連づけられた音声データに対して、音声処理テーブル１０１５に格納されている音声処理内容をそれぞれ適用した複数の音声処理データを生成する。
サーバ１０の学習部１０５１は、通話属性を入力データとして、当該通話属性に関連付けられた複数の音声処理データを評価モデル１０２１に適用することにより、複数の音声処理データのそれぞれに対する評価指標が得られる。サーバ１０の学習部１０５１は、より大きな評価指標が得られるような音声処理ＩＤを得られるように、音声処理モデル１０２３に含まれるディープニューラルネットワークの学習パラメータを最適化する。
これにより、通話属性を入力データとして、より大きな評価指標が得られる音声処理ＩＤを出力するような音声処理モデル１０２３を得ることができる。 <Details of the learning process of the speech processing model 1023>
The learning unit 1051 of the server 10 generates a plurality of voice processing data by applying the voice processing contents stored in the voice processing table 1015 to the voice data associated with the call attribute included in the learning data set 1031. Generate.
The learning unit 1051 of the server 10 uses the call attribute as input data and applies the plurality of voice processing data associated with the call attribute to the evaluation model 1021, thereby obtaining an evaluation index for each of the plurality of voice processing data. . The learning unit 1051 of the server 10 optimizes the learning parameters of the deep neural network included in the speech processing model 1023 so as to obtain a speech processing ID that provides a larger evaluation index.
As a result, it is possible to obtain a voice processing model 1023 that outputs a voice processing ID that provides a larger evaluation index using call attributes as input data.

なお、音声処理モデル１０２３の学習処理においては、入力データ（入力ベクトル）に通話属性に加えて音声データを含めても構わない。 In the learning process of the speech processing model 1023, the input data (input vector) may include speech data in addition to call attributes.

＜コンピュータの基本ハードウェア構成＞
図１６は、コンピュータ９０の基本的なハードウェア構成を示すブロック図である。コンピュータ９０は、プロセッサ９０１、主記憶装置９０２、補助記憶装置９０３、通信ＩＦ９９１（インタフェース、Interface）を少なくとも備える。これらは通信バス９２１により相互に電気的に接続される。 <Computer hardware configuration>
FIG. 16 is a block diagram showing the basic hardware configuration of computer 90. As shown in FIG. The computer 90 includes at least a processor 901, a main storage device 902, an auxiliary storage device 903, and a communication IF 991 (Interface). These are electrically connected to each other by a communication bus 921 .

プロセッサ９０１とは、プログラムに記述された命令セットを実行するためのハードウェアである。プロセッサ９０１は、演算装置、レジスタ、周辺回路等から構成される。 The processor 901 is hardware for executing an instruction set described in a program. The processor 901 is composed of an arithmetic unit, registers, peripheral circuits, and the like.

主記憶装置９０２とは、プログラム、及びプログラム等で処理されるデータ等を一時的に記憶するためのものである。例えば、ＤＲＡＭ（Dynamic Random Access Memory）等の揮発性のメモリである。 The main storage device 902 is for temporarily storing programs and data processed by the programs. For example, it is a volatile memory such as a DRAM (Dynamic Random Access Memory).

補助記憶装置９０３とは、データ及びプログラムを保存するための記憶装置である。例えば、フラッシュメモリ、ＨＤＤ（Hard Disc Drive）、光磁気ディスク、ＣＤ－ＲＯＭ、ＤＶＤ－ＲＯＭ、半導体メモリ等である。 Auxiliary storage device 903 is a storage device for storing data and programs. For example, flash memory, HDD (Hard Disc Drive), magneto-optical disk, CD-ROM, DVD-ROM, semiconductor memory, and the like.

通信ＩＦ９９１とは、有線又は無線の通信規格を用いて、他のコンピュータとネットワークを介して通信するための信号を入出力するためのインタフェースである。
ネットワークは、インターネット、ＬＡＮ、無線基地局等によって構築される各種移動通信システム等で構成される。例えば、ネットワークには、３Ｇ、４Ｇ、５Ｇ移動通信システム、ＬＴＥ（Long Term Evolution）、所定のアクセスポイントによってインターネットに接続可能な無線ネットワーク（例えばWi-Fi（登録商標））等が含まれる。無線で接続する場合、通信プロトコルとして例えば、Ｚ－Ｗａｖｅ（登録商標）、ＺｉｇＢｅｅ（登録商標）、Ｂｌｕｅｔｏｏｔｈ（登録商標）等が含まれる。有線で接続する場合は、ネットワークには、ＵＳＢ（Universal Serial Bus）ケーブル等により直接接続するものも含む。 The communication IF 991 is an interface for inputting and outputting signals for communicating with other computers via a network using a wired or wireless communication standard.
The network is composed of various mobile communication systems constructed by the Internet, LAN, wireless base stations, and the like. For example, networks include 3G, 4G, and 5G mobile communication systems, LTE (Long Term Evolution), wireless networks (for example, Wi-Fi (registered trademark)) that can be connected to the Internet through predetermined access points, and the like. For wireless connection, communication protocols include, for example, Z-Wave (registered trademark), ZigBee (registered trademark), Bluetooth (registered trademark), and the like. In the case of wired connection, the network includes direct connection using a USB (Universal Serial Bus) cable or the like.

なお、各ハードウェア構成の全部または一部を複数のコンピュータ９０に分散して設け、ネットワークを介して相互に接続することによりコンピュータ９０を仮想的に実現することができる。このように、コンピュータ９０は、単一の筐体、ケースに収納されたコンピュータ９０だけでなく、仮想化されたコンピュータシステムも含む概念である。 It should be noted that the computer 90 can be virtually realized by distributing all or part of each hardware configuration to a plurality of computers 90 and connecting them to each other via a network. Thus, the computer 90 is a concept that includes not only the computer 90 housed in a single housing or case, but also a virtualized computer system.

＜コンピュータ９０の基本機能構成＞
コンピュータ９０の基本ハードウェア構成（図１６）により実現されるコンピュータの機能構成を説明する。コンピュータは、制御部、記憶部、通信部の機能ユニットを少なくとも備える。 <Basic Functional Configuration of Computer 90>
A functional configuration of the computer realized by the basic hardware configuration (FIG. 16) of the computer 90 will be described. The computer includes at least functional units of a control section, a storage section, and a communication section.

なお、コンピュータ９０が備える機能ユニットは、それぞれの機能ユニットの全部または一部を、ネットワークで相互に接続された複数のコンピュータ９０に分散して設けても実現することができる。コンピュータ９０は、単一のコンピュータ９０だけでなく、仮想化されたコンピュータシステムも含む概念である。 Note that the functional units included in the computer 90 can also be implemented by distributing all or part of each functional unit to a plurality of computers 90 interconnected via a network. The computer 90 is a concept that includes not only a single computer 90 but also a virtualized computer system.

制御部は、プロセッサ９０１が補助記憶装置９０３に記憶された各種プログラムを読み出して主記憶装置９０２に展開し、当該プログラムに従って処理を実行することにより実現される。制御部は、プログラムの種類に応じて様々な情報処理を行う機能ユニットを実現することができる。これにより、コンピュータは情報処理を行う情報処理装置として実現される。 The control unit is implemented by the processor 901 reading out various programs stored in the auxiliary storage device 903, developing them in the main storage device 902, and executing processing according to the programs. The control unit can implement functional units that perform various information processing according to the type of program. Thereby, the computer is implemented as an information processing device that performs information processing.

記憶部は、主記憶装置９０２、補助記憶装置９０３により実現される。記憶部は、データ、各種プログラム、各種データベースを記憶する。また、プロセッサ９０１は、プログラムに従って記憶部に対応する記憶領域を主記憶装置９０２または補助記憶装置９０３に確保することができる。また、制御部は、各種プログラムに従ってプロセッサ９０１に、記憶部に記憶されたデータの追加、更新、削除処理を実行させることができる。 A storage unit is realized by the main storage device 902 and the auxiliary storage device 903 . The storage unit stores data, various programs, and various databases. Also, the processor 901 can secure a storage area corresponding to the storage unit in the main storage device 902 or the auxiliary storage device 903 according to a program. In addition, the control unit can cause the processor 901 to execute addition, update, and deletion processing of data stored in the storage unit according to various programs.

データベースは、リレーショナルデータベースを指し、行と列によって構造的に規定された表形式のテーブル、マスタと呼ばれるデータ集合を、互いに関連づけて管理するためのものである。データベースでは、表をテーブル、マスタ、表の列をカラム、表の行をレコードと呼ぶ。リレーショナルデータベースでは、テーブル、マスタ同士の関係を設定し、関連づけることができる。
通常、各テーブル、各マスタにはレコードを一意に特定するための主キーとなるカラムが設定されるが、カラムへの主キーの設定は必須ではない。制御部は、各種プログラムに従ってプロセッサ９０１に、記憶部に記憶された特定のテーブル、マスタにレコードを追加、削除、更新を実行させることができる。 A database refers to a relational database, and is used to manage tabular tables structurally defined by rows and columns, and data sets called masters in association with each other. In a database, a table is called a table, a master is called a column, and a row is called a record. In a relational database, relationships between tables and masters can be set and associated.
Normally, each table and each master has a primary key column for uniquely identifying a record, but setting a primary key to a column is not essential. The control unit can cause the processor 901 to add, delete, and update records in specific tables and masters stored in the storage unit according to various programs.

なお、本開示におけるデータベース、マスタは、情報が構造的に規定された任意のデータ構造体（リスト、辞書、連想配列、オブジェクトなど）を含み得る。データ構造体には、データと、任意のプログラミング言語により記述された関数、クラス、メソッドなどを組み合わせることにより、データ構造体と見なし得るデータも含むものとする。 Note that the database and master in the present disclosure may include any data structure (list, dictionary, associative array, object, etc.) in which information is structurally defined. The data structure also includes data that can be regarded as a data structure by combining data with functions, classes, methods, etc. written in any programming language.

通信部は、通信ＩＦ９９１により実現される。通信部は、ネットワークを介して他のコンピュータ９０と通信を行う機能を実現する。通信部は、他のコンピュータ９０から送信された情報を受信し、制御部へ入力することができる。制御部は、各種プログラムに従ってプロセッサ９０１に、受信した情報に対する情報処理を実行させることができる。また、通信部は、制御部から出力された情報を他のコンピュータ９０へ送信することができる。 A communication unit is implemented by the communication IF 991 . The communication unit implements a function of communicating with another computer 90 via a network. The communication section can receive information transmitted from another computer 90 and input it to the control section. The control unit can cause the processor 901 to execute information processing on the received information according to various programs. Also, the communication section can transmit information output from the control section to another computer 90 .

＜付記＞
以上の各実施形態で説明した事項を以下に付記する。 <Appendix>
The items described in the above embodiments will be added below.

（付記１）
プロセッサと、記憶部とを備え、コンピュータに第１ユーザと第２ユーザとの間で行われる通話を行うプログラムであって、プログラムは、プロセッサに、第１ユーザから通話音声を取得する音声取得ステップ（Ｓ１０４、Ｓ３０４）と、音声取得ステップにおいて取得した通話音声を変換する変換ステップ（Ｓ１０４、Ｓ３０４）と、変換ステップにおいて変換された通話音声を第２ユーザへ出力する出力ステップ（Ｓ１０５、Ｓ３０５）と、通話に関する通話属性を取得する属性取得ステップ（Ｓ１０２、S３０２）と、を実行させ、変換ステップは、属性取得ステップにおいて取得した通話属性に基づき、音声取得ステップにおいて取得した通話音声を変換するステップを含む、プログラム。
これにより、複数のユーザ間で行われる通話において、通話属性に応じて、顧客は、より適した音声でユーザと通話を行うことができる。 (Appendix 1)
A program, comprising a processor and a storage unit, for performing a telephone call between a first user and a second user in a computer, the program comprising: a voice acquisition step for acquiring a call voice from the first user in the processor; (S104, S304); a conversion step (S104, S304) of converting the call voice acquired in the voice acquisition step; and an output step (S105, S305) of outputting the call voice converted in the conversion step to the second user; , an attribute acquisition step (S102, S302) for acquiring a call attribute related to the call, and a conversion step of converting the call voice acquired in the voice acquisition step based on the call attribute acquired in the attribute acquisition step. including, program.
As a result, in a call between a plurality of users, the customer can make a call with the user with a more suitable voice according to the call attribute.

（付記２）
変換ステップは、属性取得ステップにおいて取得した通話属性および音声取得ステップにおいて取得した通話音声に対して生成モデルを適用することにより、通話音声を変換するステップである、付記１記載のプログラム。
これにより、複数のユーザ間で行われる通話において、通話属性に応じて、顧客は、より適した音声でユーザと通話を行うことができる。 (Appendix 2)
The program according to appendix 1, wherein the conversion step is a step of converting the call voice by applying a generative model to the call attribute acquired in the attribute acquisition step and the call voice acquired in the voice acquisition step.
As a result, in a call between a plurality of users, the customer can make a call with the user with a more suitable voice according to the call attribute.

（付記３）
プログラムは、プロセッサに、属性取得ステップにおいて取得した通話属性に基づき、複数の音声処理のうち所定の音声処理を選択する選択ステップ（Ｓ３０３）と、を実行させ、変換ステップは、音声取得ステップにおいて取得した通話音声に、選択ステップにおいて選択された所定の音声処理を適用することにより、通話音声を変換するステップである、付記１記載のプログラム。
これにより、複数のユーザ間で行われる通話において、通話属性に応じて、顧客は、より適した音声でユーザと通話を行うことができる。 (Appendix 3)
The program causes the processor to execute a selection step (S303) of selecting a predetermined voice process from among a plurality of voice processes based on the call attribute acquired in the attribute acquisition step, and the conversion step is acquired in the voice acquisition step. The program according to appendix 1, wherein the step of converting the call voice by applying the predetermined voice processing selected in the selecting step to the call voice.
As a result, in a call between a plurality of users, the customer can make a call with the user with a more suitable voice according to the call attribute.

（付記４）
選択ステップは、複数の音声処理のうち、第２ユーザにとってより聴きやすい音声となるような所定の音声処理を選択するステップである、付記３記載のプログラム。
これにより、複数のユーザ間で行われる通話において、通話属性に応じて、顧客は、より聴きやすい音声でユーザと通話を行うことができる。 (Appendix 4)
3. The program according to appendix 3, wherein the selection step is a step of selecting a predetermined audio process that makes the audio easier to hear for the second user, from among the plurality of audio processes.
As a result, in a call between a plurality of users, the customer can make a call with the user in a more audible voice according to the call attribute.

（付記５）
選択ステップは、複数の音声処理のうち、第２ユーザにとってより信頼性が得られるような所定の音声処理を選択するステップである、付記３または４記載のプログラム。
これにより、複数のユーザ間で行われる通話において、通話属性に応じて、通話を通じて、ユーザは顧客に対して信頼感のある印象を与えることができる。 (Appendix 5)
5. The program according to supplementary note 3 or 4, wherein the selecting step is a step of selecting a predetermined voice processing that is more reliable for the second user from among the plurality of voice processing.
As a result, in a call between a plurality of users, the users can give a trustworthy impression to the customer through the call according to the call attribute.

（付記６）
選択ステップは、属性取得ステップにおいて取得した通話属性に対して音声処理モデルを適用することにより、所定の音声処理を選択する（Ｓ３０４、Ｓ５０４）ステップである、付記３から５のいずれか記載のプログラム。
これにより、複数のユーザ間で行われる通話において、通話属性に応じて、顧客は、より適した音声でユーザと通話を行うことができる。 (Appendix 6)
6. The program according to any one of appendices 3 to 5, wherein the selecting step is a step of selecting a predetermined voice processing (S304, S504) by applying a voice processing model to the call attributes acquired in the attribute acquiring step. .
As a result, in a call between a plurality of users, the customer can make a call with the user with a more suitable voice according to the call attribute.

（付記７）
プロセッサと、記憶部とを備え、コンピュータに第１ユーザと第２ユーザとの間で行われる通話を行うプログラムであって、プログラムは、プロセッサに、第１ユーザから通話音声を取得する音声取得ステップ（Ｓ５０４）と、音声取得ステップにおいて取得した通話音声を変換する変換ステップ（Ｓ５０４）と、変換ステップにおいて変換された通話音声を第２ユーザへ出力する出力ステップ（Ｓ５０５）と、第１ユーザから受け付けた選択指示に基づき、複数の音声処理のうち所定の音声処理を選択する選択ステップ（Ｓ５０３）と、を実行させ、変換ステップは、音声取得ステップにおいて取得した通話音声に、選択ステップにおいて選択された所定の音声処理を適用することにより、通話音声を変換するステップである、プログラム。
これにより、複数のユーザ間で行われる通話において、ユーザからの選択指示に基づき、顧客は、より適した音声でユーザと通話を行うことができる。 (Appendix 7)
A program, comprising a processor and a storage unit, for performing a telephone call between a first user and a second user in a computer, the program comprising: a voice acquisition step for acquiring a call voice from the first user in the processor; (S504); a conversion step (S504) of converting the call voice acquired in the voice acquisition step; an output step (S505) of outputting the call voice converted in the conversion step to the second user; a selection step (S503) of selecting a predetermined audio process from among a plurality of audio processes based on the selection instruction, and the conversion step converts the call voice acquired in the voice acquisition step into the call voice selected in the selection step. A program that converts phone call audio by applying predetermined audio processing.
As a result, in a call between a plurality of users, the customer can make a call with the user with a more suitable voice based on the selection instruction from the user.

（付記８）
プログラムは、プロセッサに、通話に関する通話属性を取得する属性取得ステップと、を実行させ、選択ステップは、第１ユーザから評価指標の種別の選択指示を受け付けるステップと、属性取得ステップにおいて取得した通話属性および受け付けた評価指標に基づき、複数の音声処理のうち所定の音声処理を選択するステップと、を含む、付記７記載のプログラム。
これにより、複数のユーザ間で行われる通話において、ユーザから選択指示を受け付けた評価指標の種別に基づき、通話属性に応じて、顧客は、より適した音声でユーザと通話を行うことができる。 (Appendix 8)
The program causes the processor to execute an attribute acquisition step of acquiring a call attribute related to the call, and the selection step is a step of receiving a selection instruction of the type of evaluation index from the first user, and the call attribute acquired in the attribute acquisition step and selecting a predetermined audio process from among the plurality of audio processes based on the received evaluation index.
As a result, in a call between a plurality of users, the customer can make a call with the user with a more suitable voice according to the call attribute based on the type of evaluation index selected by the user.

（付記９）
選択ステップは、第１ユーザから最適化したい評価指標の種別の選択指示を受け付けるステップと、属性取得ステップにおいて取得した通話属性および受け付けた評価指標に基づき、複数の音声処理のうち、評価指標を最適化するような所定の音声処理を選択するステップと、を含む、付記８記載のプログラム。
これにより、複数のユーザ間で行われる通話において、ユーザから選択指示を受け付けた評価指標の種別に基づき、通話属性に応じて、顧客は、より適した音声でユーザと通話を行うことができる。 (Appendix 9)
The selection step includes a step of receiving a selection instruction from the first user for selecting a type of evaluation index to be optimized, and a step of optimizing the evaluation index among the plurality of voice processes based on the call attributes obtained in the attribute obtaining step and the received evaluation index. and selecting a predetermined audio processing such that the voice processing is optimized.
As a result, in a call between a plurality of users, the customer can make a call with the user with a more suitable voice according to the call attribute based on the type of evaluation index selected by the user.

（付記１０）
選択ステップは、第１ユーザから評価指標の種別の選択指示を受け付けるステップと、属性取得ステップにおいて取得した通話属性および受け付けた評価指標に対して音声処理モデルを適用することにより、複数の音声処理のうち所定の音声処理を選択するステップと、を含む、付記８または９記載のプログラム。
これにより、複数のユーザ間で行われる通話において、ユーザから選択指示を受け付けた評価指標の種別に基づき、通話属性に応じて、顧客は、より適した音声でユーザと通話を行うことができる。 (Appendix 10)
The selection step includes a step of receiving an instruction to select a type of evaluation index from the first user, and applying a speech processing model to the call attribute acquired in the attribute acquisition step and the received evaluation index, thereby selecting a plurality of speech processing. and selecting a predetermined audio processing among them.
As a result, in a call between a plurality of users, the customer can make a call with the user with a more suitable voice according to the call attribute based on the type of evaluation index selected by the user.

（付記１１）
音声取得ステップは、第２ユーザから第２通話音声を取得するステップを含み、選択ステップは、第２ユーザからは選択指示を受け付けることができず、変換ステップは、音声取得ステップにおいて取得した第２通話音声は変換しないステップである、付記８から１０のいずれか記載のプログラム。
これにより、第２ユーザの音声を変換せずに、第１ユーザは、第２ユーザの音声を確認しつつ第２ユーザとの通話を行うことができる。 (Appendix 11)
The voice acquisition step includes a step of acquiring a second call voice from the second user, the selection step cannot accept a selection instruction from the second user, and the conversion step includes the second call voice acquired in the voice acquisition step. 11. The program according to any one of appendices 8 to 10, wherein the step does not convert call voice.
Accordingly, the first user can make a call with the second user while confirming the second user's voice without converting the second user's voice.

（付記１２）
属性取得ステップは、第２ユーザに関する属性情報を取得するステップを含み、変換ステップは、属性取得ステップにおいて取得した第２ユーザに関する属性情報に基づき、音声取得ステップにおいて取得した通話音声を変換するステップを含む、付記１から６、８から１１のいずれか記載のプログラム。
これにより、複数のユーザ間で行われる通話において、顧客に関する属性情報に応じて、顧客は、より適した音声でユーザと通話を行うことができる。 (Appendix 12)
The attribute acquisition step includes a step of acquiring attribute information about the second user, and the conversion step includes a step of converting the call voice acquired in the voice acquisition step based on the attribute information related to the second user acquired in the attribute acquisition step. 12. The program of any one of Appendixes 1-6, 8-11, comprising:
As a result, in a call between a plurality of users, the customer can make a call with the user with a more suitable voice according to the customer-related attribute information.

（付記１３）
属性取得ステップは、第１ユーザに関する属性情報を取得するステップを含み、変換ステップは、属性取得ステップにおいて取得した第１ユーザに関する属性情報に基づき、音声取得ステップにおいて取得した通話音声を変換するステップを含む、付記１から６、８から１２のいずれか記載のプログラム。
これにより、複数のユーザ間で行われる通話において、ユーザに関する属性情報に応じて、顧客は、より適した音声でユーザと通話を行うことができる。 (Appendix 13)
The attribute acquisition step includes a step of acquiring attribute information about the first user, and the conversion step includes a step of converting the call voice acquired in the voice acquisition step based on the attribute information related to the first user acquired in the attribute acquisition step. 13. The program of any one of Appendixes 1-6, 8-12, comprising:
As a result, in a call between a plurality of users, the customer can make a call with the user with a more suitable voice according to the user attribute information.

（付記１４）
属性取得ステップは、通話に関する属性情報を取得するステップを含み、変換ステップは、属性取得ステップにおいて取得した通話に関する属性情報に基づき、音声取得ステップにおいて取得した通話音声を変換するステップを含む、付記１から６、８から１３のいずれか記載のプログラム。
これにより、複数のユーザ間で行われる通話において、通話に関する属性情報に応じて、顧客は、より適した音声でユーザと通話を行うことができる。 (Appendix 14)
Supplementary Note 1, wherein the attribute acquisition step includes a step of acquiring attribute information about the call, and the conversion step includes a step of converting the call voice acquired in the voice acquisition step based on the attribute information related to the call acquired in the attribute acquisition step. 6, 8-13.
As a result, in a call between a plurality of users, the customer can make a call with the user with a more suitable voice according to the attribute information regarding the call.

（付記１５）
属性取得ステップは、通話に関する属性情報を取得するステップを含み、変換ステップは、属性取得ステップにおいて取得した通話におけるユーザまたは顧客の感情に関する情報に基づき、音声取得ステップにおいて取得した通話音声を変換するステップを含む、付記１から６、８から１４のいずれか記載のプログラム。
これにより、複数のユーザ間で行われる通話において、ユーザまたは顧客の感情情報に応じて、顧客は、より適した音声でユーザと通話を行うことができる。例えば、ユーザまたは顧客の感情状態に応じて、より適した音声でユーザと通話を行うことができる。 (Appendix 15)
The attribute acquisition step includes a step of acquiring attribute information about the call, and the conversion step is a step of converting the call voice acquired in the voice acquisition step based on the information about the user's or customer's emotion in the call acquired in the attribute acquisition step. 15. The program according to any one of Appendices 1 to 6, 8 to 14, comprising
As a result, in a call between a plurality of users, the customer can make a call with the user with a more suitable voice according to the emotional information of the user or the customer. For example, depending on the user's or customer's emotional state, it is possible to communicate with the user with a more suitable voice.

（付記１６）
属性取得ステップにおいて取得する通話属性は、ユーザおよび顧客の周辺環境、通話環境に関する情報は含まない、付記１から１５のいずれか記載のプログラム。
これにより、複数のユーザ間で行われる通話において、通話属性に応じて、顧客は、より適した音声でユーザと通話を行うことができる。 (Appendix 16)
16. The program according to any one of appendices 1 to 15, wherein the call attributes acquired in the attribute acquisition step do not include information about the surrounding environment of the user and the customer and the call environment.
As a result, in a call between a plurality of users, the customer can make a call with the user with a more suitable voice according to the call attribute.

（付記１７）
変換ステップは、音声取得ステップにおいて取得した通話音声のうち、人物の音声成分を変換するステップを含み、音声取得ステップにおいて取得した通話音声のうち、人物の音声成分以外の背景雑音、ノイズ、騒音などの音声成分を変換するステップを含まない、付記１から１６のいずれか記載のプログラム。
これにより、複数のユーザ間で行われる通話において、通話属性に応じて、顧客は、より適した音声でユーザと通話を行うことができる。 (Appendix 17)
The conversion step includes a step of converting the speech component of the person out of the call voice acquired in the voice acquisition step, and background noise, noise, noise, etc. other than the voice component of the person out of the call voice acquired in the voice acquisition step. 17. The program according to any of the clauses 1-16, which does not include the step of converting the audio component of the .
As a result, in a call between a plurality of users, the customer can make a call with the user with a more suitable voice according to the call attribute.

（付記１８）
プログラムは、プロセッサに、第１ユーザから受け付けた第２選択指示に基づき、複数の音声処理のうち第２音声処理を選択する第２選択ステップと、を実行させ、音声取得ステップは、第２ユーザから第２通話音声を取得するステップを含み、変換ステップは、取得ステップにおいて取得した第２通話音声に、第２選択ステップにおいて選択された第２音声処理を適用することにより、第２通話音声を変換するステップを含み、出力ステップは、変換ステップにおいて変換された第２通話音声を第１ユーザへ出力するステップを含む、付記１から１７のいずれか記載のプログラム。
これにより、例えば、複数の音声処理のうち、ユーザにとってより聴きやすい音声で顧客と通話を行うことができる。例えば、複数の音声処理のうち、ユーザにとってより快適性が得られるような所定の音声で顧客と通話を行うことができる。例えば、複数の音声処理のうち、ユーザの好みの音声、ストレス値が小さくなる、興趣性が高まるような所定の音声で顧客と通話を行うことができる。例えば、顧客が怒っている場合などには、ユーザは、複数の音声処理のうち、抑揚を小さくしたり、音声を小さくする音声処理を選択することにより、顧客との応対に伴う心理的ストレスを低減させることができる。 (Appendix 18)
The program causes the processor to execute a second selection step of selecting the second audio processing from among the plurality of audio processing based on a second selection instruction received from the first user, and the audio acquisition step is performed by the second user. wherein the converting step applies the second speech processing selected in the second selecting step to the second speech speech obtained in the obtaining step to obtain the second speech speech from 18. The program according to any one of appendices 1 to 17, wherein the step of converting includes the step of outputting the second call voice converted in the converting step to the first user.
As a result, for example, it is possible to make a call with a customer using a voice that is easier for the user to hear, among a plurality of voice processes. For example, among a plurality of voice processes, it is possible to make a call with a customer using a predetermined voice that is more comfortable for the user. For example, among a plurality of voice processes, it is possible to make a call with a customer using a voice that is preferred by the user, a predetermined voice that reduces the stress value, and increases interest. For example, when a customer is angry, the user can reduce the psychological stress associated with dealing with the customer by selecting voice processing that reduces the inflection or voice from multiple voice processing. can be reduced.

（付記１９）
プロセッサと、記憶部とを備える情報処理装置であって、プロセッサに、第１ユーザから通話音声を取得する音声取得ステップ（Ｓ１０４、Ｓ３０４）と、音声取得ステップにおいて取得した通話音声を変換する変換ステップ（Ｓ１０４、Ｓ３０４）と、変換ステップにおいて変換された通話音声を第２ユーザへ出力する出力ステップ（Ｓ１０５、Ｓ３０５）と、通話に関する通話属性を取得する属性取得ステップ（Ｓ１０２、S３０２）と、を実行させ、変換ステップは、属性取得ステップにおいて取得した通話属性に基づき、音声取得ステップにおいて取得した通話音声を変換するステップを含む、情報処理装置。
これにより、複数のユーザ間で行われる通話において、通話属性に応じて、顧客は、より適した音声でユーザと通話を行うことができる。 (Appendix 19)
An information processing apparatus comprising a processor and a storage unit, wherein the processor includes a voice acquisition step (S104, S304) for acquiring call voice from a first user, and a conversion step for converting the call voice acquired in the voice acquisition step. (S104, S304), an output step (S105, S305) of outputting the call voice converted in the conversion step to the second user, and an attribute acquisition step (S102, S302) of acquiring call attributes related to the call. and the conversion step includes a step of converting the call voice acquired in the voice acquisition step based on the call attribute acquired in the attribute acquisition step.
As a result, in a call between a plurality of users, the customer can make a call with the user with a more suitable voice according to the call attribute.

（付記２０）
プロセッサと、記憶部とを備えるコンピュータにより実行される情報処理方法であって、プロセッサに、第１ユーザから通話音声を取得する音声取得ステップ（Ｓ１０４、Ｓ３０４）と、音声取得ステップにおいて取得した通話音声を変換する変換ステップ（Ｓ１０４、Ｓ３０４）と、変換ステップにおいて変換された通話音声を第２ユーザへ出力する出力ステップ（Ｓ１０５、Ｓ３０５）と、通話に関する通話属性を取得する属性取得ステップ（Ｓ１０２、S３０２）と、を実行させ、変換ステップは、属性取得ステップにおいて取得した通話属性に基づき、音声取得ステップにおいて取得した通話音声を変換するステップを含む、情報処理方法。
これにより、複数のユーザ間で行われる通話において、通話属性に応じて、顧客は、より適した音声でユーザと通話を行うことができる。 (Appendix 20)
An information processing method executed by a computer comprising a processor and a storage unit, wherein the processor is provided with a voice acquisition step (S104, S304) for acquiring a call voice from a first user, and the call voice acquired in the voice acquisition step. an output step (S105, S305) of outputting the call voice converted in the conversion step to the second user, and an attribute acquisition step of acquiring call attributes related to the call (S102, S302 ), and the conversion step includes a step of converting the call voice acquired in the voice acquisition step based on the call attribute acquired in the attribute acquisition step.
As a result, in a call between a plurality of users, the customer can make a call with the user with a more suitable voice according to the call attribute.

１情報処理システム、１０サーバ、１０１記憶部、１０３制御部、２０Ａ，２０Ｂ，２０Ｃユーザ端末、２０１記憶部、２０４制御部、３０ＣＲＭシステム、３０１記憶部、３０４制御部、５０Ａ，５０Ｂ，５０Ｃ顧客端末、５０１記憶部、５０４制御部

1 information processing system, 10 server, 101 storage unit, 103 control unit, 20A, 20B, 20C user terminal, 201 storage unit, 204 control unit, 30 CRM system, 301 storage unit, 304 control unit, 50A, 50B, 50C customer terminal, 501 storage unit, 504 control unit

Claims

A program comprising a processor and a storage unit, and carrying out a call between a first user and a second user in a computer,
The program causes the processor to:
a voice acquisition step of acquiring call voice from the first user;
a conversion step of converting the call voice acquired in the voice acquisition step;
an output step of outputting the call voice converted in the conversion step to a second user;
an attribute acquisition step of acquiring call attributes related to the call;
and
The conversion step includes a step of converting the call voice acquired in the voice acquisition step based on the call attribute acquired in the attribute acquisition step,
program.

The conversion step is a step of converting the call voice by applying a generative model to the call attribute acquired in the attribute acquisition step and the call voice acquired in the voice acquisition step,
A program according to claim 1.

The program causes the processor to:
a selection step of selecting a predetermined voice process from among a plurality of voice processes based on the call attribute acquired in the attribute acquisition step;
and
The conversion step is a step of converting the call voice by applying the predetermined voice processing selected in the selection step to the call voice acquired in the voice acquisition step.
A program according to claim 1.

The selecting step is a step of selecting a predetermined audio process that makes the audio easier to hear for the second user, from among the plurality of audio processes.
4. A program according to claim 3.

The selecting step is a step of selecting a predetermined voice processing that is more reliable for the second user from among the plurality of voice processing.
5. A program according to claim 3 or 4.

The selecting step is a step of selecting the predetermined voice processing by applying a voice processing model to the call attribute acquired in the attribute acquiring step.
A program according to any one of claims 3 to 5.

A program comprising a processor and a storage unit, and carrying out a call between a first user and a second user in a computer,
The program causes the processor to:
a voice acquisition step of acquiring call voice from the first user;
a conversion step of converting the call voice acquired in the voice acquisition step;
an output step of outputting the call voice converted in the conversion step to a second user;
a selection step of selecting a predetermined audio process from among a plurality of audio processes based on a selection instruction received from a first user;
and
The conversion step is a step of converting the call voice by applying the predetermined voice processing selected in the selection step to the call voice acquired in the voice acquisition step.
program.

The program causes the processor to:
an attribute acquisition step of acquiring call attributes related to the call;
and
The selection step includes:
a step of receiving a selection instruction of a type of evaluation index from a first user;
a step of selecting the predetermined voice processing from among the plurality of voice processing based on the call attribute obtained in the attribute obtaining step and the received evaluation index;
including,
8. A program according to claim 7.

The selection step includes:
a step of receiving a selection instruction from a first user for a type of evaluation index to be optimized;
a step of selecting a predetermined voice processing that optimizes the evaluation index from among a plurality of voice processing based on the call attribute obtained in the attribute obtaining step and the received evaluation index;
including,
9. A program according to claim 8.

The selection step includes:
a step of receiving a selection instruction of a type of evaluation index from a first user;
selecting the predetermined voice processing from among the plurality of voice processing by applying a voice processing model to the call attribute acquired in the attribute acquisition step and the received evaluation index;
including,
10. A program according to claim 8 or 9.

said acquiring voice includes acquiring a second call voice from a second user;
The selection step cannot receive a selection instruction from a second user,
The converting step is a step of not converting the second call voice acquired in the voice acquiring step,
A program according to any one of claims 8 to 10.

The attribute obtaining step includes obtaining attribute information about the second user,
The conversion step includes a step of converting the call voice acquired in the voice acquisition step based on the attribute information about the second user acquired in the attribute acquisition step,
12. The program according to any one of claims 1 to 6 and 8 to 11.

The attribute obtaining step includes obtaining attribute information about the first user,
The conversion step includes a step of converting the call voice acquired in the voice acquisition step based on the attribute information about the first user acquired in the attribute acquisition step,
13. The program according to any one of claims 1 to 6 and 8 to 12.

the attribute obtaining step includes obtaining attribute information about the call;
The conversion step includes a step of converting the call voice acquired in the voice acquisition step based on the attribute information related to the call acquired in the attribute acquisition step,
14. The program according to any one of claims 1 to 6 and 8 to 13.

the attribute obtaining step includes obtaining attribute information about the call;
The conversion step includes a step of converting the call voice acquired in the voice acquisition step based on the information about the user's or customer's emotion in the call acquired in the attribute acquisition step,
15. The program according to any one of claims 1 to 6 and 8 to 14.

The call attributes acquired in the attribute acquisition step do not include information about the surrounding environment of the user and the customer, and the call environment.
16. The program according to any one of claims 1 to 6 and 8 to 15.

The conversion step includes:
including a step of converting a voice component of a person out of the call voice acquired in the voice acquisition step;
Not including a step of converting audio components such as background noise, noise, and noise other than human voice components from the call voice acquired in the voice acquisition step;
A program according to any one of claims 1 to 16.

The program causes the processor to:
a second selection step of selecting a second audio process from among a plurality of audio processes based on a second selection instruction received from a first user;
and
said acquiring voice includes acquiring a second call voice from a second user;
the converting step includes converting the second speech speech by applying the second speech processing selected in the second selecting step to the second speech speech obtained in the obtaining step;
The output step includes outputting the second call voice converted in the conversion step to the first user,
A program according to any one of claims 1 to 17.

An information processing device comprising a processor and a storage unit,
to the processor;
a voice acquisition step of acquiring call voice from the first user;
a conversion step of converting the call voice acquired in the voice acquisition step;
an output step of outputting the call voice converted in the conversion step to a second user;
an attribute acquisition step of acquiring call attributes related to the call;
and
The conversion step includes a step of converting the call voice acquired in the voice acquisition step based on the call attribute acquired in the attribute acquisition step,
Information processing equipment.

An information processing method executed by a computer comprising a processor and a storage unit,
to the processor;
a voice acquisition step of acquiring call voice from the first user;
a conversion step of converting the call voice acquired in the voice acquisition step;
an output step of outputting the call voice converted in the conversion step to a second user;
an attribute acquisition step of acquiring call attributes related to the call;
and
The conversion step includes a step of converting the call voice acquired in the voice acquisition step based on the call attribute acquired in the attribute acquisition step,
Information processing methods.